2024-10-27 19:24:46 +08:00
# AsyncWebCrawler
2025-01-25 22:06:11 +08:00
The ** `AsyncWebCrawler` ** is the core class for asynchronous web crawling in Crawl4AI. You typically create it **once** , optionally customize it with a ** `BrowserConfig` ** (e.g., headless, user agent), then **run** multiple ** `arun()` ** calls with different ** `CrawlerRunConfig` ** objects.
2024-10-27 19:24:46 +08:00
2025-01-07 20:49:50 +08:00
**Recommended usage**:
2025-01-25 22:06:11 +08:00
1. **Create** a `BrowserConfig` for global browser settings.
2. **Instantiate** `AsyncWebCrawler(config=browser_config)` .
3. **Use** the crawler in an async context manager (`async with` ) or manage start/close manually.
2025-01-07 20:49:50 +08:00
4. **Call** `arun(url, config=crawler_run_config)` for each page you want.
2024-10-27 19:24:46 +08:00
2025-01-07 20:49:50 +08:00
---
2024-10-27 19:24:46 +08:00
2025-01-25 22:06:11 +08:00
## 1. Constructor Overview
2024-10-27 19:24:46 +08:00
```python
2025-01-07 20:49:50 +08:00
class AsyncWebCrawler:
def __init__ (
self,
crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
config: Optional[BrowserConfig] = None,
always_bypass_cache: bool = False, # deprecated
always_by_pass_cache: Optional[bool] = None, # also deprecated
base_directory: str = ...,
thread_safe: bool = False,
**kwargs,
):
"""
Create an AsyncWebCrawler instance.
Args:
2025-01-08 19:31:31 +08:00
crawler_strategy:
(Advanced) Provide a custom crawler strategy if needed.
config:
A BrowserConfig object specifying how the browser is set up.
always_bypass_cache:
(Deprecated) Use CrawlerRunConfig.cache_mode instead.
base_directory:
Folder for storing caches/logs (if relevant).
thread_safe:
2025-01-25 22:06:11 +08:00
If True, attempts some concurrency safeguards. Usually False.
2025-01-08 19:31:31 +08:00
**kwargs:
Additional legacy or debugging parameters.
2025-01-07 20:49:50 +08:00
"""
2025-01-08 19:31:31 +08:00
)
2024-10-27 19:24:46 +08:00
2025-01-07 20:49:50 +08:00
### Typical Initialization
2024-10-27 19:24:46 +08:00
```python
2025-01-07 20:49:50 +08:00
from crawl4ai import AsyncWebCrawler, BrowserConfig
2024-10-27 19:24:46 +08:00
2025-01-07 20:49:50 +08:00
browser_cfg = BrowserConfig(
browser_type="chromium",
headless=True,
verbose=True
)
crawler = AsyncWebCrawler(config=browser_cfg)
2024-10-27 19:24:46 +08:00
```
2025-01-07 20:49:50 +08:00
**Notes**:
2025-01-25 22:06:11 +08:00
2025-01-07 20:49:50 +08:00
- **Legacy** parameters like `always_bypass_cache` remain for backward compatibility, but prefer to set **caching** in `CrawlerRunConfig` .
---
2025-01-25 22:06:11 +08:00
## 2. Lifecycle: Start/Close or Context Manager
2025-01-07 20:49:50 +08:00
### 2.1 Context Manager (Recommended)
2024-10-27 19:24:46 +08:00
```python
2025-01-07 20:49:50 +08:00
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun("https://example.com")
# The crawler automatically starts/closes resources
2024-10-27 19:24:46 +08:00
```
2025-01-07 20:49:50 +08:00
When the `async with` block ends, the crawler cleans up (closes the browser, etc.).
2024-10-27 19:24:46 +08:00
2025-01-07 20:49:50 +08:00
### 2.2 Manual Start & Close
2024-10-27 19:24:46 +08:00
```python
2025-01-07 20:49:50 +08:00
crawler = AsyncWebCrawler(config=browser_cfg)
await crawler.start()
result1 = await crawler.arun("https://example.com")
result2 = await crawler.arun("https://another.com")
2024-10-27 19:24:46 +08:00
2025-01-07 20:49:50 +08:00
await crawler.close()
2024-10-27 19:24:46 +08:00
```
2025-01-07 20:49:50 +08:00
Use this style if you have a **long-running** application or need full control of the crawler’ s lifecycle.
---
2025-01-25 22:06:11 +08:00
## 3. Primary Method: `arun()`
2025-01-07 20:49:50 +08:00
2024-10-27 19:24:46 +08:00
```python
2025-01-07 20:49:50 +08:00
async def arun(
self,
url: str,
config: Optional[CrawlerRunConfig] = None,
# Legacy parameters for backward compatibility...
) -> CrawlResult:
...
2024-10-27 19:24:46 +08:00
```
2025-01-07 20:49:50 +08:00
### 3.1 New Approach
You pass a `CrawlerRunConfig` object that sets up everything about a crawl—content filtering, caching, session reuse, JS code, screenshots, etc.
2024-10-27 19:24:46 +08:00
```python
2025-01-07 20:49:50 +08:00
import asyncio
from crawl4ai import CrawlerRunConfig, CacheMode
run_cfg = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
css_selector="main.article",
word_count_threshold=10,
screenshot=True
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun("https://example.com/news", config=run_cfg)
print("Crawled HTML length:", len(result.cleaned_html))
if result.screenshot:
print("Screenshot base64 length:", len(result.screenshot))
2024-10-27 19:24:46 +08:00
```
2025-01-07 20:49:50 +08:00
### 3.2 Legacy Parameters Still Accepted
For **backward** compatibility, `arun()` can still accept direct arguments like `css_selector=...` , `word_count_threshold=...` , etc., but we strongly advise migrating them into a ** `CrawlerRunConfig` **.
---
2025-01-25 22:06:11 +08:00
## 4. Batch Processing: `arun_many()`
2025-01-07 20:49:50 +08:00
2024-10-27 19:24:46 +08:00
```python
2025-01-07 20:49:50 +08:00
async def arun_many(
self,
urls: List[str],
config: Optional[CrawlerRunConfig] = None,
2025-01-11 21:10:27 +08:00
# Legacy parameters maintained for backwards compatibility...
2025-01-07 20:49:50 +08:00
) -> List[CrawlResult]:
2025-01-11 21:10:27 +08:00
"""
Process multiple URLs with intelligent rate limiting and resource monitoring.
"""
2024-10-27 19:24:46 +08:00
```
2025-01-11 21:10:27 +08:00
### 4.1 Resource-Aware Crawling
The `arun_many()` method now uses an intelligent dispatcher that:
2025-01-25 22:06:11 +08:00
2025-01-11 21:10:27 +08:00
- Monitors system memory usage
- Implements adaptive rate limiting
- Provides detailed progress monitoring
- Manages concurrent crawls efficiently
### 4.2 Example Usage
2025-01-07 20:49:50 +08:00
2025-02-09 18:49:10 +08:00
Check page [Multi-url Crawling ](../advanced/multi-url-crawling.md ) for a detailed example of how to use `arun_many()` .
2025-01-07 20:49:50 +08:00
2025-02-09 18:49:10 +08:00
```python
2024-10-27 19:24:46 +08:00
2025-01-11 21:10:27 +08:00
### 4.3 Key Features
2025-01-07 20:49:50 +08:00
2025-01-25 22:06:11 +08:00
1. **Rate Limiting**
2025-01-11 21:10:27 +08:00
- Automatic delay between requests
- Exponential backoff on rate limit detection
- Domain-specific rate limiting
- Configurable retry strategy
2025-01-07 20:49:50 +08:00
2025-01-25 22:06:11 +08:00
2. **Resource Monitoring**
2025-01-11 21:10:27 +08:00
- Memory usage tracking
- Adaptive concurrency based on system load
- Automatic pausing when resources are constrained
2025-01-07 20:49:50 +08:00
2025-01-25 22:06:11 +08:00
3. **Progress Monitoring**
2025-01-11 21:10:27 +08:00
- Detailed or aggregated progress display
- Real-time status updates
- Memory usage statistics
2025-01-07 20:49:50 +08:00
2025-01-25 22:06:11 +08:00
4. **Error Handling**
2025-01-11 21:10:27 +08:00
- Graceful handling of rate limits
- Automatic retries with backoff
- Detailed error reporting
2024-10-27 19:24:46 +08:00
2025-01-07 20:49:50 +08:00
---
2025-01-25 22:06:11 +08:00
## 5. `CrawlResult` Output
2025-01-07 20:49:50 +08:00
Each `arun()` returns a ** `CrawlResult` ** containing:
- `url` : Final URL (if redirected).
- `html` : Original HTML.
- `cleaned_html` : Sanitized HTML.
2025-02-28 17:23:35 +05:30
- `markdown_v2` : Deprecated. Instead just use regular `markdown`
2025-01-07 20:49:50 +08:00
- `extracted_content` : If an extraction strategy was used (JSON for CSS/LLM strategies).
- `screenshot` , `pdf` : If screenshots/PDF requested.
- `media` , `links` : Information about discovered images/links.
- `success` , `error_message` : Status info.
For details, see [CrawlResult doc ](./crawl-result.md ).
---
2024-10-27 19:24:46 +08:00
2025-01-25 22:06:11 +08:00
## 6. Quick Example
2024-10-27 19:24:46 +08:00
2025-01-07 20:49:50 +08:00
Below is an example hooking it all together:
2024-10-27 19:24:46 +08:00
```python
2025-01-07 20:49:50 +08:00
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json
async def main():
2025-01-25 22:06:11 +08:00
# 1. Browser config
2025-01-07 20:49:50 +08:00
browser_cfg = BrowserConfig(
browser_type="firefox",
headless=False,
verbose=True
)
2025-01-25 22:06:11 +08:00
# 2. Run config
2025-01-07 20:49:50 +08:00
schema = {
"name": "Articles",
"baseSelector": "article.post",
"fields": [
2025-01-08 19:31:31 +08:00
{
"name": "title",
"selector": "h2",
"type": "text"
},
{
"name": "url",
"selector": "a",
"type": "attribute",
"attribute": "href"
}
2025-01-07 20:49:50 +08:00
]
}
run_cfg = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema),
word_count_threshold=15,
remove_overlay_elements=True,
wait_for="css:.post" # Wait for posts to appear
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(
url="https://example.com/blog",
config=run_cfg
)
if result.success:
print("Cleaned HTML length:", len(result.cleaned_html))
if result.extracted_content:
articles = json.loads(result.extracted_content)
print("Extracted articles:", articles[:2])
else:
print("Error:", result.error_message)
asyncio.run(main())
```
**Explanation**:
2025-01-25 22:06:11 +08:00
- We define a ** `BrowserConfig` ** with Firefox, no headless, and `verbose=True` .
- We define a ** `CrawlerRunConfig` ** that **bypasses cache** , uses a **CSS** extraction schema, has a `word_count_threshold=15` , etc.
2025-01-07 20:49:50 +08:00
- We pass them to `AsyncWebCrawler(config=...)` and `arun(url=..., config=...)` .
---
2025-01-25 22:06:11 +08:00
## 7. Best Practices & Migration Notes
2025-01-07 20:49:50 +08:00
2025-01-25 22:06:11 +08:00
1. **Use** `BrowserConfig` for **global** settings about the browser’ s environment.
2. **Use** `CrawlerRunConfig` for **per-crawl** logic (caching, content filtering, extraction strategies, wait conditions).
3. **Avoid** legacy parameters like `css_selector` or `word_count_threshold` directly in `arun()` . Instead:
2025-01-07 20:49:50 +08:00
```python
run_cfg = CrawlerRunConfig(css_selector=".main-content", word_count_threshold=20)
result = await crawler.arun(url="...", config=run_cfg)
```
4. **Context Manager** usage is simplest unless you want a persistent crawler across many calls.
---
2025-01-25 22:06:11 +08:00
## 8. Summary
2025-01-07 20:49:50 +08:00
**AsyncWebCrawler** is your entry point to asynchronous crawling:
2025-01-25 22:06:11 +08:00
- **Constructor** accepts ** `BrowserConfig` ** (or defaults).
- **`arun(url, config=CrawlerRunConfig)` ** is the main method for single-page crawls.
- **`arun_many(urls, config=CrawlerRunConfig)` ** handles concurrency across multiple URLs.
- For advanced lifecycle control, use `start()` and `close()` explicitly.
2025-01-07 20:49:50 +08:00
**Migration**:
2025-01-25 22:06:11 +08:00
2025-01-07 20:49:50 +08:00
- If you used `AsyncWebCrawler(browser_type="chromium", css_selector="...")` , move browser settings to `BrowserConfig(...)` and content/crawl logic to `CrawlerRunConfig(...)` .
2024-10-27 19:24:46 +08:00
2025-01-25 22:06:11 +08:00
This modular approach ensures your code is **clean** , **scalable** , and **easy to maintain** . For any advanced or rarely used parameters, see the [BrowserConfig docs ](../api/parameters.md ).