crawl4ai/docs/md_v2/api/async-webcrawler.md

# AsyncWebCrawler

The **`AsyncWebCrawler`** is the core class for asynchronous web crawling in Crawl4AI. You typically create it **once**, optionally customize it with a **`BrowserConfig`** (e.g., headless, user agent), then **run** multiple **`arun()`** calls with different **`CrawlerRunConfig`** objects.

**Recommended usage**:
1. **Create** a `BrowserConfig` for global browser settings.  
2. **Instantiate** `AsyncWebCrawler(config=browser_config)`.  
3. **Use** the crawler in an async context manager (`async with`) or manage start/close manually.  
4. **Call** `arun(url, config=crawler_run_config)` for each page you want.

---

## 1. Constructor Overview

```python
class AsyncWebCrawler:
    def __init__(
        self,
        crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
        config: Optional[BrowserConfig] = None,
        always_bypass_cache: bool = False,           # deprecated
        always_by_pass_cache: Optional[bool] = None, # also deprecated
        base_directory: str = ...,
        thread_safe: bool = False,
        **kwargs,
    ):
        """
        Create an AsyncWebCrawler instance.

        Args:
            crawler_strategy: 
                (Advanced) Provide a custom crawler strategy if needed.
            config: 
                A BrowserConfig object specifying how the browser is set up.
            always_bypass_cache: 
                (Deprecated) Use CrawlerRunConfig.cache_mode instead.
            base_directory:     
                Folder for storing caches/logs (if relevant).
            thread_safe: 
                If True, attempts some concurrency safeguards. Usually False.
            **kwargs: 
                Additional legacy or debugging parameters.
        """
    )

### Typical Initialization

```python
from crawl4ai import AsyncWebCrawler, BrowserConfig

browser_cfg = BrowserConfig(
    browser_type="chromium",
    headless=True,
    verbose=True
)

crawler = AsyncWebCrawler(config=browser_cfg)
```

**Notes**:
- **Legacy** parameters like `always_bypass_cache` remain for backward compatibility, but prefer to set **caching** in `CrawlerRunConfig`.

---

## 2. Lifecycle: Start/Close or Context Manager

### 2.1 Context Manager (Recommended)

```python
async with AsyncWebCrawler(config=browser_cfg) as crawler:
    result = await crawler.arun("https://example.com")
    # The crawler automatically starts/closes resources
```

When the `async with` block ends, the crawler cleans up (closes the browser, etc.).

### 2.2 Manual Start & Close

```python
crawler = AsyncWebCrawler(config=browser_cfg)
await crawler.start()

result1 = await crawler.arun("https://example.com")
result2 = await crawler.arun("https://another.com")

await crawler.close()
```

Use this style if you have a **long-running** application or need full control of the crawler’s lifecycle.

---

## 3. Primary Method: `arun()`

```python
async def arun(
    self,
    url: str,
    config: Optional[CrawlerRunConfig] = None,
    # Legacy parameters for backward compatibility...
) -> CrawlResult:
    ...
```

### 3.1 New Approach

You pass a `CrawlerRunConfig` object that sets up everything about a crawl—content filtering, caching, session reuse, JS code, screenshots, etc.

```python
import asyncio
from crawl4ai import CrawlerRunConfig, CacheMode

run_cfg = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS,
    css_selector="main.article",
    word_count_threshold=10,
    screenshot=True
)

async with AsyncWebCrawler(config=browser_cfg) as crawler:
    result = await crawler.arun("https://example.com/news", config=run_cfg)
    print("Crawled HTML length:", len(result.cleaned_html))
    if result.screenshot:
        print("Screenshot base64 length:", len(result.screenshot))
```

### 3.2 Legacy Parameters Still Accepted

For **backward** compatibility, `arun()` can still accept direct arguments like `css_selector=...`, `word_count_threshold=...`, etc., but we strongly advise migrating them into a **`CrawlerRunConfig`**.

---

## 4. Batch Processing: `arun_many()`

```python
async def arun_many(
    self,
    urls: List[str],
    config: Optional[CrawlerRunConfig] = None,
    # Legacy parameters maintained for backwards compatibility...
) -> List[CrawlResult]:
    """
    Process multiple URLs with intelligent rate limiting and resource monitoring.
    """
```

### 4.1 Resource-Aware Crawling

The `arun_many()` method now uses an intelligent dispatcher that:
- Monitors system memory usage
- Implements adaptive rate limiting
- Provides detailed progress monitoring
- Manages concurrent crawls efficiently

### 4.2 Example Usage

```python
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, RateLimitConfig
from crawl4ai.dispatcher import DisplayMode

# Configure browser
browser_cfg = BrowserConfig(headless=True)

# Configure crawler with rate limiting
run_cfg = CrawlerRunConfig(
    # Enable rate limiting
    enable_rate_limiting=True,
    rate_limit_config=RateLimitConfig(
        base_delay=(1.0, 2.0),  # Random delay between 1-2 seconds
        max_delay=30.0,         # Maximum delay after rate limit hits
        max_retries=2,          # Number of retries before giving up
        rate_limit_codes=[429, 503]  # Status codes that trigger rate limiting
    ),
    # Resource monitoring
    memory_threshold_percent=70.0,  # Pause if memory exceeds this
    check_interval=0.5,            # How often to check resources
    max_session_permit=3,          # Maximum concurrent crawls
    display_mode=DisplayMode.DETAILED.value  # Show detailed progress
)

urls = [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
]

async with AsyncWebCrawler(config=browser_cfg) as crawler:
    results = await crawler.arun_many(urls, config=run_cfg)
    for result in results:
        print(f"URL: {result.url}, Success: {result.success}")
```

### 4.3 Key Features

1. **Rate Limiting**
   - Automatic delay between requests
   - Exponential backoff on rate limit detection
   - Domain-specific rate limiting
   - Configurable retry strategy

2. **Resource Monitoring**
   - Memory usage tracking
   - Adaptive concurrency based on system load
   - Automatic pausing when resources are constrained

3. **Progress Monitoring**
   - Detailed or aggregated progress display
   - Real-time status updates
   - Memory usage statistics

4. **Error Handling**
   - Graceful handling of rate limits
   - Automatic retries with backoff
   - Detailed error reporting

---

## 5. `CrawlResult` Output

Each `arun()` returns a **`CrawlResult`** containing:

- `url`: Final URL (if redirected).
- `html`: Original HTML.
- `cleaned_html`: Sanitized HTML.
- `markdown_v2` (or future `markdown`): Markdown outputs (raw, fit, etc.).
- `extracted_content`: If an extraction strategy was used (JSON for CSS/LLM strategies).
- `screenshot`, `pdf`: If screenshots/PDF requested.
- `media`, `links`: Information about discovered images/links.
- `success`, `error_message`: Status info.

For details, see [CrawlResult doc](./crawl-result.md).

---

## 6. Quick Example

Below is an example hooking it all together:

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json

async def main():
    # 1. Browser config
    browser_cfg = BrowserConfig(
        browser_type="firefox",
        headless=False,
        verbose=True
    )

    # 2. Run config
    schema = {
        "name": "Articles",
        "baseSelector": "article.post",
        "fields": [
            {
                "name": "title", 
                "selector": "h2", 
                "type": "text"
            },
            {
                "name": "url", 
                "selector": "a", 
                "type": "attribute", 
                "attribute": "href"
            }
        ]
    }

    run_cfg = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        extraction_strategy=JsonCssExtractionStrategy(schema),
        word_count_threshold=15,
        remove_overlay_elements=True,
        wait_for="css:.post"  # Wait for posts to appear
    )

    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        result = await crawler.arun(
            url="https://example.com/blog",
            config=run_cfg
        )

        if result.success:
            print("Cleaned HTML length:", len(result.cleaned_html))
            if result.extracted_content:
                articles = json.loads(result.extracted_content)
                print("Extracted articles:", articles[:2])
        else:
            print("Error:", result.error_message)

asyncio.run(main())
```

**Explanation**:
- We define a **`BrowserConfig`** with Firefox, no headless, and `verbose=True`.  
- We define a **`CrawlerRunConfig`** that **bypasses cache**, uses a **CSS** extraction schema, has a `word_count_threshold=15`, etc.  
- We pass them to `AsyncWebCrawler(config=...)` and `arun(url=..., config=...)`.

---

## 7. Best Practices & Migration Notes

1. **Use** `BrowserConfig` for **global** settings about the browser’s environment.  
2. **Use** `CrawlerRunConfig` for **per-crawl** logic (caching, content filtering, extraction strategies, wait conditions).  
3. **Avoid** legacy parameters like `css_selector` or `word_count_threshold` directly in `arun()`. Instead:

   ```python
   run_cfg = CrawlerRunConfig(css_selector=".main-content", word_count_threshold=20)
   result = await crawler.arun(url="...", config=run_cfg)
   ```

4. **Context Manager** usage is simplest unless you want a persistent crawler across many calls.

---

## 8. Summary

**AsyncWebCrawler** is your entry point to asynchronous crawling:

- **Constructor** accepts **`BrowserConfig`** (or defaults).  
- **`arun(url, config=CrawlerRunConfig)`** is the main method for single-page crawls.  
- **`arun_many(urls, config=CrawlerRunConfig)`** handles concurrency across multiple URLs.  
- For advanced lifecycle control, use `start()` and `close()` explicitly.  

**Migration**:  
- If you used `AsyncWebCrawler(browser_type="chromium", css_selector="...")`, move browser settings to `BrowserConfig(...)` and content/crawl logic to `CrawlerRunConfig(...)`.

This modular approach ensures your code is **clean**, **scalable**, and **easy to maintain**. For any advanced or rarely used parameters, see the [BrowserConfig docs](../api/parameters.md).
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								# AsyncWebCrawler
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								The **`AsyncWebCrawler`** is the core class for asynchronous web crawling in Crawl4AI. You typically create it **once**, optionally customize it with a **`BrowserConfig`** (e.g., headless, user agent), then **run** multiple **`arun()`** calls with different **`CrawlerRunConfig`** objects.
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								**Recommended usage**:
 . **Create** a `BrowserConfig` for global browser settings.
 . **Instantiate** `AsyncWebCrawler(config=browser_config)`.
 . **Use** the crawler in an async context manager (`async with`) or manage start/close manually.
 . **Call** `arun(url, config=crawler_run_config)` for each page you want.
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								---
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								## 1. Constructor Overview
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
 								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								class AsyncWebCrawler:
 								    def __init__(
 								        self,
 								        crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
 								        config: Optional[BrowserConfig] = None,
 								        always_bypass_cache: bool = False,           # deprecated
 								        always_by_pass_cache: Optional[bool] = None, # also deprecated
 								        base_directory: str = ...,
 								        thread_safe: bool = False,
 								        **kwargs,
 								    ):
 								        """
 								        Create an AsyncWebCrawler instance.
 								        Args:
-												Update all documents

											
										
										
											2025-01-08 19:31:31 +08:00
+								            crawler_strategy:
 								                (Advanced) Provide a custom crawler strategy if needed.
 								            config:
 								                A BrowserConfig object specifying how the browser is set up.
 								            always_bypass_cache:
 								                (Deprecated) Use CrawlerRunConfig.cache_mode instead.
 								            base_directory:
 								                Folder for storing caches/logs (if relevant).
 								            thread_safe:
 								                If True, attempts some concurrency safeguards. Usually False.
 								            **kwargs:
 								                Additional legacy or debugging parameters.
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								        """
-												Update all documents

											
										
										
											2025-01-08 19:31:31 +08:00
+								    )
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								### Typical Initialization
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
 								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								from crawl4ai import AsyncWebCrawler, BrowserConfig
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								browser_cfg = BrowserConfig(
 								    browser_type="chromium",
 								    headless=True,
 								    verbose=True
 								)
 								crawler = AsyncWebCrawler(config=browser_cfg)
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								**Notes**:
 								- **Legacy** parameters like `always_bypass_cache` remain for backward compatibility, but prefer to set **caching** in `CrawlerRunConfig`.
 								---
 								## 2. Lifecycle: Start/Close or Context Manager
 								### 2.1 Context Manager (Recommended)
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								async with AsyncWebCrawler(config=browser_cfg) as crawler:
 								    result = await crawler.arun("https://example.com")
 								    # The crawler automatically starts/closes resources
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								When the `async with` block ends, the crawler cleans up (closes the browser, etc.).
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								### 2.2 Manual Start & Close
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
 								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								crawler = AsyncWebCrawler(config=browser_cfg)
 								await crawler.start()
 								result1 = await crawler.arun("https://example.com")
 								result2 = await crawler.arun("https://another.com")
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								await crawler.close()
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								Use this style if you have a **long-running** application or need full control of the crawler’s lifecycle.
 								---
 								## 3. Primary Method: `arun()`
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								async def arun(
 								    self,
 								    url: str,
 								    config: Optional[CrawlerRunConfig] = None,
 								    # Legacy parameters for backward compatibility...
 								) -> CrawlResult:
 								    ...
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								### 3.1 New Approach
 								You pass a `CrawlerRunConfig` object that sets up everything about a crawl—content filtering, caching, session reuse, JS code, screenshots, etc.
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
 								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								import asyncio
 								from crawl4ai import CrawlerRunConfig, CacheMode
 								run_cfg = CrawlerRunConfig(
 								    cache_mode=CacheMode.BYPASS,
 								    css_selector="main.article",
 								    word_count_threshold=10,
 								    screenshot=True
 								)
 								async with AsyncWebCrawler(config=browser_cfg) as crawler:
 								    result = await crawler.arun("https://example.com/news", config=run_cfg)
 								    print("Crawled HTML length:", len(result.cleaned_html))
 								    if result.screenshot:
 								        print("Screenshot base64 length:", len(result.screenshot))
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								### 3.2 Legacy Parameters Still Accepted
 								For **backward** compatibility, `arun()` can still accept direct arguments like `css_selector=...`, `word_count_threshold=...`, etc., but we strongly advise migrating them into a **`CrawlerRunConfig`**.
 								---
-												refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring

Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.

											
										
										
											2025-01-11 21:10:27 +08:00
+								## 4. Batch Processing: `arun_many()`
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								async def arun_many(
 								    self,
 								    urls: List[str],
 								    config: Optional[CrawlerRunConfig] = None,
-												refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring

Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.

											
										
										
											2025-01-11 21:10:27 +08:00
+								    # Legacy parameters maintained for backwards compatibility...
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								) -> List[CrawlResult]:
-												refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring

Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.

											
										
										
											2025-01-11 21:10:27 +08:00
+								    """
 								    Process multiple URLs with intelligent rate limiting and resource monitoring.
 								    """
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring

Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.

											
										
										
											2025-01-11 21:10:27 +08:00
+								### 4.1 Resource-Aware Crawling
 								The `arun_many()` method now uses an intelligent dispatcher that:
 								- Monitors system memory usage
 								- Implements adaptive rate limiting
 								- Provides detailed progress monitoring
 								- Manages concurrent crawls efficiently
 								### 4.2 Example Usage
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring

Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.

											
										
										
											2025-01-11 21:10:27 +08:00
+								from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, RateLimitConfig
 								from crawl4ai.dispatcher import DisplayMode
 								# Configure browser
 								browser_cfg = BrowserConfig(headless=True)
 								# Configure crawler with rate limiting
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								run_cfg = CrawlerRunConfig(
-												refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring

Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.

											
										
										
											2025-01-11 21:10:27 +08:00
+								    # Enable rate limiting
 								    enable_rate_limiting=True,
 								    rate_limit_config=RateLimitConfig(
 								        base_delay=(1.0, 2.0),  # Random delay between 1-2 seconds
 								        max_delay=30.0,         # Maximum delay after rate limit hits
 								        max_retries=2,          # Number of retries before giving up
 								        rate_limit_codes=[429, 503]  # Status codes that trigger rate limiting
 								    ),
 								    # Resource monitoring
 								    memory_threshold_percent=70.0,  # Pause if memory exceeds this
 								    check_interval=0.5,            # How often to check resources
 								    max_session_permit=3,          # Maximum concurrent crawls
 								    display_mode=DisplayMode.DETAILED.value  # Show detailed progress
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								)
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
-												refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring

Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.

											
										
										
											2025-01-11 21:10:27 +08:00
+								urls = [
 								    "https://example.com/page1",
 								    "https://example.com/page2",
 								    "https://example.com/page3"
 								]
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								async with AsyncWebCrawler(config=browser_cfg) as crawler:
-												refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring

Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.

											
										
										
											2025-01-11 21:10:27 +08:00
+								    results = await crawler.arun_many(urls, config=run_cfg)
 								    for result in results:
 								        print(f"URL: {result.url}, Success: {result.success}")
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring

Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.

											
										
										
											2025-01-11 21:10:27 +08:00
+								### 4.3 Key Features
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
-												refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring

Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.

											
										
										
											2025-01-11 21:10:27 +08:00
+. **Rate Limiting**
 								   - Automatic delay between requests
 								   - Exponential backoff on rate limit detection
 								   - Domain-specific rate limiting
 								   - Configurable retry strategy
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
-												refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring

Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.

											
										
										
											2025-01-11 21:10:27 +08:00
+. **Resource Monitoring**
 								   - Memory usage tracking
 								   - Adaptive concurrency based on system load
 								   - Automatic pausing when resources are constrained
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
-												refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring

Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.

											
										
										
											2025-01-11 21:10:27 +08:00
+. **Progress Monitoring**
 								   - Detailed or aggregated progress display
 								   - Real-time status updates
 								   - Memory usage statistics
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
-												refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring

Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.

											
										
										
											2025-01-11 21:10:27 +08:00
+. **Error Handling**
 								   - Graceful handling of rate limits
 								   - Automatic retries with backoff
 								   - Detailed error reporting
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								---
 								## 5. `CrawlResult` Output
 								Each `arun()` returns a **`CrawlResult`** containing:
 								- `url`: Final URL (if redirected).
 								- `html`: Original HTML.
 								- `cleaned_html`: Sanitized HTML.
 								- `markdown_v2` (or future `markdown`): Markdown outputs (raw, fit, etc.).
 								- `extracted_content`: If an extraction strategy was used (JSON for CSS/LLM strategies).
 								- `screenshot`, `pdf`: If screenshots/PDF requested.
 								- `media`, `links`: Information about discovered images/links.
 								- `success`, `error_message`: Status info.
 								For details, see [CrawlResult doc](./crawl-result.md).
 								---
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								## 6. Quick Example
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								Below is an example hooking it all together:
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
 								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								import asyncio
 								from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
 								from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
 								import json
 								async def main():
 								    # 1. Browser config
 								    browser_cfg = BrowserConfig(
 								        browser_type="firefox",
 								        headless=False,
 								        verbose=True
 								    )
 								    # 2. Run config
 								    schema = {
 								        "name": "Articles",
 								        "baseSelector": "article.post",
 								        "fields": [
-												Update all documents

											
										
										
											2025-01-08 19:31:31 +08:00
+								            {
 								                "name": "title",
 								                "selector": "h2",
 								                "type": "text"
 								            },
 								            {
 								                "name": "url",
 								                "selector": "a",
 								                "type": "attribute",
 								                "attribute": "href"
 								            }
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								        ]
 								    }
 								    run_cfg = CrawlerRunConfig(
 								        cache_mode=CacheMode.BYPASS,
 								        extraction_strategy=JsonCssExtractionStrategy(schema),
 								        word_count_threshold=15,
 								        remove_overlay_elements=True,
 								        wait_for="css:.post"  # Wait for posts to appear
 								    )
 								    async with AsyncWebCrawler(config=browser_cfg) as crawler:
 								        result = await crawler.arun(
 								            url="https://example.com/blog",
 								            config=run_cfg
 								        )
 								        if result.success:
 								            print("Cleaned HTML length:", len(result.cleaned_html))
 								            if result.extracted_content:
 								                articles = json.loads(result.extracted_content)
 								                print("Extracted articles:", articles[:2])
 								        else:
 								            print("Error:", result.error_message)
 								asyncio.run(main())
 								```
 								**Explanation**:
 								- We define a **`BrowserConfig`** with Firefox, no headless, and `verbose=True`.
 								- We define a **`CrawlerRunConfig`** that **bypasses cache**, uses a **CSS** extraction schema, has a `word_count_threshold=15`, etc.
 								- We pass them to `AsyncWebCrawler(config=...)` and `arun(url=..., config=...)`.
 								---
 								## 7. Best Practices & Migration Notes
 . **Use** `BrowserConfig` for **global** settings about the browser’s environment.
 . **Use** `CrawlerRunConfig` for **per-crawl** logic (caching, content filtering, extraction strategies, wait conditions).
 . **Avoid** legacy parameters like `css_selector` or `word_count_threshold` directly in `arun()`. Instead:
 								   ```python
 								   run_cfg = CrawlerRunConfig(css_selector=".main-content", word_count_threshold=20)
 								   result = await crawler.arun(url="...", config=run_cfg)
 								   ```
 . **Context Manager** usage is simplest unless you want a persistent crawler across many calls.
 								---
 								## 8. Summary
 								**AsyncWebCrawler** is your entry point to asynchronous crawling:
 								- **Constructor** accepts **`BrowserConfig`** (or defaults).
 								- **`arun(url, config=CrawlerRunConfig)`** is the main method for single-page crawls.
 								- **`arun_many(urls, config=CrawlerRunConfig)`** handles concurrency across multiple URLs.
 								- For advanced lifecycle control, use `start()` and `close()` explicitly.
 								**Migration**:
 								- If you used `AsyncWebCrawler(browser_type="chromium", css_selector="...")`, move browser settings to `BrowserConfig(...)` and content/crawl logic to `CrawlerRunConfig(...)`.
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								This modular approach ensures your code is **clean**, **scalable**, and **easy to maintain**. For any advanced or rarely used parameters, see the [BrowserConfig docs](../api/parameters.md).