crawl4ai/docs/md_v2/api/async-webcrawler.md

# AsyncWebCrawler

The **`AsyncWebCrawler`** is the core class for asynchronous web crawling in Crawl4AI. You typically create it **once**, optionally customize it with a **`BrowserConfig`** (e.g., headless, user agent), then **run** multiple **`arun()`** calls with different **`CrawlerRunConfig`** objects.

**Recommended usage**:
1. **Create** a `BrowserConfig` for global browser settings.  
2. **Instantiate** `AsyncWebCrawler(config=browser_config)`.  
3. **Use** the crawler in an async context manager (`async with`) or manage start/close manually.  
4. **Call** `arun(url, config=crawler_run_config)` for each page you want.

---

## 1. Constructor Overview

```python
class AsyncWebCrawler:
    def __init__(
        self,
        crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
        config: Optional[BrowserConfig] = None,
        always_bypass_cache: bool = False,           # deprecated
        always_by_pass_cache: Optional[bool] = None, # also deprecated
        base_directory: str = ...,
        thread_safe: bool = False,
        **kwargs,
    ):
        """
        Create an AsyncWebCrawler instance.

        Args:
            crawler_strategy: 
                (Advanced) Provide a custom crawler strategy if needed.
            config: 
                A BrowserConfig object specifying how the browser is set up.
            always_bypass_cache: 
                (Deprecated) Use CrawlerRunConfig.cache_mode instead.
            base_directory:     
                Folder for storing caches/logs (if relevant).
            thread_safe: 
                If True, attempts some concurrency safeguards. Usually False.
            **kwargs: 
                Additional legacy or debugging parameters.
        """
    )

### Typical Initialization

```python
from crawl4ai import AsyncWebCrawler, BrowserConfig

browser_cfg = BrowserConfig(
    browser_type="chromium",
    headless=True,
    verbose=True
)

crawler = AsyncWebCrawler(config=browser_cfg)
```

**Notes**:
- **Legacy** parameters like `always_bypass_cache` remain for backward compatibility, but prefer to set **caching** in `CrawlerRunConfig`.

---

## 2. Lifecycle: Start/Close or Context Manager

### 2.1 Context Manager (Recommended)

```python
async with AsyncWebCrawler(config=browser_cfg) as crawler:
    result = await crawler.arun("https://example.com")
    # The crawler automatically starts/closes resources
```

When the `async with` block ends, the crawler cleans up (closes the browser, etc.).

### 2.2 Manual Start & Close

```python
crawler = AsyncWebCrawler(config=browser_cfg)
await crawler.start()

result1 = await crawler.arun("https://example.com")
result2 = await crawler.arun("https://another.com")

await crawler.close()
```

Use this style if you have a **long-running** application or need full control of the crawler’s lifecycle.

---

## 3. Primary Method: `arun()`

```python
async def arun(
    self,
    url: str,
    config: Optional[CrawlerRunConfig] = None,
    # Legacy parameters for backward compatibility...
) -> CrawlResult:
    ...
```

### 3.1 New Approach

You pass a `CrawlerRunConfig` object that sets up everything about a crawl—content filtering, caching, session reuse, JS code, screenshots, etc.

```python
import asyncio
from crawl4ai import CrawlerRunConfig, CacheMode

run_cfg = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS,
    css_selector="main.article",
    word_count_threshold=10,
    screenshot=True
)

async with AsyncWebCrawler(config=browser_cfg) as crawler:
    result = await crawler.arun("https://example.com/news", config=run_cfg)
    print("Crawled HTML length:", len(result.cleaned_html))
    if result.screenshot:
        print("Screenshot base64 length:", len(result.screenshot))
```

### 3.2 Legacy Parameters Still Accepted

For **backward** compatibility, `arun()` can still accept direct arguments like `css_selector=...`, `word_count_threshold=...`, etc., but we strongly advise migrating them into a **`CrawlerRunConfig`**.

---

## 4. Helper Methods

### 4.1 `arun_many()`

```python
async def arun_many(
    self,
    urls: List[str],
    config: Optional[CrawlerRunConfig] = None,
    # Legacy parameters...
) -> List[CrawlResult]:
    ...
```

Crawls multiple URLs in concurrency. Accepts the same style `CrawlerRunConfig`. Example:

```python
run_cfg = CrawlerRunConfig(
    # e.g., concurrency, wait_for, caching, extraction, etc.
    semaphore_count=5
)

async with AsyncWebCrawler(config=browser_cfg) as crawler:
    results = await crawler.arun_many(
        urls=["https://example.com", "https://another.com"],
        config=run_cfg
    )
    for r in results:
        print(r.url, ":", len(r.cleaned_html))
```

### 4.2 `start()` & `close()`

Allows manual lifecycle usage instead of context manager:

```python
crawler = AsyncWebCrawler(config=browser_cfg)
await crawler.start()

# Perform multiple operations
resultA = await crawler.arun("https://exampleA.com", config=run_cfg)
resultB = await crawler.arun("https://exampleB.com", config=run_cfg)

await crawler.close()
```

---

## 5. `CrawlResult` Output

Each `arun()` returns a **`CrawlResult`** containing:

- `url`: Final URL (if redirected).
- `html`: Original HTML.
- `cleaned_html`: Sanitized HTML.
- `markdown_v2` (or future `markdown`): Markdown outputs (raw, fit, etc.).
- `extracted_content`: If an extraction strategy was used (JSON for CSS/LLM strategies).
- `screenshot`, `pdf`: If screenshots/PDF requested.
- `media`, `links`: Information about discovered images/links.
- `success`, `error_message`: Status info.

For details, see [CrawlResult doc](./crawl-result.md).

---

## 6. Quick Example

Below is an example hooking it all together:

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json

async def main():
    # 1. Browser config
    browser_cfg = BrowserConfig(
        browser_type="firefox",
        headless=False,
        verbose=True
    )

    # 2. Run config
    schema = {
        "name": "Articles",
        "baseSelector": "article.post",
        "fields": [
            {
                "name": "title", 
                "selector": "h2", 
                "type": "text"
            },
            {
                "name": "url", 
                "selector": "a", 
                "type": "attribute", 
                "attribute": "href"
            }
        ]
    }

    run_cfg = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        extraction_strategy=JsonCssExtractionStrategy(schema),
        word_count_threshold=15,
        remove_overlay_elements=True,
        wait_for="css:.post"  # Wait for posts to appear
    )

    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        result = await crawler.arun(
            url="https://example.com/blog",
            config=run_cfg
        )

        if result.success:
            print("Cleaned HTML length:", len(result.cleaned_html))
            if result.extracted_content:
                articles = json.loads(result.extracted_content)
                print("Extracted articles:", articles[:2])
        else:
            print("Error:", result.error_message)

asyncio.run(main())
```

**Explanation**:
- We define a **`BrowserConfig`** with Firefox, no headless, and `verbose=True`.  
- We define a **`CrawlerRunConfig`** that **bypasses cache**, uses a **CSS** extraction schema, has a `word_count_threshold=15`, etc.  
- We pass them to `AsyncWebCrawler(config=...)` and `arun(url=..., config=...)`.

---

## 7. Best Practices & Migration Notes

1. **Use** `BrowserConfig` for **global** settings about the browser’s environment.  
2. **Use** `CrawlerRunConfig` for **per-crawl** logic (caching, content filtering, extraction strategies, wait conditions).  
3. **Avoid** legacy parameters like `css_selector` or `word_count_threshold` directly in `arun()`. Instead:

   ```python
   run_cfg = CrawlerRunConfig(css_selector=".main-content", word_count_threshold=20)
   result = await crawler.arun(url="...", config=run_cfg)
   ```

4. **Context Manager** usage is simplest unless you want a persistent crawler across many calls.

---

## 8. Summary

**AsyncWebCrawler** is your entry point to asynchronous crawling:

- **Constructor** accepts **`BrowserConfig`** (or defaults).  
- **`arun(url, config=CrawlerRunConfig)`** is the main method for single-page crawls.  
- **`arun_many(urls, config=CrawlerRunConfig)`** handles concurrency across multiple URLs.  
- For advanced lifecycle control, use `start()` and `close()` explicitly.  

**Migration**:  
- If you used `AsyncWebCrawler(browser_type="chromium", css_selector="...")`, move browser settings to `BrowserConfig(...)` and content/crawl logic to `CrawlerRunConfig(...)`.

This modular approach ensures your code is **clean**, **scalable**, and **easy to maintain**. For any advanced or rarely used parameters, see the [BrowserConfig docs](../api/parameters.md).
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								# AsyncWebCrawler
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								The **`AsyncWebCrawler`** is the core class for asynchronous web crawling in Crawl4AI. You typically create it **once**, optionally customize it with a **`BrowserConfig`** (e.g., headless, user agent), then **run** multiple **`arun()`** calls with different **`CrawlerRunConfig`** objects.
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								**Recommended usage**:
 . **Create** a `BrowserConfig` for global browser settings.
 . **Instantiate** `AsyncWebCrawler(config=browser_config)`.
 . **Use** the crawler in an async context manager (`async with`) or manage start/close manually.
 . **Call** `arun(url, config=crawler_run_config)` for each page you want.
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								---
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								## 1. Constructor Overview
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
 								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								class AsyncWebCrawler:
 								    def __init__(
 								        self,
 								        crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
 								        config: Optional[BrowserConfig] = None,
 								        always_bypass_cache: bool = False,           # deprecated
 								        always_by_pass_cache: Optional[bool] = None, # also deprecated
 								        base_directory: str = ...,
 								        thread_safe: bool = False,
 								        **kwargs,
 								    ):
 								        """
 								        Create an AsyncWebCrawler instance.
 								        Args:
-												Update all documents

											
										
										
											2025-01-08 19:31:31 +08:00
+								            crawler_strategy:
 								                (Advanced) Provide a custom crawler strategy if needed.
 								            config:
 								                A BrowserConfig object specifying how the browser is set up.
 								            always_bypass_cache:
 								                (Deprecated) Use CrawlerRunConfig.cache_mode instead.
 								            base_directory:
 								                Folder for storing caches/logs (if relevant).
 								            thread_safe:
 								                If True, attempts some concurrency safeguards. Usually False.
 								            **kwargs:
 								                Additional legacy or debugging parameters.
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								        """
-												Update all documents

											
										
										
											2025-01-08 19:31:31 +08:00
+								    )
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								### Typical Initialization
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
 								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								from crawl4ai import AsyncWebCrawler, BrowserConfig
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								browser_cfg = BrowserConfig(
 								    browser_type="chromium",
 								    headless=True,
 								    verbose=True
 								)
 								crawler = AsyncWebCrawler(config=browser_cfg)
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								**Notes**:
 								- **Legacy** parameters like `always_bypass_cache` remain for backward compatibility, but prefer to set **caching** in `CrawlerRunConfig`.
 								---
 								## 2. Lifecycle: Start/Close or Context Manager
 								### 2.1 Context Manager (Recommended)
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								async with AsyncWebCrawler(config=browser_cfg) as crawler:
 								    result = await crawler.arun("https://example.com")
 								    # The crawler automatically starts/closes resources
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								When the `async with` block ends, the crawler cleans up (closes the browser, etc.).
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								### 2.2 Manual Start & Close
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
 								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								crawler = AsyncWebCrawler(config=browser_cfg)
 								await crawler.start()
 								result1 = await crawler.arun("https://example.com")
 								result2 = await crawler.arun("https://another.com")
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								await crawler.close()
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								Use this style if you have a **long-running** application or need full control of the crawler’s lifecycle.
 								---
 								## 3. Primary Method: `arun()`
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								async def arun(
 								    self,
 								    url: str,
 								    config: Optional[CrawlerRunConfig] = None,
 								    # Legacy parameters for backward compatibility...
 								) -> CrawlResult:
 								    ...
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								### 3.1 New Approach
 								You pass a `CrawlerRunConfig` object that sets up everything about a crawl—content filtering, caching, session reuse, JS code, screenshots, etc.
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
 								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								import asyncio
 								from crawl4ai import CrawlerRunConfig, CacheMode
 								run_cfg = CrawlerRunConfig(
 								    cache_mode=CacheMode.BYPASS,
 								    css_selector="main.article",
 								    word_count_threshold=10,
 								    screenshot=True
 								)
 								async with AsyncWebCrawler(config=browser_cfg) as crawler:
 								    result = await crawler.arun("https://example.com/news", config=run_cfg)
 								    print("Crawled HTML length:", len(result.cleaned_html))
 								    if result.screenshot:
 								        print("Screenshot base64 length:", len(result.screenshot))
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								### 3.2 Legacy Parameters Still Accepted
 								For **backward** compatibility, `arun()` can still accept direct arguments like `css_selector=...`, `word_count_threshold=...`, etc., but we strongly advise migrating them into a **`CrawlerRunConfig`**.
 								---
 								## 4. Helper Methods
 								### 4.1 `arun_many()`
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								async def arun_many(
 								    self,
 								    urls: List[str],
 								    config: Optional[CrawlerRunConfig] = None,
 								    # Legacy parameters...
 								) -> List[CrawlResult]:
 								    ...
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								Crawls multiple URLs in concurrency. Accepts the same style `CrawlerRunConfig`. Example:
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								run_cfg = CrawlerRunConfig(
 								    # e.g., concurrency, wait_for, caching, extraction, etc.
 								    semaphore_count=5
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								)
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
 								async with AsyncWebCrawler(config=browser_cfg) as crawler:
 								    results = await crawler.arun_many(
 								        urls=["https://example.com", "https://another.com"],
 								        config=run_cfg
 								    )
 								    for r in results:
 								        print(r.url, ":", len(r.cleaned_html))
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								### 4.2 `start()` & `close()`
 								Allows manual lifecycle usage instead of context manager:
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								crawler = AsyncWebCrawler(config=browser_cfg)
 								await crawler.start()
 								# Perform multiple operations
 								resultA = await crawler.arun("https://exampleA.com", config=run_cfg)
 								resultB = await crawler.arun("https://exampleB.com", config=run_cfg)
 								await crawler.close()
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								---
 								## 5. `CrawlResult` Output
 								Each `arun()` returns a **`CrawlResult`** containing:
 								- `url`: Final URL (if redirected).
 								- `html`: Original HTML.
 								- `cleaned_html`: Sanitized HTML.
 								- `markdown_v2` (or future `markdown`): Markdown outputs (raw, fit, etc.).
 								- `extracted_content`: If an extraction strategy was used (JSON for CSS/LLM strategies).
 								- `screenshot`, `pdf`: If screenshots/PDF requested.
 								- `media`, `links`: Information about discovered images/links.
 								- `success`, `error_message`: Status info.
 								For details, see [CrawlResult doc](./crawl-result.md).
 								---
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								## 6. Quick Example
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								Below is an example hooking it all together:
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
 								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								import asyncio
 								from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
 								from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
 								import json
 								async def main():
 								    # 1. Browser config
 								    browser_cfg = BrowserConfig(
 								        browser_type="firefox",
 								        headless=False,
 								        verbose=True
 								    )
 								    # 2. Run config
 								    schema = {
 								        "name": "Articles",
 								        "baseSelector": "article.post",
 								        "fields": [
-												Update all documents

											
										
										
											2025-01-08 19:31:31 +08:00
+								            {
 								                "name": "title",
 								                "selector": "h2",
 								                "type": "text"
 								            },
 								            {
 								                "name": "url",
 								                "selector": "a",
 								                "type": "attribute",
 								                "attribute": "href"
 								            }
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								        ]
 								    }
 								    run_cfg = CrawlerRunConfig(
 								        cache_mode=CacheMode.BYPASS,
 								        extraction_strategy=JsonCssExtractionStrategy(schema),
 								        word_count_threshold=15,
 								        remove_overlay_elements=True,
 								        wait_for="css:.post"  # Wait for posts to appear
 								    )
 								    async with AsyncWebCrawler(config=browser_cfg) as crawler:
 								        result = await crawler.arun(
 								            url="https://example.com/blog",
 								            config=run_cfg
 								        )
 								        if result.success:
 								            print("Cleaned HTML length:", len(result.cleaned_html))
 								            if result.extracted_content:
 								                articles = json.loads(result.extracted_content)
 								                print("Extracted articles:", articles[:2])
 								        else:
 								            print("Error:", result.error_message)
 								asyncio.run(main())
 								```
 								**Explanation**:
 								- We define a **`BrowserConfig`** with Firefox, no headless, and `verbose=True`.
 								- We define a **`CrawlerRunConfig`** that **bypasses cache**, uses a **CSS** extraction schema, has a `word_count_threshold=15`, etc.
 								- We pass them to `AsyncWebCrawler(config=...)` and `arun(url=..., config=...)`.
 								---
 								## 7. Best Practices & Migration Notes
 . **Use** `BrowserConfig` for **global** settings about the browser’s environment.
 . **Use** `CrawlerRunConfig` for **per-crawl** logic (caching, content filtering, extraction strategies, wait conditions).
 . **Avoid** legacy parameters like `css_selector` or `word_count_threshold` directly in `arun()`. Instead:
 								   ```python
 								   run_cfg = CrawlerRunConfig(css_selector=".main-content", word_count_threshold=20)
 								   result = await crawler.arun(url="...", config=run_cfg)
 								   ```
 . **Context Manager** usage is simplest unless you want a persistent crawler across many calls.
 								---
 								## 8. Summary
 								**AsyncWebCrawler** is your entry point to asynchronous crawling:
 								- **Constructor** accepts **`BrowserConfig`** (or defaults).
 								- **`arun(url, config=CrawlerRunConfig)`** is the main method for single-page crawls.
 								- **`arun_many(urls, config=CrawlerRunConfig)`** handles concurrency across multiple URLs.
 								- For advanced lifecycle control, use `start()` and `close()` explicitly.
 								**Migration**:
 								- If you used `AsyncWebCrawler(browser_type="chromium", css_selector="...")`, move browser settings to `BrowserConfig(...)` and content/crawl logic to `CrawlerRunConfig(...)`.
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								This modular approach ensures your code is **clean**, **scalable**, and **easy to maintain**. For any advanced or rarely used parameters, see the [BrowserConfig docs](../api/parameters.md).