crawl4ai/docs/md_v2/api/arun.md

# `arun()` Parameter Guide (New Approach)

In Crawl4AI’s **latest** configuration model, nearly all parameters that once went directly to `arun()` are now part of **`CrawlerRunConfig`**. When calling `arun()`, you provide:

```python
await crawler.arun(
    url="https://example.com",  
    config=my_run_config
)
```

Below is an organized look at the parameters that can go inside `CrawlerRunConfig`, divided by their functional areas. For **Browser** settings (e.g., `headless`, `browser_type`), see [BrowserConfig](./parameters.md).

---

## 1. Core Usage

```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode

async def main():
    run_config = CrawlerRunConfig(
        verbose=True,            # Detailed logging
        cache_mode=CacheMode.ENABLED,  # Use normal read/write cache
        check_robots_txt=True,   # Respect robots.txt rules
        # ... other parameters
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://example.com",
            config=run_config
        )
        
        # Check if blocked by robots.txt
        if not result.success and result.status_code == 403:
            print(f"Error: {result.error_message}")
```

**Key Fields**:
- `verbose=True` logs each crawl step.  
- `cache_mode` decides how to read/write the local crawl cache.

---

## 2. Cache Control

**`cache_mode`** (default: `CacheMode.ENABLED`)  
Use a built-in enum from `CacheMode`:
- `ENABLED`: Normal caching—reads if available, writes if missing.
- `DISABLED`: No caching—always refetch pages.
- `READ_ONLY`: Reads from cache only; no new writes.
- `WRITE_ONLY`: Writes to cache but doesn’t read existing data.
- `BYPASS`: Skips reading cache for this crawl (though it might still write if set up that way).

```python
run_config = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS
)
```

**Additional flags**:
- `bypass_cache=True` acts like `CacheMode.BYPASS`.
- `disable_cache=True` acts like `CacheMode.DISABLED`.
- `no_cache_read=True` acts like `CacheMode.WRITE_ONLY`.
- `no_cache_write=True` acts like `CacheMode.READ_ONLY`.

---

## 3. Content Processing & Selection

### 3.1 Text Processing

```python
run_config = CrawlerRunConfig(
    word_count_threshold=10,   # Ignore text blocks <10 words
    only_text=False,           # If True, tries to remove non-text elements
    keep_data_attributes=False # Keep or discard data-* attributes
)
```

### 3.2 Content Selection

```python
run_config = CrawlerRunConfig(
    css_selector=".main-content",  # Focus on .main-content region only
    excluded_tags=["form", "nav"], # Remove entire tag blocks
    remove_forms=True,             # Specifically strip <form> elements
    remove_overlay_elements=True,  # Attempt to remove modals/popups
)
```

### 3.3 Link Handling

```python
run_config = CrawlerRunConfig(
    exclude_external_links=True,         # Remove external links from final content
    exclude_social_media_links=True,     # Remove links to known social sites
    exclude_domains=["ads.example.com"], # Exclude links to these domains
    exclude_social_media_domains=["facebook.com","twitter.com"], # Extend the default list
)
```

### 3.4 Media Filtering

```python
run_config = CrawlerRunConfig(
    exclude_external_images=True  # Strip images from other domains
)
```

---

## 4. Page Navigation & Timing

### 4.1 Basic Browser Flow

```python
run_config = CrawlerRunConfig(
    wait_for="css:.dynamic-content", # Wait for .dynamic-content
    delay_before_return_html=2.0,    # Wait 2s before capturing final HTML
    page_timeout=60000,             # Navigation & script timeout (ms)
)
```

**Key Fields**:
- `wait_for`:  
  - `"css:selector"` or  
  - `"js:() => boolean"`  
  e.g. `js:() => document.querySelectorAll('.item').length > 10`.

- `mean_delay` & `max_range`: define random delays for `arun_many()` calls.  
- `semaphore_count`: concurrency limit when crawling multiple URLs.

### 4.2 JavaScript Execution

```python
run_config = CrawlerRunConfig(
    js_code=[
        "window.scrollTo(0, document.body.scrollHeight);",
        "document.querySelector('.load-more')?.click();"
    ],
    js_only=False
)
```

- `js_code` can be a single string or a list of strings.  
- `js_only=True` means “I’m continuing in the same session with new JS steps, no new full navigation.”

### 4.3 Anti-Bot

```python
run_config = CrawlerRunConfig(
    magic=True,
    simulate_user=True,
    override_navigator=True
)
```
- `magic=True` tries multiple stealth features.  
- `simulate_user=True` mimics mouse movements or random delays.  
- `override_navigator=True` fakes some navigator properties (like user agent checks).

---

## 5. Session Management

**`session_id`**: 
```python
run_config = CrawlerRunConfig(
    session_id="my_session123"
)
```
If re-used in subsequent `arun()` calls, the same tab/page context is continued (helpful for multi-step tasks or stateful browsing).

---

## 6. Screenshot, PDF & Media Options

```python
run_config = CrawlerRunConfig(
    screenshot=True,             # Grab a screenshot as base64
    screenshot_wait_for=1.0,     # Wait 1s before capturing
    pdf=True,                    # Also produce a PDF
    image_description_min_word_threshold=5,  # If analyzing alt text
    image_score_threshold=3,                # Filter out low-score images
)
```
**Where they appear**:
- `result.screenshot` → Base64 screenshot string.
- `result.pdf` → Byte array with PDF data.

---

## 7. Extraction Strategy

**For advanced data extraction** (CSS/LLM-based), set `extraction_strategy`:

```python
run_config = CrawlerRunConfig(
    extraction_strategy=my_css_or_llm_strategy
)
```

The extracted data will appear in `result.extracted_content`.

---

## 8. Comprehensive Example

Below is a snippet combining many parameters:

```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

async def main():
    # Example schema
    schema = {
        "name": "Articles",
        "baseSelector": "article.post",
        "fields": [
            {"name": "title", "selector": "h2", "type": "text"},
            {"name": "link",  "selector": "a",  "type": "attribute", "attribute": "href"}
        ]
    }

    run_config = CrawlerRunConfig(
        # Core
        verbose=True,
        cache_mode=CacheMode.ENABLED,
        check_robots_txt=True,   # Respect robots.txt rules
        
        # Content
        word_count_threshold=10,
        css_selector="main.content",
        excluded_tags=["nav", "footer"],
        exclude_external_links=True,
        
        # Page & JS
        js_code="document.querySelector('.show-more')?.click();",
        wait_for="css:.loaded-block",
        page_timeout=30000,
        
        # Extraction
        extraction_strategy=JsonCssExtractionStrategy(schema),
        
        # Session
        session_id="persistent_session",
        
        # Media
        screenshot=True,
        pdf=True,
        
        # Anti-bot
        simulate_user=True,
        magic=True,
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com/posts", config=run_config)
        if result.success:
            print("HTML length:", len(result.cleaned_html))
            print("Extraction JSON:", result.extracted_content)
            if result.screenshot:
                print("Screenshot length:", len(result.screenshot))
            if result.pdf:
                print("PDF bytes length:", len(result.pdf))
        else:
            print("Error:", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())
```

**What we covered**:
1. **Crawling** the main content region, ignoring external links.  
2. Running **JavaScript** to click “.show-more”.  
3. **Waiting** for “.loaded-block” to appear.  
4. Generating a **screenshot** & **PDF** of the final page.  
5. Extracting repeated “article.post” elements with a **CSS-based** extraction strategy.

---

## 9. Best Practices

1. **Use `BrowserConfig` for global browser** settings (headless, user agent).  
2. **Use `CrawlerRunConfig`** to handle the **specific** crawl needs: content filtering, caching, JS, screenshot, extraction, etc.  
3. Keep your **parameters consistent** in run configs—especially if you’re part of a large codebase with multiple crawls.  
4. **Limit** large concurrency (`semaphore_count`) if the site or your system can’t handle it.  
5. For dynamic pages, set `js_code` or `scan_full_page` so you load all content.

---

## 10. Conclusion

All parameters that used to be direct arguments to `arun()` now belong in **`CrawlerRunConfig`**. This approach:

- Makes code **clearer** and **more maintainable**.  
- Minimizes confusion about which arguments affect global vs. per-crawl behavior.  
- Allows you to create **reusable** config objects for different pages or tasks.

For a **full** reference, check out the [CrawlerRunConfig Docs](./parameters.md). 

Happy crawling with your **structured, flexible** config approach!
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								# `arun()` Parameter Guide (New Approach)
 								In Crawl4AI’s **latest** configuration model, nearly all parameters that once went directly to `arun()` are now part of **`CrawlerRunConfig`**. When calling `arun()`, you provide:
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
 								```python
 								await crawler.arun(
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								    url="https://example.com",
 								    config=my_run_config
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								)
 								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								Below is an organized look at the parameters that can go inside `CrawlerRunConfig`, divided by their functional areas. For **Browser** settings (e.g., `headless`, `browser_type`), see [BrowserConfig](./parameters.md).
 								---
 								## 1. Core Usage
-												feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity

											
										
										
											2024-11-17 19:44:45 +08:00
 								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
-												feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity

											
										
										
											2024-11-17 19:44:45 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								async def main():
 								    run_config = CrawlerRunConfig(
 								        verbose=True,            # Detailed logging
 								        cache_mode=CacheMode.ENABLED,  # Use normal read/write cache
-												feat(robots): add robots.txt compliance support

Add support for checking and respecting robots.txt rules before crawling websites:
- Implement RobotsParser class with SQLite caching
- Add check_robots_txt parameter to CrawlerRunConfig
- Integrate robots.txt checking in AsyncWebCrawler
- Update documentation with robots.txt compliance examples
- Add tests for robot parser functionality

The cache uses WAL mode for better concurrency and has a default TTL of 7 days.

											
										
										
											2025-01-21 17:54:13 +08:00
+								        check_robots_txt=True,   # Respect robots.txt rules
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								        # ... other parameters
 								    )
-												feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity

											
										
										
											2024-11-17 19:44:45 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								    async with AsyncWebCrawler() as crawler:
 								        result = await crawler.arun(
 								            url="https://example.com",
 								            config=run_config
 								        )
-												feat(robots): add robots.txt compliance support

Add support for checking and respecting robots.txt rules before crawling websites:
- Implement RobotsParser class with SQLite caching
- Add check_robots_txt parameter to CrawlerRunConfig
- Integrate robots.txt checking in AsyncWebCrawler
- Update documentation with robots.txt compliance examples
- Add tests for robot parser functionality

The cache uses WAL mode for better concurrency and has a default TTL of 7 days.

											
										
										
											2025-01-21 17:54:13 +08:00
 								        # Check if blocked by robots.txt
 								        if not result.success and result.status_code == 403:
 								            print(f"Error: {result.error_message}")
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								**Key Fields**:
 								- `verbose=True` logs each crawl step.
 								- `cache_mode` decides how to read/write the local crawl cache.
 								---
 								## 2. Cache Control
 								**`cache_mode`** (default: `CacheMode.ENABLED`)
 								Use a built-in enum from `CacheMode`:
 								- `ENABLED`: Normal caching—reads if available, writes if missing.
 								- `DISABLED`: No caching—always refetch pages.
 								- `READ_ONLY`: Reads from cache only; no new writes.
 								- `WRITE_ONLY`: Writes to cache but doesn’t read existing data.
 								- `BYPASS`: Skips reading cache for this crawl (though it might still write if set up that way).
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								run_config = CrawlerRunConfig(
 								    cache_mode=CacheMode.BYPASS
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								)
 								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								**Additional flags**:
 								- `bypass_cache=True` acts like `CacheMode.BYPASS`.
 								- `disable_cache=True` acts like `CacheMode.DISABLED`.
 								- `no_cache_read=True` acts like `CacheMode.WRITE_ONLY`.
 								- `no_cache_write=True` acts like `CacheMode.READ_ONLY`.
 								---
 								## 3. Content Processing & Selection
 								### 3.1 Text Processing
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								run_config = CrawlerRunConfig(
 								    word_count_threshold=10,   # Ignore text blocks <10 words
 								    only_text=False,           # If True, tries to remove non-text elements
 								    keep_data_attributes=False # Keep or discard data-* attributes
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								)
 								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								### 3.2 Content Selection
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
 								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								run_config = CrawlerRunConfig(
 								    css_selector=".main-content",  # Focus on .main-content region only
 								    excluded_tags=["form", "nav"], # Remove entire tag blocks
 								    remove_forms=True,             # Specifically strip <form> elements
 								    remove_overlay_elements=True,  # Attempt to remove modals/popups
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								)
 								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								### 3.3 Link Handling
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								run_config = CrawlerRunConfig(
 								    exclude_external_links=True,         # Remove external links from final content
 								    exclude_social_media_links=True,     # Remove links to known social sites
 								    exclude_domains=["ads.example.com"], # Exclude links to these domains
 								    exclude_social_media_domains=["facebook.com","twitter.com"], # Extend the default list
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								)
 								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								### 3.4 Media Filtering
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								run_config = CrawlerRunConfig(
 								    exclude_external_images=True  # Strip images from other domains
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								)
 								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								---
 								## 4. Page Navigation & Timing
 								### 4.1 Basic Browser Flow
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								run_config = CrawlerRunConfig(
 								    wait_for="css:.dynamic-content", # Wait for .dynamic-content
 								    delay_before_return_html=2.0,    # Wait 2s before capturing final HTML
 								    page_timeout=60000,             # Navigation & script timeout (ms)
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								)
 								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								**Key Fields**:
 								- `wait_for`:
 								  - `"css:selector"` or
 								  - `"js:() => boolean"`
 								  e.g. `js:() => document.querySelectorAll('.item').length > 10`.
 								- `mean_delay` & `max_range`: define random delays for `arun_many()` calls.
 								- `semaphore_count`: concurrency limit when crawling multiple URLs.
 								### 4.2 JavaScript Execution
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								run_config = CrawlerRunConfig(
 								    js_code=[
 								        "window.scrollTo(0, document.body.scrollHeight);",
 								        "document.querySelector('.load-more')?.click();"
 								    ],
 								    js_only=False
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								)
 								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								- `js_code` can be a single string or a list of strings.
 								- `js_only=True` means “I’m continuing in the same session with new JS steps, no new full navigation.”
 								### 4.3 Anti-Bot
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								run_config = CrawlerRunConfig(
 								    magic=True,
 								    simulate_user=True,
 								    override_navigator=True
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								)
 								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								- `magic=True` tries multiple stealth features.
 								- `simulate_user=True` mimics mouse movements or random delays.
 								- `override_navigator=True` fakes some navigator properties (like user agent checks).
 								---
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								## 5. Session Management
 								**`session_id`**:
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								run_config = CrawlerRunConfig(
 								    session_id="my_session123"
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								)
 								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								If re-used in subsequent `arun()` calls, the same tab/page context is continued (helpful for multi-step tasks or stateful browsing).
 								---
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								## 6. Screenshot, PDF & Media Options
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
 								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								run_config = CrawlerRunConfig(
 								    screenshot=True,             # Grab a screenshot as base64
 								    screenshot_wait_for=1.0,     # Wait 1s before capturing
 								    pdf=True,                    # Also produce a PDF
 								    image_description_min_word_threshold=5,  # If analyzing alt text
 								    image_score_threshold=3,                # Filter out low-score images
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								)
 								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								**Where they appear**:
 								- `result.screenshot` → Base64 screenshot string.
 								- `result.pdf` → Byte array with PDF data.
 								---
 								## 7. Extraction Strategy
 								**For advanced data extraction** (CSS/LLM-based), set `extraction_strategy`:
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
 								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								run_config = CrawlerRunConfig(
 								    extraction_strategy=my_css_or_llm_strategy
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								)
 								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								The extracted data will appear in `result.extracted_content`.
 								---
 								## 8. Comprehensive Example
 								Below is a snippet combining many parameters:
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								import asyncio
 								from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
 								from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
 								async def main():
 								    # Example schema
 								    schema = {
 								        "name": "Articles",
 								        "baseSelector": "article.post",
 								        "fields": [
 								            {"name": "title", "selector": "h2", "type": "text"},
 								            {"name": "link",  "selector": "a",  "type": "attribute", "attribute": "href"}
 								        ]
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								    }
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								    run_config = CrawlerRunConfig(
 								        # Core
 								        verbose=True,
 								        cache_mode=CacheMode.ENABLED,
-												feat(robots): add robots.txt compliance support

Add support for checking and respecting robots.txt rules before crawling websites:
- Implement RobotsParser class with SQLite caching
- Add check_robots_txt parameter to CrawlerRunConfig
- Integrate robots.txt checking in AsyncWebCrawler
- Update documentation with robots.txt compliance examples
- Add tests for robot parser functionality

The cache uses WAL mode for better concurrency and has a default TTL of 7 days.

											
										
										
											2025-01-21 17:54:13 +08:00
+								        check_robots_txt=True,   # Respect robots.txt rules
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
 								        # Content
 								        word_count_threshold=10,
 								        css_selector="main.content",
 								        excluded_tags=["nav", "footer"],
 								        exclude_external_links=True,
 								        # Page & JS
 								        js_code="document.querySelector('.show-more')?.click();",
 								        wait_for="css:.loaded-block",
 								        page_timeout=30000,
 								        # Extraction
 								        extraction_strategy=JsonCssExtractionStrategy(schema),
 								        # Session
 								        session_id="persistent_session",
 								        # Media
 								        screenshot=True,
 								        pdf=True,
 								        # Anti-bot
 								        simulate_user=True,
 								        magic=True,
 								    )
 								    async with AsyncWebCrawler() as crawler:
 								        result = await crawler.arun("https://example.com/posts", config=run_config)
 								        if result.success:
 								            print("HTML length:", len(result.cleaned_html))
 								            print("Extraction JSON:", result.extracted_content)
 								            if result.screenshot:
 								                print("Screenshot length:", len(result.screenshot))
 								            if result.pdf:
 								                print("PDF bytes length:", len(result.pdf))
 								        else:
 								            print("Error:", result.error_message)
 								if __name__ == "__main__":
 								    asyncio.run(main())
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								**What we covered**:
 . **Crawling** the main content region, ignoring external links.
 . Running **JavaScript** to click “.show-more”.
 . **Waiting** for “.loaded-block” to appear.
 . Generating a **screenshot** & **PDF** of the final page.
 . Extracting repeated “article.post” elements with a **CSS-based** extraction strategy.
 								---
 								## 9. Best Practices
 . **Use `BrowserConfig` for global browser** settings (headless, user agent).
 . **Use `CrawlerRunConfig`** to handle the **specific** crawl needs: content filtering, caching, JS, screenshot, extraction, etc.
 . Keep your **parameters consistent** in run configs—especially if you’re part of a large codebase with multiple crawls.
 . **Limit** large concurrency (`semaphore_count`) if the site or your system can’t handle it.
 . For dynamic pages, set `js_code` or `scan_full_page` so you load all content.
 								---
 								## 10. Conclusion
 								All parameters that used to be direct arguments to `arun()` now belong in **`CrawlerRunConfig`**. This approach:
 								- Makes code **clearer** and **more maintainable**.
 								- Minimizes confusion about which arguments affect global vs. per-crawl behavior.
 								- Allows you to create **reusable** config objects for different pages or tasks.
 								For a **full** reference, check out the [CrawlerRunConfig Docs](./parameters.md).
 								Happy crawling with your **structured, flexible** config approach!