
Enhanced markdown formatting, fixed list indentation, and improved readability across multiple API documentation files: - arun.md - arun_many.md - async-webcrawler.md - parameters.md Changes include: - Consistent list formatting and indentation - Better spacing between sections - Clearer separation of content blocks - Fixed quotation marks and code block formatting
309 lines
9.0 KiB
Markdown
309 lines
9.0 KiB
Markdown
# `arun()` Parameter Guide (New Approach)
|
||
|
||
In Crawl4AI’s **latest** configuration model, nearly all parameters that once went directly to `arun()` are now part of **`CrawlerRunConfig`**. When calling `arun()`, you provide:
|
||
|
||
```python
|
||
await crawler.arun(
|
||
url="https://example.com",
|
||
config=my_run_config
|
||
)
|
||
```
|
||
|
||
Below is an organized look at the parameters that can go inside `CrawlerRunConfig`, divided by their functional areas. For **Browser** settings (e.g., `headless`, `browser_type`), see [BrowserConfig](./parameters.md).
|
||
|
||
---
|
||
|
||
## 1. Core Usage
|
||
|
||
```python
|
||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||
|
||
async def main():
|
||
run_config = CrawlerRunConfig(
|
||
verbose=True, # Detailed logging
|
||
cache_mode=CacheMode.ENABLED, # Use normal read/write cache
|
||
check_robots_txt=True, # Respect robots.txt rules
|
||
# ... other parameters
|
||
)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
result = await crawler.arun(
|
||
url="https://example.com",
|
||
config=run_config
|
||
)
|
||
|
||
# Check if blocked by robots.txt
|
||
if not result.success and result.status_code == 403:
|
||
print(f"Error: {result.error_message}")
|
||
```
|
||
|
||
**Key Fields**:
|
||
- `verbose=True` logs each crawl step.
|
||
- `cache_mode` decides how to read/write the local crawl cache.
|
||
|
||
---
|
||
|
||
## 2. Cache Control
|
||
|
||
**`cache_mode`** (default: `CacheMode.ENABLED`)
|
||
Use a built-in enum from `CacheMode`:
|
||
|
||
- `ENABLED`: Normal caching—reads if available, writes if missing.
|
||
- `DISABLED`: No caching—always refetch pages.
|
||
- `READ_ONLY`: Reads from cache only; no new writes.
|
||
- `WRITE_ONLY`: Writes to cache but doesn’t read existing data.
|
||
- `BYPASS`: Skips reading cache for this crawl (though it might still write if set up that way).
|
||
|
||
```python
|
||
run_config = CrawlerRunConfig(
|
||
cache_mode=CacheMode.BYPASS
|
||
)
|
||
```
|
||
|
||
**Additional flags**:
|
||
|
||
- `bypass_cache=True` acts like `CacheMode.BYPASS`.
|
||
- `disable_cache=True` acts like `CacheMode.DISABLED`.
|
||
- `no_cache_read=True` acts like `CacheMode.WRITE_ONLY`.
|
||
- `no_cache_write=True` acts like `CacheMode.READ_ONLY`.
|
||
|
||
---
|
||
|
||
## 3. Content Processing & Selection
|
||
|
||
### 3.1 Text Processing
|
||
|
||
```python
|
||
run_config = CrawlerRunConfig(
|
||
word_count_threshold=10, # Ignore text blocks <10 words
|
||
only_text=False, # If True, tries to remove non-text elements
|
||
keep_data_attributes=False # Keep or discard data-* attributes
|
||
)
|
||
```
|
||
|
||
### 3.2 Content Selection
|
||
|
||
```python
|
||
run_config = CrawlerRunConfig(
|
||
css_selector=".main-content", # Focus on .main-content region only
|
||
excluded_tags=["form", "nav"], # Remove entire tag blocks
|
||
remove_forms=True, # Specifically strip <form> elements
|
||
remove_overlay_elements=True, # Attempt to remove modals/popups
|
||
)
|
||
```
|
||
|
||
### 3.3 Link Handling
|
||
|
||
```python
|
||
run_config = CrawlerRunConfig(
|
||
exclude_external_links=True, # Remove external links from final content
|
||
exclude_social_media_links=True, # Remove links to known social sites
|
||
exclude_domains=["ads.example.com"], # Exclude links to these domains
|
||
exclude_social_media_domains=["facebook.com","twitter.com"], # Extend the default list
|
||
)
|
||
```
|
||
|
||
### 3.4 Media Filtering
|
||
|
||
```python
|
||
run_config = CrawlerRunConfig(
|
||
exclude_external_images=True # Strip images from other domains
|
||
)
|
||
```
|
||
|
||
---
|
||
|
||
## 4. Page Navigation & Timing
|
||
|
||
### 4.1 Basic Browser Flow
|
||
|
||
```python
|
||
run_config = CrawlerRunConfig(
|
||
wait_for="css:.dynamic-content", # Wait for .dynamic-content
|
||
delay_before_return_html=2.0, # Wait 2s before capturing final HTML
|
||
page_timeout=60000, # Navigation & script timeout (ms)
|
||
)
|
||
```
|
||
|
||
**Key Fields**:
|
||
|
||
- `wait_for`:
|
||
- `"css:selector"` or
|
||
- `"js:() => boolean"`
|
||
e.g. `js:() => document.querySelectorAll('.item').length > 10`.
|
||
|
||
- `mean_delay` & `max_range`: define random delays for `arun_many()` calls.
|
||
- `semaphore_count`: concurrency limit when crawling multiple URLs.
|
||
|
||
### 4.2 JavaScript Execution
|
||
|
||
```python
|
||
run_config = CrawlerRunConfig(
|
||
js_code=[
|
||
"window.scrollTo(0, document.body.scrollHeight);",
|
||
"document.querySelector('.load-more')?.click();"
|
||
],
|
||
js_only=False
|
||
)
|
||
```
|
||
|
||
- `js_code` can be a single string or a list of strings.
|
||
- `js_only=True` means “I’m continuing in the same session with new JS steps, no new full navigation.”
|
||
|
||
### 4.3 Anti-Bot
|
||
|
||
```python
|
||
run_config = CrawlerRunConfig(
|
||
magic=True,
|
||
simulate_user=True,
|
||
override_navigator=True
|
||
)
|
||
```
|
||
- `magic=True` tries multiple stealth features.
|
||
- `simulate_user=True` mimics mouse movements or random delays.
|
||
- `override_navigator=True` fakes some navigator properties (like user agent checks).
|
||
|
||
---
|
||
|
||
## 5. Session Management
|
||
|
||
**`session_id`**:
|
||
```python
|
||
run_config = CrawlerRunConfig(
|
||
session_id="my_session123"
|
||
)
|
||
```
|
||
If re-used in subsequent `arun()` calls, the same tab/page context is continued (helpful for multi-step tasks or stateful browsing).
|
||
|
||
---
|
||
|
||
## 6. Screenshot, PDF & Media Options
|
||
|
||
```python
|
||
run_config = CrawlerRunConfig(
|
||
screenshot=True, # Grab a screenshot as base64
|
||
screenshot_wait_for=1.0, # Wait 1s before capturing
|
||
pdf=True, # Also produce a PDF
|
||
image_description_min_word_threshold=5, # If analyzing alt text
|
||
image_score_threshold=3, # Filter out low-score images
|
||
)
|
||
```
|
||
**Where they appear**:
|
||
- `result.screenshot` → Base64 screenshot string.
|
||
- `result.pdf` → Byte array with PDF data.
|
||
|
||
---
|
||
|
||
## 7. Extraction Strategy
|
||
|
||
**For advanced data extraction** (CSS/LLM-based), set `extraction_strategy`:
|
||
|
||
```python
|
||
run_config = CrawlerRunConfig(
|
||
extraction_strategy=my_css_or_llm_strategy
|
||
)
|
||
```
|
||
|
||
The extracted data will appear in `result.extracted_content`.
|
||
|
||
---
|
||
|
||
## 8. Comprehensive Example
|
||
|
||
Below is a snippet combining many parameters:
|
||
|
||
```python
|
||
import asyncio
|
||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||
|
||
async def main():
|
||
# Example schema
|
||
schema = {
|
||
"name": "Articles",
|
||
"baseSelector": "article.post",
|
||
"fields": [
|
||
{"name": "title", "selector": "h2", "type": "text"},
|
||
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
|
||
]
|
||
}
|
||
|
||
run_config = CrawlerRunConfig(
|
||
# Core
|
||
verbose=True,
|
||
cache_mode=CacheMode.ENABLED,
|
||
check_robots_txt=True, # Respect robots.txt rules
|
||
|
||
# Content
|
||
word_count_threshold=10,
|
||
css_selector="main.content",
|
||
excluded_tags=["nav", "footer"],
|
||
exclude_external_links=True,
|
||
|
||
# Page & JS
|
||
js_code="document.querySelector('.show-more')?.click();",
|
||
wait_for="css:.loaded-block",
|
||
page_timeout=30000,
|
||
|
||
# Extraction
|
||
extraction_strategy=JsonCssExtractionStrategy(schema),
|
||
|
||
# Session
|
||
session_id="persistent_session",
|
||
|
||
# Media
|
||
screenshot=True,
|
||
pdf=True,
|
||
|
||
# Anti-bot
|
||
simulate_user=True,
|
||
magic=True,
|
||
)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
result = await crawler.arun("https://example.com/posts", config=run_config)
|
||
if result.success:
|
||
print("HTML length:", len(result.cleaned_html))
|
||
print("Extraction JSON:", result.extracted_content)
|
||
if result.screenshot:
|
||
print("Screenshot length:", len(result.screenshot))
|
||
if result.pdf:
|
||
print("PDF bytes length:", len(result.pdf))
|
||
else:
|
||
print("Error:", result.error_message)
|
||
|
||
if __name__ == "__main__":
|
||
asyncio.run(main())
|
||
```
|
||
|
||
**What we covered**:
|
||
|
||
1. **Crawling** the main content region, ignoring external links.
|
||
2. Running **JavaScript** to click “.show-more”.
|
||
3. **Waiting** for “.loaded-block” to appear.
|
||
4. Generating a **screenshot** & **PDF** of the final page.
|
||
5. Extracting repeated “article.post” elements with a **CSS-based** extraction strategy.
|
||
|
||
---
|
||
|
||
## 9. Best Practices
|
||
|
||
1. **Use `BrowserConfig` for global browser** settings (headless, user agent).
|
||
2. **Use `CrawlerRunConfig`** to handle the **specific** crawl needs: content filtering, caching, JS, screenshot, extraction, etc.
|
||
3. Keep your **parameters consistent** in run configs—especially if you’re part of a large codebase with multiple crawls.
|
||
4. **Limit** large concurrency (`semaphore_count`) if the site or your system can’t handle it.
|
||
5. For dynamic pages, set `js_code` or `scan_full_page` so you load all content.
|
||
|
||
---
|
||
|
||
## 10. Conclusion
|
||
|
||
All parameters that used to be direct arguments to `arun()` now belong in **`CrawlerRunConfig`**. This approach:
|
||
|
||
- Makes code **clearer** and **more maintainable**.
|
||
- Minimizes confusion about which arguments affect global vs. per-crawl behavior.
|
||
- Allows you to create **reusable** config objects for different pages or tasks.
|
||
|
||
For a **full** reference, check out the [CrawlerRunConfig Docs](./parameters.md).
|
||
|
||
Happy crawling with your **structured, flexible** config approach! |