UncleCode 54c84079c4 docs(api): improve formatting and readability of API documentation
Enhanced markdown formatting, fixed list indentation, and improved readability across multiple API documentation files:
- arun.md
- arun_many.md
- async-webcrawler.md
- parameters.md

Changes include:
- Consistent list formatting and indentation
- Better spacing between sections
- Clearer separation of content blocks
- Fixed quotation marks and code block formatting
2025-01-25 22:06:11 +08:00

309 lines
9.0 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# `arun()` Parameter Guide (New Approach)
In Crawl4AIs **latest** configuration model, nearly all parameters that once went directly to `arun()` are now part of **`CrawlerRunConfig`**. When calling `arun()`, you provide:
```python
await crawler.arun(
url="https://example.com",
config=my_run_config
)
```
Below is an organized look at the parameters that can go inside `CrawlerRunConfig`, divided by their functional areas. For **Browser** settings (e.g., `headless`, `browser_type`), see [BrowserConfig](./parameters.md).
---
## 1. Core Usage
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def main():
run_config = CrawlerRunConfig(
verbose=True, # Detailed logging
cache_mode=CacheMode.ENABLED, # Use normal read/write cache
check_robots_txt=True, # Respect robots.txt rules
# ... other parameters
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
config=run_config
)
# Check if blocked by robots.txt
if not result.success and result.status_code == 403:
print(f"Error: {result.error_message}")
```
**Key Fields**:
- `verbose=True` logs each crawl step. 
- `cache_mode` decides how to read/write the local crawl cache.
---
## 2. Cache Control
**`cache_mode`** (default: `CacheMode.ENABLED`)
Use a built-in enum from `CacheMode`:
- `ENABLED`: Normal caching—reads if available, writes if missing.
- `DISABLED`: No caching—always refetch pages.
- `READ_ONLY`: Reads from cache only; no new writes.
- `WRITE_ONLY`: Writes to cache but doesnt read existing data.
- `BYPASS`: Skips reading cache for this crawl (though it might still write if set up that way).
```python
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS
)
```
**Additional flags**:
- `bypass_cache=True` acts like `CacheMode.BYPASS`.
- `disable_cache=True` acts like `CacheMode.DISABLED`.
- `no_cache_read=True` acts like `CacheMode.WRITE_ONLY`.
- `no_cache_write=True` acts like `CacheMode.READ_ONLY`.
---
## 3. Content Processing & Selection
### 3.1 Text Processing
```python
run_config = CrawlerRunConfig(
word_count_threshold=10, # Ignore text blocks <10 words
only_text=False, # If True, tries to remove non-text elements
keep_data_attributes=False # Keep or discard data-* attributes
)
```
### 3.2 Content Selection
```python
run_config = CrawlerRunConfig(
css_selector=".main-content", # Focus on .main-content region only
excluded_tags=["form", "nav"], # Remove entire tag blocks
remove_forms=True, # Specifically strip <form> elements
remove_overlay_elements=True, # Attempt to remove modals/popups
)
```
### 3.3 Link Handling
```python
run_config = CrawlerRunConfig(
exclude_external_links=True, # Remove external links from final content
exclude_social_media_links=True, # Remove links to known social sites
exclude_domains=["ads.example.com"], # Exclude links to these domains
exclude_social_media_domains=["facebook.com","twitter.com"], # Extend the default list
)
```
### 3.4 Media Filtering
```python
run_config = CrawlerRunConfig(
exclude_external_images=True # Strip images from other domains
)
```
---
## 4. Page Navigation & Timing
### 4.1 Basic Browser Flow
```python
run_config = CrawlerRunConfig(
wait_for="css:.dynamic-content", # Wait for .dynamic-content
delay_before_return_html=2.0, # Wait 2s before capturing final HTML
page_timeout=60000, # Navigation & script timeout (ms)
)
```
**Key Fields**:
- `wait_for`:
- `"css:selector"` or
- `"js:() => boolean"`
e.g. `js:() => document.querySelectorAll('.item').length > 10`.
- `mean_delay` & `max_range`: define random delays for `arun_many()` calls. 
- `semaphore_count`: concurrency limit when crawling multiple URLs.
### 4.2 JavaScript Execution
```python
run_config = CrawlerRunConfig(
js_code=[
"window.scrollTo(0, document.body.scrollHeight);",
"document.querySelector('.load-more')?.click();"
],
js_only=False
)
```
- `js_code` can be a single string or a list of strings. 
- `js_only=True` means “Im continuing in the same session with new JS steps, no new full navigation.”
### 4.3 Anti-Bot
```python
run_config = CrawlerRunConfig(
magic=True,
simulate_user=True,
override_navigator=True
)
```
- `magic=True` tries multiple stealth features. 
- `simulate_user=True` mimics mouse movements or random delays. 
- `override_navigator=True` fakes some navigator properties (like user agent checks).
---
## 5. Session Management
**`session_id`**:
```python
run_config = CrawlerRunConfig(
session_id="my_session123"
)
```
If re-used in subsequent `arun()` calls, the same tab/page context is continued (helpful for multi-step tasks or stateful browsing).
---
## 6. Screenshot, PDF & Media Options
```python
run_config = CrawlerRunConfig(
screenshot=True, # Grab a screenshot as base64
screenshot_wait_for=1.0, # Wait 1s before capturing
pdf=True, # Also produce a PDF
image_description_min_word_threshold=5, # If analyzing alt text
image_score_threshold=3, # Filter out low-score images
)
```
**Where they appear**:
- `result.screenshot` → Base64 screenshot string.
- `result.pdf` → Byte array with PDF data.
---
## 7. Extraction Strategy
**For advanced data extraction** (CSS/LLM-based), set `extraction_strategy`:
```python
run_config = CrawlerRunConfig(
extraction_strategy=my_css_or_llm_strategy
)
```
The extracted data will appear in `result.extracted_content`.
---
## 8. Comprehensive Example
Below is a snippet combining many parameters:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
async def main():
# Example schema
schema = {
"name": "Articles",
"baseSelector": "article.post",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
run_config = CrawlerRunConfig(
# Core
verbose=True,
cache_mode=CacheMode.ENABLED,
check_robots_txt=True, # Respect robots.txt rules
# Content
word_count_threshold=10,
css_selector="main.content",
excluded_tags=["nav", "footer"],
exclude_external_links=True,
# Page & JS
js_code="document.querySelector('.show-more')?.click();",
wait_for="css:.loaded-block",
page_timeout=30000,
# Extraction
extraction_strategy=JsonCssExtractionStrategy(schema),
# Session
session_id="persistent_session",
# Media
screenshot=True,
pdf=True,
# Anti-bot
simulate_user=True,
magic=True,
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com/posts", config=run_config)
if result.success:
print("HTML length:", len(result.cleaned_html))
print("Extraction JSON:", result.extracted_content)
if result.screenshot:
print("Screenshot length:", len(result.screenshot))
if result.pdf:
print("PDF bytes length:", len(result.pdf))
else:
print("Error:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
```
**What we covered**:
1. **Crawling** the main content region, ignoring external links. 
2. Running **JavaScript** to click “.show-more”. 
3. **Waiting** for “.loaded-block” to appear. 
4. Generating a **screenshot** & **PDF** of the final page. 
5. Extracting repeated “article.post” elements with a **CSS-based** extraction strategy.
---
## 9. Best Practices
1. **Use `BrowserConfig` for global browser** settings (headless, user agent). 
2. **Use `CrawlerRunConfig`** to handle the **specific** crawl needs: content filtering, caching, JS, screenshot, extraction, etc. 
3. Keep your **parameters consistent** in run configs—especially if youre part of a large codebase with multiple crawls. 
4. **Limit** large concurrency (`semaphore_count`) if the site or your system cant handle it. 
5. For dynamic pages, set `js_code` or `scan_full_page` so you load all content.
---
## 10. Conclusion
All parameters that used to be direct arguments to `arun()` now belong in **`CrawlerRunConfig`**. This approach:
- Makes code **clearer** and **more maintainable**. 
- Minimizes confusion about which arguments affect global vs. per-crawl behavior. 
- Allows you to create **reusable** config objects for different pages or tasks.
For a **full** reference, check out the [CrawlerRunConfig Docs](./parameters.md). 
Happy crawling with your **structured, flexible** config approach!