
Enhanced markdown formatting, fixed list indentation, and improved readability across multiple API documentation files: - arun.md - arun_many.md - async-webcrawler.md - parameters.md Changes include: - Consistent list formatting and indentation - Better spacing between sections - Clearer separation of content blocks - Fixed quotation marks and code block formatting
9.0 KiB
arun()
Parameter Guide (New Approach)
In Crawl4AI’s latest configuration model, nearly all parameters that once went directly to arun()
are now part of CrawlerRunConfig
. When calling arun()
, you provide:
await crawler.arun(
url="https://example.com",
config=my_run_config
)
Below is an organized look at the parameters that can go inside CrawlerRunConfig
, divided by their functional areas. For Browser settings (e.g., headless
, browser_type
), see BrowserConfig.
1. Core Usage
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def main():
run_config = CrawlerRunConfig(
verbose=True, # Detailed logging
cache_mode=CacheMode.ENABLED, # Use normal read/write cache
check_robots_txt=True, # Respect robots.txt rules
# ... other parameters
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
config=run_config
)
# Check if blocked by robots.txt
if not result.success and result.status_code == 403:
print(f"Error: {result.error_message}")
Key Fields:
verbose=True
logs each crawl step.cache_mode
decides how to read/write the local crawl cache.
2. Cache Control
cache_mode
(default: CacheMode.ENABLED
)
Use a built-in enum from CacheMode
:
ENABLED
: Normal caching—reads if available, writes if missing.DISABLED
: No caching—always refetch pages.READ_ONLY
: Reads from cache only; no new writes.WRITE_ONLY
: Writes to cache but doesn’t read existing data.BYPASS
: Skips reading cache for this crawl (though it might still write if set up that way).
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS
)
Additional flags:
bypass_cache=True
acts likeCacheMode.BYPASS
.disable_cache=True
acts likeCacheMode.DISABLED
.no_cache_read=True
acts likeCacheMode.WRITE_ONLY
.no_cache_write=True
acts likeCacheMode.READ_ONLY
.
3. Content Processing & Selection
3.1 Text Processing
run_config = CrawlerRunConfig(
word_count_threshold=10, # Ignore text blocks <10 words
only_text=False, # If True, tries to remove non-text elements
keep_data_attributes=False # Keep or discard data-* attributes
)
3.2 Content Selection
run_config = CrawlerRunConfig(
css_selector=".main-content", # Focus on .main-content region only
excluded_tags=["form", "nav"], # Remove entire tag blocks
remove_forms=True, # Specifically strip <form> elements
remove_overlay_elements=True, # Attempt to remove modals/popups
)
3.3 Link Handling
run_config = CrawlerRunConfig(
exclude_external_links=True, # Remove external links from final content
exclude_social_media_links=True, # Remove links to known social sites
exclude_domains=["ads.example.com"], # Exclude links to these domains
exclude_social_media_domains=["facebook.com","twitter.com"], # Extend the default list
)
3.4 Media Filtering
run_config = CrawlerRunConfig(
exclude_external_images=True # Strip images from other domains
)
4. Page Navigation & Timing
4.1 Basic Browser Flow
run_config = CrawlerRunConfig(
wait_for="css:.dynamic-content", # Wait for .dynamic-content
delay_before_return_html=2.0, # Wait 2s before capturing final HTML
page_timeout=60000, # Navigation & script timeout (ms)
)
Key Fields:
-
wait_for
:"css:selector"
or"js:() => boolean"
e.g.js:() => document.querySelectorAll('.item').length > 10
.
-
mean_delay
&max_range
: define random delays forarun_many()
calls. -
semaphore_count
: concurrency limit when crawling multiple URLs.
4.2 JavaScript Execution
run_config = CrawlerRunConfig(
js_code=[
"window.scrollTo(0, document.body.scrollHeight);",
"document.querySelector('.load-more')?.click();"
],
js_only=False
)
js_code
can be a single string or a list of strings.js_only=True
means “I’m continuing in the same session with new JS steps, no new full navigation.”
4.3 Anti-Bot
run_config = CrawlerRunConfig(
magic=True,
simulate_user=True,
override_navigator=True
)
magic=True
tries multiple stealth features.simulate_user=True
mimics mouse movements or random delays.override_navigator=True
fakes some navigator properties (like user agent checks).
5. Session Management
session_id
:
run_config = CrawlerRunConfig(
session_id="my_session123"
)
If re-used in subsequent arun()
calls, the same tab/page context is continued (helpful for multi-step tasks or stateful browsing).
6. Screenshot, PDF & Media Options
run_config = CrawlerRunConfig(
screenshot=True, # Grab a screenshot as base64
screenshot_wait_for=1.0, # Wait 1s before capturing
pdf=True, # Also produce a PDF
image_description_min_word_threshold=5, # If analyzing alt text
image_score_threshold=3, # Filter out low-score images
)
Where they appear:
result.screenshot
→ Base64 screenshot string.result.pdf
→ Byte array with PDF data.
7. Extraction Strategy
For advanced data extraction (CSS/LLM-based), set extraction_strategy
:
run_config = CrawlerRunConfig(
extraction_strategy=my_css_or_llm_strategy
)
The extracted data will appear in result.extracted_content
.
8. Comprehensive Example
Below is a snippet combining many parameters:
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
async def main():
# Example schema
schema = {
"name": "Articles",
"baseSelector": "article.post",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
run_config = CrawlerRunConfig(
# Core
verbose=True,
cache_mode=CacheMode.ENABLED,
check_robots_txt=True, # Respect robots.txt rules
# Content
word_count_threshold=10,
css_selector="main.content",
excluded_tags=["nav", "footer"],
exclude_external_links=True,
# Page & JS
js_code="document.querySelector('.show-more')?.click();",
wait_for="css:.loaded-block",
page_timeout=30000,
# Extraction
extraction_strategy=JsonCssExtractionStrategy(schema),
# Session
session_id="persistent_session",
# Media
screenshot=True,
pdf=True,
# Anti-bot
simulate_user=True,
magic=True,
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com/posts", config=run_config)
if result.success:
print("HTML length:", len(result.cleaned_html))
print("Extraction JSON:", result.extracted_content)
if result.screenshot:
print("Screenshot length:", len(result.screenshot))
if result.pdf:
print("PDF bytes length:", len(result.pdf))
else:
print("Error:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
What we covered:
1. Crawling the main content region, ignoring external links. 2. Running JavaScript to click “.show-more”. 3. Waiting for “.loaded-block” to appear. 4. Generating a screenshot & PDF of the final page. 5. Extracting repeated “article.post” elements with a CSS-based extraction strategy.
9. Best Practices
1. Use BrowserConfig
for global browser settings (headless, user agent).
2. Use CrawlerRunConfig
to handle the specific crawl needs: content filtering, caching, JS, screenshot, extraction, etc.
3. Keep your parameters consistent in run configs—especially if you’re part of a large codebase with multiple crawls.
4. Limit large concurrency (semaphore_count
) if the site or your system can’t handle it.
5. For dynamic pages, set js_code
or scan_full_page
so you load all content.
10. Conclusion
All parameters that used to be direct arguments to arun()
now belong in CrawlerRunConfig
. This approach:
- Makes code clearer and more maintainable.
- Minimizes confusion about which arguments affect global vs. per-crawl behavior.
- Allows you to create reusable config objects for different pages or tasks.
For a full reference, check out the CrawlerRunConfig Docs.
Happy crawling with your structured, flexible config approach!