UncleCode d09c611d15 feat(robots): add robots.txt compliance support
Add support for checking and respecting robots.txt rules before crawling websites:
- Implement RobotsParser class with SQLite caching
- Add check_robots_txt parameter to CrawlerRunConfig
- Integrate robots.txt checking in AsyncWebCrawler
- Update documentation with robots.txt compliance examples
- Add tests for robot parser functionality

The cache uses WAL mode for better concurrency and has a default TTL of 7 days.
2025-01-21 17:54:13 +08:00

8.9 KiB
Raw Permalink Blame History

arun() Parameter Guide (New Approach)

In Crawl4AIs latest configuration model, nearly all parameters that once went directly to arun() are now part of CrawlerRunConfig. When calling arun(), you provide:

await crawler.arun(
    url="https://example.com",  
    config=my_run_config
)

Below is an organized look at the parameters that can go inside CrawlerRunConfig, divided by their functional areas. For Browser settings (e.g., headless, browser_type), see BrowserConfig.


1. Core Usage

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode

async def main():
    run_config = CrawlerRunConfig(
        verbose=True,            # Detailed logging
        cache_mode=CacheMode.ENABLED,  # Use normal read/write cache
        check_robots_txt=True,   # Respect robots.txt rules
        # ... other parameters
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://example.com",
            config=run_config
        )
        
        # Check if blocked by robots.txt
        if not result.success and result.status_code == 403:
            print(f"Error: {result.error_message}")

Key Fields:

  • verbose=True logs each crawl step.
  • cache_mode decides how to read/write the local crawl cache.

2. Cache Control

cache_mode (default: CacheMode.ENABLED)
Use a built-in enum from CacheMode:

  • ENABLED: Normal caching—reads if available, writes if missing.
  • DISABLED: No caching—always refetch pages.
  • READ_ONLY: Reads from cache only; no new writes.
  • WRITE_ONLY: Writes to cache but doesnt read existing data.
  • BYPASS: Skips reading cache for this crawl (though it might still write if set up that way).
run_config = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS
)

Additional flags:

  • bypass_cache=True acts like CacheMode.BYPASS.
  • disable_cache=True acts like CacheMode.DISABLED.
  • no_cache_read=True acts like CacheMode.WRITE_ONLY.
  • no_cache_write=True acts like CacheMode.READ_ONLY.

3. Content Processing & Selection

3.1 Text Processing

run_config = CrawlerRunConfig(
    word_count_threshold=10,   # Ignore text blocks <10 words
    only_text=False,           # If True, tries to remove non-text elements
    keep_data_attributes=False # Keep or discard data-* attributes
)

3.2 Content Selection

run_config = CrawlerRunConfig(
    css_selector=".main-content",  # Focus on .main-content region only
    excluded_tags=["form", "nav"], # Remove entire tag blocks
    remove_forms=True,             # Specifically strip <form> elements
    remove_overlay_elements=True,  # Attempt to remove modals/popups
)
run_config = CrawlerRunConfig(
    exclude_external_links=True,         # Remove external links from final content
    exclude_social_media_links=True,     # Remove links to known social sites
    exclude_domains=["ads.example.com"], # Exclude links to these domains
    exclude_social_media_domains=["facebook.com","twitter.com"], # Extend the default list
)

3.4 Media Filtering

run_config = CrawlerRunConfig(
    exclude_external_images=True  # Strip images from other domains
)

4. Page Navigation & Timing

4.1 Basic Browser Flow

run_config = CrawlerRunConfig(
    wait_for="css:.dynamic-content", # Wait for .dynamic-content
    delay_before_return_html=2.0,    # Wait 2s before capturing final HTML
    page_timeout=60000,             # Navigation & script timeout (ms)
)

Key Fields:

  • wait_for:

    • "css:selector" or
    • "js:() => boolean"
      e.g. js:() => document.querySelectorAll('.item').length > 10.
  • mean_delay & max_range: define random delays for arun_many() calls.

  • semaphore_count: concurrency limit when crawling multiple URLs.

4.2 JavaScript Execution

run_config = CrawlerRunConfig(
    js_code=[
        "window.scrollTo(0, document.body.scrollHeight);",
        "document.querySelector('.load-more')?.click();"
    ],
    js_only=False
)
  • js_code can be a single string or a list of strings.
  • js_only=True means “Im continuing in the same session with new JS steps, no new full navigation.”

4.3 Anti-Bot

run_config = CrawlerRunConfig(
    magic=True,
    simulate_user=True,
    override_navigator=True
)
  • magic=True tries multiple stealth features.
  • simulate_user=True mimics mouse movements or random delays.
  • override_navigator=True fakes some navigator properties (like user agent checks).

5. Session Management

session_id:

run_config = CrawlerRunConfig(
    session_id="my_session123"
)

If re-used in subsequent arun() calls, the same tab/page context is continued (helpful for multi-step tasks or stateful browsing).


6. Screenshot, PDF & Media Options

run_config = CrawlerRunConfig(
    screenshot=True,             # Grab a screenshot as base64
    screenshot_wait_for=1.0,     # Wait 1s before capturing
    pdf=True,                    # Also produce a PDF
    image_description_min_word_threshold=5,  # If analyzing alt text
    image_score_threshold=3,                # Filter out low-score images
)

Where they appear:

  • result.screenshot → Base64 screenshot string.
  • result.pdf → Byte array with PDF data.

7. Extraction Strategy

For advanced data extraction (CSS/LLM-based), set extraction_strategy:

run_config = CrawlerRunConfig(
    extraction_strategy=my_css_or_llm_strategy
)

The extracted data will appear in result.extracted_content.


8. Comprehensive Example

Below is a snippet combining many parameters:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

async def main():
    # Example schema
    schema = {
        "name": "Articles",
        "baseSelector": "article.post",
        "fields": [
            {"name": "title", "selector": "h2", "type": "text"},
            {"name": "link",  "selector": "a",  "type": "attribute", "attribute": "href"}
        ]
    }

    run_config = CrawlerRunConfig(
        # Core
        verbose=True,
        cache_mode=CacheMode.ENABLED,
        check_robots_txt=True,   # Respect robots.txt rules
        
        # Content
        word_count_threshold=10,
        css_selector="main.content",
        excluded_tags=["nav", "footer"],
        exclude_external_links=True,
        
        # Page & JS
        js_code="document.querySelector('.show-more')?.click();",
        wait_for="css:.loaded-block",
        page_timeout=30000,
        
        # Extraction
        extraction_strategy=JsonCssExtractionStrategy(schema),
        
        # Session
        session_id="persistent_session",
        
        # Media
        screenshot=True,
        pdf=True,
        
        # Anti-bot
        simulate_user=True,
        magic=True,
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com/posts", config=run_config)
        if result.success:
            print("HTML length:", len(result.cleaned_html))
            print("Extraction JSON:", result.extracted_content)
            if result.screenshot:
                print("Screenshot length:", len(result.screenshot))
            if result.pdf:
                print("PDF bytes length:", len(result.pdf))
        else:
            print("Error:", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())

What we covered: 1. Crawling the main content region, ignoring external links.
2. Running JavaScript to click “.show-more”.
3. Waiting for “.loaded-block” to appear.
4. Generating a screenshot & PDF of the final page.
5. Extracting repeated “article.post” elements with a CSS-based extraction strategy.


9. Best Practices

1. Use BrowserConfig for global browser settings (headless, user agent).
2. Use CrawlerRunConfig to handle the specific crawl needs: content filtering, caching, JS, screenshot, extraction, etc.
3. Keep your parameters consistent in run configs—especially if youre part of a large codebase with multiple crawls.
4. Limit large concurrency (semaphore_count) if the site or your system cant handle it.
5. For dynamic pages, set js_code or scan_full_page so you load all content.


10. Conclusion

All parameters that used to be direct arguments to arun() now belong in CrawlerRunConfig. This approach:

  • Makes code clearer and more maintainable.
  • Minimizes confusion about which arguments affect global vs. per-crawl behavior.
  • Allows you to create reusable config objects for different pages or tasks.

For a full reference, check out the CrawlerRunConfig Docs.

Happy crawling with your structured, flexible config approach!