UncleCode d09c611d15 feat(robots): add robots.txt compliance support

Add support for checking and respecting robots.txt rules before crawling websites:
- Implement RobotsParser class with SQLite caching
- Add check_robots_txt parameter to CrawlerRunConfig
- Integrate robots.txt checking in AsyncWebCrawler
- Update documentation with robots.txt compliance examples
- Add tests for robot parser functionality

The cache uses WAL mode for better concurrency and has a default TTL of 7 days.

2025-01-21 17:54:13 +08:00

8.9 KiB

Raw Permalink Blame History

`arun()` Parameter Guide (New Approach)

In Crawl4AI’s latest configuration model, nearly all parameters that once went directly to arun() are now part of CrawlerRunConfig. When calling arun(), you provide:

await crawler.arun(
    url="https://example.com",  
    config=my_run_config
)

Below is an organized look at the parameters that can go inside CrawlerRunConfig, divided by their functional areas. For Browser settings (e.g., headless, browser_type), see BrowserConfig.

1. Core Usage

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode

async def main():
    run_config = CrawlerRunConfig(
        verbose=True,            # Detailed logging
        cache_mode=CacheMode.ENABLED,  # Use normal read/write cache
        check_robots_txt=True,   # Respect robots.txt rules
        # ... other parameters
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://example.com",
            config=run_config
        )
        
        # Check if blocked by robots.txt
        if not result.success and result.status_code == 403:
            print(f"Error: {result.error_message}")

Key Fields:

verbose=True logs each crawl step.
cache_mode decides how to read/write the local crawl cache.

2. Cache Control

cache_mode (default: CacheMode.ENABLED)
Use a built-in enum from CacheMode:

ENABLED: Normal caching—reads if available, writes if missing.
DISABLED: No caching—always refetch pages.
READ_ONLY: Reads from cache only; no new writes.
WRITE_ONLY: Writes to cache but doesn’t read existing data.
BYPASS: Skips reading cache for this crawl (though it might still write if set up that way).

run_config = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS
)

Additional flags:

bypass_cache=True acts like CacheMode.BYPASS.
disable_cache=True acts like CacheMode.DISABLED.
no_cache_read=True acts like CacheMode.WRITE_ONLY.
no_cache_write=True acts like CacheMode.READ_ONLY.

3. Content Processing & Selection

3.1 Text Processing

run_config = CrawlerRunConfig(
    word_count_threshold=10,   # Ignore text blocks <10 words
    only_text=False,           # If True, tries to remove non-text elements
    keep_data_attributes=False # Keep or discard data-* attributes
)

3.2 Content Selection

run_config = CrawlerRunConfig(
    css_selector=".main-content",  # Focus on .main-content region only
    excluded_tags=["form", "nav"], # Remove entire tag blocks
    remove_forms=True,             # Specifically strip <form> elements
    remove_overlay_elements=True,  # Attempt to remove modals/popups
)

3.3 Link Handling

run_config = CrawlerRunConfig(
    exclude_external_links=True,         # Remove external links from final content
    exclude_social_media_links=True,     # Remove links to known social sites
    exclude_domains=["ads.example.com"], # Exclude links to these domains
    exclude_social_media_domains=["facebook.com","twitter.com"], # Extend the default list
)

3.4 Media Filtering

run_config = CrawlerRunConfig(
    exclude_external_images=True  # Strip images from other domains
)

4.1 Basic Browser Flow

run_config = CrawlerRunConfig(
    wait_for="css:.dynamic-content", # Wait for .dynamic-content
    delay_before_return_html=2.0,    # Wait 2s before capturing final HTML
    page_timeout=60000,             # Navigation & script timeout (ms)
)

Key Fields:

wait_for:
- "css:selector" or
- "js:() => boolean"
  e.g. js:() => document.querySelectorAll('.item').length > 10.
mean_delay & max_range: define random delays for arun_many() calls.
semaphore_count: concurrency limit when crawling multiple URLs.

4.2 JavaScript Execution

run_config = CrawlerRunConfig(
    js_code=[
        "window.scrollTo(0, document.body.scrollHeight);",
        "document.querySelector('.load-more')?.click();"
    ],
    js_only=False
)

js_code can be a single string or a list of strings.
js_only=True means “I’m continuing in the same session with new JS steps, no new full navigation.”

4.3 Anti-Bot

run_config = CrawlerRunConfig(
    magic=True,
    simulate_user=True,
    override_navigator=True
)

magic=True tries multiple stealth features.
simulate_user=True mimics mouse movements or random delays.
override_navigator=True fakes some navigator properties (like user agent checks).

5. Session Management

session_id:

run_config = CrawlerRunConfig(
    session_id="my_session123"
)

If re-used in subsequent arun() calls, the same tab/page context is continued (helpful for multi-step tasks or stateful browsing).

6. Screenshot, PDF & Media Options

run_config = CrawlerRunConfig(
    screenshot=True,             # Grab a screenshot as base64
    screenshot_wait_for=1.0,     # Wait 1s before capturing
    pdf=True,                    # Also produce a PDF
    image_description_min_word_threshold=5,  # If analyzing alt text
    image_score_threshold=3,                # Filter out low-score images
)

Where they appear:

result.screenshot → Base64 screenshot string.
result.pdf → Byte array with PDF data.

7. Extraction Strategy

For advanced data extraction (CSS/LLM-based), set extraction_strategy:

run_config = CrawlerRunConfig(
    extraction_strategy=my_css_or_llm_strategy
)

The extracted data will appear in result.extracted_content.

8. Comprehensive Example

Below is a snippet combining many parameters:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

async def main():
    # Example schema
    schema = {
        "name": "Articles",
        "baseSelector": "article.post",
        "fields": [
            {"name": "title", "selector": "h2", "type": "text"},
            {"name": "link",  "selector": "a",  "type": "attribute", "attribute": "href"}
        ]
    }

    run_config = CrawlerRunConfig(
        # Core
        verbose=True,
        cache_mode=CacheMode.ENABLED,
        check_robots_txt=True,   # Respect robots.txt rules
        
        # Content
        word_count_threshold=10,
        css_selector="main.content",
        excluded_tags=["nav", "footer"],
        exclude_external_links=True,
        
        # Page & JS
        js_code="document.querySelector('.show-more')?.click();",
        wait_for="css:.loaded-block",
        page_timeout=30000,
        
        # Extraction
        extraction_strategy=JsonCssExtractionStrategy(schema),
        
        # Session
        session_id="persistent_session",
        
        # Media
        screenshot=True,
        pdf=True,
        
        # Anti-bot
        simulate_user=True,
        magic=True,
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com/posts", config=run_config)
        if result.success:
            print("HTML length:", len(result.cleaned_html))
            print("Extraction JSON:", result.extracted_content)
            if result.screenshot:
                print("Screenshot length:", len(result.screenshot))
            if result.pdf:
                print("PDF bytes length:", len(result.pdf))
        else:
            print("Error:", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())

What we covered: 1. Crawling the main content region, ignoring external links.
2. Running JavaScript to click “.show-more”.
3. Waiting for “.loaded-block” to appear.
4. Generating a screenshot & PDF of the final page.
5. Extracting repeated “article.post” elements with a CSS-based extraction strategy.

9. Best Practices

1. Use BrowserConfig for global browser settings (headless, user agent).
2. Use CrawlerRunConfig to handle the specific crawl needs: content filtering, caching, JS, screenshot, extraction, etc.
3. Keep your parameters consistent in run configs—especially if you’re part of a large codebase with multiple crawls.
4. Limit large concurrency (semaphore_count) if the site or your system can’t handle it.
5. For dynamic pages, set js_code or scan_full_page so you load all content.

10. Conclusion

All parameters that used to be direct arguments to arun() now belong in CrawlerRunConfig. This approach:

Makes code clearer and more maintainable.
Minimizes confusion about which arguments affect global vs. per-crawl behavior.
Allows you to create reusable config objects for different pages or tasks.

For a full reference, check out the CrawlerRunConfig Docs.

Happy crawling with your structured, flexible config approach!

8.9 KiB Raw Permalink Blame History Unescape Escape

arun() Parameter Guide (New Approach)