crawl4ai/docs/md_v2/core/browser-crawler-config.md

# Browser, Crawler & LLM Configuration (Quick Overview)

Crawl4AI’s flexibility stems from two key classes:

1. **`BrowserConfig`** – Dictates **how** the browser is launched and behaves (e.g., headless or visible, proxy, user agent).  
2. **`CrawlerRunConfig`** – Dictates **how** each **crawl** operates (e.g., caching, extraction, timeouts, JavaScript code to run, etc.).  
3. **`LlmConfig`** - Dictates **how** LLM providers are configured. (model, api token, base url, temperature etc.)

In most examples, you create **one** `BrowserConfig` for the entire crawler session, then pass a **fresh** or re-used `CrawlerRunConfig` whenever you call `arun()`. This tutorial shows the most commonly used parameters. If you need advanced or rarely used fields, see the [Configuration Parameters](../api/parameters.md).

---

## 1. BrowserConfig Essentials

```python
class BrowserConfig:
    def __init__(
        browser_type="chromium",
        headless=True,
        proxy_config=None,
        viewport_width=1080,
        viewport_height=600,
        verbose=True,
        use_persistent_context=False,
        user_data_dir=None,
        cookies=None,
        headers=None,
        user_agent=None,
        text_mode=False,
        light_mode=False,
        extra_args=None,
        # ... other advanced parameters omitted here
    ):
        ...
```

### Key Fields to Note


1. **`browser_type`**  
- Options: `"chromium"`, `"firefox"`, or `"webkit"`.  
- Defaults to `"chromium"`.  
- If you need a different engine, specify it here.

2. **`headless`**  
   - `True`: Runs the browser in headless mode (invisible browser).  
   - `False`: Runs the browser in visible mode, which helps with debugging.

3. **`proxy_config`**  
   - A dictionary with fields like:  
```json
{
    "server": "http://proxy.example.com:8080", 
    "username": "...", 
    "password": "..."
}
```
   - Leave as `None` if a proxy is not required.

4. **`viewport_width` & `viewport_height`**:  
   - The initial window size.  
   - Some sites behave differently with smaller or bigger viewports.

5. **`verbose`**:  
   - If `True`, prints extra logs.  
   - Handy for debugging.

6. **`use_persistent_context`**:  
   - If `True`, uses a **persistent** browser profile, storing cookies/local storage across runs.  
   - Typically also set `user_data_dir` to point to a folder.

7. **`cookies`** & **`headers`**:  
   - If you want to start with specific cookies or add universal HTTP headers, set them here.  
   - E.g. `cookies=[{"name": "session", "value": "abc123", "domain": "example.com"}]`.

8. **`user_agent`**:  
   - Custom User-Agent string. If `None`, a default is used.  
   - You can also set `user_agent_mode="random"` for randomization (if you want to fight bot detection).

9. **`text_mode`** & **`light_mode`**:  
   - `text_mode=True` disables images, possibly speeding up text-only crawls.  
   - `light_mode=True` turns off certain background features for performance.  

10. **`extra_args`**:  
    - Additional flags for the underlying browser.  
    - E.g. `["--disable-extensions"]`.

### Helper Methods

Both configuration classes provide a `clone()` method to create modified copies:

```python
# Create a base browser config
base_browser = BrowserConfig(
    browser_type="chromium",
    headless=True,
    text_mode=True
)

# Create a visible browser config for debugging
debug_browser = base_browser.clone(
    headless=False,
    verbose=True
)
```

**Minimal Example**:

```python
from crawl4ai import AsyncWebCrawler, BrowserConfig

browser_conf = BrowserConfig(
    browser_type="firefox",
    headless=False,
    text_mode=True
)

async with AsyncWebCrawler(config=browser_conf) as crawler:
    result = await crawler.arun("https://example.com")
    print(result.markdown[:300])
```

---

## 2. CrawlerRunConfig Essentials

```python
class CrawlerRunConfig:
    def __init__(
        word_count_threshold=200,
        extraction_strategy=None,
        markdown_generator=None,
        cache_mode=None,
        js_code=None,
        wait_for=None,
        screenshot=False,
        pdf=False,
        enable_rate_limiting=False,
        rate_limit_config=None,
        memory_threshold_percent=70.0,
        check_interval=1.0,
        max_session_permit=20,
        display_mode=None,
        verbose=True,
        stream=False,  # Enable streaming for arun_many()
        # ... other advanced parameters omitted
    ):
        ...
```

### Key Fields to Note

1. **`word_count_threshold`**:  
   - The minimum word count before a block is considered.  
   - If your site has lots of short paragraphs or items, you can lower it.

2. **`extraction_strategy`**:  
   - Where you plug in JSON-based extraction (CSS, LLM, etc.).  
   - If `None`, no structured extraction is done (only raw/cleaned HTML + markdown).

3. **`markdown_generator`**:  
   - E.g., `DefaultMarkdownGenerator(...)`, controlling how HTML→Markdown conversion is done.  
   - If `None`, a default approach is used.

4. **`cache_mode`**:  
   - Controls caching behavior (`ENABLED`, `BYPASS`, `DISABLED`, etc.).  
   - If `None`, defaults to some level of caching or you can specify `CacheMode.ENABLED`.

5. **`js_code`**:  
   - A string or list of JS strings to execute.  
   - Great for “Load More” buttons or user interactions.  

6. **`wait_for`**:  
   - A CSS or JS expression to wait for before extracting content.  
   - Common usage: `wait_for="css:.main-loaded"` or `wait_for="js:() => window.loaded === true"`.

7. **`screenshot`** & **`pdf`**:  
   - If `True`, captures a screenshot or PDF after the page is fully loaded.  
   - The results go to `result.screenshot` (base64) or `result.pdf` (bytes).

8. **`verbose`**:  
   - Logs additional runtime details.  
   - Overlaps with the browser’s verbosity if also set to `True` in `BrowserConfig`.

9. **`enable_rate_limiting`**:  
   - If `True`, enables rate limiting for batch processing.  
   - Requires `rate_limit_config` to be set.

10. **`memory_threshold_percent`**:  
    - The memory threshold (as a percentage) to monitor.  
    - If exceeded, the crawler will pause or slow down.

11. **`check_interval`**:  
    - The interval (in seconds) to check system resources.  
    - Affects how often memory and CPU usage are monitored.

12. **`max_session_permit`**:  
    - The maximum number of concurrent crawl sessions.  
    - Helps prevent overwhelming the system.

13. **`display_mode`**:  
    - The display mode for progress information (`DETAILED`, `BRIEF`, etc.).  
    - Affects how much information is printed during the crawl.

### Helper Methods

The `clone()` method is particularly useful for creating variations of your crawler configuration:

```python
# Create a base configuration
base_config = CrawlerRunConfig(
    cache_mode=CacheMode.ENABLED,
    word_count_threshold=200,
    wait_until="networkidle"
)

# Create variations for different use cases
stream_config = base_config.clone(
    stream=True,  # Enable streaming mode
    cache_mode=CacheMode.BYPASS
)

debug_config = base_config.clone(
    page_timeout=120000,  # Longer timeout for debugging
    verbose=True
)
```

The `clone()` method:
- Creates a new instance with all the same settings
- Updates only the specified parameters
- Leaves the original configuration unchanged
- Perfect for creating variations without repeating all parameters

---


## 3. LlmConfig Essentials

### Key fields to note

1. **`provider`**:  
- Which LLM provoder to use. 
- Possible values are `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`<br/>*(default: `"openai/gpt-4o-mini"`)*

2. **`api_token`**:  
    - Optional. When not provided explicitly, api_token will be read from environment variables based on provider. For example: If a gemini model is passed as provider then,`"GEMINI_API_KEY"` will be read from environment variables  
    - API token of LLM provider <br/> eg: `api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"`
    - Environment variable - use with prefix "env:" <br/> eg:`api_token = "env: GROQ_API_KEY"`            

3. **`base_url`**:  
   - If your provider has a custom endpoint

```python
llmConfig = LlmConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY"))
```

## 4. Putting It All Together

In a typical scenario, you define **one** `BrowserConfig` for your crawler session, then create **one or more** `CrawlerRunConfig` & `LlmConfig` depending on each call’s needs:

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LlmConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

async def main():
    # 1) Browser config: headless, bigger viewport, no proxy
    browser_conf = BrowserConfig(
        headless=True,
        viewport_width=1280,
        viewport_height=720
    )

    # 2) Example extraction strategy
    schema = {
        "name": "Articles",
        "baseSelector": "div.article",
        "fields": [
            {"name": "title", "selector": "h2", "type": "text"},
            {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
        ]
    }
    extraction = JsonCssExtractionStrategy(schema)

    # 3) Example LLM content filtering

    gemini_config = LlmConfig(
        provider="gemini/gemini-1.5-pro" 
        api_token = "env:GEMINI_API_TOKEN"
    )

    # Initialize LLM filter with specific instruction
    filter = LLMContentFilter(
        llmConfig=gemini_config,  # or your preferred provider
        instruction="""
        Focus on extracting the core educational content.
        Include:
        - Key concepts and explanations
        - Important code examples
        - Essential technical details
        Exclude:
        - Navigation elements
        - Sidebars
        - Footer content
        Format the output as clean markdown with proper code blocks and headers.
        """,
        chunk_token_threshold=500,  # Adjust based on your needs
        verbose=True
    )

    md_generator = DefaultMarkdownGenerator(
    content_filter=filter,
    options={"ignore_links": True}

    # 4) Crawler run config: skip cache, use extraction
    run_conf = CrawlerRunConfig(
        markdown_generator=md_generator,
        extraction_strategy=extraction,
        cache_mode=CacheMode.BYPASS,
    )

    async with AsyncWebCrawler(config=browser_conf) as crawler:
        # 4) Execute the crawl
        result = await crawler.arun(url="https://example.com/news", config=run_conf)

        if result.success:
            print("Extracted content:", result.extracted_content)
        else:
            print("Error:", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())
```

---

## 5. Next Steps

For a **detailed list** of available parameters (including advanced ones), see:

- [BrowserConfig, CrawlerRunConfig & LlmConfig Reference](../api/parameters.md)  

You can explore topics like:

- **Custom Hooks & Auth** (Inject JavaScript or handle login forms).  
- **Session Management** (Re-use pages, preserve state across multiple calls).  
- **Magic Mode** or **Identity-based Crawling** (Fight bot detection by simulating user behavior).  
- **Advanced Caching** (Fine-tune read/write cache modes).  

---

## 6. Conclusion

**BrowserConfig**, **CrawlerRunConfig** and **LlmConfig** give you straightforward ways to define:

- **Which** browser to launch, how it should run, and any proxy or user agent needs.  
- **How** each crawl should behave—caching, timeouts, JavaScript code, extraction strategies, etc.
- **Which** LLM provider to use, api token, temperature and base url for custom endpoints

Use them together for **clear, maintainable** code, and when you need more specialized behavior, check out the advanced parameters in the [reference docs](../api/parameters.md). Happy crawling!
-												Feat/llm config (#724)

* feature: Add LlmConfig to easily configure and pass LLM configs to different strategies

* pulled in next branch and resolved conflicts

* feat: Add gemini and deepseek providers. Make ignore_cache in llm content filter to true by default to avoid confusions

* Refactor: Update LlmConfig in LLMExtractionStrategy class and deprecate old params

* updated tests, docs and readme
											
										
										
											2025-02-21 13:11:37 +05:30
+								# Browser, Crawler & LLM Configuration (Quick Overview)
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
 								Crawl4AI’s flexibility stems from two key classes:
 . **`BrowserConfig`** – Dictates **how** the browser is launched and behaves (e.g., headless or visible, proxy, user agent).
-												Feat/llm config (#724)

* feature: Add LlmConfig to easily configure and pass LLM configs to different strategies

* pulled in next branch and resolved conflicts

* feat: Add gemini and deepseek providers. Make ignore_cache in llm content filter to true by default to avoid confusions

* Refactor: Update LlmConfig in LLMExtractionStrategy class and deprecate old params

* updated tests, docs and readme
											
										
										
											2025-02-21 13:11:37 +05:30
+. **`CrawlerRunConfig`** – Dictates **how** each **crawl** operates (e.g., caching, extraction, timeouts, JavaScript code to run, etc.).
 . **`LlmConfig`** - Dictates **how** LLM providers are configured. (model, api token, base url, temperature etc.)
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
 								In most examples, you create **one** `BrowserConfig` for the entire crawler session, then pass a **fresh** or re-used `CrawlerRunConfig` whenever you call `arun()`. This tutorial shows the most commonly used parameters. If you need advanced or rarely used fields, see the [Configuration Parameters](../api/parameters.md).
 								---
 								## 1. BrowserConfig Essentials
 								```python
 								class BrowserConfig:
 								    def __init__(
 								        browser_type="chromium",
 								        headless=True,
 								        proxy_config=None,
 								        viewport_width=1080,
 								        viewport_height=600,
 								        verbose=True,
 								        use_persistent_context=False,
 								        user_data_dir=None,
 								        cookies=None,
 								        headers=None,
 								        user_agent=None,
 								        text_mode=False,
 								        light_mode=False,
 								        extra_args=None,
 								        # ... other advanced parameters omitted here
 								    ):
 								        ...
 								```
 								### Key Fields to Note
 . **`browser_type`**
 								- Options: `"chromium"`, `"firefox"`, or `"webkit"`.
 								- Defaults to `"chromium"`.
 								- If you need a different engine, specify it here.
 . **`headless`**
 								   - `True`: Runs the browser in headless mode (invisible browser).
 								   - `False`: Runs the browser in visible mode, which helps with debugging.
 . **`proxy_config`**
 								   - A dictionary with fields like:
 								```json
 								{
 								    "server": "http://proxy.example.com:8080",
 								    "username": "...",
 								    "password": "..."
 								}
 								```
 								   - Leave as `None` if a proxy is not required.
 . **`viewport_width` & `viewport_height`**:
 								   - The initial window size.
 								   - Some sites behave differently with smaller or bigger viewports.
 . **`verbose`**:
 								   - If `True`, prints extra logs.
 								   - Handy for debugging.
 . **`use_persistent_context`**:
 								   - If `True`, uses a **persistent** browser profile, storing cookies/local storage across runs.
 								   - Typically also set `user_data_dir` to point to a folder.
 . **`cookies`** & **`headers`**:
 								   - If you want to start with specific cookies or add universal HTTP headers, set them here.
 								   - E.g. `cookies=[{"name": "session", "value": "abc123", "domain": "example.com"}]`.
 . **`user_agent`**:
 								   - Custom User-Agent string. If `None`, a default is used.
 								   - You can also set `user_agent_mode="random"` for randomization (if you want to fight bot detection).
 . **`text_mode`** & **`light_mode`**:
 								   - `text_mode=True` disables images, possibly speeding up text-only crawls.
 								   - `light_mode=True` turns off certain background features for performance.
 . **`extra_args`**:
 								    - Additional flags for the underlying browser.
 								    - E.g. `["--disable-extensions"]`.
-												docs(api): add streaming mode documentation and examples

Add comprehensive documentation for the new streaming mode feature in arun_many():
- Update arun_many() API docs to reflect streaming return type
- Add streaming examples in quickstart and multi-url guides
- Document stream parameter in configuration classes
- Add clone() helper method documentation for configs

This change improves documentation for processing large numbers of URLs efficiently.

											
										
										
											2025-01-19 18:21:34 +08:00
+								### Helper Methods
 								Both configuration classes provide a `clone()` method to create modified copies:
 								```python
 								# Create a base browser config
 								base_browser = BrowserConfig(
 								    browser_type="chromium",
 								    headless=True,
 								    text_mode=True
 								)
 								# Create a visible browser config for debugging
 								debug_browser = base_browser.clone(
 								    headless=False,
 								    verbose=True
 								)
 								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								**Minimal Example**:
 								```python
 								from crawl4ai import AsyncWebCrawler, BrowserConfig
 								browser_conf = BrowserConfig(
 								    browser_type="firefox",
 								    headless=False,
 								    text_mode=True
 								)
 								async with AsyncWebCrawler(config=browser_conf) as crawler:
 								    result = await crawler.arun("https://example.com")
 								    print(result.markdown[:300])
 								```
 								---
 								## 2. CrawlerRunConfig Essentials
 								```python
 								class CrawlerRunConfig:
 								    def __init__(
 								        word_count_threshold=200,
 								        extraction_strategy=None,
 								        markdown_generator=None,
 								        cache_mode=None,
 								        js_code=None,
 								        wait_for=None,
 								        screenshot=False,
 								        pdf=False,
-												refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring

Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.

											
										
										
											2025-01-11 21:10:27 +08:00
+								        enable_rate_limiting=False,
 								        rate_limit_config=None,
 								        memory_threshold_percent=70.0,
 								        check_interval=1.0,
 								        max_session_permit=20,
 								        display_mode=None,
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								        verbose=True,
-												docs(api): add streaming mode documentation and examples

Add comprehensive documentation for the new streaming mode feature in arun_many():
- Update arun_many() API docs to reflect streaming return type
- Add streaming examples in quickstart and multi-url guides
- Document stream parameter in configuration classes
- Add clone() helper method documentation for configs

This change improves documentation for processing large numbers of URLs efficiently.

											
										
										
											2025-01-19 18:21:34 +08:00
+								        stream=False,  # Enable streaming for arun_many()
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								        # ... other advanced parameters omitted
 								    ):
 								        ...
 								```
 								### Key Fields to Note
 . **`word_count_threshold`**:
 								   - The minimum word count before a block is considered.
 								   - If your site has lots of short paragraphs or items, you can lower it.
 . **`extraction_strategy`**:
 								   - Where you plug in JSON-based extraction (CSS, LLM, etc.).
 								   - If `None`, no structured extraction is done (only raw/cleaned HTML + markdown).
 . **`markdown_generator`**:
 								   - E.g., `DefaultMarkdownGenerator(...)`, controlling how HTML→Markdown conversion is done.
 								   - If `None`, a default approach is used.
 . **`cache_mode`**:
 								   - Controls caching behavior (`ENABLED`, `BYPASS`, `DISABLED`, etc.).
 								   - If `None`, defaults to some level of caching or you can specify `CacheMode.ENABLED`.
 . **`js_code`**:
 								   - A string or list of JS strings to execute.
 								   - Great for “Load More” buttons or user interactions.
 . **`wait_for`**:
 								   - A CSS or JS expression to wait for before extracting content.
 								   - Common usage: `wait_for="css:.main-loaded"` or `wait_for="js:() => window.loaded === true"`.
 . **`screenshot`** & **`pdf`**:
 								   - If `True`, captures a screenshot or PDF after the page is fully loaded.
 								   - The results go to `result.screenshot` (base64) or `result.pdf` (bytes).
 . **`verbose`**:
 								   - Logs additional runtime details.
 								   - Overlaps with the browser’s verbosity if also set to `True` in `BrowserConfig`.
-												refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring

Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.

											
										
										
											2025-01-11 21:10:27 +08:00
+. **`enable_rate_limiting`**:
 								   - If `True`, enables rate limiting for batch processing.
 								   - Requires `rate_limit_config` to be set.
-												feat(proxy): add proxy rotation strategy

Implements a new proxy rotation system with the following changes:
- Add ProxyRotationStrategy abstract base class
- Add RoundRobinProxyStrategy concrete implementation
- Integrate proxy rotation with AsyncWebCrawler
- Add proxy_rotation_strategy parameter to CrawlerRunConfig
- Add example script demonstrating proxy rotation usage
- Remove deprecated synchronous WebCrawler code
- Clean up rate limiting documentation

BREAKING CHANGE: Removed synchronous WebCrawler support and related rate limiting configurations

											
										
										
											2025-02-09 18:49:10 +08:00
+. **`memory_threshold_percent`**:
-												refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring

Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.

											
										
										
											2025-01-11 21:10:27 +08:00
+								    - The memory threshold (as a percentage) to monitor.
 								    - If exceeded, the crawler will pause or slow down.
-												feat(proxy): add proxy rotation strategy

Implements a new proxy rotation system with the following changes:
- Add ProxyRotationStrategy abstract base class
- Add RoundRobinProxyStrategy concrete implementation
- Integrate proxy rotation with AsyncWebCrawler
- Add proxy_rotation_strategy parameter to CrawlerRunConfig
- Add example script demonstrating proxy rotation usage
- Remove deprecated synchronous WebCrawler code
- Clean up rate limiting documentation

BREAKING CHANGE: Removed synchronous WebCrawler support and related rate limiting configurations

											
										
										
											2025-02-09 18:49:10 +08:00
+. **`check_interval`**:
-												refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring

Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.

											
										
										
											2025-01-11 21:10:27 +08:00
+								    - The interval (in seconds) to check system resources.
 								    - Affects how often memory and CPU usage are monitored.
-												feat(proxy): add proxy rotation strategy

Implements a new proxy rotation system with the following changes:
- Add ProxyRotationStrategy abstract base class
- Add RoundRobinProxyStrategy concrete implementation
- Integrate proxy rotation with AsyncWebCrawler
- Add proxy_rotation_strategy parameter to CrawlerRunConfig
- Add example script demonstrating proxy rotation usage
- Remove deprecated synchronous WebCrawler code
- Clean up rate limiting documentation

BREAKING CHANGE: Removed synchronous WebCrawler support and related rate limiting configurations

											
										
										
											2025-02-09 18:49:10 +08:00
+. **`max_session_permit`**:
-												refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring

Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.

											
										
										
											2025-01-11 21:10:27 +08:00
+								    - The maximum number of concurrent crawl sessions.
 								    - Helps prevent overwhelming the system.
-												feat(proxy): add proxy rotation strategy

Implements a new proxy rotation system with the following changes:
- Add ProxyRotationStrategy abstract base class
- Add RoundRobinProxyStrategy concrete implementation
- Integrate proxy rotation with AsyncWebCrawler
- Add proxy_rotation_strategy parameter to CrawlerRunConfig
- Add example script demonstrating proxy rotation usage
- Remove deprecated synchronous WebCrawler code
- Clean up rate limiting documentation

BREAKING CHANGE: Removed synchronous WebCrawler support and related rate limiting configurations

											
										
										
											2025-02-09 18:49:10 +08:00
+. **`display_mode`**:
-												refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring

Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.

											
										
										
											2025-01-11 21:10:27 +08:00
+								    - The display mode for progress information (`DETAILED`, `BRIEF`, etc.).
 								    - Affects how much information is printed during the crawl.
-												docs(api): add streaming mode documentation and examples

Add comprehensive documentation for the new streaming mode feature in arun_many():
- Update arun_many() API docs to reflect streaming return type
- Add streaming examples in quickstart and multi-url guides
- Document stream parameter in configuration classes
- Add clone() helper method documentation for configs

This change improves documentation for processing large numbers of URLs efficiently.

											
										
										
											2025-01-19 18:21:34 +08:00
+								### Helper Methods
 								The `clone()` method is particularly useful for creating variations of your crawler configuration:
 								```python
 								# Create a base configuration
 								base_config = CrawlerRunConfig(
 								    cache_mode=CacheMode.ENABLED,
 								    word_count_threshold=200,
 								    wait_until="networkidle"
 								)
 								# Create variations for different use cases
 								stream_config = base_config.clone(
 								    stream=True,  # Enable streaming mode
 								    cache_mode=CacheMode.BYPASS
 								)
 								debug_config = base_config.clone(
 								    page_timeout=120000,  # Longer timeout for debugging
 								    verbose=True
 								)
 								```
 								The `clone()` method:
 								- Creates a new instance with all the same settings
 								- Updates only the specified parameters
 								- Leaves the original configuration unchanged
 								- Perfect for creating variations without repeating all parameters
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								---
-												Feat/llm config (#724)

* feature: Add LlmConfig to easily configure and pass LLM configs to different strategies

* pulled in next branch and resolved conflicts

* feat: Add gemini and deepseek providers. Make ignore_cache in llm content filter to true by default to avoid confusions

* Refactor: Update LlmConfig in LLMExtractionStrategy class and deprecate old params

* updated tests, docs and readme
											
										
										
											2025-02-21 13:11:37 +05:30
 								## 3. LlmConfig Essentials
 								### Key fields to note
 . **`provider`**:
 								- Which LLM provoder to use.
 								- Possible values are `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`<br/>*(default: `"openai/gpt-4o-mini"`)*
 . **`api_token`**:
 								    - Optional. When not provided explicitly, api_token will be read from environment variables based on provider. For example: If a gemini model is passed as provider then,`"GEMINI_API_KEY"` will be read from environment variables
 								    - API token of LLM provider <br/> eg: `api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"`
 								    - Environment variable - use with prefix "env:" <br/> eg:`api_token = "env: GROQ_API_KEY"`
 . **`base_url`**:
 								   - If your provider has a custom endpoint
 								```python
 								llmConfig = LlmConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY"))
 								```
 								## 4. Putting It All Together
 								In a typical scenario, you define **one** `BrowserConfig` for your crawler session, then create **one or more** `CrawlerRunConfig` & `LlmConfig` depending on each call’s needs:
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
 								```python
 								import asyncio
-												Feat/llm config (#724)

* feature: Add LlmConfig to easily configure and pass LLM configs to different strategies

* pulled in next branch and resolved conflicts

* feat: Add gemini and deepseek providers. Make ignore_cache in llm content filter to true by default to avoid confusions

* Refactor: Update LlmConfig in LLMExtractionStrategy class and deprecate old params

* updated tests, docs and readme
											
										
										
											2025-02-21 13:11:37 +05:30
+								from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LlmConfig
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
 								async def main():
 								    # 1) Browser config: headless, bigger viewport, no proxy
 								    browser_conf = BrowserConfig(
 								        headless=True,
 								        viewport_width=1280,
 								        viewport_height=720
 								    )
 								    # 2) Example extraction strategy
 								    schema = {
 								        "name": "Articles",
 								        "baseSelector": "div.article",
 								        "fields": [
 								            {"name": "title", "selector": "h2", "type": "text"},
 								            {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
 								        ]
 								    }
 								    extraction = JsonCssExtractionStrategy(schema)
-												Feat/llm config (#724)

* feature: Add LlmConfig to easily configure and pass LLM configs to different strategies

* pulled in next branch and resolved conflicts

* feat: Add gemini and deepseek providers. Make ignore_cache in llm content filter to true by default to avoid confusions

* Refactor: Update LlmConfig in LLMExtractionStrategy class and deprecate old params

* updated tests, docs and readme
											
										
										
											2025-02-21 13:11:37 +05:30
+								    # 3) Example LLM content filtering
 								    gemini_config = LlmConfig(
 								        provider="gemini/gemini-1.5-pro"
 								        api_token = "env:GEMINI_API_TOKEN"
 								    )
 								    # Initialize LLM filter with specific instruction
 								    filter = LLMContentFilter(
 								        llmConfig=gemini_config,  # or your preferred provider
 								        instruction="""
 								        Focus on extracting the core educational content.
 								        Include:
 								        - Key concepts and explanations
 								        - Important code examples
 								        - Essential technical details
 								        Exclude:
 								        - Navigation elements
 								        - Sidebars
 								        - Footer content
 								        Format the output as clean markdown with proper code blocks and headers.
 								        """,
 								        chunk_token_threshold=500,  # Adjust based on your needs
 								        verbose=True
 								    )
 								    md_generator = DefaultMarkdownGenerator(
 								    content_filter=filter,
 								    options={"ignore_links": True}
 								    # 4) Crawler run config: skip cache, use extraction
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								    run_conf = CrawlerRunConfig(
-												Feat/llm config (#724)

* feature: Add LlmConfig to easily configure and pass LLM configs to different strategies

* pulled in next branch and resolved conflicts

* feat: Add gemini and deepseek providers. Make ignore_cache in llm content filter to true by default to avoid confusions

* Refactor: Update LlmConfig in LLMExtractionStrategy class and deprecate old params

* updated tests, docs and readme
											
										
										
											2025-02-21 13:11:37 +05:30
+								        markdown_generator=md_generator,
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								        extraction_strategy=extraction,
-												refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring

Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.

											
										
										
											2025-01-11 21:10:27 +08:00
+								        cache_mode=CacheMode.BYPASS,
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								    )
 								    async with AsyncWebCrawler(config=browser_conf) as crawler:
 								        # 4) Execute the crawl
 								        result = await crawler.arun(url="https://example.com/news", config=run_conf)
 								        if result.success:
 								            print("Extracted content:", result.extracted_content)
 								        else:
 								            print("Error:", result.error_message)
 								if __name__ == "__main__":
 								    asyncio.run(main())
 								```
 								---
-												Feat/llm config (#724)

* feature: Add LlmConfig to easily configure and pass LLM configs to different strategies

* pulled in next branch and resolved conflicts

* feat: Add gemini and deepseek providers. Make ignore_cache in llm content filter to true by default to avoid confusions

* Refactor: Update LlmConfig in LLMExtractionStrategy class and deprecate old params

* updated tests, docs and readme
											
										
										
											2025-02-21 13:11:37 +05:30
+								## 5. Next Steps
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
 								For a **detailed list** of available parameters (including advanced ones), see:
-												Feat/llm config (#724)

* feature: Add LlmConfig to easily configure and pass LLM configs to different strategies

* pulled in next branch and resolved conflicts

* feat: Add gemini and deepseek providers. Make ignore_cache in llm content filter to true by default to avoid confusions

* Refactor: Update LlmConfig in LLMExtractionStrategy class and deprecate old params

* updated tests, docs and readme
											
										
										
											2025-02-21 13:11:37 +05:30
+								- [BrowserConfig, CrawlerRunConfig & LlmConfig Reference](../api/parameters.md)
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
 								You can explore topics like:
 								- **Custom Hooks & Auth** (Inject JavaScript or handle login forms).
 								- **Session Management** (Re-use pages, preserve state across multiple calls).
 								- **Magic Mode** or **Identity-based Crawling** (Fight bot detection by simulating user behavior).
 								- **Advanced Caching** (Fine-tune read/write cache modes).
 								---
-												Feat/llm config (#724)

* feature: Add LlmConfig to easily configure and pass LLM configs to different strategies

* pulled in next branch and resolved conflicts

* feat: Add gemini and deepseek providers. Make ignore_cache in llm content filter to true by default to avoid confusions

* Refactor: Update LlmConfig in LLMExtractionStrategy class and deprecate old params

* updated tests, docs and readme
											
										
										
											2025-02-21 13:11:37 +05:30
+								## 6. Conclusion
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
-												Feat/llm config (#724)

* feature: Add LlmConfig to easily configure and pass LLM configs to different strategies

* pulled in next branch and resolved conflicts

* feat: Add gemini and deepseek providers. Make ignore_cache in llm content filter to true by default to avoid confusions

* Refactor: Update LlmConfig in LLMExtractionStrategy class and deprecate old params

* updated tests, docs and readme
											
										
										
											2025-02-21 13:11:37 +05:30
+								**BrowserConfig**, **CrawlerRunConfig** and **LlmConfig** give you straightforward ways to define:
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
 								- **Which** browser to launch, how it should run, and any proxy or user agent needs.
 								- **How** each crawl should behave—caching, timeouts, JavaScript code, extraction strategies, etc.
-												Feat/llm config (#724)

* feature: Add LlmConfig to easily configure and pass LLM configs to different strategies

* pulled in next branch and resolved conflicts

* feat: Add gemini and deepseek providers. Make ignore_cache in llm content filter to true by default to avoid confusions

* Refactor: Update LlmConfig in LLMExtractionStrategy class and deprecate old params

* updated tests, docs and readme
											
										
										
											2025-02-21 13:11:37 +05:30
+								- **Which** LLM provider to use, api token, temperature and base url for custom endpoints
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
 								Use them together for **clear, maintainable** code, and when you need more specialized behavior, check out the advanced parameters in the [reference docs](../api/parameters.md). Happy crawling!