crawl4ai/docs/md_v2/advanced/hooks-auth.md

# Hooks & Auth in AsyncWebCrawler

Crawl4AI’s **hooks** let you customize the crawler at specific points in the pipeline:

1. **`on_browser_created`** – After browser creation.  
2. **`on_page_context_created`** – After a new context & page are created.  
3. **`before_goto`** – Just before navigating to a page.  
4. **`after_goto`** – Right after navigation completes.  
5. **`on_user_agent_updated`** – Whenever the user agent changes.  
6. **`on_execution_started`** – Once custom JavaScript execution begins.  
7. **`before_retrieve_html`** – Just before the crawler retrieves final HTML.  
8. **`before_return_html`** – Right before returning the HTML content.

**Important**: Avoid heavy tasks in `on_browser_created` since you don’t yet have a page context. If you need to *log in*, do so in **`on_page_context_created`**.

> note "Important Hook Usage Warning"
    **Avoid Misusing Hooks**: Do not manipulate page objects in the wrong hook or at the wrong time, as it can crash the pipeline or produce incorrect results. A common mistake is attempting to handle authentication prematurely—such as creating or closing pages in `on_browser_created`. 

>   **Use the Right Hook for Auth**: If you need to log in or set tokens, use `on_page_context_created`. This ensures you have a valid page/context to work with, without disrupting the main crawling flow.

>    **Identity-Based Crawling**: For robust auth, consider identity-based crawling (or passing a session ID) to preserve state. Run your initial login steps in a separate, well-defined process, then feed that session to your main crawl—rather than shoehorning complex authentication into early hooks. Check out [Identity-Based Crawling](../advanced/identity-based-crawling.md) for more details.

>    **Be Cautious**: Overwriting or removing elements in the wrong hook can compromise the final crawl. Keep hooks focused on smaller tasks (like route filters, custom headers), and let your main logic (crawling, data extraction) proceed normally.


Below is an example demonstration.

---

## Example: Using Hooks in AsyncWebCrawler

```python
import asyncio
import json
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from playwright.async_api import Page, BrowserContext

async def main():
    print("🔗 Hooks Example: Demonstrating recommended usage")

    # 1) Configure the browser
    browser_config = BrowserConfig(
        headless=True,
        verbose=True
    )

    # 2) Configure the crawler run
    crawler_run_config = CrawlerRunConfig(
        js_code="window.scrollTo(0, document.body.scrollHeight);",
        wait_for="body",
        cache_mode=CacheMode.BYPASS
    )

    # 3) Create the crawler instance
    crawler = AsyncWebCrawler(config=browser_config)

    #
    # Define Hook Functions
    #

    async def on_browser_created(browser, **kwargs):
        # Called once the browser instance is created (but no pages or contexts yet)
        print("[HOOK] on_browser_created - Browser created successfully!")
        # Typically, do minimal setup here if needed
        return browser

    async def on_page_context_created(page: Page, context: BrowserContext, **kwargs):
        # Called right after a new page + context are created (ideal for auth or route config).
        print("[HOOK] on_page_context_created - Setting up page & context.")
        
        # Example 1: Route filtering (e.g., block images)
        async def route_filter(route):
            if route.request.resource_type == "image":
                print(f"[HOOK] Blocking image request: {route.request.url}")
                await route.abort()
            else:
                await route.continue_()

        await context.route("**", route_filter)

        # Example 2: (Optional) Simulate a login scenario
        # (We do NOT create or close pages here, just do quick steps if needed)
        # e.g., await page.goto("https://example.com/login")
        # e.g., await page.fill("input[name='username']", "testuser")
        # e.g., await page.fill("input[name='password']", "password123")
        # e.g., await page.click("button[type='submit']")
        # e.g., await page.wait_for_selector("#welcome")
        # e.g., await context.add_cookies([...])
        # Then continue

        # Example 3: Adjust the viewport
        await page.set_viewport_size({"width": 1080, "height": 600})
        return page

    async def before_goto(
        page: Page, context: BrowserContext, url: str, **kwargs
    ):
        # Called before navigating to each URL.
        print(f"[HOOK] before_goto - About to navigate: {url}")
        # e.g., inject custom headers
        await page.set_extra_http_headers({
            "Custom-Header": "my-value"
        })
        return page

    async def after_goto(
        page: Page, context: BrowserContext, 
        url: str, response, **kwargs
    ):
        # Called after navigation completes.
        print(f"[HOOK] after_goto - Successfully loaded: {url}")
        # e.g., wait for a certain element if we want to verify
        try:
            await page.wait_for_selector('.content', timeout=1000)
            print("[HOOK] Found .content element!")
        except:
            print("[HOOK] .content not found, continuing anyway.")
        return page

    async def on_user_agent_updated(
        page: Page, context: BrowserContext, 
        user_agent: str, **kwargs
    ):
        # Called whenever the user agent updates.
        print(f"[HOOK] on_user_agent_updated - New user agent: {user_agent}")
        return page

    async def on_execution_started(page: Page, context: BrowserContext, **kwargs):
        # Called after custom JavaScript execution begins.
        print("[HOOK] on_execution_started - JS code is running!")
        return page

    async def before_retrieve_html(page: Page, context: BrowserContext, **kwargs):
        # Called before final HTML retrieval.
        print("[HOOK] before_retrieve_html - We can do final actions")
        # Example: Scroll again
        await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
        return page

    async def before_return_html(
        page: Page, context: BrowserContext, html: str, **kwargs
    ):
        # Called just before returning the HTML in the result.
        print(f"[HOOK] before_return_html - HTML length: {len(html)}")
        return page

    #
    # Attach Hooks
    #

    crawler.crawler_strategy.set_hook("on_browser_created", on_browser_created)
    crawler.crawler_strategy.set_hook(
        "on_page_context_created", on_page_context_created
    )
    crawler.crawler_strategy.set_hook("before_goto", before_goto)
    crawler.crawler_strategy.set_hook("after_goto", after_goto)
    crawler.crawler_strategy.set_hook(
        "on_user_agent_updated", on_user_agent_updated
    )
    crawler.crawler_strategy.set_hook(
        "on_execution_started", on_execution_started
    )
    crawler.crawler_strategy.set_hook(
        "before_retrieve_html", before_retrieve_html
    )
    crawler.crawler_strategy.set_hook(
        "before_return_html", before_return_html
    )

    await crawler.start()

    # 4) Run the crawler on an example page
    url = "https://example.com"
    result = await crawler.arun(url, config=crawler_run_config)
    
    if result.success:
        print("\nCrawled URL:", result.url)
        print("HTML length:", len(result.html))
    else:
        print("Error:", result.error_message)

    await crawler.close()

if __name__ == "__main__":
    asyncio.run(main())
```

---

## Hook Lifecycle Summary

1. **`on_browser_created`**:  
   - Browser is up, but **no** pages or contexts yet.  
   - Light setup only—don’t try to open or close pages here (that belongs in `on_page_context_created`).

2. **`on_page_context_created`**:  
   - Perfect for advanced **auth** or route blocking.  
   - You have a **page** + **context** ready but haven’t navigated to the target URL yet.

3. **`before_goto`**:  
   - Right before navigation. Typically used for setting **custom headers** or logging the target URL.

4. **`after_goto`**:  
   - After page navigation is done. Good place for verifying content or waiting on essential elements. 

5. **`on_user_agent_updated`**:  
   - Whenever the user agent changes (for stealth or different UA modes).

6. **`on_execution_started`**:  
   - If you set `js_code` or run custom scripts, this runs once your JS is about to start.

7. **`before_retrieve_html`**:  
   - Just before the final HTML snapshot is taken. Often you do a final scroll or lazy-load triggers here.

8. **`before_return_html`**:  
   - The last hook before returning HTML to the `CrawlResult`. Good for logging HTML length or minor modifications.

---

## When to Handle Authentication

**Recommended**: Use **`on_page_context_created`** if you need to:

- Navigate to a login page or fill forms
- Set cookies or localStorage tokens
- Block resource routes to avoid ads

This ensures the newly created context is under your control **before** `arun()` navigates to the main URL.

---

## Additional Considerations

- **Session Management**: If you want multiple `arun()` calls to reuse a single session, pass `session_id=` in your `CrawlerRunConfig`. Hooks remain the same.  
- **Performance**: Hooks can slow down crawling if they do heavy tasks. Keep them concise.  
- **Error Handling**: If a hook fails, the overall crawl might fail. Catch exceptions or handle them gracefully.  
- **Concurrency**: If you run `arun_many()`, each URL triggers these hooks in parallel. Ensure your hooks are thread/async-safe.

---

## Conclusion

Hooks provide **fine-grained** control over:

- **Browser** creation (light tasks only)
- **Page** and **context** creation (auth, route blocking)
- **Navigation** phases
- **Final HTML** retrieval

Follow the recommended usage:
- **Login** or advanced tasks in `on_page_context_created`  
- **Custom headers** or logs in `before_goto` / `after_goto`  
- **Scrolling** or final checks in `before_retrieve_html` / `before_return_html`
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								# Hooks & Auth in AsyncWebCrawler
-												ADD MKDocs

											
										
										
											2024-06-21 17:56:54 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								Crawl4AI’s **hooks** let you customize the crawler at specific points in the pipeline:
-												ADD MKDocs

											
										
										
											2024-06-21 17:56:54 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+. **`on_browser_created`** – After browser creation.
 . **`on_page_context_created`** – After a new context & page are created.
 . **`before_goto`** – Just before navigating to a page.
 . **`after_goto`** – Right after navigation completes.
 . **`on_user_agent_updated`** – Whenever the user agent changes.
 . **`on_execution_started`** – Once custom JavaScript execution begins.
 . **`before_retrieve_html`** – Just before the crawler retrieves final HTML.
 . **`before_return_html`** – Right before returning the HTML content.
-												ADD MKDocs

											
										
										
											2024-06-21 17:56:54 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								**Important**: Avoid heavy tasks in `on_browser_created` since you don’t yet have a page context. If you need to *log in*, do so in **`on_page_context_created`**.
-												ADD MKDocs

											
										
										
											2024-06-21 17:56:54 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								> note "Important Hook Usage Warning"
 								    **Avoid Misusing Hooks**: Do not manipulate page objects in the wrong hook or at the wrong time, as it can crash the pipeline or produce incorrect results. A common mistake is attempting to handle authentication prematurely—such as creating or closing pages in `on_browser_created`.
-												ADD MKDocs

											
										
										
											2024-06-21 17:56:54 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								>   **Use the Right Hook for Auth**: If you need to log in or set tokens, use `on_page_context_created`. This ensures you have a valid page/context to work with, without disrupting the main crawling flow.
-												ADD MKDocs

											
										
										
											2024-06-21 17:56:54 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								>    **Identity-Based Crawling**: For robust auth, consider identity-based crawling (or passing a session ID) to preserve state. Run your initial login steps in a separate, well-defined process, then feed that session to your main crawl—rather than shoehorning complex authentication into early hooks. Check out [Identity-Based Crawling](../advanced/identity-based-crawling.md) for more details.
 								>    **Be Cautious**: Overwriting or removing elements in the wrong hook can compromise the final crawl. Keep hooks focused on smaller tasks (like route filters, custom headers), and let your main logic (crawling, data extraction) proceed normally.
-												ADD MKDocs

											
										
										
											2024-06-21 17:56:54 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
 								Below is an example demonstration.
 								---
 								## Example: Using Hooks in AsyncWebCrawler
-												ADD MKDocs

											
										
										
											2024-06-21 17:56:54 +08:00
 								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								import asyncio
 								import json
 								from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
 								from playwright.async_api import Page, BrowserContext
-												Push async version last changes for merge to main branch

											
										
										
											2024-09-24 20:52:08 +08:00
+								async def main():
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								    print("🔗 Hooks Example: Demonstrating recommended usage")
-												  Enhance Crawl4AI with new features and documentation

  - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags.
  - Introduced Managed Browsers for enhanced crawling experience.
  - Updated documentation for clearer navigation on configuration.
  - Changed 'text_only' to 'text_mode' in configuration and methods.
  - Improved performance and relevance in content filtering strategies.

											
										
										
											2024-12-19 21:02:29 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								    # 1) Configure the browser
-												  Enhance Crawl4AI with new features and documentation

  - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags.
  - Introduced Managed Browsers for enhanced crawling experience.
  - Updated documentation for clearer navigation on configuration.
  - Changed 'text_only' to 'text_mode' in configuration and methods.
  - Improved performance and relevance in content filtering strategies.

											
										
										
											2024-12-19 21:02:29 +08:00
+								    browser_config = BrowserConfig(
 								        headless=True,
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								        verbose=True
-												  Enhance Crawl4AI with new features and documentation

  - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags.
  - Introduced Managed Browsers for enhanced crawling experience.
  - Updated documentation for clearer navigation on configuration.
  - Changed 'text_only' to 'text_mode' in configuration and methods.
  - Improved performance and relevance in content filtering strategies.

											
										
										
											2024-12-19 21:02:29 +08:00
+								    )
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
 								    # 2) Configure the crawler run
-												  Enhance Crawl4AI with new features and documentation

  - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags.
  - Introduced Managed Browsers for enhanced crawling experience.
  - Updated documentation for clearer navigation on configuration.
  - Changed 'text_only' to 'text_mode' in configuration and methods.
  - Improved performance and relevance in content filtering strategies.

											
										
										
											2024-12-19 21:02:29 +08:00
+								    crawler_run_config = CrawlerRunConfig(
 								        js_code="window.scrollTo(0, document.body.scrollHeight);",
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								        wait_for="body",
 								        cache_mode=CacheMode.BYPASS
-												  Enhance Crawl4AI with new features and documentation

  - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags.
  - Introduced Managed Browsers for enhanced crawling experience.
  - Updated documentation for clearer navigation on configuration.
  - Changed 'text_only' to 'text_mode' in configuration and methods.
  - Improved performance and relevance in content filtering strategies.

											
										
										
											2024-12-19 21:02:29 +08:00
+								    )
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								    # 3) Create the crawler instance
 								    crawler = AsyncWebCrawler(config=browser_config)
 								    #
 								    # Define Hook Functions
 								    #
 								    async def on_browser_created(browser, **kwargs):
 								        # Called once the browser instance is created (but no pages or contexts yet)
 								        print("[HOOK] on_browser_created - Browser created successfully!")
 								        # Typically, do minimal setup here if needed
 								        return browser
 								    async def on_page_context_created(page: Page, context: BrowserContext, **kwargs):
 								        # Called right after a new page + context are created (ideal for auth or route config).
 								        print("[HOOK] on_page_context_created - Setting up page & context.")
 								        # Example 1: Route filtering (e.g., block images)
 								        async def route_filter(route):
 								            if route.request.resource_type == "image":
 								                print(f"[HOOK] Blocking image request: {route.request.url}")
 								                await route.abort()
 								            else:
 								                await route.continue_()
 								        await context.route("**", route_filter)
 								        # Example 2: (Optional) Simulate a login scenario
 								        # (We do NOT create or close pages here, just do quick steps if needed)
 								        # e.g., await page.goto("https://example.com/login")
 								        # e.g., await page.fill("input[name='username']", "testuser")
 								        # e.g., await page.fill("input[name='password']", "password123")
 								        # e.g., await page.click("button[type='submit']")
 								        # e.g., await page.wait_for_selector("#welcome")
 								        # e.g., await context.add_cookies([...])
 								        # Then continue
 								        # Example 3: Adjust the viewport
 								        await page.set_viewport_size({"width": 1080, "height": 600})
 								        return page
 								    async def before_goto(
 								        page: Page, context: BrowserContext, url: str, **kwargs
 								    ):
 								        # Called before navigating to each URL.
 								        print(f"[HOOK] before_goto - About to navigate: {url}")
 								        # e.g., inject custom headers
 								        await page.set_extra_http_headers({
 								            "Custom-Header": "my-value"
 								        })
 								        return page
 								    async def after_goto(
 								        page: Page, context: BrowserContext,
 								        url: str, response, **kwargs
 								    ):
 								        # Called after navigation completes.
 								        print(f"[HOOK] after_goto - Successfully loaded: {url}")
 								        # e.g., wait for a certain element if we want to verify
 								        try:
 								            await page.wait_for_selector('.content', timeout=1000)
 								            print("[HOOK] Found .content element!")
 								        except:
 								            print("[HOOK] .content not found, continuing anyway.")
 								        return page
 								    async def on_user_agent_updated(
 								        page: Page, context: BrowserContext,
 								        user_agent: str, **kwargs
 								    ):
 								        # Called whenever the user agent updates.
 								        print(f"[HOOK] on_user_agent_updated - New user agent: {user_agent}")
 								        return page
 								    async def on_execution_started(page: Page, context: BrowserContext, **kwargs):
 								        # Called after custom JavaScript execution begins.
 								        print("[HOOK] on_execution_started - JS code is running!")
 								        return page
 								    async def before_retrieve_html(page: Page, context: BrowserContext, **kwargs):
 								        # Called before final HTML retrieval.
 								        print("[HOOK] before_retrieve_html - We can do final actions")
 								        # Example: Scroll again
 								        await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
 								        return page
 								    async def before_return_html(
 								        page: Page, context: BrowserContext, html: str, **kwargs
 								    ):
 								        # Called just before returning the HTML in the result.
 								        print(f"[HOOK] before_return_html - HTML length: {len(html)}")
 								        return page
 								    #
 								    # Attach Hooks
 								    #
 								    crawler.crawler_strategy.set_hook("on_browser_created", on_browser_created)
 								    crawler.crawler_strategy.set_hook(
 								        "on_page_context_created", on_page_context_created
 								    )
 								    crawler.crawler_strategy.set_hook("before_goto", before_goto)
 								    crawler.crawler_strategy.set_hook("after_goto", after_goto)
 								    crawler.crawler_strategy.set_hook(
 								        "on_user_agent_updated", on_user_agent_updated
 								    )
 								    crawler.crawler_strategy.set_hook(
 								        "on_execution_started", on_execution_started
 								    )
 								    crawler.crawler_strategy.set_hook(
 								        "before_retrieve_html", before_retrieve_html
 								    )
 								    crawler.crawler_strategy.set_hook(
 								        "before_return_html", before_return_html
 								    )
-												  Enhance Crawl4AI with new features and documentation

  - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags.
  - Introduced Managed Browsers for enhanced crawling experience.
  - Updated documentation for clearer navigation on configuration.
  - Changed 'text_only' to 'text_mode' in configuration and methods.
  - Improved performance and relevance in content filtering strategies.

											
										
										
											2024-12-19 21:02:29 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								    await crawler.start()
 								    # 4) Run the crawler on an example page
 								    url = "https://example.com"
 								    result = await crawler.arun(url, config=crawler_run_config)
 								    if result.success:
 								        print("\nCrawled URL:", result.url)
 								        print("HTML length:", len(result.html))
 								    else:
 								        print("Error:", result.error_message)
-												  Enhance Crawl4AI with new features and documentation

  - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags.
  - Introduced Managed Browsers for enhanced crawling experience.
  - Updated documentation for clearer navigation on configuration.
  - Changed 'text_only' to 'text_mode' in configuration and methods.
  - Improved performance and relevance in content filtering strategies.

											
										
										
											2024-12-19 21:02:29 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								    await crawler.close()
-												Push async version last changes for merge to main branch

											
										
										
											2024-09-24 20:52:08 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								if __name__ == "__main__":
 								    asyncio.run(main())
-												ADD MKDocs

											
										
										
											2024-06-21 17:56:54 +08:00
+								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								---
 								## Hook Lifecycle Summary
 . **`on_browser_created`**:
 								   - Browser is up, but **no** pages or contexts yet.
 								   - Light setup only—don’t try to open or close pages here (that belongs in `on_page_context_created`).
 . **`on_page_context_created`**:
 								   - Perfect for advanced **auth** or route blocking.
 								   - You have a **page** + **context** ready but haven’t navigated to the target URL yet.
 . **`before_goto`**:
 								   - Right before navigation. Typically used for setting **custom headers** or logging the target URL.
 . **`after_goto`**:
 								   - After page navigation is done. Good place for verifying content or waiting on essential elements.
 . **`on_user_agent_updated`**:
 								   - Whenever the user agent changes (for stealth or different UA modes).
 . **`on_execution_started`**:
 								   - If you set `js_code` or run custom scripts, this runs once your JS is about to start.
 . **`before_retrieve_html`**:
 								   - Just before the final HTML snapshot is taken. Often you do a final scroll or lazy-load triggers here.
 . **`before_return_html`**:
 								   - The last hook before returning HTML to the `CrawlResult`. Good for logging HTML length or minor modifications.
 								---
 								## When to Handle Authentication
 								**Recommended**: Use **`on_page_context_created`** if you need to:
 								- Navigate to a login page or fill forms
 								- Set cookies or localStorage tokens
 								- Block resource routes to avoid ads
 								This ensures the newly created context is under your control **before** `arun()` navigates to the main URL.
 								---
 								## Additional Considerations
 								- **Session Management**: If you want multiple `arun()` calls to reuse a single session, pass `session_id=` in your `CrawlerRunConfig`. Hooks remain the same.
 								- **Performance**: Hooks can slow down crawling if they do heavy tasks. Keep them concise.
 								- **Error Handling**: If a hook fails, the overall crawl might fail. Catch exceptions or handle them gracefully.
 								- **Concurrency**: If you run `arun_many()`, each URL triggers these hooks in parallel. Ensure your hooks are thread/async-safe.
 								---
-												  Enhance Crawl4AI with new features and documentation

  - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags.
  - Introduced Managed Browsers for enhanced crawling experience.
  - Updated documentation for clearer navigation on configuration.
  - Changed 'text_only' to 'text_mode' in configuration and methods.
  - Improved performance and relevance in content filtering strategies.

											
										
										
											2024-12-19 21:02:29 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								## Conclusion
-												ADD MKDocs

											
										
										
											2024-06-21 17:56:54 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								Hooks provide **fine-grained** control over:
-												ADD MKDocs

											
										
										
											2024-06-21 17:56:54 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								- **Browser** creation (light tasks only)
 								- **Page** and **context** creation (auth, route blocking)
 								- **Navigation** phases
 								- **Final HTML** retrieval
-												ADD MKDocs

											
										
										
											2024-06-21 17:56:54 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								Follow the recommended usage:
 								- **Login** or advanced tasks in `on_page_context_created`
 								- **Custom headers** or logs in `before_goto` / `after_goto`
 								- **Scrolling** or final checks in `before_retrieve_html` / `before_return_html`
-												ADD MKDocs

											
										
										
											2024-06-21 17:56:54 +08:00