crawl4ai/docs/examples/hello_world.py

import asyncio
from crawl4ai import (
    AsyncWebCrawler,
    BrowserConfig,
    CrawlerRunConfig,
    CacheMode,
    DefaultMarkdownGenerator,
    PruningContentFilter,
    CrawlResult
)

async def example_cdp():
    browser_conf = BrowserConfig(
        headless=False,
        cdp_url="http://localhost:9223"
    )
    crawler_config = CrawlerRunConfig(
        session_id="test",
        js_code = """(() => { return {"result": "Hello World!"} })()""",
        js_only=True
    )
    async with AsyncWebCrawler(
        config=browser_conf,
        verbose=True,
    ) as crawler:
        result : CrawlResult = await crawler.arun(
            url="https://www.helloworld.org",
            config=crawler_config,
        )
        print(result.js_execution_result)
                   

async def main():
    browser_config = BrowserConfig(headless=True, verbose=True)
    async with AsyncWebCrawler(config=browser_config) as crawler:
        crawler_config = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,
            markdown_generator=DefaultMarkdownGenerator(
                content_filter=PruningContentFilter(
                     threshold=0.48, threshold_type="fixed", min_word_threshold=0
                )
            ),
        )
        result : CrawlResult = await crawler.arun(
            url="https://www.helloworld.org", config=crawler_config
        )
        print(result.markdown.raw_markdown[:500])

if __name__ == "__main__":
    asyncio.run(main())
refactor(crawler): - Update hello_world example with proper content filtering 2025-01-01 19:39:42 +08:00			`import asyncio`
feat(logger): add abstract logger base class and file logger implementation Add AsyncLoggerBase abstract class to standardize logger interface and introduce AsyncFileLogger for file-only logging. Remove deprecated always_bypass_cache parameter and clean up AsyncWebCrawler initialization. BREAKING CHANGE: Removed deprecated 'always_by_pass_cache' parameter. Use BrowserConfig cache settings instead. 2025-02-23 21:23:41 +08:00			`from crawl4ai import (`
			`AsyncWebCrawler,`
			`BrowserConfig,`
			`CrawlerRunConfig,`
			`CacheMode,`
			`DefaultMarkdownGenerator,`
			`PruningContentFilter,`
refactor(pdf): improve PDF processor dependency handling Make PyPDF2 an optional dependency and improve import handling in PDF processor. Move imports inside methods to allow for lazy loading and better error handling. Add new 'pdf' optional dependency group in pyproject.toml. Clean up unused imports and remove deprecated files. BREAKING CHANGE: PyPDF2 is now an optional dependency. Users need to install with 'pip install crawl4ai[pdf]' to use PDF processing features. 2025-02-25 22:27:55 +08:00			`CrawlResult`
feat(logger): add abstract logger base class and file logger implementation Add AsyncLoggerBase abstract class to standardize logger interface and introduce AsyncFileLogger for file-only logging. Remove deprecated always_bypass_cache parameter and clean up AsyncWebCrawler initialization. BREAKING CHANGE: Removed deprecated 'always_by_pass_cache' parameter. Use BrowserConfig cache settings instead. 2025-02-23 21:23:41 +08:00			`)`
refactor(crawler): - Update hello_world example with proper content filtering 2025-01-01 19:39:42 +08:00
feat(browser): implement modular browser management system Adds a new browser management system with strategy pattern implementation: - Introduces BrowserManager class with strategy pattern support - Adds PlaywrightBrowserStrategy, CDPBrowserStrategy, and BuiltinBrowserStrategy - Implements BrowserProfileManager for profile management - Adds PagePoolConfig for browser page pooling - Includes comprehensive test suite for all browser strategies BREAKING CHANGE: Browser management has been moved to browser/ module. Direct usage of browser_manager.py and browser_profiler.py is deprecated. 2025-03-21 22:50:00 +08:00			`async def example_cdp():`
			`browser_conf = BrowserConfig(`
			`headless=False,`
			`cdp_url="http://localhost:9223"`
			`)`
			`crawler_config = CrawlerRunConfig(`
			`session_id="test",`
			`js_code = """(() => { return {"result": "Hello World!"} })()""",`
			`js_only=True`
			`)`
			`async with AsyncWebCrawler(`
			`config=browser_conf,`
			`verbose=True,`
			`) as crawler:`
			`result : CrawlResult = await crawler.arun(`
			`url="https://www.helloworld.org",`
			`config=crawler_config,`
			`)`
			`print(result.js_execution_result)`

Apply Ruff Corrections 2025-01-13 19:19:58 +08:00
refactor(crawler): - Update hello_world example with proper content filtering 2025-01-01 19:39:42 +08:00			`async def main():`
refactor(): - Update hello world example 2025-01-02 17:53:30 +08:00			`browser_config = BrowserConfig(headless=True, verbose=True)`
			`async with AsyncWebCrawler(config=browser_config) as crawler:`
refactor(crawler): - Update hello_world example with proper content filtering 2025-01-01 19:39:42 +08:00			`crawler_config = CrawlerRunConfig(`
			`cache_mode=CacheMode.BYPASS,`
			`markdown_generator=DefaultMarkdownGenerator(`
feat(browser): implement modular browser management system Adds a new browser management system with strategy pattern implementation: - Introduces BrowserManager class with strategy pattern support - Adds PlaywrightBrowserStrategy, CDPBrowserStrategy, and BuiltinBrowserStrategy - Implements BrowserProfileManager for profile management - Adds PagePoolConfig for browser page pooling - Includes comprehensive test suite for all browser strategies BREAKING CHANGE: Browser management has been moved to browser/ module. Direct usage of browser_manager.py and browser_profiler.py is deprecated. 2025-03-21 22:50:00 +08:00			`content_filter=PruningContentFilter(`
			`threshold=0.48, threshold_type="fixed", min_word_threshold=0`
			`)`
Apply Ruff Corrections 2025-01-13 19:19:58 +08:00			`),`
refactor(crawler): - Update hello_world example with proper content filtering 2025-01-01 19:39:42 +08:00			`)`
refactor(pdf): improve PDF processor dependency handling Make PyPDF2 an optional dependency and improve import handling in PDF processor. Move imports inside methods to allow for lazy loading and better error handling. Add new 'pdf' optional dependency group in pyproject.toml. Clean up unused imports and remove deprecated files. BREAKING CHANGE: PyPDF2 is now an optional dependency. Users need to install with 'pip install crawl4ai[pdf]' to use PDF processing features. 2025-02-25 22:27:55 +08:00			`result : CrawlResult = await crawler.arun(`
feat(browser): implement modular browser management system Adds a new browser management system with strategy pattern implementation: - Introduces BrowserManager class with strategy pattern support - Adds PlaywrightBrowserStrategy, CDPBrowserStrategy, and BuiltinBrowserStrategy - Implements BrowserProfileManager for profile management - Adds PagePoolConfig for browser page pooling - Includes comprehensive test suite for all browser strategies BREAKING CHANGE: Browser management has been moved to browser/ module. Direct usage of browser_manager.py and browser_profiler.py is deprecated. 2025-03-21 22:50:00 +08:00			`url="https://www.helloworld.org", config=crawler_config`
refactor(crawler): - Update hello_world example with proper content filtering 2025-01-01 19:39:42 +08:00			`)`
Release prep (#749) * fix: Update export of URLPatternFilter * chore: Add dependancy for cchardet in requirements * docs: Update example for deep crawl in release note for v0.5 * Docs: update the example for memory dispatcher * docs: updated example for crawl strategies * Refactor: Removed wrapping in if __name__==main block since this is a markdown file. * chore: removed cchardet from dependancy list, since unclecode is planning to remove it * docs: updated the example for proxy rotation to a working example * feat: Introduced ProxyConfig param * Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1 * chore: update and test new dependancies * feat:Make PyPDF2 a conditional dependancy * updated tutorial and release note for v0.5 * docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename * refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult * fix: Bug in serialisation of markdown in acache_url * Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown * fix: remove deprecated markdown_v2 from docker * Refactor: remove deprecated fit_markdown and fit_html from result * refactor: fix cache retrieval for markdown as a string * chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown 2025-02-28 17:23:35 +05:30			`print(result.markdown.raw_markdown[:500])`
Apply Ruff Corrections 2025-01-13 19:19:58 +08:00
refactor(crawler): - Update hello_world example with proper content filtering 2025-01-01 19:39:42 +08:00			`if __name__ == "__main__":`
Apply Ruff Corrections 2025-01-13 19:19:58 +08:00			`asyncio.run(main())`