crawl4ai/docs/md_v2/basic/simple-crawling.md

# Simple Crawling

This guide covers the basics of web crawling with Crawl4AI. You'll learn how to set up a crawler, make your first request, and understand the response.

## Basic Usage

Set up a simple crawl using `BrowserConfig` and `CrawlerRunConfig`:

```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig

async def main():
    browser_config = BrowserConfig()  # Default browser configuration
    run_config = CrawlerRunConfig()   # Default crawl run configuration

    async with AsyncWebCrawler(browser_config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://example.com",
            config=run_config
        )
        print(result.markdown)  # Print clean markdown content

if __name__ == "__main__":
    asyncio.run(main())
```

## Understanding the Response

The `arun()` method returns a `CrawlResult` object with several useful properties. Here's a quick overview (see [CrawlResult](../api/crawl-result.md) for complete details):

```python
result = await crawler.arun(
    url="https://example.com",
    config=CrawlerRunConfig(fit_markdown=True)
)

# Different content formats
print(result.html)         # Raw HTML
print(result.cleaned_html) # Cleaned HTML
print(result.markdown)     # Markdown version
print(result.fit_markdown) # Most relevant content in markdown

# Check success status
print(result.success)      # True if crawl succeeded
print(result.status_code)  # HTTP status code (e.g., 200, 404)

# Access extracted media and links
print(result.media)        # Dictionary of found media (images, videos, audio)
print(result.links)        # Dictionary of internal and external links
```

## Adding Basic Options

Customize your crawl using `CrawlerRunConfig`:

```python
run_config = CrawlerRunConfig(
    word_count_threshold=10,        # Minimum words per content block
    exclude_external_links=True,    # Remove external links
    remove_overlay_elements=True,   # Remove popups/modals
    process_iframes=True           # Process iframe content
)

result = await crawler.arun(
    url="https://example.com",
    config=run_config
)
```

## Handling Errors

Always check if the crawl was successful:

```python
run_config = CrawlerRunConfig()
result = await crawler.arun(url="https://example.com", config=run_config)

if not result.success:
    print(f"Crawl failed: {result.error_message}")
    print(f"Status code: {result.status_code}")
```

## Logging and Debugging

Enable verbose logging in `BrowserConfig`:

```python
browser_config = BrowserConfig(verbose=True)

async with AsyncWebCrawler(browser_config=browser_config) as crawler:
    run_config = CrawlerRunConfig()
    result = await crawler.arun(url="https://example.com", config=run_config)
```

## Complete Example

Here's a more comprehensive example demonstrating common usage patterns:

```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig, CacheMode

async def main():
    browser_config = BrowserConfig(verbose=True)
    run_config = CrawlerRunConfig(
        # Content filtering
        word_count_threshold=10,
        excluded_tags=['form', 'header'],
        exclude_external_links=True,
        
        # Content processing
        process_iframes=True,
        remove_overlay_elements=True,
        
        # Cache control
        cache_mode=CacheMode.ENABLED  # Use cache if available
    )

    async with AsyncWebCrawler(browser_config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://example.com",
            config=run_config
        )
        
        if result.success:
            # Print clean content
            print("Content:", result.markdown[:500])  # First 500 chars
            
            # Process images
            for image in result.media["images"]:
                print(f"Found image: {image['src']}")
            
            # Process links
            for link in result.links["internal"]:
                print(f"Internal link: {link['href']}")
                
        else:
            print(f"Crawl failed: {result.error_message}")

if __name__ == "__main__":
    asyncio.run(main())
```
Update Documentation 2024-10-27 19:24:46 +08:00			`# Simple Crawling`

			`This guide covers the basics of web crawling with Crawl4AI. You'll learn how to set up a crawler, make your first request, and understand the response.`

			`## Basic Usage`

Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			Set up a simple crawl using `BrowserConfig` and `CrawlerRunConfig`:
Update Documentation 2024-10-27 19:24:46 +08:00
			```python
			`import asyncio`
			`from crawl4ai import AsyncWebCrawler`
Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			`from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig`
Update Documentation 2024-10-27 19:24:46 +08:00
			`async def main():`
Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			`browser_config = BrowserConfig() # Default browser configuration`
			`run_config = CrawlerRunConfig() # Default crawl run configuration`

			`async with AsyncWebCrawler(browser_config=browser_config) as crawler:`
feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00			`result = await crawler.arun(`
Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			`url="https://example.com",`
			`config=run_config`
feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00			`)`
Update Documentation 2024-10-27 19:24:46 +08:00			`print(result.markdown) # Print clean markdown content`

			`if __name__ == "__main__":`
			`asyncio.run(main())`
			```

			`## Understanding the Response`

			The `arun()` method returns a `CrawlResult` object with several useful properties. Here's a quick overview (see [CrawlResult](../api/crawl-result.md) for complete details):

			```python
Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			`result = await crawler.arun(`
			`url="https://example.com",`
			`config=CrawlerRunConfig(fit_markdown=True)`
			`)`
Update Documentation 2024-10-27 19:24:46 +08:00
			`# Different content formats`
			`print(result.html) # Raw HTML`
			`print(result.cleaned_html) # Cleaned HTML`
			`print(result.markdown) # Markdown version`
			`print(result.fit_markdown) # Most relevant content in markdown`

			`# Check success status`
			`print(result.success) # True if crawl succeeded`
			`print(result.status_code) # HTTP status code (e.g., 200, 404)`

			`# Access extracted media and links`
			`print(result.media) # Dictionary of found media (images, videos, audio)`
			`print(result.links) # Dictionary of internal and external links`
			```

			`## Adding Basic Options`

Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			Customize your crawl using `CrawlerRunConfig`:
Update Documentation 2024-10-27 19:24:46 +08:00
			```python
Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			`run_config = CrawlerRunConfig(`
Update Documentation 2024-10-27 19:24:46 +08:00			`word_count_threshold=10, # Minimum words per content block`
			`exclude_external_links=True, # Remove external links`
			`remove_overlay_elements=True, # Remove popups/modals`
			`process_iframes=True # Process iframe content`
			`)`
Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00
			`result = await crawler.arun(`
			`url="https://example.com",`
			`config=run_config`
			`)`
Update Documentation 2024-10-27 19:24:46 +08:00			```

			`## Handling Errors`

			`Always check if the crawl was successful:`

			```python
Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			`run_config = CrawlerRunConfig()`
			`result = await crawler.arun(url="https://example.com", config=run_config)`

Update Documentation 2024-10-27 19:24:46 +08:00			`if not result.success:`
			`print(f"Crawl failed: {result.error_message}")`
			`print(f"Status code: {result.status_code}")`
			```

			`## Logging and Debugging`

Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			Enable verbose logging in `BrowserConfig`:
Update Documentation 2024-10-27 19:24:46 +08:00
			```python
Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			`browser_config = BrowserConfig(verbose=True)`

			`async with AsyncWebCrawler(browser_config=browser_config) as crawler:`
			`run_config = CrawlerRunConfig()`
			`result = await crawler.arun(url="https://example.com", config=run_config)`
Update Documentation 2024-10-27 19:24:46 +08:00			```

			`## Complete Example`

Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			`Here's a more comprehensive example demonstrating common usage patterns:`
Update Documentation 2024-10-27 19:24:46 +08:00
			```python
			`import asyncio`
Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			`from crawl4ai import AsyncWebCrawler`
			`from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig, CacheMode`
Update Documentation 2024-10-27 19:24:46 +08:00
			`async def main():`
Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			`browser_config = BrowserConfig(verbose=True)`
			`run_config = CrawlerRunConfig(`
			`# Content filtering`
			`word_count_threshold=10,`
			`excluded_tags=['form', 'header'],`
			`exclude_external_links=True,`

			`# Content processing`
			`process_iframes=True,`
			`remove_overlay_elements=True,`

			`# Cache control`
			`cache_mode=CacheMode.ENABLED # Use cache if available`
			`)`

			`async with AsyncWebCrawler(browser_config=browser_config) as crawler:`
Update Documentation 2024-10-27 19:24:46 +08:00			`result = await crawler.arun(`
			`url="https://example.com",`
Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			`config=run_config`
Update Documentation 2024-10-27 19:24:46 +08:00			`)`

			`if result.success:`
			`# Print clean content`
			`print("Content:", result.markdown[:500]) # First 500 chars`

			`# Process images`
			`for image in result.media["images"]:`
			`print(f"Found image: {image['src']}")`

			`# Process links`
			`for link in result.links["internal"]:`
			`print(f"Internal link: {link['href']}")`

			`else:`
			`print(f"Crawl failed: {result.error_message}")`

			`if __name__ == "__main__":`
			`asyncio.run(main())`
			```