2024-10-27 19:24:46 +08:00
# Simple Crawling
This guide covers the basics of web crawling with Crawl4AI. You'll learn how to set up a crawler, make your first request, and understand the response.
## Basic Usage
2024-12-19 21:02:29 +08:00
Set up a simple crawl using `BrowserConfig` and `CrawlerRunConfig` :
2024-10-27 19:24:46 +08:00
```python
import asyncio
from crawl4ai import AsyncWebCrawler
2024-12-19 21:02:29 +08:00
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
2024-10-27 19:24:46 +08:00
async def main():
2024-12-19 21:02:29 +08:00
browser_config = BrowserConfig() # Default browser configuration
run_config = CrawlerRunConfig() # Default crawl run configuration
async with AsyncWebCrawler(browser_config=browser_config) as crawler:
2024-11-17 19:44:45 +08:00
result = await crawler.arun(
2024-12-19 21:02:29 +08:00
url="https://example.com",
config=run_config
2024-11-17 19:44:45 +08:00
)
2024-10-27 19:24:46 +08:00
print(result.markdown) # Print clean markdown content
if __name__ == "__main__":
asyncio.run(main())
```
## Understanding the Response
The `arun()` method returns a `CrawlResult` object with several useful properties. Here's a quick overview (see [CrawlResult ](../api/crawl-result.md ) for complete details):
```python
2024-12-19 21:02:29 +08:00
result = await crawler.arun(
url="https://example.com",
config=CrawlerRunConfig(fit_markdown=True)
)
2024-10-27 19:24:46 +08:00
# Different content formats
print(result.html) # Raw HTML
print(result.cleaned_html) # Cleaned HTML
print(result.markdown) # Markdown version
print(result.fit_markdown) # Most relevant content in markdown
# Check success status
print(result.success) # True if crawl succeeded
print(result.status_code) # HTTP status code (e.g., 200, 404)
# Access extracted media and links
print(result.media) # Dictionary of found media (images, videos, audio)
print(result.links) # Dictionary of internal and external links
```
## Adding Basic Options
2024-12-19 21:02:29 +08:00
Customize your crawl using `CrawlerRunConfig` :
2024-10-27 19:24:46 +08:00
```python
2024-12-19 21:02:29 +08:00
run_config = CrawlerRunConfig(
2024-10-27 19:24:46 +08:00
word_count_threshold=10, # Minimum words per content block
exclude_external_links=True, # Remove external links
remove_overlay_elements=True, # Remove popups/modals
process_iframes=True # Process iframe content
)
2024-12-19 21:02:29 +08:00
result = await crawler.arun(
url="https://example.com",
config=run_config
)
2024-10-27 19:24:46 +08:00
```
## Handling Errors
Always check if the crawl was successful:
```python
2024-12-19 21:02:29 +08:00
run_config = CrawlerRunConfig()
result = await crawler.arun(url="https://example.com", config=run_config)
2024-10-27 19:24:46 +08:00
if not result.success:
print(f"Crawl failed: {result.error_message}")
print(f"Status code: {result.status_code}")
```
## Logging and Debugging
2024-12-19 21:02:29 +08:00
Enable verbose logging in `BrowserConfig` :
2024-10-27 19:24:46 +08:00
```python
2024-12-19 21:02:29 +08:00
browser_config = BrowserConfig(verbose=True)
async with AsyncWebCrawler(browser_config=browser_config) as crawler:
run_config = CrawlerRunConfig()
result = await crawler.arun(url="https://example.com", config=run_config)
2024-10-27 19:24:46 +08:00
```
## Complete Example
2024-12-19 21:02:29 +08:00
Here's a more comprehensive example demonstrating common usage patterns:
2024-10-27 19:24:46 +08:00
```python
import asyncio
2024-12-19 21:02:29 +08:00
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig, CacheMode
2024-10-27 19:24:46 +08:00
async def main():
2024-12-19 21:02:29 +08:00
browser_config = BrowserConfig(verbose=True)
run_config = CrawlerRunConfig(
# Content filtering
word_count_threshold=10,
excluded_tags=['form', 'header'],
exclude_external_links=True,
# Content processing
process_iframes=True,
remove_overlay_elements=True,
# Cache control
cache_mode=CacheMode.ENABLED # Use cache if available
)
async with AsyncWebCrawler(browser_config=browser_config) as crawler:
2024-10-27 19:24:46 +08:00
result = await crawler.arun(
url="https://example.com",
2024-12-19 21:02:29 +08:00
config=run_config
2024-10-27 19:24:46 +08:00
)
if result.success:
# Print clean content
print("Content:", result.markdown[:500]) # First 500 chars
# Process images
for image in result.media["images"]:
print(f"Found image: {image['src']}")
# Process links
for link in result.links["internal"]:
print(f"Internal link: {link['href']}")
else:
print(f"Crawl failed: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```