2024-10-27 19:24:46 +08:00
# Quick Start Guide 🚀
2024-12-19 21:02:29 +08:00
Welcome to the Crawl4AI Quickstart Guide! In this tutorial, we'll walk you through the basic usage of Crawl4AI, covering everything from initial setup to advanced features like chunking and extraction strategies, using asynchronous programming. Let's dive in! 🌟
---
2024-10-27 19:24:46 +08:00
## Getting Started 🛠️
2024-12-19 21:02:29 +08:00
Set up your environment with `BrowserConfig` and create an `AsyncWebCrawler` instance.
2024-10-27 19:24:46 +08:00
```python
import asyncio
2024-12-19 21:02:29 +08:00
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig
2024-10-27 19:24:46 +08:00
async def main():
2024-12-19 21:02:29 +08:00
browser_config = BrowserConfig(verbose=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
# Add your crawling logic here
2024-10-27 19:24:46 +08:00
pass
if __name__ == "__main__":
asyncio.run(main())
```
2024-12-19 21:02:29 +08:00
---
2024-10-27 19:24:46 +08:00
### Basic Usage
2024-12-19 21:02:29 +08:00
Provide a URL and let Crawl4AI do the work!
2024-10-27 19:24:46 +08:00
```python
2024-12-19 21:02:29 +08:00
from crawl4ai.async_configs import CrawlerRunConfig
2024-10-27 19:24:46 +08:00
async def main():
2024-12-19 21:02:29 +08:00
browser_config = BrowserConfig(verbose=True)
crawl_config = CrawlerRunConfig(url="https://www.nbcnews.com/business")
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(config=crawl_config)
2024-10-27 19:24:46 +08:00
print(f"Basic crawl result: {result.markdown[:500]}") # Print first 500 characters
2024-12-19 21:02:29 +08:00
if __name__ == "__main__":
asyncio.run(main())
2024-10-27 19:24:46 +08:00
```
2024-12-19 21:02:29 +08:00
---
2024-10-27 19:24:46 +08:00
### Taking Screenshots 📸
2024-12-19 21:02:29 +08:00
Capture and save webpage screenshots with `CrawlerRunConfig` :
2024-10-27 19:24:46 +08:00
```python
2024-12-19 21:02:29 +08:00
from crawl4ai.async_configs import CacheMode
2024-10-27 19:24:46 +08:00
async def capture_and_save_screenshot(url: str, output_path: str):
2024-12-19 21:02:29 +08:00
browser_config = BrowserConfig(verbose=True)
crawl_config = CrawlerRunConfig(
url=url,
screenshot=True,
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(config=crawl_config)
2024-10-27 19:24:46 +08:00
if result.success and result.screenshot:
import base64
screenshot_data = base64.b64decode(result.screenshot)
with open(output_path, 'wb') as f:
f.write(screenshot_data)
print(f"Screenshot saved successfully to {output_path}")
else:
print("Failed to capture screenshot")
```
2024-12-19 21:02:29 +08:00
---
2024-10-27 19:24:46 +08:00
### Browser Selection 🌐
2024-12-19 21:02:29 +08:00
Choose from multiple browser engines using `BrowserConfig` :
2024-10-27 19:24:46 +08:00
```python
2024-12-19 21:02:29 +08:00
from crawl4ai.async_configs import BrowserConfig
2024-10-27 19:24:46 +08:00
# Use Firefox
2024-12-19 21:02:29 +08:00
firefox_config = BrowserConfig(browser_type="firefox", verbose=True, headless=True)
async with AsyncWebCrawler(config=firefox_config) as crawler:
result = await crawler.arun(config=CrawlerRunConfig(url="https://www.example.com"))
2024-10-27 19:24:46 +08:00
# Use WebKit
2024-12-19 21:02:29 +08:00
webkit_config = BrowserConfig(browser_type="webkit", verbose=True, headless=True)
async with AsyncWebCrawler(config=webkit_config) as crawler:
result = await crawler.arun(config=CrawlerRunConfig(url="https://www.example.com"))
2024-10-27 19:24:46 +08:00
# Use Chromium (default)
2024-12-19 21:02:29 +08:00
chromium_config = BrowserConfig(verbose=True, headless=True)
async with AsyncWebCrawler(config=chromium_config) as crawler:
result = await crawler.arun(config=CrawlerRunConfig(url="https://www.example.com"))
2024-10-27 19:24:46 +08:00
```
2024-12-19 21:02:29 +08:00
---
2024-10-27 19:24:46 +08:00
### User Simulation 🎭
2024-12-19 21:02:29 +08:00
Simulate real user behavior to bypass detection:
2024-10-27 19:24:46 +08:00
```python
2024-12-19 21:02:29 +08:00
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
browser_config = BrowserConfig(verbose=True, headless=True)
crawl_config = CrawlerRunConfig(
url="YOUR-URL-HERE",
cache_mode=CacheMode.BYPASS,
simulate_user=True, # Random mouse movements and clicks
override_navigator=True # Makes the browser appear like a real user
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(config=crawl_config)
2024-10-27 19:24:46 +08:00
```
2024-12-19 21:02:29 +08:00
---
2024-10-27 19:24:46 +08:00
### Understanding Parameters 🧠
2024-12-19 21:02:29 +08:00
Explore caching and forcing fresh crawls:
2024-10-27 19:24:46 +08:00
```python
async def main():
2024-12-19 21:02:29 +08:00
browser_config = BrowserConfig(verbose=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
# First crawl (uses cache)
result1 = await crawler.arun(config=CrawlerRunConfig(url="https://www.nbcnews.com/business"))
2024-10-27 19:24:46 +08:00
print(f"First crawl result: {result1.markdown[:100]}...")
2024-12-19 21:02:29 +08:00
# Force fresh crawl
result2 = await crawler.arun(
config=CrawlerRunConfig(url="https://www.nbcnews.com/business", cache_mode=CacheMode.BYPASS)
)
2024-10-27 19:24:46 +08:00
print(f"Second crawl result: {result2.markdown[:100]}...")
2024-12-19 21:02:29 +08:00
if __name__ == "__main__":
asyncio.run(main())
2024-10-27 19:24:46 +08:00
```
2024-12-19 21:02:29 +08:00
---
2024-10-27 19:24:46 +08:00
### Adding a Chunking Strategy 🧩
2024-12-19 21:02:29 +08:00
Split content into chunks using `RegexChunking` :
2024-10-27 19:24:46 +08:00
```python
from crawl4ai.chunking_strategy import RegexChunking
async def main():
2024-12-19 21:02:29 +08:00
browser_config = BrowserConfig(verbose=True)
crawl_config = CrawlerRunConfig(
2024-10-27 19:24:46 +08:00
url="https://www.nbcnews.com/business",
2024-12-19 21:02:29 +08:00
chunking_strategy=RegexChunking(patterns=["\n\n"])
2024-10-27 19:24:46 +08:00
)
2024-12-19 21:02:29 +08:00
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(config=crawl_config)
print(f"RegexChunking result: {result.extracted_content[:200]}...")
2024-10-27 19:24:46 +08:00
2024-12-19 21:02:29 +08:00
if __name__ == "__main__":
asyncio.run(main())
```
2024-10-27 19:24:46 +08:00
2024-12-19 21:02:29 +08:00
---
2024-10-27 19:24:46 +08:00
2024-12-19 21:02:29 +08:00
### Advanced Features and Configurations
2024-10-27 19:24:46 +08:00
2024-12-19 21:02:29 +08:00
For advanced examples (LLM strategies, knowledge graphs, pagination handling), ensure all code aligns with the `BrowserConfig` and `CrawlerRunConfig` pattern shown above.