528 lines
18 KiB
Markdown
528 lines
18 KiB
Markdown
# Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling & AI Integration Solution
|
||
|
||
Crawl4AI, the **#1 trending GitHub repository**, streamlines web content extraction into AI-ready formats. Perfect for AI assistants, semantic search engines, or data pipelines, Crawl4AI transforms raw HTML into structured Markdown or JSON effortlessly. Integrate with LLMs, open-source models, or your own retrieval-augmented generation workflows.
|
||
|
||
**What Crawl4AI is not:**
|
||
|
||
Crawl4AI is not a replacement for traditional web scraping libraries, Selenium, or Playwright. It's not designed as a general-purpose web automation tool. Instead, Crawl4AI has a specific, focused goal:
|
||
|
||
- To generate perfect, AI-friendly data (particularly for LLMs) from web content
|
||
- To maximize speed and efficiency in data extraction and processing
|
||
- To operate at scale, from Raspberry Pi to cloud infrastructures
|
||
|
||
Crawl4AI is engineered with a "scale-first" mindset, aiming to handle millions of links while maintaining exceptional performance. It's super efficient and fast, optimized to:
|
||
|
||
1. Transform raw web content into structured, LLM-ready formats (Markdown/JSON)
|
||
2. Implement intelligent extraction strategies to reduce reliance on costly API calls
|
||
3. Provide a streamlined pipeline for AI data preparation and ingestion
|
||
|
||
In essence, Crawl4AI bridges the gap between web content and AI systems, focusing on delivering high-quality, processed data rather than offering broad web automation capabilities.
|
||
|
||
**Key Links:**
|
||
|
||
- **Website:** [https://crawl4ai.com](https://crawl4ai.com)
|
||
- **GitHub:** [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
|
||
- **Colab Notebook:** [Try on Google Colab](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)
|
||
- **Quickstart Code Example:** [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py)
|
||
- **Examples Folder:** [Crawl4AI Examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples)
|
||
|
||
---
|
||
|
||
## Table of Contents
|
||
|
||
- [Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling \& AI Integration Solution](#crawl4ai-quick-start-guide-your-all-in-one-ai-ready-web-crawling--ai-integration-solution)
|
||
- [Table of Contents](#table-of-contents)
|
||
- [1. Introduction \& Key Concepts](#1-introduction--key-concepts)
|
||
- [2. Installation \& Environment Setup](#2-installation--environment-setup)
|
||
- [Test Your Installation](#test-your-installation)
|
||
- [3. Core Concepts \& Configuration](#3-core-concepts--configuration)
|
||
- [4. Basic Crawling \& Simple Extraction](#4-basic-crawling--simple-extraction)
|
||
- [5. Markdown Generation \& AI-Optimized Output](#5-markdown-generation--ai-optimized-output)
|
||
- [6. Structured Data Extraction (CSS, XPath, LLM)](#6-structured-data-extraction-css-xpath-llm)
|
||
- [7. Advanced Extraction: LLM \& Open-Source Models](#7-advanced-extraction-llm--open-source-models)
|
||
- [8. Page Interactions, JS Execution, \& Dynamic Content](#8-page-interactions-js-execution--dynamic-content)
|
||
- [9. Media, Links, \& Metadata Handling](#9-media-links--metadata-handling)
|
||
- [10. Authentication \& Identity Preservation](#10-authentication--identity-preservation)
|
||
- [Manual Setup via User Data Directory](#manual-setup-via-user-data-directory)
|
||
- [Using `storage_state`](#using-storage_state)
|
||
- [11. Proxy \& Security Enhancements](#11-proxy--security-enhancements)
|
||
- [12. Screenshots, PDFs \& File Downloads](#12-screenshots-pdfs--file-downloads)
|
||
- [13. Caching \& Performance Optimization](#13-caching--performance-optimization)
|
||
- [14. Hooks for Custom Logic](#14-hooks-for-custom-logic)
|
||
- [15. Dockerization \& Scaling](#15-dockerization--scaling)
|
||
- [16. Troubleshooting \& Common Pitfalls](#16-troubleshooting--common-pitfalls)
|
||
- [17. Comprehensive End-to-End Example](#17-comprehensive-end-to-end-example)
|
||
- [18. Further Resources \& Community](#18-further-resources--community)
|
||
|
||
---
|
||
|
||
## 1. Introduction & Key Concepts
|
||
|
||
Crawl4AI transforms websites into structured, AI-friendly data. It efficiently handles large-scale crawling, integrates with both proprietary and open-source LLMs, and optimizes content for semantic search or RAG pipelines.
|
||
|
||
**Quick Test:**
|
||
|
||
```python
|
||
import asyncio
|
||
from crawl4ai import AsyncWebCrawler
|
||
|
||
async def test_run():
|
||
async with AsyncWebCrawler() as crawler:
|
||
result = await crawler.arun("https://example.com")
|
||
print(result.markdown)
|
||
|
||
asyncio.run(test_run())
|
||
```
|
||
|
||
If you see Markdown output, everything is working!
|
||
|
||
**More info:** [See /docs/introduction](#) or [1_introduction.ex.md](https://github.com/unclecode/crawl4ai/blob/main/introduction.ex.md)
|
||
|
||
---
|
||
|
||
## 2. Installation & Environment Setup
|
||
|
||
```bash
|
||
# Install the package
|
||
pip install crawl4ai
|
||
crawl4ai-setup
|
||
|
||
# Install Playwright with system dependencies (recommended)
|
||
playwright install --with-deps # Installs all browsers
|
||
|
||
# Or install specific browsers:
|
||
playwright install --with-deps chrome # Recommended for Colab/Linux
|
||
playwright install --with-deps firefox
|
||
playwright install --with-deps webkit
|
||
playwright install --with-deps chromium
|
||
|
||
# Keep Playwright updated periodically
|
||
playwright install
|
||
```
|
||
|
||
> **Note**: For Google Colab and some Linux environments, use `chrome` instead of `chromium` - it tends to work more reliably.
|
||
|
||
### Test Your Installation
|
||
Try these one-liners:
|
||
|
||
```python
|
||
# Visible browser test
|
||
python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=False); page = browser.new_page(); page.goto('https://example.com'); input('Press Enter to close...')"
|
||
|
||
# Headless test (for servers/CI)
|
||
python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=True); page = browser.new_page(); page.goto('https://example.com'); print(f'Title: {page.title()}'); browser.close()"
|
||
```
|
||
|
||
You should see a browser window (in visible test) loading example.com. If you get errors, try with Firefox using `playwright install --with-deps firefox`.
|
||
|
||
|
||
**Try in Colab:**
|
||
[Open Colab Notebook](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)
|
||
|
||
**More info:** [See /docs/configuration](#) or [2_configuration.md](https://github.com/unclecode/crawl4ai/blob/main/configuration.md)
|
||
|
||
---
|
||
|
||
## 3. Core Concepts & Configuration
|
||
|
||
Use `AsyncWebCrawler`, `CrawlerRunConfig`, and `BrowserConfig` to control crawling.
|
||
|
||
**Example config:**
|
||
|
||
```python
|
||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||
|
||
browser_config = BrowserConfig(
|
||
headless=True,
|
||
verbose=True,
|
||
viewport_width=1080,
|
||
viewport_height=600,
|
||
text_mode=False,
|
||
ignore_https_errors=True,
|
||
java_script_enabled=True
|
||
)
|
||
|
||
run_config = CrawlerRunConfig(
|
||
css_selector="article.main",
|
||
word_count_threshold=50,
|
||
excluded_tags=['nav','footer'],
|
||
exclude_external_links=True,
|
||
wait_for="css:.article-loaded",
|
||
page_timeout=60000,
|
||
delay_before_return_html=1.0,
|
||
mean_delay=0.1,
|
||
max_range=0.3,
|
||
process_iframes=True,
|
||
remove_overlay_elements=True,
|
||
js_code="""
|
||
(async () => {
|
||
window.scrollTo(0, document.body.scrollHeight);
|
||
await new Promise(r => setTimeout(r, 2000));
|
||
document.querySelector('.load-more')?.click();
|
||
})();
|
||
"""
|
||
)
|
||
|
||
# Use: ENABLED, DISABLED, BYPASS, READ_ONLY, WRITE_ONLY
|
||
# run_config.cache_mode = CacheMode.ENABLED
|
||
```
|
||
|
||
**Prefixes:**
|
||
|
||
- `http://` or `https://` for live pages
|
||
- `file://local.html` for local
|
||
- `raw:<html>` for raw HTML strings
|
||
|
||
**More info:** [See /docs/async_webcrawler](#) or [3_async_webcrawler.ex.md](https://github.com/unclecode/crawl4ai/blob/main/async_webcrawler.ex.md)
|
||
|
||
---
|
||
|
||
## 4. Basic Crawling & Simple Extraction
|
||
|
||
```python
|
||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||
result = await crawler.arun("https://news.example.com/article", config=run_config)
|
||
print(result.markdown) # Basic markdown content
|
||
```
|
||
|
||
**More info:** [See /docs/browser_context_page](#) or [4_browser_context_page.ex.md](https://github.com/unclecode/crawl4ai/blob/main/browser_context_page.ex.md)
|
||
|
||
---
|
||
|
||
## 5. Markdown Generation & AI-Optimized Output
|
||
|
||
After crawling, `result.markdown_v2` provides:
|
||
|
||
- `raw_markdown`: Unfiltered markdown
|
||
- `markdown_with_citations`: Links as references at the bottom
|
||
- `references_markdown`: A separate list of reference links
|
||
- `fit_markdown`: Filtered, relevant markdown (e.g., after BM25)
|
||
- `fit_html`: The HTML used to produce `fit_markdown`
|
||
|
||
**Example:**
|
||
|
||
```python
|
||
print("RAW:", result.markdown_v2.raw_markdown[:200])
|
||
print("CITED:", result.markdown_v2.markdown_with_citations[:200])
|
||
print("REFERENCES:", result.markdown_v2.references_markdown)
|
||
print("FIT MARKDOWN:", result.markdown_v2.fit_markdown)
|
||
```
|
||
|
||
For AI training, `fit_markdown` focuses on the most relevant content.
|
||
|
||
**More info:** [See /docs/markdown_generation](#) or [5_markdown_generation.ex.md](https://github.com/unclecode/crawl4ai/blob/main/markdown_generation.ex.md)
|
||
|
||
---
|
||
|
||
## 6. Structured Data Extraction (CSS, XPath, LLM)
|
||
|
||
Extract JSON data without LLMs:
|
||
|
||
**CSS:**
|
||
|
||
```python
|
||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||
|
||
schema = {
|
||
"name": "Products",
|
||
"baseSelector": ".product",
|
||
"fields": [
|
||
{"name": "title", "selector": "h2", "type": "text"},
|
||
{"name": "price", "selector": ".price", "type": "text"}
|
||
]
|
||
}
|
||
run_config.extraction_strategy = JsonCssExtractionStrategy(schema)
|
||
```
|
||
|
||
**XPath:**
|
||
|
||
```python
|
||
from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
|
||
|
||
xpath_schema = {
|
||
"name": "Articles",
|
||
"baseSelector": "//div[@class='article']",
|
||
"fields": [
|
||
{"name":"headline","selector":".//h1","type":"text"},
|
||
{"name":"summary","selector":".//p[@class='summary']","type":"text"}
|
||
]
|
||
}
|
||
run_config.extraction_strategy = JsonXPathExtractionStrategy(xpath_schema)
|
||
```
|
||
|
||
**More info:** [See /docs/extraction_strategies](#) or [7_extraction_strategies.ex.md](https://github.com/unclecode/crawl4ai/blob/main/extraction_strategies.ex.md)
|
||
|
||
---
|
||
|
||
## 7. Advanced Extraction: LLM & Open-Source Models
|
||
|
||
Use LLMExtractionStrategy for complex tasks. Works with OpenAI or open-source models (e.g., Ollama).
|
||
|
||
```python
|
||
from pydantic import BaseModel
|
||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||
|
||
class TravelData(BaseModel):
|
||
destination: str
|
||
attractions: list
|
||
|
||
run_config.extraction_strategy = LLMExtractionStrategy(
|
||
provider="ollama/nemotron",
|
||
schema=TravelData.schema(),
|
||
instruction="Extract destination and top attractions."
|
||
)
|
||
```
|
||
|
||
**More info:** [See /docs/extraction_strategies](#) or [7_extraction_strategies.ex.md](https://github.com/unclecode/crawl4ai/blob/main/extraction_strategies.ex.md)
|
||
|
||
---
|
||
|
||
## 8. Page Interactions, JS Execution, & Dynamic Content
|
||
|
||
Insert `js_code` and use `wait_for` to ensure content loads. Example:
|
||
|
||
```python
|
||
run_config.js_code = """
|
||
(async () => {
|
||
document.querySelector('.load-more')?.click();
|
||
await new Promise(r => setTimeout(r, 2000));
|
||
})();
|
||
"""
|
||
run_config.wait_for = "css:.item-loaded"
|
||
```
|
||
|
||
**More info:** [See /docs/page_interaction](#) or [11_page_interaction.md](https://github.com/unclecode/crawl4ai/blob/main/page_interaction.md)
|
||
|
||
---
|
||
|
||
## 9. Media, Links, & Metadata Handling
|
||
|
||
`result.media["images"]`: List of images with `src`, `score`, `alt`. Score indicates relevance.
|
||
|
||
`result.media["videos"]`, `result.media["audios"]` similarly hold media info.
|
||
|
||
`result.links["internal"]`, `result.links["external"]`, `result.links["social"]`: Categorized links. Each link has `href`, `text`, `context`, `type`.
|
||
|
||
`result.metadata`: Title, description, keywords, author.
|
||
|
||
**Example:**
|
||
|
||
```python
|
||
# Images
|
||
for img in result.media["images"]:
|
||
print("Image:", img["src"], "Score:", img["score"], "Alt:", img.get("alt","N/A"))
|
||
|
||
# Links
|
||
for link in result.links["external"]:
|
||
print("External Link:", link["href"], "Text:", link["text"])
|
||
|
||
# Metadata
|
||
print("Page Title:", result.metadata["title"])
|
||
print("Description:", result.metadata["description"])
|
||
```
|
||
|
||
**More info:** [See /docs/content_selection](#) or [8_content_selection.ex.md](https://github.com/unclecode/crawl4ai/blob/main/content_selection.ex.md)
|
||
|
||
---
|
||
|
||
## 10. Authentication & Identity Preservation
|
||
|
||
### Manual Setup via User Data Directory
|
||
|
||
1. **Open Chrome with a custom user data dir:**
|
||
|
||
```bash
|
||
"C:\Program Files\Google\Chrome\Application\chrome.exe" --user-data-dir="C:\MyChromeProfile"
|
||
```
|
||
|
||
On macOS:
|
||
|
||
```bash
|
||
"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --user-data-dir="/Users/username/ChromeProfiles/MyProfile"
|
||
```
|
||
|
||
2. **Log in to sites, solve CAPTCHAs, adjust settings manually.**
|
||
The browser saves cookies/localStorage in that directory.
|
||
|
||
3. **Use `user_data_dir` in `BrowserConfig`:**
|
||
|
||
```python
|
||
browser_config = BrowserConfig(
|
||
headless=True,
|
||
user_data_dir="/Users/username/ChromeProfiles/MyProfile"
|
||
)
|
||
```
|
||
|
||
Now the crawler starts with those cookies, sessions, etc.
|
||
|
||
### Using `storage_state`
|
||
|
||
Alternatively, export and reuse storage states:
|
||
|
||
```python
|
||
browser_config = BrowserConfig(
|
||
headless=True,
|
||
storage_state="mystate.json" # Pre-saved state
|
||
)
|
||
```
|
||
|
||
No repeated logins needed.
|
||
|
||
**More info:** [See /docs/storage_state](#) or [16_storage_state.md](https://github.com/unclecode/crawl4ai/blob/main/storage_state.md)
|
||
|
||
---
|
||
|
||
## 11. Proxy & Security Enhancements
|
||
|
||
Use `proxy_config` for authenticated proxies:
|
||
|
||
```python
|
||
browser_config.proxy_config = {
|
||
"server": "http://proxy.example.com:8080",
|
||
"username": "proxyuser",
|
||
"password": "proxypass"
|
||
}
|
||
```
|
||
|
||
Combine with `headers` or `ignore_https_errors` as needed.
|
||
|
||
**More info:** [See /docs/proxy_security](#) or [14_proxy_security.md](https://github.com/unclecode/crawl4ai/blob/main/proxy_security.md)
|
||
|
||
---
|
||
|
||
## 12. Screenshots, PDFs & File Downloads
|
||
|
||
Enable `screenshot=True` or `pdf=True` in `CrawlerRunConfig`:
|
||
|
||
```python
|
||
run_config.screenshot = True
|
||
run_config.pdf = True
|
||
```
|
||
|
||
After crawling:
|
||
|
||
```python
|
||
if result.screenshot:
|
||
with open("page.png", "wb") as f:
|
||
f.write(result.screenshot)
|
||
|
||
if result.pdf:
|
||
with open("page.pdf", "wb") as f:
|
||
f.write(result.pdf)
|
||
```
|
||
|
||
**File Downloads:**
|
||
|
||
```python
|
||
browser_config.accept_downloads = True
|
||
browser_config.downloads_path = "./downloads"
|
||
run_config.js_code = """document.querySelector('a.download')?.click();"""
|
||
|
||
# After crawl:
|
||
print("Downloaded files:", result.downloaded_files)
|
||
```
|
||
|
||
**More info:** [See /docs/screenshot_and_pdf_export](#) or [15_screenshot_and_pdf_export.md](https://github.com/unclecode/crawl4ai/blob/main/screenshot_and_pdf_export.md)
|
||
Also [10_file_download.md](https://github.com/unclecode/crawl4ai/blob/main/file_download.md)
|
||
|
||
---
|
||
|
||
## 13. Caching & Performance Optimization
|
||
|
||
Set `cache_mode` to reuse fetch results:
|
||
|
||
```python
|
||
from crawl4ai import CacheMode
|
||
run_config.cache_mode = CacheMode.ENABLED
|
||
```
|
||
|
||
Adjust delays, increase concurrency, or use `text_mode=True` for faster extraction.
|
||
|
||
**More info:** [See /docs/cache_modes](#) or [9_cache_modes.md](https://github.com/unclecode/crawl4ai/blob/main/cache_modes.md)
|
||
|
||
---
|
||
|
||
## 14. Hooks for Custom Logic
|
||
|
||
Hooks let you run code at specific lifecycle events without creating pages manually in `on_browser_created`.
|
||
|
||
Use `on_page_context_created` to apply routing or modify page contexts before crawling the URL:
|
||
|
||
**Example Hook:**
|
||
|
||
```python
|
||
async def on_page_context_created_hook(context, page, **kwargs):
|
||
# Block all images to speed up load
|
||
await context.route("**/*.{png,jpg,jpeg}", lambda route: route.abort())
|
||
print("[HOOK] Image requests blocked")
|
||
|
||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||
crawler.crawler_strategy.set_hook("on_page_context_created", on_page_context_created_hook)
|
||
result = await crawler.arun("https://imageheavy.example.com", config=run_config)
|
||
print("Crawl finished with images blocked.")
|
||
```
|
||
|
||
This hook is clean and doesn’t create a separate page itself—it just modifies the current context/page setup.
|
||
|
||
**More info:** [See /docs/hooks_auth](#) or [13_hooks_auth.md](https://github.com/unclecode/crawl4ai/blob/main/hooks_auth.md)
|
||
|
||
---
|
||
|
||
## 15. Dockerization & Scaling
|
||
|
||
Use Docker images:
|
||
|
||
- AMD64 basic:
|
||
|
||
```bash
|
||
docker pull unclecode/crawl4ai:basic-amd64
|
||
docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
|
||
```
|
||
|
||
- ARM64 for M1/M2:
|
||
|
||
```bash
|
||
docker pull unclecode/crawl4ai:basic-arm64
|
||
docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64
|
||
```
|
||
|
||
- GPU support:
|
||
|
||
```bash
|
||
docker pull unclecode/crawl4ai:gpu-amd64
|
||
docker run --gpus all -p 11235:11235 unclecode/crawl4ai:gpu-amd64
|
||
```
|
||
|
||
Scale with load balancers or Kubernetes.
|
||
|
||
**More info:** [See /docs/proxy_security (for proxy) or relevant Docker instructions in README](#)
|
||
|
||
---
|
||
|
||
## 16. Troubleshooting & Common Pitfalls
|
||
|
||
- Empty results? Relax filters, check selectors.
|
||
- Timeouts? Increase `page_timeout` or refine `wait_for`.
|
||
- CAPTCHAs? Use `user_data_dir` or `storage_state` after manual solving.
|
||
- JS errors? Try headful mode for debugging.
|
||
|
||
Check [examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples) & [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py) for more code.
|
||
|
||
---
|
||
|
||
## 17. Comprehensive End-to-End Example
|
||
|
||
Combine hooks, JS execution, PDF saving, LLM extraction—see [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py) for a full example.
|
||
|
||
---
|
||
|
||
## 18. Further Resources & Community
|
||
|
||
- **Docs:** [https://crawl4ai.com](https://crawl4ai.com)
|
||
- **Issues & PRs:** [https://github.com/unclecode/crawl4ai/issues](https://github.com/unclecode/crawl4ai/issues)
|
||
|
||
Follow [@unclecode](https://x.com/unclecode) for news & community updates.
|
||
|
||
**Happy Crawling!**
|
||
Leverage Crawl4AI to feed your AI models with clean, structured web data today.
|