- **`crawl4ai-setup`** installs and configures Playwright (Chromium by default).
We cover advanced installation and Docker in the [Installation](#installation) section.
---
## 3. Your First Crawl
Here’s a minimal Python script that creates an **`AsyncWebCrawler`**, fetches a webpage, and prints the first 300 characters of its Markdown output:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown[:300]) # Print first 300 chars
if __name__ == "__main__":
asyncio.run(main())
```
**What’s happening?**
- **`AsyncWebCrawler`** launches a headless browser (Chromium by default).
- It fetches `https://example.com`.
- Crawl4AI automatically converts the HTML into Markdown.
You now have a simple, working crawl!
---
## 4. Basic Configuration (Light Introduction)
Crawl4AI’s crawler can be heavily customized using two main classes:
1.**`BrowserConfig`**: Controls browser behavior (headless or full UI, user agent, JavaScript toggles, etc.).
2.**`CrawlerRunConfig`**: Controls how each crawl runs (caching, extraction, timeouts, hooking, etc.).
Below is an example with minimal usage:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
async def main():
browser_conf = BrowserConfig(headless=True) # or False to see the browser
run_conf = CrawlerRunConfig(cache_mode="BYPASS")
async with AsyncWebCrawler(config=browser_conf) as crawler:
result = await crawler.arun(
url="https://example.com",
config=run_conf
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
```
We’ll explore more advanced config in later tutorials (like enabling proxies, PDF output, multi-tab sessions, etc.). For now, just note how you pass these objects to manage crawling.
---
## 5. Generating Markdown Output
By default, Crawl4AI automatically generates Markdown from each crawled page. However, the exact output depends on whether you specify a **markdown generator** or **content filter**.
- **`result.markdown`**:
The direct HTML-to-Markdown conversion.
- **`result.markdown.fit_markdown`**:
The same content after applying any configured **content filter** (e.g., `PruningContentFilter`).
### Example: Using a Filter with `DefaultMarkdownGenerator`
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
**Note**: If you do **not** specify a content filter or markdown generator, you’ll typically see only the raw Markdown. We’ll dive deeper into these strategies in a dedicated **Markdown Generation** tutorial.
---
## 6. Simple Data Extraction (CSS-based)
Crawl4AI can also extract structured data (JSON) using CSS or XPath selectors. Below is a minimal CSS-based example:
```python
import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
# The JSON output is stored in 'extracted_content'
data = json.loads(result.extracted_content)
print(data)
if __name__ == "__main__":
asyncio.run(main())
```
**Why is this helpful?**
- Great for repetitive page structures (e.g., item listings, articles).
- No AI usage or costs.
- The crawler returns a JSON string you can parse or store.
---
## 7. Simple Data Extraction (LLM-based)
For more complex or irregular pages, a language model can parse text intelligently into a structure you define. Crawl4AI supports **open-source** or **closed-source** providers:
- We define a Pydantic schema (`PricingInfo`) describing the fields we want.
- The LLM extraction strategy uses that schema and your instructions to transform raw text into structured JSON.
- Depending on the **provider** and **api_token**, you can use local models or a remote API.
---
## 8. Next Steps
Congratulations! You have:
1. Installed Crawl4AI (via pip, with Docker as an option).
2. Performed a simple crawl and printed Markdown.
3. Seen how adding a **markdown generator** + **content filter** can produce “fit” Markdown.
4. Experimented with **CSS-based** extraction for repetitive data.
5. Learned the basics of **LLM-based** extraction (open-source and closed-source).
If you are ready for more, check out:
- **Installation**: Learn more on how to install Crawl4AI and set up Playwright.
- **Focus on Configuration**: Learn to customize browser settings, caching modes, advanced timeouts, etc.
- **Markdown Generation Basics**: Dive deeper into content filtering and “fit markdown” usage.
- **Dynamic Pages & Hooks**: Tackle sites with “Load More” buttons, login forms, or JavaScript complexities.
- **Deployment**: Run Crawl4AI in Docker containers and scale across multiple nodes.
- **Explanations & How-To Guides**: Explore browser contexts, identity-based crawling, hooking, performance, and more.
Crawl4AI is a powerful tool for extracting data and generating Markdown from virtually any website. Enjoy exploring, and we hope you build amazing AI-powered applications with it!