218 lines
9.1 KiB
Markdown
218 lines
9.1 KiB
Markdown
![]() |
Below is a sample Markdown file (`tutorials/async-webcrawler-basics.md`) illustrating how you might teach new users the fundamentals of `AsyncWebCrawler`. This tutorial builds on the **Getting Started** section by introducing key configuration parameters and the structure of the crawl result. Feel free to adjust the code snippets, wording, or format to match your style.
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
# AsyncWebCrawler Basics
|
|||
|
|
|||
|
In this tutorial, you’ll learn how to:
|
|||
|
|
|||
|
1. Create and configure an `AsyncWebCrawler` instance
|
|||
|
2. Understand the `CrawlResult` object returned by `arun()`
|
|||
|
3. Use basic `BrowserConfig` and `CrawlerRunConfig` options to tailor your crawl
|
|||
|
|
|||
|
> **Prerequisites**
|
|||
|
> - You’ve already completed the [Getting Started](./getting-started.md) tutorial (or have equivalent knowledge).
|
|||
|
> - You have **Crawl4AI** installed and configured with Playwright.
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## 1. What is `AsyncWebCrawler`?
|
|||
|
|
|||
|
`AsyncWebCrawler` is the central class for running asynchronous crawling operations in Crawl4AI. It manages browser sessions, handles dynamic pages (if needed), and provides you with a structured result object for each crawl. Essentially, it’s your high-level interface for collecting page data.
|
|||
|
|
|||
|
```python
|
|||
|
from crawl4ai import AsyncWebCrawler
|
|||
|
|
|||
|
async with AsyncWebCrawler() as crawler:
|
|||
|
result = await crawler.arun("https://example.com")
|
|||
|
print(result)
|
|||
|
```
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## 2. Creating a Basic `AsyncWebCrawler` Instance
|
|||
|
|
|||
|
Below is a simple code snippet showing how to create and use `AsyncWebCrawler`. This goes one step beyond the minimal example you saw in [Getting Started](./getting-started.md).
|
|||
|
|
|||
|
```python
|
|||
|
import asyncio
|
|||
|
from crawl4ai import AsyncWebCrawler
|
|||
|
from crawl4ai import BrowserConfig, CrawlerRunConfig
|
|||
|
|
|||
|
async def main():
|
|||
|
# 1. Set up configuration objects (optional if you want defaults)
|
|||
|
browser_config = BrowserConfig(
|
|||
|
browser_type="chromium",
|
|||
|
headless=True,
|
|||
|
verbose=True
|
|||
|
)
|
|||
|
crawler_config = CrawlerRunConfig(
|
|||
|
page_timeout=30000, # 30 seconds
|
|||
|
wait_for_images=True,
|
|||
|
verbose=True
|
|||
|
)
|
|||
|
|
|||
|
# 2. Initialize AsyncWebCrawler with your chosen browser config
|
|||
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
|||
|
# 3. Run a single crawl
|
|||
|
url_to_crawl = "https://example.com"
|
|||
|
result = await crawler.arun(url=url_to_crawl, config=crawler_config)
|
|||
|
|
|||
|
# 4. Inspect the result
|
|||
|
if result.success:
|
|||
|
print(f"Successfully crawled: {result.url}")
|
|||
|
print(f"HTML length: {len(result.html)}")
|
|||
|
print(f"Markdown snippet: {result.markdown[:200]}...")
|
|||
|
else:
|
|||
|
print(f"Failed to crawl {result.url}. Error: {result.error_message}")
|
|||
|
|
|||
|
if __name__ == "__main__":
|
|||
|
asyncio.run(main())
|
|||
|
```
|
|||
|
|
|||
|
### Key Points
|
|||
|
|
|||
|
1. **`BrowserConfig`** is optional, but it’s the place to specify browser-related settings (e.g., `headless`, `browser_type`).
|
|||
|
2. **`CrawlerRunConfig`** deals with how you want the crawler to behave for this particular run (timeouts, waiting for images, etc.).
|
|||
|
3. **`arun()`** is the main method to crawl a single URL. We’ll see how `arun_many()` works in later tutorials.
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## 3. Understanding `CrawlResult`
|
|||
|
|
|||
|
When you call `arun()`, you get back a `CrawlResult` object containing all the relevant data from that crawl attempt. Some common fields include:
|
|||
|
|
|||
|
```python
|
|||
|
class CrawlResult(BaseModel):
|
|||
|
url: str
|
|||
|
html: str
|
|||
|
success: bool
|
|||
|
cleaned_html: Optional[str] = None
|
|||
|
media: Dict[str, List[Dict]] = {}
|
|||
|
links: Dict[str, List[Dict]] = {}
|
|||
|
screenshot: Optional[str] = None # base64-encoded screenshot if requested
|
|||
|
pdf: Optional[bytes] = None # binary PDF data if requested
|
|||
|
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
|
|||
|
markdown_v2: Optional[MarkdownGenerationResult] = None
|
|||
|
error_message: Optional[str] = None
|
|||
|
# ... plus other fields like status_code, ssl_certificate, extracted_content, etc.
|
|||
|
```
|
|||
|
|
|||
|
### Commonly Used Fields
|
|||
|
|
|||
|
- **`success`**: `True` if the crawl succeeded, `False` otherwise.
|
|||
|
- **`html`**: The raw HTML (or final rendered state if JavaScript was executed).
|
|||
|
- **`markdown` / `markdown_v2`**: Contains the automatically generated Markdown representation of the page.
|
|||
|
- **`media`**: A dictionary with lists of extracted images, videos, or audio elements.
|
|||
|
- **`links`**: A dictionary with lists of “internal” and “external” link objects.
|
|||
|
- **`error_message`**: If `success` is `False`, this often contains a description of the error.
|
|||
|
|
|||
|
**Example**:
|
|||
|
|
|||
|
```python
|
|||
|
if result.success:
|
|||
|
print("Page Title or snippet of HTML:", result.html[:200])
|
|||
|
if result.markdown:
|
|||
|
print("Markdown snippet:", result.markdown[:200])
|
|||
|
print("Links found:", len(result.links.get("internal", [])), "internal links")
|
|||
|
else:
|
|||
|
print("Error crawling:", result.error_message)
|
|||
|
```
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## 4. Relevant Basic Parameters
|
|||
|
|
|||
|
Below are a few `BrowserConfig` and `CrawlerRunConfig` parameters you might tweak early on. We’ll cover more advanced ones (like proxies, PDF, or screenshots) in later tutorials.
|
|||
|
|
|||
|
### 4.1 `BrowserConfig` Essentials
|
|||
|
|
|||
|
| Parameter | Description | Default |
|
|||
|
|--------------------|-----------------------------------------------------------|----------------|
|
|||
|
| `browser_type` | Which browser engine to use: `"chromium"`, `"firefox"`, `"webkit"` | `"chromium"` |
|
|||
|
| `headless` | Run the browser with no UI window. If `False`, you see the browser. | `True` |
|
|||
|
| `verbose` | Print extra logs for debugging. | `True` |
|
|||
|
| `java_script_enabled` | Toggle JavaScript. When `False`, you might speed up loads but lose dynamic content. | `True` |
|
|||
|
|
|||
|
### 4.2 `CrawlerRunConfig` Essentials
|
|||
|
|
|||
|
| Parameter | Description | Default |
|
|||
|
|-----------------------|--------------------------------------------------------------|--------------------|
|
|||
|
| `page_timeout` | Maximum time in ms to wait for the page to load or scripts. | `30000` (30s) |
|
|||
|
| `wait_for_images` | Wait for images to fully load. Good for accurate rendering. | `True` |
|
|||
|
| `css_selector` | Target only certain elements for extraction. | `None` |
|
|||
|
| `excluded_tags` | Skip certain HTML tags (like `nav`, `footer`, etc.) | `None` |
|
|||
|
| `verbose` | Print logs for debugging. | `True` |
|
|||
|
|
|||
|
> **Tip**: Don’t worry if you see lots of parameters. You’ll learn them gradually in later tutorials.
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## 5. Putting It All Together
|
|||
|
|
|||
|
Here’s a slightly more in-depth example that shows off a few key config parameters at once:
|
|||
|
|
|||
|
```python
|
|||
|
import asyncio
|
|||
|
from crawl4ai import AsyncWebCrawler
|
|||
|
from crawl4ai import BrowserConfig, CrawlerRunConfig
|
|||
|
|
|||
|
async def main():
|
|||
|
browser_cfg = BrowserConfig(
|
|||
|
browser_type="chromium",
|
|||
|
headless=True,
|
|||
|
java_script_enabled=True,
|
|||
|
verbose=False
|
|||
|
)
|
|||
|
|
|||
|
crawler_cfg = CrawlerRunConfig(
|
|||
|
page_timeout=30000, # wait up to 30 seconds
|
|||
|
wait_for_images=True,
|
|||
|
css_selector=".article-body", # only extract content under this CSS selector
|
|||
|
verbose=True
|
|||
|
)
|
|||
|
|
|||
|
async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
|||
|
result = await crawler.arun("https://news.example.com", config=crawler_cfg)
|
|||
|
|
|||
|
if result.success:
|
|||
|
print("[OK] Crawled:", result.url)
|
|||
|
print("HTML length:", len(result.html))
|
|||
|
print("Extracted Markdown:", result.markdown_v2.raw_markdown[:300])
|
|||
|
else:
|
|||
|
print("[ERROR]", result.error_message)
|
|||
|
|
|||
|
if __name__ == "__main__":
|
|||
|
asyncio.run(main())
|
|||
|
```
|
|||
|
|
|||
|
**Key Observations**:
|
|||
|
- `css_selector=".article-body"` ensures we only focus on the main content region.
|
|||
|
- `page_timeout=30000` helps if the site is slow.
|
|||
|
- We turned off `verbose` logs for the browser but kept them on for the crawler config.
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## 6. Next Steps
|
|||
|
|
|||
|
- **Smart Crawling Techniques**: Learn to handle iframes, advanced caching, and selective extraction in the [next tutorial](./smart-crawling.md).
|
|||
|
- **Hooks & Custom Code**: See how to inject custom logic before and after navigation in a dedicated [Hooks Tutorial](./hooks-custom.md).
|
|||
|
- **Reference**: For a complete list of every parameter in `BrowserConfig` and `CrawlerRunConfig`, check out the [Reference section](../../reference/configuration.md).
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## Summary
|
|||
|
|
|||
|
You now know the basics of **AsyncWebCrawler**:
|
|||
|
- How to create it with optional browser/crawler configs
|
|||
|
- How `arun()` works for single-page crawls
|
|||
|
- Where to find your crawled data in `CrawlResult`
|
|||
|
- A handful of frequently used configuration parameters
|
|||
|
|
|||
|
From here, you can refine your crawler to handle more advanced scenarios, like focusing on specific content or dealing with dynamic elements. Let’s move on to **[Smart Crawling Techniques](./smart-crawling.md)** to learn how to handle iframes, advanced caching, and more.
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
**Last updated**: 2024-XX-XX
|
|||
|
|
|||
|
Keep exploring! If you get stuck, remember to check out the [How-To Guides](../../how-to/) for targeted solutions or the [Explanations](../../explanations/) for deeper conceptual background.
|