2025-01-07 20:49:50 +08:00
# `CrawlResult` Reference
2024-10-27 19:24:46 +08:00
2025-01-07 20:49:50 +08:00
The ** `CrawlResult` ** class encapsulates everything returned after a single crawl operation. It provides the **raw or processed content** , details on links and media, plus optional metadata (like screenshots, PDFs, or extracted JSON).
2024-10-27 19:24:46 +08:00
2025-01-07 20:49:50 +08:00
**Location**: `crawl4ai/crawler/models.py` (for reference)
2024-10-27 19:24:46 +08:00
```python
class CrawlResult(BaseModel):
2025-01-07 20:49:50 +08:00
url: str
html: str
success: bool
cleaned_html: Optional[str] = None
media: Dict[str, List[Dict]] = {}
links: Dict[str, List[Dict]] = {}
downloaded_files: Optional[List[str]] = None
screenshot: Optional[str] = None
pdf : Optional[bytes] = None
2025-04-09 15:39:04 +08:00
mhtml: Optional[str] = None
2025-01-07 20:49:50 +08:00
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
extracted_content: Optional[str] = None
metadata: Optional[dict] = None
error_message: Optional[str] = None
session_id: Optional[str] = None
response_headers: Optional[dict] = None
status_code: Optional[int] = None
ssl_certificate: Optional[SSLCertificate] = None
2025-01-11 21:10:27 +08:00
dispatch_result: Optional[DispatchResult] = None
2025-01-07 20:49:50 +08:00
...
2024-10-27 19:24:46 +08:00
```
2025-01-07 20:49:50 +08:00
Below is a **field-by-field** explanation and possible usage patterns.
2024-10-27 19:24:46 +08:00
2025-01-07 20:49:50 +08:00
---
2024-10-27 19:24:46 +08:00
2025-01-07 20:49:50 +08:00
## 1. Basic Crawl Info
2024-10-27 19:24:46 +08:00
2025-01-07 20:49:50 +08:00
### 1.1 **`url`** *(str)*
**What**: The final crawled URL (after any redirects).
**Usage**:
```python
print(result.url) # e.g., "https://example.com/"
2024-10-27 19:24:46 +08:00
```
2025-01-07 20:49:50 +08:00
### 1.2 **`success`** *(bool)*
**What**: `True` if the crawl pipeline ended without major errors; `False` otherwise.
**Usage**:
```python
if not result.success:
print(f"Crawl failed: {result.error_message}")
```
2024-10-27 19:24:46 +08:00
2025-01-07 20:49:50 +08:00
### 1.3 **`status_code`** *(Optional[int])*
**What**: The page’ s HTTP status code (e.g., 200, 404).
**Usage**:
2024-10-27 19:24:46 +08:00
```python
2025-01-07 20:49:50 +08:00
if result.status_code == 404:
print("Page not found!")
```
2024-10-27 19:24:46 +08:00
2025-01-07 20:49:50 +08:00
### 1.4 **`error_message`** *(Optional[str])*
**What**: If `success=False` , a textual description of the failure.
**Usage**:
```python
if not result.success:
print("Error:", result.error_message)
```
2024-10-27 19:24:46 +08:00
2025-01-07 20:49:50 +08:00
### 1.5 **`session_id`** *(Optional[str])*
**What**: The ID used for reusing a browser context across multiple calls.
**Usage**:
```python
# If you used session_id="login_session" in CrawlerRunConfig, see it here:
print("Session:", result.session_id)
2024-10-27 19:24:46 +08:00
```
2025-01-07 20:49:50 +08:00
### 1.6 **`response_headers`** *(Optional[dict])*
**What**: Final HTTP response headers.
**Usage**:
2024-10-27 19:24:46 +08:00
```python
2025-01-07 20:49:50 +08:00
if result.response_headers:
print("Server:", result.response_headers.get("Server", "Unknown"))
```
2024-10-27 19:24:46 +08:00
2025-01-07 20:49:50 +08:00
### 1.7 **`ssl_certificate`** *(Optional[SSLCertificate])*
**What**: If `fetch_ssl_certificate=True` in your CrawlerRunConfig, ** `result.ssl_certificate` ** contains a [**`SSLCertificate`** ](../advanced/ssl-certificate.md ) object describing the site’ s certificate. You can export the cert in multiple formats (PEM/DER/JSON) or access its properties like `issuer` ,
`subject` , `valid_from` , `valid_until` , etc.
**Usage**:
```python
if result.ssl_certificate:
print("Issuer:", result.ssl_certificate.issuer)
2024-10-27 19:24:46 +08:00
```
2025-01-07 20:49:50 +08:00
---
2024-10-27 19:24:46 +08:00
2025-01-07 20:49:50 +08:00
## 2. Raw / Cleaned Content
2024-10-27 19:24:46 +08:00
2025-01-07 20:49:50 +08:00
### 2.1 **`html`** *(str)*
**What**: The **original** unmodified HTML from the final page load.
**Usage**:
2024-10-27 19:24:46 +08:00
```python
2025-01-07 20:49:50 +08:00
# Possibly large
print(len(result.html))
2024-10-27 19:24:46 +08:00
```
2025-01-07 20:49:50 +08:00
### 2.2 **`cleaned_html`** *(Optional[str])*
**What**: A sanitized HTML version—scripts, styles, or excluded tags are removed based on your `CrawlerRunConfig` .
**Usage**:
```python
print(result.cleaned_html[:500]) # Show a snippet
```
2024-10-27 19:24:46 +08:00
2025-01-07 20:49:50 +08:00
### 2.3 **`fit_html`** *(Optional[str])*
**What**: If a **content filter** or heuristic (e.g., Pruning/BM25) modifies the HTML, the “fit” or post-filter version.
**When**: This is **only** present if your `markdown_generator` or `content_filter` produces it.
**Usage**:
2024-10-27 19:24:46 +08:00
```python
2025-02-28 17:23:35 +05:30
if result.markdown.fit_html:
print("High-value HTML content:", result.markdown.fit_html[:300])
2024-10-27 19:24:46 +08:00
```
2025-01-07 20:49:50 +08:00
---
## 3. Markdown Fields
### 3.1 The Markdown Generation Approach
Crawl4AI can convert HTML→Markdown, optionally including:
- **Raw** markdown
- **Links as citations** (with a references section)
- **Fit** markdown if a **content filter** is used (like Pruning or BM25)
2024-10-27 19:24:46 +08:00
2025-01-07 20:49:50 +08:00
**`MarkdownGenerationResult` ** includes:
- **`raw_markdown` ** *(str)* : The full HTML→Markdown conversion.
- **`markdown_with_citations` ** *(str)* : Same markdown, but with link references as academic-style citations.
- **`references_markdown` ** *(str)* : The reference list or footnotes at the end.
- **`fit_markdown` ** *(Optional[str])* : If content filtering (Pruning/BM25) was applied, the filtered “fit” text.
- **`fit_html` ** *(Optional[str])* : The HTML that led to `fit_markdown` .
**Usage**:
2024-10-27 19:24:46 +08:00
```python
2025-02-28 17:23:35 +05:30
if result.markdown:
md_res = result.markdown
2025-01-07 20:49:50 +08:00
print("Raw MD:", md_res.raw_markdown[:300])
print("Citations MD:", md_res.markdown_with_citations[:300])
print("References:", md_res.references_markdown)
if md_res.fit_markdown:
print("Pruned text:", md_res.fit_markdown[:300])
2024-10-27 19:24:46 +08:00
```
2025-02-28 17:23:35 +05:30
### 3.2 **`markdown`** *(Optional[Union[str, MarkdownGenerationResult]])*
**What**: Holds the `MarkdownGenerationResult` .
2025-01-07 20:49:50 +08:00
**Usage**:
```python
2025-02-28 17:23:35 +05:30
print(result.markdown.raw_markdown[:200])
print(result.markdown.fit_markdown)
print(result.markdown.fit_html)
2025-01-07 20:49:50 +08:00
```
2025-02-28 17:23:35 +05:30
**Important**: “Fit” content (in `fit_markdown` /`fit_html` ) exists in result.markdown, only if you used a **filter** (like **PruningContentFilter** or **BM25ContentFilter** ) within a `MarkdownGenerationStrategy` .
2025-01-07 20:49:50 +08:00
---
2024-10-27 19:24:46 +08:00
2025-01-07 20:49:50 +08:00
## 4. Media & Links
2024-10-27 19:24:46 +08:00
2025-01-07 20:49:50 +08:00
### 4.1 **`media`** *(Dict[str, List[Dict]])*
**What**: Contains info about discovered images, videos, or audio. Typically keys: `"images"` , `"videos"` , `"audios"` .
**Common Fields** in each item:
- `src` *(str)* : Media URL
- `alt` or `title` *(str)* : Descriptive text
- `score` *(float)* : Relevance score if the crawler’ s heuristic found it “important”
- `desc` or `description` *(Optional[str])* : Additional context extracted from surrounding text
**Usage**:
2024-10-27 19:24:46 +08:00
```python
2025-01-07 20:49:50 +08:00
images = result.media.get("images", [])
for img in images:
if img.get("score", 0) > 5:
print("High-value image:", img["src"])
2024-10-27 19:24:46 +08:00
```
2025-01-07 20:49:50 +08:00
### 4.2 **`links`** *(Dict[str, List[Dict]])*
**What**: Holds internal and external link data. Usually two keys: `"internal"` and `"external"` .
**Common Fields**:
- `href` *(str)* : The link target
- `text` *(str)* : Link text
- `title` *(str)* : Title attribute
- `context` *(str)* : Surrounding text snippet
- `domain` *(str)* : If external, the domain
2024-10-27 19:24:46 +08:00
2025-01-07 20:49:50 +08:00
**Usage**:
2024-10-27 19:24:46 +08:00
```python
2025-01-07 20:49:50 +08:00
for link in result.links["internal"]:
print(f"Internal link to {link['href']} with text {link['text']}")
2024-10-27 19:24:46 +08:00
```
2025-01-07 20:49:50 +08:00
---
## 5. Additional Fields
### 5.1 **`extracted_content`** *(Optional[str])*
**What**: If you used ** `extraction_strategy` ** (CSS, LLM, etc.), the structured output (JSON).
**Usage**:
2024-10-27 19:24:46 +08:00
```python
2025-01-07 20:49:50 +08:00
if result.extracted_content:
data = json.loads(result.extracted_content)
print(data)
2024-10-27 19:24:46 +08:00
```
2025-01-07 20:49:50 +08:00
### 5.2 **`downloaded_files`** *(Optional[List[str]])*
**What**: If `accept_downloads=True` in your `BrowserConfig` + `downloads_path` , lists local file paths for downloaded items.
**Usage**:
2024-10-27 19:24:46 +08:00
```python
2025-01-07 20:49:50 +08:00
if result.downloaded_files:
for file_path in result.downloaded_files:
print("Downloaded:", file_path)
2024-10-27 19:24:46 +08:00
```
2025-01-07 20:49:50 +08:00
### 5.3 **`screenshot`** *(Optional[str])*
**What**: Base64-encoded screenshot if `screenshot=True` in `CrawlerRunConfig` .
**Usage**:
2024-10-27 19:24:46 +08:00
```python
2025-01-07 20:49:50 +08:00
import base64
if result.screenshot:
with open("page.png", "wb") as f:
f.write(base64.b64decode(result.screenshot))
2024-10-27 19:24:46 +08:00
```
2025-01-07 20:49:50 +08:00
### 5.4 **`pdf`** *(Optional[bytes])*
**What**: Raw PDF bytes if `pdf=True` in `CrawlerRunConfig` .
**Usage**:
2024-10-27 19:24:46 +08:00
```python
2025-01-07 20:49:50 +08:00
if result.pdf:
with open("page.pdf", "wb") as f:
f.write(result.pdf)
2024-10-27 19:24:46 +08:00
```
2025-04-09 15:39:04 +08:00
### 5.5 **`mhtml`** *(Optional[str])*
**What**: MHTML snapshot of the page if `capture_mhtml=True` in `CrawlerRunConfig` . MHTML (MIME HTML) format preserves the entire web page with all its resources (CSS, images, scripts, etc.) in a single file.
**Usage**:
```python
if result.mhtml:
with open("page.mhtml", "w", encoding="utf-8") as f:
f.write(result.mhtml)
```
### 5.6 **`metadata`** *(Optional[dict])*
2025-01-07 20:49:50 +08:00
**What**: Page-level metadata if discovered (title, description, OG data, etc.).
**Usage**:
2024-10-27 19:24:46 +08:00
```python
2025-01-07 20:49:50 +08:00
if result.metadata:
print("Title:", result.metadata.get("title"))
print("Author:", result.metadata.get("author"))
2024-10-27 19:24:46 +08:00
```
2025-01-07 20:49:50 +08:00
---
2025-01-11 21:10:27 +08:00
## 6. `dispatch_result` (optional)
A `DispatchResult` object providing additional concurrency and resource usage information when crawling URLs in parallel (e.g., via `arun_many()` with custom dispatchers). It contains:
- **`task_id` **: A unique identifier for the parallel task.
- **`memory_usage` ** (float): The memory (in MB) used at the time of completion.
- **`peak_memory` ** (float): The peak memory usage (in MB) recorded during the task’ s execution.
- **`start_time` ** / ** `end_time` ** (datetime): Time range for this crawling task.
- **`error_message` ** (str): Any dispatcher- or concurrency-related error encountered.
```python
# Example usage:
for result in results:
if result.success and result.dispatch_result:
dr = result.dispatch_result
print(f"URL: {result.url}, Task ID: {dr.task_id}")
print(f"Memory: {dr.memory_usage:.1f} MB (Peak: {dr.peak_memory:.1f} MB)")
print(f"Duration: {dr.end_time - dr.start_time}")
```
> **Note**: This field is typically populated when using `arun_many(...)` alongside a **dispatcher** (e.g., `MemoryAdaptiveDispatcher` or `SemaphoreDispatcher`). If no concurrency or dispatcher is used, `dispatch_result` may remain `None`.
---
## 7. Example: Accessing Everything
2025-01-07 20:49:50 +08:00
2024-10-27 19:24:46 +08:00
```python
2025-01-07 20:49:50 +08:00
async def handle_result(result: CrawlResult):
if not result.success:
print("Crawl error:", result.error_message)
return
# Basic info
print("Crawled URL:", result.url)
print("Status code:", result.status_code)
# HTML
print("Original HTML size:", len(result.html))
print("Cleaned HTML size:", len(result.cleaned_html or ""))
# Markdown output
2025-02-28 17:23:35 +05:30
if result.markdown:
print("Raw Markdown:", result.markdown.raw_markdown[:300])
print("Citations Markdown:", result.markdown.markdown_with_citations[:300])
if result.markdown.fit_markdown:
print("Fit Markdown:", result.markdown.fit_markdown[:200])
2025-01-07 20:49:50 +08:00
# Media & Links
if "images" in result.media:
print("Image count:", len(result.media["images"]))
if "internal" in result.links:
print("Internal link count:", len(result.links["internal"]))
# Extraction strategy result
if result.extracted_content:
print("Structured data:", result.extracted_content)
2025-04-09 15:39:04 +08:00
# Screenshot/PDF/MHTML
2025-01-07 20:49:50 +08:00
if result.screenshot:
print("Screenshot length:", len(result.screenshot))
if result.pdf:
print("PDF bytes length:", len(result.pdf))
2025-04-09 15:39:04 +08:00
if result.mhtml:
print("MHTML length:", len(result.mhtml))
2025-01-07 20:49:50 +08:00
```
---
2025-01-11 21:10:27 +08:00
## 8. Key Points & Future
2025-01-07 20:49:50 +08:00
2025-02-28 17:23:35 +05:30
1. **Deprecated legacy properties of CrawlResult**
- `markdown_v2` - Deprecated in v0.5. Just use `markdown` . It holds the `MarkdownGenerationResult` now!
- `fit_markdown` and `fit_html` - Deprecated in v0.5. They can now be accessed via `MarkdownGenerationResult` in `result.markdown` . eg: `result.markdown.fit_markdown` and `result.markdown.fit_html`
2025-01-07 20:49:50 +08:00
2. **Fit Content**
2025-02-28 17:23:35 +05:30
- **`fit_markdown` ** and ** `fit_html` ** appear in MarkdownGenerationResult, only if you used a content filter (like **PruningContentFilter** or **BM25ContentFilter** ) inside your **MarkdownGenerationStrategy** or set them directly.
2025-01-07 20:49:50 +08:00
- If no filter is used, they remain `None` .
3. **References & Citations**
- If you enable link citations in your `DefaultMarkdownGenerator` (`options={"citations": True}` ), you’ ll see `markdown_with_citations` plus a ** `references_markdown` ** block. This helps large language models or academic-like referencing.
4. **Links & Media**
- `links["internal"]` and `links["external"]` group discovered anchors by domain.
- `media["images"]` / `["videos"]` / `["audios"]` store extracted media elements with optional scoring or context.
5. **Error Cases**
- If `success=False` , check `error_message` (e.g., timeouts, invalid URLs).
- `status_code` might be `None` if we failed before an HTTP response.
Use ** `CrawlResult` ** to glean all final outputs and feed them into your data pipelines, AI models, or archives. With the synergy of a properly configured **BrowserConfig** and **CrawlerRunConfig** , the crawler can produce robust, structured results here in ** `CrawlResult` **.