crawl4ai/docs/md_v2/api/crawl-result.md

# `CrawlResult` Reference

The **`CrawlResult`** class encapsulates everything returned after a single crawl operation. It provides the **raw or processed content**, details on links and media, plus optional metadata (like screenshots, PDFs, or extracted JSON).

**Location**: `crawl4ai/crawler/models.py` (for reference)

```python
class CrawlResult(BaseModel):
    url: str
    html: str
    success: bool
    cleaned_html: Optional[str] = None
    media: Dict[str, List[Dict]] = {}
    links: Dict[str, List[Dict]] = {}
    downloaded_files: Optional[List[str]] = None
    screenshot: Optional[str] = None
    pdf : Optional[bytes] = None
    mhtml: Optional[str] = None
    markdown: Optional[Union[str, MarkdownGenerationResult]] = None
    extracted_content: Optional[str] = None
    metadata: Optional[dict] = None
    error_message: Optional[str] = None
    session_id: Optional[str] = None
    response_headers: Optional[dict] = None
    status_code: Optional[int] = None
    ssl_certificate: Optional[SSLCertificate] = None
    dispatch_result: Optional[DispatchResult] = None
    ...
```

Below is a **field-by-field** explanation and possible usage patterns.

---

## 1. Basic Crawl Info

### 1.1 **`url`** *(str)*  
**What**: The final crawled URL (after any redirects).  
**Usage**:
```python
print(result.url)  # e.g., "https://example.com/"
```

### 1.2 **`success`** *(bool)*  
**What**: `True` if the crawl pipeline ended without major errors; `False` otherwise.  
**Usage**:
```python
if not result.success:
    print(f"Crawl failed: {result.error_message}")
```

### 1.3 **`status_code`** *(Optional[int])*  
**What**: The page’s HTTP status code (e.g., 200, 404).  
**Usage**:
```python
if result.status_code == 404:
    print("Page not found!")
```

### 1.4 **`error_message`** *(Optional[str])*  
**What**: If `success=False`, a textual description of the failure.  
**Usage**:
```python
if not result.success:
    print("Error:", result.error_message)
```

### 1.5 **`session_id`** *(Optional[str])*  
**What**: The ID used for reusing a browser context across multiple calls.  
**Usage**:
```python
# If you used session_id="login_session" in CrawlerRunConfig, see it here:
print("Session:", result.session_id)
```

### 1.6 **`response_headers`** *(Optional[dict])*  
**What**: Final HTTP response headers.  
**Usage**:
```python
if result.response_headers:
    print("Server:", result.response_headers.get("Server", "Unknown"))
```

### 1.7 **`ssl_certificate`** *(Optional[SSLCertificate])*  
**What**: If `fetch_ssl_certificate=True` in your CrawlerRunConfig, **`result.ssl_certificate`** contains a  [**`SSLCertificate`**](../advanced/ssl-certificate.md) object describing the site’s certificate. You can export the cert in multiple formats (PEM/DER/JSON) or access its properties like `issuer`, 
 `subject`, `valid_from`, `valid_until`, etc. 
**Usage**:
```python
if result.ssl_certificate:
    print("Issuer:", result.ssl_certificate.issuer)
```

---

## 2. Raw / Cleaned Content

### 2.1 **`html`** *(str)*  
**What**: The **original** unmodified HTML from the final page load.  
**Usage**:
```python
# Possibly large
print(len(result.html))
```

### 2.2 **`cleaned_html`** *(Optional[str])*  
**What**: A sanitized HTML version—scripts, styles, or excluded tags are removed based on your `CrawlerRunConfig`.  
**Usage**:
```python
print(result.cleaned_html[:500])  # Show a snippet
```

### 2.3 **`fit_html`** *(Optional[str])*  
**What**: If a **content filter** or heuristic (e.g., Pruning/BM25) modifies the HTML, the “fit” or post-filter version.  
**When**: This is **only** present if your `markdown_generator` or `content_filter` produces it.  
**Usage**:
```python
if result.markdown.fit_html:
    print("High-value HTML content:", result.markdown.fit_html[:300])
```

---

## 3. Markdown Fields

### 3.1 The Markdown Generation Approach

Crawl4AI can convert HTML→Markdown, optionally including:

- **Raw** markdown  
- **Links as citations** (with a references section)  
- **Fit** markdown if a **content filter** is used (like Pruning or BM25)


**`MarkdownGenerationResult`** includes:
- **`raw_markdown`** *(str)*: The full HTML→Markdown conversion.  
- **`markdown_with_citations`** *(str)*: Same markdown, but with link references as academic-style citations.  
- **`references_markdown`** *(str)*: The reference list or footnotes at the end.  
- **`fit_markdown`** *(Optional[str])*: If content filtering (Pruning/BM25) was applied, the filtered “fit” text.  
- **`fit_html`** *(Optional[str])*: The HTML that led to `fit_markdown`.

**Usage**:
```python
if result.markdown:
    md_res = result.markdown
    print("Raw MD:", md_res.raw_markdown[:300])
    print("Citations MD:", md_res.markdown_with_citations[:300])
    print("References:", md_res.references_markdown)
    if md_res.fit_markdown:
        print("Pruned text:", md_res.fit_markdown[:300])
```

### 3.2 **`markdown`** *(Optional[Union[str, MarkdownGenerationResult]])*  
**What**: Holds the `MarkdownGenerationResult`.  
**Usage**:
```python
print(result.markdown.raw_markdown[:200])
print(result.markdown.fit_markdown)
print(result.markdown.fit_html)
```
**Important**: “Fit” content (in `fit_markdown`/`fit_html`) exists in result.markdown, only if you used a **filter** (like **PruningContentFilter** or **BM25ContentFilter**) within a `MarkdownGenerationStrategy`.

---

## 4. Media & Links

### 4.1 **`media`** *(Dict[str, List[Dict]])*  
**What**: Contains info about discovered images, videos, or audio. Typically keys: `"images"`, `"videos"`, `"audios"`.  
**Common Fields** in each item:

- `src` *(str)*: Media URL  
- `alt` or `title` *(str)*: Descriptive text  
- `score` *(float)*: Relevance score if the crawler’s heuristic found it “important”  
- `desc` or `description` *(Optional[str])*: Additional context extracted from surrounding text  

**Usage**:
```python
images = result.media.get("images", [])
for img in images:
    if img.get("score", 0) > 5:
        print("High-value image:", img["src"])
```

### 4.2 **`links`** *(Dict[str, List[Dict]])*  
**What**: Holds internal and external link data. Usually two keys: `"internal"` and `"external"`.  
**Common Fields**:

- `href` *(str)*: The link target  
- `text` *(str)*: Link text  
- `title` *(str)*: Title attribute  
- `context` *(str)*: Surrounding text snippet  
- `domain` *(str)*: If external, the domain

**Usage**:
```python
for link in result.links["internal"]:
    print(f"Internal link to {link['href']} with text {link['text']}")
```

---

## 5. Additional Fields

### 5.1 **`extracted_content`** *(Optional[str])*  
**What**: If you used **`extraction_strategy`** (CSS, LLM, etc.), the structured output (JSON).  
**Usage**:
```python
if result.extracted_content:
    data = json.loads(result.extracted_content)
    print(data)
```

### 5.2 **`downloaded_files`** *(Optional[List[str]])*  
**What**: If `accept_downloads=True` in your `BrowserConfig` + `downloads_path`, lists local file paths for downloaded items.  
**Usage**:
```python
if result.downloaded_files:
    for file_path in result.downloaded_files:
        print("Downloaded:", file_path)
```

### 5.3 **`screenshot`** *(Optional[str])*  
**What**: Base64-encoded screenshot if `screenshot=True` in `CrawlerRunConfig`.  
**Usage**:
```python
import base64
if result.screenshot:
    with open("page.png", "wb") as f:
        f.write(base64.b64decode(result.screenshot))
```

### 5.4 **`pdf`** *(Optional[bytes])*  
**What**: Raw PDF bytes if `pdf=True` in `CrawlerRunConfig`.  
**Usage**:
```python
if result.pdf:
    with open("page.pdf", "wb") as f:
        f.write(result.pdf)
```

### 5.5 **`mhtml`** *(Optional[str])*  
**What**: MHTML snapshot of the page if `capture_mhtml=True` in `CrawlerRunConfig`. MHTML (MIME HTML) format preserves the entire web page with all its resources (CSS, images, scripts, etc.) in a single file.  
**Usage**:
```python
if result.mhtml:
    with open("page.mhtml", "w", encoding="utf-8") as f:
        f.write(result.mhtml)
```

### 5.6 **`metadata`** *(Optional[dict])*  
**What**: Page-level metadata if discovered (title, description, OG data, etc.).  
**Usage**:
```python
if result.metadata:
    print("Title:", result.metadata.get("title"))
    print("Author:", result.metadata.get("author"))
```

---

## 6. `dispatch_result` (optional)

A `DispatchResult` object providing additional concurrency and resource usage information when crawling URLs in parallel (e.g., via `arun_many()` with custom dispatchers). It contains:

- **`task_id`**: A unique identifier for the parallel task.
- **`memory_usage`** (float): The memory (in MB) used at the time of completion.
- **`peak_memory`** (float): The peak memory usage (in MB) recorded during the task’s execution.
- **`start_time`** / **`end_time`** (datetime): Time range for this crawling task.
- **`error_message`** (str): Any dispatcher- or concurrency-related error encountered.

```python
# Example usage:
for result in results:
    if result.success and result.dispatch_result:
        dr = result.dispatch_result
        print(f"URL: {result.url}, Task ID: {dr.task_id}")
        print(f"Memory: {dr.memory_usage:.1f} MB (Peak: {dr.peak_memory:.1f} MB)")
        print(f"Duration: {dr.end_time - dr.start_time}")
```

> **Note**: This field is typically populated when using `arun_many(...)` alongside a **dispatcher** (e.g., `MemoryAdaptiveDispatcher` or `SemaphoreDispatcher`). If no concurrency or dispatcher is used, `dispatch_result` may remain `None`. 

---

## 7. Example: Accessing Everything

```python
async def handle_result(result: CrawlResult):
    if not result.success:
        print("Crawl error:", result.error_message)
        return
    
    # Basic info
    print("Crawled URL:", result.url)
    print("Status code:", result.status_code)
    
    # HTML
    print("Original HTML size:", len(result.html))
    print("Cleaned HTML size:", len(result.cleaned_html or ""))

    # Markdown output
    if result.markdown:
        print("Raw Markdown:", result.markdown.raw_markdown[:300])
        print("Citations Markdown:", result.markdown.markdown_with_citations[:300])
        if result.markdown.fit_markdown:
            print("Fit Markdown:", result.markdown.fit_markdown[:200])

    # Media & Links
    if "images" in result.media:
        print("Image count:", len(result.media["images"]))
    if "internal" in result.links:
        print("Internal link count:", len(result.links["internal"]))

    # Extraction strategy result
    if result.extracted_content:
        print("Structured data:", result.extracted_content)
    
    # Screenshot/PDF/MHTML
    if result.screenshot:
        print("Screenshot length:", len(result.screenshot))
    if result.pdf:
        print("PDF bytes length:", len(result.pdf))
    if result.mhtml:
        print("MHTML length:", len(result.mhtml))
```

---

## 8. Key Points & Future

1. **Deprecated legacy properties of CrawlResult**  
   - `markdown_v2` - Deprecated in v0.5. Just use `markdown`. It holds the `MarkdownGenerationResult` now!
   - `fit_markdown` and `fit_html` - Deprecated in v0.5. They can now be accessed via `MarkdownGenerationResult` in `result.markdown`. eg: `result.markdown.fit_markdown` and `result.markdown.fit_html`

2. **Fit Content**  
   - **`fit_markdown`** and **`fit_html`** appear in MarkdownGenerationResult, only if you used a content filter (like **PruningContentFilter** or **BM25ContentFilter**) inside your **MarkdownGenerationStrategy** or set them directly.  
   - If no filter is used, they remain `None`.

3. **References & Citations**  
   - If you enable link citations in your `DefaultMarkdownGenerator` (`options={"citations": True}`), you’ll see `markdown_with_citations` plus a **`references_markdown`** block. This helps large language models or academic-like referencing.

4. **Links & Media**  
   - `links["internal"]` and `links["external"]` group discovered anchors by domain.  
   - `media["images"]` / `["videos"]` / `["audios"]` store extracted media elements with optional scoring or context.

5. **Error Cases**  
   - If `success=False`, check `error_message` (e.g., timeouts, invalid URLs).  
   - `status_code` might be `None` if we failed before an HTTP response.

Use **`CrawlResult`** to glean all final outputs and feed them into your data pipelines, AI models, or archives. With the synergy of a properly configured **BrowserConfig** and **CrawlerRunConfig**, the crawler can produce robust, structured results here in **`CrawlResult`**.
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								# `CrawlResult` Reference
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								The **`CrawlResult`** class encapsulates everything returned after a single crawl operation. It provides the **raw or processed content**, details on links and media, plus optional metadata (like screenshots, PDFs, or extracted JSON).
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								**Location**: `crawl4ai/crawler/models.py` (for reference)
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
 								```python
 								class CrawlResult(BaseModel):
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								    url: str
 								    html: str
 								    success: bool
 								    cleaned_html: Optional[str] = None
 								    media: Dict[str, List[Dict]] = {}
 								    links: Dict[str, List[Dict]] = {}
 								    downloaded_files: Optional[List[str]] = None
 								    screenshot: Optional[str] = None
 								    pdf : Optional[bytes] = None
-												feat(crawler): add MHTML capture functionality

Add ability to capture web pages as MHTML format, which includes all page resources
in a single file. This enables complete page archival and offline viewing.

- Add capture_mhtml parameter to CrawlerRunConfig
- Implement MHTML capture using CDP in AsyncPlaywrightCrawlerStrategy
- Add mhtml field to CrawlResult and AsyncCrawlResponse models
- Add comprehensive tests for MHTML capture functionality
- Update documentation with MHTML capture details
- Add exclude_all_images option for better memory management

Breaking changes: None

											
										
										
											2025-04-09 15:39:04 +08:00
+								    mhtml: Optional[str] = None
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								    markdown: Optional[Union[str, MarkdownGenerationResult]] = None
 								    extracted_content: Optional[str] = None
 								    metadata: Optional[dict] = None
 								    error_message: Optional[str] = None
 								    session_id: Optional[str] = None
 								    response_headers: Optional[dict] = None
 								    status_code: Optional[int] = None
 								    ssl_certificate: Optional[SSLCertificate] = None
-												refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring

Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.

											
										
										
											2025-01-11 21:10:27 +08:00
+								    dispatch_result: Optional[DispatchResult] = None
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								    ...
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								Below is a **field-by-field** explanation and possible usage patterns.
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								---
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								## 1. Basic Crawl Info
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								### 1.1 **`url`** *(str)*
 								**What**: The final crawled URL (after any redirects).
 								**Usage**:
 								```python
 								print(result.url)  # e.g., "https://example.com/"
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								### 1.2 **`success`** *(bool)*
 								**What**: `True` if the crawl pipeline ended without major errors; `False` otherwise.
 								**Usage**:
 								```python
 								if not result.success:
 								    print(f"Crawl failed: {result.error_message}")
 								```
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								### 1.3 **`status_code`** *(Optional[int])*
 								**What**: The page’s HTTP status code (e.g., 200, 404).
 								**Usage**:
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								if result.status_code == 404:
 								    print("Page not found!")
 								```
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								### 1.4 **`error_message`** *(Optional[str])*
 								**What**: If `success=False`, a textual description of the failure.
 								**Usage**:
 								```python
 								if not result.success:
 								    print("Error:", result.error_message)
 								```
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								### 1.5 **`session_id`** *(Optional[str])*
 								**What**: The ID used for reusing a browser context across multiple calls.
 								**Usage**:
 								```python
 								# If you used session_id="login_session" in CrawlerRunConfig, see it here:
 								print("Session:", result.session_id)
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								### 1.6 **`response_headers`** *(Optional[dict])*
 								**What**: Final HTTP response headers.
 								**Usage**:
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								if result.response_headers:
 								    print("Server:", result.response_headers.get("Server", "Unknown"))
 								```
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								### 1.7 **`ssl_certificate`** *(Optional[SSLCertificate])*
 								**What**: If `fetch_ssl_certificate=True` in your CrawlerRunConfig, **`result.ssl_certificate`** contains a  [**`SSLCertificate`**](../advanced/ssl-certificate.md) object describing the site’s certificate. You can export the cert in multiple formats (PEM/DER/JSON) or access its properties like `issuer`,
 								 `subject`, `valid_from`, `valid_until`, etc.
 								**Usage**:
 								```python
 								if result.ssl_certificate:
 								    print("Issuer:", result.ssl_certificate.issuer)
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								---
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								## 2. Raw / Cleaned Content
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								### 2.1 **`html`** *(str)*
 								**What**: The **original** unmodified HTML from the final page load.
 								**Usage**:
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								# Possibly large
 								print(len(result.html))
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								### 2.2 **`cleaned_html`** *(Optional[str])*
 								**What**: A sanitized HTML version—scripts, styles, or excluded tags are removed based on your `CrawlerRunConfig`.
 								**Usage**:
 								```python
 								print(result.cleaned_html[:500])  # Show a snippet
 								```
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								### 2.3 **`fit_html`** *(Optional[str])*
 								**What**: If a **content filter** or heuristic (e.g., Pruning/BM25) modifies the HTML, the “fit” or post-filter version.
 								**When**: This is **only** present if your `markdown_generator` or `content_filter` produces it.
 								**Usage**:
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												Release prep (#749)

* fix: Update export of URLPatternFilter

* chore: Add dependancy for cchardet in requirements

* docs: Update example for deep crawl in release note for v0.5

* Docs: update the example for memory dispatcher

* docs: updated example for crawl strategies

* Refactor: Removed wrapping in if __name__==main block since this is a markdown file.

* chore: removed cchardet from dependancy list, since unclecode is planning to remove it

* docs: updated the example for proxy rotation to a working example

* feat: Introduced ProxyConfig param

* Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1

* chore: update and test new dependancies

* feat:Make PyPDF2 a conditional dependancy

* updated tutorial and release note for v0.5

* docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename

* refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult

* fix: Bug in serialisation of markdown in acache_url

* Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown

* fix: remove deprecated markdown_v2 from docker

* Refactor: remove deprecated fit_markdown and fit_html from result

* refactor: fix cache retrieval for markdown as a string

* chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown
											
										
										
											2025-02-28 17:23:35 +05:30
+								if result.markdown.fit_html:
 								    print("High-value HTML content:", result.markdown.fit_html[:300])
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								---
 								## 3. Markdown Fields
 								### 3.1 The Markdown Generation Approach
 								Crawl4AI can convert HTML→Markdown, optionally including:
 								- **Raw** markdown
 								- **Links as citations** (with a references section)
 								- **Fit** markdown if a **content filter** is used (like Pruning or BM25)
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								**`MarkdownGenerationResult`** includes:
 								- **`raw_markdown`** *(str)*: The full HTML→Markdown conversion.
 								- **`markdown_with_citations`** *(str)*: Same markdown, but with link references as academic-style citations.
 								- **`references_markdown`** *(str)*: The reference list or footnotes at the end.
 								- **`fit_markdown`** *(Optional[str])*: If content filtering (Pruning/BM25) was applied, the filtered “fit” text.
 								- **`fit_html`** *(Optional[str])*: The HTML that led to `fit_markdown`.
 								**Usage**:
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												Release prep (#749)

* fix: Update export of URLPatternFilter

* chore: Add dependancy for cchardet in requirements

* docs: Update example for deep crawl in release note for v0.5

* Docs: update the example for memory dispatcher

* docs: updated example for crawl strategies

* Refactor: Removed wrapping in if __name__==main block since this is a markdown file.

* chore: removed cchardet from dependancy list, since unclecode is planning to remove it

* docs: updated the example for proxy rotation to a working example

* feat: Introduced ProxyConfig param

* Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1

* chore: update and test new dependancies

* feat:Make PyPDF2 a conditional dependancy

* updated tutorial and release note for v0.5

* docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename

* refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult

* fix: Bug in serialisation of markdown in acache_url

* Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown

* fix: remove deprecated markdown_v2 from docker

* Refactor: remove deprecated fit_markdown and fit_html from result

* refactor: fix cache retrieval for markdown as a string

* chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown
											
										
										
											2025-02-28 17:23:35 +05:30
+								if result.markdown:
 								    md_res = result.markdown
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								    print("Raw MD:", md_res.raw_markdown[:300])
 								    print("Citations MD:", md_res.markdown_with_citations[:300])
 								    print("References:", md_res.references_markdown)
 								    if md_res.fit_markdown:
 								        print("Pruned text:", md_res.fit_markdown[:300])
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												Release prep (#749)

* fix: Update export of URLPatternFilter

* chore: Add dependancy for cchardet in requirements

* docs: Update example for deep crawl in release note for v0.5

* Docs: update the example for memory dispatcher

* docs: updated example for crawl strategies

* Refactor: Removed wrapping in if __name__==main block since this is a markdown file.

* chore: removed cchardet from dependancy list, since unclecode is planning to remove it

* docs: updated the example for proxy rotation to a working example

* feat: Introduced ProxyConfig param

* Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1

* chore: update and test new dependancies

* feat:Make PyPDF2 a conditional dependancy

* updated tutorial and release note for v0.5

* docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename

* refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult

* fix: Bug in serialisation of markdown in acache_url

* Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown

* fix: remove deprecated markdown_v2 from docker

* Refactor: remove deprecated fit_markdown and fit_html from result

* refactor: fix cache retrieval for markdown as a string

* chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown
											
										
										
											2025-02-28 17:23:35 +05:30
+								### 3.2 **`markdown`** *(Optional[Union[str, MarkdownGenerationResult]])*
 								**What**: Holds the `MarkdownGenerationResult`.
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								**Usage**:
 								```python
-												Release prep (#749)

* fix: Update export of URLPatternFilter

* chore: Add dependancy for cchardet in requirements

* docs: Update example for deep crawl in release note for v0.5

* Docs: update the example for memory dispatcher

* docs: updated example for crawl strategies

* Refactor: Removed wrapping in if __name__==main block since this is a markdown file.

* chore: removed cchardet from dependancy list, since unclecode is planning to remove it

* docs: updated the example for proxy rotation to a working example

* feat: Introduced ProxyConfig param

* Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1

* chore: update and test new dependancies

* feat:Make PyPDF2 a conditional dependancy

* updated tutorial and release note for v0.5

* docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename

* refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult

* fix: Bug in serialisation of markdown in acache_url

* Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown

* fix: remove deprecated markdown_v2 from docker

* Refactor: remove deprecated fit_markdown and fit_html from result

* refactor: fix cache retrieval for markdown as a string

* chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown
											
										
										
											2025-02-28 17:23:35 +05:30
+								print(result.markdown.raw_markdown[:200])
 								print(result.markdown.fit_markdown)
 								print(result.markdown.fit_html)
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								```
-												Release prep (#749)

* fix: Update export of URLPatternFilter

* chore: Add dependancy for cchardet in requirements

* docs: Update example for deep crawl in release note for v0.5

* Docs: update the example for memory dispatcher

* docs: updated example for crawl strategies

* Refactor: Removed wrapping in if __name__==main block since this is a markdown file.

* chore: removed cchardet from dependancy list, since unclecode is planning to remove it

* docs: updated the example for proxy rotation to a working example

* feat: Introduced ProxyConfig param

* Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1

* chore: update and test new dependancies

* feat:Make PyPDF2 a conditional dependancy

* updated tutorial and release note for v0.5

* docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename

* refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult

* fix: Bug in serialisation of markdown in acache_url

* Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown

* fix: remove deprecated markdown_v2 from docker

* Refactor: remove deprecated fit_markdown and fit_html from result

* refactor: fix cache retrieval for markdown as a string

* chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown
											
										
										
											2025-02-28 17:23:35 +05:30
+								**Important**: “Fit” content (in `fit_markdown`/`fit_html`) exists in result.markdown, only if you used a **filter** (like **PruningContentFilter** or **BM25ContentFilter**) within a `MarkdownGenerationStrategy`.
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
 								---
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								## 4. Media & Links
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								### 4.1 **`media`** *(Dict[str, List[Dict]])*
 								**What**: Contains info about discovered images, videos, or audio. Typically keys: `"images"`, `"videos"`, `"audios"`.
 								**Common Fields** in each item:
 								- `src` *(str)*: Media URL
 								- `alt` or `title` *(str)*: Descriptive text
 								- `score` *(float)*: Relevance score if the crawler’s heuristic found it “important”
 								- `desc` or `description` *(Optional[str])*: Additional context extracted from surrounding text
 								**Usage**:
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								images = result.media.get("images", [])
 								for img in images:
 								    if img.get("score", 0) > 5:
 								        print("High-value image:", img["src"])
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								### 4.2 **`links`** *(Dict[str, List[Dict]])*
 								**What**: Holds internal and external link data. Usually two keys: `"internal"` and `"external"`.
 								**Common Fields**:
 								- `href` *(str)*: The link target
 								- `text` *(str)*: Link text
 								- `title` *(str)*: Title attribute
 								- `context` *(str)*: Surrounding text snippet
 								- `domain` *(str)*: If external, the domain
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								**Usage**:
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								for link in result.links["internal"]:
 								    print(f"Internal link to {link['href']} with text {link['text']}")
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								---
 								## 5. Additional Fields
 								### 5.1 **`extracted_content`** *(Optional[str])*
 								**What**: If you used **`extraction_strategy`** (CSS, LLM, etc.), the structured output (JSON).
 								**Usage**:
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								if result.extracted_content:
 								    data = json.loads(result.extracted_content)
 								    print(data)
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								### 5.2 **`downloaded_files`** *(Optional[List[str]])*
 								**What**: If `accept_downloads=True` in your `BrowserConfig` + `downloads_path`, lists local file paths for downloaded items.
 								**Usage**:
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								if result.downloaded_files:
 								    for file_path in result.downloaded_files:
 								        print("Downloaded:", file_path)
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								### 5.3 **`screenshot`** *(Optional[str])*
 								**What**: Base64-encoded screenshot if `screenshot=True` in `CrawlerRunConfig`.
 								**Usage**:
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								import base64
 								if result.screenshot:
 								    with open("page.png", "wb") as f:
 								        f.write(base64.b64decode(result.screenshot))
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								### 5.4 **`pdf`** *(Optional[bytes])*
 								**What**: Raw PDF bytes if `pdf=True` in `CrawlerRunConfig`.
 								**Usage**:
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								if result.pdf:
 								    with open("page.pdf", "wb") as f:
 								        f.write(result.pdf)
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												feat(crawler): add MHTML capture functionality

Add ability to capture web pages as MHTML format, which includes all page resources
in a single file. This enables complete page archival and offline viewing.

- Add capture_mhtml parameter to CrawlerRunConfig
- Implement MHTML capture using CDP in AsyncPlaywrightCrawlerStrategy
- Add mhtml field to CrawlResult and AsyncCrawlResponse models
- Add comprehensive tests for MHTML capture functionality
- Update documentation with MHTML capture details
- Add exclude_all_images option for better memory management

Breaking changes: None

											
										
										
											2025-04-09 15:39:04 +08:00
+								### 5.5 **`mhtml`** *(Optional[str])*
 								**What**: MHTML snapshot of the page if `capture_mhtml=True` in `CrawlerRunConfig`. MHTML (MIME HTML) format preserves the entire web page with all its resources (CSS, images, scripts, etc.) in a single file.
 								**Usage**:
 								```python
 								if result.mhtml:
 								    with open("page.mhtml", "w", encoding="utf-8") as f:
 								        f.write(result.mhtml)
 								```
 								### 5.6 **`metadata`** *(Optional[dict])*
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								**What**: Page-level metadata if discovered (title, description, OG data, etc.).
 								**Usage**:
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								if result.metadata:
 								    print("Title:", result.metadata.get("title"))
 								    print("Author:", result.metadata.get("author"))
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								---
-												refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring

Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.

											
										
										
											2025-01-11 21:10:27 +08:00
+								## 6. `dispatch_result` (optional)
 								A `DispatchResult` object providing additional concurrency and resource usage information when crawling URLs in parallel (e.g., via `arun_many()` with custom dispatchers). It contains:
 								- **`task_id`**: A unique identifier for the parallel task.
 								- **`memory_usage`** (float): The memory (in MB) used at the time of completion.
 								- **`peak_memory`** (float): The peak memory usage (in MB) recorded during the task’s execution.
 								- **`start_time`** / **`end_time`** (datetime): Time range for this crawling task.
 								- **`error_message`** (str): Any dispatcher- or concurrency-related error encountered.
 								```python
 								# Example usage:
 								for result in results:
 								    if result.success and result.dispatch_result:
 								        dr = result.dispatch_result
 								        print(f"URL: {result.url}, Task ID: {dr.task_id}")
 								        print(f"Memory: {dr.memory_usage:.1f} MB (Peak: {dr.peak_memory:.1f} MB)")
 								        print(f"Duration: {dr.end_time - dr.start_time}")
 								```
 								> **Note**: This field is typically populated when using `arun_many(...)` alongside a **dispatcher** (e.g., `MemoryAdaptiveDispatcher` or `SemaphoreDispatcher`). If no concurrency or dispatcher is used, `dispatch_result` may remain `None`.
 								---
 								## 7. Example: Accessing Everything
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
+								```python
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								async def handle_result(result: CrawlResult):
 								    if not result.success:
 								        print("Crawl error:", result.error_message)
 								        return
 								    # Basic info
 								    print("Crawled URL:", result.url)
 								    print("Status code:", result.status_code)
 								    # HTML
 								    print("Original HTML size:", len(result.html))
 								    print("Cleaned HTML size:", len(result.cleaned_html or ""))
 								    # Markdown output
-												Release prep (#749)

* fix: Update export of URLPatternFilter

* chore: Add dependancy for cchardet in requirements

* docs: Update example for deep crawl in release note for v0.5

* Docs: update the example for memory dispatcher

* docs: updated example for crawl strategies

* Refactor: Removed wrapping in if __name__==main block since this is a markdown file.

* chore: removed cchardet from dependancy list, since unclecode is planning to remove it

* docs: updated the example for proxy rotation to a working example

* feat: Introduced ProxyConfig param

* Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1

* chore: update and test new dependancies

* feat:Make PyPDF2 a conditional dependancy

* updated tutorial and release note for v0.5

* docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename

* refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult

* fix: Bug in serialisation of markdown in acache_url

* Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown

* fix: remove deprecated markdown_v2 from docker

* Refactor: remove deprecated fit_markdown and fit_html from result

* refactor: fix cache retrieval for markdown as a string

* chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown
											
										
										
											2025-02-28 17:23:35 +05:30
+								    if result.markdown:
 								        print("Raw Markdown:", result.markdown.raw_markdown[:300])
 								        print("Citations Markdown:", result.markdown.markdown_with_citations[:300])
 								        if result.markdown.fit_markdown:
 								            print("Fit Markdown:", result.markdown.fit_markdown[:200])
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
 								    # Media & Links
 								    if "images" in result.media:
 								        print("Image count:", len(result.media["images"]))
 								    if "internal" in result.links:
 								        print("Internal link count:", len(result.links["internal"]))
 								    # Extraction strategy result
 								    if result.extracted_content:
 								        print("Structured data:", result.extracted_content)
-												feat(crawler): add MHTML capture functionality

Add ability to capture web pages as MHTML format, which includes all page resources
in a single file. This enables complete page archival and offline viewing.

- Add capture_mhtml parameter to CrawlerRunConfig
- Implement MHTML capture using CDP in AsyncPlaywrightCrawlerStrategy
- Add mhtml field to CrawlResult and AsyncCrawlResponse models
- Add comprehensive tests for MHTML capture functionality
- Update documentation with MHTML capture details
- Add exclude_all_images option for better memory management

Breaking changes: None

											
										
										
											2025-04-09 15:39:04 +08:00
+								    # Screenshot/PDF/MHTML
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								    if result.screenshot:
 								        print("Screenshot length:", len(result.screenshot))
 								    if result.pdf:
 								        print("PDF bytes length:", len(result.pdf))
-												feat(crawler): add MHTML capture functionality

Add ability to capture web pages as MHTML format, which includes all page resources
in a single file. This enables complete page archival and offline viewing.

- Add capture_mhtml parameter to CrawlerRunConfig
- Implement MHTML capture using CDP in AsyncPlaywrightCrawlerStrategy
- Add mhtml field to CrawlResult and AsyncCrawlResponse models
- Add comprehensive tests for MHTML capture functionality
- Update documentation with MHTML capture details
- Add exclude_all_images option for better memory management

Breaking changes: None

											
										
										
											2025-04-09 15:39:04 +08:00
+								    if result.mhtml:
 								        print("MHTML length:", len(result.mhtml))
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								```
 								---
-												refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring

Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.

											
										
										
											2025-01-11 21:10:27 +08:00
+								## 8. Key Points & Future
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
-												Release prep (#749)

* fix: Update export of URLPatternFilter

* chore: Add dependancy for cchardet in requirements

* docs: Update example for deep crawl in release note for v0.5

* Docs: update the example for memory dispatcher

* docs: updated example for crawl strategies

* Refactor: Removed wrapping in if __name__==main block since this is a markdown file.

* chore: removed cchardet from dependancy list, since unclecode is planning to remove it

* docs: updated the example for proxy rotation to a working example

* feat: Introduced ProxyConfig param

* Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1

* chore: update and test new dependancies

* feat:Make PyPDF2 a conditional dependancy

* updated tutorial and release note for v0.5

* docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename

* refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult

* fix: Bug in serialisation of markdown in acache_url

* Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown

* fix: remove deprecated markdown_v2 from docker

* Refactor: remove deprecated fit_markdown and fit_html from result

* refactor: fix cache retrieval for markdown as a string

* chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown
											
										
										
											2025-02-28 17:23:35 +05:30
+. **Deprecated legacy properties of CrawlResult**
 								   - `markdown_v2` - Deprecated in v0.5. Just use `markdown`. It holds the `MarkdownGenerationResult` now!
 								   - `fit_markdown` and `fit_html` - Deprecated in v0.5. They can now be accessed via `MarkdownGenerationResult` in `result.markdown`. eg: `result.markdown.fit_markdown` and `result.markdown.fit_html`
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
 . **Fit Content**
-												Release prep (#749)

* fix: Update export of URLPatternFilter

* chore: Add dependancy for cchardet in requirements

* docs: Update example for deep crawl in release note for v0.5

* Docs: update the example for memory dispatcher

* docs: updated example for crawl strategies

* Refactor: Removed wrapping in if __name__==main block since this is a markdown file.

* chore: removed cchardet from dependancy list, since unclecode is planning to remove it

* docs: updated the example for proxy rotation to a working example

* feat: Introduced ProxyConfig param

* Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1

* chore: update and test new dependancies

* feat:Make PyPDF2 a conditional dependancy

* updated tutorial and release note for v0.5

* docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename

* refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult

* fix: Bug in serialisation of markdown in acache_url

* Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown

* fix: remove deprecated markdown_v2 from docker

* Refactor: remove deprecated fit_markdown and fit_html from result

* refactor: fix cache retrieval for markdown as a string

* chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown
											
										
										
											2025-02-28 17:23:35 +05:30
+								   - **`fit_markdown`** and **`fit_html`** appear in MarkdownGenerationResult, only if you used a content filter (like **PruningContentFilter** or **BM25ContentFilter**) inside your **MarkdownGenerationStrategy** or set them directly.
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								   - If no filter is used, they remain `None`.
 . **References & Citations**
 								   - If you enable link citations in your `DefaultMarkdownGenerator` (`options={"citations": True}`), you’ll see `markdown_with_citations` plus a **`references_markdown`** block. This helps large language models or academic-like referencing.
 . **Links & Media**
 								   - `links["internal"]` and `links["external"]` group discovered anchors by domain.
 								   - `media["images"]` / `["videos"]` / `["audios"]` store extracted media elements with optional scoring or context.
 . **Error Cases**
 								   - If `success=False`, check `error_message` (e.g., timeouts, invalid URLs).
 								   - `status_code` might be `None` if we failed before an HTTP response.
 								Use **`CrawlResult`** to glean all final outputs and feed them into your data pipelines, AI models, or archives. With the synergy of a properly configured **BrowserConfig** and **CrawlerRunConfig**, the crawler can produce robust, structured results here in **`CrawlResult`**.