yujunjun/crawl4ai

Fork 0

mirror of https://github.com/unclecode/crawl4ai.git synced 2025-12-30 03:48:54 +00:00

Aravind a9e24307cc

Release prep (#749 )

* fix: Update export of URLPatternFilter

* chore: Add dependancy for cchardet in requirements

* docs: Update example for deep crawl in release note for v0.5

* Docs: update the example for memory dispatcher

* docs: updated example for crawl strategies

* Refactor: Removed wrapping in if __name__==main block since this is a markdown file.

* chore: removed cchardet from dependancy list, since unclecode is planning to remove it

* docs: updated the example for proxy rotation to a working example

* feat: Introduced ProxyConfig param

* Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1

* chore: update and test new dependancies

* feat:Make PyPDF2 a conditional dependancy

* updated tutorial and release note for v0.5

* docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename

* refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult

* fix: Bug in serialisation of markdown in acache_url

* Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown

* fix: remove deprecated markdown_v2 from docker

* Refactor: remove deprecated fit_markdown and fit_html from result

* refactor: fix cache retrieval for markdown as a string

* chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown

2025-02-28 19:53:35 +08:00

12 KiB

Raw Permalink Blame History

`CrawlResult` Reference

The CrawlResult class encapsulates everything returned after a single crawl operation. It provides the raw or processed content, details on links and media, plus optional metadata (like screenshots, PDFs, or extracted JSON).

Location: crawl4ai/crawler/models.py (for reference)

class CrawlResult(BaseModel):
    url: str
    html: str
    success: bool
    cleaned_html: Optional[str] = None
    media: Dict[str, List[Dict]] = {}
    links: Dict[str, List[Dict]] = {}
    downloaded_files: Optional[List[str]] = None
    screenshot: Optional[str] = None
    pdf : Optional[bytes] = None
    markdown: Optional[Union[str, MarkdownGenerationResult]] = None
    extracted_content: Optional[str] = None
    metadata: Optional[dict] = None
    error_message: Optional[str] = None
    session_id: Optional[str] = None
    response_headers: Optional[dict] = None
    status_code: Optional[int] = None
    ssl_certificate: Optional[SSLCertificate] = None
    dispatch_result: Optional[DispatchResult] = None
    ...

Below is a field-by-field explanation and possible usage patterns.

1. Basic Crawl Info

1.1 `url` (str)

What: The final crawled URL (after any redirects).
Usage:

print(result.url)  # e.g., "https://example.com/"

1.2 `success` (bool)

What: True if the crawl pipeline ended without major errors; False otherwise.
Usage:

if not result.success:
    print(f"Crawl failed: {result.error_message}")

1.3 `status_code` (Optional[int])

What: The page’s HTTP status code (e.g., 200, 404).
Usage:

if result.status_code == 404:
    print("Page not found!")

1.4 `error_message` (Optional[str])

What: If success=False, a textual description of the failure.
Usage:

if not result.success:
    print("Error:", result.error_message)

1.5 `session_id` (Optional[str])

What: The ID used for reusing a browser context across multiple calls.
Usage:

# If you used session_id="login_session" in CrawlerRunConfig, see it here:
print("Session:", result.session_id)

1.6 `response_headers` (Optional[dict])

What: Final HTTP response headers.
Usage:

if result.response_headers:
    print("Server:", result.response_headers.get("Server", "Unknown"))

1.7 `ssl_certificate` (Optional[SSLCertificate])

What: If fetch_ssl_certificate=True in your CrawlerRunConfig, result.ssl_certificate contains a SSLCertificate object describing the site’s certificate. You can export the cert in multiple formats (PEM/DER/JSON) or access its properties like issuer, subject, valid_from, valid_until, etc. Usage:

if result.ssl_certificate:
    print("Issuer:", result.ssl_certificate.issuer)

2. Raw / Cleaned Content

2.1 `html` (str)

What: The original unmodified HTML from the final page load.
Usage:

# Possibly large
print(len(result.html))

2.2 `cleaned_html` (Optional[str])

What: A sanitized HTML version—scripts, styles, or excluded tags are removed based on your CrawlerRunConfig.
Usage:

print(result.cleaned_html[:500])  # Show a snippet

2.3 `fit_html` (Optional[str])

What: If a content filter or heuristic (e.g., Pruning/BM25) modifies the HTML, the “fit” or post-filter version.
When: This is only present if your markdown_generator or content_filter produces it.
Usage:

if result.markdown.fit_html:
    print("High-value HTML content:", result.markdown.fit_html[:300])

3. Markdown Fields

3.1 The Markdown Generation Approach

Crawl4AI can convert HTML→Markdown, optionally including:

Raw markdown
Links as citations (with a references section)
Fit markdown if a content filter is used (like Pruning or BM25)

MarkdownGenerationResult includes:

raw_markdown (str): The full HTML→Markdown conversion.
markdown_with_citations (str): Same markdown, but with link references as academic-style citations.
references_markdown (str): The reference list or footnotes at the end.
fit_markdown (Optional[str]): If content filtering (Pruning/BM25) was applied, the filtered “fit” text.
fit_html (Optional[str]): The HTML that led to fit_markdown.

Usage:

if result.markdown:
    md_res = result.markdown
    print("Raw MD:", md_res.raw_markdown[:300])
    print("Citations MD:", md_res.markdown_with_citations[:300])
    print("References:", md_res.references_markdown)
    if md_res.fit_markdown:
        print("Pruned text:", md_res.fit_markdown[:300])

3.2 `markdown` (Optional[Union[str, MarkdownGenerationResult]])

What: Holds the MarkdownGenerationResult.
Usage:

print(result.markdown.raw_markdown[:200])
print(result.markdown.fit_markdown)
print(result.markdown.fit_html)

Important: “Fit” content (in fit_markdown/fit_html) exists in result.markdown, only if you used a filter (like PruningContentFilter or BM25ContentFilter) within a MarkdownGenerationStrategy.

4. Media & Links

4.1 `media` (Dict[str, List[Dict]])

What: Contains info about discovered images, videos, or audio. Typically keys: "images", "videos", "audios".
Common Fields in each item:

src (str): Media URL
alt or title (str): Descriptive text
score (float): Relevance score if the crawler’s heuristic found it “important”
desc or description (Optional[str]): Additional context extracted from surrounding text

Usage:

images = result.media.get("images", [])
for img in images:
    if img.get("score", 0) > 5:
        print("High-value image:", img["src"])

4.2 `links` (Dict[str, List[Dict]])

What: Holds internal and external link data. Usually two keys: "internal" and "external".
Common Fields:

href (str): The link target
text (str): Link text
title (str): Title attribute
context (str): Surrounding text snippet
domain (str): If external, the domain

Usage:

for link in result.links["internal"]:
    print(f"Internal link to {link['href']} with text {link['text']}")

5. Additional Fields

5.1 `extracted_content` (Optional[str])

What: If you used extraction_strategy (CSS, LLM, etc.), the structured output (JSON).
Usage:

if result.extracted_content:
    data = json.loads(result.extracted_content)
    print(data)

5.2 `downloaded_files` (Optional[List[str]])

What: If accept_downloads=True in your BrowserConfig + downloads_path, lists local file paths for downloaded items.
Usage:

if result.downloaded_files:
    for file_path in result.downloaded_files:
        print("Downloaded:", file_path)

5.3 `screenshot` (Optional[str])

What: Base64-encoded screenshot if screenshot=True in CrawlerRunConfig.
Usage:

import base64
if result.screenshot:
    with open("page.png", "wb") as f:
        f.write(base64.b64decode(result.screenshot))

5.4 `pdf` (Optional[bytes])

What: Raw PDF bytes if pdf=True in CrawlerRunConfig.
Usage:

if result.pdf:
    with open("page.pdf", "wb") as f:
        f.write(result.pdf)

5.5 `metadata` (Optional[dict])

What: Page-level metadata if discovered (title, description, OG data, etc.).
Usage:

if result.metadata:
    print("Title:", result.metadata.get("title"))
    print("Author:", result.metadata.get("author"))

6. `dispatch_result` (optional)

A DispatchResult object providing additional concurrency and resource usage information when crawling URLs in parallel (e.g., via arun_many() with custom dispatchers). It contains:

task_id: A unique identifier for the parallel task.
memory_usage (float): The memory (in MB) used at the time of completion.
peak_memory (float): The peak memory usage (in MB) recorded during the task’s execution.
start_time / end_time (datetime): Time range for this crawling task.
error_message (str): Any dispatcher- or concurrency-related error encountered.

# Example usage:
for result in results:
    if result.success and result.dispatch_result:
        dr = result.dispatch_result
        print(f"URL: {result.url}, Task ID: {dr.task_id}")
        print(f"Memory: {dr.memory_usage:.1f} MB (Peak: {dr.peak_memory:.1f} MB)")
        print(f"Duration: {dr.end_time - dr.start_time}")

Note

: This field is typically populated when using arun_many(...) alongside a dispatcher (e.g., MemoryAdaptiveDispatcher or SemaphoreDispatcher). If no concurrency or dispatcher is used, dispatch_result may remain None.

7. Example: Accessing Everything

async def handle_result(result: CrawlResult):
    if not result.success:
        print("Crawl error:", result.error_message)
        return
    
    # Basic info
    print("Crawled URL:", result.url)
    print("Status code:", result.status_code)
    
    # HTML
    print("Original HTML size:", len(result.html))
    print("Cleaned HTML size:", len(result.cleaned_html or ""))

    # Markdown output
    if result.markdown:
        print("Raw Markdown:", result.markdown.raw_markdown[:300])
        print("Citations Markdown:", result.markdown.markdown_with_citations[:300])
        if result.markdown.fit_markdown:
            print("Fit Markdown:", result.markdown.fit_markdown[:200])

    # Media & Links
    if "images" in result.media:
        print("Image count:", len(result.media["images"]))
    if "internal" in result.links:
        print("Internal link count:", len(result.links["internal"]))

    # Extraction strategy result
    if result.extracted_content:
        print("Structured data:", result.extracted_content)
    
    # Screenshot/PDF
    if result.screenshot:
        print("Screenshot length:", len(result.screenshot))
    if result.pdf:
        print("PDF bytes length:", len(result.pdf))

8. Key Points & Future

1. Deprecated legacy properties of CrawlResult

markdown_v2 - Deprecated in v0.5. Just use markdown. It holds the MarkdownGenerationResult now!
fit_markdown and fit_html - Deprecated in v0.5. They can now be accessed via MarkdownGenerationResult in result.markdown. eg: result.markdown.fit_markdown and result.markdown.fit_html

2. Fit Content

fit_markdown and fit_html appear in MarkdownGenerationResult, only if you used a content filter (like PruningContentFilter or BM25ContentFilter) inside your MarkdownGenerationStrategy or set them directly.
If no filter is used, they remain None.

3. References & Citations

If you enable link citations in your DefaultMarkdownGenerator (options={"citations": True}), you’ll see markdown_with_citations plus a references_markdown block. This helps large language models or academic-like referencing.

4. Links & Media

links["internal"] and links["external"] group discovered anchors by domain.
media["images"] / ["videos"] / ["audios"] store extracted media elements with optional scoring or context.

5. Error Cases

If success=False, check error_message (e.g., timeouts, invalid URLs).
status_code might be None if we failed before an HTTP response.

Use CrawlResult to glean all final outputs and feed them into your data pipelines, AI models, or archives. With the synergy of a properly configured BrowserConfig and CrawlerRunConfig, the crawler can produce robust, structured results here in CrawlResult.

12 KiB Raw Permalink Blame History Unescape Escape

CrawlResult Reference

1. Basic Crawl Info

1.1 url (str)

1.2 success (bool)

1.3 status_code (Optional[int])

1.4 error_message (Optional[str])

1.5 session_id (Optional[str])

1.6 response_headers (Optional[dict])

1.7 ssl_certificate (Optional[SSLCertificate])

2. Raw / Cleaned Content

2.1 html (str)

2.2 cleaned_html (Optional[str])

2.3 fit_html (Optional[str])

3. Markdown Fields

3.1 The Markdown Generation Approach

3.2 markdown (Optional[Union[str, MarkdownGenerationResult]])

4. Media & Links

4.1 media (Dict[str, List[Dict]])

4.2 links (Dict[str, List[Dict]])

5. Additional Fields

5.1 extracted_content (Optional[str])

5.2 downloaded_files (Optional[List[str]])

5.3 screenshot (Optional[str])

5.4 pdf (Optional[bytes])

5.5 metadata (Optional[dict])

6. dispatch_result (optional)

7. Example: Accessing Everything

8. Key Points & Future

12 KiB

Raw Permalink Blame History

`CrawlResult` Reference

1.1 `url` (str)

1.2 `success` (bool)

1.3 `status_code` (Optional[int])

1.4 `error_message` (Optional[str])

1.5 `session_id` (Optional[str])

1.6 `response_headers` (Optional[dict])

1.7 `ssl_certificate` (Optional[SSLCertificate])

2.1 `html` (str)

2.2 `cleaned_html` (Optional[str])

2.3 `fit_html` (Optional[str])

3.2 `markdown` (Optional[Union[str, MarkdownGenerationResult]])

4.1 `media` (Dict[str, List[Dict]])

4.2 `links` (Dict[str, List[Dict]])

5.1 `extracted_content` (Optional[str])

5.2 `downloaded_files` (Optional[List[str]])

5.3 `screenshot` (Optional[str])

5.4 `pdf` (Optional[bytes])

5.5 `metadata` (Optional[dict])

6. `dispatch_result` (optional)