* fix: Update export of URLPatternFilter * chore: Add dependancy for cchardet in requirements * docs: Update example for deep crawl in release note for v0.5 * Docs: update the example for memory dispatcher * docs: updated example for crawl strategies * Refactor: Removed wrapping in if __name__==main block since this is a markdown file. * chore: removed cchardet from dependancy list, since unclecode is planning to remove it * docs: updated the example for proxy rotation to a working example * feat: Introduced ProxyConfig param * Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1 * chore: update and test new dependancies * feat:Make PyPDF2 a conditional dependancy * updated tutorial and release note for v0.5 * docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename * refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult * fix: Bug in serialisation of markdown in acache_url * Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown * fix: remove deprecated markdown_v2 from docker * Refactor: remove deprecated fit_markdown and fit_html from result * refactor: fix cache retrieval for markdown as a string * chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown
12 KiB
CrawlResult Reference
The CrawlResult class encapsulates everything returned after a single crawl operation. It provides the raw or processed content, details on links and media, plus optional metadata (like screenshots, PDFs, or extracted JSON).
Location: crawl4ai/crawler/models.py (for reference)
class CrawlResult(BaseModel):
url: str
html: str
success: bool
cleaned_html: Optional[str] = None
media: Dict[str, List[Dict]] = {}
links: Dict[str, List[Dict]] = {}
downloaded_files: Optional[List[str]] = None
screenshot: Optional[str] = None
pdf : Optional[bytes] = None
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
extracted_content: Optional[str] = None
metadata: Optional[dict] = None
error_message: Optional[str] = None
session_id: Optional[str] = None
response_headers: Optional[dict] = None
status_code: Optional[int] = None
ssl_certificate: Optional[SSLCertificate] = None
dispatch_result: Optional[DispatchResult] = None
...
Below is a field-by-field explanation and possible usage patterns.
1. Basic Crawl Info
1.1 url (str)
What: The final crawled URL (after any redirects).
Usage:
print(result.url) # e.g., "https://example.com/"
1.2 success (bool)
What: True if the crawl pipeline ended without major errors; False otherwise.
Usage:
if not result.success:
print(f"Crawl failed: {result.error_message}")
1.3 status_code (Optional[int])
What: The page’s HTTP status code (e.g., 200, 404).
Usage:
if result.status_code == 404:
print("Page not found!")
1.4 error_message (Optional[str])
What: If success=False, a textual description of the failure.
Usage:
if not result.success:
print("Error:", result.error_message)
1.5 session_id (Optional[str])
What: The ID used for reusing a browser context across multiple calls.
Usage:
# If you used session_id="login_session" in CrawlerRunConfig, see it here:
print("Session:", result.session_id)
1.6 response_headers (Optional[dict])
What: Final HTTP response headers.
Usage:
if result.response_headers:
print("Server:", result.response_headers.get("Server", "Unknown"))
1.7 ssl_certificate (Optional[SSLCertificate])
What: If fetch_ssl_certificate=True in your CrawlerRunConfig, result.ssl_certificate contains a SSLCertificate object describing the site’s certificate. You can export the cert in multiple formats (PEM/DER/JSON) or access its properties like issuer,
subject, valid_from, valid_until, etc.
Usage:
if result.ssl_certificate:
print("Issuer:", result.ssl_certificate.issuer)
2. Raw / Cleaned Content
2.1 html (str)
What: The original unmodified HTML from the final page load.
Usage:
# Possibly large
print(len(result.html))
2.2 cleaned_html (Optional[str])
What: A sanitized HTML version—scripts, styles, or excluded tags are removed based on your CrawlerRunConfig.
Usage:
print(result.cleaned_html[:500]) # Show a snippet
2.3 fit_html (Optional[str])
What: If a content filter or heuristic (e.g., Pruning/BM25) modifies the HTML, the “fit” or post-filter version.
When: This is only present if your markdown_generator or content_filter produces it.
Usage:
if result.markdown.fit_html:
print("High-value HTML content:", result.markdown.fit_html[:300])
3. Markdown Fields
3.1 The Markdown Generation Approach
Crawl4AI can convert HTML→Markdown, optionally including:
- Raw markdown
- Links as citations (with a references section)
- Fit markdown if a content filter is used (like Pruning or BM25)
MarkdownGenerationResult includes:
raw_markdown(str): The full HTML→Markdown conversion.markdown_with_citations(str): Same markdown, but with link references as academic-style citations.references_markdown(str): The reference list or footnotes at the end.fit_markdown(Optional[str]): If content filtering (Pruning/BM25) was applied, the filtered “fit” text.fit_html(Optional[str]): The HTML that led tofit_markdown.
Usage:
if result.markdown:
md_res = result.markdown
print("Raw MD:", md_res.raw_markdown[:300])
print("Citations MD:", md_res.markdown_with_citations[:300])
print("References:", md_res.references_markdown)
if md_res.fit_markdown:
print("Pruned text:", md_res.fit_markdown[:300])
3.2 markdown (Optional[Union[str, MarkdownGenerationResult]])
What: Holds the MarkdownGenerationResult.
Usage:
print(result.markdown.raw_markdown[:200])
print(result.markdown.fit_markdown)
print(result.markdown.fit_html)
Important: “Fit” content (in fit_markdown/fit_html) exists in result.markdown, only if you used a filter (like PruningContentFilter or BM25ContentFilter) within a MarkdownGenerationStrategy.
4. Media & Links
4.1 media (Dict[str, List[Dict]])
What: Contains info about discovered images, videos, or audio. Typically keys: "images", "videos", "audios".
Common Fields in each item:
src(str): Media URLaltortitle(str): Descriptive textscore(float): Relevance score if the crawler’s heuristic found it “important”descordescription(Optional[str]): Additional context extracted from surrounding text
Usage:
images = result.media.get("images", [])
for img in images:
if img.get("score", 0) > 5:
print("High-value image:", img["src"])
4.2 links (Dict[str, List[Dict]])
What: Holds internal and external link data. Usually two keys: "internal" and "external".
Common Fields:
href(str): The link targettext(str): Link texttitle(str): Title attributecontext(str): Surrounding text snippetdomain(str): If external, the domain
Usage:
for link in result.links["internal"]:
print(f"Internal link to {link['href']} with text {link['text']}")
5. Additional Fields
5.1 extracted_content (Optional[str])
What: If you used extraction_strategy (CSS, LLM, etc.), the structured output (JSON).
Usage:
if result.extracted_content:
data = json.loads(result.extracted_content)
print(data)
5.2 downloaded_files (Optional[List[str]])
What: If accept_downloads=True in your BrowserConfig + downloads_path, lists local file paths for downloaded items.
Usage:
if result.downloaded_files:
for file_path in result.downloaded_files:
print("Downloaded:", file_path)
5.3 screenshot (Optional[str])
What: Base64-encoded screenshot if screenshot=True in CrawlerRunConfig.
Usage:
import base64
if result.screenshot:
with open("page.png", "wb") as f:
f.write(base64.b64decode(result.screenshot))
5.4 pdf (Optional[bytes])
What: Raw PDF bytes if pdf=True in CrawlerRunConfig.
Usage:
if result.pdf:
with open("page.pdf", "wb") as f:
f.write(result.pdf)
5.5 metadata (Optional[dict])
What: Page-level metadata if discovered (title, description, OG data, etc.).
Usage:
if result.metadata:
print("Title:", result.metadata.get("title"))
print("Author:", result.metadata.get("author"))
6. dispatch_result (optional)
A DispatchResult object providing additional concurrency and resource usage information when crawling URLs in parallel (e.g., via arun_many() with custom dispatchers). It contains:
task_id: A unique identifier for the parallel task.memory_usage(float): The memory (in MB) used at the time of completion.peak_memory(float): The peak memory usage (in MB) recorded during the task’s execution.start_time/end_time(datetime): Time range for this crawling task.error_message(str): Any dispatcher- or concurrency-related error encountered.
# Example usage:
for result in results:
if result.success and result.dispatch_result:
dr = result.dispatch_result
print(f"URL: {result.url}, Task ID: {dr.task_id}")
print(f"Memory: {dr.memory_usage:.1f} MB (Peak: {dr.peak_memory:.1f} MB)")
print(f"Duration: {dr.end_time - dr.start_time}")
Note
: This field is typically populated when using
arun_many(...)alongside a dispatcher (e.g.,MemoryAdaptiveDispatcherorSemaphoreDispatcher). If no concurrency or dispatcher is used,dispatch_resultmay remainNone.
7. Example: Accessing Everything
async def handle_result(result: CrawlResult):
if not result.success:
print("Crawl error:", result.error_message)
return
# Basic info
print("Crawled URL:", result.url)
print("Status code:", result.status_code)
# HTML
print("Original HTML size:", len(result.html))
print("Cleaned HTML size:", len(result.cleaned_html or ""))
# Markdown output
if result.markdown:
print("Raw Markdown:", result.markdown.raw_markdown[:300])
print("Citations Markdown:", result.markdown.markdown_with_citations[:300])
if result.markdown.fit_markdown:
print("Fit Markdown:", result.markdown.fit_markdown[:200])
# Media & Links
if "images" in result.media:
print("Image count:", len(result.media["images"]))
if "internal" in result.links:
print("Internal link count:", len(result.links["internal"]))
# Extraction strategy result
if result.extracted_content:
print("Structured data:", result.extracted_content)
# Screenshot/PDF
if result.screenshot:
print("Screenshot length:", len(result.screenshot))
if result.pdf:
print("PDF bytes length:", len(result.pdf))
8. Key Points & Future
1. Deprecated legacy properties of CrawlResult
markdown_v2- Deprecated in v0.5. Just usemarkdown. It holds theMarkdownGenerationResultnow!fit_markdownandfit_html- Deprecated in v0.5. They can now be accessed viaMarkdownGenerationResultinresult.markdown. eg:result.markdown.fit_markdownandresult.markdown.fit_html
2. Fit Content
fit_markdownandfit_htmlappear in MarkdownGenerationResult, only if you used a content filter (like PruningContentFilter or BM25ContentFilter) inside your MarkdownGenerationStrategy or set them directly.- If no filter is used, they remain
None.
3. References & Citations
- If you enable link citations in your
DefaultMarkdownGenerator(options={"citations": True}), you’ll seemarkdown_with_citationsplus areferences_markdownblock. This helps large language models or academic-like referencing.
4. Links & Media
links["internal"]andlinks["external"]group discovered anchors by domain.media["images"]/["videos"]/["audios"]store extracted media elements with optional scoring or context.
5. Error Cases
- If
success=False, checkerror_message(e.g., timeouts, invalid URLs). status_codemight beNoneif we failed before an HTTP response.
Use CrawlResult to glean all final outputs and feed them into your data pipelines, AI models, or archives. With the synergy of a properly configured BrowserConfig and CrawlerRunConfig, the crawler can produce robust, structured results here in CrawlResult.