UncleCode c38ac29edb perf(crawler): major performance improvements & raw HTML support

- Switch to lxml parser (~4x speedup)
- Add raw HTML & local file crawling support
- Fix cache headers & async cleanup
- Add browser process monitoring
- Optimize BeautifulSoup operations
- Pre-compile regex patterns

Breaking: Raw HTML handling requires new URL prefixes
Fixes: #256, #253

2024-11-13 19:40:40 +08:00

8.0 KiB

Raw Blame History

Prefix-Based Input Handling in Crawl4AI

This guide will walk you through using the Crawl4AI library to crawl web pages, local HTML files, and raw HTML strings. We'll demonstrate these capabilities using a Wikipedia page as an example.

Prefix-Based Input Handling in Crawl4AI

Crawling a Web URL

To crawl a live web page, provide the URL starting with http:// or https://.

import asyncio
from crawl4ai import AsyncWebCrawler

async def crawl_web():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(url="https://en.wikipedia.org/wiki/apple", bypass_cache=True)
        if result.success:
            print("Markdown Content:")
            print(result.markdown)
        else:
            print(f"Failed to crawl: {result.error_message}")

asyncio.run(crawl_web())

Crawling a Local HTML File

To crawl a local HTML file, prefix the file path with file://.

import asyncio
from crawl4ai import AsyncWebCrawler

async def crawl_local_file():
    local_file_path = "/path/to/apple.html"  # Replace with your file path
    file_url = f"file://{local_file_path}"
    
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(url=file_url, bypass_cache=True)
        if result.success:
            print("Markdown Content from Local File:")
            print(result.markdown)
        else:
            print(f"Failed to crawl local file: {result.error_message}")

asyncio.run(crawl_local_file())

Crawling Raw HTML Content

To crawl raw HTML content, prefix the HTML string with raw:.

import asyncio
from crawl4ai import AsyncWebCrawler

async def crawl_raw_html():
    raw_html = "<html><body><h1>Hello, World!</h1></body></html>"
    raw_html_url = f"raw:{raw_html}"
    
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(url=raw_html_url, bypass_cache=True)
        if result.success:
            print("Markdown Content from Raw HTML:")
            print(result.markdown)
        else:
            print(f"Failed to crawl raw HTML: {result.error_message}")

asyncio.run(crawl_raw_html())

Complete Example

Below is a comprehensive script that:

Crawls the Wikipedia page for "Apple".
Saves the HTML content to a local file (apple.html).
Crawls the local HTML file and verifies the markdown length matches the original crawl.
Crawls the raw HTML content from the saved file and verifies consistency.

import os
import sys
import asyncio
from pathlib import Path

# Adjust the parent directory to include the crawl4ai module
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(parent_dir)

from crawl4ai import AsyncWebCrawler

async def main():
    # Define the URL to crawl
    wikipedia_url = "https://en.wikipedia.org/wiki/apple"
    
    # Define the path to save the HTML file
    # Save the file in the same directory as the script
    script_dir = Path(__file__).parent
    html_file_path = script_dir / "apple.html"
    
    async with AsyncWebCrawler(verbose=True) as crawler:
        print("\n=== Step 1: Crawling the Wikipedia URL ===")
        # Crawl the Wikipedia URL
        result = await crawler.arun(url=wikipedia_url, bypass_cache=True)
        
        # Check if crawling was successful
        if not result.success:
            print(f"Failed to crawl {wikipedia_url}: {result.error_message}")
            return
        
        # Save the HTML content to a local file
        with open(html_file_path, 'w', encoding='utf-8') as f:
            f.write(result.html)
        print(f"Saved HTML content to {html_file_path}")
        
        # Store the length of the generated markdown
        web_crawl_length = len(result.markdown)
        print(f"Length of markdown from web crawl: {web_crawl_length}\n")
        
        print("=== Step 2: Crawling from the Local HTML File ===")
        # Construct the file URL with 'file://' prefix
        file_url = f"file://{html_file_path.resolve()}"
        
        # Crawl the local HTML file
        local_result = await crawler.arun(url=file_url, bypass_cache=True)
        
        # Check if crawling was successful
        if not local_result.success:
            print(f"Failed to crawl local file {file_url}: {local_result.error_message}")
            return
        
        # Store the length of the generated markdown from local file
        local_crawl_length = len(local_result.markdown)
        print(f"Length of markdown from local file crawl: {local_crawl_length}")
        
        # Compare the lengths
        assert web_crawl_length == local_crawl_length, (
            f"Markdown length mismatch: Web crawl ({web_crawl_length}) != Local file crawl ({local_crawl_length})"
        )
        print("✅ Markdown length matches between web crawl and local file crawl.\n")
        
        print("=== Step 3: Crawling Using Raw HTML Content ===")
        # Read the HTML content from the saved file
        with open(html_file_path, 'r', encoding='utf-8') as f:
            raw_html_content = f.read()
        
        # Prefix the raw HTML content with 'raw:'
        raw_html_url = f"raw:{raw_html_content}"
        
        # Crawl using the raw HTML content
        raw_result = await crawler.arun(url=raw_html_url, bypass_cache=True)
        
        # Check if crawling was successful
        if not raw_result.success:
            print(f"Failed to crawl raw HTML content: {raw_result.error_message}")
            return
        
        # Store the length of the generated markdown from raw HTML
        raw_crawl_length = len(raw_result.markdown)
        print(f"Length of markdown from raw HTML crawl: {raw_crawl_length}")
        
        # Compare the lengths
        assert web_crawl_length == raw_crawl_length, (
            f"Markdown length mismatch: Web crawl ({web_crawl_length}) != Raw HTML crawl ({raw_crawl_length})"
        )
        print("✅ Markdown length matches between web crawl and raw HTML crawl.\n")
        
        print("All tests passed successfully!")
        
    # Clean up by removing the saved HTML file
    if html_file_path.exists():
        os.remove(html_file_path)
        print(f"Removed the saved HTML file: {html_file_path}")

# Run the main function
if __name__ == "__main__":
    asyncio.run(main())

How It Works

Step 1: Crawl the Web URL
- Crawls https://en.wikipedia.org/wiki/apple.
- Saves the HTML content to apple.html.
- Records the length of the generated markdown.
Step 2: Crawl from the Local HTML File
- Uses the file:// prefix to crawl apple.html.
- Ensures the markdown length matches the original web crawl.
Step 3: Crawl Using Raw HTML Content
- Reads the HTML from apple.html.
- Prefixes it with raw: and crawls.
- Verifies the markdown length matches the previous results.
Cleanup
- Deletes the apple.html file after testing.

Running the Example

Save the Script:
- Save the above code as test_crawl4ai.py in your project directory.
Execute the Script:
- Run the script using:
```
python test_crawl4ai.py
```
Observe the Output:
- The script will print logs detailing each step.
- Assertions ensure consistency across different crawling methods.
- Upon success, it confirms that all markdown lengths match.

Conclusion

With the new prefix-based input handling in Crawl4AI, you can effortlessly crawl web URLs, local HTML files, and raw HTML strings using a unified url parameter. This enhancement simplifies the API usage and provides greater flexibility for diverse crawling scenarios.

8.0 KiB Raw Blame History