UncleCode 4239654722 Update Documentation

2024-10-27 19:24:46 +08:00

9.2 KiB

Raw Blame History

Quick Start Guide 🚀

Welcome to the Crawl4AI Quickstart Guide! In this tutorial, we'll walk you through the basic usage of Crawl4AI with a friendly and humorous tone. We'll cover everything from basic usage to advanced features like chunking and extraction strategies, all with the power of asynchronous programming. Let's dive in! 🌟

Getting Started 🛠️

First, let's import the necessary modules and create an instance of AsyncWebCrawler. We'll use an async context manager, which handles the setup and teardown of the crawler for us.

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        # We'll add our crawling code here
        pass

if __name__ == "__main__":
    asyncio.run(main())

Basic Usage

Simply provide a URL and let Crawl4AI do the magic!

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(url="https://www.nbcnews.com/business")
        print(f"Basic crawl result: {result.markdown[:500]}")  # Print first 500 characters

asyncio.run(main())

Taking Screenshots 📸

Capture screenshots of web pages easily:

async def capture_and_save_screenshot(url: str, output_path: str):
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url=url,
            screenshot=True,
            bypass_cache=True
        )
        
        if result.success and result.screenshot:
            import base64
            screenshot_data = base64.b64decode(result.screenshot)
            with open(output_path, 'wb') as f:
                f.write(screenshot_data)
            print(f"Screenshot saved successfully to {output_path}")
        else:
            print("Failed to capture screenshot")

Browser Selection 🌐

Crawl4AI supports multiple browser engines. Here's how to use different browsers:

# Use Firefox
async with AsyncWebCrawler(browser_type="firefox", verbose=True, headless=True) as crawler:
    result = await crawler.arun(url="https://www.example.com", bypass_cache=True)

# Use WebKit
async with AsyncWebCrawler(browser_type="webkit", verbose=True, headless=True) as crawler:
    result = await crawler.arun(url="https://www.example.com", bypass_cache=True)

# Use Chromium (default)
async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
    result = await crawler.arun(url="https://www.example.com", bypass_cache=True)

User Simulation 🎭

Simulate real user behavior to avoid detection:

async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
    result = await crawler.arun(
        url="YOUR-URL-HERE",
        bypass_cache=True,
        simulate_user=True,  # Causes random mouse movements and clicks
        override_navigator=True  # Makes the browser appear more like a real user
    )

Understanding Parameters 🧠

By default, Crawl4AI caches the results of your crawls. This means that subsequent crawls of the same URL will be much faster! Let's see this in action.

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        # First crawl (caches the result)
        result1 = await crawler.arun(url="https://www.nbcnews.com/business")
        print(f"First crawl result: {result1.markdown[:100]}...")

        # Force to crawl again
        result2 = await crawler.arun(url="https://www.nbcnews.com/business", bypass_cache=True)
        print(f"Second crawl result: {result2.markdown[:100]}...")

asyncio.run(main())

Adding a Chunking Strategy 🧩

Let's add a chunking strategy: RegexChunking! This strategy splits the text based on a given regex pattern.

from crawl4ai.chunking_strategy import RegexChunking

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            chunking_strategy=RegexChunking(patterns=["\n\n"])
        )
        print(f"RegexChunking result: {result.extracted_content[:200]}...")

asyncio.run(main())

Using LLMExtractionStrategy with Different Providers 🤖

Crawl4AI supports multiple LLM providers for extraction:

from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")

# OpenAI
await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))

# Hugging Face
await extract_structured_data_using_llm(
    "huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", 
    os.getenv("HUGGINGFACE_API_KEY")
)

# Ollama
await extract_structured_data_using_llm("ollama/llama3.2")

# With custom headers
custom_headers = {
    "Authorization": "Bearer your-custom-token",
    "X-Custom-Header": "Some-Value"
}
await extract_structured_data_using_llm(extra_headers=custom_headers)

Knowledge Graph Generation 🕸️

Generate knowledge graphs from web content:

from pydantic import BaseModel
from typing import List

class Entity(BaseModel):
    name: str
    description: str
    
class Relationship(BaseModel):
    entity1: Entity
    entity2: Entity
    description: str
    relation_type: str

class KnowledgeGraph(BaseModel):
    entities: List[Entity]
    relationships: List[Relationship]

extraction_strategy = LLMExtractionStrategy(
    provider='openai/gpt-4o-mini',
    api_token=os.getenv('OPENAI_API_KEY'),
    schema=KnowledgeGraph.model_json_schema(),
    extraction_type="schema",
    instruction="Extract entities and relationships from the given text."
)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://paulgraham.com/love.html",
        bypass_cache=True,
        extraction_strategy=extraction_strategy
    )

Advanced Session-Based Crawling with Dynamic Content 🔄

For modern web applications with dynamic content loading, here's how to handle pagination and content updates:

async def crawl_dynamic_content():
    async with AsyncWebCrawler(verbose=True) as crawler:
        url = "https://github.com/microsoft/TypeScript/commits/main"
        session_id = "typescript_commits_session"
        
        js_next_page = """
        const button = document.querySelector('a[data-testid="pagination-next-button"]');
        if (button) button.click();
        """

        wait_for = """() => {
            const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
            if (commits.length === 0) return false;
            const firstCommit = commits[0].textContent.trim();
            return firstCommit !== window.firstCommit;
        }"""
        
        schema = {
            "name": "Commit Extractor",
            "baseSelector": "li.Box-sc-g0xbh4-0",
            "fields": [
                {
                    "name": "title",
                    "selector": "h4.markdown-title",
                    "type": "text",
                    "transform": "strip",
                },
            ],
        }
        extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)

        for page in range(3):  # Crawl 3 pages
            result = await crawler.arun(
                url=url,
                session_id=session_id,
                css_selector="li.Box-sc-g0xbh4-0",
                extraction_strategy=extraction_strategy,
                js_code=js_next_page if page > 0 else None,
                wait_for=wait_for if page > 0 else None,
                js_only=page > 0,
                bypass_cache=True,
                headless=False,
            )

        await crawler.crawler_strategy.kill_session(session_id)

Handling Overlays and Fitting Content 📏

Remove overlay elements and fit content appropriately:

async with AsyncWebCrawler(headless=False) as crawler:
    result = await crawler.arun(
        url="your-url-here",
        bypass_cache=True,
        word_count_threshold=10,
        remove_overlay_elements=True,
        screenshot=True
    )

Performance Comparison 🏎️

Crawl4AI offers impressive performance compared to other solutions:

# Firecrawl comparison
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key=os.environ['FIRECRAWL_API_KEY'])
start = time.time()
scrape_status = app.scrape_url(
    'https://www.nbcnews.com/business',
    params={'formats': ['markdown', 'html']}
)
end = time.time()

# Crawl4AI comparison
async with AsyncWebCrawler() as crawler:
    start = time.time()
    result = await crawler.arun(
        url="https://www.nbcnews.com/business",
        word_count_threshold=0,
        bypass_cache=True,
        verbose=False,
    )
    end = time.time()

Note: Performance comparisons should be conducted in environments with stable and fast internet connections for accurate results.

Congratulations! 🎉

You've made it through the updated Crawl4AI Quickstart Guide! Now you're equipped with even more powerful features to crawl the web asynchronously like a pro! 🕸️

Happy crawling! 🚀

9.2 KiB Raw Blame History