crawl4ai/docs/md/examples/summarization.md

5.9 KiB

Summarization Example with AsyncWebCrawler

This example demonstrates how to use Crawl4AI's AsyncWebCrawler to extract a summary from a web page asynchronously. The goal is to obtain the title, a detailed summary, a brief summary, and a list of keywords from the given page.

Step-by-Step Guide

  1. Import Necessary Modules

    First, import the necessary modules and classes:

    import os
    import json
    import asyncio
    from crawl4ai import AsyncWebCrawler
    from crawl4ai.extraction_strategy import LLMExtractionStrategy
    from crawl4ai.chunking_strategy import RegexChunking
    from pydantic import BaseModel, Field
    
  2. Define the URL to be Crawled

    Set the URL of the web page you want to summarize:

    url = 'https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot'
    
  3. Define the Data Model

    Use Pydantic to define the structure of the extracted data:

    class PageSummary(BaseModel):
        title: str = Field(..., description="Title of the page.")
        summary: str = Field(..., description="Summary of the page.")
        brief_summary: str = Field(..., description="Brief summary of the page.")
        keywords: list = Field(..., description="Keywords assigned to the page.")
    
  4. Create the Extraction Strategy

    Set up the LLMExtractionStrategy with the necessary parameters:

    extraction_strategy = LLMExtractionStrategy(
        provider="openai/gpt-4o", 
        api_token=os.getenv('OPENAI_API_KEY'), 
        schema=PageSummary.model_json_schema(),
        extraction_type="schema",
        apply_chunking=False,
        instruction=(
            "From the crawled content, extract the following details: "
            "1. Title of the page "
            "2. Summary of the page, which is a detailed summary "
            "3. Brief summary of the page, which is a paragraph text "
            "4. Keywords assigned to the page, which is a list of keywords. "
            'The extracted JSON format should look like this: '
            '{ "title": "Page Title", "summary": "Detailed summary of the page.", '
            '"brief_summary": "Brief summary in a paragraph.", "keywords": ["keyword1", "keyword2", "keyword3"] }'
        )
    )
    
  5. Define the Async Crawl Function

    Create an asynchronous function to run the crawler:

    async def crawl_and_summarize(url):
        async with AsyncWebCrawler(verbose=True) as crawler:
            result = await crawler.arun(
                url=url,
                word_count_threshold=1,
                extraction_strategy=extraction_strategy,
                chunking_strategy=RegexChunking(),
                bypass_cache=True,
            )
            return result
    
  6. Run the Crawler and Process Results

    Use asyncio to run the crawler and process the results:

    async def main():
        result = await crawl_and_summarize(url)
    
        if result.success:
            page_summary = json.loads(result.extracted_content)
            print("Extracted Page Summary:")
            print(json.dumps(page_summary, indent=2))
    
            # Save the extracted data
            with open(".data/page_summary.json", "w", encoding="utf-8") as f:
                json.dump(page_summary, f, indent=2)
            print("Page summary saved to .data/page_summary.json")
        else:
            print(f"Failed to crawl and summarize the page. Error: {result.error_message}")
    
    # Run the async main function
    asyncio.run(main())
    

Explanation

  • Importing Modules: We import the necessary modules, including AsyncWebCrawler and LLMExtractionStrategy from Crawl4AI.
  • URL Definition: We set the URL of the web page to crawl and summarize.
  • Data Model Definition: We define the structure of the data to extract using Pydantic's BaseModel.
  • Extraction Strategy Setup: We create an instance of LLMExtractionStrategy with the schema and detailed instructions for the extraction process.
  • Async Crawl Function: We define an asynchronous function crawl_and_summarize that uses AsyncWebCrawler to perform the crawling and extraction.
  • Main Execution: In the main function, we run the crawler, process the results, and save the extracted data.

Advanced Usage: Crawling Multiple URLs

To demonstrate the power of AsyncWebCrawler, here's how you can summarize multiple pages concurrently:

async def crawl_multiple_urls(urls):
    async with AsyncWebCrawler(verbose=True) as crawler:
        tasks = [crawler.arun(
            url=url,
            word_count_threshold=1,
            extraction_strategy=extraction_strategy,
            chunking_strategy=RegexChunking(),
            bypass_cache=True
        ) for url in urls]
        results = await asyncio.gather(*tasks)
    return results

async def main():
    urls = [
        'https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot',
        'https://marketplace.visualstudio.com/items?itemName=GitHub.copilot',
        'https://marketplace.visualstudio.com/items?itemName=ms-python.python'
    ]
    results = await crawl_multiple_urls(urls)
    
    for i, result in enumerate(results):
        if result.success:
            page_summary = json.loads(result.extracted_content)
            print(f"\nSummary for URL {i+1}:")
            print(json.dumps(page_summary, indent=2))
        else:
            print(f"\nFailed to summarize URL {i+1}. Error: {result.error_message}")

asyncio.run(main())

This advanced example shows how to use AsyncWebCrawler to efficiently summarize multiple web pages concurrently, significantly reducing the total processing time compared to sequential crawling.

By leveraging the asynchronous capabilities of Crawl4AI, you can perform advanced web crawling and data extraction tasks with improved efficiency and scalability.