crawl4ai/docs/md _sync/examples/summarization.md

## Summarization Example

This example demonstrates how to use `Crawl4AI` to extract a summary from a web page. The goal is to obtain the title, a detailed summary, a brief summary, and a list of keywords from the given page.

### Step-by-Step Guide

1. **Import Necessary Modules**

    First, import the necessary modules and classes.

    ```python
    import os
    import time
    import json
    from crawl4ai.web_crawler import WebCrawler
    from crawl4ai.chunking_strategy import *
    from crawl4ai.extraction_strategy import *
    from crawl4ai.crawler_strategy import *
    from pydantic import BaseModel, Field
    ```

2. **Define the URL to be Crawled**

    Set the URL of the web page you want to summarize.

    ```python
    url = r'https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot'
    ```

3. **Initialize the WebCrawler**

    Create an instance of the `WebCrawler` and call the `warmup` method.

    ```python
    crawler = WebCrawler()
    crawler.warmup()
    ```

4. **Define the Data Model**

    Use Pydantic to define the structure of the extracted data.

    ```python
    class PageSummary(BaseModel):
        title: str = Field(..., description="Title of the page.")
        summary: str = Field(..., description="Summary of the page.")
        brief_summary: str = Field(..., description="Brief summary of the page.")
        keywords: list = Field(..., description="Keywords assigned to the page.")
    ```

5. **Run the Crawler**

    Set up and run the crawler with the `LLMExtractionStrategy`. Provide the necessary parameters, including the schema for the extracted data and the instruction for the LLM.

    ```python
    result = crawler.run(
        url=url,
        word_count_threshold=1,
        extraction_strategy=LLMExtractionStrategy(
            provider="openai/gpt-4o",
            api_token=os.getenv('OPENAI_API_KEY'),
            schema=PageSummary.model_json_schema(),
            extraction_type="schema",
            apply_chunking=False,
            instruction=(
                "From the crawled content, extract the following details: "
                "1. Title of the page "
                "2. Summary of the page, which is a detailed summary "
                "3. Brief summary of the page, which is a paragraph text "
                "4. Keywords assigned to the page, which is a list of keywords. "
                'The extracted JSON format should look like this: '
                '{ "title": "Page Title", "summary": "Detailed summary of the page.", '
                '"brief_summary": "Brief summary in a paragraph.", "keywords": ["keyword1", "keyword2", "keyword3"] }'
            )
        ),
        bypass_cache=True,
    )
    ```

6. **Process the Extracted Data**

    Load the extracted content into a JSON object and print it.

    ```python
    page_summary = json.loads(result.extracted_content)
    print(page_summary)
    ```

7. **Save the Extracted Data**

    Save the extracted data to a file for further use.

    ```python
    with open(".data/page_summary.json", "w", encoding="utf-8") as f:
        f.write(result.extracted_content)
    ```

### Explanation

- **Importing Modules**: Import the necessary modules, including `WebCrawler` and `LLMExtractionStrategy` from `Crawl4AI`.
- **URL Definition**: Set the URL of the web page you want to crawl and summarize.
- **WebCrawler Initialization**: Create an instance of `WebCrawler` and call the `warmup` method to prepare the crawler.
- **Data Model Definition**: Define the structure of the data you want to extract using Pydantic's `BaseModel`.
- **Crawler Execution**: Run the crawler with the `LLMExtractionStrategy`, providing the schema and detailed instructions for the extraction process.
- **Data Processing**: Load the extracted content into a JSON object and print it to verify the results.
- **Data Saving**: Save the extracted data to a file for further use.

This example demonstrates how to harness the power of `Crawl4AI` to perform advanced web crawling and data extraction tasks with minimal code.