This example demonstrates how to use `Crawl4AI` to extract a summary from a web page. The goal is to obtain the title, a detailed summary, a brief summary, and a list of keywords from the given page.
### Step-by-Step Guide
1.**Import Necessary Modules**
First, import the necessary modules and classes.
```python
import os
import time
import json
from crawl4ai.web_crawler import WebCrawler
from crawl4ai.chunking_strategy import *
from crawl4ai.extraction_strategy import *
from crawl4ai.crawler_strategy import *
from pydantic import BaseModel, Field
```
2.**Define the URL to be Crawled**
Set the URL of the web page you want to summarize.
Create an instance of the `WebCrawler` and call the `warmup` method.
```python
crawler = WebCrawler()
crawler.warmup()
```
4.**Define the Data Model**
Use Pydantic to define the structure of the extracted data.
```python
class PageSummary(BaseModel):
title: str = Field(..., description="Title of the page.")
summary: str = Field(..., description="Summary of the page.")
brief_summary: str = Field(..., description="Brief summary of the page.")
keywords: list = Field(..., description="Keywords assigned to the page.")
```
5.**Run the Crawler**
Set up and run the crawler with the `LLMExtractionStrategy`. Provide the necessary parameters, including the schema for the extracted data and the instruction for the LLM.
```python
result = crawler.run(
url=url,
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
schema=PageSummary.model_json_schema(),
extraction_type="schema",
apply_chunking=False,
instruction=(
"From the crawled content, extract the following details: "
"1. Title of the page "
"2. Summary of the page, which is a detailed summary "
"3. Brief summary of the page, which is a paragraph text "
"4. Keywords assigned to the page, which is a list of keywords. "
'The extracted JSON format should look like this: '
'{ "title": "Page Title", "summary": "Detailed summary of the page.", '
'"brief_summary": "Brief summary in a paragraph.", "keywords": ["keyword1", "keyword2", "keyword3"] }'
)
),
bypass_cache=True,
)
```
6.**Process the Extracted Data**
Load the extracted content into a JSON object and print it.