mirror of
https://github.com/unclecode/crawl4ai.git
synced 2025-10-03 06:30:17 +00:00
109 lines
4.0 KiB
Markdown
109 lines
4.0 KiB
Markdown
## Summarization Example
|
|
|
|
This example demonstrates how to use `Crawl4AI` to extract a summary from a web page. The goal is to obtain the title, a detailed summary, a brief summary, and a list of keywords from the given page.
|
|
|
|
### Step-by-Step Guide
|
|
|
|
1. **Import Necessary Modules**
|
|
|
|
First, import the necessary modules and classes.
|
|
|
|
```python
|
|
import os
|
|
import time
|
|
import json
|
|
from crawl4ai.web_crawler import WebCrawler
|
|
from crawl4ai.chunking_strategy import *
|
|
from crawl4ai.extraction_strategy import *
|
|
from crawl4ai.crawler_strategy import *
|
|
from pydantic import BaseModel, Field
|
|
```
|
|
|
|
2. **Define the URL to be Crawled**
|
|
|
|
Set the URL of the web page you want to summarize.
|
|
|
|
```python
|
|
url = r'https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot'
|
|
```
|
|
|
|
3. **Initialize the WebCrawler**
|
|
|
|
Create an instance of the `WebCrawler` and call the `warmup` method.
|
|
|
|
```python
|
|
crawler = WebCrawler()
|
|
crawler.warmup()
|
|
```
|
|
|
|
4. **Define the Data Model**
|
|
|
|
Use Pydantic to define the structure of the extracted data.
|
|
|
|
```python
|
|
class PageSummary(BaseModel):
|
|
title: str = Field(..., description="Title of the page.")
|
|
summary: str = Field(..., description="Summary of the page.")
|
|
brief_summary: str = Field(..., description="Brief summary of the page.")
|
|
keywords: list = Field(..., description="Keywords assigned to the page.")
|
|
```
|
|
|
|
5. **Run the Crawler**
|
|
|
|
Set up and run the crawler with the `LLMExtractionStrategy`. Provide the necessary parameters, including the schema for the extracted data and the instruction for the LLM.
|
|
|
|
```python
|
|
result = crawler.run(
|
|
url=url,
|
|
word_count_threshold=1,
|
|
extraction_strategy=LLMExtractionStrategy(
|
|
provider="openai/gpt-4o",
|
|
api_token=os.getenv('OPENAI_API_KEY'),
|
|
schema=PageSummary.model_json_schema(),
|
|
extraction_type="schema",
|
|
apply_chunking=False,
|
|
instruction=(
|
|
"From the crawled content, extract the following details: "
|
|
"1. Title of the page "
|
|
"2. Summary of the page, which is a detailed summary "
|
|
"3. Brief summary of the page, which is a paragraph text "
|
|
"4. Keywords assigned to the page, which is a list of keywords. "
|
|
'The extracted JSON format should look like this: '
|
|
'{ "title": "Page Title", "summary": "Detailed summary of the page.", '
|
|
'"brief_summary": "Brief summary in a paragraph.", "keywords": ["keyword1", "keyword2", "keyword3"] }'
|
|
)
|
|
),
|
|
bypass_cache=True,
|
|
)
|
|
```
|
|
|
|
6. **Process the Extracted Data**
|
|
|
|
Load the extracted content into a JSON object and print it.
|
|
|
|
```python
|
|
page_summary = json.loads(result.extracted_content)
|
|
print(page_summary)
|
|
```
|
|
|
|
7. **Save the Extracted Data**
|
|
|
|
Save the extracted data to a file for further use.
|
|
|
|
```python
|
|
with open(".data/page_summary.json", "w", encoding="utf-8") as f:
|
|
f.write(result.extracted_content)
|
|
```
|
|
|
|
### Explanation
|
|
|
|
- **Importing Modules**: Import the necessary modules, including `WebCrawler` and `LLMExtractionStrategy` from `Crawl4AI`.
|
|
- **URL Definition**: Set the URL of the web page you want to crawl and summarize.
|
|
- **WebCrawler Initialization**: Create an instance of `WebCrawler` and call the `warmup` method to prepare the crawler.
|
|
- **Data Model Definition**: Define the structure of the data you want to extract using Pydantic's `BaseModel`.
|
|
- **Crawler Execution**: Run the crawler with the `LLMExtractionStrategy`, providing the schema and detailed instructions for the extraction process.
|
|
- **Data Processing**: Load the extracted content into a JSON object and print it to verify the results.
|
|
- **Data Saving**: Save the extracted data to a file for further use.
|
|
|
|
This example demonstrates how to harness the power of `Crawl4AI` to perform advanced web crawling and data extraction tasks with minimal code.
|