5.9 KiB
Summarization Example with AsyncWebCrawler
This example demonstrates how to use Crawl4AI's AsyncWebCrawler
to extract a summary from a web page asynchronously. The goal is to obtain the title, a detailed summary, a brief summary, and a list of keywords from the given page.
Step-by-Step Guide
-
Import Necessary Modules
First, import the necessary modules and classes:
import os import json import asyncio from crawl4ai import AsyncWebCrawler from crawl4ai.extraction_strategy import LLMExtractionStrategy from crawl4ai.chunking_strategy import RegexChunking from pydantic import BaseModel, Field
-
Define the URL to be Crawled
Set the URL of the web page you want to summarize:
url = 'https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot'
-
Define the Data Model
Use Pydantic to define the structure of the extracted data:
class PageSummary(BaseModel): title: str = Field(..., description="Title of the page.") summary: str = Field(..., description="Summary of the page.") brief_summary: str = Field(..., description="Brief summary of the page.") keywords: list = Field(..., description="Keywords assigned to the page.")
-
Create the Extraction Strategy
Set up the
LLMExtractionStrategy
with the necessary parameters:extraction_strategy = LLMExtractionStrategy( provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'), schema=PageSummary.model_json_schema(), extraction_type="schema", apply_chunking=False, instruction=( "From the crawled content, extract the following details: " "1. Title of the page " "2. Summary of the page, which is a detailed summary " "3. Brief summary of the page, which is a paragraph text " "4. Keywords assigned to the page, which is a list of keywords. " 'The extracted JSON format should look like this: ' '{ "title": "Page Title", "summary": "Detailed summary of the page.", ' '"brief_summary": "Brief summary in a paragraph.", "keywords": ["keyword1", "keyword2", "keyword3"] }' ) )
-
Define the Async Crawl Function
Create an asynchronous function to run the crawler:
async def crawl_and_summarize(url): async with AsyncWebCrawler(verbose=True) as crawler: result = await crawler.arun( url=url, word_count_threshold=1, extraction_strategy=extraction_strategy, chunking_strategy=RegexChunking(), bypass_cache=True, ) return result
-
Run the Crawler and Process Results
Use asyncio to run the crawler and process the results:
async def main(): result = await crawl_and_summarize(url) if result.success: page_summary = json.loads(result.extracted_content) print("Extracted Page Summary:") print(json.dumps(page_summary, indent=2)) # Save the extracted data with open(".data/page_summary.json", "w", encoding="utf-8") as f: json.dump(page_summary, f, indent=2) print("Page summary saved to .data/page_summary.json") else: print(f"Failed to crawl and summarize the page. Error: {result.error_message}") # Run the async main function asyncio.run(main())
Explanation
- Importing Modules: We import the necessary modules, including
AsyncWebCrawler
andLLMExtractionStrategy
from Crawl4AI. - URL Definition: We set the URL of the web page to crawl and summarize.
- Data Model Definition: We define the structure of the data to extract using Pydantic's
BaseModel
. - Extraction Strategy Setup: We create an instance of
LLMExtractionStrategy
with the schema and detailed instructions for the extraction process. - Async Crawl Function: We define an asynchronous function
crawl_and_summarize
that usesAsyncWebCrawler
to perform the crawling and extraction. - Main Execution: In the
main
function, we run the crawler, process the results, and save the extracted data.
Advanced Usage: Crawling Multiple URLs
To demonstrate the power of AsyncWebCrawler
, here's how you can summarize multiple pages concurrently:
async def crawl_multiple_urls(urls):
async with AsyncWebCrawler(verbose=True) as crawler:
tasks = [crawler.arun(
url=url,
word_count_threshold=1,
extraction_strategy=extraction_strategy,
chunking_strategy=RegexChunking(),
bypass_cache=True
) for url in urls]
results = await asyncio.gather(*tasks)
return results
async def main():
urls = [
'https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot',
'https://marketplace.visualstudio.com/items?itemName=GitHub.copilot',
'https://marketplace.visualstudio.com/items?itemName=ms-python.python'
]
results = await crawl_multiple_urls(urls)
for i, result in enumerate(results):
if result.success:
page_summary = json.loads(result.extracted_content)
print(f"\nSummary for URL {i+1}:")
print(json.dumps(page_summary, indent=2))
else:
print(f"\nFailed to summarize URL {i+1}. Error: {result.error_message}")
asyncio.run(main())
This advanced example shows how to use AsyncWebCrawler
to efficiently summarize multiple web pages concurrently, significantly reducing the total processing time compared to sequential crawling.
By leveraging the asynchronous capabilities of Crawl4AI, you can perform advanced web crawling and data extraction tasks with improved efficiency and scalability.