crawl4ai/docs/md/examples/llm_extraction.md

# LLM Extraction

Crawl4AI allows you to use Language Models (LLMs) to extract structured data or relevant content from web pages. Below are two examples demonstrating how to use LLMExtractionStrategy for different purposes.

## Example 1: Extract Structured Data

In this example, we use the `LLMExtractionStrategy` to extract structured data (model names and their fees) from the OpenAI pricing page.

```python
import os
import time
from crawl4ai.web_crawler import WebCrawler
from crawl4ai.chunking_strategy import *
from crawl4ai.extraction_strategy import *
from crawl4ai.crawler_strategy import *

url = r'https://openai.com/api/pricing/'

crawler = WebCrawler()
crawler.warmup()

from pydantic import BaseModel, Field

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")

result = crawler.run(
    url=url,
    word_count_threshold=1,
    extraction_strategy= LLMExtractionStrategy(
        provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'), 
        schema=OpenAIModelFee.model_json_schema(),
        extraction_type="schema",
        instruction="From the crawled content, extract all mentioned model names along with their "\
            "fees for input and output tokens. Make sure not to miss anything in the entire content. "\
            'One extracted model JSON format should look like this: '\
            '{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }'
    ),
    bypass_cache=True,
)

model_fees = json.loads(result.extracted_content)

print(len(model_fees))

with open(".data/data.json", "w", encoding="utf-8") as f:
    f.write(result.extracted_content)
```

## Example 2: Extract Relevant Content

In this example, we instruct the LLM to extract only content related to technology from the NBC News business page.

```python
crawler = WebCrawler()
crawler.warmup()

result = crawler.run(
        url="https://www.nbcnews.com/business",
        extraction_strategy=LLMExtractionStrategy(
            provider="openai/gpt-4o",
            api_token=os.getenv('OPENAI_API_KEY'),
            instruction="Extract only content related to technology"
        ),
    bypass_cache=True,
    )

model_fees = json.loads(result.extracted_content)

print(len(model_fees))

with open(".data/data.json", "w", encoding="utf-8") as f:
    f.write(result.extracted_content)
```

## Customizing LLM Provider

Under the hood, Crawl4AI uses the `litellm` library, which allows you to use any LLM provider you want. Just pass the correct model name and API token.

```python
extraction_strategy=LLMExtractionStrategy(
    provider="your_llm_provider/model_name",
    api_token="your_api_token",
    instruction="Your extraction instruction"
)
```

This flexibility allows you to integrate with various LLM providers and tailor the extraction process to your specific needs.
ADD MKDocs 2024-06-21 17:56:54 +08:00			`# LLM Extraction`

			`Crawl4AI allows you to use Language Models (LLMs) to extract structured data or relevant content from web pages. Below are two examples demonstrating how to use LLMExtractionStrategy for different purposes.`

			`## Example 1: Extract Structured Data`

			In this example, we use the `LLMExtractionStrategy` to extract structured data (model names and their fees) from the OpenAI pricing page.

			```python
			`import os`
			`import time`
			`from crawl4ai.web_crawler import WebCrawler`
			`from crawl4ai.chunking_strategy import *`
			`from crawl4ai.extraction_strategy import *`
			`from crawl4ai.crawler_strategy import *`

			`url = r'https://openai.com/api/pricing/'`

			`crawler = WebCrawler()`
			`crawler.warmup()`

			`from pydantic import BaseModel, Field`

			`class OpenAIModelFee(BaseModel):`
			`model_name: str = Field(..., description="Name of the OpenAI model.")`
			`input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")`
			`output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")`

			`result = crawler.run(`
			`url=url,`
			`word_count_threshold=1,`
			`extraction_strategy= LLMExtractionStrategy(`
			`provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),`
			`schema=OpenAIModelFee.model_json_schema(),`
			`extraction_type="schema",`
			`instruction="From the crawled content, extract all mentioned model names along with their "\`
			`"fees for input and output tokens. Make sure not to miss anything in the entire content. "\`
			`'One extracted model JSON format should look like this: '\`
			`'{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }'`
			`),`
			`bypass_cache=True,`
			`)`

			`model_fees = json.loads(result.extracted_content)`

			`print(len(model_fees))`

## [v0.2.74] - 2024-07-08 A slew of exciting updates to improve the crawler's stability and robustness! 🎉 - 💻 UTF encoding fix: Resolved the Windows \"charmap\" error by adding UTF encoding. - 🛡️ Error handling: Implemented MaxRetryError exception handling in LocalSeleniumCrawlerStrategy. - 🧹 Input sanitization: Improved input sanitization and handled encoding issues in LLMExtractionStrategy. - 🚮 Database cleanup: Removed existing database file and initialized a new one. 2024-07-08 16:33:25 +08:00			`with open(".data/data.json", "w", encoding="utf-8") as f:`
ADD MKDocs 2024-06-21 17:56:54 +08:00			`f.write(result.extracted_content)`
			```

			`## Example 2: Extract Relevant Content`

			`In this example, we instruct the LLM to extract only content related to technology from the NBC News business page.`

			```python
			`crawler = WebCrawler()`
			`crawler.warmup()`

			`result = crawler.run(`
			`url="https://www.nbcnews.com/business",`
			`extraction_strategy=LLMExtractionStrategy(`
			`provider="openai/gpt-4o",`
			`api_token=os.getenv('OPENAI_API_KEY'),`
			`instruction="Extract only content related to technology"`
			`),`
			`bypass_cache=True,`
			`)`

			`model_fees = json.loads(result.extracted_content)`

			`print(len(model_fees))`

## [v0.2.74] - 2024-07-08 A slew of exciting updates to improve the crawler's stability and robustness! 🎉 - 💻 UTF encoding fix: Resolved the Windows \"charmap\" error by adding UTF encoding. - 🛡️ Error handling: Implemented MaxRetryError exception handling in LocalSeleniumCrawlerStrategy. - 🧹 Input sanitization: Improved input sanitization and handled encoding issues in LLMExtractionStrategy. - 🚮 Database cleanup: Removed existing database file and initialized a new one. 2024-07-08 16:33:25 +08:00			`with open(".data/data.json", "w", encoding="utf-8") as f:`
ADD MKDocs 2024-06-21 17:56:54 +08:00			`f.write(result.extracted_content)`
			```

			`## Customizing LLM Provider`

			Under the hood, Crawl4AI uses the `litellm` library, which allows you to use any LLM provider you want. Just pass the correct model name and API token.

			```python
			`extraction_strategy=LLMExtractionStrategy(`
			`provider="your_llm_provider/model_name",`
			`api_token="your_api_token",`
			`instruction="Your extraction instruction"`
			`)`
			```

			`This flexibility allows you to integrate with various LLM providers and tailor the extraction process to your specific needs.`