334 lines
15 KiB
Markdown
334 lines
15 KiB
Markdown
![]() |
Below is a **draft** of the **Extracting JSON (LLM)** tutorial, illustrating how to use large language models for structured data extraction in Crawl4AI. It highlights key parameters (like chunking, overlap, instruction, schema) and explains how the system remains **provider-agnostic** via LightLLM. Adjust field names or code snippets to match your repository’s specifics.
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
# Extracting JSON (LLM)
|
|||
|
|
|||
|
In some cases, you need to extract **complex or unstructured** information from a webpage that a simple CSS/XPath schema cannot easily parse. Or you want **AI**-driven insights, classification, or summarization. For these scenarios, Crawl4AI provides an **LLM-based extraction strategy** that:
|
|||
|
|
|||
|
1. Works with **any** large language model supported by [LightLLM](https://github.com/LightLLM) (Ollama, OpenAI, Claude, and more).
|
|||
|
2. Automatically splits content into chunks (if desired) to handle token limits, then combines results.
|
|||
|
3. Lets you define a **schema** (like a Pydantic model) or a simpler “block” extraction approach.
|
|||
|
|
|||
|
**Important**: LLM-based extraction can be slower and costlier than schema-based approaches. If your page data is highly structured, consider using [`JsonCssExtractionStrategy`](./json-extraction-basic.md) or [`JsonXPathExtractionStrategy`](./json-extraction-basic.md) first. But if you need AI to interpret or reorganize content, read on!
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## 1. Why Use an LLM?
|
|||
|
|
|||
|
- **Complex Reasoning**: If the site’s data is unstructured, scattered, or full of natural language context.
|
|||
|
- **Semantic Extraction**: Summaries, knowledge graphs, or relational data that require comprehension.
|
|||
|
- **Flexible**: You can pass instructions to the model to do more advanced transformations or classification.
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## 2. Provider-Agnostic via LightLLM
|
|||
|
|
|||
|
Crawl4AI uses a “provider string” (e.g., `"openai/gpt-4o"`, `"ollama/llama2.0"`, `"aws/titan"`) to identify your LLM. **Any** model that LightLLM supports is fair game. You just provide:
|
|||
|
|
|||
|
- **`provider`**: The `<provider>/<model_name>` identifier (e.g., `"openai/gpt-4"`, `"ollama/llama2"`, `"huggingface/google-flan"`, etc.).
|
|||
|
- **`api_token`**: If needed (for OpenAI, HuggingFace, etc.); local models or Ollama might not require it.
|
|||
|
- **`api_base`** (optional): If your provider has a custom endpoint.
|
|||
|
|
|||
|
This means you **aren’t locked** into a single LLM vendor. Switch or experiment easily.
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## 3. How LLM Extraction Works
|
|||
|
|
|||
|
### 3.1 Flow
|
|||
|
|
|||
|
1. **Chunking** (optional): The HTML or markdown is split into smaller segments if it’s very long (based on `chunk_token_threshold`, overlap, etc.).
|
|||
|
2. **Prompt Construction**: For each chunk, the library forms a prompt that includes your **`instruction`** (and possibly schema or examples).
|
|||
|
3. **LLM Inference**: Each chunk is sent to the model in parallel or sequentially (depending on your concurrency).
|
|||
|
4. **Combining**: The results from each chunk are merged and parsed into JSON.
|
|||
|
|
|||
|
### 3.2 `extraction_type`
|
|||
|
|
|||
|
- **`"schema"`**: The model tries to return JSON conforming to your Pydantic-based schema.
|
|||
|
- **`"block"`**: The model returns freeform text, or smaller JSON structures, which the library collects.
|
|||
|
|
|||
|
For structured data, `"schema"` is recommended. You provide `schema=YourPydanticModel.model_json_schema()`.
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## 4. Key Parameters
|
|||
|
|
|||
|
Below is an overview of important LLM extraction parameters. All are typically set inside `LLMExtractionStrategy(...)`. You then put that strategy in your `CrawlerRunConfig(..., extraction_strategy=...)`.
|
|||
|
|
|||
|
1. **`provider`** (str): e.g., `"openai/gpt-4"`, `"ollama/llama2"`.
|
|||
|
2. **`api_token`** (str): The API key or token for that model. May not be needed for local models.
|
|||
|
3. **`schema`** (dict): A JSON schema describing the fields you want. Usually generated by `YourModel.model_json_schema()`.
|
|||
|
4. **`extraction_type`** (str): `"schema"` or `"block"`.
|
|||
|
5. **`instruction`** (str): Prompt text telling the LLM what you want extracted. E.g., “Extract these fields as a JSON array.”
|
|||
|
6. **`chunk_token_threshold`** (int): Maximum tokens per chunk. If your content is huge, you can break it up for the LLM.
|
|||
|
7. **`overlap_rate`** (float): Overlap ratio between adjacent chunks. E.g., `0.1` means 10% of each chunk is repeated to preserve context continuity.
|
|||
|
8. **`apply_chunking`** (bool): Set `True` to chunk automatically. If you want a single pass, set `False`.
|
|||
|
9. **`input_format`** (str): Determines **which** crawler result is passed to the LLM. Options include:
|
|||
|
- `"markdown"`: The raw markdown (default).
|
|||
|
- `"fit_markdown"`: The filtered “fit” markdown if you used a content filter.
|
|||
|
- `"html"`: The cleaned or raw HTML.
|
|||
|
10. **`extra_args`** (dict): Additional LLM parameters like `temperature`, `max_tokens`, `top_p`, etc.
|
|||
|
11. **`show_usage()`**: A method you can call to print out usage info (token usage per chunk, total cost if known).
|
|||
|
|
|||
|
**Example**:
|
|||
|
|
|||
|
```python
|
|||
|
extraction_strategy = LLMExtractionStrategy(
|
|||
|
provider="openai/gpt-4",
|
|||
|
api_token="YOUR_OPENAI_KEY",
|
|||
|
schema=MyModel.model_json_schema(),
|
|||
|
extraction_type="schema",
|
|||
|
instruction="Extract a list of items from the text with 'name' and 'price' fields.",
|
|||
|
chunk_token_threshold=1200,
|
|||
|
overlap_rate=0.1,
|
|||
|
apply_chunking=True,
|
|||
|
input_format="html",
|
|||
|
extra_args={"temperature": 0.1, "max_tokens": 1000},
|
|||
|
verbose=True
|
|||
|
)
|
|||
|
```
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## 5. Putting It in `CrawlerRunConfig`
|
|||
|
|
|||
|
**Important**: In Crawl4AI, all strategy definitions should go inside the `CrawlerRunConfig`, not directly as a param in `arun()`. Here’s a full example:
|
|||
|
|
|||
|
```python
|
|||
|
import os
|
|||
|
import asyncio
|
|||
|
import json
|
|||
|
from pydantic import BaseModel, Field
|
|||
|
from typing import List
|
|||
|
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
|||
|
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
|||
|
|
|||
|
class Product(BaseModel):
|
|||
|
name: str
|
|||
|
price: str
|
|||
|
|
|||
|
async def main():
|
|||
|
# 1. Define the LLM extraction strategy
|
|||
|
llm_strategy = LLMExtractionStrategy(
|
|||
|
provider="openai/gpt-4o-mini", # e.g. "ollama/llama2"
|
|||
|
api_token=os.getenv('OPENAI_API_KEY'),
|
|||
|
schema=Product.schema_json(), # Or use model_json_schema()
|
|||
|
extraction_type="schema",
|
|||
|
instruction="Extract all product objects with 'name' and 'price' from the content.",
|
|||
|
chunk_token_threshold=1000,
|
|||
|
overlap_rate=0.0,
|
|||
|
apply_chunking=True,
|
|||
|
input_format="markdown", # or "html", "fit_markdown"
|
|||
|
extra_args={"temperature": 0.0, "max_tokens": 800}
|
|||
|
)
|
|||
|
|
|||
|
# 2. Build the crawler config
|
|||
|
crawl_config = CrawlerRunConfig(
|
|||
|
extraction_strategy=llm_strategy,
|
|||
|
cache_mode=CacheMode.BYPASS
|
|||
|
)
|
|||
|
|
|||
|
# 3. Create a browser config if needed
|
|||
|
browser_cfg = BrowserConfig(headless=True)
|
|||
|
|
|||
|
async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
|||
|
# 4. Let's say we want to crawl a single page
|
|||
|
result = await crawler.arun(
|
|||
|
url="https://example.com/products",
|
|||
|
config=crawl_config
|
|||
|
)
|
|||
|
|
|||
|
if result.success:
|
|||
|
# 5. The extracted content is presumably JSON
|
|||
|
data = json.loads(result.extracted_content)
|
|||
|
print("Extracted items:", data)
|
|||
|
|
|||
|
# 6. Show usage stats
|
|||
|
llm_strategy.show_usage() # prints token usage
|
|||
|
else:
|
|||
|
print("Error:", result.error_message)
|
|||
|
|
|||
|
if __name__ == "__main__":
|
|||
|
asyncio.run(main())
|
|||
|
```
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## 6. Chunking Details
|
|||
|
|
|||
|
### 6.1 `chunk_token_threshold`
|
|||
|
|
|||
|
If your page is large, you might exceed your LLM’s context window. **`chunk_token_threshold`** sets the approximate max tokens per chunk. The library calculates word→token ratio using `word_token_rate` (often ~0.75 by default). If chunking is enabled (`apply_chunking=True`), the text is split into segments.
|
|||
|
|
|||
|
### 6.2 `overlap_rate`
|
|||
|
|
|||
|
To keep context continuous across chunks, we can overlap them. E.g., `overlap_rate=0.1` means each subsequent chunk includes 10% of the previous chunk’s text. This is helpful if your needed info might straddle chunk boundaries.
|
|||
|
|
|||
|
### 6.3 Performance & Parallelism
|
|||
|
|
|||
|
By chunking, you can potentially process multiple chunks in parallel (depending on your concurrency settings and the LLM provider). This reduces total time if the site is huge or has many sections.
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## 7. Input Format
|
|||
|
|
|||
|
By default, **LLMExtractionStrategy** uses `input_format="markdown"`, meaning the **crawler’s final markdown** is fed to the LLM. You can change to:
|
|||
|
|
|||
|
- **`html`**: The cleaned HTML or raw HTML (depending on your crawler config) goes into the LLM.
|
|||
|
- **`fit_markdown`**: If you used, for instance, `PruningContentFilter`, the “fit” version of the markdown is used. This can drastically reduce tokens if you trust the filter.
|
|||
|
- **`markdown`**: Standard markdown output from the crawler’s `markdown_generator`.
|
|||
|
|
|||
|
This setting is crucial: if the LLM instructions rely on HTML tags, pick `"html"`. If you prefer a text-based approach, pick `"markdown"`.
|
|||
|
|
|||
|
```python
|
|||
|
LLMExtractionStrategy(
|
|||
|
# ...
|
|||
|
input_format="html", # Instead of "markdown" or "fit_markdown"
|
|||
|
)
|
|||
|
```
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## 8. Token Usage & Show Usage
|
|||
|
|
|||
|
To keep track of tokens and cost, each chunk is processed with an LLM call. We record usage in:
|
|||
|
|
|||
|
- **`usages`** (list): token usage per chunk or call.
|
|||
|
- **`total_usage`**: sum of all chunk calls.
|
|||
|
- **`show_usage()`**: prints a usage report (if the provider returns usage data).
|
|||
|
|
|||
|
```python
|
|||
|
llm_strategy = LLMExtractionStrategy(...)
|
|||
|
# ...
|
|||
|
llm_strategy.show_usage()
|
|||
|
# e.g. “Total usage: 1241 tokens across 2 chunk calls”
|
|||
|
```
|
|||
|
|
|||
|
If your model provider doesn’t return usage info, these fields might be partial or empty.
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## 9. Example: Building a Knowledge Graph
|
|||
|
|
|||
|
Below is a snippet combining **`LLMExtractionStrategy`** with a Pydantic schema for a knowledge graph. Notice how we pass an **`instruction`** telling the model what to parse.
|
|||
|
|
|||
|
```python
|
|||
|
import os
|
|||
|
import json
|
|||
|
import asyncio
|
|||
|
from typing import List
|
|||
|
from pydantic import BaseModel, Field
|
|||
|
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
|||
|
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
|||
|
|
|||
|
class Entity(BaseModel):
|
|||
|
name: str
|
|||
|
description: str
|
|||
|
|
|||
|
class Relationship(BaseModel):
|
|||
|
entity1: Entity
|
|||
|
entity2: Entity
|
|||
|
description: str
|
|||
|
relation_type: str
|
|||
|
|
|||
|
class KnowledgeGraph(BaseModel):
|
|||
|
entities: List[Entity]
|
|||
|
relationships: List[Relationship]
|
|||
|
|
|||
|
async def main():
|
|||
|
# LLM extraction strategy
|
|||
|
llm_strat = LLMExtractionStrategy(
|
|||
|
provider="openai/gpt-4",
|
|||
|
api_token=os.getenv('OPENAI_API_KEY'),
|
|||
|
schema=KnowledgeGraph.schema_json(),
|
|||
|
extraction_type="schema",
|
|||
|
instruction="Extract entities and relationships from the content. Return valid JSON.",
|
|||
|
chunk_token_threshold=1400,
|
|||
|
apply_chunking=True,
|
|||
|
input_format="html",
|
|||
|
extra_args={"temperature": 0.1, "max_tokens": 1500}
|
|||
|
)
|
|||
|
|
|||
|
crawl_config = CrawlerRunConfig(
|
|||
|
extraction_strategy=llm_strat,
|
|||
|
cache_mode=CacheMode.BYPASS
|
|||
|
)
|
|||
|
|
|||
|
async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
|
|||
|
# Example page
|
|||
|
url = "https://www.nbcnews.com/business"
|
|||
|
result = await crawler.arun(url=url, config=crawl_config)
|
|||
|
|
|||
|
if result.success:
|
|||
|
with open("kb_result.json", "w", encoding="utf-8") as f:
|
|||
|
f.write(result.extracted_content)
|
|||
|
llm_strat.show_usage()
|
|||
|
else:
|
|||
|
print("Crawl failed:", result.error_message)
|
|||
|
|
|||
|
if __name__ == "__main__":
|
|||
|
asyncio.run(main())
|
|||
|
```
|
|||
|
|
|||
|
**Key Observations**:
|
|||
|
|
|||
|
- **`extraction_type="schema"`** ensures we get JSON fitting our `KnowledgeGraph`.
|
|||
|
- **`input_format="html"`** means we feed HTML to the model.
|
|||
|
- **`instruction`** guides the model to output a structured knowledge graph.
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## 10. Best Practices & Caveats
|
|||
|
|
|||
|
1. **Cost & Latency**: LLM calls can be slow or expensive. Consider chunking or smaller coverage if you only need partial data.
|
|||
|
2. **Model Token Limits**: If your page + instruction exceed the context window, chunking is essential.
|
|||
|
3. **Instruction Engineering**: Well-crafted instructions can drastically improve output reliability.
|
|||
|
4. **Schema Strictness**: `"schema"` extraction tries to parse the model output as JSON. If the model returns invalid JSON, partial extraction might happen, or you might get an error.
|
|||
|
5. **Parallel vs. Serial**: The library can process multiple chunks in parallel, but you must watch out for rate limits on certain providers.
|
|||
|
6. **Check Output**: Sometimes, an LLM might omit fields or produce extraneous text. You may want to post-validate with Pydantic or do additional cleanup.
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## 11. Conclusion
|
|||
|
|
|||
|
**LLM-based extraction** in Crawl4AI is **provider-agnostic**, letting you choose from hundreds of models via LightLLM. It’s perfect for **semantically complex** tasks or generating advanced structures like knowledge graphs. However, it’s **slower** and potentially costlier than schema-based approaches. Keep these tips in mind:
|
|||
|
|
|||
|
- Put your LLM strategy **in `CrawlerRunConfig`**.
|
|||
|
- Use **`input_format`** to pick which form (markdown, HTML, fit_markdown) the LLM sees.
|
|||
|
- Tweak **`chunk_token_threshold`**, **`overlap_rate`**, and **`apply_chunking`** to handle large content efficiently.
|
|||
|
- Monitor token usage with `show_usage()`.
|
|||
|
|
|||
|
If your site’s data is consistent or repetitive, consider [`JsonCssExtractionStrategy`](./json-extraction-basic.md) first for speed and simplicity. But if you need an **AI-driven** approach, `LLMExtractionStrategy` offers a flexible, multi-provider solution for extracting structured JSON from any website.
|
|||
|
|
|||
|
**Next Steps**:
|
|||
|
|
|||
|
1. **Experiment with Different Providers**
|
|||
|
- Try switching the `provider` (e.g., `"ollama/llama2"`, `"openai/gpt-4o"`, etc.) to see differences in speed, accuracy, or cost.
|
|||
|
- Pass different `extra_args` like `temperature`, `top_p`, and `max_tokens` to fine-tune your results.
|
|||
|
|
|||
|
2. **Combine With Other Strategies**
|
|||
|
- Use [content filters](../../how-to/content-filters.md) like BM25 or Pruning prior to LLM extraction to remove noise and reduce token usage.
|
|||
|
- Apply a [CSS or XPath extraction strategy](./json-extraction-basic.md) first for obvious, structured data, then send only the tricky parts to the LLM.
|
|||
|
|
|||
|
3. **Performance Tuning**
|
|||
|
- If pages are large, tweak `chunk_token_threshold`, `overlap_rate`, or `apply_chunking` to optimize throughput.
|
|||
|
- Check the usage logs with `show_usage()` to keep an eye on token consumption and identify potential bottlenecks.
|
|||
|
|
|||
|
4. **Validate Outputs**
|
|||
|
- If using `extraction_type="schema"`, parse the LLM’s JSON with a Pydantic model for a final validation step.
|
|||
|
- Log or handle any parse errors gracefully, especially if the model occasionally returns malformed JSON.
|
|||
|
|
|||
|
5. **Explore Hooks & Automation**
|
|||
|
- Integrate LLM extraction with [hooks](./hooks-custom.md) for complex pre/post-processing.
|
|||
|
- Use a multi-step pipeline: crawl, filter, LLM-extract, then store or index results for further analysis.
|
|||
|
|
|||
|
6. **Scale and Deploy**
|
|||
|
- Combine your LLM extraction setup with [Docker or other deployment solutions](./docker-quickstart.md) to run at scale.
|
|||
|
- Monitor memory usage and concurrency if you call LLMs frequently.
|
|||
|
|
|||
|
**Last Updated**: 2024-XX-XX
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
That’s it for **Extracting JSON (LLM)**—now you can harness AI to parse, classify, or reorganize data on the web. Happy crawling!
|