crawl4ai/docs/md_v2/api/strategies.md
Aravind 2af958e12c
Feat/llm config (#724)
* feature: Add LlmConfig to easily configure and pass LLM configs to different strategies

* pulled in next branch and resolved conflicts

* feat: Add gemini and deepseek providers. Make ignore_cache in llm content filter to true by default to avoid confusions

* Refactor: Update LlmConfig in LLMExtractionStrategy class and deprecate old params

* updated tests, docs and readme
2025-02-21 15:41:37 +08:00

257 lines
6.7 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Extraction & Chunking Strategies API
This documentation covers the API reference for extraction and chunking strategies in Crawl4AI.
## Extraction Strategies
All extraction strategies inherit from the base `ExtractionStrategy` class and implement two key methods:
- `extract(url: str, html: str) -> List[Dict[str, Any]]`
- `run(url: str, sections: List[str]) -> List[Dict[str, Any]]`
### LLMExtractionStrategy
Used for extracting structured data using Language Models.
```python
LLMExtractionStrategy(
# Required Parameters
provider: str = DEFAULT_PROVIDER, # LLM provider (e.g., "ollama/llama2")
api_token: Optional[str] = None, # API token
# Extraction Configuration
instruction: str = None, # Custom extraction instruction
schema: Dict = None, # Pydantic model schema for structured data
extraction_type: str = "block", # "block" or "schema"
# Chunking Parameters
chunk_token_threshold: int = 4000, # Maximum tokens per chunk
overlap_rate: float = 0.1, # Overlap between chunks
word_token_rate: float = 0.75, # Word to token conversion rate
apply_chunking: bool = True, # Enable/disable chunking
# API Configuration
base_url: str = None, # Base URL for API
extra_args: Dict = {}, # Additional provider arguments
verbose: bool = False # Enable verbose logging
)
```
### CosineStrategy
Used for content similarity-based extraction and clustering.
```python
CosineStrategy(
# Content Filtering
semantic_filter: str = None, # Topic/keyword filter
word_count_threshold: int = 10, # Minimum words per cluster
sim_threshold: float = 0.3, # Similarity threshold
# Clustering Parameters
max_dist: float = 0.2, # Maximum cluster distance
linkage_method: str = 'ward', # Clustering method
top_k: int = 3, # Top clusters to return
# Model Configuration
model_name: str = 'sentence-transformers/all-MiniLM-L6-v2', # Embedding model
verbose: bool = False # Enable verbose logging
)
```
### JsonCssExtractionStrategy
Used for CSS selector-based structured data extraction.
```python
JsonCssExtractionStrategy(
schema: Dict[str, Any], # Extraction schema
verbose: bool = False # Enable verbose logging
)
# Schema Structure
schema = {
"name": str, # Schema name
"baseSelector": str, # Base CSS selector
"fields": [ # List of fields to extract
{
"name": str, # Field name
"selector": str, # CSS selector
"type": str, # Field type: "text", "attribute", "html", "regex"
"attribute": str, # For type="attribute"
"pattern": str, # For type="regex"
"transform": str, # Optional: "lowercase", "uppercase", "strip"
"default": Any # Default value if extraction fails
}
]
}
```
## Chunking Strategies
All chunking strategies inherit from `ChunkingStrategy` and implement the `chunk(text: str) -> list` method.
### RegexChunking
Splits text based on regex patterns.
```python
RegexChunking(
patterns: List[str] = None # Regex patterns for splitting
# Default: [r'\n\n']
)
```
### SlidingWindowChunking
Creates overlapping chunks with a sliding window approach.
```python
SlidingWindowChunking(
window_size: int = 100, # Window size in words
step: int = 50 # Step size between windows
)
```
### OverlappingWindowChunking
Creates chunks with specified overlap.
```python
OverlappingWindowChunking(
window_size: int = 1000, # Chunk size in words
overlap: int = 100 # Overlap size in words
)
```
## Usage Examples
### LLM Extraction
```python
from pydantic import BaseModel
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.async_configs import LlmConfig
# Define schema
class Article(BaseModel):
title: str
content: str
author: str
# Create strategy
strategy = LLMExtractionStrategy(
llmConfig = LlmConfig(provider="ollama/llama2"),
schema=Article.schema(),
instruction="Extract article details"
)
# Use with crawler
result = await crawler.arun(
url="https://example.com/article",
extraction_strategy=strategy
)
# Access extracted data
data = json.loads(result.extracted_content)
```
### CSS Extraction
```python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
# Define schema
schema = {
"name": "Product List",
"baseSelector": ".product-card",
"fields": [
{
"name": "title",
"selector": "h2.title",
"type": "text"
},
{
"name": "price",
"selector": ".price",
"type": "text",
"transform": "strip"
},
{
"name": "image",
"selector": "img",
"type": "attribute",
"attribute": "src"
}
]
}
# Create and use strategy
strategy = JsonCssExtractionStrategy(schema)
result = await crawler.arun(
url="https://example.com/products",
extraction_strategy=strategy
)
```
### Content Chunking
```python
from crawl4ai.chunking_strategy import OverlappingWindowChunking
from crawl4ai.async_configs import LlmConfig
# Create chunking strategy
chunker = OverlappingWindowChunking(
window_size=500, # 500 words per chunk
overlap=50 # 50 words overlap
)
# Use with extraction strategy
strategy = LLMExtractionStrategy(
llmConfig = LlmConfig(provider="ollama/llama2"),
chunking_strategy=chunker
)
result = await crawler.arun(
url="https://example.com/long-article",
extraction_strategy=strategy
)
```
## Best Practices
1. **Choose the Right Strategy**
- Use `LLMExtractionStrategy` for complex, unstructured content
- Use `JsonCssExtractionStrategy` for well-structured HTML
- Use `CosineStrategy` for content similarity and clustering
2. **Optimize Chunking**
```python
# For long documents
strategy = LLMExtractionStrategy(
chunk_token_threshold=2000, # Smaller chunks
overlap_rate=0.1 # 10% overlap
)
```
3. **Handle Errors**
```python
try:
result = await crawler.arun(
url="https://example.com",
extraction_strategy=strategy
)
if result.success:
content = json.loads(result.extracted_content)
except Exception as e:
print(f"Extraction failed: {e}")
```
4. **Monitor Performance**
```python
strategy = CosineStrategy(
verbose=True, # Enable logging
word_count_threshold=20, # Filter short content
top_k=5 # Limit results
)
```