2025-01-22 17:14:24 +08:00
# Crawl4AI 0.4.3: Major Performance Boost & LLM Integration
2025-01-21 21:03:11 +08:00
2025-01-22 17:14:24 +08:00
We're excited to announce Crawl4AI 0.4.3, focusing on three key areas: Speed & Efficiency, LLM Integration, and Core Platform Improvements. This release significantly improves crawling performance while adding powerful new LLM-powered features.
2025-01-21 21:03:11 +08:00
2025-01-22 17:14:24 +08:00
## ⚡ Speed & Efficiency Improvements
2025-01-21 21:03:11 +08:00
2025-01-22 17:14:24 +08:00
### 1. Memory-Adaptive Dispatcher System
The new dispatcher system provides intelligent resource management and real-time monitoring:
2025-01-21 21:03:11 +08:00
```python
2025-01-22 17:14:24 +08:00
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, DisplayMode
from crawl4ai.async_dispatcher import MemoryAdaptiveDispatcher, CrawlerMonitor
2025-01-21 21:03:11 +08:00
async def main():
2025-01-22 17:14:24 +08:00
urls = ["https://example1.com", "https://example2.com"] * 50
# Configure memory-aware dispatch
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=80.0, # Auto-throttle at 80% memory
check_interval=0.5, # Check every 0.5 seconds
max_session_permit=20, # Max concurrent sessions
monitor=CrawlerMonitor( # Real-time monitoring
display_mode=DisplayMode.DETAILED
)
2025-01-21 21:03:11 +08:00
)
2025-01-22 17:14:24 +08:00
2025-01-21 21:03:11 +08:00
async with AsyncWebCrawler() as crawler:
2025-01-22 17:14:24 +08:00
results = await dispatcher.run_urls(
urls=urls,
crawler=crawler,
config=CrawlerRunConfig()
)
2025-01-21 21:03:11 +08:00
```
2025-01-22 17:14:24 +08:00
### 2. Streaming Support
Process crawled URLs in real-time instead of waiting for all results:
2025-01-21 21:03:11 +08:00
```python
2025-01-22 17:14:24 +08:00
config = CrawlerRunConfig(stream=True)
2025-01-21 21:03:11 +08:00
2025-01-22 17:14:24 +08:00
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun_many(urls, config=config):
print(f"Got result for {result.url}")
# Process each result immediately
2025-01-21 21:03:11 +08:00
```
2025-01-22 17:14:24 +08:00
### 3. LXML-Based Scraping
New LXML scraping strategy offering up to 20x faster parsing:
2025-01-21 21:03:11 +08:00
```python
2025-01-22 17:14:24 +08:00
config = CrawlerRunConfig(
scraping_strategy=LXMLWebScrapingStrategy(),
cache_mode=CacheMode.ENABLED
)
2025-01-21 21:03:11 +08:00
```
2025-01-22 17:14:24 +08:00
## 🤖 LLM Integration
2025-01-21 21:03:11 +08:00
2025-01-22 17:14:24 +08:00
### 1. LLM-Powered Markdown Generation
Smart content filtering and organization using LLMs:
2025-01-21 21:03:11 +08:00
```python
2025-01-22 17:14:24 +08:00
config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(
content_filter=LLMContentFilter(
provider="openai/gpt-4o",
instruction="Extract technical documentation and code examples"
)
)
2025-01-21 21:03:11 +08:00
)
```
2025-01-22 17:14:24 +08:00
### 2. Automatic Schema Generation
Generate extraction schemas instantly using LLMs instead of manual CSS/XPath writing:
2025-01-21 21:03:11 +08:00
```python
2025-01-22 17:14:24 +08:00
schema = JsonCssExtractionStrategy.generate_schema(
html_content,
schema_type="CSS",
query="Extract product name, price, and description"
)
2025-01-21 21:03:11 +08:00
```
2025-01-22 17:14:24 +08:00
## 🔧 Core Improvements
2025-01-21 21:03:11 +08:00
2025-01-22 17:14:24 +08:00
### 1. Proxy Support & Rotation
Integrated proxy support with automatic rotation and verification:
2025-01-21 21:03:11 +08:00
```python
config = CrawlerRunConfig(
2025-01-22 17:14:24 +08:00
proxy_config={
"server": "http://proxy:8080",
"username": "user",
"password": "pass"
}
2025-01-21 21:03:11 +08:00
)
```
2025-01-22 17:14:24 +08:00
### 2. Robots.txt Compliance
Built-in robots.txt support with SQLite caching:
2025-01-21 21:03:11 +08:00
2025-01-22 17:14:24 +08:00
```python
config = CrawlerRunConfig(check_robots_txt=True)
result = await crawler.arun(url, config=config)
if result.status_code == 403:
print("Access blocked by robots.txt")
```
2025-01-21 21:03:11 +08:00
2025-01-22 17:14:24 +08:00
### 3. URL Redirection Tracking
Track final URLs after redirects:
2025-01-21 21:03:11 +08:00
2025-01-22 17:14:24 +08:00
```python
result = await crawler.arun(url)
print(f"Initial URL: {url}")
print(f"Final URL: {result.redirected_url}")
```
2025-01-21 21:03:11 +08:00
2025-01-22 17:14:24 +08:00
## Performance Impact
2025-01-21 21:03:11 +08:00
2025-01-22 17:14:24 +08:00
- Memory usage reduced by up to 40% with adaptive dispatcher
- Parsing speed increased up to 20x with LXML strategy
- Streaming reduces memory footprint for large crawls by ~60%
2025-01-21 21:03:11 +08:00
2025-01-22 17:14:24 +08:00
## Getting Started
2025-01-21 21:03:11 +08:00
2025-01-22 17:14:24 +08:00
```bash
pip install -U crawl4ai
```
2025-01-21 21:03:11 +08:00
2025-01-22 17:14:24 +08:00
For complete examples, check our [demo repository ](https://github.com/unclecode/crawl4ai/examples ).
2025-01-21 21:03:11 +08:00
2025-01-22 17:14:24 +08:00
## Stay Connected
2025-01-21 21:03:11 +08:00
2025-01-22 17:14:24 +08:00
- Star us on [GitHub ](https://github.com/unclecode/crawl4ai )
- Follow [@unclecode ](https://twitter.com/unclecode )
- Join our [Discord ](https://discord.gg/crawl4ai )
2025-01-21 21:03:11 +08:00
2025-01-22 17:14:24 +08:00
Happy crawling! 🕷️