Update README

This commit is contained in:
UncleCode 2025-04-22 23:21:42 +08:00
parent c98ffe2130
commit b0aa8bc9f7
2 changed files with 652 additions and 630 deletions

104
README.md
View File

@ -21,9 +21,9 @@
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease. Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
[✨ Check out latest update v0.5.0](#-recent-updates) [✨ Check out latest update v0.6.0rc1](#-recent-updates)
🎉 **Version 0.5.0 is out!** This major release introduces Deep Crawling with BFS/DFS/BestFirst strategies, Memory-Adaptive Dispatcher, Multiple Crawling Strategies (Playwright and HTTP), Docker Deployment with FastAPI, Command-Line Interface (CLI), and more! [Read the release notes →](https://docs.crawl4ai.com/blog) 🎉 **Version 0.6.0rc1 is now available!** This release candidate introduces World-aware Crawling with geolocation and locale settings, Table-to-DataFrame extraction, Browser pooling with pre-warming, Network and console traffic capture, MCP integration for AI tools, and a completely revamped Docker deployment! [Read the release notes →](https://docs.crawl4ai.com/blog)
<details> <details>
<summary>🤓 <strong>My Personal Story</strong></summary> <summary>🤓 <strong>My Personal Story</strong></summary>
@ -253,24 +253,29 @@ pip install -e ".[all]" # Install all optional features
<details> <details>
<summary>🐳 <strong>Docker Deployment</strong></summary> <summary>🐳 <strong>Docker Deployment</strong></summary>
> 🚀 **Major Changes Coming!** We're developing a completely new Docker implementation that will make deployment even more efficient and seamless. The current Docker setup is being deprecated in favor of this new solution. > 🚀 **Now Available!** Our completely redesigned Docker implementation is here! This new solution makes deployment more efficient and seamless than ever.
### Current Docker Support ### New Docker Features
The existing Docker implementation is being deprecated and will be replaced soon. If you still need to use Docker with the current version: The new Docker implementation includes:
- **Browser pooling** with page pre-warming for faster response times
- **Interactive playground** to test and generate request code
- **MCP integration** for direct connection to AI tools like Claude Code
- **Comprehensive API endpoints** including HTML extraction, screenshots, PDF generation, and JavaScript execution
- **Multi-architecture support** with automatic detection (AMD64/ARM64)
- **Optimized resources** with improved memory management
- 📚 [Deprecated Docker Setup](./docs/deprecated/docker-deployment.md) - Instructions for the current Docker implementation ### Getting Started
- ⚠️ Note: This setup will be replaced in the next major release
### What's Coming Next? ```bash
# Pull and run the latest release candidate
docker pull unclecode/crawl4ai:0.6.0rc1-r1
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.6.0rc1-r1
Our new Docker implementation will bring: # Visit the playground at http://localhost:11235/playground
- Improved performance and resource efficiency ```
- Streamlined deployment process
- Better integration with Crawl4AI features
- Enhanced scalability options
Stay connected with our [GitHub repository](https://github.com/unclecode/crawl4ai) for updates! For complete documentation, see our [Docker Deployment Guide](https://docs.crawl4ai.com/core/docker-deployment/).
</details> </details>
@ -500,31 +505,60 @@ async def test_news_crawl():
## ✨ Recent Updates ## ✨ Recent Updates
### Version 0.5.0 Major Release Highlights ### Version 0.6.0rc1 Release Highlights
- **🚀 Deep Crawling System**: Explore websites beyond initial URLs with three strategies: - **🌎 World-aware Crawling**: Set geolocation, language, and timezone for authentic locale-specific content:
- **BFS Strategy**: Breadth-first search explores websites level by level ```python
- **DFS Strategy**: Depth-first search explores each branch deeply before backtracking crawler_config = CrawlerRunConfig(
- **BestFirst Strategy**: Uses scoring functions to prioritize which URLs to crawl next geo_locale={"city": "Tokyo", "lang": "ja", "timezone": "Asia/Tokyo"}
- **Page Limiting**: Control the maximum number of pages to crawl with `max_pages` parameter )
- **Score Thresholds**: Filter URLs based on relevance scores ```
- **⚡ Memory-Adaptive Dispatcher**: Dynamically adjusts concurrency based on system memory with built-in rate limiting
- **🔄 Multiple Crawling Strategies**: - **📊 Table-to-DataFrame Extraction**: Extract HTML tables directly to CSV or pandas DataFrames:
- **AsyncPlaywrightCrawlerStrategy**: Browser-based crawling with JavaScript support (Default) ```python
- **AsyncHTTPCrawlerStrategy**: Fast, lightweight HTTP-only crawler for simple tasks crawler_config = CrawlerRunConfig(extract_tables=True)
- **🐳 Docker Deployment**: Easy deployment with FastAPI server and streaming/non-streaming endpoints # Access tables via result.tables or result.tables_as_dataframe
- **💻 Command-Line Interface**: New `crwl` CLI provides convenient terminal access to all features with intuitive commands and configuration options ```
- **👤 Browser Profiler**: Create and manage persistent browser profiles to save authentication states, cookies, and settings for seamless crawling of protected content
- **🧠 Crawl4AI Coding Assistant**: AI-powered coding assistant to answer your question for Crawl4ai, and generate proper code for crawling. - **🚀 Browser Pooling**: Pages launch hot with pre-warmed browser instances for lower latency and memory usage
- **🏎️ LXML Scraping Mode**: Fast HTML parsing using the `lxml` library for improved performance
- **🌐 Proxy Rotation**: Built-in support for proxy switching with `RoundRobinProxyStrategy` - **🕸️ Network and Console Capture**: Full traffic logs and MHTML snapshots for debugging:
```python
crawler_config = CrawlerRunConfig(
capture_network=True,
capture_console=True,
mhtml=True
)
```
- **🔌 MCP Integration**: Connect to AI tools like Claude Code through the Model Context Protocol
```bash
# Add Crawl4AI to Claude Code
claude mcp add --transport sse c4ai-sse http://localhost:11235/mcp/sse
```
- **🖥️ Interactive Playground**: Test configurations and generate API requests with the built-in web interface at `/playground`
- **🐳 Revamped Docker Deployment**: Streamlined multi-architecture Docker image with improved resource efficiency
- **📱 Multi-stage Build System**: Optimized Dockerfile with platform-specific performance enhancements
Read the full details in our [0.6.0rc1 Release Notes](https://docs.crawl4ai.com/blog/releases/0.6.0.html) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
### Previous Version: 0.5.0 Major Release Highlights
- **🚀 Deep Crawling System**: Explore websites beyond initial URLs with BFS, DFS, and BestFirst strategies
- **⚡ Memory-Adaptive Dispatcher**: Dynamically adjusts concurrency based on system memory
- **🔄 Multiple Crawling Strategies**: Browser-based and lightweight HTTP-only crawlers
- **💻 Command-Line Interface**: New `crwl` CLI provides convenient terminal access
- **👤 Browser Profiler**: Create and manage persistent browser profiles
- **🧠 Crawl4AI Coding Assistant**: AI-powered coding assistant
- **🏎️ LXML Scraping Mode**: Fast HTML parsing using the `lxml` library
- **🌐 Proxy Rotation**: Built-in support for proxy switching
- **🤖 LLM Content Filter**: Intelligent markdown generation using LLMs - **🤖 LLM Content Filter**: Intelligent markdown generation using LLMs
- **📄 PDF Processing**: Extract text, images, and metadata from PDF files - **📄 PDF Processing**: Extract text, images, and metadata from PDF files
- **🔗 URL Redirection Tracking**: Automatically follow and record HTTP redirects
- **🤖 LLM Schema Generation**: Easily create extraction schemas with LLM assistance
- **🔍 robots.txt Compliance**: Respect website crawling rules
Read the full details in our [0.5.0 Release Notes](https://docs.crawl4ai.com/blog/releases/0.5.0.html) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md). Read the full details in our [0.5.0 Release Notes](https://docs.crawl4ai.com/blog/releases/0.5.0.html).
## Version Numbering in Crawl4AI ## Version Numbering in Crawl4AI

File diff suppressed because it is too large Load Diff