Update README
This commit is contained in:
parent
c98ffe2130
commit
b0aa8bc9f7
104
README.md
104
README.md
@ -21,9 +21,9 @@
|
|||||||
|
|
||||||
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
|
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
|
||||||
|
|
||||||
[✨ Check out latest update v0.5.0](#-recent-updates)
|
[✨ Check out latest update v0.6.0rc1](#-recent-updates)
|
||||||
|
|
||||||
🎉 **Version 0.5.0 is out!** This major release introduces Deep Crawling with BFS/DFS/BestFirst strategies, Memory-Adaptive Dispatcher, Multiple Crawling Strategies (Playwright and HTTP), Docker Deployment with FastAPI, Command-Line Interface (CLI), and more! [Read the release notes →](https://docs.crawl4ai.com/blog)
|
🎉 **Version 0.6.0rc1 is now available!** This release candidate introduces World-aware Crawling with geolocation and locale settings, Table-to-DataFrame extraction, Browser pooling with pre-warming, Network and console traffic capture, MCP integration for AI tools, and a completely revamped Docker deployment! [Read the release notes →](https://docs.crawl4ai.com/blog)
|
||||||
|
|
||||||
<details>
|
<details>
|
||||||
<summary>🤓 <strong>My Personal Story</strong></summary>
|
<summary>🤓 <strong>My Personal Story</strong></summary>
|
||||||
@ -253,24 +253,29 @@ pip install -e ".[all]" # Install all optional features
|
|||||||
<details>
|
<details>
|
||||||
<summary>🐳 <strong>Docker Deployment</strong></summary>
|
<summary>🐳 <strong>Docker Deployment</strong></summary>
|
||||||
|
|
||||||
> 🚀 **Major Changes Coming!** We're developing a completely new Docker implementation that will make deployment even more efficient and seamless. The current Docker setup is being deprecated in favor of this new solution.
|
> 🚀 **Now Available!** Our completely redesigned Docker implementation is here! This new solution makes deployment more efficient and seamless than ever.
|
||||||
|
|
||||||
### Current Docker Support
|
### New Docker Features
|
||||||
|
|
||||||
The existing Docker implementation is being deprecated and will be replaced soon. If you still need to use Docker with the current version:
|
The new Docker implementation includes:
|
||||||
|
- **Browser pooling** with page pre-warming for faster response times
|
||||||
|
- **Interactive playground** to test and generate request code
|
||||||
|
- **MCP integration** for direct connection to AI tools like Claude Code
|
||||||
|
- **Comprehensive API endpoints** including HTML extraction, screenshots, PDF generation, and JavaScript execution
|
||||||
|
- **Multi-architecture support** with automatic detection (AMD64/ARM64)
|
||||||
|
- **Optimized resources** with improved memory management
|
||||||
|
|
||||||
- 📚 [Deprecated Docker Setup](./docs/deprecated/docker-deployment.md) - Instructions for the current Docker implementation
|
### Getting Started
|
||||||
- ⚠️ Note: This setup will be replaced in the next major release
|
|
||||||
|
|
||||||
### What's Coming Next?
|
```bash
|
||||||
|
# Pull and run the latest release candidate
|
||||||
|
docker pull unclecode/crawl4ai:0.6.0rc1-r1
|
||||||
|
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.6.0rc1-r1
|
||||||
|
|
||||||
Our new Docker implementation will bring:
|
# Visit the playground at http://localhost:11235/playground
|
||||||
- Improved performance and resource efficiency
|
```
|
||||||
- Streamlined deployment process
|
|
||||||
- Better integration with Crawl4AI features
|
|
||||||
- Enhanced scalability options
|
|
||||||
|
|
||||||
Stay connected with our [GitHub repository](https://github.com/unclecode/crawl4ai) for updates!
|
For complete documentation, see our [Docker Deployment Guide](https://docs.crawl4ai.com/core/docker-deployment/).
|
||||||
|
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
@ -500,31 +505,60 @@ async def test_news_crawl():
|
|||||||
|
|
||||||
## ✨ Recent Updates
|
## ✨ Recent Updates
|
||||||
|
|
||||||
### Version 0.5.0 Major Release Highlights
|
### Version 0.6.0rc1 Release Highlights
|
||||||
|
|
||||||
- **🚀 Deep Crawling System**: Explore websites beyond initial URLs with three strategies:
|
- **🌎 World-aware Crawling**: Set geolocation, language, and timezone for authentic locale-specific content:
|
||||||
- **BFS Strategy**: Breadth-first search explores websites level by level
|
```python
|
||||||
- **DFS Strategy**: Depth-first search explores each branch deeply before backtracking
|
crawler_config = CrawlerRunConfig(
|
||||||
- **BestFirst Strategy**: Uses scoring functions to prioritize which URLs to crawl next
|
geo_locale={"city": "Tokyo", "lang": "ja", "timezone": "Asia/Tokyo"}
|
||||||
- **Page Limiting**: Control the maximum number of pages to crawl with `max_pages` parameter
|
)
|
||||||
- **Score Thresholds**: Filter URLs based on relevance scores
|
```
|
||||||
- **⚡ Memory-Adaptive Dispatcher**: Dynamically adjusts concurrency based on system memory with built-in rate limiting
|
|
||||||
- **🔄 Multiple Crawling Strategies**:
|
- **📊 Table-to-DataFrame Extraction**: Extract HTML tables directly to CSV or pandas DataFrames:
|
||||||
- **AsyncPlaywrightCrawlerStrategy**: Browser-based crawling with JavaScript support (Default)
|
```python
|
||||||
- **AsyncHTTPCrawlerStrategy**: Fast, lightweight HTTP-only crawler for simple tasks
|
crawler_config = CrawlerRunConfig(extract_tables=True)
|
||||||
- **🐳 Docker Deployment**: Easy deployment with FastAPI server and streaming/non-streaming endpoints
|
# Access tables via result.tables or result.tables_as_dataframe
|
||||||
- **💻 Command-Line Interface**: New `crwl` CLI provides convenient terminal access to all features with intuitive commands and configuration options
|
```
|
||||||
- **👤 Browser Profiler**: Create and manage persistent browser profiles to save authentication states, cookies, and settings for seamless crawling of protected content
|
|
||||||
- **🧠 Crawl4AI Coding Assistant**: AI-powered coding assistant to answer your question for Crawl4ai, and generate proper code for crawling.
|
- **🚀 Browser Pooling**: Pages launch hot with pre-warmed browser instances for lower latency and memory usage
|
||||||
- **🏎️ LXML Scraping Mode**: Fast HTML parsing using the `lxml` library for improved performance
|
|
||||||
- **🌐 Proxy Rotation**: Built-in support for proxy switching with `RoundRobinProxyStrategy`
|
- **🕸️ Network and Console Capture**: Full traffic logs and MHTML snapshots for debugging:
|
||||||
|
```python
|
||||||
|
crawler_config = CrawlerRunConfig(
|
||||||
|
capture_network=True,
|
||||||
|
capture_console=True,
|
||||||
|
mhtml=True
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
- **🔌 MCP Integration**: Connect to AI tools like Claude Code through the Model Context Protocol
|
||||||
|
```bash
|
||||||
|
# Add Crawl4AI to Claude Code
|
||||||
|
claude mcp add --transport sse c4ai-sse http://localhost:11235/mcp/sse
|
||||||
|
```
|
||||||
|
|
||||||
|
- **🖥️ Interactive Playground**: Test configurations and generate API requests with the built-in web interface at `/playground`
|
||||||
|
|
||||||
|
- **🐳 Revamped Docker Deployment**: Streamlined multi-architecture Docker image with improved resource efficiency
|
||||||
|
|
||||||
|
- **📱 Multi-stage Build System**: Optimized Dockerfile with platform-specific performance enhancements
|
||||||
|
|
||||||
|
Read the full details in our [0.6.0rc1 Release Notes](https://docs.crawl4ai.com/blog/releases/0.6.0.html) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
|
||||||
|
|
||||||
|
### Previous Version: 0.5.0 Major Release Highlights
|
||||||
|
|
||||||
|
- **🚀 Deep Crawling System**: Explore websites beyond initial URLs with BFS, DFS, and BestFirst strategies
|
||||||
|
- **⚡ Memory-Adaptive Dispatcher**: Dynamically adjusts concurrency based on system memory
|
||||||
|
- **🔄 Multiple Crawling Strategies**: Browser-based and lightweight HTTP-only crawlers
|
||||||
|
- **💻 Command-Line Interface**: New `crwl` CLI provides convenient terminal access
|
||||||
|
- **👤 Browser Profiler**: Create and manage persistent browser profiles
|
||||||
|
- **🧠 Crawl4AI Coding Assistant**: AI-powered coding assistant
|
||||||
|
- **🏎️ LXML Scraping Mode**: Fast HTML parsing using the `lxml` library
|
||||||
|
- **🌐 Proxy Rotation**: Built-in support for proxy switching
|
||||||
- **🤖 LLM Content Filter**: Intelligent markdown generation using LLMs
|
- **🤖 LLM Content Filter**: Intelligent markdown generation using LLMs
|
||||||
- **📄 PDF Processing**: Extract text, images, and metadata from PDF files
|
- **📄 PDF Processing**: Extract text, images, and metadata from PDF files
|
||||||
- **🔗 URL Redirection Tracking**: Automatically follow and record HTTP redirects
|
|
||||||
- **🤖 LLM Schema Generation**: Easily create extraction schemas with LLM assistance
|
|
||||||
- **🔍 robots.txt Compliance**: Respect website crawling rules
|
|
||||||
|
|
||||||
Read the full details in our [0.5.0 Release Notes](https://docs.crawl4ai.com/blog/releases/0.5.0.html) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
|
Read the full details in our [0.5.0 Release Notes](https://docs.crawl4ai.com/blog/releases/0.5.0.html).
|
||||||
|
|
||||||
## Version Numbering in Crawl4AI
|
## Version Numbering in Crawl4AI
|
||||||
|
|
||||||
|
File diff suppressed because it is too large
Load Diff
Loading…
x
Reference in New Issue
Block a user