feat(docker): update Docker deployment for v0.6.0
Major updates to Docker deployment infrastructure: - Switch default port to 11235 for all services - Add MCP (Model Context Protocol) support with WebSocket/SSE endpoints - Simplify docker-compose.yml with auto-platform detection - Update documentation with new features and examples - Consolidate configuration and improve resource management BREAKING CHANGE: Default port changed from 8020 to 11235. Update your configurations and deployment scripts accordingly.
This commit is contained in:
parent
f3ebb38edf
commit
4812f08a73
47
CHANGELOG.md
47
CHANGELOG.md
@ -5,6 +5,53 @@ All notable changes to Crawl4AI will be documented in this file.
|
|||||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||||
|
|
||||||
|
## [0.6.0rc1‑r1] ‑ 2025‑04‑22
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- Browser pooling with page pre‑warming and fine‑grained **geolocation, locale, and timezone** controls
|
||||||
|
- Crawler pool manager (SDK + Docker API) for smarter resource allocation
|
||||||
|
- Network & console log capture plus MHTML snapshot export
|
||||||
|
- **Table extractor**: turn HTML `<table>`s into DataFrames or CSV with one flag
|
||||||
|
- High‑volume stress‑test framework in `tests/memory` and API load scripts
|
||||||
|
- MCP protocol endpoints with socket & SSE support; playground UI scaffold
|
||||||
|
- Docs v2 revamp: TOC, GitHub badge, copy‑code buttons, Docker API demo
|
||||||
|
- “Ask AI” helper button *(work‑in‑progress, shipping soon)*
|
||||||
|
- New examples: geo‑location usage, network/console capture, Docker API, markdown source selection, crypto analysis
|
||||||
|
- Expanded automated test suites for browser, Docker, MCP and memory benchmarks
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Consolidated and renamed browser strategies; legacy docker strategy modules removed
|
||||||
|
- `ProxyConfig` moved to `async_configs`
|
||||||
|
- Server migrated to pool‑based crawler management
|
||||||
|
- FastAPI validators replace custom query validation
|
||||||
|
- Docker build now uses Chromium base image
|
||||||
|
- Large‑scale repo tidy‑up (≈36 k insertions, ≈5 k deletions)
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- Async crawler session leak, duplicate‑visit handling, URL normalisation
|
||||||
|
- Target‑element regressions in scraping strategies
|
||||||
|
- Logged‑URL readability, encoded‑URL decoding, middle truncation for long URLs
|
||||||
|
- Closed issues: #701, #733, #756, #774, #804, #822, #839, #841, #842, #843, #867, #902, #911
|
||||||
|
|
||||||
|
### Removed
|
||||||
|
- Obsolete modules under `crawl4ai/browser/*` superseded by the new pooled browser layer
|
||||||
|
|
||||||
|
### Deprecated
|
||||||
|
- Old markdown generator names now alias `DefaultMarkdownGenerator` and emit warnings
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### Upgrade notes
|
||||||
|
1. Update any direct imports from `crawl4ai/browser/*` to the new pooled browser modules
|
||||||
|
2. If you override `AsyncPlaywrightCrawlerStrategy.get_page`, adopt the new signature
|
||||||
|
3. Rebuild Docker images to pull the new Chromium layer
|
||||||
|
4. Switch to `DefaultMarkdownGenerator` (or silence the deprecation warning)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
`121 files changed, ≈36 223 insertions, ≈4 975 deletions` :contentReference[oaicite:0]{index=0}​:contentReference[oaicite:1]{index=1}
|
||||||
|
|
||||||
|
|
||||||
### [Feature] 2025-04-21
|
### [Feature] 2025-04-21
|
||||||
- Implemented MCP protocol for machine-to-machine communication
|
- Implemented MCP protocol for machine-to-machine communication
|
||||||
- Added WebSocket and SSE transport for MCP server
|
- Added WebSocket and SSE transport for MCP server
|
||||||
|
@ -1,5 +1,10 @@
|
|||||||
FROM python:3.10-slim
|
FROM python:3.10-slim
|
||||||
|
|
||||||
|
# C4ai version
|
||||||
|
ARG C4AI_VER=0.6.0
|
||||||
|
ENV C4AI_VERSION=$C4AI_VER
|
||||||
|
LABEL c4ai.version=$C4AI_VER
|
||||||
|
|
||||||
# Set build arguments
|
# Set build arguments
|
||||||
ARG APP_HOME=/app
|
ARG APP_HOME=/app
|
||||||
ARG GITHUB_REPO=https://github.com/unclecode/crawl4ai.git
|
ARG GITHUB_REPO=https://github.com/unclecode/crawl4ai.git
|
||||||
|
@ -1,2 +1,3 @@
|
|||||||
# crawl4ai/_version.py
|
# crawl4ai/_version.py
|
||||||
__version__ = "0.5.0.post8"
|
__version__ = "0.6.0rc1"
|
||||||
|
|
||||||
|
@ -1,644 +0,0 @@
|
|||||||
# Crawl4AI Docker Guide 🐳
|
|
||||||
|
|
||||||
## Table of Contents
|
|
||||||
- [Prerequisites](#prerequisites)
|
|
||||||
- [Installation](#installation)
|
|
||||||
- [Option 1: Using Docker Compose (Recommended)](#option-1-using-docker-compose-recommended)
|
|
||||||
- [Option 2: Manual Local Build & Run](#option-2-manual-local-build--run)
|
|
||||||
- [Option 3: Using Pre-built Docker Hub Images](#option-3-using-pre-built-docker-hub-images)
|
|
||||||
- [Dockerfile Parameters](#dockerfile-parameters)
|
|
||||||
- [Using the API](#using-the-api)
|
|
||||||
- [Understanding Request Schema](#understanding-request-schema)
|
|
||||||
- [REST API Examples](#rest-api-examples)
|
|
||||||
- [Python SDK](#python-sdk)
|
|
||||||
- [Metrics & Monitoring](#metrics--monitoring)
|
|
||||||
- [Deployment Scenarios](#deployment-scenarios)
|
|
||||||
- [Complete Examples](#complete-examples)
|
|
||||||
- [Server Configuration](#server-configuration)
|
|
||||||
- [Understanding config.yml](#understanding-configyml)
|
|
||||||
- [JWT Authentication](#jwt-authentication)
|
|
||||||
- [Configuration Tips and Best Practices](#configuration-tips-and-best-practices)
|
|
||||||
- [Customizing Your Configuration](#customizing-your-configuration)
|
|
||||||
- [Configuration Recommendations](#configuration-recommendations)
|
|
||||||
- [Getting Help](#getting-help)
|
|
||||||
|
|
||||||
## Prerequisites
|
|
||||||
|
|
||||||
Before we dive in, make sure you have:
|
|
||||||
- Docker installed and running (version 20.10.0 or higher), including `docker compose` (usually bundled with Docker Desktop).
|
|
||||||
- `git` for cloning the repository.
|
|
||||||
- At least 4GB of RAM available for the container (more recommended for heavy use).
|
|
||||||
- Python 3.10+ (if using the Python SDK).
|
|
||||||
- Node.js 16+ (if using the Node.js examples).
|
|
||||||
|
|
||||||
> 💡 **Pro tip**: Run `docker info` to check your Docker installation and available resources.
|
|
||||||
|
|
||||||
## Installation
|
|
||||||
|
|
||||||
We offer several ways to get the Crawl4AI server running. Docker Compose is the easiest way to manage local builds and runs.
|
|
||||||
|
|
||||||
### Option 1: Using Docker Compose (Recommended)
|
|
||||||
|
|
||||||
Docker Compose simplifies building and running the service, especially for local development and testing across different platforms.
|
|
||||||
|
|
||||||
#### 1. Clone Repository
|
|
||||||
|
|
||||||
```bash
|
|
||||||
git clone https://github.com/unclecode/crawl4ai.git
|
|
||||||
cd crawl4ai
|
|
||||||
```
|
|
||||||
|
|
||||||
#### 2. Environment Setup (API Keys)
|
|
||||||
|
|
||||||
If you plan to use LLMs, copy the example environment file and add your API keys. This file should be in the **project root directory**.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Make sure you are in the 'crawl4ai' root directory
|
|
||||||
cp deploy/docker/.llm.env.example .llm.env
|
|
||||||
|
|
||||||
# Now edit .llm.env and add your API keys
|
|
||||||
# Example content:
|
|
||||||
# OPENAI_API_KEY=sk-your-key
|
|
||||||
# ANTHROPIC_API_KEY=your-anthropic-key
|
|
||||||
# ...
|
|
||||||
```
|
|
||||||
> 🔑 **Note**: Keep your API keys secure! Never commit `.llm.env` to version control.
|
|
||||||
|
|
||||||
#### 3. Build and Run with Compose
|
|
||||||
|
|
||||||
The `docker-compose.yml` file in the project root defines services for different scenarios using **profiles**.
|
|
||||||
|
|
||||||
* **Build and Run Locally (AMD64):**
|
|
||||||
```bash
|
|
||||||
# Builds the image locally using Dockerfile and runs it
|
|
||||||
docker compose --profile local-amd64 up --build -d
|
|
||||||
```
|
|
||||||
|
|
||||||
* **Build and Run Locally (ARM64):**
|
|
||||||
```bash
|
|
||||||
# Builds the image locally using Dockerfile and runs it
|
|
||||||
docker compose --profile local-arm64 up --build -d
|
|
||||||
```
|
|
||||||
|
|
||||||
* **Run Pre-built Image from Docker Hub (AMD64):**
|
|
||||||
```bash
|
|
||||||
# Pulls and runs the specified AMD64 image from Docker Hub
|
|
||||||
# (Set VERSION env var for specific tags, e.g., VERSION=0.5.1-d1)
|
|
||||||
docker compose --profile hub-amd64 up -d
|
|
||||||
```
|
|
||||||
|
|
||||||
* **Run Pre-built Image from Docker Hub (ARM64):**
|
|
||||||
```bash
|
|
||||||
# Pulls and runs the specified ARM64 image from Docker Hub
|
|
||||||
docker compose --profile hub-arm64 up -d
|
|
||||||
```
|
|
||||||
|
|
||||||
> The server will be available at `http://localhost:11235`.
|
|
||||||
|
|
||||||
#### 4. Stopping Compose Services
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Stop the service(s) associated with a profile (e.g., local-amd64)
|
|
||||||
docker compose --profile local-amd64 down
|
|
||||||
```
|
|
||||||
|
|
||||||
### Option 2: Manual Local Build & Run
|
|
||||||
|
|
||||||
If you prefer not to use Docker Compose for local builds.
|
|
||||||
|
|
||||||
#### 1. Clone Repository & Setup Environment
|
|
||||||
|
|
||||||
Follow steps 1 and 2 from the Docker Compose section above (clone repo, `cd crawl4ai`, create `.llm.env` in the root).
|
|
||||||
|
|
||||||
#### 2. Build the Image (Multi-Arch)
|
|
||||||
|
|
||||||
Use `docker buildx` to build the image. This example builds for multiple platforms and loads the image matching your host architecture into the local Docker daemon.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Make sure you are in the 'crawl4ai' root directory
|
|
||||||
docker buildx build --platform linux/amd64,linux/arm64 -t crawl4ai-local:latest --load .
|
|
||||||
```
|
|
||||||
|
|
||||||
#### 3. Run the Container
|
|
||||||
|
|
||||||
* **Basic run (no LLM support):**
|
|
||||||
```bash
|
|
||||||
# Replace --platform if your host is ARM64
|
|
||||||
docker run -d \
|
|
||||||
-p 11235:11235 \
|
|
||||||
--name crawl4ai-standalone \
|
|
||||||
--shm-size=1g \
|
|
||||||
--platform linux/amd64 \
|
|
||||||
crawl4ai-local:latest
|
|
||||||
```
|
|
||||||
|
|
||||||
* **With LLM support:**
|
|
||||||
```bash
|
|
||||||
# Make sure .llm.env is in the current directory (project root)
|
|
||||||
# Replace --platform if your host is ARM64
|
|
||||||
docker run -d \
|
|
||||||
-p 11235:11235 \
|
|
||||||
--name crawl4ai-standalone \
|
|
||||||
--env-file .llm.env \
|
|
||||||
--shm-size=1g \
|
|
||||||
--platform linux/amd64 \
|
|
||||||
crawl4ai-local:latest
|
|
||||||
```
|
|
||||||
|
|
||||||
> The server will be available at `http://localhost:11235`.
|
|
||||||
|
|
||||||
#### 4. Stopping the Manual Container
|
|
||||||
|
|
||||||
```bash
|
|
||||||
docker stop crawl4ai-standalone && docker rm crawl4ai-standalone
|
|
||||||
```
|
|
||||||
|
|
||||||
### Option 3: Using Pre-built Docker Hub Images
|
|
||||||
|
|
||||||
Pull and run images directly from Docker Hub without building locally.
|
|
||||||
|
|
||||||
#### 1. Pull the Image
|
|
||||||
|
|
||||||
We use a versioning scheme like `LIBRARY_VERSION-dREVISION` (e.g., `0.5.1-d1`). The `latest` tag points to the most recent stable release. Images are built with multi-arch manifests, so Docker usually pulls the correct version for your system automatically.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Pull a specific version (recommended for stability)
|
|
||||||
docker pull unclecode/crawl4ai:0.5.1-d1
|
|
||||||
|
|
||||||
# Or pull the latest stable version
|
|
||||||
docker pull unclecode/crawl4ai:latest
|
|
||||||
```
|
|
||||||
|
|
||||||
#### 2. Setup Environment (API Keys)
|
|
||||||
|
|
||||||
If using LLMs, create the `.llm.env` file in a directory of your choice, similar to Step 2 in the Compose section.
|
|
||||||
|
|
||||||
#### 3. Run the Container
|
|
||||||
|
|
||||||
* **Basic run:**
|
|
||||||
```bash
|
|
||||||
docker run -d \
|
|
||||||
-p 11235:11235 \
|
|
||||||
--name crawl4ai-hub \
|
|
||||||
--shm-size=1g \
|
|
||||||
unclecode/crawl4ai:0.5.1-d1 # Or use :latest
|
|
||||||
```
|
|
||||||
|
|
||||||
* **With LLM support:**
|
|
||||||
```bash
|
|
||||||
# Make sure .llm.env is in the current directory you are running docker from
|
|
||||||
docker run -d \
|
|
||||||
-p 11235:11235 \
|
|
||||||
--name crawl4ai-hub \
|
|
||||||
--env-file .llm.env \
|
|
||||||
--shm-size=1g \
|
|
||||||
unclecode/crawl4ai:0.5.1-d1 # Or use :latest
|
|
||||||
```
|
|
||||||
|
|
||||||
> The server will be available at `http://localhost:11235`.
|
|
||||||
|
|
||||||
#### 4. Stopping the Hub Container
|
|
||||||
|
|
||||||
```bash
|
|
||||||
docker stop crawl4ai-hub && docker rm crawl4ai-hub
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Docker Hub Versioning Explained
|
|
||||||
|
|
||||||
* **Image Name:** `unclecode/crawl4ai`
|
|
||||||
* **Tag Format:** `LIBRARY_VERSION-dREVISION`
|
|
||||||
* `LIBRARY_VERSION`: The Semantic Version of the core `crawl4ai` Python library included (e.g., `0.5.1`).
|
|
||||||
* `dREVISION`: An incrementing number (starting at `d1`) for Docker build changes made *without* changing the library version (e.g., base image updates, dependency fixes). Resets to `d1` for each new `LIBRARY_VERSION`.
|
|
||||||
* **Example:** `unclecode/crawl4ai:0.5.1-d1`
|
|
||||||
* **`latest` Tag:** Points to the most recent stable `LIBRARY_VERSION-dREVISION`.
|
|
||||||
* **Multi-Arch:** Images support `linux/amd64` and `linux/arm64`. Docker automatically selects the correct architecture.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
*(Rest of the document remains largely the same, but with key updates below)*
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Dockerfile Parameters
|
|
||||||
|
|
||||||
You can customize the image build process using build arguments (`--build-arg`). These are typically used via `docker buildx build` or within the `docker-compose.yml` file.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Example: Build with 'all' features using buildx
|
|
||||||
docker buildx build \
|
|
||||||
--platform linux/amd64,linux/arm64 \
|
|
||||||
--build-arg INSTALL_TYPE=all \
|
|
||||||
-t yourname/crawl4ai-all:latest \
|
|
||||||
--load \
|
|
||||||
. # Build from root context
|
|
||||||
```
|
|
||||||
|
|
||||||
### Build Arguments Explained
|
|
||||||
|
|
||||||
| Argument | Description | Default | Options |
|
|
||||||
| :----------- | :--------------------------------------- | :-------- | :--------------------------------- |
|
|
||||||
| INSTALL_TYPE | Feature set | `default` | `default`, `all`, `torch`, `transformer` |
|
|
||||||
| ENABLE_GPU | GPU support (CUDA for AMD64) | `false` | `true`, `false` |
|
|
||||||
| APP_HOME | Install path inside container (advanced) | `/app` | any valid path |
|
|
||||||
| USE_LOCAL | Install library from local source | `true` | `true`, `false` |
|
|
||||||
| GITHUB_REPO | Git repo to clone if USE_LOCAL=false | *(see Dockerfile)* | any git URL |
|
|
||||||
| GITHUB_BRANCH| Git branch to clone if USE_LOCAL=false | `main` | any branch name |
|
|
||||||
|
|
||||||
*(Note: PYTHON_VERSION is fixed by the `FROM` instruction in the Dockerfile)*
|
|
||||||
|
|
||||||
### Build Best Practices
|
|
||||||
|
|
||||||
1. **Choose the Right Install Type**
|
|
||||||
* `default`: Basic installation, smallest image size. Suitable for most standard web scraping and markdown generation.
|
|
||||||
* `all`: Full features including `torch` and `transformers` for advanced extraction strategies (e.g., CosineStrategy, certain LLM filters). Significantly larger image. Ensure you need these extras.
|
|
||||||
2. **Platform Considerations**
|
|
||||||
* Use `buildx` for building multi-architecture images, especially for pushing to registries.
|
|
||||||
* Use `docker compose` profiles (`local-amd64`, `local-arm64`) for easy platform-specific local builds.
|
|
||||||
3. **Performance Optimization**
|
|
||||||
* The image automatically includes platform-specific optimizations (OpenMP for AMD64, OpenBLAS for ARM64).
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Using the API
|
|
||||||
|
|
||||||
Communicate with the running Docker server via its REST API (defaulting to `http://localhost:11235`). You can use the Python SDK or make direct HTTP requests.
|
|
||||||
|
|
||||||
### Python SDK
|
|
||||||
|
|
||||||
Install the SDK: `pip install crawl4ai`
|
|
||||||
|
|
||||||
```python
|
|
||||||
import asyncio
|
|
||||||
from crawl4ai.docker_client import Crawl4aiDockerClient
|
|
||||||
from crawl4ai import BrowserConfig, CrawlerRunConfig, CacheMode # Assuming you have crawl4ai installed
|
|
||||||
|
|
||||||
async def main():
|
|
||||||
# Point to the correct server port
|
|
||||||
async with Crawl4aiDockerClient(base_url="http://localhost:11235", verbose=True) as client:
|
|
||||||
# If JWT is enabled on the server, authenticate first:
|
|
||||||
# await client.authenticate("user@example.com") # See Server Configuration section
|
|
||||||
|
|
||||||
# Example Non-streaming crawl
|
|
||||||
print("--- Running Non-Streaming Crawl ---")
|
|
||||||
results = await client.crawl(
|
|
||||||
["https://httpbin.org/html"],
|
|
||||||
browser_config=BrowserConfig(headless=True), # Use library classes for config aid
|
|
||||||
crawler_config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
|
||||||
)
|
|
||||||
if results: # client.crawl returns None on failure
|
|
||||||
print(f"Non-streaming results success: {results.success}")
|
|
||||||
if results.success:
|
|
||||||
for result in results: # Iterate through the CrawlResultContainer
|
|
||||||
print(f"URL: {result.url}, Success: {result.success}")
|
|
||||||
else:
|
|
||||||
print("Non-streaming crawl failed.")
|
|
||||||
|
|
||||||
|
|
||||||
# Example Streaming crawl
|
|
||||||
print("\n--- Running Streaming Crawl ---")
|
|
||||||
stream_config = CrawlerRunConfig(stream=True, cache_mode=CacheMode.BYPASS)
|
|
||||||
try:
|
|
||||||
async for result in await client.crawl( # client.crawl returns an async generator for streaming
|
|
||||||
["https://httpbin.org/html", "https://httpbin.org/links/5/0"],
|
|
||||||
browser_config=BrowserConfig(headless=True),
|
|
||||||
crawler_config=stream_config
|
|
||||||
):
|
|
||||||
print(f"Streamed result: URL: {result.url}, Success: {result.success}")
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Streaming crawl failed: {e}")
|
|
||||||
|
|
||||||
|
|
||||||
# Example Get schema
|
|
||||||
print("\n--- Getting Schema ---")
|
|
||||||
schema = await client.get_schema()
|
|
||||||
print(f"Schema received: {bool(schema)}") # Print whether schema was received
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
asyncio.run(main())
|
|
||||||
```
|
|
||||||
|
|
||||||
*(SDK parameters like timeout, verify_ssl etc. remain the same)*
|
|
||||||
|
|
||||||
### Second Approach: Direct API Calls
|
|
||||||
|
|
||||||
Crucially, when sending configurations directly via JSON, they **must** follow the `{"type": "ClassName", "params": {...}}` structure for any non-primitive value (like config objects or strategies). Dictionaries must be wrapped as `{"type": "dict", "value": {...}}`.
|
|
||||||
|
|
||||||
*(Keep the detailed explanation of Configuration Structure, Basic Pattern, Simple vs Complex, Strategy Pattern, Complex Nested Example, Quick Grammar Overview, Important Rules, Pro Tip)*
|
|
||||||
|
|
||||||
#### More Examples *(Ensure Schema example uses type/value wrapper)*
|
|
||||||
|
|
||||||
**Advanced Crawler Configuration**
|
|
||||||
*(Keep example, ensure cache_mode uses valid enum value like "bypass")*
|
|
||||||
|
|
||||||
**Extraction Strategy**
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"crawler_config": {
|
|
||||||
"type": "CrawlerRunConfig",
|
|
||||||
"params": {
|
|
||||||
"extraction_strategy": {
|
|
||||||
"type": "JsonCssExtractionStrategy",
|
|
||||||
"params": {
|
|
||||||
"schema": {
|
|
||||||
"type": "dict",
|
|
||||||
"value": {
|
|
||||||
"baseSelector": "article.post",
|
|
||||||
"fields": [
|
|
||||||
{"name": "title", "selector": "h1", "type": "text"},
|
|
||||||
{"name": "content", "selector": ".content", "type": "html"}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**LLM Extraction Strategy** *(Keep example, ensure schema uses type/value wrapper)*
|
|
||||||
*(Keep Deep Crawler Example)*
|
|
||||||
|
|
||||||
### REST API Examples
|
|
||||||
|
|
||||||
Update URLs to use port `11235`.
|
|
||||||
|
|
||||||
#### Simple Crawl
|
|
||||||
|
|
||||||
```python
|
|
||||||
import requests
|
|
||||||
|
|
||||||
# Configuration objects converted to the required JSON structure
|
|
||||||
browser_config_payload = {
|
|
||||||
"type": "BrowserConfig",
|
|
||||||
"params": {"headless": True}
|
|
||||||
}
|
|
||||||
crawler_config_payload = {
|
|
||||||
"type": "CrawlerRunConfig",
|
|
||||||
"params": {"stream": False, "cache_mode": "bypass"} # Use string value of enum
|
|
||||||
}
|
|
||||||
|
|
||||||
crawl_payload = {
|
|
||||||
"urls": ["https://httpbin.org/html"],
|
|
||||||
"browser_config": browser_config_payload,
|
|
||||||
"crawler_config": crawler_config_payload
|
|
||||||
}
|
|
||||||
response = requests.post(
|
|
||||||
"http://localhost:11235/crawl", # Updated port
|
|
||||||
# headers={"Authorization": f"Bearer {token}"}, # If JWT is enabled
|
|
||||||
json=crawl_payload
|
|
||||||
)
|
|
||||||
print(f"Status Code: {response.status_code}")
|
|
||||||
if response.ok:
|
|
||||||
print(response.json())
|
|
||||||
else:
|
|
||||||
print(f"Error: {response.text}")
|
|
||||||
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Streaming Results
|
|
||||||
|
|
||||||
```python
|
|
||||||
import json
|
|
||||||
import httpx # Use httpx for async streaming example
|
|
||||||
|
|
||||||
async def test_stream_crawl(token: str = None): # Made token optional
|
|
||||||
"""Test the /crawl/stream endpoint with multiple URLs."""
|
|
||||||
url = "http://localhost:11235/crawl/stream" # Updated port
|
|
||||||
payload = {
|
|
||||||
"urls": [
|
|
||||||
"https://httpbin.org/html",
|
|
||||||
"https://httpbin.org/links/5/0",
|
|
||||||
],
|
|
||||||
"browser_config": {
|
|
||||||
"type": "BrowserConfig",
|
|
||||||
"params": {"headless": True, "viewport": {"type": "dict", "value": {"width": 1200, "height": 800}}} # Viewport needs type:dict
|
|
||||||
},
|
|
||||||
"crawler_config": {
|
|
||||||
"type": "CrawlerRunConfig",
|
|
||||||
"params": {"stream": True, "cache_mode": "bypass"}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
headers = {}
|
|
||||||
# if token:
|
|
||||||
# headers = {"Authorization": f"Bearer {token}"} # If JWT is enabled
|
|
||||||
|
|
||||||
try:
|
|
||||||
async with httpx.AsyncClient() as client:
|
|
||||||
async with client.stream("POST", url, json=payload, headers=headers, timeout=120.0) as response:
|
|
||||||
print(f"Status: {response.status_code} (Expected: 200)")
|
|
||||||
response.raise_for_status() # Raise exception for bad status codes
|
|
||||||
|
|
||||||
# Read streaming response line-by-line (NDJSON)
|
|
||||||
async for line in response.aiter_lines():
|
|
||||||
if line:
|
|
||||||
try:
|
|
||||||
data = json.loads(line)
|
|
||||||
# Check for completion marker
|
|
||||||
if data.get("status") == "completed":
|
|
||||||
print("Stream completed.")
|
|
||||||
break
|
|
||||||
print(f"Streamed Result: {json.dumps(data, indent=2)}")
|
|
||||||
except json.JSONDecodeError:
|
|
||||||
print(f"Warning: Could not decode JSON line: {line}")
|
|
||||||
|
|
||||||
except httpx.HTTPStatusError as e:
|
|
||||||
print(f"HTTP error occurred: {e.response.status_code} - {e.response.text}")
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Error in streaming crawl test: {str(e)}")
|
|
||||||
|
|
||||||
# To run this example:
|
|
||||||
# import asyncio
|
|
||||||
# asyncio.run(test_stream_crawl())
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Metrics & Monitoring
|
|
||||||
|
|
||||||
Keep an eye on your crawler with these endpoints:
|
|
||||||
|
|
||||||
- `/health` - Quick health check
|
|
||||||
- `/metrics` - Detailed Prometheus metrics
|
|
||||||
- `/schema` - Full API schema
|
|
||||||
|
|
||||||
Example health check:
|
|
||||||
```bash
|
|
||||||
curl http://localhost:11235/health
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
*(Deployment Scenarios and Complete Examples sections remain the same, maybe update links if examples moved)*
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Server Configuration
|
|
||||||
|
|
||||||
The server's behavior can be customized through the `config.yml` file.
|
|
||||||
|
|
||||||
### Understanding config.yml
|
|
||||||
|
|
||||||
The configuration file is loaded from `/app/config.yml` inside the container. By default, the file from `deploy/docker/config.yml` in the repository is copied there during the build.
|
|
||||||
|
|
||||||
Here's a detailed breakdown of the configuration options (using defaults from `deploy/docker/config.yml`):
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
# Application Configuration
|
|
||||||
app:
|
|
||||||
title: "Crawl4AI API"
|
|
||||||
version: "1.0.0" # Consider setting this to match library version, e.g., "0.5.1"
|
|
||||||
host: "0.0.0.0"
|
|
||||||
port: 8020 # NOTE: This port is used ONLY when running server.py directly. Gunicorn overrides this (see supervisord.conf).
|
|
||||||
reload: False # Default set to False - suitable for production
|
|
||||||
timeout_keep_alive: 300
|
|
||||||
|
|
||||||
# Default LLM Configuration
|
|
||||||
llm:
|
|
||||||
provider: "openai/gpt-4o-mini"
|
|
||||||
api_key_env: "OPENAI_API_KEY"
|
|
||||||
# api_key: sk-... # If you pass the API key directly then api_key_env will be ignored
|
|
||||||
|
|
||||||
# Redis Configuration (Used by internal Redis server managed by supervisord)
|
|
||||||
redis:
|
|
||||||
host: "localhost"
|
|
||||||
port: 6379
|
|
||||||
db: 0
|
|
||||||
password: ""
|
|
||||||
# ... other redis options ...
|
|
||||||
|
|
||||||
# Rate Limiting Configuration
|
|
||||||
rate_limiting:
|
|
||||||
enabled: True
|
|
||||||
default_limit: "1000/minute"
|
|
||||||
trusted_proxies: []
|
|
||||||
storage_uri: "memory://" # Use "redis://localhost:6379" if you need persistent/shared limits
|
|
||||||
|
|
||||||
# Security Configuration
|
|
||||||
security:
|
|
||||||
enabled: false # Master toggle for security features
|
|
||||||
jwt_enabled: false # Enable JWT authentication (requires security.enabled=true)
|
|
||||||
https_redirect: false # Force HTTPS (requires security.enabled=true)
|
|
||||||
trusted_hosts: ["*"] # Allowed hosts (use specific domains in production)
|
|
||||||
headers: # Security headers (applied if security.enabled=true)
|
|
||||||
x_content_type_options: "nosniff"
|
|
||||||
x_frame_options: "DENY"
|
|
||||||
content_security_policy: "default-src 'self'"
|
|
||||||
strict_transport_security: "max-age=63072000; includeSubDomains"
|
|
||||||
|
|
||||||
# Crawler Configuration
|
|
||||||
crawler:
|
|
||||||
memory_threshold_percent: 95.0
|
|
||||||
rate_limiter:
|
|
||||||
base_delay: [1.0, 2.0] # Min/max delay between requests in seconds for dispatcher
|
|
||||||
timeouts:
|
|
||||||
stream_init: 30.0 # Timeout for stream initialization
|
|
||||||
batch_process: 300.0 # Timeout for non-streaming /crawl processing
|
|
||||||
|
|
||||||
# Logging Configuration
|
|
||||||
logging:
|
|
||||||
level: "INFO"
|
|
||||||
format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
|
|
||||||
|
|
||||||
# Observability Configuration
|
|
||||||
observability:
|
|
||||||
prometheus:
|
|
||||||
enabled: True
|
|
||||||
endpoint: "/metrics"
|
|
||||||
health_check:
|
|
||||||
endpoint: "/health"
|
|
||||||
```
|
|
||||||
|
|
||||||
*(JWT Authentication section remains the same, just note the default port is now 11235 for requests)*
|
|
||||||
|
|
||||||
*(Configuration Tips and Best Practices remain the same)*
|
|
||||||
|
|
||||||
### Customizing Your Configuration
|
|
||||||
|
|
||||||
You can override the default `config.yml`.
|
|
||||||
|
|
||||||
#### Method 1: Modify Before Build
|
|
||||||
|
|
||||||
1. Edit the `deploy/docker/config.yml` file in your local repository clone.
|
|
||||||
2. Build the image using `docker buildx` or `docker compose --profile local-... up --build`. The modified file will be copied into the image.
|
|
||||||
|
|
||||||
#### Method 2: Runtime Mount (Recommended for Custom Deploys)
|
|
||||||
|
|
||||||
1. Create your custom configuration file, e.g., `my-custom-config.yml` locally. Ensure it contains all necessary sections.
|
|
||||||
2. Mount it when running the container:
|
|
||||||
|
|
||||||
* **Using `docker run`:**
|
|
||||||
```bash
|
|
||||||
# Assumes my-custom-config.yml is in the current directory
|
|
||||||
docker run -d -p 11235:11235 \
|
|
||||||
--name crawl4ai-custom-config \
|
|
||||||
--env-file .llm.env \
|
|
||||||
--shm-size=1g \
|
|
||||||
-v $(pwd)/my-custom-config.yml:/app/config.yml \
|
|
||||||
unclecode/crawl4ai:latest # Or your specific tag
|
|
||||||
```
|
|
||||||
|
|
||||||
* **Using `docker-compose.yml`:** Add a `volumes` section to the service definition:
|
|
||||||
```yaml
|
|
||||||
services:
|
|
||||||
crawl4ai-hub-amd64: # Or your chosen service
|
|
||||||
image: unclecode/crawl4ai:latest
|
|
||||||
profiles: ["hub-amd64"]
|
|
||||||
<<: *base-config
|
|
||||||
volumes:
|
|
||||||
# Mount local custom config over the default one in the container
|
|
||||||
- ./my-custom-config.yml:/app/config.yml
|
|
||||||
# Keep the shared memory volume from base-config
|
|
||||||
- /dev/shm:/dev/shm
|
|
||||||
```
|
|
||||||
*(Note: Ensure `my-custom-config.yml` is in the same directory as `docker-compose.yml`)*
|
|
||||||
|
|
||||||
> 💡 When mounting, your custom file *completely replaces* the default one. Ensure it's a valid and complete configuration.
|
|
||||||
|
|
||||||
### Configuration Recommendations
|
|
||||||
|
|
||||||
1. **Security First** 🔒
|
|
||||||
- Always enable security in production
|
|
||||||
- Use specific trusted_hosts instead of wildcards
|
|
||||||
- Set up proper rate limiting to protect your server
|
|
||||||
- Consider your environment before enabling HTTPS redirect
|
|
||||||
|
|
||||||
2. **Resource Management** 💻
|
|
||||||
- Adjust memory_threshold_percent based on available RAM
|
|
||||||
- Set timeouts according to your content size and network conditions
|
|
||||||
- Use Redis for rate limiting in multi-container setups
|
|
||||||
|
|
||||||
3. **Monitoring** 📊
|
|
||||||
- Enable Prometheus if you need metrics
|
|
||||||
- Set DEBUG logging in development, INFO in production
|
|
||||||
- Regular health check monitoring is crucial
|
|
||||||
|
|
||||||
4. **Performance Tuning** ⚡
|
|
||||||
- Start with conservative rate limiter delays
|
|
||||||
- Increase batch_process timeout for large content
|
|
||||||
- Adjust stream_init timeout based on initial response times
|
|
||||||
|
|
||||||
## Getting Help
|
|
||||||
|
|
||||||
We're here to help you succeed with Crawl4AI! Here's how to get support:
|
|
||||||
|
|
||||||
- 📖 Check our [full documentation](https://docs.crawl4ai.com)
|
|
||||||
- 🐛 Found a bug? [Open an issue](https://github.com/unclecode/crawl4ai/issues)
|
|
||||||
- 💬 Join our [Discord community](https://discord.gg/crawl4ai)
|
|
||||||
- ⭐ Star us on GitHub to show support!
|
|
||||||
|
|
||||||
## Summary
|
|
||||||
|
|
||||||
In this guide, we've covered everything you need to get started with Crawl4AI's Docker deployment:
|
|
||||||
- Building and running the Docker container
|
|
||||||
- Configuring the environment
|
|
||||||
- Making API requests with proper typing
|
|
||||||
- Using the Python SDK
|
|
||||||
- Monitoring your deployment
|
|
||||||
|
|
||||||
Remember, the examples in the `examples` folder are your friends - they show real-world usage patterns that you can adapt for your needs.
|
|
||||||
|
|
||||||
Keep exploring, and don't hesitate to reach out if you need help! We're building something amazing together. 🚀
|
|
||||||
|
|
||||||
Happy crawling! 🕷️
|
|
File diff suppressed because it is too large
Load Diff
@ -3,9 +3,9 @@ app:
|
|||||||
title: "Crawl4AI API"
|
title: "Crawl4AI API"
|
||||||
version: "1.0.0"
|
version: "1.0.0"
|
||||||
host: "0.0.0.0"
|
host: "0.0.0.0"
|
||||||
port: 8020
|
port: 11235
|
||||||
reload: False
|
reload: False
|
||||||
workers: 4
|
workers: 1
|
||||||
timeout_keep_alive: 300
|
timeout_keep_alive: 300
|
||||||
|
|
||||||
# Default LLM Configuration
|
# Default LLM Configuration
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
fastapi==0.115.12
|
fastapi>=0.115.12
|
||||||
uvicorn==0.34.2
|
uvicorn>=0.34.2
|
||||||
gunicorn>=23.0.0
|
gunicorn>=23.0.0
|
||||||
slowapi==0.1.9
|
slowapi==0.1.9
|
||||||
prometheus-fastapi-instrumentator>=7.1.0
|
prometheus-fastapi-instrumentator>=7.1.0
|
||||||
@ -8,8 +8,9 @@ jwt>=1.3.1
|
|||||||
dnspython>=2.7.0
|
dnspython>=2.7.0
|
||||||
email-validator==2.2.0
|
email-validator==2.2.0
|
||||||
sse-starlette==2.2.1
|
sse-starlette==2.2.1
|
||||||
pydantic==2.11
|
pydantic>=2.11
|
||||||
rank-bm25==0.2.2
|
rank-bm25==0.2.2
|
||||||
anyio==4.9.0
|
anyio==4.9.0
|
||||||
PyJWT==2.10.1
|
PyJWT==2.10.1
|
||||||
|
mcp>=1.6.0
|
||||||
|
websockets>=15.0.1
|
||||||
|
@ -629,6 +629,7 @@ async def get_context(
|
|||||||
|
|
||||||
|
|
||||||
# attach MCP layer (adds /mcp/ws, /mcp/sse, /mcp/schema)
|
# attach MCP layer (adds /mcp/ws, /mcp/sse, /mcp/schema)
|
||||||
|
print(f"MCP server running on {config['app']['host']}:{config['app']['port']}")
|
||||||
attach_mcp(
|
attach_mcp(
|
||||||
app,
|
app,
|
||||||
base_url=f"http://{config['app']['host']}:{config['app']['port']}"
|
base_url=f"http://{config['app']['host']}:{config['app']['port']}"
|
||||||
|
@ -536,10 +536,14 @@
|
|||||||
|
|
||||||
const endpointMap = {
|
const endpointMap = {
|
||||||
crawl: '/crawl',
|
crawl: '/crawl',
|
||||||
|
};
|
||||||
|
|
||||||
|
/*const endpointMap = {
|
||||||
|
crawl: '/crawl',
|
||||||
crawl_stream: '/crawl/stream',
|
crawl_stream: '/crawl/stream',
|
||||||
md: '/md',
|
md: '/md',
|
||||||
llm: '/llm'
|
llm: '/llm'
|
||||||
};
|
};*/
|
||||||
|
|
||||||
const api = endpointMap[endpoint];
|
const api = endpointMap[endpoint];
|
||||||
const payload = {
|
const payload = {
|
||||||
|
@ -14,7 +14,7 @@ stderr_logfile=/dev/stderr ; Redirect redis stderr to container stderr
|
|||||||
stderr_logfile_maxbytes=0
|
stderr_logfile_maxbytes=0
|
||||||
|
|
||||||
[program:gunicorn]
|
[program:gunicorn]
|
||||||
command=/usr/local/bin/gunicorn --bind 0.0.0.0:11235 --workers 2 --threads 2 --timeout 120 --graceful-timeout 30 --keep-alive 60 --log-level info --worker-class uvicorn.workers.UvicornWorker server:app
|
command=/usr/local/bin/gunicorn --bind 0.0.0.0:11235 --workers 1 --threads 4 --timeout 1800 --graceful-timeout 30 --keep-alive 300 --log-level info --worker-class uvicorn.workers.UvicornWorker server:app
|
||||||
directory=/app ; Working directory for the app
|
directory=/app ; Working directory for the app
|
||||||
user=appuser ; Run gunicorn as our non-root user
|
user=appuser ; Run gunicorn as our non-root user
|
||||||
autorestart=true
|
autorestart=true
|
||||||
|
@ -1,19 +1,11 @@
|
|||||||
# docker-compose.yml
|
version: '3.8'
|
||||||
|
|
||||||
# Base configuration anchor for reusability
|
# Shared configuration for all environments
|
||||||
x-base-config: &base-config
|
x-base-config: &base-config
|
||||||
ports:
|
ports:
|
||||||
# Map host port 11235 to container port 11235 (where Gunicorn will listen)
|
- "11235:11235" # Gunicorn port
|
||||||
- "11235:11235"
|
|
||||||
# - "8080:8080" # Uncomment if needed
|
|
||||||
|
|
||||||
# Load API keys primarily from .llm.env file
|
|
||||||
# Create .llm.env in the root directory .llm.env.example
|
|
||||||
env_file:
|
env_file:
|
||||||
- .llm.env
|
- .llm.env # API keys (create from .llm.env.example)
|
||||||
|
|
||||||
# Define environment variables, allowing overrides from host environment
|
|
||||||
# Syntax ${VAR:-} uses host env var 'VAR' if set, otherwise uses value from .llm.env
|
|
||||||
environment:
|
environment:
|
||||||
- OPENAI_API_KEY=${OPENAI_API_KEY:-}
|
- OPENAI_API_KEY=${OPENAI_API_KEY:-}
|
||||||
- DEEPSEEK_API_KEY=${DEEPSEEK_API_KEY:-}
|
- DEEPSEEK_API_KEY=${DEEPSEEK_API_KEY:-}
|
||||||
@ -22,10 +14,8 @@ x-base-config: &base-config
|
|||||||
- TOGETHER_API_KEY=${TOGETHER_API_KEY:-}
|
- TOGETHER_API_KEY=${TOGETHER_API_KEY:-}
|
||||||
- MISTRAL_API_KEY=${MISTRAL_API_KEY:-}
|
- MISTRAL_API_KEY=${MISTRAL_API_KEY:-}
|
||||||
- GEMINI_API_TOKEN=${GEMINI_API_TOKEN:-}
|
- GEMINI_API_TOKEN=${GEMINI_API_TOKEN:-}
|
||||||
|
|
||||||
volumes:
|
volumes:
|
||||||
# Mount /dev/shm for Chromium/Playwright performance
|
- /dev/shm:/dev/shm # Chromium performance
|
||||||
- /dev/shm:/dev/shm
|
|
||||||
deploy:
|
deploy:
|
||||||
resources:
|
resources:
|
||||||
limits:
|
limits:
|
||||||
@ -34,47 +24,26 @@ x-base-config: &base-config
|
|||||||
memory: 1G
|
memory: 1G
|
||||||
restart: unless-stopped
|
restart: unless-stopped
|
||||||
healthcheck:
|
healthcheck:
|
||||||
# IMPORTANT: Ensure Gunicorn binds to 11235 in supervisord.conf
|
|
||||||
test: ["CMD", "curl", "-f", "http://localhost:11235/health"]
|
test: ["CMD", "curl", "-f", "http://localhost:11235/health"]
|
||||||
interval: 30s
|
interval: 30s
|
||||||
timeout: 10s
|
timeout: 10s
|
||||||
retries: 3
|
retries: 3
|
||||||
start_period: 40s # Give the server time to start
|
start_period: 40s
|
||||||
# Run the container as the non-root user defined in the Dockerfile
|
|
||||||
user: "appuser"
|
user: "appuser"
|
||||||
|
|
||||||
services:
|
services:
|
||||||
# --- Local Build Services ---
|
crawl4ai:
|
||||||
crawl4ai-local-amd64:
|
# 1. Default: Pull multi-platform test image from Docker Hub
|
||||||
|
# 2. Override with local image via: IMAGE=local-test docker compose up
|
||||||
|
image: ${IMAGE:-unclecode/crawl4ai:${TAG:-latest}}
|
||||||
|
|
||||||
|
# Local build config (used with --build)
|
||||||
build:
|
build:
|
||||||
context: . # Build context is the root directory
|
context: .
|
||||||
dockerfile: Dockerfile # Dockerfile is in the root directory
|
dockerfile: Dockerfile
|
||||||
args:
|
args:
|
||||||
INSTALL_TYPE: ${INSTALL_TYPE:-default}
|
INSTALL_TYPE: ${INSTALL_TYPE:-default}
|
||||||
ENABLE_GPU: ${ENABLE_GPU:-false}
|
ENABLE_GPU: ${ENABLE_GPU:-false}
|
||||||
# PYTHON_VERSION arg is omitted as it's fixed by 'FROM python:3.10-slim' in Dockerfile
|
|
||||||
platform: linux/amd64
|
# Inherit shared config
|
||||||
profiles: ["local-amd64"]
|
|
||||||
<<: *base-config # Inherit base configuration
|
|
||||||
|
|
||||||
crawl4ai-local-arm64:
|
|
||||||
build:
|
|
||||||
context: . # Build context is the root directory
|
|
||||||
dockerfile: Dockerfile # Dockerfile is in the root directory
|
|
||||||
args:
|
|
||||||
INSTALL_TYPE: ${INSTALL_TYPE:-default}
|
|
||||||
ENABLE_GPU: ${ENABLE_GPU:-false}
|
|
||||||
platform: linux/arm64
|
|
||||||
profiles: ["local-arm64"]
|
|
||||||
<<: *base-config
|
|
||||||
|
|
||||||
# --- Docker Hub Image Services ---
|
|
||||||
crawl4ai-hub-amd64:
|
|
||||||
image: unclecode/crawl4ai:${VERSION:-latest}-amd64
|
|
||||||
profiles: ["hub-amd64"]
|
|
||||||
<<: *base-config
|
|
||||||
|
|
||||||
crawl4ai-hub-arm64:
|
|
||||||
image: unclecode/crawl4ai:${VERSION:-latest}-arm64
|
|
||||||
profiles: ["hub-arm64"]
|
|
||||||
<<: *base-config
|
<<: *base-config
|
51
docs/md_v2/blog/releases/0.6.0.md
Normal file
51
docs/md_v2/blog/releases/0.6.0.md
Normal file
@ -0,0 +1,51 @@
|
|||||||
|
# Crawl4AI 0.6.0
|
||||||
|
|
||||||
|
*Release date: 2025‑04‑22*
|
||||||
|
|
||||||
|
0.6.0 is the **biggest jump** since the 0.5 series, packing a smarter browser core, pool‑based crawlers, and a ton of DX candy. Expect faster runs, lower RAM burn, and richer diagnostics.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🚀 Key upgrades
|
||||||
|
|
||||||
|
| Area | What changed |
|
||||||
|
|------|--------------|
|
||||||
|
| **Browser** | New **Browser** management with pooling, page pre‑warm, geolocation + locale + timezone switches |
|
||||||
|
| **Crawler** | Console and network log capture, MHTML snapshots, safer `get_page` API |
|
||||||
|
| **Server & API** | **Crawler Pool Manager** endpoint, MCP socket + SSE support |
|
||||||
|
| **Docs** | v2 layout, floating Ask‑AI helper, GitHub stats badge, copy‑code buttons, Docker API demo |
|
||||||
|
| **Tests** | Memory + load benchmarks, 90+ new cases covering MCP and Docker |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ⚠️ Breaking changes
|
||||||
|
|
||||||
|
1. **`get_page` signature** – returns `(html, metadata)` instead of plain html.
|
||||||
|
2. **Docker** – new Chromium base layer, rebuild images.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## How to upgrade
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install -U crawl4ai==0.6.0
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Full changelog
|
||||||
|
|
||||||
|
The diff between `main` and `next` spans **36 k insertions, 4.9 k deletions** over 121 files. Read the [compare view](https://github.com/unclecode/crawl4ai/compare/0.5.0.post8...0.6.0) or see `CHANGELOG.md` for the granular list.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Upgrade tips
|
||||||
|
|
||||||
|
* Using the Docker API? Pull `unclecode/crawl4ai:0.6.0`, new args are documented in `/deploy/docker/README.md`.
|
||||||
|
* Stress‑test your stack with `tests/memory/run_benchmark.py` before production rollout.
|
||||||
|
* Markdown generators renamed but aliased, update when convenient, warnings will remind you.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
Happy crawling, ping `@unclecode` on X for questions or memes.
|
||||||
|
|
@ -8,7 +8,7 @@ dynamic = ["version"]
|
|||||||
description = "🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & scraper"
|
description = "🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & scraper"
|
||||||
readme = "README.md"
|
readme = "README.md"
|
||||||
requires-python = ">=3.9"
|
requires-python = ">=3.9"
|
||||||
license = {text = "MIT"}
|
license = {text = "Apache-2.0"}
|
||||||
authors = [
|
authors = [
|
||||||
{name = "Unclecode", email = "unclecode@kidocode.com"}
|
{name = "Unclecode", email = "unclecode@kidocode.com"}
|
||||||
]
|
]
|
||||||
|
@ -101,19 +101,19 @@ async def test_context(s: ClientSession):
|
|||||||
|
|
||||||
|
|
||||||
async def main() -> None:
|
async def main() -> None:
|
||||||
async with websocket_client("ws://localhost:8020/mcp/ws") as (r, w):
|
async with websocket_client("ws://localhost:11235/mcp/ws") as (r, w):
|
||||||
async with ClientSession(r, w) as s:
|
async with ClientSession(r, w) as s:
|
||||||
await s.initialize() # handshake
|
await s.initialize() # handshake
|
||||||
tools = (await s.list_tools()).tools
|
tools = (await s.list_tools()).tools
|
||||||
print("tools:", [t.name for t in tools])
|
print("tools:", [t.name for t in tools])
|
||||||
|
|
||||||
# await test_list()
|
# await test_list()
|
||||||
# await test_crawl(s)
|
await test_crawl(s)
|
||||||
# await test_md(s)
|
await test_md(s)
|
||||||
# await test_screenshot(s)
|
await test_screenshot(s)
|
||||||
# await test_pdf(s)
|
await test_pdf(s)
|
||||||
# await test_execute_js(s)
|
await test_execute_js(s)
|
||||||
# await test_html(s)
|
await test_html(s)
|
||||||
await test_context(s)
|
await test_context(s)
|
||||||
|
|
||||||
anyio.run(main)
|
anyio.run(main)
|
||||||
|
Loading…
x
Reference in New Issue
Block a user