crawl4ai/docs/md_v3/tutorials/docker-quickstart.md

271 lines
8.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Deploying with Docker (Quickstart)
> **⚠️ WARNING: Experimental & Legacy**
> Our current Docker solution for Crawl4AI is **not stable** and **will be discontinued** soon. A more robust Docker/Orchestration strategy is in development, with a planned stable release in **2025**. If you choose to use this Docker approach, please proceed cautiously and avoid production deployment without thorough testing.
Crawl4AI is **open-source** and under **active development**. We appreciate your interest, but strongly recommend you make **informed decisions** if you need a production environment. Expect breaking changes in future versions.
---
## 1. Installation & Environment Setup (Outside Docker)
Before we jump into Docker usage, heres a quick reminder of how to install Crawl4AI locally (legacy doc). For **non-Docker** deployments or local dev:
```bash
# 1. Install the package
pip install crawl4ai
crawl4ai-setup
# 2. Install playwright dependencies (all browsers or specific ones)
playwright install --with-deps
# or
playwright install --with-deps chromium
# or
playwright install --with-deps chrome
```
**Testing** your installation:
```bash
# Visible browser test
python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=False); page = browser.new_page(); page.goto('https://example.com'); input('Press Enter to close...')"
```
---
## 2. Docker Overview
This Docker approach allows you to run a **Crawl4AI** service via REST API. You can:
1. **POST** a request (e.g., URLs, extraction config)
2. **Retrieve** your results from a task-based endpoint
> **Note**: This Docker solution is **temporary**. We plan a more robust, stable Docker approach in the near future. For now, you can experiment, but do not rely on it for mission-critical production.
---
## 3. Pulling and Running the Image
### Basic Run
```bash
docker pull unclecode/crawl4ai:basic
docker run -p 11235:11235 unclecode/crawl4ai:basic
```
This starts a container on port `11235`. You can `POST` requests to `http://localhost:11235/crawl`.
### Using an API Token
```bash
docker run -p 11235:11235 \
-e CRAWL4AI_API_TOKEN=your_secret_token \
unclecode/crawl4ai:basic
```
If **`CRAWL4AI_API_TOKEN`** is set, you must include `Authorization: Bearer <token>` in your requests. Otherwise, the service is open to anyone.
---
## 4. Docker Compose for Multi-Container Workflows
You can also use **Docker Compose** to manage multiple services. Below is an **experimental** snippet:
```yaml
version: '3.8'
services:
crawl4ai:
image: unclecode/crawl4ai:basic
ports:
- "11235:11235"
environment:
- CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-}
- OPENAI_API_KEY=${OPENAI_API_KEY:-}
# Additional env variables as needed
volumes:
- /dev/shm:/dev/shm
```
To run:
```bash
docker-compose up -d
```
And to stop:
```bash
docker-compose down
```
**Troubleshooting**:
- **Check logs**: `docker-compose logs -f crawl4ai`
- **Remove orphan containers**: `docker-compose down --remove-orphans`
- **Remove networks**: `docker network rm <network_name>`
---
## 5. Making Requests to the Container
**Base URL**: `http://localhost:11235`
### Example: Basic Crawl
```python
import requests
task_request = {
"urls": "https://example.com",
"priority": 10
}
response = requests.post("http://localhost:11235/crawl", json=task_request)
task_id = response.json()["task_id"]
# Poll for status
status_url = f"http://localhost:11235/task/{task_id}"
status = requests.get(status_url).json()
print(status)
```
If you used an API token, do:
```python
headers = {"Authorization": "Bearer your_secret_token"}
response = requests.post(
"http://localhost:11235/crawl",
headers=headers,
json=task_request
)
```
---
## 6. Docker + New Crawler Config Approach
### Using `BrowserConfig` & `CrawlerRunConfig` in Requests
The Docker-based solution can accept **crawler configurations** in the request JSON (legacy doc might show direct parameters, but we want to embed them in `crawler_params` or `extra` to align with the new approach). For example:
```python
import requests
request_data = {
"urls": "https://www.nbcnews.com/business",
"crawler_params": {
"headless": True,
"browser_type": "chromium",
"verbose": True,
"page_timeout": 30000,
# ... any other BrowserConfig-like fields
},
"extra": {
"word_count_threshold": 50,
"bypass_cache": True
}
}
response = requests.post("http://localhost:11235/crawl", json=request_data)
task_id = response.json()["task_id"]
```
This is the recommended style if you want to replicate `BrowserConfig` and `CrawlerRunConfig` settings in Docker mode.
---
## 7. Example: JSON Extraction in Docker
```python
import requests
import json
# Define a schema for CSS extraction
schema = {
"name": "Coinbase Crypto Prices",
"baseSelector": ".cds-tableRow-t45thuk",
"fields": [
{
"name": "crypto",
"selector": "td:nth-child(1) h2",
"type": "text"
},
{
"name": "symbol",
"selector": "td:nth-child(1) p",
"type": "text"
},
{
"name": "price",
"selector": "td:nth-child(2)",
"type": "text"
}
]
}
request_data = {
"urls": "https://www.coinbase.com/explore",
"extraction_config": {
"type": "json_css",
"params": {"schema": schema}
},
"crawler_params": {
"headless": True,
"verbose": True
}
}
resp = requests.post("http://localhost:11235/crawl", json=request_data)
task_id = resp.json()["task_id"]
# Poll for status
status = requests.get(f"http://localhost:11235/task/{task_id}").json()
if status["status"] == "completed":
extracted_content = status["result"]["extracted_content"]
data = json.loads(extracted_content)
print("Extracted:", len(data), "entries")
else:
print("Task still in progress or failed.")
```
---
## 8. Why This Docker Is Temporary
**We are building a new, stable approach**:
- The current Docker container is **experimental** and might break with future releases.
- We plan a stable release in **2025** with a more robust API, versioning, and orchestration.
- If you use this Docker in production, do so at your own risk and be prepared for **breaking changes**.
**Community**: Because Crawl4AI is open-source, you can track progress or contribute to the new Docker approach. Check the [GitHub repository](https://github.com/unclecode/crawl4ai) for roadmaps and updates.
---
## 9. Known Limitations & Next Steps
1. **Not Production-Ready**: This Docker approach lacks extensive security, logging, or advanced config for large-scale usage.
2. **Ongoing Changes**: Expect API changes. The official stable version is targeted for **2025**.
3. **LLM Integrations**: Docker images are big if you want GPU or multiple model providers. We might unify these in a future build.
4. **Performance**: For concurrency or large crawls, you may need to tune resources (memory, CPU) and watch out for ephemeral storage.
5. **Version Pinning**: If you must deploy, pin your Docker tag to a specific version (e.g., `:basic-0.3.7`) to avoid surprise updates.
### Next Steps
- **Watch the Repository**: For announcements on the new Docker architecture.
- **Experiment**: Use this Docker for test or dev environments, but keep an eye out for breakage.
- **Contribute**: If you have ideas or improvements, open a PR or discussion.
- **Check Roadmaps**: See our [GitHub issues](https://github.com/unclecode/crawl4ai/issues) or [Roadmap doc](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md) to find upcoming releases.
---
## 10. Summary
**Deploying with Docker** can simplify running Crawl4AI as a service. However:
- **This Docker** approach is **legacy** and subject to removal/overhaul.
- For production, please weigh the risks carefully.
- Detailed “new Docker approach” is coming in **2025**.
We hope this guide helps you do a quick spin-up of Crawl4AI in Docker for **experimental** usage. Stay tuned for the fully-supported version!