UncleCode 91a5fea11f feat(cli): add command line interface with comprehensive features
Implements a full-featured CLI for Crawl4AI with the following capabilities:
- Basic and advanced web crawling
- Configuration management via YAML/JSON files
- Multiple extraction strategies (CSS, XPath, LLM)
- Content filtering and optimization
- Interactive Q&A capabilities
- Various output formats
- Comprehensive documentation and examples

Also includes:
- Home directory setup for configuration and cache
- Environment variable support for API tokens
- Test suite for CLI functionality
2025-02-10 16:58:52 +08:00

6.5 KiB

Crawl4AI CLI Guide

Table of Contents

Basic Usage

The Crawl4AI CLI (crwl) provides a simple interface to the Crawl4AI library:

# Basic crawling
crwl https://example.com

# Get markdown output
crwl https://example.com -o markdown

# Verbose JSON output with cache bypass
crwl https://example.com -o json -v --bypass-cache

# See usage examples
crwl --example

Quick Example of Advanced Usage

If you clone the repository and run the following command, you will receive the content of the page in JSON format according to a JSON-CSS schema:

crwl "https://www.infoq.com/ai-ml-data-eng/" -e docs/examples/cli/extract_css.yml -s docs/examples/cli/css_schema.json -o json;

Configuration

Browser Configuration

Browser settings can be configured via YAML file or command line parameters:

# browser.yml
headless: true
viewport_width: 1280
user_agent_mode: "random"
verbose: true
ignore_https_errors: true
# Using config file
crwl https://example.com -B browser.yml

# Using direct parameters
crwl https://example.com -b "headless=true,viewport_width=1280,user_agent_mode=random"

Crawler Configuration

Control crawling behavior:

# crawler.yml
cache_mode: "bypass"
wait_until: "networkidle"
page_timeout: 30000
delay_before_return_html: 0.5
word_count_threshold: 100
scan_full_page: true
scroll_delay: 0.3
process_iframes: false
remove_overlay_elements: true
magic: true
verbose: true
# Using config file
crwl https://example.com -C crawler.yml

# Using direct parameters
crwl https://example.com -c "css_selector=#main,delay_before_return_html=2,scan_full_page=true"

Extraction Configuration

Two types of extraction are supported:

  1. CSS/XPath-based extraction:
# extract_css.yml
type: "json-css"
params:
  verbose: true
// css_schema.json
{
  "name": "ArticleExtractor",
  "baseSelector": ".article",
  "fields": [
    {
      "name": "title",
      "selector": "h1.title",
      "type": "text"
    },
    {
      "name": "link",
      "selector": "a.read-more",
      "type": "attribute",
      "attribute": "href"
    }
  ]
}
  1. LLM-based extraction:
# extract_llm.yml
type: "llm"
provider: "openai/gpt-4"
instruction: "Extract all articles with their titles and links"
api_token: "your-token"
params:
  temperature: 0.3
  max_tokens: 1000
// llm_schema.json
{
  "title": "Article",
  "type": "object",
  "properties": {
    "title": {
      "type": "string",
      "description": "The title of the article"
    },
    "link": {
      "type": "string",
      "description": "URL to the full article"
    }
  }
}

Advanced Features

LLM Q&A

Ask questions about crawled content:

# Simple question
crwl https://example.com -q "What is the main topic discussed?"

# View content then ask questions
crwl https://example.com -o markdown  # See content first
crwl https://example.com -q "Summarize the key points"
crwl https://example.com -q "What are the conclusions?"

# Combined with advanced crawling
crwl https://example.com \
    -B browser.yml \
    -c "css_selector=article,scan_full_page=true" \
    -q "What are the pros and cons mentioned?"

First-time setup:

  • Prompts for LLM provider and API token
  • Saves configuration in ~/.crawl4ai/global.yml
  • Supports various providers (openai/gpt-4, anthropic/claude-3-sonnet, etc.)
  • For case of ollama you do not need to provide API token.
  • See LiteLLM Providers for full list

Structured Data Extraction

Extract structured data using CSS selectors:

crwl https://example.com \
    -e extract_css.yml \
    -s css_schema.json \
    -o json

Or using LLM-based extraction:

crwl https://example.com \
    -e extract_llm.yml \
    -s llm_schema.json \
    -o json

Content Filtering

Filter content for relevance:

# filter_bm25.yml
type: "bm25"
query: "target content"
threshold: 1.0

# filter_pruning.yml
type: "pruning"
query: "focus topic"
threshold: 0.48
crwl https://example.com -f filter_bm25.yml -o markdown-fit

Output Formats

  • all - Full crawl result including metadata
  • json - Extracted structured data (when using extraction)
  • markdown / md - Raw markdown output
  • markdown-fit / md-fit - Filtered markdown for better readability

Complete Examples

  1. Basic Extraction:
crwl https://example.com \
    -B browser.yml \
    -C crawler.yml \
    -o json
  1. Structured Data Extraction:
crwl https://example.com \
    -e extract_css.yml \
    -s css_schema.json \
    -o json \
    -v
  1. LLM Extraction with Filtering:
crwl https://example.com \
    -B browser.yml \
    -e extract_llm.yml \
    -s llm_schema.json \
    -f filter_bm25.yml \
    -o json
  1. Interactive Q&A:
# First crawl and view
crwl https://example.com -o markdown

# Then ask questions
crwl https://example.com -q "What are the main points?"
crwl https://example.com -q "Summarize the conclusions"

Best Practices & Tips

  1. Configuration Management:

    • Keep common configurations in YAML files
    • Use CLI parameters for quick overrides
    • Store sensitive data (API tokens) in ~/.crawl4ai/global.yml
  2. Performance Optimization:

    • Use --bypass-cache for fresh content
    • Enable scan_full_page for infinite scroll pages
    • Adjust delay_before_return_html for dynamic content
  3. Content Extraction:

    • Use CSS extraction for structured content
    • Use LLM extraction for unstructured content
    • Combine with filters for focused results
  4. Q&A Workflow:

    • View content first with -o markdown
    • Ask specific questions
    • Use broader context with appropriate selectors

Recap

The Crawl4AI CLI provides:

  • Flexible configuration via files and parameters
  • Multiple extraction strategies (CSS, XPath, LLM)
  • Content filtering and optimization
  • Interactive Q&A capabilities
  • Various output formats