- Debug
- Refactor code for new version
This commit is contained in:
unclecode 2024-05-16 17:31:44 +08:00
parent f6e59157bf
commit 5b80be956d
23 changed files with 3116 additions and 1019 deletions

552
README.md
View File

@ -8,16 +8,90 @@
Crawl4AI is a powerful, free web crawling service designed to extract useful information from web pages and make it accessible for large language models (LLMs) and AI applications. 🆓🌐
## 🚧 Work in Progress 👷‍♂️
## Recent Changes
- 🔧 Separate Crawl and Extract Semantic Chunk: Enhancing efficiency in large-scale tasks.
- 🔍 Colab Integration: Exploring integration with Google Colab for easy experimentation.
- 🎯 XPath and CSS Selector Support: Adding support for selective retrieval of specific elements.
- 📷 Image Captioning: Incorporating image captioning capabilities to extract descriptions from images.
- 💾 Embedding Vector Data: Generate and store embedding data for each crawled website.
- 🔍 Semantic Search Engine: Building a semantic search engine that fetches content, performs vector search similarity, and generates labeled chunk data based on user queries and URLs.
- 🚀 10x faster!!
- 📜 Execute custom JavaScript before crawling!
- 🤝 Colab friendly!
- 📚 Chunking strategies: topic-based, regex, sentence, and more!
- 🧠 Extraction strategies: cosine clustering, LLM, and more!
- 🎯 CSS selector support
- 📝 Pass instructions/keywords to refine extraction
## Power and Simplicity of Crawl4AI 🚀
Crawl4AI makes even complex web crawling tasks simple and intuitive. Below is an example of how you can execute JavaScript, filter data using keywords, and use a CSS selector to extract specific content—all in one go!
**Example Task:**
1. Execute custom JavaScript to click a "Load More" button.
2. Filter the data to include only content related to "technology".
3. Use a CSS selector to extract only paragraphs (`<p>` tags).
**Example Code:**
```python
# Import necessary modules
from crawl4ai import WebCrawler
from crawl4ai.chunking_strategy import *
from crawl4ai.extraction_strategy import *
from crawl4ai.crawler_strategy import *
# Define the JavaScript code to click the "Load More" button
js_code = """
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click();
"""
# Define the crawling strategy
crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
# Create the WebCrawler instance with the defined strategy
crawler = WebCrawler(crawler_strategy=crawler_strategy)
# Run the crawler with keyword filtering and CSS selector
result = crawler.run(
url="https://www.example.com",
extraction_strategy=CosineStrategy(
semantic_filter="technology",
),
)
# Run the crawler with LLM extraction strategy
result = crawler.run(
url="https://www.example.com",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
instruction="Extract only content related to technology"
),
css_selector="p"
)
# Display the extracted result
print(result)
```
With Crawl4AI, you can perform advanced web crawling and data extraction tasks with just a few lines of code. This example demonstrates how you can harness the power of Crawl4AI to simplify your workflow and get the data you need efficiently.
---
*Continue reading to learn more about the features, installation process, usage, and more.*
## Table of Contents
1. [Features](#features)
2. [Installation](#installation)
3. [REST API/Local Server](#using-the-local-server-ot-rest-api)
4. [Python Library Usage](#usage)
5. [Parameters](#parameters)
6. [Chunking Strategies](#chunking-strategies)
7. [Extraction Strategies](#extraction-strategies)
8. [Contributing](#contributing)
9. [License](#license)
10. [Contact](#contact)
For more details, refer to the [CHANGELOG.md](https://github.com/unclecode/crawl4ai/edit/main/CHANGELOG.md) file.
## Features ✨
@ -26,26 +100,28 @@ For more details, refer to the [CHANGELOG.md](https://github.com/unclecode/crawl
- 🌍 Supports crawling multiple URLs simultaneously
- 🌃 Replace media tags with ALT.
- 🆓 Completely free to use and open-source
## Getting Started 🚀
To get started with Crawl4AI, simply visit our web application at [https://crawl4ai.uccode.io](https://crawl4ai.uccode.io) (Available now!) and enter the URL(s) you want to crawl. The application will process the URLs and provide you with the extracted data in various formats.
- 📜 Execute custom JavaScript before crawling
- 📚 Chunking strategies: topic-based, regex, sentence, and more
- 🧠 Extraction strategies: cosine clustering, LLM, and more
- 🎯 CSS selector support
- 📝 Pass instructions/keywords to refine extraction
## Installation 💻
There are two ways to use Crawl4AI: as a library in your Python projects or as a standalone local server.
### Using Crawl4AI as a Library 📚
There are three ways to use Crawl4AI:
1. As a library (Recommended)
2. As a local server (Docker) or using the REST API
4. As a Google Colab notebook. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1wz8u30rvbq6Scodye9AGCw8Qg_Z8QGsk)
To install Crawl4AI as a library, follow these steps:
1. Install the package from GitHub:
```sh
```bash
pip install git+https://github.com/unclecode/crawl4ai.git
```
Alternatively, you can clone the repository and install the package locally:
```sh
2. Alternatively, you can clone the repository and install the package locally:
```bash
virtualenv venv
source venv/bin/activate
git clone https://github.com/unclecode/crawl4ai.git
@ -53,133 +129,193 @@ cd crawl4ai
pip install -e .
```
2. Import the necessary modules in your Python script:
```python
from crawl4ai.web_crawler import WebCrawler
from crawl4ai.chunking_strategy import *
from crawl4ai.extraction_strategy import *
import os
crawler = WebCrawler()
crawler.warmup() # IMPORTANT: Warmup the engine before running the first crawl
# Single page crawl
result = crawler.run(
url='https://www.nbcnews.com/business',
word_count_threshold=5, # Minimum word count for a HTML tag to be considered as a worthy block
chunking_strategy= RegexChunking( patterns = ["\n\n"]), # Default is RegexChunking
extraction_strategy= CosineStrategy(word_count_threshold=20, max_dist=0.2, linkage_method='ward', top_k=3) # Default is CosineStrategy
# extraction_strategy= LLMExtractionStrategy(provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY')),
bypass_cache=False,
extract_blocks =True, # Whether to extract semantical blocks of text from the HTML
css_selector = "", # Eg: "div.article-body"
verbose=True,
include_raw_html=True, # Whether to include the raw HTML content in the response
)
print(result.model_dump())
```
Running for the first time will download the chrome driver for selenium. Also creates a SQLite database file `crawler_data.db` in the current directory. This file will store the crawled data for future reference.
The response model is a `CrawlResponse` object that contains the following attributes:
```python
class CrawlResult(BaseModel):
url: str
html: str
success: bool
cleaned_html: str = None
markdown: str = None
parsed_json: str = None
error_message: str = None
```
### Running Crawl4AI as a Local Server 🚀
To run Crawl4AI as a standalone local server, follow these steps:
1. Clone the repository:
```sh
git clone https://github.com/unclecode/crawl4ai.git
```
2. Navigate to the project directory:
```sh
cd crawl4ai
```
3. Open `crawler/config.py` and set your favorite LLM provider and API token.
4. Build the Docker image:
```sh
docker build -t crawl4ai .
```
For Mac users, use the following command instead:
```sh
docker build --platform linux/amd64 -t crawl4ai .
```
5. Run the Docker container:
```sh
3. Use docker to run the local server:
```bash
docker build -t crawl4ai .
# For Mac users
# docker build --platform linux/amd64 -t crawl4ai .
docker run -d -p 8000:80 crawl4ai
```
6. Access the application at `http://localhost:8000`.
For more information about how to run Crawl4AI as a local server, please refer to the [GitHub repository](https://github.com/unclecode/crawl4ai).
- CURL Example:
Set the api_token to your OpenAI API key or any other provider you are using.
```sh
curl -X POST -H "Content-Type: application/json" -d '{"urls":["https://techcrunch.com/"],"provider_model":"openai/gpt-3.5-turbo","api_token":"your_api_token","include_raw_html":true,"forced":false,"extract_blocks_flag":false,"word_count_threshold":10}' http://localhost:8000/crawl
```
Set `extract_blocks_flag` to True to enable the LLM to generate semantically clustered chunks and return them as JSON. Depending on the model and data size, this may take up to 1 minute. Without this setting, it will take between 5 to 20 seconds.
## Using the Local server ot REST API 🌐
- Python Example:
```python
import requests
import os
You can also use Crawl4AI through the REST API. This method allows you to send HTTP requests to the Crawl4AI server and receive structured data in response. The base URL for the API is `https://crawl4ai.com/crawl`. If you run the local server, you can use `http://localhost:8000/crawl`. (Port is dependent on your docker configuration)
data = {
"urls": [
"https://www.nbcnews.com/business"
],
"provider_model": "groq/llama3-70b-8192",
"include_raw_html": true,
"bypass_cache": false,
"extract_blocks": true,
"word_count_threshold": 10,
"extraction_strategy": "CosineStrategy",
"chunking_strategy": "RegexChunking",
"css_selector": "",
"verbose": true
### Example Usage
To use the REST API, send a POST request to `https://crawl4ai.com/crawl` with the following parameters in the request body.
**Example Request:**
```json
{
"urls": ["https://www.example.com"],
"include_raw_html": false,
"bypass_cache": true,
"word_count_threshold": 5,
"extraction_strategy": "CosineStrategy",
"chunking_strategy": "RegexChunking",
"css_selector": "p",
"verbose": true,
"extraction_strategy_args": {
"semantic_filter": "finance economy and stock market",
"word_count_threshold": 20,
"max_dist": 0.2,
"linkage_method": "ward",
"top_k": 3
},
"chunking_strategy_args": {
"patterns": ["\n\n"]
}
}
response = requests.post("http://crawl4ai.uccode.io/crawl", json=data) # OR http://localhost:8000 if your run locally
if response.status_code == 200:
result = response.json()["results"][0]
print("Parsed JSON:")
print(result["parsed_json"])
print("\nCleaned HTML:")
print(result["cleaned_html"])
print("\nMarkdown:")
print(result["markdown"])
else:
print("Error:", response.status_code, response.text)
```
This code sends a POST request to the Crawl4AI server running on localhost, specifying the target URL (`http://crawl4ai.uccode.io/crawl`) and the desired options. The server processes the request and returns the crawled data in JSON format.
**Example Response:**
```json
{
"status": "success",
"data": [
{
"url": "https://www.example.com",
"extracted_content": "...",
"html": "...",
"markdown": "...",
"metadata": {...}
}
]
}
```
The response from the server includes the semantical clusters, cleaned HTML, and markdown representations of the crawled webpage. You can access and use this data in your Python application as needed.
For more information about the available parameters and their descriptions, refer to the [Parameters](#parameters) section.
Make sure to replace `"http://localhost:8000/crawl"` with the appropriate server URL if your Crawl4AI server is running on a different host or port.
Choose the approach that best suits your needs. If you want to integrate Crawl4AI into your existing Python projects, installing it as a library is the way to go. If you prefer to run Crawl4AI as a standalone service and interact with it via API endpoints, running it as a local server using Docker is the recommended approach.
## Python Library Usage 🚀
**Make sure to check the config.py tp set required environment variables.**
### Quickstart Guide
That's it! You can now integrate Crawl4AI into your Python projects and leverage its web crawling capabilities. 🎉
Create an instance of WebCrawler and call the `warmup()` function.
```python
crawler = WebCrawler()
crawler.warmup()
```
## 📖 Parameters
### Understanding 'bypass_cache' and 'include_raw_html' parameters
First crawl (caches the result):
```python
result = crawler.run(url="https://www.nbcnews.com/business")
```
Second crawl (Force to crawl again):
```python
result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
```
💡 Don't forget to set `bypass_cache` to True if you want to try different strategies for the same URL. Otherwise, the cached result will be returned. You can also set `always_by_pass_cache` in constructor to True to always bypass the cache.
Crawl result without raw HTML content:
```python
result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)
```
### Adding a chunking strategy: RegexChunking
Using RegexChunking:
```python
result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=RegexChunking(patterns=["\n\n"])
)
```
Using NlpSentenceChunking:
```python
result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=NlpSentenceChunking()
)
```
### Extraction strategy: CosineStrategy
Using CosineStrategy:
```python
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=CosineStrategy(
semantic_filter="",
word_count_threshold=10,
max_dist=0.2,
linkage_method="ward",
top_k=3
)
)
```
You can set `semantic_filter` to filter relevant documents before clustering. Documents are filtered based on their cosine similarity to the keyword filter embedding.
```python
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=CosineStrategy(
semantic_filter="finance economy and stock market",
word_count_threshold=10,
max_dist=0.2,
linkage_method="ward",
top_k=3
)
)
```
### Using LLMExtractionStrategy
Without instructions:
```python
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY')
)
)
```
With instructions:
```python
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
instruction="I am interested in only financial news"
)
)
```
### Targeted extraction using CSS selector
Extract only H2 tags:
```python
result = crawler.run(
url="https://www.nbcnews.com/business",
css_selector="h2"
)
```
### Passing JavaScript code to click 'Load More' button
Using JavaScript to click 'Load More' button:
```python
js_code = """
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click();
"""
crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
result = crawler.run(url="https://www.nbcnews.com/business")
```
## Parameters 📖
| Parameter | Description | Required | Default Value |
|-----------------------|-------------------------------------------------------------------------------------------------------|----------|---------------------|
@ -193,49 +329,134 @@ That's it! You can now integrate Crawl4AI into your Python projects and leverage
| `css_selector` | The CSS selector to target specific parts of the HTML for extraction. | No | `None` |
| `verbose` | Whether to enable verbose logging. | No | `true` |
## 🛠️ Configuration
Crawl4AI allows you to configure various parameters and settings in the `crawler/config.py` file. Here's an example of how you can adjust the parameters:
## Chunking Strategies 📚
### RegexChunking
`RegexChunking` is a text chunking strategy that splits a given text into smaller parts using regular expressions. This is useful for preparing large texts for processing by language models, ensuring they are divided into manageable segments.
**Constructor Parameters:**
- `patterns` (list, optional): A list of regular expression patterns used to split the text. Default is to split by double newlines (`['\n\n']`).
**Example usage:**
```python
import os
from dotenv import load_dotenv
load_dotenv() # Load environment variables from .env file
# Default provider, ONLY used when the extraction strategy is LLMExtractionStrategy
DEFAULT_PROVIDER = "openai/gpt-4-turbo"
# Provider-model dictionary, ONLY used when the extraction strategy is LLMExtractionStrategy
PROVIDER_MODELS = {
"ollama/llama3": "no-token-needed", # Any model from Ollama no need for API token
"groq/llama3-70b-8192": os.getenv("GROQ_API_KEY"),
"groq/llama3-8b-8192": os.getenv("GROQ_API_KEY"),
"openai/gpt-3.5-turbo": os.getenv("OPENAI_API_KEY"),
"openai/gpt-4-turbo": os.getenv("OPENAI_API_KEY"),
"openai/gpt-4o": os.getenv("OPENAI_API_KEY"),
"anthropic/claude-3-haiku-20240307": os.getenv("ANTHROPIC_API_KEY"),
"anthropic/claude-3-opus-20240229": os.getenv("ANTHROPIC_API_KEY"),
"anthropic/claude-3-sonnet-20240229": os.getenv("ANTHROPIC_API_KEY"),
}
# Chunk token threshold
CHUNK_TOKEN_THRESHOLD = 1000
# Threshold for the minimum number of words in an HTML tag to be considered
MIN_WORD_THRESHOLD = 5
chunker = RegexChunking(patterns=[r'\n\n', r'\. '])
chunks = chunker.chunk("This is a sample text. It will be split into chunks.")
```
In the `crawler/config.py` file, you can:
### NlpSentenceChunking
REMEBER: You only need to set the API keys for the providers in case you choose LLMExtractStrategy as the extraction strategy. If you choose CosineStrategy, you don't need to set the API keys.
`NlpSentenceChunking` uses a natural language processing model to chunk a given text into sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.
- Set the default provider using the `DEFAULT_PROVIDER` variable.
- Add or modify the provider-model dictionary (`PROVIDER_MODELS`) to include your desired providers and their corresponding API keys. Crawl4AI supports various providers such as Groq, OpenAI, Anthropic, and more. You can add any provider supported by LiteLLM, as well as Ollama.
- Adjust the `CHUNK_TOKEN_THRESHOLD` value to control the splitting of web content into chunks for parallel processing. A higher value means fewer chunks and faster processing, but it may cause issues with weaker LLMs during extraction.
- Modify the `MIN_WORD_THRESHOLD` value to set the minimum number of words an HTML tag must contain to be considered a meaningful block.
**Constructor Parameters:**
- `model` (str, optional): The SpaCy model to use for sentence detection. Default is `'en_core_web_sm'`.
Make sure to set the appropriate API keys for each provider in the `PROVIDER_MODELS` dictionary. You can either directly provide the API key or use environment variables to store them securely.
**Example usage:**
```python
chunker = NlpSentenceChunking(model='en_core_web_sm')
chunks = chunker.chunk("This is a sample text. It will be split into sentences.")
```
Remember to update the `crawler/config.py` file based on your specific requirements and the providers you want to use with Crawl4AI.
### TopicSegmentationChunking
`TopicSegmentationChunking` uses the TextTiling algorithm to segment a given text into topic-based chunks. This method identifies thematic boundaries in the text.
**Constructor Parameters:**
- `num_keywords` (int, optional): The number of keywords to extract for each topic segment. Default is `3`.
**Example usage:**
```python
chunker = TopicSegmentationChunking(num_keywords=3)
chunks = chunker.chunk("This is a sample text. It will be split into topic-based segments.")
```
### FixedLengthWordChunking
`FixedLengthWordChunking` splits a given text into chunks of fixed length, based on the number of words.
**Constructor Parameters:**
- `chunk_size` (int, optional): The number of words in each chunk. Default is `100`.
**Example usage:**
```python
chunker = FixedLengthWordChunking(chunk_size=100)
chunks = chunker.chunk("This is a sample text. It will be split into fixed-length word chunks.")
```
### SlidingWindowChunking
`SlidingWindowChunking` uses a sliding window approach to chunk a given text. Each chunk has a fixed length, and the window slides by a specified step size.
**Constructor Parameters:**
- `window_size` (int, optional): The number of words in each chunk. Default is `100`.
- `step` (int, optional): The number of words to slide the window. Default is `50`.
**Example usage:**
```python
chunker = SlidingWindowChunking(window_size=100, step=50)
chunks = chunker.chunk("This is a sample text. It will be split using a sliding window approach.")
```
## Extraction Strategies 🧠
### NoExtractionStrategy
`NoExtractionStrategy` is a basic extraction strategy that returns the entire HTML content without any modification. It is useful for cases where no specific extraction is required.
**Constructor Parameters:**
None.
**Example usage:**
```python
extractor = NoExtractionStrategy()
extracted_content = extractor.extract(url, html)
```
### LLMExtractionStrategy
`LLMExtractionStrategy` uses a Language Model (LLM) to extract meaningful blocks or chunks from the given HTML content. This strategy leverages an external provider for language model completions.
**Constructor Parameters:**
- `provider` (str, optional): The provider to use for the language model completions. Default is `DEFAULT_PROVIDER` (e.g., openai/gpt-4).
- `api_token` (str, optional): The API token for the provider. If not provided, it will try to load from the environment variable `OPENAI_API_KEY`.
- `instruction` (str, optional): An instruction to guide the LLM on how to perform the extraction. This allows users to specify the type of data they are interested in or set the tone of the response. Default is `None`.
**Example usage:**
```python
extractor = LLMExtractionStrategy(provider='openai', api_token='your_api_token', instruction='Extract only news about AI.')
extracted_content = extractor.extract(url, html)
```
### CosineStrategy
`CosineStrategy` uses hierarchical clustering based on cosine similarity to extract clusters of text from the given HTML content. This strategy is suitable for identifying related content sections.
**Constructor Parameters:**
- `semantic_filter` (str, optional): A string containing keywords for filtering relevant documents before clustering. If provided, documents are filtered based on their cosine similarity to the keyword filter embedding. Default is `None`.
- `word_count_threshold` (int, optional): Minimum number of words per cluster. Default is `20`.
- `max_dist` (float, optional): The maximum cophenetic distance on the dendrogram to form clusters. Default is `0.2`.
- `linkage_method` (str, optional): The linkage method for hierarchical clustering. Default is `'ward'`.
- `top_k` (int, optional): Number of top categories to extract. Default is `3`.
- `model_name` (str, optional): The model name for embedding generation. Default is `'BAAI/bge-small-en-v1.5'`.
**Example usage:**
```python
extractor = CosineStrategy(semantic_filter='artificial intelligence', word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5')
extracted_content = extractor.extract(url, html)
```
### TopicExtractionStrategy
`TopicExtractionStrategy` uses the TextTiling algorithm to segment the HTML content into topics and extracts keywords for each segment. This strategy is useful for identifying and summarizing thematic content.
**Constructor Parameters:**
- `num_keywords` (int, optional): Number of keywords to represent each topic segment. Default is `3`.
**Example usage:**
```python
extractor = TopicExtractionStrategy(num_keywords=3)
extracted_content = extractor.extract(url, html)
```
## Contributing 🤝
@ -259,5 +480,6 @@ If you have any questions, suggestions, or feedback, please feel free to reach o
- GitHub: [unclecode](https://github.com/unclecode)
- Twitter: [@unclecode](https://twitter.com/unclecode)
- Website: [crawl4ai.com](https://crawl4ai.com)
Let's work together to make the web more accessible and useful for AI applications! 💪🌐🤖

View File

@ -38,7 +38,12 @@ class RegexChunking(ChunkingStrategy):
class NlpSentenceChunking(ChunkingStrategy):
def __init__(self, model='en_core_web_sm'):
import spacy
self.nlp = spacy.load(model)
try:
self.nlp = spacy.load(model)
except IOError:
spacy.cli.download("en_core_web_sm")
self.nlp = spacy.load(model)
# raise ImportError(f"Spacy model '{model}' not found. Please download the model using 'python -m spacy download {model}'")
def chunk(self, text: str) -> list:
doc = self.nlp(text)

View File

@ -18,15 +18,16 @@ class CrawlerStrategy(ABC):
pass
class CloudCrawlerStrategy(CrawlerStrategy):
def crawl(self, url: str, use_cached_html = False, css_selector = None) -> str:
def __init__(self, use_cached_html = False):
super().__init__()
self.use_cached_html = use_cached_html
def crawl(self, url: str) -> str:
data = {
"urls": [url],
"provider_model": "",
"api_token": "token",
"include_raw_html": True,
"forced": True,
"extract_blocks": False,
"word_count_threshold": 10
}
response = requests.post("http://crawl4ai.uccode.io/crawl", json=data)
@ -35,19 +36,24 @@ class CloudCrawlerStrategy(CrawlerStrategy):
return html
class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
def __init__(self):
def __init__(self, use_cached_html=False, js_code=None):
super().__init__()
self.options = Options()
self.options.headless = True
self.options.add_argument("--no-sandbox")
self.options.add_argument("--disable-dev-shm-usage")
self.options.add_argument("--disable-gpu")
self.options.add_argument("--disable-extensions")
self.options.add_argument("--headless")
self.use_cached_html = use_cached_html
self.js_code = js_code
# chromedriver_autoinstaller.install()
self.service = Service(chromedriver_autoinstaller.install())
self.driver = webdriver.Chrome(service=self.service, options=self.options)
def crawl(self, url: str, use_cached_html = False, css_selector = None) -> str:
if use_cached_html:
def crawl(self, url: str) -> str:
if self.use_cached_html:
cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", url.replace("/", "_"))
if os.path.exists(cache_file_path):
with open(cache_file_path, "r") as f:
@ -58,6 +64,15 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
WebDriverWait(self.driver, 10).until(
EC.presence_of_all_elements_located((By.TAG_NAME, "html"))
)
# Execute JS code if provided
if self.js_code:
self.driver.execute_script(self.js_code)
# Optionally, wait for some condition after executing the JS code
WebDriverWait(self.driver, 10).until(
lambda driver: driver.execute_script("return document.readyState") == "complete"
)
html = self.driver.page_source
# Store in cache

View File

@ -8,9 +8,9 @@ DB_PATH = os.path.join(Path.home(), ".crawl4ai")
os.makedirs(DB_PATH, exist_ok=True)
DB_PATH = os.path.join(DB_PATH, "crawl4ai.db")
def init_db(db_path: str):
def init_db():
global DB_PATH
conn = sqlite3.connect(db_path)
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
cursor.execute('''
CREATE TABLE IF NOT EXISTS crawled_data (
@ -18,13 +18,12 @@ def init_db(db_path: str):
html TEXT,
cleaned_html TEXT,
markdown TEXT,
parsed_json TEXT,
extracted_content TEXT,
success BOOLEAN
)
''')
conn.commit()
conn.close()
DB_PATH = db_path
def check_db_path():
if not DB_PATH:
@ -35,7 +34,7 @@ def get_cached_url(url: str) -> Optional[Tuple[str, str, str, str, str, bool]]:
try:
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
cursor.execute('SELECT url, html, cleaned_html, markdown, parsed_json, success FROM crawled_data WHERE url = ?', (url,))
cursor.execute('SELECT url, html, cleaned_html, markdown, extracted_content, success FROM crawled_data WHERE url = ?', (url,))
result = cursor.fetchone()
conn.close()
return result
@ -43,21 +42,21 @@ def get_cached_url(url: str) -> Optional[Tuple[str, str, str, str, str, bool]]:
print(f"Error retrieving cached URL: {e}")
return None
def cache_url(url: str, html: str, cleaned_html: str, markdown: str, parsed_json: str, success: bool):
def cache_url(url: str, html: str, cleaned_html: str, markdown: str, extracted_content: str, success: bool):
check_db_path()
try:
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
cursor.execute('''
INSERT INTO crawled_data (url, html, cleaned_html, markdown, parsed_json, success)
INSERT INTO crawled_data (url, html, cleaned_html, markdown, extracted_content, success)
VALUES (?, ?, ?, ?, ?, ?)
ON CONFLICT(url) DO UPDATE SET
html = excluded.html,
cleaned_html = excluded.cleaned_html,
markdown = excluded.markdown,
parsed_json = excluded.parsed_json,
extracted_content = excluded.extracted_content,
success = excluded.success
''', (url, html, cleaned_html, markdown, parsed_json, success))
''', (url, html, cleaned_html, markdown, extracted_content, success))
conn.commit()
conn.close()
except Exception as e:
@ -85,4 +84,15 @@ def clear_db():
conn.commit()
conn.close()
except Exception as e:
print(f"Error clearing database: {e}")
print(f"Error clearing database: {e}")
def flush_db():
check_db_path()
try:
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
cursor.execute('DROP TABLE crawled_data')
conn.commit()
conn.close()
except Exception as e:
print(f"Error flushing database: {e}")

View File

@ -3,19 +3,20 @@ from typing import Any, List, Dict, Optional, Union
from concurrent.futures import ThreadPoolExecutor, as_completed
import json, time
# from optimum.intel import IPEXModel
from .prompts import PROMPT_EXTRACT_BLOCKS
from .prompts import PROMPT_EXTRACT_BLOCKS, PROMPT_EXTRACT_BLOCKS_WITH_INSTRUCTION
from .config import *
from .utils import *
from functools import partial
from .model_loader import load_bert_base_uncased, load_bge_small_en_v1_5, load_spacy_model
from transformers import pipeline
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
class ExtractionStrategy(ABC):
"""
Abstract base class for all extraction strategies.
"""
def __init__(self):
def __init__(self, **kwargs):
self.DEL = "<|DEL|>"
self.name = self.__class__.__name__
@ -38,12 +39,12 @@ class ExtractionStrategy(ABC):
:param sections: List of sections (strings) to process.
:return: A list of processed JSON blocks.
"""
parsed_json = []
extracted_content = []
with ThreadPoolExecutor() as executor:
futures = [executor.submit(self.extract, url, section, **kwargs) for section in sections]
for future in as_completed(futures):
parsed_json.extend(future.result())
return parsed_json
extracted_content.extend(future.result())
return extracted_content
class NoExtractionStrategy(ExtractionStrategy):
def extract(self, url: str, html: str, *q, **kwargs) -> List[Dict[str, Any]]:
@ -53,37 +54,41 @@ class NoExtractionStrategy(ExtractionStrategy):
return [{"index": i, "tags": [], "content": section} for i, section in enumerate(sections)]
class LLMExtractionStrategy(ExtractionStrategy):
def __init__(self, provider: str = DEFAULT_PROVIDER, api_token: Optional[str] = None):
def __init__(self, provider: str = DEFAULT_PROVIDER, api_token: Optional[str] = None, instruction:str = None, **kwargs):
"""
Initialize the strategy with clustering parameters.
:param word_count_threshold: Minimum number of words per cluster.
:param max_dist: The maximum cophenetic distance on the dendrogram to form clusters.
:param linkage_method: The linkage method for hierarchical clustering.
:param provider: The provider to use for extraction.
:param api_token: The API token for the provider.
:param instruction: The instruction to use for the LLM model.
"""
super().__init__()
self.provider = provider
self.api_token = api_token or PROVIDER_MODELS.get(provider, None) or os.getenv("OPENAI_API_KEY")
self.instruction = instruction
if not self.api_token:
raise ValueError("API token must be provided for LLMExtractionStrategy. Update the config.py or set OPENAI_API_KEY environment variable.")
def extract(self, url: str, html: str) -> List[Dict[str, Any]]:
print("[LOG] Extracting blocks from URL:", url)
def extract(self, url: str, ix:int, html: str) -> List[Dict[str, Any]]:
# print("[LOG] Extracting blocks from URL:", url)
print(f"[LOG] Call LLM for {url} - block index: {ix}")
variable_values = {
"URL": url,
"HTML": escape_json_string(sanitize_html(html)),
}
if self.instruction:
variable_values["REQUEST"] = self.instruction
prompt_with_variables = PROMPT_EXTRACT_BLOCKS
prompt_with_variables = PROMPT_EXTRACT_BLOCKS if not self.instruction else PROMPT_EXTRACT_BLOCKS_WITH_INSTRUCTION
for variable in variable_values:
prompt_with_variables = prompt_with_variables.replace(
"{" + variable + "}", variable_values[variable]
)
response = perform_completion_with_backoff(self.provider, prompt_with_variables, self.api_token)
try:
blocks = extract_xml_data(["blocks"], response.choices[0].message.content)['blocks']
blocks = json.loads(blocks)
@ -101,7 +106,7 @@ class LLMExtractionStrategy(ExtractionStrategy):
"content": unparsed
})
print("[LOG] Extracted", len(blocks), "blocks from URL:", url)
print("[LOG] Extracted", len(blocks), "blocks from URL:", url, "block index:", ix)
return blocks
def _merge(self, documents):
@ -130,29 +135,30 @@ class LLMExtractionStrategy(ExtractionStrategy):
"""
merged_sections = self._merge(sections)
parsed_json = []
extracted_content = []
if self.provider.startswith("groq/"):
# Sequential processing with a delay
for section in merged_sections:
parsed_json.extend(self.extract(url, section))
for ix, section in enumerate(merged_sections):
extracted_content.extend(self.extract(ix, url, section))
time.sleep(0.5) # 500 ms delay between each processing
else:
# Parallel processing using ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=4) as executor:
extract_func = partial(self.extract, url)
futures = [executor.submit(extract_func, section) for section in merged_sections]
futures = [executor.submit(extract_func, ix, section) for ix, section in enumerate(merged_sections)]
for future in as_completed(futures):
parsed_json.extend(future.result())
extracted_content.extend(future.result())
return parsed_json
return extracted_content
class CosineStrategy(ExtractionStrategy):
def __init__(self, word_count_threshold=20, max_dist=0.2, linkage_method='ward', top_k=3, model_name = 'BAAI/bge-small-en-v1.5'):
def __init__(self, semantic_filter = None, word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name = 'BAAI/bge-small-en-v1.5', **kwargs):
"""
Initialize the strategy with clustering parameters.
:param semantic_filter: A keyword filter for document filtering.
:param word_count_threshold: Minimum number of words per cluster.
:param max_dist: The maximum cophenetic distance on the dendrogram to form clusters.
:param linkage_method: The linkage method for hierarchical clustering.
@ -163,11 +169,14 @@ class CosineStrategy(ExtractionStrategy):
from transformers import AutoTokenizer, AutoModel
import spacy
self.semantic_filter = semantic_filter
self.word_count_threshold = word_count_threshold
self.max_dist = max_dist
self.linkage_method = linkage_method
self.top_k = top_k
self.timer = time.time()
self.buffer_embeddings = np.array([])
if model_name == "bert-base-uncased":
self.tokenizer, self.model = load_bert_base_uncased()
@ -177,13 +186,42 @@ class CosineStrategy(ExtractionStrategy):
self.nlp = load_spacy_model()
print(f"[LOG] Model loaded {model_name}, models/reuters, took " + str(time.time() - self.timer) + " seconds")
def get_embeddings(self, sentences: List[str]):
def filter_documents_embeddings(self, documents: List[str], semantic_filter: str, threshold: float = 0.5) -> List[str]:
"""
Filter documents based on the cosine similarity of their embeddings with the semantic_filter embedding.
:param documents: List of text chunks (documents).
:param semantic_filter: A string containing the keywords for filtering.
:param threshold: Cosine similarity threshold for filtering documents.
:return: Filtered list of documents.
"""
if not semantic_filter:
return documents
# Compute embedding for the keyword filter
query_embedding = self.get_embeddings([semantic_filter])[0]
# Compute embeddings for the docu ments
document_embeddings = self.get_embeddings(documents)
# Calculate cosine similarity between the query embedding and document embeddings
similarities = cosine_similarity([query_embedding], document_embeddings).flatten()
# Filter documents based on the similarity threshold
filtered_docs = [doc for doc, sim in zip(documents, similarities) if sim >= threshold]
return filtered_docs
def get_embeddings(self, sentences: List[str], bypass_buffer=True):
"""
Get BERT embeddings for a list of sentences.
:param sentences: List of text chunks (sentences).
:return: NumPy array of embeddings.
"""
# if self.buffer_embeddings.any() and not bypass_buffer:
# return self.buffer_embeddings
import torch
# Tokenize sentences and convert to tensor
encoded_input = self.tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
@ -193,6 +231,7 @@ class CosineStrategy(ExtractionStrategy):
# Get embeddings from the last hidden state (mean pooling)
embeddings = model_output.last_hidden_state.mean(1)
self.buffer_embeddings = embeddings.numpy()
return embeddings.numpy()
def hierarchical_clustering(self, sentences: List[str]):
@ -206,7 +245,7 @@ class CosineStrategy(ExtractionStrategy):
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import pdist
self.timer = time.time()
embeddings = self.get_embeddings(sentences)
embeddings = self.get_embeddings(sentences, bypass_buffer=False)
# print(f"[LOG] 🚀 Embeddings computed in {time.time() - self.timer:.2f} seconds")
# Compute pairwise cosine distances
distance_matrix = pdist(embeddings, 'cosine')
@ -247,6 +286,12 @@ class CosineStrategy(ExtractionStrategy):
# Assume `html` is a list of text chunks for this strategy
t = time.time()
text_chunks = html.split(self.DEL) # Split by lines or paragraphs as needed
# Pre-filter documents using embeddings and semantic_filter
text_chunks = self.filter_documents_embeddings(text_chunks, self.semantic_filter)
if not text_chunks:
return []
# Perform clustering
labels = self.hierarchical_clustering(text_chunks)
@ -290,7 +335,7 @@ class CosineStrategy(ExtractionStrategy):
return self.extract(url, self.DEL.join(sections), **kwargs)
class TopicExtractionStrategy(ExtractionStrategy):
def __init__(self, num_keywords: int = 3):
def __init__(self, num_keywords: int = 3, **kwargs):
"""
Initialize the topic extraction strategy with parameters for topic segmentation.
@ -358,7 +403,7 @@ class TopicExtractionStrategy(ExtractionStrategy):
return self.extract(url, self.DEL.join(sections), **kwargs)
class ContentSummarizationStrategy(ExtractionStrategy):
def __init__(self, model_name: str = "sshleifer/distilbart-cnn-12-6"):
def __init__(self, model_name: str = "sshleifer/distilbart-cnn-12-6", **kwargs):
"""
Initialize the content summarization strategy with a specific model.

View File

@ -11,5 +11,6 @@ class CrawlResult(BaseModel):
success: bool
cleaned_html: str = None
markdown: str = None
parsed_json: str = None
extracted_content: str = None
metadata: dict = None
error_message: str = None

View File

@ -59,7 +59,7 @@ Please provide your output within <blocks> tags, like this:
Remember, the output should be a complete, parsable JSON wrapped in <blocks> tags, with no omissions or errors. The JSON objects should semantically break down the content into relevant blocks, maintaining the original order."""
PROMPT_EXTRACT_BLOCKS = """YHere is the URL of the webpage:
PROMPT_EXTRACT_BLOCKS = """Here is the URL of the webpage:
<url>{URL}</url>
And here is the cleaned HTML content of that webpage:
@ -107,4 +107,61 @@ Please provide your output within <blocks> tags, like this:
}]
</blocks>
Remember, the output should be a complete, parsable JSON wrapped in <blocks> tags, with no omissions or errors. The JSON objects should semantically break down the content into relevant blocks, maintaining the original order."""
PROMPT_EXTRACT_BLOCKS_WITH_INSTRUCTION = """Here is the URL of the webpage:
<url>{URL}</url>
And here is the cleaned HTML content of that webpage:
<html>
{HTML}
</html>
Your task is to break down this HTML content into semantically relevant blocks, following the provided user's REQUEST, and for each block, generate a JSON object with the following keys:
- index: an integer representing the index of the block in the content
- content: a list of strings containing the text content of the block
This is the user's REQUEST, pay attention to it:
<request>
{REQUEST}
</request>
To generate the JSON objects:
1. Carefully read through the HTML content and identify logical breaks or shifts in the content that would warrant splitting it into separate blocks.
2. For each block:
a. Assign it an index based on its order in the content.
b. Analyze the content and generate ONE semantic tag that describe what the block is about.
c. Extract the text content, EXACTLY SAME AS GIVE DATA, clean it up if needed, and store it as a list of strings in the "content" field.
3. Ensure that the order of the JSON objects matches the order of the blocks as they appear in the original HTML content.
4. Double-check that each JSON object includes all required keys (index, tag, content) and that the values are in the expected format (integer, list of strings, etc.).
5. Make sure the generated JSON is complete and parsable, with no errors or omissions.
6. Make sur to escape any special characters in the HTML content, and also single or double quote to avoid JSON parsing issues.
7. Never alter the extracted content, just copy and paste it as it is.
Please provide your output within <blocks> tags, like this:
<blocks>
[{
"index": 0,
"tags": ["introduction"],
"content": ["This is the first paragraph of the article, which provides an introduction and overview of the main topic."]
},
{
"index": 1,
"tags": ["background"],
"content": ["This is the second paragraph, which delves into the history and background of the topic.",
"It provides context and sets the stage for the rest of the article."]
}]
</blocks>
**Make sure to follow the user instruction to extract blocks aligin with the instruction.**
Remember, the output should be a complete, parsable JSON wrapped in <blocks> tags, with no omissions or errors. The JSON objects should semantically break down the content into relevant blocks, maintaining the original order."""

View File

@ -461,17 +461,17 @@ def merge_chunks_based_on_token_threshold(chunks, token_threshold):
return merged_sections
def process_sections(url: str, sections: list, provider: str, api_token: str) -> list:
parsed_json = []
extracted_content = []
if provider.startswith("groq/"):
# Sequential processing with a delay
for section in sections:
parsed_json.extend(extract_blocks(url, section, provider, api_token))
extracted_content.extend(extract_blocks(url, section, provider, api_token))
time.sleep(0.5) # 500 ms delay between each processing
else:
# Parallel processing using ThreadPoolExecutor
with ThreadPoolExecutor() as executor:
futures = [executor.submit(extract_blocks, url, section, provider, api_token) for section in sections]
for future in as_completed(futures):
parsed_json.extend(future.result())
extracted_content.extend(future.result())
return parsed_json
return extracted_content

View File

@ -1,8 +1,9 @@
import os, time
os.environ["TOKENIZERS_PARALLELISM"] = "false"
from pathlib import Path
from .models import UrlModel, CrawlResult
from .database import init_db, get_cached_url, cache_url, DB_PATH
from .database import init_db, get_cached_url, cache_url, DB_PATH, flush_db
from .utils import *
from .chunking_strategy import *
from .extraction_strategy import *
@ -16,11 +17,13 @@ from .config import *
class WebCrawler:
def __init__(
self,
db_path: str = None,
# db_path: str = None,
crawler_strategy: CrawlerStrategy = LocalSeleniumCrawlerStrategy(),
always_by_pass_cache: bool = False,
):
self.db_path = db_path
# self.db_path = db_path
self.crawler_strategy = crawler_strategy
self.always_by_pass_cache = always_by_pass_cache
# Create the .crawl4ai folder in the user's home directory if it doesn't exist
self.crawl4ai_folder = os.path.join(Path.home(), ".crawl4ai")
@ -28,10 +31,11 @@ class WebCrawler:
os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
# If db_path is not provided, use the default path
if not db_path:
self.db_path = f"{self.crawl4ai_folder}/crawl4ai.db"
# if not db_path:
# self.db_path = f"{self.crawl4ai_folder}/crawl4ai.db"
init_db(self.db_path)
flush_db()
init_db()
self.ready = False
@ -93,7 +97,7 @@ class WebCrawler:
word_count_threshold = MIN_WORD_THRESHOLD
# Check cache first
if not bypass_cache:
if not bypass_cache and not self.always_by_pass_cache:
cached = get_cached_url(url)
if cached:
return CrawlResult(
@ -102,7 +106,7 @@ class WebCrawler:
"html": cached[1],
"cleaned_html": cached[2],
"markdown": cached[3],
"parsed_json": cached[4],
"extracted_content": cached[4],
"success": cached[5],
"error_message": "",
}
@ -130,7 +134,7 @@ class WebCrawler:
f"[LOG] 🚀 Crawling done for {url}, success: {success}, time taken: {time.time() - t} seconds"
)
parsed_json = []
extracted_content = []
if verbose:
print(f"[LOG] 🔥 Extracting semantic blocks for {url}, Strategy: {extraction_strategy.name}")
t = time.time()
@ -138,10 +142,10 @@ class WebCrawler:
sections = chunking_strategy.chunk(markdown)
# sections = merge_chunks_based_on_token_threshold(sections, CHUNK_TOKEN_THRESHOLD)
parsed_json = extraction_strategy.run(
extracted_content = extraction_strategy.run(
url, sections,
)
parsed_json = json.dumps(parsed_json)
extracted_content = json.dumps(extracted_content)
if verbose:
print(
@ -155,7 +159,7 @@ class WebCrawler:
html,
cleaned_html,
markdown,
parsed_json,
extracted_content,
success,
)
@ -164,7 +168,7 @@ class WebCrawler:
html=html,
cleaned_html=cleaned_html,
markdown=markdown,
parsed_json=parsed_json,
extracted_content=extracted_content,
success=success,
error_message=error_message,
)

View File

@ -1,9 +1,9 @@
{
"NoExtractionStrategy": "### NoExtractionStrategy\n\n`NoExtractionStrategy` is a basic extraction strategy that returns the entire HTML content without any modification. It is useful for cases where no specific extraction is required. Only clean html, and amrkdown.\n\n#### Constructor Parameters:\nNone.\n\n#### Example usage:\n```python\nextractor = NoExtractionStrategy()\nextracted_content = extractor.extract(url, html)\n```",
"LLMExtractionStrategy": "### LLMExtractionStrategy\n\n`LLMExtractionStrategy` uses a Language Model (LLM) to extract meaningful blocks or chunks from the given HTML content. This strategy leverages an external provider for language model completions.\n\n#### Constructor Parameters:\n- `provider` (str, optional): The provider to use for the language model completions. Default is `DEFAULT_PROVIDER` (following provider/model eg. openai/gpt-4o).\n- `api_token` (str, optional): The API token for the provider. If not provided, it will try to load from the environment variable `OPENAI_API_KEY`.\n\n#### Example usage:\n```python\nextractor = LLMExtractionStrategy(provider='openai', api_token='your_api_token')\nextracted_content = extractor.extract(url, html)\n```",
"LLMExtractionStrategy": "### LLMExtractionStrategy\n\n`LLMExtractionStrategy` uses a Language Model (LLM) to extract meaningful blocks or chunks from the given HTML content. This strategy leverages an external provider for language model completions.\n\n#### Constructor Parameters:\n- `provider` (str, optional): The provider to use for the language model completions. Default is `DEFAULT_PROVIDER` (e.g., openai/gpt-4).\n- `api_token` (str, optional): The API token for the provider. If not provided, it will try to load from the environment variable `OPENAI_API_KEY`.\n- `instruction` (str, optional): An instruction to guide the LLM on how to perform the extraction. This allows users to specify the type of data they are interested in or set the tone of the response. Default is `None`.\n\n#### Example usage:\n```python\nextractor = LLMExtractionStrategy(provider='openai', api_token='your_api_token', instruction='Extract only news about AI.')\nextracted_content = extractor.extract(url, html)\n```\n\nBy providing clear instructions, users can tailor the extraction process to their specific needs, enhancing the relevance and utility of the extracted content.",
"CosineStrategy": "### CosineStrategy\n\n`CosineStrategy` uses hierarchical clustering based on cosine similarity to extract clusters of text from the given HTML content. This strategy is suitable for identifying related content sections.\n\n#### Constructor Parameters:\n- `word_count_threshold` (int, optional): Minimum number of words per cluster. Default is `20`.\n- `max_dist` (float, optional): The maximum cophenetic distance on the dendrogram to form clusters. Default is `0.2`.\n- `linkage_method` (str, optional): The linkage method for hierarchical clustering. Default is `'ward'`.\n- `top_k` (int, optional): Number of top categories to extract. Default is `3`.\n- `model_name` (str, optional): The model name for embedding generation. Default is `'BAAI/bge-small-en-v1.5'`.\n\n#### Example usage:\n```python\nextractor = CosineStrategy(word_count_threshold=20, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5')\nextracted_content = extractor.extract(url, html)\n```",
"CosineStrategy": "### CosineStrategy\n\n`CosineStrategy` uses hierarchical clustering based on cosine similarity to extract clusters of text from the given HTML content. This strategy is suitable for identifying related content sections.\n\n#### Constructor Parameters:\n- `semantic_filter` (str, optional): A string containing keywords for filtering relevant documents before clustering. If provided, documents are filtered based on their cosine similarity to the keyword filter embedding. Default is `None`.\n- `word_count_threshold` (int, optional): Minimum number of words per cluster. Default is `20`.\n- `max_dist` (float, optional): The maximum cophenetic distance on the dendrogram to form clusters. Default is `0.2`.\n- `linkage_method` (str, optional): The linkage method for hierarchical clustering. Default is `'ward'`.\n- `top_k` (int, optional): Number of top categories to extract. Default is `3`.\n- `model_name` (str, optional): The model name for embedding generation. Default is `'BAAI/bge-small-en-v1.5'`.\n\n#### Example usage:\n```python\nextractor = CosineStrategy(semantic_filter='artificial intelligence', word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5')\nextracted_content = extractor.extract(url, html)\n```\n\n#### Cosine Similarity Filtering\n\nWhen a `semantic_filter` is provided, the `CosineStrategy` applies an embedding-based filtering process to select relevant documents before performing hierarchical clustering.",
"TopicExtractionStrategy": "### TopicExtractionStrategy\n\n`TopicExtractionStrategy` uses the TextTiling algorithm to segment the HTML content into topics and extracts keywords for each segment. This strategy is useful for identifying and summarizing thematic content.\n\n#### Constructor Parameters:\n- `num_keywords` (int, optional): Number of keywords to represent each topic segment. Default is `3`.\n\n#### Example usage:\n```python\nextractor = TopicExtractionStrategy(num_keywords=3)\nextracted_content = extractor.extract(url, html)\n```"
}

View File

@ -1,22 +1,195 @@
import os
import os, time
from crawl4ai.web_crawler import WebCrawler
from crawl4ai.chunking_strategy import *
from crawl4ai.extraction_strategy import *
from crawl4ai.crawler_strategy import *
from rich import print
from rich.console import Console
console = Console()
def print_result(result):
# Print each key in one line and just the first 10 characters of each one's value and three dots
console.print(f"\t[bold]Result:[/bold]")
for key, value in result.model_dump().items():
if type(value) == str and value:
console.print(f"\t{key}: [green]{value[:20]}...[/green]")
def cprint(message, press_any_key=False):
console.print(message)
if press_any_key:
console.print("Press any key to continue...", style="")
input()
def main():
# 🚀 Let's get started with the basics!
cprint("🌟 [bold green]Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun! 🌐[/bold green]")
# Basic usage: Just provide the URL
cprint("⛳️ [bold cyan]First Step: Create an instance of WebCrawler and call the `warmup()` function.[/bold cyan]")
cprint("If this is the first time you're running Crawl4ai, this might take a few seconds to lead required model files.", True)
crawler = WebCrawler()
crawler.warmup()
cprint("🛠️ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]")
result = crawler.run(url="https://www.nbcnews.com/business")
cprint("[LOG] 📦 [bold yellow]Basic crawl result:[/bold yellow]")
print_result(result)
# Explanation of bypass_cache and include_raw_html
cprint("\n🧠 [bold cyan]Understanding 'bypass_cache' and 'include_raw_html' parameters:[/bold cyan]")
cprint("By default, Crawl4ai caches the results of your crawls. This means that subsequent crawls of the same URL will be much faster! Let's see this in action. Becuase we already crawled this URL, the result will be fetched from the cache. Let's try it out!")
# Reads from cache
cprint("1⃣ First crawl (caches the result):", True)
start_time = time.time()
result = crawler.run(url="https://www.nbcnews.com/business")
end_time = time.time()
cprint(f"[LOG] 📦 [bold yellow]First crawl took {end_time - start_time} seconds and result (from cache):[/bold yellow]")
print_result(result)
# Force to crawl again
cprint("2⃣ Second crawl (Force to crawl again):", True)
start_time = time.time()
result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
end_time = time.time()
cprint(f"[LOG] 📦 [bold yellow]Second crawl took {end_time - start_time} seconds and result (forced to crawl):[/bold yellow]")
print_result(result)
# Retrieve raw HTML content
cprint("\n🔄 [bold cyan]By default 'include_raw_html' is set to True, which includes the raw HTML content in the response.[/bold cyan]", True)
result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)
cprint("[LOG] 📦 [bold yellow]Craw result (without raw HTML content):[/bold yellow]")
print_result(result)
cprint("\n📄 The 'include_raw_html' parameter, when set to True, includes the raw HTML content in the response. By default is set to True. Let's move on to exploring different chunking strategies now!")
cprint("For the rest of this guide, I set crawler.always_by_pass_cache to True to force the crawler to bypass the cache. This is to ensure that we get fresh results for each run.", True)
crawler.always_by_pass_cache = True
# Adding a chunking strategy: RegexChunking
cprint("\n🧩 [bold cyan]Let's add a chunking strategy: RegexChunking![/bold cyan]", True)
cprint("RegexChunking is a simple chunking strategy that splits the text based on a given regex pattern. Let's see it in action!")
result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=RegexChunking(patterns=["\n\n"])
)
cprint("[LOG] 📦 [bold yellow]RegexChunking result:[/bold yellow]")
print_result(result)
# Adding another chunking strategy: NlpSentenceChunking
cprint("\n🔍 [bold cyan]Time to explore another chunking strategy: NlpSentenceChunking![/bold cyan]", True)
cprint("NlpSentenceChunking uses NLP techniques to split the text into sentences. Let's see how it performs!")
result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=NlpSentenceChunking()
)
cprint("[LOG] 📦 [bold yellow]NlpSentenceChunking result:[/bold yellow]")
print_result(result)
cprint("There are more chunking strategies to explore, make sure to check document, but let's move on to extraction strategies now!")
# Adding an extraction strategy: CosineStrategy
cprint("\n🧠 [bold cyan]Let's get smarter with an extraction strategy: CosineStrategy![/bold cyan]", True)
cprint("CosineStrategy uses cosine similarity to extract semantically similar blocks of text. Let's see it in action!")
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3)
)
cprint("[LOG] 📦 [bold yellow]CosineStrategy result:[/bold yellow]")
print_result(result)
cprint("You can pass other parameters like 'semantic_filter' to the CosineStrategy to extract semantically similar blocks of text. Let's see it in action!")
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=CosineStrategy(
semantic_filter="inflation rent prices",
)
)
cprint("[LOG] 📦 [bold yellow]CosineStrategy result with semantic filter:[/bold yellow]")
print_result(result)
# Adding an LLM extraction strategy without instructions
cprint("\n🤖 [bold cyan]Time to bring in the big guns: LLMExtractionStrategy without instructions![/bold cyan]", True)
cprint("LLMExtractionStrategy uses a large language model to extract relevant information from the web page. Let's see it in action!")
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
)
cprint("[LOG] 📦 [bold yellow]LLMExtractionStrategy (no instructions) result:[/bold yellow]")
print_result(result)
cprint("You can pass other providers like 'groq/llama3-70b-8192' or 'ollama/llama3' to the LLMExtractionStrategy.")
# Adding an LLM extraction strategy with instructions
cprint("\n📜 [bold cyan]Let's make it even more interesting: LLMExtractionStrategy with instructions![/bold cyan]", True)
cprint("Let's say we are only interested in financial news. Let's see how LLMExtractionStrategy performs with instructions!")
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
instruction="I am interested in only financial news"
)
)
cprint("[LOG] 📦 [bold yellow]LLMExtractionStrategy (with instructions) result:[/bold yellow]")
print_result(result)
result = crawler.run(
url="https://www.example.com",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
instruction="Extract only content related to technology"
)
)
cprint("You can pass other instructions like 'Extract only content related to technology' to the LLMExtractionStrategy.")
cprint("There are more extraction strategies to explore, make sure to check the documentation!")
# Using a CSS selector to extract only H2 tags
cprint("\n🎯 [bold cyan]Targeted extraction: Let's use a CSS selector to extract only H2 tags![/bold cyan]", True)
result = crawler.run(
url="https://www.nbcnews.com/business",
css_selector="h2"
)
cprint("[LOG] 📦 [bold yellow]CSS Selector (H2 tags) result:[/bold yellow]")
print_result(result)
# Passing JavaScript code to interact with the page
cprint("\n🖱️ [bold cyan]Let's get interactive: Passing JavaScript code to click 'Load More' button![/bold cyan]", True)
cprint("In this example we try to click the 'Load More' button on the page using JavaScript code.")
js_code = """
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click();
"""
crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
result = crawler.run(
url="https://www.nbcnews.com/business",
)
cprint("[LOG] 📦 [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]")
print_result(result)
cprint("\n🎉 [bold green]Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl the web like a pro! 🕸️[/bold green]")
if __name__ == "__main__":
main()
def old_main():
js_code = """const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"""
# js_code = None
crawler = WebCrawler( crawler_strategy=LocalSeleniumCrawlerStrategy(use_cached_html=False, js_code=js_code))
crawler.warmup()
# Single page crawl
result = crawler.run(
url="https://www.nbcnews.com/business",
word_count_threshold=5, # Minimum word count for a HTML tag to be considered as a worthy block
chunking_strategy=RegexChunking(patterns=["\n\n"]), # Default is RegexChunking
extraction_strategy=CosineStrategy(
word_count_threshold=20, max_dist=0.2, linkage_method="ward", top_k=3
), # Default is CosineStrategy
# extraction_strategy= LLMExtractionStrategy(provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY')),
# extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3), # Default is CosineStrategy
extraction_strategy= LLMExtractionStrategy(provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'), instruction = "I am intrested in only financial news"),
bypass_cache=True,
extract_blocks=True, # Whether to extract semantical blocks of text from the HTML
css_selector="", # Eg: "div.article-body" or all H2 tags liek "h2"
@ -28,6 +201,3 @@ def main():
print("[LOG] 📦 Crawl result:")
print(result.model_dump())
if __name__ == "__main__":
main()

27
main.py
View File

@ -7,6 +7,7 @@ from fastapi import FastAPI, HTTPException, Request
from fastapi.responses import HTMLResponse, JSONResponse
from fastapi.staticfiles import StaticFiles
from fastapi.middleware.cors import CORSMiddleware
from fastapi.templating import Jinja2Templates
from pydantic import BaseModel, HttpUrl
from concurrent.futures import ThreadPoolExecutor, as_completed
@ -35,7 +36,7 @@ app.add_middleware(
# Mount the pages directory as a static directory
app.mount("/pages", StaticFiles(directory=__location__ + "/pages"), name="pages")
templates = Jinja2Templates(directory=__location__ + "/pages")
# chromedriver_autoinstaller.install() # Ensure chromedriver is installed
@lru_cache()
def get_crawler():
@ -51,16 +52,24 @@ class CrawlRequest(BaseModel):
extract_blocks: bool = True
word_count_threshold: Optional[int] = 5
extraction_strategy: Optional[str] = "CosineStrategy"
extraction_strategy_args: Optional[dict] = {}
chunking_strategy: Optional[str] = "RegexChunking"
chunking_strategy_args: Optional[dict] = {}
css_selector: Optional[str] = None
verbose: Optional[bool] = True
@app.get("/", response_class=HTMLResponse)
async def read_index():
with open(f"{__location__}/pages/index.html", "r") as file:
html_content = file.read()
return HTMLResponse(content=html_content, status_code=200)
async def read_index(request: Request):
partials_dir = os.path.join(__location__, "pages", "partial")
partials = {}
for filename in os.listdir(partials_dir):
if filename.endswith(".html"):
with open(os.path.join(partials_dir, filename), "r") as file:
partials[filename[:-5]] = file.read()
return templates.TemplateResponse("index.html", {"request": request, **partials})
@app.get("/total-count")
async def get_total_url_count():
@ -73,11 +82,11 @@ async def clear_database():
clear_db()
return JSONResponse(content={"message": "Database cleared."})
def import_strategy(module_name: str, class_name: str):
def import_strategy(module_name: str, class_name: str, *args, **kwargs):
try:
module = importlib.import_module(module_name)
strategy_class = getattr(module, class_name)
return strategy_class()
return strategy_class(*args, **kwargs)
except ImportError:
raise HTTPException(status_code=400, detail=f"Module {module_name} not found.")
except AttributeError:
@ -95,8 +104,8 @@ async def crawl_urls(crawl_request: CrawlRequest, request: Request):
current_requests += 1
try:
extraction_strategy = import_strategy("crawl4ai.extraction_strategy", crawl_request.extraction_strategy)
chunking_strategy = import_strategy("crawl4ai.chunking_strategy", crawl_request.chunking_strategy)
extraction_strategy = import_strategy("crawl4ai.extraction_strategy", crawl_request.extraction_strategy, **crawl_request.extraction_strategy_args)
chunking_strategy = import_strategy("crawl4ai.chunking_strategy", crawl_request.chunking_strategy, **crawl_request.chunking_strategy_args)
# Use ThreadPoolExecutor to run the synchronous WebCrawler in async manner
with ThreadPoolExecutor() as executor:

131
pages/app.css Normal file
View File

@ -0,0 +1,131 @@
:root {
--ifm-font-size-base: 100%;
--ifm-line-height-base: 1.65;
--ifm-font-family-base: system-ui, -apple-system, Segoe UI, Roboto, Ubuntu, Cantarell, Noto Sans, sans-serif,
BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji",
"Segoe UI Symbol";
}
html {
-webkit-font-smoothing: antialiased;
-webkit-text-size-adjust: 100%;
text-size-adjust: 100%;
font: var(--ifm-font-size-base) / var(--ifm-line-height-base) var(--ifm-font-family-base);
}
body {
background-color: #1a202c;
color: #fff;
}
.tab-content {
max-height: 400px;
overflow: auto;
}
pre {
white-space: pre-wrap;
font-size: 14px;
}
pre code {
width: 100%;
}
/* Custom styling for docs-item class and Markdown generated elements */
.docs-item {
background-color: #2d3748; /* bg-gray-800 */
padding: 1rem; /* p-4 */
border-radius: 0.375rem; /* rounded */
box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1); /* shadow-md */
margin-bottom: 1rem; /* space between items */
line-height: 1.5; /* leading-normal */
}
.docs-item h3,
.docs-item h4 {
color: #ffffff; /* text-white */
font-size: 1.25rem; /* text-xl */
font-weight: 700; /* font-bold */
margin-bottom: 0.5rem; /* mb-2 */
}
.docs-item h4 {
font-size: 1rem; /* text-xl */
}
.docs-item p {
color: #e2e8f0; /* text-gray-300 */
margin-bottom: 0.5rem; /* mb-2 */
}
.docs-item code {
background-color: #1a202c; /* bg-gray-900 */
color: #e2e8f0; /* text-gray-300 */
padding: 0.25rem 0.5rem; /* px-2 py-1 */
border-radius: 0.25rem; /* rounded */
font-size: 0.875rem; /* text-sm */
}
.docs-item pre {
background-color: #1a202c; /* bg-gray-900 */
color: #e2e8f0; /* text-gray-300 */
padding: 0.5rem; /* p-2 */
border-radius: 0.375rem; /* rounded */
overflow: auto; /* overflow-auto */
margin-bottom: 0.5rem; /* mb-2 */
}
.docs-item div {
color: #e2e8f0; /* text-gray-300 */
font-size: 1rem; /* prose prose-sm */
line-height: 1.25rem; /* line-height for readability */
}
/* Adjustments to make prose class more suitable for dark mode */
.prose {
max-width: none; /* max-w-none */
}
.prose p,
.prose ul {
margin-bottom: 1rem; /* mb-4 */
}
.prose code {
/* background-color: #4a5568; */ /* bg-gray-700 */
color: #65a30d; /* text-white */
padding: 0.25rem 0.5rem; /* px-1 py-0.5 */
border-radius: 0.25rem; /* rounded */
display: inline-block; /* inline-block */
}
.prose pre {
background-color: #1a202c; /* bg-gray-900 */
color: #ffffff; /* text-white */
padding: 0.5rem; /* p-2 */
border-radius: 0.375rem; /* rounded */
}
.prose h3 {
color: #65a30d; /* text-white */
font-size: 1.25rem; /* text-xl */
font-weight: 700; /* font-bold */
margin-bottom: 0.5rem; /* mb-2 */
}
body {
background-color: #1a1a1a;
color: #b3ff00;
}
.sidebar {
color: #b3ff00;
border-right: 1px solid #333;
}
.sidebar a {
color: #b3ff00;
text-decoration: none;
}
.sidebar a:hover {
background-color: #555;
}
.content-section {
display: none;
}
.content-section.active {
display: block;
}

303
pages/app.js Normal file
View File

@ -0,0 +1,303 @@
// JavaScript to manage dynamic form changes and logic
document.getElementById("extraction-strategy-select").addEventListener("change", function () {
const strategy = this.value;
const providerModelSelect = document.getElementById("provider-model-select");
const tokenInput = document.getElementById("token-input");
const instruction = document.getElementById("instruction");
const semantic_filter = document.getElementById("semantic_filter");
const instruction_div = document.getElementById("instruction_div");
const semantic_filter_div = document.getElementById("semantic_filter_div");
const llm_settings = document.getElementById("llm_settings");
if (strategy === "LLMExtractionStrategy") {
// providerModelSelect.disabled = false;
// tokenInput.disabled = false;
// semantic_filter.disabled = true;
// instruction.disabled = false;
llm_settings.classList.remove("hidden");
instruction_div.classList.remove("hidden");
semantic_filter_div.classList.add("hidden");
} else if (strategy === "NoExtractionStrategy") {
semantic_filter_div.classList.add("hidden");
instruction_div.classList.add("hidden");
llm_settings.classList.add("hidden");
} else {
// providerModelSelect.disabled = true;
// tokenInput.disabled = true;
// semantic_filter.disabled = false;
// instruction.disabled = true;
llm_settings.classList.add("hidden");
instruction_div.classList.add("hidden");
semantic_filter_div.classList.remove("hidden");
}
});
// Get the selected provider model and token from local storage
const storedProviderModel = localStorage.getItem("provider_model");
const storedToken = localStorage.getItem(storedProviderModel);
if (storedProviderModel) {
document.getElementById("provider-model-select").value = storedProviderModel;
}
if (storedToken) {
document.getElementById("token-input").value = storedToken;
}
// Handle provider model dropdown change
document.getElementById("provider-model-select").addEventListener("change", () => {
const selectedProviderModel = document.getElementById("provider-model-select").value;
const storedToken = localStorage.getItem(selectedProviderModel);
if (storedToken) {
document.getElementById("token-input").value = storedToken;
} else {
document.getElementById("token-input").value = "";
}
});
// Fetch total count from the database
axios
.get("/total-count")
.then((response) => {
document.getElementById("total-count").textContent = response.data.count;
})
.catch((error) => console.error(error));
// Handle crawl button click
document.getElementById("crawl-btn").addEventListener("click", () => {
// validate input to have both URL and API token
if (!document.getElementById("url-input").value || !document.getElementById("token-input").value) {
alert("Please enter both URL(s) and API token.");
return;
}
const selectedProviderModel = document.getElementById("provider-model-select").value;
const apiToken = document.getElementById("token-input").value;
const extractBlocks = document.getElementById("extract-blocks-checkbox").checked;
const bypassCache = document.getElementById("bypass-cache-checkbox").checked;
// Save the selected provider model and token to local storage
localStorage.setItem("provider_model", selectedProviderModel);
localStorage.setItem(selectedProviderModel, apiToken);
const urlsInput = document.getElementById("url-input").value;
const urls = urlsInput.split(",").map((url) => url.trim());
const data = {
urls: urls,
provider_model: selectedProviderModel,
api_token: apiToken,
include_raw_html: true,
bypass_cache: bypassCache,
extract_blocks: extractBlocks,
word_count_threshold: parseInt(document.getElementById("threshold").value),
extraction_strategy: document.getElementById("extraction-strategy-select").value,
extraction_strategy_args: {
provider: selectedProviderModel,
api_token: apiToken,
instruction: document.getElementById("instruction").value,
semantic_filter: document.getElementById("semantic_filter").value,
},
chunking_strategy: document.getElementById("chunking-strategy-select").value,
chunking_strategy_args: {},
css_selector: document.getElementById("css-selector").value,
// instruction: document.getElementById("instruction").value,
// semantic_filter: document.getElementById("semantic_filter").value,
verbose: true,
};
// save api token to local storage
localStorage.setItem("api_token", document.getElementById("token-input").value);
document.getElementById("loading").classList.remove("hidden");
document.getElementById("result").classList.add("hidden");
document.getElementById("code_help").classList.add("hidden");
axios
.post("/crawl", data)
.then((response) => {
const result = response.data.results[0];
const parsedJson = JSON.parse(result.extracted_content);
document.getElementById("json-result").textContent = JSON.stringify(parsedJson, null, 2);
document.getElementById("cleaned-html-result").textContent = result.cleaned_html;
document.getElementById("markdown-result").textContent = result.markdown;
// Update code examples dynamically
const extractionStrategy = data.extraction_strategy;
const isLLMExtraction = extractionStrategy === "LLMExtractionStrategy";
document.getElementById(
"curl-code"
).textContent = `curl -X POST -H "Content-Type: application/json" -d '${JSON.stringify({
...data,
api_token: isLLMExtraction ? "your_api_token" : undefined,
})}' http://crawl4ai.uccode.io/crawl`;
document.getElementById("python-code").textContent = `import requests\n\ndata = ${JSON.stringify(
{ ...data, api_token: isLLMExtraction ? "your_api_token" : undefined },
null,
2
)}\n\nresponse = requests.post("http://crawl4ai.uccode.io/crawl", json=data) # OR local host if your run locally \nprint(response.json())`;
document.getElementById(
"nodejs-code"
).textContent = `const axios = require('axios');\n\nconst data = ${JSON.stringify(
{ ...data, api_token: isLLMExtraction ? "your_api_token" : undefined },
null,
2
)};\n\naxios.post("http://crawl4ai.uccode.io/crawl", data) // OR local host if your run locally \n .then(response => console.log(response.data))\n .catch(error => console.error(error));`;
document.getElementById(
"library-code"
).textContent = `from crawl4ai.web_crawler import WebCrawler\nfrom crawl4ai.extraction_strategy import *\nfrom crawl4ai.chunking_strategy import *\n\ncrawler = WebCrawler()\ncrawler.warmup()\n\nresult = crawler.run(\n url='${
urls[0]
}',\n word_count_threshold=${data.word_count_threshold},\n extraction_strategy=${
isLLMExtraction
? `${extractionStrategy}(provider="${data.provider_model}", api_token="${data.api_token}")`
: extractionStrategy + "()"
},\n chunking_strategy=${data.chunking_strategy}(),\n bypass_cache=${
data.bypass_cache
},\n css_selector="${data.css_selector}"\n)\nprint(result)`;
// Highlight code syntax
hljs.highlightAll();
// Select JSON tab by default
document.querySelector('.tab-btn[data-tab="json"]').click();
document.getElementById("loading").classList.add("hidden");
document.getElementById("result").classList.remove("hidden");
document.getElementById("code_help").classList.remove("hidden");
// increment the total count
document.getElementById("total-count").textContent =
parseInt(document.getElementById("total-count").textContent) + 1;
})
.catch((error) => {
console.error(error);
document.getElementById("loading").classList.add("hidden");
});
});
// Handle tab clicks
document.querySelectorAll(".tab-btn").forEach((btn) => {
btn.addEventListener("click", () => {
const tab = btn.dataset.tab;
document.querySelectorAll(".tab-btn").forEach((b) => b.classList.remove("bg-lime-700", "text-white"));
btn.classList.add("bg-lime-700", "text-white");
document.querySelectorAll(".tab-content.code pre").forEach((el) => el.classList.add("hidden"));
document.getElementById(`${tab}-result`).parentElement.classList.remove("hidden");
});
});
// Handle code tab clicks
document.querySelectorAll(".code-tab-btn").forEach((btn) => {
btn.addEventListener("click", () => {
const tab = btn.dataset.tab;
document.querySelectorAll(".code-tab-btn").forEach((b) => b.classList.remove("bg-lime-700", "text-white"));
btn.classList.add("bg-lime-700", "text-white");
document.querySelectorAll(".tab-content.result pre").forEach((el) => el.classList.add("hidden"));
document.getElementById(`${tab}-code`).parentElement.classList.remove("hidden");
});
});
// Handle copy to clipboard button clicks
async function copyToClipboard(text) {
if (navigator.clipboard && navigator.clipboard.writeText) {
return navigator.clipboard.writeText(text);
} else {
return fallbackCopyTextToClipboard(text);
}
}
function fallbackCopyTextToClipboard(text) {
return new Promise((resolve, reject) => {
const textArea = document.createElement("textarea");
textArea.value = text;
// Avoid scrolling to bottom
textArea.style.top = "0";
textArea.style.left = "0";
textArea.style.position = "fixed";
document.body.appendChild(textArea);
textArea.focus();
textArea.select();
try {
const successful = document.execCommand("copy");
if (successful) {
resolve();
} else {
reject();
}
} catch (err) {
reject(err);
}
document.body.removeChild(textArea);
});
}
document.querySelectorAll(".copy-btn").forEach((btn) => {
btn.addEventListener("click", () => {
const target = btn.dataset.target;
const code = document.getElementById(target).textContent;
//navigator.clipboard.writeText(code).then(() => {
copyToClipboard(code).then(() => {
btn.textContent = "Copied!";
setTimeout(() => {
btn.textContent = "Copy";
}, 2000);
});
});
});
document.addEventListener("DOMContentLoaded", async () => {
try {
const extractionResponse = await fetch("/strategies/extraction");
const extractionStrategies = await extractionResponse.json();
const chunkingResponse = await fetch("/strategies/chunking");
const chunkingStrategies = await chunkingResponse.json();
renderStrategies("extraction-strategies", extractionStrategies);
renderStrategies("chunking-strategies", chunkingStrategies);
} catch (error) {
console.error("Error fetching strategies:", error);
}
});
function renderStrategies(containerId, strategies) {
const container = document.getElementById(containerId);
container.innerHTML = ""; // Clear any existing content
strategies = JSON.parse(strategies);
Object.entries(strategies).forEach(([strategy, description]) => {
const strategyElement = document.createElement("div");
strategyElement.classList.add("bg-zinc-800", "p-4", "rounded", "shadow-md", "docs-item");
const strategyDescription = document.createElement("div");
strategyDescription.classList.add("text-gray-300", "prose", "prose-sm");
strategyDescription.innerHTML = marked.parse(description);
strategyElement.appendChild(strategyDescription);
container.appendChild(strategyElement);
});
}
document.querySelectorAll(".sidebar a").forEach((link) => {
link.addEventListener("click", function (event) {
event.preventDefault();
document.querySelectorAll(".content-section").forEach((section) => {
section.classList.remove("active");
});
const target = event.target.getAttribute("data-target");
document.getElementById(target).classList.add("active");
});
});
// Highlight code syntax
hljs.highlightAll();

971
pages/index copy.html Normal file
View File

@ -0,0 +1,971 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Crawl4AI</title>
<link rel="preconnect" href="https://fonts.googleapis.com" />
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin />
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@100..900&display=swap" rel="stylesheet" />
<!-- <link href="https://cdn.jsdelivr.net/npm/tailwindcss@3.4.3/dist/tailwind.min.css" rel="stylesheet" /> -->
<script src="https://cdn.tailwindcss.com"></script>
<script src="https://cdn.jsdelivr.net/npm/axios/dist/axios.min.js"></script>
<link
rel="stylesheet"
href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.7.0/styles/monokai.min.css"
/>
<script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.7.0/highlight.min.js"></script>
<style>
:root {
--ifm-font-size-base: 100%;
--ifm-line-height-base: 1.65;
--ifm-font-family-base: system-ui, -apple-system, Segoe UI, Roboto, Ubuntu, Cantarell, Noto Sans,
sans-serif, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji",
"Segoe UI Emoji", "Segoe UI Symbol";
}
html {
-webkit-font-smoothing: antialiased;
-webkit-text-size-adjust: 100%;
text-size-adjust: 100%;
font: var(--ifm-font-size-base) / var(--ifm-line-height-base) var(--ifm-font-family-base);
}
body {
background-color: #1a202c;
color: #fff;
}
.tab-content {
max-height: 400px;
overflow: auto;
}
pre {
white-space: pre-wrap;
font-size: 14px;
}
pre code {
width: 100%;
}
</style>
<style>
/* Custom styling for docs-item class and Markdown generated elements */
.docs-item {
background-color: #2d3748; /* bg-gray-800 */
padding: 1rem; /* p-4 */
border-radius: 0.375rem; /* rounded */
box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1); /* shadow-md */
margin-bottom: 1rem; /* space between items */
}
.docs-item h3,
.docs-item h4 {
color: #ffffff; /* text-white */
font-size: 1.25rem; /* text-xl */
font-weight: 700; /* font-bold */
margin-bottom: 0.5rem; /* mb-2 */
}
.docs-item p {
color: #e2e8f0; /* text-gray-300 */
margin-bottom: 0.5rem; /* mb-2 */
}
.docs-item code {
background-color: #1a202c; /* bg-gray-900 */
color: #e2e8f0; /* text-gray-300 */
padding: 0.25rem 0.5rem; /* px-2 py-1 */
border-radius: 0.25rem; /* rounded */
}
.docs-item pre {
background-color: #1a202c; /* bg-gray-900 */
color: #e2e8f0; /* text-gray-300 */
padding: 0.5rem; /* p-2 */
border-radius: 0.375rem; /* rounded */
overflow: auto; /* overflow-auto */
margin-bottom: 0.5rem; /* mb-2 */
}
.docs-item div {
color: #e2e8f0; /* text-gray-300 */
font-size: 1rem; /* prose prose-sm */
line-height: 1.25rem; /* line-height for readability */
}
/* Adjustments to make prose class more suitable for dark mode */
.prose {
max-width: none; /* max-w-none */
}
.prose p,
.prose ul {
margin-bottom: 1rem; /* mb-4 */
}
.prose code {
/* background-color: #4a5568; */ /* bg-gray-700 */
color: #65a30d; /* text-white */
padding: 0.25rem 0.5rem; /* px-1 py-0.5 */
border-radius: 0.25rem; /* rounded */
display: inline-block; /* inline-block */
}
.prose pre {
background-color: #1a202c; /* bg-gray-900 */
color: #ffffff; /* text-white */
padding: 0.5rem; /* p-2 */
border-radius: 0.375rem; /* rounded */
}
.prose h3 {
color: #65a30d; /* text-white */
font-size: 1.25rem; /* text-xl */
font-weight: 700; /* font-bold */
margin-bottom: 0.5rem; /* mb-2 */
}
</style>
</head>
<body class="bg-black text-gray-200">
<header class="bg-zinc-950 text-white py-4 flex">
<div class="mx-auto px-4">
<h1 class="text-2xl font-bold">🔥🕷️ Crawl4AI: Web Data for your Thoughts</h1>
</div>
<div class="mx-auto px-4 flex font-bold text-xl gap-2">
<span>📊 Total Website Processed</span>
<span id="total-count" class="text-lime-400">2</span>
</div>
</header>
<section class="try-it py-8 px-16 pb-20">
<div class="container mx-auto px-4">
<h2 class="text-2xl font-bold mb-4">Try It Now</h2>
<div class="grid grid-cols-1 lg:grid-cols-3 gap-4">
<div class="space-y-4">
<div class="flex flex-col">
<label for="url-input" class="text-lime-500 font-bold text-xs">URL(s)</label>
<input
type="text"
id="url-input"
value="https://www.nbcnews.com/business"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
placeholder="Enter URL(s) separated by commas"
/>
</div>
<div class="flex flex-col">
<label for="threshold" class="text-lime-500 font-bold text-xs">Min Words Threshold</label>
<select
id="threshold"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
>
<option value="5">5</option>
<option value="10" selected>10</option>
<option value="15">15</option>
<option value="20">20</option>
<option value="25">25</option>
</select>
</div>
<div class="flex flex-col">
<label for="css-selector" class="text-lime-500 font-bold text-xs">CSS Selector</label>
<input
type="text"
id="css-selector"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
placeholder="Enter CSS Selector"
/>
</div>
<div class="flex flex-col">
<label for="extraction-strategy-select" class="text-lime-500 font-bold text-xs"
>Extraction Strategy</label
>
<select
id="extraction-strategy-select"
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-lime-500"
>
<option value="CosineStrategy">CosineStrategy</option>
<option value="LLMExtractionStrategy">LLMExtractionStrategy</option>
<option value="NoExtractionStrategy">NoExtractionStrategy</option>
</select>
</div>
<div class="flex flex-col">
<label for="chunking-strategy-select" class="text-lime-500 font-bold text-xs"
>Chunking Strategy</label
>
<select
id="chunking-strategy-select"
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-lime-500"
>
<option value="RegexChunking">RegexChunking</option>
<option value="NlpSentenceChunking">NlpSentenceChunking</option>
<option value="TopicSegmentationChunking">TopicSegmentationChunking</option>
<option value="FixedLengthWordChunking">FixedLengthWordChunking</option>
<option value="SlidingWindowChunking">SlidingWindowChunking</option>
</select>
</div>
<div class="flex flex-col">
<label for="provider-model-select" class="text-lime-500 font-bold text-xs"
>Provider Model</label
>
<select
id="provider-model-select"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
disabled
>
<option value="groq/llama3-70b-8192">groq/llama3-70b-8192</option>
<option value="groq/llama3-8b-8192">groq/llama3-8b-8192</option>
<option value="openai/gpt-4-turbo">gpt-4-turbo</option>
<option value="openai/gpt-3.5-turbo">gpt-3.5-turbo</option>
<option value="anthropic/claude-3-haiku-20240307">claude-3-haiku</option>
<option value="anthropic/claude-3-opus-20240229">claude-3-opus</option>
<option value="anthropic/claude-3-sonnet-20240229">claude-3-sonnet</option>
</select>
</div>
<div class="flex flex-col">
<label for="token-input" class="text-lime-500 font-bold text-xs">API Token</label>
<input
type="password"
id="token-input"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
placeholder="Enter Groq API token"
disabled
/>
</div>
<div class="flex gap-3">
<div class="flex items-center gap-2">
<input type="checkbox" id="bypass-cache-checkbox" />
<label for="bypass-cache-checkbox" class="text-lime-500 font-bold">Bypass Cache</label>
</div>
<div class="flex items-center gap-2">
<input type="checkbox" id="extract-blocks-checkbox" checked />
<label for="extract-blocks-checkbox" class="text-lime-500 font-bold"
>Extract Blocks</label
>
</div>
<button id="crawl-btn" class="bg-lime-600 text-black font-bold px-4 py-0 rounded">
Crawl
</button>
</div>
</div>
<div id="result" class=" ">
<div id="loading" class="hidden">
<p class="text-white">Loading... Please wait.</p>
</div>
<div class="tab-buttons flex gap-2">
<button
class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="json"
>
JSON
</button>
<button
class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="cleaned-html"
>
Cleaned HTML
</button>
<button
class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="markdown"
>
Markdown
</button>
</div>
<div class="tab-content code bg-zinc-900 p-2 rounded h-full border border-zinc-700 text-sm">
<pre class="h-full flex"><code id="json-result" class="language-json"></code></pre>
<pre
class="hidden h-full flex"
><code id="cleaned-html-result" class="language-html"></code></pre>
<pre
class="hidden h-full flex"
><code id="markdown-result" class="language-markdown"></code></pre>
</div>
</div>
<div id="code_help" class=" ">
<div class="tab-buttons flex gap-2">
<button
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="curl"
>
cURL
</button>
<button
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="library"
>
Python Library
</button>
<button
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="python"
>
Python (Request)
</button>
<button
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="nodejs"
>
Node.js
</button>
</div>
<div class="tab-content result bg-zinc-900 p-2 rounded h-full border border-zinc-700 text-sm">
<pre class="h-full flex relative">
<code id="curl-code" class="language-bash"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="curl-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative">
<code id="python-code" class="language-python"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="python-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative">
<code id="nodejs-code" class="language-javascript"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="nodejs-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative">
<code id="library-code" class="language-python"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="library-code">Copy</button>
</pre>
</div>
</div>
</div>
</div>
</section>
<section class="bg-zinc-900 text-zinc-300 p-6 px-20">
<div class="grid grid-cols-2 gap-4 p-4 bg-zinc-900 text-lime-500">
<!-- Step 1 -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
🌟 <strong>Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun!</strong>
</div>
<div class="bg-zinc-800 p-2 rounded">
First Step: Create an instance of WebCrawler and call the <code>warmup()</code> function.
</div>
<div>
<pre><code class="language-python">crawler = WebCrawler()
crawler.warmup()</code></pre>
</div>
<!-- Step 2 -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
🧠 <strong>Understanding 'bypass_cache' and 'include_raw_html' parameters:</strong>
</div>
<div class="bg-zinc-800 p-2 rounded">First crawl (caches the result):</div>
<div>
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business")</code></pre>
</div>
<div class="bg-zinc-800 p-2 rounded">Second crawl (Force to crawl again):</div>
<div>
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)</code></pre>
</div>
<div class="bg-zinc-800 p-2 rounded">Crawl result without raw HTML content:</div>
<div>
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)</code></pre>
</div>
<!-- Step 3 -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
📄
<strong
>The 'include_raw_html' parameter, when set to True, includes the raw HTML content in the
response. By default, it is set to True.</strong
>
</div>
<div class="bg-zinc-800 p-2 rounded">Set <code>always_by_pass_cache</code> to True:</div>
<div>
<pre><code class="language-python">crawler.always_by_pass_cache = True</code></pre>
</div>
<!-- Step 4 -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
🧩 <strong>Let's add a chunking strategy: RegexChunking!</strong>
</div>
<div class="bg-zinc-800 p-2 rounded">Using RegexChunking:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=RegexChunking(patterns=["\n\n"])
)</code></pre>
</div>
<div class="bg-zinc-800 p-2 rounded">Using NlpSentenceChunking:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=NlpSentenceChunking()
)</code></pre>
</div>
<!-- Step 5 -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
🧠 <strong>Let's get smarter with an extraction strategy: CosineStrategy!</strong>
</div>
<div class="bg-zinc-800 p-2 rounded">Using CosineStrategy:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3)
)</code></pre>
</div>
<!-- Step 6 -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
🤖 <strong>Time to bring in the big guns: LLMExtractionStrategy without instructions!</strong>
</div>
<div class="bg-zinc-800 p-2 rounded">Using LLMExtractionStrategy without instructions:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
)</code></pre>
</div>
<!-- Step 7 -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
📜 <strong>Let's make it even more interesting: LLMExtractionStrategy with instructions!</strong>
</div>
<div class="bg-zinc-800 p-2 rounded">Using LLMExtractionStrategy with instructions:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
instruction="I am interested in only financial news"
)
)</code></pre>
</div>
<!-- Step 8 -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
🎯 <strong>Targeted extraction: Let's use a CSS selector to extract only H2 tags!</strong>
</div>
<div class="bg-zinc-800 p-2 rounded">Using CSS selector to extract H2 tags:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
css_selector="h2"
)</code></pre>
</div>
<!-- Step 9 -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
🖱️ <strong>Let's get interactive: Passing JavaScript code to click 'Load More' button!</strong>
</div>
<div class="bg-zinc-800 p-2 rounded">Using JavaScript to click 'Load More' button:</div>
<div>
<pre><code class="language-python">js_code = """
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click();
"""
crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
result = crawler.run(url="https://www.nbcnews.com/business")</code></pre>
</div>
<!-- Conclusion -->
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
🎉
<strong
>Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl
the web like a pro! 🕸️</strong
>
</div>
</div>
</section>
<section class="bg-zinc-900 text-zinc-300 p-6 px-20">
<h1 class="text-3xl font-bold mb-4">Installation 💻</h1>
<p class="mb-4">
There are two ways to use Crawl4AI: as a library in your Python projects or as a standalone local
server.
</p>
<p class="mb-4">
You can also try Crawl4AI in a Google Colab
<a href="https://colab.research.google.com/drive/1wz8u30rvbq6Scodye9AGCw8Qg_Z8QGsk"
><img
src="https://colab.research.google.com/assets/colab-badge.svg"
alt="Open In Colab"
style="display: inline-block; width: 100px; height: 20px"
/></a>
</p>
<h2 class="text-2xl font-bold mb-2">Using Crawl4AI as a Library 📚</h2>
<p class="mb-4">To install Crawl4AI as a library, follow these steps:</p>
<ol class="list-decimal list-inside mb-4">
<li class="mb-2">
Install the package from GitHub:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code>pip install git+https://github.com/unclecode/crawl4ai.git</code></pre>
</li>
<li class="mb-2">
Alternatively, you can clone the repository and install the package locally:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code class = "language-python bash">virtualenv venv
source venv/bin/activate
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .
</code></pre>
</li>
<li>
Import the necessary modules in your Python script:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code class = "language-python hljs">from crawl4ai.web_crawler import WebCrawler
from crawl4ai.chunking_strategy import *
from crawl4ai.extraction_strategy import *
import os
crawler = WebCrawler()
# Single page crawl
single_url = UrlModel(url='https://www.nbcnews.com/business', forced=False)
result = crawl4ai.fetch_page(
url='https://www.nbcnews.com/business',
word_count_threshold=5, # Minimum word count for a HTML tag to be considered as a worthy block
chunking_strategy= RegexChunking( patterns = ["\\n\\n"]), # Default is RegexChunking
extraction_strategy= CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3) # Default is CosineStrategy
# extraction_strategy= LLMExtractionStrategy(provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY')),
bypass_cache=False,
extract_blocks =True, # Whether to extract semantical blocks of text from the HTML
css_selector = "", # Eg: "div.article-body"
verbose=True,
include_raw_html=True, # Whether to include the raw HTML content in the response
)
print(result.model_dump())
</code></pre>
</li>
</ol>
<p class="mb-4">
For more information about how to run Crawl4AI as a local server, please refer to the
<a href="https://github.com/unclecode/crawl4ai" class="text-blue-400">GitHub repository</a>.
</p>
</section>
<section class="bg-zinc-900 text-zinc-300 p-6 px-20">
<h1 class="text-3xl font-bold mb-4">📖 Parameters</h1>
<div class="overflow-x-auto">
<table class="min-w-full bg-zinc-800 border border-zinc-700">
<thead>
<tr>
<th class="py-2 px-4 border-b border-zinc-700">Parameter</th>
<th class="py-2 px-4 border-b border-zinc-700">Description</th>
<th class="py-2 px-4 border-b border-zinc-700">Required</th>
<th class="py-2 px-4 border-b border-zinc-700">Default Value</th>
</tr>
</thead>
<tbody>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">urls</td>
<td class="py-2 px-4 border-b border-zinc-700">
A list of URLs to crawl and extract data from.
</td>
<td class="py-2 px-4 border-b border-zinc-700">Yes</td>
<td class="py-2 px-4 border-b border-zinc-700">-</td>
</tr>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">include_raw_html</td>
<td class="py-2 px-4 border-b border-zinc-700">
Whether to include the raw HTML content in the response.
</td>
<td class="py-2 px-4 border-b border-zinc-700">No</td>
<td class="py-2 px-4 border-b border-zinc-700">false</td>
</tr>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">bypass_cache</td>
<td class="py-2 px-4 border-b border-zinc-700">
Whether to force a fresh crawl even if the URL has been previously crawled.
</td>
<td class="py-2 px-4 border-b border-zinc-700">No</td>
<td class="py-2 px-4 border-b border-zinc-700">false</td>
</tr>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">extract_blocks</td>
<td class="py-2 px-4 border-b border-zinc-700">
Whether to extract semantical blocks of text from the HTML.
</td>
<td class="py-2 px-4 border-b border-zinc-700">No</td>
<td class="py-2 px-4 border-b border-zinc-700">true</td>
</tr>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">word_count_threshold</td>
<td class="py-2 px-4 border-b border-zinc-700">
The minimum number of words a block must contain to be considered meaningful (minimum
value is 5).
</td>
<td class="py-2 px-4 border-b border-zinc-700">No</td>
<td class="py-2 px-4 border-b border-zinc-700">5</td>
</tr>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">extraction_strategy</td>
<td class="py-2 px-4 border-b border-zinc-700">
The strategy to use for extracting content from the HTML (e.g., "CosineStrategy").
</td>
<td class="py-2 px-4 border-b border-zinc-700">No</td>
<td class="py-2 px-4 border-b border-zinc-700">CosineStrategy</td>
</tr>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">chunking_strategy</td>
<td class="py-2 px-4 border-b border-zinc-700">
The strategy to use for chunking the text before processing (e.g., "RegexChunking").
</td>
<td class="py-2 px-4 border-b border-zinc-700">No</td>
<td class="py-2 px-4 border-b border-zinc-700">RegexChunking</td>
</tr>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">css_selector</td>
<td class="py-2 px-4 border-b border-zinc-700">
The CSS selector to target specific parts of the HTML for extraction.
</td>
<td class="py-2 px-4 border-b border-zinc-700">No</td>
<td class="py-2 px-4 border-b border-zinc-700">None</td>
</tr>
<tr>
<td class="py-2 px-4">verbose</td>
<td class="py-2 px-4">Whether to enable verbose logging.</td>
<td class="py-2 px-4">No</td>
<td class="py-2 px-4">true</td>
</tr>
</tbody>
</table>
</div>
</section>
<section id="extraction" class="py-8 px-20">
<div class="overflow-x-auto mx-auto px-6">
<h2 class="text-2xl font-bold mb-4">Extraction Strategies</h2>
<div id="extraction-strategies" class="space-y-4"></div>
</div>
</section>
<section id="chunking" class="py-8 px-20">
<div class="overflow-x-auto mx-auto px-6">
<h2 class="text-2xl font-bold mb-4">Chunking Strategies</h2>
<div id="chunking-strategies" class="space-y-4"></div>
</div>
</section>
<section class="hero bg-zinc-900 py-8 px-20">
<div class="container mx-auto px-4">
<h2 class="text-3xl font-bold mb-4">🤔 Why building this?</h2>
<p class="text-lg mb-4">
In recent times, we've witnessed a surge of startups emerging, riding the AI hype wave and charging
for services that should rightfully be accessible to everyone. 🌍💸 One such example is scraping and
crawling web pages and transforming them into a format suitable for Large Language Models (LLMs).
🕸️🤖 We believe that building a business around this is not the right approach; instead, it should
definitely be open-source. 🆓🌟 So, if you possess the skills to build such tools and share our
philosophy, we invite you to join our "Robinhood" band and help set these products free for the
benefit of all. 🤝💪
</p>
</div>
</section>
<section class="installation py-8 px-20">
<div class="container mx-auto px-4">
<h2 class="text-2xl font-bold mb-4">⚙️ Installation</h2>
<p class="mb-4">
To install and run Crawl4AI as a library or a local server, please refer to the 📚
<a href="https://github.com/unclecode/crawl4ai" class="text-blue-400">GitHub repository</a>.
</p>
</div>
</section>
<footer class="bg-zinc-900 text-white py-4">
<div class="container mx-auto px-4">
<div class="flex justify-between items-center">
<p>© 2024 Crawl4AI. All rights reserved.</p>
<div class="social-links">
<a
href="https://github.com/unclecode/crawl4ai"
class="text-white hover:text-gray-300 mx-2"
target="_blank"
>😺 GitHub</a
>
<a
href="https://twitter.com/unclecode"
class="text-white hover:text-gray-300 mx-2"
target="_blank"
>🐦 Twitter</a
>
</div>
</div>
</div>
</footer>
<script>
// JavaScript to manage dynamic form changes and logic
document.getElementById("extraction-strategy-select").addEventListener("change", function () {
const strategy = this.value;
const providerModelSelect = document.getElementById("provider-model-select");
const tokenInput = document.getElementById("token-input");
if (strategy === "LLMExtractionStrategy") {
providerModelSelect.disabled = false;
tokenInput.disabled = false;
} else {
providerModelSelect.disabled = true;
tokenInput.disabled = true;
}
});
// Get the selected provider model and token from local storage
const storedProviderModel = localStorage.getItem("provider_model");
const storedToken = localStorage.getItem(storedProviderModel);
if (storedProviderModel) {
document.getElementById("provider-model-select").value = storedProviderModel;
}
if (storedToken) {
document.getElementById("token-input").value = storedToken;
}
// Handle provider model dropdown change
document.getElementById("provider-model-select").addEventListener("change", () => {
const selectedProviderModel = document.getElementById("provider-model-select").value;
const storedToken = localStorage.getItem(selectedProviderModel);
if (storedToken) {
document.getElementById("token-input").value = storedToken;
} else {
document.getElementById("token-input").value = "";
}
});
// Fetch total count from the database
axios
.get("/total-count")
.then((response) => {
document.getElementById("total-count").textContent = response.data.count;
})
.catch((error) => console.error(error));
// Handle crawl button click
document.getElementById("crawl-btn").addEventListener("click", () => {
// validate input to have both URL and API token
if (!document.getElementById("url-input").value || !document.getElementById("token-input").value) {
alert("Please enter both URL(s) and API token.");
return;
}
const selectedProviderModel = document.getElementById("provider-model-select").value;
const apiToken = document.getElementById("token-input").value;
const extractBlocks = document.getElementById("extract-blocks-checkbox").checked;
const bypassCache = document.getElementById("bypass-cache-checkbox").checked;
// Save the selected provider model and token to local storage
localStorage.setItem("provider_model", selectedProviderModel);
localStorage.setItem(selectedProviderModel, apiToken);
const urlsInput = document.getElementById("url-input").value;
const urls = urlsInput.split(",").map((url) => url.trim());
const data = {
urls: urls,
provider_model: selectedProviderModel,
api_token: apiToken,
include_raw_html: true,
bypass_cache: bypassCache,
extract_blocks: extractBlocks,
word_count_threshold: parseInt(document.getElementById("threshold").value),
extraction_strategy: document.getElementById("extraction-strategy-select").value,
chunking_strategy: document.getElementById("chunking-strategy-select").value,
css_selector: document.getElementById("css-selector").value,
verbose: true,
};
// save api token to local storage
localStorage.setItem("api_token", document.getElementById("token-input").value);
document.getElementById("loading").classList.remove("hidden");
//document.getElementById("result").classList.add("hidden");
//document.getElementById("code_help").classList.add("hidden");
axios
.post("/crawl", data)
.then((response) => {
const result = response.data.results[0];
const parsedJson = JSON.parse(result.extracted_content);
document.getElementById("json-result").textContent = JSON.stringify(parsedJson, null, 2);
document.getElementById("cleaned-html-result").textContent = result.cleaned_html;
document.getElementById("markdown-result").textContent = result.markdown;
// Update code examples dynamically
const extractionStrategy = data.extraction_strategy;
const isLLMExtraction = extractionStrategy === "LLMExtractionStrategy";
document.getElementById(
"curl-code"
).textContent = `curl -X POST -H "Content-Type: application/json" -d '${JSON.stringify({
...data,
api_token: isLLMExtraction ? "your_api_token" : undefined,
})}' http://crawl4ai.uccode.io/crawl`;
document.getElementById(
"python-code"
).textContent = `import requests\n\ndata = ${JSON.stringify(
{ ...data, api_token: isLLMExtraction ? "your_api_token" : undefined },
null,
2
)}\n\nresponse = requests.post("http://crawl4ai.uccode.io/crawl", json=data) # OR local host if your run locally \nprint(response.json())`;
document.getElementById(
"nodejs-code"
).textContent = `const axios = require('axios');\n\nconst data = ${JSON.stringify(
{ ...data, api_token: isLLMExtraction ? "your_api_token" : undefined },
null,
2
)};\n\naxios.post("http://crawl4ai.uccode.io/crawl", data) // OR local host if your run locally \n .then(response => console.log(response.data))\n .catch(error => console.error(error));`;
document.getElementById(
"library-code"
).textContent = `from crawl4ai.web_crawler import WebCrawler\nfrom crawl4ai.extraction_strategy import *\nfrom crawl4ai.chunking_strategy import *\n\ncrawler = WebCrawler()\ncrawler.warmup()\n\nresult = crawler.run(\n url='${
urls[0]
}',\n word_count_threshold=${data.word_count_threshold},\n extraction_strategy=${
isLLMExtraction
? `${extractionStrategy}(provider="${data.provider_model}", api_token="${data.api_token}")`
: extractionStrategy + "()"
},\n chunking_strategy=${data.chunking_strategy}(),\n bypass_cache=${
data.bypass_cache
},\n css_selector="${data.css_selector}"\n)\nprint(result)`;
// Highlight code syntax
hljs.highlightAll();
// Select JSON tab by default
document.querySelector('.tab-btn[data-tab="json"]').click();
document.getElementById("loading").classList.add("hidden");
document.getElementById("result").classList.remove("hidden");
document.getElementById("code_help").classList.remove("hidden");
// increment the total count
document.getElementById("total-count").textContent =
parseInt(document.getElementById("total-count").textContent) + 1;
})
.catch((error) => {
console.error(error);
document.getElementById("loading").classList.add("hidden");
});
});
// Handle tab clicks
document.querySelectorAll(".tab-btn").forEach((btn) => {
btn.addEventListener("click", () => {
const tab = btn.dataset.tab;
document
.querySelectorAll(".tab-btn")
.forEach((b) => b.classList.remove("bg-lime-700", "text-white"));
btn.classList.add("bg-lime-700", "text-white");
document.querySelectorAll(".tab-content.code pre").forEach((el) => el.classList.add("hidden"));
document.getElementById(`${tab}-result`).parentElement.classList.remove("hidden");
});
});
// Handle code tab clicks
document.querySelectorAll(".code-tab-btn").forEach((btn) => {
btn.addEventListener("click", () => {
const tab = btn.dataset.tab;
document
.querySelectorAll(".code-tab-btn")
.forEach((b) => b.classList.remove("bg-lime-700", "text-white"));
btn.classList.add("bg-lime-700", "text-white");
document.querySelectorAll(".tab-content.result pre").forEach((el) => el.classList.add("hidden"));
document.getElementById(`${tab}-code`).parentElement.classList.remove("hidden");
});
});
// Handle copy to clipboard button clicks
async function copyToClipboard(text) {
if (navigator.clipboard && navigator.clipboard.writeText) {
return navigator.clipboard.writeText(text);
} else {
return fallbackCopyTextToClipboard(text);
}
}
function fallbackCopyTextToClipboard(text) {
return new Promise((resolve, reject) => {
const textArea = document.createElement("textarea");
textArea.value = text;
// Avoid scrolling to bottom
textArea.style.top = "0";
textArea.style.left = "0";
textArea.style.position = "fixed";
document.body.appendChild(textArea);
textArea.focus();
textArea.select();
try {
const successful = document.execCommand("copy");
if (successful) {
resolve();
} else {
reject();
}
} catch (err) {
reject(err);
}
document.body.removeChild(textArea);
});
}
document.querySelectorAll(".copy-btn").forEach((btn) => {
btn.addEventListener("click", () => {
const target = btn.dataset.target;
const code = document.getElementById(target).textContent;
//navigator.clipboard.writeText(code).then(() => {
copyToClipboard(code).then(() => {
btn.textContent = "Copied!";
setTimeout(() => {
btn.textContent = "Copy";
}, 2000);
});
});
});
document.addEventListener("DOMContentLoaded", async () => {
try {
const extractionResponse = await fetch("/strategies/extraction");
const extractionStrategies = await extractionResponse.json();
const chunkingResponse = await fetch("/strategies/chunking");
const chunkingStrategies = await chunkingResponse.json();
renderStrategies("extraction-strategies", extractionStrategies);
renderStrategies("chunking-strategies", chunkingStrategies);
} catch (error) {
console.error("Error fetching strategies:", error);
}
});
function renderStrategies(containerId, strategies) {
const container = document.getElementById(containerId);
container.innerHTML = ""; // Clear any existing content
strategies = JSON.parse(strategies);
Object.entries(strategies).forEach(([strategy, description]) => {
const strategyElement = document.createElement("div");
strategyElement.classList.add("bg-zinc-800", "p-4", "rounded", "shadow-md", "docs-item");
const strategyDescription = document.createElement("div");
strategyDescription.classList.add("text-gray-300", "prose", "prose-sm");
strategyDescription.innerHTML = marked.parse(description);
strategyElement.appendChild(strategyDescription);
container.appendChild(strategyElement);
});
}
// Highlight code syntax
hljs.highlightAll();
</script>
</body>
</html>

View File

@ -12,6 +12,7 @@
<!-- <link href="https://cdn.jsdelivr.net/npm/tailwindcss@3.4.3/dist/tailwind.min.css" rel="stylesheet" /> -->
<script src="https://cdn.tailwindcss.com"></script>
<script src="https://cdn.jsdelivr.net/npm/axios/dist/axios.min.js"></script>
<link rel="stylesheet" href="/pages/app.css" />
<link
rel="stylesheet"
href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.7.0/styles/monokai.min.css"
@ -19,116 +20,10 @@
<script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.7.0/highlight.min.js"></script>
<style>
:root {
--ifm-font-size-base: 100%;
--ifm-line-height-base: 1.65;
--ifm-font-family-base: system-ui, -apple-system, Segoe UI, Roboto, Ubuntu, Cantarell, Noto Sans,
sans-serif, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji",
"Segoe UI Emoji", "Segoe UI Symbol";
}
html {
-webkit-font-smoothing: antialiased;
-webkit-text-size-adjust: 100%;
text-size-adjust: 100%;
font: var(--ifm-font-size-base) / var(--ifm-line-height-base) var(--ifm-font-family-base);
}
body {
background-color: #1a202c;
color: #fff;
}
.tab-content {
max-height: 400px;
overflow: auto;
}
pre {
white-space: pre-wrap;
font-size: 14px;
}
pre code {
width: 100%;
}
</style>
<style>
/* Custom styling for docs-item class and Markdown generated elements */
.docs-item {
background-color: #2d3748; /* bg-gray-800 */
padding: 1rem; /* p-4 */
border-radius: 0.375rem; /* rounded */
box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1); /* shadow-md */
margin-bottom: 1rem; /* space between items */
}
.docs-item h3,
.docs-item h4 {
color: #ffffff; /* text-white */
font-size: 1.25rem; /* text-xl */
font-weight: 700; /* font-bold */
margin-bottom: 0.5rem; /* mb-2 */
}
.docs-item p {
color: #e2e8f0; /* text-gray-300 */
margin-bottom: 0.5rem; /* mb-2 */
}
.docs-item code {
background-color: #1a202c; /* bg-gray-900 */
color: #e2e8f0; /* text-gray-300 */
padding: 0.25rem 0.5rem; /* px-2 py-1 */
border-radius: 0.25rem; /* rounded */
}
.docs-item pre {
background-color: #1a202c; /* bg-gray-900 */
color: #e2e8f0; /* text-gray-300 */
padding: 0.5rem; /* p-2 */
border-radius: 0.375rem; /* rounded */
overflow: auto; /* overflow-auto */
margin-bottom: 0.5rem; /* mb-2 */
}
.docs-item div {
color: #e2e8f0; /* text-gray-300 */
font-size: 1rem; /* prose prose-sm */
line-height: 1.25rem; /* line-height for readability */
}
/* Adjustments to make prose class more suitable for dark mode */
.prose {
max-width: none; /* max-w-none */
}
.prose p,
.prose ul {
margin-bottom: 1rem; /* mb-4 */
}
.prose code {
/* background-color: #4a5568; */ /* bg-gray-700 */
color: #65a30d; /* text-white */
padding: 0.25rem 0.5rem; /* px-1 py-0.5 */
border-radius: 0.25rem; /* rounded */
display: inline-block; /* inline-block */
}
.prose pre {
background-color: #1a202c; /* bg-gray-900 */
color: #ffffff; /* text-white */
padding: 0.5rem; /* p-2 */
border-radius: 0.375rem; /* rounded */
}
.prose h3 {
color: #65a30d; /* text-white */
font-size: 1.25rem; /* text-xl */
font-weight: 700; /* font-bold */
margin-bottom: 0.5rem; /* mb-2 */
}
</style>
</head>
<body class="bg-black text-gray-200">
<header class="bg-zinc-950 text-white py-4 flex">
<header class="bg-zinc-950 text-lime-500 py-4 flex">
<div class="mx-auto px-4">
<h1 class="text-2xl font-bold">🔥🕷️ Crawl4AI: Web Data for your Thoughts</h1>
</div>
@ -137,675 +32,42 @@
<span id="total-count" class="text-lime-400">2</span>
</div>
</header>
{{ try_it | safe }}
<section class="try-it py-8 px-16 pb-20">
<div class="container mx-auto px-4">
<h2 class="text-2xl font-bold mb-4">Try It Now</h2>
<div class="grid grid-cols-1 lg:grid-cols-3 gap-4">
<div class="space-y-4">
<div class="flex flex-col">
<label for="url-input" class="text-lime-500 font-bold text-xs">URL(s)</label>
<input
type="text"
id="url-input"
value="https://www.nbcnews.com/business"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
placeholder="Enter URL(s) separated by commas"
/>
</div>
<div class="flex flex-col">
<label for="threshold" class="text-lime-500 font-bold text-xs">Min Words Threshold</label>
<select
id="threshold"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
>
<option value="5">5</option>
<option value="10" selected>10</option>
<option value="15">15</option>
<option value="20">20</option>
<option value="25">25</option>
</select>
</div>
<div class="flex flex-col">
<label for="css-selector" class="text-lime-500 font-bold text-xs">CSS Selector</label>
<input
type="text"
id="css-selector"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
placeholder="Enter CSS Selector"
/>
</div>
<div class="flex flex-col">
<label for="extraction-strategy-select" class="text-lime-500 font-bold text-xs"
>Extraction Strategy</label
>
<select
id="extraction-strategy-select"
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-lime-500"
>
<option value="CosineStrategy">CosineStrategy</option>
<option value="LLMExtractionStrategy">LLMExtractionStrategy</option>
<option value="NoExtractionStrategy">NoExtractionStrategy</option>
</select>
</div>
<div class="flex flex-col">
<label for="chunking-strategy-select" class="text-lime-500 font-bold text-xs"
>Chunking Strategy</label
>
<select
id="chunking-strategy-select"
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-lime-500"
>
<option value="RegexChunking">RegexChunking</option>
<option value="NlpSentenceChunking">NlpSentenceChunking</option>
<option value="TopicSegmentationChunking">TopicSegmentationChunking</option>
<option value="FixedLengthWordChunking">FixedLengthWordChunking</option>
<option value="SlidingWindowChunking">SlidingWindowChunking</option>
</select>
</div>
<div class="flex flex-col">
<label for="provider-model-select" class="text-lime-500 font-bold text-xs"
>Provider Model</label
>
<select
id="provider-model-select"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
disabled
>
<option value="groq/llama3-70b-8192">groq/llama3-70b-8192</option>
<option value="groq/llama3-8b-8192">groq/llama3-8b-8192</option>
<option value="openai/gpt-4-turbo">gpt-4-turbo</option>
<option value="openai/gpt-3.5-turbo">gpt-3.5-turbo</option>
<option value="anthropic/claude-3-haiku-20240307">claude-3-haiku</option>
<option value="anthropic/claude-3-opus-20240229">claude-3-opus</option>
<option value="anthropic/claude-3-sonnet-20240229">claude-3-sonnet</option>
</select>
</div>
<div class="flex flex-col">
<label for="token-input" class="text-lime-500 font-bold text-xs">API Token</label>
<input
type="password"
id="token-input"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
placeholder="Enter Groq API token"
disabled
/>
</div>
<div class="flex gap-3">
<div class="flex items-center gap-2">
<input type="checkbox" id="bypass-cache-checkbox" />
<label for="bypass-cache-checkbox" class="text-lime-500 font-bold">Bypass Cache</label>
</div>
<div class="flex items-center gap-2">
<input type="checkbox" id="extract-blocks-checkbox" checked />
<label for="extract-blocks-checkbox" class="text-lime-500 font-bold"
>Extract Blocks</label
>
</div>
<button id="crawl-btn" class="bg-lime-600 text-black font-bold px-4 py-0 rounded">
Crawl
</button>
</div>
<div class="mx-auto p-4 bg-zinc-950 text-lime-500 min-h-screen">
<div class="container mx-auto">
<div class="flex h-full px-20">
<div class="sidebar w-1/4 p-4">
<h2 class="text-lg font-bold mb-4">Outline</h2>
<ul>
<li class="mb-2"><a href="#" data-target="installation">Installation</a></li>
<li class="mb-2"><a href="#" data-target="how-to-guide">How to Guide</a></li>
<li class="mb-2"><a href="#" data-target="chunking-strategies">Chunking Strategies</a></li>
<li class="mb-2">
<a href="#" data-target="extraction-strategies">Extraction Strategies</a>
</li>
</ul>
</div>
<div id="result" class=" ">
<div id="loading" class="hidden">
<p class="text-white">Loading... Please wait.</p>
</div>
<div class="tab-buttons flex gap-2">
<button
class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="json"
>
JSON
</button>
<button
class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="cleaned-html"
>
Cleaned HTML
</button>
<button
class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="markdown"
>
Markdown
</button>
</div>
<div class="tab-content code bg-zinc-900 p-2 rounded h-full border border-zinc-700 text-sm">
<pre class="h-full flex"><code id="json-result" class="language-json"></code></pre>
<pre
class="hidden h-full flex"
><code id="cleaned-html-result" class="language-html"></code></pre>
<pre
class="hidden h-full flex"
><code id="markdown-result" class="language-markdown"></code></pre>
</div>
</div>
<!-- Main Content -->
<div class="w-3/4 p-4">
{{installation | safe}} {{how_to_guide | safe}}
<div id="code_help" class=" ">
<div class="tab-buttons flex gap-2">
<button
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="curl"
>
cURL
</button>
<button
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="library"
>
Python Library
</button>
<button
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="python"
>
Python (Request)
</button>
<button
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="nodejs"
>
Node.js
</button>
</div>
<div class="tab-content result bg-zinc-900 p-2 rounded h-full border border-zinc-700 text-sm">
<pre class="h-full flex relative">
<code id="curl-code" class="language-bash"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="curl-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative">
<code id="python-code" class="language-python"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="python-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative">
<code id="nodejs-code" class="language-javascript"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="nodejs-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative">
<code id="library-code" class="language-python"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="library-code">Copy</button>
</pre>
</div>
<section id="chunking-strategies" class="content-section">
<h1 class="text-2xl font-bold">Chunking Strategies</h1>
<p>Content for chunking strategies...</p>
</section>
<section id="extraction-strategies" class="content-section">
<h1 class="text-2xl font-bold">Extraction Strategies</h1>
<p>Content for extraction strategies...</p>
</section>
</div>
</div>
</div>
</section>
<section class="bg-zinc-900 text-zinc-300 p-6 px-20">
<h1 class="text-3xl font-bold mb-4">Installation 💻</h1>
<p class="mb-4">There are two ways to use Crawl4AI: as a library in your Python projects or as a standalone local server.</p>
<p class="mb-4">You can also try Crawl4AI in a Google Colab <a href = "https://colab.research.google.com/drive/1wz8u30rvbq6Scodye9AGCw8Qg_Z8QGsk"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" style="display: inline-block; width: 100px; height: 20px;"/></a></p>
</div>
<h2 class="text-2xl font-bold mb-2">Using Crawl4AI as a Library 📚</h2>
<p class="mb-4">To install Crawl4AI as a library, follow these steps:</p>
<ol class="list-decimal list-inside mb-4">
<li class="mb-2">
Install the package from GitHub:
<pre class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"><code>pip install git+https://github.com/unclecode/crawl4ai.git</code></pre>
</li>
<li class="mb-2">
Alternatively, you can clone the repository and install the package locally:
<pre class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"><code class = "language-python bash">virtualenv venv
source venv/bin/activate
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .
</code></pre>
</li>
<li>
Import the necessary modules in your Python script:
<pre class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"><code class = "language-python hljs">from crawl4ai.web_crawler import WebCrawler
from crawl4ai.chunking_strategy import *
from crawl4ai.extraction_strategy import *
import os
crawler = WebCrawler()
# Single page crawl
single_url = UrlModel(url='https://www.nbcnews.com/business', forced=False)
result = crawl4ai.fetch_page(
url='https://www.nbcnews.com/business',
word_count_threshold=5, # Minimum word count for a HTML tag to be considered as a worthy block
chunking_strategy= RegexChunking( patterns = ["\\n\\n"]), # Default is RegexChunking
extraction_strategy= CosineStrategy(word_count_threshold=20, max_dist=0.2, linkage_method='ward', top_k=3) # Default is CosineStrategy
# extraction_strategy= LLMExtractionStrategy(provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY')),
bypass_cache=False,
extract_blocks =True, # Whether to extract semantical blocks of text from the HTML
css_selector = "", # Eg: "div.article-body"
verbose=True,
include_raw_html=True, # Whether to include the raw HTML content in the response
)
print(result.model_dump())
</code></pre>
</li>
</ol>
<p class="mb-4">For more information about how to run Crawl4AI as a local server, please refer to the <a href="https://github.com/unclecode/crawl4ai" class="text-blue-400">GitHub repository</a>.</p>
<a href="
</section>
<section class="bg-zinc-900 text-zinc-300 p-6 px-20">
<h1 class="text-3xl font-bold mb-4">📖 Parameters</h1>
<div class="overflow-x-auto">
<table class="min-w-full bg-zinc-800 border border-zinc-700">
<thead>
<tr>
<th class="py-2 px-4 border-b border-zinc-700">Parameter</th>
<th class="py-2 px-4 border-b border-zinc-700">Description</th>
<th class="py-2 px-4 border-b border-zinc-700">Required</th>
<th class="py-2 px-4 border-b border-zinc-700">Default Value</th>
</tr>
</thead>
<tbody>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">urls</td>
<td class="py-2 px-4 border-b border-zinc-700">
A list of URLs to crawl and extract data from.
</td>
<td class="py-2 px-4 border-b border-zinc-700">Yes</td>
<td class="py-2 px-4 border-b border-zinc-700">-</td>
</tr>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">include_raw_html</td>
<td class="py-2 px-4 border-b border-zinc-700">
Whether to include the raw HTML content in the response.
</td>
<td class="py-2 px-4 border-b border-zinc-700">No</td>
<td class="py-2 px-4 border-b border-zinc-700">false</td>
</tr>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">bypass_cache</td>
<td class="py-2 px-4 border-b border-zinc-700">
Whether to force a fresh crawl even if the URL has been previously crawled.
</td>
<td class="py-2 px-4 border-b border-zinc-700">No</td>
<td class="py-2 px-4 border-b border-zinc-700">false</td>
</tr>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">extract_blocks</td>
<td class="py-2 px-4 border-b border-zinc-700">
Whether to extract semantical blocks of text from the HTML.
</td>
<td class="py-2 px-4 border-b border-zinc-700">No</td>
<td class="py-2 px-4 border-b border-zinc-700">true</td>
</tr>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">word_count_threshold</td>
<td class="py-2 px-4 border-b border-zinc-700">
The minimum number of words a block must contain to be considered meaningful (minimum
value is 5).
</td>
<td class="py-2 px-4 border-b border-zinc-700">No</td>
<td class="py-2 px-4 border-b border-zinc-700">5</td>
</tr>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">extraction_strategy</td>
<td class="py-2 px-4 border-b border-zinc-700">
The strategy to use for extracting content from the HTML (e.g., "CosineStrategy").
</td>
<td class="py-2 px-4 border-b border-zinc-700">No</td>
<td class="py-2 px-4 border-b border-zinc-700">CosineStrategy</td>
</tr>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">chunking_strategy</td>
<td class="py-2 px-4 border-b border-zinc-700">
The strategy to use for chunking the text before processing (e.g., "RegexChunking").
</td>
<td class="py-2 px-4 border-b border-zinc-700">No</td>
<td class="py-2 px-4 border-b border-zinc-700">RegexChunking</td>
</tr>
<tr>
<td class="py-2 px-4 border-b border-zinc-700">css_selector</td>
<td class="py-2 px-4 border-b border-zinc-700">
The CSS selector to target specific parts of the HTML for extraction.
</td>
<td class="py-2 px-4 border-b border-zinc-700">No</td>
<td class="py-2 px-4 border-b border-zinc-700">None</td>
</tr>
<tr>
<td class="py-2 px-4">verbose</td>
<td class="py-2 px-4">Whether to enable verbose logging.</td>
<td class="py-2 px-4">No</td>
<td class="py-2 px-4">true</td>
</tr>
</tbody>
</table>
</div>
</section>
<section id="extraction" class="py-8 px-20">
<div class="overflow-x-auto mx-auto px-6">
<h2 class="text-2xl font-bold mb-4">Extraction Strategies</h2>
<div id="extraction-strategies" class="space-y-4"></div>
</div>
</section>
<section id="chunking" class="py-8 px-20">
<div class="overflow-x-auto mx-auto px-6">
<h2 class="text-2xl font-bold mb-4">Chunking Strategies</h2>
<div id="chunking-strategies" class="space-y-4"></div>
</div>
</section>
<section class="hero bg-zinc-900 py-8 px-20">
<div class="container mx-auto px-4">
<h2 class="text-3xl font-bold mb-4">🤔 Why building this?</h2>
<p class="text-lg mb-4">
In recent times, we've witnessed a surge of startups emerging, riding the AI hype wave and charging
for services that should rightfully be accessible to everyone. 🌍💸 One such example is scraping and
crawling web pages and transforming them into a format suitable for Large Language Models (LLMs).
🕸️🤖 We believe that building a business around this is not the right approach; instead, it should
definitely be open-source. 🆓🌟 So, if you possess the skills to build such tools and share our
philosophy, we invite you to join our "Robinhood" band and help set these products free for the
benefit of all. 🤝💪
</p>
</div>
</section>
<section class="installation py-8 px-20">
<div class="container mx-auto px-4">
<h2 class="text-2xl font-bold mb-4">⚙️ Installation</h2>
<p class="mb-4">
To install and run Crawl4AI as a library or a local server, please refer to the 📚
<a href="https://github.com/unclecode/crawl4ai" class="text-blue-400">GitHub repository</a>.
</p>
</div>
</section>
<footer class="bg-zinc-900 text-white py-4">
<div class="container mx-auto px-4">
<div class="flex justify-between items-center">
<p>© 2024 Crawl4AI. All rights reserved.</p>
<div class="social-links">
<a
href="https://github.com/unclecode/crawl4ai"
class="text-white hover:text-gray-300 mx-2"
target="_blank"
>😺 GitHub</a
>
<a
href="https://twitter.com/unclecode"
class="text-white hover:text-gray-300 mx-2"
target="_blank"
>🐦 Twitter</a
>
</div>
</div>
</div>
</footer>
<script>
// JavaScript to manage dynamic form changes and logic
document.getElementById("extraction-strategy-select").addEventListener("change", function () {
const strategy = this.value;
const providerModelSelect = document.getElementById("provider-model-select");
const tokenInput = document.getElementById("token-input");
if (strategy === "LLMExtractionStrategy") {
providerModelSelect.disabled = false;
tokenInput.disabled = false;
} else {
providerModelSelect.disabled = true;
tokenInput.disabled = true;
}
});
// Get the selected provider model and token from local storage
const storedProviderModel = localStorage.getItem("provider_model");
const storedToken = localStorage.getItem(storedProviderModel);
if (storedProviderModel) {
document.getElementById("provider-model-select").value = storedProviderModel;
}
if (storedToken) {
document.getElementById("token-input").value = storedToken;
}
// Handle provider model dropdown change
document.getElementById("provider-model-select").addEventListener("change", () => {
const selectedProviderModel = document.getElementById("provider-model-select").value;
const storedToken = localStorage.getItem(selectedProviderModel);
if (storedToken) {
document.getElementById("token-input").value = storedToken;
} else {
document.getElementById("token-input").value = "";
}
});
// Fetch total count from the database
axios
.get("/total-count")
.then((response) => {
document.getElementById("total-count").textContent = response.data.count;
})
.catch((error) => console.error(error));
// Handle crawl button click
document.getElementById("crawl-btn").addEventListener("click", () => {
// validate input to have both URL and API token
if (!document.getElementById("url-input").value || !document.getElementById("token-input").value) {
alert("Please enter both URL(s) and API token.");
return;
}
const selectedProviderModel = document.getElementById("provider-model-select").value;
const apiToken = document.getElementById("token-input").value;
const extractBlocks = document.getElementById("extract-blocks-checkbox").checked;
const bypassCache = document.getElementById("bypass-cache-checkbox").checked;
// Save the selected provider model and token to local storage
localStorage.setItem("provider_model", selectedProviderModel);
localStorage.setItem(selectedProviderModel, apiToken);
const urlsInput = document.getElementById("url-input").value;
const urls = urlsInput.split(",").map((url) => url.trim());
const data = {
urls: urls,
provider_model: selectedProviderModel,
api_token: apiToken,
include_raw_html: true,
bypass_cache: bypassCache,
extract_blocks: extractBlocks,
word_count_threshold: parseInt(document.getElementById("threshold").value),
extraction_strategy: document.getElementById("extraction-strategy-select").value,
chunking_strategy: document.getElementById("chunking-strategy-select").value,
css_selector: document.getElementById("css-selector").value,
verbose: true,
};
// save api token to local storage
localStorage.setItem("api_token", document.getElementById("token-input").value);
document.getElementById("loading").classList.remove("hidden");
//document.getElementById("result").classList.add("hidden");
//document.getElementById("code_help").classList.add("hidden");
axios
.post("/crawl", data)
.then((response) => {
const result = response.data.results[0];
const parsedJson = JSON.parse(result.parsed_json);
document.getElementById("json-result").textContent = JSON.stringify(parsedJson, null, 2);
document.getElementById("cleaned-html-result").textContent = result.cleaned_html;
document.getElementById("markdown-result").textContent = result.markdown;
// Update code examples dynamically
const extractionStrategy = data.extraction_strategy;
const isLLMExtraction = extractionStrategy === "LLMExtractionStrategy";
document.getElementById(
"curl-code"
).textContent = `curl -X POST -H "Content-Type: application/json" -d '${JSON.stringify({
...data,
api_token: isLLMExtraction ? "your_api_token" : undefined,
})}' http://crawl4ai.uccode.io/crawl`;
document.getElementById(
"python-code"
).textContent = `import requests\n\ndata = ${JSON.stringify(
{ ...data, api_token: isLLMExtraction ? "your_api_token" : undefined },
null,
2
)}\n\nresponse = requests.post("http://crawl4ai.uccode.io/crawl", json=data) # OR local host if your run locally \nprint(response.json())`;
document.getElementById(
"nodejs-code"
).textContent = `const axios = require('axios');\n\nconst data = ${JSON.stringify(
{ ...data, api_token: isLLMExtraction ? "your_api_token" : undefined },
null,
2
)};\n\naxios.post("http://crawl4ai.uccode.io/crawl", data) // OR local host if your run locally \n .then(response => console.log(response.data))\n .catch(error => console.error(error));`;
document.getElementById(
"library-code"
).textContent = `from crawl4ai.web_crawler import WebCrawler\nfrom crawl4ai.extraction_strategy import *\nfrom crawl4ai.chunking_strategy import *\n\ncrawler = WebCrawler()\ncrawler.warmup()\n\nresult = crawler.run(\n url='${
urls[0]
}',\n word_count_threshold=${data.word_count_threshold},\n extraction_strategy=${
isLLMExtraction
? `${extractionStrategy}(provider="${data.provider_model}", api_token="${data.api_token}")`
: extractionStrategy + "()"
},\n chunking_strategy=${data.chunking_strategy}(),\n bypass_cache=${
data.bypass_cache
},\n css_selector="${data.css_selector}"\n)\nprint(result)`;
// Highlight code syntax
hljs.highlightAll();
// Select JSON tab by default
document.querySelector('.tab-btn[data-tab="json"]').click();
document.getElementById("loading").classList.add("hidden");
document.getElementById("result").classList.remove("hidden");
document.getElementById("code_help").classList.remove("hidden");
// increment the total count
document.getElementById("total-count").textContent =
parseInt(document.getElementById("total-count").textContent) + 1;
})
.catch((error) => {
console.error(error);
document.getElementById("loading").classList.add("hidden");
});
});
// Handle tab clicks
document.querySelectorAll(".tab-btn").forEach((btn) => {
btn.addEventListener("click", () => {
const tab = btn.dataset.tab;
document
.querySelectorAll(".tab-btn")
.forEach((b) => b.classList.remove("bg-lime-700", "text-white"));
btn.classList.add("bg-lime-700", "text-white");
document.querySelectorAll(".tab-content.code pre").forEach((el) => el.classList.add("hidden"));
document.getElementById(`${tab}-result`).parentElement.classList.remove("hidden");
});
});
// Handle code tab clicks
document.querySelectorAll(".code-tab-btn").forEach((btn) => {
btn.addEventListener("click", () => {
const tab = btn.dataset.tab;
document
.querySelectorAll(".code-tab-btn")
.forEach((b) => b.classList.remove("bg-lime-700", "text-white"));
btn.classList.add("bg-lime-700", "text-white");
document.querySelectorAll(".tab-content.result pre").forEach((el) => el.classList.add("hidden"));
document.getElementById(`${tab}-code`).parentElement.classList.remove("hidden");
});
});
// Handle copy to clipboard button clicks
async function copyToClipboard(text) {
if (navigator.clipboard && navigator.clipboard.writeText) {
return navigator.clipboard.writeText(text);
} else {
return fallbackCopyTextToClipboard(text);
}
}
function fallbackCopyTextToClipboard(text) {
return new Promise((resolve, reject) => {
const textArea = document.createElement("textarea");
textArea.value = text;
// Avoid scrolling to bottom
textArea.style.top = "0";
textArea.style.left = "0";
textArea.style.position = "fixed";
document.body.appendChild(textArea);
textArea.focus();
textArea.select();
try {
const successful = document.execCommand("copy");
if (successful) {
resolve();
} else {
reject();
}
} catch (err) {
reject(err);
}
document.body.removeChild(textArea);
});
}
document.querySelectorAll(".copy-btn").forEach((btn) => {
btn.addEventListener("click", () => {
const target = btn.dataset.target;
const code = document.getElementById(target).textContent;
//navigator.clipboard.writeText(code).then(() => {
copyToClipboard(code).then(() => {
btn.textContent = "Copied!";
setTimeout(() => {
btn.textContent = "Copy";
}, 2000);
});
});
});
document.addEventListener("DOMContentLoaded", async () => {
try {
const extractionResponse = await fetch("/strategies/extraction");
const extractionStrategies = await extractionResponse.json();
const chunkingResponse = await fetch("/strategies/chunking");
const chunkingStrategies = await chunkingResponse.json();
renderStrategies("extraction-strategies", extractionStrategies);
renderStrategies("chunking-strategies", chunkingStrategies);
} catch (error) {
console.error("Error fetching strategies:", error);
}
});
function renderStrategies(containerId, strategies) {
const container = document.getElementById(containerId);
container.innerHTML = ""; // Clear any existing content
strategies = JSON.parse(strategies);
Object.entries(strategies).forEach(([strategy, description]) => {
const strategyElement = document.createElement("div");
strategyElement.classList.add("bg-zinc-800", "p-4", "rounded", "shadow-md", "docs-item");
const strategyDescription = document.createElement("div");
strategyDescription.classList.add("text-gray-300", "prose", "prose-sm");
strategyDescription.innerHTML = marked.parse(description);
strategyElement.appendChild(strategyDescription);
container.appendChild(strategyElement);
});
}
// Highlight code syntax
hljs.highlightAll();
</script>
{{ footer | safe }}
<script script src="/pages/app.js"></script>
</body>
</html>

View File

@ -283,7 +283,7 @@
.post("/crawl", data)
.then((response) => {
const result = response.data.results[0];
const parsedJson = JSON.parse(result.parsed_json);
const parsedJson = JSON.parse(result.extracted_content);
document.getElementById("json-result").textContent = JSON.stringify(parsedJson, null, 2);
document.getElementById("cleaned-html-result").textContent = result.cleaned_html;
document.getElementById("markdown-result").textContent = result.markdown;

36
pages/partial/footer.html Normal file
View File

@ -0,0 +1,36 @@
<section class="hero bg-zinc-900 py-8 px-20 text-zinc-400">
<div class="container mx-auto px-4">
<h2 class="text-3xl font-bold mb-4">🤔 Why building this?</h2>
<p class="text-lg mb-4">
In recent times, we've witnessed a surge of startups emerging, riding the AI hype wave and charging
for services that should rightfully be accessible to everyone. 🌍💸 One such example is scraping and
crawling web pages and transforming them into a format suitable for Large Language Models (LLMs).
🕸️🤖 We believe that building a business around this is not the right approach; instead, it should
definitely be open-source. 🆓🌟 So, if you possess the skills to build such tools and share our
philosophy, we invite you to join our "Robinhood" band and help set these products free for the
benefit of all. 🤝💪
</p>
</div>
</section>
<footer class="bg-zinc-900 text-zinc-400 py-4">
<div class="container mx-auto px-4">
<div class="flex justify-between items-center">
<p>© 2024 Crawl4AI. All rights reserved.</p>
<div class="social-links">
<a
href="https://github.com/unclecode/crawl4ai"
class="text-zinc-400 hover:text-gray-300 mx-2"
target="_blank"
>😺 GitHub</a
>
<a
href="https://twitter.com/unclecode"
class="text-zinc-400 hover:text-gray-300 mx-2"
target="_blank"
>🐦 Twitter</a
>
</div>
</div>
</div>
</footer>

View File

@ -0,0 +1,160 @@
<section id="how-to-guide" class="content-section">
<h1 class="text-2xl font-bold">How to Guide</h1>
<div class="flex flex-col gap-4 p-4 bg-zinc-900 text-lime-500">
<!-- Step 1 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🌟
<strong
>Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling
fun!</strong
>
</div>
<div class="">
First Step: Create an instance of WebCrawler and call the
<code>warmup()</code> function.
</div>
<div>
<pre><code class="language-python">crawler = WebCrawler()
crawler.warmup()</code></pre>
</div>
<!-- Step 2 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🧠 <strong>Understanding 'bypass_cache' and 'include_raw_html' parameters:</strong>
</div>
<div class="">First crawl (caches the result):</div>
<div>
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business")</code></pre>
</div>
<div class="">Second crawl (Force to crawl again):</div>
<div>
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)</code></pre>
<div class="bg-red-900 p-2 text-zinc-50">
⚠️ Don't forget to set <code>`bypass_cache`</code> to True if you want to try different strategies for the same URL. Otherwise, the cached result will be returned. You can also set <code>`always_by_pass_cache`</code> in constructor to True to always bypass the cache.
</div>
</div>
<div class="">Crawl result without raw HTML content:</div>
<div>
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)</code></pre>
</div>
<!-- Step 3 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
📄
<strong
>The 'include_raw_html' parameter, when set to True, includes the raw HTML content
in the response. By default, it is set to True.</strong
>
</div>
<div class="">Set <code>always_by_pass_cache</code> to True:</div>
<div>
<pre><code class="language-python">crawler.always_by_pass_cache = True</code></pre>
</div>
<!-- Step 4 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🧩 <strong>Let's add a chunking strategy: RegexChunking!</strong>
</div>
<div class="">Using RegexChunking:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=RegexChunking(patterns=["\n\n"])
)</code></pre>
</div>
<div class="">Using NlpSentenceChunking:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=NlpSentenceChunking()
)</code></pre>
</div>
<!-- Step 5 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🧠 <strong>Let's get smarter with an extraction strategy: CosineStrategy!</strong>
</div>
<div class="">Using CosineStrategy:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3)
)</code></pre>
</div>
<!-- Step 6 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🤖
<strong
>Time to bring in the big guns: LLMExtractionStrategy without instructions!</strong
>
</div>
<div class="">Using LLMExtractionStrategy without instructions:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
)</code></pre>
</div>
<!-- Step 7 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
📜
<strong
>Let's make it even more interesting: LLMExtractionStrategy with
instructions!</strong
>
</div>
<div class="">Using LLMExtractionStrategy with instructions:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
instruction="I am interested in only financial news"
)
)</code></pre>
</div>
<!-- Step 8 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🎯
<strong>Targeted extraction: Let's use a CSS selector to extract only H2 tags!</strong>
</div>
<div class="">Using CSS selector to extract H2 tags:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
css_selector="h2"
)</code></pre>
</div>
<!-- Step 9 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🖱️
<strong
>Let's get interactive: Passing JavaScript code to click 'Load More' button!</strong
>
</div>
<div class="">Using JavaScript to click 'Load More' button:</div>
<div>
<pre><code class="language-python">js_code = """
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click();
"""
crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
result = crawler.run(url="https://www.nbcnews.com/business")</code></pre>
</div>
<!-- Conclusion -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🎉
<strong
>Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth
and crawl the web like a pro! 🕸️</strong
>
</div>
</div>
</section>

View File

@ -0,0 +1,56 @@
<section id="installation" class="content-section active">
<h1 class="text-2xl font-bold">Installation 💻</h1>
<p class="mb-4">
There are three ways to use Crawl4AI:
<ol class="list-decimal list-inside mb-4">
<li class="">
As a library
</li>
<li class="">
As a local server (Docker)
</li>
<li class="">
As a Google Colab notebook. <a href="https://colab.research.google.com/drive/1wz8u30rvbq6Scodye9AGCw8Qg_Z8QGsk"
><img
src="https://colab.research.google.com/assets/colab-badge.svg"
alt="Open In Colab"
style="display: inline-block; width: 100px; height: 20px"
/></a>
</li>
</p>
<p class="my-4">To install Crawl4AI as a library, follow these steps:</p>
<ol class="list-decimal list-inside mb-4">
<li class="mb-4">
Install the package from GitHub:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code>pip install git+https://github.com/unclecode/crawl4ai.git</code></pre>
</li>
<li class="mb-4">
Alternatively, you can clone the repository and install the package locally:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code class = "language-python bash">virtualenv venv
source venv/bin/activate
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .
</code></pre>
</li>
<li class="">
Use docker to run the local server:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code class = "language-python bash">docker build -t crawl4ai .
# docker build --platform linux/amd64 -t crawl4ai . For Mac users
docker run -d -p 8000:80 crawl4ai</code></pre>
</li>
</ol>
<p class="mb-4">
For more information about how to run Crawl4AI as a local server, please refer to the
<a href="https://github.com/unclecode/crawl4ai" class="text-blue-400">GitHub repository</a>.
</p>
</section>

204
pages/partial/try_it.html Normal file
View File

@ -0,0 +1,204 @@
<section class="try-it py-8 px-16 pb-20 bg-zinc-900">
<div class="container mx-auto ">
<h2 class="text-2xl font-bold mb-4 text-lime-500">Try It Now</h2>
<div class="flex gap-4">
<div class="flex flex-col flex-1 gap-2">
<div class="flex flex-col">
<label for="url-input" class="text-lime-500 font-bold text-xs">URL(s)</label>
<input
type="text"
id="url-input"
value="https://www.nbcnews.com/business"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-zinc-300"
placeholder="Enter URL(s) separated by commas"
/>
</div>
<div class="flex gap-2">
<div class="flex flex-col">
<label for="threshold" class="text-lime-500 font-bold text-xs">Min Words Threshold</label>
<select
id="threshold"
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-zinc-300"
>
<option value="5">5</option>
<option value="10" selected>10</option>
<option value="15">15</option>
<option value="20">20</option>
<option value="25">25</option>
</select>
</div>
<div class="flex flex-col flex-1">
<label for="css-selector" class="text-lime-500 font-bold text-xs">CSS Selector</label>
<input
type="text"
id="css-selector"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-zinc-300 placeholder-lime-700"
placeholder="CSS Selector (e.g. .content, #main, article)"
/>
</div>
</div>
<div class="flex gap-2">
<div class="flex flex-col">
<label for="extraction-strategy-select" class="text-lime-500 font-bold text-xs"
>Extraction Strategy</label
>
<select
id="extraction-strategy-select"
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-zinc-300"
>
<option value="CosineStrategy">CosineStrategy</option>
<option value="LLMExtractionStrategy">LLMExtractionStrategy</option>
<option value="NoExtractionStrategy">NoExtractionStrategy</option>
</select>
</div>
<div class="flex flex-col">
<label for="chunking-strategy-select" class="text-lime-500 font-bold text-xs"
>Chunking Strategy</label
>
<select
id="chunking-strategy-select"
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-zinc-300"
>
<option value="RegexChunking">RegexChunking</option>
<option value="NlpSentenceChunking">NlpSentenceChunking</option>
<option value="TopicSegmentationChunking">TopicSegmentationChunking</option>
<option value="FixedLengthWordChunking">FixedLengthWordChunking</option>
<option value="SlidingWindowChunking">SlidingWindowChunking</option>
</select>
</div>
</div>
<div id = "llm_settings" class="flex gap-2 hidden hidden">
<div class="flex flex-col">
<label for="provider-model-select" class="text-lime-500 font-bold text-xs"
>Provider Model</label
>
<select
id="provider-model-select"
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-zinc-300"
>
<option value="groq/llama3-70b-8192">groq/llama3-70b-8192</option>
<option value="groq/llama3-8b-8192">groq/llama3-8b-8192</option>
<option value="groq/mixtral-8x7b-32768">groq/mixtral-8x7b-32768</option>
<option value="openai/gpt-4-turbo">gpt-4-turbo</option>
<option value="openai/gpt-3.5-turbo">gpt-3.5-turbo</option>
<option value="openai/gpt-4o">gpt-4o</option>
<option value="anthropic/claude-3-haiku-20240307">claude-3-haiku</option>
<option value="anthropic/claude-3-opus-20240229">claude-3-opus</option>
<option value="anthropic/claude-3-sonnet-20240229">claude-3-sonnet</option>
</select>
</div>
<div class="flex flex-col flex-1">
<label for="token-input" class="text-lime-500 font-bold text-xs">API Token</label>
<input
type="password"
id="token-input"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-zinc-300"
placeholder="Enter Groq API token"
/>
</div>
</div>
<div class="flex gap-2">
<!-- Add two textarea one for getting Keyword Filter and another one Instruction, make both grow whole with-->
<div id = "semantic_filter_div" class="flex flex-col flex-1">
<label for="keyword-filter" class="text-lime-500 font-bold text-xs">Keyword Filter</label>
<textarea
id="semantic_filter"
rows="3"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-zinc-300 placeholder-zinc-700"
placeholder="Enter keywords for CosineStrategy to narrow down the content."
></textarea>
</div>
<div id = "instruction_div" class="flex flex-col flex-1 hidden">
<label for="instruction" class="text-lime-500 font-bold text-xs">Instruction</label>
<textarea
id="instruction"
rows="3"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-zinc-300 placeholder-zinc-700"
placeholder="Enter instruction for the LLMEstrategy to instruct the model."
></textarea>
</div>
</div>
<div class="flex gap-3">
<div class="flex items-center gap-2">
<input type="checkbox" id="bypass-cache-checkbox" />
<label for="bypass-cache-checkbox" class="text-lime-500 font-bold">Bypass Cache</label>
</div>
<div class="flex items-center gap-2">
<input type="checkbox" id="extract-blocks-checkbox" checked />
<label for="extract-blocks-checkbox" class="text-lime-500 font-bold">Extract Blocks</label>
</div>
<button id="crawl-btn" class="bg-lime-600 text-black font-bold px-4 py-0 rounded">Crawl</button>
</div>
</div>
<div id="result" class="flex-1">
<div id="loading" class="hidden">
<p class="text-white">Loading... Please wait.</p>
</div>
<div class="tab-buttons flex gap-2">
<button class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500" data-tab="json">
JSON
</button>
<button
class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="cleaned-html"
>
Cleaned HTML
</button>
<button class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500" data-tab="markdown">
Markdown
</button>
</div>
<div class="tab-content code bg-zinc-900 p-2 rounded h-full border border-zinc-700 text-sm">
<pre class="h-full flex"><code id="json-result" class="language-json"></code></pre>
<pre class="hidden h-full flex"><code id="cleaned-html-result" class="language-html"></code></pre>
<pre class="hidden h-full flex"><code id="markdown-result" class="language-markdown"></code></pre>
</div>
</div>
<div id="code_help" class="flex-1">
<div class="tab-buttons flex gap-2">
<button class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500" data-tab="curl">
cURL
</button>
<button
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="library"
>
Python
</button>
<button
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="python"
>
REST API
</button>
<!-- <button
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="nodejs"
>
Node.js
</button> -->
</div>
<div class="tab-content result bg-zinc-900 p-2 rounded h-full border border-zinc-700 text-sm">
<pre class="h-full flex relative">
<code id="curl-code" class="language-bash"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="curl-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative">
<code id="python-code" class="language-python"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="python-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative">
<code id="nodejs-code" class="language-javascript"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="nodejs-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative">
<code id="library-code" class="language-python"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="library-code">Copy</button>
</pre>
</div>
</div>
</div>
</div>
</section>

435
pages/tmp.html Normal file
View File

@ -0,0 +1,435 @@
<div class="w-3/4 p-4">
<section id="installation" class="content-section active">
<h1 class="text-2xl font-bold">Installation 💻</h1>
<p class="mb-4">There are three ways to use Crawl4AI:</p>
<ol class="list-decimal list-inside mb-4">
<li class="">As a library</li>
<li class="">As a local server (Docker)</li>
<li class="">
As a Google Colab notebook.
<a href="https://colab.research.google.com/drive/1wz8u30rvbq6Scodye9AGCw8Qg_Z8QGsk"
><img
src="https://colab.research.google.com/assets/colab-badge.svg"
alt="Open In Colab"
style="display: inline-block; width: 100px; height: 20px"
/></a>
</li>
<p></p>
<p class="my-4">To install Crawl4AI as a library, follow these steps:</p>
<ol class="list-decimal list-inside mb-4">
<li class="mb-4">
Install the package from GitHub:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code class="hljs language-bash">pip install git+https://github.com/unclecode/crawl4ai.git</code></pre>
</li>
<li class="mb-4">
Alternatively, you can clone the repository and install the package locally:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code class="language-python bash hljs">virtualenv venv
source venv/<span class="hljs-built_in">bin</span>/activate
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .
</code></pre>
</li>
<li class="">
Use docker to run the local server:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code class="language-python bash hljs">docker build -t crawl4ai .
<span class="hljs-comment"># docker build --platform linux/amd64 -t crawl4ai . For Mac users</span>
docker run -d -p <span class="hljs-number">8000</span>:<span class="hljs-number">80</span> crawl4ai</code></pre>
</li>
</ol>
<p class="mb-4">
For more information about how to run Crawl4AI as a local server, please refer to the
<a href="https://github.com/unclecode/crawl4ai" class="text-blue-400">GitHub repository</a>.
</p>
</ol>
</section>
<section id="how-to-guide" class="content-section">
<h1 class="text-2xl font-bold">How to Guide</h1>
<div class="flex flex-col gap-4 p-4 bg-zinc-900 text-lime-500">
<!-- Step 1 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🌟
<strong>Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun!</strong>
</div>
<div class="">
First Step: Create an instance of WebCrawler and call the
<code>warmup()</code> function.
</div>
<div>
<pre><code class="language-python hljs">crawler = WebCrawler()
crawler.warmup()</code></pre>
</div>
<!-- Step 2 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🧠 <strong>Understanding 'bypass_cache' and 'include_raw_html' parameters:</strong>
</div>
<div class="">First crawl (caches the result):</div>
<div>
<pre><code class="language-python hljs">result = crawler.run(url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>)</code></pre>
</div>
<div class="">Second crawl (Force to crawl again):</div>
<div>
<pre><code class="language-python hljs">result = crawler.run(url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>, bypass_cache=<span class="hljs-literal">True</span>)</code></pre>
<div class="bg-red-900 p-2 text-zinc-50">
⚠️ Don't forget to set <code>`bypass_cache`</code> to True if you want to try different strategies
for the same URL. Otherwise, the cached result will be returned. You can also set
<code>`always_by_pass_cache`</code> in constructor to True to always bypass the cache.
</div>
</div>
<div class="">Crawl result without raw HTML content:</div>
<div>
<pre><code class="language-python hljs">result = crawler.run(url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>, include_raw_html=<span class="hljs-literal">False</span>)</code></pre>
</div>
<!-- Step 3 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
📄
<strong
>The 'include_raw_html' parameter, when set to True, includes the raw HTML content in the response.
By default, it is set to True.</strong
>
</div>
<div class="">Set <code>always_by_pass_cache</code> to True:</div>
<div>
<pre><code class="language-python hljs">crawler.always_by_pass_cache = <span class="hljs-literal">True</span></code></pre>
</div>
<!-- Step 4 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🧩 <strong>Let's add a chunking strategy: RegexChunking!</strong>
</div>
<div class="">Using RegexChunking:</div>
<div>
<pre><code class="language-python hljs">result = crawler.run(
url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>,
chunking_strategy=RegexChunking(patterns=[<span class="hljs-string">"\n\n"</span>])
)</code></pre>
</div>
<div class="">Using NlpSentenceChunking:</div>
<div>
<pre><code class="language-python hljs">result = crawler.run(
url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>,
chunking_strategy=NlpSentenceChunking()
)</code></pre>
</div>
<!-- Step 5 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🧠 <strong>Let's get smarter with an extraction strategy: CosineStrategy!</strong>
</div>
<div class="">Using CosineStrategy:</div>
<div>
<pre><code class="language-python hljs">result = crawler.run(
url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>,
extraction_strategy=CosineStrategy(word_count_threshold=<span class="hljs-number">20</span>, max_dist=<span class="hljs-number">0.2</span>, linkage_method=<span class="hljs-string">"ward"</span>, top_k=<span class="hljs-number">3</span>)
)</code></pre>
</div>
<!-- Step 6 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🤖
<strong>Time to bring in the big guns: LLMExtractionStrategy without instructions!</strong>
</div>
<div class="">Using LLMExtractionStrategy without instructions:</div>
<div>
<pre><code class="language-python hljs">result = crawler.run(
url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>,
extraction_strategy=LLMExtractionStrategy(provider=<span class="hljs-string">"openai/gpt-4o"</span>, api_token=os.getenv(<span class="hljs-string">'OPENAI_API_KEY'</span>))
)</code></pre>
</div>
<!-- Step 7 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
📜
<strong>Let's make it even more interesting: LLMExtractionStrategy with instructions!</strong>
</div>
<div class="">Using LLMExtractionStrategy with instructions:</div>
<div>
<pre><code class="language-python hljs">result = crawler.run(
url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>,
extraction_strategy=LLMExtractionStrategy(
provider=<span class="hljs-string">"openai/gpt-4o"</span>,
api_token=os.getenv(<span class="hljs-string">'OPENAI_API_KEY'</span>),
instruction=<span class="hljs-string">"I am interested in only financial news"</span>
)
)</code></pre>
</div>
<!-- Step 8 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🎯
<strong>Targeted extraction: Let's use a CSS selector to extract only H2 tags!</strong>
</div>
<div class="">Using CSS selector to extract H2 tags:</div>
<div>
<pre><code class="language-python hljs">result = crawler.run(
url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>,
css_selector=<span class="hljs-string">"h2"</span>
)</code></pre>
</div>
<!-- Step 9 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🖱️
<strong>Let's get interactive: Passing JavaScript code to click 'Load More' button!</strong>
</div>
<div class="">Using JavaScript to click 'Load More' button:</div>
<div>
<pre><code class="language-python hljs">js_code = <span class="hljs-string">"""
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button =&gt; button.textContent.includes('Load More'));
loadMoreButton &amp;&amp; loadMoreButton.click();
"""</span>
crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=<span class="hljs-literal">True</span>)
result = crawler.run(url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>)</code></pre>
</div>
<!-- Conclusion -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🎉
<strong
>Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl the
web like a pro! 🕸️</strong
>
</div>
</div>
</section>
<section id="chunking-strategies" class="content-section">
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
<div class="text-gray-300 prose prose-sm">
<h3>RegexChunking</h3>
<p>
<code>RegexChunking</code> is a text chunking strategy that splits a given text into smaller parts
using regular expressions. This is useful for preparing large texts for processing by language
models, ensuring they are divided into manageable segments.
</p>
<h4>Constructor Parameters:</h4>
<ul>
<li>
<code>patterns</code> (list, optional): A list of regular expression patterns used to split the
text. Default is to split by double newlines (<code>['\n\n']</code>).
</li>
</ul>
<h4>Example usage:</h4>
<pre><code class="language-python">chunker = RegexChunking(patterns=[r'\n\n', r'\. '])
chunks = chunker.chunk("This is a sample text. It will be split into chunks.")
</code></pre>
</div>
</div>
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
<div class="text-gray-300 prose prose-sm">
<h3>NlpSentenceChunking</h3>
<p>
<code>NlpSentenceChunking</code> uses a natural language processing model to chunk a given text into
sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.
</p>
<h4>Constructor Parameters:</h4>
<ul>
<li>
<code>model</code> (str, optional): The SpaCy model to use for sentence detection. Default is
<code>'en_core_web_sm'</code>.
</li>
</ul>
<h4>Example usage:</h4>
<pre><code class="language-python">chunker = NlpSentenceChunking(model='en_core_web_sm')
chunks = chunker.chunk("This is a sample text. It will be split into sentences.")
</code></pre>
</div>
</div>
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
<div class="text-gray-300 prose prose-sm">
<h3>TopicSegmentationChunking</h3>
<p>
<code>TopicSegmentationChunking</code> uses the TextTiling algorithm to segment a given text into
topic-based chunks. This method identifies thematic boundaries in the text.
</p>
<h4>Constructor Parameters:</h4>
<ul>
<li>
<code>num_keywords</code> (int, optional): The number of keywords to extract for each topic
segment. Default is <code>3</code>.
</li>
</ul>
<h4>Example usage:</h4>
<pre><code class="language-python">chunker = TopicSegmentationChunking(num_keywords=3)
chunks = chunker.chunk("This is a sample text. It will be split into topic-based segments.")
</code></pre>
</div>
</div>
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
<div class="text-gray-300 prose prose-sm">
<h3>FixedLengthWordChunking</h3>
<p>
<code>FixedLengthWordChunking</code> splits a given text into chunks of fixed length, based on the
number of words.
</p>
<h4>Constructor Parameters:</h4>
<ul>
<li>
<code>chunk_size</code> (int, optional): The number of words in each chunk. Default is
<code>100</code>.
</li>
</ul>
<h4>Example usage:</h4>
<pre><code class="language-python">chunker = FixedLengthWordChunking(chunk_size=100)
chunks = chunker.chunk("This is a sample text. It will be split into fixed-length word chunks.")
</code></pre>
</div>
</div>
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
<div class="text-gray-300 prose prose-sm">
<h3>SlidingWindowChunking</h3>
<p>
<code>SlidingWindowChunking</code> uses a sliding window approach to chunk a given text. Each chunk
has a fixed length, and the window slides by a specified step size.
</p>
<h4>Constructor Parameters:</h4>
<ul>
<li>
<code>window_size</code> (int, optional): The number of words in each chunk. Default is
<code>100</code>.
</li>
<li>
<code>step</code> (int, optional): The number of words to slide the window. Default is
<code>50</code>.
</li>
</ul>
<h4>Example usage:</h4>
<pre><code class="language-python">chunker = SlidingWindowChunking(window_size=100, step=50)
chunks = chunker.chunk("This is a sample text. It will be split using a sliding window approach.")
</code></pre>
</div>
</div>
</section>
<section id="extraction-strategies" class="content-section">
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
<div class="text-gray-300 prose prose-sm">
<h3>NoExtractionStrategy</h3>
<p>
<code>NoExtractionStrategy</code> is a basic extraction strategy that returns the entire HTML
content without any modification. It is useful for cases where no specific extraction is required.
Only clean html, and amrkdown.
</p>
<h4>Constructor Parameters:</h4>
<p>None.</p>
<h4>Example usage:</h4>
<pre><code class="language-python">extractor = NoExtractionStrategy()
extracted_content = extractor.extract(url, html)
</code></pre>
</div>
</div>
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
<div class="text-gray-300 prose prose-sm">
<h3>LLMExtractionStrategy</h3>
<p>
<code>LLMExtractionStrategy</code> uses a Language Model (LLM) to extract meaningful blocks or
chunks from the given HTML content. This strategy leverages an external provider for language model
completions.
</p>
<h4>Constructor Parameters:</h4>
<ul>
<li>
<code>provider</code> (str, optional): The provider to use for the language model completions.
Default is <code>DEFAULT_PROVIDER</code> (e.g., openai/gpt-4).
</li>
<li>
<code>api_token</code> (str, optional): The API token for the provider. If not provided, it will
try to load from the environment variable <code>OPENAI_API_KEY</code>.
</li>
<li>
<code>instruction</code> (str, optional): An instruction to guide the LLM on how to perform the
extraction. This allows users to specify the type of data they are interested in or set the tone
of the response. Default is <code>None</code>.
</li>
</ul>
<h4>Example usage:</h4>
<pre><code class="language-python">extractor = LLMExtractionStrategy(provider='openai', api_token='your_api_token', instruction='Extract only news about AI.')
extracted_content = extractor.extract(url, html)
</code></pre>
<p>
By providing clear instructions, users can tailor the extraction process to their specific needs,
enhancing the relevance and utility of the extracted content.
</p>
</div>
</div>
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
<div class="text-gray-300 prose prose-sm">
<h3>CosineStrategy</h3>
<p>
<code>CosineStrategy</code> uses hierarchical clustering based on cosine similarity to extract
clusters of text from the given HTML content. This strategy is suitable for identifying related
content sections.
</p>
<h4>Constructor Parameters:</h4>
<ul>
<li>
<code>semantic_filter</code> (str, optional): A string containing keywords for filtering relevant
documents before clustering. If provided, documents are filtered based on their cosine
similarity to the keyword filter embedding. Default is <code>None</code>.
</li>
<li>
<code>word_count_threshold</code> (int, optional): Minimum number of words per cluster. Default
is <code>20</code>.
</li>
<li>
<code>max_dist</code> (float, optional): The maximum cophenetic distance on the dendrogram to
form clusters. Default is <code>0.2</code>.
</li>
<li>
<code>linkage_method</code> (str, optional): The linkage method for hierarchical clustering.
Default is <code>'ward'</code>.
</li>
<li>
<code>top_k</code> (int, optional): Number of top categories to extract. Default is
<code>3</code>.
</li>
<li>
<code>model_name</code> (str, optional): The model name for embedding generation. Default is
<code>'BAAI/bge-small-en-v1.5'</code>.
</li>
</ul>
<h4>Example usage:</h4>
<pre><code class="language-python">extractor = CosineStrategy(semantic_filter='artificial intelligence', word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5')
extracted_content = extractor.extract(url, html)
</code></pre>
<h4>Cosine Similarity Filtering</h4>
<p>
When a <code>semantic_filter</code> is provided, the <code>CosineStrategy</code> applies an
embedding-based filtering process to select relevant documents before performing hierarchical
clustering.
</p>
</div>
</div>
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
<div class="text-gray-300 prose prose-sm">
<h3>TopicExtractionStrategy</h3>
<p>
<code>TopicExtractionStrategy</code> uses the TextTiling algorithm to segment the HTML content into
topics and extracts keywords for each segment. This strategy is useful for identifying and
summarizing thematic content.
</p>
<h4>Constructor Parameters:</h4>
<ul>
<li>
<code>num_keywords</code> (int, optional): Number of keywords to represent each topic segment.
Default is <code>3</code>.
</li>
</ul>
<h4>Example usage:</h4>
<pre><code class="language-python">extractor = TopicExtractionStrategy(num_keywords=3)
extracted_content = extractor.extract(url, html)
</code></pre>
</div>
</div>
</section>
</div>

View File

@ -13,4 +13,5 @@ litellm
python-dotenv
nltk
lazy_import
rich
# spacy