mirror of
https://github.com/unclecode/crawl4ai.git
synced 2025-12-27 18:38:32 +00:00
Update:
- Debug - Refactor code for new version
This commit is contained in:
parent
f6e59157bf
commit
5b80be956d
552
README.md
552
README.md
@ -8,16 +8,90 @@
|
||||
|
||||
Crawl4AI is a powerful, free web crawling service designed to extract useful information from web pages and make it accessible for large language models (LLMs) and AI applications. 🆓🌐
|
||||
|
||||
## 🚧 Work in Progress 👷♂️
|
||||
## Recent Changes
|
||||
|
||||
- 🔧 Separate Crawl and Extract Semantic Chunk: Enhancing efficiency in large-scale tasks.
|
||||
- 🔍 Colab Integration: Exploring integration with Google Colab for easy experimentation.
|
||||
- 🎯 XPath and CSS Selector Support: Adding support for selective retrieval of specific elements.
|
||||
- 📷 Image Captioning: Incorporating image captioning capabilities to extract descriptions from images.
|
||||
- 💾 Embedding Vector Data: Generate and store embedding data for each crawled website.
|
||||
- 🔍 Semantic Search Engine: Building a semantic search engine that fetches content, performs vector search similarity, and generates labeled chunk data based on user queries and URLs.
|
||||
- 🚀 10x faster!!
|
||||
- 📜 Execute custom JavaScript before crawling!
|
||||
- 🤝 Colab friendly!
|
||||
- 📚 Chunking strategies: topic-based, regex, sentence, and more!
|
||||
- 🧠 Extraction strategies: cosine clustering, LLM, and more!
|
||||
- 🎯 CSS selector support
|
||||
- 📝 Pass instructions/keywords to refine extraction
|
||||
|
||||
## Power and Simplicity of Crawl4AI 🚀
|
||||
|
||||
Crawl4AI makes even complex web crawling tasks simple and intuitive. Below is an example of how you can execute JavaScript, filter data using keywords, and use a CSS selector to extract specific content—all in one go!
|
||||
|
||||
**Example Task:**
|
||||
|
||||
1. Execute custom JavaScript to click a "Load More" button.
|
||||
2. Filter the data to include only content related to "technology".
|
||||
3. Use a CSS selector to extract only paragraphs (`<p>` tags).
|
||||
|
||||
**Example Code:**
|
||||
|
||||
```python
|
||||
# Import necessary modules
|
||||
from crawl4ai import WebCrawler
|
||||
from crawl4ai.chunking_strategy import *
|
||||
from crawl4ai.extraction_strategy import *
|
||||
from crawl4ai.crawler_strategy import *
|
||||
|
||||
# Define the JavaScript code to click the "Load More" button
|
||||
js_code = """
|
||||
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
|
||||
loadMoreButton && loadMoreButton.click();
|
||||
"""
|
||||
|
||||
# Define the crawling strategy
|
||||
crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
|
||||
|
||||
# Create the WebCrawler instance with the defined strategy
|
||||
crawler = WebCrawler(crawler_strategy=crawler_strategy)
|
||||
|
||||
# Run the crawler with keyword filtering and CSS selector
|
||||
result = crawler.run(
|
||||
url="https://www.example.com",
|
||||
extraction_strategy=CosineStrategy(
|
||||
semantic_filter="technology",
|
||||
),
|
||||
)
|
||||
|
||||
# Run the crawler with LLM extraction strategy
|
||||
result = crawler.run(
|
||||
url="https://www.example.com",
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o",
|
||||
api_token=os.getenv('OPENAI_API_KEY'),
|
||||
instruction="Extract only content related to technology"
|
||||
),
|
||||
css_selector="p"
|
||||
)
|
||||
|
||||
# Display the extracted result
|
||||
print(result)
|
||||
```
|
||||
|
||||
With Crawl4AI, you can perform advanced web crawling and data extraction tasks with just a few lines of code. This example demonstrates how you can harness the power of Crawl4AI to simplify your workflow and get the data you need efficiently.
|
||||
|
||||
---
|
||||
|
||||
*Continue reading to learn more about the features, installation process, usage, and more.*
|
||||
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Features](#features)
|
||||
2. [Installation](#installation)
|
||||
3. [REST API/Local Server](#using-the-local-server-ot-rest-api)
|
||||
4. [Python Library Usage](#usage)
|
||||
5. [Parameters](#parameters)
|
||||
6. [Chunking Strategies](#chunking-strategies)
|
||||
7. [Extraction Strategies](#extraction-strategies)
|
||||
8. [Contributing](#contributing)
|
||||
9. [License](#license)
|
||||
10. [Contact](#contact)
|
||||
|
||||
For more details, refer to the [CHANGELOG.md](https://github.com/unclecode/crawl4ai/edit/main/CHANGELOG.md) file.
|
||||
|
||||
## Features ✨
|
||||
|
||||
@ -26,26 +100,28 @@ For more details, refer to the [CHANGELOG.md](https://github.com/unclecode/crawl
|
||||
- 🌍 Supports crawling multiple URLs simultaneously
|
||||
- 🌃 Replace media tags with ALT.
|
||||
- 🆓 Completely free to use and open-source
|
||||
|
||||
## Getting Started 🚀
|
||||
|
||||
To get started with Crawl4AI, simply visit our web application at [https://crawl4ai.uccode.io](https://crawl4ai.uccode.io) (Available now!) and enter the URL(s) you want to crawl. The application will process the URLs and provide you with the extracted data in various formats.
|
||||
- 📜 Execute custom JavaScript before crawling
|
||||
- 📚 Chunking strategies: topic-based, regex, sentence, and more
|
||||
- 🧠 Extraction strategies: cosine clustering, LLM, and more
|
||||
- 🎯 CSS selector support
|
||||
- 📝 Pass instructions/keywords to refine extraction
|
||||
|
||||
## Installation 💻
|
||||
|
||||
There are two ways to use Crawl4AI: as a library in your Python projects or as a standalone local server.
|
||||
|
||||
### Using Crawl4AI as a Library 📚
|
||||
There are three ways to use Crawl4AI:
|
||||
1. As a library (Recommended)
|
||||
2. As a local server (Docker) or using the REST API
|
||||
4. As a Google Colab notebook. [](https://colab.research.google.com/drive/1wz8u30rvbq6Scodye9AGCw8Qg_Z8QGsk)
|
||||
|
||||
To install Crawl4AI as a library, follow these steps:
|
||||
|
||||
1. Install the package from GitHub:
|
||||
```sh
|
||||
```bash
|
||||
pip install git+https://github.com/unclecode/crawl4ai.git
|
||||
```
|
||||
|
||||
Alternatively, you can clone the repository and install the package locally:
|
||||
```sh
|
||||
2. Alternatively, you can clone the repository and install the package locally:
|
||||
```bash
|
||||
virtualenv venv
|
||||
source venv/bin/activate
|
||||
git clone https://github.com/unclecode/crawl4ai.git
|
||||
@ -53,133 +129,193 @@ cd crawl4ai
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
2. Import the necessary modules in your Python script:
|
||||
```python
|
||||
from crawl4ai.web_crawler import WebCrawler
|
||||
from crawl4ai.chunking_strategy import *
|
||||
from crawl4ai.extraction_strategy import *
|
||||
import os
|
||||
|
||||
crawler = WebCrawler()
|
||||
crawler.warmup() # IMPORTANT: Warmup the engine before running the first crawl
|
||||
|
||||
# Single page crawl
|
||||
result = crawler.run(
|
||||
url='https://www.nbcnews.com/business',
|
||||
word_count_threshold=5, # Minimum word count for a HTML tag to be considered as a worthy block
|
||||
chunking_strategy= RegexChunking( patterns = ["\n\n"]), # Default is RegexChunking
|
||||
extraction_strategy= CosineStrategy(word_count_threshold=20, max_dist=0.2, linkage_method='ward', top_k=3) # Default is CosineStrategy
|
||||
# extraction_strategy= LLMExtractionStrategy(provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY')),
|
||||
bypass_cache=False,
|
||||
extract_blocks =True, # Whether to extract semantical blocks of text from the HTML
|
||||
css_selector = "", # Eg: "div.article-body"
|
||||
verbose=True,
|
||||
include_raw_html=True, # Whether to include the raw HTML content in the response
|
||||
)
|
||||
|
||||
print(result.model_dump())
|
||||
```
|
||||
|
||||
Running for the first time will download the chrome driver for selenium. Also creates a SQLite database file `crawler_data.db` in the current directory. This file will store the crawled data for future reference.
|
||||
|
||||
The response model is a `CrawlResponse` object that contains the following attributes:
|
||||
```python
|
||||
class CrawlResult(BaseModel):
|
||||
url: str
|
||||
html: str
|
||||
success: bool
|
||||
cleaned_html: str = None
|
||||
markdown: str = None
|
||||
parsed_json: str = None
|
||||
error_message: str = None
|
||||
```
|
||||
|
||||
### Running Crawl4AI as a Local Server 🚀
|
||||
|
||||
To run Crawl4AI as a standalone local server, follow these steps:
|
||||
|
||||
1. Clone the repository:
|
||||
```sh
|
||||
git clone https://github.com/unclecode/crawl4ai.git
|
||||
```
|
||||
|
||||
2. Navigate to the project directory:
|
||||
```sh
|
||||
cd crawl4ai
|
||||
```
|
||||
|
||||
3. Open `crawler/config.py` and set your favorite LLM provider and API token.
|
||||
|
||||
4. Build the Docker image:
|
||||
```sh
|
||||
docker build -t crawl4ai .
|
||||
```
|
||||
For Mac users, use the following command instead:
|
||||
```sh
|
||||
docker build --platform linux/amd64 -t crawl4ai .
|
||||
```
|
||||
|
||||
5. Run the Docker container:
|
||||
```sh
|
||||
3. Use docker to run the local server:
|
||||
```bash
|
||||
docker build -t crawl4ai .
|
||||
# For Mac users
|
||||
# docker build --platform linux/amd64 -t crawl4ai .
|
||||
docker run -d -p 8000:80 crawl4ai
|
||||
```
|
||||
|
||||
6. Access the application at `http://localhost:8000`.
|
||||
For more information about how to run Crawl4AI as a local server, please refer to the [GitHub repository](https://github.com/unclecode/crawl4ai).
|
||||
|
||||
- CURL Example:
|
||||
Set the api_token to your OpenAI API key or any other provider you are using.
|
||||
```sh
|
||||
curl -X POST -H "Content-Type: application/json" -d '{"urls":["https://techcrunch.com/"],"provider_model":"openai/gpt-3.5-turbo","api_token":"your_api_token","include_raw_html":true,"forced":false,"extract_blocks_flag":false,"word_count_threshold":10}' http://localhost:8000/crawl
|
||||
```
|
||||
Set `extract_blocks_flag` to True to enable the LLM to generate semantically clustered chunks and return them as JSON. Depending on the model and data size, this may take up to 1 minute. Without this setting, it will take between 5 to 20 seconds.
|
||||
## Using the Local server ot REST API 🌐
|
||||
|
||||
- Python Example:
|
||||
```python
|
||||
import requests
|
||||
import os
|
||||
You can also use Crawl4AI through the REST API. This method allows you to send HTTP requests to the Crawl4AI server and receive structured data in response. The base URL for the API is `https://crawl4ai.com/crawl`. If you run the local server, you can use `http://localhost:8000/crawl`. (Port is dependent on your docker configuration)
|
||||
|
||||
data = {
|
||||
"urls": [
|
||||
"https://www.nbcnews.com/business"
|
||||
],
|
||||
"provider_model": "groq/llama3-70b-8192",
|
||||
"include_raw_html": true,
|
||||
"bypass_cache": false,
|
||||
"extract_blocks": true,
|
||||
"word_count_threshold": 10,
|
||||
"extraction_strategy": "CosineStrategy",
|
||||
"chunking_strategy": "RegexChunking",
|
||||
"css_selector": "",
|
||||
"verbose": true
|
||||
### Example Usage
|
||||
|
||||
To use the REST API, send a POST request to `https://crawl4ai.com/crawl` with the following parameters in the request body.
|
||||
|
||||
**Example Request:**
|
||||
```json
|
||||
{
|
||||
"urls": ["https://www.example.com"],
|
||||
"include_raw_html": false,
|
||||
"bypass_cache": true,
|
||||
"word_count_threshold": 5,
|
||||
"extraction_strategy": "CosineStrategy",
|
||||
"chunking_strategy": "RegexChunking",
|
||||
"css_selector": "p",
|
||||
"verbose": true,
|
||||
"extraction_strategy_args": {
|
||||
"semantic_filter": "finance economy and stock market",
|
||||
"word_count_threshold": 20,
|
||||
"max_dist": 0.2,
|
||||
"linkage_method": "ward",
|
||||
"top_k": 3
|
||||
},
|
||||
"chunking_strategy_args": {
|
||||
"patterns": ["\n\n"]
|
||||
}
|
||||
}
|
||||
|
||||
response = requests.post("http://crawl4ai.uccode.io/crawl", json=data) # OR http://localhost:8000 if your run locally
|
||||
|
||||
if response.status_code == 200:
|
||||
result = response.json()["results"][0]
|
||||
print("Parsed JSON:")
|
||||
print(result["parsed_json"])
|
||||
print("\nCleaned HTML:")
|
||||
print(result["cleaned_html"])
|
||||
print("\nMarkdown:")
|
||||
print(result["markdown"])
|
||||
else:
|
||||
print("Error:", response.status_code, response.text)
|
||||
```
|
||||
|
||||
This code sends a POST request to the Crawl4AI server running on localhost, specifying the target URL (`http://crawl4ai.uccode.io/crawl`) and the desired options. The server processes the request and returns the crawled data in JSON format.
|
||||
**Example Response:**
|
||||
```json
|
||||
{
|
||||
"status": "success",
|
||||
"data": [
|
||||
{
|
||||
"url": "https://www.example.com",
|
||||
"extracted_content": "...",
|
||||
"html": "...",
|
||||
"markdown": "...",
|
||||
"metadata": {...}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The response from the server includes the semantical clusters, cleaned HTML, and markdown representations of the crawled webpage. You can access and use this data in your Python application as needed.
|
||||
For more information about the available parameters and their descriptions, refer to the [Parameters](#parameters) section.
|
||||
|
||||
Make sure to replace `"http://localhost:8000/crawl"` with the appropriate server URL if your Crawl4AI server is running on a different host or port.
|
||||
|
||||
Choose the approach that best suits your needs. If you want to integrate Crawl4AI into your existing Python projects, installing it as a library is the way to go. If you prefer to run Crawl4AI as a standalone service and interact with it via API endpoints, running it as a local server using Docker is the recommended approach.
|
||||
## Python Library Usage 🚀
|
||||
|
||||
**Make sure to check the config.py tp set required environment variables.**
|
||||
### Quickstart Guide
|
||||
|
||||
That's it! You can now integrate Crawl4AI into your Python projects and leverage its web crawling capabilities. 🎉
|
||||
Create an instance of WebCrawler and call the `warmup()` function.
|
||||
```python
|
||||
crawler = WebCrawler()
|
||||
crawler.warmup()
|
||||
```
|
||||
|
||||
## 📖 Parameters
|
||||
### Understanding 'bypass_cache' and 'include_raw_html' parameters
|
||||
|
||||
First crawl (caches the result):
|
||||
```python
|
||||
result = crawler.run(url="https://www.nbcnews.com/business")
|
||||
```
|
||||
|
||||
Second crawl (Force to crawl again):
|
||||
```python
|
||||
result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
|
||||
```
|
||||
💡 Don't forget to set `bypass_cache` to True if you want to try different strategies for the same URL. Otherwise, the cached result will be returned. You can also set `always_by_pass_cache` in constructor to True to always bypass the cache.
|
||||
|
||||
Crawl result without raw HTML content:
|
||||
```python
|
||||
result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)
|
||||
```
|
||||
|
||||
### Adding a chunking strategy: RegexChunking
|
||||
|
||||
Using RegexChunking:
|
||||
```python
|
||||
result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
chunking_strategy=RegexChunking(patterns=["\n\n"])
|
||||
)
|
||||
```
|
||||
|
||||
Using NlpSentenceChunking:
|
||||
```python
|
||||
result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
chunking_strategy=NlpSentenceChunking()
|
||||
)
|
||||
```
|
||||
|
||||
### Extraction strategy: CosineStrategy
|
||||
|
||||
Using CosineStrategy:
|
||||
```python
|
||||
result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
extraction_strategy=CosineStrategy(
|
||||
semantic_filter="",
|
||||
word_count_threshold=10,
|
||||
max_dist=0.2,
|
||||
linkage_method="ward",
|
||||
top_k=3
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
You can set `semantic_filter` to filter relevant documents before clustering. Documents are filtered based on their cosine similarity to the keyword filter embedding.
|
||||
|
||||
```python
|
||||
result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
extraction_strategy=CosineStrategy(
|
||||
semantic_filter="finance economy and stock market",
|
||||
word_count_threshold=10,
|
||||
max_dist=0.2,
|
||||
linkage_method="ward",
|
||||
top_k=3
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### Using LLMExtractionStrategy
|
||||
|
||||
Without instructions:
|
||||
```python
|
||||
result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o",
|
||||
api_token=os.getenv('OPENAI_API_KEY')
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
With instructions:
|
||||
```python
|
||||
result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o",
|
||||
api_token=os.getenv('OPENAI_API_KEY'),
|
||||
instruction="I am interested in only financial news"
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### Targeted extraction using CSS selector
|
||||
|
||||
Extract only H2 tags:
|
||||
```python
|
||||
result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
css_selector="h2"
|
||||
)
|
||||
```
|
||||
|
||||
### Passing JavaScript code to click 'Load More' button
|
||||
|
||||
Using JavaScript to click 'Load More' button:
|
||||
```python
|
||||
js_code = """
|
||||
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
|
||||
loadMoreButton && loadMoreButton.click();
|
||||
"""
|
||||
crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
|
||||
crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
|
||||
result = crawler.run(url="https://www.nbcnews.com/business")
|
||||
```
|
||||
|
||||
## Parameters 📖
|
||||
|
||||
| Parameter | Description | Required | Default Value |
|
||||
|-----------------------|-------------------------------------------------------------------------------------------------------|----------|---------------------|
|
||||
@ -193,49 +329,134 @@ That's it! You can now integrate Crawl4AI into your Python projects and leverage
|
||||
| `css_selector` | The CSS selector to target specific parts of the HTML for extraction. | No | `None` |
|
||||
| `verbose` | Whether to enable verbose logging. | No | `true` |
|
||||
|
||||
## 🛠️ Configuration
|
||||
Crawl4AI allows you to configure various parameters and settings in the `crawler/config.py` file. Here's an example of how you can adjust the parameters:
|
||||
## Chunking Strategies 📚
|
||||
|
||||
### RegexChunking
|
||||
|
||||
`RegexChunking` is a text chunking strategy that splits a given text into smaller parts using regular expressions. This is useful for preparing large texts for processing by language models, ensuring they are divided into manageable segments.
|
||||
|
||||
**Constructor Parameters:**
|
||||
- `patterns` (list, optional): A list of regular expression patterns used to split the text. Default is to split by double newlines (`['\n\n']`).
|
||||
|
||||
**Example usage:**
|
||||
```python
|
||||
import os
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv() # Load environment variables from .env file
|
||||
|
||||
# Default provider, ONLY used when the extraction strategy is LLMExtractionStrategy
|
||||
DEFAULT_PROVIDER = "openai/gpt-4-turbo"
|
||||
|
||||
# Provider-model dictionary, ONLY used when the extraction strategy is LLMExtractionStrategy
|
||||
PROVIDER_MODELS = {
|
||||
"ollama/llama3": "no-token-needed", # Any model from Ollama no need for API token
|
||||
"groq/llama3-70b-8192": os.getenv("GROQ_API_KEY"),
|
||||
"groq/llama3-8b-8192": os.getenv("GROQ_API_KEY"),
|
||||
"openai/gpt-3.5-turbo": os.getenv("OPENAI_API_KEY"),
|
||||
"openai/gpt-4-turbo": os.getenv("OPENAI_API_KEY"),
|
||||
"openai/gpt-4o": os.getenv("OPENAI_API_KEY"),
|
||||
"anthropic/claude-3-haiku-20240307": os.getenv("ANTHROPIC_API_KEY"),
|
||||
"anthropic/claude-3-opus-20240229": os.getenv("ANTHROPIC_API_KEY"),
|
||||
"anthropic/claude-3-sonnet-20240229": os.getenv("ANTHROPIC_API_KEY"),
|
||||
}
|
||||
|
||||
# Chunk token threshold
|
||||
CHUNK_TOKEN_THRESHOLD = 1000
|
||||
# Threshold for the minimum number of words in an HTML tag to be considered
|
||||
MIN_WORD_THRESHOLD = 5
|
||||
chunker = RegexChunking(patterns=[r'\n\n', r'\. '])
|
||||
chunks = chunker.chunk("This is a sample text. It will be split into chunks.")
|
||||
```
|
||||
|
||||
In the `crawler/config.py` file, you can:
|
||||
### NlpSentenceChunking
|
||||
|
||||
REMEBER: You only need to set the API keys for the providers in case you choose LLMExtractStrategy as the extraction strategy. If you choose CosineStrategy, you don't need to set the API keys.
|
||||
`NlpSentenceChunking` uses a natural language processing model to chunk a given text into sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.
|
||||
|
||||
- Set the default provider using the `DEFAULT_PROVIDER` variable.
|
||||
- Add or modify the provider-model dictionary (`PROVIDER_MODELS`) to include your desired providers and their corresponding API keys. Crawl4AI supports various providers such as Groq, OpenAI, Anthropic, and more. You can add any provider supported by LiteLLM, as well as Ollama.
|
||||
- Adjust the `CHUNK_TOKEN_THRESHOLD` value to control the splitting of web content into chunks for parallel processing. A higher value means fewer chunks and faster processing, but it may cause issues with weaker LLMs during extraction.
|
||||
- Modify the `MIN_WORD_THRESHOLD` value to set the minimum number of words an HTML tag must contain to be considered a meaningful block.
|
||||
**Constructor Parameters:**
|
||||
- `model` (str, optional): The SpaCy model to use for sentence detection. Default is `'en_core_web_sm'`.
|
||||
|
||||
Make sure to set the appropriate API keys for each provider in the `PROVIDER_MODELS` dictionary. You can either directly provide the API key or use environment variables to store them securely.
|
||||
**Example usage:**
|
||||
```python
|
||||
chunker = NlpSentenceChunking(model='en_core_web_sm')
|
||||
chunks = chunker.chunk("This is a sample text. It will be split into sentences.")
|
||||
```
|
||||
|
||||
Remember to update the `crawler/config.py` file based on your specific requirements and the providers you want to use with Crawl4AI.
|
||||
### TopicSegmentationChunking
|
||||
|
||||
`TopicSegmentationChunking` uses the TextTiling algorithm to segment a given text into topic-based chunks. This method identifies thematic boundaries in the text.
|
||||
|
||||
**Constructor Parameters:**
|
||||
- `num_keywords` (int, optional): The number of keywords to extract for each topic segment. Default is `3`.
|
||||
|
||||
**Example usage:**
|
||||
```python
|
||||
chunker = TopicSegmentationChunking(num_keywords=3)
|
||||
chunks = chunker.chunk("This is a sample text. It will be split into topic-based segments.")
|
||||
```
|
||||
|
||||
### FixedLengthWordChunking
|
||||
|
||||
`FixedLengthWordChunking` splits a given text into chunks of fixed length, based on the number of words.
|
||||
|
||||
**Constructor Parameters:**
|
||||
- `chunk_size` (int, optional): The number of words in each chunk. Default is `100`.
|
||||
|
||||
**Example usage:**
|
||||
```python
|
||||
chunker = FixedLengthWordChunking(chunk_size=100)
|
||||
chunks = chunker.chunk("This is a sample text. It will be split into fixed-length word chunks.")
|
||||
```
|
||||
|
||||
### SlidingWindowChunking
|
||||
|
||||
`SlidingWindowChunking` uses a sliding window approach to chunk a given text. Each chunk has a fixed length, and the window slides by a specified step size.
|
||||
|
||||
**Constructor Parameters:**
|
||||
- `window_size` (int, optional): The number of words in each chunk. Default is `100`.
|
||||
- `step` (int, optional): The number of words to slide the window. Default is `50`.
|
||||
|
||||
**Example usage:**
|
||||
```python
|
||||
chunker = SlidingWindowChunking(window_size=100, step=50)
|
||||
chunks = chunker.chunk("This is a sample text. It will be split using a sliding window approach.")
|
||||
```
|
||||
|
||||
## Extraction Strategies 🧠
|
||||
|
||||
### NoExtractionStrategy
|
||||
|
||||
`NoExtractionStrategy` is a basic extraction strategy that returns the entire HTML content without any modification. It is useful for cases where no specific extraction is required.
|
||||
|
||||
**Constructor Parameters:**
|
||||
None.
|
||||
|
||||
**Example usage:**
|
||||
```python
|
||||
extractor = NoExtractionStrategy()
|
||||
extracted_content = extractor.extract(url, html)
|
||||
```
|
||||
|
||||
### LLMExtractionStrategy
|
||||
|
||||
`LLMExtractionStrategy` uses a Language Model (LLM) to extract meaningful blocks or chunks from the given HTML content. This strategy leverages an external provider for language model completions.
|
||||
|
||||
**Constructor Parameters:**
|
||||
- `provider` (str, optional): The provider to use for the language model completions. Default is `DEFAULT_PROVIDER` (e.g., openai/gpt-4).
|
||||
- `api_token` (str, optional): The API token for the provider. If not provided, it will try to load from the environment variable `OPENAI_API_KEY`.
|
||||
- `instruction` (str, optional): An instruction to guide the LLM on how to perform the extraction. This allows users to specify the type of data they are interested in or set the tone of the response. Default is `None`.
|
||||
|
||||
**Example usage:**
|
||||
```python
|
||||
extractor = LLMExtractionStrategy(provider='openai', api_token='your_api_token', instruction='Extract only news about AI.')
|
||||
extracted_content = extractor.extract(url, html)
|
||||
```
|
||||
|
||||
### CosineStrategy
|
||||
|
||||
`CosineStrategy` uses hierarchical clustering based on cosine similarity to extract clusters of text from the given HTML content. This strategy is suitable for identifying related content sections.
|
||||
|
||||
**Constructor Parameters:**
|
||||
- `semantic_filter` (str, optional): A string containing keywords for filtering relevant documents before clustering. If provided, documents are filtered based on their cosine similarity to the keyword filter embedding. Default is `None`.
|
||||
- `word_count_threshold` (int, optional): Minimum number of words per cluster. Default is `20`.
|
||||
- `max_dist` (float, optional): The maximum cophenetic distance on the dendrogram to form clusters. Default is `0.2`.
|
||||
- `linkage_method` (str, optional): The linkage method for hierarchical clustering. Default is `'ward'`.
|
||||
- `top_k` (int, optional): Number of top categories to extract. Default is `3`.
|
||||
- `model_name` (str, optional): The model name for embedding generation. Default is `'BAAI/bge-small-en-v1.5'`.
|
||||
|
||||
**Example usage:**
|
||||
```python
|
||||
extractor = CosineStrategy(semantic_filter='artificial intelligence', word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5')
|
||||
extracted_content = extractor.extract(url, html)
|
||||
```
|
||||
|
||||
### TopicExtractionStrategy
|
||||
|
||||
`TopicExtractionStrategy` uses the TextTiling algorithm to segment the HTML content into topics and extracts keywords for each segment. This strategy is useful for identifying and summarizing thematic content.
|
||||
|
||||
**Constructor Parameters:**
|
||||
- `num_keywords` (int, optional): Number of keywords to represent each topic segment. Default is `3`.
|
||||
|
||||
**Example usage:**
|
||||
```python
|
||||
extractor = TopicExtractionStrategy(num_keywords=3)
|
||||
extracted_content = extractor.extract(url, html)
|
||||
```
|
||||
|
||||
## Contributing 🤝
|
||||
|
||||
@ -259,5 +480,6 @@ If you have any questions, suggestions, or feedback, please feel free to reach o
|
||||
|
||||
- GitHub: [unclecode](https://github.com/unclecode)
|
||||
- Twitter: [@unclecode](https://twitter.com/unclecode)
|
||||
- Website: [crawl4ai.com](https://crawl4ai.com)
|
||||
|
||||
Let's work together to make the web more accessible and useful for AI applications! 💪🌐🤖
|
||||
|
||||
@ -38,7 +38,12 @@ class RegexChunking(ChunkingStrategy):
|
||||
class NlpSentenceChunking(ChunkingStrategy):
|
||||
def __init__(self, model='en_core_web_sm'):
|
||||
import spacy
|
||||
self.nlp = spacy.load(model)
|
||||
try:
|
||||
self.nlp = spacy.load(model)
|
||||
except IOError:
|
||||
spacy.cli.download("en_core_web_sm")
|
||||
self.nlp = spacy.load(model)
|
||||
# raise ImportError(f"Spacy model '{model}' not found. Please download the model using 'python -m spacy download {model}'")
|
||||
|
||||
def chunk(self, text: str) -> list:
|
||||
doc = self.nlp(text)
|
||||
|
||||
@ -18,15 +18,16 @@ class CrawlerStrategy(ABC):
|
||||
pass
|
||||
|
||||
class CloudCrawlerStrategy(CrawlerStrategy):
|
||||
def crawl(self, url: str, use_cached_html = False, css_selector = None) -> str:
|
||||
def __init__(self, use_cached_html = False):
|
||||
super().__init__()
|
||||
self.use_cached_html = use_cached_html
|
||||
|
||||
def crawl(self, url: str) -> str:
|
||||
data = {
|
||||
"urls": [url],
|
||||
"provider_model": "",
|
||||
"api_token": "token",
|
||||
"include_raw_html": True,
|
||||
"forced": True,
|
||||
"extract_blocks": False,
|
||||
"word_count_threshold": 10
|
||||
}
|
||||
|
||||
response = requests.post("http://crawl4ai.uccode.io/crawl", json=data)
|
||||
@ -35,19 +36,24 @@ class CloudCrawlerStrategy(CrawlerStrategy):
|
||||
return html
|
||||
|
||||
class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
|
||||
def __init__(self):
|
||||
def __init__(self, use_cached_html=False, js_code=None):
|
||||
super().__init__()
|
||||
self.options = Options()
|
||||
self.options.headless = True
|
||||
self.options.add_argument("--no-sandbox")
|
||||
self.options.add_argument("--disable-dev-shm-usage")
|
||||
self.options.add_argument("--disable-gpu")
|
||||
self.options.add_argument("--disable-extensions")
|
||||
self.options.add_argument("--headless")
|
||||
self.use_cached_html = use_cached_html
|
||||
self.js_code = js_code
|
||||
|
||||
# chromedriver_autoinstaller.install()
|
||||
self.service = Service(chromedriver_autoinstaller.install())
|
||||
self.driver = webdriver.Chrome(service=self.service, options=self.options)
|
||||
|
||||
def crawl(self, url: str, use_cached_html = False, css_selector = None) -> str:
|
||||
if use_cached_html:
|
||||
def crawl(self, url: str) -> str:
|
||||
if self.use_cached_html:
|
||||
cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", url.replace("/", "_"))
|
||||
if os.path.exists(cache_file_path):
|
||||
with open(cache_file_path, "r") as f:
|
||||
@ -58,6 +64,15 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
|
||||
WebDriverWait(self.driver, 10).until(
|
||||
EC.presence_of_all_elements_located((By.TAG_NAME, "html"))
|
||||
)
|
||||
|
||||
# Execute JS code if provided
|
||||
if self.js_code:
|
||||
self.driver.execute_script(self.js_code)
|
||||
# Optionally, wait for some condition after executing the JS code
|
||||
WebDriverWait(self.driver, 10).until(
|
||||
lambda driver: driver.execute_script("return document.readyState") == "complete"
|
||||
)
|
||||
|
||||
html = self.driver.page_source
|
||||
|
||||
# Store in cache
|
||||
|
||||
@ -8,9 +8,9 @@ DB_PATH = os.path.join(Path.home(), ".crawl4ai")
|
||||
os.makedirs(DB_PATH, exist_ok=True)
|
||||
DB_PATH = os.path.join(DB_PATH, "crawl4ai.db")
|
||||
|
||||
def init_db(db_path: str):
|
||||
def init_db():
|
||||
global DB_PATH
|
||||
conn = sqlite3.connect(db_path)
|
||||
conn = sqlite3.connect(DB_PATH)
|
||||
cursor = conn.cursor()
|
||||
cursor.execute('''
|
||||
CREATE TABLE IF NOT EXISTS crawled_data (
|
||||
@ -18,13 +18,12 @@ def init_db(db_path: str):
|
||||
html TEXT,
|
||||
cleaned_html TEXT,
|
||||
markdown TEXT,
|
||||
parsed_json TEXT,
|
||||
extracted_content TEXT,
|
||||
success BOOLEAN
|
||||
)
|
||||
''')
|
||||
conn.commit()
|
||||
conn.close()
|
||||
DB_PATH = db_path
|
||||
|
||||
def check_db_path():
|
||||
if not DB_PATH:
|
||||
@ -35,7 +34,7 @@ def get_cached_url(url: str) -> Optional[Tuple[str, str, str, str, str, bool]]:
|
||||
try:
|
||||
conn = sqlite3.connect(DB_PATH)
|
||||
cursor = conn.cursor()
|
||||
cursor.execute('SELECT url, html, cleaned_html, markdown, parsed_json, success FROM crawled_data WHERE url = ?', (url,))
|
||||
cursor.execute('SELECT url, html, cleaned_html, markdown, extracted_content, success FROM crawled_data WHERE url = ?', (url,))
|
||||
result = cursor.fetchone()
|
||||
conn.close()
|
||||
return result
|
||||
@ -43,21 +42,21 @@ def get_cached_url(url: str) -> Optional[Tuple[str, str, str, str, str, bool]]:
|
||||
print(f"Error retrieving cached URL: {e}")
|
||||
return None
|
||||
|
||||
def cache_url(url: str, html: str, cleaned_html: str, markdown: str, parsed_json: str, success: bool):
|
||||
def cache_url(url: str, html: str, cleaned_html: str, markdown: str, extracted_content: str, success: bool):
|
||||
check_db_path()
|
||||
try:
|
||||
conn = sqlite3.connect(DB_PATH)
|
||||
cursor = conn.cursor()
|
||||
cursor.execute('''
|
||||
INSERT INTO crawled_data (url, html, cleaned_html, markdown, parsed_json, success)
|
||||
INSERT INTO crawled_data (url, html, cleaned_html, markdown, extracted_content, success)
|
||||
VALUES (?, ?, ?, ?, ?, ?)
|
||||
ON CONFLICT(url) DO UPDATE SET
|
||||
html = excluded.html,
|
||||
cleaned_html = excluded.cleaned_html,
|
||||
markdown = excluded.markdown,
|
||||
parsed_json = excluded.parsed_json,
|
||||
extracted_content = excluded.extracted_content,
|
||||
success = excluded.success
|
||||
''', (url, html, cleaned_html, markdown, parsed_json, success))
|
||||
''', (url, html, cleaned_html, markdown, extracted_content, success))
|
||||
conn.commit()
|
||||
conn.close()
|
||||
except Exception as e:
|
||||
@ -85,4 +84,15 @@ def clear_db():
|
||||
conn.commit()
|
||||
conn.close()
|
||||
except Exception as e:
|
||||
print(f"Error clearing database: {e}")
|
||||
print(f"Error clearing database: {e}")
|
||||
|
||||
def flush_db():
|
||||
check_db_path()
|
||||
try:
|
||||
conn = sqlite3.connect(DB_PATH)
|
||||
cursor = conn.cursor()
|
||||
cursor.execute('DROP TABLE crawled_data')
|
||||
conn.commit()
|
||||
conn.close()
|
||||
except Exception as e:
|
||||
print(f"Error flushing database: {e}")
|
||||
@ -3,19 +3,20 @@ from typing import Any, List, Dict, Optional, Union
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
import json, time
|
||||
# from optimum.intel import IPEXModel
|
||||
from .prompts import PROMPT_EXTRACT_BLOCKS
|
||||
from .prompts import PROMPT_EXTRACT_BLOCKS, PROMPT_EXTRACT_BLOCKS_WITH_INSTRUCTION
|
||||
from .config import *
|
||||
from .utils import *
|
||||
from functools import partial
|
||||
from .model_loader import load_bert_base_uncased, load_bge_small_en_v1_5, load_spacy_model
|
||||
|
||||
|
||||
from transformers import pipeline
|
||||
from sklearn.metrics.pairwise import cosine_similarity
|
||||
import numpy as np
|
||||
class ExtractionStrategy(ABC):
|
||||
"""
|
||||
Abstract base class for all extraction strategies.
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
def __init__(self, **kwargs):
|
||||
self.DEL = "<|DEL|>"
|
||||
self.name = self.__class__.__name__
|
||||
|
||||
@ -38,12 +39,12 @@ class ExtractionStrategy(ABC):
|
||||
:param sections: List of sections (strings) to process.
|
||||
:return: A list of processed JSON blocks.
|
||||
"""
|
||||
parsed_json = []
|
||||
extracted_content = []
|
||||
with ThreadPoolExecutor() as executor:
|
||||
futures = [executor.submit(self.extract, url, section, **kwargs) for section in sections]
|
||||
for future in as_completed(futures):
|
||||
parsed_json.extend(future.result())
|
||||
return parsed_json
|
||||
extracted_content.extend(future.result())
|
||||
return extracted_content
|
||||
|
||||
class NoExtractionStrategy(ExtractionStrategy):
|
||||
def extract(self, url: str, html: str, *q, **kwargs) -> List[Dict[str, Any]]:
|
||||
@ -53,37 +54,41 @@ class NoExtractionStrategy(ExtractionStrategy):
|
||||
return [{"index": i, "tags": [], "content": section} for i, section in enumerate(sections)]
|
||||
|
||||
class LLMExtractionStrategy(ExtractionStrategy):
|
||||
def __init__(self, provider: str = DEFAULT_PROVIDER, api_token: Optional[str] = None):
|
||||
def __init__(self, provider: str = DEFAULT_PROVIDER, api_token: Optional[str] = None, instruction:str = None, **kwargs):
|
||||
"""
|
||||
Initialize the strategy with clustering parameters.
|
||||
|
||||
:param word_count_threshold: Minimum number of words per cluster.
|
||||
:param max_dist: The maximum cophenetic distance on the dendrogram to form clusters.
|
||||
:param linkage_method: The linkage method for hierarchical clustering.
|
||||
:param provider: The provider to use for extraction.
|
||||
:param api_token: The API token for the provider.
|
||||
:param instruction: The instruction to use for the LLM model.
|
||||
"""
|
||||
super().__init__()
|
||||
self.provider = provider
|
||||
self.api_token = api_token or PROVIDER_MODELS.get(provider, None) or os.getenv("OPENAI_API_KEY")
|
||||
self.instruction = instruction
|
||||
|
||||
if not self.api_token:
|
||||
raise ValueError("API token must be provided for LLMExtractionStrategy. Update the config.py or set OPENAI_API_KEY environment variable.")
|
||||
|
||||
|
||||
def extract(self, url: str, html: str) -> List[Dict[str, Any]]:
|
||||
print("[LOG] Extracting blocks from URL:", url)
|
||||
def extract(self, url: str, ix:int, html: str) -> List[Dict[str, Any]]:
|
||||
# print("[LOG] Extracting blocks from URL:", url)
|
||||
print(f"[LOG] Call LLM for {url} - block index: {ix}")
|
||||
variable_values = {
|
||||
"URL": url,
|
||||
"HTML": escape_json_string(sanitize_html(html)),
|
||||
}
|
||||
|
||||
if self.instruction:
|
||||
variable_values["REQUEST"] = self.instruction
|
||||
|
||||
prompt_with_variables = PROMPT_EXTRACT_BLOCKS
|
||||
prompt_with_variables = PROMPT_EXTRACT_BLOCKS if not self.instruction else PROMPT_EXTRACT_BLOCKS_WITH_INSTRUCTION
|
||||
for variable in variable_values:
|
||||
prompt_with_variables = prompt_with_variables.replace(
|
||||
"{" + variable + "}", variable_values[variable]
|
||||
)
|
||||
|
||||
response = perform_completion_with_backoff(self.provider, prompt_with_variables, self.api_token)
|
||||
|
||||
try:
|
||||
blocks = extract_xml_data(["blocks"], response.choices[0].message.content)['blocks']
|
||||
blocks = json.loads(blocks)
|
||||
@ -101,7 +106,7 @@ class LLMExtractionStrategy(ExtractionStrategy):
|
||||
"content": unparsed
|
||||
})
|
||||
|
||||
print("[LOG] Extracted", len(blocks), "blocks from URL:", url)
|
||||
print("[LOG] Extracted", len(blocks), "blocks from URL:", url, "block index:", ix)
|
||||
return blocks
|
||||
|
||||
def _merge(self, documents):
|
||||
@ -130,29 +135,30 @@ class LLMExtractionStrategy(ExtractionStrategy):
|
||||
"""
|
||||
|
||||
merged_sections = self._merge(sections)
|
||||
parsed_json = []
|
||||
extracted_content = []
|
||||
if self.provider.startswith("groq/"):
|
||||
# Sequential processing with a delay
|
||||
for section in merged_sections:
|
||||
parsed_json.extend(self.extract(url, section))
|
||||
for ix, section in enumerate(merged_sections):
|
||||
extracted_content.extend(self.extract(ix, url, section))
|
||||
time.sleep(0.5) # 500 ms delay between each processing
|
||||
else:
|
||||
# Parallel processing using ThreadPoolExecutor
|
||||
with ThreadPoolExecutor(max_workers=4) as executor:
|
||||
extract_func = partial(self.extract, url)
|
||||
futures = [executor.submit(extract_func, section) for section in merged_sections]
|
||||
futures = [executor.submit(extract_func, ix, section) for ix, section in enumerate(merged_sections)]
|
||||
|
||||
for future in as_completed(futures):
|
||||
parsed_json.extend(future.result())
|
||||
extracted_content.extend(future.result())
|
||||
|
||||
|
||||
return parsed_json
|
||||
return extracted_content
|
||||
|
||||
class CosineStrategy(ExtractionStrategy):
|
||||
def __init__(self, word_count_threshold=20, max_dist=0.2, linkage_method='ward', top_k=3, model_name = 'BAAI/bge-small-en-v1.5'):
|
||||
def __init__(self, semantic_filter = None, word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name = 'BAAI/bge-small-en-v1.5', **kwargs):
|
||||
"""
|
||||
Initialize the strategy with clustering parameters.
|
||||
|
||||
:param semantic_filter: A keyword filter for document filtering.
|
||||
:param word_count_threshold: Minimum number of words per cluster.
|
||||
:param max_dist: The maximum cophenetic distance on the dendrogram to form clusters.
|
||||
:param linkage_method: The linkage method for hierarchical clustering.
|
||||
@ -163,11 +169,14 @@ class CosineStrategy(ExtractionStrategy):
|
||||
from transformers import AutoTokenizer, AutoModel
|
||||
import spacy
|
||||
|
||||
self.semantic_filter = semantic_filter
|
||||
self.word_count_threshold = word_count_threshold
|
||||
self.max_dist = max_dist
|
||||
self.linkage_method = linkage_method
|
||||
self.top_k = top_k
|
||||
self.timer = time.time()
|
||||
|
||||
self.buffer_embeddings = np.array([])
|
||||
|
||||
if model_name == "bert-base-uncased":
|
||||
self.tokenizer, self.model = load_bert_base_uncased()
|
||||
@ -177,13 +186,42 @@ class CosineStrategy(ExtractionStrategy):
|
||||
self.nlp = load_spacy_model()
|
||||
print(f"[LOG] Model loaded {model_name}, models/reuters, took " + str(time.time() - self.timer) + " seconds")
|
||||
|
||||
def get_embeddings(self, sentences: List[str]):
|
||||
|
||||
def filter_documents_embeddings(self, documents: List[str], semantic_filter: str, threshold: float = 0.5) -> List[str]:
|
||||
"""
|
||||
Filter documents based on the cosine similarity of their embeddings with the semantic_filter embedding.
|
||||
|
||||
:param documents: List of text chunks (documents).
|
||||
:param semantic_filter: A string containing the keywords for filtering.
|
||||
:param threshold: Cosine similarity threshold for filtering documents.
|
||||
:return: Filtered list of documents.
|
||||
"""
|
||||
if not semantic_filter:
|
||||
return documents
|
||||
# Compute embedding for the keyword filter
|
||||
query_embedding = self.get_embeddings([semantic_filter])[0]
|
||||
|
||||
# Compute embeddings for the docu ments
|
||||
document_embeddings = self.get_embeddings(documents)
|
||||
|
||||
# Calculate cosine similarity between the query embedding and document embeddings
|
||||
similarities = cosine_similarity([query_embedding], document_embeddings).flatten()
|
||||
|
||||
# Filter documents based on the similarity threshold
|
||||
filtered_docs = [doc for doc, sim in zip(documents, similarities) if sim >= threshold]
|
||||
|
||||
return filtered_docs
|
||||
|
||||
def get_embeddings(self, sentences: List[str], bypass_buffer=True):
|
||||
"""
|
||||
Get BERT embeddings for a list of sentences.
|
||||
|
||||
:param sentences: List of text chunks (sentences).
|
||||
:return: NumPy array of embeddings.
|
||||
"""
|
||||
# if self.buffer_embeddings.any() and not bypass_buffer:
|
||||
# return self.buffer_embeddings
|
||||
|
||||
import torch
|
||||
# Tokenize sentences and convert to tensor
|
||||
encoded_input = self.tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
||||
@ -193,6 +231,7 @@ class CosineStrategy(ExtractionStrategy):
|
||||
|
||||
# Get embeddings from the last hidden state (mean pooling)
|
||||
embeddings = model_output.last_hidden_state.mean(1)
|
||||
self.buffer_embeddings = embeddings.numpy()
|
||||
return embeddings.numpy()
|
||||
|
||||
def hierarchical_clustering(self, sentences: List[str]):
|
||||
@ -206,7 +245,7 @@ class CosineStrategy(ExtractionStrategy):
|
||||
from scipy.cluster.hierarchy import linkage, fcluster
|
||||
from scipy.spatial.distance import pdist
|
||||
self.timer = time.time()
|
||||
embeddings = self.get_embeddings(sentences)
|
||||
embeddings = self.get_embeddings(sentences, bypass_buffer=False)
|
||||
# print(f"[LOG] 🚀 Embeddings computed in {time.time() - self.timer:.2f} seconds")
|
||||
# Compute pairwise cosine distances
|
||||
distance_matrix = pdist(embeddings, 'cosine')
|
||||
@ -247,6 +286,12 @@ class CosineStrategy(ExtractionStrategy):
|
||||
# Assume `html` is a list of text chunks for this strategy
|
||||
t = time.time()
|
||||
text_chunks = html.split(self.DEL) # Split by lines or paragraphs as needed
|
||||
|
||||
# Pre-filter documents using embeddings and semantic_filter
|
||||
text_chunks = self.filter_documents_embeddings(text_chunks, self.semantic_filter)
|
||||
|
||||
if not text_chunks:
|
||||
return []
|
||||
|
||||
# Perform clustering
|
||||
labels = self.hierarchical_clustering(text_chunks)
|
||||
@ -290,7 +335,7 @@ class CosineStrategy(ExtractionStrategy):
|
||||
return self.extract(url, self.DEL.join(sections), **kwargs)
|
||||
|
||||
class TopicExtractionStrategy(ExtractionStrategy):
|
||||
def __init__(self, num_keywords: int = 3):
|
||||
def __init__(self, num_keywords: int = 3, **kwargs):
|
||||
"""
|
||||
Initialize the topic extraction strategy with parameters for topic segmentation.
|
||||
|
||||
@ -358,7 +403,7 @@ class TopicExtractionStrategy(ExtractionStrategy):
|
||||
return self.extract(url, self.DEL.join(sections), **kwargs)
|
||||
|
||||
class ContentSummarizationStrategy(ExtractionStrategy):
|
||||
def __init__(self, model_name: str = "sshleifer/distilbart-cnn-12-6"):
|
||||
def __init__(self, model_name: str = "sshleifer/distilbart-cnn-12-6", **kwargs):
|
||||
"""
|
||||
Initialize the content summarization strategy with a specific model.
|
||||
|
||||
|
||||
@ -11,5 +11,6 @@ class CrawlResult(BaseModel):
|
||||
success: bool
|
||||
cleaned_html: str = None
|
||||
markdown: str = None
|
||||
parsed_json: str = None
|
||||
extracted_content: str = None
|
||||
metadata: dict = None
|
||||
error_message: str = None
|
||||
@ -59,7 +59,7 @@ Please provide your output within <blocks> tags, like this:
|
||||
|
||||
Remember, the output should be a complete, parsable JSON wrapped in <blocks> tags, with no omissions or errors. The JSON objects should semantically break down the content into relevant blocks, maintaining the original order."""
|
||||
|
||||
PROMPT_EXTRACT_BLOCKS = """YHere is the URL of the webpage:
|
||||
PROMPT_EXTRACT_BLOCKS = """Here is the URL of the webpage:
|
||||
<url>{URL}</url>
|
||||
|
||||
And here is the cleaned HTML content of that webpage:
|
||||
@ -107,4 +107,61 @@ Please provide your output within <blocks> tags, like this:
|
||||
}]
|
||||
</blocks>
|
||||
|
||||
Remember, the output should be a complete, parsable JSON wrapped in <blocks> tags, with no omissions or errors. The JSON objects should semantically break down the content into relevant blocks, maintaining the original order."""
|
||||
|
||||
PROMPT_EXTRACT_BLOCKS_WITH_INSTRUCTION = """Here is the URL of the webpage:
|
||||
<url>{URL}</url>
|
||||
|
||||
And here is the cleaned HTML content of that webpage:
|
||||
<html>
|
||||
{HTML}
|
||||
</html>
|
||||
|
||||
Your task is to break down this HTML content into semantically relevant blocks, following the provided user's REQUEST, and for each block, generate a JSON object with the following keys:
|
||||
|
||||
- index: an integer representing the index of the block in the content
|
||||
- content: a list of strings containing the text content of the block
|
||||
|
||||
This is the user's REQUEST, pay attention to it:
|
||||
<request>
|
||||
{REQUEST}
|
||||
</request>
|
||||
|
||||
To generate the JSON objects:
|
||||
|
||||
1. Carefully read through the HTML content and identify logical breaks or shifts in the content that would warrant splitting it into separate blocks.
|
||||
|
||||
2. For each block:
|
||||
a. Assign it an index based on its order in the content.
|
||||
b. Analyze the content and generate ONE semantic tag that describe what the block is about.
|
||||
c. Extract the text content, EXACTLY SAME AS GIVE DATA, clean it up if needed, and store it as a list of strings in the "content" field.
|
||||
|
||||
3. Ensure that the order of the JSON objects matches the order of the blocks as they appear in the original HTML content.
|
||||
|
||||
4. Double-check that each JSON object includes all required keys (index, tag, content) and that the values are in the expected format (integer, list of strings, etc.).
|
||||
|
||||
5. Make sure the generated JSON is complete and parsable, with no errors or omissions.
|
||||
|
||||
6. Make sur to escape any special characters in the HTML content, and also single or double quote to avoid JSON parsing issues.
|
||||
|
||||
7. Never alter the extracted content, just copy and paste it as it is.
|
||||
|
||||
Please provide your output within <blocks> tags, like this:
|
||||
|
||||
<blocks>
|
||||
[{
|
||||
"index": 0,
|
||||
"tags": ["introduction"],
|
||||
"content": ["This is the first paragraph of the article, which provides an introduction and overview of the main topic."]
|
||||
},
|
||||
{
|
||||
"index": 1,
|
||||
"tags": ["background"],
|
||||
"content": ["This is the second paragraph, which delves into the history and background of the topic.",
|
||||
"It provides context and sets the stage for the rest of the article."]
|
||||
}]
|
||||
</blocks>
|
||||
|
||||
**Make sure to follow the user instruction to extract blocks aligin with the instruction.**
|
||||
|
||||
Remember, the output should be a complete, parsable JSON wrapped in <blocks> tags, with no omissions or errors. The JSON objects should semantically break down the content into relevant blocks, maintaining the original order."""
|
||||
@ -461,17 +461,17 @@ def merge_chunks_based_on_token_threshold(chunks, token_threshold):
|
||||
return merged_sections
|
||||
|
||||
def process_sections(url: str, sections: list, provider: str, api_token: str) -> list:
|
||||
parsed_json = []
|
||||
extracted_content = []
|
||||
if provider.startswith("groq/"):
|
||||
# Sequential processing with a delay
|
||||
for section in sections:
|
||||
parsed_json.extend(extract_blocks(url, section, provider, api_token))
|
||||
extracted_content.extend(extract_blocks(url, section, provider, api_token))
|
||||
time.sleep(0.5) # 500 ms delay between each processing
|
||||
else:
|
||||
# Parallel processing using ThreadPoolExecutor
|
||||
with ThreadPoolExecutor() as executor:
|
||||
futures = [executor.submit(extract_blocks, url, section, provider, api_token) for section in sections]
|
||||
for future in as_completed(futures):
|
||||
parsed_json.extend(future.result())
|
||||
extracted_content.extend(future.result())
|
||||
|
||||
return parsed_json
|
||||
return extracted_content
|
||||
@ -1,8 +1,9 @@
|
||||
import os, time
|
||||
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
||||
from pathlib import Path
|
||||
|
||||
from .models import UrlModel, CrawlResult
|
||||
from .database import init_db, get_cached_url, cache_url, DB_PATH
|
||||
from .database import init_db, get_cached_url, cache_url, DB_PATH, flush_db
|
||||
from .utils import *
|
||||
from .chunking_strategy import *
|
||||
from .extraction_strategy import *
|
||||
@ -16,11 +17,13 @@ from .config import *
|
||||
class WebCrawler:
|
||||
def __init__(
|
||||
self,
|
||||
db_path: str = None,
|
||||
# db_path: str = None,
|
||||
crawler_strategy: CrawlerStrategy = LocalSeleniumCrawlerStrategy(),
|
||||
always_by_pass_cache: bool = False,
|
||||
):
|
||||
self.db_path = db_path
|
||||
# self.db_path = db_path
|
||||
self.crawler_strategy = crawler_strategy
|
||||
self.always_by_pass_cache = always_by_pass_cache
|
||||
|
||||
# Create the .crawl4ai folder in the user's home directory if it doesn't exist
|
||||
self.crawl4ai_folder = os.path.join(Path.home(), ".crawl4ai")
|
||||
@ -28,10 +31,11 @@ class WebCrawler:
|
||||
os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
|
||||
|
||||
# If db_path is not provided, use the default path
|
||||
if not db_path:
|
||||
self.db_path = f"{self.crawl4ai_folder}/crawl4ai.db"
|
||||
# if not db_path:
|
||||
# self.db_path = f"{self.crawl4ai_folder}/crawl4ai.db"
|
||||
|
||||
init_db(self.db_path)
|
||||
flush_db()
|
||||
init_db()
|
||||
|
||||
self.ready = False
|
||||
|
||||
@ -93,7 +97,7 @@ class WebCrawler:
|
||||
word_count_threshold = MIN_WORD_THRESHOLD
|
||||
|
||||
# Check cache first
|
||||
if not bypass_cache:
|
||||
if not bypass_cache and not self.always_by_pass_cache:
|
||||
cached = get_cached_url(url)
|
||||
if cached:
|
||||
return CrawlResult(
|
||||
@ -102,7 +106,7 @@ class WebCrawler:
|
||||
"html": cached[1],
|
||||
"cleaned_html": cached[2],
|
||||
"markdown": cached[3],
|
||||
"parsed_json": cached[4],
|
||||
"extracted_content": cached[4],
|
||||
"success": cached[5],
|
||||
"error_message": "",
|
||||
}
|
||||
@ -130,7 +134,7 @@ class WebCrawler:
|
||||
f"[LOG] 🚀 Crawling done for {url}, success: {success}, time taken: {time.time() - t} seconds"
|
||||
)
|
||||
|
||||
parsed_json = []
|
||||
extracted_content = []
|
||||
if verbose:
|
||||
print(f"[LOG] 🔥 Extracting semantic blocks for {url}, Strategy: {extraction_strategy.name}")
|
||||
t = time.time()
|
||||
@ -138,10 +142,10 @@ class WebCrawler:
|
||||
sections = chunking_strategy.chunk(markdown)
|
||||
# sections = merge_chunks_based_on_token_threshold(sections, CHUNK_TOKEN_THRESHOLD)
|
||||
|
||||
parsed_json = extraction_strategy.run(
|
||||
extracted_content = extraction_strategy.run(
|
||||
url, sections,
|
||||
)
|
||||
parsed_json = json.dumps(parsed_json)
|
||||
extracted_content = json.dumps(extracted_content)
|
||||
|
||||
if verbose:
|
||||
print(
|
||||
@ -155,7 +159,7 @@ class WebCrawler:
|
||||
html,
|
||||
cleaned_html,
|
||||
markdown,
|
||||
parsed_json,
|
||||
extracted_content,
|
||||
success,
|
||||
)
|
||||
|
||||
@ -164,7 +168,7 @@ class WebCrawler:
|
||||
html=html,
|
||||
cleaned_html=cleaned_html,
|
||||
markdown=markdown,
|
||||
parsed_json=parsed_json,
|
||||
extracted_content=extracted_content,
|
||||
success=success,
|
||||
error_message=error_message,
|
||||
)
|
||||
|
||||
@ -1,9 +1,9 @@
|
||||
{
|
||||
"NoExtractionStrategy": "### NoExtractionStrategy\n\n`NoExtractionStrategy` is a basic extraction strategy that returns the entire HTML content without any modification. It is useful for cases where no specific extraction is required. Only clean html, and amrkdown.\n\n#### Constructor Parameters:\nNone.\n\n#### Example usage:\n```python\nextractor = NoExtractionStrategy()\nextracted_content = extractor.extract(url, html)\n```",
|
||||
|
||||
"LLMExtractionStrategy": "### LLMExtractionStrategy\n\n`LLMExtractionStrategy` uses a Language Model (LLM) to extract meaningful blocks or chunks from the given HTML content. This strategy leverages an external provider for language model completions.\n\n#### Constructor Parameters:\n- `provider` (str, optional): The provider to use for the language model completions. Default is `DEFAULT_PROVIDER` (following provider/model eg. openai/gpt-4o).\n- `api_token` (str, optional): The API token for the provider. If not provided, it will try to load from the environment variable `OPENAI_API_KEY`.\n\n#### Example usage:\n```python\nextractor = LLMExtractionStrategy(provider='openai', api_token='your_api_token')\nextracted_content = extractor.extract(url, html)\n```",
|
||||
"LLMExtractionStrategy": "### LLMExtractionStrategy\n\n`LLMExtractionStrategy` uses a Language Model (LLM) to extract meaningful blocks or chunks from the given HTML content. This strategy leverages an external provider for language model completions.\n\n#### Constructor Parameters:\n- `provider` (str, optional): The provider to use for the language model completions. Default is `DEFAULT_PROVIDER` (e.g., openai/gpt-4).\n- `api_token` (str, optional): The API token for the provider. If not provided, it will try to load from the environment variable `OPENAI_API_KEY`.\n- `instruction` (str, optional): An instruction to guide the LLM on how to perform the extraction. This allows users to specify the type of data they are interested in or set the tone of the response. Default is `None`.\n\n#### Example usage:\n```python\nextractor = LLMExtractionStrategy(provider='openai', api_token='your_api_token', instruction='Extract only news about AI.')\nextracted_content = extractor.extract(url, html)\n```\n\nBy providing clear instructions, users can tailor the extraction process to their specific needs, enhancing the relevance and utility of the extracted content.",
|
||||
|
||||
"CosineStrategy": "### CosineStrategy\n\n`CosineStrategy` uses hierarchical clustering based on cosine similarity to extract clusters of text from the given HTML content. This strategy is suitable for identifying related content sections.\n\n#### Constructor Parameters:\n- `word_count_threshold` (int, optional): Minimum number of words per cluster. Default is `20`.\n- `max_dist` (float, optional): The maximum cophenetic distance on the dendrogram to form clusters. Default is `0.2`.\n- `linkage_method` (str, optional): The linkage method for hierarchical clustering. Default is `'ward'`.\n- `top_k` (int, optional): Number of top categories to extract. Default is `3`.\n- `model_name` (str, optional): The model name for embedding generation. Default is `'BAAI/bge-small-en-v1.5'`.\n\n#### Example usage:\n```python\nextractor = CosineStrategy(word_count_threshold=20, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5')\nextracted_content = extractor.extract(url, html)\n```",
|
||||
"CosineStrategy": "### CosineStrategy\n\n`CosineStrategy` uses hierarchical clustering based on cosine similarity to extract clusters of text from the given HTML content. This strategy is suitable for identifying related content sections.\n\n#### Constructor Parameters:\n- `semantic_filter` (str, optional): A string containing keywords for filtering relevant documents before clustering. If provided, documents are filtered based on their cosine similarity to the keyword filter embedding. Default is `None`.\n- `word_count_threshold` (int, optional): Minimum number of words per cluster. Default is `20`.\n- `max_dist` (float, optional): The maximum cophenetic distance on the dendrogram to form clusters. Default is `0.2`.\n- `linkage_method` (str, optional): The linkage method for hierarchical clustering. Default is `'ward'`.\n- `top_k` (int, optional): Number of top categories to extract. Default is `3`.\n- `model_name` (str, optional): The model name for embedding generation. Default is `'BAAI/bge-small-en-v1.5'`.\n\n#### Example usage:\n```python\nextractor = CosineStrategy(semantic_filter='artificial intelligence', word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5')\nextracted_content = extractor.extract(url, html)\n```\n\n#### Cosine Similarity Filtering\n\nWhen a `semantic_filter` is provided, the `CosineStrategy` applies an embedding-based filtering process to select relevant documents before performing hierarchical clustering.",
|
||||
|
||||
"TopicExtractionStrategy": "### TopicExtractionStrategy\n\n`TopicExtractionStrategy` uses the TextTiling algorithm to segment the HTML content into topics and extracts keywords for each segment. This strategy is useful for identifying and summarizing thematic content.\n\n#### Constructor Parameters:\n- `num_keywords` (int, optional): Number of keywords to represent each topic segment. Default is `3`.\n\n#### Example usage:\n```python\nextractor = TopicExtractionStrategy(num_keywords=3)\nextracted_content = extractor.extract(url, html)\n```"
|
||||
}
|
||||
|
||||
@ -1,22 +1,195 @@
|
||||
import os
|
||||
import os, time
|
||||
from crawl4ai.web_crawler import WebCrawler
|
||||
from crawl4ai.chunking_strategy import *
|
||||
from crawl4ai.extraction_strategy import *
|
||||
from crawl4ai.crawler_strategy import *
|
||||
from rich import print
|
||||
from rich.console import Console
|
||||
|
||||
console = Console()
|
||||
|
||||
def print_result(result):
|
||||
# Print each key in one line and just the first 10 characters of each one's value and three dots
|
||||
console.print(f"\t[bold]Result:[/bold]")
|
||||
for key, value in result.model_dump().items():
|
||||
if type(value) == str and value:
|
||||
console.print(f"\t{key}: [green]{value[:20]}...[/green]")
|
||||
|
||||
def cprint(message, press_any_key=False):
|
||||
console.print(message)
|
||||
if press_any_key:
|
||||
console.print("Press any key to continue...", style="")
|
||||
input()
|
||||
|
||||
def main():
|
||||
# 🚀 Let's get started with the basics!
|
||||
cprint("🌟 [bold green]Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun! 🌐[/bold green]")
|
||||
|
||||
# Basic usage: Just provide the URL
|
||||
cprint("⛳️ [bold cyan]First Step: Create an instance of WebCrawler and call the `warmup()` function.[/bold cyan]")
|
||||
cprint("If this is the first time you're running Crawl4ai, this might take a few seconds to lead required model files.", True)
|
||||
|
||||
crawler = WebCrawler()
|
||||
crawler.warmup()
|
||||
cprint("🛠️ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]")
|
||||
result = crawler.run(url="https://www.nbcnews.com/business")
|
||||
cprint("[LOG] 📦 [bold yellow]Basic crawl result:[/bold yellow]")
|
||||
print_result(result)
|
||||
|
||||
# Explanation of bypass_cache and include_raw_html
|
||||
cprint("\n🧠 [bold cyan]Understanding 'bypass_cache' and 'include_raw_html' parameters:[/bold cyan]")
|
||||
cprint("By default, Crawl4ai caches the results of your crawls. This means that subsequent crawls of the same URL will be much faster! Let's see this in action. Becuase we already crawled this URL, the result will be fetched from the cache. Let's try it out!")
|
||||
|
||||
# Reads from cache
|
||||
cprint("1️⃣ First crawl (caches the result):", True)
|
||||
start_time = time.time()
|
||||
result = crawler.run(url="https://www.nbcnews.com/business")
|
||||
end_time = time.time()
|
||||
cprint(f"[LOG] 📦 [bold yellow]First crawl took {end_time - start_time} seconds and result (from cache):[/bold yellow]")
|
||||
print_result(result)
|
||||
|
||||
# Force to crawl again
|
||||
cprint("2️⃣ Second crawl (Force to crawl again):", True)
|
||||
start_time = time.time()
|
||||
result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
|
||||
end_time = time.time()
|
||||
cprint(f"[LOG] 📦 [bold yellow]Second crawl took {end_time - start_time} seconds and result (forced to crawl):[/bold yellow]")
|
||||
print_result(result)
|
||||
|
||||
# Retrieve raw HTML content
|
||||
cprint("\n🔄 [bold cyan]By default 'include_raw_html' is set to True, which includes the raw HTML content in the response.[/bold cyan]", True)
|
||||
result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)
|
||||
cprint("[LOG] 📦 [bold yellow]Craw result (without raw HTML content):[/bold yellow]")
|
||||
print_result(result)
|
||||
|
||||
cprint("\n📄 The 'include_raw_html' parameter, when set to True, includes the raw HTML content in the response. By default is set to True. Let's move on to exploring different chunking strategies now!")
|
||||
|
||||
cprint("For the rest of this guide, I set crawler.always_by_pass_cache to True to force the crawler to bypass the cache. This is to ensure that we get fresh results for each run.", True)
|
||||
crawler.always_by_pass_cache = True
|
||||
|
||||
# Adding a chunking strategy: RegexChunking
|
||||
cprint("\n🧩 [bold cyan]Let's add a chunking strategy: RegexChunking![/bold cyan]", True)
|
||||
cprint("RegexChunking is a simple chunking strategy that splits the text based on a given regex pattern. Let's see it in action!")
|
||||
result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
chunking_strategy=RegexChunking(patterns=["\n\n"])
|
||||
)
|
||||
cprint("[LOG] 📦 [bold yellow]RegexChunking result:[/bold yellow]")
|
||||
print_result(result)
|
||||
|
||||
# Adding another chunking strategy: NlpSentenceChunking
|
||||
cprint("\n🔍 [bold cyan]Time to explore another chunking strategy: NlpSentenceChunking![/bold cyan]", True)
|
||||
cprint("NlpSentenceChunking uses NLP techniques to split the text into sentences. Let's see how it performs!")
|
||||
result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
chunking_strategy=NlpSentenceChunking()
|
||||
)
|
||||
cprint("[LOG] 📦 [bold yellow]NlpSentenceChunking result:[/bold yellow]")
|
||||
print_result(result)
|
||||
|
||||
cprint("There are more chunking strategies to explore, make sure to check document, but let's move on to extraction strategies now!")
|
||||
|
||||
# Adding an extraction strategy: CosineStrategy
|
||||
cprint("\n🧠 [bold cyan]Let's get smarter with an extraction strategy: CosineStrategy![/bold cyan]", True)
|
||||
cprint("CosineStrategy uses cosine similarity to extract semantically similar blocks of text. Let's see it in action!")
|
||||
result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3)
|
||||
)
|
||||
cprint("[LOG] 📦 [bold yellow]CosineStrategy result:[/bold yellow]")
|
||||
print_result(result)
|
||||
|
||||
cprint("You can pass other parameters like 'semantic_filter' to the CosineStrategy to extract semantically similar blocks of text. Let's see it in action!")
|
||||
result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
extraction_strategy=CosineStrategy(
|
||||
semantic_filter="inflation rent prices",
|
||||
)
|
||||
)
|
||||
|
||||
cprint("[LOG] 📦 [bold yellow]CosineStrategy result with semantic filter:[/bold yellow]")
|
||||
print_result(result)
|
||||
|
||||
# Adding an LLM extraction strategy without instructions
|
||||
cprint("\n🤖 [bold cyan]Time to bring in the big guns: LLMExtractionStrategy without instructions![/bold cyan]", True)
|
||||
cprint("LLMExtractionStrategy uses a large language model to extract relevant information from the web page. Let's see it in action!")
|
||||
result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
|
||||
)
|
||||
cprint("[LOG] 📦 [bold yellow]LLMExtractionStrategy (no instructions) result:[/bold yellow]")
|
||||
print_result(result)
|
||||
|
||||
cprint("You can pass other providers like 'groq/llama3-70b-8192' or 'ollama/llama3' to the LLMExtractionStrategy.")
|
||||
|
||||
# Adding an LLM extraction strategy with instructions
|
||||
cprint("\n📜 [bold cyan]Let's make it even more interesting: LLMExtractionStrategy with instructions![/bold cyan]", True)
|
||||
cprint("Let's say we are only interested in financial news. Let's see how LLMExtractionStrategy performs with instructions!")
|
||||
result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o",
|
||||
api_token=os.getenv('OPENAI_API_KEY'),
|
||||
instruction="I am interested in only financial news"
|
||||
)
|
||||
)
|
||||
cprint("[LOG] 📦 [bold yellow]LLMExtractionStrategy (with instructions) result:[/bold yellow]")
|
||||
print_result(result)
|
||||
|
||||
result = crawler.run(
|
||||
url="https://www.example.com",
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o",
|
||||
api_token=os.getenv('OPENAI_API_KEY'),
|
||||
instruction="Extract only content related to technology"
|
||||
)
|
||||
)
|
||||
|
||||
cprint("You can pass other instructions like 'Extract only content related to technology' to the LLMExtractionStrategy.")
|
||||
|
||||
cprint("There are more extraction strategies to explore, make sure to check the documentation!")
|
||||
|
||||
# Using a CSS selector to extract only H2 tags
|
||||
cprint("\n🎯 [bold cyan]Targeted extraction: Let's use a CSS selector to extract only H2 tags![/bold cyan]", True)
|
||||
result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
css_selector="h2"
|
||||
)
|
||||
cprint("[LOG] 📦 [bold yellow]CSS Selector (H2 tags) result:[/bold yellow]")
|
||||
print_result(result)
|
||||
|
||||
# Passing JavaScript code to interact with the page
|
||||
cprint("\n🖱️ [bold cyan]Let's get interactive: Passing JavaScript code to click 'Load More' button![/bold cyan]", True)
|
||||
cprint("In this example we try to click the 'Load More' button on the page using JavaScript code.")
|
||||
js_code = """
|
||||
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
|
||||
loadMoreButton && loadMoreButton.click();
|
||||
"""
|
||||
crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
|
||||
crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
|
||||
result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
)
|
||||
cprint("[LOG] 📦 [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]")
|
||||
print_result(result)
|
||||
|
||||
cprint("\n🎉 [bold green]Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl the web like a pro! 🕸️[/bold green]")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
||||
def old_main():
|
||||
js_code = """const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"""
|
||||
# js_code = None
|
||||
crawler = WebCrawler( crawler_strategy=LocalSeleniumCrawlerStrategy(use_cached_html=False, js_code=js_code))
|
||||
crawler.warmup()
|
||||
# Single page crawl
|
||||
result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
word_count_threshold=5, # Minimum word count for a HTML tag to be considered as a worthy block
|
||||
chunking_strategy=RegexChunking(patterns=["\n\n"]), # Default is RegexChunking
|
||||
extraction_strategy=CosineStrategy(
|
||||
word_count_threshold=20, max_dist=0.2, linkage_method="ward", top_k=3
|
||||
), # Default is CosineStrategy
|
||||
# extraction_strategy= LLMExtractionStrategy(provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY')),
|
||||
# extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3), # Default is CosineStrategy
|
||||
extraction_strategy= LLMExtractionStrategy(provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'), instruction = "I am intrested in only financial news"),
|
||||
bypass_cache=True,
|
||||
extract_blocks=True, # Whether to extract semantical blocks of text from the HTML
|
||||
css_selector="", # Eg: "div.article-body" or all H2 tags liek "h2"
|
||||
@ -28,6 +201,3 @@ def main():
|
||||
print("[LOG] 📦 Crawl result:")
|
||||
print(result.model_dump())
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
||||
27
main.py
27
main.py
@ -7,6 +7,7 @@ from fastapi import FastAPI, HTTPException, Request
|
||||
from fastapi.responses import HTMLResponse, JSONResponse
|
||||
from fastapi.staticfiles import StaticFiles
|
||||
from fastapi.middleware.cors import CORSMiddleware
|
||||
from fastapi.templating import Jinja2Templates
|
||||
|
||||
from pydantic import BaseModel, HttpUrl
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
@ -35,7 +36,7 @@ app.add_middleware(
|
||||
|
||||
# Mount the pages directory as a static directory
|
||||
app.mount("/pages", StaticFiles(directory=__location__ + "/pages"), name="pages")
|
||||
|
||||
templates = Jinja2Templates(directory=__location__ + "/pages")
|
||||
# chromedriver_autoinstaller.install() # Ensure chromedriver is installed
|
||||
@lru_cache()
|
||||
def get_crawler():
|
||||
@ -51,16 +52,24 @@ class CrawlRequest(BaseModel):
|
||||
extract_blocks: bool = True
|
||||
word_count_threshold: Optional[int] = 5
|
||||
extraction_strategy: Optional[str] = "CosineStrategy"
|
||||
extraction_strategy_args: Optional[dict] = {}
|
||||
chunking_strategy: Optional[str] = "RegexChunking"
|
||||
chunking_strategy_args: Optional[dict] = {}
|
||||
css_selector: Optional[str] = None
|
||||
verbose: Optional[bool] = True
|
||||
|
||||
|
||||
@app.get("/", response_class=HTMLResponse)
|
||||
async def read_index():
|
||||
with open(f"{__location__}/pages/index.html", "r") as file:
|
||||
html_content = file.read()
|
||||
return HTMLResponse(content=html_content, status_code=200)
|
||||
async def read_index(request: Request):
|
||||
partials_dir = os.path.join(__location__, "pages", "partial")
|
||||
partials = {}
|
||||
|
||||
for filename in os.listdir(partials_dir):
|
||||
if filename.endswith(".html"):
|
||||
with open(os.path.join(partials_dir, filename), "r") as file:
|
||||
partials[filename[:-5]] = file.read()
|
||||
|
||||
return templates.TemplateResponse("index.html", {"request": request, **partials})
|
||||
|
||||
@app.get("/total-count")
|
||||
async def get_total_url_count():
|
||||
@ -73,11 +82,11 @@ async def clear_database():
|
||||
clear_db()
|
||||
return JSONResponse(content={"message": "Database cleared."})
|
||||
|
||||
def import_strategy(module_name: str, class_name: str):
|
||||
def import_strategy(module_name: str, class_name: str, *args, **kwargs):
|
||||
try:
|
||||
module = importlib.import_module(module_name)
|
||||
strategy_class = getattr(module, class_name)
|
||||
return strategy_class()
|
||||
return strategy_class(*args, **kwargs)
|
||||
except ImportError:
|
||||
raise HTTPException(status_code=400, detail=f"Module {module_name} not found.")
|
||||
except AttributeError:
|
||||
@ -95,8 +104,8 @@ async def crawl_urls(crawl_request: CrawlRequest, request: Request):
|
||||
current_requests += 1
|
||||
|
||||
try:
|
||||
extraction_strategy = import_strategy("crawl4ai.extraction_strategy", crawl_request.extraction_strategy)
|
||||
chunking_strategy = import_strategy("crawl4ai.chunking_strategy", crawl_request.chunking_strategy)
|
||||
extraction_strategy = import_strategy("crawl4ai.extraction_strategy", crawl_request.extraction_strategy, **crawl_request.extraction_strategy_args)
|
||||
chunking_strategy = import_strategy("crawl4ai.chunking_strategy", crawl_request.chunking_strategy, **crawl_request.chunking_strategy_args)
|
||||
|
||||
# Use ThreadPoolExecutor to run the synchronous WebCrawler in async manner
|
||||
with ThreadPoolExecutor() as executor:
|
||||
|
||||
131
pages/app.css
Normal file
131
pages/app.css
Normal file
@ -0,0 +1,131 @@
|
||||
:root {
|
||||
--ifm-font-size-base: 100%;
|
||||
--ifm-line-height-base: 1.65;
|
||||
--ifm-font-family-base: system-ui, -apple-system, Segoe UI, Roboto, Ubuntu, Cantarell, Noto Sans, sans-serif,
|
||||
BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji",
|
||||
"Segoe UI Symbol";
|
||||
}
|
||||
html {
|
||||
-webkit-font-smoothing: antialiased;
|
||||
-webkit-text-size-adjust: 100%;
|
||||
text-size-adjust: 100%;
|
||||
font: var(--ifm-font-size-base) / var(--ifm-line-height-base) var(--ifm-font-family-base);
|
||||
}
|
||||
body {
|
||||
background-color: #1a202c;
|
||||
color: #fff;
|
||||
}
|
||||
.tab-content {
|
||||
max-height: 400px;
|
||||
overflow: auto;
|
||||
}
|
||||
pre {
|
||||
white-space: pre-wrap;
|
||||
font-size: 14px;
|
||||
}
|
||||
pre code {
|
||||
width: 100%;
|
||||
}
|
||||
|
||||
/* Custom styling for docs-item class and Markdown generated elements */
|
||||
.docs-item {
|
||||
background-color: #2d3748; /* bg-gray-800 */
|
||||
padding: 1rem; /* p-4 */
|
||||
border-radius: 0.375rem; /* rounded */
|
||||
box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1); /* shadow-md */
|
||||
margin-bottom: 1rem; /* space between items */
|
||||
line-height: 1.5; /* leading-normal */
|
||||
}
|
||||
|
||||
.docs-item h3,
|
||||
.docs-item h4 {
|
||||
color: #ffffff; /* text-white */
|
||||
font-size: 1.25rem; /* text-xl */
|
||||
font-weight: 700; /* font-bold */
|
||||
margin-bottom: 0.5rem; /* mb-2 */
|
||||
}
|
||||
.docs-item h4 {
|
||||
font-size: 1rem; /* text-xl */
|
||||
}
|
||||
|
||||
.docs-item p {
|
||||
color: #e2e8f0; /* text-gray-300 */
|
||||
margin-bottom: 0.5rem; /* mb-2 */
|
||||
}
|
||||
|
||||
.docs-item code {
|
||||
background-color: #1a202c; /* bg-gray-900 */
|
||||
color: #e2e8f0; /* text-gray-300 */
|
||||
padding: 0.25rem 0.5rem; /* px-2 py-1 */
|
||||
border-radius: 0.25rem; /* rounded */
|
||||
font-size: 0.875rem; /* text-sm */
|
||||
}
|
||||
|
||||
.docs-item pre {
|
||||
background-color: #1a202c; /* bg-gray-900 */
|
||||
color: #e2e8f0; /* text-gray-300 */
|
||||
padding: 0.5rem; /* p-2 */
|
||||
border-radius: 0.375rem; /* rounded */
|
||||
overflow: auto; /* overflow-auto */
|
||||
margin-bottom: 0.5rem; /* mb-2 */
|
||||
}
|
||||
|
||||
.docs-item div {
|
||||
color: #e2e8f0; /* text-gray-300 */
|
||||
font-size: 1rem; /* prose prose-sm */
|
||||
line-height: 1.25rem; /* line-height for readability */
|
||||
}
|
||||
|
||||
/* Adjustments to make prose class more suitable for dark mode */
|
||||
.prose {
|
||||
max-width: none; /* max-w-none */
|
||||
}
|
||||
|
||||
.prose p,
|
||||
.prose ul {
|
||||
margin-bottom: 1rem; /* mb-4 */
|
||||
}
|
||||
|
||||
.prose code {
|
||||
/* background-color: #4a5568; */ /* bg-gray-700 */
|
||||
color: #65a30d; /* text-white */
|
||||
padding: 0.25rem 0.5rem; /* px-1 py-0.5 */
|
||||
border-radius: 0.25rem; /* rounded */
|
||||
display: inline-block; /* inline-block */
|
||||
}
|
||||
|
||||
.prose pre {
|
||||
background-color: #1a202c; /* bg-gray-900 */
|
||||
color: #ffffff; /* text-white */
|
||||
padding: 0.5rem; /* p-2 */
|
||||
border-radius: 0.375rem; /* rounded */
|
||||
}
|
||||
|
||||
.prose h3 {
|
||||
color: #65a30d; /* text-white */
|
||||
font-size: 1.25rem; /* text-xl */
|
||||
font-weight: 700; /* font-bold */
|
||||
margin-bottom: 0.5rem; /* mb-2 */
|
||||
}
|
||||
|
||||
body {
|
||||
background-color: #1a1a1a;
|
||||
color: #b3ff00;
|
||||
}
|
||||
.sidebar {
|
||||
color: #b3ff00;
|
||||
border-right: 1px solid #333;
|
||||
}
|
||||
.sidebar a {
|
||||
color: #b3ff00;
|
||||
text-decoration: none;
|
||||
}
|
||||
.sidebar a:hover {
|
||||
background-color: #555;
|
||||
}
|
||||
.content-section {
|
||||
display: none;
|
||||
}
|
||||
.content-section.active {
|
||||
display: block;
|
||||
}
|
||||
303
pages/app.js
Normal file
303
pages/app.js
Normal file
@ -0,0 +1,303 @@
|
||||
// JavaScript to manage dynamic form changes and logic
|
||||
document.getElementById("extraction-strategy-select").addEventListener("change", function () {
|
||||
const strategy = this.value;
|
||||
const providerModelSelect = document.getElementById("provider-model-select");
|
||||
const tokenInput = document.getElementById("token-input");
|
||||
const instruction = document.getElementById("instruction");
|
||||
const semantic_filter = document.getElementById("semantic_filter");
|
||||
const instruction_div = document.getElementById("instruction_div");
|
||||
const semantic_filter_div = document.getElementById("semantic_filter_div");
|
||||
const llm_settings = document.getElementById("llm_settings");
|
||||
|
||||
if (strategy === "LLMExtractionStrategy") {
|
||||
// providerModelSelect.disabled = false;
|
||||
// tokenInput.disabled = false;
|
||||
// semantic_filter.disabled = true;
|
||||
// instruction.disabled = false;
|
||||
llm_settings.classList.remove("hidden");
|
||||
instruction_div.classList.remove("hidden");
|
||||
semantic_filter_div.classList.add("hidden");
|
||||
} else if (strategy === "NoExtractionStrategy") {
|
||||
semantic_filter_div.classList.add("hidden");
|
||||
instruction_div.classList.add("hidden");
|
||||
llm_settings.classList.add("hidden");
|
||||
} else {
|
||||
// providerModelSelect.disabled = true;
|
||||
// tokenInput.disabled = true;
|
||||
// semantic_filter.disabled = false;
|
||||
// instruction.disabled = true;
|
||||
llm_settings.classList.add("hidden");
|
||||
instruction_div.classList.add("hidden");
|
||||
semantic_filter_div.classList.remove("hidden");
|
||||
}
|
||||
|
||||
|
||||
});
|
||||
|
||||
// Get the selected provider model and token from local storage
|
||||
const storedProviderModel = localStorage.getItem("provider_model");
|
||||
const storedToken = localStorage.getItem(storedProviderModel);
|
||||
|
||||
if (storedProviderModel) {
|
||||
document.getElementById("provider-model-select").value = storedProviderModel;
|
||||
}
|
||||
|
||||
if (storedToken) {
|
||||
document.getElementById("token-input").value = storedToken;
|
||||
}
|
||||
|
||||
// Handle provider model dropdown change
|
||||
document.getElementById("provider-model-select").addEventListener("change", () => {
|
||||
const selectedProviderModel = document.getElementById("provider-model-select").value;
|
||||
const storedToken = localStorage.getItem(selectedProviderModel);
|
||||
|
||||
if (storedToken) {
|
||||
document.getElementById("token-input").value = storedToken;
|
||||
} else {
|
||||
document.getElementById("token-input").value = "";
|
||||
}
|
||||
});
|
||||
|
||||
// Fetch total count from the database
|
||||
axios
|
||||
.get("/total-count")
|
||||
.then((response) => {
|
||||
document.getElementById("total-count").textContent = response.data.count;
|
||||
})
|
||||
.catch((error) => console.error(error));
|
||||
|
||||
// Handle crawl button click
|
||||
document.getElementById("crawl-btn").addEventListener("click", () => {
|
||||
// validate input to have both URL and API token
|
||||
if (!document.getElementById("url-input").value || !document.getElementById("token-input").value) {
|
||||
alert("Please enter both URL(s) and API token.");
|
||||
return;
|
||||
}
|
||||
|
||||
const selectedProviderModel = document.getElementById("provider-model-select").value;
|
||||
const apiToken = document.getElementById("token-input").value;
|
||||
const extractBlocks = document.getElementById("extract-blocks-checkbox").checked;
|
||||
const bypassCache = document.getElementById("bypass-cache-checkbox").checked;
|
||||
|
||||
// Save the selected provider model and token to local storage
|
||||
localStorage.setItem("provider_model", selectedProviderModel);
|
||||
localStorage.setItem(selectedProviderModel, apiToken);
|
||||
|
||||
const urlsInput = document.getElementById("url-input").value;
|
||||
const urls = urlsInput.split(",").map((url) => url.trim());
|
||||
const data = {
|
||||
urls: urls,
|
||||
provider_model: selectedProviderModel,
|
||||
api_token: apiToken,
|
||||
include_raw_html: true,
|
||||
bypass_cache: bypassCache,
|
||||
extract_blocks: extractBlocks,
|
||||
word_count_threshold: parseInt(document.getElementById("threshold").value),
|
||||
extraction_strategy: document.getElementById("extraction-strategy-select").value,
|
||||
extraction_strategy_args: {
|
||||
provider: selectedProviderModel,
|
||||
api_token: apiToken,
|
||||
instruction: document.getElementById("instruction").value,
|
||||
semantic_filter: document.getElementById("semantic_filter").value,
|
||||
},
|
||||
chunking_strategy: document.getElementById("chunking-strategy-select").value,
|
||||
chunking_strategy_args: {},
|
||||
css_selector: document.getElementById("css-selector").value,
|
||||
// instruction: document.getElementById("instruction").value,
|
||||
// semantic_filter: document.getElementById("semantic_filter").value,
|
||||
verbose: true,
|
||||
};
|
||||
|
||||
// save api token to local storage
|
||||
localStorage.setItem("api_token", document.getElementById("token-input").value);
|
||||
|
||||
document.getElementById("loading").classList.remove("hidden");
|
||||
document.getElementById("result").classList.add("hidden");
|
||||
document.getElementById("code_help").classList.add("hidden");
|
||||
|
||||
axios
|
||||
.post("/crawl", data)
|
||||
.then((response) => {
|
||||
const result = response.data.results[0];
|
||||
const parsedJson = JSON.parse(result.extracted_content);
|
||||
document.getElementById("json-result").textContent = JSON.stringify(parsedJson, null, 2);
|
||||
document.getElementById("cleaned-html-result").textContent = result.cleaned_html;
|
||||
document.getElementById("markdown-result").textContent = result.markdown;
|
||||
|
||||
// Update code examples dynamically
|
||||
const extractionStrategy = data.extraction_strategy;
|
||||
const isLLMExtraction = extractionStrategy === "LLMExtractionStrategy";
|
||||
|
||||
document.getElementById(
|
||||
"curl-code"
|
||||
).textContent = `curl -X POST -H "Content-Type: application/json" -d '${JSON.stringify({
|
||||
...data,
|
||||
api_token: isLLMExtraction ? "your_api_token" : undefined,
|
||||
})}' http://crawl4ai.uccode.io/crawl`;
|
||||
|
||||
document.getElementById("python-code").textContent = `import requests\n\ndata = ${JSON.stringify(
|
||||
{ ...data, api_token: isLLMExtraction ? "your_api_token" : undefined },
|
||||
null,
|
||||
2
|
||||
)}\n\nresponse = requests.post("http://crawl4ai.uccode.io/crawl", json=data) # OR local host if your run locally \nprint(response.json())`;
|
||||
|
||||
document.getElementById(
|
||||
"nodejs-code"
|
||||
).textContent = `const axios = require('axios');\n\nconst data = ${JSON.stringify(
|
||||
{ ...data, api_token: isLLMExtraction ? "your_api_token" : undefined },
|
||||
null,
|
||||
2
|
||||
)};\n\naxios.post("http://crawl4ai.uccode.io/crawl", data) // OR local host if your run locally \n .then(response => console.log(response.data))\n .catch(error => console.error(error));`;
|
||||
|
||||
document.getElementById(
|
||||
"library-code"
|
||||
).textContent = `from crawl4ai.web_crawler import WebCrawler\nfrom crawl4ai.extraction_strategy import *\nfrom crawl4ai.chunking_strategy import *\n\ncrawler = WebCrawler()\ncrawler.warmup()\n\nresult = crawler.run(\n url='${
|
||||
urls[0]
|
||||
}',\n word_count_threshold=${data.word_count_threshold},\n extraction_strategy=${
|
||||
isLLMExtraction
|
||||
? `${extractionStrategy}(provider="${data.provider_model}", api_token="${data.api_token}")`
|
||||
: extractionStrategy + "()"
|
||||
},\n chunking_strategy=${data.chunking_strategy}(),\n bypass_cache=${
|
||||
data.bypass_cache
|
||||
},\n css_selector="${data.css_selector}"\n)\nprint(result)`;
|
||||
|
||||
// Highlight code syntax
|
||||
hljs.highlightAll();
|
||||
|
||||
// Select JSON tab by default
|
||||
document.querySelector('.tab-btn[data-tab="json"]').click();
|
||||
|
||||
document.getElementById("loading").classList.add("hidden");
|
||||
|
||||
document.getElementById("result").classList.remove("hidden");
|
||||
document.getElementById("code_help").classList.remove("hidden");
|
||||
|
||||
// increment the total count
|
||||
document.getElementById("total-count").textContent =
|
||||
parseInt(document.getElementById("total-count").textContent) + 1;
|
||||
})
|
||||
.catch((error) => {
|
||||
console.error(error);
|
||||
document.getElementById("loading").classList.add("hidden");
|
||||
});
|
||||
});
|
||||
|
||||
// Handle tab clicks
|
||||
document.querySelectorAll(".tab-btn").forEach((btn) => {
|
||||
btn.addEventListener("click", () => {
|
||||
const tab = btn.dataset.tab;
|
||||
document.querySelectorAll(".tab-btn").forEach((b) => b.classList.remove("bg-lime-700", "text-white"));
|
||||
btn.classList.add("bg-lime-700", "text-white");
|
||||
document.querySelectorAll(".tab-content.code pre").forEach((el) => el.classList.add("hidden"));
|
||||
document.getElementById(`${tab}-result`).parentElement.classList.remove("hidden");
|
||||
});
|
||||
});
|
||||
|
||||
// Handle code tab clicks
|
||||
document.querySelectorAll(".code-tab-btn").forEach((btn) => {
|
||||
btn.addEventListener("click", () => {
|
||||
const tab = btn.dataset.tab;
|
||||
document.querySelectorAll(".code-tab-btn").forEach((b) => b.classList.remove("bg-lime-700", "text-white"));
|
||||
btn.classList.add("bg-lime-700", "text-white");
|
||||
document.querySelectorAll(".tab-content.result pre").forEach((el) => el.classList.add("hidden"));
|
||||
document.getElementById(`${tab}-code`).parentElement.classList.remove("hidden");
|
||||
});
|
||||
});
|
||||
|
||||
// Handle copy to clipboard button clicks
|
||||
|
||||
async function copyToClipboard(text) {
|
||||
if (navigator.clipboard && navigator.clipboard.writeText) {
|
||||
return navigator.clipboard.writeText(text);
|
||||
} else {
|
||||
return fallbackCopyTextToClipboard(text);
|
||||
}
|
||||
}
|
||||
|
||||
function fallbackCopyTextToClipboard(text) {
|
||||
return new Promise((resolve, reject) => {
|
||||
const textArea = document.createElement("textarea");
|
||||
textArea.value = text;
|
||||
|
||||
// Avoid scrolling to bottom
|
||||
textArea.style.top = "0";
|
||||
textArea.style.left = "0";
|
||||
textArea.style.position = "fixed";
|
||||
|
||||
document.body.appendChild(textArea);
|
||||
textArea.focus();
|
||||
textArea.select();
|
||||
|
||||
try {
|
||||
const successful = document.execCommand("copy");
|
||||
if (successful) {
|
||||
resolve();
|
||||
} else {
|
||||
reject();
|
||||
}
|
||||
} catch (err) {
|
||||
reject(err);
|
||||
}
|
||||
|
||||
document.body.removeChild(textArea);
|
||||
});
|
||||
}
|
||||
|
||||
document.querySelectorAll(".copy-btn").forEach((btn) => {
|
||||
btn.addEventListener("click", () => {
|
||||
const target = btn.dataset.target;
|
||||
const code = document.getElementById(target).textContent;
|
||||
//navigator.clipboard.writeText(code).then(() => {
|
||||
copyToClipboard(code).then(() => {
|
||||
btn.textContent = "Copied!";
|
||||
setTimeout(() => {
|
||||
btn.textContent = "Copy";
|
||||
}, 2000);
|
||||
});
|
||||
});
|
||||
});
|
||||
|
||||
document.addEventListener("DOMContentLoaded", async () => {
|
||||
try {
|
||||
const extractionResponse = await fetch("/strategies/extraction");
|
||||
const extractionStrategies = await extractionResponse.json();
|
||||
|
||||
const chunkingResponse = await fetch("/strategies/chunking");
|
||||
const chunkingStrategies = await chunkingResponse.json();
|
||||
|
||||
renderStrategies("extraction-strategies", extractionStrategies);
|
||||
renderStrategies("chunking-strategies", chunkingStrategies);
|
||||
} catch (error) {
|
||||
console.error("Error fetching strategies:", error);
|
||||
}
|
||||
});
|
||||
|
||||
function renderStrategies(containerId, strategies) {
|
||||
const container = document.getElementById(containerId);
|
||||
container.innerHTML = ""; // Clear any existing content
|
||||
strategies = JSON.parse(strategies);
|
||||
Object.entries(strategies).forEach(([strategy, description]) => {
|
||||
const strategyElement = document.createElement("div");
|
||||
strategyElement.classList.add("bg-zinc-800", "p-4", "rounded", "shadow-md", "docs-item");
|
||||
|
||||
const strategyDescription = document.createElement("div");
|
||||
strategyDescription.classList.add("text-gray-300", "prose", "prose-sm");
|
||||
strategyDescription.innerHTML = marked.parse(description);
|
||||
|
||||
strategyElement.appendChild(strategyDescription);
|
||||
|
||||
container.appendChild(strategyElement);
|
||||
});
|
||||
}
|
||||
document.querySelectorAll(".sidebar a").forEach((link) => {
|
||||
link.addEventListener("click", function (event) {
|
||||
event.preventDefault();
|
||||
document.querySelectorAll(".content-section").forEach((section) => {
|
||||
section.classList.remove("active");
|
||||
});
|
||||
const target = event.target.getAttribute("data-target");
|
||||
document.getElementById(target).classList.add("active");
|
||||
});
|
||||
});
|
||||
// Highlight code syntax
|
||||
hljs.highlightAll();
|
||||
971
pages/index copy.html
Normal file
971
pages/index copy.html
Normal file
@ -0,0 +1,971 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="UTF-8" />
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||
<title>Crawl4AI</title>
|
||||
|
||||
<link rel="preconnect" href="https://fonts.googleapis.com" />
|
||||
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin />
|
||||
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@100..900&display=swap" rel="stylesheet" />
|
||||
|
||||
<!-- <link href="https://cdn.jsdelivr.net/npm/tailwindcss@3.4.3/dist/tailwind.min.css" rel="stylesheet" /> -->
|
||||
<script src="https://cdn.tailwindcss.com"></script>
|
||||
<script src="https://cdn.jsdelivr.net/npm/axios/dist/axios.min.js"></script>
|
||||
<link
|
||||
rel="stylesheet"
|
||||
href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.7.0/styles/monokai.min.css"
|
||||
/>
|
||||
<script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
|
||||
|
||||
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.7.0/highlight.min.js"></script>
|
||||
<style>
|
||||
:root {
|
||||
--ifm-font-size-base: 100%;
|
||||
--ifm-line-height-base: 1.65;
|
||||
--ifm-font-family-base: system-ui, -apple-system, Segoe UI, Roboto, Ubuntu, Cantarell, Noto Sans,
|
||||
sans-serif, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji",
|
||||
"Segoe UI Emoji", "Segoe UI Symbol";
|
||||
}
|
||||
html {
|
||||
-webkit-font-smoothing: antialiased;
|
||||
-webkit-text-size-adjust: 100%;
|
||||
text-size-adjust: 100%;
|
||||
font: var(--ifm-font-size-base) / var(--ifm-line-height-base) var(--ifm-font-family-base);
|
||||
}
|
||||
body {
|
||||
background-color: #1a202c;
|
||||
color: #fff;
|
||||
}
|
||||
.tab-content {
|
||||
max-height: 400px;
|
||||
overflow: auto;
|
||||
}
|
||||
pre {
|
||||
white-space: pre-wrap;
|
||||
font-size: 14px;
|
||||
}
|
||||
pre code {
|
||||
width: 100%;
|
||||
}
|
||||
</style>
|
||||
<style>
|
||||
/* Custom styling for docs-item class and Markdown generated elements */
|
||||
.docs-item {
|
||||
background-color: #2d3748; /* bg-gray-800 */
|
||||
padding: 1rem; /* p-4 */
|
||||
border-radius: 0.375rem; /* rounded */
|
||||
box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1); /* shadow-md */
|
||||
margin-bottom: 1rem; /* space between items */
|
||||
}
|
||||
|
||||
.docs-item h3,
|
||||
.docs-item h4 {
|
||||
color: #ffffff; /* text-white */
|
||||
font-size: 1.25rem; /* text-xl */
|
||||
font-weight: 700; /* font-bold */
|
||||
margin-bottom: 0.5rem; /* mb-2 */
|
||||
}
|
||||
|
||||
.docs-item p {
|
||||
color: #e2e8f0; /* text-gray-300 */
|
||||
margin-bottom: 0.5rem; /* mb-2 */
|
||||
}
|
||||
|
||||
.docs-item code {
|
||||
background-color: #1a202c; /* bg-gray-900 */
|
||||
color: #e2e8f0; /* text-gray-300 */
|
||||
padding: 0.25rem 0.5rem; /* px-2 py-1 */
|
||||
border-radius: 0.25rem; /* rounded */
|
||||
}
|
||||
|
||||
.docs-item pre {
|
||||
background-color: #1a202c; /* bg-gray-900 */
|
||||
color: #e2e8f0; /* text-gray-300 */
|
||||
padding: 0.5rem; /* p-2 */
|
||||
border-radius: 0.375rem; /* rounded */
|
||||
overflow: auto; /* overflow-auto */
|
||||
margin-bottom: 0.5rem; /* mb-2 */
|
||||
}
|
||||
|
||||
.docs-item div {
|
||||
color: #e2e8f0; /* text-gray-300 */
|
||||
font-size: 1rem; /* prose prose-sm */
|
||||
line-height: 1.25rem; /* line-height for readability */
|
||||
}
|
||||
|
||||
/* Adjustments to make prose class more suitable for dark mode */
|
||||
.prose {
|
||||
max-width: none; /* max-w-none */
|
||||
}
|
||||
|
||||
.prose p,
|
||||
.prose ul {
|
||||
margin-bottom: 1rem; /* mb-4 */
|
||||
}
|
||||
|
||||
.prose code {
|
||||
/* background-color: #4a5568; */ /* bg-gray-700 */
|
||||
color: #65a30d; /* text-white */
|
||||
padding: 0.25rem 0.5rem; /* px-1 py-0.5 */
|
||||
border-radius: 0.25rem; /* rounded */
|
||||
display: inline-block; /* inline-block */
|
||||
}
|
||||
|
||||
.prose pre {
|
||||
background-color: #1a202c; /* bg-gray-900 */
|
||||
color: #ffffff; /* text-white */
|
||||
padding: 0.5rem; /* p-2 */
|
||||
border-radius: 0.375rem; /* rounded */
|
||||
}
|
||||
|
||||
.prose h3 {
|
||||
color: #65a30d; /* text-white */
|
||||
font-size: 1.25rem; /* text-xl */
|
||||
font-weight: 700; /* font-bold */
|
||||
margin-bottom: 0.5rem; /* mb-2 */
|
||||
}
|
||||
</style>
|
||||
</head>
|
||||
<body class="bg-black text-gray-200">
|
||||
<header class="bg-zinc-950 text-white py-4 flex">
|
||||
<div class="mx-auto px-4">
|
||||
<h1 class="text-2xl font-bold">🔥🕷️ Crawl4AI: Web Data for your Thoughts</h1>
|
||||
</div>
|
||||
<div class="mx-auto px-4 flex font-bold text-xl gap-2">
|
||||
<span>📊 Total Website Processed</span>
|
||||
<span id="total-count" class="text-lime-400">2</span>
|
||||
</div>
|
||||
</header>
|
||||
|
||||
<section class="try-it py-8 px-16 pb-20">
|
||||
<div class="container mx-auto px-4">
|
||||
<h2 class="text-2xl font-bold mb-4">Try It Now</h2>
|
||||
<div class="grid grid-cols-1 lg:grid-cols-3 gap-4">
|
||||
<div class="space-y-4">
|
||||
<div class="flex flex-col">
|
||||
<label for="url-input" class="text-lime-500 font-bold text-xs">URL(s)</label>
|
||||
<input
|
||||
type="text"
|
||||
id="url-input"
|
||||
value="https://www.nbcnews.com/business"
|
||||
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
|
||||
placeholder="Enter URL(s) separated by commas"
|
||||
/>
|
||||
</div>
|
||||
<div class="flex flex-col">
|
||||
<label for="threshold" class="text-lime-500 font-bold text-xs">Min Words Threshold</label>
|
||||
<select
|
||||
id="threshold"
|
||||
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
|
||||
>
|
||||
<option value="5">5</option>
|
||||
<option value="10" selected>10</option>
|
||||
<option value="15">15</option>
|
||||
<option value="20">20</option>
|
||||
<option value="25">25</option>
|
||||
</select>
|
||||
</div>
|
||||
<div class="flex flex-col">
|
||||
<label for="css-selector" class="text-lime-500 font-bold text-xs">CSS Selector</label>
|
||||
<input
|
||||
type="text"
|
||||
id="css-selector"
|
||||
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
|
||||
placeholder="Enter CSS Selector"
|
||||
/>
|
||||
</div>
|
||||
<div class="flex flex-col">
|
||||
<label for="extraction-strategy-select" class="text-lime-500 font-bold text-xs"
|
||||
>Extraction Strategy</label
|
||||
>
|
||||
<select
|
||||
id="extraction-strategy-select"
|
||||
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-lime-500"
|
||||
>
|
||||
<option value="CosineStrategy">CosineStrategy</option>
|
||||
<option value="LLMExtractionStrategy">LLMExtractionStrategy</option>
|
||||
<option value="NoExtractionStrategy">NoExtractionStrategy</option>
|
||||
</select>
|
||||
</div>
|
||||
<div class="flex flex-col">
|
||||
<label for="chunking-strategy-select" class="text-lime-500 font-bold text-xs"
|
||||
>Chunking Strategy</label
|
||||
>
|
||||
<select
|
||||
id="chunking-strategy-select"
|
||||
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-lime-500"
|
||||
>
|
||||
<option value="RegexChunking">RegexChunking</option>
|
||||
<option value="NlpSentenceChunking">NlpSentenceChunking</option>
|
||||
<option value="TopicSegmentationChunking">TopicSegmentationChunking</option>
|
||||
<option value="FixedLengthWordChunking">FixedLengthWordChunking</option>
|
||||
<option value="SlidingWindowChunking">SlidingWindowChunking</option>
|
||||
</select>
|
||||
</div>
|
||||
<div class="flex flex-col">
|
||||
<label for="provider-model-select" class="text-lime-500 font-bold text-xs"
|
||||
>Provider Model</label
|
||||
>
|
||||
<select
|
||||
id="provider-model-select"
|
||||
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
|
||||
disabled
|
||||
>
|
||||
<option value="groq/llama3-70b-8192">groq/llama3-70b-8192</option>
|
||||
<option value="groq/llama3-8b-8192">groq/llama3-8b-8192</option>
|
||||
<option value="openai/gpt-4-turbo">gpt-4-turbo</option>
|
||||
<option value="openai/gpt-3.5-turbo">gpt-3.5-turbo</option>
|
||||
<option value="anthropic/claude-3-haiku-20240307">claude-3-haiku</option>
|
||||
<option value="anthropic/claude-3-opus-20240229">claude-3-opus</option>
|
||||
<option value="anthropic/claude-3-sonnet-20240229">claude-3-sonnet</option>
|
||||
</select>
|
||||
</div>
|
||||
<div class="flex flex-col">
|
||||
<label for="token-input" class="text-lime-500 font-bold text-xs">API Token</label>
|
||||
<input
|
||||
type="password"
|
||||
id="token-input"
|
||||
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
|
||||
placeholder="Enter Groq API token"
|
||||
disabled
|
||||
/>
|
||||
</div>
|
||||
<div class="flex gap-3">
|
||||
<div class="flex items-center gap-2">
|
||||
<input type="checkbox" id="bypass-cache-checkbox" />
|
||||
<label for="bypass-cache-checkbox" class="text-lime-500 font-bold">Bypass Cache</label>
|
||||
</div>
|
||||
<div class="flex items-center gap-2">
|
||||
<input type="checkbox" id="extract-blocks-checkbox" checked />
|
||||
<label for="extract-blocks-checkbox" class="text-lime-500 font-bold"
|
||||
>Extract Blocks</label
|
||||
>
|
||||
</div>
|
||||
<button id="crawl-btn" class="bg-lime-600 text-black font-bold px-4 py-0 rounded">
|
||||
Crawl
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div id="result" class=" ">
|
||||
<div id="loading" class="hidden">
|
||||
<p class="text-white">Loading... Please wait.</p>
|
||||
</div>
|
||||
<div class="tab-buttons flex gap-2">
|
||||
<button
|
||||
class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
|
||||
data-tab="json"
|
||||
>
|
||||
JSON
|
||||
</button>
|
||||
<button
|
||||
class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
|
||||
data-tab="cleaned-html"
|
||||
>
|
||||
Cleaned HTML
|
||||
</button>
|
||||
<button
|
||||
class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
|
||||
data-tab="markdown"
|
||||
>
|
||||
Markdown
|
||||
</button>
|
||||
</div>
|
||||
<div class="tab-content code bg-zinc-900 p-2 rounded h-full border border-zinc-700 text-sm">
|
||||
<pre class="h-full flex"><code id="json-result" class="language-json"></code></pre>
|
||||
<pre
|
||||
class="hidden h-full flex"
|
||||
><code id="cleaned-html-result" class="language-html"></code></pre>
|
||||
<pre
|
||||
class="hidden h-full flex"
|
||||
><code id="markdown-result" class="language-markdown"></code></pre>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div id="code_help" class=" ">
|
||||
<div class="tab-buttons flex gap-2">
|
||||
<button
|
||||
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
|
||||
data-tab="curl"
|
||||
>
|
||||
cURL
|
||||
</button>
|
||||
<button
|
||||
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
|
||||
data-tab="library"
|
||||
>
|
||||
Python Library
|
||||
</button>
|
||||
<button
|
||||
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
|
||||
data-tab="python"
|
||||
>
|
||||
Python (Request)
|
||||
</button>
|
||||
<button
|
||||
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
|
||||
data-tab="nodejs"
|
||||
>
|
||||
Node.js
|
||||
</button>
|
||||
</div>
|
||||
<div class="tab-content result bg-zinc-900 p-2 rounded h-full border border-zinc-700 text-sm">
|
||||
<pre class="h-full flex relative">
|
||||
<code id="curl-code" class="language-bash"></code>
|
||||
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="curl-code">Copy</button>
|
||||
</pre>
|
||||
<pre class="hidden h-full flex relative">
|
||||
<code id="python-code" class="language-python"></code>
|
||||
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="python-code">Copy</button>
|
||||
</pre>
|
||||
<pre class="hidden h-full flex relative">
|
||||
<code id="nodejs-code" class="language-javascript"></code>
|
||||
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="nodejs-code">Copy</button>
|
||||
</pre>
|
||||
<pre class="hidden h-full flex relative">
|
||||
<code id="library-code" class="language-python"></code>
|
||||
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="library-code">Copy</button>
|
||||
</pre>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
<section class="bg-zinc-900 text-zinc-300 p-6 px-20">
|
||||
<div class="grid grid-cols-2 gap-4 p-4 bg-zinc-900 text-lime-500">
|
||||
<!-- Step 1 -->
|
||||
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
|
||||
🌟 <strong>Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun!</strong>
|
||||
</div>
|
||||
<div class="bg-zinc-800 p-2 rounded">
|
||||
First Step: Create an instance of WebCrawler and call the <code>warmup()</code> function.
|
||||
</div>
|
||||
<div>
|
||||
<pre><code class="language-python">crawler = WebCrawler()
|
||||
crawler.warmup()</code></pre>
|
||||
</div>
|
||||
|
||||
<!-- Step 2 -->
|
||||
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
|
||||
🧠 <strong>Understanding 'bypass_cache' and 'include_raw_html' parameters:</strong>
|
||||
</div>
|
||||
<div class="bg-zinc-800 p-2 rounded">First crawl (caches the result):</div>
|
||||
<div>
|
||||
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business")</code></pre>
|
||||
</div>
|
||||
<div class="bg-zinc-800 p-2 rounded">Second crawl (Force to crawl again):</div>
|
||||
<div>
|
||||
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)</code></pre>
|
||||
</div>
|
||||
<div class="bg-zinc-800 p-2 rounded">Crawl result without raw HTML content:</div>
|
||||
<div>
|
||||
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)</code></pre>
|
||||
</div>
|
||||
|
||||
<!-- Step 3 -->
|
||||
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
|
||||
📄
|
||||
<strong
|
||||
>The 'include_raw_html' parameter, when set to True, includes the raw HTML content in the
|
||||
response. By default, it is set to True.</strong
|
||||
>
|
||||
</div>
|
||||
<div class="bg-zinc-800 p-2 rounded">Set <code>always_by_pass_cache</code> to True:</div>
|
||||
<div>
|
||||
<pre><code class="language-python">crawler.always_by_pass_cache = True</code></pre>
|
||||
</div>
|
||||
|
||||
<!-- Step 4 -->
|
||||
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
|
||||
🧩 <strong>Let's add a chunking strategy: RegexChunking!</strong>
|
||||
</div>
|
||||
<div class="bg-zinc-800 p-2 rounded">Using RegexChunking:</div>
|
||||
<div>
|
||||
<pre><code class="language-python">result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
chunking_strategy=RegexChunking(patterns=["\n\n"])
|
||||
)</code></pre>
|
||||
</div>
|
||||
<div class="bg-zinc-800 p-2 rounded">Using NlpSentenceChunking:</div>
|
||||
<div>
|
||||
<pre><code class="language-python">result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
chunking_strategy=NlpSentenceChunking()
|
||||
)</code></pre>
|
||||
</div>
|
||||
|
||||
<!-- Step 5 -->
|
||||
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
|
||||
🧠 <strong>Let's get smarter with an extraction strategy: CosineStrategy!</strong>
|
||||
</div>
|
||||
<div class="bg-zinc-800 p-2 rounded">Using CosineStrategy:</div>
|
||||
<div>
|
||||
<pre><code class="language-python">result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3)
|
||||
)</code></pre>
|
||||
</div>
|
||||
|
||||
<!-- Step 6 -->
|
||||
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
|
||||
🤖 <strong>Time to bring in the big guns: LLMExtractionStrategy without instructions!</strong>
|
||||
</div>
|
||||
<div class="bg-zinc-800 p-2 rounded">Using LLMExtractionStrategy without instructions:</div>
|
||||
<div>
|
||||
<pre><code class="language-python">result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
|
||||
)</code></pre>
|
||||
</div>
|
||||
|
||||
<!-- Step 7 -->
|
||||
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
|
||||
📜 <strong>Let's make it even more interesting: LLMExtractionStrategy with instructions!</strong>
|
||||
</div>
|
||||
<div class="bg-zinc-800 p-2 rounded">Using LLMExtractionStrategy with instructions:</div>
|
||||
<div>
|
||||
<pre><code class="language-python">result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o",
|
||||
api_token=os.getenv('OPENAI_API_KEY'),
|
||||
instruction="I am interested in only financial news"
|
||||
)
|
||||
)</code></pre>
|
||||
</div>
|
||||
|
||||
<!-- Step 8 -->
|
||||
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
|
||||
🎯 <strong>Targeted extraction: Let's use a CSS selector to extract only H2 tags!</strong>
|
||||
</div>
|
||||
<div class="bg-zinc-800 p-2 rounded">Using CSS selector to extract H2 tags:</div>
|
||||
<div>
|
||||
<pre><code class="language-python">result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
css_selector="h2"
|
||||
)</code></pre>
|
||||
</div>
|
||||
|
||||
<!-- Step 9 -->
|
||||
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
|
||||
🖱️ <strong>Let's get interactive: Passing JavaScript code to click 'Load More' button!</strong>
|
||||
</div>
|
||||
<div class="bg-zinc-800 p-2 rounded">Using JavaScript to click 'Load More' button:</div>
|
||||
<div>
|
||||
<pre><code class="language-python">js_code = """
|
||||
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
|
||||
loadMoreButton && loadMoreButton.click();
|
||||
"""
|
||||
crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
|
||||
crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
|
||||
result = crawler.run(url="https://www.nbcnews.com/business")</code></pre>
|
||||
</div>
|
||||
|
||||
<!-- Conclusion -->
|
||||
<div class="col-span-2 bg-yellow-500 p-2 rounded text-zinc-900">
|
||||
🎉
|
||||
<strong
|
||||
>Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl
|
||||
the web like a pro! 🕸️</strong
|
||||
>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
<section class="bg-zinc-900 text-zinc-300 p-6 px-20">
|
||||
<h1 class="text-3xl font-bold mb-4">Installation 💻</h1>
|
||||
<p class="mb-4">
|
||||
There are two ways to use Crawl4AI: as a library in your Python projects or as a standalone local
|
||||
server.
|
||||
</p>
|
||||
|
||||
<p class="mb-4">
|
||||
You can also try Crawl4AI in a Google Colab
|
||||
<a href="https://colab.research.google.com/drive/1wz8u30rvbq6Scodye9AGCw8Qg_Z8QGsk"
|
||||
><img
|
||||
src="https://colab.research.google.com/assets/colab-badge.svg"
|
||||
alt="Open In Colab"
|
||||
style="display: inline-block; width: 100px; height: 20px"
|
||||
/></a>
|
||||
</p>
|
||||
|
||||
<h2 class="text-2xl font-bold mb-2">Using Crawl4AI as a Library 📚</h2>
|
||||
<p class="mb-4">To install Crawl4AI as a library, follow these steps:</p>
|
||||
|
||||
<ol class="list-decimal list-inside mb-4">
|
||||
<li class="mb-2">
|
||||
Install the package from GitHub:
|
||||
<pre
|
||||
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
|
||||
><code>pip install git+https://github.com/unclecode/crawl4ai.git</code></pre>
|
||||
</li>
|
||||
<li class="mb-2">
|
||||
Alternatively, you can clone the repository and install the package locally:
|
||||
<pre
|
||||
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
|
||||
><code class = "language-python bash">virtualenv venv
|
||||
source venv/bin/activate
|
||||
git clone https://github.com/unclecode/crawl4ai.git
|
||||
cd crawl4ai
|
||||
pip install -e .
|
||||
</code></pre>
|
||||
</li>
|
||||
<li>
|
||||
Import the necessary modules in your Python script:
|
||||
<pre
|
||||
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
|
||||
><code class = "language-python hljs">from crawl4ai.web_crawler import WebCrawler
|
||||
from crawl4ai.chunking_strategy import *
|
||||
from crawl4ai.extraction_strategy import *
|
||||
import os
|
||||
|
||||
crawler = WebCrawler()
|
||||
|
||||
# Single page crawl
|
||||
single_url = UrlModel(url='https://www.nbcnews.com/business', forced=False)
|
||||
result = crawl4ai.fetch_page(
|
||||
url='https://www.nbcnews.com/business',
|
||||
word_count_threshold=5, # Minimum word count for a HTML tag to be considered as a worthy block
|
||||
chunking_strategy= RegexChunking( patterns = ["\\n\\n"]), # Default is RegexChunking
|
||||
extraction_strategy= CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3) # Default is CosineStrategy
|
||||
# extraction_strategy= LLMExtractionStrategy(provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY')),
|
||||
bypass_cache=False,
|
||||
extract_blocks =True, # Whether to extract semantical blocks of text from the HTML
|
||||
css_selector = "", # Eg: "div.article-body"
|
||||
verbose=True,
|
||||
include_raw_html=True, # Whether to include the raw HTML content in the response
|
||||
)
|
||||
print(result.model_dump())
|
||||
</code></pre>
|
||||
</li>
|
||||
</ol>
|
||||
<p class="mb-4">
|
||||
For more information about how to run Crawl4AI as a local server, please refer to the
|
||||
<a href="https://github.com/unclecode/crawl4ai" class="text-blue-400">GitHub repository</a>.
|
||||
</p>
|
||||
|
||||
</section>
|
||||
|
||||
<section class="bg-zinc-900 text-zinc-300 p-6 px-20">
|
||||
<h1 class="text-3xl font-bold mb-4">📖 Parameters</h1>
|
||||
<div class="overflow-x-auto">
|
||||
<table class="min-w-full bg-zinc-800 border border-zinc-700">
|
||||
<thead>
|
||||
<tr>
|
||||
<th class="py-2 px-4 border-b border-zinc-700">Parameter</th>
|
||||
<th class="py-2 px-4 border-b border-zinc-700">Description</th>
|
||||
<th class="py-2 px-4 border-b border-zinc-700">Required</th>
|
||||
<th class="py-2 px-4 border-b border-zinc-700">Default Value</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">urls</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">
|
||||
A list of URLs to crawl and extract data from.
|
||||
</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">Yes</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">include_raw_html</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">
|
||||
Whether to include the raw HTML content in the response.
|
||||
</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">No</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">false</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">bypass_cache</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">
|
||||
Whether to force a fresh crawl even if the URL has been previously crawled.
|
||||
</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">No</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">false</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">extract_blocks</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">
|
||||
Whether to extract semantical blocks of text from the HTML.
|
||||
</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">No</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">true</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">word_count_threshold</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">
|
||||
The minimum number of words a block must contain to be considered meaningful (minimum
|
||||
value is 5).
|
||||
</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">No</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">5</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">extraction_strategy</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">
|
||||
The strategy to use for extracting content from the HTML (e.g., "CosineStrategy").
|
||||
</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">No</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">CosineStrategy</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">chunking_strategy</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">
|
||||
The strategy to use for chunking the text before processing (e.g., "RegexChunking").
|
||||
</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">No</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">RegexChunking</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">css_selector</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">
|
||||
The CSS selector to target specific parts of the HTML for extraction.
|
||||
</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">No</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">None</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="py-2 px-4">verbose</td>
|
||||
<td class="py-2 px-4">Whether to enable verbose logging.</td>
|
||||
<td class="py-2 px-4">No</td>
|
||||
<td class="py-2 px-4">true</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="extraction" class="py-8 px-20">
|
||||
<div class="overflow-x-auto mx-auto px-6">
|
||||
<h2 class="text-2xl font-bold mb-4">Extraction Strategies</h2>
|
||||
<div id="extraction-strategies" class="space-y-4"></div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="chunking" class="py-8 px-20">
|
||||
<div class="overflow-x-auto mx-auto px-6">
|
||||
<h2 class="text-2xl font-bold mb-4">Chunking Strategies</h2>
|
||||
<div id="chunking-strategies" class="space-y-4"></div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section class="hero bg-zinc-900 py-8 px-20">
|
||||
<div class="container mx-auto px-4">
|
||||
<h2 class="text-3xl font-bold mb-4">🤔 Why building this?</h2>
|
||||
<p class="text-lg mb-4">
|
||||
In recent times, we've witnessed a surge of startups emerging, riding the AI hype wave and charging
|
||||
for services that should rightfully be accessible to everyone. 🌍💸 One such example is scraping and
|
||||
crawling web pages and transforming them into a format suitable for Large Language Models (LLMs).
|
||||
🕸️🤖 We believe that building a business around this is not the right approach; instead, it should
|
||||
definitely be open-source. 🆓🌟 So, if you possess the skills to build such tools and share our
|
||||
philosophy, we invite you to join our "Robinhood" band and help set these products free for the
|
||||
benefit of all. 🤝💪
|
||||
</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section class="installation py-8 px-20">
|
||||
<div class="container mx-auto px-4">
|
||||
<h2 class="text-2xl font-bold mb-4">⚙️ Installation</h2>
|
||||
<p class="mb-4">
|
||||
To install and run Crawl4AI as a library or a local server, please refer to the 📚
|
||||
<a href="https://github.com/unclecode/crawl4ai" class="text-blue-400">GitHub repository</a>.
|
||||
</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<footer class="bg-zinc-900 text-white py-4">
|
||||
<div class="container mx-auto px-4">
|
||||
<div class="flex justify-between items-center">
|
||||
<p>© 2024 Crawl4AI. All rights reserved.</p>
|
||||
<div class="social-links">
|
||||
<a
|
||||
href="https://github.com/unclecode/crawl4ai"
|
||||
class="text-white hover:text-gray-300 mx-2"
|
||||
target="_blank"
|
||||
>😺 GitHub</a
|
||||
>
|
||||
<a
|
||||
href="https://twitter.com/unclecode"
|
||||
class="text-white hover:text-gray-300 mx-2"
|
||||
target="_blank"
|
||||
>🐦 Twitter</a
|
||||
>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</footer>
|
||||
|
||||
<script>
|
||||
// JavaScript to manage dynamic form changes and logic
|
||||
document.getElementById("extraction-strategy-select").addEventListener("change", function () {
|
||||
const strategy = this.value;
|
||||
const providerModelSelect = document.getElementById("provider-model-select");
|
||||
const tokenInput = document.getElementById("token-input");
|
||||
|
||||
if (strategy === "LLMExtractionStrategy") {
|
||||
providerModelSelect.disabled = false;
|
||||
tokenInput.disabled = false;
|
||||
} else {
|
||||
providerModelSelect.disabled = true;
|
||||
tokenInput.disabled = true;
|
||||
}
|
||||
});
|
||||
|
||||
// Get the selected provider model and token from local storage
|
||||
const storedProviderModel = localStorage.getItem("provider_model");
|
||||
const storedToken = localStorage.getItem(storedProviderModel);
|
||||
|
||||
if (storedProviderModel) {
|
||||
document.getElementById("provider-model-select").value = storedProviderModel;
|
||||
}
|
||||
|
||||
if (storedToken) {
|
||||
document.getElementById("token-input").value = storedToken;
|
||||
}
|
||||
|
||||
// Handle provider model dropdown change
|
||||
document.getElementById("provider-model-select").addEventListener("change", () => {
|
||||
const selectedProviderModel = document.getElementById("provider-model-select").value;
|
||||
const storedToken = localStorage.getItem(selectedProviderModel);
|
||||
|
||||
if (storedToken) {
|
||||
document.getElementById("token-input").value = storedToken;
|
||||
} else {
|
||||
document.getElementById("token-input").value = "";
|
||||
}
|
||||
});
|
||||
|
||||
// Fetch total count from the database
|
||||
axios
|
||||
.get("/total-count")
|
||||
.then((response) => {
|
||||
document.getElementById("total-count").textContent = response.data.count;
|
||||
})
|
||||
.catch((error) => console.error(error));
|
||||
|
||||
// Handle crawl button click
|
||||
document.getElementById("crawl-btn").addEventListener("click", () => {
|
||||
// validate input to have both URL and API token
|
||||
if (!document.getElementById("url-input").value || !document.getElementById("token-input").value) {
|
||||
alert("Please enter both URL(s) and API token.");
|
||||
return;
|
||||
}
|
||||
|
||||
const selectedProviderModel = document.getElementById("provider-model-select").value;
|
||||
const apiToken = document.getElementById("token-input").value;
|
||||
const extractBlocks = document.getElementById("extract-blocks-checkbox").checked;
|
||||
const bypassCache = document.getElementById("bypass-cache-checkbox").checked;
|
||||
|
||||
// Save the selected provider model and token to local storage
|
||||
localStorage.setItem("provider_model", selectedProviderModel);
|
||||
localStorage.setItem(selectedProviderModel, apiToken);
|
||||
|
||||
const urlsInput = document.getElementById("url-input").value;
|
||||
const urls = urlsInput.split(",").map((url) => url.trim());
|
||||
const data = {
|
||||
urls: urls,
|
||||
provider_model: selectedProviderModel,
|
||||
api_token: apiToken,
|
||||
include_raw_html: true,
|
||||
bypass_cache: bypassCache,
|
||||
extract_blocks: extractBlocks,
|
||||
word_count_threshold: parseInt(document.getElementById("threshold").value),
|
||||
extraction_strategy: document.getElementById("extraction-strategy-select").value,
|
||||
chunking_strategy: document.getElementById("chunking-strategy-select").value,
|
||||
css_selector: document.getElementById("css-selector").value,
|
||||
verbose: true,
|
||||
};
|
||||
|
||||
// save api token to local storage
|
||||
localStorage.setItem("api_token", document.getElementById("token-input").value);
|
||||
|
||||
document.getElementById("loading").classList.remove("hidden");
|
||||
//document.getElementById("result").classList.add("hidden");
|
||||
//document.getElementById("code_help").classList.add("hidden");
|
||||
|
||||
axios
|
||||
.post("/crawl", data)
|
||||
.then((response) => {
|
||||
const result = response.data.results[0];
|
||||
const parsedJson = JSON.parse(result.extracted_content);
|
||||
document.getElementById("json-result").textContent = JSON.stringify(parsedJson, null, 2);
|
||||
document.getElementById("cleaned-html-result").textContent = result.cleaned_html;
|
||||
document.getElementById("markdown-result").textContent = result.markdown;
|
||||
|
||||
// Update code examples dynamically
|
||||
const extractionStrategy = data.extraction_strategy;
|
||||
const isLLMExtraction = extractionStrategy === "LLMExtractionStrategy";
|
||||
|
||||
document.getElementById(
|
||||
"curl-code"
|
||||
).textContent = `curl -X POST -H "Content-Type: application/json" -d '${JSON.stringify({
|
||||
...data,
|
||||
api_token: isLLMExtraction ? "your_api_token" : undefined,
|
||||
})}' http://crawl4ai.uccode.io/crawl`;
|
||||
|
||||
document.getElementById(
|
||||
"python-code"
|
||||
).textContent = `import requests\n\ndata = ${JSON.stringify(
|
||||
{ ...data, api_token: isLLMExtraction ? "your_api_token" : undefined },
|
||||
null,
|
||||
2
|
||||
)}\n\nresponse = requests.post("http://crawl4ai.uccode.io/crawl", json=data) # OR local host if your run locally \nprint(response.json())`;
|
||||
|
||||
document.getElementById(
|
||||
"nodejs-code"
|
||||
).textContent = `const axios = require('axios');\n\nconst data = ${JSON.stringify(
|
||||
{ ...data, api_token: isLLMExtraction ? "your_api_token" : undefined },
|
||||
null,
|
||||
2
|
||||
)};\n\naxios.post("http://crawl4ai.uccode.io/crawl", data) // OR local host if your run locally \n .then(response => console.log(response.data))\n .catch(error => console.error(error));`;
|
||||
|
||||
document.getElementById(
|
||||
"library-code"
|
||||
).textContent = `from crawl4ai.web_crawler import WebCrawler\nfrom crawl4ai.extraction_strategy import *\nfrom crawl4ai.chunking_strategy import *\n\ncrawler = WebCrawler()\ncrawler.warmup()\n\nresult = crawler.run(\n url='${
|
||||
urls[0]
|
||||
}',\n word_count_threshold=${data.word_count_threshold},\n extraction_strategy=${
|
||||
isLLMExtraction
|
||||
? `${extractionStrategy}(provider="${data.provider_model}", api_token="${data.api_token}")`
|
||||
: extractionStrategy + "()"
|
||||
},\n chunking_strategy=${data.chunking_strategy}(),\n bypass_cache=${
|
||||
data.bypass_cache
|
||||
},\n css_selector="${data.css_selector}"\n)\nprint(result)`;
|
||||
|
||||
// Highlight code syntax
|
||||
hljs.highlightAll();
|
||||
|
||||
// Select JSON tab by default
|
||||
document.querySelector('.tab-btn[data-tab="json"]').click();
|
||||
|
||||
document.getElementById("loading").classList.add("hidden");
|
||||
document.getElementById("result").classList.remove("hidden");
|
||||
document.getElementById("code_help").classList.remove("hidden");
|
||||
|
||||
// increment the total count
|
||||
document.getElementById("total-count").textContent =
|
||||
parseInt(document.getElementById("total-count").textContent) + 1;
|
||||
})
|
||||
.catch((error) => {
|
||||
console.error(error);
|
||||
document.getElementById("loading").classList.add("hidden");
|
||||
});
|
||||
});
|
||||
|
||||
// Handle tab clicks
|
||||
document.querySelectorAll(".tab-btn").forEach((btn) => {
|
||||
btn.addEventListener("click", () => {
|
||||
const tab = btn.dataset.tab;
|
||||
document
|
||||
.querySelectorAll(".tab-btn")
|
||||
.forEach((b) => b.classList.remove("bg-lime-700", "text-white"));
|
||||
btn.classList.add("bg-lime-700", "text-white");
|
||||
document.querySelectorAll(".tab-content.code pre").forEach((el) => el.classList.add("hidden"));
|
||||
document.getElementById(`${tab}-result`).parentElement.classList.remove("hidden");
|
||||
});
|
||||
});
|
||||
|
||||
// Handle code tab clicks
|
||||
document.querySelectorAll(".code-tab-btn").forEach((btn) => {
|
||||
btn.addEventListener("click", () => {
|
||||
const tab = btn.dataset.tab;
|
||||
document
|
||||
.querySelectorAll(".code-tab-btn")
|
||||
.forEach((b) => b.classList.remove("bg-lime-700", "text-white"));
|
||||
btn.classList.add("bg-lime-700", "text-white");
|
||||
document.querySelectorAll(".tab-content.result pre").forEach((el) => el.classList.add("hidden"));
|
||||
document.getElementById(`${tab}-code`).parentElement.classList.remove("hidden");
|
||||
});
|
||||
});
|
||||
|
||||
// Handle copy to clipboard button clicks
|
||||
|
||||
async function copyToClipboard(text) {
|
||||
if (navigator.clipboard && navigator.clipboard.writeText) {
|
||||
return navigator.clipboard.writeText(text);
|
||||
} else {
|
||||
return fallbackCopyTextToClipboard(text);
|
||||
}
|
||||
}
|
||||
|
||||
function fallbackCopyTextToClipboard(text) {
|
||||
return new Promise((resolve, reject) => {
|
||||
const textArea = document.createElement("textarea");
|
||||
textArea.value = text;
|
||||
|
||||
// Avoid scrolling to bottom
|
||||
textArea.style.top = "0";
|
||||
textArea.style.left = "0";
|
||||
textArea.style.position = "fixed";
|
||||
|
||||
document.body.appendChild(textArea);
|
||||
textArea.focus();
|
||||
textArea.select();
|
||||
|
||||
try {
|
||||
const successful = document.execCommand("copy");
|
||||
if (successful) {
|
||||
resolve();
|
||||
} else {
|
||||
reject();
|
||||
}
|
||||
} catch (err) {
|
||||
reject(err);
|
||||
}
|
||||
|
||||
document.body.removeChild(textArea);
|
||||
});
|
||||
}
|
||||
|
||||
document.querySelectorAll(".copy-btn").forEach((btn) => {
|
||||
btn.addEventListener("click", () => {
|
||||
const target = btn.dataset.target;
|
||||
const code = document.getElementById(target).textContent;
|
||||
//navigator.clipboard.writeText(code).then(() => {
|
||||
copyToClipboard(code).then(() => {
|
||||
btn.textContent = "Copied!";
|
||||
setTimeout(() => {
|
||||
btn.textContent = "Copy";
|
||||
}, 2000);
|
||||
});
|
||||
});
|
||||
});
|
||||
|
||||
document.addEventListener("DOMContentLoaded", async () => {
|
||||
try {
|
||||
const extractionResponse = await fetch("/strategies/extraction");
|
||||
const extractionStrategies = await extractionResponse.json();
|
||||
|
||||
const chunkingResponse = await fetch("/strategies/chunking");
|
||||
const chunkingStrategies = await chunkingResponse.json();
|
||||
|
||||
renderStrategies("extraction-strategies", extractionStrategies);
|
||||
renderStrategies("chunking-strategies", chunkingStrategies);
|
||||
} catch (error) {
|
||||
console.error("Error fetching strategies:", error);
|
||||
}
|
||||
});
|
||||
|
||||
function renderStrategies(containerId, strategies) {
|
||||
const container = document.getElementById(containerId);
|
||||
container.innerHTML = ""; // Clear any existing content
|
||||
strategies = JSON.parse(strategies);
|
||||
Object.entries(strategies).forEach(([strategy, description]) => {
|
||||
const strategyElement = document.createElement("div");
|
||||
strategyElement.classList.add("bg-zinc-800", "p-4", "rounded", "shadow-md", "docs-item");
|
||||
|
||||
const strategyDescription = document.createElement("div");
|
||||
strategyDescription.classList.add("text-gray-300", "prose", "prose-sm");
|
||||
strategyDescription.innerHTML = marked.parse(description);
|
||||
|
||||
strategyElement.appendChild(strategyDescription);
|
||||
|
||||
container.appendChild(strategyElement);
|
||||
});
|
||||
}
|
||||
|
||||
// Highlight code syntax
|
||||
hljs.highlightAll();
|
||||
</script>
|
||||
</body>
|
||||
</html>
|
||||
802
pages/index.html
802
pages/index.html
@ -12,6 +12,7 @@
|
||||
<!-- <link href="https://cdn.jsdelivr.net/npm/tailwindcss@3.4.3/dist/tailwind.min.css" rel="stylesheet" /> -->
|
||||
<script src="https://cdn.tailwindcss.com"></script>
|
||||
<script src="https://cdn.jsdelivr.net/npm/axios/dist/axios.min.js"></script>
|
||||
<link rel="stylesheet" href="/pages/app.css" />
|
||||
<link
|
||||
rel="stylesheet"
|
||||
href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.7.0/styles/monokai.min.css"
|
||||
@ -19,116 +20,10 @@
|
||||
<script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
|
||||
|
||||
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.7.0/highlight.min.js"></script>
|
||||
<style>
|
||||
:root {
|
||||
--ifm-font-size-base: 100%;
|
||||
--ifm-line-height-base: 1.65;
|
||||
--ifm-font-family-base: system-ui, -apple-system, Segoe UI, Roboto, Ubuntu, Cantarell, Noto Sans,
|
||||
sans-serif, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji",
|
||||
"Segoe UI Emoji", "Segoe UI Symbol";
|
||||
}
|
||||
html {
|
||||
-webkit-font-smoothing: antialiased;
|
||||
-webkit-text-size-adjust: 100%;
|
||||
text-size-adjust: 100%;
|
||||
font: var(--ifm-font-size-base) / var(--ifm-line-height-base) var(--ifm-font-family-base);
|
||||
}
|
||||
body {
|
||||
background-color: #1a202c;
|
||||
color: #fff;
|
||||
}
|
||||
.tab-content {
|
||||
max-height: 400px;
|
||||
overflow: auto;
|
||||
}
|
||||
pre {
|
||||
white-space: pre-wrap;
|
||||
font-size: 14px;
|
||||
}
|
||||
pre code {
|
||||
width: 100%;
|
||||
}
|
||||
</style>
|
||||
<style>
|
||||
/* Custom styling for docs-item class and Markdown generated elements */
|
||||
.docs-item {
|
||||
background-color: #2d3748; /* bg-gray-800 */
|
||||
padding: 1rem; /* p-4 */
|
||||
border-radius: 0.375rem; /* rounded */
|
||||
box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1); /* shadow-md */
|
||||
margin-bottom: 1rem; /* space between items */
|
||||
}
|
||||
|
||||
.docs-item h3,
|
||||
.docs-item h4 {
|
||||
color: #ffffff; /* text-white */
|
||||
font-size: 1.25rem; /* text-xl */
|
||||
font-weight: 700; /* font-bold */
|
||||
margin-bottom: 0.5rem; /* mb-2 */
|
||||
}
|
||||
|
||||
.docs-item p {
|
||||
color: #e2e8f0; /* text-gray-300 */
|
||||
margin-bottom: 0.5rem; /* mb-2 */
|
||||
}
|
||||
|
||||
.docs-item code {
|
||||
background-color: #1a202c; /* bg-gray-900 */
|
||||
color: #e2e8f0; /* text-gray-300 */
|
||||
padding: 0.25rem 0.5rem; /* px-2 py-1 */
|
||||
border-radius: 0.25rem; /* rounded */
|
||||
}
|
||||
|
||||
.docs-item pre {
|
||||
background-color: #1a202c; /* bg-gray-900 */
|
||||
color: #e2e8f0; /* text-gray-300 */
|
||||
padding: 0.5rem; /* p-2 */
|
||||
border-radius: 0.375rem; /* rounded */
|
||||
overflow: auto; /* overflow-auto */
|
||||
margin-bottom: 0.5rem; /* mb-2 */
|
||||
}
|
||||
|
||||
.docs-item div {
|
||||
color: #e2e8f0; /* text-gray-300 */
|
||||
font-size: 1rem; /* prose prose-sm */
|
||||
line-height: 1.25rem; /* line-height for readability */
|
||||
}
|
||||
|
||||
/* Adjustments to make prose class more suitable for dark mode */
|
||||
.prose {
|
||||
max-width: none; /* max-w-none */
|
||||
}
|
||||
|
||||
.prose p,
|
||||
.prose ul {
|
||||
margin-bottom: 1rem; /* mb-4 */
|
||||
}
|
||||
|
||||
.prose code {
|
||||
/* background-color: #4a5568; */ /* bg-gray-700 */
|
||||
color: #65a30d; /* text-white */
|
||||
padding: 0.25rem 0.5rem; /* px-1 py-0.5 */
|
||||
border-radius: 0.25rem; /* rounded */
|
||||
display: inline-block; /* inline-block */
|
||||
}
|
||||
|
||||
.prose pre {
|
||||
background-color: #1a202c; /* bg-gray-900 */
|
||||
color: #ffffff; /* text-white */
|
||||
padding: 0.5rem; /* p-2 */
|
||||
border-radius: 0.375rem; /* rounded */
|
||||
}
|
||||
|
||||
.prose h3 {
|
||||
color: #65a30d; /* text-white */
|
||||
font-size: 1.25rem; /* text-xl */
|
||||
font-weight: 700; /* font-bold */
|
||||
margin-bottom: 0.5rem; /* mb-2 */
|
||||
}
|
||||
</style>
|
||||
</head>
|
||||
<body class="bg-black text-gray-200">
|
||||
<header class="bg-zinc-950 text-white py-4 flex">
|
||||
<header class="bg-zinc-950 text-lime-500 py-4 flex">
|
||||
|
||||
<div class="mx-auto px-4">
|
||||
<h1 class="text-2xl font-bold">🔥🕷️ Crawl4AI: Web Data for your Thoughts</h1>
|
||||
</div>
|
||||
@ -137,675 +32,42 @@
|
||||
<span id="total-count" class="text-lime-400">2</span>
|
||||
</div>
|
||||
</header>
|
||||
|
||||
{{ try_it | safe }}
|
||||
|
||||
<section class="try-it py-8 px-16 pb-20">
|
||||
<div class="container mx-auto px-4">
|
||||
<h2 class="text-2xl font-bold mb-4">Try It Now</h2>
|
||||
<div class="grid grid-cols-1 lg:grid-cols-3 gap-4">
|
||||
<div class="space-y-4">
|
||||
<div class="flex flex-col">
|
||||
<label for="url-input" class="text-lime-500 font-bold text-xs">URL(s)</label>
|
||||
<input
|
||||
type="text"
|
||||
id="url-input"
|
||||
value="https://www.nbcnews.com/business"
|
||||
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
|
||||
placeholder="Enter URL(s) separated by commas"
|
||||
/>
|
||||
</div>
|
||||
<div class="flex flex-col">
|
||||
<label for="threshold" class="text-lime-500 font-bold text-xs">Min Words Threshold</label>
|
||||
<select
|
||||
id="threshold"
|
||||
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
|
||||
>
|
||||
<option value="5">5</option>
|
||||
<option value="10" selected>10</option>
|
||||
<option value="15">15</option>
|
||||
<option value="20">20</option>
|
||||
<option value="25">25</option>
|
||||
</select>
|
||||
</div>
|
||||
<div class="flex flex-col">
|
||||
<label for="css-selector" class="text-lime-500 font-bold text-xs">CSS Selector</label>
|
||||
<input
|
||||
type="text"
|
||||
id="css-selector"
|
||||
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
|
||||
placeholder="Enter CSS Selector"
|
||||
/>
|
||||
</div>
|
||||
<div class="flex flex-col">
|
||||
<label for="extraction-strategy-select" class="text-lime-500 font-bold text-xs"
|
||||
>Extraction Strategy</label
|
||||
>
|
||||
<select
|
||||
id="extraction-strategy-select"
|
||||
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-lime-500"
|
||||
>
|
||||
<option value="CosineStrategy">CosineStrategy</option>
|
||||
<option value="LLMExtractionStrategy">LLMExtractionStrategy</option>
|
||||
<option value="NoExtractionStrategy">NoExtractionStrategy</option>
|
||||
</select>
|
||||
</div>
|
||||
<div class="flex flex-col">
|
||||
<label for="chunking-strategy-select" class="text-lime-500 font-bold text-xs"
|
||||
>Chunking Strategy</label
|
||||
>
|
||||
<select
|
||||
id="chunking-strategy-select"
|
||||
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-lime-500"
|
||||
>
|
||||
<option value="RegexChunking">RegexChunking</option>
|
||||
<option value="NlpSentenceChunking">NlpSentenceChunking</option>
|
||||
<option value="TopicSegmentationChunking">TopicSegmentationChunking</option>
|
||||
<option value="FixedLengthWordChunking">FixedLengthWordChunking</option>
|
||||
<option value="SlidingWindowChunking">SlidingWindowChunking</option>
|
||||
</select>
|
||||
</div>
|
||||
<div class="flex flex-col">
|
||||
<label for="provider-model-select" class="text-lime-500 font-bold text-xs"
|
||||
>Provider Model</label
|
||||
>
|
||||
<select
|
||||
id="provider-model-select"
|
||||
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
|
||||
disabled
|
||||
>
|
||||
<option value="groq/llama3-70b-8192">groq/llama3-70b-8192</option>
|
||||
<option value="groq/llama3-8b-8192">groq/llama3-8b-8192</option>
|
||||
<option value="openai/gpt-4-turbo">gpt-4-turbo</option>
|
||||
<option value="openai/gpt-3.5-turbo">gpt-3.5-turbo</option>
|
||||
<option value="anthropic/claude-3-haiku-20240307">claude-3-haiku</option>
|
||||
<option value="anthropic/claude-3-opus-20240229">claude-3-opus</option>
|
||||
<option value="anthropic/claude-3-sonnet-20240229">claude-3-sonnet</option>
|
||||
</select>
|
||||
</div>
|
||||
<div class="flex flex-col">
|
||||
<label for="token-input" class="text-lime-500 font-bold text-xs">API Token</label>
|
||||
<input
|
||||
type="password"
|
||||
id="token-input"
|
||||
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-lime-500"
|
||||
placeholder="Enter Groq API token"
|
||||
disabled
|
||||
/>
|
||||
</div>
|
||||
<div class="flex gap-3">
|
||||
<div class="flex items-center gap-2">
|
||||
<input type="checkbox" id="bypass-cache-checkbox" />
|
||||
<label for="bypass-cache-checkbox" class="text-lime-500 font-bold">Bypass Cache</label>
|
||||
</div>
|
||||
<div class="flex items-center gap-2">
|
||||
<input type="checkbox" id="extract-blocks-checkbox" checked />
|
||||
<label for="extract-blocks-checkbox" class="text-lime-500 font-bold"
|
||||
>Extract Blocks</label
|
||||
>
|
||||
</div>
|
||||
<button id="crawl-btn" class="bg-lime-600 text-black font-bold px-4 py-0 rounded">
|
||||
Crawl
|
||||
</button>
|
||||
</div>
|
||||
<div class="mx-auto p-4 bg-zinc-950 text-lime-500 min-h-screen">
|
||||
<div class="container mx-auto">
|
||||
<div class="flex h-full px-20">
|
||||
<div class="sidebar w-1/4 p-4">
|
||||
<h2 class="text-lg font-bold mb-4">Outline</h2>
|
||||
<ul>
|
||||
<li class="mb-2"><a href="#" data-target="installation">Installation</a></li>
|
||||
<li class="mb-2"><a href="#" data-target="how-to-guide">How to Guide</a></li>
|
||||
<li class="mb-2"><a href="#" data-target="chunking-strategies">Chunking Strategies</a></li>
|
||||
<li class="mb-2">
|
||||
<a href="#" data-target="extraction-strategies">Extraction Strategies</a>
|
||||
</li>
|
||||
</ul>
|
||||
</div>
|
||||
|
||||
<div id="result" class=" ">
|
||||
<div id="loading" class="hidden">
|
||||
<p class="text-white">Loading... Please wait.</p>
|
||||
</div>
|
||||
<div class="tab-buttons flex gap-2">
|
||||
<button
|
||||
class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
|
||||
data-tab="json"
|
||||
>
|
||||
JSON
|
||||
</button>
|
||||
<button
|
||||
class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
|
||||
data-tab="cleaned-html"
|
||||
>
|
||||
Cleaned HTML
|
||||
</button>
|
||||
<button
|
||||
class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
|
||||
data-tab="markdown"
|
||||
>
|
||||
Markdown
|
||||
</button>
|
||||
</div>
|
||||
<div class="tab-content code bg-zinc-900 p-2 rounded h-full border border-zinc-700 text-sm">
|
||||
<pre class="h-full flex"><code id="json-result" class="language-json"></code></pre>
|
||||
<pre
|
||||
class="hidden h-full flex"
|
||||
><code id="cleaned-html-result" class="language-html"></code></pre>
|
||||
<pre
|
||||
class="hidden h-full flex"
|
||||
><code id="markdown-result" class="language-markdown"></code></pre>
|
||||
</div>
|
||||
</div>
|
||||
<!-- Main Content -->
|
||||
<div class="w-3/4 p-4">
|
||||
{{installation | safe}} {{how_to_guide | safe}}
|
||||
|
||||
<div id="code_help" class=" ">
|
||||
<div class="tab-buttons flex gap-2">
|
||||
<button
|
||||
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
|
||||
data-tab="curl"
|
||||
>
|
||||
cURL
|
||||
</button>
|
||||
<button
|
||||
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
|
||||
data-tab="library"
|
||||
>
|
||||
Python Library
|
||||
</button>
|
||||
<button
|
||||
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
|
||||
data-tab="python"
|
||||
>
|
||||
Python (Request)
|
||||
</button>
|
||||
<button
|
||||
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
|
||||
data-tab="nodejs"
|
||||
>
|
||||
Node.js
|
||||
</button>
|
||||
</div>
|
||||
<div class="tab-content result bg-zinc-900 p-2 rounded h-full border border-zinc-700 text-sm">
|
||||
<pre class="h-full flex relative">
|
||||
<code id="curl-code" class="language-bash"></code>
|
||||
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="curl-code">Copy</button>
|
||||
</pre>
|
||||
<pre class="hidden h-full flex relative">
|
||||
<code id="python-code" class="language-python"></code>
|
||||
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="python-code">Copy</button>
|
||||
</pre>
|
||||
<pre class="hidden h-full flex relative">
|
||||
<code id="nodejs-code" class="language-javascript"></code>
|
||||
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="nodejs-code">Copy</button>
|
||||
</pre>
|
||||
<pre class="hidden h-full flex relative">
|
||||
<code id="library-code" class="language-python"></code>
|
||||
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="library-code">Copy</button>
|
||||
</pre>
|
||||
</div>
|
||||
<section id="chunking-strategies" class="content-section">
|
||||
<h1 class="text-2xl font-bold">Chunking Strategies</h1>
|
||||
<p>Content for chunking strategies...</p>
|
||||
</section>
|
||||
<section id="extraction-strategies" class="content-section">
|
||||
<h1 class="text-2xl font-bold">Extraction Strategies</h1>
|
||||
<p>Content for extraction strategies...</p>
|
||||
</section>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
<section class="bg-zinc-900 text-zinc-300 p-6 px-20">
|
||||
<h1 class="text-3xl font-bold mb-4">Installation 💻</h1>
|
||||
<p class="mb-4">There are two ways to use Crawl4AI: as a library in your Python projects or as a standalone local server.</p>
|
||||
|
||||
<p class="mb-4">You can also try Crawl4AI in a Google Colab <a href = "https://colab.research.google.com/drive/1wz8u30rvbq6Scodye9AGCw8Qg_Z8QGsk"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" style="display: inline-block; width: 100px; height: 20px;"/></a></p>
|
||||
</div>
|
||||
|
||||
<h2 class="text-2xl font-bold mb-2">Using Crawl4AI as a Library 📚</h2>
|
||||
<p class="mb-4">To install Crawl4AI as a library, follow these steps:</p>
|
||||
|
||||
<ol class="list-decimal list-inside mb-4">
|
||||
<li class="mb-2">
|
||||
Install the package from GitHub:
|
||||
<pre class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"><code>pip install git+https://github.com/unclecode/crawl4ai.git</code></pre>
|
||||
</li>
|
||||
<li class="mb-2">
|
||||
Alternatively, you can clone the repository and install the package locally:
|
||||
<pre class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"><code class = "language-python bash">virtualenv venv
|
||||
source venv/bin/activate
|
||||
git clone https://github.com/unclecode/crawl4ai.git
|
||||
cd crawl4ai
|
||||
pip install -e .
|
||||
</code></pre>
|
||||
</li>
|
||||
<li>
|
||||
Import the necessary modules in your Python script:
|
||||
<pre class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"><code class = "language-python hljs">from crawl4ai.web_crawler import WebCrawler
|
||||
from crawl4ai.chunking_strategy import *
|
||||
from crawl4ai.extraction_strategy import *
|
||||
import os
|
||||
|
||||
crawler = WebCrawler()
|
||||
|
||||
# Single page crawl
|
||||
single_url = UrlModel(url='https://www.nbcnews.com/business', forced=False)
|
||||
result = crawl4ai.fetch_page(
|
||||
url='https://www.nbcnews.com/business',
|
||||
word_count_threshold=5, # Minimum word count for a HTML tag to be considered as a worthy block
|
||||
chunking_strategy= RegexChunking( patterns = ["\\n\\n"]), # Default is RegexChunking
|
||||
extraction_strategy= CosineStrategy(word_count_threshold=20, max_dist=0.2, linkage_method='ward', top_k=3) # Default is CosineStrategy
|
||||
# extraction_strategy= LLMExtractionStrategy(provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY')),
|
||||
bypass_cache=False,
|
||||
extract_blocks =True, # Whether to extract semantical blocks of text from the HTML
|
||||
css_selector = "", # Eg: "div.article-body"
|
||||
verbose=True,
|
||||
include_raw_html=True, # Whether to include the raw HTML content in the response
|
||||
)
|
||||
print(result.model_dump())
|
||||
</code></pre>
|
||||
</li>
|
||||
</ol>
|
||||
<p class="mb-4">For more information about how to run Crawl4AI as a local server, please refer to the <a href="https://github.com/unclecode/crawl4ai" class="text-blue-400">GitHub repository</a>.</p>
|
||||
<a href="
|
||||
</section>
|
||||
|
||||
<section class="bg-zinc-900 text-zinc-300 p-6 px-20">
|
||||
<h1 class="text-3xl font-bold mb-4">📖 Parameters</h1>
|
||||
<div class="overflow-x-auto">
|
||||
<table class="min-w-full bg-zinc-800 border border-zinc-700">
|
||||
<thead>
|
||||
<tr>
|
||||
<th class="py-2 px-4 border-b border-zinc-700">Parameter</th>
|
||||
<th class="py-2 px-4 border-b border-zinc-700">Description</th>
|
||||
<th class="py-2 px-4 border-b border-zinc-700">Required</th>
|
||||
<th class="py-2 px-4 border-b border-zinc-700">Default Value</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">urls</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">
|
||||
A list of URLs to crawl and extract data from.
|
||||
</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">Yes</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">include_raw_html</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">
|
||||
Whether to include the raw HTML content in the response.
|
||||
</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">No</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">false</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">bypass_cache</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">
|
||||
Whether to force a fresh crawl even if the URL has been previously crawled.
|
||||
</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">No</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">false</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">extract_blocks</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">
|
||||
Whether to extract semantical blocks of text from the HTML.
|
||||
</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">No</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">true</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">word_count_threshold</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">
|
||||
The minimum number of words a block must contain to be considered meaningful (minimum
|
||||
value is 5).
|
||||
</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">No</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">5</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">extraction_strategy</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">
|
||||
The strategy to use for extracting content from the HTML (e.g., "CosineStrategy").
|
||||
</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">No</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">CosineStrategy</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">chunking_strategy</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">
|
||||
The strategy to use for chunking the text before processing (e.g., "RegexChunking").
|
||||
</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">No</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">RegexChunking</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">css_selector</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">
|
||||
The CSS selector to target specific parts of the HTML for extraction.
|
||||
</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">No</td>
|
||||
<td class="py-2 px-4 border-b border-zinc-700">None</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="py-2 px-4">verbose</td>
|
||||
<td class="py-2 px-4">Whether to enable verbose logging.</td>
|
||||
<td class="py-2 px-4">No</td>
|
||||
<td class="py-2 px-4">true</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="extraction" class="py-8 px-20">
|
||||
<div class="overflow-x-auto mx-auto px-6">
|
||||
<h2 class="text-2xl font-bold mb-4">Extraction Strategies</h2>
|
||||
<div id="extraction-strategies" class="space-y-4"></div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="chunking" class="py-8 px-20">
|
||||
<div class="overflow-x-auto mx-auto px-6">
|
||||
<h2 class="text-2xl font-bold mb-4">Chunking Strategies</h2>
|
||||
<div id="chunking-strategies" class="space-y-4"></div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section class="hero bg-zinc-900 py-8 px-20">
|
||||
<div class="container mx-auto px-4">
|
||||
<h2 class="text-3xl font-bold mb-4">🤔 Why building this?</h2>
|
||||
<p class="text-lg mb-4">
|
||||
In recent times, we've witnessed a surge of startups emerging, riding the AI hype wave and charging
|
||||
for services that should rightfully be accessible to everyone. 🌍💸 One such example is scraping and
|
||||
crawling web pages and transforming them into a format suitable for Large Language Models (LLMs).
|
||||
🕸️🤖 We believe that building a business around this is not the right approach; instead, it should
|
||||
definitely be open-source. 🆓🌟 So, if you possess the skills to build such tools and share our
|
||||
philosophy, we invite you to join our "Robinhood" band and help set these products free for the
|
||||
benefit of all. 🤝💪
|
||||
</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section class="installation py-8 px-20">
|
||||
<div class="container mx-auto px-4">
|
||||
<h2 class="text-2xl font-bold mb-4">⚙️ Installation</h2>
|
||||
<p class="mb-4">
|
||||
To install and run Crawl4AI as a library or a local server, please refer to the 📚
|
||||
<a href="https://github.com/unclecode/crawl4ai" class="text-blue-400">GitHub repository</a>.
|
||||
</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<footer class="bg-zinc-900 text-white py-4">
|
||||
<div class="container mx-auto px-4">
|
||||
<div class="flex justify-between items-center">
|
||||
<p>© 2024 Crawl4AI. All rights reserved.</p>
|
||||
<div class="social-links">
|
||||
<a
|
||||
href="https://github.com/unclecode/crawl4ai"
|
||||
class="text-white hover:text-gray-300 mx-2"
|
||||
target="_blank"
|
||||
>😺 GitHub</a
|
||||
>
|
||||
<a
|
||||
href="https://twitter.com/unclecode"
|
||||
class="text-white hover:text-gray-300 mx-2"
|
||||
target="_blank"
|
||||
>🐦 Twitter</a
|
||||
>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</footer>
|
||||
|
||||
<script>
|
||||
// JavaScript to manage dynamic form changes and logic
|
||||
document.getElementById("extraction-strategy-select").addEventListener("change", function () {
|
||||
const strategy = this.value;
|
||||
const providerModelSelect = document.getElementById("provider-model-select");
|
||||
const tokenInput = document.getElementById("token-input");
|
||||
|
||||
if (strategy === "LLMExtractionStrategy") {
|
||||
providerModelSelect.disabled = false;
|
||||
tokenInput.disabled = false;
|
||||
} else {
|
||||
providerModelSelect.disabled = true;
|
||||
tokenInput.disabled = true;
|
||||
}
|
||||
});
|
||||
|
||||
// Get the selected provider model and token from local storage
|
||||
const storedProviderModel = localStorage.getItem("provider_model");
|
||||
const storedToken = localStorage.getItem(storedProviderModel);
|
||||
|
||||
if (storedProviderModel) {
|
||||
document.getElementById("provider-model-select").value = storedProviderModel;
|
||||
}
|
||||
|
||||
if (storedToken) {
|
||||
document.getElementById("token-input").value = storedToken;
|
||||
}
|
||||
|
||||
// Handle provider model dropdown change
|
||||
document.getElementById("provider-model-select").addEventListener("change", () => {
|
||||
const selectedProviderModel = document.getElementById("provider-model-select").value;
|
||||
const storedToken = localStorage.getItem(selectedProviderModel);
|
||||
|
||||
if (storedToken) {
|
||||
document.getElementById("token-input").value = storedToken;
|
||||
} else {
|
||||
document.getElementById("token-input").value = "";
|
||||
}
|
||||
});
|
||||
|
||||
// Fetch total count from the database
|
||||
axios
|
||||
.get("/total-count")
|
||||
.then((response) => {
|
||||
document.getElementById("total-count").textContent = response.data.count;
|
||||
})
|
||||
.catch((error) => console.error(error));
|
||||
|
||||
// Handle crawl button click
|
||||
document.getElementById("crawl-btn").addEventListener("click", () => {
|
||||
// validate input to have both URL and API token
|
||||
if (!document.getElementById("url-input").value || !document.getElementById("token-input").value) {
|
||||
alert("Please enter both URL(s) and API token.");
|
||||
return;
|
||||
}
|
||||
|
||||
const selectedProviderModel = document.getElementById("provider-model-select").value;
|
||||
const apiToken = document.getElementById("token-input").value;
|
||||
const extractBlocks = document.getElementById("extract-blocks-checkbox").checked;
|
||||
const bypassCache = document.getElementById("bypass-cache-checkbox").checked;
|
||||
|
||||
// Save the selected provider model and token to local storage
|
||||
localStorage.setItem("provider_model", selectedProviderModel);
|
||||
localStorage.setItem(selectedProviderModel, apiToken);
|
||||
|
||||
const urlsInput = document.getElementById("url-input").value;
|
||||
const urls = urlsInput.split(",").map((url) => url.trim());
|
||||
const data = {
|
||||
urls: urls,
|
||||
provider_model: selectedProviderModel,
|
||||
api_token: apiToken,
|
||||
include_raw_html: true,
|
||||
bypass_cache: bypassCache,
|
||||
extract_blocks: extractBlocks,
|
||||
word_count_threshold: parseInt(document.getElementById("threshold").value),
|
||||
extraction_strategy: document.getElementById("extraction-strategy-select").value,
|
||||
chunking_strategy: document.getElementById("chunking-strategy-select").value,
|
||||
css_selector: document.getElementById("css-selector").value,
|
||||
verbose: true,
|
||||
};
|
||||
|
||||
// save api token to local storage
|
||||
localStorage.setItem("api_token", document.getElementById("token-input").value);
|
||||
|
||||
document.getElementById("loading").classList.remove("hidden");
|
||||
//document.getElementById("result").classList.add("hidden");
|
||||
//document.getElementById("code_help").classList.add("hidden");
|
||||
|
||||
axios
|
||||
.post("/crawl", data)
|
||||
.then((response) => {
|
||||
const result = response.data.results[0];
|
||||
const parsedJson = JSON.parse(result.parsed_json);
|
||||
document.getElementById("json-result").textContent = JSON.stringify(parsedJson, null, 2);
|
||||
document.getElementById("cleaned-html-result").textContent = result.cleaned_html;
|
||||
document.getElementById("markdown-result").textContent = result.markdown;
|
||||
|
||||
// Update code examples dynamically
|
||||
const extractionStrategy = data.extraction_strategy;
|
||||
const isLLMExtraction = extractionStrategy === "LLMExtractionStrategy";
|
||||
|
||||
document.getElementById(
|
||||
"curl-code"
|
||||
).textContent = `curl -X POST -H "Content-Type: application/json" -d '${JSON.stringify({
|
||||
...data,
|
||||
api_token: isLLMExtraction ? "your_api_token" : undefined,
|
||||
})}' http://crawl4ai.uccode.io/crawl`;
|
||||
|
||||
document.getElementById(
|
||||
"python-code"
|
||||
).textContent = `import requests\n\ndata = ${JSON.stringify(
|
||||
{ ...data, api_token: isLLMExtraction ? "your_api_token" : undefined },
|
||||
null,
|
||||
2
|
||||
)}\n\nresponse = requests.post("http://crawl4ai.uccode.io/crawl", json=data) # OR local host if your run locally \nprint(response.json())`;
|
||||
|
||||
document.getElementById(
|
||||
"nodejs-code"
|
||||
).textContent = `const axios = require('axios');\n\nconst data = ${JSON.stringify(
|
||||
{ ...data, api_token: isLLMExtraction ? "your_api_token" : undefined },
|
||||
null,
|
||||
2
|
||||
)};\n\naxios.post("http://crawl4ai.uccode.io/crawl", data) // OR local host if your run locally \n .then(response => console.log(response.data))\n .catch(error => console.error(error));`;
|
||||
|
||||
document.getElementById(
|
||||
"library-code"
|
||||
).textContent = `from crawl4ai.web_crawler import WebCrawler\nfrom crawl4ai.extraction_strategy import *\nfrom crawl4ai.chunking_strategy import *\n\ncrawler = WebCrawler()\ncrawler.warmup()\n\nresult = crawler.run(\n url='${
|
||||
urls[0]
|
||||
}',\n word_count_threshold=${data.word_count_threshold},\n extraction_strategy=${
|
||||
isLLMExtraction
|
||||
? `${extractionStrategy}(provider="${data.provider_model}", api_token="${data.api_token}")`
|
||||
: extractionStrategy + "()"
|
||||
},\n chunking_strategy=${data.chunking_strategy}(),\n bypass_cache=${
|
||||
data.bypass_cache
|
||||
},\n css_selector="${data.css_selector}"\n)\nprint(result)`;
|
||||
|
||||
// Highlight code syntax
|
||||
hljs.highlightAll();
|
||||
|
||||
// Select JSON tab by default
|
||||
document.querySelector('.tab-btn[data-tab="json"]').click();
|
||||
|
||||
document.getElementById("loading").classList.add("hidden");
|
||||
document.getElementById("result").classList.remove("hidden");
|
||||
document.getElementById("code_help").classList.remove("hidden");
|
||||
|
||||
// increment the total count
|
||||
document.getElementById("total-count").textContent =
|
||||
parseInt(document.getElementById("total-count").textContent) + 1;
|
||||
})
|
||||
.catch((error) => {
|
||||
console.error(error);
|
||||
document.getElementById("loading").classList.add("hidden");
|
||||
});
|
||||
});
|
||||
|
||||
// Handle tab clicks
|
||||
document.querySelectorAll(".tab-btn").forEach((btn) => {
|
||||
btn.addEventListener("click", () => {
|
||||
const tab = btn.dataset.tab;
|
||||
document
|
||||
.querySelectorAll(".tab-btn")
|
||||
.forEach((b) => b.classList.remove("bg-lime-700", "text-white"));
|
||||
btn.classList.add("bg-lime-700", "text-white");
|
||||
document.querySelectorAll(".tab-content.code pre").forEach((el) => el.classList.add("hidden"));
|
||||
document.getElementById(`${tab}-result`).parentElement.classList.remove("hidden");
|
||||
});
|
||||
});
|
||||
|
||||
// Handle code tab clicks
|
||||
document.querySelectorAll(".code-tab-btn").forEach((btn) => {
|
||||
btn.addEventListener("click", () => {
|
||||
const tab = btn.dataset.tab;
|
||||
document
|
||||
.querySelectorAll(".code-tab-btn")
|
||||
.forEach((b) => b.classList.remove("bg-lime-700", "text-white"));
|
||||
btn.classList.add("bg-lime-700", "text-white");
|
||||
document.querySelectorAll(".tab-content.result pre").forEach((el) => el.classList.add("hidden"));
|
||||
document.getElementById(`${tab}-code`).parentElement.classList.remove("hidden");
|
||||
});
|
||||
});
|
||||
|
||||
// Handle copy to clipboard button clicks
|
||||
|
||||
async function copyToClipboard(text) {
|
||||
if (navigator.clipboard && navigator.clipboard.writeText) {
|
||||
return navigator.clipboard.writeText(text);
|
||||
} else {
|
||||
return fallbackCopyTextToClipboard(text);
|
||||
}
|
||||
}
|
||||
|
||||
function fallbackCopyTextToClipboard(text) {
|
||||
return new Promise((resolve, reject) => {
|
||||
const textArea = document.createElement("textarea");
|
||||
textArea.value = text;
|
||||
|
||||
// Avoid scrolling to bottom
|
||||
textArea.style.top = "0";
|
||||
textArea.style.left = "0";
|
||||
textArea.style.position = "fixed";
|
||||
|
||||
document.body.appendChild(textArea);
|
||||
textArea.focus();
|
||||
textArea.select();
|
||||
|
||||
try {
|
||||
const successful = document.execCommand("copy");
|
||||
if (successful) {
|
||||
resolve();
|
||||
} else {
|
||||
reject();
|
||||
}
|
||||
} catch (err) {
|
||||
reject(err);
|
||||
}
|
||||
|
||||
document.body.removeChild(textArea);
|
||||
});
|
||||
}
|
||||
|
||||
document.querySelectorAll(".copy-btn").forEach((btn) => {
|
||||
btn.addEventListener("click", () => {
|
||||
const target = btn.dataset.target;
|
||||
const code = document.getElementById(target).textContent;
|
||||
//navigator.clipboard.writeText(code).then(() => {
|
||||
copyToClipboard(code).then(() => {
|
||||
btn.textContent = "Copied!";
|
||||
setTimeout(() => {
|
||||
btn.textContent = "Copy";
|
||||
}, 2000);
|
||||
});
|
||||
});
|
||||
});
|
||||
|
||||
document.addEventListener("DOMContentLoaded", async () => {
|
||||
try {
|
||||
const extractionResponse = await fetch("/strategies/extraction");
|
||||
const extractionStrategies = await extractionResponse.json();
|
||||
|
||||
const chunkingResponse = await fetch("/strategies/chunking");
|
||||
const chunkingStrategies = await chunkingResponse.json();
|
||||
|
||||
renderStrategies("extraction-strategies", extractionStrategies);
|
||||
renderStrategies("chunking-strategies", chunkingStrategies);
|
||||
} catch (error) {
|
||||
console.error("Error fetching strategies:", error);
|
||||
}
|
||||
});
|
||||
|
||||
function renderStrategies(containerId, strategies) {
|
||||
const container = document.getElementById(containerId);
|
||||
container.innerHTML = ""; // Clear any existing content
|
||||
strategies = JSON.parse(strategies);
|
||||
Object.entries(strategies).forEach(([strategy, description]) => {
|
||||
const strategyElement = document.createElement("div");
|
||||
strategyElement.classList.add("bg-zinc-800", "p-4", "rounded", "shadow-md", "docs-item");
|
||||
|
||||
const strategyDescription = document.createElement("div");
|
||||
strategyDescription.classList.add("text-gray-300", "prose", "prose-sm");
|
||||
strategyDescription.innerHTML = marked.parse(description);
|
||||
|
||||
strategyElement.appendChild(strategyDescription);
|
||||
|
||||
container.appendChild(strategyElement);
|
||||
});
|
||||
}
|
||||
|
||||
// Highlight code syntax
|
||||
hljs.highlightAll();
|
||||
</script>
|
||||
{{ footer | safe }}
|
||||
<script script src="/pages/app.js"></script>
|
||||
</body>
|
||||
</html>
|
||||
|
||||
@ -283,7 +283,7 @@
|
||||
.post("/crawl", data)
|
||||
.then((response) => {
|
||||
const result = response.data.results[0];
|
||||
const parsedJson = JSON.parse(result.parsed_json);
|
||||
const parsedJson = JSON.parse(result.extracted_content);
|
||||
document.getElementById("json-result").textContent = JSON.stringify(parsedJson, null, 2);
|
||||
document.getElementById("cleaned-html-result").textContent = result.cleaned_html;
|
||||
document.getElementById("markdown-result").textContent = result.markdown;
|
||||
|
||||
36
pages/partial/footer.html
Normal file
36
pages/partial/footer.html
Normal file
@ -0,0 +1,36 @@
|
||||
<section class="hero bg-zinc-900 py-8 px-20 text-zinc-400">
|
||||
<div class="container mx-auto px-4">
|
||||
<h2 class="text-3xl font-bold mb-4">🤔 Why building this?</h2>
|
||||
<p class="text-lg mb-4">
|
||||
In recent times, we've witnessed a surge of startups emerging, riding the AI hype wave and charging
|
||||
for services that should rightfully be accessible to everyone. 🌍💸 One such example is scraping and
|
||||
crawling web pages and transforming them into a format suitable for Large Language Models (LLMs).
|
||||
🕸️🤖 We believe that building a business around this is not the right approach; instead, it should
|
||||
definitely be open-source. 🆓🌟 So, if you possess the skills to build such tools and share our
|
||||
philosophy, we invite you to join our "Robinhood" band and help set these products free for the
|
||||
benefit of all. 🤝💪
|
||||
</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<footer class="bg-zinc-900 text-zinc-400 py-4">
|
||||
<div class="container mx-auto px-4">
|
||||
<div class="flex justify-between items-center">
|
||||
<p>© 2024 Crawl4AI. All rights reserved.</p>
|
||||
<div class="social-links">
|
||||
<a
|
||||
href="https://github.com/unclecode/crawl4ai"
|
||||
class="text-zinc-400 hover:text-gray-300 mx-2"
|
||||
target="_blank"
|
||||
>😺 GitHub</a
|
||||
>
|
||||
<a
|
||||
href="https://twitter.com/unclecode"
|
||||
class="text-zinc-400 hover:text-gray-300 mx-2"
|
||||
target="_blank"
|
||||
>🐦 Twitter</a
|
||||
>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</footer>
|
||||
160
pages/partial/how_to_guide.html
Normal file
160
pages/partial/how_to_guide.html
Normal file
@ -0,0 +1,160 @@
|
||||
<section id="how-to-guide" class="content-section">
|
||||
<h1 class="text-2xl font-bold">How to Guide</h1>
|
||||
<div class="flex flex-col gap-4 p-4 bg-zinc-900 text-lime-500">
|
||||
<!-- Step 1 -->
|
||||
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
||||
🌟
|
||||
<strong
|
||||
>Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling
|
||||
fun!</strong
|
||||
>
|
||||
</div>
|
||||
<div class="">
|
||||
First Step: Create an instance of WebCrawler and call the
|
||||
<code>warmup()</code> function.
|
||||
</div>
|
||||
<div>
|
||||
<pre><code class="language-python">crawler = WebCrawler()
|
||||
crawler.warmup()</code></pre>
|
||||
</div>
|
||||
|
||||
<!-- Step 2 -->
|
||||
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
||||
🧠 <strong>Understanding 'bypass_cache' and 'include_raw_html' parameters:</strong>
|
||||
</div>
|
||||
<div class="">First crawl (caches the result):</div>
|
||||
<div>
|
||||
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business")</code></pre>
|
||||
</div>
|
||||
<div class="">Second crawl (Force to crawl again):</div>
|
||||
<div>
|
||||
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)</code></pre>
|
||||
<div class="bg-red-900 p-2 text-zinc-50">
|
||||
⚠️ Don't forget to set <code>`bypass_cache`</code> to True if you want to try different strategies for the same URL. Otherwise, the cached result will be returned. You can also set <code>`always_by_pass_cache`</code> in constructor to True to always bypass the cache.
|
||||
</div>
|
||||
</div>
|
||||
<div class="">Crawl result without raw HTML content:</div>
|
||||
<div>
|
||||
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)</code></pre>
|
||||
</div>
|
||||
|
||||
<!-- Step 3 -->
|
||||
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
||||
📄
|
||||
<strong
|
||||
>The 'include_raw_html' parameter, when set to True, includes the raw HTML content
|
||||
in the response. By default, it is set to True.</strong
|
||||
>
|
||||
</div>
|
||||
<div class="">Set <code>always_by_pass_cache</code> to True:</div>
|
||||
<div>
|
||||
<pre><code class="language-python">crawler.always_by_pass_cache = True</code></pre>
|
||||
</div>
|
||||
|
||||
<!-- Step 4 -->
|
||||
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
||||
🧩 <strong>Let's add a chunking strategy: RegexChunking!</strong>
|
||||
</div>
|
||||
<div class="">Using RegexChunking:</div>
|
||||
<div>
|
||||
<pre><code class="language-python">result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
chunking_strategy=RegexChunking(patterns=["\n\n"])
|
||||
)</code></pre>
|
||||
</div>
|
||||
<div class="">Using NlpSentenceChunking:</div>
|
||||
<div>
|
||||
<pre><code class="language-python">result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
chunking_strategy=NlpSentenceChunking()
|
||||
)</code></pre>
|
||||
</div>
|
||||
|
||||
<!-- Step 5 -->
|
||||
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
||||
🧠 <strong>Let's get smarter with an extraction strategy: CosineStrategy!</strong>
|
||||
</div>
|
||||
<div class="">Using CosineStrategy:</div>
|
||||
<div>
|
||||
<pre><code class="language-python">result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3)
|
||||
)</code></pre>
|
||||
</div>
|
||||
|
||||
<!-- Step 6 -->
|
||||
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
||||
🤖
|
||||
<strong
|
||||
>Time to bring in the big guns: LLMExtractionStrategy without instructions!</strong
|
||||
>
|
||||
</div>
|
||||
<div class="">Using LLMExtractionStrategy without instructions:</div>
|
||||
<div>
|
||||
<pre><code class="language-python">result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
|
||||
)</code></pre>
|
||||
</div>
|
||||
|
||||
<!-- Step 7 -->
|
||||
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
||||
📜
|
||||
<strong
|
||||
>Let's make it even more interesting: LLMExtractionStrategy with
|
||||
instructions!</strong
|
||||
>
|
||||
</div>
|
||||
<div class="">Using LLMExtractionStrategy with instructions:</div>
|
||||
<div>
|
||||
<pre><code class="language-python">result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o",
|
||||
api_token=os.getenv('OPENAI_API_KEY'),
|
||||
instruction="I am interested in only financial news"
|
||||
)
|
||||
)</code></pre>
|
||||
</div>
|
||||
|
||||
<!-- Step 8 -->
|
||||
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
||||
🎯
|
||||
<strong>Targeted extraction: Let's use a CSS selector to extract only H2 tags!</strong>
|
||||
</div>
|
||||
<div class="">Using CSS selector to extract H2 tags:</div>
|
||||
<div>
|
||||
<pre><code class="language-python">result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
css_selector="h2"
|
||||
)</code></pre>
|
||||
</div>
|
||||
|
||||
<!-- Step 9 -->
|
||||
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
||||
🖱️
|
||||
<strong
|
||||
>Let's get interactive: Passing JavaScript code to click 'Load More' button!</strong
|
||||
>
|
||||
</div>
|
||||
<div class="">Using JavaScript to click 'Load More' button:</div>
|
||||
<div>
|
||||
<pre><code class="language-python">js_code = """
|
||||
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
|
||||
loadMoreButton && loadMoreButton.click();
|
||||
"""
|
||||
crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
|
||||
crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
|
||||
result = crawler.run(url="https://www.nbcnews.com/business")</code></pre>
|
||||
</div>
|
||||
|
||||
<!-- Conclusion -->
|
||||
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
||||
🎉
|
||||
<strong
|
||||
>Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth
|
||||
and crawl the web like a pro! 🕸️</strong
|
||||
>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
56
pages/partial/installation.html
Normal file
56
pages/partial/installation.html
Normal file
@ -0,0 +1,56 @@
|
||||
<section id="installation" class="content-section active">
|
||||
<h1 class="text-2xl font-bold">Installation 💻</h1>
|
||||
<p class="mb-4">
|
||||
There are three ways to use Crawl4AI:
|
||||
<ol class="list-decimal list-inside mb-4">
|
||||
<li class="">
|
||||
As a library
|
||||
</li>
|
||||
<li class="">
|
||||
As a local server (Docker)
|
||||
</li>
|
||||
<li class="">
|
||||
As a Google Colab notebook. <a href="https://colab.research.google.com/drive/1wz8u30rvbq6Scodye9AGCw8Qg_Z8QGsk"
|
||||
><img
|
||||
src="https://colab.research.google.com/assets/colab-badge.svg"
|
||||
alt="Open In Colab"
|
||||
style="display: inline-block; width: 100px; height: 20px"
|
||||
/></a>
|
||||
</li>
|
||||
</p>
|
||||
|
||||
|
||||
<p class="my-4">To install Crawl4AI as a library, follow these steps:</p>
|
||||
|
||||
<ol class="list-decimal list-inside mb-4">
|
||||
<li class="mb-4">
|
||||
Install the package from GitHub:
|
||||
<pre
|
||||
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
|
||||
><code>pip install git+https://github.com/unclecode/crawl4ai.git</code></pre>
|
||||
</li>
|
||||
<li class="mb-4">
|
||||
Alternatively, you can clone the repository and install the package locally:
|
||||
<pre
|
||||
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
|
||||
><code class = "language-python bash">virtualenv venv
|
||||
source venv/bin/activate
|
||||
git clone https://github.com/unclecode/crawl4ai.git
|
||||
cd crawl4ai
|
||||
pip install -e .
|
||||
</code></pre>
|
||||
</li>
|
||||
<li class="">
|
||||
Use docker to run the local server:
|
||||
<pre
|
||||
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
|
||||
><code class = "language-python bash">docker build -t crawl4ai .
|
||||
# docker build --platform linux/amd64 -t crawl4ai . For Mac users
|
||||
docker run -d -p 8000:80 crawl4ai</code></pre>
|
||||
</li>
|
||||
</ol>
|
||||
<p class="mb-4">
|
||||
For more information about how to run Crawl4AI as a local server, please refer to the
|
||||
<a href="https://github.com/unclecode/crawl4ai" class="text-blue-400">GitHub repository</a>.
|
||||
</p>
|
||||
</section>
|
||||
204
pages/partial/try_it.html
Normal file
204
pages/partial/try_it.html
Normal file
@ -0,0 +1,204 @@
|
||||
<section class="try-it py-8 px-16 pb-20 bg-zinc-900">
|
||||
<div class="container mx-auto ">
|
||||
<h2 class="text-2xl font-bold mb-4 text-lime-500">Try It Now</h2>
|
||||
<div class="flex gap-4">
|
||||
<div class="flex flex-col flex-1 gap-2">
|
||||
<div class="flex flex-col">
|
||||
<label for="url-input" class="text-lime-500 font-bold text-xs">URL(s)</label>
|
||||
<input
|
||||
type="text"
|
||||
id="url-input"
|
||||
value="https://www.nbcnews.com/business"
|
||||
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-zinc-300"
|
||||
placeholder="Enter URL(s) separated by commas"
|
||||
/>
|
||||
</div>
|
||||
<div class="flex gap-2">
|
||||
<div class="flex flex-col">
|
||||
<label for="threshold" class="text-lime-500 font-bold text-xs">Min Words Threshold</label>
|
||||
<select
|
||||
id="threshold"
|
||||
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-zinc-300"
|
||||
>
|
||||
<option value="5">5</option>
|
||||
<option value="10" selected>10</option>
|
||||
<option value="15">15</option>
|
||||
<option value="20">20</option>
|
||||
<option value="25">25</option>
|
||||
</select>
|
||||
</div>
|
||||
<div class="flex flex-col flex-1">
|
||||
<label for="css-selector" class="text-lime-500 font-bold text-xs">CSS Selector</label>
|
||||
<input
|
||||
type="text"
|
||||
id="css-selector"
|
||||
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-zinc-300 placeholder-lime-700"
|
||||
placeholder="CSS Selector (e.g. .content, #main, article)"
|
||||
/>
|
||||
</div>
|
||||
</div>
|
||||
<div class="flex gap-2">
|
||||
<div class="flex flex-col">
|
||||
<label for="extraction-strategy-select" class="text-lime-500 font-bold text-xs"
|
||||
>Extraction Strategy</label
|
||||
>
|
||||
<select
|
||||
id="extraction-strategy-select"
|
||||
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-zinc-300"
|
||||
>
|
||||
<option value="CosineStrategy">CosineStrategy</option>
|
||||
<option value="LLMExtractionStrategy">LLMExtractionStrategy</option>
|
||||
<option value="NoExtractionStrategy">NoExtractionStrategy</option>
|
||||
</select>
|
||||
</div>
|
||||
<div class="flex flex-col">
|
||||
<label for="chunking-strategy-select" class="text-lime-500 font-bold text-xs"
|
||||
>Chunking Strategy</label
|
||||
>
|
||||
<select
|
||||
id="chunking-strategy-select"
|
||||
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-zinc-300"
|
||||
>
|
||||
<option value="RegexChunking">RegexChunking</option>
|
||||
<option value="NlpSentenceChunking">NlpSentenceChunking</option>
|
||||
<option value="TopicSegmentationChunking">TopicSegmentationChunking</option>
|
||||
<option value="FixedLengthWordChunking">FixedLengthWordChunking</option>
|
||||
<option value="SlidingWindowChunking">SlidingWindowChunking</option>
|
||||
</select>
|
||||
</div>
|
||||
</div>
|
||||
<div id = "llm_settings" class="flex gap-2 hidden hidden">
|
||||
<div class="flex flex-col">
|
||||
<label for="provider-model-select" class="text-lime-500 font-bold text-xs"
|
||||
>Provider Model</label
|
||||
>
|
||||
<select
|
||||
id="provider-model-select"
|
||||
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-zinc-300"
|
||||
>
|
||||
<option value="groq/llama3-70b-8192">groq/llama3-70b-8192</option>
|
||||
<option value="groq/llama3-8b-8192">groq/llama3-8b-8192</option>
|
||||
<option value="groq/mixtral-8x7b-32768">groq/mixtral-8x7b-32768</option>
|
||||
<option value="openai/gpt-4-turbo">gpt-4-turbo</option>
|
||||
<option value="openai/gpt-3.5-turbo">gpt-3.5-turbo</option>
|
||||
<option value="openai/gpt-4o">gpt-4o</option>
|
||||
<option value="anthropic/claude-3-haiku-20240307">claude-3-haiku</option>
|
||||
<option value="anthropic/claude-3-opus-20240229">claude-3-opus</option>
|
||||
<option value="anthropic/claude-3-sonnet-20240229">claude-3-sonnet</option>
|
||||
</select>
|
||||
</div>
|
||||
<div class="flex flex-col flex-1">
|
||||
<label for="token-input" class="text-lime-500 font-bold text-xs">API Token</label>
|
||||
<input
|
||||
type="password"
|
||||
id="token-input"
|
||||
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-zinc-300"
|
||||
placeholder="Enter Groq API token"
|
||||
/>
|
||||
</div>
|
||||
</div>
|
||||
<div class="flex gap-2">
|
||||
<!-- Add two textarea one for getting Keyword Filter and another one Instruction, make both grow whole with-->
|
||||
<div id = "semantic_filter_div" class="flex flex-col flex-1">
|
||||
<label for="keyword-filter" class="text-lime-500 font-bold text-xs">Keyword Filter</label>
|
||||
<textarea
|
||||
id="semantic_filter"
|
||||
rows="3"
|
||||
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-zinc-300 placeholder-zinc-700"
|
||||
placeholder="Enter keywords for CosineStrategy to narrow down the content."
|
||||
></textarea>
|
||||
</div>
|
||||
<div id = "instruction_div" class="flex flex-col flex-1 hidden">
|
||||
<label for="instruction" class="text-lime-500 font-bold text-xs">Instruction</label>
|
||||
<textarea
|
||||
id="instruction"
|
||||
rows="3"
|
||||
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-zinc-300 placeholder-zinc-700"
|
||||
placeholder="Enter instruction for the LLMEstrategy to instruct the model."
|
||||
></textarea>
|
||||
</div>
|
||||
</div>
|
||||
<div class="flex gap-3">
|
||||
<div class="flex items-center gap-2">
|
||||
<input type="checkbox" id="bypass-cache-checkbox" />
|
||||
<label for="bypass-cache-checkbox" class="text-lime-500 font-bold">Bypass Cache</label>
|
||||
</div>
|
||||
<div class="flex items-center gap-2">
|
||||
<input type="checkbox" id="extract-blocks-checkbox" checked />
|
||||
<label for="extract-blocks-checkbox" class="text-lime-500 font-bold">Extract Blocks</label>
|
||||
</div>
|
||||
<button id="crawl-btn" class="bg-lime-600 text-black font-bold px-4 py-0 rounded">Crawl</button>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div id="result" class="flex-1">
|
||||
<div id="loading" class="hidden">
|
||||
<p class="text-white">Loading... Please wait.</p>
|
||||
</div>
|
||||
<div class="tab-buttons flex gap-2">
|
||||
<button class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500" data-tab="json">
|
||||
JSON
|
||||
</button>
|
||||
<button
|
||||
class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
|
||||
data-tab="cleaned-html"
|
||||
>
|
||||
Cleaned HTML
|
||||
</button>
|
||||
<button class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500" data-tab="markdown">
|
||||
Markdown
|
||||
</button>
|
||||
</div>
|
||||
<div class="tab-content code bg-zinc-900 p-2 rounded h-full border border-zinc-700 text-sm">
|
||||
<pre class="h-full flex"><code id="json-result" class="language-json"></code></pre>
|
||||
<pre class="hidden h-full flex"><code id="cleaned-html-result" class="language-html"></code></pre>
|
||||
<pre class="hidden h-full flex"><code id="markdown-result" class="language-markdown"></code></pre>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div id="code_help" class="flex-1">
|
||||
<div class="tab-buttons flex gap-2">
|
||||
<button class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500" data-tab="curl">
|
||||
cURL
|
||||
</button>
|
||||
<button
|
||||
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
|
||||
data-tab="library"
|
||||
>
|
||||
Python
|
||||
</button>
|
||||
<button
|
||||
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
|
||||
data-tab="python"
|
||||
>
|
||||
REST API
|
||||
</button>
|
||||
<!-- <button
|
||||
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
|
||||
data-tab="nodejs"
|
||||
>
|
||||
Node.js
|
||||
</button> -->
|
||||
</div>
|
||||
<div class="tab-content result bg-zinc-900 p-2 rounded h-full border border-zinc-700 text-sm">
|
||||
<pre class="h-full flex relative">
|
||||
<code id="curl-code" class="language-bash"></code>
|
||||
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="curl-code">Copy</button>
|
||||
</pre>
|
||||
<pre class="hidden h-full flex relative">
|
||||
<code id="python-code" class="language-python"></code>
|
||||
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="python-code">Copy</button>
|
||||
</pre>
|
||||
<pre class="hidden h-full flex relative">
|
||||
<code id="nodejs-code" class="language-javascript"></code>
|
||||
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="nodejs-code">Copy</button>
|
||||
</pre>
|
||||
<pre class="hidden h-full flex relative">
|
||||
<code id="library-code" class="language-python"></code>
|
||||
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="library-code">Copy</button>
|
||||
</pre>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
435
pages/tmp.html
Normal file
435
pages/tmp.html
Normal file
@ -0,0 +1,435 @@
|
||||
<div class="w-3/4 p-4">
|
||||
<section id="installation" class="content-section active">
|
||||
<h1 class="text-2xl font-bold">Installation 💻</h1>
|
||||
<p class="mb-4">There are three ways to use Crawl4AI:</p>
|
||||
<ol class="list-decimal list-inside mb-4">
|
||||
<li class="">As a library</li>
|
||||
<li class="">As a local server (Docker)</li>
|
||||
<li class="">
|
||||
As a Google Colab notebook.
|
||||
<a href="https://colab.research.google.com/drive/1wz8u30rvbq6Scodye9AGCw8Qg_Z8QGsk"
|
||||
><img
|
||||
src="https://colab.research.google.com/assets/colab-badge.svg"
|
||||
alt="Open In Colab"
|
||||
style="display: inline-block; width: 100px; height: 20px"
|
||||
/></a>
|
||||
</li>
|
||||
<p></p>
|
||||
|
||||
<p class="my-4">To install Crawl4AI as a library, follow these steps:</p>
|
||||
|
||||
<ol class="list-decimal list-inside mb-4">
|
||||
<li class="mb-4">
|
||||
Install the package from GitHub:
|
||||
<pre
|
||||
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
|
||||
><code class="hljs language-bash">pip install git+https://github.com/unclecode/crawl4ai.git</code></pre>
|
||||
</li>
|
||||
<li class="mb-4">
|
||||
Alternatively, you can clone the repository and install the package locally:
|
||||
<pre
|
||||
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
|
||||
><code class="language-python bash hljs">virtualenv venv
|
||||
source venv/<span class="hljs-built_in">bin</span>/activate
|
||||
git clone https://github.com/unclecode/crawl4ai.git
|
||||
cd crawl4ai
|
||||
pip install -e .
|
||||
</code></pre>
|
||||
</li>
|
||||
<li class="">
|
||||
Use docker to run the local server:
|
||||
<pre
|
||||
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
|
||||
><code class="language-python bash hljs">docker build -t crawl4ai .
|
||||
<span class="hljs-comment"># docker build --platform linux/amd64 -t crawl4ai . For Mac users</span>
|
||||
docker run -d -p <span class="hljs-number">8000</span>:<span class="hljs-number">80</span> crawl4ai</code></pre>
|
||||
</li>
|
||||
</ol>
|
||||
<p class="mb-4">
|
||||
For more information about how to run Crawl4AI as a local server, please refer to the
|
||||
<a href="https://github.com/unclecode/crawl4ai" class="text-blue-400">GitHub repository</a>.
|
||||
</p>
|
||||
</ol>
|
||||
</section>
|
||||
<section id="how-to-guide" class="content-section">
|
||||
<h1 class="text-2xl font-bold">How to Guide</h1>
|
||||
<div class="flex flex-col gap-4 p-4 bg-zinc-900 text-lime-500">
|
||||
<!-- Step 1 -->
|
||||
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
||||
🌟
|
||||
<strong>Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun!</strong>
|
||||
</div>
|
||||
<div class="">
|
||||
First Step: Create an instance of WebCrawler and call the
|
||||
<code>warmup()</code> function.
|
||||
</div>
|
||||
<div>
|
||||
<pre><code class="language-python hljs">crawler = WebCrawler()
|
||||
crawler.warmup()</code></pre>
|
||||
</div>
|
||||
|
||||
<!-- Step 2 -->
|
||||
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
||||
🧠 <strong>Understanding 'bypass_cache' and 'include_raw_html' parameters:</strong>
|
||||
</div>
|
||||
<div class="">First crawl (caches the result):</div>
|
||||
<div>
|
||||
<pre><code class="language-python hljs">result = crawler.run(url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>)</code></pre>
|
||||
</div>
|
||||
<div class="">Second crawl (Force to crawl again):</div>
|
||||
<div>
|
||||
<pre><code class="language-python hljs">result = crawler.run(url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>, bypass_cache=<span class="hljs-literal">True</span>)</code></pre>
|
||||
<div class="bg-red-900 p-2 text-zinc-50">
|
||||
⚠️ Don't forget to set <code>`bypass_cache`</code> to True if you want to try different strategies
|
||||
for the same URL. Otherwise, the cached result will be returned. You can also set
|
||||
<code>`always_by_pass_cache`</code> in constructor to True to always bypass the cache.
|
||||
</div>
|
||||
</div>
|
||||
<div class="">Crawl result without raw HTML content:</div>
|
||||
<div>
|
||||
<pre><code class="language-python hljs">result = crawler.run(url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>, include_raw_html=<span class="hljs-literal">False</span>)</code></pre>
|
||||
</div>
|
||||
|
||||
<!-- Step 3 -->
|
||||
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
||||
📄
|
||||
<strong
|
||||
>The 'include_raw_html' parameter, when set to True, includes the raw HTML content in the response.
|
||||
By default, it is set to True.</strong
|
||||
>
|
||||
</div>
|
||||
<div class="">Set <code>always_by_pass_cache</code> to True:</div>
|
||||
<div>
|
||||
<pre><code class="language-python hljs">crawler.always_by_pass_cache = <span class="hljs-literal">True</span></code></pre>
|
||||
</div>
|
||||
|
||||
<!-- Step 4 -->
|
||||
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
||||
🧩 <strong>Let's add a chunking strategy: RegexChunking!</strong>
|
||||
</div>
|
||||
<div class="">Using RegexChunking:</div>
|
||||
<div>
|
||||
<pre><code class="language-python hljs">result = crawler.run(
|
||||
url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>,
|
||||
chunking_strategy=RegexChunking(patterns=[<span class="hljs-string">"\n\n"</span>])
|
||||
)</code></pre>
|
||||
</div>
|
||||
<div class="">Using NlpSentenceChunking:</div>
|
||||
<div>
|
||||
<pre><code class="language-python hljs">result = crawler.run(
|
||||
url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>,
|
||||
chunking_strategy=NlpSentenceChunking()
|
||||
)</code></pre>
|
||||
</div>
|
||||
|
||||
<!-- Step 5 -->
|
||||
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
||||
🧠 <strong>Let's get smarter with an extraction strategy: CosineStrategy!</strong>
|
||||
</div>
|
||||
<div class="">Using CosineStrategy:</div>
|
||||
<div>
|
||||
<pre><code class="language-python hljs">result = crawler.run(
|
||||
url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>,
|
||||
extraction_strategy=CosineStrategy(word_count_threshold=<span class="hljs-number">20</span>, max_dist=<span class="hljs-number">0.2</span>, linkage_method=<span class="hljs-string">"ward"</span>, top_k=<span class="hljs-number">3</span>)
|
||||
)</code></pre>
|
||||
</div>
|
||||
|
||||
<!-- Step 6 -->
|
||||
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
||||
🤖
|
||||
<strong>Time to bring in the big guns: LLMExtractionStrategy without instructions!</strong>
|
||||
</div>
|
||||
<div class="">Using LLMExtractionStrategy without instructions:</div>
|
||||
<div>
|
||||
<pre><code class="language-python hljs">result = crawler.run(
|
||||
url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>,
|
||||
extraction_strategy=LLMExtractionStrategy(provider=<span class="hljs-string">"openai/gpt-4o"</span>, api_token=os.getenv(<span class="hljs-string">'OPENAI_API_KEY'</span>))
|
||||
)</code></pre>
|
||||
</div>
|
||||
|
||||
<!-- Step 7 -->
|
||||
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
||||
📜
|
||||
<strong>Let's make it even more interesting: LLMExtractionStrategy with instructions!</strong>
|
||||
</div>
|
||||
<div class="">Using LLMExtractionStrategy with instructions:</div>
|
||||
<div>
|
||||
<pre><code class="language-python hljs">result = crawler.run(
|
||||
url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>,
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider=<span class="hljs-string">"openai/gpt-4o"</span>,
|
||||
api_token=os.getenv(<span class="hljs-string">'OPENAI_API_KEY'</span>),
|
||||
instruction=<span class="hljs-string">"I am interested in only financial news"</span>
|
||||
)
|
||||
)</code></pre>
|
||||
</div>
|
||||
|
||||
<!-- Step 8 -->
|
||||
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
||||
🎯
|
||||
<strong>Targeted extraction: Let's use a CSS selector to extract only H2 tags!</strong>
|
||||
</div>
|
||||
<div class="">Using CSS selector to extract H2 tags:</div>
|
||||
<div>
|
||||
<pre><code class="language-python hljs">result = crawler.run(
|
||||
url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>,
|
||||
css_selector=<span class="hljs-string">"h2"</span>
|
||||
)</code></pre>
|
||||
</div>
|
||||
|
||||
<!-- Step 9 -->
|
||||
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
||||
🖱️
|
||||
<strong>Let's get interactive: Passing JavaScript code to click 'Load More' button!</strong>
|
||||
</div>
|
||||
<div class="">Using JavaScript to click 'Load More' button:</div>
|
||||
<div>
|
||||
<pre><code class="language-python hljs">js_code = <span class="hljs-string">"""
|
||||
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
|
||||
loadMoreButton && loadMoreButton.click();
|
||||
"""</span>
|
||||
crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
|
||||
crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=<span class="hljs-literal">True</span>)
|
||||
result = crawler.run(url=<span class="hljs-string">"https://www.nbcnews.com/business"</span>)</code></pre>
|
||||
</div>
|
||||
|
||||
<!-- Conclusion -->
|
||||
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
||||
🎉
|
||||
<strong
|
||||
>Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl the
|
||||
web like a pro! 🕸️</strong
|
||||
>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="chunking-strategies" class="content-section">
|
||||
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
|
||||
<div class="text-gray-300 prose prose-sm">
|
||||
<h3>RegexChunking</h3>
|
||||
<p>
|
||||
<code>RegexChunking</code> is a text chunking strategy that splits a given text into smaller parts
|
||||
using regular expressions. This is useful for preparing large texts for processing by language
|
||||
models, ensuring they are divided into manageable segments.
|
||||
</p>
|
||||
<h4>Constructor Parameters:</h4>
|
||||
<ul>
|
||||
<li>
|
||||
<code>patterns</code> (list, optional): A list of regular expression patterns used to split the
|
||||
text. Default is to split by double newlines (<code>['\n\n']</code>).
|
||||
</li>
|
||||
</ul>
|
||||
<h4>Example usage:</h4>
|
||||
<pre><code class="language-python">chunker = RegexChunking(patterns=[r'\n\n', r'\. '])
|
||||
chunks = chunker.chunk("This is a sample text. It will be split into chunks.")
|
||||
</code></pre>
|
||||
</div>
|
||||
</div>
|
||||
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
|
||||
<div class="text-gray-300 prose prose-sm">
|
||||
<h3>NlpSentenceChunking</h3>
|
||||
<p>
|
||||
<code>NlpSentenceChunking</code> uses a natural language processing model to chunk a given text into
|
||||
sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.
|
||||
</p>
|
||||
<h4>Constructor Parameters:</h4>
|
||||
<ul>
|
||||
<li>
|
||||
<code>model</code> (str, optional): The SpaCy model to use for sentence detection. Default is
|
||||
<code>'en_core_web_sm'</code>.
|
||||
</li>
|
||||
</ul>
|
||||
<h4>Example usage:</h4>
|
||||
<pre><code class="language-python">chunker = NlpSentenceChunking(model='en_core_web_sm')
|
||||
chunks = chunker.chunk("This is a sample text. It will be split into sentences.")
|
||||
</code></pre>
|
||||
</div>
|
||||
</div>
|
||||
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
|
||||
<div class="text-gray-300 prose prose-sm">
|
||||
<h3>TopicSegmentationChunking</h3>
|
||||
<p>
|
||||
<code>TopicSegmentationChunking</code> uses the TextTiling algorithm to segment a given text into
|
||||
topic-based chunks. This method identifies thematic boundaries in the text.
|
||||
</p>
|
||||
<h4>Constructor Parameters:</h4>
|
||||
<ul>
|
||||
<li>
|
||||
<code>num_keywords</code> (int, optional): The number of keywords to extract for each topic
|
||||
segment. Default is <code>3</code>.
|
||||
</li>
|
||||
</ul>
|
||||
<h4>Example usage:</h4>
|
||||
<pre><code class="language-python">chunker = TopicSegmentationChunking(num_keywords=3)
|
||||
chunks = chunker.chunk("This is a sample text. It will be split into topic-based segments.")
|
||||
</code></pre>
|
||||
</div>
|
||||
</div>
|
||||
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
|
||||
<div class="text-gray-300 prose prose-sm">
|
||||
<h3>FixedLengthWordChunking</h3>
|
||||
<p>
|
||||
<code>FixedLengthWordChunking</code> splits a given text into chunks of fixed length, based on the
|
||||
number of words.
|
||||
</p>
|
||||
<h4>Constructor Parameters:</h4>
|
||||
<ul>
|
||||
<li>
|
||||
<code>chunk_size</code> (int, optional): The number of words in each chunk. Default is
|
||||
<code>100</code>.
|
||||
</li>
|
||||
</ul>
|
||||
<h4>Example usage:</h4>
|
||||
<pre><code class="language-python">chunker = FixedLengthWordChunking(chunk_size=100)
|
||||
chunks = chunker.chunk("This is a sample text. It will be split into fixed-length word chunks.")
|
||||
</code></pre>
|
||||
</div>
|
||||
</div>
|
||||
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
|
||||
<div class="text-gray-300 prose prose-sm">
|
||||
<h3>SlidingWindowChunking</h3>
|
||||
<p>
|
||||
<code>SlidingWindowChunking</code> uses a sliding window approach to chunk a given text. Each chunk
|
||||
has a fixed length, and the window slides by a specified step size.
|
||||
</p>
|
||||
<h4>Constructor Parameters:</h4>
|
||||
<ul>
|
||||
<li>
|
||||
<code>window_size</code> (int, optional): The number of words in each chunk. Default is
|
||||
<code>100</code>.
|
||||
</li>
|
||||
<li>
|
||||
<code>step</code> (int, optional): The number of words to slide the window. Default is
|
||||
<code>50</code>.
|
||||
</li>
|
||||
</ul>
|
||||
<h4>Example usage:</h4>
|
||||
<pre><code class="language-python">chunker = SlidingWindowChunking(window_size=100, step=50)
|
||||
chunks = chunker.chunk("This is a sample text. It will be split using a sliding window approach.")
|
||||
</code></pre>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
<section id="extraction-strategies" class="content-section">
|
||||
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
|
||||
<div class="text-gray-300 prose prose-sm">
|
||||
<h3>NoExtractionStrategy</h3>
|
||||
<p>
|
||||
<code>NoExtractionStrategy</code> is a basic extraction strategy that returns the entire HTML
|
||||
content without any modification. It is useful for cases where no specific extraction is required.
|
||||
Only clean html, and amrkdown.
|
||||
</p>
|
||||
<h4>Constructor Parameters:</h4>
|
||||
<p>None.</p>
|
||||
<h4>Example usage:</h4>
|
||||
<pre><code class="language-python">extractor = NoExtractionStrategy()
|
||||
extracted_content = extractor.extract(url, html)
|
||||
</code></pre>
|
||||
</div>
|
||||
</div>
|
||||
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
|
||||
<div class="text-gray-300 prose prose-sm">
|
||||
<h3>LLMExtractionStrategy</h3>
|
||||
<p>
|
||||
<code>LLMExtractionStrategy</code> uses a Language Model (LLM) to extract meaningful blocks or
|
||||
chunks from the given HTML content. This strategy leverages an external provider for language model
|
||||
completions.
|
||||
</p>
|
||||
<h4>Constructor Parameters:</h4>
|
||||
<ul>
|
||||
<li>
|
||||
<code>provider</code> (str, optional): The provider to use for the language model completions.
|
||||
Default is <code>DEFAULT_PROVIDER</code> (e.g., openai/gpt-4).
|
||||
</li>
|
||||
<li>
|
||||
<code>api_token</code> (str, optional): The API token for the provider. If not provided, it will
|
||||
try to load from the environment variable <code>OPENAI_API_KEY</code>.
|
||||
</li>
|
||||
<li>
|
||||
<code>instruction</code> (str, optional): An instruction to guide the LLM on how to perform the
|
||||
extraction. This allows users to specify the type of data they are interested in or set the tone
|
||||
of the response. Default is <code>None</code>.
|
||||
</li>
|
||||
</ul>
|
||||
<h4>Example usage:</h4>
|
||||
<pre><code class="language-python">extractor = LLMExtractionStrategy(provider='openai', api_token='your_api_token', instruction='Extract only news about AI.')
|
||||
extracted_content = extractor.extract(url, html)
|
||||
</code></pre>
|
||||
<p>
|
||||
By providing clear instructions, users can tailor the extraction process to their specific needs,
|
||||
enhancing the relevance and utility of the extracted content.
|
||||
</p>
|
||||
</div>
|
||||
</div>
|
||||
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
|
||||
<div class="text-gray-300 prose prose-sm">
|
||||
<h3>CosineStrategy</h3>
|
||||
<p>
|
||||
<code>CosineStrategy</code> uses hierarchical clustering based on cosine similarity to extract
|
||||
clusters of text from the given HTML content. This strategy is suitable for identifying related
|
||||
content sections.
|
||||
</p>
|
||||
<h4>Constructor Parameters:</h4>
|
||||
<ul>
|
||||
<li>
|
||||
<code>semantic_filter</code> (str, optional): A string containing keywords for filtering relevant
|
||||
documents before clustering. If provided, documents are filtered based on their cosine
|
||||
similarity to the keyword filter embedding. Default is <code>None</code>.
|
||||
</li>
|
||||
<li>
|
||||
<code>word_count_threshold</code> (int, optional): Minimum number of words per cluster. Default
|
||||
is <code>20</code>.
|
||||
</li>
|
||||
<li>
|
||||
<code>max_dist</code> (float, optional): The maximum cophenetic distance on the dendrogram to
|
||||
form clusters. Default is <code>0.2</code>.
|
||||
</li>
|
||||
<li>
|
||||
<code>linkage_method</code> (str, optional): The linkage method for hierarchical clustering.
|
||||
Default is <code>'ward'</code>.
|
||||
</li>
|
||||
<li>
|
||||
<code>top_k</code> (int, optional): Number of top categories to extract. Default is
|
||||
<code>3</code>.
|
||||
</li>
|
||||
<li>
|
||||
<code>model_name</code> (str, optional): The model name for embedding generation. Default is
|
||||
<code>'BAAI/bge-small-en-v1.5'</code>.
|
||||
</li>
|
||||
</ul>
|
||||
<h4>Example usage:</h4>
|
||||
<pre><code class="language-python">extractor = CosineStrategy(semantic_filter='artificial intelligence', word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5')
|
||||
extracted_content = extractor.extract(url, html)
|
||||
</code></pre>
|
||||
<h4>Cosine Similarity Filtering</h4>
|
||||
<p>
|
||||
When a <code>semantic_filter</code> is provided, the <code>CosineStrategy</code> applies an
|
||||
embedding-based filtering process to select relevant documents before performing hierarchical
|
||||
clustering.
|
||||
</p>
|
||||
</div>
|
||||
</div>
|
||||
<div class="bg-zinc-800 p-4 rounded shadow-md docs-item">
|
||||
<div class="text-gray-300 prose prose-sm">
|
||||
<h3>TopicExtractionStrategy</h3>
|
||||
<p>
|
||||
<code>TopicExtractionStrategy</code> uses the TextTiling algorithm to segment the HTML content into
|
||||
topics and extracts keywords for each segment. This strategy is useful for identifying and
|
||||
summarizing thematic content.
|
||||
</p>
|
||||
<h4>Constructor Parameters:</h4>
|
||||
<ul>
|
||||
<li>
|
||||
<code>num_keywords</code> (int, optional): Number of keywords to represent each topic segment.
|
||||
Default is <code>3</code>.
|
||||
</li>
|
||||
</ul>
|
||||
<h4>Example usage:</h4>
|
||||
<pre><code class="language-python">extractor = TopicExtractionStrategy(num_keywords=3)
|
||||
extracted_content = extractor.extract(url, html)
|
||||
</code></pre>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
</div>
|
||||
@ -13,4 +13,5 @@ litellm
|
||||
python-dotenv
|
||||
nltk
|
||||
lazy_import
|
||||
rich
|
||||
# spacy
|
||||
Loading…
x
Reference in New Issue
Block a user