crawl4ai/README.md

# Crawl4AI v0.2.77 🕷️🤖

[![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers)
[![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members)
[![GitHub Issues](https://img.shields.io/github/issues/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/issues)
[![GitHub Pull Requests](https://img.shields.io/github/issues-pr/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/pulls)
[![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)

Crawl4AI simplifies web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐

#### [v0.2.77] - 2024-08-02

Major improvements in functionality, performance, and cross-platform compatibility! 🚀

- 🐳 **Docker enhancements**:
  - Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows.
- 🌐 **Official Docker Hub image**:
  - Launched our first official image on Docker Hub for streamlined deployment (unclecode/crawl4ai).
- 🔧 **Selenium upgrade**:
  - Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility.
- 🖼️ **Image description**:
  - Implemented ability to generate textual descriptions for extracted images from web pages.
- ⚡ **Performance boost**:
  - Various improvements to enhance overall speed and performance.
  
## Try it Now!

✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sJPAmeLj5PMrg2VgOwMJ2ubGIcK0cJeX?usp=sharing)

✨ visit our [Documentation Website](https://crawl4ai.com/mkdocs/)

✨ Check [Demo](https://crawl4ai.com/mkdocs/demo)

## Features ✨

- 🆓 Completely free and open-source
- 🤖 LLM-friendly output formats (JSON, cleaned HTML, markdown)
- 🌍 Supports crawling multiple URLs simultaneously
- 🎨 Extracts and returns all media tags (Images, Audio, and Video)
- 🔗 Extracts all external and internal links
- 📚 Extracts metadata from the page
- 🔄 Custom hooks for authentication, headers, and page modifications before crawling
- 🕵️ User-agent customization
- 🖼️ Takes screenshots of the page
- 📜 Executes multiple custom JavaScripts before crawling
- 📚 Various chunking strategies: topic-based, regex, sentence, and more
- 🧠 Advanced extraction strategies: cosine clustering, LLM, and more
- 🎯 CSS selector support
- 📝 Passes instructions/keywords to refine extraction

# Crawl4AI

## 🌟 Shoutout to Contributors of v0.2.77!

A big thank you to the amazing contributors who've made this release possible:

- [@aravindkarnam](https://github.com/aravindkarnam) for the new image description feature
- [@FractalMind](https://github.com/FractalMind) for our official Docker Hub image
- [@ketonkss4](https://github.com/ketonkss4) for helping streamline our Selenium setup

Your contributions are driving Crawl4AI forward! 🚀

## Cool Examples 🚀

### Quick Start

```python
from crawl4ai import WebCrawler

# Create an instance of WebCrawler
crawler = WebCrawler()

# Warm up the crawler (load necessary models)
crawler.warmup()

# Run the crawler on a URL
result = crawler.run(url="https://www.nbcnews.com/business")

# Print the extracted content
print(result.markdown)
```

## How to install 🛠 

### Using pip 🐍
```bash
virtualenv venv
source venv/bin/activate
pip install "crawl4ai @ git+https://github.com/unclecode/crawl4ai.git"
```

### Using Docker 🐳

```bash
# For Mac users (M1/M2)
# docker build --platform linux/amd64 -t crawl4ai .
docker build -t crawl4ai .
docker run -d -p 8000:80 crawl4ai
```

### Using Docker Hub 🐳

```bash
docker pull unclecode/crawl4ai:latest
docker run -d -p 8000:80 unclecode/crawl4ai:latest
```


## Speed-First Design 🚀

Perhaps the most important design principle for this library is speed. We need to ensure it can handle many links and resources in parallel as quickly as possible. By combining this speed with fast LLMs like Groq, the results will be truly amazing.

```python
import time
from crawl4ai.web_crawler import WebCrawler
crawler = WebCrawler()
crawler.warmup()

start = time.time()
url = r"https://www.nbcnews.com/business"
result = crawler.run( url, word_count_threshold=10, bypass_cache=True)
end = time.time()
print(f"Time taken: {end - start}")
```

Let's take a look the calculated time for the above code snippet:

```bash
[LOG] 🚀 Crawling done, success: True, time taken: 1.3623387813568115 seconds
[LOG] 🚀 Content extracted, success: True, time taken: 0.05715131759643555 seconds
[LOG] 🚀 Extraction, time taken: 0.05750393867492676 seconds.
Time taken: 1.439958095550537
```
Fetching the content from the page took 1.3623 seconds, and extracting the content took 0.0575 seconds. 🚀

### Extract Structured Data from Web Pages 📊

Crawl all OpenAI models and their fees from the official page.

```python
import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(..., description="Fee for output token ßfor the OpenAI model.")

url = 'https://openai.com/api/pricing/'
crawler = WebCrawler()
crawler.warmup()

result = crawler.run(
        url=url,
        word_count_threshold=1,
        extraction_strategy= LLMExtractionStrategy(
            provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'), 
            schema=OpenAIModelFee.schema(),
            extraction_type="schema",
            instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
            Do not miss any models in the entire content. One extracted model JSON format should look like this: 
            {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
        ),            
        bypass_cache=True,
    )

print(result.extracted_content)
```

### Execute JS, Filter Data with CSS Selector, and Clustering

```python
from crawl4ai import WebCrawler
from crawl4ai.chunking_strategy import CosineStrategy

js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"]

crawler = WebCrawler()
crawler.warmup()

result = crawler.run(
    url="https://www.nbcnews.com/business",
    js=js_code,
    css_selector="p",
    extraction_strategy=CosineStrategy(semantic_filter="technology")
)

print(result.extracted_content)
```

### Extract Structured Data from Web Pages With Proxy and BaseUrl

```python
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy

def create_crawler():
    crawler = WebCrawler(verbose=True, proxy="http://127.0.0.1:7890")
    crawler.warmup()
    return crawler

crawler = create_crawler()

crawler.warmup()

result = crawler.run(
    url="https://www.nbcnews.com/business",
    extraction_strategy=LLMExtractionStrategy(
        provider="openai/gpt-4o",
        api_token="sk-",
        base_url="https://api.openai.com/v1"
    )
)

print(result.markdown)
```

## Documentation 📚

For detailed documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).

## Contributing 🤝

We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md) for more information.

## License 📄

Crawl4AI is released under the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE).

## Contact 📧

For questions, suggestions, or feedback, feel free to reach out:

- GitHub: [unclecode](https://github.com/unclecode)
- Twitter: [@unclecode](https://twitter.com/unclecode)
- Website: [crawl4ai.com](https://crawl4ai.com)

Happy Crawling! 🕸️🚀

## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=unclecode/crawl4ai&type=Date)](https://star-history.com/#unclecode/crawl4ai&Date)
## [v0.2.77] - 2024-08-04 Significant improvements in text processing and performance: - 🚀 Dependency reduction: Removed dependency on spaCy model for text chunk labeling in cosine extraction strategy. - 🤖 Transformer upgrade: Implemented text sequence classification using a transformer model for labeling text chunks. - ⚡ Performance enhancement: Improved model loading speed due to removal of spaCy dependency. - 🔧 Future-proofing: Laid groundwork for potential complete removal of spaCy dependency in future versions. These changes address issue #68 and provide a foundation for faster, more efficient text processing in Crawl4AI. 2024-08-04 14:54:18 +08:00			`# Crawl4AI v0.2.77 🕷️🤖`
Initial Commit 2024-05-09 19:10:25 +08:00
			`[![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers)`
			`[![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members)`
			`[![GitHub Issues](https://img.shields.io/github/issues/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/issues)`
			`[![GitHub Pull Requests](https://img.shields.io/github/issues-pr/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/pulls)`
			`[![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)`

ADD MKDocs 2024-06-21 17:56:54 +08:00			`Crawl4AI simplifies web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐`
Initial Commit 2024-05-09 19:10:25 +08:00
## [v0.2.77] - 2024-08-04 Significant improvements in text processing and performance: - 🚀 Dependency reduction: Removed dependency on spaCy model for text chunk labeling in cosine extraction strategy. - 🤖 Transformer upgrade: Implemented text sequence classification using a transformer model for labeling text chunks. - ⚡ Performance enhancement: Improved model loading speed due to removal of spaCy dependency. - 🔧 Future-proofing: Laid groundwork for potential complete removal of spaCy dependency in future versions. These changes address issue #68 and provide a foundation for faster, more efficient text processing in Crawl4AI. 2024-08-04 14:54:18 +08:00			`#### [v0.2.77] - 2024-08-02`
## [v0.2.76] - 2024-08-02 Major improvements in functionality, performance, and cross-platform compatibility! 🚀 - 🐳 Docker enhancements: Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows. - 🌐 Official Docker Hub image: Launched our first official image on Docker Hub for streamlined deployment (unclecode/crawl4ai). - 🔧 Selenium upgrade: Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility. - 🖼️ Image description: Implemented ability to generate textual descriptions for extracted images from web pages. - ⚡ Performance boost: Various improvements to enhance overall speed and performance. 2024-08-02 16:02:42 +08:00
			`Major improvements in functionality, performance, and cross-platform compatibility! 🚀`

Update README 2024-08-02 16:04:14 +08:00			`- 🐳 Docker enhancements:`
			`- Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows.`
			`- 🌐 Official Docker Hub image:`
			`- Launched our first official image on Docker Hub for streamlined deployment (unclecode/crawl4ai).`
			`- 🔧 Selenium upgrade:`
			`- Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility.`
			`- 🖼️ Image description:`
			`- Implemented ability to generate textual descriptions for extracted images from web pages.`
			`- ⚡ Performance boost:`
			`- Various improvements to enhance overall speed and performance.`
## [v0.2.76] - 2024-08-02 Major improvements in functionality, performance, and cross-platform compatibility! 🚀 - 🐳 Docker enhancements: Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows. - 🌐 Official Docker Hub image: Launched our first official image on Docker Hub for streamlined deployment (unclecode/crawl4ai). - 🔧 Selenium upgrade: Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility. - 🖼️ Image description: Implemented ability to generate textual descriptions for extracted images from web pages. - ⚡ Performance boost: Various improvements to enhance overall speed and performance. 2024-08-02 16:02:42 +08:00
ADD MKDocs 2024-06-21 17:56:54 +08:00			`## Try it Now!`
chore: Update configuration values, create new example, and update Dockerfile and README 2024-06-19 18:48:20 +08:00
refactor: Update image description minimum word threshold in get_content_of_website_optimized 2024-08-02 15:55:32 +08:00			`✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sJPAmeLj5PMrg2VgOwMJ2ubGIcK0cJeX?usp=sharing)`

chore: Update documentation link in README.md 2024-06-21 18:05:18 +08:00			`✨ visit our [Documentation Website](https://crawl4ai.com/mkdocs/)`

refactor: Update Dockerfile to install Crawl4AI with specified options 2024-08-01 20:13:06 +08:00			`✨ Check [Demo](https://crawl4ai.com/mkdocs/demo)`

chore: Update configuration values, create new example, and update Dockerfile and README 2024-06-19 18:48:20 +08:00			`## Features ✨`
ADD MKDocs 2024-06-21 17:56:54 +08:00
			`- 🆓 Completely free and open-source`
chore: Update configuration values, create new example, and update Dockerfile and README 2024-06-19 18:48:20 +08:00			`- 🤖 LLM-friendly output formats (JSON, cleaned HTML, markdown)`
			`- 🌍 Supports crawling multiple URLs simultaneously`
ADD MKDocs 2024-06-21 17:56:54 +08:00			`- 🎨 Extracts and returns all media tags (Images, Audio, and Video)`
			`- 🔗 Extracts all external and internal links`
			`- 📚 Extracts metadata from the page`
chore: Update configuration values, create new example, and update Dockerfile and README 2024-06-19 18:48:20 +08:00			`- 🔄 Custom hooks for authentication, headers, and page modifications before crawling`
ADD MKDocs 2024-06-21 17:56:54 +08:00			`- 🕵️ User-agent customization`
			`- 🖼️ Takes screenshots of the page`
			`- 📜 Executes multiple custom JavaScripts before crawling`
			`- 📚 Various chunking strategies: topic-based, regex, sentence, and more`
			`- 🧠 Advanced extraction strategies: cosine clustering, LLM, and more`
chore: Update configuration values, create new example, and update Dockerfile and README 2024-06-19 18:48:20 +08:00			`- 🎯 CSS selector support`
ADD MKDocs 2024-06-21 17:56:54 +08:00			`- 📝 Passes instructions/keywords to refine extraction`
chore: Update configuration values, create new example, and update Dockerfile and README 2024-06-19 18:48:20 +08:00
refactor: Update image description minimum word threshold in get_content_of_website_optimized 2024-08-02 15:55:32 +08:00			`# Crawl4AI`

## [v0.2.77] - 2024-08-04 Significant improvements in text processing and performance: - 🚀 Dependency reduction: Removed dependency on spaCy model for text chunk labeling in cosine extraction strategy. - 🤖 Transformer upgrade: Implemented text sequence classification using a transformer model for labeling text chunks. - ⚡ Performance enhancement: Improved model loading speed due to removal of spaCy dependency. - 🔧 Future-proofing: Laid groundwork for potential complete removal of spaCy dependency in future versions. These changes address issue #68 and provide a foundation for faster, more efficient text processing in Crawl4AI. 2024-08-04 14:54:18 +08:00			`## 🌟 Shoutout to Contributors of v0.2.77!`
refactor: Update image description minimum word threshold in get_content_of_website_optimized 2024-08-02 15:55:32 +08:00
			`A big thank you to the amazing contributors who've made this release possible:`

			`- [@aravindkarnam](https://github.com/aravindkarnam) for the new image description feature`
			`- [@FractalMind](https://github.com/FractalMind) for our official Docker Hub image`
			`- [@ketonkss4](https://github.com/ketonkss4) for helping streamline our Selenium setup`

			`Your contributions are driving Crawl4AI forward! 🚀`

ADD MKDocs 2024-06-21 17:56:54 +08:00			`## Cool Examples 🚀`
Add example of REST API call 2024-06-07 16:24:40 +08:00
ADD MKDocs 2024-06-21 17:56:54 +08:00			`### Quick Start`
Update: - Debug - Refactor code for new version 2024-05-16 17:31:44 +08:00
chore: Update pip installation command and requirements, add new dependencies 2024-05-17 16:53:03 +08:00			```python
			`from crawl4ai import WebCrawler`
Update: - Debug - Refactor code for new version 2024-05-16 17:31:44 +08:00
ADD MKDocs 2024-06-21 17:56:54 +08:00			`# Create an instance of WebCrawler`
			`crawler = WebCrawler()`

			`# Warm up the crawler (load necessary models)`
			`crawler.warmup()`
chore: Update Dockerfile to install chromium-chromedriver and spacy library 2024-05-18 09:16:52 +00:00
ADD MKDocs 2024-06-21 17:56:54 +08:00			`# Run the crawler on a URL`
chore: Update web crawler URLs to use NBC News business section 2024-05-17 18:11:13 +08:00			`result = crawler.run(url="https://www.nbcnews.com/business")`
ADD MKDocs 2024-06-21 17:56:54 +08:00
			`# Print the extracted content`
chore: Update print statement to use markdown format 2024-06-21 19:10:13 +08:00			`print(result.markdown)`
Update generated code sample 2024-06-02 16:06:43 +08:00			```

refactor: Update Dockerfile to install Crawl4AI with specified options 2024-08-01 20:13:06 +08:00			`## How to install 🛠`

			`### Using pip 🐍`
Update Redme and Docker file 2024-06-30 00:15:43 +08:00			```bash
			`virtualenv venv`
			`source venv/bin/activate`
			`pip install "crawl4ai @ git+https://github.com/unclecode/crawl4ai.git"`
refactor: Update Dockerfile to install Crawl4AI with specified options 2024-08-01 20:13:06 +08:00			```

			`### Using Docker 🐳`

			```bash
			`# For Mac users (M1/M2)`
			`# docker build --platform linux/amd64 -t crawl4ai .`
			`docker build -t crawl4ai .`
			`docker run -d -p 8000:80 crawl4ai`
			```

			`### Using Docker Hub 🐳`

			```bash
			`docker pull unclecode/crawl4ai:latest`
			`docker run -d -p 8000:80 unclecode/crawl4ai:latest`
			```

Update Redme and Docker file 2024-06-30 00:15:43 +08:00
refactor: Update Dockerfile to install Crawl4AI with specified options 2024-08-01 20:13:06 +08:00			`## Speed-First Design 🚀`
Update Readme: Showcase the speed 2024-06-24 23:02:08 +08:00
			`Perhaps the most important design principle for this library is speed. We need to ensure it can handle many links and resources in parallel as quickly as possible. By combining this speed with fast LLMs like Groq, the results will be truly amazing.`

			```python
			`import time`
			`from crawl4ai.web_crawler import WebCrawler`
			`crawler = WebCrawler()`
			`crawler.warmup()`

			`start = time.time()`
			`url = r"https://www.nbcnews.com/business"`
			`result = crawler.run( url, word_count_threshold=10, bypass_cache=True)`
			`end = time.time()`
			`print(f"Time taken: {end - start}")`
			```

			`Let's take a look the calculated time for the above code snippet:`

			```bash
Update README for speed example 2024-06-24 23:06:12 +08:00			`[LOG] 🚀 Crawling done, success: True, time taken: 1.3623387813568115 seconds`
			`[LOG] 🚀 Content extracted, success: True, time taken: 0.05715131759643555 seconds`
			`[LOG] 🚀 Extraction, time taken: 0.05750393867492676 seconds.`
			`Time taken: 1.439958095550537`
Update Readme: Showcase the speed 2024-06-24 23:02:08 +08:00			```
Update README for speed example 2024-06-24 23:06:12 +08:00			`Fetching the content from the page took 1.3623 seconds, and extracting the content took 0.0575 seconds. 🚀`
Update Readme: Showcase the speed 2024-06-24 23:02:08 +08:00
ADD MKDocs 2024-06-21 17:56:54 +08:00			`### Extract Structured Data from Web Pages 📊`

			`Crawl all OpenAI models and their fees from the official page.`
chore: Update configuration values, create new example, and update Dockerfile and README 2024-06-19 18:48:20 +08:00
			```python
			`import os`
ADD MKDocs 2024-06-21 17:56:54 +08:00			`from crawl4ai import WebCrawler`
			`from crawl4ai.extraction_strategy import LLMExtractionStrategy`
Update reame example. 2024-06-24 22:54:29 +08:00			`from pydantic import BaseModel, Field`

			`class OpenAIModelFee(BaseModel):`
			`model_name: str = Field(..., description="Name of the OpenAI model.")`
			`input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")`
			`output_fee: str = Field(..., description="Fee for output token ßfor the OpenAI model.")`
chore: Update configuration values, create new example, and update Dockerfile and README 2024-06-19 18:48:20 +08:00
ADD MKDocs 2024-06-21 17:56:54 +08:00			`url = 'https://openai.com/api/pricing/'`
chore: Update configuration values, create new example, and update Dockerfile and README 2024-06-19 18:48:20 +08:00			`crawler = WebCrawler()`
			`crawler.warmup()`

			`result = crawler.run(`
Update reame example. 2024-06-24 22:54:29 +08:00			`url=url,`
			`word_count_threshold=1,`
			`extraction_strategy= LLMExtractionStrategy(`
			`provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),`
			`schema=OpenAIModelFee.schema(),`
			`extraction_type="schema",`
			`instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.`
			`Do not miss any models in the entire content. One extracted model JSON format should look like this:`
			`{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""`
			`),`
			`bypass_cache=True,`
			`)`
chore: Update configuration values, create new example, and update Dockerfile and README 2024-06-19 18:48:20 +08:00
ADD MKDocs 2024-06-21 17:56:54 +08:00			`print(result.extracted_content)`
chore: Update configuration values, create new example, and update Dockerfile and README 2024-06-19 18:48:20 +08:00			```

ADD MKDocs 2024-06-21 17:56:54 +08:00			`### Execute JS, Filter Data with CSS Selector, and Clustering`
chore: Update pip installation command and requirements for Crawl4AI 2024-05-16 20:42:53 +08:00
Update: - Debug - Refactor code for new version 2024-05-16 17:31:44 +08:00			```python
			`from crawl4ai import WebCrawler`
ADD MKDocs 2024-06-21 17:56:54 +08:00			`from crawl4ai.chunking_strategy import CosineStrategy`
`chore: Update README.md and project structure` 2024-05-09 22:40:08 +08:00
ADD MKDocs 2024-06-21 17:56:54 +08:00			`js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"]`
Update: - Debug - Refactor code for new version 2024-05-16 17:31:44 +08:00
			`crawler = WebCrawler()`
			`crawler.warmup()`
Initial Commit 2024-05-09 19:10:25 +08:00
Update: - Debug - Refactor code for new version 2024-05-16 17:31:44 +08:00			`result = crawler.run(`
			`url="https://www.nbcnews.com/business",`
ADD MKDocs 2024-06-21 17:56:54 +08:00			`js=js_code,`
			`css_selector="p",`
			`extraction_strategy=CosineStrategy(semantic_filter="technology")`
Update: - Debug - Refactor code for new version 2024-05-16 17:31:44 +08:00			`)`
Initial Commit 2024-05-09 19:10:25 +08:00
ADD MKDocs 2024-06-21 17:56:54 +08:00			`print(result.extracted_content)`
Update: - Debug - Refactor code for new version 2024-05-16 17:31:44 +08:00			```

add use proxy and llm baseurl examples 2024-08-27 10:14:54 +08:00			`### Extract Structured Data from Web Pages With Proxy and BaseUrl`

			```python
			`from crawl4ai import WebCrawler`
			`from crawl4ai.extraction_strategy import LLMExtractionStrategy`

			`def create_crawler():`
			`crawler = WebCrawler(verbose=True, proxy="http://127.0.0.1:7890")`
			`crawler.warmup()`
			`return crawler`

			`crawler = create_crawler()`

			`crawler.warmup()`

			`result = crawler.run(`
			`url="https://www.nbcnews.com/business",`
			`extraction_strategy=LLMExtractionStrategy(`
			`provider="openai/gpt-4o",`
			`api_token="sk-",`
			`base_url="https://api.openai.com/v1"`
			`)`
			`)`

			`print(result.markdown)`
			```

ADD MKDocs 2024-06-21 17:56:54 +08:00			`## Documentation 📚`
Initial Commit 2024-05-09 19:10:25 +08:00
chore: Update documentation link in README.md 2024-06-21 18:05:18 +08:00			`For detailed documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).`
Initial Commit 2024-05-09 19:10:25 +08:00
			`## Contributing 🤝`

ADD MKDocs 2024-06-21 17:56:54 +08:00			`We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md) for more information.`
Initial Commit 2024-05-09 19:10:25 +08:00
			`## License 📄`

`chore: Update license information in README.md` `chore: Update social media links in index.html` 2024-05-09 19:14:48 +08:00			`Crawl4AI is released under the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE).`
Initial Commit 2024-05-09 19:10:25 +08:00
			`## Contact 📧`

ADD MKDocs 2024-06-21 17:56:54 +08:00			`For questions, suggestions, or feedback, feel free to reach out:`
Initial Commit 2024-05-09 19:10:25 +08:00
			`- GitHub: [unclecode](https://github.com/unclecode)`
			`- Twitter: [@unclecode](https://twitter.com/unclecode)`
Update: - Debug - Refactor code for new version 2024-05-16 17:31:44 +08:00			`- Website: [crawl4ai.com](https://crawl4ai.com)`
Initial Commit 2024-05-09 19:10:25 +08:00
ADD MKDocs 2024-06-21 17:56:54 +08:00			`Happy Crawling! 🕸️🚀`
Add star chart 2024-06-24 22:47:46 +08:00
			`## Star History`

			`[![Star History Chart](https://api.star-history.com/svg?repos=unclecode/crawl4ai&type=Date)](https://star-history.com/#unclecode/crawl4ai&Date)`