crawl4ai/docs/md_v2/index.md

# 🚀🤖 Crawl4AI: Open-Source LLM-Friendly Web Crawler & Scraper

<div align="center">

<a href="https://trendshift.io/repositories/11716" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11716" alt="unclecode%2Fcrawl4ai | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>

</div>

[![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers)
[![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members)
[![PyPI version](https://badge.fury.io/py/crawl4ai.svg)](https://badge.fury.io/py/crawl4ai)
[![Python Version](https://img.shields.io/pypi/pyversions/crawl4ai)](https://pypi.org/project/crawl4ai/)
[![Downloads](https://static.pepy.tech/badge/crawl4ai/month)](https://pepy.tech/project/crawl4ai)
[![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Security: bandit](https://img.shields.io/badge/security-bandit-yellow.svg)](https://github.com/PyCQA/bandit)


Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for large language models, AI agents, and data pipelines. Fully open source, flexible, and built for real-time performance, **Crawl4AI** empowers developers with unmatched speed, precision, and deployment ease.

---

## My Personal Journey

I’ve always loved exploring the web development, back from when HTML and JavaScript were hardly intertwined. My curiosity drove me into web development, mathematics, AI, and machine learning, always keeping a close tie to real industrial applications. In 2009–2010, as a postgraduate student, I created platforms to gather and organize published papers for Master’s and PhD researchers. Faced with post-grad students’ data challenges, I built a helper app to crawl newly published papers and public data. Relying on Internet Explorer and DLL hacks was far more cumbersome than modern tools, highlighting my longtime background in data extraction.

Fast-forward to 2023: I needed to fetch web data and transform it into neat **markdown** for my AI pipeline. All solutions I found were either **closed-source**, overpriced, or produced low-quality output. As someone who has built large edu-tech ventures (like KidoCode), I believe **data belongs to the people**. We shouldn’t pay $16 just to parse the web’s publicly available content. This friction led me to create my own library, **Crawl4AI**, in a matter of days to meet my immediate needs. Unexpectedly, it went **viral**, accumulating thousands of GitHub stars.

Now, in **January 2025**, Crawl4AI has surpassed **21,000 stars** and remains the #1 trending repository. It’s my way of giving back to the community after benefiting from open source for years. I’m thrilled by how many of you share that passion. Thank you for being here, join our Discord, file issues, submit PRs, or just spread the word. Let’s build the best data extraction, crawling, and scraping library **together**.

---

## What Does Crawl4AI Do?

Crawl4AI is a feature-rich crawler and scraper that aims to:

1. **Generate Clean Markdown**: Perfect for RAG pipelines or direct ingestion into LLMs.  
2. **Structured Extraction**: Parse repeated patterns with CSS, XPath, or LLM-based extraction.  
3. **Advanced Browser Control**: Hooks, proxies, stealth modes, session re-use—fine-grained control.  
4. **High Performance**: Parallel crawling, chunk-based extraction, real-time use cases.  
5. **Open Source**: No forced API keys, no paywalls—everyone can access their data.  

**Core Philosophies**:
- **Democratize Data**: Free to use, transparent, and highly configurable.  
- **LLM Friendly**: Minimally processed, well-structured text, images, and metadata, so AI models can easily consume it.

---

## Documentation Structure

To help you get started, we’ve organized our docs into clear sections:

- **Setup & Installation**  
  Basic instructions to install Crawl4AI via pip or Docker.  
- **Quick Start**  
  A hands-on introduction showing how to do your first crawl, generate Markdown, and do a simple extraction.  
- **Core**  
  Deeper guides on single-page crawling, advanced browser/crawler parameters, content filtering, and caching.  
- **Advanced**  
  Explore link & media handling, lazy loading, hooking & authentication, proxies, session management, and more.  
- **Extraction**  
  Detailed references for no-LLM (CSS, XPath) vs. LLM-based strategies, chunking, and clustering approaches.  
- **API Reference**  
  Find the technical specifics of each class and method, including `AsyncWebCrawler`, `arun()`, and `CrawlResult`.

Throughout these sections, you’ll find code samples you can **copy-paste** into your environment. If something is missing or unclear, raise an issue or PR.

---

## How You Can Support

- **Star & Fork**: If you find Crawl4AI helpful, star the repo on GitHub or fork it to add your own features.  
- **File Issues**: Encounter a bug or missing feature? Let us know by filing an issue, so we can improve.  
- **Pull Requests**: Whether it’s a small fix, a big feature, or better docs—contributions are always welcome.  
- **Join Discord**: Come chat about web scraping, crawling tips, or AI workflows with the community.  
- **Spread the Word**: Mention Crawl4AI in your blog posts, talks, or on social media.  

**Our mission**: to empower everyone—students, researchers, entrepreneurs, data scientists—to access, parse, and shape the world’s data with speed, cost-efficiency, and creative freedom.

---

## Quick Links

- **[GitHub Repo](https://github.com/unclecode/crawl4ai)**  
- **[Installation Guide](./core/installation.md)**  
- **[Quick Start](./core/quickstart.md)**  
- **[API Reference](./api/async-webcrawler.md)**  
- **[Changelog](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md)**  

Thank you for joining me on this journey. Let’s keep building an **open, democratic** approach to data extraction and AI together.

Happy Crawling!  
— *Unclecde, Founder & Maintainer of Crawl4AI*
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								# 🚀🤖 Crawl4AI: Open-Source LLM-Friendly Web Crawler & Scraper
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								<div align="center">
 								<a href="https://trendshift.io/repositories/11716" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11716" alt="unclecode%2Fcrawl4ai | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
 								</div>
 								[![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers)
 								[![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members)
 								[![PyPI version](https://badge.fury.io/py/crawl4ai.svg)](https://badge.fury.io/py/crawl4ai)
 								[![Python Version](https://img.shields.io/pypi/pyversions/crawl4ai)](https://pypi.org/project/crawl4ai/)
 								[![Downloads](https://static.pepy.tech/badge/crawl4ai/month)](https://pepy.tech/project/crawl4ai)
 								[![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
 								[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
 								[![Security: bandit](https://img.shields.io/badge/security-bandit-yellow.svg)](https://github.com/PyCQA/bandit)
 								Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for large language models, AI agents, and data pipelines. Fully open source, flexible, and built for real-time performance, **Crawl4AI** empowers developers with unmatched speed, precision, and deployment ease.
 								---
 								## My Personal Journey
 								I’ve always loved exploring the web development, back from when HTML and JavaScript were hardly intertwined. My curiosity drove me into web development, mathematics, AI, and machine learning, always keeping a close tie to real industrial applications. In 2009–2010, as a postgraduate student, I created platforms to gather and organize published papers for Master’s and PhD researchers. Faced with post-grad students’ data challenges, I built a helper app to crawl newly published papers and public data. Relying on Internet Explorer and DLL hacks was far more cumbersome than modern tools, highlighting my longtime background in data extraction.
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								Fast-forward to 2023: I needed to fetch web data and transform it into neat **markdown** for my AI pipeline. All solutions I found were either **closed-source**, overpriced, or produced low-quality output. As someone who has built large edu-tech ventures (like KidoCode), I believe **data belongs to the people**. We shouldn’t pay $16 just to parse the web’s publicly available content. This friction led me to create my own library, **Crawl4AI**, in a matter of days to meet my immediate needs. Unexpectedly, it went **viral**, accumulating thousands of GitHub stars.
 								Now, in **January 2025**, Crawl4AI has surpassed **21,000 stars** and remains the #1 trending repository. It’s my way of giving back to the community after benefiting from open source for years. I’m thrilled by how many of you share that passion. Thank you for being here, join our Discord, file issues, submit PRs, or just spread the word. Let’s build the best data extraction, crawling, and scraping library **together**.
 								---
 								## What Does Crawl4AI Do?
 								Crawl4AI is a feature-rich crawler and scraper that aims to:
 . **Generate Clean Markdown**: Perfect for RAG pipelines or direct ingestion into LLMs.
 . **Structured Extraction**: Parse repeated patterns with CSS, XPath, or LLM-based extraction.
 . **Advanced Browser Control**: Hooks, proxies, stealth modes, session re-use—fine-grained control.
 . **High Performance**: Parallel crawling, chunk-based extraction, real-time use cases.
 . **Open Source**: No forced API keys, no paywalls—everyone can access their data.
 								**Core Philosophies**:
 								- **Democratize Data**: Free to use, transparent, and highly configurable.
 								- **LLM Friendly**: Minimally processed, well-structured text, images, and metadata, so AI models can easily consume it.
 								---
 								## Documentation Structure
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								To help you get started, we’ve organized our docs into clear sections:
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								- **Setup & Installation**
 								  Basic instructions to install Crawl4AI via pip or Docker.
 								- **Quick Start**
 								  A hands-on introduction showing how to do your first crawl, generate Markdown, and do a simple extraction.
 								- **Core**
 								  Deeper guides on single-page crawling, advanced browser/crawler parameters, content filtering, and caching.
 								- **Advanced**
 								  Explore link & media handling, lazy loading, hooking & authentication, proxies, session management, and more.
 								- **Extraction**
 								  Detailed references for no-LLM (CSS, XPath) vs. LLM-based strategies, chunking, and clustering approaches.
 								- **API Reference**
 								  Find the technical specifics of each class and method, including `AsyncWebCrawler`, `arun()`, and `CrawlResult`.
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								Throughout these sections, you’ll find code samples you can **copy-paste** into your environment. If something is missing or unclear, raise an issue or PR.
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								---
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								## How You Can Support
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								- **Star & Fork**: If you find Crawl4AI helpful, star the repo on GitHub or fork it to add your own features.
 								- **File Issues**: Encounter a bug or missing feature? Let us know by filing an issue, so we can improve.
 								- **Pull Requests**: Whether it’s a small fix, a big feature, or better docs—contributions are always welcome.
 								- **Join Discord**: Come chat about web scraping, crawling tips, or AI workflows with the community.
 								- **Spread the Word**: Mention Crawl4AI in your blog posts, talks, or on social media.
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								**Our mission**: to empower everyone—students, researchers, entrepreneurs, data scientists—to access, parse, and shape the world’s data with speed, cost-efficiency, and creative freedom.
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								---
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								## Quick Links
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								- **[GitHub Repo](https://github.com/unclecode/crawl4ai)**
 								- **[Installation Guide](./core/installation.md)**
 								- **[Quick Start](./core/quickstart.md)**
 								- **[API Reference](./api/async-webcrawler.md)**
 								- **[Changelog](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md)**
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								Thank you for joining me on this journey. Let’s keep building an **open, democratic** approach to data extraction and AI together.
-												Update Documentation

											
										
										
											2024-10-27 19:24:46 +08:00
-												refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

											
										
										
											2025-01-07 20:49:50 +08:00
+								Happy Crawling!
 								— *Unclecde, Founder & Maintainer of Crawl4AI*