15 Commits

Author SHA1 Message Date
UncleCode
d024749633 refactor(deep-crawl): add max_pages limit and improve crawl control
Add max_pages parameter to all deep crawling strategies to limit total pages crawled.
Add score_threshold parameter to BFS/DFS strategies for quality control.
Remove legacy parameter handling in AsyncWebCrawler.
Improve error handling and logging in crawl strategies.

BREAKING CHANGE: Removed support for legacy parameters in AsyncWebCrawler.run_many()
2025-03-03 21:51:11 +08:00
Aravind
a9e24307cc
Release prep (#749)
* fix: Update export of URLPatternFilter

* chore: Add dependancy for cchardet in requirements

* docs: Update example for deep crawl in release note for v0.5

* Docs: update the example for memory dispatcher

* docs: updated example for crawl strategies

* Refactor: Removed wrapping in if __name__==main block since this is a markdown file.

* chore: removed cchardet from dependancy list, since unclecode is planning to remove it

* docs: updated the example for proxy rotation to a working example

* feat: Introduced ProxyConfig param

* Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1

* chore: update and test new dependancies

* feat:Make PyPDF2 a conditional dependancy

* updated tutorial and release note for v0.5

* docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename

* refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult

* fix: Bug in serialisation of markdown in acache_url

* Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown

* fix: remove deprecated markdown_v2 from docker

* Refactor: remove deprecated fit_markdown and fit_html from result

* refactor: fix cache retrieval for markdown as a string

* chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown
2025-02-28 19:53:35 +08:00
Aravind
2af958e12c
Feat/llm config (#724)
* feature: Add LlmConfig to easily configure and pass LLM configs to different strategies

* pulled in next branch and resolved conflicts

* feat: Add gemini and deepseek providers. Make ignore_cache in llm content filter to true by default to avoid confusions

* Refactor: Update LlmConfig in LLMExtractionStrategy class and deprecate old params

* updated tests, docs and readme
2025-02-21 15:41:37 +08:00
Aravind
dad592c801
2025 feb alpha 1 (#685)
* spelling change in prompt

* gpt-4o-mini support

* Remove leading Y before here

* prompt spell correction

* (Docs) Fix numbered list end-of-line formatting

Added the missing "two spaces" to add a line break

* fix: access downloads_path through browser_config in _handle_download method - Fixes #585

* crawl

* fix: https://github.com/unclecode/crawl4ai/issues/592

* fix: https://github.com/unclecode/crawl4ai/issues/583

* Docs update: https://github.com/unclecode/crawl4ai/issues/649

* fix: https://github.com/unclecode/crawl4ai/issues/570

* Docs: updated example for content-selection to reflect new changes in yc newsfeed css

* Refactor: Removed old filters and replaced with optimised filters

* fix:Fixed imports as per the new names of filters

* Tests: For deep crawl filters

* Refactor: Remove old scorers and replace with optimised ones: Fix imports forall filters and scorers.

* fix: awaiting on filters that are async in nature eg: content relevance and seo filters

* fix: https://github.com/unclecode/crawl4ai/issues/592

* fix: https://github.com/unclecode/crawl4ai/issues/715

---------

Co-authored-by: DarshanTank <darshan.tank@gnani.ai>
Co-authored-by: Tuhin Mallick <tuhin.mllk@gmail.com>
Co-authored-by: Serhat Soydan <ssoydan@gmail.com>
Co-authored-by: cardit1 <maneesh@cardit.in>
Co-authored-by: Tautik Agrahari <tautikagrahari@gmail.com>
2025-02-19 14:13:17 +08:00
UncleCode
91a5fea11f feat(cli): add command line interface with comprehensive features
Implements a full-featured CLI for Crawl4AI with the following capabilities:
- Basic and advanced web crawling
- Configuration management via YAML/JSON files
- Multiple extraction strategies (CSS, XPath, LLM)
- Content filtering and optimization
- Interactive Q&A capabilities
- Various output formats
- Comprehensive documentation and examples

Also includes:
- Home directory setup for configuration and cache
- Environment variable support for API tokens
- Test suite for CLI functionality
2025-02-10 16:58:52 +08:00
UncleCode
19df96ed56 feat(proxy): add proxy rotation strategy
Implements a new proxy rotation system with the following changes:
- Add ProxyRotationStrategy abstract base class
- Add RoundRobinProxyStrategy concrete implementation
- Integrate proxy rotation with AsyncWebCrawler
- Add proxy_rotation_strategy parameter to CrawlerRunConfig
- Add example script demonstrating proxy rotation usage
- Remove deprecated synchronous WebCrawler code
- Clean up rate limiting documentation

BREAKING CHANGE: Removed synchronous WebCrawler support and related rate limiting configurations
2025-02-09 18:49:10 +08:00
UncleCode
16b8d4945b feat(release): prepare v0.4.3 beta release
Prepare the v0.4.3 beta release with major feature additions and improvements:
- Add JsonXPathExtractionStrategy and LLMContentFilter to exports
- Update version to 0.4.3b1
- Improve documentation for dispatchers and markdown generation
- Update development status to Beta
- Reorganize changelog format

BREAKING CHANGE: Memory threshold in MemoryAdaptiveDispatcher increased to 90% and SemaphoreDispatcher parameter renamed to max_session_permit
2025-01-21 21:03:11 +08:00
UncleCode
2cec527a22 feat(extraction): add LLM-powered schema generation utility
Adds new static method generate_schema() to JsonElementExtractionStrategy classes
that can automatically generate extraction schemas using LLM (OpenAI or Ollama).
This provides a convenient way to bootstrap extraction schemas while maintaining
the performance benefits of selector-based extraction.

Key changes:
- Added generate_schema() static method to base extraction strategy
- Added support for both CSS and XPath schema generation
- Updated documentation with examples and best practices
- Added new prompt templates for schema generation
2025-01-20 17:28:00 +08:00
UncleCode
8b6fe6a98f docs(api): add streaming mode documentation and examples
Add comprehensive documentation for the new streaming mode feature in arun_many():
- Update arun_many() API docs to reflect streaming return type
- Add streaming examples in quickstart and multi-url guides
- Document stream parameter in configuration classes
- Add clone() helper method documentation for configs

This change improves documentation for processing large numbers of URLs efficiently.
2025-01-19 18:21:34 +08:00
UncleCode
3d09b6a221 feat(content-filter): add LLMContentFilter for intelligent markdown generation
Add new LLMContentFilter class that uses LLMs to generate high-quality markdown content:
- Implement intelligent content filtering with customizable instructions
- Add chunk processing for handling large documents
- Support parallel processing of content chunks
- Include caching mechanism for filtered results
- Add usage tracking and statistics
- Update documentation with examples and use cases

Also includes minor changes:
- Disable Pydantic warnings in __init__.py
- Add new prompt template for content filtering
2025-01-18 19:31:07 +08:00
UncleCode
c3370ec5da refactor(scraping): replace ScrapingMode enum with strategy pattern
Replace the ScrapingMode enum with a proper strategy pattern implementation for content scraping.
This change introduces:
- New ContentScrapingStrategy abstract base class
- Concrete WebScrapingStrategy and LXMLWebScrapingStrategy implementations
- New Pydantic models for structured scraping results
- Updated documentation reflecting the new strategy-based approach

BREAKING CHANGE: ScrapingMode enum has been removed. Users should now use ContentScrapingStrategy implementations instead.
2025-01-13 17:53:12 +08:00
UncleCode
f3ae5a657c feat(scraping): add LXML-based scraping mode for improved performance
Adds a new ScrapingMode enum to allow switching between BeautifulSoup and LXML parsing.
LXML mode offers 10-20x better performance for large HTML documents.

Key changes:
- Added ScrapingMode enum with BEAUTIFULSOUP and LXML options
- Implemented LXMLWebScrapingStrategy class
- Added LXML-based metadata extraction
- Updated documentation with scraping mode usage and performance considerations
- Added cssselect dependency

BREAKING CHANGE: None
2025-01-12 20:46:23 +08:00
UncleCode
825c78a048 refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring
Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.
2025-01-11 21:10:27 +08:00
UncleCode
e8b4ac6046 docs(urls): update documentation URLs to new domain
Update all documentation URLs from crawl4ai.com/mkdocs to docs.crawl4ai.com
Improve badges styling and layout in documentation
Increase code font size in documentation CSS

BREAKING CHANGE: Documentation URLs have changed from crawl4ai.com/mkdocs to docs.crawl4ai.com
2025-01-09 16:22:41 +08:00
UncleCode
ca3e33122e refactor(docs): reorganize documentation structure and update styles
Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized
2025-01-07 20:49:50 +08:00