655 Commits

Author SHA1 Message Date
UncleCode
4aeb7ef9ad refactor(proxy): consolidate proxy configuration handling
Moves ProxyConfig from configs/ directory into proxy_strategy.py to improve code organization and reduce fragmentation. Updates all imports and type hints to reflect the new location.

Key changes:
- Moved ProxyConfig class from configs/proxy_config.py to proxy_strategy.py
- Updated type hints in async_configs.py to support ProxyConfig
- Fixed proxy configuration handling in browser_manager.py
- Updated documentation and examples to use new import path

BREAKING CHANGE: ProxyConfig import path has changed from crawl4ai.configs to crawl4ai.proxy_strategy
2025-03-07 23:14:11 +08:00
UncleCode
a68cbb232b feat(browser): add standalone CDP browser launch and lxml extraction strategy
Add new features to enhance browser automation and HTML extraction:
- Add CDP browser launch capability with customizable ports and profiles
- Implement JsonLxmlExtractionStrategy for faster HTML parsing
- Add CLI command 'crwl cdp' for launching standalone CDP browsers
- Support connecting to external CDP browsers via URL
- Optimize selector caching and context-sensitive queries

BREAKING CHANGE: LLMConfig import path changed from crawl4ai.types to crawl4ai
2025-03-07 20:55:56 +08:00
UncleCode
f78c46446b feat(deep-crawling): improve URL normalization and domain filtering
Enhance URL handling in deep crawling with:
- New URL normalization functions for consistent URL formats
- Improved domain filtering with subdomain support
- Added URLPatternFilter to public API
- Better URL deduplication in BFS strategy

These changes improve crawling accuracy and reduce duplicate visits.
2025-03-06 22:45:57 +08:00
UncleCode
1b72880007 chore(version): bump version to 0.5.0.post3 2025-03-06 20:32:32 +08:00
UncleCode
29f7915b79 fix(models): support float timestamps in CrawlStats
Modify CrawlStats class to handle both datetime and float timestamp formats for start_time and end_time fields. This change improves compatibility with different time formats while maintaining existing functionality.

Other minor changes:
- Add datetime import in async_dispatcher
- Update JsonElementExtractionStrategy kwargs handling

No breaking changes.
2025-03-06 20:30:57 +08:00
UncleCode
2327db6fdc refactor(crawler): introduce CrawlResultContainer and simplify interfaces
Introduces a new generic CrawlResultContainer class to standardize return types and
improve type safety. Removes legacy parameter handling and simplifies method signatures.
This change makes the API more consistent and easier to maintain.

BREAKING CHANGE: Synchronous crawler methods now always return CrawlResultContainer
instead of raw CrawlResult or List[CrawlResult]. Legacy parameters have been removed
from method signatures.
2025-03-05 22:23:08 +08:00
UncleCode
3a234ec950 fix(auth): make JWT authentication optional with fallback
Modify authentication system to gracefully handle cases where JWT is not enabled or token is missing. This includes:
- Making HTTPBearer auto_error=False to prevent automatic 403 errors
- Updating token dependency to return None when JWT is disabled
- Fixing model deserialization in CrawlResult
- Updating documentation links
- Cleaning up imports

BREAKING CHANGE: Authentication behavior changed to be more permissive when JWT is disabled
2025-03-05 17:14:42 +08:00
UncleCode
9e89d27fcd chore(version): bump version to 0.5.0.post2 2025-03-05 14:18:29 +08:00
UncleCode
b3ec7ce960 Merge branch 'vr0.5.0.post1' into next 2025-03-05 14:17:19 +08:00
UncleCode
baee4949d3 refactor(llm): rename LlmConfig to LLMConfig for consistency
Rename LlmConfig to LLMConfig across the codebase to follow consistent naming conventions.
Update all imports and usages to use the new name.
Update documentation and examples to reflect the change.

BREAKING CHANGE: LlmConfig has been renamed to LLMConfig. Users need to update their imports and usage.
2025-03-05 14:17:04 +08:00
UncleCode
9c58e4ce2e fix(docs): correct section numbering in deepcrawl_example.py tutorial v0.5.0.post1 2025-03-04 20:57:33 +08:00
UncleCode
df6a6d5f4f refactor(docs): reorganize tutorial sections and update wrap-up example 2025-03-04 20:55:09 +08:00
UncleCode
e896c08f9c chore(version): bump version to 0.5.0.post1 2025-03-04 20:29:27 +08:00
UncleCode
56bc3c6e45 refactor(cli): improve CLI default command handling
Make 'crawl' the default command when no command is specified.
This improves user experience by allowing direct URL input without
explicitly specifying the 'crawl' command.

Also removes unnecessary blank lines in example code for better readability.
2025-03-04 20:28:16 +08:00
UncleCode
cbef406f9b docs: update README for version 0.5.0 release with new features and CLI commands 2025-03-04 19:24:46 +08:00
UncleCode
8a76563018 chore(docs): update site version to v0.5.x in mkdocs configuration 2025-03-04 18:30:03 +08:00
UncleCode
415c1c5bee refactor(core): replace float('inf') with math.inf
Replace float('inf') and float('-inf') with math.inf and -math.inf from the math module for better readability and performance. Also clean up imports and remove unused speed comparison code.

No breaking changes.
2025-03-04 18:23:55 +08:00
UncleCode
f334daa979 feat(deep-crawling): add max_pages and score_threshold parameters for improved crawling control 2025-03-03 21:54:58 +08:00
UncleCode
d024749633 refactor(deep-crawl): add max_pages limit and improve crawl control
Add max_pages parameter to all deep crawling strategies to limit total pages crawled.
Add score_threshold parameter to BFS/DFS strategies for quality control.
Remove legacy parameter handling in AsyncWebCrawler.
Improve error handling and logging in crawl strategies.

BREAKING CHANGE: Removed support for legacy parameters in AsyncWebCrawler.run_many()
2025-03-03 21:51:11 +08:00
UncleCode
c612f9a852 feat(profiles): add CLI command for crawling with browser profiles
Adds new functionality to crawl websites using saved browser profiles directly from the CLI.
This includes:
- New CLI option to use profiles for crawling
- Helper functions for profile-based crawling
- Fixed type hints for config parameters
- Updated example to show browser window by default

This makes it easier for users to leverage saved browser profiles for crawling without writing code.
2025-03-02 21:33:33 +08:00
UncleCode
95175cb394 feat(cli): add browser profile management functionality
Adds new interactive browser profile management system that allows users to:
- Create and manage browser profiles for authenticated crawling
- List existing profiles with detailed information
- Delete unused profiles
- Use profiles during crawling with the new -p/--profile flag

Also restructures CLI to use Click groups and adds humanize dependency for better size formatting.
2025-03-02 20:54:45 +08:00
UncleCode
cba4a466e5 feat(browser): add BrowserProfiler class for identity-based browsing
Adds a new BrowserProfiler class that provides comprehensive management of browser profiles for identity-based crawling. Features include:
- Interactive profile creation and management
- Profile listing, retrieval, and deletion
- Guided console interface
- Migration of profile management from ManagedBrowser
- New example script for identity-based browsing

ALSO:
- Updates logging format in AsyncWebCrawler
- Removes content filter from hello_world example
- Relaxes httpx version constraint

BREAKING CHANGE: Profile management methods from ManagedBrowser are now deprecated and delegate to BrowserProfiler
2025-03-02 20:32:29 +08:00
Aravind
a9e24307cc
Release prep (#749)
* fix: Update export of URLPatternFilter

* chore: Add dependancy for cchardet in requirements

* docs: Update example for deep crawl in release note for v0.5

* Docs: update the example for memory dispatcher

* docs: updated example for crawl strategies

* Refactor: Removed wrapping in if __name__==main block since this is a markdown file.

* chore: removed cchardet from dependancy list, since unclecode is planning to remove it

* docs: updated the example for proxy rotation to a working example

* feat: Introduced ProxyConfig param

* Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1

* chore: update and test new dependancies

* feat:Make PyPDF2 a conditional dependancy

* updated tutorial and release note for v0.5

* docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename

* refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult

* fix: Bug in serialisation of markdown in acache_url

* Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown

* fix: remove deprecated markdown_v2 from docker

* Refactor: remove deprecated fit_markdown and fit_html from result

* refactor: fix cache retrieval for markdown as a string

* chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown
2025-02-28 19:53:35 +08:00
UncleCode
3a87b4e43b fix(dependencies): update cchardet to faust-cchardet for compatibility 2025-02-26 18:25:58 +08:00
UncleCode
4bcd4cbda1 refactor(pdf): improve PDF processor dependency handling
Make PyPDF2 an optional dependency and improve import handling in PDF processor.
Move imports inside methods to allow for lazy loading and better error handling.
Add new 'pdf' optional dependency group in pyproject.toml.
Clean up unused imports and remove deprecated files.

BREAKING CHANGE: PyPDF2 is now an optional dependency. Users need to install with 'pip install crawl4ai[pdf]' to use PDF processing features.
2025-02-25 22:27:55 +08:00
UncleCode
71ce01c9e1 feat(browser): add cdp_url parameter to BrowserManager initialization 2025-02-24 14:48:02 +08:00
UncleCode
c6d48080a4 feat(logger): add abstract logger base class and file logger implementation
Add AsyncLoggerBase abstract class to standardize logger interface and introduce AsyncFileLogger for file-only logging. Remove deprecated always_bypass_cache parameter and clean up AsyncWebCrawler initialization.

BREAKING CHANGE: Removed deprecated 'always_by_pass_cache' parameter. Use BrowserConfig cache settings instead.
2025-02-23 21:23:41 +08:00
UncleCode
46d2f12851 chore: remove old Dockerfile and server script 2025-02-22 13:45:04 +08:00
UncleCode
367cd71db9 feat(core): release version 0.5.0 with deep crawling and CLI
This major release adds deep crawling capabilities, memory-adaptive dispatcher,
multiple crawling strategies, Docker deployment, and a new CLI. It also includes
significant improvements to proxy handling, PDF processing, and LLM integration.

BREAKING CHANGES:
- Add memory-adaptive dispatcher as default for arun_many()
- Move max_depth to CrawlerRunConfig
- Replace ScrapingMode enum with strategy pattern
- Update BrowserContext API
- Make model fields optional with defaults
- Remove content_filter parameter from CrawlerRunConfig
- Remove synchronous WebCrawler and old CLI
- Update Docker deployment configuration
- Replace FastFilterChain with FilterChain
- Change license to Apache 2.0 with attribution clause
2025-02-21 19:55:02 +08:00
Aravind
2af958e12c
Feat/llm config (#724)
* feature: Add LlmConfig to easily configure and pass LLM configs to different strategies

* pulled in next branch and resolved conflicts

* feat: Add gemini and deepseek providers. Make ignore_cache in llm content filter to true by default to avoid confusions

* Refactor: Update LlmConfig in LLMExtractionStrategy class and deprecate old params

* updated tests, docs and readme
2025-02-21 15:41:37 +08:00
UncleCode
3cb28875c3 refactor(config): enhance serialization and config handling
- Add ignore_default_value option to to_serializable_dict
- Add viewport dict support in BrowserConfig
- Replace FastFilterChain with FilterChain
- Add deprecation warnings for unwanted properties
- Clean up unused imports
- Rename example files for consistency
- Add comprehensive Docker configuration tutorial

BREAKING CHANGE: FastFilterChain has been replaced with FilterChain
2025-02-19 17:23:25 +08:00
Aravind
dad592c801
2025 feb alpha 1 (#685)
* spelling change in prompt

* gpt-4o-mini support

* Remove leading Y before here

* prompt spell correction

* (Docs) Fix numbered list end-of-line formatting

Added the missing "two spaces" to add a line break

* fix: access downloads_path through browser_config in _handle_download method - Fixes #585

* crawl

* fix: https://github.com/unclecode/crawl4ai/issues/592

* fix: https://github.com/unclecode/crawl4ai/issues/583

* Docs update: https://github.com/unclecode/crawl4ai/issues/649

* fix: https://github.com/unclecode/crawl4ai/issues/570

* Docs: updated example for content-selection to reflect new changes in yc newsfeed css

* Refactor: Removed old filters and replaced with optimised filters

* fix:Fixed imports as per the new names of filters

* Tests: For deep crawl filters

* Refactor: Remove old scorers and replace with optimised ones: Fix imports forall filters and scorers.

* fix: awaiting on filters that are async in nature eg: content relevance and seo filters

* fix: https://github.com/unclecode/crawl4ai/issues/592

* fix: https://github.com/unclecode/crawl4ai/issues/715

---------

Co-authored-by: DarshanTank <darshan.tank@gnani.ai>
Co-authored-by: Tuhin Mallick <tuhin.mllk@gmail.com>
Co-authored-by: Serhat Soydan <ssoydan@gmail.com>
Co-authored-by: cardit1 <maneesh@cardit.in>
Co-authored-by: Tautik Agrahari <tautikagrahari@gmail.com>
2025-02-19 14:13:17 +08:00
UncleCode
c171891999 Merge branch 'main' into next
# Conflicts:
#	.gitignore
2025-02-19 13:26:42 +08:00
UncleCode
3b1025abbb Merge branch 'main' of https://github.com/unclecode/crawl4ai 2025-02-19 13:24:26 +08:00
UncleCode
f00dcc276f Update README.md (#562) 2025-02-19 13:24:04 +08:00
UncleCode
392c923980 feat(docker): add JWT authentication and improve server architecture
Add JWT token-based authentication to Docker server and client.
Refactor server architecture for better code organization and error handling.
Move Dockerfile to root deploy directory and update configuration.
Add comprehensive documentation and examples.

BREAKING CHANGE: Docker server now requires authentication by default.
Endpoints require JWT tokens when security.jwt_enabled is true in config.
2025-02-18 22:07:13 +08:00
UncleCode
2864015469 feat(docker): implement supervisor and secure API endpoints
Add supervisor configuration for managing Redis and Gunicorn processes
Replace direct process management with supervisord
Add secure and token-free API server variants
Implement JWT authentication for protected endpoints
Update datetime handling in async dispatcher
Add email domain verification

BREAKING CHANGE: Server startup now uses supervisord instead of direct process management
2025-02-17 20:31:20 +08:00
UncleCode
8bb799068e feat(crawler): add HTTP crawler strategy for lightweight web scraping
Implements a new AsyncHTTPCrawlerStrategy class that provides a fast, memory-efficient alternative to browser-based crawling. Features include:
- Support for HTTP/HTTPS requests with configurable methods, headers, and timeouts
- File and raw content handling capabilities
- Streaming response processing for large files
- Customizable request/response hooks
- Comprehensive error handling

Also refactors browser management code into separate module for better organization.
2025-02-15 19:26:30 +08:00
UncleCode
063df572b0 docs(examples): add SERP API project example
Add comprehensive example demonstrating Google Search Results Page (SERP) API implementation using crawl4ai. The example includes:
- Basic web crawling setup
- LLM-based extraction
- Schema generation
- Golden standard implementation
- CrawlerHub usage

The example serves as a reference for implementing SERP API functionality with various extraction strategies.
2025-02-14 23:06:16 +08:00
UncleCode
966fb47e64 feat(config): enhance serialization and add deep crawling exports
Improve configuration serialization with better handling of frozensets and slots.
Expand deep crawling module exports and documentation.
Add comprehensive API usage examples in Docker README.

- Add support for frozenset serialization
- Improve error handling in config loading
- Export additional deep crawling components
- Enhance Docker API documentation with detailed examples
- Fix ContentTypeFilter initialization
2025-02-13 21:45:19 +08:00
UncleCode
43e09da694 refactor(crawler): remove content filter functionality
Remove content filter related code and parameters as part of simplifying the crawler configuration. This includes:
- Removing ContentFilter import and related classes
- Removing content_filter parameter from CrawlerRunConfig
- Cleaning up LLMExtractionStrategy constructor parameters

BREAKING CHANGE: Removed content_filter parameter from CrawlerRunConfig. Users should migrate to using extraction strategies for content filtering.
2025-02-12 21:59:19 +08:00
UncleCode
69705df0b3 fix(install): ensure proper exit after running doctor command 2025-02-11 19:48:23 +08:00
UncleCode
91a5fea11f feat(cli): add command line interface with comprehensive features
Implements a full-featured CLI for Crawl4AI with the following capabilities:
- Basic and advanced web crawling
- Configuration management via YAML/JSON files
- Multiple extraction strategies (CSS, XPath, LLM)
- Content filtering and optimization
- Interactive Q&A capabilities
- Various output formats
- Comprehensive documentation and examples

Also includes:
- Home directory setup for configuration and cache
- Environment variable support for API tokens
- Test suite for CLI functionality
2025-02-10 16:58:52 +08:00
UncleCode
467be9ac76 feat(deep-crawling): add DFS strategy and update exports; refactor CLI entry point 2025-02-09 20:23:40 +08:00
UncleCode
19df96ed56 feat(proxy): add proxy rotation strategy
Implements a new proxy rotation system with the following changes:
- Add ProxyRotationStrategy abstract base class
- Add RoundRobinProxyStrategy concrete implementation
- Integrate proxy rotation with AsyncWebCrawler
- Add proxy_rotation_strategy parameter to CrawlerRunConfig
- Add example script demonstrating proxy rotation usage
- Remove deprecated synchronous WebCrawler code
- Clean up rate limiting documentation

BREAKING CHANGE: Removed synchronous WebCrawler support and related rate limiting configurations
2025-02-09 18:49:10 +08:00
UncleCode
b957ff2ecd refactor(crawler): improve HTML handling and cleanup codebase
- Add HTML attribute preservation in GoogleSearchCrawler
- Fix lxml import references in utils.py
- Remove unused ssl_certificate.json
- Clean up imports and code organization in hub.py
- Update test case formatting and remove unused image search test

BREAKING CHANGE: Removed ssl_certificate.json file which might affect existing certificate validations
2025-02-07 21:56:27 +08:00
UncleCode
91073c1244 refactor(crawling): improve type hints and code cleanup
- Added proper return type hints for DeepCrawlStrategy.arun method
- Added __call__ method to DeepCrawlStrategy for easier usage
- Removed redundant comments and imports
- Cleaned up type hints in DFS strategy
- Removed empty docker_client.py and .continuerules
- Added .private/ to gitignore

BREAKING CHANGE: DeepCrawlStrategy.arun now returns Union[CrawlResultT, List[CrawlResultT], AsyncGenerator[CrawlResultT, None]]
2025-02-07 19:01:59 +08:00
Sezer Bozkır
926beee832
base-config structure is changed (#618)
refactor(docker): restructure docker-compose for modular configuration

- Added reusable base configuration block (x-base-config) for ports, environment variables, volumes, deployment resources, restart policy, and health check.
- Updated services to include base configuration directly using `<<: *base-config` syntax.
- Removed redundant `base-config` service definition.
2025-02-07 17:11:51 +08:00
UncleCode
a9415aaaf6 refactor(deep-crawling): reorganize deep crawling strategies and add new implementations
Split deep crawling code into separate strategy files for better organization and maintainability. Added new BFF (Best First) and DFS crawling strategies. Introduced base strategy class and common types.

BREAKING CHANGE: Deep crawling implementation has been split into multiple files. Import paths for deep crawling strategies have changed.
2025-02-05 22:50:39 +08:00
UncleCode
c308a794e8 refactor(deep-crawl): reorganize deep crawling functionality into dedicated module
Restructure deep crawling code into a dedicated module with improved organization:
- Move deep crawl logic from async_deep_crawl.py to deep_crawling/
- Create separate files for BFS strategy, filters, and scorers
- Improve code organization and maintainability
- Add optimized implementations for URL filtering and scoring
- Rename DeepCrawlHandler to DeepCrawlDecorator for clarity

BREAKING CHANGE: DeepCrawlStrategy and BreadthFirstSearchStrategy imports need to be updated to new package structure
2025-02-04 23:28:17 +08:00