crawl4ai

mirror of https://github.com/unclecode/crawl4ai.git synced 2026-01-01 04:44:45 +00:00

Author	SHA1	Message	Date
UncleCode	415c1c5bee	refactor(core): replace float('inf') with math.inf Replace float('inf') and float('-inf') with math.inf and -math.inf from the math module for better readability and performance. Also clean up imports and remove unused speed comparison code. No breaking changes.	2025-03-04 18:23:55 +08:00
UncleCode	d024749633	refactor(deep-crawl): add max_pages limit and improve crawl control Add max_pages parameter to all deep crawling strategies to limit total pages crawled. Add score_threshold parameter to BFS/DFS strategies for quality control. Remove legacy parameter handling in AsyncWebCrawler. Improve error handling and logging in crawl strategies. BREAKING CHANGE: Removed support for legacy parameters in AsyncWebCrawler.run_many()	2025-03-03 21:51:11 +08:00
UncleCode	c612f9a852	feat(profiles): add CLI command for crawling with browser profiles Adds new functionality to crawl websites using saved browser profiles directly from the CLI. This includes: - New CLI option to use profiles for crawling - Helper functions for profile-based crawling - Fixed type hints for config parameters - Updated example to show browser window by default This makes it easier for users to leverage saved browser profiles for crawling without writing code.	2025-03-02 21:33:33 +08:00
UncleCode	cba4a466e5	feat(browser): add BrowserProfiler class for identity-based browsing Adds a new BrowserProfiler class that provides comprehensive management of browser profiles for identity-based crawling. Features include: - Interactive profile creation and management - Profile listing, retrieval, and deletion - Guided console interface - Migration of profile management from ManagedBrowser - New example script for identity-based browsing ALSO: - Updates logging format in AsyncWebCrawler - Removes content filter from hello_world example - Relaxes httpx version constraint BREAKING CHANGE: Profile management methods from ManagedBrowser are now deprecated and delegate to BrowserProfiler	2025-03-02 20:32:29 +08:00
Aravind	a9e24307cc	Release prep (#749 ) * fix: Update export of URLPatternFilter * chore: Add dependancy for cchardet in requirements * docs: Update example for deep crawl in release note for v0.5 * Docs: update the example for memory dispatcher * docs: updated example for crawl strategies * Refactor: Removed wrapping in if __name__==main block since this is a markdown file. * chore: removed cchardet from dependancy list, since unclecode is planning to remove it * docs: updated the example for proxy rotation to a working example * feat: Introduced ProxyConfig param * Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1 * chore: update and test new dependancies * feat:Make PyPDF2 a conditional dependancy * updated tutorial and release note for v0.5 * docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename * refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult * fix: Bug in serialisation of markdown in acache_url * Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown * fix: remove deprecated markdown_v2 from docker * Refactor: remove deprecated fit_markdown and fit_html from result * refactor: fix cache retrieval for markdown as a string * chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown	2025-02-28 19:53:35 +08:00
UncleCode	4bcd4cbda1	refactor(pdf): improve PDF processor dependency handling Make PyPDF2 an optional dependency and improve import handling in PDF processor. Move imports inside methods to allow for lazy loading and better error handling. Add new 'pdf' optional dependency group in pyproject.toml. Clean up unused imports and remove deprecated files. BREAKING CHANGE: PyPDF2 is now an optional dependency. Users need to install with 'pip install crawl4ai[pdf]' to use PDF processing features.	2025-02-25 22:27:55 +08:00
UncleCode	c6d48080a4	feat(logger): add abstract logger base class and file logger implementation Add AsyncLoggerBase abstract class to standardize logger interface and introduce AsyncFileLogger for file-only logging. Remove deprecated always_bypass_cache parameter and clean up AsyncWebCrawler initialization. BREAKING CHANGE: Removed deprecated 'always_by_pass_cache' parameter. Use BrowserConfig cache settings instead.	2025-02-23 21:23:41 +08:00
Aravind	2af958e12c	Feat/llm config (#724 ) * feature: Add LlmConfig to easily configure and pass LLM configs to different strategies * pulled in next branch and resolved conflicts * feat: Add gemini and deepseek providers. Make ignore_cache in llm content filter to true by default to avoid confusions * Refactor: Update LlmConfig in LLMExtractionStrategy class and deprecate old params * updated tests, docs and readme	2025-02-21 15:41:37 +08:00
UncleCode	3cb28875c3	refactor(config): enhance serialization and config handling - Add ignore_default_value option to to_serializable_dict - Add viewport dict support in BrowserConfig - Replace FastFilterChain with FilterChain - Add deprecation warnings for unwanted properties - Clean up unused imports - Rename example files for consistency - Add comprehensive Docker configuration tutorial BREAKING CHANGE: FastFilterChain has been replaced with FilterChain	2025-02-19 17:23:25 +08:00
UncleCode	392c923980	feat(docker): add JWT authentication and improve server architecture Add JWT token-based authentication to Docker server and client. Refactor server architecture for better code organization and error handling. Move Dockerfile to root deploy directory and update configuration. Add comprehensive documentation and examples. BREAKING CHANGE: Docker server now requires authentication by default. Endpoints require JWT tokens when security.jwt_enabled is true in config.	2025-02-18 22:07:13 +08:00
UncleCode	063df572b0	docs(examples): add SERP API project example Add comprehensive example demonstrating Google Search Results Page (SERP) API implementation using crawl4ai. The example includes: - Basic web crawling setup - LLM-based extraction - Schema generation - Golden standard implementation - CrawlerHub usage The example serves as a reference for implementing SERP API functionality with various extraction strategies.	2025-02-14 23:06:16 +08:00
UncleCode	91a5fea11f	feat(cli): add command line interface with comprehensive features Implements a full-featured CLI for Crawl4AI with the following capabilities: - Basic and advanced web crawling - Configuration management via YAML/JSON files - Multiple extraction strategies (CSS, XPath, LLM) - Content filtering and optimization - Interactive Q&A capabilities - Various output formats - Comprehensive documentation and examples Also includes: - Home directory setup for configuration and cache - Environment variable support for API tokens - Test suite for CLI functionality	2025-02-10 16:58:52 +08:00
UncleCode	19df96ed56	feat(proxy): add proxy rotation strategy Implements a new proxy rotation system with the following changes: - Add ProxyRotationStrategy abstract base class - Add RoundRobinProxyStrategy concrete implementation - Integrate proxy rotation with AsyncWebCrawler - Add proxy_rotation_strategy parameter to CrawlerRunConfig - Add example script demonstrating proxy rotation usage - Remove deprecated synchronous WebCrawler code - Clean up rate limiting documentation BREAKING CHANGE: Removed synchronous WebCrawler support and related rate limiting configurations	2025-02-09 18:49:10 +08:00
UncleCode	33a21d6a7a	refactor(docker): improve server architecture and configuration Complete overhaul of Docker deployment setup with improved architecture: - Add Redis integration for task management - Implement rate limiting and security middleware - Add Prometheus metrics and health checks - Improve error handling and logging - Add support for streaming responses - Implement proper configuration management - Add platform-specific optimizations for ARM64/AMD64 BREAKING CHANGE: Docker deployment now requires Redis and new config.yml structure	2025-02-02 20:19:51 +08:00
UncleCode	09ac7ed008	feat(demo): uncomment feature demos and add fake-useragent dependency Uncomments demonstration code for memory dispatcher, streaming support, content scraping, JSON schema generation, LLM markdown, and robots compliance in the v0.4.3b2 features demo file. Also adds fake-useragent package as a project dependency. This change makes all feature demonstrations active by default and ensures proper user agent handling capabilities.	2025-01-25 21:56:08 +08:00
UncleCode	97796f39d2	docs(examples): update proxy rotation demo and disable other demos Modify proxy rotation example to include empty user agent setting and comment out other demo functions for focused testing. This change simplifies the demo file to focus specifically on proxy rotation functionality. No breaking changes.	2025-01-25 21:52:35 +08:00
UncleCode	65d33bcc0f	style(docs): improve code formatting in features demo Clean up whitespace and improve readability in v0_4_3b2_features_demo.py: - Remove excessive blank lines between functions - Improve config formatting for better readability - Uncomment memory dispatcher demo in main function No breaking changes.	2025-01-23 22:36:58 +08:00
UncleCode	6a01008a2b	docs(multi-url): improve documentation clarity and update examples - Restructure multi-URL crawling documentation with better formatting and examples - Update code examples to use new API syntax (arun_many) - Add detailed parameter explanations for RateLimiter and Dispatchers - Enhance CSS styling for better documentation readability - Fix outdated method calls in feature demo script BREAKING CHANGE: Updated dispatcher.run_urls() to crawler.arun_many() in examples	2025-01-23 22:33:36 +08:00
UncleCode	260b9120c3	docs(examples): update v0.4.3 features demo to v0.4.3b2 Rename and replace the features demo file to reflect the beta 2 version number. The old v0.4.3 demo file is removed and replaced with a new beta 2 version. Renames: - docs/examples/v0_4_3_features_demo.py -> docs/examples/v0_4_3b2_features_demo.py	2025-01-22 20:41:43 +08:00
UncleCode	976ea52167	docs(examples): update demo scripts and fix output formats Update example scripts to reflect latest API changes and improve demonstrations: - Increase test URLs in dispatcher example from 20 to 40 pages - Comment out unused dispatcher strategies for cleaner output - Fix scraping strategies performance script to use correct object notation - Update v0_4_3_features_demo with additional feature mentions and uncomment demo sections These changes make the examples more current and better aligned with the actual API.	2025-01-22 20:40:03 +08:00
UncleCode	2d69bf2366	refactor(models): rename final_url to redirected_url for consistency Renames the final_url field to redirected_url across all components to maintain consistent terminology throughout the codebase. This change affects: - AsyncCrawlResponse model - AsyncPlaywrightCrawlerStrategy - Documentation and examples No functional changes, purely naming consistency improvement.	2025-01-22 17:14:24 +08:00
UncleCode	dee5fe9851	feat(proxy): add proxy rotation support and documentation Implements dynamic proxy rotation functionality with authentication support and IP verification. Updates include: - Added proxy rotation demo in features example - Updated proxy configuration handling in BrowserManager - Added proxy rotation documentation - Updated README with new proxy rotation feature - Bumped version to 0.4.3b2 This change enables users to dynamically switch between proxies and verify IP addresses for each request.	2025-01-22 16:11:01 +08:00
UncleCode	16b8d4945b	feat(release): prepare v0.4.3 beta release Prepare the v0.4.3 beta release with major feature additions and improvements: - Add JsonXPathExtractionStrategy and LLMContentFilter to exports - Update version to 0.4.3b1 - Improve documentation for dispatchers and markdown generation - Update development status to Beta - Reorganize changelog format BREAKING CHANGE: Memory threshold in MemoryAdaptiveDispatcher increased to 90% and SemaphoreDispatcher parameter renamed to max_session_permit	2025-01-21 21:03:11 +08:00
UncleCode	1221be30a3	feat(browser): improve browser context management and add shared data support Add shared_data parameter to CrawlerRunConfig to allow data sharing between hooks. Implement browser context reuse based on config signatures to improve memory usage. Fix Firefox/Webkit channel settings. Add config parameter to hook callbacks for better context access. Remove debug print statements. BREAKING CHANGE: Hook callback signatures now include config parameter	2025-01-19 17:12:03 +08:00
UncleCode	8ec12d7d68	Apply Ruff Corrections	2025-01-13 19:19:58 +08:00
UncleCode	825c78a048	refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring Reorganize dispatcher functionality into separate components: - Create dedicated dispatcher classes (MemoryAdaptive, Semaphore) - Add RateLimiter for smart request throttling - Implement CrawlerMonitor for real-time progress tracking - Move dispatcher config from CrawlerRunConfig to separate classes BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.	2025-01-11 21:10:27 +08:00
UncleCode	3865342c93	Merge branch 'next' into next-cdp	2025-01-10 16:01:49 +08:00
UncleCode	ac5f461d40	feat(crawler): add memory-adaptive dispatcher with rate limiting Implements a new MemoryAdaptiveDispatcher class to manage concurrent crawling operations with memory monitoring and rate limiting capabilities. Changes include: - Added RateLimitConfig dataclass for configuring rate limiting behavior - Extended CrawlerRunConfig with dispatcher-related settings - Refactored arun_many to use the new dispatcher system - Added memory threshold and session permit controls - Integrated optional progress monitoring display BREAKING CHANGE: The arun_many method now uses MemoryAdaptiveDispatcher by default, which may affect concurrent crawling behavior	2025-01-10 16:01:18 +08:00
UncleCode	e8b4ac6046	docs(urls): update documentation URLs to new domain Update all documentation URLs from crawl4ai.com/mkdocs to docs.crawl4ai.com Improve badges styling and layout in documentation Increase code font size in documentation CSS BREAKING CHANGE: Documentation URLs have changed from crawl4ai.com/mkdocs to docs.crawl4ai.com	2025-01-09 16:22:41 +08:00
UncleCode	ca3e33122e	refactor(docs): reorganize documentation structure and update styles Reorganize documentation into core/advanced/extraction sections for better navigation. Update terminal theme styles and add rich library for better CLI output. Remove redundant tutorial files and consolidate content into core sections. Add personal story to index page for project context. BREAKING CHANGE: Documentation structure has been significantly reorganized	2025-01-07 20:49:50 +08:00
UncleCode	24b3da717a	refactor(): - Update hello world example	2025-01-02 17:53:30 +08:00
UncleCode	98acc4254d	refactor: - Update hello_world.py example	2025-01-01 19:47:22 +08:00
UncleCode	aa4f92f458	refactor(crawler): - Update hello_world example with proper content filtering	2025-01-01 19:39:42 +08:00
UncleCode	67f65f958b	refactor(build): simplify setup.py configuration - Remove dependency management from setup.py - Remove entry points configuration (moved to pyproject.toml) - Keep minimal setup.py for backwards compatibility - Clean up package metadata structure	2025-01-01 15:52:01 +08:00
UncleCode	bd66befcf0	Fix issue in 0.4.24 walkthrough	2024-12-31 21:07:58 +08:00
UncleCode	19b0a5ae82	Update 0.4.24 walkthrough	2024-12-31 21:01:46 +08:00
UncleCode	bd71f7f4ea	Add 0.4.24 walkthrough	2024-12-31 20:22:33 +08:00
UncleCode	67d0999bc3	chore: resolve merge conflicts for v0.4.24	2024-12-31 19:24:03 +08:00
UncleCode	fb33a24891	Commit Message: - Added examples for Amazon product data extraction methods - Updated configuration options and enhance documentation - Minor refactoring for improved performance and readability - Cleaned up version control settings.	2024-12-29 20:05:18 +08:00
UncleCode	f2d9912697	Renames browser_config param to config in AsyncWebCrawler Standardizes parameter naming convention across the codebase by renaming browser_config to the more concise config in AsyncWebCrawler constructor. Updates all documentation examples and internal usages to reflect the new parameter name for consistency. Also improves hook execution by adding url/response parameters to goto hooks and fixes parameter ordering in before_return_html hook.	2024-12-26 16:34:36 +08:00
UncleCode	9a4ed6bbd7	Commit Message: Enhance crawler capabilities and documentation - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation management to streamline user experience.	2024-12-26 15:17:07 +08:00
UncleCode	d5ed451299	Enhance crawler capabilities and documentation - Add llm.txt generator - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation.	2024-12-25 21:34:31 +08:00
Haopeng138	bacbeb3ed4	Fix #340 example llm_extraction (#358 ) @Haopeng138 Thank you so much. They are still part of the library. I forgot to update them since I moved the asynchronous versions years ago. I really appreciate it. I have to say that I feel weak in the documentation. That's why I spent a lot of time on it last week. Now, when you mention some of the things in the example folder, I realize I forgot about the example folder. I'll try to update it more. If you find anything else, please help and support. Thank you. I will add your name to contributor name as well.	2024-12-24 19:56:07 +08:00
UncleCode	e9e5b5642d	Fix js_snipprt issue 0.4.21 bump to 0.4.22	2024-12-15 19:49:30 +08:00
UncleCode	7524aa7b5e	Feature: Add Markdown generation to CrawlerRunConfig - Added markdown generator parameter to CrawlerRunConfig in `async_configs.py`. - Implemented logic for Markdown generation in content scraping in `async_webcrawler.py`. - Updated version number to 0.4.21 in `__version__.py`.	2024-12-13 21:51:38 +08:00
UncleCode	0982c639ae	Enhance AsyncWebCrawler and related configurations - Introduced new configuration classes: BrowserConfig and CrawlerRunConfig. - Refactored AsyncWebCrawler to leverage the new configuration system for cleaner parameter management. - Updated AsyncPlaywrightCrawlerStrategy for better flexibility and reduced legacy parameters. - Improved error handling with detailed context extraction during exceptions. - Enhanced overall maintainability and usability of the web crawler.	2024-12-12 19:35:09 +08:00
UncleCode	5188b7a6a0	Add full-page screenshot and PDF export features - Introduced a new approach for capturing full-page screenshots by exporting them as PDFs first, enhancing reliability and performance. - Added documentation for the feature in `docs/examples/full_page_screenshot_and_pdf_export.md`. - Refactored `perform_completion_with_backoff` in `crawl4ai/utils.py` to include necessary extra parameters. - Updated `quickstart_async.py` to utilize LLM extraction with refined arguments.	2024-12-10 20:59:31 +08:00
UncleCode	5431fa2d0c	Add PDF & screenshot functionality, new tutorial - Added support for exporting pages as PDFs - Enhanced screenshot functionality for long pages - Created a tutorial on dynamic content loading with 'Load More' buttons. - Updated web crawler to handle PDF data in responses.	2024-12-10 20:10:39 +08:00
UncleCode	e130fd8db9	Implement new async crawler features and stability updates - Introduced new async crawl strategy with session management. - Added BrowserManager for improved browser management. - Enhanced documentation, focusing on storage state and usage examples. - Improved error handling and logging for sessions. - Added JavaScript snippets for customizing navigator properties.	2024-12-10 17:55:29 +08:00
UncleCode	c51e901f68	feat: Enhance AsyncPlaywrightCrawlerStrategy with text-only and light modes, dynamic viewport adjustment, and session management ### New Features: - Text-Only Mode: Added support for text-only crawling by disabling images, JavaScript, GPU, and other non-essential features. - Light Mode: Optimized browser settings to reduce resource usage and improve efficiency during crawling. - Dynamic Viewport Adjustment: Automatically adjusts viewport dimensions based on content size, ensuring accurate rendering and scaling. - Full Page Scanning: Introduced a feature to scroll and capture dynamic content for pages with infinite scroll or lazy-loading elements. - Session Management: Added `create_session` method for creating and managing browser sessions with unique IDs. ### Improvements: - Unified viewport handling across contexts by dynamically setting dimensions using `self.viewport_width` and `self.viewport_height`. - Enhanced logging and error handling for viewport adjustments, page scanning, and content evaluation. - Reduced resource usage with additional browser flags for both `light_mode` and `text_only` configurations. - Improved handling of cookies, headers, and proxies in session creation. ### Refactoring: - Removed hardcoded viewport dimensions and replaced them with dynamic configurations. - Cleaned up unused and commented-out code for better readability and maintainability. - Introduced defaults for frequently used parameters like `delay_before_return_html`. ### Fixes: - Resolved potential inconsistencies in viewport handling. - Improved robustness of content loading and dynamic adjustments to avoid failures and timeouts. ### Docs Update: - Updated schema usage in `quickstart_async.py` example: - Changed `OpenAIModelFee.schema()` to `OpenAIModelFee.model_json_schema()` for compatibility. - Enhanced LLM extraction instruction documentation. This commit introduces significant enhancements to improve efficiency, flexibility, and reliability of the crawler strategy.	2024-12-08 20:04:44 +08:00

1 2 3

113 Commits