crawl4ai

mirror of https://github.com/unclecode/crawl4ai.git synced 2025-12-01 05:14:22 +00:00

Author	SHA1	Message	Date
UncleCode	d0586f09a9	Merge branch 'vr0.4.3b3'	2025-01-25 21:57:29 +08:00
UncleCode	6a01008a2b	docs(multi-url): improve documentation clarity and update examples - Restructure multi-URL crawling documentation with better formatting and examples - Update code examples to use new API syntax (arun_many) - Add detailed parameter explanations for RateLimiter and Dispatchers - Enhance CSS styling for better documentation readability - Fix outdated method calls in feature demo script BREAKING CHANGE: Updated dispatcher.run_urls() to crawler.arun_many() in examples	2025-01-23 22:33:36 +08:00
UncleCode	45809d1c91	Merge branch 'vr0.4.3b2'	2025-01-22 20:51:46 +08:00
UncleCode	2d69bf2366	refactor(models): rename final_url to redirected_url for consistency Renames the final_url field to redirected_url across all components to maintain consistent terminology throughout the codebase. This change affects: - AsyncCrawlResponse model - AsyncPlaywrightCrawlerStrategy - Documentation and examples No functional changes, purely naming consistency improvement.	2025-01-22 17:14:24 +08:00
UncleCode	dee5fe9851	feat(proxy): add proxy rotation support and documentation Implements dynamic proxy rotation functionality with authentication support and IP verification. Updates include: - Added proxy rotation demo in features example - Updated proxy configuration handling in BrowserManager - Added proxy rotation documentation - Updated README with new proxy rotation feature - Bumped version to 0.4.3b2 This change enables users to dynamically switch between proxies and verify IP addresses for each request.	2025-01-22 16:11:01 +08:00
UncleCode	16b8d4945b	feat(release): prepare v0.4.3 beta release Prepare the v0.4.3 beta release with major feature additions and improvements: - Add JsonXPathExtractionStrategy and LLMContentFilter to exports - Update version to 0.4.3b1 - Improve documentation for dispatchers and markdown generation - Update development status to Beta - Reorganize changelog format BREAKING CHANGE: Memory threshold in MemoryAdaptiveDispatcher increased to 90% and SemaphoreDispatcher parameter renamed to max_session_permit	2025-01-21 21:03:11 +08:00
UncleCode	d09c611d15	feat(robots): add robots.txt compliance support Add support for checking and respecting robots.txt rules before crawling websites: - Implement RobotsParser class with SQLite caching - Add check_robots_txt parameter to CrawlerRunConfig - Integrate robots.txt checking in AsyncWebCrawler - Update documentation with robots.txt compliance examples - Add tests for robot parser functionality The cache uses WAL mode for better concurrency and has a default TTL of 7 days.	2025-01-21 17:54:13 +08:00
UncleCode	9247877037	feat(proxy): add proxy configuration support to CrawlerRunConfig Add proxy_config parameter to CrawlerRunConfig to support dynamic proxy configuration per crawl request. This enables users to specify different proxy settings for each crawl operation without modifying the browser config. - Added proxy_config parameter to CrawlerRunConfig - Updated BrowserManager to apply proxy settings from CrawlerRunConfig - Updated proxy-security documentation with new usage examples	2025-01-20 22:14:05 +08:00
UncleCode	2cec527a22	feat(extraction): add LLM-powered schema generation utility Adds new static method generate_schema() to JsonElementExtractionStrategy classes that can automatically generate extraction schemas using LLM (OpenAI or Ollama). This provides a convenient way to bootstrap extraction schemas while maintaining the performance benefits of selector-based extraction. Key changes: - Added generate_schema() static method to base extraction strategy - Added support for both CSS and XPath schema generation - Updated documentation with examples and best practices - Added new prompt templates for schema generation	2025-01-20 17:28:00 +08:00
UncleCode	4b1309cbf2	feat(crawler): add URL redirection tracking Add capability to track and return final URLs after redirects in crawler responses. This enhancement helps users understand the actual destination of crawled URLs after any redirections. Changes include: - Added final_url tracking in AsyncPlaywrightCrawlerStrategy - Added redirected_url field to CrawlResult model - Updated AsyncWebCrawler to properly handle and store redirect URLs - Fixed typo in documentation signature	2025-01-19 19:53:38 +08:00
UncleCode	8b6fe6a98f	docs(api): add streaming mode documentation and examples Add comprehensive documentation for the new streaming mode feature in arun_many(): - Update arun_many() API docs to reflect streaming return type - Add streaming examples in quickstart and multi-url guides - Document stream parameter in configuration classes - Add clone() helper method documentation for configs This change improves documentation for processing large numbers of URLs efficiently.	2025-01-19 18:21:34 +08:00
UncleCode	3d09b6a221	feat(content-filter): add LLMContentFilter for intelligent markdown generation Add new LLMContentFilter class that uses LLMs to generate high-quality markdown content: - Implement intelligent content filtering with customizable instructions - Add chunk processing for handling large documents - Support parallel processing of content chunks - Include caching mechanism for filtered results - Add usage tracking and statistics - Update documentation with examples and use cases Also includes minor changes: - Disable Pydantic warnings in __init__.py - Add new prompt template for content filtering	2025-01-18 19:31:07 +08:00
UncleCode	c3370ec5da	refactor(scraping): replace ScrapingMode enum with strategy pattern Replace the ScrapingMode enum with a proper strategy pattern implementation for content scraping. This change introduces: - New ContentScrapingStrategy abstract base class - Concrete WebScrapingStrategy and LXMLWebScrapingStrategy implementations - New Pydantic models for structured scraping results - Updated documentation reflecting the new strategy-based approach BREAKING CHANGE: ScrapingMode enum has been removed. Users should now use ContentScrapingStrategy implementations instead.	2025-01-13 17:53:12 +08:00
UncleCode	f3ae5a657c	feat(scraping): add LXML-based scraping mode for improved performance Adds a new ScrapingMode enum to allow switching between BeautifulSoup and LXML parsing. LXML mode offers 10-20x better performance for large HTML documents. Key changes: - Added ScrapingMode enum with BEAUTIFULSOUP and LXML options - Implemented LXMLWebScrapingStrategy class - Added LXML-based metadata extraction - Updated documentation with scraping mode usage and performance considerations - Added cssselect dependency BREAKING CHANGE: None	2025-01-12 20:46:23 +08:00
UncleCode	825c78a048	refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring Reorganize dispatcher functionality into separate components: - Create dedicated dispatcher classes (MemoryAdaptive, Semaphore) - Add RateLimiter for smart request throttling - Implement CrawlerMonitor for real-time progress tracking - Move dispatcher config from CrawlerRunConfig to separate classes BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.	2025-01-11 21:10:27 +08:00
UncleCode	f9c601eb7e	docs(urls): update documentation URLs to new domain Update all documentation URLs from crawl4ai.com/mkdocs to docs.crawl4ai.com across README, examples, and documentation files. This change reflects the new documentation hosting domain. Also add todo/ directory to .gitignore.	2025-01-09 16:24:41 +08:00
UncleCode	e8b4ac6046	docs(urls): update documentation URLs to new domain Update all documentation URLs from crawl4ai.com/mkdocs to docs.crawl4ai.com Improve badges styling and layout in documentation Increase code font size in documentation CSS BREAKING CHANGE: Documentation URLs have changed from crawl4ai.com/mkdocs to docs.crawl4ai.com	2025-01-09 16:22:41 +08:00
UncleCode	1c9464b988	Update all documents	2025-01-08 19:31:31 +08:00
UncleCode	ca3e33122e	refactor(docs): reorganize documentation structure and update styles Reorganize documentation into core/advanced/extraction sections for better navigation. Update terminal theme styles and add rich library for better CLI output. Remove redundant tutorial files and consolidate content into core sections. Add personal story to index page for project context. BREAKING CHANGE: Documentation structure has been significantly reorganized	2025-01-07 20:49:50 +08:00
UncleCode	fb33a24891	Commit Message: - Added examples for Amazon product data extraction methods - Updated configuration options and enhance documentation - Minor refactoring for improved performance and readability - Cleaned up version control settings.	2024-12-29 20:05:18 +08:00
UncleCode	f2d9912697	Renames browser_config param to config in AsyncWebCrawler Standardizes parameter naming convention across the codebase by renaming browser_config to the more concise config in AsyncWebCrawler constructor. Updates all documentation examples and internal usages to reflect the new parameter name for consistency. Also improves hook execution by adding url/response parameters to goto hooks and fixes parameter ordering in before_return_html hook.	2024-12-26 16:34:36 +08:00
UncleCode	d5ed451299	Enhance crawler capabilities and documentation - Add llm.txt generator - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation.	2024-12-25 21:34:31 +08:00
UncleCode	849765712f	Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies.	2024-12-19 21:02:29 +08:00
UncleCode	a11d9646e3	Enhance crawler features and improve documentation - Added detailed CrawlerRunConfig parameters documentation. - Introduced plans for real-time event-driven crawling. - Updated async logger default level to DEBUG for better insights. - Improved structure and readability in configuration file. - Enhanced documentation on future capabilities in new blog entries.	2024-12-16 18:52:51 +08:00
UncleCode	4a72c5ea6e	Add release notes and documentation for version 0.4.2: Configurable Crawlers, Session Management, and Enhanced Screenshot/PDF features	2024-12-12 20:15:50 +08:00
UncleCode	0982c639ae	Enhance AsyncWebCrawler and related configurations - Introduced new configuration classes: BrowserConfig and CrawlerRunConfig. - Refactored AsyncWebCrawler to leverage the new configuration system for cleaner parameter management. - Updated AsyncPlaywrightCrawlerStrategy for better flexibility and reduced legacy parameters. - Improved error handling with detailed context extraction during exceptions. - Enhanced overall maintainability and usability of the web crawler.	2024-12-12 19:35:09 +08:00
UncleCode	e130fd8db9	Implement new async crawler features and stability updates - Introduced new async crawl strategy with session management. - Added BrowserManager for improved browser management. - Enhanced documentation, focusing on storage state and usage examples. - Improved error handling and logging for sessions. - Added JavaScript snippets for customizing navigator properties.	2024-12-10 17:55:29 +08:00
UncleCode	c51e901f68	feat: Enhance AsyncPlaywrightCrawlerStrategy with text-only and light modes, dynamic viewport adjustment, and session management ### New Features: - Text-Only Mode: Added support for text-only crawling by disabling images, JavaScript, GPU, and other non-essential features. - Light Mode: Optimized browser settings to reduce resource usage and improve efficiency during crawling. - Dynamic Viewport Adjustment: Automatically adjusts viewport dimensions based on content size, ensuring accurate rendering and scaling. - Full Page Scanning: Introduced a feature to scroll and capture dynamic content for pages with infinite scroll or lazy-loading elements. - Session Management: Added `create_session` method for creating and managing browser sessions with unique IDs. ### Improvements: - Unified viewport handling across contexts by dynamically setting dimensions using `self.viewport_width` and `self.viewport_height`. - Enhanced logging and error handling for viewport adjustments, page scanning, and content evaluation. - Reduced resource usage with additional browser flags for both `light_mode` and `text_only` configurations. - Improved handling of cookies, headers, and proxies in session creation. ### Refactoring: - Removed hardcoded viewport dimensions and replaced them with dynamic configurations. - Cleaned up unused and commented-out code for better readability and maintainability. - Introduced defaults for frequently used parameters like `delay_before_return_html`. ### Fixes: - Resolved potential inconsistencies in viewport handling. - Improved robustness of content loading and dynamic adjustments to avoid failures and timeouts. ### Docs Update: - Updated schema usage in `quickstart_async.py` example: - Changed `OpenAIModelFee.schema()` to `OpenAIModelFee.model_json_schema()` for compatibility. - Enhanced LLM extraction instruction documentation. This commit introduces significant enhancements to improve efficiency, flexibility, and reliability of the crawler strategy.	2024-12-08 20:04:44 +08:00
UncleCode	b02544bc0b	docs: update README and blog for version 0.4.0 release, highlighting new features and improvements	2024-12-03 21:28:52 +08:00
unclecode	293f299c08	Add PruningContentFilter with unit tests and update documentation - Introduced the PruningContentFilter for better content relevance. - Implemented comprehensive unit tests for verification of functionality. - Enhanced existing BM25ContentFilter tests for edge case coverage. - Updated documentation to include usage examples for new filter.	2024-12-01 19:17:33 +08:00
UncleCode	24723b2f10	Enhance features and documentation - Updated version to 0.3.743 - Improved ManagedBrowser configuration with dynamic host/port - Implemented fast HTML formatting in web crawler - Enhanced markdown generation with a new generator class - Improved sanitization and utility functions - Added contributor details and pull request acknowledgments - Updated documentation for clearer usage scenarios - Adjusted tests to reflect class name changes	2024-11-28 12:45:05 +08:00
UncleCode	b6af94cbbb	Merge remote-tracking branch 'origin/main' into 0.3.74	2024-11-18 21:15:04 +08:00
UncleCode	852729ff38	feat(docker): add Docker Compose configurations for local and hub deployment; enhance GPU support checks in Dockerfile feat(requirements): update requirements.txt to include snowballstemmer fix(version_manager): correct version parsing to use __version__.__version__ feat(main): introduce chunking strategy and content filter in CrawlRequest model feat(content_filter): enhance BM25 algorithm with priority tag scoring for improved content relevance feat(logger): implement new async logger engine replacing print statements throughout library fix(database): resolve version-related deadlock and circular lock issues in database operations docs(docker): expand Docker deployment documentation with usage instructions for Docker Compose	2024-11-18 21:00:06 +08:00
UncleCode	df63a40606	feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity	2024-11-17 19:44:45 +08:00
UncleCode	f9fe6f89fe	feat(database): implement version management and migration checks during initialization	2024-11-17 18:09:33 +08:00
UncleCode	2a82455b3d	feat(crawl): implement direct crawl functionality and introduce CacheMode for improved caching control	2024-11-17 17:17:34 +08:00
UncleCode	4b45b28f25	feat(docs): enhance deployment documentation with one-click setup, API security details, and Docker Compose examples	2024-11-16 18:44:47 +08:00
UncleCode	9139ef3125	feat(docker): update Dockerfile for improved installation process and enhance deployment documentation with Docker Compose setup and API token security	2024-11-16 18:19:44 +08:00
UncleCode	c38ac29edb	perf(crawler): major performance improvements & raw HTML support - Switch to lxml parser (~4x speedup) - Add raw HTML & local file crawling support - Fix cache headers & async cleanup - Add browser process monitoring - Optimize BeautifulSoup operations - Pre-compile regex patterns Breaking: Raw HTML handling requires new URL prefixes Fixes: #256, #253	2024-11-13 19:40:40 +08:00
Mahesh	00026b5f8b	feat(config): Adding a configurable way of setting the cache directory for constrained environments	2024-11-12 14:52:51 -07:00
UncleCode	c5aa1bec18	Merge pull request #229 from bizrockman/main Preventing NoneType has no attribute get Errors	2024-11-06 07:31:07 +01:00
UncleCode	67a23c3182	feat(core): Release v0.3.73 with Browser Takeover and Docker Support Major changes: - Add browser takeover feature using CDP for authentic browsing - Implement Docker support with full API server documentation - Enhance Mockdown with tag preservation system - Improve parallel crawling performance This release focuses on authenticity and scalability, introducing the ability to use users' own browsers while providing containerized deployment options. Breaking changes include modified browser handling and API response structure. See CHANGELOG.md for detailed migration guide.	2024-11-05 20:04:18 +08:00
bizrockman	796dbaf08c	Rename episode_11_3_Extraction_Strategies:_Cosine.md to episode_11_3_Extraction_Strategies_Cosine.md Name that will work in Windows	2024-11-04 20:19:43 +01:00
bizrockman	3a3c88a2d0	Rename episode_11_2_Extraction_Strategies:_LLM.md to episode_11_2_Extraction_Strategies_LLM.md Name that will work in Windows	2024-11-04 20:19:20 +01:00
bizrockman	870296fa7e	Rename episode_11_1_Extraction_Strategies:_JSON_CSS.md to episode_11_1_Extraction_Strategies_JSON_CSS.md Name that will work in Windows	2024-11-04 20:18:58 +01:00
bizrockman	a28046c233	Rename episode_08_Media_Handling:_Images,_Videos,_and_Audio.md to episode_08_Media_Handling_Images_Videos_and_Audio.md Name that will work in Windows	2024-11-04 20:18:26 +01:00
UncleCode	19c3f3efb2	Refactor tutorial markdown files: Update numbering and formatting	2024-10-30 20:58:07 +08:00
UncleCode	9307c19f35	Update documents, upload new version of quickstart.	2024-10-30 20:39:35 +08:00
UncleCode	3529c2e732	Update new tutorial documents and added to the docs folder.	2024-10-30 00:16:18 +08:00
UncleCode	4239654722	Update Documentation	2024-10-27 19:24:46 +08:00

50 Commits