crawl4ai

mirror of https://github.com/unclecode/crawl4ai.git synced 2025-12-12 19:12:05 +00:00

Author	SHA1	Message	Date
UncleCode	c51e901f68	feat: Enhance AsyncPlaywrightCrawlerStrategy with text-only and light modes, dynamic viewport adjustment, and session management ### New Features: - Text-Only Mode: Added support for text-only crawling by disabling images, JavaScript, GPU, and other non-essential features. - Light Mode: Optimized browser settings to reduce resource usage and improve efficiency during crawling. - Dynamic Viewport Adjustment: Automatically adjusts viewport dimensions based on content size, ensuring accurate rendering and scaling. - Full Page Scanning: Introduced a feature to scroll and capture dynamic content for pages with infinite scroll or lazy-loading elements. - Session Management: Added `create_session` method for creating and managing browser sessions with unique IDs. ### Improvements: - Unified viewport handling across contexts by dynamically setting dimensions using `self.viewport_width` and `self.viewport_height`. - Enhanced logging and error handling for viewport adjustments, page scanning, and content evaluation. - Reduced resource usage with additional browser flags for both `light_mode` and `text_only` configurations. - Improved handling of cookies, headers, and proxies in session creation. ### Refactoring: - Removed hardcoded viewport dimensions and replaced them with dynamic configurations. - Cleaned up unused and commented-out code for better readability and maintainability. - Introduced defaults for frequently used parameters like `delay_before_return_html`. ### Fixes: - Resolved potential inconsistencies in viewport handling. - Improved robustness of content loading and dynamic adjustments to avoid failures and timeouts. ### Docs Update: - Updated schema usage in `quickstart_async.py` example: - Changed `OpenAIModelFee.schema()` to `OpenAIModelFee.model_json_schema()` for compatibility. - Enhanced LLM extraction instruction documentation. This commit introduces significant enhancements to improve efficiency, flexibility, and reliability of the crawler strategy.	2024-12-08 20:04:44 +08:00
UncleCode	b02544bc0b	docs: update README and blog for version 0.4.0 release, highlighting new features and improvements	2024-12-03 21:28:52 +08:00
unclecode	293f299c08	Add PruningContentFilter with unit tests and update documentation - Introduced the PruningContentFilter for better content relevance. - Implemented comprehensive unit tests for verification of functionality. - Enhanced existing BM25ContentFilter tests for edge case coverage. - Updated documentation to include usage examples for new filter.	2024-12-01 19:17:33 +08:00
UncleCode	24723b2f10	Enhance features and documentation - Updated version to 0.3.743 - Improved ManagedBrowser configuration with dynamic host/port - Implemented fast HTML formatting in web crawler - Enhanced markdown generation with a new generator class - Improved sanitization and utility functions - Added contributor details and pull request acknowledgments - Updated documentation for clearer usage scenarios - Adjusted tests to reflect class name changes	2024-11-28 12:45:05 +08:00
UncleCode	b6af94cbbb	Merge remote-tracking branch 'origin/main' into 0.3.74	2024-11-18 21:15:04 +08:00
UncleCode	852729ff38	feat(docker): add Docker Compose configurations for local and hub deployment; enhance GPU support checks in Dockerfile feat(requirements): update requirements.txt to include snowballstemmer fix(version_manager): correct version parsing to use __version__.__version__ feat(main): introduce chunking strategy and content filter in CrawlRequest model feat(content_filter): enhance BM25 algorithm with priority tag scoring for improved content relevance feat(logger): implement new async logger engine replacing print statements throughout library fix(database): resolve version-related deadlock and circular lock issues in database operations docs(docker): expand Docker deployment documentation with usage instructions for Docker Compose	2024-11-18 21:00:06 +08:00
UncleCode	df63a40606	feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity	2024-11-17 19:44:45 +08:00
UncleCode	f9fe6f89fe	feat(database): implement version management and migration checks during initialization	2024-11-17 18:09:33 +08:00
UncleCode	2a82455b3d	feat(crawl): implement direct crawl functionality and introduce CacheMode for improved caching control	2024-11-17 17:17:34 +08:00
UncleCode	4b45b28f25	feat(docs): enhance deployment documentation with one-click setup, API security details, and Docker Compose examples	2024-11-16 18:44:47 +08:00
UncleCode	9139ef3125	feat(docker): update Dockerfile for improved installation process and enhance deployment documentation with Docker Compose setup and API token security	2024-11-16 18:19:44 +08:00
UncleCode	c38ac29edb	perf(crawler): major performance improvements & raw HTML support - Switch to lxml parser (~4x speedup) - Add raw HTML & local file crawling support - Fix cache headers & async cleanup - Add browser process monitoring - Optimize BeautifulSoup operations - Pre-compile regex patterns Breaking: Raw HTML handling requires new URL prefixes Fixes: #256, #253	2024-11-13 19:40:40 +08:00
Mahesh	00026b5f8b	feat(config): Adding a configurable way of setting the cache directory for constrained environments	2024-11-12 14:52:51 -07:00
UncleCode	c5aa1bec18	Merge pull request #229 from bizrockman/main Preventing NoneType has no attribute get Errors	2024-11-06 07:31:07 +01:00
UncleCode	67a23c3182	feat(core): Release v0.3.73 with Browser Takeover and Docker Support Major changes: - Add browser takeover feature using CDP for authentic browsing - Implement Docker support with full API server documentation - Enhance Mockdown with tag preservation system - Improve parallel crawling performance This release focuses on authenticity and scalability, introducing the ability to use users' own browsers while providing containerized deployment options. Breaking changes include modified browser handling and API response structure. See CHANGELOG.md for detailed migration guide.	2024-11-05 20:04:18 +08:00
bizrockman	796dbaf08c	Rename episode_11_3_Extraction_Strategies:_Cosine.md to episode_11_3_Extraction_Strategies_Cosine.md Name that will work in Windows	2024-11-04 20:19:43 +01:00
bizrockman	3a3c88a2d0	Rename episode_11_2_Extraction_Strategies:_LLM.md to episode_11_2_Extraction_Strategies_LLM.md Name that will work in Windows	2024-11-04 20:19:20 +01:00
bizrockman	870296fa7e	Rename episode_11_1_Extraction_Strategies:_JSON_CSS.md to episode_11_1_Extraction_Strategies_JSON_CSS.md Name that will work in Windows	2024-11-04 20:18:58 +01:00
bizrockman	a28046c233	Rename episode_08_Media_Handling:_Images,_Videos,_and_Audio.md to episode_08_Media_Handling_Images_Videos_and_Audio.md Name that will work in Windows	2024-11-04 20:18:26 +01:00
UncleCode	19c3f3efb2	Refactor tutorial markdown files: Update numbering and formatting	2024-10-30 20:58:07 +08:00
UncleCode	9307c19f35	Update documents, upload new version of quickstart.	2024-10-30 20:39:35 +08:00
UncleCode	3529c2e732	Update new tutorial documents and added to the docs folder.	2024-10-30 00:16:18 +08:00
UncleCode	4239654722	Update Documentation	2024-10-27 19:24:46 +08:00

23 Commits