5 Commits

Author SHA1 Message Date
UncleCode
367cd71db9 feat(core): release version 0.5.0 with deep crawling and CLI
This major release adds deep crawling capabilities, memory-adaptive dispatcher,
multiple crawling strategies, Docker deployment, and a new CLI. It also includes
significant improvements to proxy handling, PDF processing, and LLM integration.

BREAKING CHANGES:
- Add memory-adaptive dispatcher as default for arun_many()
- Move max_depth to CrawlerRunConfig
- Replace ScrapingMode enum with strategy pattern
- Update BrowserContext API
- Make model fields optional with defaults
- Remove content_filter parameter from CrawlerRunConfig
- Remove synchronous WebCrawler and old CLI
- Update Docker deployment configuration
- Replace FastFilterChain with FilterChain
- Change license to Apache 2.0 with attribution clause
2025-02-21 19:55:02 +08:00
UncleCode
1c9464b988 Update all documents 2025-01-08 19:31:31 +08:00
UncleCode
4a72c5ea6e Add release notes and documentation for version 0.4.2: Configurable Crawlers, Session Management, and Enhanced Screenshot/PDF features 2024-12-12 20:15:50 +08:00
UncleCode
c51e901f68 feat: Enhance AsyncPlaywrightCrawlerStrategy with text-only and light modes, dynamic viewport adjustment, and session management
### New Features:
- **Text-Only Mode**: Added support for text-only crawling by disabling images, JavaScript, GPU, and other non-essential features.
- **Light Mode**: Optimized browser settings to reduce resource usage and improve efficiency during crawling.
- **Dynamic Viewport Adjustment**: Automatically adjusts viewport dimensions based on content size, ensuring accurate rendering and scaling.
- **Full Page Scanning**: Introduced a feature to scroll and capture dynamic content for pages with infinite scroll or lazy-loading elements.
- **Session Management**: Added `create_session` method for creating and managing browser sessions with unique IDs.

### Improvements:
- Unified viewport handling across contexts by dynamically setting dimensions using `self.viewport_width` and `self.viewport_height`.
- Enhanced logging and error handling for viewport adjustments, page scanning, and content evaluation.
- Reduced resource usage with additional browser flags for both `light_mode` and `text_only` configurations.
- Improved handling of cookies, headers, and proxies in session creation.

### Refactoring:
- Removed hardcoded viewport dimensions and replaced them with dynamic configurations.
- Cleaned up unused and commented-out code for better readability and maintainability.
- Introduced defaults for frequently used parameters like `delay_before_return_html`.

### Fixes:
- Resolved potential inconsistencies in viewport handling.
- Improved robustness of content loading and dynamic adjustments to avoid failures and timeouts.

### Docs Update:
- Updated schema usage in `quickstart_async.py` example:
  - Changed `OpenAIModelFee.schema()` to `OpenAIModelFee.model_json_schema()` for compatibility.
- Enhanced LLM extraction instruction documentation.

This commit introduces significant enhancements to improve efficiency, flexibility, and reliability of the crawler strategy.
2024-12-08 20:04:44 +08:00
UncleCode
b02544bc0b docs: update README and blog for version 0.4.0 release, highlighting new features and improvements 2024-12-03 21:28:52 +08:00