6 Commits

Author SHA1 Message Date
UncleCode
9547bada3a feat(content): add target_elements parameter for selective content extraction
Adds new target_elements parameter to CrawlerRunConfig that allows more flexible content selection than css_selector. This enables focusing markdown generation and data extraction on specific elements while still processing the entire page for links and media.

Key changes:
- Added target_elements list parameter to CrawlerRunConfig
- Modified WebScrapingStrategy and LXMLWebScrapingStrategy to handle target_elements
- Updated documentation with examples and comparison between css_selector and target_elements
- Fixed table extraction in content_scraping_strategy.py

BREAKING CHANGE: Table extraction logic has been modified to better handle thead/tbody structures
2025-03-10 18:54:51 +08:00
UncleCode
976ea52167 docs(examples): update demo scripts and fix output formats
Update example scripts to reflect latest API changes and improve demonstrations:
- Increase test URLs in dispatcher example from 20 to 40 pages
- Comment out unused dispatcher strategies for cleaner output
- Fix scraping strategies performance script to use correct object notation
- Update v0_4_3_features_demo with additional feature mentions and uncomment demo sections

These changes make the examples more current and better aligned with the actual API.
2025-01-22 20:40:03 +08:00
UncleCode
16b8d4945b feat(release): prepare v0.4.3 beta release
Prepare the v0.4.3 beta release with major feature additions and improvements:
- Add JsonXPathExtractionStrategy and LLMContentFilter to exports
- Update version to 0.4.3b1
- Improve documentation for dispatchers and markdown generation
- Update development status to Beta
- Reorganize changelog format

BREAKING CHANGE: Memory threshold in MemoryAdaptiveDispatcher increased to 90% and SemaphoreDispatcher parameter renamed to max_session_permit
2025-01-21 21:03:11 +08:00
UncleCode
8ec12d7d68 Apply Ruff Corrections 2025-01-13 19:19:58 +08:00
UncleCode
825c78a048 refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring
Reorganize dispatcher functionality into separate components:
- Create dedicated dispatcher classes (MemoryAdaptive, Semaphore)
- Add RateLimiter for smart request throttling
- Implement CrawlerMonitor for real-time progress tracking
- Move dispatcher config from CrawlerRunConfig to separate classes

BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.
2025-01-11 21:10:27 +08:00
UncleCode
ac5f461d40 feat(crawler): add memory-adaptive dispatcher with rate limiting
Implements a new MemoryAdaptiveDispatcher class to manage concurrent crawling operations with memory monitoring and rate limiting capabilities. Changes include:

- Added RateLimitConfig dataclass for configuring rate limiting behavior
- Extended CrawlerRunConfig with dispatcher-related settings
- Refactored arun_many to use the new dispatcher system
- Added memory threshold and session permit controls
- Integrated optional progress monitoring display

BREAKING CHANGE: The arun_many method now uses MemoryAdaptiveDispatcher by default, which may affect concurrent crawling behavior
2025-01-10 16:01:18 +08:00