- Added examples for Amazon product data extraction methods
- Updated configuration options and enhance documentation
- Minor refactoring for improved performance and readability
- Cleaned up version control settings.
Standardizes parameter naming convention across the codebase by renaming browser_config to the more concise config in AsyncWebCrawler constructor.
Updates all documentation examples and internal usages to reflect the new parameter name for consistency.
Also improves hook execution by adding url/response parameters to goto hooks and fixes parameter ordering in before_return_html hook.
Enhance crawler capabilities and documentation
- Added SSL certificate extraction in AsyncWebCrawler.
- Introduced new content filters and chunking strategies for more robust data extraction.
- Updated documentation management to streamline user experience.
- Add llm.txt generator
- Added SSL certificate extraction in AsyncWebCrawler.
- Introduced new content filters and chunking strategies for more robust data extraction.
- Updated documentation.
@Haopeng138 Thank you so much. They are still part of the library. I forgot to update them since I moved the asynchronous versions years ago. I really appreciate it. I have to say that I feel weak in the documentation. That's why I spent a lot of time on it last week. Now, when you mention some of the things in the example folder, I realize I forgot about the example folder. I'll try to update it more. If you find anything else, please help and support. Thank you. I will add your name to contributor name as well.
- Added markdown generator parameter to CrawlerRunConfig in `async_configs.py`.
- Implemented logic for Markdown generation in content scraping in `async_webcrawler.py`.
- Updated version number to 0.4.21 in `__version__.py`.
- Introduced new configuration classes: BrowserConfig and CrawlerRunConfig.
- Refactored AsyncWebCrawler to leverage the new configuration system for cleaner parameter management.
- Updated AsyncPlaywrightCrawlerStrategy for better flexibility and reduced legacy parameters.
- Improved error handling with detailed context extraction during exceptions.
- Enhanced overall maintainability and usability of the web crawler.
- Introduced a new approach for capturing full-page screenshots by exporting them as PDFs first, enhancing reliability and performance.
- Added documentation for the feature in `docs/examples/full_page_screenshot_and_pdf_export.md`.
- Refactored `perform_completion_with_backoff` in `crawl4ai/utils.py` to include necessary extra parameters.
- Updated `quickstart_async.py` to utilize LLM extraction with refined arguments.
- Added support for exporting pages as PDFs
- Enhanced screenshot functionality for long pages
- Created a tutorial on dynamic content loading with 'Load More' buttons.
- Updated web crawler to handle PDF data in responses.
- Introduced new async crawl strategy with session management.
- Added BrowserManager for improved browser management.
- Enhanced documentation, focusing on storage state and usage examples.
- Improved error handling and logging for sessions.
- Added JavaScript snippets for customizing navigator properties.
### New Features:
- **Text-Only Mode**: Added support for text-only crawling by disabling images, JavaScript, GPU, and other non-essential features.
- **Light Mode**: Optimized browser settings to reduce resource usage and improve efficiency during crawling.
- **Dynamic Viewport Adjustment**: Automatically adjusts viewport dimensions based on content size, ensuring accurate rendering and scaling.
- **Full Page Scanning**: Introduced a feature to scroll and capture dynamic content for pages with infinite scroll or lazy-loading elements.
- **Session Management**: Added `create_session` method for creating and managing browser sessions with unique IDs.
### Improvements:
- Unified viewport handling across contexts by dynamically setting dimensions using `self.viewport_width` and `self.viewport_height`.
- Enhanced logging and error handling for viewport adjustments, page scanning, and content evaluation.
- Reduced resource usage with additional browser flags for both `light_mode` and `text_only` configurations.
- Improved handling of cookies, headers, and proxies in session creation.
### Refactoring:
- Removed hardcoded viewport dimensions and replaced them with dynamic configurations.
- Cleaned up unused and commented-out code for better readability and maintainability.
- Introduced defaults for frequently used parameters like `delay_before_return_html`.
### Fixes:
- Resolved potential inconsistencies in viewport handling.
- Improved robustness of content loading and dynamic adjustments to avoid failures and timeouts.
### Docs Update:
- Updated schema usage in `quickstart_async.py` example:
- Changed `OpenAIModelFee.schema()` to `OpenAIModelFee.model_json_schema()` for compatibility.
- Enhanced LLM extraction instruction documentation.
This commit introduces significant enhancements to improve efficiency, flexibility, and reliability of the crawler strategy.
- Enhanced the web scraping strategy with new methods for optimized media handling.
- Added new utility functions for better content processing.
- Refined existing features for improved accuracy and efficiency in scraping tasks.
- Introduced more robust filtering criteria for media elements.
- Enhanced error handling in async crawler.
- Added flexible options in Markdown generation.
- Updated user agent settings for improved reliability.
- Reflected changes in documentation and examples.
- Introduced the PruningContentFilter for better content relevance.
- Implemented comprehensive unit tests for verification of functionality.
- Enhanced existing BM25ContentFilter tests for edge case coverage.
- Updated documentation to include usage examples for new filter.
- Added new Docker commands for platform-specific builds.
- Updated README with comprehensive installation and setup instructions.
- Introduced `post_install` method in setup script for automation.
- Refined migration processes with enhanced error logging.
- Bump version to 0.3.746 and updated dependencies.
- Added a post-installation setup script for initialization.
- Updated README with installation notes for Playwright setup.
- Enhanced migration logging for better error visibility.
- Added 'pydantic' to requirements.
- Bumped version to 0.3.746.
feat(requirements): update requirements.txt to include snowballstemmer
fix(version_manager): correct version parsing to use __version__.__version__
feat(main): introduce chunking strategy and content filter in CrawlRequest model
feat(content_filter): enhance BM25 algorithm with priority tag scoring for improved content relevance
feat(logger): implement new async logger engine replacing print statements throughout library
fix(database): resolve version-related deadlock and circular lock issues in database operations
docs(docker): expand Docker deployment documentation with usage instructions for Docker Compose
chore(requirements): add colorama dependency
refactor(config): add SHOW_DEPRECATION_WARNINGS flag and clean up code
fix(docs): update example scripts for clarity and consistency
- Update version number to 0.3.71
- Add sleep_on_close option to AsyncPlaywrightCrawlerStrategy
- Enhance context creation with additional options
- Improve error message formatting and visibility
- Update quickstart documentation
- Implement playwright_stealth for better bot detection avoidance
- Add user simulation and navigator override options
- Improve iframe processing and browser selection
- Enhance error reporting and debugging capabilities
- Optimize image processing and parallel crawling
- Add new example for user simulation feature
- Added support for including links in Markdown content, by definin g a new flag `include_links_on_markdown` in `crawl` method.
- Add browser type selection (Chromium, Firefox, WebKit)
- Implement iframe content extraction
- Improve image processing and dimension updates
- Add custom headers support in AsyncPlaywrightCrawlerStrategy
- Enhance delayed content retrieval with new parameter
- Optimize HTML sanitization and Markdown conversion
- Update examples in quickstart_async.py for new features
- Add before_retrieve_html hook and delay_before_return_html option
- Implement flexible page_timeout for smart_wait function
- Support extra_args and custom headers in LLM extraction
- Allow arbitrary kwargs in AsyncWebCrawler initialization
- Improve perform_completion_with_backoff for custom API calls
- Update examples with new features and diverse LLM providers
- Implement smart_wait function in AsyncPlaywrightCrawlerStrategy
- Add screenshot support to AsyncCrawlResponse and AsyncWebCrawler
- Improve error handling and timeout management in crawling process
- Fix typo in CrawlResult model (responser_headers -> response_headers)
- Update .gitignore to exclude additional files
- Adjust import path in test_basic_crawling.py