* fix: Update export of URLPatternFilter
* chore: Add dependancy for cchardet in requirements
* docs: Update example for deep crawl in release note for v0.5
* Docs: update the example for memory dispatcher
* docs: updated example for crawl strategies
* Refactor: Removed wrapping in if __name__==main block since this is a markdown file.
* chore: removed cchardet from dependancy list, since unclecode is planning to remove it
* docs: updated the example for proxy rotation to a working example
* feat: Introduced ProxyConfig param
* Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1
* chore: update and test new dependancies
* feat:Make PyPDF2 a conditional dependancy
* updated tutorial and release note for v0.5
* docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename
* refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult
* fix: Bug in serialisation of markdown in acache_url
* Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown
* fix: remove deprecated markdown_v2 from docker
* Refactor: remove deprecated fit_markdown and fit_html from result
* refactor: fix cache retrieval for markdown as a string
* chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown
Adds a new ScrapingMode enum to allow switching between BeautifulSoup and LXML parsing.
LXML mode offers 10-20x better performance for large HTML documents.
Key changes:
- Added ScrapingMode enum with BEAUTIFULSOUP and LXML options
- Implemented LXMLWebScrapingStrategy class
- Added LXML-based metadata extraction
- Updated documentation with scraping mode usage and performance considerations
- Added cssselect dependency
BREAKING CHANGE: None
Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.
BREAKING CHANGE: Documentation structure has been significantly reorganized
- Add llm.txt generator
- Added SSL certificate extraction in AsyncWebCrawler.
- Introduced new content filters and chunking strategies for more robust data extraction.
- Updated documentation.
- Added a post-installation setup script for initialization.
- Updated README with installation notes for Playwright setup.
- Enhanced migration logging for better error visibility.
- Added 'pydantic' to requirements.
- Bumped version to 0.3.746.
feat(requirements): update requirements.txt to include snowballstemmer
fix(version_manager): correct version parsing to use __version__.__version__
feat(main): introduce chunking strategy and content filter in CrawlRequest model
feat(content_filter): enhance BM25 algorithm with priority tag scoring for improved content relevance
feat(logger): implement new async logger engine replacing print statements throughout library
fix(database): resolve version-related deadlock and circular lock issues in database operations
docs(docker): expand Docker deployment documentation with usage instructions for Docker Compose
chore(requirements): add colorama dependency
refactor(config): add SHOW_DEPRECATION_WARNINGS flag and clean up code
fix(docs): update example scripts for clarity and consistency
- Introduced AsyncDatabaseManager for async DB management.
- Added migration feature to transition to file-based storage.
- Enhanced web crawler with improved caching logic.
- Updated requirements and setup for async processing.
- Implemented new async crawler strategy using Playwright.
- Introduced ManagedBrowser for better browser management.
- Added support for persistent browser sessions and improved error handling.
- Updated version from 0.3.73 to 0.3.731.
- Enhanced logic in main.py for conditional mounting of static files.
- Updated requirements to replace playwright_stealth with tf-playwright-stealth.
According to #102 the requirements specified are minimum version. Currently they are defined as fixed versions in requirements.txt and setup.py leading to projects consuming this package are limited to using exactly these requirements instead of a more flexible range. This PR addresses this.
This commit updates the Dockerfile to install Crawl4AI with the specified options. The `INSTALL_OPTION` build argument is used to determine which additional packages to install. If the option is set to "all", all models will be downloaded. If the option is set to "torch", only torch models will be downloaded. If the option is set to "transformer", only transformer models will be downloaded. If no option is specified, the default installation will be used. This change improves the flexibility and customization of the Crawl4AI installation process.
Set the `image_size` variable to 0 in the `get_content_of_website_optimized` function to temporarily disable fetching the image file size. This change addresses performance issues and will be improved in a future update.
Update Dockerfile for linuz users