* feature: Add LlmConfig to easily configure and pass LLM configs to different strategies
* pulled in next branch and resolved conflicts
* feat: Add gemini and deepseek providers. Make ignore_cache in llm content filter to true by default to avoid confusions
* Refactor: Update LlmConfig in LLMExtractionStrategy class and deprecate old params
* updated tests, docs and readme
* spelling change in prompt
* gpt-4o-mini support
* Remove leading Y before here
* prompt spell correction
* (Docs) Fix numbered list end-of-line formatting
Added the missing "two spaces" to add a line break
* fix: access downloads_path through browser_config in _handle_download method - Fixes#585
* crawl
* fix: https://github.com/unclecode/crawl4ai/issues/592
* fix: https://github.com/unclecode/crawl4ai/issues/583
* Docs update: https://github.com/unclecode/crawl4ai/issues/649
* fix: https://github.com/unclecode/crawl4ai/issues/570
* Docs: updated example for content-selection to reflect new changes in yc newsfeed css
* Refactor: Removed old filters and replaced with optimised filters
* fix:Fixed imports as per the new names of filters
* Tests: For deep crawl filters
* Refactor: Remove old scorers and replace with optimised ones: Fix imports forall filters and scorers.
* fix: awaiting on filters that are async in nature eg: content relevance and seo filters
* fix: https://github.com/unclecode/crawl4ai/issues/592
* fix: https://github.com/unclecode/crawl4ai/issues/715
---------
Co-authored-by: DarshanTank <darshan.tank@gnani.ai>
Co-authored-by: Tuhin Mallick <tuhin.mllk@gmail.com>
Co-authored-by: Serhat Soydan <ssoydan@gmail.com>
Co-authored-by: cardit1 <maneesh@cardit.in>
Co-authored-by: Tautik Agrahari <tautikagrahari@gmail.com>
Adds new static method generate_schema() to JsonElementExtractionStrategy classes
that can automatically generate extraction schemas using LLM (OpenAI or Ollama).
This provides a convenient way to bootstrap extraction schemas while maintaining
the performance benefits of selector-based extraction.
Key changes:
- Added generate_schema() static method to base extraction strategy
- Added support for both CSS and XPath schema generation
- Updated documentation with examples and best practices
- Added new prompt templates for schema generation
Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.
BREAKING CHANGE: Documentation structure has been significantly reorganized
- Add llm.txt generator
- Added SSL certificate extraction in AsyncWebCrawler.
- Introduced new content filters and chunking strategies for more robust data extraction.
- Updated documentation.