33 Commits

Author SHA1 Message Date
unclecode
bccadec887 Remove dependency on psutil, PyYaml, and extend requests version range 2024-09-29 17:07:06 +08:00
unclecode
8b6e88c85c Update .gitignore to ignore temporary and test directories 2024-09-26 15:09:49 +08:00
unclecode
f1eee09cf4 Update README, add manifest, make selenium optional library 2024-09-25 16:35:14 +08:00
unclecode
4d48bd31ca Push async version last changes for merge to main branch 2024-09-24 20:52:08 +08:00
unclecode
b179aa9b6f Refactor website content and setup.py descriptions for consistent terminology 2024-09-12 16:50:52 +08:00
unclecode
dec3d44224 refactor: Update extraction strategy to handle schema extraction with non-empty schema
This code change updates the `LLMExtractionStrategy` class to handle schema extraction when the schema is non-empty. Previously, the schema extraction was only triggered when the `extract_type` was set to "schema", regardless of whether a schema was provided. With this update, the schema extraction will only be performed if the `extract_type` is "schema" and a non-empty schema is provided. This ensures that the extraction strategy behaves correctly and avoids unnecessary schema extraction when not needed. Also "numpy" is removed from default installation mode.
2024-08-19 15:37:07 +08:00
unclecode
e5e6a34e80 ## [v0.2.77] - 2024-08-04
Significant improvements in text processing and performance:

- 🚀 **Dependency reduction**: Removed dependency on spaCy model for text chunk labeling in cosine extraction strategy.
- 🤖 **Transformer upgrade**: Implemented text sequence classification using a transformer model for labeling text chunks.
-  **Performance enhancement**: Improved model loading speed due to removal of spaCy dependency.
- 🔧 **Future-proofing**: Laid groundwork for potential complete removal of spaCy dependency in future versions.

These changes address issue #68 and provide a foundation for faster, more efficient text processing in Crawl4AI.
2024-08-04 14:54:18 +08:00
unclecode
659c8cd953 refactor: Update image description minimum word threshold in get_content_of_website_optimized 2024-08-02 15:55:32 +08:00
unclecode
fa5516aad6 chore: Refactor setup.py to use pathlib and shutil for folder creation and removal, to remove cache folder in cross platform manner. 2024-07-09 13:25:00 +08:00
unclecode
4d283ab386 ## [v0.2.74] - 2024-07-08
A slew of exciting updates to improve the crawler's stability and robustness! 🎉

- 💻 **UTF encoding fix**: Resolved the Windows \"charmap\" error by adding UTF encoding.
- 🛡️ **Error handling**: Implemented MaxRetryError exception handling in LocalSeleniumCrawlerStrategy.
- 🧹 **Input sanitization**: Improved input sanitization and handled encoding issues in LLMExtractionStrategy.
- 🚮 **Database cleanup**: Removed existing database file and initialized a new one.
2024-07-08 16:33:25 +08:00
unclecode
3ff2a0d0e7 Merge branch 'main' of https://github.com/unclecode/crawl4ai 2024-07-03 15:26:47 +08:00
unclecode
9926eb9f95 feat: Bump version to v0.2.73 and update documentation
This commit updates the version number to v0.2.73 and makes corresponding changes in the README.md and Dockerfile.

Docker file install the default mode, this resolve many of installation issues.

Additionally, the installation instructions are updated to include support for different modes. Setup.py doesn't have anymore dependancy on Spacy.

The change log is also updated to reflect these changes.

Supporting websites need with-head browser.
2024-07-03 15:19:22 +08:00
shiv
a08f21d66c Fix UnicodeDecodeError by reading README.md with UTF-8 encoding 2024-06-30 20:27:33 +05:30
unclecode
685706e0aa Update version, and change log 2024-06-30 00:17:43 +08:00
unclecode
61ae2de841 1/Update setup.py to support following modes:
- default (most frequent mode)
- torch
- transformers
- all
2/ Update Docker file
3/ Update documentation as well.
2024-06-30 00:15:29 +08:00
unclecode
d11a83c232 ## [0.2.71] 2024-06-26
• Refactored `crawler_strategy.py` to handle exceptions and improve error messages
• Improved `get_content_of_website_optimized` function in `utils.py` for better performance
• Updated `utils.py` with latest changes
• Migrated to `ChromeDriverManager` for resolving Chrome driver download issues
2024-06-26 15:34:15 +08:00
unclecode
78cfad8b2f chore: Update version to 0.2.7 and improve extraction function speed 2024-06-24 22:39:56 +08:00
unclecode
2c2362b4d3 issue 19 is resolved
- Update Dockerfile to install mkdocs and build documentation
2024-06-22 17:18:00 +08:00
unclecode
539263a8ba chore: Update configuration values for chunk token threshold, overlap rate, and minimum word threshold. Create a new example for LLMExtraction Strategy, update Dockerfile, and README 2024-06-19 18:32:20 +08:00
unclecode
853b9d59d8 feat: Add hooks for enhanced control over Selenium drivers
- Added six hooks: on_driver_created, before_get_url, after_get_url, before_return_html, on_user_agent_updated.
- Included example usage in quickstart.py.
- Updated README and changelog.
2024-06-18 20:00:51 +08:00
unclecode
42a5da854d Update version and change log. 2024-06-17 14:47:58 +08:00
unclecode
0533aeb814 v0.2.3:
- Extract all media tags
- Take screenshot of the page
2024-06-07 15:23:13 +08:00
unclecode
51f26d12fe Update for v0.2.2
- Support multiple JS scripts
- Fixed some of bugs
- Resolved a few issue relevant to Colab installation
2024-06-02 15:40:18 +08:00
unclecode
52c4be0696 Update setup.py version to 0.2.1 2024-05-19 22:30:59 +08:00
UncleCode
bc27982992
Update setup.py Handle Spacy installation 2024-05-17 22:11:00 +08:00
unclecode
957a2458b1 chore: Update web crawler URLs to use NBC News business section 2024-05-17 18:11:13 +08:00
unclecode
3593f017d7 chore: Update setup.py to exclude torch, transformers, and nltk dependencies
This commit updates the setup.py file to exclude the torch, transformers, and nltk dependencies from the install_requires section. Instead, it creates separate extras_require sections for different environments, including all requirements, excluding torch for Colab, and excluding torch, transformers, and nltk for the crawl environment.
2024-05-17 16:01:04 +08:00
unclecode
e7bb76f19b chore: Update torch dependency to version 2.3.0 2024-05-17 15:52:39 +08:00
unclecode
bf3b040f10 chore: Update pip installation command and requirements, add new dependencies 2024-05-17 15:21:45 +08:00
unclecode
4006f5f4e2 chore: Update pip installation command to use sys.executable 2024-05-16 20:24:48 +08:00
unclecode
7e0682e0de chore: Update dependencies and installation process 2024-05-16 20:22:50 +08:00
unclecode
8e28eb9efb Add model loader, update requirements.txt 2024-05-16 20:08:21 +08:00
unclecode
b8e743cd8d Initial Commit 2024-05-09 19:10:25 +08:00