54 Commits

Author SHA1 Message Date
UncleCode
c4f5651199 chore(deps): upgrade to Python 3.12 and prepare for 0.6.0 release
- Update Docker base image to Python 3.12-slim-bookworm
- Bump version from 0.6.0rc1 to 0.6.0
- Update documentation to reflect release version changes
- Fix license specification in pyproject.toml and setup.py
- Clean up code formatting in demo_docker_api.py

BREAKING CHANGE: Base Python version upgraded from 3.10 to 3.12
2025-04-23 16:35:15 +08:00
UncleCode
8ec12d7d68 Apply Ruff Corrections 2025-01-13 19:19:58 +08:00
UncleCode
67f65f958b refactor(build): simplify setup.py configuration
- Remove dependency management from setup.py
- Remove entry points configuration (moved to pyproject.toml)
- Keep minimal setup.py for backwards compatibility
- Clean up package metadata structure
2025-01-01 15:52:01 +08:00
UncleCode
78b6ba5cef build: modernize package configuration with pyproject.toml
- Add pyproject.toml for PEP 517 build system support
- Configure dependencies, scripts, and metadata in pyproject.toml
- Set Python requirement to >=3.9 and add support up to 3.13
- Keep setup.py for backwards compatibility
- Move package dependencies and entry points to pyproject.toml
2025-01-01 15:45:27 +08:00
UncleCode
3f019d34cc docs: update project description emojis
- Change project description emojis from 🔥🕷️ to 🚀🤖
- Update emojis consistently in both setup.py and pyproject.toml
2025-01-01 15:39:33 +08:00
UncleCode
84b311760f Commit Message:
Enhance Crawl4AI with CLI and documentation updates
  - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py`
  - Added chunking strategies and their documentation in `llm.txt`
2024-12-21 14:26:56 +08:00
UncleCode
e9e5b5642d Fix js_snipprt issue 0.4.21
bump to 0.4.22
2024-12-15 19:49:30 +08:00
UncleCode
d202f3539b Enhance installation and migration processes
- Added a post-installation setup script for initialization.
  - Updated README with installation notes for Playwright setup.
  - Enhanced migration logging for better error visibility.
  - Added 'pydantic' to requirements.
  - Bumped version to 0.3.746.
2024-11-29 18:48:44 +08:00
UncleCode
12e73d4898 refactor: remove legacy build hooks and setup files, migrate to setup.cfg and pyproject.toml 2024-11-29 16:01:19 +08:00
unclecode
449dd7cc0b Migrating from the classic setup.py to a using PyProject approach. 2024-11-29 14:45:04 +08:00
UncleCode
aa3e2d0fe6 Merge branch 'main' of https://github.com/unclecode/crawl4ai 2024-11-28 20:03:43 +08:00
UncleCode
7d81c17cca fix: improve handling of CRAWL4_AI_BASE_DIRECTORY environment variable in setup.py 2024-11-28 20:02:39 +08:00
UncleCode
1d83c493af Enhance setup process and update contributors list
- Acknowledge contributor paulokuong for fixing RAWL4_AI_BASE_DIRECTORY issue
  - Refine base directory handling in `setup.py`
  - Clarify Playwright installation instructions and improve error handling
2024-11-28 19:58:40 +08:00
Paulo Kuong
cf35cbe59e
CRAWL4_AI_BASE_DIRECTORY should be Path object instead of string (#298)
Thank you so much for your point. Yes, that's correct. I accept your pull request, and I add your name to a contribution list. Thank you again.
2024-11-28 19:46:36 +08:00
UncleCode
b6af94cbbb Merge remote-tracking branch 'origin/main' into 0.3.74 2024-11-18 21:15:04 +08:00
UncleCode
f9fe6f89fe feat(database): implement version management and migration checks during initialization 2024-11-17 18:09:33 +08:00
UncleCode
5098442086 refactor: migrate versioning to __version__.py and remove deprecated _version.py 2024-11-16 15:30:24 +08:00
UncleCode
d0014c6793 New async database manager and migration support
- Introduced AsyncDatabaseManager for async DB management.
  - Added migration feature to transition to file-based storage.
  - Enhanced web crawler with improved caching logic.
  - Updated requirements and setup for async processing.
2024-11-16 14:54:41 +08:00
Mahesh
00026b5f8b feat(config): Adding a configurable way of setting the cache directory for constrained environments 2024-11-12 14:52:51 -07:00
UncleCode
67a23c3182 feat(core): Release v0.3.73 with Browser Takeover and Docker Support
Major changes:
- Add browser takeover feature using CDP for authentic browsing
- Implement Docker support with full API server documentation
- Enhance Mockdown with tag preservation system
- Improve parallel crawling performance

This release focuses on authenticity and scalability, introducing the ability
to use users' own browsers while providing containerized deployment options.
Breaking changes include modified browser handling and API response structure.

See CHANGELOG.md for detailed migration guide.
2024-11-05 20:04:18 +08:00
UncleCode
e6c914d2fa Refactor version management and remove deprecated gitignore.dev file 2024-11-04 16:51:59 +08:00
unclecode
bccadec887 Remove dependency on psutil, PyYaml, and extend requests version range 2024-09-29 17:07:06 +08:00
unclecode
8b6e88c85c Update .gitignore to ignore temporary and test directories 2024-09-26 15:09:49 +08:00
unclecode
f1eee09cf4 Update README, add manifest, make selenium optional library 2024-09-25 16:35:14 +08:00
unclecode
4d48bd31ca Push async version last changes for merge to main branch 2024-09-24 20:52:08 +08:00
unclecode
b179aa9b6f Refactor website content and setup.py descriptions for consistent terminology 2024-09-12 16:50:52 +08:00
unclecode
dec3d44224 refactor: Update extraction strategy to handle schema extraction with non-empty schema
This code change updates the `LLMExtractionStrategy` class to handle schema extraction when the schema is non-empty. Previously, the schema extraction was only triggered when the `extract_type` was set to "schema", regardless of whether a schema was provided. With this update, the schema extraction will only be performed if the `extract_type` is "schema" and a non-empty schema is provided. This ensures that the extraction strategy behaves correctly and avoids unnecessary schema extraction when not needed. Also "numpy" is removed from default installation mode.
2024-08-19 15:37:07 +08:00
unclecode
e5e6a34e80 ## [v0.2.77] - 2024-08-04
Significant improvements in text processing and performance:

- 🚀 **Dependency reduction**: Removed dependency on spaCy model for text chunk labeling in cosine extraction strategy.
- 🤖 **Transformer upgrade**: Implemented text sequence classification using a transformer model for labeling text chunks.
-  **Performance enhancement**: Improved model loading speed due to removal of spaCy dependency.
- 🔧 **Future-proofing**: Laid groundwork for potential complete removal of spaCy dependency in future versions.

These changes address issue #68 and provide a foundation for faster, more efficient text processing in Crawl4AI.
2024-08-04 14:54:18 +08:00
unclecode
659c8cd953 refactor: Update image description minimum word threshold in get_content_of_website_optimized 2024-08-02 15:55:32 +08:00
unclecode
fa5516aad6 chore: Refactor setup.py to use pathlib and shutil for folder creation and removal, to remove cache folder in cross platform manner. 2024-07-09 13:25:00 +08:00
unclecode
4d283ab386 ## [v0.2.74] - 2024-07-08
A slew of exciting updates to improve the crawler's stability and robustness! 🎉

- 💻 **UTF encoding fix**: Resolved the Windows \"charmap\" error by adding UTF encoding.
- 🛡️ **Error handling**: Implemented MaxRetryError exception handling in LocalSeleniumCrawlerStrategy.
- 🧹 **Input sanitization**: Improved input sanitization and handled encoding issues in LLMExtractionStrategy.
- 🚮 **Database cleanup**: Removed existing database file and initialized a new one.
2024-07-08 16:33:25 +08:00
unclecode
3ff2a0d0e7 Merge branch 'main' of https://github.com/unclecode/crawl4ai 2024-07-03 15:26:47 +08:00
unclecode
9926eb9f95 feat: Bump version to v0.2.73 and update documentation
This commit updates the version number to v0.2.73 and makes corresponding changes in the README.md and Dockerfile.

Docker file install the default mode, this resolve many of installation issues.

Additionally, the installation instructions are updated to include support for different modes. Setup.py doesn't have anymore dependancy on Spacy.

The change log is also updated to reflect these changes.

Supporting websites need with-head browser.
2024-07-03 15:19:22 +08:00
shiv
a08f21d66c Fix UnicodeDecodeError by reading README.md with UTF-8 encoding 2024-06-30 20:27:33 +05:30
unclecode
685706e0aa Update version, and change log 2024-06-30 00:17:43 +08:00
unclecode
61ae2de841 1/Update setup.py to support following modes:
- default (most frequent mode)
- torch
- transformers
- all
2/ Update Docker file
3/ Update documentation as well.
2024-06-30 00:15:29 +08:00
unclecode
d11a83c232 ## [0.2.71] 2024-06-26
• Refactored `crawler_strategy.py` to handle exceptions and improve error messages
• Improved `get_content_of_website_optimized` function in `utils.py` for better performance
• Updated `utils.py` with latest changes
• Migrated to `ChromeDriverManager` for resolving Chrome driver download issues
2024-06-26 15:34:15 +08:00
unclecode
78cfad8b2f chore: Update version to 0.2.7 and improve extraction function speed 2024-06-24 22:39:56 +08:00
unclecode
2c2362b4d3 issue 19 is resolved
- Update Dockerfile to install mkdocs and build documentation
2024-06-22 17:18:00 +08:00
unclecode
539263a8ba chore: Update configuration values for chunk token threshold, overlap rate, and minimum word threshold. Create a new example for LLMExtraction Strategy, update Dockerfile, and README 2024-06-19 18:32:20 +08:00
unclecode
853b9d59d8 feat: Add hooks for enhanced control over Selenium drivers
- Added six hooks: on_driver_created, before_get_url, after_get_url, before_return_html, on_user_agent_updated.
- Included example usage in quickstart.py.
- Updated README and changelog.
2024-06-18 20:00:51 +08:00
unclecode
42a5da854d Update version and change log. 2024-06-17 14:47:58 +08:00
unclecode
0533aeb814 v0.2.3:
- Extract all media tags
- Take screenshot of the page
2024-06-07 15:23:13 +08:00
unclecode
51f26d12fe Update for v0.2.2
- Support multiple JS scripts
- Fixed some of bugs
- Resolved a few issue relevant to Colab installation
2024-06-02 15:40:18 +08:00
unclecode
52c4be0696 Update setup.py version to 0.2.1 2024-05-19 22:30:59 +08:00
UncleCode
bc27982992
Update setup.py Handle Spacy installation 2024-05-17 22:11:00 +08:00
unclecode
957a2458b1 chore: Update web crawler URLs to use NBC News business section 2024-05-17 18:11:13 +08:00
unclecode
3593f017d7 chore: Update setup.py to exclude torch, transformers, and nltk dependencies
This commit updates the setup.py file to exclude the torch, transformers, and nltk dependencies from the install_requires section. Instead, it creates separate extras_require sections for different environments, including all requirements, excluding torch for Colab, and excluding torch, transformers, and nltk for the crawl environment.
2024-05-17 16:01:04 +08:00
unclecode
e7bb76f19b chore: Update torch dependency to version 2.3.0 2024-05-17 15:52:39 +08:00
unclecode
bf3b040f10 chore: Update pip installation command and requirements, add new dependencies 2024-05-17 15:21:45 +08:00