crawl4ai

Author	SHA1	Message	Date
UncleCode	c4f5651199	chore(deps): upgrade to Python 3.12 and prepare for 0.6.0 release - Update Docker base image to Python 3.12-slim-bookworm - Bump version from 0.6.0rc1 to 0.6.0 - Update documentation to reflect release version changes - Fix license specification in pyproject.toml and setup.py - Clean up code formatting in demo_docker_api.py BREAKING CHANGE: Base Python version upgraded from 3.10 to 3.12	2025-04-23 16:35:15 +08:00
UncleCode	8ec12d7d68	Apply Ruff Corrections	2025-01-13 19:19:58 +08:00
UncleCode	67f65f958b	refactor(build): simplify setup.py configuration - Remove dependency management from setup.py - Remove entry points configuration (moved to pyproject.toml) - Keep minimal setup.py for backwards compatibility - Clean up package metadata structure	2025-01-01 15:52:01 +08:00
UncleCode	78b6ba5cef	build: modernize package configuration with pyproject.toml - Add pyproject.toml for PEP 517 build system support - Configure dependencies, scripts, and metadata in pyproject.toml - Set Python requirement to >=3.9 and add support up to 3.13 - Keep setup.py for backwards compatibility - Move package dependencies and entry points to pyproject.toml	2025-01-01 15:45:27 +08:00
UncleCode	3f019d34cc	docs: update project description emojis - Change project description emojis from 🔥🕷️ to 🚀🤖 - Update emojis consistently in both setup.py and pyproject.toml	2025-01-01 15:39:33 +08:00
UncleCode	84b311760f	Commit Message: Enhance Crawl4AI with CLI and documentation updates - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py` - Added chunking strategies and their documentation in `llm.txt`	2024-12-21 14:26:56 +08:00
UncleCode	e9e5b5642d	Fix js_snipprt issue 0.4.21 bump to 0.4.22	2024-12-15 19:49:30 +08:00
UncleCode	d202f3539b	Enhance installation and migration processes - Added a post-installation setup script for initialization. - Updated README with installation notes for Playwright setup. - Enhanced migration logging for better error visibility. - Added 'pydantic' to requirements. - Bumped version to 0.3.746.	2024-11-29 18:48:44 +08:00
UncleCode	12e73d4898	refactor: remove legacy build hooks and setup files, migrate to setup.cfg and pyproject.toml	2024-11-29 16:01:19 +08:00
unclecode	449dd7cc0b	Migrating from the classic setup.py to a using PyProject approach.	2024-11-29 14:45:04 +08:00
UncleCode	aa3e2d0fe6	Merge branch 'main' of https://github.com/unclecode/crawl4ai	2024-11-28 20:03:43 +08:00
UncleCode	7d81c17cca	fix: improve handling of CRAWL4_AI_BASE_DIRECTORY environment variable in setup.py	2024-11-28 20:02:39 +08:00
UncleCode	1d83c493af	Enhance setup process and update contributors list - Acknowledge contributor paulokuong for fixing RAWL4_AI_BASE_DIRECTORY issue - Refine base directory handling in `setup.py` - Clarify Playwright installation instructions and improve error handling	2024-11-28 19:58:40 +08:00
Paulo Kuong	cf35cbe59e	CRAWL4_AI_BASE_DIRECTORY should be Path object instead of string (#298 ) Thank you so much for your point. Yes, that's correct. I accept your pull request, and I add your name to a contribution list. Thank you again.	2024-11-28 19:46:36 +08:00
UncleCode	b6af94cbbb	Merge remote-tracking branch 'origin/main' into 0.3.74	2024-11-18 21:15:04 +08:00
UncleCode	f9fe6f89fe	feat(database): implement version management and migration checks during initialization	2024-11-17 18:09:33 +08:00
UncleCode	5098442086	refactor: migrate versioning to __version__.py and remove deprecated _version.py	2024-11-16 15:30:24 +08:00
UncleCode	d0014c6793	New async database manager and migration support - Introduced AsyncDatabaseManager for async DB management. - Added migration feature to transition to file-based storage. - Enhanced web crawler with improved caching logic. - Updated requirements and setup for async processing.	2024-11-16 14:54:41 +08:00
Mahesh	00026b5f8b	feat(config): Adding a configurable way of setting the cache directory for constrained environments	2024-11-12 14:52:51 -07:00
UncleCode	67a23c3182	feat(core): Release v0.3.73 with Browser Takeover and Docker Support Major changes: - Add browser takeover feature using CDP for authentic browsing - Implement Docker support with full API server documentation - Enhance Mockdown with tag preservation system - Improve parallel crawling performance This release focuses on authenticity and scalability, introducing the ability to use users' own browsers while providing containerized deployment options. Breaking changes include modified browser handling and API response structure. See CHANGELOG.md for detailed migration guide.	2024-11-05 20:04:18 +08:00
UncleCode	e6c914d2fa	Refactor version management and remove deprecated gitignore.dev file	2024-11-04 16:51:59 +08:00
unclecode	bccadec887	Remove dependency on psutil, PyYaml, and extend requests version range	2024-09-29 17:07:06 +08:00
unclecode	8b6e88c85c	Update .gitignore to ignore temporary and test directories	2024-09-26 15:09:49 +08:00
unclecode	f1eee09cf4	Update README, add manifest, make selenium optional library	2024-09-25 16:35:14 +08:00
unclecode	4d48bd31ca	Push async version last changes for merge to main branch	2024-09-24 20:52:08 +08:00
unclecode	b179aa9b6f	Refactor website content and setup.py descriptions for consistent terminology	2024-09-12 16:50:52 +08:00
unclecode	dec3d44224	refactor: Update extraction strategy to handle schema extraction with non-empty schema This code change updates the `LLMExtractionStrategy` class to handle schema extraction when the schema is non-empty. Previously, the schema extraction was only triggered when the `extract_type` was set to "schema", regardless of whether a schema was provided. With this update, the schema extraction will only be performed if the `extract_type` is "schema" and a non-empty schema is provided. This ensures that the extraction strategy behaves correctly and avoids unnecessary schema extraction when not needed. Also "numpy" is removed from default installation mode.	2024-08-19 15:37:07 +08:00
unclecode	e5e6a34e80	## [v0.2.77] - 2024-08-04 Significant improvements in text processing and performance: - 🚀 Dependency reduction: Removed dependency on spaCy model for text chunk labeling in cosine extraction strategy. - 🤖 Transformer upgrade: Implemented text sequence classification using a transformer model for labeling text chunks. - ⚡ Performance enhancement: Improved model loading speed due to removal of spaCy dependency. - 🔧 Future-proofing: Laid groundwork for potential complete removal of spaCy dependency in future versions. These changes address issue #68 and provide a foundation for faster, more efficient text processing in Crawl4AI.	2024-08-04 14:54:18 +08:00
unclecode	659c8cd953	refactor: Update image description minimum word threshold in get_content_of_website_optimized	2024-08-02 15:55:32 +08:00
unclecode	fa5516aad6	chore: Refactor setup.py to use pathlib and shutil for folder creation and removal, to remove cache folder in cross platform manner.	2024-07-09 13:25:00 +08:00
unclecode	4d283ab386	## [v0.2.74] - 2024-07-08 A slew of exciting updates to improve the crawler's stability and robustness! 🎉 - 💻 UTF encoding fix: Resolved the Windows \"charmap\" error by adding UTF encoding. - 🛡️ Error handling: Implemented MaxRetryError exception handling in LocalSeleniumCrawlerStrategy. - 🧹 Input sanitization: Improved input sanitization and handled encoding issues in LLMExtractionStrategy. - 🚮 Database cleanup: Removed existing database file and initialized a new one.	2024-07-08 16:33:25 +08:00
unclecode	3ff2a0d0e7	Merge branch 'main' of https://github.com/unclecode/crawl4ai	2024-07-03 15:26:47 +08:00
unclecode	9926eb9f95	feat: Bump version to v0.2.73 and update documentation This commit updates the version number to v0.2.73 and makes corresponding changes in the README.md and Dockerfile. Docker file install the default mode, this resolve many of installation issues. Additionally, the installation instructions are updated to include support for different modes. Setup.py doesn't have anymore dependancy on Spacy. The change log is also updated to reflect these changes. Supporting websites need with-head browser.	2024-07-03 15:19:22 +08:00
shiv	a08f21d66c	Fix UnicodeDecodeError by reading README.md with UTF-8 encoding	2024-06-30 20:27:33 +05:30
unclecode	685706e0aa	Update version, and change log	2024-06-30 00:17:43 +08:00
unclecode	61ae2de841	1/Update setup.py to support following modes: - default (most frequent mode) - torch - transformers - all 2/ Update Docker file 3/ Update documentation as well.	2024-06-30 00:15:29 +08:00
unclecode	d11a83c232	## [0.2.71] 2024-06-26 • Refactored `crawler_strategy.py` to handle exceptions and improve error messages • Improved `get_content_of_website_optimized` function in `utils.py` for better performance • Updated `utils.py` with latest changes • Migrated to `ChromeDriverManager` for resolving Chrome driver download issues	2024-06-26 15:34:15 +08:00
unclecode	78cfad8b2f	chore: Update version to 0.2.7 and improve extraction function speed	2024-06-24 22:39:56 +08:00
unclecode	2c2362b4d3	issue 19 is resolved - Update Dockerfile to install mkdocs and build documentation	2024-06-22 17:18:00 +08:00
unclecode	539263a8ba	chore: Update configuration values for chunk token threshold, overlap rate, and minimum word threshold. Create a new example for LLMExtraction Strategy, update Dockerfile, and README	2024-06-19 18:32:20 +08:00
unclecode	853b9d59d8	feat: Add hooks for enhanced control over Selenium drivers - Added six hooks: on_driver_created, before_get_url, after_get_url, before_return_html, on_user_agent_updated. - Included example usage in quickstart.py. - Updated README and changelog.	2024-06-18 20:00:51 +08:00
unclecode	42a5da854d	Update version and change log.	2024-06-17 14:47:58 +08:00
unclecode	0533aeb814	v0.2.3: - Extract all media tags - Take screenshot of the page	2024-06-07 15:23:13 +08:00
unclecode	51f26d12fe	Update for v0.2.2 - Support multiple JS scripts - Fixed some of bugs - Resolved a few issue relevant to Colab installation	2024-06-02 15:40:18 +08:00
unclecode	52c4be0696	Update setup.py version to 0.2.1	2024-05-19 22:30:59 +08:00
UncleCode	bc27982992	Update setup.py Handle Spacy installation	2024-05-17 22:11:00 +08:00
unclecode	957a2458b1	chore: Update web crawler URLs to use NBC News business section	2024-05-17 18:11:13 +08:00
unclecode	3593f017d7	chore: Update setup.py to exclude torch, transformers, and nltk dependencies This commit updates the setup.py file to exclude the torch, transformers, and nltk dependencies from the install_requires section. Instead, it creates separate extras_require sections for different environments, including all requirements, excluding torch for Colab, and excluding torch, transformers, and nltk for the crawl environment.	2024-05-17 16:01:04 +08:00
unclecode	e7bb76f19b	chore: Update torch dependency to version 2.3.0	2024-05-17 15:52:39 +08:00
unclecode	bf3b040f10	chore: Update pip installation command and requirements, add new dependencies	2024-05-17 15:21:45 +08:00

1 2

54 Commits