crawl4ai

mirror of https://github.com/unclecode/crawl4ai.git synced 2025-12-29 19:39:41 +00:00

Author	SHA1	Message	Date
unclecode	2fada16abb	chore: Update crawl4ai package with AsyncWebCrawler and JsonCssExtractionStrategy	2024-09-03 23:32:27 +08:00
unclecode	c37614cbc8	Add Async Version, JsonCss Extrator	2024-09-03 01:27:00 +08:00
unclecode	3116f95c1a	Merge branch 'pull-84' into staging	2024-09-01 16:44:06 +08:00
unclecode	b0e8b66666	Merge branch 'proxy-support' into staging	2024-09-01 16:35:14 +08:00
unclecode	3caf48c9be	refactor: Update LocalSeleniumCrawlerStrategy to execute JS code if provided	2024-09-01 16:34:51 +08:00
Umut CAN	3c6ebb73ae	Update web_crawler.py Improve code efficiency, readability, and maintainability in web_crawler.py	2024-08-30 15:30:06 +03:00
UncleCode	0d9b638636	Merge pull request #75 from aravindkarnam/main Added support to source tags wrapped inside video and audio tags. Ext…	2024-08-30 12:54:15 +02:00
datehoer	2ba70b9501	add use proxy and llm baseurl examples	2024-08-27 10:14:54 +08:00
datehoer	16f98cebc0	replace base64 image url to ''	2024-08-27 09:44:35 +08:00
datehoer	fe9ff498ce	add proxy and add ai base_url	2024-08-26 16:12:49 +08:00
Datehoer	eba831ca30	fix spelling mistake	2024-08-26 15:29:23 +08:00
unclecode	dec3d44224	refactor: Update extraction strategy to handle schema extraction with non-empty schema This code change updates the `LLMExtractionStrategy` class to handle schema extraction when the schema is non-empty. Previously, the schema extraction was only triggered when the `extract_type` was set to "schema", regardless of whether a schema was provided. With this update, the schema extraction will only be performed if the `extract_type` is "schema" and a non-empty schema is provided. This ensures that the extraction strategy behaves correctly and avoids unnecessary schema extraction when not needed. Also "numpy" is removed from default installation mode.	2024-08-19 15:37:07 +08:00
Aravind Karnam	9ed1551125	Added support to source tags wrapped inside video and audio tags. Extended the text extraction to video and audio elements in media. https://github.com/unclecode/crawl4ai/issues/71	2024-08-14 11:07:26 +05:30
unclecode	e5e6a34e80	## [v0.2.77] - 2024-08-04 Significant improvements in text processing and performance: - 🚀 Dependency reduction: Removed dependency on spaCy model for text chunk labeling in cosine extraction strategy. - 🤖 Transformer upgrade: Implemented text sequence classification using a transformer model for labeling text chunks. - ⚡ Performance enhancement: Improved model loading speed due to removal of spaCy dependency. - 🔧 Future-proofing: Laid groundwork for potential complete removal of spaCy dependency in future versions. These changes address issue #68 and provide a foundation for faster, more efficient text processing in Crawl4AI. v0.2.77	2024-08-04 14:54:18 +08:00
unclecode	897e766728	Update README	2024-08-02 16:04:14 +08:00
unclecode	9200a6731d	## [v0.2.76] - 2024-08-02 Major improvements in functionality, performance, and cross-platform compatibility! 🚀 - 🐳 Docker enhancements: Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows. - 🌐 Official Docker Hub image: Launched our first official image on Docker Hub for streamlined deployment (unclecode/crawl4ai). - 🔧 Selenium upgrade: Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility. - 🖼️ Image description: Implemented ability to generate textual descriptions for extracted images from web pages. - ⚡ Performance boost: Various improvements to enhance overall speed and performance.	2024-08-02 16:02:42 +08:00
unclecode	61c166ab19	refactor: Update Crawl4AI version to v0.2.76 This commit updates the Crawl4AI version from v0.2.7765 to v0.2.76. The version number is updated in the README.md file. This change ensures consistency and reflects the correct version of the software.	2024-08-02 15:55:53 +08:00
unclecode	659c8cd953	refactor: Update image description minimum word threshold in get_content_of_website_optimized	2024-08-02 15:55:32 +08:00
unclecode	9ee988753d	refactor: Update image description minimum word threshold in get_content_of_website_optimized	2024-08-02 14:53:11 +08:00
unclecode	8ae6c43ca4	refactor: Update Dockerfile to install Crawl4AI with specified options	2024-08-01 20:13:06 +08:00
unclecode	b6713870ef	refactor: Update Dockerfile to install Crawl4AI with specified options This commit updates the Dockerfile to install Crawl4AI with the specified options. The `INSTALL_OPTION` build argument is used to determine which additional packages to install. If the option is set to "all", all models will be downloaded. If the option is set to "torch", only torch models will be downloaded. If the option is set to "transformer", only transformer models will be downloaded. If no option is specified, the default installation will be used. This change improves the flexibility and customization of the Crawl4AI installation process.	2024-08-01 17:56:19 +08:00
unclecode	40477493d3	refactor: Remove image format dot in get_content_of_website_optimized The code change removes the dot from the image format in the `get_content_of_website_optimized` function. This change ensures consistency in the image format and improves the functionality.	2024-07-31 16:15:55 +08:00
Kevin Moturi	efcf3ac6eb	Update LocalSeleniumCrawlerStrategy to resolve ChromeDriver version mismatch issue This resolves the following error: `selenium.common.exceptions.SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 114` Windows users are getting.	2024-07-31 13:33:09 +08:00
unclecode	9e43f7beda	refactor: Temporarily disable fetching image file size in get_content_of_website_optimized Set the `image_size` variable to 0 in the `get_content_of_website_optimized` function to temporarily disable fetching the image file size. This change addresses performance issues and will be improved in a future update. Update Dockerfile for linuz users	2024-07-31 13:29:23 +08:00
unclecode	aa9412e1b4	refactor: Set image_size to 0 in get_content_of_website_optimized The code change sets the `image_size` variable to 0 in the `get_content_of_website_optimized` function. This change is made to temporarily disable fetching the image file size, which was causing performance issues. The image size will be fetched in a future update to improve the functionality.	2024-07-23 13:08:53 +08:00
Aravind Karnam	cf6c835e18	moved score threshold to config.py & replaced the separator for tag.get_text in find_closest_parent_with_useful_text fn from period(.) to space( ) to keep the text more neutral.	2024-07-21 15:18:23 +05:30
Aravind Karnam	e5ecf291f3	Implemented filtering for images and grabbing the contextual text from nearest parent	2024-07-21 15:03:17 +05:30
Aravind Karnam	9d0cafcfa6	fixed import error in model_loader.py	2024-07-21 14:55:58 +05:30
unclecode	7715623430	chore: Fix typos and update .gitignore These changes fix typos in `chunking_strategy.py` and `crawler_strategy.py` to improve code readability. Additionally, the `.test_pads/` directory is removed from the `.gitignore` file to keep the repository clean and organized. v0.0.75	2024-07-19 17:42:39 +08:00
unclecode	f5a4e80e2c	chore: Fix typo in chunking_strategy.py and crawler_strategy.py The commit fixes a typo in the `chunking_strategy.py` file where `nl.toknize.TextTilingTokenizer()` was corrected to `nl.tokenize.TextTilingTokenizer()`. Additionally, in the `crawler_strategy.py` file, the commit converts the screenshot image to RGB mode before saving it as a JPEG. This ensures consistent image quality and compression.	2024-07-19 17:40:31 +08:00
unclecode	8463aabedf	chore: Remove .test_pads/ directory from .gitignore	2024-07-19 17:09:29 +08:00
unclecode	7f30144ef2	chore: Remove .tests/ directory from .gitignore	2024-07-09 15:10:18 +08:00
unclecode	fa5516aad6	chore: Refactor setup.py to use pathlib and shutil for folder creation and removal, to remove cache folder in cross platform manner.	2024-07-09 13:25:00 +08:00
unclecode	ca0336af9e	feat: Add error handling for rate limit exceeded in form submission This commit adds error handling for rate limit exceeded in the form submission process. If the server returns a 429 status code, the client will display an error message indicating the rate limit has been exceeded and provide information on when the user can try again. This improves the user experience by providing clear feedback and guidance when rate limits are reached.	2024-07-08 20:24:00 +08:00
unclecode	65ed1aeade	feat: Add rate limiting functionality with custom handlers	2024-07-08 20:02:12 +08:00
unclecode	4d283ab386	## [v0.2.74] - 2024-07-08 A slew of exciting updates to improve the crawler's stability and robustness! 🎉 - 💻 UTF encoding fix: Resolved the Windows \"charmap\" error by adding UTF encoding. - 🛡️ Error handling: Implemented MaxRetryError exception handling in LocalSeleniumCrawlerStrategy. - 🧹 Input sanitization: Improved input sanitization and handled encoding issues in LLMExtractionStrategy. - 🚮 Database cleanup: Removed existing database file and initialized a new one. v0.2.74	2024-07-08 16:33:25 +08:00
unclecode	3ff2a0d0e7	Merge branch 'main' of https://github.com/unclecode/crawl4ai v0.2.73	2024-07-03 15:26:47 +08:00
unclecode	3cd1b3719f	Bump version to v0.2.73, update documentation, and resolve installation issues	2024-07-03 15:26:43 +08:00
unclecode	9926eb9f95	feat: Bump version to v0.2.73 and update documentation This commit updates the version number to v0.2.73 and makes corresponding changes in the README.md and Dockerfile. Docker file install the default mode, this resolve many of installation issues. Additionally, the installation instructions are updated to include support for different modes. Setup.py doesn't have anymore dependancy on Spacy. The change log is also updated to reflect these changes. Supporting websites need with-head browser.	2024-07-03 15:19:22 +08:00
UncleCode	3abaa82501	Merge pull request #37 from shivkumar0757/fix-readme-encoding @shivkumar0757 Great work! I value your contribution and have merged your pull request. You will be credited in the upcoming change-log. Thank you for your continuous support in advancing this library, to democratize an open access crawler to everyone.	2024-07-01 07:31:07 +02:00
unclecode	88d8cd8650	feat: Add page load check for LocalSeleniumCrawlerStrategy This commit adds a page load check for the LocalSeleniumCrawlerStrategy in the `crawl` method. The `_ensure_page_load` method is introduced to ensure that the page has finished loading before proceeding. This helps to prevent issues with incomplete page sources and improves the reliability of the crawler.	2024-07-01 00:07:32 +08:00
shiv	a08f21d66c	Fix UnicodeDecodeError by reading README.md with UTF-8 encoding	2024-06-30 20:27:33 +05:30
unclecode	d58286989c	UPDATE DOCUMENTS	2024-06-30 00:34:02 +08:00
unclecode	b58af3349c	chore: Update installation instructions with support for different modes v0.2.72	2024-06-30 00:22:17 +08:00
unclecode	940df4631f	Update ChangeLog	2024-06-30 00:18:40 +08:00
unclecode	685706e0aa	Update version, and change log	2024-06-30 00:17:43 +08:00
unclecode	7b0979e134	Update Redme and Docker file	2024-06-30 00:15:43 +08:00
unclecode	61ae2de841	1/Update setup.py to support following modes: - default (most frequent mode) - torch - transformers - all 2/ Update Docker file 3/ Update documentation as well.	2024-06-30 00:15:29 +08:00
unclecode	5b28eed2c0	Add a temporary solution for when we can't crawl websites in headless mode.	2024-06-29 23:25:50 +08:00
unclecode	f8a11779fe	Update change log	2024-06-26 16:48:36 +08:00

1 2 3 4 5

220 Commits