crawl4ai

Author	SHA1	Message	Date
UncleCode	2d6b19e1a2	refactor(browser): improve browser path management Implement more robust browser executable path handling using playwright's built-in browser management. This change: - Adds async browser path resolution - Implements path caching in the home folder - Removes hardcoded browser paths - Adds httpx dependency - Removes obsolete test result files This change makes the browser path resolution more reliable across different platforms and environments.	2025-01-17 22:14:37 +08:00
UncleCode	8ec12d7d68	Apply Ruff Corrections	2025-01-13 19:19:58 +08:00
UncleCode	d5ed451299	Enhance crawler capabilities and documentation - Add llm.txt generator - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation.	2024-12-25 21:34:31 +08:00
UncleCode	2d31915f0a	Commit Message: Enhance Async Crawler with storage state handling - Updated Async Crawler to support storage state management. - Added error handling for URL validation in Async Web Crawler. - Modified README logo and improved .gitignore entries. - Fixed issues in multiple files for better code robustness.	2024-12-09 20:04:59 +08:00
UncleCode	93bf3e8a1f	Refactor Dockerfile and clean up main.py - Enhanced Dockerfile for platform-specific installations - Added ARG for TARGETPLATFORM and BUILDPLATFORM - Improved GPU support conditional on TARGETPLATFORM - Removed static pages mounting in main.py - Streamlined code structure to improve maintainability	2024-11-29 20:08:09 +08:00
UncleCode	852729ff38	feat(docker): add Docker Compose configurations for local and hub deployment; enhance GPU support checks in Dockerfile feat(requirements): update requirements.txt to include snowballstemmer fix(version_manager): correct version parsing to use __version__.__version__ feat(main): introduce chunking strategy and content filter in CrawlRequest model feat(content_filter): enhance BM25 algorithm with priority tag scoring for improved content relevance feat(logger): implement new async logger engine replacing print statements throughout library fix(database): resolve version-related deadlock and circular lock issues in database operations docs(docker): expand Docker deployment documentation with usage instructions for Docker Compose	2024-11-18 21:00:06 +08:00
UncleCode	2a82455b3d	feat(crawl): implement direct crawl functionality and introduce CacheMode for improved caching control	2024-11-17 17:17:34 +08:00
UncleCode	4b45b28f25	feat(docs): enhance deployment documentation with one-click setup, API security details, and Docker Compose examples	2024-11-16 18:44:47 +08:00
UncleCode	6360d0545a	feat(api): add API token authentication and update Dockerfile description	2024-11-16 18:08:56 +08:00
UncleCode	90df6921b7	feat(crawl_sync): add synchronous crawl endpoint and corresponding test	2024-11-16 15:34:30 +08:00
UncleCode	b6d6631b12	Enhance Async Crawler with Playwright support - Implemented new async crawler strategy using Playwright. - Introduced ManagedBrowser for better browser management. - Added support for persistent browser sessions and improved error handling. - Updated version from 0.3.73 to 0.3.731. - Enhanced logic in main.py for conditional mounting of static files. - Updated requirements to replace playwright_stealth with tf-playwright-stealth.	2024-11-12 12:10:58 +08:00
UncleCode	f7574230a1	Update API server request object. text_docker file and Readme	2024-11-07 19:29:31 +08:00
UncleCode	b51263664e	feat(api): add CORS support and static file serving, update root redirect	2024-11-05 21:02:47 +08:00
UncleCode	67a23c3182	feat(core): Release v0.3.73 with Browser Takeover and Docker Support Major changes: - Add browser takeover feature using CDP for authentic browsing - Implement Docker support with full API server documentation - Enhance Mockdown with tag preservation system - Improve parallel crawling performance This release focuses on authenticity and scalability, introducing the ability to use users' own browsers while providing containerized deployment options. Breaking changes include modified browser handling and API response structure. See CHANGELOG.md for detailed migration guide.	2024-11-05 20:04:18 +08:00
UncleCode	c4c6227962	Creating the API server component	2024-11-04 20:33:15 +08:00
unclecode	ca0336af9e	feat: Add error handling for rate limit exceeded in form submission This commit adds error handling for rate limit exceeded in the form submission process. If the server returns a 429 status code, the client will display an error message indicating the rate limit has been exceeded and provide information on when the user can try again. This improves the user experience by providing clear feedback and guidance when rate limits are reached.	2024-07-08 20:24:00 +08:00
unclecode	65ed1aeade	feat: Add rate limiting functionality with custom handlers	2024-07-08 20:02:12 +08:00
unclecode	d58286989c	UPDATE DOCUMENTS	2024-06-30 00:34:02 +08:00
unclecode	144cfa0eda	Switch to ChromeDriverManager due some issues with download the chrome driver	2024-06-26 13:00:17 +08:00
unclecode	8c77a760fc	Fixed: - Redirect "/" to mkdocs	2024-06-22 20:54:32 +08:00
unclecode	b9bf8ac9d7	Fix mounting the "/" to mkdocs site folder	2024-06-22 20:41:39 +08:00
unclecode	d6182bedd7	chore: - Add demo page to the new mkdocs - Set website home page to mkdocs	2024-06-22 20:36:01 +08:00
unclecode	e7705e661a	ADD MKDocs	2024-06-21 17:56:54 +08:00
unclecode	b3a0edaa6d	- User agent - Extract Links - Extract Metadata - Update Readme - Update REST API document	2024-06-08 17:59:42 +08:00
unclecode	8e73a482a2	feat: Add screenshot functionality to crawl_urls The code changes in this commit add the `screenshot` parameter to the `crawl_urls` function in `main.py`. This allows users to specify whether they want to take a screenshot of the page during the crawling process. The default value is `False`. This commit message follows the established convention of starting with a type (feat for feature) and providing a concise and descriptive summary of the changes made.	2024-06-07 15:23:32 +08:00
unclecode	0533aeb814	v0.2.3: - Extract all media tags - Take screenshot of the page	2024-06-07 15:23:13 +08:00
UncleCode	7381fa95e6	Merge pull request #3 from QIN2DIM/main fix(main): UnicodeDecodeError	2024-05-23 09:29:28 +08:00
Unclecode	53d1176d53	chore: Update extraction strategy to support GPU, MPS, and CPU, add batch processing for CPU devices	2024-05-19 16:18:58 +00:00
QIN2DIM	5cee084340	fix(main): UnicodeDecodeError File "T:\_GitHubProjects\Forks\crawl4ai\main.py", line 70, in read_index partials[filename[:-5]] = file.read() UnicodeDecodeError: 'gbk' codec can't decode byte 0xa4 in position 149: illegal multibyte sequence	2024-05-18 23:31:11 +08:00
Unclecode	bf00c26a83	chore: Update Dockerfile to install chromium-chromedriver and spacy library	2024-05-18 09:16:52 +00:00
unclecode	d7b37e849d	chore: Update CrawlRequest model to use NoExtractionStrategy as default	2024-05-17 16:50:38 +08:00
unclecode	5b80be956d	Update: - Debug - Refactor code for new version	2024-05-16 17:31:44 +08:00
unclecode	f6e59157bf	- Test all methods - Update index.hml - Update Readme - Resolve some bugs	2024-05-14 21:27:41 +08:00
ntohidi	aa126e436b	Add CORS middleware for allowing all origins to make requests	2024-05-10 12:27:40 +02:00
unclecode	3ff1d15702	Change the project folder name from crawler to crawl4ai	2024-05-09 22:16:28 +08:00
unclecode	181250cb93	`chore: Add function to clear the database`	2024-05-09 19:42:43 +08:00
unclecode	b8e743cd8d	Initial Commit	2024-05-09 19:10:25 +08:00

37 Commits