790 Commits

Author SHA1 Message Date
UncleCode
2140d9aca4 fix(browser): correct headless mode default behavior
Modify BrowserConfig to respect explicit headless parameter setting instead of forcing True. Update version to 0.6.2 and clean up code formatting in examples.

BREAKING CHANGE: BrowserConfig no longer defaults to headless=True when explicitly set to False
2025-04-26 21:09:50 +08:00
UncleCode
ccec40ed17 feat(models): add dedicated tables field to CrawlResult
- Add tables field to CrawlResult model while maintaining backward compatibility
- Update async_webcrawler.py to extract tables from media and pass to tables field
- Update crypto_analysis_example.py to use the new tables field
- Add /config/dump examples to demo_docker_api.py
- Bump version to 0.6.1
2025-04-24 18:36:25 +08:00
UncleCode
ad4dfb21e1 Remoce "rc1" 2025-04-23 21:00:00 +08:00
UncleCode
7784b2468e feat(docs): enhance Ask AI button UX and add v0.6.0 release notes
Improve Ask AI button with better mobile support, animations, and positioning:
- Add button animations and hover effects
- Improve mobile responsiveness
- Add icon to button
- Fix positioning logic for different viewport sizes
- Add keyboard (Escape) support

Add comprehensive v0.6.0 release documentation:
- Create detailed release notes
- Update blog index with latest release
- Document all major features and breaking changes

BREAKING CHANGE: Documentation structure updated with new v0.6.0 section
2025-04-23 20:07:03 +08:00
UncleCode
146f9d415f Update README vr0.6.0 2025-04-23 19:50:33 +08:00
UncleCode
37fd80e4b9 feat(docs): add mobile-friendly navigation menu
Implements a responsive hamburger menu for mobile devices with the following changes:
- Add new mobile_menu.js for handling mobile navigation
- Update layout.css with mobile-specific styles and animations
- Enhance README with updated geolocation example
- Register mobile_menu.js in mkdocs.yml

The mobile menu includes:
- Hamburger button animation
- Slide-out sidebar
- Backdrop overlay
- Touch-friendly navigation
- Proper event handling
2025-04-23 19:44:25 +08:00
UncleCode
949a93982e feat(docs): update documentation and disable Ask AI feature
Major documentation updates including:
- Add comprehensive code examples page
- Add video tutorial to homepage
- Update Docker deployment instructions for v0.6.0
- Temporarily disable Ask AI feature
- Add table border styling
- Update site version to v0.6.x

BREAKING CHANGE: Ask AI feature temporarily disabled pending launch
2025-04-23 19:02:39 +08:00
UncleCode
c4f5651199 chore(deps): upgrade to Python 3.12 and prepare for 0.6.0 release
- Update Docker base image to Python 3.12-slim-bookworm
- Bump version from 0.6.0rc1 to 0.6.0
- Update documentation to reflect release version changes
- Fix license specification in pyproject.toml and setup.py
- Clean up code formatting in demo_docker_api.py

BREAKING CHANGE: Base Python version upgraded from 3.10 to 3.12
2025-04-23 16:35:15 +08:00
UncleCode
b0aa8bc9f7 Update README vr0.6.0rc1 2025-04-22 23:21:42 +08:00
UncleCode
c98ffe2130 Update CHANGELOG 2025-04-22 22:36:41 +08:00
UncleCode
4812f08a73 feat(docker): update Docker deployment for v0.6.0
Major updates to Docker deployment infrastructure:
- Switch default port to 11235 for all services
- Add MCP (Model Context Protocol) support with WebSocket/SSE endpoints
- Simplify docker-compose.yml with auto-platform detection
- Update documentation with new features and examples
- Consolidate configuration and improve resource management

BREAKING CHANGE: Default port changed from 8020 to 11235. Update your configurations and deployment scripts accordingly.
2025-04-22 22:35:25 +08:00
unclecode
f3ebb38edf Merge PR #899 into next, resolve conflicts in server.py and docs/browser-crawler-config.md 2025-04-22 14:56:47 +08:00
UncleCode
0007aea204 Update changelog 2025-04-21 23:21:49 +08:00
UncleCode
b5c25731e6 feat(browser): add geolocation, locale and timezone support
Add support for controlling browser geolocation, locale and timezone settings:
- New GeolocationConfig class for managing GPS coordinates
- Add locale and timezone_id parameters to CrawlerRunConfig
- Update browser context creation to handle location settings
- Add example script for geolocation usage
- Update documentation with location-based identity features

This enables more precise control over browser identity and location reporting.
2025-04-21 23:20:59 +08:00
UncleCode
5297e362f3 feat(mcp): Implement MCP protocol and enhance server capabilities
This commit introduces several significant enhancements to the Crawl4AI Docker deployment:

  1. Add MCP Protocol Support:
     - Implement WebSocket and SSE transport layers for MCP server communication
     - Create mcp_bridge.py to expose existing API endpoints via MCP protocol
     - Add comprehensive tests for both socket and SSE transport methods

  2. Enhance Docker Server Capabilities:
     - Add PDF generation endpoint with file saving functionality
     - Add screenshot capture endpoint with configurable wait time
     - Implement JavaScript execution endpoint for dynamic page interaction
     - Add intelligent file path handling for saving generated assets

  3. Improve Search and Context Functionality:
     - Implement syntax-aware code function chunking using AST parsing
     - Add BM25-based intelligent document search with relevance scoring
     - Create separate code and documentation context endpoints
     - Enhance response format with structured results and scores

  4. Rename and Fix File Organization:
     - Fix typo in test_docker_config_gen.py filename
     - Update import statements and dependencies
     - Add FileResponse for context endpoints

  This enhancement significantly improves the machine-to-machine communication
  capabilities of Crawl4AI, making it more suitable for integration with LLM agents
  and other automated systems.

  The CHANGELOG update has been applied successfully, highlighting the key features and improvements made in this release. The commit message provides a detailed explanation of all the
  changes, which will be helpful for tracking the project's evolution.
2025-04-21 22:22:02 +08:00
UncleCode
a58c8000aa refactor(server): migrate to pool-based crawler management
Replace crawler_manager.py with simpler crawler_pool.py implementation:
- Add global page semaphore for hard concurrency cap
- Implement browser pool with idle cleanup
- Add playground UI for testing and stress testing
- Update API handlers to use pooled crawlers
- Enhance logging levels and symbols

BREAKING CHANGE: Removes CrawlerManager class in favor of simpler pool-based approach
2025-04-20 20:14:26 +08:00
Aravind Karnam
b27bb367e8 merge next. Resolve conflicts. Fix some import errors and error handling in server.py 2025-04-19 20:27:47 +05:30
Aravind Karnam
d2648eaa39 fix: solved with deepcopy of elements https://github.com/unclecode/crawl4ai/issues/902 2025-04-19 20:08:36 +05:30
Aravind Karnam
c2902fd200 reverse:last change in order of execution for it introduced a new issue in content generated. https://github.com/unclecode/crawl4ai/issues/902 2025-04-19 19:46:20 +05:30
UncleCode
16b2318242 feat(api): implement crawler pool manager for improved resource handling
Adds a new CrawlerManager class to handle browser instance pooling and failover:
- Implements auto-scaling based on system resources
- Adds primary/backup crawler management
- Integrates memory monitoring and throttling
- Adds streaming support with memory tracking
- Updates API endpoints to use pooled crawlers

BREAKING CHANGE: API endpoints now require CrawlerManager initialization
2025-04-18 22:26:24 +08:00
UncleCode
907cba194f Merge branch 'next-stress' into next 2025-04-17 22:34:43 +08:00
UncleCode
3bf78ff47a refactor(docker-demo): enhance error handling and output formatting
Improve the Docker API demo script with better error handling, more detailed output,
and enhanced visualization:
- Add detailed error messages and stack traces for debugging
- Implement better status code handling and display
- Enhance JSON output formatting with monokai theme and word wrap
- Add depth information display for deep crawls
- Improve proxy usage reporting
- Fix port number inconsistency

No breaking changes.
2025-04-17 22:32:58 +08:00
UncleCode
921e0c46b6 feat(tests): implement high volume stress testing framework
Add comprehensive stress testing solution for SDK using arun_many and dispatcher system:
- Create test_stress_sdk.py for running high volume crawl tests
- Add run_benchmark.py for orchestrating tests with predefined configs
- Implement benchmark_report.py for generating performance reports
- Add memory tracking and local test site generation
- Support both streaming and batch processing modes
- Add detailed documentation in README.md

The framework enables testing SDK performance, concurrency handling,
and memory behavior under high-volume scenarios.
2025-04-17 22:31:51 +08:00
UncleCode
fd899f66aa Merge branch 'next-fix-markdown-source' into next 2025-04-17 20:16:15 +08:00
UncleCode
30ec4f571f feat(docs): add comprehensive Docker API demo script
Add a new example script demonstrating Docker API usage with extensive features:
- Basic crawling with single/multi URL support
- Markdown generation with various filters
- Parameter demonstrations (CSS, JS, screenshots, SSL, proxies)
- Extraction strategies using CSS and LLM
- Deep crawling capabilities with streaming
- Integration examples with proxy rotation and SSL certificate fetching

Also includes minor formatting improvements in async_webcrawler.py
2025-04-17 20:16:11 +08:00
UncleCode
7db6b468d9 feat(markdown): add content source selection for markdown generation
Adds a new content_source parameter to MarkdownGenerationStrategy that allows
selecting which HTML content to use for markdown generation:
- cleaned_html (default): uses post-processed HTML
- raw_html: uses original webpage HTML
- fit_html: uses preprocessed HTML for schema extraction

Changes include:
- Added content_source parameter to MarkdownGenerationStrategy
- Updated AsyncWebCrawler to handle HTML source selection
- Added examples and tests for the new feature
- Updated documentation with new parameter details

BREAKING CHANGE: Renamed cleaned_html parameter to input_html in generate_markdown()
method signature to better reflect its generalized purpose
2025-04-17 20:13:53 +08:00
Aravind Karnam
eed7f88f29 Merge branch 'next' into 2025-MAR-ALPHA-1 2025-04-17 10:50:02 +05:30
UncleCode
94d486579c docs(tests): clarify server URL comments in deep crawl tests
Improve documentation of test configuration URLs by adding clearer
comments explaining when to use each URL configuration - Docker vs
development mode.

No functional changes, only comment improvements.
2025-04-15 22:32:27 +08:00
UncleCode
5206c6f2d6 Modify the test file 2025-04-15 22:28:01 +08:00
UncleCode
230f22da86 refactor(proxy): move ProxyConfig to async_configs and improve LLM token handling
Moved ProxyConfig class from proxy_strategy.py to async_configs.py for better organization.
Improved LLM token handling with new PROVIDER_MODELS_PREFIXES.
Added test cases for deep crawling and proxy rotation.
Removed docker_config from BrowserConfig as it's handled separately.

BREAKING CHANGE: ProxyConfig import path changed from crawl4ai.proxy_strategy to crawl4ai
2025-04-15 22:27:18 +08:00
UncleCode
793668a413 Remove parameter_updates.txt 2025-04-14 23:05:24 +08:00
UncleCode
82aa53aa59 Merge branch 'next-alpine-docker' into next 2025-04-14 23:01:22 +08:00
UncleCode
cd7ff6f9c1 feat(docs): add AI assistant interface and code copy button
Add new AI assistant chat interface with features:
- Real-time chat with markdown support
- Chat history management
- Citation tracking
- Selection-to-query functionality

Also adds code copy button to documentation code blocks and adjusts layout/styling.

Breaking changes: None
2025-04-14 23:00:47 +08:00
UncleCode
c56974cf59 feat(docs): enhance documentation UI with ToC and GitHub stats
Add new features to documentation UI:
- Add table of contents with scroll spy functionality
- Add GitHub repository statistics badge
- Implement new centered layout system with fixed sidebar
- Add conditional Playwright installation based on CRAWL4AI_MODE

Breaking changes: None
2025-04-14 20:46:32 +08:00
Aravind Karnam
dcc265458c fix: Add a nominal wait time for remove overlay elements since it's already controllable through delay_before_return_html 2025-04-14 12:39:05 +05:30
UncleCode
ecec53a8c1 Docker tested on Windows machine. 2025-04-13 20:14:41 +08:00
Aravind Karnam
7d8e81fb2e fix: fix target_elements, in a less invasive and more efficient way simply by changing order of execution :) https://github.com/unclecode/crawl4ai/issues/902 2025-04-12 12:44:00 +05:30
Aravind Karnam
9fc5d315af fix: revert the old target_elms code in LXMLwebscraping strategy 2025-04-12 12:07:04 +05:30
Aravind Karnam
d84508b4d5 fix: revert the old target_elms code in regular webscraping strategy 2025-04-12 12:05:17 +05:30
Aravind Karnam
022f5c9e25 Merged next branch 2025-04-12 10:47:02 +05:30
UncleCode
3179d6ad0c fix(core): improve error handling and stability in core components
Enhance error handling and stability across multiple components:
- Add safety checks in async_configs.py for type and params existence
- Fix browser manager initialization and cleanup logic
- Add default LLM config fallback in extraction strategy
- Add comprehensive Docker deployment guide and server tests

BREAKING CHANGE: BrowserManager.start() now automatically closes existing instances
2025-04-11 20:58:39 +08:00
UncleCode
18e8227dfb feat(crawler): add console message capture functionality
Add ability to capture browser console messages during crawling:
- Implement _capture_console_messages method to collect console logs
- Update crawl method to support console message capture
- Modify browser_manager page creation to accept full CrawlerRunConfig
- Fix request failure text formatting

This enhancement allows debugging and monitoring of JavaScript console output during crawling operations.
2025-04-10 23:26:09 +08:00
UncleCode
7c358a1aee fix(browser): add null check for crawlerRunConfig.url
Add additional null check when accessing crawlerRunConfig.url in cookie configuration to prevent potential null pointer exceptions. Previously, the code only checked if crawlerRunConfig existed but not its url property.

Fixes potential runtime error when crawlerRunConfig.url is undefined.
2025-04-10 23:25:07 +08:00
UncleCode
108b2a8bfb Fixed capturing console messages for case the url is the local file. Update docker configuration (work in progress) 2025-04-10 23:22:38 +08:00
unclecode
66ac07b4f3 feat(crawler): add network request and console message capturing
Implement comprehensive network request and console message capturing functionality:
- Add capture_network_requests and capture_console_messages config parameters
- Add network_requests and console_messages fields to models
- Implement Playwright event listeners to capture requests, responses, and console output
- Create detailed documentation and examples
- Add comprehensive tests

This feature enables deep visibility into web page activity for debugging,
security analysis, performance profiling, and API discovery in web applications.
2025-04-10 16:03:48 +08:00
UncleCode
a2061bf31e feat(crawler): add MHTML capture functionality
Add ability to capture web pages as MHTML format, which includes all page resources
in a single file. This enables complete page archival and offline viewing.

- Add capture_mhtml parameter to CrawlerRunConfig
- Implement MHTML capture using CDP in AsyncPlaywrightCrawlerStrategy
- Add mhtml field to CrawlResult and AsyncCrawlResponse models
- Add comprehensive tests for MHTML capture functionality
- Update documentation with MHTML capture details
- Add exclude_all_images option for better memory management

Breaking changes: None
2025-04-09 15:39:04 +08:00
Aravind Karnam
6f7ab9c927 fix: Revert changes to session management in AsyncHttpWebcrawler and solve the underlying issue by removing the session closure in finally block of session context. 2025-04-08 18:31:00 +05:30
UncleCode
9038e9acbd Merge branch 'main' into next 2025-04-08 17:43:42 +08:00
UncleCode
02e627e0bd fix(crawler): simplify page retrieval logic in AsyncPlaywrightCrawlerStrategy 2025-04-08 17:43:36 +08:00
UncleCode
5b66208a7e Refactor next branch 2025-04-06 18:33:09 +08:00