75 Commits

Author SHA1 Message Date
UncleCode
4812f08a73 feat(docker): update Docker deployment for v0.6.0
Major updates to Docker deployment infrastructure:
- Switch default port to 11235 for all services
- Add MCP (Model Context Protocol) support with WebSocket/SSE endpoints
- Simplify docker-compose.yml with auto-platform detection
- Update documentation with new features and examples
- Consolidate configuration and improve resource management

BREAKING CHANGE: Default port changed from 8020 to 11235. Update your configurations and deployment scripts accordingly.
2025-04-22 22:35:25 +08:00
UncleCode
5297e362f3 feat(mcp): Implement MCP protocol and enhance server capabilities
This commit introduces several significant enhancements to the Crawl4AI Docker deployment:

  1. Add MCP Protocol Support:
     - Implement WebSocket and SSE transport layers for MCP server communication
     - Create mcp_bridge.py to expose existing API endpoints via MCP protocol
     - Add comprehensive tests for both socket and SSE transport methods

  2. Enhance Docker Server Capabilities:
     - Add PDF generation endpoint with file saving functionality
     - Add screenshot capture endpoint with configurable wait time
     - Implement JavaScript execution endpoint for dynamic page interaction
     - Add intelligent file path handling for saving generated assets

  3. Improve Search and Context Functionality:
     - Implement syntax-aware code function chunking using AST parsing
     - Add BM25-based intelligent document search with relevance scoring
     - Create separate code and documentation context endpoints
     - Enhance response format with structured results and scores

  4. Rename and Fix File Organization:
     - Fix typo in test_docker_config_gen.py filename
     - Update import statements and dependencies
     - Add FileResponse for context endpoints

  This enhancement significantly improves the machine-to-machine communication
  capabilities of Crawl4AI, making it more suitable for integration with LLM agents
  and other automated systems.

  The CHANGELOG update has been applied successfully, highlighting the key features and improvements made in this release. The commit message provides a detailed explanation of all the
  changes, which will be helpful for tracking the project's evolution.
2025-04-21 22:22:02 +08:00
UncleCode
a58c8000aa refactor(server): migrate to pool-based crawler management
Replace crawler_manager.py with simpler crawler_pool.py implementation:
- Add global page semaphore for hard concurrency cap
- Implement browser pool with idle cleanup
- Add playground UI for testing and stress testing
- Update API handlers to use pooled crawlers
- Enhance logging levels and symbols

BREAKING CHANGE: Removes CrawlerManager class in favor of simpler pool-based approach
2025-04-20 20:14:26 +08:00
UncleCode
16b2318242 feat(api): implement crawler pool manager for improved resource handling
Adds a new CrawlerManager class to handle browser instance pooling and failover:
- Implements auto-scaling based on system resources
- Adds primary/backup crawler management
- Integrates memory monitoring and throttling
- Adds streaming support with memory tracking
- Updates API endpoints to use pooled crawlers

BREAKING CHANGE: API endpoints now require CrawlerManager initialization
2025-04-18 22:26:24 +08:00
UncleCode
907cba194f Merge branch 'next-stress' into next 2025-04-17 22:34:43 +08:00
UncleCode
921e0c46b6 feat(tests): implement high volume stress testing framework
Add comprehensive stress testing solution for SDK using arun_many and dispatcher system:
- Create test_stress_sdk.py for running high volume crawl tests
- Add run_benchmark.py for orchestrating tests with predefined configs
- Implement benchmark_report.py for generating performance reports
- Add memory tracking and local test site generation
- Support both streaming and batch processing modes
- Add detailed documentation in README.md

The framework enables testing SDK performance, concurrency handling,
and memory behavior under high-volume scenarios.
2025-04-17 22:31:51 +08:00
UncleCode
7db6b468d9 feat(markdown): add content source selection for markdown generation
Adds a new content_source parameter to MarkdownGenerationStrategy that allows
selecting which HTML content to use for markdown generation:
- cleaned_html (default): uses post-processed HTML
- raw_html: uses original webpage HTML
- fit_html: uses preprocessed HTML for schema extraction

Changes include:
- Added content_source parameter to MarkdownGenerationStrategy
- Updated AsyncWebCrawler to handle HTML source selection
- Added examples and tests for the new feature
- Updated documentation with new parameter details

BREAKING CHANGE: Renamed cleaned_html parameter to input_html in generate_markdown()
method signature to better reflect its generalized purpose
2025-04-17 20:13:53 +08:00
UncleCode
94d486579c docs(tests): clarify server URL comments in deep crawl tests
Improve documentation of test configuration URLs by adding clearer
comments explaining when to use each URL configuration - Docker vs
development mode.

No functional changes, only comment improvements.
2025-04-15 22:32:27 +08:00
UncleCode
5206c6f2d6 Modify the test file 2025-04-15 22:28:01 +08:00
UncleCode
230f22da86 refactor(proxy): move ProxyConfig to async_configs and improve LLM token handling
Moved ProxyConfig class from proxy_strategy.py to async_configs.py for better organization.
Improved LLM token handling with new PROVIDER_MODELS_PREFIXES.
Added test cases for deep crawling and proxy rotation.
Removed docker_config from BrowserConfig as it's handled separately.

BREAKING CHANGE: ProxyConfig import path changed from crawl4ai.proxy_strategy to crawl4ai
2025-04-15 22:27:18 +08:00
UncleCode
ecec53a8c1 Docker tested on Windows machine. 2025-04-13 20:14:41 +08:00
UncleCode
3179d6ad0c fix(core): improve error handling and stability in core components
Enhance error handling and stability across multiple components:
- Add safety checks in async_configs.py for type and params existence
- Fix browser manager initialization and cleanup logic
- Add default LLM config fallback in extraction strategy
- Add comprehensive Docker deployment guide and server tests

BREAKING CHANGE: BrowserManager.start() now automatically closes existing instances
2025-04-11 20:58:39 +08:00
unclecode
66ac07b4f3 feat(crawler): add network request and console message capturing
Implement comprehensive network request and console message capturing functionality:
- Add capture_network_requests and capture_console_messages config parameters
- Add network_requests and console_messages fields to models
- Implement Playwright event listeners to capture requests, responses, and console output
- Create detailed documentation and examples
- Add comprehensive tests

This feature enables deep visibility into web page activity for debugging,
security analysis, performance profiling, and API discovery in web applications.
2025-04-10 16:03:48 +08:00
UncleCode
a2061bf31e feat(crawler): add MHTML capture functionality
Add ability to capture web pages as MHTML format, which includes all page resources
in a single file. This enables complete page archival and offline viewing.

- Add capture_mhtml parameter to CrawlerRunConfig
- Implement MHTML capture using CDP in AsyncPlaywrightCrawlerStrategy
- Add mhtml field to CrawlResult and AsyncCrawlResponse models
- Add comprehensive tests for MHTML capture functionality
- Update documentation with MHTML capture details
- Add exclude_all_images option for better memory management

Breaking changes: None
2025-04-09 15:39:04 +08:00
UncleCode
555455d710 feat(browser): implement browser pooling and page pre-warming
Adds a new BrowserManager implementation with browser pooling and page pre-warming capabilities:
- Adds support for managing multiple browser instances per configuration
- Implements page pre-warming for improved performance
- Adds configurable behavior for when no browsers are available
- Includes comprehensive status reporting and monitoring
- Maintains backward compatibility with existing API
- Adds demo script showcasing new features

BREAKING CHANGE: BrowserManager API now returns a strategy instance along with page and context
2025-03-31 21:55:07 +08:00
UncleCode
bb02398086 refactor(browser): improve browser strategy architecture and lifecycle management
Major refactoring of browser strategy implementations to improve code organization and reliability:
- Move CrawlResultContainer and RunManyReturn types from async_webcrawler to models.py
- Simplify browser lifecycle management in AsyncWebCrawler
- Standardize browser strategy interface with _generate_page method
- Improve headless mode handling and browser args construction
- Clean up Docker and Playwright strategy implementations
- Fix session management and context handling across strategies

BREAKING CHANGE: Browser strategy interface has changed with new _generate_page method requirement
2025-03-30 20:58:39 +08:00
UncleCode
3ff7eec8f3 refactor(browser): consolidate browser strategy implementations
Moves common browser functionality into BaseBrowserStrategy class to reduce code duplication and improve maintainability. Key changes:
- Adds shared browser argument building and session management to base class
- Standardizes storage state handling across strategies
- Improves process cleanup and error handling
- Consolidates CDP URL management and container lifecycle

BREAKING CHANGE: Changes browser_mode="custom" to "cdp" for consistency
2025-03-28 22:47:28 +08:00
UncleCode
64f20ab44a refactor(docker): update Dockerfile and browser strategy to use Chromium 2025-03-28 15:59:02 +08:00
UncleCode
c635f6b9a2 refactor(browser): reorganize browser strategies and improve Docker implementation
Reorganize browser strategy code into separate modules for better maintainability and separation of concerns. Improve Docker implementation with:
- Add Alpine and Debian-based Dockerfiles for better container options
- Enhance Docker registry to share configuration with BuiltinBrowserStrategy
- Add CPU and memory limits to container configuration
- Improve error handling and logging
- Update documentation and examples

BREAKING CHANGE: DockerConfig, DockerRegistry, and DockerUtils have been moved to new locations and their APIs have been updated.
2025-03-27 21:35:13 +08:00
UncleCode
7f93e88379 refactor(tests): remove unused imports in test_docker_browser.py 2025-03-26 15:19:29 +08:00
UncleCode
8c08521301 feat(browser): add Docker-based browser automation strategy
Implements a new browser strategy that runs Chrome in Docker containers,
providing better isolation and cross-platform consistency. Features include:
- Connect and launch modes for different container configurations
- Persistent storage support for maintaining browser state
- Container registry for efficient reuse
- Comprehensive test suite for Docker browser functionality

This addition allows users to run browser automation workloads in isolated
containers, improving security and resource management.
2025-03-24 21:36:58 +08:00
UncleCode
462d5765e2 fix(browser): improve storage state persistence in CDP strategy
Enhance storage state persistence mechanism in CDP browser strategy by:
- Explicitly saving storage state for each browser context
- Using proper file path for storage state
- Removing unnecessary sleep delay

Also includes test improvements:
- Simplified test configurations in playwright tests
- Temporarily disabled some CDP tests
2025-03-23 21:06:41 +08:00
UncleCode
0094cac675 refactor(browser): improve parallel crawling and browser management
Remove PagePoolConfig in favor of direct page management in browser strategies.
Add get_pages() method for efficient parallel page creation.
Improve storage state handling and persistence.
Add comprehensive parallel crawling tests and performance analysis.

BREAKING CHANGE: Removed PagePoolConfig class and related functionality.
2025-03-23 18:53:24 +08:00
UncleCode
4ab0893ffb feat(browser): implement modular browser management system
Adds a new browser management system with strategy pattern implementation:
- Introduces BrowserManager class with strategy pattern support
- Adds PlaywrightBrowserStrategy, CDPBrowserStrategy, and BuiltinBrowserStrategy
- Implements BrowserProfileManager for profile management
- Adds PagePoolConfig for browser page pooling
- Includes comprehensive test suite for all browser strategies

BREAKING CHANGE: Browser management has been moved to browser/ module. Direct usage of browser_manager.py and browser_profiler.py is deprecated.
2025-03-21 22:50:00 +08:00
UncleCode
6432ff1257 feat(browser): add builtin browser management system
Implements a persistent browser management system that allows running a single shared browser instance
that can be reused across multiple crawler sessions. Key changes include:

- Added browser_mode config option with 'builtin', 'dedicated', and 'custom' modes
- Implemented builtin browser management in BrowserProfiler
- Added CLI commands for managing builtin browser (start, stop, status, restart, view)
- Modified browser process handling to support detached processes
- Added automatic builtin browser setup during package installation

BREAKING CHANGE: The browser_mode config option changes how browser instances are managed
2025-03-20 12:13:59 +08:00
UncleCode
b750542e6d feat(crawler): optimize single URL handling and add performance comparison
Add special handling for single URL requests in Docker API to use arun() instead of arun_many()
Add new example script demonstrating performance differences between sequential and parallel crawling
Update cache mode from aggressive to bypass in examples and tests
Remove unused dependencies (zstandard, msgpack)

BREAKING CHANGE: Changed default cache_mode from aggressive to bypass in examples
2025-03-13 22:15:15 +08:00
UncleCode
dc36997a08 feat(schema): improve HTML preprocessing for schema generation
Add new preprocess_html_for_schema utility function to better handle HTML cleaning
for schema generation. This replaces the previous optimize_html function in the
GoogleSearchCrawler and includes smarter attribute handling and pattern detection.

Other changes:
- Update default provider to gpt-4o
- Add DEFAULT_PROVIDER_API_KEY constant
- Make LLMConfig creation more flexible with create_llm_config helper
- Add new dependencies: zstandard and msgpack

This change improves schema generation reliability while reducing noise in the
processed HTML.
2025-03-12 22:40:46 +08:00
UncleCode
1630fbdafe feat(monitor): add real-time crawler monitoring system with memory management
Implements a comprehensive monitoring and visualization system for tracking web crawler operations in real-time. The system includes:
- Terminal-based dashboard with rich UI for displaying task statuses
- Memory pressure monitoring and adaptive dispatch control
- Queue statistics and performance metrics tracking
- Detailed task progress visualization
- Stress testing framework for memory management

This addition helps operators track crawler performance and manage memory usage more effectively.
2025-03-12 19:05:24 +08:00
UncleCode
a68cbb232b feat(browser): add standalone CDP browser launch and lxml extraction strategy
Add new features to enhance browser automation and HTML extraction:
- Add CDP browser launch capability with customizable ports and profiles
- Implement JsonLxmlExtractionStrategy for faster HTML parsing
- Add CLI command 'crwl cdp' for launching standalone CDP browsers
- Support connecting to external CDP browsers via URL
- Optimize selector caching and context-sensitive queries

BREAKING CHANGE: LLMConfig import path changed from crawl4ai.types to crawl4ai
2025-03-07 20:55:56 +08:00
UncleCode
baee4949d3 refactor(llm): rename LlmConfig to LLMConfig for consistency
Rename LlmConfig to LLMConfig across the codebase to follow consistent naming conventions.
Update all imports and usages to use the new name.
Update documentation and examples to reflect the change.

BREAKING CHANGE: LlmConfig has been renamed to LLMConfig. Users need to update their imports and usage.
2025-03-05 14:17:04 +08:00
Aravind
a9e24307cc
Release prep (#749)
* fix: Update export of URLPatternFilter

* chore: Add dependancy for cchardet in requirements

* docs: Update example for deep crawl in release note for v0.5

* Docs: update the example for memory dispatcher

* docs: updated example for crawl strategies

* Refactor: Removed wrapping in if __name__==main block since this is a markdown file.

* chore: removed cchardet from dependancy list, since unclecode is planning to remove it

* docs: updated the example for proxy rotation to a working example

* feat: Introduced ProxyConfig param

* Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1

* chore: update and test new dependancies

* feat:Make PyPDF2 a conditional dependancy

* updated tutorial and release note for v0.5

* docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename

* refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult

* fix: Bug in serialisation of markdown in acache_url

* Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown

* fix: remove deprecated markdown_v2 from docker

* Refactor: remove deprecated fit_markdown and fit_html from result

* refactor: fix cache retrieval for markdown as a string

* chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown
2025-02-28 19:53:35 +08:00
UncleCode
c6d48080a4 feat(logger): add abstract logger base class and file logger implementation
Add AsyncLoggerBase abstract class to standardize logger interface and introduce AsyncFileLogger for file-only logging. Remove deprecated always_bypass_cache parameter and clean up AsyncWebCrawler initialization.

BREAKING CHANGE: Removed deprecated 'always_by_pass_cache' parameter. Use BrowserConfig cache settings instead.
2025-02-23 21:23:41 +08:00
Aravind
2af958e12c
Feat/llm config (#724)
* feature: Add LlmConfig to easily configure and pass LLM configs to different strategies

* pulled in next branch and resolved conflicts

* feat: Add gemini and deepseek providers. Make ignore_cache in llm content filter to true by default to avoid confusions

* Refactor: Update LlmConfig in LLMExtractionStrategy class and deprecate old params

* updated tests, docs and readme
2025-02-21 15:41:37 +08:00
Aravind
dad592c801
2025 feb alpha 1 (#685)
* spelling change in prompt

* gpt-4o-mini support

* Remove leading Y before here

* prompt spell correction

* (Docs) Fix numbered list end-of-line formatting

Added the missing "two spaces" to add a line break

* fix: access downloads_path through browser_config in _handle_download method - Fixes #585

* crawl

* fix: https://github.com/unclecode/crawl4ai/issues/592

* fix: https://github.com/unclecode/crawl4ai/issues/583

* Docs update: https://github.com/unclecode/crawl4ai/issues/649

* fix: https://github.com/unclecode/crawl4ai/issues/570

* Docs: updated example for content-selection to reflect new changes in yc newsfeed css

* Refactor: Removed old filters and replaced with optimised filters

* fix:Fixed imports as per the new names of filters

* Tests: For deep crawl filters

* Refactor: Remove old scorers and replace with optimised ones: Fix imports forall filters and scorers.

* fix: awaiting on filters that are async in nature eg: content relevance and seo filters

* fix: https://github.com/unclecode/crawl4ai/issues/592

* fix: https://github.com/unclecode/crawl4ai/issues/715

---------

Co-authored-by: DarshanTank <darshan.tank@gnani.ai>
Co-authored-by: Tuhin Mallick <tuhin.mllk@gmail.com>
Co-authored-by: Serhat Soydan <ssoydan@gmail.com>
Co-authored-by: cardit1 <maneesh@cardit.in>
Co-authored-by: Tautik Agrahari <tautikagrahari@gmail.com>
2025-02-19 14:13:17 +08:00
UncleCode
392c923980 feat(docker): add JWT authentication and improve server architecture
Add JWT token-based authentication to Docker server and client.
Refactor server architecture for better code organization and error handling.
Move Dockerfile to root deploy directory and update configuration.
Add comprehensive documentation and examples.

BREAKING CHANGE: Docker server now requires authentication by default.
Endpoints require JWT tokens when security.jwt_enabled is true in config.
2025-02-18 22:07:13 +08:00
UncleCode
8bb799068e feat(crawler): add HTTP crawler strategy for lightweight web scraping
Implements a new AsyncHTTPCrawlerStrategy class that provides a fast, memory-efficient alternative to browser-based crawling. Features include:
- Support for HTTP/HTTPS requests with configurable methods, headers, and timeouts
- File and raw content handling capabilities
- Streaming response processing for large files
- Customizable request/response hooks
- Comprehensive error handling

Also refactors browser management code into separate module for better organization.
2025-02-15 19:26:30 +08:00
UncleCode
966fb47e64 feat(config): enhance serialization and add deep crawling exports
Improve configuration serialization with better handling of frozensets and slots.
Expand deep crawling module exports and documentation.
Add comprehensive API usage examples in Docker README.

- Add support for frozenset serialization
- Improve error handling in config loading
- Export additional deep crawling components
- Enhance Docker API documentation with detailed examples
- Fix ContentTypeFilter initialization
2025-02-13 21:45:19 +08:00
UncleCode
91a5fea11f feat(cli): add command line interface with comprehensive features
Implements a full-featured CLI for Crawl4AI with the following capabilities:
- Basic and advanced web crawling
- Configuration management via YAML/JSON files
- Multiple extraction strategies (CSS, XPath, LLM)
- Content filtering and optimization
- Interactive Q&A capabilities
- Various output formats
- Comprehensive documentation and examples

Also includes:
- Home directory setup for configuration and cache
- Environment variable support for API tokens
- Test suite for CLI functionality
2025-02-10 16:58:52 +08:00
UncleCode
b957ff2ecd refactor(crawler): improve HTML handling and cleanup codebase
- Add HTML attribute preservation in GoogleSearchCrawler
- Fix lxml import references in utils.py
- Remove unused ssl_certificate.json
- Clean up imports and code organization in hub.py
- Update test case formatting and remove unused image search test

BREAKING CHANGE: Removed ssl_certificate.json file which might affect existing certificate validations
2025-02-07 21:56:27 +08:00
UncleCode
a9415aaaf6 refactor(deep-crawling): reorganize deep crawling strategies and add new implementations
Split deep crawling code into separate strategy files for better organization and maintainability. Added new BFF (Best First) and DFS crawling strategies. Introduced base strategy class and common types.

BREAKING CHANGE: Deep crawling implementation has been split into multiple files. Import paths for deep crawling strategies have changed.
2025-02-05 22:50:39 +08:00
UncleCode
c308a794e8 refactor(deep-crawl): reorganize deep crawling functionality into dedicated module
Restructure deep crawling code into a dedicated module with improved organization:
- Move deep crawl logic from async_deep_crawl.py to deep_crawling/
- Create separate files for BFS strategy, filters, and scorers
- Improve code organization and maintainability
- Add optimized implementations for URL filtering and scoring
- Rename DeepCrawlHandler to DeepCrawlDecorator for clarity

BREAKING CHANGE: DeepCrawlStrategy and BreadthFirstSearchStrategy imports need to be updated to new package structure
2025-02-04 23:28:17 +08:00
UncleCode
04bc643cec feat(api): improve cache handling and add API tests
Changes cache mode from BYPASS to WRITE_ONLY when cache is disabled to ensure
results are still cached for future use. Also adds error handling for non-JSON
LLM responses and comprehensive API test suite.

- Changes default cache fallback from BYPASS to WRITE_ONLY
- Adds error handling for LLM JSON parsing
- Introduces new test suite for API endpoints
2025-02-02 20:53:31 +08:00
UncleCode
2f15976b34 feat(docker): enhance Docker deployment setup and configuration
Add comprehensive Docker deployment configuration with:
- New .dockerignore and .llm.env.example files
- Enhanced Dockerfile with multi-stage build and optimizations
- Detailed README with setup instructions and environment configurations
- Improved requirements.txt with Gunicorn
- Better error handling in async_configs.py

BREAKING CHANGE: Docker deployment now requires .llm.env file for API keys
2025-02-01 19:33:27 +08:00
UncleCode
53ac3ec0b4 feat(docker): add Docker service integration and config serialization
Add Docker service integration with FastAPI server and client implementation.
Implement serialization utilities for BrowserConfig and CrawlerRunConfig to support
Docker service communication. Clean up imports and improve error handling.

- Add Crawl4aiDockerClient class
- Implement config serialization/deserialization
- Add FastAPI server with streaming support
- Add health check endpoint
- Clean up imports and type hints
2025-01-31 18:00:16 +08:00
UncleCode
f81712eb91 refactor(core): reorganize project structure and remove legacy code
Major reorganization of the project structure:
- Moved legacy synchronous crawler code to legacy folder
- Removed deprecated CLI and docs manager
- Consolidated version manager into utils.py
- Added CrawlerHub to __init__.py exports
- Fixed type hints in async_webcrawler.py
- Fixed minor bugs in chunking and crawler strategies

BREAKING CHANGE: Removed synchronous WebCrawler, CLI, and docs management functionality. Users should migrate to AsyncWebCrawler.
2025-01-30 19:35:06 +08:00
UncleCode
d09c611d15 feat(robots): add robots.txt compliance support
Add support for checking and respecting robots.txt rules before crawling websites:
- Implement RobotsParser class with SQLite caching
- Add check_robots_txt parameter to CrawlerRunConfig
- Integrate robots.txt checking in AsyncWebCrawler
- Update documentation with robots.txt compliance examples
- Add tests for robot parser functionality

The cache uses WAL mode for better concurrency and has a default TTL of 7 days.
2025-01-21 17:54:13 +08:00
UncleCode
2cec527a22 feat(extraction): add LLM-powered schema generation utility
Adds new static method generate_schema() to JsonElementExtractionStrategy classes
that can automatically generate extraction schemas using LLM (OpenAI or Ollama).
This provides a convenient way to bootstrap extraction schemas while maintaining
the performance benefits of selector-based extraction.

Key changes:
- Added generate_schema() static method to base extraction strategy
- Added support for both CSS and XPath schema generation
- Updated documentation with examples and best practices
- Added new prompt templates for schema generation
2025-01-20 17:28:00 +08:00
UncleCode
91463e34f1 feat(config): add streaming support and config cloning
Add streaming capability to crawler configurations and introduce clone() methods
for both BrowserConfig and CrawlerRunConfig to support immutable config updates.
Move stream parameter from arun_many() method to CrawlerRunConfig.

BREAKING CHANGE: Removed stream parameter from AsyncWebCrawler.arun_many() method.
Use config.stream=True instead.
2025-01-19 17:51:47 +08:00
UncleCode
1221be30a3 feat(browser): improve browser context management and add shared data support
Add shared_data parameter to CrawlerRunConfig to allow data sharing between hooks.
Implement browser context reuse based on config signatures to improve memory usage.
Fix Firefox/Webkit channel settings.
Add config parameter to hook callbacks for better context access.
Remove debug print statements.

BREAKING CHANGE: Hook callback signatures now include config parameter
2025-01-19 17:12:03 +08:00
UncleCode
e363234172 feat(dispatcher): add streaming support for URL processing
Add new streaming capability to the MemoryAdaptiveDispatcher and AsyncWebCrawler
to allow processing URLs with real-time result streaming. This enables
processing results as they become available rather than waiting for all
URLs to complete.

Key changes:
- Add run_urls_stream method to MemoryAdaptiveDispatcher
- Update AsyncWebCrawler.arun_many to support streaming mode
- Add result queue for better result handling
- Improve type hints and documentation

BREAKING CHANGE: The return type of arun_many now depends on the 'stream'
parameter, returning either List[CrawlResult] or AsyncGenerator[CrawlResult, None]
2025-01-19 14:03:34 +08:00