790 Commits

Author SHA1 Message Date
UncleCode
591f55edc7 refactor(browser): rename methods and update type hints in BrowserHub for clarity 2025-04-06 18:22:05 +08:00
UncleCode
e1d9e2489c refactor(docs): update import statement in quickstart.py for improved clarity 2025-04-05 23:12:06 +08:00
UncleCode
b1693b1c21 Remove old quickstart files 2025-04-05 23:10:25 +08:00
UncleCode
49d904ca0a refactor(docs): enhance quickstart_examples.py with improved configuration and file handling 2025-04-05 22:57:45 +08:00
UncleCode
ca9351252a refactor(docs): update import paths and clean up example code in quickstart_examples.py 2025-04-05 22:55:56 +08:00
UncleCode
935d9d39f8 Add quickstart example set 2025-04-05 21:37:25 +08:00
UncleCode
f8213c32b9 Merge branch 'vr0.5.0.post8' 2025-04-05 21:36:17 +08:00
UncleCode
14894b4d70 feat(config): set DefaultMarkdownGenerator as the default markdown generator in CrawlerRunConfig
feat(logger): add color mapping for log message formatting options
2025-04-03 20:34:19 +08:00
Aravind Karnam
7155778eac chore: move from faust-cchardet to chardet 2025-04-03 17:42:51 +05:30
Aravind Karnam
4133e5460d typo-fix: https://github.com/unclecode/crawl4ai/pull/918 2025-04-03 17:42:24 +05:30
Aravind Karnam
73fda8a6ec fix: address the PR review: https://github.com/unclecode/crawl4ai/pull/899#discussion_r2024639193 2025-04-03 13:47:13 +05:30
UncleCode
86df20234b fix(crawler): handle exceptions in get_page call to ensure page retrieval 2025-04-02 21:25:24 +08:00
UncleCode
179921a131 fix(crawler): update get_page call to include additional return value 2025-04-02 19:01:30 +08:00
Aravind Karnam
9e16a4bb26 Merge next and resolve conflicts 2025-04-02 12:18:23 +05:30
UncleCode
c5cac2b459 feat(browser): add BrowserHub for centralized browser management and resource sharing 2025-04-01 20:35:02 +08:00
UncleCode
555455d710 feat(browser): implement browser pooling and page pre-warming
Adds a new BrowserManager implementation with browser pooling and page pre-warming capabilities:
- Adds support for managing multiple browser instances per configuration
- Implements page pre-warming for improved performance
- Adds configurable behavior for when no browsers are available
- Includes comprehensive status reporting and monitoring
- Maintains backward compatibility with existing API
- Adds demo script showcasing new features

BREAKING CHANGE: BrowserManager API now returns a strategy instance along with page and context
2025-03-31 21:55:07 +08:00
Aravind
765f856ed4
Merge pull request #808 from dvschuyl/bug/parse-srcset-fix-float-width
🐛 Truncate width to integer string in srcset
2025-03-31 18:21:09 +05:30
Aravind Karnam
757e3177ed fix: https://github.com/unclecode/crawl4ai/issues/839 2025-03-31 17:10:04 +05:30
Aravind
d8357e80d2
Merge pull request #915 from maggie-edkey/css-selector
fix(#911): css_selector is not working properly
2025-03-31 13:03:35 +05:30
Aravind Karnam
ef1f0c4102 fix:https://github.com/unclecode/crawl4ai/issues/701 2025-03-31 12:43:32 +05:30
maggie.wang
1119f2f5b5 fix: https://github.com/unclecode/crawl4ai/issues/911 2025-03-31 14:05:54 +08:00
UncleCode
bb02398086 refactor(browser): improve browser strategy architecture and lifecycle management
Major refactoring of browser strategy implementations to improve code organization and reliability:
- Move CrawlResultContainer and RunManyReturn types from async_webcrawler to models.py
- Simplify browser lifecycle management in AsyncWebCrawler
- Standardize browser strategy interface with _generate_page method
- Improve headless mode handling and browser args construction
- Clean up Docker and Playwright strategy implementations
- Fix session management and context handling across strategies

BREAKING CHANGE: Browser strategy interface has changed with new _generate_page method requirement
2025-03-30 20:58:39 +08:00
UncleCode
3ff7eec8f3 refactor(browser): consolidate browser strategy implementations
Moves common browser functionality into BaseBrowserStrategy class to reduce code duplication and improve maintainability. Key changes:
- Adds shared browser argument building and session management to base class
- Standardizes storage state handling across strategies
- Improves process cleanup and error handling
- Consolidates CDP URL management and container lifecycle

BREAKING CHANGE: Changes browser_mode="custom" to "cdp" for consistency
2025-03-28 22:47:28 +08:00
Aravind Karnam
d8cbeff386 fix: https://github.com/unclecode/crawl4ai/issues/842 2025-03-28 19:31:05 +05:30
UncleCode
64f20ab44a refactor(docker): update Dockerfile and browser strategy to use Chromium 2025-03-28 15:59:02 +08:00
Aravind Karnam
57e0423b3a fix:target_element should not affect link extraction. -> https://github.com/unclecode/crawl4ai/issues/902 2025-03-28 12:56:37 +05:30
UncleCode
c635f6b9a2 refactor(browser): reorganize browser strategies and improve Docker implementation
Reorganize browser strategy code into separate modules for better maintainability and separation of concerns. Improve Docker implementation with:
- Add Alpine and Debian-based Dockerfiles for better container options
- Enhance Docker registry to share configuration with BuiltinBrowserStrategy
- Add CPU and memory limits to container configuration
- Improve error handling and logging
- Update documentation and examples

BREAKING CHANGE: DockerConfig, DockerRegistry, and DockerUtils have been moved to new locations and their APIs have been updated.
2025-03-27 21:35:13 +08:00
Aravind Karnam
7be5427283 Merge branch 'next' into 2025-MAR-ALPHA-1 2025-03-27 12:29:32 +05:30
UncleCode
7f93e88379 refactor(tests): remove unused imports in test_docker_browser.py 2025-03-26 15:19:29 +08:00
UncleCode
40d4dd36c9 chore(version): bump version to 0.5.0.post8 and update post-installation setup 2025-03-25 21:56:49 +08:00
UncleCode
d8f38f2298 chore(version): bump version to 0.5.0.post7 2025-03-25 21:47:19 +08:00
UncleCode
5c88d1310d feat(cli): add output file option and integrate LXML web scraping strategy 2025-03-25 21:38:24 +08:00
UncleCode
4a20d7f7c2 feat(cli): add quick JSON extraction and global config management
Adds new features to improve user experience and configuration:
- Quick JSON extraction with -j flag for direct LLM-based structured data extraction
- Global configuration management with 'crwl config' commands
- Enhanced LLM extraction with better JSON handling and error management
- New user settings for default behaviors (LLM provider, browser settings, etc.)

Breaking changes: None
2025-03-25 20:30:25 +08:00
Aravind Karnam
585e5e5973 fix: https://github.com/unclecode/crawl4ai/issues/733 2025-03-25 15:17:59 +05:30
Aravind Karnam
e3111d0a32 fix: prevent session closing after each request to maintain connection pool. Fixes: https://github.com/unclecode/crawl4ai/issues/867 2025-03-25 13:46:55 +05:30
Aravind Karnam
2f0e217751 Chore: Add brotli as dependancy to fix: https://github.com/unclecode/crawl4ai/issues/867 2025-03-25 13:44:41 +05:30
UncleCode
6405cf0a6f Merge branch 'vr0.5.0.post5' into next 2025-03-25 14:51:29 +08:00
UncleCode
6eed4adc65 Merge branch 'vr0.5.0.post5' 2025-03-25 12:24:07 +08:00
UncleCode
bdd9db579a chore(version): bump version to 0.5.0.post6
refactor(cli): remove unused import from FastAPI
2025-03-25 12:01:36 +08:00
UncleCode
1107fa1d62 feat(cli): enhance markdown generation with default content filters
Add DefaultMarkdownGenerator integration and automatic content filtering for markdown output formats. When using 'markdown-fit' or 'md-fit' output formats, automatically apply PruningContentFilter with default settings if no filter config is provided.

This change improves the user experience by providing sensible defaults for markdown generation while maintaining the ability to customize filtering behavior.
2025-03-25 11:56:00 +08:00
Aravind Karnam
efa73257c5 Merge branch 'next' into 2025-MAR-ALPHA-1 2025-03-24 21:57:29 +05:30
UncleCode
8c08521301 feat(browser): add Docker-based browser automation strategy
Implements a new browser strategy that runs Chrome in Docker containers,
providing better isolation and cross-platform consistency. Features include:
- Connect and launch modes for different container configurations
- Persistent storage support for maintaining browser state
- Container registry for efficient reuse
- Comprehensive test suite for Docker browser functionality

This addition allows users to run browser automation workloads in isolated
containers, improving security and resource management.
2025-03-24 21:36:58 +08:00
UncleCode
462d5765e2 fix(browser): improve storage state persistence in CDP strategy
Enhance storage state persistence mechanism in CDP browser strategy by:
- Explicitly saving storage state for each browser context
- Using proper file path for storage state
- Removing unnecessary sleep delay

Also includes test improvements:
- Simplified test configurations in playwright tests
- Temporarily disabled some CDP tests
2025-03-23 21:06:41 +08:00
UncleCode
6eeb2e4076 feat(browser): enhance browser context creation with user data directory support and improved storage state handling 2025-03-23 19:07:13 +08:00
UncleCode
0094cac675 refactor(browser): improve parallel crawling and browser management
Remove PagePoolConfig in favor of direct page management in browser strategies.
Add get_pages() method for efficient parallel page creation.
Improve storage state handling and persistence.
Add comprehensive parallel crawling tests and performance analysis.

BREAKING CHANGE: Removed PagePoolConfig class and related functionality.
2025-03-23 18:53:24 +08:00
UncleCode
4ab0893ffb feat(browser): implement modular browser management system
Adds a new browser management system with strategy pattern implementation:
- Introduces BrowserManager class with strategy pattern support
- Adds PlaywrightBrowserStrategy, CDPBrowserStrategy, and BuiltinBrowserStrategy
- Implements BrowserProfileManager for profile management
- Adds PagePoolConfig for browser page pooling
- Includes comprehensive test suite for all browser strategies

BREAKING CHANGE: Browser management has been moved to browser/ module. Direct usage of browser_manager.py and browser_profiler.py is deprecated.
2025-03-21 22:50:00 +08:00
Aravind Karnam
e01d1e73e1 fix: link normalisation in BestFirstStrategy 2025-03-21 17:34:13 +05:30
Aravind Karnam
471d110c5e fix: url normalisation ref: https://github.com/unclecode/crawl4ai/issues/841 2025-03-21 16:48:07 +05:30
Aravind Karnam
f89113377a fix: Move adding of visited urls to the 'visited' set, when queueing the URLs instead of after dequeuing, this is to prevent duplicate crawls. https://github.com/unclecode/crawl4ai/issues/843 2025-03-21 13:44:57 +05:30
Aravind Karnam
6740e87b4d fix: remove trailing slash when the path is empty. This is causing dupicate crawls 2025-03-21 13:41:31 +05:30