533 Commits

Author SHA1 Message Date
UncleCode
ca3e33122e refactor(docs): reorganize documentation structure and update styles
Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized
2025-01-07 20:49:50 +08:00
UncleCode
ae376f15fb docs(extraction): add clarifying comments for CSS selector behavior
Add explanatory comments to JsonCssExtractionStrategy._get_elements() method to clarify that it returns all matching elements using select() instead of select_one(). This helps developers understand the method's behavior and its difference from single element selection.

Removed trailing whitespace at end of file.
2025-01-05 19:39:15 +08:00
UncleCode
72fbdac467 fix(extraction): JsonCss selector and crawler improvements
- Fix JsonCssExtractionStrategy._get_elements to return all matching elements instead of just one
- Add robust error handling to page_need_scroll with default fallback
- Improve JSON extraction strategies documentation
- Refactor content scraping strategy
- Update version to 0.4.247
2025-01-05 19:26:46 +08:00
UncleCode
0857c7b448 Merge branch 'main' of https://github.com/unclecode/crawl4ai into next 2025-01-05 17:05:59 +08:00
Guilume
07b4c1c0ed
fix: not working long page screenshot (#403) 2025-01-05 17:04:34 +08:00
UncleCode
196dc79ec7 fix: prevent memory leaks by ensuring proper closure of Playwright pages
- Fixes critical memory leak issue where browser pages remained open
- Ensures proper cleanup of Playwright resources after page operations
- Improves resource management in browser farm implementation

This is an urgent fix to address resource leakage that could impact system stability.
2025-01-03 21:17:23 +08:00
UncleCode
24b3da717a refactor():
- Update hello world example
2025-01-02 17:53:30 +08:00
UncleCode
98acc4254d refactor:
- Update hello_world.py example
2025-01-01 19:47:22 +08:00
UncleCode
eac78c7993 Merge branch 'vr0.4.246' 2025-01-01 19:43:01 +08:00
UncleCode
da1bc0f7bf Update version file 2025-01-01 19:42:35 +08:00
UncleCode
aa4f92f458 refactor(crawler):
- Update hello_world example with proper content filtering
2025-01-01 19:39:42 +08:00
UncleCode
a96e05d4ae refactor(crawler): optimize response handling and default settings
- Set wait_for_images default to false for better performance
- Simplify response attribute copying in AsyncWebCrawler
- Update hello_world example with proper content filtering
2025-01-01 19:39:02 +08:00
UncleCode
5c95fd92b4 fix(browser): resolve merge conflicts in browser channel configuration 2025-01-01 19:05:47 +08:00
UncleCode
4cb2a62551 Update README 2025-01-01 18:59:55 +08:00
UncleCode
5b4fad9e25 - Bump version to 0.4.244 2025-01-01 18:58:43 +08:00
UncleCode
ea0ac25f38 refactor(browser):
Update browser channel default to 'chromium' in BrowserConfig.from_args method
2025-01-01 18:58:15 +08:00
UncleCode
7688aca7d6 Update Version 2025-01-01 18:44:27 +08:00
UncleCode
a7215ad972 fix(browser): update default browser channel to chromium and simplify channel selection logic 2025-01-01 18:38:33 +08:00
Arno.Edwards
8e2403a7da
fix(browser)!: default to Chromium channel for new headless mode (#387)
BREAKING CHANGE: Updated `chrome_channel` to "chromium" to fix compatibility with the new Chromium headless implementation. This resolves the error `playwright._impl._errors.Error: BrowserType.launch: Chromium distribution 'chrome' is not found`, caused by the removal of the old headless mode in Chromium.

With this change, channels like "chrome" and "msedge" now default to the new headless mode, aligning with upstream updates in Playwright v1.49. The new headless mode uses the real Chrome browser, offering more authenticity, reliability, and feature parity with the full browser.

Additionally, simplified fallback logic by directly assigning `chrome_channel` based on `browser_type` or defaulting to "chromium".

Refer to:
- https://playwright.dev/python/docs/browsers#chromium
- https://github.com/microsoft/playwright/issues/33566
2025-01-01 18:37:50 +08:00
UncleCode
318554e6bf Merge branch 'v0.4.243' v0.4.243 2025-01-01 18:11:15 +08:00
UncleCode
c64979b8dd docs: update README 2025-01-01 18:10:38 +08:00
UncleCode
bfe21b29d4 build: streamline package discovery and bump to v0.4.243
- Replace explicit package listing with setuptools.find
- Include all crawl4ai.* packages automatically
- Use `packages = {find = {where = ["."], include = ["crawl4ai*"]}}` syntax
- Bump version to 0.4.243

This change simplifies package maintenance by automatically discovering
all subpackages under crawl4ai namespace instead of listing them manually.
2025-01-01 17:55:59 +08:00
UncleCode
e9d9a6ffe8 fix: ensure js_snippet files are included in package
- Add js_snippet to packages list in pyproject.toml
- Verified JS files are properly included in installed package
- Bump version to 0.4.242
2025-01-01 17:38:59 +08:00
UncleCode
5313c71a0d docs: update REAME browser installation command
- Remove Chrome from manual installation command
- Keep Chromium as the only default browser in docs
2025-01-01 17:24:44 +08:00
UncleCode
d36ef3d424 refactor(install): use chromium as default browser
- Remove Chrome installation to reduce setup time
- Keep Chromium as default browser for better cross-platform compatibility
2025-01-01 17:19:54 +08:00
UncleCode
4a4f613238 docs: simplify installation instructions
- Add crawl4ai-doctor command to verify installation
- Update browser installation instructions in README and docs
- Move optional features to documentation
- Add manual browser installation steps as fallback
- Update getting-started guide with verification step
2025-01-01 16:54:03 +08:00
UncleCode
dc6a24618e feat(install): add doctor command and force browser install
- Add --force flag to Playwright browser installation
- Add doctor command to test crawling functionality
- Install Chrome and Chromium browsers explicitly
- Add crawl4ai-doctor entry point in pyproject.toml
- Implement simple health check focused on crawling test
2025-01-01 16:33:43 +08:00
UncleCode
74a7c6dbb6 feat(install): specify chrome and chromium for playwright
- Install Chrome and Chromium browsers explicitly
- Split browser installation into separate commands
2025-01-01 16:10:08 +08:00
UncleCode
67f65f958b refactor(build): simplify setup.py configuration
- Remove dependency management from setup.py
- Remove entry points configuration (moved to pyproject.toml)
- Keep minimal setup.py for backwards compatibility
- Clean up package metadata structure
2025-01-01 15:52:01 +08:00
UncleCode
78b6ba5cef build: modernize package configuration with pyproject.toml
- Add pyproject.toml for PEP 517 build system support
- Configure dependencies, scripts, and metadata in pyproject.toml
- Set Python requirement to >=3.9 and add support up to 3.13
- Keep setup.py for backwards compatibility
- Move package dependencies and entry points to pyproject.toml
2025-01-01 15:45:27 +08:00
UncleCode
3f019d34cc docs: update project description emojis
- Change project description emojis from 🔥🕷️ to 🚀🤖
- Update emojis consistently in both setup.py and pyproject.toml
2025-01-01 15:39:33 +08:00
UncleCode
304260e484 refactor(install): simplify Playwright installation error handling
- Remove setup_docs() call from post_install()
- Simplify error messages for Playwright installation failures
- Use sys.executable for more accurate Python path in error messages
- Add --with-deps flag to Playwright install command
2025-01-01 15:33:36 +08:00
UncleCode
704bd66b63 Uphrade plawyright installation command to install dependencies 2025-01-01 15:23:16 +08:00
UncleCode
1acc162c18 Bumb version v0.4.241 2025-01-01 15:16:06 +08:00
UncleCode
553c97a0c1 Fix bug reported in issue https://github.com/unclecode/crawl4ai/issues/396 2025-01-01 15:15:14 +08:00
UncleCode
bd66befcf0 Fix issue in 0.4.24 walkthrough 2024-12-31 21:07:58 +08:00
UncleCode
3e769a9c6c Fix issue in 0.4.24 walkthrough 2024-12-31 21:07:33 +08:00
UncleCode
19b0a5ae82 Update 0.4.24 walkthrough 2024-12-31 21:01:46 +08:00
UncleCode
bd71f7f4ea Add 0.4.24 walkthrough 2024-12-31 20:22:33 +08:00
UncleCode
171ce25ba6 Fixe typo in CHANGELOG 2024-12-31 19:49:00 +08:00
UncleCode
6c5a44f774 chore: bump version to 0.4.25 2024-12-31 19:45:48 +08:00
UncleCode
5c3c05bf93 docs: update README badges and Docker section, reorganize documentation structure 2024-12-31 19:45:02 +08:00
UncleCode
67d0999bc3 chore: resolve merge conflicts for v0.4.24 v0.4.24 2024-12-31 19:24:03 +08:00
UncleCode
553a4622bf chore: prepare for version 0.4.24 2024-12-31 19:18:36 +08:00
UncleCode
6f81ef006d Remove .local folder from remote repository 2024-12-31 17:37:50 +08:00
UncleCode
a04870a662 Remove .do folder 2024-12-31 17:37:14 +08:00
UncleCode
f7d26390c5 Remove .do folder 2024-12-31 17:36:22 +08:00
UncleCode
141783fb2d Remove .do folder from remote repository 2024-12-31 17:35:57 +08:00
UncleCode
2fedd4876e Update gitignore 2024-12-31 17:35:34 +08:00
UncleCode
e187b0aaf0 update gitignore 2024-12-31 17:34:31 +08:00