Commit Graph

  • 0d3f9e65b0 Add MEMORY.md to gitignore develop unclecode 2025-12-30 03:04:30 +00:00
  • db61ab8559 Update URL seeder docs with smart TTL cache parameters unclecode 2025-12-30 03:03:41 +00:00
  • 3d78001c30 Add smart TTL cache for sitemap URL seeder unclecode 2025-12-30 01:59:09 +00:00
  • 2550f3d2d5 Add browser pipeline support for raw:/file:// URLs unclecode 2025-12-27 12:32:42 +00:00
  • a43256b27a Add proxy support to HTTP crawler strategy unclecode 2025-12-26 13:17:28 +00:00
  • 9e7f5aa44b Updates on proxy rotation and proxy configuration unclecode 2025-12-26 12:45:57 +00:00
  • c85f56b085
    Merge pull request #1677 from unclecode/sponsors/thor_data main UncleCode 2025-12-25 12:08:21 +08:00
  • fde4e9f0c6 Add prefetch mode for two-phase deep crawling unclecode 2025-12-25 01:55:08 +00:00
  • 3937efcf0b Add base_url parameter to CrawlerRunConfig for raw HTML processing unclecode 2025-12-24 06:05:55 +00:00
  • 624e34164d Fix: HTTP strategy raw: URL parsing truncates at # character unclecode 2025-12-24 04:31:57 +00:00
  • a234959b12 sponsors: Add thor data as sponsor sponsors/thor_data Aravind Karnam 2025-12-23 20:45:00 +05:30
  • da82f0ada5 sponsors: Add thor data as sponsor Aravind Karnam 2025-12-23 16:28:26 +05:30
  • 31ebf37252 Add crash recovery for deep crawl strategies unclecode 2025-12-22 14:51:10 +00:00
  • 67e03d64b8 Add PDF and MHTML support for raw: and file:// URLs unclecode 2025-12-22 01:24:51 +00:00
  • 444cb14f82 Add _generate_screenshot_from_html for raw: and file:// URLs unclecode 2025-12-22 01:10:20 +00:00
  • 48426f73f0 Some debugging for caching unclecode 2025-12-21 04:45:52 +00:00
  • f6b29a8f9f Update gitignore unclecode 2025-12-21 03:15:15 +00:00
  • 02acad1dc6 Fix CDP connection handling: support WS URLs and proper cleanup unclecode 2025-12-18 22:04:52 +08:00
  • d5a0866e03 fix: pdf processing to target only css_selector, thereby give users a choice to discard unnecssary element from page in pdf generated pdf_processing Aravind Karnam 2025-12-18 16:32:16 +05:30
  • d10ca38599 Add init_scripts support to BrowserConfig for pre-page-load JS injection unclecode 2025-12-14 01:58:11 +00:00
  • ecedb6113e Add context caching to create_isolated_context branch unclecode 2025-12-13 08:58:21 +00:00
  • 55eb968a8d Add create_isolated_context flag for concurrent CDP crawls unclecode 2025-12-13 08:29:05 +00:00
  • 6185d3cb32 Revert context matching attempts - Playwright cannot see CDP-created contexts unclecode 2025-12-13 07:57:29 +00:00
  • 8014805c17 Fix: use CDP to find context by browserContextId for concurrent sessions unclecode 2025-12-13 07:02:23 +00:00
  • c1e485e0b0 Fix: use target_id to find correct page in get_page unclecode 2025-12-13 06:51:54 +00:00
  • b2e4a1f2e3 Fix: find context by target_id for concurrent CDP connections unclecode 2025-12-13 06:41:13 +00:00
  • d22825eea4 Fix: add cdp_cleanup_on_close to from_kwargs unclecode 2025-12-13 06:33:26 +00:00
  • 66941a59e8 Add cdp_cleanup_on_close flag to prevent memory leaks in cloud/server scenarios unclecode 2025-12-13 06:25:25 +00:00
  • 8ae908bede Add browser_context_id and target_id parameters to BrowserConfig unclecode 2025-12-13 02:42:48 +00:00
  • 306ddcbf3d Merge branch 'main' into develop ntohidi 2025-12-11 11:18:30 +01:00
  • a87e8c1c9e
    Release/v0.7.8 (#1662) Nasrin 2025-12-11 18:04:52 +08:00
  • 61be862ab0 fix: add disk cleanup step to Docker workflow docker-rebuild-v0.7.8 release/v0.7.8 ntohidi 2025-12-11 10:28:15 +01:00
  • 835e3c56fe
    Add disk cleanup step in Docker release workflow UncleCode 2025-12-11 09:49:27 +01:00
  • b0b2b2761c fix:Make JsonCssExtractionStrategy.generate_schema resilient to markdown tags generated by LLMs https://github.com/unclecode/crawl4ai/issues/1663 patch/generate_schema Aravind Karnam 2025-12-09 15:23:56 +05:30
  • 9672afded2 docs: add section for Crawl4AI Cloud API closed beta with application link ntohidi 2025-12-09 10:27:15 +01:00
  • 60d6173914
    Merge pull request #1661 from unclecode/waitlist v0.7.8 Nasrin 2025-12-09 16:44:15 +08:00
  • 48c31c4cb9 Release v0.7.8: Stability & Bug Fix Release ntohidi 2025-12-08 15:42:29 +01:00
  • 48b6283e71 announcement: add application form for cloud API closed beta Aravind Karnam 2025-12-08 14:00:57 +05:30
  • 5a8fb57795
    Merge pull request #1648 from christopher-w-murphy/fix/content-relevance-filter Nasrin 2025-12-03 18:36:07 +08:00
  • df4d87ed78 refactor: replace PyPDF2 with pypdf across the codebase. ref #1412 ntohidi 2025-12-03 10:59:18 +01:00
  • f32cfc6db0
    Merge pull request #1645 from unclecode/fix/configurable-backoff Nasrin 2025-12-02 21:07:49 +08:00
  • d06c39e8ab
    Merge pull request #1641 from unclecode/fix/serialize-proxy-config Nasrin 2025-12-02 21:06:02 +08:00
  • afc31e144a Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop ntohidi 2025-12-02 13:01:11 +01:00
  • 07ccf13be6 Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268 ntohidi 2025-12-02 13:00:54 +01:00
  • 3a07c5962c
    Sponsors/new (#1643) Aravind 2025-12-02 05:19:39 +05:30
  • 6893094f58 parameterized tests Chris Murphy 2025-12-01 16:19:19 -05:00
  • 3a8f8298d3 import modules from enhanceable deserialization Chris Murphy 2025-12-01 16:18:59 -05:00
  • e95e8e1a97 generalized query in ContentRelevanceFilter to be a str or list Chris Murphy 2025-12-01 16:16:31 -05:00
  • eb76df2c0d added missing deep crawling objects to init Chris Murphy 2025-12-01 16:15:58 -05:00
  • 6ec6bc4d8a pass timeout parameter to docker client request Chris Murphy 2025-12-01 16:15:27 -05:00
  • 33a3cc3933 reproduced AttributeError from #1642 Chris Murphy 2025-12-01 11:31:07 -05:00
  • 7a133e22cc feat: make LLM backoff configurable end-to-end fix/configurable-backoff Soham Kukreti 2025-11-28 18:50:04 +05:30
  • dcb77c94bf
    Merge pull request #1623 from unclecode/fix/deprecated_pydantic Nasrin 2025-11-27 20:05:42 +08:00
  • 6695a21a41 Fix: enhance fallback scoring for failed head extraction in LinkPreview. ref #1638 fix/linkPreviewScoring ntohidi 2025-11-27 12:14:08 +01:00
  • a0c5f0f79a fix: ensure BrowserConfig.to_dict serializes proxy_config fix/serialize-proxy-config Soham Kukreti 2025-11-26 17:44:06 +05:30
  • 6eb3baed50 feat: Add ConfigHealthMonitor for automated crawler configuration health monitoring feature/configHealthMonitor Soham Kukreti 2025-11-25 23:49:15 +05:30
  • b36c6daa5c Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638 ntohidi 2025-11-25 11:51:59 +01:00
  • 94c8a833bf
    Merge pull request #1447 from rbushri/fix/wrong_url_raw Nasrin 2025-11-25 17:49:44 +08:00
  • 84bfea8bd1 Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621 ntohidi 2025-11-25 10:46:00 +01:00
  • 0024c82cdc
    Sponsors/new (#1637) Aravind 2025-11-24 17:59:33 +05:30
  • 7771ed3894
    Merge branch 'develop' into fix/wrong_url_raw Rachel Bushrian 2025-11-24 13:54:07 +02:00
  • af77800a6b Implement CORS handling with --disable-web-security in BrowserManager and add corresponding tests fix-cors-disable-web-security AHMET YILMAZ 2025-11-18 16:18:49 +08:00
  • eca04b0368 Refactor Pydantic model configuration to use ConfigDict for arbitrary types fix/deprecated_pydantic AHMET YILMAZ 2025-11-18 15:40:17 +08:00
  • 43a2088eb0 Fix redirect target verification in AsyncUrlSeeder and enhance tests fix-async-url-seeder-redirect-verification AHMET YILMAZ 2025-11-18 11:43:47 +08:00
  • c2c4d42be4 Fix #1181: Preserve whitespace in code blocks during HTML scraping ntohidi 2025-11-17 12:21:23 +01:00
  • f68e7531e3
    Sponsors/scrapeless (#1619) Aravind 2025-11-17 12:14:52 +05:30
  • cb637fb5c4
    Merge pull request #1613 from unclecode/release/v0.7.7 UncleCode 2025-11-16 12:26:54 +01:00
  • 6244f56f36 Release v0.7.7 v0.7.7 docker-rebuild-v0.7.7 release/v0.7.7 ntohidi 2025-11-14 10:23:31 +01:00
  • 2c973b1183 Merge branch 'develop' into release/v0.7.7 ntohidi 2025-11-13 14:54:05 +01:00
  • f3146de969
    Merge pull request #1609 from unclecode/fix/update-config-documentation Nasrin 2025-11-13 21:52:53 +08:00
  • d6b6d11a2d docs: update browser and crawler run config documentation to match async_configs.py implementation Soham Kukreti 2025-11-13 14:54:16 +05:30
  • b58579548c Bump version to 0.7.7 for stable release ntohidi 2025-11-13 09:52:18 +01:00
  • 466be69e72
    Merge pull request #1607 from unclecode/fix/dfs_deep_crawling Nasrin 2025-11-13 16:43:47 +08:00
  • ceade853c3 Enhance DFSDeepCrawlStrategy documentation for clarity and detail fix/dfs_deep_crawling AHMET YILMAZ 2025-11-13 16:39:08 +08:00
  • 998c809e08 Rename folder name for NSTProxy integration examples for crawl4ai ntohidi 2025-11-13 09:36:39 +01:00
  • d0fb53540d Update proxy-security documentation ntohidi 2025-11-13 09:23:44 +01:00
  • 8116b15b63
    Merge pull request #1596 from unclecode/docs-proxy-security Nasrin 2025-11-13 16:22:28 +08:00
  • fe353c4e27 Refactor proxy configuration documentation for clarity and consistency docs-proxy-security AHMET YILMAZ 2025-11-13 11:20:24 +08:00
  • 89cc29fe44 Merge branch 'fix/docker' into develop ntohidi 2025-11-12 17:06:31 +01:00
  • cdcb8836b7
    Merge pull request #1605 from Nstproxy/feat/nstproxy Nasrin 2025-11-12 23:56:14 +08:00
  • b207ae2848
    Merge pull request #1528 from unclecode/fix/managed-browser-cdp-timing Nasrin 2025-11-12 23:53:57 +08:00
  • be00fc3a42
    Merge pull request #1598 from unclecode/fix/sitemap_seeder Nasrin 2025-11-12 18:09:34 +08:00
  • 124ac583bb
    Merge pull request #1599 from unclecode/docs-llm-strategies-update Nasrin 2025-11-12 17:54:26 +08:00
  • 1bd3de6a47 #1510 : Add DFS deep crawler demonstration script and enhance DFS strategy with seen URL tracking AHMET YILMAZ 2025-11-12 17:44:43 +08:00
  • 80452166c8 feat: Add Nstproxy Proxies nstproxy 2025-11-12 16:25:39 +08:00
  • a99cd37c0e
    Merge pull request #1597 from unclecode/sponsors/capsolver UncleCode 2025-11-11 14:50:44 +08:00
  • 2e8f8c9b49 #1551 : Fix casing and variable name consistency for LLMConfig in documentation docs-llm-strategies-update AHMET YILMAZ 2025-11-10 15:38:14 +08:00
  • 80745bceb9 #1559 :Add tests for sitemap parsing and URL normalization in AsyncUrlSeeder fix/sitemap_seeder AHMET YILMAZ 2025-11-10 14:15:54 +08:00
  • 4bee230c37 docs: Add a tip for captcha solving usecases using a third party integration Aravind Karnam 2025-11-10 11:20:48 +05:30
  • 006e29f308
    Merge pull request #1589 from capsolver/main Aravind 2025-11-10 10:45:16 +05:30
  • 263ac890fd #1591 : Enhance proxy configuration documentation with security features, SSL analysis, and improved examples AHMET YILMAZ 2025-11-10 11:42:07 +08:00
  • 78120df47e chore: update .gitignore from main feature/agent-oai unclecode 2025-11-09 19:19:52 +08:00
  • 1a22fb4d4f docs: rename Docker deployment to self-hosting guide with comprehensive monitoring documentation fix/docker unclecode 2025-11-09 13:31:52 +08:00
  • 81b5312629 Update gitignore unclecode 2025-11-09 10:49:42 +08:00
  • c003cb6e4f fix #1563 (cdp): resolve page leaks and race conditions in concurrent crawling bugfix/arun-many-cdp-managed-browser AHMET YILMAZ 2025-11-07 15:42:37 +08:00
  • d56b0eb9a9
    Merge pull request #1495 from unclecode/fix/viewport_in_managed_browser Nasrin 2025-11-06 18:42:45 +08:00
  • 66175e132b
    Merge pull request #1590 from unclecode/fix/async-llm-extraction-arunMany Nasrin 2025-11-06 18:40:42 +08:00
  • a30548a98f This commit resolves issue #1055 where LLM extraction was blocking async execution, causing URLs to be processed sequentially instead of in parallel. fix/async-llm-extraction-arunMany ntohidi 2025-11-06 11:22:45 +01:00
  • c1c5dfc49b Add smoke test and comprehensive documentation copilot/modify-page-creation-and-logging copilot-swe-agent[bot] 2025-11-06 08:20:39 +00:00
  • 2507720cc7 Refactor imports for PEP 8 compliance and clarity copilot-swe-agent[bot] 2025-11-06 08:18:48 +00:00