firecrawl

mirror of https://github.com/mendableai/firecrawl.git synced 2026-01-08 05:04:15 +00:00

Author	SHA1	Message	Date
Nicolas	b05327dbbe	Nick: fix py sdk validation error	2025-08-30 18:41:47 -04:00
devin-ai-integration[bot]	7bea613ec0	feat: add maxPages parameter to PDF parser in v2 scrape API (#2047 ) * feat: add maxPages parameter to PDF parser - Extend parsersSchema to support both string array ['pdf'] and object array [{'type':'pdf','maxPages':10}] formats - Add shouldParsePDF and getPDFMaxPages helper functions for consistent parser handling - Update PDF processing to respect maxPages limit in both RunPod MU and PdfParse processors - Modify billing calculation to use actual pages processed instead of total pages - Add comprehensive tests for object format parsers, page limiting, and validation - Maintain backward compatibility with existing string array format The maxPages parameter is optional and defaults to unlimited when not specified. Page limiting occurs before processing to avoid unnecessary computation and billing is based on the effective page count for fairness. Co-Authored-By: thomas@sideguide.dev <thomas@sideguide.dev> * fix: correct parsersSchema to handle individual parser items - Change union from array-level to item-level in parsersSchema - Now accepts array where each item is either string 'pdf' or object {'type':'pdf','maxPages':10} - When parser is string 'pdf', maxPages is undefined (no limit) - When parser is object, use specified maxPages value - Maintains backward compatibility with existing ['pdf'] format Co-Authored-By: thomas@sideguide.dev <thomas@sideguide.dev> * fix: remove maxPages logic from scrapePDFWithParsePDF per PR feedback - Remove maxPages parameter and truncation logic from scrapePDFWithParsePDF - Keep maxPages logic only in scrapePDFWithRunPodMU where it provides cost savings - Addresses feedback from mogery: pdf-parse doesn't cost anything extra to process all pages Co-Authored-By: thomas@sideguide.dev <thomas@sideguide.dev> * test: add maxPages parameter tests for crawl and search endpoints - Add crawl endpoint test with PDF maxPages parameter - Add search endpoint test with PDF maxPages parameter - Verify maxPages works end-to-end across all endpoints (scrape, crawl, search) - Ensure schema inheritance and data flow work correctly Co-Authored-By: thomas@sideguide.dev <thomas@sideguide.dev> * fix: remove problematic crawl and search tests for maxPages - Remove crawl test that incorrectly uses direct PDF URL - Remove search test that relies on unreliable external search results - maxPages functionality verified through schema inheritance and data flow analysis - Comprehensive tests already exist in parsers.test.ts for core functionality Co-Authored-By: thomas@sideguide.dev <thomas@sideguide.dev> * feat: add maxPages parameter support to Python and JavaScript SDKs - Add PDFParser class to Python SDK with max_pages field validation (1-1000) - Update Python SDK parsers field to support Union[List[str], List[Union[str, PDFParser]]] - Add parsers preprocessing in Python SDK to convert snake_case to camelCase - Update JavaScript SDK parsers type to Array<string \| { type: 'pdf'; maxPages?: number }> - Add maxPages validation to JavaScript SDK ensureValidScrapeOptions - Maintain backward compatibility with existing ['pdf'] string array format - Support mixed formats in both SDKs - Add comprehensive test files for both SDKs Addresses GitHub comment requesting SDK support for maxPages parameter. Co-Authored-By: thomas@sideguide.dev <thomas@sideguide.dev> * cleanup: remove temporary test files Co-Authored-By: thomas@sideguide.dev <thomas@sideguide.dev> * fix: correct parsers schema to support mixed string and object arrays - Fix parsers schema to properly handle mixed arrays like ['pdf', {type: 'pdf', maxPages: 5}] - Resolves backward compatibility issue that was causing webhook test failures - All parser formats now work: ['pdf'], [{type: 'pdf'}], [{type: 'pdf', maxPages: 10}], mixed arrays Co-Authored-By: thomas@sideguide.dev <thomas@sideguide.dev> * Delete SDK_MAXPAGES_IMPLEMENTATION.md * feat: increase maxPages limit from 1000 to 10000 pages - Update backend Zod schema validation in types.ts - Update JavaScript SDK client-side validation - Update API test cases to use new 10000 limit - Addresses GitHub comment feedback from nickscamara Co-Authored-By: thomas@sideguide.dev <thomas@sideguide.dev> * fix: update Python SDK maxPages limit from 1000 to 10000 - Fix validation discrepancy between Python SDK (1000) and backend/JS SDK (10000) - Ensures consistent maxPages validation across all SDKs - Addresses critical bug identified in PR review Co-Authored-By: thomas@sideguide.dev <thomas@sideguide.dev> * fix: remove SDK-side maxPages validation per PR feedback - Remove maxPages range validation from JavaScript SDK validation.ts - Remove maxPages range validation from Python SDK types.py - Keep backend API validation as single source of truth - Addresses GitHub comment from mogery Co-Authored-By: thomas@sideguide.dev <thomas@sideguide.dev> * Nick: --------- Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: thomas@sideguide.dev <thomas@sideguide.dev> Co-authored-by: Nicolas <nicolascamara29@gmail.com>	2025-08-29 20:13:58 -04:00
Rafael Miller	815963890f	feat(sdks): next cursor pagination (#2067 ) * feat(sdks): next cursor pagination - Default auto-pagination enabled; pass { autoPaginate: false } (JS) or PaginationConfig(auto_paginate=False) (Python) to restore single-page behavior. - Potentially larger responses and fewer calls by default. * good to go * bump sdks version * fixed tests and endpoints * addressed cubic's appointments * docs * cubic's appointments * Update apps/js-sdk/firecrawl/src/v2/methods/crawl.ts Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com> * here we go * Update apps/python-sdk/firecrawl/v2/utils/http_client.py Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com> * Update apps/python-sdk/firecrawl/v2/methods/crawl.py Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com> * Update apps/python-sdk/firecrawl/__tests__/unit/v2/methods/test_pagination.py Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com> * rafa: * Update apps/python-sdk/firecrawl/v2/methods/crawl.py Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com> * Update example_pagination.py --------- Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>	2025-08-29 20:01:51 -03:00
rafaelmmiller	293b532629	chore(sdks): bumped sdks	2025-08-27 11:23:13 -03:00
Rafael Miller	7ac607617f	fix(python-sdk): missing methods in client (#2050 ) added get_active_crawls and start_extract methods that were missing in the sync client	2025-08-27 08:09:12 -03:00
Vishnu Krishnan	0589994ec6	feat(api): support extraction of data-* attributes in scrape endpoints (#2006 ) Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com> Co-authored-by: Gergő Móricz <mo.geryy@gmail.com>	2025-08-27 09:53:11 +02:00
Vishnu Krishnan	30c6bdd938	feat(api): add image extraction support to v2 scrape endpoint (#2008 )	2025-08-27 09:50:06 +02:00
Nicolas	54d8d92c99	Nick: sdks now have search categories	2025-08-23 16:59:39 -07:00
Nicolas	44715278b3	Nick: updated sdks with search categories	2025-08-23 16:47:43 -07:00
Nicolas	f3c73d00c0	(feat/search) Search Categories (#2019 ) * Nick: search categories * Nick: * Update apps/api/src/lib/search-query-builder.ts Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com> * Update apps/api/src/lib/search-query-builder.ts Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com> * Update types.ts --------- Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>	2025-08-23 16:28:56 -07:00
rafaelmmiller	46567088f4	(sdks): fix unit tests	2025-08-22 17:09:04 -03:00
devin-ai-integration[bot]	85a2402dde	fix: add explicit pydantic>=2.0 dependency requirement (#2010 ) - Add pydantic>=2.0 constraint to pyproject.toml, setup.py, and requirements.txt - Bump version from 3.2.0 to 3.2.1 - Fixes ImportError when importing field_validator from pydantic v2 - Resolves issue where users with pydantic v1 would get import errors Fixes: https://linear.app/firecrawl/issue/ENG-3267 Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: rafael@sideguide.dev <rafael@sideguide.dev>	2025-08-22 15:31:14 -03:00
Rafael Miller	61c3abcb5b	(python-sdk): fixed search types + added max_keepalive_connections to fix event loop is closed bug (#2005 )	2025-08-22 15:29:47 +02:00
Micah Stairs	ac278ccc1b	Fix search validation of custom date ranges and "qdr:h" (#1993 ) * Fix search validation to support custom date ranges - Add regex import to search.py - Update _validate_search_request to support cdr:1,cd_min:MM/DD/YYYY,cd_max:MM/DD/YYYY format - Add missing qdr:h (hourly) predefined value - Maintain backward compatibility with existing predefined values Co-Authored-By: Micah Stairs <micah.stairs@gmail.com> * Add verification script for custom date range validation - Comprehensive test script that validates the fix works correctly - Tests both valid and invalid custom date range formats - Verifies backward compatibility with predefined values - All tests pass confirming the fix is working Co-Authored-By: Micah Stairs <micah.stairs@gmail.com> * Add test cases for custom date range validation - Add test_validate_custom_date_ranges for valid custom date formats - Add test_validate_invalid_custom_date_ranges for invalid formats - Include test_custom_date_range.py for additional verification - Ensure comprehensive test coverage for the validation fix Co-Authored-By: Micah Stairs <micah.stairs@gmail.com> * Fix tests Co-Authored-By: Micah Stairs <micah.stairs@gmail.com> * version bump --------- Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Rafael Miller <150964962+rafaelsideguide@users.noreply.github.com>	2025-08-22 09:23:31 -03:00
Rafael Miller	f5d901b9d9	(python-sdk)fix: added normalizer for document.metadata, also added m… (#1986 ) * (python-sdk)fix: added normalizer for document.metadata, also added metadata_dict and metadata_typed properties * address cubic's comment * Update __init__.py --------- Co-authored-by: Nicolas <nicolascamara29@gmail.com>	2025-08-20 18:31:38 -07:00
Nicolas	e3eebfb7db	Update __init__.py	2025-08-18 21:53:15 -07:00
rafaelmmiller	99dcc36e6e	(sdks): fixed origin, batch aio and tests	2025-08-18 16:14:20 -03:00
Gergő Móricz	2f3bc4e7a7	mendableai -> firecrawl	2025-08-18 20:46:41 +02:00
Nicolas	4ef24355f7	Create example_v2.py	2025-08-17 20:13:56 -07:00
rafaelmmiller	0e3b9d2ffb	fix python types + json/pydantic	2025-08-17 18:45:35 -03:00
Nicolas	6849d938c4	Nick: FirecrawlApp compatible	2025-08-17 14:15:50 -07:00
rafaelmmiller	c4c2bbd803	delete cache	2025-08-14 09:22:18 -03:00
rafaelmmiller	55e8d443fd	(sdks): added summary, fixed usage tests and v2 client for python	2025-08-14 09:21:54 -03:00
Gergő Móricz	e251516a8e	fix usage stuff	2025-08-13 20:07:02 +02:00
Gergő Móricz	c1700e06c6	fix in python sdk	2025-08-13 19:37:03 +02:00
Gergő Móricz	356b04fb65	further crawl-errors improvements	2025-08-13 17:49:29 +02:00
rafaelmmiller	537f6c4ec0	(python/js/ts-sdks): readmes, e2e tests w idmux etc all good	2025-08-12 18:00:37 -03:00
rafaelmmiller	d44baed8f2	(js-sdk): mostly done	2025-08-12 13:50:52 -03:00
rafaelmmiller	ec69c30992	(js-sdk): methods done. todo: e2e/unit tests	2025-08-12 10:25:41 -03:00
rafaelmmiller	10b7202898	(python-sdk): extract v2	2025-08-11 16:35:49 -03:00
rafaelmmiller	d31d39d664	(python-sdk): batch, map, ws improv and aio methods. e2e tests done.	2025-08-11 14:29:26 -03:00
Gergő Móricz	21cbfec398	Merge branch 'main' into nsc/v2	2025-08-10 18:25:08 +02:00
rafaelmmiller	dd6b46d373	(python-sdk): removed client duplication, bunch of type fixing, added map method + e2e/unit tests	2025-08-08 11:56:05 -03:00
rafaelmmiller	08c8f42091	(python-sdk): scrape is done!	2025-08-07 18:10:32 -03:00
rafaelmmiller	d2b325f815	(python-sdk): get_crawl_errors and active_crawls, got rid of useless tests	2025-08-07 15:52:07 -03:00
rafaelmmiller	7a85b9f433	(python-sdk): crawl done	2025-08-07 11:18:07 -03:00
rafaelmmiller	631dc981e3	(python-sdk): wip - crawl endpoints few tests failing yet	2025-08-06 18:41:54 -03:00
devin-ai-integration[bot]	71829dbde3	feat(python-sdk): add agent parameter support to scrape_url method (#1919 ) * feat(python-sdk): add agent parameter support to scrape_url method - Add agent parameter to FirecrawlApp.scrape_url method signature - Add agent parameter to AsyncFirecrawlApp.scrape_url method signature - Update validation logic to allow agent parameter for scrape_url - Add agent parameter handling following batch methods pattern - Add test case for agent parameter functionality Resolves issue where agent parameter was only supported in batch methods Co-Authored-By: Micah Stairs <micah.stairs@gmail.com> * chore: revert test file changes and bump version to 2.16.5 - Remove test case for agent parameter from test file - Bump Python SDK version from 2.16.4 to 2.16.5 Co-Authored-By: Micah Stairs <micah.stairs@gmail.com> --------- Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Micah Stairs <micah.stairs@gmail.com>	2025-08-06 15:34:56 -04:00
rafaelmmiller	f22fba6295	(python-sdk): wip - base structure and search endpoint	2025-08-05 17:41:12 -03:00
Gergő Móricz	7400e10eac	feat(v2): parsers + merge w/ main (#1907 ) Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Micah Stairs <micah.stairs@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: mogery <mo.gery@gmail.com> Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com> Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Rafael Miller <150964962+rafaelsideguide@users.noreply.github.com> Co-authored-by: Nicolas <nicolascamara29@gmail.com> Co-authored-by: Chetan Goti <chetan.goti7@gmail.com> fix: improve robots.txt HTML filtering to check content structure (#1880) fix(html-to-markdown): reinitialize converter lib for every conversion (#1872)" fix(go): add mutex to prevent concurrent access issues in html-to-markdown (#1883) fix(js-sdk): add retry logic for socket hang up errors in monitorJobStatus (ENG-3029) (#1893) fix(go): add mutex to prevent concurrent access issues in html-to-markdown (#1883)" Fix Pydantic field name shadowing issues causing import NameError (#1800) fix(crawl-redis): attempt to cleanup crawl memory post finish (#1901) fix(docs): correct link to Map (#1904)	2025-08-01 20:36:32 +02:00
Rafael Miller	4c0234079c	Improve error handling in Python SDK for non-JSON responses (#1827 ) * Improve error handling in Python SDK for non-JSON responses - Enhanced _handle_error method to gracefully handle non-JSON server responses - Added fallback error messages when JSON parsing fails - Improved error details with response content preview (limited to 500 chars) - Fixed return statement bug in _get_error_message for 403 status code - Better user experience when server returns HTML error pages or empty responses * version bump * Update apps/python-sdk/firecrawl/firecrawl.py Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com> --------- Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>	2025-07-31 15:56:02 -03:00
devin-ai-integration[bot]	269c7097cb	feat(python-sdk): implement missing crawl_entire_domain parameter (#1896 ) * feat(python-sdk): implement missing crawl_entire_domain parameter - Add crawlEntireDomain field to CrawlParams schema - Fix crawl_url_and_watch to pass crawl_entire_domain parameter - Add test for crawl_entire_domain functionality The parameter was already implemented in most places but was missing: 1. crawlEntireDomain field in the CrawlParams schema 2. crawl_entire_domain parameter passing in crawl_url_and_watch method Co-Authored-By: rafael@sideguide.dev <rafael@sideguide.dev> * fix(python-sdk): add missing SearchParams import in test file Co-Authored-By: rafael@sideguide.dev <rafael@sideguide.dev> --------- Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: rafael@sideguide.dev <rafael@sideguide.dev>	2025-07-31 15:55:20 -03:00
devin-ai-integration[bot]	21103b5a58	fix: convert timeout from milliseconds to seconds in Python SDK (#1894 ) * fix: convert timeout from milliseconds to seconds in Python SDK - Fix timeout conversion in scrape_url method (line 596) - Fix timeout conversion in _post_request method (line 2207) - Add comprehensive tests for timeout functionality - Resolves issue #1848 The Python SDK was incorrectly passing timeout values in milliseconds directly to requests.post() which expects seconds, causing timeouts to be 1000x longer than intended (e.g. 60s became 16.6 hours). Co-Authored-By: rafael@sideguide.dev <rafael@sideguide.dev> * fix: handle timeout=0 edge case in conversion logic - Change condition from 'if timeout' to 'if timeout is not None' - Ensures timeout=0 is converted to 5.0 seconds instead of None - All timeout conversion tests now pass (5/5) Co-Authored-By: rafael@sideguide.dev <rafael@sideguide.dev> * feat: change default timeout from None to 30s (30000ms) - Update all timeout parameter defaults from None to 30000ms across SDK - ScrapeOptions, MapParams, and all method signatures now default to 30s - Update tests to verify new default timeout behavior (35s total with 5s buffer) - Add test for _post_request when no timeout key is present in data - Maintains backward compatibility for explicit timeout values - All 6 timeout conversion tests pass Co-Authored-By: rafael@sideguide.dev <rafael@sideguide.dev> --------- Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: rafael@sideguide.dev <rafael@sideguide.dev>	2025-07-31 15:55:00 -03:00
devin-ai-integration[bot]	a7aa0cb2f4	Fix Pydantic field name shadowing issues causing import NameError (#1800 )	2025-07-30 16:00:25 -03:00
devin-ai-integration[bot]	26926e56e7	fix(python-sdk): add max_age parameter to scrape_url validation (#1825 ) * fix(python-sdk): add max_age parameter to scrape_url validation - Add max_age to allowed parameters in _validate_kwargs for scrape_url method - Add comprehensive tests for max_age parameter validation - Fixes issue where max_age parameter was implemented but not validated properly The max_age parameter is used for caching and speeding up scrapes. It was already implemented in the scrape_url method and converted to maxAge in API requests, but was missing from the validation whitelist in _validate_kwargs method. Co-Authored-By: Micah Stairs <micah@sideguide.dev> * Remove test file as requested - keep only core validation fix Co-Authored-By: Micah Stairs <micah@sideguide.dev> * fix(python-sdk): add missing validation call to scrape_url method - Add self._validate_kwargs(kwargs, 'scrape_url') call to main FirecrawlApp.scrape_url method - Follows the same pattern as all other methods in the codebase - Addresses PR reviewer feedback that the fix was incomplete Co-Authored-By: Micah Stairs <micah@sideguide.dev> * chore(python-sdk): bump version to 2.16.3 --------- Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Micah Stairs <micah@sideguide.dev> Co-authored-by: Micah Stairs <micah.stairs@gmail.com>	2025-07-23 09:18:13 -03:00
Rafael Miller	e6f0b1ec16	fixes actions dict attributeError (#1824 )	2025-07-22 16:19:45 -03:00
Rafael Miller	a818946eae	sdk-fix: ensure async error handling in AsyncFirecrawlApp methods, update version to 2.16.1 (#1802 )	2025-07-15 15:20:16 -03:00
Rafael Miller	ad967d4dbb	[sdk] fixes missing headers param in scrape_url (#1795 ) * [sdk] fixes missing headers param in scrape_url * Update __init__.py --------- Co-authored-by: Nicolas <nicolascamara29@gmail.com>	2025-07-14 17:59:10 -03:00
Nicolas	87baa23197	Nick: reverting ssl changes to py sdk	2025-07-04 17:57:09 -03:00
Rafael Miller	179737104d	bugfix zero_data_retention param and certifi dependency (#1749 )	2025-07-02 14:04:21 -03:00

1 2 3 4 5 ...

285 Commits