3554 Commits

Author SHA1 Message Date
Micah Stairs
9a5d40c3cf
Allow international URLs to pass validation (#1717) 2025-06-26 13:16:42 -04:00
devin-ai-integration[bot]
1919799bed
feat(python-sdk): add parsePDF parameter support (#1713)
* feat(python-sdk): add parsePDF parameter support

- Add parsePDF field to ScrapeOptions class for Search API usage
- Add parse_pdf parameter to both sync and async scrape_url methods
- Add parameter handling logic to pass parsePDF to API requests
- Add comprehensive tests for parsePDF functionality
- Maintain backward compatibility with existing API

The parsePDF parameter controls PDF processing behavior:
- When true (default): PDF content extracted and converted to markdown
- When false: PDF returned in base64 encoding with flat credit rate

Resolves missing parsePDF support in Python SDK v2.9.0

Co-Authored-By: Micah Stairs <micah@sideguide.dev>

* Update __init__.py

* Update test.py

* Update __init__.py

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Micah Stairs <micah@sideguide.dev>
Co-authored-by: Nicolas <nicolascamara29@gmail.com>
2025-06-26 16:34:43 +00:00
devin-ai-integration[bot]
89e57ace3c
Add temporary exception for Faire team ID to bypass job expiration (#1716)
* Add temporary exception for Faire team ID to bypass job expiration

- Add TEMP_FAIRE_TEAM_ID constant for team f96ad1a4-8102-4b35-9904-36fd517d3616
- Modify job expiration logic to skip 24-hour timeout for this team
- Add tests to verify Faire team bypasses expiration and others don't
- Temporary solution to allow Faire team access to expired crawl jobs

Co-Authored-By: Micah Stairs <micah@sideguide.dev>

* Update apps/api/src/__tests__/snips/crawl.test.ts

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Micah Stairs <micah@sideguide.dev>
Co-authored-by: Gergő Móricz <mo.geryy@gmail.com>
2025-06-26 13:42:34 +00:00
Gergő Móricz
f4714f4849
fix(js-sdk/extract): use same zod fallback logic (#1711) 2025-06-25 17:59:58 +00:00
Gergő Móricz
3d04c2087e
fix(api): cached acuc didn't have the is_extract flag set (#1712)
cosmetic issue only (error message), no behavioural change
2025-06-25 16:43:53 +02:00
Gergő Móricz
bc9065810d
fix(concurrency-limit): overlogging (#1709) 2025-06-24 17:01:32 +00:00
Gergő Móricz
cc3afa2578
fix(concurrency-limit): scan instead of taking jobs (#1708) 2025-06-24 13:32:22 -03:00
Gergő Móricz
ae94edd43e
feat(api/ci): idmux (#1707)
* feat(api/ci): idmux

* fix: bad merge

* no more default identity

* fix change tracking test

* fix httpstatus going down lol

* fix change tracking tests

* bump timeout

* fix ct self-hosted

* further fixes

* one more httpstatus bug

* bs

* it's being weird, blockAds testing
2025-06-24 15:36:05 +02:00
Gergő Móricz
86603de664
fix(api): instantiate Storage only once (#1706) 2025-06-24 00:18:07 +02:00
Gergő Móricz
11f469488e
fix(api/batch/scrape): maxConcurrency field support when using ignoreInvalidURLs (#1705)
* fix(api/batch/scrape): maxConcurrency field support when using ignoreInvalidURLs

* fix(tests): timeouts
2025-06-23 21:44:55 +02:00
Gergő Móricz
e7a62dd490
fix(api): pdf bug + testing bugs (#1704) 2025-06-23 19:57:27 +02:00
Gergő Móricz
fe9057559b
fix(v1): check credits variable scope collision (#1703)
This is what’s been causing the weird insufficient credits errors.
2025-06-23 19:19:01 +02:00
Gergő Móricz
e3948ae5b1
feat(api): pdf action + housekeeping (#1702)
* feat(api): pdf action + housekeeping

* fix TS build
2025-06-23 19:03:35 +02:00
Ademílson Tonato
78a3579d6e
feat: add relevanceai as part of the integrations 2025-06-23 16:41:19 +01:00
Gergő Móricz
439619ffc6
fix(api/v1/crawl/ongoing): only crawls, no batch scrape (#1701) 2025-06-23 16:33:02 +02:00
Gergő Móricz
1fdf95913d
feat(api): optimize job count query and improve error handling (#1700) 2025-06-23 16:18:55 +02:00
Gergő Móricz
c31172493e
fix(api): handle errors better in redis-less crawl status (#1699) 2025-06-23 15:56:20 +02:00
Gergő Móricz
66cde50a2a
fix(api): enhance error handler with optional ACUC data (#1698)
Update error handler to use RequestWithMaybeACUC type, allowing
access to optional ACUC properties on the request object. Include
team_id from ACUC in error logging to improve context for debugging.
2025-06-23 15:42:38 +02:00
Gergő Móricz
e06ec2d047
fix(api): improve error logging with structured error object (#1697) 2025-06-23 15:26:15 +02:00
Gergő Móricz
7ed19c0ac0 feat(scrapeURL): separate URL rewrites to different function 2025-06-21 02:02:29 +02:00
Gergő Móricz
9174e0c8a0
fix(api): CI (#1692)
* add scrapeTimeout parameter

* fix(api/ci): allow webhook server some time to settle

* fix(api/ci): extract time extension

* fix(api/ci): switch index location tests to a more reliable proxy

* check crawl errors + extend index cooldown

* fix lib
2025-06-20 22:57:23 +02:00
Meet Soni
2082243cb5
feat(scrape): support Google Slides (#1693)
* feat(scrape): support Google Slides

* feat(scrape): add test for scraping Google Slides links
2025-06-20 21:58:45 +02:00
Gergő Móricz
4b03ffca36
fix(search): respect parsePDF in pricing (#1690) 2025-06-20 21:15:14 +02:00
Gergő Móricz
125e1ada45
feat(scrapeURL): support cookies in safeFetch (#1688) 2025-06-20 20:43:04 +02:00
Gergő Móricz
3f0b8b8e27
Remove old cache mechanisms (redis cache, PDF cache, crawl maps, etc.) (FIR-2266) (#1667)
* feat(api): remove old indexes pt. 1

* feat(map): better subdomain support

* more culling

* adjust map maxage

* feat(api/tests): add tests for pdf caching

* fix(scrapeURL/index): pdf caching

* restore pdf cache

* fix __experimental_cache

* sitemap fetching

* remove extra var
2025-06-20 19:40:28 +02:00
Nicolas
363afb8048 Nick: updated openapi specs 2025-06-20 14:30:37 -03:00
Nicolas
80f7177473 Nick: bump version v1.12.0 2025-06-20 12:05:15 -03:00
devin-ai-integration[bot]
09aabbedb5
feat: add followInternalLinks parameter as semantic replacement for allowBackwardLinks (#1684)
* feat: add followInternalLinks parameter as semantic replacement for allowBackwardLinks

- Add followInternalLinks parameter to crawl API with same functionality as allowBackwardLinks
- Update transformation logic to use followInternalLinks with precedence over allowBackwardLinks
- Add parameter to Python SDK crawl methods with proper precedence handling
- Add parameter to Node.js SDK CrawlParams interface
- Add comprehensive tests for new parameter and backward compatibility
- Maintain full backward compatibility for existing allowBackwardLinks usage
- Add deprecation notices in documentation while preserving functionality

Co-Authored-By: Nick <nicolascamara29@gmail.com>

* fix: revert accidental cache=True changes to preserve original cache parameter handling

- Revert cache=True back to cache=cache in generate_llms_text methods
- Preserve original parameter passing behavior for cache parameter
- Fix accidental hardcoding of cache parameter to True

Co-Authored-By: Nick <nicolascamara29@gmail.com>

* refactor: rename followInternalLinks to crawlEntireDomain across API, SDKs, and tests

- Rename followInternalLinks parameter to crawlEntireDomain in API schema
- Update Node.js SDK CrawlParams interface to use crawlEntireDomain
- Update Python SDK methods to use crawl_entire_domain parameter
- Update test cases to use new crawlEntireDomain parameter name
- Maintain backward compatibility with allowBackwardLinks
- Update transformation logic to use crawlEntireDomain with precedence

Co-Authored-By: Nick <nicolascamara29@gmail.com>

* fix: add missing cache parameter to generate_llms_text and update documentation references

Co-Authored-By: Nick <nicolascamara29@gmail.com>

* Update apps/python-sdk/firecrawl/firecrawl.py

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Nick <nicolascamara29@gmail.com>
Co-authored-by: Gergő Móricz <mo.geryy@gmail.com>
2025-06-20 12:02:23 -03:00
Gergő Móricz
f939428264
feat(scrape): support Google Docs (FIR-1365) (#1686)
* feat(scrape): support Google Docs

* fixes
2025-06-20 11:42:41 +02:00
Gergő Móricz
f8983fffb7
Concurrency limit refactor + maxConcurrency parameter (FIR-2191) (#1643) 2025-06-20 10:45:36 +02:00
Gergő Móricz
a8e3c29664
feat(scrape, extract): creditsUsed, tokensUsed fields (FIR-2336) (#1683)
* fix(scrape): log FIRE-1 credits billed on failures properly

* fix dumb thinbgs

* feat(scrape, extract): creditsUsed fields

* fix(extract): call it tokensUsed

* Trigger Build

* dumb mistake, search does separate billing
2025-06-18 21:49:20 +02:00
Gergő Móricz
fbd81b4168
fix(scrape): log FIRE-1 credits billed on failures properly (FIR-2331) (#1682)
* fix(scrape): log FIRE-1 credits billed on failures properly

* fix dumb thinbgs
2025-06-18 21:47:58 +02:00
Gergő Móricz
ebc1de9d60
feat(crawl-status): refactor to work after a redis flush (#1664) 2025-06-18 18:58:04 +02:00
devin-ai-integration[bot]
cd2e0f868c
Add deployment type field to bug report template (#1681)
- Add 'Deployment Type' field to Environment section
- Allows users to specify Cloud (firecrawl.dev) vs Self-hosted
- Helps maintainers better triage issues based on deployment context
- Positioned logically after OS field in existing template structure

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Nick <nicolascamara29@gmail.com>
2025-06-18 12:26:15 -03:00
Thomas Kosmas
199115c7be stop testing new mu 2025-06-18 00:48:50 +03:00
Thomas Kosmas
f46f845efc fix: send the request to new mu version before the main one to achieve better sync 2025-06-17 20:37:45 +03:00
Thomas Kosmas
ee7b29b3f6
feat: Test mu v3 (#1678)
* Test mu v3

* fix env
2025-06-17 20:13:19 +03:00
Gergő Móricz
5ca8e2e98e
feat(index): store short titles and descriptions (#1677) 2025-06-17 19:09:07 +02:00
devin-ai-integration[bot]
9710bdffc0
Improve URL filtering error messages with specific denial reasons (FIR-2352) (#1676)
* Improve URL filtering error messages with specific denial reasons

- Add FilterResult and FilterLinksResult interfaces for structured error reporting
- Define DenialReason enum with specific, human-readable error messages
- Update filterURL method to return structured results with denial reasons
- Update filterLinks method to collect and return denial reasons for each URL
- Modify error handling in queue-worker.ts to use specific denial reasons
- Add comprehensive tests for different URL filtering scenarios
- Maintain backward compatibility while improving error specificity

Fixes: Misleading 'includePaths/excludePaths rules' error now shows actual denial reason (robots.txt, exclude patterns, depth limits, etc.)

Co-Authored-By: mogery@sideguide.dev <mogery@sideguide.dev>

* Fix test compilation error for FilterLinksResult interface

- Update crawler.test.ts to use filteredLinks.links.length instead of filteredLinks.length
- Update test expectations to use filteredLinks.links array
- Resolves TypeScript compilation error preventing CI from passing

Co-Authored-By: mogery@sideguide.dev <mogery@sideguide.dev>

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: mogery@sideguide.dev <mogery@sideguide.dev>
2025-06-17 19:00:29 +02:00
Nicolas
c6482eaf2d Nick: prevent additional logging on /extract scrapes 2025-06-13 18:17:17 -03:00
Gergő Móricz
ea321b4936 fix search test timeouts 2025-06-13 17:42:55 +02:00
Thomas Kosmas
38c5795282
feat(vertex): fix vertex ai provider bug and update model references to use "gemini-2.5-pro" (#1668) 2025-06-13 18:29:03 +03:00
Gergő Móricz
0bf23071ff
feat(index): add domain splitting for improved map querying (#1666) v1.11.0 2025-06-13 15:22:45 +02:00
Gergő Móricz
07224b8cd4
feat: use index in search and extract (#1660) 2025-06-13 12:30:28 +02:00
Gergő Móricz
f296342731
feat(index): remove unused columns (#1662) 2025-06-12 16:51:40 +02:00
Gergő Móricz
89e42b1137
fix(api): remove query parameter sanitization that was breaking extracts (#1661) 2025-06-12 15:37:45 +02:00
Gergő Móricz
3c03d07051
feat: add credits_billed everywhere (FIR-2286) (#1655)
* feat: add credits_billed everywhere

also a bit of logging improvement for logJob

* fix(queue-worker): db auth check before doing rpc for crawl/batch_scrape
2025-06-11 23:06:55 +02:00
Nicolas
bf3b2a359a
Improve concurrency limit email notifications (#1658)
* Update email_notification.ts

* Update email_notification.ts

* Update email_notification.ts
2025-06-11 17:14:54 -03:00
Pulkit Saini
255be2a2ff
Fix PLAYWRIGHT_MICROSERVICE_URL env var to use /scrape endpoint (#1654)
The correct environment variable should be PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/scrape instead of PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/html
2025-06-11 16:53:32 +02:00
Gergő Móricz
19dd086eb3 improve auto recharge logging 2025-06-11 16:26:06 +02:00