3580 Commits

Author SHA1 Message Date
Ademílson Tonato
30b7e17327
feat(firecrawl): add integration parameter support and enhance kwargs handling 2025-07-02 14:15:39 +01:00
Nicolas
dcbed186a9
Add local environment configuration to docker-compose services (#1742)
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
2025-07-02 00:27:01 +02:00
Gergő Móricz
e3dc2e87db chore: bump js sdk 2025-07-01 20:29:00 +02:00
Gergő Móricz
4506b21185
feat(api): zero data retention (ENG-2376) (#1687)
* add ZDR flag and v0 lockout

* zdr walls on v1 and propagation

* zdr within scrapeurl

* fixes

* more fixes

* zdr flag on queue-worker logging

* final stretch + testing, needs f-e changes

* fixes

* self-serve ZDR through request body

* request-level zdr

* improved zdrcleaner

* generalize schema to allow for different data retention times in the future

* update Go version on CI

* feat(api/tests/zdr): test that nothing is logged

* fix(api/tests/zdr): correct log name

* fix(ci): envs

* fix(zdrcleaner): lower bound on db query

* zdr test with idmux

* WIP Assignments

* fix bad merge
remove unused identity

* fix stupid jest globals thing

* feat(scrapeURL/zdr): blacklist pdf action

* fix(concurrency-limit): zdr logging enforcement

* temp: remove extra billing for zdr

* SDK support

* final zdr business logic
fix rename

* fix test log filtering

* fix log filtering... again

* fix(tests/zdr): more logging exceptions

---------

Co-authored-by: Nicolas <nicolascamara29@gmail.com>
2025-07-01 20:07:26 +02:00
Gergő Móricz
1f1f733011
fix(map): pass timeout to sitemap fetch (#1741) 2025-07-01 14:26:43 -03:00
Nicolas
ec298f58b6 Update search.ts 2025-07-01 13:43:00 -03:00
Nicolas
3e09f9fb8a
Nick: init (#1740) 2025-07-01 11:39:45 -03:00
Gergő Móricz
6c2f432d49
feat(crawl-status): better creditsUsed field (#1738) 2025-07-01 16:34:30 +02:00
Gergő Móricz
ebf98e3c16
feat(queue-worker): decrease job lock duration to pick up jobs on dead workers faster (#1737) 2025-07-01 11:27:38 -03:00
Gergő Móricz
400d497fca
feat(scrapeURL): ask user to increase timeout if there's a DOM.getDocument or queryAXTree error (#1739)
* feat(scrapeURL): ask user to increase timeout if there's a DOM.getDocument or queryAXTree error

* fix: move result tracking to meta
2025-07-01 16:25:49 +02:00
devin-ai-integration[bot]
1816cfc4c8
feat: implement IDN support with Punycode encoding (#1735)
- Update URL validation regex to accept xn-- prefixed domains
- Add normalizeHostnameForComparison utility for consistent IDN handling
- Update domain comparison functions to use Punycode normalization
- Expand test coverage for various IDN scripts (Chinese, Arabic, Russian)
- Ensure backward compatibility with existing URL processing

Fixes ENG-2510

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: mogery@sideguide.dev <mogery@sideguide.dev>
2025-07-01 15:39:34 +02:00
Gergő Móricz
8a282e3fc8
fix(auto_charge): bad hourly counter logic (#1736) 2025-07-01 10:37:37 -03:00
Nicolas
b4eedce3e0
(feat/ledger) Ledger events (#1728)
* Nick: ledger init

* Update email_notification.ts

* Update tracking.ts

* Nick: removed unused events

* Update email_notification.ts

* Apply suggestions from code review

* Update tracking.ts

* Update tracking.ts

* Update email_notification.ts

* Nick: conc limit ledger

---------

Co-authored-by: Gergő Móricz <mo.geryy@gmail.com>
2025-06-30 12:48:06 -03:00
Gergő Móricz
13f012c583
add pdf prefetch log for debugging (ENG-2542) (#1734)
* feat

* feat: pdf prefetch anti-loop error
2025-06-30 17:37:31 +02:00
Gergő Móricz
9162952744
proxy used improvement (#1727) 2025-06-30 17:37:19 +02:00
Gergő Móricz
9b95a17c0d
fix json format on search (#1729) 2025-06-30 12:17:52 -03:00
Nicolas
17ff8be67b
Nick; (#1726) v1.13.0 2025-06-27 12:02:15 -03:00
Gergő Móricz
57b8e66bc8
feat(api/worker): liveness check in queueing -- don't take jobs when the worker is dying (#1725) 2025-06-27 11:51:40 -03:00
Nicolas
c4adc687ea Update index.ts 2025-06-27 11:42:26 -03:00
Nicolas
caec228f60 Nick: version bump 2025-06-27 11:33:36 -03:00
devin-ai-integration[bot]
fa5b96c521
Add parsePDF parameter to JS SDK (#1720)
* Add parsePDF parameter to JS SDK (clean implementation)

- Add parsePDF boolean parameter to CrawlScrapeOptions interface
- Parameter automatically flows through scrape and crawl operations via spread operator
- Add comprehensive test cases for parsePDF functionality in both scrape and crawl scenarios
- Tests verify parsePDF=true and parsePDF=false behavior with PDF files

Co-Authored-By: Micah Stairs <micah@sideguide.dev>

* Fix parsePDF tests to match actual API behavior

- Update parsePDF=false test to expect base64 data instead of markdown
- Tests now properly verify the difference between parsePDF=true and parsePDF=false
- Address GitHub comment about 'hallucinated' tests by fixing unrealistic expectations

Co-Authored-By: Micah Stairs <micah@sideguide.dev>

* Update index.test.ts

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Micah Stairs <micah@sideguide.dev>
Co-authored-by: Nicolas <nicolascamara29@gmail.com>
2025-06-27 11:30:49 -03:00
devin-ai-integration[bot]
070d1c1d98
Fix unreachable allowSubdomains code in crawler filterURL method (#1719)
- Move subdomain check logic before external link denial to make it reachable
- Add comprehensive tests for allowSubdomains functionality
- Ensure subdomain URLs are properly allowed/filtered based on configuration

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Micah Stairs <micah@sideguide.dev>
2025-06-27 11:28:58 -03:00
Gergő Móricz
2b87ea6599
feat: improve DNS resolution error message (#1724) 2025-06-27 11:27:25 -03:00
Gergő Móricz
55d5c1f41d
feat(scrapeURL/skipTlsVerification): improve error message (#1723) 2025-06-27 11:27:04 -03:00
Gergő Móricz
9ed26e1e07
feat(sdk/python): add pdf action (ENG-2515) (#1722)
* feat(sdk/python): add pdf action
result

* bump

---------

Co-authored-by: Nicolas <nicolascamara29@gmail.com>
2025-06-27 11:26:27 -03:00
Nicolas
d8796e4536
feat: Screenshot quality (#1721)
* Nick: init

* Update index.ts

* Nick: sdks support
2025-06-27 11:07:14 -03:00
Micah Stairs
9a5d40c3cf
Allow international URLs to pass validation (#1717) 2025-06-26 13:16:42 -04:00
devin-ai-integration[bot]
1919799bed
feat(python-sdk): add parsePDF parameter support (#1713)
* feat(python-sdk): add parsePDF parameter support

- Add parsePDF field to ScrapeOptions class for Search API usage
- Add parse_pdf parameter to both sync and async scrape_url methods
- Add parameter handling logic to pass parsePDF to API requests
- Add comprehensive tests for parsePDF functionality
- Maintain backward compatibility with existing API

The parsePDF parameter controls PDF processing behavior:
- When true (default): PDF content extracted and converted to markdown
- When false: PDF returned in base64 encoding with flat credit rate

Resolves missing parsePDF support in Python SDK v2.9.0

Co-Authored-By: Micah Stairs <micah@sideguide.dev>

* Update __init__.py

* Update test.py

* Update __init__.py

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Micah Stairs <micah@sideguide.dev>
Co-authored-by: Nicolas <nicolascamara29@gmail.com>
2025-06-26 16:34:43 +00:00
devin-ai-integration[bot]
89e57ace3c
Add temporary exception for Faire team ID to bypass job expiration (#1716)
* Add temporary exception for Faire team ID to bypass job expiration

- Add TEMP_FAIRE_TEAM_ID constant for team f96ad1a4-8102-4b35-9904-36fd517d3616
- Modify job expiration logic to skip 24-hour timeout for this team
- Add tests to verify Faire team bypasses expiration and others don't
- Temporary solution to allow Faire team access to expired crawl jobs

Co-Authored-By: Micah Stairs <micah@sideguide.dev>

* Update apps/api/src/__tests__/snips/crawl.test.ts

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Micah Stairs <micah@sideguide.dev>
Co-authored-by: Gergő Móricz <mo.geryy@gmail.com>
2025-06-26 13:42:34 +00:00
Gergő Móricz
f4714f4849
fix(js-sdk/extract): use same zod fallback logic (#1711) 2025-06-25 17:59:58 +00:00
Gergő Móricz
3d04c2087e
fix(api): cached acuc didn't have the is_extract flag set (#1712)
cosmetic issue only (error message), no behavioural change
2025-06-25 16:43:53 +02:00
Gergő Móricz
bc9065810d
fix(concurrency-limit): overlogging (#1709) 2025-06-24 17:01:32 +00:00
Gergő Móricz
cc3afa2578
fix(concurrency-limit): scan instead of taking jobs (#1708) 2025-06-24 13:32:22 -03:00
Gergő Móricz
ae94edd43e
feat(api/ci): idmux (#1707)
* feat(api/ci): idmux

* fix: bad merge

* no more default identity

* fix change tracking test

* fix httpstatus going down lol

* fix change tracking tests

* bump timeout

* fix ct self-hosted

* further fixes

* one more httpstatus bug

* bs

* it's being weird, blockAds testing
2025-06-24 15:36:05 +02:00
Gergő Móricz
86603de664
fix(api): instantiate Storage only once (#1706) 2025-06-24 00:18:07 +02:00
Gergő Móricz
11f469488e
fix(api/batch/scrape): maxConcurrency field support when using ignoreInvalidURLs (#1705)
* fix(api/batch/scrape): maxConcurrency field support when using ignoreInvalidURLs

* fix(tests): timeouts
2025-06-23 21:44:55 +02:00
Gergő Móricz
e7a62dd490
fix(api): pdf bug + testing bugs (#1704) 2025-06-23 19:57:27 +02:00
Gergő Móricz
fe9057559b
fix(v1): check credits variable scope collision (#1703)
This is what’s been causing the weird insufficient credits errors.
2025-06-23 19:19:01 +02:00
Gergő Móricz
e3948ae5b1
feat(api): pdf action + housekeeping (#1702)
* feat(api): pdf action + housekeeping

* fix TS build
2025-06-23 19:03:35 +02:00
Ademílson Tonato
78a3579d6e
feat: add relevanceai as part of the integrations 2025-06-23 16:41:19 +01:00
Gergő Móricz
439619ffc6
fix(api/v1/crawl/ongoing): only crawls, no batch scrape (#1701) 2025-06-23 16:33:02 +02:00
Gergő Móricz
1fdf95913d
feat(api): optimize job count query and improve error handling (#1700) 2025-06-23 16:18:55 +02:00
Gergő Móricz
c31172493e
fix(api): handle errors better in redis-less crawl status (#1699) 2025-06-23 15:56:20 +02:00
Gergő Móricz
66cde50a2a
fix(api): enhance error handler with optional ACUC data (#1698)
Update error handler to use RequestWithMaybeACUC type, allowing
access to optional ACUC properties on the request object. Include
team_id from ACUC in error logging to improve context for debugging.
2025-06-23 15:42:38 +02:00
Gergő Móricz
e06ec2d047
fix(api): improve error logging with structured error object (#1697) 2025-06-23 15:26:15 +02:00
Gergő Móricz
7ed19c0ac0 feat(scrapeURL): separate URL rewrites to different function 2025-06-21 02:02:29 +02:00
Gergő Móricz
9174e0c8a0
fix(api): CI (#1692)
* add scrapeTimeout parameter

* fix(api/ci): allow webhook server some time to settle

* fix(api/ci): extract time extension

* fix(api/ci): switch index location tests to a more reliable proxy

* check crawl errors + extend index cooldown

* fix lib
2025-06-20 22:57:23 +02:00
Meet Soni
2082243cb5
feat(scrape): support Google Slides (#1693)
* feat(scrape): support Google Slides

* feat(scrape): add test for scraping Google Slides links
2025-06-20 21:58:45 +02:00
Gergő Móricz
4b03ffca36
fix(search): respect parsePDF in pricing (#1690) 2025-06-20 21:15:14 +02:00
Gergő Móricz
125e1ada45
feat(scrapeURL): support cookies in safeFetch (#1688) 2025-06-20 20:43:04 +02:00