2263 Commits

Author SHA1 Message Date
Gergő Móricz
5e760aacbb fixes 2025-06-20 11:28:22 +02:00
Gergő Móricz
3cf22f9167 feat(scrape): support Google Docs 2025-06-20 11:12:05 +02:00
Gergő Móricz
f8983fffb7
Concurrency limit refactor + maxConcurrency parameter (FIR-2191) (#1643) 2025-06-20 10:45:36 +02:00
Gergő Móricz
a8e3c29664
feat(scrape, extract): creditsUsed, tokensUsed fields (FIR-2336) (#1683)
* fix(scrape): log FIRE-1 credits billed on failures properly

* fix dumb thinbgs

* feat(scrape, extract): creditsUsed fields

* fix(extract): call it tokensUsed

* Trigger Build

* dumb mistake, search does separate billing
2025-06-18 21:49:20 +02:00
Gergő Móricz
fbd81b4168
fix(scrape): log FIRE-1 credits billed on failures properly (FIR-2331) (#1682)
* fix(scrape): log FIRE-1 credits billed on failures properly

* fix dumb thinbgs
2025-06-18 21:47:58 +02:00
Gergő Móricz
ebc1de9d60
feat(crawl-status): refactor to work after a redis flush (#1664) 2025-06-18 18:58:04 +02:00
Thomas Kosmas
199115c7be stop testing new mu 2025-06-18 00:48:50 +03:00
Thomas Kosmas
f46f845efc fix: send the request to new mu version before the main one to achieve better sync 2025-06-17 20:37:45 +03:00
Thomas Kosmas
ee7b29b3f6
feat: Test mu v3 (#1678)
* Test mu v3

* fix env
2025-06-17 20:13:19 +03:00
Gergő Móricz
5ca8e2e98e
feat(index): store short titles and descriptions (#1677) 2025-06-17 19:09:07 +02:00
devin-ai-integration[bot]
9710bdffc0
Improve URL filtering error messages with specific denial reasons (FIR-2352) (#1676)
* Improve URL filtering error messages with specific denial reasons

- Add FilterResult and FilterLinksResult interfaces for structured error reporting
- Define DenialReason enum with specific, human-readable error messages
- Update filterURL method to return structured results with denial reasons
- Update filterLinks method to collect and return denial reasons for each URL
- Modify error handling in queue-worker.ts to use specific denial reasons
- Add comprehensive tests for different URL filtering scenarios
- Maintain backward compatibility while improving error specificity

Fixes: Misleading 'includePaths/excludePaths rules' error now shows actual denial reason (robots.txt, exclude patterns, depth limits, etc.)

Co-Authored-By: mogery@sideguide.dev <mogery@sideguide.dev>

* Fix test compilation error for FilterLinksResult interface

- Update crawler.test.ts to use filteredLinks.links.length instead of filteredLinks.length
- Update test expectations to use filteredLinks.links array
- Resolves TypeScript compilation error preventing CI from passing

Co-Authored-By: mogery@sideguide.dev <mogery@sideguide.dev>

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: mogery@sideguide.dev <mogery@sideguide.dev>
2025-06-17 19:00:29 +02:00
Nicolas
c6482eaf2d Nick: prevent additional logging on /extract scrapes 2025-06-13 18:17:17 -03:00
Gergő Móricz
ea321b4936 fix search test timeouts 2025-06-13 17:42:55 +02:00
Thomas Kosmas
38c5795282
feat(vertex): fix vertex ai provider bug and update model references to use "gemini-2.5-pro" (#1668) 2025-06-13 18:29:03 +03:00
Gergő Móricz
0bf23071ff
feat(index): add domain splitting for improved map querying (#1666) 2025-06-13 15:22:45 +02:00
Gergő Móricz
07224b8cd4
feat: use index in search and extract (#1660) 2025-06-13 12:30:28 +02:00
Gergő Móricz
f296342731
feat(index): remove unused columns (#1662) 2025-06-12 16:51:40 +02:00
Gergő Móricz
89e42b1137
fix(api): remove query parameter sanitization that was breaking extracts (#1661) 2025-06-12 15:37:45 +02:00
Gergő Móricz
3c03d07051
feat: add credits_billed everywhere (FIR-2286) (#1655)
* feat: add credits_billed everywhere

also a bit of logging improvement for logJob

* fix(queue-worker): db auth check before doing rpc for crawl/batch_scrape
2025-06-11 23:06:55 +02:00
Nicolas
bf3b2a359a
Improve concurrency limit email notifications (#1658)
* Update email_notification.ts

* Update email_notification.ts

* Update email_notification.ts
2025-06-11 17:14:54 -03:00
Pulkit Saini
255be2a2ff
Fix PLAYWRIGHT_MICROSERVICE_URL env var to use /scrape endpoint (#1654)
The correct environment variable should be PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/scrape instead of PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/html
2025-06-11 16:53:32 +02:00
Gergő Móricz
19dd086eb3 improve auto recharge logging 2025-06-11 16:26:06 +02:00
Nicolas
9964d11c20 Update v1.ts 2025-06-09 17:13:56 -03:00
Gergő Móricz
4659155b76 remove logs 2025-06-06 00:25:28 +02:00
Gergő Móricz
3b6be76d3e debug(index): time insights 2025-06-06 00:03:34 +02:00
Gergő Móricz
8ef3e8484a
feat(gcs-jobs): ditch exists check to cut lookup time in half (#1641) 2025-06-05 23:43:30 +02:00
Gergő Móricz
6d1b9bf1fe debug(api/scrape): more logging 2025-06-05 22:52:41 +02:00
Gergő Móricz
0c7f864ea4 debug(api/scrape): increased logging to diagnose scrape fluke length 2025-06-05 22:51:25 +02:00
Gergő Móricz
1de0ae392c
Index testing improvements (FIR-2214) (#1637)
* feat(api/tests/scrape): index improvements

* fix(api/test/scrape): add waits to allow batch insert to happen

* fix: ...
2025-06-05 22:10:06 +02:00
Gergő Móricz
78580f65df
feat(webhook): refactor callWebhook and add logWebhook (FIR-2218) (#1629)
* feat(webhook): refactor callWebhook and add logWebhook

* feat(queue-worker): fix crawl pre-finishing logic (#1628)

* feat(ci): verify typescript errors

* fix(ci):

* feat(api/tests): add webhook tests + refactor batch scrape lib (#1630)

* feat(api/tests): add webhook tests + refactor batch scrape lib

* fix(ci):

* feat(webhook/log): insert queue
2025-06-05 22:04:22 +02:00
Gergő Móricz
f050b169e2
feat(api/index): port queryIndexAtSplitLevel to RPC (FIR-2241) (#1640)
* feat(api/index): port queryIndexAtSplitLevel to RPC

* Update apps/api/src/services/index.ts
2025-06-05 22:02:41 +02:00
Gergő Móricz
a08d52e45d feat(scrapeURL/index): don't put results by "dumb" engines into the index 2025-06-05 22:01:29 +02:00
Thomas Kosmas
af88218fad
feat: update mu (#1639)
* update to mu v2

* feat(ci): add RUNPOD_MUV2_POD_ID

* stupid change to make CI run

---------

Co-authored-by: Gergő Móricz <mo.geryy@gmail.com>
2025-06-05 22:27:00 +03:00
Nicolas
6ca551a887 Merge branch 'main' of https://github.com/mendableai/firecrawl 2025-06-05 15:39:50 -03:00
Nicolas
8c40271796 Update map.ts 2025-06-05 15:39:48 -03:00
Thomas Kosmas
4bf64d2c01
feat(scraper): runpod v2 parallel testing (#1636)
* feat(scraper): runpod v2 parallel testing

* fix catch
2025-06-05 20:28:01 +03:00
Ademílson Tonato
6e1f8d6c10
feat(api): propagate integration field in queue worker job processing 2025-06-05 16:39:20 +01:00
Ademílson Tonato
71caf8ae57
Merge pull request #1632 from mendableai/feat/api-integration-parameter
feat(api): add integration field to jobs and update related controllers and types
2025-06-05 11:44:02 +01:00
Gergő Móricz
34a18a2d2f
feat(changeTracking): support tags (FIR-1940) (#1631)
* feat(changeTracking): support tags

* test 408 fixes
2025-06-04 23:50:54 +02:00
Gergő Móricz
6e63528b61 fix(crawl-redis): bad logic 2025-06-04 20:46:04 +02:00
Ademílson Tonato
4c49bb9fc6
refactor: remove unnecessary logs and set integration as default to null 2025-06-04 19:29:47 +01:00
Ademílson Tonato
57a0aed484
feat(api): add integration field to jobs and update related controllers and types 2025-06-04 19:17:57 +01:00
Gergő Móricz
a05c4ae97d
feat(api): GET /crawl/ongoing (FIR-2189) (#1620)
* feat(api): GET /crawl/ongoing

* fix: routers in wrong order

* feat(api/crawl/ongoing): return more details

---------

Co-authored-by: Nicolas <nicolascamara29@gmail.com>
2025-06-04 18:14:23 +02:00
Gergő Móricz
077c5dd8ec
feat(api/tests): add webhook tests + refactor batch scrape lib (#1630)
* feat(api/tests): add webhook tests + refactor batch scrape lib

* fix(ci):
2025-06-04 16:11:47 +02:00
Gergő Móricz
0f394a10c6
feat(queue-worker): fix crawl pre-finishing logic (#1628) 2025-06-04 15:19:48 +02:00
Gergő Móricz
8dd5bf7bd9
feat(api/tests/scrape): Playwright test improvements (#1626)
* feat(api/tests/scrape): verify that proxy works on Playwright

* debug: logs

* remove logs

* feat(playwright): add contentType relaying

* fix tests

* debug

* fix json
2025-06-04 01:24:19 +02:00
Gergő Móricz
95f204aab7
Index (FIR-2177) (#1605)
* poc progress

* poc

* url splits and better url normalization

* feat(index): integrate into map

* fix on selfhost

* feat: modifiers

* separate index supa logic

* debug

* fix language comparison

* feat: dontStoreInCache

* feat(index): some rudimentary testing

* feat: use url split columns

* feat(queue-worker/kickoff): use index links to kickoff crawl

* feat(scrapeURL/index): behaviour on non-200 index entries

* feat/added benchmark for scrapes

* feat(map): ignoreIndex

* feat(index): batch insert

* fix(api/tests/scrape): fix index test to work with batching

* disable cacheable lookup for self hosting tests

* feat(js-sdk): dontStoreInCache

* chore(js-sdk): bump

* feat(index): FIRECRAWL_INDEX_WRITE_ONLY

* feat(api/test): index envs

* map benchmarks

* cleanup

* further fixes

* clean up on map

* remove extraneous log

* workflow test run

* asd

* improve fns

* try again

* wow i'm an idiot

* ok fixed

* wth

* revert

* async saving to index

* feat: enhance metadata extraction by including 'itemprop' attribute in HTML (#1624)

* feat(selfhost): deploy a playwright image (#1625)

* Testing improvements (FIR-2209) (#1623)

* yeet ad blocking tests until further notice

* feat: re-enable billing tests

* more timeout

* cache issues with billing test

* weird thing

* fix(api/tests/scrape/status): propagation time

* stupid

* no log

* sws

---------

Co-authored-by: rafaelmmiller <150964962+rafaelsideguide@users.noreply.github.com>
Co-authored-by: Ademílson Tonato <ademilsonft@outlook.com>
2025-06-03 21:30:19 +02:00
Gergő Móricz
406d696667
Testing improvements (FIR-2209) (#1623)
* yeet ad blocking tests until further notice

* feat: re-enable billing tests

* more timeout

* cache issues with billing test

* weird thing

* fix(api/tests/scrape/status): propagation time

* stupid

* no log

* sws
2025-06-03 21:16:36 +02:00
Ademílson Tonato
41897139da
feat: enhance metadata extraction by including 'itemprop' attribute in HTML (#1624) 2025-06-03 18:16:46 +02:00
Nicolas
e108ff3525 Update search.ts 2025-06-02 23:46:55 -03:00