3485 Commits

Author SHA1 Message Date
Gergő Móricz
5730b626f6 feat(sdk): Index parameters + other missing parameters 2025-06-05 19:24:53 +02:00
Ademílson Tonato
b2e0f657bd
Merge pull request #1635 from mendableai/refactor/api-integration-parameter
feat(api): propagate integration field in queue worker job processing
2025-06-05 17:31:13 +01:00
Ademílson Tonato
6e1f8d6c10
feat(api): propagate integration field in queue worker job processing 2025-06-05 16:39:20 +01:00
Ademílson Tonato
71caf8ae57
Merge pull request #1632 from mendableai/feat/api-integration-parameter
feat(api): add integration field to jobs and update related controllers and types
2025-06-05 11:44:02 +01:00
Gergő Móricz
abb8919e2e feat(js-sdk): changeTrackingOptions.tag 2025-06-04 23:51:53 +02:00
Gergő Móricz
34a18a2d2f
feat(changeTracking): support tags (FIR-1940) (#1631)
* feat(changeTracking): support tags

* test 408 fixes
2025-06-04 23:50:54 +02:00
Gergő Móricz
6e63528b61 fix(crawl-redis): bad logic 2025-06-04 20:46:04 +02:00
Ademílson Tonato
4c49bb9fc6
refactor: remove unnecessary logs and set integration as default to null 2025-06-04 19:29:47 +01:00
Ademílson Tonato
57a0aed484
feat(api): add integration field to jobs and update related controllers and types 2025-06-04 19:17:57 +01:00
Gergő Móricz
a05c4ae97d
feat(api): GET /crawl/ongoing (FIR-2189) (#1620)
* feat(api): GET /crawl/ongoing

* fix: routers in wrong order

* feat(api/crawl/ongoing): return more details

---------

Co-authored-by: Nicolas <nicolascamara29@gmail.com>
2025-06-04 18:14:23 +02:00
Gergő Móricz
077c5dd8ec
feat(api/tests): add webhook tests + refactor batch scrape lib (#1630)
* feat(api/tests): add webhook tests + refactor batch scrape lib

* fix(ci):
2025-06-04 16:11:47 +02:00
Gergő Móricz
122ccd5eb0 fix(ci): 2025-06-04 15:49:45 +02:00
Gergő Móricz
11c1178ca1 feat(ci): verify typescript errors 2025-06-04 15:20:53 +02:00
Gergő Móricz
0f394a10c6
feat(queue-worker): fix crawl pre-finishing logic (#1628) 2025-06-04 15:19:48 +02:00
Gergő Móricz
8dd5bf7bd9
feat(api/tests/scrape): Playwright test improvements (#1626)
* feat(api/tests/scrape): verify that proxy works on Playwright

* debug: logs

* remove logs

* feat(playwright): add contentType relaying

* fix tests

* debug

* fix json
2025-06-04 01:24:19 +02:00
Gergő Móricz
95f204aab7
Index (FIR-2177) (#1605)
* poc progress

* poc

* url splits and better url normalization

* feat(index): integrate into map

* fix on selfhost

* feat: modifiers

* separate index supa logic

* debug

* fix language comparison

* feat: dontStoreInCache

* feat(index): some rudimentary testing

* feat: use url split columns

* feat(queue-worker/kickoff): use index links to kickoff crawl

* feat(scrapeURL/index): behaviour on non-200 index entries

* feat/added benchmark for scrapes

* feat(map): ignoreIndex

* feat(index): batch insert

* fix(api/tests/scrape): fix index test to work with batching

* disable cacheable lookup for self hosting tests

* feat(js-sdk): dontStoreInCache

* chore(js-sdk): bump

* feat(index): FIRECRAWL_INDEX_WRITE_ONLY

* feat(api/test): index envs

* map benchmarks

* cleanup

* further fixes

* clean up on map

* remove extraneous log

* workflow test run

* asd

* improve fns

* try again

* wow i'm an idiot

* ok fixed

* wth

* revert

* async saving to index

* feat: enhance metadata extraction by including 'itemprop' attribute in HTML (#1624)

* feat(selfhost): deploy a playwright image (#1625)

* Testing improvements (FIR-2209) (#1623)

* yeet ad blocking tests until further notice

* feat: re-enable billing tests

* more timeout

* cache issues with billing test

* weird thing

* fix(api/tests/scrape/status): propagation time

* stupid

* no log

* sws

---------

Co-authored-by: rafaelmmiller <150964962+rafaelsideguide@users.noreply.github.com>
Co-authored-by: Ademílson Tonato <ademilsonft@outlook.com>
v1.10.0
2025-06-03 21:30:19 +02:00
Gergő Móricz
406d696667
Testing improvements (FIR-2209) (#1623)
* yeet ad blocking tests until further notice

* feat: re-enable billing tests

* more timeout

* cache issues with billing test

* weird thing

* fix(api/tests/scrape/status): propagation time

* stupid

* no log

* sws
2025-06-03 21:16:36 +02:00
Gergő Móricz
e297cf8a0d
feat(selfhost): deploy a playwright image (#1625) 2025-06-03 19:19:08 +02:00
Ademílson Tonato
41897139da
feat: enhance metadata extraction by including 'itemprop' attribute in HTML (#1624) 2025-06-03 18:16:46 +02:00
Nicolas
e108ff3525 Update search.ts 2025-06-02 23:46:55 -03:00
Nicolas
9347de6a41 Update scrape.ts 2025-06-02 23:15:59 -03:00
Nicolas
86a9d3525b Update queue-jobs.ts 2025-06-02 23:09:09 -03:00
Nicolas
cbc47305cc Update search.ts 2025-06-02 23:09:02 -03:00
Nicolas
ce425d966f Merge branch 'nsc/bypass-billing-internal' 2025-06-02 22:37:56 -03:00
Nicolas
8c661f5329 Update scrape.ts 2025-06-02 22:37:49 -03:00
Nicolas
dc8cc99b1d
Nick: bypass billing (#1622) 2025-06-02 21:57:28 -03:00
Nicolas
8967b31465 Nick: bypass billing 2025-06-02 21:51:46 -03:00
Nicolas
bf919ceb82 Nick: __searchPreviewToken 2025-06-02 21:16:34 -03:00
Nicolas
ef789ce8d7 Nick: __experimental 2025-06-02 19:58:56 -03:00
Gergő Móricz
72be73473f
feat(api/scrape): credits_billed column + handle billing for /scrape calls on worker side with stricter timeout enforcement (FIR-2162) (#1607)
* feat(api/scrape): stricten timeout and handle billing and logging on queue-worker

* fix: abortsignal pre-check

* fix: proper level

* add comment to clarify is_scrape

* reenable billing tests

* Revert "reenable billing tests"

This reverts commit 98236fdfa03dde8cecdd6b763fcf86810e468a28.

* oof

* fix searxng logging

---------

Co-authored-by: Nicolas <nicolascamara29@gmail.com>
2025-06-02 17:56:27 -03:00
Gergő Móricz
4167ec53eb
fix(scrapeURL): only allow disabling the adblock on playwright (FIR-2200) (#1616)
* fix(scrapeURL): only allow disabling the adblock on playwright

* feat(api/tests/scrape): re-enable ad blocking tests
2025-06-02 22:48:16 +02:00
Gergő Móricz
7a8be13220 remove indexes that are no longer used 2025-06-02 22:09:55 +02:00
Gergő Móricz
98ceda9bd5
feat(search): ignore concurrency limit for search (FIR-2187) (#1617)
* feat(search): ignore concurrency limit for search (temp)

* feat(search): only for low tier users for good DX
2025-06-02 17:07:44 -03:00
Gergő Móricz
1396451d31 bump rust version pt.2 2025-06-02 18:10:14 +02:00
Gergő Móricz
07fb651a91 bump rust version 2025-06-02 18:09:12 +02:00
Supasin Liulak
6a76ccfacb
webhook param for crawl (#1609) 2025-06-02 18:08:32 +02:00
Nicolas
9297afd1ff Nick: search 2025-05-29 17:00:13 -03:00
Gergő Móricz
a8e0482718 feat(search): bill for PDFs properly 2025-05-29 20:59:15 +02:00
Gergő Móricz
a2f41fb650 feat(api/server): wait 60s for GCE load balancer drain timeout
To minimize 502s.
2025-05-29 20:08:52 +02:00
Gergő Móricz
3ea221b093 fix(api/queue): tighten expiries on indexQueue jobs 2025-05-29 16:36:55 +02:00
Gergő Móricz
c9dd0e609a fix(api/queue): tighten expiries on billingQueue jobs 2025-05-29 16:26:52 +02:00
Gergő Móricz
93655b5c0b
feat(scrapeURL/pdf): bill n credits per page (FIR-1934) (#1553)
* feat(scrapeURL/pdf): bill n credits per page

* Update scrape.ts

* Update queue-worker.ts

* separate billing logi

---------

Co-authored-by: Nicolas <nicolascamara29@gmail.com>
2025-05-29 16:01:08 +02:00
Gergő Móricz
38c96b524f
feat(scrapeURL): handle contentType JSON better in markdown conversion (#1604) 2025-05-29 15:26:07 +02:00
Gergő Móricz
7e73b01599 fix(queue-worker): call webhook after job is in DB 2025-05-29 14:40:47 +02:00
Gergő Móricz
706d378a89 feat(api/v1/scrape-status): log supa lookup errors 2025-05-29 13:02:54 +02:00
Gergő Móricz
3557c90210
feat(js-sdk): auto mode proxy (FIR-2145) (#1602)
* feat(js-sdk): auto mode proxy

* Nick: py sdk

---------

Co-authored-by: Nicolas <nicolascamara29@gmail.com>
2025-05-28 14:31:48 -03:00
Gergő Móricz
a5efff07f9
feat(apps/api): add support for a separate, non-eviction Redis (#1600)
* feat(apps/api): add support for a separate, non-eviction Redis

* fix: misimport
2025-05-28 09:58:04 +02:00
Nicolas
756b452a01 Update batch_billing.ts 2025-05-27 19:05:00 -03:00
Nicolas
299e3e29e0 Update batch_billing.ts 2025-05-27 18:44:24 -03:00
Gergő Móricz
a36c6a4f40
feat(scrapeURL): add unnormalizedSourceURL for url matching DX (FIR-2137) (#1601)
* feat(scrapeURL): add unnormalizedSourceURL for url matching DX

* fix(tests): fixc
2025-05-27 21:33:44 +02:00