116 Commits

Author SHA1 Message Date
Nicolas
c4adc687ea Update index.ts 2025-06-27 11:42:26 -03:00
Gergő Móricz
2b87ea6599
feat: improve DNS resolution error message (#1724) 2025-06-27 11:27:25 -03:00
Gergő Móricz
55d5c1f41d
feat(scrapeURL/skipTlsVerification): improve error message (#1723) 2025-06-27 11:27:04 -03:00
Nicolas
d8796e4536
feat: Screenshot quality (#1721)
* Nick: init

* Update index.ts

* Nick: sdks support
2025-06-27 11:07:14 -03:00
Gergő Móricz
e7a62dd490
fix(api): pdf bug + testing bugs (#1704) 2025-06-23 19:57:27 +02:00
Gergő Móricz
e3948ae5b1
feat(api): pdf action + housekeeping (#1702)
* feat(api): pdf action + housekeeping

* fix TS build
2025-06-23 19:03:35 +02:00
Gergő Móricz
125e1ada45
feat(scrapeURL): support cookies in safeFetch (#1688) 2025-06-20 20:43:04 +02:00
Gergő Móricz
3f0b8b8e27
Remove old cache mechanisms (redis cache, PDF cache, crawl maps, etc.) (FIR-2266) (#1667)
* feat(api): remove old indexes pt. 1

* feat(map): better subdomain support

* more culling

* adjust map maxage

* feat(api/tests): add tests for pdf caching

* fix(scrapeURL/index): pdf caching

* restore pdf cache

* fix __experimental_cache

* sitemap fetching

* remove extra var
2025-06-20 19:40:28 +02:00
Gergő Móricz
f939428264
feat(scrape): support Google Docs (FIR-1365) (#1686)
* feat(scrape): support Google Docs

* fixes
2025-06-20 11:42:41 +02:00
Thomas Kosmas
199115c7be stop testing new mu 2025-06-18 00:48:50 +03:00
Thomas Kosmas
f46f845efc fix: send the request to new mu version before the main one to achieve better sync 2025-06-17 20:37:45 +03:00
Thomas Kosmas
ee7b29b3f6
feat: Test mu v3 (#1678)
* Test mu v3

* fix env
2025-06-17 20:13:19 +03:00
Gergő Móricz
5ca8e2e98e
feat(index): store short titles and descriptions (#1677) 2025-06-17 19:09:07 +02:00
Gergő Móricz
0bf23071ff
feat(index): add domain splitting for improved map querying (#1666) 2025-06-13 15:22:45 +02:00
Gergő Móricz
f296342731
feat(index): remove unused columns (#1662) 2025-06-12 16:51:40 +02:00
Gergő Móricz
4659155b76 remove logs 2025-06-06 00:25:28 +02:00
Gergő Móricz
3b6be76d3e debug(index): time insights 2025-06-06 00:03:34 +02:00
Gergő Móricz
0c7f864ea4 debug(api/scrape): increased logging to diagnose scrape fluke length 2025-06-05 22:51:25 +02:00
Gergő Móricz
1de0ae392c
Index testing improvements (FIR-2214) (#1637)
* feat(api/tests/scrape): index improvements

* fix(api/test/scrape): add waits to allow batch insert to happen

* fix: ...
2025-06-05 22:10:06 +02:00
Gergő Móricz
a08d52e45d feat(scrapeURL/index): don't put results by "dumb" engines into the index 2025-06-05 22:01:29 +02:00
Thomas Kosmas
af88218fad
feat: update mu (#1639)
* update to mu v2

* feat(ci): add RUNPOD_MUV2_POD_ID

* stupid change to make CI run

---------

Co-authored-by: Gergő Móricz <mo.geryy@gmail.com>
2025-06-05 22:27:00 +03:00
Thomas Kosmas
4bf64d2c01
feat(scraper): runpod v2 parallel testing (#1636)
* feat(scraper): runpod v2 parallel testing

* fix catch
2025-06-05 20:28:01 +03:00
Gergő Móricz
8dd5bf7bd9
feat(api/tests/scrape): Playwright test improvements (#1626)
* feat(api/tests/scrape): verify that proxy works on Playwright

* debug: logs

* remove logs

* feat(playwright): add contentType relaying

* fix tests

* debug

* fix json
2025-06-04 01:24:19 +02:00
Gergő Móricz
95f204aab7
Index (FIR-2177) (#1605)
* poc progress

* poc

* url splits and better url normalization

* feat(index): integrate into map

* fix on selfhost

* feat: modifiers

* separate index supa logic

* debug

* fix language comparison

* feat: dontStoreInCache

* feat(index): some rudimentary testing

* feat: use url split columns

* feat(queue-worker/kickoff): use index links to kickoff crawl

* feat(scrapeURL/index): behaviour on non-200 index entries

* feat/added benchmark for scrapes

* feat(map): ignoreIndex

* feat(index): batch insert

* fix(api/tests/scrape): fix index test to work with batching

* disable cacheable lookup for self hosting tests

* feat(js-sdk): dontStoreInCache

* chore(js-sdk): bump

* feat(index): FIRECRAWL_INDEX_WRITE_ONLY

* feat(api/test): index envs

* map benchmarks

* cleanup

* further fixes

* clean up on map

* remove extraneous log

* workflow test run

* asd

* improve fns

* try again

* wow i'm an idiot

* ok fixed

* wth

* revert

* async saving to index

* feat: enhance metadata extraction by including 'itemprop' attribute in HTML (#1624)

* feat(selfhost): deploy a playwright image (#1625)

* Testing improvements (FIR-2209) (#1623)

* yeet ad blocking tests until further notice

* feat: re-enable billing tests

* more timeout

* cache issues with billing test

* weird thing

* fix(api/tests/scrape/status): propagation time

* stupid

* no log

* sws

---------

Co-authored-by: rafaelmmiller <150964962+rafaelsideguide@users.noreply.github.com>
Co-authored-by: Ademílson Tonato <ademilsonft@outlook.com>
2025-06-03 21:30:19 +02:00
Gergő Móricz
4167ec53eb
fix(scrapeURL): only allow disabling the adblock on playwright (FIR-2200) (#1616)
* fix(scrapeURL): only allow disabling the adblock on playwright

* feat(api/tests/scrape): re-enable ad blocking tests
2025-06-02 22:48:16 +02:00
Gergő Móricz
38c96b524f
feat(scrapeURL): handle contentType JSON better in markdown conversion (#1604) 2025-05-29 15:26:07 +02:00
Gergő Móricz
c3738063cf less logs even more 2025-05-25 15:50:20 +02:00
Gergő Móricz
492d97e889 reduce logging 2025-05-24 00:09:13 +02:00
Gergő Móricz
a7894a2714 fix(scrapeURL/pdf): even better timeout detection 2025-05-23 16:29:28 +02:00
Gergő Móricz
f41af8241e fix(scrapeURL/pdf): better timeout error 2025-05-23 13:59:53 +02:00
Gergő Móricz
bfe731309c fix(scrapeURL/pdf/mu): remove log 2025-05-23 13:47:34 +02:00
Gergő Móricz
b03670a8b7
feat: parse PDFs on fc side and reject if too long for timeout (FIR-2083) (#1592)
* feat: pdf-parser, implementation in scrapeURL

* use pdf-parser for page count instead of mu

* fix(pdf-parser): bindings

* feat(scrapeURL/pdf): adjust MILLISECONDS_PER_PAGE

* implement post-runsync polling and fix

* fix(Dockerfile): copy in the pdf-parser source code

* fix(scrapeURL/pdf): better error for timeout below 0
2025-05-23 13:45:53 +02:00
Gergő Móricz
fd74299134
feat(scrapeURL, logJob): log pdf page count to db (FIR-2068) (#1587)
* feat(scrapeURL, logJob): log pdf page count to db

* devin stop the test littering pls
2025-05-22 17:26:01 -03:00
Gergő Móricz
f838190ba6
hotfix: kill zombie workers, respect timeouts better (FIR-2034) (#1575)
* feat(scrapeURL): add strict timeouts everywhere

* feat(queue-worker/liveness): add networking check

* fix(queue-worker): typo

* fix(queue-worker/liveness): do not parse

* fix(queue-worker): check local network instead

* fix(queue-worker/liveness): typo
2025-05-20 17:35:32 +02:00
Gergő Móricz
192d056bef
feat(scrapeURL/pdf/mu): add timeout and created_at (#1570) 2025-05-19 21:36:15 +02:00
Gergő Móricz
fab4f00536
feat(scrapeURL): proxy auto mode (FIR-1853) (#1551)
* feat(scrapeURL): proxy auto mode

* feat(api/tests/snips/proxy/auto): add test for stealth pick
2025-05-19 19:43:03 +02:00
devin-ai-integration[bot]
526165e1b9
Add caching for RunPod PDF markdown results in GCS (#1561)
* Add caching for RunPod PDF markdown results in GCS

Co-Authored-By: thomas@sideguide.dev <thomas@sideguide.dev>

* Update PDF caching to hash base64 directly and add metadata

Co-Authored-By: thomas@sideguide.dev <thomas@sideguide.dev>

* Fix PDF caching to directly hash content and fix test expectations

Co-Authored-By: thomas@sideguide.dev <thomas@sideguide.dev>

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: thomas@sideguide.dev <thomas@sideguide.dev>
2025-05-16 12:04:38 -03:00
Gergő Móricz
bd9673e104
Mog/cachable lookup (#1560)
* feat(scrapeURL): use cacheableLookup

* feat(queue-worker): add cacheablelookup

* fix(cacheable-lookup): make it work with tailscale on local

* add devenv

* try again

* allow querying all

* log

* fixes

* asd

* fix:

* fix(lookup):

* lookup
2025-05-16 15:44:52 +02:00
Gergő Móricz
d46ba95924 Revert "feat: use cacheable lookup everywhere (#1559)"
This reverts commit b8703b2a720765b92f5c4cab94cc90ea624198a8.
2025-05-16 15:31:06 +02:00
Gergő Móricz
b8703b2a72
feat: use cacheable lookup everywhere (#1559)
* feat(scrapeURL): use cacheableLookup

* feat(queue-worker): add cacheablelookup

* fix(cacheable-lookup): make it work with tailscale on local

* add devenv

* try again

* allow querying all

* log

* fixes

* asd

* fix:

* fix(lookup):
2025-05-16 15:27:24 +02:00
Gergő Móricz
cee481a3a9 fix(fire-engine): sslerror passthrough 2025-05-14 23:50:57 +02:00
Gergő Móricz
3db2294b97
feat(scrapeURL): better error for SSL failures (#1552) 2025-05-14 23:34:59 +02:00
Rafael Miller
eee613d1bc
[feat] Implement GCS storage option for scrape results across controllers an… (#1500)
* Implement GCS storage option for scrape results across controllers and update GCS document retrieval functionality

* done!

* Update gcs-jobs.ts
2025-04-29 15:15:44 -03:00
Rafael Miller
37dabce1ed
[feat] added second scrapeURLWithFireEngine (#1494) 2025-04-23 20:36:36 +02:00
Nicolas
1c421f2d74
Nick: (#1492) 2025-04-22 21:42:37 -04:00
Gergő Móricz
46048bc94d
feat(scrapeURL): return js returns from f-e (FIR-1535) (#1385)
* feat(scrapeURL): return js returns from f-e

* feat(js-sdk): handle new results
2025-03-28 12:42:25 +01:00
Grass Huang
7bf04d409a
fix(scraper): improve charset detection regex to accurately parse meta tags (#1265) 2025-02-26 17:31:06 +01:00
Gergő Móricz
283a3bfef3
fix(scrapeURL/engines/fetch): discover charset and re-decode (#1221)
* fix(scrapeURL/engines/fetch): discover charset and re-decode

* fix(snips/scrape): allow more time for stealth proxy
2025-02-20 18:56:15 +01:00
Gergő Móricz
c38dcd0432
feat(self-host): proxy support (FIR-1111) (#1212)
* feat(self-host): proxy support

* fix(playwright-service-ts): return untreated text/plain
2025-02-20 14:20:03 +01:00
Gergő Móricz
da1670b78c
feat(map): mock support (FIR-1109) (#1213)
* feat(map,fetch): mock support

* feat(snips/map): mock out long-running test

* fix(snips/scrape): use more reliable site for adblock testing
2025-02-20 10:41:43 +01:00