116 Commits

Author SHA1 Message Date
Gergő Móricz
11ed679274 feat(scrapeURL/pdf): support PDF prefetch when parsePDF is off 2025-02-20 09:28:13 +01:00
Gergő Móricz
55d047b6b3
feat(scrapeURL): handle PDFs behind anti-bot (#1198) 2025-02-20 04:11:30 +01:00
Gergő Móricz
46b187bc64
feat(v1/map): stop mapping if timed out via AbortController (#1205) 2025-02-20 00:42:13 +01:00
Gergő Móricz
2200f084f3
SELFHOST FIXES (#1207)
* fix(extract): construct OpenAI on demand

Fixes hard-crash if api key not specified in a self-hosting environment.

* fix(ci): try sleeping

* fix(ci): override host

* fix(ci): wait for server to start

* Support /extract and /crawl for self-hosted (FIR-1097) (#1137)

* Support /extract for self-hosted

This returns the job response from redis rather than supabase when db auth is disabled (self hosted mode)

* Use getJob for extract and use correct types

* fix(v1/crawl-status): only poll DB for total count if DB is enabled

* feat(snips): TEST_SUITE_SELF_HOSTED

* fix(ci/test-server-self-host): use pr trigger

* fix(scrapeURL): f-e mocking in selfhosted env

* fix(snips): do not try to eval json format on selfhost

* fix(scrapeURL): further f-e mocking

* fix(snips): don't timeout on hard fail polling

* fix(v1/extract-status): fix-up the db-agnostic impl

unfortunately had to separate the functions since the schema
was too divergent :(

* fix(snips): boost screenshot delay

* feat(ci): test with openai

* feat(ci): extract, search testing

* fix(ci): matrix

* fix(ci): bleh

* Update: fix default google search (#1174)

* fix log title

* search should always work

* asd

* fix ci

---------

Co-authored-by: Nick Roth <nlr06886@gmail.com>
Co-authored-by: William <sdustusun@gmail.com>
2025-02-20 00:41:22 +01:00
Gergő Móricz
b136e42b53
feat(v1): proxy option / stealthProxy flag (FIR-1050) (#1196)
* feat(v1): proxy option / stealthProxy flag

* feat(js-sdk): add proxy option
2025-02-18 18:03:10 +01:00
Gergő Móricz
5c4021e8cf fix(scrapeURL/fire-engine): screenshot broken hotfix 2025-02-17 18:59:16 +01:00
Gergő Móricz
445b906af1
fix(scrapeURL/fire-engine): perform format screenshot after specified actions (#1192) 2025-02-17 12:35:55 -03:00
Gergő Móricz
1491b5b141
fix(scrapeURL/sb): enforce timeout (FIR-980) (#1183)
* fix(scrapeURL/scrapingbee): enforce timeout

* fix(scrapeURL/sb): types

* fix the test

* fixup: remove nix files
2025-02-16 15:55:03 +01:00
Gergő Móricz
892f3a41f3
fix(scrape): allow getting valid JSON via rawHtml (FIR-852) (#1138)
* fix(scrape): allow getting valid JSON via rawHtml

* fix(scrape/test):
2025-02-06 18:35:28 +01:00
Nicolas
6bfd24d903 Nick: waitFor fixes 2025-01-30 23:23:03 -03:00
Gergő Móricz
d09e0603f8
feat(scrapeUrl/fire-engine): add blockAds flag (FIR-692) (#1106)
* feat(scrapeUrl/fire-engine): add blockAds flag

* feat(v1/scrape): blockAds test
2025-01-29 15:03:37 +01:00
Móricz Gergő
bee2b2873e fix(sitemap): better ordering 2025-01-23 08:58:18 +01:00
Nicolas
498558d358 Nick: formatting done 2025-01-22 18:47:44 -03:00
Nicolas
04916f17e2 Nick: bug fixes + acuc fixes + cache fixes 2025-01-21 19:17:06 -03:00
Gergő Móricz
5c62bb1195
feat: new snips test framework (FIR-414) (#1033)
* feat: new snips test framework

* Update mock.ts

---------

Co-authored-by: Nicolas <nicolascamara29@gmail.com>
2025-01-13 20:50:47 +01:00
Nicolas
f4d10c5031 Nick: formatting fixes 2025-01-10 18:35:10 -03:00
Móricz Gergő
3c614a2e5c fix(scrapeURL/engines/pdf,docx): support authorization 2025-01-09 10:03:27 +01:00
Nicolas
aef040b41e Nick: from cache fixes 2025-01-03 23:07:15 -03:00
Nicolas
a4f7c38834 Nick: fixed 2025-01-03 22:15:23 -03:00
Nicolas
6b2e1cbb28 Nick: cache /extract scrapes 2025-01-03 21:19:40 -03:00
Nicolas
c3fd13a82b Nick: fixed re-ranker and enabled url cache of 2hrs 2024-12-31 18:06:07 -03:00
Gergő Móricz
4d1f92f4c8 fix(scrapeURL/fetch): block loopback and link-local IPs 2024-12-29 17:35:14 +01:00
Nicolas
e255301005 Update index.ts 2024-12-27 21:31:29 -03:00
Nicolas
1eca61bffb Update index.ts 2024-12-27 20:59:18 -03:00
Nicolas
f9d55efba8 Update index.ts 2024-12-27 20:54:26 -03:00
Nicolas
b8d7f9f257 Nick: we are using runpod 2024-12-27 19:59:05 -03:00
Nicolas
5fcf3fa97e Merge branch 'main' into mog/mineru 2024-12-27 19:53:09 -03:00
Gergő Móricz
4772951313 feat(scrapeURL/fire-engine): explicitly delete job after scrape 2024-12-27 16:44:41 +01:00
Gergő Móricz
0b55fb836b feat(scrapeURL/pdf): switch to MinerU 2024-12-27 16:37:32 +01:00
Gergő Móricz
c543f4f76c feat(scrapeURL/pdf): update mock Blob implementation to pass TypeScript 2024-12-26 20:31:51 +01:00
Gergő Móricz
f15ef0e758 feat(scrapeURL/fire-engine/chrome-cdp): handle file downloads 2024-12-26 20:29:09 +01:00
Gergő Móricz
071b9a01c3 fix(scrapeURL/fire-engine): pass geolocation 2024-12-19 18:23:21 +01:00
Nicolas
3b6edef9fa chore: formatting 2024-12-17 16:58:57 -03:00
Eric Ciarla
a20a003c74 revert to pdf parse 2024-12-17 12:12:22 -05:00
Eric Ciarla
1402831a0a Replace pdf parse with pdf to md 2024-12-17 09:59:52 -05:00
Eric Ciarla
ed7d15d2af Update index.ts 2024-12-17 09:50:29 -05:00
Gergő Móricz
47b968fede fix(scrapeURL/fire-engine): timeout calculation issues 2024-12-17 13:17:55 +01:00
Nicolas
1214d219e1 Nick: fix actions errors 2024-12-15 15:43:12 -03:00
Gergő Móricz
0f3a27bf27 fix(scrapeURL/engines): better timeouts 2024-12-15 18:58:29 +01:00
Nicolas
a5256827c0 Update index.ts 2024-12-15 14:36:09 -03:00
Gergő Móricz
b4a5e1a6e9 fix(scrapeURL/fire-engine): timeout handling 2024-12-15 16:04:17 +01:00
Gergő Móricz
afbd01299a fix(scrapeURL/fire-engine): timeouts 2024-12-15 15:58:27 +01:00
Nicolas
8a1c404918 Nick: revert trailing comma 2024-12-11 19:51:08 -03:00
Nicolas
00335e2ba9 Nick: fixed prettier 2024-12-11 19:46:11 -03:00
Gergő Móricz
d276a23da0 fix(scrapeURL/pdf): handle if a presumed PDF link returns HTML (e.g. 404) 2024-12-10 23:24:33 +01:00
Gergő Móricz
ecad76978d feat(scrapeURL/pdf): extend amount of time we're willing to wait for PDFs in crawl/batch scrape mode 2024-12-10 21:43:00 +01:00
Gergő Móricz
85cbfbb5bb fix(crawl): disable smart wait
This increases the reliability/deterministic-ness of crawls.
2024-12-10 21:12:31 +01:00
Nicolas
4d2f4aad11 Update index.ts 2024-12-03 21:07:45 -03:00
Nicolas
f3aa32863f Revert "Merge branch 'nsc/crawl-n--1-fixes'"
This reverts commit 6d325b7ce7af912b326369eace62f89f897b536b, reversing
changes made to 3d5704b73e6c4802f0344dc2e17042af9b6de0f5.
2024-12-03 20:53:14 -03:00
Nicolas
64800a1c02 Nick: rm fe for test 2024-12-03 20:34:14 -03:00