Eric Ciarla
|
a20a003c74
|
revert to pdf parse
|
2024-12-17 12:12:22 -05:00 |
|
Eric Ciarla
|
1402831a0a
|
Replace pdf parse with pdf to md
|
2024-12-17 09:59:52 -05:00 |
|
Eric Ciarla
|
ed7d15d2af
|
Update index.ts
|
2024-12-17 09:50:29 -05:00 |
|
Gergő Móricz
|
654d6c6e0b
|
fix(scrapeURL): increase timeToRun
|
2024-12-17 13:21:24 +01:00 |
|
Gergő Móricz
|
47b968fede
|
fix(scrapeURL/fire-engine): timeout calculation issues
|
2024-12-17 13:17:55 +01:00 |
|
Gergő Móricz
|
7f57c868be
|
Revert "fix(scrapeURL): better timeToRun distribution"
This reverts commit 284a6ccedd1baede825571ee933eb7e4f773e2de.
|
2024-12-16 23:08:20 +01:00 |
|
Gergő Móricz
|
284a6ccedd
|
fix(scrapeURL): better timeToRun distribution
|
2024-12-16 23:01:34 +01:00 |
|
Nicolas
|
1214d219e1
|
Nick: fix actions errors
|
2024-12-15 15:43:12 -03:00 |
|
Gergő Móricz
|
0f3a27bf27
|
fix(scrapeURL/engines): better timeouts
|
2024-12-15 18:58:29 +01:00 |
|
Nicolas
|
a5256827c0
|
Update index.ts
|
2024-12-15 14:36:09 -03:00 |
|
Gergő Móricz
|
b4a5e1a6e9
|
fix(scrapeURL/fire-engine): timeout handling
|
2024-12-15 16:04:17 +01:00 |
|
Gergő Móricz
|
afbd01299a
|
fix(scrapeURL/fire-engine): timeouts
|
2024-12-15 15:58:27 +01:00 |
|
Gergő Móricz
|
842b522b44
|
feat: add scrapeOptions.fastMode
|
2024-12-15 14:28:47 +01:00 |
|
Gergő Móricz
|
e74e4bcefc
|
feat(runWebScraper): retry a scrape max 3 times in a crawl if the status code is failure
|
2024-12-14 00:54:05 +01:00 |
|
Nicolas
|
8a1c404918
|
Nick: revert trailing comma
|
2024-12-11 19:51:08 -03:00 |
|
Nicolas
|
00335e2ba9
|
Nick: fixed prettier
|
2024-12-11 19:46:11 -03:00 |
|
Gergő Móricz
|
d276a23da0
|
fix(scrapeURL/pdf): handle if a presumed PDF link returns HTML (e.g. 404)
|
2024-12-10 23:24:33 +01:00 |
|
Gergő Móricz
|
ecad76978d
|
feat(scrapeURL/pdf): extend amount of time we're willing to wait for PDFs in crawl/batch scrape mode
|
2024-12-10 21:43:00 +01:00 |
|
Gergő Móricz
|
85cbfbb5bb
|
fix(crawl): disable smart wait
This increases the reliability/deterministic-ness of crawls.
|
2024-12-10 21:12:31 +01:00 |
|
Gergő Móricz
|
6b1f30e0fb
|
fix(scrapeURL/removeUnwantedElements): try to fix onlyMainContent for poorly structured sites
|
2024-12-04 19:05:12 +01:00 |
|
Nicolas
|
4d2f4aad11
|
Update index.ts
|
2024-12-03 21:07:45 -03:00 |
|
Nicolas
|
f3aa32863f
|
Revert "Merge branch 'nsc/crawl-n--1-fixes'"
This reverts commit 6d325b7ce7af912b326369eace62f89f897b536b, reversing
changes made to 3d5704b73e6c4802f0344dc2e17042af9b6de0f5.
|
2024-12-03 20:53:14 -03:00 |
|
Nicolas
|
64800a1c02
|
Nick: rm fe for test
|
2024-12-03 20:34:14 -03:00 |
|
Nicolas
|
eb2e51e50b
|
Nick: fixed /extract without a schema
|
2024-12-03 12:08:15 -03:00 |
|
Gergő Móricz
|
42980c899d
|
fix(scrapeURL/fire-engine): fast fail on chrome error
|
2024-11-28 18:41:48 +01:00 |
|
rafaelmmiller
|
943bbae88d
|
fixed nested data inside extract
|
2024-11-27 18:29:37 -03:00 |
|
Nicolas
|
18b864eace
|
Update index.ts
|
2024-11-24 19:48:13 -08:00 |
|
Nicolas
|
3eaa3b38ab
|
Nick: formatting
|
2024-11-20 16:42:42 -08:00 |
|
Nicolas
|
3de4997f4d
|
Loggin num tokens
|
2024-11-20 13:09:46 -08:00 |
|
Nicolas
|
67a2989874
|
Nick: fixes
|
2024-11-20 12:48:10 -08:00 |
|
Nicolas
|
28696da6b2
|
Nick: gpt-4o
|
2024-11-20 12:25:50 -08:00 |
|
Nicolas
|
d49f62fb56
|
Nick: extract fixes
|
2024-11-20 11:50:14 -08:00 |
|
Nicolas
|
c9b0a80522
|
Nick:
|
2024-11-20 10:23:44 -08:00 |
|
rafaelmmiller
|
2fb8a3c8dc
|
fix schema
|
2024-11-19 10:04:42 -03:00 |
|
rafaelmmiller
|
36cf49c959
|
Merge remote-tracking branch 'origin/main' into nsc/new-extract
|
2024-11-19 09:34:08 -03:00 |
|
Gergő Móricz
|
63787bc504
|
fix(scrapeURL/fire-engine): wait longer if timeout is not specified
|
2024-11-15 20:25:16 +01:00 |
|
Gergő Móricz
|
4cddcd5206
|
fix(scrapeURL/fire-engine): timeout-less scrape support (initial)
|
2024-11-15 20:15:25 +01:00 |
|
Móricz Gergő
|
3a342bfbf0
|
fix(scrapeURL/playwright): JSON body fix
|
2024-11-15 15:18:40 +01:00 |
|
Gergő Móricz
|
359c30fbda
|
fix(cache): don't cache on failure error code
|
2024-11-14 19:49:34 +01:00 |
|
Gergő Móricz
|
49ff37afb4
|
feat: cache
|
2024-11-14 19:47:12 +01:00 |
|
Móricz Gergő
|
5519f077aa
|
fix(scrapeURL): adjust error message for clarity
|
2024-11-14 10:13:48 +01:00 |
|
Nicolas
|
a1c018fdb0
|
Merge branch 'main' into nsc/new-extract
|
2024-11-13 17:14:43 -05:00 |
|
rafaelmmiller
|
904c904971
|
wip
|
2024-11-13 18:06:20 -03:00 |
|
Gergő Móricz
|
16e850288c
|
fix(scrapeURL/pdf,docx): ignore SSL when downloading PDF
|
2024-11-12 22:46:58 +01:00 |
|
Gergő Móricz
|
7081beff1f
|
fix(scrapeURL/pdf): retry
|
2024-11-12 22:26:36 +01:00 |
|
Gergő Móricz
|
9ace2ad071
|
fix(scrapeURL/pdf): fix llamaparse upload
|
2024-11-12 20:55:14 +01:00 |
|
Gergő Móricz
|
9f8b8c190f
|
feat(scrapeURL): log URL for easy searching
|
2024-11-12 17:54:48 +01:00 |
|
Gergő Móricz
|
e95b6656fa
|
fix(scrapeURL): don't log fetch request
|
2024-11-12 17:53:44 +01:00 |
|
Gergő Móricz
|
f42740a109
|
fix(scrapeURL): don't log engineResult
|
2024-11-12 17:52:32 +01:00 |
|
Gergő Móricz
|
2ca22659d3
|
fix(scrapeURL/llmExtract): fix schema-less LLM extract
|
2024-11-11 21:07:37 +01:00 |
|