Gergő Móricz
|
e74e4bcefc
|
feat(runWebScraper): retry a scrape max 3 times in a crawl if the status code is failure
|
2024-12-14 00:54:05 +01:00 |
|
rafaelmmiller
|
bdbc05a4c7
|
added check for object and trycatch as workaround for 502s
|
2024-12-13 18:33:39 -03:00 |
|
Nicolas
|
6b17a53d4b
|
Update package.json
|
2024-12-12 21:53:15 -03:00 |
|
Nicolas
|
13afe4c733
|
Update index.ts
|
2024-12-12 21:52:20 -03:00 |
|
Nicolas
|
6b41916e1a
|
Merge pull request #971 from mendableai/Hash-Urls
Remove Block List
|
2024-12-12 18:19:51 -03:00 |
|
Nicolas
|
3b0d192d1b
|
Update types.ts
|
2024-12-12 18:14:11 -03:00 |
|
Eric Ciarla
|
a2998d4499
|
Hash Urls
|
2024-12-12 16:10:10 -05:00 |
|
Nicolas
|
e22a0b596c
|
Nick: custom metadata
|
2024-12-12 13:30:00 -03:00 |
|
Nicolas
|
1d1a936f2c
|
Merge pull request #954 from mendableai/rafa/fix-schema-base-model-extract
Fixes schema base model extract
|
2024-12-11 20:14:35 -03:00 |
|
Nicolas
|
de57e7f4dd
|
Nick: from dependencies to dev-dependencies
|
2024-12-11 20:07:05 -03:00 |
|
Nicolas
|
8a1c404918
|
Nick: revert trailing comma
|
2024-12-11 19:51:08 -03:00 |
|
Nicolas
|
52f2e733e2
|
Nick: fixes
|
2024-12-11 19:48:22 -03:00 |
|
Nicolas
|
00335e2ba9
|
Nick: fixed prettier
|
2024-12-11 19:46:11 -03:00 |
|
Gergő Móricz
|
f877fbfb8f
|
fix(WebCrawler/isFile): add .wav
|
2024-12-10 23:24:53 +01:00 |
|
Gergő Móricz
|
d276a23da0
|
fix(scrapeURL/pdf): handle if a presumed PDF link returns HTML (e.g. 404)
|
2024-12-10 23:24:33 +01:00 |
|
Gergő Móricz
|
d9e017e5e2
|
feat(queue-worker/crawl): solidify redirect behaviour
|
2024-12-10 22:34:26 +01:00 |
|
Gergő Móricz
|
ce460a3a56
|
fix(v1/crawl/status): completed more than total if some scrape jobs fail or are discarded
|
2024-12-10 22:33:53 +01:00 |
|
Gergő Móricz
|
ecad76978d
|
feat(scrapeURL/pdf): extend amount of time we're willing to wait for PDFs in crawl/batch scrape mode
|
2024-12-10 21:43:00 +01:00 |
|
Gergő Móricz
|
85cbfbb5bb
|
fix(crawl): disable smart wait
This increases the reliability/deterministic-ness of crawls.
|
2024-12-10 21:12:31 +01:00 |
|
Nicolas
|
2d35a52efe
|
Merge pull request #958 from mendableai/remove-microsoft
|
2024-12-10 12:00:49 -03:00 |
|
rafaelmmiller
|
468b8cdeb9
|
removing microsoft from blocklist
|
2024-12-10 11:29:36 -03:00 |
|
Gergő Móricz
|
877f072e3c
|
feat: crawl log parser (poc)
|
2024-12-09 23:40:47 +01:00 |
|
Nicolas
|
4dbe0e6236
|
Update requests.http
|
2024-12-09 19:26:33 -03:00 |
|
Nicolas
|
a47e278c97
|
Nick: bump node sdk
|
2024-12-09 19:25:48 -03:00 |
|
rafaelmmiller
|
5c81ea1803
|
fixed optional+default bug on llm schema
|
2024-12-09 15:34:50 -03:00 |
|
Gergő Móricz
|
91a1a9a1fc
|
fix(crawl-redis/lockURL): reduce logging
|
2024-12-09 19:29:42 +01:00 |
|
Gergő Móricz
|
6776aee1c3
|
feat(auth): extend rate limiter logging to make it easier to debug
|
2024-12-09 19:29:32 +01:00 |
|
Gergő Móricz
|
fe6b003fcf
|
fix(js-sdk/batchScrapeUrls): zod support
|
2024-12-09 18:49:48 +01:00 |
|
rafaelmmiller
|
ff878bc6f5
|
bump version
|
2024-12-09 14:35:22 -03:00 |
|
rafaelmmiller
|
d8847bb4ce
|
fixes schema warning
|
2024-12-09 14:34:50 -03:00 |
|
Nicolas
|
f007f2439e
|
Update email_notification.ts
|
2024-12-08 22:24:16 -03:00 |
|
Nicolas
|
4d287bb77f
|
Nick: moving acuc temp to read replica
|
2024-12-06 13:06:26 -03:00 |
|
Gergő Móricz
|
934363b409
|
feat(queue-worker): add better logging for worker
|
2024-12-05 22:06:07 +01:00 |
|
Gergő Móricz
|
f82b9c205c
|
fix(crawl-redis): oops
|
2024-12-05 21:42:08 +01:00 |
|
Gergő Móricz
|
845c2744a9
|
feat(app): add extra crawl logging (app-side only for now)
|
2024-12-05 20:50:36 +01:00 |
|
Gergő Móricz
|
cce94289ee
|
fix(v1/batch/scrape): horrid memory usage
|
2024-12-05 20:49:28 +01:00 |
|
Gergő Móricz
|
f8e619b5df
|
fix(crawl-status): returnvalue filtering on active jobs
|
2024-12-05 18:20:21 +01:00 |
|
Gergő Móricz
|
41d859203f
|
feat(v1/batch/scrape): appendToId
|
2024-12-04 23:35:29 +01:00 |
|
Gergő Móricz
|
7bde034020
|
auth: log team id
|
2024-12-04 23:12:55 +01:00 |
|
Nicolas
|
64546f1259
|
Update types.ts
|
2024-12-04 18:00:51 -03:00 |
|
Nicolas
|
f7207f91b4
|
Nick: temp e-s-1
|
2024-12-04 16:25:43 -03:00 |
|
Gergő Móricz
|
6b1f30e0fb
|
fix(scrapeURL/removeUnwantedElements): try to fix onlyMainContent for poorly structured sites
|
2024-12-04 19:05:12 +01:00 |
|
Gergő Móricz
|
88a16b18a3
|
fix(crawl-status): ts error
|
2024-12-04 17:55:51 +01:00 |
|
Gergő Móricz
|
d8613899e3
|
fix(crawl-status): handle failed jobs (oops)
|
2024-12-04 17:52:47 +01:00 |
|
Gergő Móricz
|
712a138404
|
fix(crawl-status): hard error bug
|
2024-12-04 17:47:37 +01:00 |
|
Nicolas
|
51a6b83f45
|
Nick: fixed the crawl + n - not respecting limit
|
2024-12-04 12:56:47 -03:00 |
|
Nicolas
|
39ff49a8f3
|
Nick: reverted redirect fix
|
2024-12-04 12:42:56 -03:00 |
|
Nicolas
|
da96acdb94
|
Merge pull request #943 from mendableai/fix/key-error-data
fixed keyerror for data on sdk
|
2024-12-04 11:45:01 -03:00 |
|
rafaelmmiller
|
7e9ad3cba7
|
fixed keyerror for data on sdk
|
2024-12-04 11:17:19 -03:00 |
|
Nicolas
|
4d2f4aad11
|
Update index.ts
|
2024-12-03 21:07:45 -03:00 |
|