2271 Commits

Author SHA1 Message Date
Gergő Móricz
0013bdfcb4 feat(v1/scrape): add more context to timeout logs 2024-12-16 22:42:51 +01:00
Gergő Móricz
139e2c9a05 fix(runWebScraper): proper error handling 2024-12-16 22:24:00 +01:00
Rafael Miller
2c233bd321
Update requests.http 2024-12-16 11:48:48 -03:00
rafaelmmiller
d8150c6171 added type to reqs example 2024-12-16 11:46:56 -03:00
rafaelmmiller
b6802bc443 merged with main 2024-12-16 11:41:59 -03:00
Rafael Miller
8192d756e9
Merge branch 'main' into rafa/fix-default-on-schema-llm-extract 2024-12-16 09:33:36 -03:00
rafaelmmiller
eab30c474b added unit tests 2024-12-16 09:30:40 -03:00
Gergő Móricz
2de659d810 fix(queue-jobs): fix concurrency limit 2024-12-15 23:54:52 +01:00
Gergő Móricz
72d6a8179e fix(rate-limiter): raise crawlStatus limits 2024-12-15 23:08:23 +01:00
Gergő Móricz
e97ee4a4be fix(WebScraper/tryGetSitemap): deduplicate sitemap links list 2024-12-15 22:33:36 +01:00
Gergő Móricz
37f58efe45 fix(crawl-redis/lockURL): only add to visited_unique if lock succeeds 2024-12-15 21:01:31 +01:00
Gergő Móricz
30fa78cd9e feat(queue-worker): fix redirect slipping 2024-12-15 20:16:52 +01:00
Nicolas
126b46ee2c Update issue_credits.ts 2024-12-15 15:53:24 -03:00
Nicolas
1214d219e1 Nick: fix actions errors 2024-12-15 15:43:12 -03:00
Gergő Móricz
0f3a27bf27 fix(scrapeURL/engines): better timeouts 2024-12-15 18:58:29 +01:00
Nicolas
a5256827c0 Update index.ts 2024-12-15 14:36:09 -03:00
Gergő Móricz
98f27b0acc fix(crawl-redis/addCrawlJobDone): further ensure that completed doesn't go over total 2024-12-15 16:29:09 +01:00
Gergő Móricz
b4a5e1a6e9 fix(scrapeURL/fire-engine): timeout handling 2024-12-15 16:04:17 +01:00
Gergő Móricz
afbd01299a fix(scrapeURL/fire-engine): timeouts 2024-12-15 15:58:27 +01:00
Gergő Móricz
842b522b44 feat: add scrapeOptions.fastMode 2024-12-15 14:28:47 +01:00
Nicolas
588f747ee8 chore: formatting 2024-12-15 02:54:49 -03:00
Nicolas
4987880b32 Nick: random fixes 2024-12-15 02:52:06 -03:00
Nicolas
664ba69f08 Nick: f-eng monitoring test 2024-12-14 21:40:46 -03:00
Nicolas
ccbae4b155 Update auth.ts 2024-12-14 00:20:14 -03:00
Gergő Móricz
4b5014d7fe feat(v1/batch/scrape): add ignoreInvalidURLs option 2024-12-14 01:11:43 +01:00
Gergő Móricz
e74e4bcefc feat(runWebScraper): retry a scrape max 3 times in a crawl if the status code is failure 2024-12-14 00:54:05 +01:00
Nicolas
6b41916e1a
Merge pull request #971 from mendableai/Hash-Urls
Remove Block List
2024-12-12 18:19:51 -03:00
Nicolas
3b0d192d1b Update types.ts 2024-12-12 18:14:11 -03:00
Eric Ciarla
a2998d4499 Hash Urls 2024-12-12 16:10:10 -05:00
Nicolas
e22a0b596c Nick: custom metadata 2024-12-12 13:30:00 -03:00
Nicolas
de57e7f4dd Nick: from dependencies to dev-dependencies 2024-12-11 20:07:05 -03:00
Nicolas
8a1c404918 Nick: revert trailing comma 2024-12-11 19:51:08 -03:00
Nicolas
52f2e733e2 Nick: fixes 2024-12-11 19:48:22 -03:00
Nicolas
00335e2ba9 Nick: fixed prettier 2024-12-11 19:46:11 -03:00
Gergő Móricz
f877fbfb8f fix(WebCrawler/isFile): add .wav 2024-12-10 23:24:53 +01:00
Gergő Móricz
d276a23da0 fix(scrapeURL/pdf): handle if a presumed PDF link returns HTML (e.g. 404) 2024-12-10 23:24:33 +01:00
Gergő Móricz
d9e017e5e2 feat(queue-worker/crawl): solidify redirect behaviour 2024-12-10 22:34:26 +01:00
Gergő Móricz
ce460a3a56 fix(v1/crawl/status): completed more than total if some scrape jobs fail or are discarded 2024-12-10 22:33:53 +01:00
Gergő Móricz
ecad76978d feat(scrapeURL/pdf): extend amount of time we're willing to wait for PDFs in crawl/batch scrape mode 2024-12-10 21:43:00 +01:00
Gergő Móricz
85cbfbb5bb fix(crawl): disable smart wait
This increases the reliability/deterministic-ness of crawls.
2024-12-10 21:12:31 +01:00
Nicolas
2d35a52efe
Merge pull request #958 from mendableai/remove-microsoft 2024-12-10 12:00:49 -03:00
rafaelmmiller
468b8cdeb9 removing microsoft from blocklist 2024-12-10 11:29:36 -03:00
Gergő Móricz
877f072e3c feat: crawl log parser (poc) 2024-12-09 23:40:47 +01:00
Nicolas
4dbe0e6236 Update requests.http 2024-12-09 19:26:33 -03:00
Nicolas
a47e278c97 Nick: bump node sdk 2024-12-09 19:25:48 -03:00
rafaelmmiller
5c81ea1803 fixed optional+default bug on llm schema 2024-12-09 15:34:50 -03:00
Gergő Móricz
91a1a9a1fc fix(crawl-redis/lockURL): reduce logging 2024-12-09 19:29:42 +01:00
Gergő Móricz
6776aee1c3 feat(auth): extend rate limiter logging to make it easier to debug 2024-12-09 19:29:32 +01:00
Nicolas
f007f2439e Update email_notification.ts 2024-12-08 22:24:16 -03:00
Nicolas
4d287bb77f Nick: moving acuc temp to read replica 2024-12-06 13:06:26 -03:00