157 Commits

Author SHA1 Message Date
Nicolas
78badf8f72 Nick: wip 2024-10-28 16:02:07 -03:00
Nicolas
d965f2ce7d Nick: fixes 2024-10-24 23:13:30 -03:00
Nicolas
d8abd15716 Nick: from bulk to batch 2024-10-23 15:37:24 -03:00
Nicolas
66e505317e Merge branch 'main' into mog/bulk-scrape 2024-10-23 14:36:26 -03:00
Thomas Kosmas
acde353e56 skipTlsVerification on robots.txt scraping 2024-10-23 01:07:03 +03:00
Thomas Kosmas
bd55464b52 skipTlsVerification 2024-10-22 22:28:02 +03:00
Nicolas
d2344aa14b Revert "Nick: improved map ranking algorithm"
This reverts commit 7acd8d2edb6abc45a63fe1060377d2acb398ec36.
2024-10-21 16:11:32 -03:00
Nicolas
7acd8d2edb Nick: improved map ranking algorithm 2024-10-19 13:27:47 -03:00
Gergő Móricz
03b37998fd feat: bulk scrape 2024-10-17 19:40:18 +02:00
Nicolas
081d7407b3
Merge pull request #788 from mendableai/nsc/log-extractpr-options
Extractor options logging v1 fix
2024-10-16 23:51:22 -03:00
Nicolas
06b8d24a4c Update scrape.ts 2024-10-16 23:50:21 -03:00
Nicolas
a73b06589c
Merge pull request #785 from mendableai/nsc/support-for-all-metadata
Return all the website metadata
2024-10-16 23:37:26 -03:00
Nicolas
c0384ea381 Nick: added tests 2024-10-16 23:32:44 -03:00
Nicolas
b4f6a0f919 Nick: geolocation 2024-10-15 21:12:33 -03:00
rafaelsideguide
4afcd16e02 performance improv for ws 2024-10-15 10:12:27 -03:00
rafaelsideguide
3afaab13d9 feat/improv-crawl-status-filters 2024-10-14 18:14:00 -03:00
Nicolas
961b1010cf Nick: rm the cache for map for 24hrs 2024-10-12 17:48:37 -03:00
rafaelsideguide
2d3d7c827a fix/added unkwown status to job filter 2024-10-11 15:40:29 -03:00
rafaelsideguide
8cbd94ed2d fix/filters failed and unknown jobs now 2024-10-11 09:45:51 -03:00
busaud
c6ebbc6f6a bugfix: self-host crawling doesnt respect limit 2024-10-09 22:52:49 +00:00
Nicolas
497ac3328b
Merge pull request #732 from mendableai/fix/url-validation-params
[BUG] Fixed URLs with params
2024-10-03 17:43:37 -03:00
rafaelsideguide
cfd776a5de fix: now urls with params are passing validation
example: https://www.granitecreek.com?asljhda=akjshd
2024-10-03 17:37:04 -03:00
Nicolas
49bd95327e Update types.ts 2024-10-03 17:00:33 -03:00
Nicolas
1a1ac9fd60 Nick: 2024-10-03 16:37:58 -03:00
Nicolas
c6717fecaa Nick: got rid of job interval sleep and math.min 2024-10-01 16:11:12 -03:00
Nicolas
18f9cd09e1 Nick: fixed more stuff 2024-10-01 16:04:39 -03:00
Nicolas
37299fc035 Update types.ts 2024-10-01 15:18:11 -03:00
Nicolas
4d5477f357 Nick: resolved conflicts 2024-10-01 14:39:57 -03:00
Nicolas
96245e387d Update crawl.ts 2024-10-01 14:29:53 -03:00
Nicolas
445fc432e9 Reapply "fix(v1/crawl): always use sitemap"
This reverts commit 339b19ce9d57fd15b11820e1cfbe4d7b5f44cf30.
2024-10-01 14:03:07 -03:00
Nicolas
339b19ce9d Revert "fix(v1/crawl): always use sitemap"
This reverts commit 5dc0fcf644bfc64b2b30dd345b2a61b64a4c1262.
2024-10-01 13:59:49 -03:00
Gergő Móricz
5dc0fcf644 fix(v1/crawl): always use sitemap 2024-10-01 18:49:44 +02:00
Nicolas
1af26fe1b4 Nick: sitemap fix 2024-10-01 12:38:48 -03:00
Gergő Móricz
3621e191bd feat(concurrency-limit): set limit based on plan 2024-09-28 00:19:54 +02:00
Gergő Móricz
d5e2a80e4a fix(crawl-status): keep 10 megabyte pages if they're the only thing in the output 2024-09-27 20:41:41 +02:00
Gergő Móricz
e98f858eb6 fix(api): playground scrape errors 2024-09-26 22:28:14 +02:00
Gergő Móricz
84bff8add8 fix(billTeam): update cached ACUC after billing 2024-09-26 22:15:15 +02:00
Gergő Móricz
f22ab5ffaf feat(db): implement bill_team RPC 2024-09-26 22:15:15 +02:00
Gergő Móricz
f8c70fe5dd feat(db): implement auth_credit_usage_chunk RPC 2024-09-26 22:15:15 +02:00
Gergő Móricz
29815e084b feat(v1/Document): add warning field 2024-09-26 21:19:05 +02:00
Gergő Móricz
b696bfc854 fix(crawl-status): avoid race conditions where crawl may be deemed failed 2024-09-26 21:00:27 +02:00
Gergő Móricz
e67cbc2ca1 fix(billTeam): update cached ACUC after billing 2024-09-25 21:37:01 +02:00
Gergő Móricz
5a8eb17a82 feat(db): implement bill_team RPC 2024-09-25 20:57:45 +02:00
Gergő Móricz
331e826bca feat(db): implement auth_credit_usage_chunk RPC 2024-09-25 19:25:18 +02:00
Gergő Móricz
f00c0b82f9 fix(v1/scrape): add total wait specified in request to timeout 2024-09-24 21:56:22 +02:00
Gergő Móricz
3e661a2087 fix(v1/crawl-cancel): avoid double authing 2024-09-24 20:01:34 +02:00
Gergő Móricz
a59b5836d5 Revert error tallying 2024-09-24 10:27:49 +02:00
Nicolas
db161ac55a Nick: press + write 2024-09-20 19:45:23 -04:00
Nicolas
0690cfeaad Merge branch 'main' into feat/actions 2024-09-20 18:24:13 -04:00
Gergő Móricz
d663bbf0ca feat(actions): add scroll 2024-09-20 21:41:53 +02:00