2271 Commits

Author SHA1 Message Date
Nicolas
86e34d7c6c Nick: wip 2025-01-07 12:13:12 -03:00
Móricz Gergő
7a03275575 add comment 2025-01-07 13:57:47 +01:00
Móricz Gergő
7d73ebdbf1 fix(crawl): never invalidate first crawl scrape if redirects 2025-01-07 13:57:23 +01:00
Móricz Gergő
b96b97ed72 fix(crawl): don't push rawhtml to db unless requested 2025-01-07 10:09:15 +01:00
Móricz Gergő
35d1d85978 fix(crawler): also take the hostname of the base url when determining isInternalLink 2025-01-07 09:29:58 +01:00
Nicolas
bb27594443 Merge branch 'main' into nsc/extract-queue 2025-01-06 13:01:15 -03:00
Gergő Móricz
461842fe8c fix(v1/crawl-status): handle job's returnvalue being explicitly null (db race) 2025-01-04 17:24:33 +01:00
Gergő Móricz
b92a4eb79b fix(queue-worker): only do redirect handling logic on crawls, not batch scrape 2025-01-04 16:59:35 +01:00
Nicolas
d48ddb8820 Update canonical-url.test.ts 2025-01-03 23:55:05 -03:00
Nicolas
f2e0bfbfe3 Nick: url normalization 2025-01-03 23:54:03 -03:00
Nicolas
f25c0c6d21 Nick: added canonical tests 2025-01-03 23:16:33 -03:00
Nicolas
aef040b41e Nick: from cache fixes 2025-01-03 23:07:15 -03:00
Nicolas
e8a9d8ddcd Merge branch 'main' of https://github.com/mendableai/firecrawl 2025-01-03 22:55:42 -03:00
Nicolas
05e845a971 Update cache.ts 2025-01-03 22:55:38 -03:00
Nicolas
c655c6859f Nick: fixed 2025-01-03 22:50:53 -03:00
Nicolas
a4f7c38834 Nick: fixed 2025-01-03 22:15:23 -03:00
Nicolas
8df1c67961 Update queue-worker.ts 2025-01-03 21:48:28 -03:00
Nicolas
499479c85e Update url-processor.ts 2025-01-03 21:28:52 -03:00
Nicolas
432b410678 Update queue-worker.ts 2025-01-03 21:26:05 -03:00
Nicolas
6b2e1cbb28 Nick: cache /extract scrapes 2025-01-03 21:19:40 -03:00
Nicolas
27457ed5db Nick: init 2025-01-03 20:44:27 -03:00
Nicolas
81cf05885b Merge branch 'main' into nsc/semantic-index-extract 2025-01-03 19:57:29 -03:00
Nicolas
ad49503f8a Update search.ts 2025-01-02 21:15:47 -03:00
Nicolas
cbe0716439 Update search.ts 2025-01-02 21:13:24 -03:00
Nicolas
e37ab8431a Update search.ts 2025-01-02 21:07:14 -03:00
Nicolas
8b64e915b3 Update search.ts 2025-01-02 21:02:55 -03:00
Nicolas
7ce780ac81 Update search.ts 2025-01-02 20:40:38 -03:00
Nicolas
21bf89b6cc Update search.ts 2025-01-02 19:57:51 -03:00
Nicolas
22ae1730bd Update search.ts 2025-01-02 19:57:41 -03:00
Nicolas
a0dbf20c40 Update types.ts 2025-01-02 19:55:28 -03:00
Nicolas
35d7202894 Update search.ts 2025-01-02 19:33:21 -03:00
Nicolas
d2742bec4d Nick: v1 search 2025-01-02 19:31:03 -03:00
rafaelmmiller
ef0fc8d0d3 broader search if didnt find results 2025-01-02 18:00:18 -03:00
Nicolas
c9d91af86f Merge branch 'main' into nsc/semantic-index-extract 2025-01-02 15:26:40 -03:00
Nicolas
c3fd13a82b Nick: fixed re-ranker and enabled url cache of 2hrs 2024-12-31 18:06:07 -03:00
Nicolas
07f4b714af Update removeUnwantedElements.ts 2024-12-31 15:23:02 -03:00
Nicolas
33632d2fe3 Update extraction-service.ts 2024-12-31 15:22:50 -03:00
Nicolas
bd81b41d5f Update queue-worker.ts 2024-12-30 21:43:59 -03:00
Nicolas
e6da214aeb Nick: async background index 2024-12-30 21:42:01 -03:00
Nicolas
7a31306be5 Nick: url normalization + max metadata size 2024-12-30 20:04:22 -03:00
Nicolas
bf9d41d0b2 Nick: index exploration 2024-12-30 19:37:48 -03:00
Nicolas
0847a6038e
Merge pull request #1014 from mendableai/nsc/extract-url-trace
/extract URL trace
2024-12-30 19:00:58 -03:00
Gergő Móricz
71a8f7452c fix(WebScraper/sitemap): await urlsHandler to fix race condition 2024-12-30 16:09:22 +01:00
Nicolas
8ae34a0d31 Nick: rm .xml from isFile 2024-12-30 11:57:01 -03:00
Gergő Móricz
9005757de3 fix(queue-worker): do not follow redirect URLs if they are not allowed by the crawl options 2024-12-30 14:41:31 +01:00
Gergő Móricz
4d1f92f4c8 fix(scrapeURL/fetch): block loopback and link-local IPs 2024-12-29 17:35:14 +01:00
Nicolas
e255301005 Update index.ts 2024-12-27 21:31:29 -03:00
Nicolas
1eca61bffb Update index.ts 2024-12-27 20:59:18 -03:00
Nicolas
f9d55efba8 Update index.ts 2024-12-27 20:54:26 -03:00
Nicolas
b8d7f9f257 Nick: we are using runpod 2024-12-27 19:59:05 -03:00