107 Commits

Author SHA1 Message Date
Nicolas
4d0acc9722 Merge branch 'main' into v1-webscraper 2024-08-26 16:22:05 -03:00
Nicolas
173f4ee1bf Nick: chrome cdp main | simple autoscaler 2024-08-23 20:09:59 -03:00
Gergő Móricz
e7f267b6fe Merge branch 'main' into v1-webscraper 2024-08-23 17:21:54 +02:00
rafaelsideguide
7473b74021 fix: html and rawlhtmls for pdfs 2024-08-22 15:15:45 -03:00
rafaelsideguide
fe2e8c0b7a includehtml fix 2024-08-21 15:54:00 -03:00
Gergő Móricz
1368f9a87f fix: treat existing screenshot as a scraper success condition 2024-08-20 22:24:18 +02:00
rafaelsideguide
ecd472356b added variables to beta customers 2024-08-19 16:41:54 -03:00
rafaelsideguide
7a61325500 map + search + scrape markdown bug 2024-08-16 17:57:11 -03:00
rafaelsideguide
3f998b688d scrape ready 2024-08-16 15:14:37 -03:00
Gergő Móricz
29f0d9ec94 propagate priority to fire-engine 2024-08-15 19:04:46 +02:00
Gergő Móricz
5fc7fcb77c
Merge branch 'main' into feat/queue-scrapes 2024-08-07 16:35:44 +02:00
Gergo Moricz
b60ee30dba fix(single_url): accept 500 2024-08-06 18:00:56 +02:00
rafaelsideguide
4d24a99d50 fix params 2024-08-06 09:34:43 -03:00
rafaelsideguide
3edc3a3d15 added fullpagescreenshot capabilities, wip on fire-engine side 2024-08-05 18:17:37 -03:00
Nicolas
7b813883ef Nick: first layer 2024-07-29 20:31:51 -04:00
rafaelsideguide
96cec2a673 fix checking scrape log success content length 2024-07-26 12:00:52 -03:00
Nicolas
f82ca3be17 Nick: 2024-07-25 19:53:29 -04:00
Nicolas
01fab6e036 Update single_url.ts 2024-07-25 17:51:41 -04:00
Nicolas
56042d090c Update single_url.ts 2024-07-25 17:48:44 -04:00
Nicolas
3242872503 Update single_url.ts 2024-07-25 17:43:55 -04:00
Gergo Moricz
4d35ad073c feat(monitoring/scrape): include url, worker, response_size 2024-07-24 16:43:39 +02:00
Gergo Moricz
64bcedeefc fix(monitoring): bad success check on scrape 2024-07-24 16:21:59 +02:00
Gergo Moricz
7cd9bf92e3 feat: scrape event logging to DB 2024-07-24 14:31:25 +02:00
rafaelsideguide
6208ecdbc0 added logger 2024-07-23 17:30:46 -03:00
Nicolas
d2de01d342 Nick: fixes 2024-07-18 13:19:44 -04:00
Nicolas
f11137352c Merge branch 'main' into feat/fire-engine-chrome-cdp 2024-07-18 12:48:42 -04:00
Caleb Peffer
c5d1e7260d Caleb: made changes per Rafaels requests 2024-07-17 11:29:05 -07:00
Caleb Peffer
d39d3be649 Caleb: now extracting and returning a list of all links on the page for a customer 2024-07-16 18:38:03 -07:00
Thomas Kosmas
5c65ec58e5 Support chrome-cdp and restructure sitemap fire-engine support. 2024-07-15 18:40:43 +03:00
Nicolas
066d92f643 Update single_url.ts 2024-07-03 18:38:17 -03:00
Nicolas
90c54c32fd Nick: refactor 2024-07-03 18:01:17 -03:00
Nicolas
90cf799a3c Update single_url.ts 2024-07-03 17:56:21 -03:00
Nicolas
b36406e465 Nick: log scrpaers 2024-07-03 17:28:53 -03:00
rafaelsideguide
7b7154ba1e bugfixed pageStatusCode 2024-07-02 10:51:35 -03:00
Nicolas
42cd58a679
Merge pull request #332 from mendableai/feat/rawHtmlExtraction
Adds pageOptions.includeRawHtml and new extraction mode "llm-extraction-from-raw-html"
2024-07-01 18:23:26 -03:00
rafaelsideguide
16aac7f8c5 Update single_url.ts 2024-07-01 18:21:15 -03:00
Eric Ciarla
87b54488d3 update to includeRawHtml 2024-06-28 17:07:47 -04:00
Eric Ciarla
70fcf2ce03 init 2024-06-28 16:39:09 -04:00
Nicolas
9bf74bc774 Update single_url.ts 2024-06-28 15:51:18 -03:00
Nicolas
7e17498bcf Update single_url.ts 2024-06-28 15:45:16 -03:00
Nicolas
e7be17db92 Nick: metadata fixes and lock duration for bull decreased to 2 hrs 2024-06-25 15:21:14 -03:00
rafaelsideguide
3ebdf93342 removed console.logs 2024-06-24 16:43:12 -03:00
rafaelsideguide
21d29de819 testing crawl with new.abb.com case
many unnecessary console.logs for tracing the code execution
2024-06-24 16:25:07 -03:00
rafaelsideguide
9c539e9113 Fixed includeHTML to use cleanedHtml as response 2024-06-18 16:26:54 -03:00
rafaelsideguide
6c726a02eb Moved to utils/removeUnwantedElements, added unit tests 2024-06-18 09:46:42 -03:00
AndyMik90
8b3c3aae91 Added support for RegEx in removeTags 2024-06-18 07:31:46 +02:00
rafaelsideguide
ad7795f973 Merge remote-tracking branch 'origin/main' into test/load-testing 2024-06-14 15:14:01 -03:00
Rafael Miller
f9c7ca9388
Merge branch 'main' into feat/issue-266 2024-06-14 11:47:58 -03:00
Rafael Miller
3e2e76311c
Merge branch 'main' into feat/issue-205 2024-06-14 11:25:20 -03:00
rafaelsideguide
5dd18ca79b fixed edge cases 2024-06-14 09:46:55 -03:00