97 Commits

Author SHA1 Message Date
rafaelsideguide
c3aeed510b Update single_url.ts 2024-08-12 16:40:31 -03:00
Kevin Swiber
ba2af74adf
Ensuring USE_DB_AUTHENTICATION is true in single URL scraper. 2024-08-09 15:29:18 -07:00
rafaelsideguide
4d24a99d50 fix params 2024-08-06 09:34:43 -03:00
rafaelsideguide
3edc3a3d15 added fullpagescreenshot capabilities, wip on fire-engine side 2024-08-05 18:17:37 -03:00
Nicolas
7b813883ef Nick: first layer 2024-07-29 20:31:51 -04:00
rafaelsideguide
96cec2a673 fix checking scrape log success content length 2024-07-26 12:00:52 -03:00
Nicolas
f82ca3be17 Nick: 2024-07-25 19:53:29 -04:00
Nicolas
01fab6e036 Update single_url.ts 2024-07-25 17:51:41 -04:00
Nicolas
56042d090c Update single_url.ts 2024-07-25 17:48:44 -04:00
Nicolas
3242872503 Update single_url.ts 2024-07-25 17:43:55 -04:00
Gergo Moricz
4d35ad073c feat(monitoring/scrape): include url, worker, response_size 2024-07-24 16:43:39 +02:00
Gergo Moricz
64bcedeefc fix(monitoring): bad success check on scrape 2024-07-24 16:21:59 +02:00
Gergo Moricz
7cd9bf92e3 feat: scrape event logging to DB 2024-07-24 14:31:25 +02:00
rafaelsideguide
6208ecdbc0 added logger 2024-07-23 17:30:46 -03:00
Nicolas
d2de01d342 Nick: fixes 2024-07-18 13:19:44 -04:00
Nicolas
f11137352c Merge branch 'main' into feat/fire-engine-chrome-cdp 2024-07-18 12:48:42 -04:00
Caleb Peffer
c5d1e7260d Caleb: made changes per Rafaels requests 2024-07-17 11:29:05 -07:00
Caleb Peffer
d39d3be649 Caleb: now extracting and returning a list of all links on the page for a customer 2024-07-16 18:38:03 -07:00
Thomas Kosmas
5c65ec58e5 Support chrome-cdp and restructure sitemap fire-engine support. 2024-07-15 18:40:43 +03:00
Nicolas
066d92f643 Update single_url.ts 2024-07-03 18:38:17 -03:00
Nicolas
90c54c32fd Nick: refactor 2024-07-03 18:01:17 -03:00
Nicolas
90cf799a3c Update single_url.ts 2024-07-03 17:56:21 -03:00
Nicolas
b36406e465 Nick: log scrpaers 2024-07-03 17:28:53 -03:00
rafaelsideguide
7b7154ba1e bugfixed pageStatusCode 2024-07-02 10:51:35 -03:00
Nicolas
42cd58a679
Merge pull request #332 from mendableai/feat/rawHtmlExtraction
Adds pageOptions.includeRawHtml and new extraction mode "llm-extraction-from-raw-html"
2024-07-01 18:23:26 -03:00
rafaelsideguide
16aac7f8c5 Update single_url.ts 2024-07-01 18:21:15 -03:00
Eric Ciarla
87b54488d3 update to includeRawHtml 2024-06-28 17:07:47 -04:00
Eric Ciarla
70fcf2ce03 init 2024-06-28 16:39:09 -04:00
Nicolas
9bf74bc774 Update single_url.ts 2024-06-28 15:51:18 -03:00
Nicolas
7e17498bcf Update single_url.ts 2024-06-28 15:45:16 -03:00
Nicolas
e7be17db92 Nick: metadata fixes and lock duration for bull decreased to 2 hrs 2024-06-25 15:21:14 -03:00
rafaelsideguide
3ebdf93342 removed console.logs 2024-06-24 16:43:12 -03:00
rafaelsideguide
21d29de819 testing crawl with new.abb.com case
many unnecessary console.logs for tracing the code execution
2024-06-24 16:25:07 -03:00
rafaelsideguide
9c539e9113 Fixed includeHTML to use cleanedHtml as response 2024-06-18 16:26:54 -03:00
rafaelsideguide
6c726a02eb Moved to utils/removeUnwantedElements, added unit tests 2024-06-18 09:46:42 -03:00
AndyMik90
8b3c3aae91 Added support for RegEx in removeTags 2024-06-18 07:31:46 +02:00
rafaelsideguide
ad7795f973 Merge remote-tracking branch 'origin/main' into test/load-testing 2024-06-14 15:14:01 -03:00
Rafael Miller
f9c7ca9388
Merge branch 'main' into feat/issue-266 2024-06-14 11:47:58 -03:00
Rafael Miller
3e2e76311c
Merge branch 'main' into feat/issue-205 2024-06-14 11:25:20 -03:00
rafaelsideguide
5dd18ca79b fixed edge cases 2024-06-14 09:46:55 -03:00
rafaelsideguide
bb859ae9a7 Added metadata.pageStatusCode and metadata.pageError properties to the responses 2024-06-13 17:08:40 -03:00
rafaelsideguide
676d6e8ab5 Added pageOptions.removeTags 2024-06-13 10:51:05 -03:00
rafaelsideguide
e37d151404 added parsePDF option to pageOptions
user can decide if they are going to let us take care of the parse or they are going to parse the pdf by themselves
2024-06-12 15:06:47 -03:00
Nicolas
7ae9778642 Update single_url.ts 2024-06-10 16:57:31 -07:00
Nicolas
913c1dd568 Nick: fetch -> axios and fix timeouts 2024-06-10 16:49:03 -07:00
rafaelsideguide
164676c70a bugfix screenshot for readme pages 2024-06-05 15:34:42 -03:00
rafaelsideguide
0d51b11dcd missing breaks 2024-06-05 15:02:28 -03:00
Rafael Miller
9e000ded03
Merge branch 'main' into feat/better-gdrive-pdf-fetch 2024-06-05 14:07:56 -03:00
rafaelsideguide
ccc55127d6 Added scroll xpaths on fire-engine for handling readme docs 2024-06-05 11:48:41 -03:00
rafaelsideguide
b5045d1661 [feat] improved the scrape for gdrive pdfs 2024-06-04 17:47:28 -03:00