rafaelsideguide
|
c3aeed510b
|
Update single_url.ts
|
2024-08-12 16:40:31 -03:00 |
|
Kevin Swiber
|
ba2af74adf
|
Ensuring USE_DB_AUTHENTICATION is true in single URL scraper.
|
2024-08-09 15:29:18 -07:00 |
|
rafaelsideguide
|
4d24a99d50
|
fix params
|
2024-08-06 09:34:43 -03:00 |
|
rafaelsideguide
|
3edc3a3d15
|
added fullpagescreenshot capabilities, wip on fire-engine side
|
2024-08-05 18:17:37 -03:00 |
|
Nicolas
|
7b813883ef
|
Nick: first layer
|
2024-07-29 20:31:51 -04:00 |
|
rafaelsideguide
|
96cec2a673
|
fix checking scrape log success content length
|
2024-07-26 12:00:52 -03:00 |
|
Nicolas
|
f82ca3be17
|
Nick:
|
2024-07-25 19:53:29 -04:00 |
|
Nicolas
|
01fab6e036
|
Update single_url.ts
|
2024-07-25 17:51:41 -04:00 |
|
Nicolas
|
56042d090c
|
Update single_url.ts
|
2024-07-25 17:48:44 -04:00 |
|
Nicolas
|
3242872503
|
Update single_url.ts
|
2024-07-25 17:43:55 -04:00 |
|
Gergo Moricz
|
4d35ad073c
|
feat(monitoring/scrape): include url, worker, response_size
|
2024-07-24 16:43:39 +02:00 |
|
Gergo Moricz
|
64bcedeefc
|
fix(monitoring): bad success check on scrape
|
2024-07-24 16:21:59 +02:00 |
|
Gergo Moricz
|
7cd9bf92e3
|
feat: scrape event logging to DB
|
2024-07-24 14:31:25 +02:00 |
|
rafaelsideguide
|
6208ecdbc0
|
added logger
|
2024-07-23 17:30:46 -03:00 |
|
Nicolas
|
d2de01d342
|
Nick: fixes
|
2024-07-18 13:19:44 -04:00 |
|
Nicolas
|
f11137352c
|
Merge branch 'main' into feat/fire-engine-chrome-cdp
|
2024-07-18 12:48:42 -04:00 |
|
Caleb Peffer
|
c5d1e7260d
|
Caleb: made changes per Rafaels requests
|
2024-07-17 11:29:05 -07:00 |
|
Caleb Peffer
|
d39d3be649
|
Caleb: now extracting and returning a list of all links on the page for a customer
|
2024-07-16 18:38:03 -07:00 |
|
Thomas Kosmas
|
5c65ec58e5
|
Support chrome-cdp and restructure sitemap fire-engine support.
|
2024-07-15 18:40:43 +03:00 |
|
Nicolas
|
066d92f643
|
Update single_url.ts
|
2024-07-03 18:38:17 -03:00 |
|
Nicolas
|
90c54c32fd
|
Nick: refactor
|
2024-07-03 18:01:17 -03:00 |
|
Nicolas
|
90cf799a3c
|
Update single_url.ts
|
2024-07-03 17:56:21 -03:00 |
|
Nicolas
|
b36406e465
|
Nick: log scrpaers
|
2024-07-03 17:28:53 -03:00 |
|
rafaelsideguide
|
7b7154ba1e
|
bugfixed pageStatusCode
|
2024-07-02 10:51:35 -03:00 |
|
Nicolas
|
42cd58a679
|
Merge pull request #332 from mendableai/feat/rawHtmlExtraction
Adds pageOptions.includeRawHtml and new extraction mode "llm-extraction-from-raw-html"
|
2024-07-01 18:23:26 -03:00 |
|
rafaelsideguide
|
16aac7f8c5
|
Update single_url.ts
|
2024-07-01 18:21:15 -03:00 |
|
Eric Ciarla
|
87b54488d3
|
update to includeRawHtml
|
2024-06-28 17:07:47 -04:00 |
|
Eric Ciarla
|
70fcf2ce03
|
init
|
2024-06-28 16:39:09 -04:00 |
|
Nicolas
|
9bf74bc774
|
Update single_url.ts
|
2024-06-28 15:51:18 -03:00 |
|
Nicolas
|
7e17498bcf
|
Update single_url.ts
|
2024-06-28 15:45:16 -03:00 |
|
Nicolas
|
e7be17db92
|
Nick: metadata fixes and lock duration for bull decreased to 2 hrs
|
2024-06-25 15:21:14 -03:00 |
|
rafaelsideguide
|
3ebdf93342
|
removed console.logs
|
2024-06-24 16:43:12 -03:00 |
|
rafaelsideguide
|
21d29de819
|
testing crawl with new.abb.com case
many unnecessary console.logs for tracing the code execution
|
2024-06-24 16:25:07 -03:00 |
|
rafaelsideguide
|
9c539e9113
|
Fixed includeHTML to use cleanedHtml as response
|
2024-06-18 16:26:54 -03:00 |
|
rafaelsideguide
|
6c726a02eb
|
Moved to utils/removeUnwantedElements, added unit tests
|
2024-06-18 09:46:42 -03:00 |
|
AndyMik90
|
8b3c3aae91
|
Added support for RegEx in removeTags
|
2024-06-18 07:31:46 +02:00 |
|
rafaelsideguide
|
ad7795f973
|
Merge remote-tracking branch 'origin/main' into test/load-testing
|
2024-06-14 15:14:01 -03:00 |
|
Rafael Miller
|
f9c7ca9388
|
Merge branch 'main' into feat/issue-266
|
2024-06-14 11:47:58 -03:00 |
|
Rafael Miller
|
3e2e76311c
|
Merge branch 'main' into feat/issue-205
|
2024-06-14 11:25:20 -03:00 |
|
rafaelsideguide
|
5dd18ca79b
|
fixed edge cases
|
2024-06-14 09:46:55 -03:00 |
|
rafaelsideguide
|
bb859ae9a7
|
Added metadata.pageStatusCode and metadata.pageError properties to the responses
|
2024-06-13 17:08:40 -03:00 |
|
rafaelsideguide
|
676d6e8ab5
|
Added pageOptions.removeTags
|
2024-06-13 10:51:05 -03:00 |
|
rafaelsideguide
|
e37d151404
|
added parsePDF option to pageOptions
user can decide if they are going to let us take care of the parse or they are going to parse the pdf by themselves
|
2024-06-12 15:06:47 -03:00 |
|
Nicolas
|
7ae9778642
|
Update single_url.ts
|
2024-06-10 16:57:31 -07:00 |
|
Nicolas
|
913c1dd568
|
Nick: fetch -> axios and fix timeouts
|
2024-06-10 16:49:03 -07:00 |
|
rafaelsideguide
|
164676c70a
|
bugfix screenshot for readme pages
|
2024-06-05 15:34:42 -03:00 |
|
rafaelsideguide
|
0d51b11dcd
|
missing breaks
|
2024-06-05 15:02:28 -03:00 |
|
Rafael Miller
|
9e000ded03
|
Merge branch 'main' into feat/better-gdrive-pdf-fetch
|
2024-06-05 14:07:56 -03:00 |
|
rafaelsideguide
|
ccc55127d6
|
Added scroll xpaths on fire-engine for handling readme docs
|
2024-06-05 11:48:41 -03:00 |
|
rafaelsideguide
|
b5045d1661
|
[feat] improved the scrape for gdrive pdfs
|
2024-06-04 17:47:28 -03:00 |
|