Gergő Móricz
|
e7f267b6fe
|
Merge branch 'main' into v1-webscraper
|
2024-08-23 17:21:54 +02:00 |
|
Gergő Móricz
|
8e3c2b2855
|
fix(crawler): verify URL
|
2024-08-22 23:30:19 +02:00 |
|
Gergő Móricz
|
fbbc3878f1
|
fix(crawler): make sure includes/excludes is an array
|
2024-08-22 13:18:26 +02:00 |
|
Gergő Móricz
|
55009e51f5
|
fix: filter out invalid URLs from crawl links
|
2024-08-21 20:49:25 +02:00 |
|
rafaelsideguide
|
e1c9cbf709
|
bug fixed. crawl should not stop if sitemap url is invalid
|
2024-08-20 09:11:58 -03:00 |
|
Gergő Móricz
|
aabfaf0ac5
|
clean up crawl-status, fix db ddos
|
2024-08-16 23:29:39 +02:00 |
|
Gergo Moricz
|
86e136beca
|
feat: crawl to scrape conversion
|
2024-08-13 20:51:43 +02:00 |
|
rafaelsideguide
|
8568b61015
|
bugfix for sitemaps
|
2024-08-02 11:03:01 -03:00 |
|
rafaelsideguide
|
f48ff36b32
|
added .inc files and forced lower case comparison
|
2024-07-31 09:28:43 -03:00 |
|
Nicolas
|
e5b797549e
|
Merge branch 'main' into feat/scrape-monitoring
|
2024-07-25 16:21:02 -04:00 |
|
rafaelsideguide
|
e720e1bacf
|
Merge remote-tracking branch 'origin/main' into feat/logger
|
2024-07-25 09:49:27 -03:00 |
|
Gergo Moricz
|
7cd9bf92e3
|
feat: scrape event logging to DB
|
2024-07-24 14:31:25 +02:00 |
|
Rafael Miller
|
5e728c1a4d
|
Update apps/api/src/scraper/WebScraper/crawler.ts
no need for regex
Co-authored-by: Gergő Móricz <mo.geryy@gmail.com>
|
2024-07-24 08:33:00 -03:00 |
|
rafaelsideguide
|
6208ecdbc0
|
added logger
|
2024-07-23 17:30:46 -03:00 |
|
rafaelsideguide
|
a684bd3c5d
|
added regex for links in sitemap
|
2024-07-23 09:07:23 -03:00 |
|
rafaelsideguide
|
5c02dbe20c
|
fix(isFile): added .tiff extension
|
2024-07-18 17:07:21 -03:00 |
|
Gergo Moricz
|
f0e95ce399
|
fix(WebCrawler): filter out file URLs when taking URLs from sitemap
|
2024-07-18 21:49:37 +02:00 |
|
Nicolas
|
e098e88ea7
|
Nick:
|
2024-07-12 22:02:08 -04:00 |
|
rafaelsideguide
|
9ad06fdf56
|
added fire-engine fallback for getting sitemaps
|
2024-07-09 16:07:53 -03:00 |
|
Nicolas
|
90c54c32fd
|
Nick: refactor
|
2024-07-03 18:01:17 -03:00 |
|
rafaelsideguide
|
4d6e25619b
|
minor spacing and comment stuff
|
2024-07-01 16:05:34 -03:00 |
|
Jeff Pereira
|
a5fb45988c
|
new feature allowExternalContentLinks
|
2024-06-28 17:23:40 -07:00 |
|
Nicolas
|
90b7fff366
|
Update crawler.ts
|
2024-06-24 16:52:01 -03:00 |
|
rafaelsideguide
|
3ebdf93342
|
removed console.logs
|
2024-06-24 16:43:12 -03:00 |
|
Nicolas
|
56d42d9c9b
|
Nick:
|
2024-06-24 16:33:07 -03:00 |
|
rafaelsideguide
|
21d29de819
|
testing crawl with new.abb.com case
many unnecessary console.logs for tracing the code execution
|
2024-06-24 16:25:07 -03:00 |
|
Eric Ciarla
|
b1eb608295
|
Merge branch 'main' into feat/maxDepthRelative
|
2024-06-15 16:50:27 -04:00 |
|
Eric Ciarla
|
34e37c5671
|
Add unit tests to replace e2e
|
2024-06-15 16:43:37 -04:00 |
|
Eric Ciarla
|
a6b7197737
|
Fix for maxDepth
|
2024-06-14 19:40:37 -04:00 |
|
Nicolas
|
e88cb314c8
|
Update crawler.ts
|
2024-06-14 13:44:54 -07:00 |
|
Eric Ciarla
|
2c5f5c0ea2
|
Merge branch 'main' into feat/maxDepthRelative
|
2024-06-14 11:49:12 -04:00 |
|
Eric Ciarla
|
ab9de0f5ab
|
Update maxDepth tests
|
2024-06-13 18:46:30 -04:00 |
|
rafaelsideguide
|
bb859ae9a7
|
Added metadata.pageStatusCode and metadata.pageError properties to the responses
|
2024-06-13 17:08:40 -03:00 |
|
rafaelsideguide
|
ee282c3d55
|
Added allowBackwardCrawling option
|
2024-06-11 15:24:39 -03:00 |
|
Nicolas
|
f6b06ac27a
|
Nick: ignoreSitemap, better crawling algo
|
2024-06-10 18:12:41 -07:00 |
|
Nicolas
|
3091f0134c
|
Nick:
|
2024-06-10 16:27:10 -07:00 |
|
rafaelsideguide
|
f4a3469b9e
|
Merge branch 'main' into bug/crawl-limit
|
2024-05-22 14:27:28 -03:00 |
|
Nicolas
|
0d187f0425
|
Merge pull request #77 from tractorjuice/patch-1
Add additional file extensions to crawler.ts
|
2024-05-22 10:16:49 -07:00 |
|
Nicolas
|
9e61d431f0
|
Nick: hyper dx integration init
|
2024-05-20 13:36:34 -07:00 |
|
Nicolas
|
9d635cb2a3
|
Nick: docx support
|
2024-05-16 11:48:02 -07:00 |
|
Nicolas
|
24be4866c5
|
Nick:
|
2024-05-15 17:16:20 -07:00 |
|
Nicolas
|
ade4e05cff
|
Nick: working
|
2024-05-15 17:13:04 -07:00 |
|
Nicolas
|
bfccaf670d
|
Nick: fixes most of it
|
2024-05-15 15:30:37 -07:00 |
|
rafaelsideguide
|
fa014defc7
|
Fixing child links only bug
|
2024-05-15 18:35:09 -03:00 |
|
Nicolas
|
a0fdc6f7c6
|
Nick:
|
2024-05-14 12:12:40 -07:00 |
|
Nicolas
|
7f31959be7
|
Nick:
|
2024-05-14 12:04:36 -07:00 |
|
Nicolas
|
8a72cf556b
|
Nick:
|
2024-05-13 21:10:58 -07:00 |
|
Nicolas
|
a96fc5b96d
|
Nick: 4x speed
|
2024-05-13 20:45:11 -07:00 |
|
rafaelsideguide
|
bc6b929b43
|
[Bug] Fixing /crawl limit
|
2024-05-10 12:15:54 -03:00 |
|
rafaelsideguide
|
83f3408634
|
Added max depth option
|
2024-05-07 11:06:26 -03:00 |
|