474 Commits

Author SHA1 Message Date
Nicolas
5e8ffcf505 Update website_params.ts 2024-07-29 20:43:47 -04:00
Nicolas
7b813883ef Nick: first layer 2024-07-29 20:31:51 -04:00
Nicolas
968a2dc753 Nick: 2024-07-29 18:37:09 -04:00
rafaelsideguide
49e3e64787 bugfix for pdfs and logging pdf events, also added trycatchs for docx 2024-07-29 14:13:46 -03:00
Nicolas
4c9d62f6d3 Nick: fixing sitemap fallback 2024-07-26 18:25:44 -04:00
Nicolas
cb97871ff9 Merge branch 'main' of https://github.com/mendableai/firecrawl 2024-07-26 17:21:11 -04:00
Nicolas
ff4266f09e Update pdfProcessor.ts 2024-07-26 17:21:09 -04:00
rafaelsideguide
96cec2a673 fix checking scrape log success content length 2024-07-26 12:00:52 -03:00
Nicolas
f82ca3be17 Nick: 2024-07-25 19:53:29 -04:00
Nicolas
01fab6e036 Update single_url.ts 2024-07-25 17:51:41 -04:00
Nicolas
56042d090c Update single_url.ts 2024-07-25 17:48:44 -04:00
Nicolas
3242872503 Update single_url.ts 2024-07-25 17:43:55 -04:00
Nicolas
e5b797549e Merge branch 'main' into feat/scrape-monitoring 2024-07-25 16:21:02 -04:00
rafaelsideguide
e720e1bacf Merge remote-tracking branch 'origin/main' into feat/logger 2024-07-25 09:49:27 -03:00
rafaelsideguide
309728a482 updated logs 2024-07-25 09:48:06 -03:00
Nicolas
2c1221750b
Merge pull request #449 from mendableai/bugfix/malformed-url-sitemap
Added regex for links in sitemap
2024-07-24 20:37:35 -04:00
Nicolas
3a1b8a9797 Update website_params.ts 2024-07-24 11:04:47 -04:00
Nicolas
8b48ec8d30 Update website_params.ts 2024-07-24 11:02:20 -04:00
Gergo Moricz
4d35ad073c feat(monitoring/scrape): include url, worker, response_size 2024-07-24 16:43:39 +02:00
Gergo Moricz
64bcedeefc fix(monitoring): bad success check on scrape 2024-07-24 16:21:59 +02:00
Gergo Moricz
7cd9bf92e3 feat: scrape event logging to DB 2024-07-24 14:31:25 +02:00
Rafael Miller
5e728c1a4d
Update apps/api/src/scraper/WebScraper/crawler.ts
no need for regex

Co-authored-by: Gergő Móricz <mo.geryy@gmail.com>
2024-07-24 08:33:00 -03:00
rafaelsideguide
6208ecdbc0 added logger 2024-07-23 17:30:46 -03:00
Nicolas
f0b07b509b Update index.ts 2024-07-23 15:15:56 -04:00
rafaelsideguide
a684bd3c5d added regex for links in sitemap 2024-07-23 09:07:23 -03:00
Nicolas
8916fec66c Update index.ts 2024-07-22 19:14:53 -04:00
Nicolas
e31a5007d5 Nick: speed improvements 2024-07-22 18:30:58 -04:00
rafaelsideguide
5c02dbe20c fix(isFile): added .tiff extension 2024-07-18 17:07:21 -03:00
Gergo Moricz
f0e95ce399 fix(WebCrawler): filter out file URLs when taking URLs from sitemap 2024-07-18 21:49:37 +02:00
Nicolas
5f14f4f788 Update blocklist.ts 2024-07-18 14:20:19 -04:00
Nicolas
f10f3f886b
Merge pull request #410 from mendableai/feat/fire-engine-chrome-cdp
Support chrome-cdp and restructure sitemap fire-engine support.
2024-07-18 13:52:08 -04:00
Nicolas
d2de01d342 Nick: fixes 2024-07-18 13:19:44 -04:00
Gergo Moricz
0b8047c7a0 fix(WebScraper): infinite regex leading to fly.io instance hangs 2024-07-18 19:13:43 +02:00
Nicolas
f11137352c Merge branch 'main' into feat/fire-engine-chrome-cdp 2024-07-18 12:48:42 -04:00
Caleb Peffer
8d5ebc9b9f
Merge pull request #423 from mendableai/cjp/linksOnPage
Caleb: Return a list of links on a page by default
2024-07-17 12:36:07 -06:00
Caleb Peffer
5b24d26c84 Caleb; fixed test 2024-07-17 11:33:12 -07:00
Caleb Peffer
c5d1e7260d Caleb: made changes per Rafaels requests 2024-07-17 11:29:05 -07:00
Caleb Peffer
da3c6bca37 Caleb: added a simple test 2024-07-16 21:23:22 -07:00
Caleb Peffer
0b3c0ede49 Added tests per @nicks request 2024-07-16 21:15:59 -07:00
Caleb Peffer
98c788ca7a Caleb: added a test to ensure links on page exists and isn't zero on mendable 2024-07-16 21:13:52 -07:00
Caleb Peffer
d39d3be649 Caleb: now extracting and returning a list of all links on the page for a customer 2024-07-16 18:38:03 -07:00
rafaelsideguide
dba1fb2dc8 Update removeUnwantedElements.ts 2024-07-16 18:22:56 -03:00
Thomas Kosmas
5c65ec58e5 Support chrome-cdp and restructure sitemap fire-engine support. 2024-07-15 18:40:43 +03:00
Nicolas
949791049f Nick: 2024-07-12 23:20:26 -04:00
Nicolas
d0c8d3ecde Merge branch 'main' into nsc/sitemap-fix-fire-engine 2024-07-12 22:15:06 -04:00
Nicolas
a3b1703b68 Update fireEngine.ts 2024-07-12 22:15:00 -04:00
Nicolas
e098e88ea7 Nick: 2024-07-12 22:02:08 -04:00
Nicolas
5da03a8fbd Update fireEngine.ts 2024-07-12 14:59:49 -04:00
rafaelsideguide
9ad06fdf56 added fire-engine fallback for getting sitemaps 2024-07-09 16:07:53 -03:00
rafaelsideguide
c2bba54b4f Added veeva to special case params 2024-07-05 16:58:07 -03:00