339 Commits

Author SHA1 Message Date
Gergo Moricz
7bb922071c fix(queue-worker): manually renew lock (testing) 2024-08-07 14:35:20 +02:00
Nicolas
3321ca9398
Merge pull request #504 from mendableai/feat/fullpage-screenshot
[Feat] Added fullpagescreenshot capabilities
2024-08-06 13:52:29 -04:00
Gergo Moricz
b60ee30dba fix(single_url): accept 500 2024-08-06 18:00:56 +02:00
rafaelsideguide
4d24a99d50 fix params 2024-08-06 09:34:43 -03:00
rafaelsideguide
3edc3a3d15 added fullpagescreenshot capabilities, wip on fire-engine side 2024-08-05 18:17:37 -03:00
rafaelsideguide
f32e8de156 fixes the empty excludes.filter undefined bug 2024-08-05 18:13:31 -03:00
Nicolas
1742e4ceae Nick: 2024-08-02 19:25:15 -04:00
Nicolas
b448e3c3ad Update website_params.ts 2024-08-02 14:26:35 -04:00
rafaelsideguide
4051630632 Update sitemap.ts 2024-08-02 11:32:48 -03:00
rafaelsideguide
8568b61015 bugfix for sitemaps 2024-08-02 11:03:01 -03:00
Nicolas
af68b7a785
Merge pull request #475 from mendableai/bugfix/issue-466
[Bug] pdfs and logging pdf events, also added trycatchs for docx
2024-08-01 22:05:26 -04:00
rafaelsideguide
f48ff36b32 added .inc files and forced lower case comparison 2024-07-31 09:28:43 -03:00
Nicolas
ad6f6eff4b Update fireEngine.ts 2024-07-30 19:15:54 -04:00
Nicolas
6d99dedd3c Nick: fixed tests 2024-07-30 19:11:01 -04:00
rafaelsideguide
d25d7e7244 special case: developer.apple.com 2024-07-30 10:13:09 -03:00
Nicolas
5e8ffcf505 Update website_params.ts 2024-07-29 20:43:47 -04:00
Nicolas
7b813883ef Nick: first layer 2024-07-29 20:31:51 -04:00
Nicolas
968a2dc753 Nick: 2024-07-29 18:37:09 -04:00
rafaelsideguide
49e3e64787 bugfix for pdfs and logging pdf events, also added trycatchs for docx 2024-07-29 14:13:46 -03:00
Nicolas
4c9d62f6d3 Nick: fixing sitemap fallback 2024-07-26 18:25:44 -04:00
Nicolas
cb97871ff9 Merge branch 'main' of https://github.com/mendableai/firecrawl 2024-07-26 17:21:11 -04:00
Nicolas
ff4266f09e Update pdfProcessor.ts 2024-07-26 17:21:09 -04:00
rafaelsideguide
96cec2a673 fix checking scrape log success content length 2024-07-26 12:00:52 -03:00
Nicolas
f82ca3be17 Nick: 2024-07-25 19:53:29 -04:00
Nicolas
01fab6e036 Update single_url.ts 2024-07-25 17:51:41 -04:00
Nicolas
56042d090c Update single_url.ts 2024-07-25 17:48:44 -04:00
Nicolas
3242872503 Update single_url.ts 2024-07-25 17:43:55 -04:00
Nicolas
e5b797549e Merge branch 'main' into feat/scrape-monitoring 2024-07-25 16:21:02 -04:00
rafaelsideguide
e720e1bacf Merge remote-tracking branch 'origin/main' into feat/logger 2024-07-25 09:49:27 -03:00
rafaelsideguide
309728a482 updated logs 2024-07-25 09:48:06 -03:00
Nicolas
2c1221750b
Merge pull request #449 from mendableai/bugfix/malformed-url-sitemap
Added regex for links in sitemap
2024-07-24 20:37:35 -04:00
Nicolas
3a1b8a9797 Update website_params.ts 2024-07-24 11:04:47 -04:00
Nicolas
8b48ec8d30 Update website_params.ts 2024-07-24 11:02:20 -04:00
Gergo Moricz
4d35ad073c feat(monitoring/scrape): include url, worker, response_size 2024-07-24 16:43:39 +02:00
Gergo Moricz
64bcedeefc fix(monitoring): bad success check on scrape 2024-07-24 16:21:59 +02:00
Gergo Moricz
7cd9bf92e3 feat: scrape event logging to DB 2024-07-24 14:31:25 +02:00
Rafael Miller
5e728c1a4d
Update apps/api/src/scraper/WebScraper/crawler.ts
no need for regex

Co-authored-by: Gergő Móricz <mo.geryy@gmail.com>
2024-07-24 08:33:00 -03:00
rafaelsideguide
6208ecdbc0 added logger 2024-07-23 17:30:46 -03:00
Nicolas
f0b07b509b Update index.ts 2024-07-23 15:15:56 -04:00
rafaelsideguide
a684bd3c5d added regex for links in sitemap 2024-07-23 09:07:23 -03:00
Nicolas
8916fec66c Update index.ts 2024-07-22 19:14:53 -04:00
Nicolas
e31a5007d5 Nick: speed improvements 2024-07-22 18:30:58 -04:00
rafaelsideguide
5c02dbe20c fix(isFile): added .tiff extension 2024-07-18 17:07:21 -03:00
Gergo Moricz
f0e95ce399 fix(WebCrawler): filter out file URLs when taking URLs from sitemap 2024-07-18 21:49:37 +02:00
Nicolas
5f14f4f788 Update blocklist.ts 2024-07-18 14:20:19 -04:00
Nicolas
f10f3f886b
Merge pull request #410 from mendableai/feat/fire-engine-chrome-cdp
Support chrome-cdp and restructure sitemap fire-engine support.
2024-07-18 13:52:08 -04:00
Nicolas
d2de01d342 Nick: fixes 2024-07-18 13:19:44 -04:00
Gergo Moricz
0b8047c7a0 fix(WebScraper): infinite regex leading to fly.io instance hangs 2024-07-18 19:13:43 +02:00
Nicolas
f11137352c Merge branch 'main' into feat/fire-engine-chrome-cdp 2024-07-18 12:48:42 -04:00
Caleb Peffer
8d5ebc9b9f
Merge pull request #423 from mendableai/cjp/linksOnPage
Caleb: Return a list of links on a page by default
2024-07-17 12:36:07 -06:00