91 Commits

Author SHA1 Message Date
Eric Ciarla
a6b7197737 Fix for maxDepth 2024-06-14 19:40:37 -04:00
Eric Ciarla
2c5f5c0ea2
Merge branch 'main' into feat/maxDepthRelative 2024-06-14 11:49:12 -04:00
Rafael Miller
f9c7ca9388
Merge branch 'main' into feat/issue-266 2024-06-14 11:47:58 -03:00
Rafael Miller
3e2e76311c
Merge branch 'main' into feat/issue-205 2024-06-14 11:25:20 -03:00
Eric Ciarla
59451754f5 Add tests 2024-06-14 10:14:07 -04:00
Eric Ciarla
71c98d8b80 Update logic 2024-06-13 18:00:52 -04:00
Eric Ciarla
095951aa4d Update test 2024-06-13 17:40:00 -04:00
Eric Ciarla
5e8aa92788 Update index.ts 2024-06-13 17:33:13 -04:00
Eric Ciarla
65d63bae45 Update index.ts 2024-06-13 17:17:44 -04:00
Eric Ciarla
32e814bedc Update index.ts 2024-06-13 17:02:30 -04:00
rafaelsideguide
bb859ae9a7 Added metadata.pageStatusCode and metadata.pageError properties to the responses 2024-06-13 17:08:40 -03:00
rafaelsideguide
676d6e8ab5 Added pageOptions.removeTags 2024-06-13 10:51:05 -03:00
rafaelsideguide
e37d151404 added parsePDF option to pageOptions
user can decide if they are going to let us take care of the parse or they are going to parse the pdf by themselves
2024-06-12 15:06:47 -03:00
rafaelsideguide
dc6acbf1f0 Merge remote-tracking branch 'origin/main' into feat/allowbackwardcrawling-option 2024-06-12 11:01:05 -03:00
Nicolas
520739c9f4 Nick: fixed bugs associated with absolute path replacements 2024-06-11 12:43:16 -07:00
rafaelsideguide
ee282c3d55 Added allowBackwardCrawling option 2024-06-11 15:24:39 -03:00
Nicolas
f6b06ac27a Nick: ignoreSitemap, better crawling algo 2024-06-10 18:12:41 -07:00
Nicolas
3091f0134c Nick: 2024-06-10 16:27:10 -07:00
Nicolas
b4c6819a54 Nick: 2024-06-05 11:11:09 -07:00
Rafael Miller
02fe470e20
Merge pull request #148 from mendableai/nsc/improvemnts-fixes-misc
Better fallbacks for initial crawl start
2024-06-04 14:31:10 -03:00
rafaelsideguide
6920ec8a61 bugfixing. already on main 2024-06-04 11:05:50 -03:00
Nicolas
918059ee9e Merge branch 'main' into nsc/improvemnts-fixes-misc 2024-06-03 16:46:02 -07:00
Nicolas
df6c3d1e7d Merge branch 'main' into detect-pdfs 2024-05-17 09:55:51 -07:00
Nicolas
9d635cb2a3 Nick: docx support 2024-05-16 11:48:02 -07:00
Nicolas
098db17913 Update index.ts 2024-05-15 17:37:09 -07:00
Nicolas
6ca368327f Merge branch 'main' into test/crawl-options 2024-05-15 17:18:25 -07:00
Nicolas
ade4e05cff Nick: working 2024-05-15 17:13:04 -07:00
Nicolas
bfccaf670d Nick: fixes most of it 2024-05-15 15:30:37 -07:00
rafaelsideguide
d91043376c not working yet 2024-05-15 18:54:40 -03:00
rafaelsideguide
fa014defc7 Fixing child links only bug 2024-05-15 18:35:09 -03:00
Nicolas
2ba743fb1a
Merge pull request #27 from eltociear/patch-1
refactor: fix typo in WebScraper/index.ts
2024-05-15 13:28:38 -07:00
Nicolas
1b0d6341d3 Update index.ts 2024-05-15 11:48:12 -07:00
Nicolas
d10f81e7fe Nick: fixes 2024-05-15 11:28:20 -07:00
Nicolas
87570bdfa1 Update index.ts 2024-05-15 11:06:03 -07:00
Ikko Eltociear Ashimine
e91c122c69
Merge branch 'main' into patch-1 2024-05-15 12:14:52 +09:00
Nicolas
a0fdc6f7c6 Nick: 2024-05-14 12:12:40 -07:00
Nicolas
7f31959be7 Nick: 2024-05-14 12:04:36 -07:00
Nicolas
8a72cf556b Nick: 2024-05-13 21:10:58 -07:00
Nicolas
26a092f780 Update index.ts 2024-05-13 21:04:49 -07:00
Nicolas
8101cbee37 Update index.ts 2024-05-13 21:02:47 -07:00
Nicolas
86b8439844 Nick: 2024-05-13 20:51:42 -07:00
Nicolas
a96fc5b96d Nick: 4x speed 2024-05-13 20:45:11 -07:00
rafaelsideguide
8eb2e95f19 Cleaned up 2024-05-13 16:13:10 -03:00
Nicolas
2ce045912f Nick: disable vision right now 2024-05-13 10:56:08 -07:00
rafaelsideguide
f4348024c6 Added check during scraping to deal with pdfs
Checks if the URL is a PDF during the scraping process (single_url.ts).

TODO: Run integration tests - Does this strat affect the running time?

ps. Some comments need to be removed if we decide to proceed with this strategy.
2024-05-13 09:13:42 -03:00
Rafael Miller
5a2712fa5a
Merge branch 'main' into detect-pdfs 2024-05-10 15:53:13 -03:00
Nicolas
dcedb8d798 Merge branch 'main' into feat/max-depth 2024-05-07 10:20:49 -07:00
Nicolas
6505bf6bf2 Merge branch 'main' into feat/max-depth 2024-05-07 10:20:44 -07:00
Nicolas
bdbee963f7 Merge branch 'main' into nsc/cancel-job 2024-05-07 10:13:43 -07:00
rafaelsideguide
61d615c04b Added tests 2024-05-07 14:03:00 -03:00