rafaelsideguide
e37d151404
added parsePDF option to pageOptions
...
user can decide if they are going to let us take care of the parse or they are going to parse the pdf by themselves
2024-06-12 15:06:47 -03:00
rafaelsideguide
dc6acbf1f0
Merge remote-tracking branch 'origin/main' into feat/allowbackwardcrawling-option
2024-06-12 11:01:05 -03:00
Nicolas
520739c9f4
Nick: fixed bugs associated with absolute path replacements
2024-06-11 12:43:16 -07:00
rafaelsideguide
ee282c3d55
Added allowBackwardCrawling option
2024-06-11 15:24:39 -03:00
Nicolas
f6b06ac27a
Nick: ignoreSitemap, better crawling algo
2024-06-10 18:12:41 -07:00
Nicolas
3091f0134c
Nick:
2024-06-10 16:27:10 -07:00
Nicolas
b4c6819a54
Nick:
2024-06-05 11:11:09 -07:00
Rafael Miller
02fe470e20
Merge pull request #148 from mendableai/nsc/improvemnts-fixes-misc
...
Better fallbacks for initial crawl start
2024-06-04 14:31:10 -03:00
rafaelsideguide
6920ec8a61
bugfixing. already on main
2024-06-04 11:05:50 -03:00
Nicolas
918059ee9e
Merge branch 'main' into nsc/improvemnts-fixes-misc
2024-06-03 16:46:02 -07:00
Nicolas
df6c3d1e7d
Merge branch 'main' into detect-pdfs
2024-05-17 09:55:51 -07:00
Nicolas
9d635cb2a3
Nick: docx support
2024-05-16 11:48:02 -07:00
Nicolas
098db17913
Update index.ts
2024-05-15 17:37:09 -07:00
Nicolas
6ca368327f
Merge branch 'main' into test/crawl-options
2024-05-15 17:18:25 -07:00
Nicolas
ade4e05cff
Nick: working
2024-05-15 17:13:04 -07:00
Nicolas
bfccaf670d
Nick: fixes most of it
2024-05-15 15:30:37 -07:00
rafaelsideguide
d91043376c
not working yet
2024-05-15 18:54:40 -03:00
rafaelsideguide
fa014defc7
Fixing child links only bug
2024-05-15 18:35:09 -03:00
Nicolas
2ba743fb1a
Merge pull request #27 from eltociear/patch-1
...
refactor: fix typo in WebScraper/index.ts
2024-05-15 13:28:38 -07:00
Nicolas
1b0d6341d3
Update index.ts
2024-05-15 11:48:12 -07:00
Nicolas
d10f81e7fe
Nick: fixes
2024-05-15 11:28:20 -07:00
Nicolas
87570bdfa1
Update index.ts
2024-05-15 11:06:03 -07:00
Ikko Eltociear Ashimine
e91c122c69
Merge branch 'main' into patch-1
2024-05-15 12:14:52 +09:00
Nicolas
a0fdc6f7c6
Nick:
2024-05-14 12:12:40 -07:00
Nicolas
7f31959be7
Nick:
2024-05-14 12:04:36 -07:00
Nicolas
8a72cf556b
Nick:
2024-05-13 21:10:58 -07:00
Nicolas
26a092f780
Update index.ts
2024-05-13 21:04:49 -07:00
Nicolas
8101cbee37
Update index.ts
2024-05-13 21:02:47 -07:00
Nicolas
86b8439844
Nick:
2024-05-13 20:51:42 -07:00
Nicolas
a96fc5b96d
Nick: 4x speed
2024-05-13 20:45:11 -07:00
rafaelsideguide
8eb2e95f19
Cleaned up
2024-05-13 16:13:10 -03:00
Nicolas
2ce045912f
Nick: disable vision right now
2024-05-13 10:56:08 -07:00
rafaelsideguide
f4348024c6
Added check during scraping to deal with pdfs
...
Checks if the URL is a PDF during the scraping process (single_url.ts).
TODO: Run integration tests - Does this strat affect the running time?
ps. Some comments need to be removed if we decide to proceed with this strategy.
2024-05-13 09:13:42 -03:00
Rafael Miller
5a2712fa5a
Merge branch 'main' into detect-pdfs
2024-05-10 15:53:13 -03:00
Nicolas
dcedb8d798
Merge branch 'main' into feat/max-depth
2024-05-07 10:20:49 -07:00
Nicolas
6505bf6bf2
Merge branch 'main' into feat/max-depth
2024-05-07 10:20:44 -07:00
Nicolas
bdbee963f7
Merge branch 'main' into nsc/cancel-job
2024-05-07 10:13:43 -07:00
rafaelsideguide
61d615c04b
Added tests
2024-05-07 14:03:00 -03:00
rafaelsideguide
e1f52c538f
nested includeHtml inside pageOptions
2024-05-07 13:40:24 -03:00
Nicolas
f46bf19fa5
Nick:
2024-05-07 09:26:52 -07:00
rafaelsideguide
83f3408634
Added max depth option
2024-05-07 11:06:26 -03:00
Nicolas
6d5da358cc
Nick: cancel job
2024-05-06 17:16:43 -07:00
rafaelsideguide
509250c4ef
changed to includeHtml
2024-05-06 19:45:56 -03:00
rafaelsideguide
538355f1af
Added toMarkdown option
2024-05-06 11:36:44 -03:00
Nicolas
15b774e974
Update index.ts
2024-05-04 12:44:30 -07:00
Nicolas
2aa09a3000
Nick: partial docs working, cleaner
2024-05-04 12:30:12 -07:00
Nicolas
00373228fa
Update index.ts
2024-05-04 11:53:16 -07:00
Nicolas
cbd9e88b77
Merge branch 'main' into llm-extraction
2024-04-30 14:49:20 -07:00
Nicolas
4f526cff92
Nick: cleanup
2024-04-30 12:19:43 -07:00
Caleb Peffer
3ca9e5153f
Caleb: trying to get loggin workng
2024-04-30 09:20:15 -07:00