17 Commits

Author SHA1 Message Date
rafaelsideguide
49e3e64787 bugfix for pdfs and logging pdf events, also added trycatchs for docx 2024-07-29 14:13:46 -03:00
Nicolas
ff4266f09e Update pdfProcessor.ts 2024-07-26 17:21:09 -04:00
rafaelsideguide
6208ecdbc0 added logger 2024-07-23 17:30:46 -03:00
Nicolas
56d42d9c9b Nick: 2024-06-24 16:33:07 -03:00
rafaelsideguide
21d29de819 testing crawl with new.abb.com case
many unnecessary console.logs for tracing the code execution
2024-06-24 16:25:07 -03:00
Rafael Miller
f9c7ca9388
Merge branch 'main' into feat/issue-266 2024-06-14 11:47:58 -03:00
rafaelsideguide
bb859ae9a7 Added metadata.pageStatusCode and metadata.pageError properties to the responses 2024-06-13 17:08:40 -03:00
rafaelsideguide
e37d151404 added parsePDF option to pageOptions
user can decide if they are going to let us take care of the parse or they are going to parse the pdf by themselves
2024-06-12 15:06:47 -03:00
Nicolas
cbf8d79cce Update pdfProcessor.ts 2024-06-04 00:13:37 -07:00
Nicolas
5be208f595 Nick: fixed 2024-05-17 10:40:44 -07:00
rafaelsideguide
8eb2e95f19 Cleaned up 2024-05-13 16:13:10 -03:00
rafaelsideguide
f4348024c6 Added check during scraping to deal with pdfs
Checks if the URL is a PDF during the scraping process (single_url.ts).

TODO: Run integration tests - Does this strat affect the running time?

ps. Some comments need to be removed if we decide to proceed with this strategy.
2024-05-13 09:13:42 -03:00
rafaelsideguide
f8b207793f changed the request to do a HEAD to check for a PDF instead 2024-04-29 15:15:32 -03:00
Nicolas
c5cb268b61 Update pdfProcessor.ts 2024-04-19 13:13:42 -07:00
Nicolas
43cfcec326 Nick: disabling in crawl and sitemap for now 2024-04-19 13:12:08 -07:00
Nicolas
140529c609 Nick: fixes pdfs not found 2024-04-19 13:05:21 -07:00
rafaelsideguide
57e5b36014 [Feat] Adding pdf parser 2024-04-18 11:43:57 -03:00