rafaelsideguide
49e3e64787
bugfix for pdfs and logging pdf events, also added trycatchs for docx
2024-07-29 14:13:46 -03:00
Nicolas
ff4266f09e
Update pdfProcessor.ts
2024-07-26 17:21:09 -04:00
rafaelsideguide
6208ecdbc0
added logger
2024-07-23 17:30:46 -03:00
Nicolas
56d42d9c9b
Nick:
2024-06-24 16:33:07 -03:00
rafaelsideguide
21d29de819
testing crawl with new.abb.com case
...
many unnecessary console.logs for tracing the code execution
2024-06-24 16:25:07 -03:00
Rafael Miller
f9c7ca9388
Merge branch 'main' into feat/issue-266
2024-06-14 11:47:58 -03:00
rafaelsideguide
bb859ae9a7
Added metadata.pageStatusCode and metadata.pageError properties to the responses
2024-06-13 17:08:40 -03:00
rafaelsideguide
e37d151404
added parsePDF option to pageOptions
...
user can decide if they are going to let us take care of the parse or they are going to parse the pdf by themselves
2024-06-12 15:06:47 -03:00
Nicolas
cbf8d79cce
Update pdfProcessor.ts
2024-06-04 00:13:37 -07:00
Nicolas
5be208f595
Nick: fixed
2024-05-17 10:40:44 -07:00
rafaelsideguide
8eb2e95f19
Cleaned up
2024-05-13 16:13:10 -03:00
rafaelsideguide
f4348024c6
Added check during scraping to deal with pdfs
...
Checks if the URL is a PDF during the scraping process (single_url.ts).
TODO: Run integration tests - Does this strat affect the running time?
ps. Some comments need to be removed if we decide to proceed with this strategy.
2024-05-13 09:13:42 -03:00
rafaelsideguide
f8b207793f
changed the request to do a HEAD to check for a PDF instead
2024-04-29 15:15:32 -03:00
Nicolas
c5cb268b61
Update pdfProcessor.ts
2024-04-19 13:13:42 -07:00
Nicolas
43cfcec326
Nick: disabling in crawl and sitemap for now
2024-04-19 13:12:08 -07:00
Nicolas
140529c609
Nick: fixes pdfs not found
2024-04-19 13:05:21 -07:00
rafaelsideguide
57e5b36014
[Feat] Adding pdf parser
2024-04-18 11:43:57 -03:00