53 Commits

Author SHA1 Message Date
Nicolas
7ae9778642 Update single_url.ts 2024-06-10 16:57:31 -07:00
Nicolas
913c1dd568 Nick: fetch -> axios and fix timeouts 2024-06-10 16:49:03 -07:00
rafaelsideguide
164676c70a bugfix screenshot for readme pages 2024-06-05 15:34:42 -03:00
rafaelsideguide
0d51b11dcd missing breaks 2024-06-05 15:02:28 -03:00
Rafael Miller
9e000ded03
Merge branch 'main' into feat/better-gdrive-pdf-fetch 2024-06-05 14:07:56 -03:00
rafaelsideguide
ccc55127d6 Added scroll xpaths on fire-engine for handling readme docs 2024-06-05 11:48:41 -03:00
rafaelsideguide
b5045d1661 [feat] improved the scrape for gdrive pdfs 2024-06-04 17:47:28 -03:00
Nicolas
674500affa Nick: 2024-06-04 12:15:39 -07:00
rafaelsideguide
5ae4d1caf5 Update single_url.ts 2024-06-04 15:28:09 -03:00
rafaelsideguide
64a4338ff0 Update single_url.ts 2024-06-04 14:40:05 -03:00
Rafael Miller
b80fb374e5
Merge branch 'main' into playwright-service-bug-222 2024-06-04 11:57:17 -03:00
Nicolas
2ea01f1456 Update single_url.ts 2024-06-03 23:42:39 -07:00
Nicolas
854d5b3cb3 Update single_url.ts 2024-06-03 23:32:55 -07:00
Nicolas
d30ced4394
Merge pull request #221 from mendableai/nsc/fwd-header-auth
feat: Ability to forward headers to reliable providers for auth etc...
2024-06-03 16:33:40 -07:00
rafaelsideguide
1fc3a15149 Update single_url.ts 2024-06-03 15:24:40 -03:00
Nicolas
fde522c3e1 Update single_url.ts 2024-06-02 20:23:45 -07:00
Matt Joyce
deefe65cbe Change the way the playwright response is parsed
Was failing with a Type Error, but actually looked ok.
This fixes the type error, and stop scraper fallback.
2024-06-01 19:16:56 +10:00
Nicolas
3b8059edb6 Update single_url.ts 2024-05-31 15:43:06 -07:00
Nicolas
6bea803120 Nick: 2024-05-31 15:39:54 -07:00
Nicolas
6c939d534d Nick: small refactor 2024-05-29 19:43:51 -07:00
Eric Ciarla
37915e11e8 Final push 2024-05-29 21:18:24 -04:00
Eric Ciarla
a0e404f94e init commit 2024-05-29 18:56:57 -04:00
rafaelsideguide
ee9a2184e2 Added custom scraping conditions for readme docs 2024-05-29 13:39:43 -03:00
Nicolas
1b3547dcf2 Nick: 2024-05-28 12:56:24 -07:00
Nicolas
a8ff295977 Update single_url.ts 2024-05-21 18:50:42 -07:00
Nicolas
a5e718b084 Nick: improvements 2024-05-21 18:34:23 -07:00
Nicolas
df6c3d1e7d Merge branch 'main' into detect-pdfs 2024-05-17 09:55:51 -07:00
Nicolas
d10f81e7fe Nick: fixes 2024-05-15 11:28:20 -07:00
Nicolas
a96fc5b96d Nick: 4x speed 2024-05-13 20:45:11 -07:00
rafaelsideguide
8eb2e95f19 Cleaned up 2024-05-13 16:13:10 -03:00
rafaelsideguide
f4348024c6 Added check during scraping to deal with pdfs
Checks if the URL is a PDF during the scraping process (single_url.ts).

TODO: Run integration tests - Does this strat affect the running time?

ps. Some comments need to be removed if we decide to proceed with this strategy.
2024-05-13 09:13:42 -03:00
Nicolas
d21091bb06 Update single_url.ts 2024-05-09 17:52:46 -07:00
Nicolas
be85008622 Nick: better 2024-05-09 17:48:11 -07:00
Nicolas
be5661a768 Nick: a lot better 2024-05-09 17:45:16 -07:00
rafaelsideguide
e1f52c538f nested includeHtml inside pageOptions 2024-05-07 13:40:24 -03:00
rafaelsideguide
509250c4ef changed to includeHtml 2024-05-06 19:45:56 -03:00
rafaelsideguide
538355f1af Added toMarkdown option 2024-05-06 11:36:44 -03:00
Nicolas
768166b066 Update single_url.ts 2024-04-30 16:57:44 -07:00
Caleb Peffer
3ca9e5153f Caleb: trying to get loggin workng 2024-04-30 09:20:15 -07:00
Nicolas
b69feab916 Merge branch 'main' into llm-extraction 2024-04-29 08:40:44 -07:00
Caleb Peffer
6ee1f2d3bc Caleb: initially pulled inspiration code from https://github.com/mishushakov/llm-scraper 2024-04-28 13:59:35 -07:00
Nicolas
68838c9e0d Update single_url.ts 2024-04-28 12:44:00 -07:00
Nicolas
8e44696c4d Nick: 2024-04-28 11:34:25 -07:00
Nicolas
fdb2789eaa Nick: added url as return param 2024-04-23 17:14:34 -07:00
Nicolas
f0695c7123 Update single_url.ts 2024-04-23 17:04:10 -07:00
Nicolas
0146157876 Nick: mvp 2024-04-23 15:28:32 -07:00
Nicolas
306cfe4ce1 Nick: 2024-04-23 11:15:11 -07:00
Nicolas
ca2bf9cc12 Update single_url.ts 2024-04-17 18:27:08 -07:00
Nicolas
36abe0f7f9 Nick: 2024-04-17 18:24:46 -07:00
Nicolas
08ed68ff55 Nick: fixes 2024-04-17 12:44:23 -07:00