rafaelsideguide
ad7795f973
Merge remote-tracking branch 'origin/main' into test/load-testing
2024-06-14 15:14:01 -03:00
Rafael Miller
f9c7ca9388
Merge branch 'main' into feat/issue-266
2024-06-14 11:47:58 -03:00
Rafael Miller
3e2e76311c
Merge branch 'main' into feat/issue-205
2024-06-14 11:25:20 -03:00
rafaelsideguide
5dd18ca79b
fixed edge cases
2024-06-14 09:46:55 -03:00
rafaelsideguide
bb859ae9a7
Added metadata.pageStatusCode and metadata.pageError properties to the responses
2024-06-13 17:08:40 -03:00
rafaelsideguide
676d6e8ab5
Added pageOptions.removeTags
2024-06-13 10:51:05 -03:00
rafaelsideguide
e37d151404
added parsePDF option to pageOptions
...
user can decide if they are going to let us take care of the parse or they are going to parse the pdf by themselves
2024-06-12 15:06:47 -03:00
Nicolas
7ae9778642
Update single_url.ts
2024-06-10 16:57:31 -07:00
Nicolas
913c1dd568
Nick: fetch -> axios and fix timeouts
2024-06-10 16:49:03 -07:00
rafaelsideguide
164676c70a
bugfix screenshot for readme pages
2024-06-05 15:34:42 -03:00
rafaelsideguide
0d51b11dcd
missing breaks
2024-06-05 15:02:28 -03:00
Rafael Miller
9e000ded03
Merge branch 'main' into feat/better-gdrive-pdf-fetch
2024-06-05 14:07:56 -03:00
rafaelsideguide
ccc55127d6
Added scroll xpaths on fire-engine for handling readme docs
2024-06-05 11:48:41 -03:00
rafaelsideguide
b5045d1661
[feat] improved the scrape for gdrive pdfs
2024-06-04 17:47:28 -03:00
Nicolas
674500affa
Nick:
2024-06-04 12:15:39 -07:00
rafaelsideguide
5ae4d1caf5
Update single_url.ts
2024-06-04 15:28:09 -03:00
rafaelsideguide
64a4338ff0
Update single_url.ts
2024-06-04 14:40:05 -03:00
Rafael Miller
b80fb374e5
Merge branch 'main' into playwright-service-bug-222
2024-06-04 11:57:17 -03:00
Nicolas
2ea01f1456
Update single_url.ts
2024-06-03 23:42:39 -07:00
Nicolas
854d5b3cb3
Update single_url.ts
2024-06-03 23:32:55 -07:00
Nicolas
d30ced4394
Merge pull request #221 from mendableai/nsc/fwd-header-auth
...
feat: Ability to forward headers to reliable providers for auth etc...
2024-06-03 16:33:40 -07:00
rafaelsideguide
1fc3a15149
Update single_url.ts
2024-06-03 15:24:40 -03:00
Nicolas
fde522c3e1
Update single_url.ts
2024-06-02 20:23:45 -07:00
Matt Joyce
deefe65cbe
Change the way the playwright response is parsed
...
Was failing with a Type Error, but actually looked ok.
This fixes the type error, and stop scraper fallback.
2024-06-01 19:16:56 +10:00
Nicolas
3b8059edb6
Update single_url.ts
2024-05-31 15:43:06 -07:00
Nicolas
6bea803120
Nick:
2024-05-31 15:39:54 -07:00
Nicolas
6c939d534d
Nick: small refactor
2024-05-29 19:43:51 -07:00
Eric Ciarla
37915e11e8
Final push
2024-05-29 21:18:24 -04:00
Eric Ciarla
a0e404f94e
init commit
2024-05-29 18:56:57 -04:00
rafaelsideguide
ee9a2184e2
Added custom scraping conditions for readme docs
2024-05-29 13:39:43 -03:00
Nicolas
1b3547dcf2
Nick:
2024-05-28 12:56:24 -07:00
rafaelsideguide
aa6df4305e
crawl load tests 6 and 7
2024-05-22 18:20:24 -03:00
Nicolas
a8ff295977
Update single_url.ts
2024-05-21 18:50:42 -07:00
Nicolas
a5e718b084
Nick: improvements
2024-05-21 18:34:23 -07:00
Nicolas
df6c3d1e7d
Merge branch 'main' into detect-pdfs
2024-05-17 09:55:51 -07:00
Nicolas
d10f81e7fe
Nick: fixes
2024-05-15 11:28:20 -07:00
Nicolas
a96fc5b96d
Nick: 4x speed
2024-05-13 20:45:11 -07:00
rafaelsideguide
8eb2e95f19
Cleaned up
2024-05-13 16:13:10 -03:00
rafaelsideguide
f4348024c6
Added check during scraping to deal with pdfs
...
Checks if the URL is a PDF during the scraping process (single_url.ts).
TODO: Run integration tests - Does this strat affect the running time?
ps. Some comments need to be removed if we decide to proceed with this strategy.
2024-05-13 09:13:42 -03:00
Nicolas
d21091bb06
Update single_url.ts
2024-05-09 17:52:46 -07:00
Nicolas
be85008622
Nick: better
2024-05-09 17:48:11 -07:00
Nicolas
be5661a768
Nick: a lot better
2024-05-09 17:45:16 -07:00
rafaelsideguide
e1f52c538f
nested includeHtml inside pageOptions
2024-05-07 13:40:24 -03:00
rafaelsideguide
509250c4ef
changed to includeHtml
2024-05-06 19:45:56 -03:00
rafaelsideguide
538355f1af
Added toMarkdown option
2024-05-06 11:36:44 -03:00
Nicolas
768166b066
Update single_url.ts
2024-04-30 16:57:44 -07:00
Caleb Peffer
3ca9e5153f
Caleb: trying to get loggin workng
2024-04-30 09:20:15 -07:00
Nicolas
b69feab916
Merge branch 'main' into llm-extraction
2024-04-29 08:40:44 -07:00
Caleb Peffer
6ee1f2d3bc
Caleb: initially pulled inspiration code from https://github.com/mishushakov/llm-scraper
2024-04-28 13:59:35 -07:00
Nicolas
68838c9e0d
Update single_url.ts
2024-04-28 12:44:00 -07:00