10 Commits

Author SHA1 Message Date
Habeeb Shopeju
2ca843782c
Connector for Biomedical Literature (#345)
The implementation involves the introduction of SimpleBiomedConfig, BiomedIngestDoc and BiomedConnector which ingests documents from the PDF Download.
2023-03-11 01:09:54 +00:00
Alvaro Bartolome
5291a96616
Add AzureBlobStorageConnector (#353)
* Add `AzureBlobStorageConnector` based on its `fsspec` implementation inheriting
from `FsspecConnector`
* Start deprecation life cycle for `unstructured-ingest --s3-url` option, to be deprecated in
  favor of `--remote-url`.

---------

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
2023-03-10 15:43:40 -08:00
Tom Aarsen
1580c1bf8e
feat: Add GitLab ingest connector (#349)
Add GitLab data connector for ingest.

Involves more general Git functionality that is shared between the GitHub and GitLab data connectors.

Prevent code duplication for functionality between GitHub and GitLab ingest connectors.

Renamed github-access-token, github-branch and github-file-glob to git-access-token, git-branch and git-file-glob, respectively.

These work for GitHub and GitLab.
2023-03-08 00:15:21 -08:00
Habeeb Shopeju
4117f57e14
Connector for Google Drive (#294)
Implements issue #244
2023-03-07 06:01:02 +00:00
Tom Aarsen
54a6db1c2c
feat: Add Wikipedia ingest connector (#299)
The connector can process a Wikipedia page
and output the HTML,
the plain text contents,
and the summary.
No API key required
Also add test case verifying that 3 files are indeed created (one for HTML, one for text, one for the summary).
2023-02-28 08:25:11 +00:00
Tom Aarsen
ded60afda9
feat: Add GitHub data connector; add Markdown partitioner (#284) 2023-02-27 14:36:44 -08:00
Tom Aarsen
486c7987fc
feat: Add Reddit ingest connector (#293)
Add Reddit data connector for ingest.
* The connector can process a subreddit.
* Either via a search query,
*  or via hot posts.
* The texts in the submissions are converted to markdown files including the post title and the text body, if any (i.e. no images or videos).
* The number of posts to fetch can be changed with the CLI.
2023-02-27 00:11:04 -08:00
cragwolfe
87fd0d01dc
feat: Ingest refactors, doc updates (#243)
- Creates ABC's for ingest connectors
- Updates the s3_connector classes to inherit from ABC's
- Moves s3 test script to it's own file to establish pattern for additional connectors
- Rewrites the Ingest.md doc, including instructions how how to add a connector
- Updates the example s3 ingest script to use the new location for main.py

Note that there were no logic changes, this is essentially a refactoring PR.

Test instructions:

Run ./test_unstructured_ingest/test-ingest.sh and ./examples/ingest/s3-small-batch/ingest.sh.
2023-02-21 10:15:33 -08:00
cragwolfe
3c1b089071
feat: Ingest CLI flags and test fixture updates (#227)
* Many command line options added. The sample ingest project is now an easy to use CLI (no code editing
   necessary), capable of processing large numbers of files from S3 in a re-entrant manner. See Ingest.md.
* Fixes issue where text fixtures had been truncated
  * Adds a check to make sure this doesn't happen again
* Moves fixture outputs for the existing connector one subdir lower, 
  to make room for future connector outputs.
2023-02-16 16:45:50 +00:00
cragwolfe
ab542ca3c6
feat: Sample ingest project with S3 connector (#218) 2023-02-14 12:27:45 -08:00