unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-15 10:05:19 +00:00

Author	SHA1	Message	Date
Sebastian Laverde Alfonso	508ce48d54	Feat: notebook for Elasticsearch integration (#681 ) * feat: nb elasticsearch unstructured sentiment * chore: refactor readme for elasticsearch nb * fix: update es-credentials.ini * chore: update es-credentials.ini * fix: type in nb load-into-es.ipynb exist --> exists * fix: typo 2 in nb load-into-es.ipynb obtaing --> obtain	2023-06-05 19:05:08 +00:00
Matt Robinson	c35fff2972	feat: Add `stage_for_weaviate` and schema creation function (#672 ) * add weaviate docker compose * added staging brick and tests for weaviate * initial notebook and requirements file * add commentary to weaviate notebook * weaviate readme * update docs * version and change log * install weaviate client * install weaviate; skip for docker * linting, linting, linting * install weaviate client with deps * comments on weaviate client * fix module not found error for docker container * skipped wrong test in docker * fix typos * add in local-inference	2023-06-01 20:48:54 +00:00
Yuming Long	ab5f92dd79	Fix(ingest): Deprecate `--s3-url` in favor of `--remote-url` (#616 ) * deprecation s3-url * changelopg and versioin * download dir not now	2023-05-19 12:11:40 -04:00
Mallori Harrell	34d563c1fc	feat: Create spacy notebook example (#593 ) * add new notebook for spacy	2023-05-17 15:42:15 -05:00
Trevor Bossert	830d67f653	Feat: Discord connector (#515 ) * Initial commit of discord connector based off of initial work by @tnachen with modifications https://github.com/tnachen/unstructured/tree/tnachen/discord_connector * Add test file change format of imports * working version of the connector More work to be done to tidy it up and add any additional options * add to test fixtures update * fix spacing * tests working, switching to bot testing channel * add additional channel add reprocess to tests * add try clause to allow for exit on error Update changelog and bump version * add updated expected output filtes * add logic to check if —discord-period is an integer Add more to option description * fix lint error * Update discord reqs * PR feedback * add newline * another newline --------- Co-authored-by: Justin Bossert <packerbacker21@hotmail.com>	2023-05-16 11:46:30 -07:00
Matt Robinson	e052c2a9b2	docs: example of how to use `unstructured` with `pgvector` (#571 ) * pgvector requirements * first pass on pgvector notebook and sql alchemy file * created code for loading vectors into db * added query for embedding distance * updates to pgvector notebook * update function with time decay * update pgvector notebook to use example code * remove old create table script * add readme for pgvector * update example to use get_date()	2023-05-12 13:54:38 -04:00
Matt Robinson	19beb24e03	docs: `unstructured` -> MySQL example (#557 ) * added requirements for mysql * first bit of mysql notebook * update requirements file * wrap with mysql example * update readme with install instructions	2023-05-09 13:26:49 +00:00
pravin-unstructured	4020da56ad	Went through this demo notebook with Matt. Decision was made to add it to our collection of examples for use later. (#484 )	2023-04-17 11:53:25 -04:00
Trevor Bossert	cff7f4fd5a	Slack connector (#462 ) This connector takes a slack channel id, token and other options to pull conversation history for a channel and store it as a text file that is then processed by unstructured into expected output.	2023-04-16 19:34:43 +00:00
natygyoon	7f6e094c1f	feat: add local file system connector for unstructured-ingest (#399 ) * added local connector to unstructured-ingest	2023-03-29 15:53:23 -07:00
cragwolfe	ce9fc26009	feat: add ability to pass headers in partition_html (#397 ) Also adds pytest-mock requirement, those fixtures are nice to have! Implements issue/feature #396 .	2023-03-23 20:14:57 -07:00
Habeeb Shopeju	2ca843782c	Connector for Biomedical Literature (#345 ) The implementation involves the introduction of SimpleBiomedConfig, BiomedIngestDoc and BiomedConnector which ingests documents from the PDF Download.	2023-03-11 01:09:54 +00:00
Alvaro Bartolome	5291a96616	Add `AzureBlobStorageConnector` (#353 ) * Add `AzureBlobStorageConnector` based on its `fsspec` implementation inheriting from `FsspecConnector` * Start deprecation life cycle for `unstructured-ingest --s3-url` option, to be deprecated in favor of `--remote-url`. --------- Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>	2023-03-10 15:43:40 -08:00
Tom Aarsen	1580c1bf8e	feat: Add GitLab ingest connector (#349 ) Add GitLab data connector for ingest. Involves more general Git functionality that is shared between the GitHub and GitLab data connectors. Prevent code duplication for functionality between GitHub and GitLab ingest connectors. Renamed github-access-token, github-branch and github-file-glob to git-access-token, git-branch and git-file-glob, respectively. These work for GitHub and GitLab.	2023-03-08 00:15:21 -08:00
Habeeb Shopeju	4117f57e14	Connector for Google Drive (#294 ) Implements issue #244	2023-03-07 06:01:02 +00:00
Ikko Eltociear Ashimine	213077e2ab	docs: update sec-sentiment-analysis.ipynb (#342 ) Huggingface -> Hugging Face	2023-03-06 15:16:14 +00:00
Tom Aarsen	54a6db1c2c	feat: Add Wikipedia ingest connector (#299 ) The connector can process a Wikipedia page and output the HTML, the plain text contents, and the summary. No API key required Also add test case verifying that 3 files are indeed created (one for HTML, one for text, one for the summary).	2023-02-28 08:25:11 +00:00
Tom Aarsen	ded60afda9	feat: Add GitHub data connector; add Markdown partitioner (#284 )	2023-02-27 14:36:44 -08:00
Matt Robinson	9b0dbc7026	build(deps): bump dependencies; resolve security issues in example dependencies (#300 ) * bump cryptography version * re pip-compile for latest versions * update argilla example requirements * dependency updates * bump versions * pin unstructured-inference due to multithreading issue * linting, linting, linting * dependency on one line	2023-02-27 12:45:28 -05:00
Tom Aarsen	5eb1466acc	Resolve various style issues to improve overall code quality (#282 ) * Apply import sorting ruff . --select I --fix * Remove unnecessary open mode parameter ruff . --select UP015 --fix * Use f-string formatting rather than .format * Remove extraneous parentheses Also use "" instead of str() * Resolve missing trailing commas ruff . --select COM --fix * Rewrite list() and dict() calls using literals ruff . --select C4 --fix * Add () to pytest.fixture, use tuples for parametrize, etc. ruff . --select PT --fix * Simplify code: merge conditionals, context managers ruff . --select SIM --fix * Import without unnecessary alias ruff . --select PLR0402 --fix * Apply formatting via black * Rewrite ValueError somewhat Slightly unrelated to the rest of the PR * Apply formatting to tests via black * Update expected exception message to match 0d81564 * Satisfy E501 line too long in test * Update changelog & version * Add ruff to make tidy and test deps * Run 'make tidy' * Update changelog & version * Update changelog & version * Add ruff to 'check' target Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.	2023-02-27 11:30:54 -05:00
Matt Robinson	5db94fdee6	docs: add getting started section and remove outdated docs (#277 ) * add getting started section to the docs * remove old examples * update example notebook * change to convert_to_dict * various and sundry edits	2023-02-27 15:10:53 +00:00
Tom Aarsen	486c7987fc	feat: Add Reddit ingest connector (#293 ) Add Reddit data connector for ingest. * The connector can process a subreddit. * Either via a search query, * or via hot posts. * The texts in the submissions are converted to markdown files including the post title and the text body, if any (i.e. no images or videos). * The number of posts to fetch can be changed with the CLI.	2023-02-27 00:11:04 -08:00
Tom Aarsen	9062d25d0d	Resolve numerous typos (#280 ) * Resolve numerous typos * Resolve typo in mime type	2023-02-24 17:48:23 -08:00
cragwolfe	87fd0d01dc	feat: Ingest refactors, doc updates (#243 ) - Creates ABC's for ingest connectors - Updates the s3_connector classes to inherit from ABC's - Moves s3 test script to it's own file to establish pattern for additional connectors - Rewrites the Ingest.md doc, including instructions how how to add a connector - Updates the example s3 ingest script to use the new location for main.py Note that there were no logic changes, this is essentially a refactoring PR. Test instructions: Run ./test_unstructured_ingest/test-ingest.sh and ./examples/ingest/s3-small-batch/ingest.sh.	2023-02-21 10:15:33 -08:00
Matt Robinson	9bbd4a1d56	docs: file exploration training notebook (#221 )	2023-02-16 20:33:02 +00:00
cragwolfe	3c1b089071	feat: Ingest CLI flags and test fixture updates (#227 ) * Many command line options added. The sample ingest project is now an easy to use CLI (no code editing necessary), capable of processing large numbers of files from S3 in a re-entrant manner. See Ingest.md. * Fixes issue where text fixtures had been truncated * Adds a check to make sure this doesn't happen again * Moves fixture outputs for the existing connector one subdir lower, to make room for future connector outputs.	2023-02-16 16:45:50 +00:00
cragwolfe	ab542ca3c6	feat: Sample ingest project with S3 connector (#218 )	2023-02-14 12:27:45 -08:00
Matt Robinson	f890972139	docs: add bricks training notebook (#211 ) * added bricks notebook * more unicode quotes; isd dataframe column fix * fix remove_punctuation docs * typo fixes * put staging bricks in code	2023-02-10 14:39:14 +00:00
Matt Robinson	7fb3797165	docs: core concepts training notebook (#207 ) * added to_dict to elements * first training notebook * bump changelog, rerun notebook * remove coordinates and id * rerun notebook * has -> have * partitioning -> partition * various and sundry typos * switch to using convert_to_isd	2023-02-09 14:34:34 +00:00
Matt Robinson	d0bf8904fa	docs: example notebooks from community repo (#187 )	2023-01-31 10:37:32 -05:00

30 Commits