unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-12-09 22:30:15 +00:00

Author	SHA1	Message	Date
Christine Straub	b30d6a601e	Fix/1209 tweak xycut ordering output (#1630 ) Closes GH Issue #1209. ### Summary - add swapped `xycut` sorting - update `xycut` sorting evaluation script PDFs: - [sbaa031.073.pdf](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7234218/pdf/sbaa031.073.pdf) - [multi-column-2p.pdf](https://github.com/Unstructured-IO/unstructured/files/12796147/multi-column-2p.pdf) - [11723901.pdf](https://github.com/Unstructured-IO/unstructured-inference/files/12360085/11723901.pdf) ### Testing ``` elements = partition_pdf("sbaa031.073.pdf", strategy="hi_res") print("\n\n".join([str(el) for el in elements])) ``` ### Evaluation ``` PYTHONPATH=. python examples/custom-layout-order/evaluate_xy_cut_sorting.py sbaa031.073.pdf hi_res xycut_only ```	2023-10-05 07:41:38 +00:00
Roman Isecke	9d81971fcb	update ingest python doc (#1446 ) ### Description Updating the python version of the example docs to show how to run the same code that the CLI runs, but using python. Rather than copying the same command that would be run via the terminal and using the subprocess library to run it, this updates it to use the supported code exposed in the inference directory. For now only the wikipedia one has been updated to get some opinions on this before updating all other connector docs. Would close out https://github.com/Unstructured-IO/unstructured/issues/1445	2023-10-03 10:01:41 -04:00
Roman Isecke	bd49cfbab7	feat: adds Azure Cognitive Search (full text) destination connector (#1459 ) ### Description New [Azure Cognitive Search](https://azure.microsoft.com/en-us/products/ai-services/cognitive-search) destination connector added. Writes each json element from the created json files via partition and writes that content to an index. Bonus bug fix: Due to a recent change where the default version of python used in the repo was bumped to `3.10` from `3.8`, this means running `pip-compile` now runs it against that version rather than the lowest we support which is still `3.8`. This breaks the setup for those lower versions because some of the versions pulled in by `pip-compile` exist for `3.10` but not `3.8`. `pip-compile` was updates to run as a script that checks the version of python being used first, which helps guarantee that all dependencies meet the minimum python version requirement. Closes out https://github.com/Unstructured-IO/unstructured/issues/1466	2023-09-25 10:27:42 -04:00
Trevor Bossert	3e04110bab	Chore: Pin unstructured-inference in extra-pdf-image (#1474 ) This is so users are able to upgrade it when unstructured library is updated.	2023-09-22 09:41:53 -07:00
Ahmet Melek	9e88929a8c	feat: document embeddings (#1368 ) Closes https://github.com/Unstructured-IO/unstructured/issues/1319, closes https://github.com/Unstructured-IO/unstructured/issues/1372 This module: - implements EmbeddingEncoder classes which track embedding related data - implements embed_documents method which receives a list of Elements, obtains embeddings for the text within Elements, updates the Elements with an attribute named embeddings , and returns the updated Elements - the module uses langchain to obtain the embeddings ----- - The PR additionally fixes a JSON de-serialization issue on the metadata fields. To test the changes, run `examples/embed/example.py`	2023-09-20 19:55:30 +00:00
Ryan Nikolaidis	8c1d03e5cf	update slack invite	2023-09-20 00:02:03 -07:00
Roman Isecke	333558494e	roman/delta lake dest connector (#1385 ) ### Description Add delta table downstream destination connector Closes https://github.com/Unstructured-IO/unstructured/issues/1415	2023-09-15 22:13:39 +00:00
Roman Isecke	59e850bbd9	Roman/downstream connector cli subcommand (#1302 ) ### Description Update all other connectors to use the new downstream architecture that was recently introduced for the s3 connector. Closes #1313 and #1311	2023-09-11 11:40:56 -04:00
Ahmet Melek	09cc4bfa5f	feat: jira connector (cloud) (#1238 ) This connector: - takes a Jira Cloud URL, user email and api token; to authenticate into Jira Cloud - ingests: - either all issues in all projects in a Jira Cloud Organization - or - issues in user specified projects, boards - user specified issues - processes this kind of data: - text fields such as issue summary, description, and comments - dropdown fields such as issue type, status, priority, assignee, reporter, labels, and components - other data such as issue id, issue key, project id, information on subtasks - notes down attachment URLs, however does not process attachments - stores each downloaded issue in a txt file, in a predefined template form (consisting of the data above) - then processes each downloaded issue document into elements using unstructured library - related to: https://github.com/Unstructured-IO/unstructured/issues/263 To test the changes, make the necessary setups and run the relevant ingest test scripts. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-09-06 10:10:48 +00:00
cragwolfe	a475b447e8	doc: add colab link for xycut sorting (#1288 )	2023-09-03 20:19:40 +00:00
David Potter	b710bafa89	feat: add salesforce connector (#1168 )	2023-09-02 08:50:31 -07:00
Roman Isecke	ed7f991ab9	Add s3 writer (#1223 ) ### Description Convert s3 cli code to also support writing to s3. Writers are added as optional subcommands to the parent command with their own arguments. Custom `click.Group` introduced to add some custom formatting and text in help messages. To limit the scope of this PR, most existing files were not touched but instead new files were added for the new flow. This allowed _only_ the s3 connector to be updated without breaking any other ones.	2023-08-31 22:19:53 +00:00
Christine Straub	483b09b3c9	Feat/1136 elements ordering for pdf (#1161 ) ### Summary Address [#1136](https://github.com/Unstructured-IO/unstructured/issues/1136) for `hi_res` and `fast` strategies. The `ocr_only` strategy does not include coordinates. - add functionality to switch sort mode between the current `basic` sorting and the new `xy-cut` sorting for `hi_res` and `fast` strategies - add the script to evaluate the `xy-cut` sorting approach - add jupyter notebook to provide evaluation and visualization for the `xy-cut` sorting approach ### Evaluation ``` export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy> ``` Here, the file should be under the project root directory. For example, ``` export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py example-docs/multi-column-2p.pdf fast ```	2023-08-24 17:46:19 -07:00
Roman Isecke	106ee965a6	Roman/delta table connector (#1132 ) ### Description Add delta table connector and test against a delta table generated via delta.io and uploaded to s3. Shows an example of how to use the connection options to leverage s3. I was able to get this to work with s3 if I pass in the access and secret keys as storage options. Even though the s3 bucket being used is public, would not work without those. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-08-22 10:19:46 -04:00
Sebastian Laverde Alfonso	fe5048a834	feat: chipper local inference notebook (#1116 ) Download chipper model for local use and demonstrate how to partition a .pdf document through the unstructured and unstructured_inference libraries.	2023-08-15 20:43:23 -07:00
Mark Risher	612f9da6e8	Update news-of-the-day.ipynb - typo (#1113 ) Fixed typo	2023-08-14 16:48:49 +00:00
John	f63a66dbef	Capture section and chapter in the metadata for epubs under `epub_section` (#1005 ) Capture section and chapter in the metadata for epubs under epub_section. Closes Github issue #459	2023-08-12 21:02:06 +00:00
Ahmet Melek	627f78c16f	feat: airtable connector (#1012 ) * add the first version of airtable connector * change imports as inline to fail gracefully in case of lacking dependency * parse tables as csv rather than plain text * add relevant logic to be able to use --airtable-list-of-paths * add script for creation of reseources for testing, add test script (large) for testing with a large number of tables to validate scroll functionality, update test script (diff) based on the new settings * fix ingest test names * add scripts for the large table test * remove large table test from diff test * make base and table ids explicit * add and remove comments * use -ne instead of != * update code based on the recent ingest refactor, update changelog and version * shellcheck fix * update comments * update check-num-rows-and-columns-output error message Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * update help comments * update help comments * update help comments * update workflows to set auth tokens and to run make install * add comments on create_scale_test_components * separate component ids from the test script, add comments to document test component creation * add LARGE_BASE test, implement LARGE_BASE component creation, replace component id * shellcheck fixes * shellcheck fixes * update docs * update comment * bump version * add wrongly deleted file * sort columns before saving to process * Update ingest test fixtures (#1098) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-08-11 12:02:51 -07:00
Ronny H	b31c62fa84	replace Weaviate nearText with BM25 query algorithm (#1078 )	2023-08-10 22:15:27 +00:00
rvztz	dee9b405cd	feat: Sharepoint connector (#918 )	2023-08-10 09:37:58 -07:00
Roman Isecke	50389f15a8	roman/Add notion connector (#1033 ) * Add notion connector and supporting code * minor fixes * Add notion deps to extras * Use the same return type for both helper methods * Don't ignore types that aren't recognized when mapping json * Add support for recursively getting docs * Add recursive search for databases * fix logging * fix linting * remove debugging code	2023-08-08 22:01:25 -04:00
Matt Robinson	ac7efa19e7	docs: news of the day (chroma + langchain) (#1054 ) * news of the day notebook * readme and requirements * change to single mode instead of elements	2023-08-08 11:36:04 -04:00
Sebastian Laverde Alfonso	084ead173a	chore: custom layout order example notebook (#1024 ) * chore: CFR double column sample Federal Regulations document for example notebook in `examples/custom-layout-order` * chore: custom-layout-order example dir * feat: helper methods to plot and reorder layouts Helper methods: `plot_image_with_bounding_boxes_coloured` and `reorder_elements_in_double_columns` * chore: delete __init__.py --------- Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>	2023-08-02 18:29:04 -06:00
David Potter	1542607892	feat: adds Box connector (#996 )	2023-08-01 01:10:10 +00:00
Roman Isecke	28214a6cc3	Roman/ingest refactor (#978 ) * Pull out s3 code as subcommand * Pull out dropbox code as subcommand * Pull out azure code as subcommand * Pull out fsspec code as subcommand * Pull out github code as subcommand * Pull out gitlab code as subcommand * Pull out reddit code as subcommand * Pull out slack code as subcommand * Pull out discord code as subcommand * Pull out wikipedia code as subcommand * Pull out gdrive code as subcommand * Pull out biomed code as subcommand * rename parameters * Pull out onedrive code as subcommand * Pull out outlook code as subcommand * Pull out local code as subcommand * Pull out elasticsearch code as subcommand * Pull out confluence code as subcommand * Drop previous main file * update changelog * Add back in mp.Pool * Fix mypy issues with click * Make sure all tests run with verbose flag * refactor approach to dynamically add common options to each subcommand, scrub logging of options for sensitive data * Pull out some more shared options * Support running code via python as well as cli * update ingest readme and move it to the ingest folder * update usage in connector docs * move local command arg in test * Seperate out cli code from logic running unstructured * Make some cli fields required rather than optional * rename process -> processor * Improve logger to avoid duplicate handlers --------- Co-authored-by: Ryan Nikolaidis <1208590+ryannikolaidis@users.noreply.github.com>	2023-07-31 13:20:10 -04:00
David Potter	f7e46af22f	feat: adds Outlook connector (#939 ) * bonus: fixes issue with email partitioning where From field was being assigned the To field value.	2023-07-26 04:09:26 +00:00
Ahmet Melek	b7674fb97e	feat: confluence connector (cloud) (#906 ) * Add confluence connector and an example script * add test script, add dependency installations * add authentication secret variables for ci tests and actions * add dependency installation commands for workflows * add dependency installation commands for workflows * Update ingest test fixtures (#907) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * add add ingest test fixtures update workflow for python 3.10, update example script with dummy values * change workflow name to avoid confusion * change workflow name to avoid confusion * only leave 3.8 in ingest test matrix to test consistent partitioning among python versions, remove 3.10 workflow for the test fixtures update * only leave 3.8 in ingest test matrix to test consistent partitioning among python versions * Update ingest test fixtures (#911) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * revert back the test python version matrix * recompile dependencies * modifications for shellcheck * update changelog and version * changelog and version * remove comments * Update ingest test fixtures (#915) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * add the option to state the number of spaces to be fetched * add scroll functionality, expose --confluence-num-of-spaces, --confluence-list-of-spaces and --confluence-num-of-docs-from-each-space to users * add help message * add docstrings for two tests, validate grabbing every doc in the fetched spaces, count number of files instead of diffing for confluence2 test * change test names * rename connector arg Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * change arg name for connector Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * add comment to example * change arg names * add new tests to ingest test * shellcheck remove redundant statement * Update ingest test fixtures (#932) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * Update ingest test fixtures (#936) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * linting * change file extensions to parse as html * Update ingest test fixtures (#943) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * remove old fixtures * update version to 0.8.2-dev3 * change file to trigger CI * change file to trigger CI * change file to trigger CI * change file to trigger CI --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-07-18 19:29:41 +01:00
rvztz	ce20c3f2bc	feat: add OneDrive connector (#834 )	2023-07-13 20:57:54 +00:00
Ahmet Melek	5ea216cf07	feat: elasticsearch connector (#817 )	2023-07-01 17:45:28 +00:00
David Potter	bec733cdf8	feat: add Dropbox connector (#844 )	2023-06-30 17:08:27 -07:00
David Potter	3b472cb7df	feat: add google cloud storage connector (#746 )	2023-06-21 15:14:50 -07:00
Sebastian Laverde Alfonso	508ce48d54	Feat: notebook for Elasticsearch integration (#681 ) * feat: nb elasticsearch unstructured sentiment * chore: refactor readme for elasticsearch nb * fix: update es-credentials.ini * chore: update es-credentials.ini * fix: type in nb load-into-es.ipynb exist --> exists * fix: typo 2 in nb load-into-es.ipynb obtaing --> obtain	2023-06-05 19:05:08 +00:00
Matt Robinson	c35fff2972	feat: Add `stage_for_weaviate` and schema creation function (#672 ) * add weaviate docker compose * added staging brick and tests for weaviate * initial notebook and requirements file * add commentary to weaviate notebook * weaviate readme * update docs * version and change log * install weaviate client * install weaviate; skip for docker * linting, linting, linting * install weaviate client with deps * comments on weaviate client * fix module not found error for docker container * skipped wrong test in docker * fix typos * add in local-inference	2023-06-01 20:48:54 +00:00
Yuming Long	ab5f92dd79	Fix(ingest): Deprecate `--s3-url` in favor of `--remote-url` (#616 ) * deprecation s3-url * changelopg and versioin * download dir not now	2023-05-19 12:11:40 -04:00
Mallori Harrell	34d563c1fc	feat: Create spacy notebook example (#593 ) * add new notebook for spacy	2023-05-17 15:42:15 -05:00
Trevor Bossert	830d67f653	Feat: Discord connector (#515 ) * Initial commit of discord connector based off of initial work by @tnachen with modifications https://github.com/tnachen/unstructured/tree/tnachen/discord_connector * Add test file change format of imports * working version of the connector More work to be done to tidy it up and add any additional options * add to test fixtures update * fix spacing * tests working, switching to bot testing channel * add additional channel add reprocess to tests * add try clause to allow for exit on error Update changelog and bump version * add updated expected output filtes * add logic to check if —discord-period is an integer Add more to option description * fix lint error * Update discord reqs * PR feedback * add newline * another newline --------- Co-authored-by: Justin Bossert <packerbacker21@hotmail.com>	2023-05-16 11:46:30 -07:00
Matt Robinson	e052c2a9b2	docs: example of how to use `unstructured` with `pgvector` (#571 ) * pgvector requirements * first pass on pgvector notebook and sql alchemy file * created code for loading vectors into db * added query for embedding distance * updates to pgvector notebook * update function with time decay * update pgvector notebook to use example code * remove old create table script * add readme for pgvector * update example to use get_date()	2023-05-12 13:54:38 -04:00
Matt Robinson	19beb24e03	docs: `unstructured` -> MySQL example (#557 ) * added requirements for mysql * first bit of mysql notebook * update requirements file * wrap with mysql example * update readme with install instructions	2023-05-09 13:26:49 +00:00
pravin-unstructured	4020da56ad	Went through this demo notebook with Matt. Decision was made to add it to our collection of examples for use later. (#484 )	2023-04-17 11:53:25 -04:00
Trevor Bossert	cff7f4fd5a	Slack connector (#462 ) This connector takes a slack channel id, token and other options to pull conversation history for a channel and store it as a text file that is then processed by unstructured into expected output.	2023-04-16 19:34:43 +00:00
natygyoon	7f6e094c1f	feat: add local file system connector for unstructured-ingest (#399 ) * added local connector to unstructured-ingest	2023-03-29 15:53:23 -07:00
cragwolfe	ce9fc26009	feat: add ability to pass headers in partition_html (#397 ) Also adds pytest-mock requirement, those fixtures are nice to have! Implements issue/feature #396 .	2023-03-23 20:14:57 -07:00
Habeeb Shopeju	2ca843782c	Connector for Biomedical Literature (#345 ) The implementation involves the introduction of SimpleBiomedConfig, BiomedIngestDoc and BiomedConnector which ingests documents from the PDF Download.	2023-03-11 01:09:54 +00:00
Alvaro Bartolome	5291a96616	Add `AzureBlobStorageConnector` (#353 ) * Add `AzureBlobStorageConnector` based on its `fsspec` implementation inheriting from `FsspecConnector` * Start deprecation life cycle for `unstructured-ingest --s3-url` option, to be deprecated in favor of `--remote-url`. --------- Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>	2023-03-10 15:43:40 -08:00
Tom Aarsen	1580c1bf8e	feat: Add GitLab ingest connector (#349 ) Add GitLab data connector for ingest. Involves more general Git functionality that is shared between the GitHub and GitLab data connectors. Prevent code duplication for functionality between GitHub and GitLab ingest connectors. Renamed github-access-token, github-branch and github-file-glob to git-access-token, git-branch and git-file-glob, respectively. These work for GitHub and GitLab.	2023-03-08 00:15:21 -08:00
Habeeb Shopeju	4117f57e14	Connector for Google Drive (#294 ) Implements issue #244	2023-03-07 06:01:02 +00:00
Ikko Eltociear Ashimine	213077e2ab	docs: update sec-sentiment-analysis.ipynb (#342 ) Huggingface -> Hugging Face	2023-03-06 15:16:14 +00:00
Tom Aarsen	54a6db1c2c	feat: Add Wikipedia ingest connector (#299 ) The connector can process a Wikipedia page and output the HTML, the plain text contents, and the summary. No API key required Also add test case verifying that 3 files are indeed created (one for HTML, one for text, one for the summary).	2023-02-28 08:25:11 +00:00
Tom Aarsen	ded60afda9	feat: Add GitHub data connector; add Markdown partitioner (#284 )	2023-02-27 14:36:44 -08:00
Matt Robinson	9b0dbc7026	build(deps): bump dependencies; resolve security issues in example dependencies (#300 ) * bump cryptography version * re pip-compile for latest versions * update argilla example requirements * dependency updates * bump versions * pin unstructured-inference due to multithreading issue * linting, linting, linting * dependency on one line	2023-02-27 12:45:28 -05:00

1 2

61 Commits