unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-16 02:17:21 +00:00

Author	SHA1	Message	Date
rvztz	950e5d68f9	feat: adds postgresql/sqlite destination connector (#2005 ) - Adds a destination connector to upload processed output into a PostgreSQL/Sqlite database instance. - Users are responsible to provide their instances. This PR includes a couple of configuration examples. - Defines the scripts required to setup a PostgreSQL instance with the unstructured elements schema. - Validates postgres/pgvector embedding storage and retrieval --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-04 19:33:16 +00:00
Ronny H	8e2bfcab18	Unstructured SaaS API subscription guide (#2341 ) To test: > cd docs && make html Sections: - New User sign-up: (i) registration form, (ii) payment processing, and (iii) use API key & URL - API Account maintenance: (i) update billing, (ii) opt-in email, (iii) rotate API key, and (iv) cancel plan - Get Supports	2024-01-03 14:38:03 -08:00
ryannikolaidis	dd1443ab6f	feat: add Qdrant ingest destination connector (#2338 ) This PR intends to add [Qdrant](https://qdrant.tech/) as a supported ingestion destination. - Implements CLI and programmatic usage. - Documentation update - Integration test script --- Clone of #2315 to run with CI secrets --------- Co-authored-by: Anush008 <anushshetty90@gmail.com> Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>	2024-01-02 22:08:20 +00:00
Ronny H	ac380ce989	Added AWS Marketplace docs and improved Azure Marketplace docs (#2248 ) To test: > cd docs && make HTML Change logs: - Added AWS Marketplace documentation - Improved Azure Marketplace documentation - Networking section	2023-12-20 20:13:47 +00:00
Ahmet Melek	fd293b3e78	feat: add elasticsearch destination connector (#2152 ) Closes https://github.com/Unstructured-IO/unstructured/issues/1842 Closes https://github.com/Unstructured-IO/unstructured/issues/2202 Closes https://github.com/Unstructured-IO/unstructured/issues/2203 This PR: - Adds Elasticsearch destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into Elasticsearch. - Defines an example unstructured elements schema for users to be able to setup their unstructured elasticsearch indexes easily. - Includes parallelized upload and lazy processing for elasticsearch destination connector. - Rearranges elasticsearch test helpers to source, destination, and common folders. - Adds util functions to be able to batch iterables in a lazy way for uploads - Fixes a bug where removing the optional parameter `--fields` broke the connector due to an integer processing error. - Fixes a bug where using an [elasticsearch config](`8fa5cbf036/unstructured/ingest/connector/elasticsearch.py (L26-L35)`) for a destination connector resulted in a serialization issue when optional parameter `--fields` was not provided.	2023-12-20 01:26:58 +00:00
David Potter	4b8352e0f5	feat: add chroma destination connector (#2240 ) Adds Chroma (also known as ChromaDB) as a vector destination. Currently Chroma is an in-memory single-process oriented library with plans of a hosted and/or more production ready solution -https://docs.trychroma.com/deployment Though they now claim to support multiple Clients hitting the database at once, I found that it was inconsistent. Sometimes multiprocessing worked (maybe 1 out of 3 times) But the other times I would get different errors. So I kept it single process. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2023-12-19 16:58:23 +00:00
cragwolfe	bd8a74d686	chore: shell scripts default indent of 2 instead of 4 (#2287 ) Given the tendency for shell scripts to easily enter into a few levels of indentation and long line lengths, update the default to 2 spaces.	2023-12-19 07:48:21 +00:00
cragwolfe	9efc22c0fc	build: release commit for 0.11.5 (#2285 ) Also fix broken link in docs.	2023-12-16 18:25:55 -08:00
Roman Isecke	ac302689a0	chore: update sphinx ingest docs with new connectors (#2245 ) Replacing https://github.com/Unstructured-IO/unstructured/pull/2243	2023-12-11 21:29:41 +00:00
David Potter	cde11d1eb0	feat: Add sftp source connector (#2163 ) Adds source connector for SFTP which uses fsspec and paramiko via fsspec. Paramiko is the standard sftp package for python used in pysftp etc... ``` --username foo \ --password bar \ --remote-url sftp://localhost:47474/upload/ ``` Will only download a specifically requested file if it has an extension. (i.e. `--remote-url sftp://localhost:47474/upload/bob.zip`) It will treat any other remote_url as a folder path. This is intentional. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2023-12-07 19:33:19 +00:00
Christine Straub	ed76b11b1a	Refactor: support image extraction (#2201 ) ### Summary This PR is the second part of the "image extraction" refactor to move it from unstructured-inference repo to unstructured repo, the first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/299. This PR adds logic to support extracting images. ### Testing `git clone -b refactor/remove_image_extraction_code --single-branch https://github.com/Unstructured-IO/unstructured-inference.git && cd unstructured-inference && pip install -e . && cd ../` ``` elements = partition_pdf( filename="example-docs/embedded-images.pdf", strategy="hi_res", extract_images_in_pdf=True, ) print("\n\n".join([str(el) for el in elements])) ```	2023-12-05 18:22:29 +00:00
rvztz	ce905dd098	feat: Weaviate destination connector (#1963 ) Closes #1781. - Adds a Weaviate destination connector - The connector receives a host for the weaviate instance and a weaviate class name. - Defines a weaviate schema for json elements. - Defines the pre-processing to conform unstructured's schema to the proposed weaviate schema.	2023-12-01 22:27:41 +00:00
Christine Straub	69d0ee1aea	Refactor: support merging `extracted` layout with `inferred` layout (#2158 ) ### Summary This PR is the second part of `pdfminer` refactor to move it from `unstructured-inference` repo to `unstructured` repo, the first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/294. This PR adds logic to merge the extracted layout with the inferred layout. The updated workflow for the `hi_res` strategy: * pass the document (as data/filename) to the `inference` repo to get `inferred_layout` (DocumentLayout) * pass the `inferred_layout` returned from the `inference` repo and the document (as data/filename) to the `pdfminer_processing` module, which first opens the document (create temp file/dir as needed), and splits the document by pages * if is_image is `True`, return the passed inferred_layout(DocumentLayout) * if is_image is `False`: * get extracted_layout (TextRegions) from the passed document(data/filename) by pdfminer * merge `extracted_layout` (TextRegions) with the passed `inferred_layout` (DocumentLayout) * return the `inferred_layout `(DocumentLayout) with updated elements (all merged LayoutElements) as merged_layout (DocumentLayout) * pass merged_layout and the document (as data/filename) to the `OCR` module, which first opens the document (create temp file/dir as needed), and splits the document by pages (convert PDF pages to image pages for PDF file) ### Note This PR also fixes issue #2164 by using functionality similar to the one implemented in the `fast` strategy workflow when extracting elements by `pdfminer`. ### TODO * image extraction refactor to move it from `unstructured-inference` repo to `unstructured` repo * improving natural reading order by applying the current default `xycut` sorting to the elements extracted by `pdfminer`	2023-12-01 20:56:31 +00:00
John	e5bdf7fb43	chore: unstructured python client (#2195 ) ### Summary Closes #2033 Updates `partition_via_api` to use `UnstructuredClient` for api calls instead of `requests`. Updates associated tests. Note: This PR does not update `partition_multiple_via_api` as documentation in `unstructured-python-client` indicates it does not support multiple files. A new issue should be opened to add that functionality to `unstructured-python-client`. --------- Co-authored-by: Klaijan <klaijan@unstructured.io> Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-12-01 18:49:59 +00:00
Ronny H	d80abf0714	Reorganized the Examples section in Documentation & add Databricks example (#1855 ) To test: > cd docs && make html Change logs: * Examples are reorganized to have its own page * Removed two old examples, ie. "file-utils" & "sentiment analysis". * Added two examples: "RAG with Unstructured, LangChain, and ChromaDB" & "Multi-Files Processing with S3 Connector and API" * Reorganized and added detailed API documentation: (i) usage, (ii) SDKs, (iii) Azure Marketplace, (iv) AWS Marketplace, (v) parameters and validation errors	2023-11-30 01:24:43 +00:00
Ahmet Melek	ed08773de7	feat: add pinecone destination connector (#1774 ) Closes https://github.com/Unstructured-IO/unstructured/issues/1414 Closes #2039 This PR: - Uses Pinecone python cli to implement a destination connector for Pinecone and provides the ingest readme requirements [(here)](https://github.com/Unstructured-IO/unstructured/tree/main/unstructured/ingest#the-checklist) for the connector - Updates documentation for the s3 destination connector - Alphabetically sorts setup.py contents - Updates logs for the chunking node in ingest pipeline - Adds a baseline session handle implementation for destination connectors, to be able to parallelize their operations - For the [bug](https://github.com/Unstructured-IO/unstructured/issues/1892) related to persisting element data to ingest embedding nodes; this PR tests the [solution](https://github.com/Unstructured-IO/unstructured/pull/1893) with its ingest test - Solves a bug on ingest chunking params with [bugfix on chunking params and implementing related test](`69e1949a6f`) --------- Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>	2023-11-29 22:37:32 +00:00
Yuming Long	92dae8cd1a	Chore: Repair invalid PDF structure for PDFminer when PSSyntaxError (#2137 ) ### Summary Add a procedure to repair PDF when the PDF structure is invalid for `PDFminer` to process. This PR handles two cases of `PSSyntaxError Invalid dictionary construct: ...`: * PDFminer open entire document and create pages generator on `PDFPage.get_pages(fp)`: [sentry log example](https://unstructuredio.sentry.io/issues/4655715023/?alert_rule_id=14681339&alert_type=issue&notification_uuid=d8db4cf4-686f-4504-8a22-74a79a8e966f&project=4505909127086080&referrer=slack) * PDFminer's interpreter process a single page on `interpreter.process_page(page)`: [sentry log example](https://unstructuredio.sentry.io/issues/4655898781/?referrer=slack&notification_uuid=0d929d48-f490-4db8-8dad-5d431c8460bc&alert_rule_id=14681339&alert_type=issue) Additional tech details: * Add new dependency `pikepdf` in `requirements/extra-pdf-image.in`, which is used for repairing PDF. * Add new denpendenct `pypdf` in `requirements/extra-pdf-image.in`, which is used to find the error page from entire document by reading the PDF file again (can't find a way to split pdf in PDFminer). * Refactor the `is null` check for `get_uris_from_annots`, since the root cause is that `get_uris` passed a None `annots` to `get_uris_from_annots`, so the Null check should happen in `get_uris`. * Add more type protection in `get_uris_from_annots` when using any `PDFObjRef.resolve()` as `dict` (it could still be a `PDFObjRef`). This should fix : * https://github.com/Unstructured-IO/unstructured/issues/1922 where `annotation_dict` is a `PDFObjRef` * https://github.com/Unstructured-IO/unstructured/issues/1921 where `rect` is a `PDFObjRef` ### Test Added three test files (both are larger than 500 KB) for unittests to test: * Repair entire doc * Repair one page * Reprocess failure after repairing one page (just return the elements before error page in this case). * Also seems like splitting the document into smaller pages could fix this problem, but not sure why. For example, I saw error from reprocess in the whole [cancer.pdf](https://github.com/Unstructured-IO/unstructured/files/13461616/cancer.pdf) doc, but no error when i split the pdf by error page.... * tested if i can repair the entire doc again in this case, saw other error which means repairing is not helping imo * PDFminer can process the whole doc after pikepdf only repaired the entire doc in the first place, but we can't repair by pages in this way --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-11-29 19:00:15 +00:00
qued	1576e0b891	docs: update docker image link (#2186 ) Updated docker image link in documentation to be consistent with README.	2023-11-29 11:40:09 -06:00
ryannikolaidis	b718db22b8	docs: add mongodb destination connector to ingest docs (#2136 ) The mongodb documentation page exists but is missing from the list of destination connectors, meaning it's not actually visible in the documentation.	2023-11-21 19:11:25 +00:00
Yuming Long	ccda93b0d1	chore: bump inference to `0.7.15` release unst `0.11.0` (#2110 ) ^^	2023-11-20 18:20:03 +00:00
Roman Isecke	b8af2f18bb	add mongo db destination connector (#2068 ) ### Description This adds the basic implementation of pushing the generated json output of partition to mongodb. None of this code provisions the mondo db instance so things like adding a search index around the embedding content must be done by the user. Any sort of schema validation would also have to take place via user-specific configuration on the database. This update makes no assumptions about the configuration of the database itself.	2023-11-16 22:40:22 +00:00
Yuming Long	ad14321016	Chore: don't pass empty language code to tesseract CLI (#1996 ) Summary: Close: https://github.com/Unstructured-IO/unstructured/issues/1920 * stop passing in empty string from `languages` to tesseract, which will result in passing empty string to language config `-l` for the tesseract CLI * also stop passing in duplicate language code from `languages` to tesseract OCR * if we failed to convert any iso languages from the `languages` parameter, proceed OCR with `eng` as default ### Test * First confirm the tesseract error `Estimating resolution as X` before this: * on the `unstructured-api` repo with main branch, run `make run-web-app` * curl to test error from empty string, or just any wrong input like `-F 'languages="eng,de"'`: ``` curl -X 'POST' 'http://0.0.0.0:8000/general/v0/general' \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \ -F 'languages=""' \ -F 'strategy=hi_res' \ -F 'pdf_infer_table_structure=True' \ \| jq -C . \| less -R ``` * after this change: * in your unstructured API env, cd to unstructured repo and install it locally with `pip install -e .` * check out to this branch * run `make run-web-app` again in api repo * the curl command return output and see warning in log --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-11-06 19:30:12 -06:00
qued	ad09a869b5	fix: update slack link to link shortener (#2010 ) Per @tabossert we're now using a link shortener behind which we can rotate the link to keep it current. That way we (🤞 ) never have to update this here again. #### Testing: Links should work. No more links should exist in the documentation except this one.	2023-11-06 15:47:18 +00:00
Matt Robinson	e5bcd36475	docs: update slack links (#1990 ) ### Summary A user in the [Community Slack](https://unstructuredw-kbe4326.slack.com/archives/C043YA29U0J/p1698933003702919) reported having difficulty signing up for Slack using the links from the documentation. Updated the links to the use the invite link that worked from him, which came from [this blog post](https://medium.com/unstructured-io/setting-up-a-private-retrieval-augmented-generation-rag-system-with-local-vector-database-d42f34692ca7).	2023-11-05 11:26:34 -08:00
Roman Isecke	d09c8c0cab	test: update ingest dest tests to follow set pattern (#1991 ) ### Description Update all destination tests to match pattern: * Don't omit any metadata to check full schema * Move azure cognitive dest test from src to dest * Split delta table test into seperate src and dest tests * Fix azure cognitive search and add to dest tests being run (wasn't being run originally)	2023-11-03 12:46:56 +00:00
Roman Isecke	901704b6c0	update sphinx docs with ingest content (#1969 ) ### Description Create a new structure for ingest content in the docs, update with all configs	2023-11-02 20:40:35 +00:00
Matt Robinson	d9c035edb1	docs: no more bricks (#1967 ) ### Summary We no longer use the "bricks" terminology for partioning functions, etc in the library. This PR updates various references to bricks within the repo and the docs. This is just an initial pass to swap the terminology out, it'll likely be helpful to reorganize the docs a bit as well. --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>	2023-11-02 09:43:26 -05:00
Roman Isecke	922bc84cee	Update fsspecs-specific source connector docs (#1898 ) ### Description Add in the fsspec configs needed for the fsspec-based connectors To match the behavior of the original CLI, the default used by the click option was mirrored in the base config for the api endpoint.	2023-10-31 16:09:46 +00:00
Matt Robinson	21b45ae8b0	docs: update to new logo (#1937 ) ### Summary Updates the docs and README to use the new Unstructured logo. The README links to the raw GitHub user content, so the changes isn't reflected in the README on the branch, but will update after the image is merged to main. ### Testing Here's what the updated docs look like locally: <img width="237" alt="image" src="https://github.com/Unstructured-IO/unstructured/assets/1635179/f13d8b4b-3098-4823-bd16-a6c8dfcffe67"> <img width="1509" alt="image" src="https://github.com/Unstructured-IO/unstructured/assets/1635179/3b8aae5e-34aa-48c0-90f9-f5f3f0f1e26d"> <img width="1490" alt="image" src="https://github.com/Unstructured-IO/unstructured/assets/1635179/e82a876f-b19a-4573-b6bb-1c0215d2d7a9">	2023-10-31 15:39:19 +00:00
Amanda Cameron	0584e1d031	chore: fix infer_table bug (#1833 ) Carrying `skip_infer_table_types` to `infer_table_structure` in partition flow. Now PPT/X, DOC/X, etc. Table elements should not have a `text_as_html` field. Note: I've continued to exclude this var from partitioners that go through html flow, I think if we've already got the html it doesn't make sense to carry the infer variable along, since we're not 'infer-ing' the html table in these cases. TODO: ✅ add unit tests --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: amanda103 <amanda103@users.noreply.github.com>	2023-10-24 00:11:53 +00:00
Mallori Harrell	00635744ed	feat: Adds local embedding model (#1619 ) This PR adds a local embedding model option as an alternative to using our OpenAI embedding brick. This brick uses LangChain's HuggingFacEmbeddings.	2023-10-19 11:51:36 -05:00
Jack Retterer	b8f24ba67e	Added AWS Bedrock embeddings (#1738 ) Summary: Added support for AWS Bedrock embeddings. Leverages "amazon.titan-tg1-large" for the embedding model. Test - find your aws secret access key and key id; make sure the account has access to bedrock's tian embed model - follow the instructions in `d5e797cd44/docs/source/bricks/embedding.rst (bedrockembeddingencoder)` --------- Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com> Co-authored-by: Yao You <yao@unstructured.io> Co-authored-by: Yao You <theyaoyou@gmail.com> Co-authored-by: Ahmet Melek <ahmetmeleq@gmail.com>	2023-10-18 19:36:51 -05:00
Roman Isecke	adacd8e5b1	roman/update ingest pipeline docs (#1689 ) ### Description * Update all existing connector docs to use new pipeline approach ### Additional changes: * Some defaults were set for the runners to match those in the configs to make those easy to handle, i.e. the biomed runner: ```python max_retries: int = 5, max_request_time: int = 45, decay: float = 0.3, ```	2023-10-17 16:11:16 +00:00
Roman Isecke	b265d8874b	refactoring linting (#1739 ) ### Description Currently linting only takes place over the base unstructured directory but we support python files throughout the repo. It makes sense for all those files to also abide by the same linting rules so the entire repo was set to be inspected when the linters are run. Along with that autoflake was added as a linter which has a lot of added benefits such as removing unused imports for you that would currently break flake and require manual intervention. The only real relevant changes in this PR are in the `Makefile`, `setup.cfg`, and `requirements/test.in`. The rest is the result of running the linters.	2023-10-17 12:45:12 +00:00
Amanda Cameron	d0c84d605c	chore: updating table docs with file extensions (#1702 ) gh issue: https://github.com/Unstructured-IO/unstructured/issues/1691 Adding filetype extensions from this [list](`f98d5e65ca/unstructured/file_utils/filetype.py (L154-L200)`) where applicable. --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: Crag Wolfe <crag@unstructuredai.io>	2023-10-14 14:14:52 -07:00
Ahmet Melek	94836cfad4	feat: add file-based access permissions for SharePoint ingest (#1628 ) This PR: - defines rbac_data as a SourceMetadata field, - manages connections to an external api for obtaining rbac data with ConnectorRBAC class, - serializes rbac data and saves it to the disk, - matches the rbac_data in the disk to each IngestDoc, using a common field, - forwards rbac data to Elements, via the partition() function To test the changes, run `examples/ingest/sharepoint/ingest.sh` with the relevant rbac & connector credentials --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-10-13 00:38:08 +00:00
Dev Khant	f09b87da23	Doc : replace link `upstream connectors` with `source connectors` (#1683 ) Fixes #1502 Here I have replaced `stream_connectors.html` with `source_connectors.html`.	2023-10-09 21:37:51 -07:00
Amanda Cameron	f98d5e65ca	chore: adding max_characters to other element type chunking (#1673 ) This PR adds the `max_characters` (hard max) param to non-table element chunking. Additionally updates the `num_characters` metadata to `max_characters` to make it clearer which param we're referencing. To test: ``` from unstructured.partition.html import partition_html filename = "example-docs/example-10k-1p.html" chunk_elements = partition_html( filename, chunking_strategy="by_title", combine_text_under_n_chars=0, new_after_n_chars=50, max_characters=100, ) for chunk in chunk_elements: print(len(chunk.text)) # previously we were only respecting the "soft max" (default of 500) for elements other than tables # now we should see that all the elements have text fields under 100 chars. ``` --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-09 19:42:36 +00:00
Jack Retterer	7e310ecac2	Update Getting Started Guide in Documentation (#1667 ) - Fixed typo that stated "infer_table_structured" instead of "infer_table_structure" Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-07 01:12:52 +00:00
Roman Isecke	2e1404e02c	refactor: unstructured ingest as a pipeline (#1551 ) ### Description As we add more and more steps to the pipeline (i.e. chunking, embedding, table manipulation), it would help seperate the responsibility of each of these into their own processes, running each in parallel using json files to share data across. This will also help guarantee data is serializable if this code was used in an actual pipeline. Following is a flow diagram of the proposed changes. As part of this change: * A parent pipeline class will be responsible for running each `node`, which can optionally be run via multiprocessing if it supports it, or not. Possible nodes at this moment: * Doc factory: creates all the ingest docs via the source connector * Source: reads/downloads all of the content to process to the local filesystem to the location set by the `download_dir` parameter. * Partition: runs partition on all of the downloaded content in json format. * Any number of reformat nodes that modify the partitioned content. This can include chunking, embedding, etc. * Write: push the final json into the destination via the destination connector * This pipeline relies on the information of the ingest docs to be available via their serialization. An optimization was introduced with the `IngestDocJsonMixin` which adds in all the `@property` fields to the serialized json already being created via the `DataClassJsonMixin` * For all intermediate steps (partitioning, reformatting), the content is saved to a dedicated location on the local filesystem. Right now it's set to `$HOME/.cache/unstructured/ingest/pipeline/STEP_NAME/`. * Minor changes: made sense to move some of the config parameters between the read and partition configs when I explicitly divided the responsibility to download vs partition the content in the pipeline. * The pipeline class only makes the doc factory, source and partition nodes required, keeping with the logic that has been supported so far. All reformatting nodes and write node are optional. * Long term, there should also be some changes to the base configs supported by the CLI to support pipeline specific configs, but for now what exists was used to minimize changes in this PR. * Final step to copy the final output to the location designated by the `_output_filename` value of the ingest doc. * Hashing occurs at each step by hashing the parameters of that step (i.e. partition configs) along with the previous step via the filename used. This allows each step to be the same _if_ all the parameters for it have not changed and the content so far is the same. * The only data that is shared and has writes to across processes is the dictionary of ingest json data. This dict is created using the `multiprocessing.manager.DictProxy` to make sure any interaction with it is behind a lock. ### Minor refactors included: * Utility methods added to extract configs from the click options * Utility method to add common options to click commands. * All writers moved to using the class approach which extracts a lot of the common code so there's less copy-paste when new runners are added. * Use `@property` for source metadata on base ingest doc to add logic to call `update_source_metadata` if it's still `None` at the time it's fetched. ### Additional bug fixes included * Fsspec connectors were not serializable due to the `ingest_doc_cls`. This was removed from the fields captured by the `@dataclass` decorator and added in a `__post_init__` method. * Various reddit connector params were missing. This doesn't have an explicit ingest test at the moment so was never caught. * Fsspec connector had the parent `update_source_metadata` misnamed as `update_source_metadata_metadata` so it was never being called. ### Flow Diagram ![ingest_pipeline](https://github.com/Unstructured-IO/unstructured/assets/136338424/be485606-cfe0-4931-8b81-c2bf569cf1e2)	2023-10-06 18:49:29 +00:00
Ronny H	8564d920ac	Update Metadata and Installation Documentation (#1646 ) * Updated Metadata page: add common and additional metadata fields by document types and connectors * Updated specific installation extra by document types and connectors * Added embedding brick page in Sphinx TOC * Fixed Sphinx warnings in new pages	2023-10-05 01:25:41 +00:00
Klaijan	0a65fc2134	feat: xlsx subtable extraction (#1585 ) Executive Summary Unstructured is now able to capture subtables, along with other text element types within the `.xlsx` sheet. Technical Details - The function now reads the excel without header as default - Leverages the connected components search to find subtables within the sheet. This search is based on dfs search - It also handle the overlapping table or text cases - Row with only single cell of data is considered not a table, and therefore passed on the determine the element type as text - In connected elements, it is possible to have table title, header, or footer. We run the count for the first non-single empty rows from top and bottom to determine those text Result This table now reads as: <img width="747" alt="image" src="https://github.com/Unstructured-IO/unstructured/assets/2177850/6b8e6d01-4ca5-43f4-ae88-6104b0174ed2"> ``` [ { "type": "Title", "element_id": "3315afd97f7f2ebcd450e7c939878429", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "3315afd97f7f2ebcd450e7c939878429", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Quarterly revenue</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>1</td>\n </tr>\n <tr>\n <td>Group financial performance</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>2</td>\n </tr>\n <tr>\n <td>Segmental results</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>3</td>\n </tr>\n <tr>\n <td>Segmental analysis</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>4</td>\n </tr>\n <tr>\n <td>Cash flow</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>5</td>\n </tr>\n </tbody>\n</table>" }, "text": "Financial performance" }, { "type": "Table", "element_id": "17f5d512705be6f8812e5dbb801ba727", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "3315afd97f7f2ebcd450e7c939878429", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Quarterly revenue</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>1</td>\n </tr>\n <tr>\n <td>Group financial performance</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>2</td>\n </tr>\n <tr>\n <td>Segmental results</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>3</td>\n </tr>\n <tr>\n <td>Segmental analysis</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>4</td>\n </tr>\n <tr>\n <td>Cash flow</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>5</td>\n </tr>\n </tbody>\n</table>" }, "text": "\n\n\nTopic\nPeriod\n\n\nPage\n\n\nQuarterly revenue\nNine quarters to 30 June 2023\n\n\n1\n\n\nGroup financial performance\nFY 22\nFY 23\n\n2\n\n\nSegmental results\nFY 22\nFY 23\n\n3\n\n\nSegmental analysis\nFY 22\nFY 23\n\n4\n\n\nCash flow\nFY 22\nFY 23\n\n5\n\n\n" }, { "type": "Title", "element_id": "8a9db7161a02b427f8fda883656036e1", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "8a9db7161a02b427f8fda883656036e1", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Mobile customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>6</td>\n </tr>\n <tr>\n <td>Fixed broadband customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>7</td>\n </tr>\n <tr>\n <td>Marketable homes passed</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>8</td>\n </tr>\n <tr>\n <td>TV customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>9</td>\n </tr>\n <tr>\n <td>Converged customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>10</td>\n </tr>\n <tr>\n <td>Mobile churn</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>11</td>\n </tr>\n <tr>\n <td>Mobile data usage</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>12</td>\n </tr>\n <tr>\n <td>Mobile ARPU</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>13</td>\n </tr>\n </tbody>\n</table>" }, "text": "Operational metrics" }, { "type": "Table", "element_id": "d5d16f7bf9c7950cd45fae06e12e5847", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "8a9db7161a02b427f8fda883656036e1", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Mobile customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>6</td>\n </tr>\n <tr>\n <td>Fixed broadband customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>7</td>\n </tr>\n <tr>\n <td>Marketable homes passed</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>8</td>\n </tr>\n <tr>\n <td>TV customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>9</td>\n </tr>\n <tr>\n <td>Converged customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>10</td>\n </tr>\n <tr>\n <td>Mobile churn</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>11</td>\n </tr>\n <tr>\n <td>Mobile data usage</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>12</td>\n </tr>\n <tr>\n <td>Mobile ARPU</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>13</td>\n </tr>\n </tbody>\n</table>" }, "text": "\n\n\nTopic\nPeriod\n\n\nPage\n\n\nMobile customers\nNine quarters to 30 June 2023\n\n\n6\n\n\nFixed broadband customers\nNine quarters to 30 June 2023\n\n\n7\n\n\nMarketable homes passed\nNine quarters to 30 June 2023\n\n\n8\n\n\nTV customers\nNine quarters to 30 June 2023\n\n\n9\n\n\nConverged customers\nNine quarters to 30 June 2023\n\n\n10\n\n\nMobile churn\nNine quarters to 30 June 2023\n\n\n11\n\n\nMobile data usage\nNine quarters to 30 June 2023\n\n\n12\n\n\nMobile ARPU\nNine quarters to 30 June 2023\n\n\n13\n\n\n" }, { "type": "Title", "element_id": "f97e9da0e3b879f0a9df979ae260a5f7", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "f97e9da0e3b879f0a9df979ae260a5f7", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Average foreign exchange rates</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>14</td>\n </tr>\n <tr>\n <td>Guidance rates</td>\n <td>FY 23/24</td>\n <td></td>\n <td></td>\n <td>14</td>\n </tr>\n </tbody>\n</table>" }, "text": "Other" }, { "type": "Table", "element_id": "080e1a745a2a3f2df22b6a08d33d59bb", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "f97e9da0e3b879f0a9df979ae260a5f7", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Average foreign exchange rates</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>14</td>\n </tr>\n <tr>\n <td>Guidance rates</td>\n <td>FY 23/24</td>\n <td></td>\n <td></td>\n <td>14</td>\n </tr>\n </tbody>\n</table>" }, "text": "\n\n\nTopic\nPeriod\n\n\nPage\n\n\nAverage foreign exchange rates\nNine quarters to 30 June 2023\n\n\n14\n\n\nGuidance rates\nFY 23/24\n\n\n14\n\n\n" } ] ```	2023-10-04 13:30:23 -04:00
Manirevuri	13453d6358	Fix: Documentation for Unstructured API's (#1624 ) Fixed "files=file_data" param for all python files --------- Co-authored-by: Austin Walker <austin@unstructured.io>	2023-10-03 20:42:32 +00:00
Roman Isecke	9d81971fcb	update ingest python doc (#1446 ) ### Description Updating the python version of the example docs to show how to run the same code that the CLI runs, but using python. Rather than copying the same command that would be run via the terminal and using the subprocess library to run it, this updates it to use the supported code exposed in the inference directory. For now only the wikipedia one has been updated to get some opinions on this before updating all other connector docs. Would close out https://github.com/Unstructured-IO/unstructured/issues/1445	2023-10-03 10:01:41 -04:00
Roman Isecke	5c7b4f586b	Roman/azure cognitive embeddings (#1524 ) ### Description This PR is two-fold: Embeddings: * Embeddings incorporated into the sharepoint source connector, which will now call out to OpenAI and create embeddings if the flag is passed in and the api key provided. Writing vector content (embeddings) to Azure cognitive search index: * The schema for the index expected to exist in Azure has been updated to include the vector field type and a test script has been added to test the new content being produced from the Sharepoint connector to push the embedding content. Some important notes about other changes in here: * The embedding code had to be updated to patch the `to_dict` method on elements to add `embeddings` to the dict output if that was added. While the code originally added the embedding content, when `to_dict` was called to save the content as json, this was lost.	2023-09-26 23:24:21 +00:00
Ronny H	868cac5bd5	Fixed Sphinx warning errors (#1438 ) Fixed issue #1437 - resolved the Warning errors when building sphinx with `make html`. test: 1. `cd docs` folder and `rm -rf build` 2. `pip install -r requirements.txt` 3. run `make html`	2023-09-26 04:20:16 +00:00
Trevor Bossert	2a24c81852	Update docker download url to use scarf gateway (#1523 ) This updates the docker image download url to pass through the scarf gateway, this allows anonymous tracking of downloads Related to: https://github.com/Unstructured-IO/unstructured#chart_with_upwards_trend-analytics Testing: docker pull downloads.unstructured.io/unstructured-io/unstructured:latest Result: Image should download	2023-09-25 14:52:39 -07:00
Roman Isecke	bd49cfbab7	feat: adds Azure Cognitive Search (full text) destination connector (#1459 ) ### Description New [Azure Cognitive Search](https://azure.microsoft.com/en-us/products/ai-services/cognitive-search) destination connector added. Writes each json element from the created json files via partition and writes that content to an index. Bonus bug fix: Due to a recent change where the default version of python used in the repo was bumped to `3.10` from `3.8`, this means running `pip-compile` now runs it against that version rather than the lowest we support which is still `3.8`. This breaks the setup for those lower versions because some of the versions pulled in by `pip-compile` exist for `3.10` but not `3.8`. `pip-compile` was updates to run as a script that checks the version of python being used first, which helps guarantee that all dependencies meet the minimum python version requirement. Closes out https://github.com/Unstructured-IO/unstructured/issues/1466	2023-09-25 10:27:42 -04:00
Christine Straub	2d951722df	Feat/1332 save embedded images in pdf (#1371 ) Addresses [#1332](https://github.com/Unstructured-IO/unstructured/issues/1332) with `unstructured-inference` PR [#208](https://github.com/Unstructured-IO/unstructured-inference/pull/208). ### Summary - Add `image_path` to element metadata - Pass parameters related to extracting images in PDF - Preserve image elements ignored due to garbage text if `el.metadata.image_path` is `True` ### Testing from unstructured.partition.pdf import partition_pdf f_path = "example-docs/embedded-images.pdf" # default image output directory elements = partition_pdf( f_path, strategy=strategy, extract_images_in_pdf=True, ) # specific image output directory elements = partition_pdf( f_path, strategy=strategy, extract_images_in_pdf=True, image_output_dir_path=<directory path>, )	2023-09-22 09:16:03 +00:00
Ahmet Melek	9e88929a8c	feat: document embeddings (#1368 ) Closes https://github.com/Unstructured-IO/unstructured/issues/1319, closes https://github.com/Unstructured-IO/unstructured/issues/1372 This module: - implements EmbeddingEncoder classes which track embedding related data - implements embed_documents method which receives a list of Elements, obtains embeddings for the text within Elements, updates the Elements with an attribute named embeddings , and returns the updated Elements - the module uses langchain to obtain the embeddings ----- - The PR additionally fixes a JSON de-serialization issue on the metadata fields. To test the changes, run `examples/embed/example.py`	2023-09-20 19:55:30 +00:00

1 2 3 4 5

237 Commits