unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-24 14:20:09 +00:00

Author	SHA1	Message	Date
Filip Knefel	5defe79bf2	docs: add information about MIME type of extracted images (#2515 ) Include information about what mime type is expected when extracting images. Co-authored-by: Filip Knefel <filip@unstructured.io>	2024-02-07 08:40:24 +00:00
David Potter	c100ce28a7	feat: add Vectara destination connector (#2357 ) Thanks to Ofer at Vectara, we now have a Vectara destination connector. - There are no dependencies since it is all REST calls to API - --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-02-01 14:38:34 +00:00
John	db67805ec6	feat: add support for partitioning .heic files (#2454 ) .heic files are an image filetype we have not supported. #### Testing ``` from unstructured.partition.image import partition_image png_filename = "example-docs/DA-1p.png" heic_filename = "example-docs/DA-1p.heic" png_elements = partition_image(png_filename, strategy="hi_res") heic_elements = partition_image(heic_filename, strategy="hi_res") for i in range(len(heic_elements)): print(heic_elements[i].text == png_elements[i].text) ``` --------- Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-01-30 04:49:00 +00:00
John	9320311a19	fix: check languages args (#2435 ) This PR is the last in a series of PRs for refactoring and fixing the language parameters (`languages` and `ocr_languages` so we can address incorrect input by users. See #2293 It is recommended to go though this PR commit-by-commit and note the commit message. The most significant commit is "update check_languages..."	2024-01-29 20:12:08 +00:00
Ronny H	d5a6f4b82c	Docs updates (#2458 ) To test: > cd docs && make html Change logs: * Updates the best practice for table extraction to use `skip_infer_table_types` instead of `pdf_infer_table_structure`. * Fixed CSS issue with a duplicate search box. * Fixed RST warning message * Fixed typo on the Intro page.	2024-01-25 20:31:28 +00:00
Roman Isecke	a8de52e94f	feat: databricks volumes dest added (#2391 ) ### Description This adds in a destination connector to write content to the Databricks Unity Catalog Volumes service. Currently there is an internal account that can be used for testing manually but there is not dedicated account to use for testing so this is not being added to the automated ingest tests that get run in the CI. To test locally: ```shell #!/usr/bin/env bash path="testpath/$(uuidgen)" PYTHONPATH=. python ./unstructured/ingest/main.py local \ --num-processes 4 \ --output-dir azure-test \ --strategy fast \ --verbose \ --input-path example-docs/fake-memo.pdf \ --recursive \ databricks-volumes \ --catalog "utic-dev-tech-fixtures" \ --volume "small-pdf-set" \ --volume-path "$path" \ --username "$DATABRICKS_USERNAME" \ --password "$DATABRICKS_PASSWORD" \ --host "$DATABRICKS_HOST" ```	2024-01-23 01:25:51 +00:00
Ronny H	4c772f6ed7	Updated docs on API Params and Filetype Supports (#2433 ) To test: > cd docs && make html Changelogs: * Fixed sphinx error due to malformed rst table on partition page * Updated API Params, ie. `extract_image_block_types` and `extract_image_block_to_payload` * Updated image filetype supports	2024-01-19 16:07:57 -08:00
Christine Straub	7378a378f6	enhancement: allow setting image block crop padding parameter (#2415 ) Closes #2320 . ### Summary In certain circumstances, adjusting the image block crop padding can improve image block extraction by preventing extracted image blocks from being clipped. ### Testing - PDF: [LM339-D_2-2.pdf](https://github.com/Unstructured-IO/unstructured/files/13968952/LM339-D_2-2.pdf) - Set two environment variables `EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD` and `EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD` (e.g. `EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD = 40`, `EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD = 20` ``` elements = partition_pdf( filename="LM339-D_2-2.pdf", extract_image_block_types=["image"], ) ```	2024-01-19 06:28:32 +00:00
Ronny H	96fe7dd5e5	Kapa.ai widget installation (#2418 ) To test: > cd docs && make html > click "Ask AI" button on the bottom right-hand corner Changelogs: * Installed kapa.ai widget * fixed sphinx errors in opensearch & elasticsearch documentation	2024-01-18 00:17:11 +00:00
Matt Robinson	4d5038d9fd	enhancement: add support from bitmap images (#2414 ) ### Summary Adds support for bitmap images (`.bmp`) in both file detection and partitioning. Bitmap images will be processed with `partition_image` just like JPGs and PNGs. ### Testing ```python from unstructured.file_utils.filetype import detect_filetype from unstructured.partition.auto import partition from PIL import Image filename = "example-docs/layout-parser-paper-with-table.jpg" bmp_filename = "~/tmp/ayout-parser-paper-with-table.bmp" img = Image.open(filename) img.save(bmp_filename) detect_filetype(filename=bmp_filename) # Should be FileType.BMP elements = partition(filename=bmp_filename) ```	2024-01-17 22:50:36 +00:00
Ronny H	8e6bc10ba1	Docs various updates (#2386 ) To test: > cd docs && make html Changelogs: * Added verbiage about the cap limit and data usage for the Freemium AP * Added deprecated warning on Staging bricks * Added warning and code examples to use the SaaS API Endpoints using CLI-vs-SDKs * Fixed example page formatting * Added deprecation warning on ``model_name`` param in favor of ``hi_res_model_name`` * Added ``extract_images_in_pdf`` usage and code example in ``partition_pdf`` section * Reorganized and improved the documentation Intro section	2024-01-17 21:01:01 +00:00
David Potter	bc791d53f4	feat: add opensearch source and destination connector (#2349 ) Adds OpenSearch as a source and destination. Since OpenSearch is a fork of Elasticsearch, these connectors rely heavily on inheriting the Elasticsearch connectors whenever possible. - Adds OpenSearch source connector to be able to ingest documents from OpenSearch. - Adds OpenSearch destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into OpenSearch. - Defines an example unstructured elements schema for users to be able to setup their unstructured OpenSearch indexes easily. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-17 04:31:49 +00:00
David Potter	d7f4c24e21	fix documentation for chroma (#2403 ) To test: cd docs && make HTML changelogs: point main readme to the correct connector html page point chroma docs to correct sample code --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-17 01:53:52 +00:00
David Potter	76e0d10e61	feat: add MongoDB source connector (#2393 ) Adds MongoDB as a source (we already had it as a destination connector) --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-16 20:56:29 +00:00
ryannikolaidis	2ce829ddd0	test: update test Elasticsearch mappings to validate embedding search (#2397 ) Currently in the Elasticsearch Destination ingest test we are writing the embeddings to a "float" type field. In order to leverage this field for similarity search it should be mapped as "dense_vector" with the respective dimensions assigned. This PR updates that mapping and adds a test query to validate that this works as expected.	2024-01-14 19:27:56 +00:00
Steve Canny	2f2c48acd5	feat(ingest): add basic chunking to ingest (#2380 ) The new "basic" chunking strategy and overlap options need to be available from the ingest CLI. An ingest test of those features is also welcome, both to verify the ingest feature and to defend against regressions in the chunking code. Add a local ingest test exercising both the "basic" chunking strategy and intra-chunk overlap. Since there is no new source connector involved, use the local ingest source and destination. Update documentation to suit, filling in some details that hadn't made it into the docs yet.	2024-01-12 20:27:34 +00:00
jakub-sandomierz-deepsense-ai	411aa98bbf	feat: Salesforce connector accepts key path or value (#2321 ) (#2327 ) Solution to issue https://github.com/Unstructured-IO/unstructured/issues/2321. simple_salesforce API allows for passing private key path or value. This PR introduces this support for Ingest connector. Salesforce parameter "private-key-file" has been renamed to "private-key". It can contain one of following: - path to PEM encoded key file (as string) - key contents (PEM encoded string) If the provided value cannot be parsed as PEM encoded private key, then the file existence is checked. This way private key contents are not exposed to unnecessary underlying function calls.	2024-01-11 11:15:24 +00:00
Ronny H	98a0de30b4	Fix sphinx error (#2384 ) To test: > cd docs && make HTML changelogs: - remove unindented line in destination connector's sql.rst file - add elasticsearch page into destination_connector.rst file	2024-01-10 22:25:18 +00:00
Steve Canny	23edf2e911	feature(chunking): add basic strategy and overlap (#2367 ) This PR culminates the restructuring of chunking over my prior dozen-or-so commits by adding the new options to the API and documentation. Separately I'll be adding a new ingest test to defend against regression, although the integration test included in this PR will do a pretty good job of that too.	2024-01-10 22:19:24 +00:00
rvztz	950e5d68f9	feat: adds postgresql/sqlite destination connector (#2005 ) - Adds a destination connector to upload processed output into a PostgreSQL/Sqlite database instance. - Users are responsible to provide their instances. This PR includes a couple of configuration examples. - Defines the scripts required to setup a PostgreSQL instance with the unstructured elements schema. - Validates postgres/pgvector embedding storage and retrieval --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-04 19:33:16 +00:00
Ronny H	8e2bfcab18	Unstructured SaaS API subscription guide (#2341 ) To test: > cd docs && make html Sections: - New User sign-up: (i) registration form, (ii) payment processing, and (iii) use API key & URL - API Account maintenance: (i) update billing, (ii) opt-in email, (iii) rotate API key, and (iv) cancel plan - Get Supports	2024-01-03 14:38:03 -08:00
ryannikolaidis	dd1443ab6f	feat: add Qdrant ingest destination connector (#2338 ) This PR intends to add [Qdrant](https://qdrant.tech/) as a supported ingestion destination. - Implements CLI and programmatic usage. - Documentation update - Integration test script --- Clone of #2315 to run with CI secrets --------- Co-authored-by: Anush008 <anushshetty90@gmail.com> Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>	2024-01-02 22:08:20 +00:00
Ronny H	ac380ce989	Added AWS Marketplace docs and improved Azure Marketplace docs (#2248 ) To test: > cd docs && make HTML Change logs: - Added AWS Marketplace documentation - Improved Azure Marketplace documentation - Networking section	2023-12-20 20:13:47 +00:00
Ahmet Melek	fd293b3e78	feat: add elasticsearch destination connector (#2152 ) Closes https://github.com/Unstructured-IO/unstructured/issues/1842 Closes https://github.com/Unstructured-IO/unstructured/issues/2202 Closes https://github.com/Unstructured-IO/unstructured/issues/2203 This PR: - Adds Elasticsearch destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into Elasticsearch. - Defines an example unstructured elements schema for users to be able to setup their unstructured elasticsearch indexes easily. - Includes parallelized upload and lazy processing for elasticsearch destination connector. - Rearranges elasticsearch test helpers to source, destination, and common folders. - Adds util functions to be able to batch iterables in a lazy way for uploads - Fixes a bug where removing the optional parameter `--fields` broke the connector due to an integer processing error. - Fixes a bug where using an [elasticsearch config](`8fa5cbf036/unstructured/ingest/connector/elasticsearch.py (L26-L35)`) for a destination connector resulted in a serialization issue when optional parameter `--fields` was not provided.	2023-12-20 01:26:58 +00:00
David Potter	4b8352e0f5	feat: add chroma destination connector (#2240 ) Adds Chroma (also known as ChromaDB) as a vector destination. Currently Chroma is an in-memory single-process oriented library with plans of a hosted and/or more production ready solution -https://docs.trychroma.com/deployment Though they now claim to support multiple Clients hitting the database at once, I found that it was inconsistent. Sometimes multiprocessing worked (maybe 1 out of 3 times) But the other times I would get different errors. So I kept it single process. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2023-12-19 16:58:23 +00:00
cragwolfe	bd8a74d686	chore: shell scripts default indent of 2 instead of 4 (#2287 ) Given the tendency for shell scripts to easily enter into a few levels of indentation and long line lengths, update the default to 2 spaces.	2023-12-19 07:48:21 +00:00
cragwolfe	9efc22c0fc	build: release commit for 0.11.5 (#2285 ) Also fix broken link in docs.	2023-12-16 18:25:55 -08:00
Roman Isecke	ac302689a0	chore: update sphinx ingest docs with new connectors (#2245 ) Replacing https://github.com/Unstructured-IO/unstructured/pull/2243	2023-12-11 21:29:41 +00:00
David Potter	cde11d1eb0	feat: Add sftp source connector (#2163 ) Adds source connector for SFTP which uses fsspec and paramiko via fsspec. Paramiko is the standard sftp package for python used in pysftp etc... ``` --username foo \ --password bar \ --remote-url sftp://localhost:47474/upload/ ``` Will only download a specifically requested file if it has an extension. (i.e. `--remote-url sftp://localhost:47474/upload/bob.zip`) It will treat any other remote_url as a folder path. This is intentional. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2023-12-07 19:33:19 +00:00
rvztz	ce905dd098	feat: Weaviate destination connector (#1963 ) Closes #1781. - Adds a Weaviate destination connector - The connector receives a host for the weaviate instance and a weaviate class name. - Defines a weaviate schema for json elements. - Defines the pre-processing to conform unstructured's schema to the proposed weaviate schema.	2023-12-01 22:27:41 +00:00
Ronny H	d80abf0714	Reorganized the Examples section in Documentation & add Databricks example (#1855 ) To test: > cd docs && make html Change logs: * Examples are reorganized to have its own page * Removed two old examples, ie. "file-utils" & "sentiment analysis". * Added two examples: "RAG with Unstructured, LangChain, and ChromaDB" & "Multi-Files Processing with S3 Connector and API" * Reorganized and added detailed API documentation: (i) usage, (ii) SDKs, (iii) Azure Marketplace, (iv) AWS Marketplace, (v) parameters and validation errors	2023-11-30 01:24:43 +00:00
Ahmet Melek	ed08773de7	feat: add pinecone destination connector (#1774 ) Closes https://github.com/Unstructured-IO/unstructured/issues/1414 Closes #2039 This PR: - Uses Pinecone python cli to implement a destination connector for Pinecone and provides the ingest readme requirements [(here)](https://github.com/Unstructured-IO/unstructured/tree/main/unstructured/ingest#the-checklist) for the connector - Updates documentation for the s3 destination connector - Alphabetically sorts setup.py contents - Updates logs for the chunking node in ingest pipeline - Adds a baseline session handle implementation for destination connectors, to be able to parallelize their operations - For the [bug](https://github.com/Unstructured-IO/unstructured/issues/1892) related to persisting element data to ingest embedding nodes; this PR tests the [solution](https://github.com/Unstructured-IO/unstructured/pull/1893) with its ingest test - Solves a bug on ingest chunking params with [bugfix on chunking params and implementing related test](`69e1949a6f`) --------- Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>	2023-11-29 22:37:32 +00:00
qued	1576e0b891	docs: update docker image link (#2186 ) Updated docker image link in documentation to be consistent with README.	2023-11-29 11:40:09 -06:00
ryannikolaidis	b718db22b8	docs: add mongodb destination connector to ingest docs (#2136 ) The mongodb documentation page exists but is missing from the list of destination connectors, meaning it's not actually visible in the documentation.	2023-11-21 19:11:25 +00:00
Roman Isecke	b8af2f18bb	add mongo db destination connector (#2068 ) ### Description This adds the basic implementation of pushing the generated json output of partition to mongodb. None of this code provisions the mondo db instance so things like adding a search index around the embedding content must be done by the user. Any sort of schema validation would also have to take place via user-specific configuration on the database. This update makes no assumptions about the configuration of the database itself.	2023-11-16 22:40:22 +00:00
qued	ad09a869b5	fix: update slack link to link shortener (#2010 ) Per @tabossert we're now using a link shortener behind which we can rotate the link to keep it current. That way we (🤞 ) never have to update this here again. #### Testing: Links should work. No more links should exist in the documentation except this one.	2023-11-06 15:47:18 +00:00
Matt Robinson	e5bcd36475	docs: update slack links (#1990 ) ### Summary A user in the [Community Slack](https://unstructuredw-kbe4326.slack.com/archives/C043YA29U0J/p1698933003702919) reported having difficulty signing up for Slack using the links from the documentation. Updated the links to the use the invite link that worked from him, which came from [this blog post](https://medium.com/unstructured-io/setting-up-a-private-retrieval-augmented-generation-rag-system-with-local-vector-database-d42f34692ca7).	2023-11-05 11:26:34 -08:00
Roman Isecke	d09c8c0cab	test: update ingest dest tests to follow set pattern (#1991 ) ### Description Update all destination tests to match pattern: * Don't omit any metadata to check full schema * Move azure cognitive dest test from src to dest * Split delta table test into seperate src and dest tests * Fix azure cognitive search and add to dest tests being run (wasn't being run originally)	2023-11-03 12:46:56 +00:00
Roman Isecke	901704b6c0	update sphinx docs with ingest content (#1969 ) ### Description Create a new structure for ingest content in the docs, update with all configs	2023-11-02 20:40:35 +00:00
Matt Robinson	d9c035edb1	docs: no more bricks (#1967 ) ### Summary We no longer use the "bricks" terminology for partioning functions, etc in the library. This PR updates various references to bricks within the repo and the docs. This is just an initial pass to swap the terminology out, it'll likely be helpful to reorganize the docs a bit as well. --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>	2023-11-02 09:43:26 -05:00
Roman Isecke	922bc84cee	Update fsspecs-specific source connector docs (#1898 ) ### Description Add in the fsspec configs needed for the fsspec-based connectors To match the behavior of the original CLI, the default used by the click option was mirrored in the base config for the api endpoint.	2023-10-31 16:09:46 +00:00
Matt Robinson	21b45ae8b0	docs: update to new logo (#1937 ) ### Summary Updates the docs and README to use the new Unstructured logo. The README links to the raw GitHub user content, so the changes isn't reflected in the README on the branch, but will update after the image is merged to main. ### Testing Here's what the updated docs look like locally: <img width="237" alt="image" src="https://github.com/Unstructured-IO/unstructured/assets/1635179/f13d8b4b-3098-4823-bd16-a6c8dfcffe67"> <img width="1509" alt="image" src="https://github.com/Unstructured-IO/unstructured/assets/1635179/3b8aae5e-34aa-48c0-90f9-f5f3f0f1e26d"> <img width="1490" alt="image" src="https://github.com/Unstructured-IO/unstructured/assets/1635179/e82a876f-b19a-4573-b6bb-1c0215d2d7a9">	2023-10-31 15:39:19 +00:00
Amanda Cameron	0584e1d031	chore: fix infer_table bug (#1833 ) Carrying `skip_infer_table_types` to `infer_table_structure` in partition flow. Now PPT/X, DOC/X, etc. Table elements should not have a `text_as_html` field. Note: I've continued to exclude this var from partitioners that go through html flow, I think if we've already got the html it doesn't make sense to carry the infer variable along, since we're not 'infer-ing' the html table in these cases. TODO: ✅ add unit tests --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: amanda103 <amanda103@users.noreply.github.com>	2023-10-24 00:11:53 +00:00
Mallori Harrell	00635744ed	feat: Adds local embedding model (#1619 ) This PR adds a local embedding model option as an alternative to using our OpenAI embedding brick. This brick uses LangChain's HuggingFacEmbeddings.	2023-10-19 11:51:36 -05:00
Jack Retterer	b8f24ba67e	Added AWS Bedrock embeddings (#1738 ) Summary: Added support for AWS Bedrock embeddings. Leverages "amazon.titan-tg1-large" for the embedding model. Test - find your aws secret access key and key id; make sure the account has access to bedrock's tian embed model - follow the instructions in `d5e797cd44/docs/source/bricks/embedding.rst (bedrockembeddingencoder)` --------- Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com> Co-authored-by: Yao You <yao@unstructured.io> Co-authored-by: Yao You <theyaoyou@gmail.com> Co-authored-by: Ahmet Melek <ahmetmeleq@gmail.com>	2023-10-18 19:36:51 -05:00
Roman Isecke	adacd8e5b1	roman/update ingest pipeline docs (#1689 ) ### Description * Update all existing connector docs to use new pipeline approach ### Additional changes: * Some defaults were set for the runners to match those in the configs to make those easy to handle, i.e. the biomed runner: ```python max_retries: int = 5, max_request_time: int = 45, decay: float = 0.3, ```	2023-10-17 16:11:16 +00:00
Roman Isecke	b265d8874b	refactoring linting (#1739 ) ### Description Currently linting only takes place over the base unstructured directory but we support python files throughout the repo. It makes sense for all those files to also abide by the same linting rules so the entire repo was set to be inspected when the linters are run. Along with that autoflake was added as a linter which has a lot of added benefits such as removing unused imports for you that would currently break flake and require manual intervention. The only real relevant changes in this PR are in the `Makefile`, `setup.cfg`, and `requirements/test.in`. The rest is the result of running the linters.	2023-10-17 12:45:12 +00:00
Amanda Cameron	d0c84d605c	chore: updating table docs with file extensions (#1702 ) gh issue: https://github.com/Unstructured-IO/unstructured/issues/1691 Adding filetype extensions from this [list](`f98d5e65ca/unstructured/file_utils/filetype.py (L154-L200)`) where applicable. --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: Crag Wolfe <crag@unstructuredai.io>	2023-10-14 14:14:52 -07:00
Ahmet Melek	94836cfad4	feat: add file-based access permissions for SharePoint ingest (#1628 ) This PR: - defines rbac_data as a SourceMetadata field, - manages connections to an external api for obtaining rbac data with ConnectorRBAC class, - serializes rbac data and saves it to the disk, - matches the rbac_data in the disk to each IngestDoc, using a common field, - forwards rbac data to Elements, via the partition() function To test the changes, run `examples/ingest/sharepoint/ingest.sh` with the relevant rbac & connector credentials --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-10-13 00:38:08 +00:00
Dev Khant	f09b87da23	Doc : replace link `upstream connectors` with `source connectors` (#1683 ) Fixes #1502 Here I have replaced `stream_connectors.html` with `source_connectors.html`.	2023-10-09 21:37:51 -07:00

1 2 3 4 5

219 Commits