unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-30 04:21:20 +00:00

Author	SHA1	Message	Date
ryannikolaidis	d25e6081d8	chore: add opensearch extra (#2419 )	2024-01-18 05:21:37 +00:00
Ronny H	96fe7dd5e5	Kapa.ai widget installation (#2418 ) To test: > cd docs && make html > click "Ask AI" button on the bottom right-hand corner Changelogs: * Installed kapa.ai widget * fixed sphinx errors in opensearch & elasticsearch documentation	2024-01-18 00:17:11 +00:00
Matt Robinson	4d5038d9fd	enhancement: add support from bitmap images (#2414 ) ### Summary Adds support for bitmap images (`.bmp`) in both file detection and partitioning. Bitmap images will be processed with `partition_image` just like JPGs and PNGs. ### Testing ```python from unstructured.file_utils.filetype import detect_filetype from unstructured.partition.auto import partition from PIL import Image filename = "example-docs/layout-parser-paper-with-table.jpg" bmp_filename = "~/tmp/ayout-parser-paper-with-table.bmp" img = Image.open(filename) img.save(bmp_filename) detect_filetype(filename=bmp_filename) # Should be FileType.BMP elements = partition(filename=bmp_filename) ```	2024-01-17 22:50:36 +00:00
Ronny H	8e6bc10ba1	Docs various updates (#2386 ) To test: > cd docs && make html Changelogs: * Added verbiage about the cap limit and data usage for the Freemium AP * Added deprecated warning on Staging bricks * Added warning and code examples to use the SaaS API Endpoints using CLI-vs-SDKs * Fixed example page formatting * Added deprecation warning on ``model_name`` param in favor of ``hi_res_model_name`` * Added ``extract_images_in_pdf`` usage and code example in ``partition_pdf`` section * Reorganized and improved the documentation Intro section	2024-01-17 21:01:01 +00:00
ryannikolaidis	f23f20c1dc	fix: postgres destination connector serialization (#2411 ) This fixes the serialization of the Elasticsearch destination connector. Presence of the _client object breaks serialization due to TypeError: cannot pickle '_thread.lock' object. This removes that object before serialization.	2024-01-17 17:39:32 +00:00
Yao You	ae24136238	chore: update installation instructions for conda (#2409 ) - bump the pytorch version for conda to match that in requirements/extra-pdf-image.txt (to 2.1.2)	2024-01-17 17:27:37 +00:00
David Potter	bc791d53f4	feat: add opensearch source and destination connector (#2349 ) Adds OpenSearch as a source and destination. Since OpenSearch is a fork of Elasticsearch, these connectors rely heavily on inheriting the Elasticsearch connectors whenever possible. - Adds OpenSearch source connector to be able to ingest documents from OpenSearch. - Adds OpenSearch destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into OpenSearch. - Defines an example unstructured elements schema for users to be able to setup their unstructured OpenSearch indexes easily. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-17 04:31:49 +00:00
David Potter	d7f4c24e21	fix documentation for chroma (#2403 ) To test: cd docs && make HTML changelogs: point main readme to the correct connector html page point chroma docs to correct sample code --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-17 01:53:52 +00:00
Austin Walker	aaf3fd982b	chore: bump base image (#2410 ) Propagating the openssl fix from Unstructured-IO/base-images#12	2024-01-17 01:32:58 +00:00
Steve Canny	fcc919b9f5	rfctr(chunking): add chunking arg constants (#2408 ) There are several public interface points for chunking and they all provide a default for arguments like `max_charactes`. These defaults are provided by literal values. Keeping these synchronized has become a problem. Declare constant values for chunking argument default values and use those wherever a non-trivial default is used in an end-user facing API function.	2024-01-16 21:48:36 +00:00
David Potter	76e0d10e61	feat: add MongoDB source connector (#2393 ) Adds MongoDB as a source (we already had it as a destination connector) --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-16 20:56:29 +00:00
John	125b63cd7c	refactor: extract language helper functions (#2370 ) This PR is one in a series of PRs for refactoring and fixing the `languages` parameter so it can address incorrect input by users. #2293 Refactor `_convert_language_code_to_pytesseract_lang_code` and extract `_get_iso639_language_object` to its own function ``` from unstructured.partition.lang import _convert_language_code_to_pytesseract_lang_code as convert convert("English") # this will raise an error on both main and this branch convert("en") # this will return "eng" on both branches ```	2024-01-16 17:51:03 +00:00
jakub-sandomierz-deepsense-ai	ee0441efea	enhancement: normalize Salesforce artifact extensions (#2402 ) Connectors use predictable result file naming convention so consumers of library can write code in abstraction of particular connector. This change introduces compatibility with said naming convention. `_output_filename` returns now filename with format.	2024-01-16 10:36:00 +00:00
Christine Straub	ee06260987	feat: keep all image elements when using `hi_res` strategy. (#2382 ) ### Summary The goal of this PR is to keep all image elements when using "hi_res" strategy. Previously, `Image` elements with small chunks of text were ignored unless the image block extraction parameters (`extract_images_in_pdf` or `extract_image_block_types`) were specified. Now, all image elements are kept regardless of whether the image block extraction parameters are specified. ### Testing - on `main` branch, ``` elements = partition_pdf( filename="example-docs/embedded-images.pdf", strategy="hi_res", ) image_elements = [el for el in elements if el.category == ElementType.IMAGE] print("number of image elements: ", len(image_elements)) ``` The above code will display `number of image elements: 0`. - on this `feature` branch, The same code will display `number of image elements: 3` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-01-15 23:19:17 +00:00
John	1f0826ab0a	pin unstructured-client (#2392 ) Replacement for #2311 since python 3.8 was dropped as a supported version. Unstructured-client added `api_key_auth` as a param to `UnstructuredClient` in [version 0.9.0](`8c93115c92`). This pins the version of `unstructured-client` so users do not receive `TypeError: UnstructuredClient.__init__() got an unexpected keyword argument 'api_key_auth'`	2024-01-15 17:26:38 +00:00
Matt Robinson	36faf677c0	enhancement: file detection for `.wav` files (#2387 ) ### Summary Adds filetype detection for `.wav` audio files ### Testing ```python from unstructured.file_utils.filetype import detect_filetype filename = "example-docs/CantinaBand3.wav" detect_filetype(filename=filename) # Should be FileType.WAV ```	2024-01-15 16:50:49 +00:00
ryannikolaidis	d7980b3665	fix: elasticsearch serialization issue (#2399 ) This fixes the serialization of the Elasticsearch destination connector. Presence of the _client object breaks serialization due to TypeError: cannot pickle '_thread.lock' object. This removes that object before serialization.	2024-01-14 23:07:37 +00:00
ryannikolaidis	f07fc6e03a	chore: make Elasticsearch Destination connector write settings optional (#2398 ) * set required=False to all write config options * update num_processes to default to 1 since that will always work	2024-01-14 22:31:05 +00:00
ryannikolaidis	2ce829ddd0	test: update test Elasticsearch mappings to validate embedding search (#2397 ) Currently in the Elasticsearch Destination ingest test we are writing the embeddings to a "float" type field. In order to leverage this field for similarity search it should be mapped as "dense_vector" with the respective dimensions assigned. This PR updates that mapping and adds a test query to validate that this works as expected.	2024-01-14 19:27:56 +00:00
ryannikolaidis	018cd7f71b	fix: pinecone serialization issue (#2394 ) This fixes the serialization of the Pinecone destination connector. Presence of the PineconeIndex object breaks serialization due to TypeError: cannot pickle '_thread.lock' object. This removes that object before serialization.	2024-01-13 00:08:33 +00:00
Steve Canny	2f2c48acd5	feat(ingest): add basic chunking to ingest (#2380 ) The new "basic" chunking strategy and overlap options need to be available from the ingest CLI. An ingest test of those features is also welcome, both to verify the ingest feature and to defend against regressions in the chunking code. Add a local ingest test exercising both the "basic" chunking strategy and intra-chunk overlap. Since there is no new source connector involved, use the local ingest source and destination. Update documentation to suit, filling in some details that hadn't made it into the docs yet.	2024-01-12 20:27:34 +00:00
Ahmet Melek	50f142d4e0	chore(ingest): update pinecone index creation specifications (#2389 ) This PR updates Pinecone index creation in the ingest test due to a recent update in Pinecone API. Due to a change in Pinecone API, it is not allowed anymore to specify both number of replicas and number of pods: `Cannot specify both replicas and pods` We solve it by removing the replica specification while sending the index creation request. ``` Creating index ingest-test-28418 Index creation success: 201 ```	2024-01-12 02:49:09 +00:00
jakub-sandomierz-deepsense-ai	411aa98bbf	feat: Salesforce connector accepts key path or value (#2321 ) (#2327 ) Solution to issue https://github.com/Unstructured-IO/unstructured/issues/2321. simple_salesforce API allows for passing private key path or value. This PR introduces this support for Ingest connector. Salesforce parameter "private-key-file" has been renamed to "private-key". It can contain one of following: - path to PEM encoded key file (as string) - key contents (PEM encoded string) If the provided value cannot be parsed as PEM encoded private key, then the file existence is checked. This way private key contents are not exposed to unnecessary underlying function calls.	2024-01-11 11:15:24 +00:00
jakub-sandomierz-deepsense-ai	5581e6a4c4	fix: Ingest GCS accepts JSON auth token (#2322 ) (#2371 ) FSSpec serialization caused conversion of JSON token to string with single quotes. GCS requires JSON token in form of dict so this format is now assured. Other forms of auth are not modified but there is improved validation for all of the options.	2024-01-11 09:03:47 +00:00
John	bfd0258ba5	chore: refactor _convert_to_standard_langcode (#2369 ) This PR is one in a series of PRs for refactoring and fixing the `languages` parameter so it can address incorrect input by users. #2293 This PR adds a dictionary for helping map fully spelled out languages to tesseract language codes --------- Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>	2024-01-11 00:34:13 +00:00
Roman Isecke	8dc130c920	fix: ensure consistency in method signatures across destination connectors (#2381 ) ### Description * Make sure all destination connectors implement the base abstract methods using the same signatures. * Also leverage conform dict in the base methods to make sure it's called in a consistent fashion. * Additional updates to move the common code into the base destination connector class	2024-01-11 00:19:49 +00:00
Ronny H	98a0de30b4	Fix sphinx error (#2384 ) To test: > cd docs && make HTML changelogs: - remove unindented line in destination connector's sql.rst file - add elasticsearch page into destination_connector.rst file	2024-01-10 22:25:18 +00:00
Steve Canny	23edf2e911	feature(chunking): add basic strategy and overlap (#2367 ) This PR culminates the restructuring of chunking over my prior dozen-or-so commits by adding the new options to the API and documentation. Separately I'll be adding a new ingest test to defend against regression, although the integration test included in this PR will do a pretty good job of that too.	2024-01-10 22:19:24 +00:00
Roman Isecke	a8a103bc5c	bug: don't redact text when serialization if not value (#2379 ) ### Description The current approach injects the redacted text for all sensitive fields regardless of if they have a value or not. This updates the code to only replace the value with the redacted text if the value exists.	2024-01-10 18:52:43 +00:00
Roman Isecke	22c0bad246	bug: weaviate serialization broken (#2378 ) ### Description This PR handles two things: * Fixes the serialization of the weaviate destination connector since the client content breaks serialization when present due to `TypeError: cannot pickle '_thread.lock' object`. * Set finer auth control rather than generic dictionary on the CLI and access config.	2024-01-10 17:22:37 +00:00
Roman Isecke	b37b4689bc	drop python3.8 (#2372 ) ### Description Remove all uses of python3.8 --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com> 0.12.0	2024-01-09 23:37:30 +00:00
Christine Straub	e2f0de3c50	chore: bump unstructured-inference=0.7.21 (#2361 )	2024-01-08 21:05:04 +00:00
Roman Isecke	7caf255316	bug: omit session handler from serialization to avoid mp issues (#2366 ) ### Description The session handler variable can be anything, because it's specific to the SDK being used for the connector. This can break the serialization depending on what that is. To avoid this all together, the session handler itself is not serialized. Instead, it needs to be recreated if an object is serialized and then deserialized.	2024-01-08 19:14:26 +00:00
jakub-sandomierz-deepsense-ai	0ca154a0f3	Fix: MongoDB connector URI password redaction, basic unit tests for Git connector (#2268 ) MongoDB connector: Issue: [MongoDB documentation](https://www.mongodb.com/docs/manual/reference/connection-string/) states that characters `$ : / ? # [ ] @` must be percent encoded. URI with password containing such special character will not be redacted. Fix: This fix removes usage of `unquote_plus` on password which allows detected password to match with one inside URI and successfully replace it. Git connector: Added very basic unit tests for repository filtering methods. Their impact is rather minimal but showcases current limitation in `is_file_type_supported` method.	2024-01-08 11:27:08 +00:00
Klaijan	e65a44eabb	feat: update cct eval for text dir (#2299 ) The code makes edit to the `measure_text_extraction_accuracy` function to allows dir of txt as well as json. The function also takes input `output_type` to be either "json" or "txt" only, and checks if the files under given directory/list contains only specified file type or not. To test this feature, run the following code: ```PYTHONPATH=. python unstructured/ingest/evaluate.py measure-text-extraction-accuracy-command --output_dir <clean-text-path> --source_dir <cct-label-path> --output_type txt```	2024-01-05 23:34:53 +00:00
Ahmet Melek	d6674ba27e	chore: update ingest azure cognitive search endpoint (#2353 ) This PR: - updates ingest azure cognitive search destination connector test to move into a new service. - changes response parsing logic in the test.	2024-01-05 05:26:12 +00:00
Steve Canny	7a1e732aa1	feat(chunking): add inter-chunk overlap (#2309 ) Reviewer: This PR probably reviews faster commit-by-commit. Each of the commits is groomed and focuses on a separate clear aspect of this implementation. This PR adds inter-chunk overlap capability to chunking. It does not yet expose it via the API. Inter-chunk overlap is overlap between whole pre-chunks, prior to any text-splitting required for oversized chunks. Contrast with intra-chunk overlap implemented in the prior PR which implements overlap on these latter text-splitting boundaries. Inter-chunk overlap is disabled by default since a pre-chunk already has a "clean" semantic boundary (composed of whole elements) and adding overlap there introduces noise from the adjacent context. If the user wants inter-chunk overlap they must specify `overlap_all=True` in the options. Inter-chunk overlap uses the same `overlap` length value used by intra-chunk overlap and does not overlap when that value is 0.	2024-01-05 01:24:12 +00:00
Steve Canny	22cbdce7ca	fix(html): unequal row lengths in HTMLTable.text_as_html (#2345 ) Fixes #2339 Fixes to HTML partitioning introduced with v0.11.0 removed the use of `tabulate` for forming the HTML placed in `HTMLTable.text_as_html`. This had several benefits, but part of `tabulate`'s behavior was to make row-length (cell-count) uniform across the rows of the table. Lacking this prior uniformity produced a downstream problem reported in On closer inspection, the method used to "harvest" cell-text was producing more text-nodes than there were cells and was sensitive to where whitespace was used to format the HTML. It also "moved" text to different columns in certain rows. Refine the cell-text gathering mechanism to get exactly one text string for each row cell, eliminating whitespace formatting nodes and producing strict correspondence between the number of cells in the original HTML table row and that placed in HTML.text_as_html. HTML tables that are uniform (every row has the same number of cells) will produce a uniform table in `.text_as_html`. Merged cells may still produce a non-uniform table in `.text_as_html` (because the source table is non-uniform).	2024-01-04 21:53:19 +00:00
rvztz	950e5d68f9	feat: adds postgresql/sqlite destination connector (#2005 ) - Adds a destination connector to upload processed output into a PostgreSQL/Sqlite database instance. - Users are responsible to provide their instances. This PR includes a couple of configuration examples. - Defines the scripts required to setup a PostgreSQL instance with the unstructured elements schema. - Validates postgres/pgvector embedding storage and retrieval --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-04 19:33:16 +00:00
Christine Straub	5b0ae3fd8b	Refactor: rename image extraction kwargs (#2303 ) Currently, we're using different kwarg names in partition() and partition_pdf(), which has implications for the API since it goes through partition(). ### Summary - rename `extract_element_types` -> `extract_image_block_types` - rename `image_output_dir_path` to `extract_image_block_output_dir` - rename `extract_to_payload` -> `extract_image_block_to_payload` - rename `pdf_extract_images` -> `extract_images_in_pdf` in `partition.auto` - add unit tests to test element extraction for `pdf/image` via `partition.auto` ### Testing CI should pass.	2024-01-04 17:52:00 +00:00
Ronny H	8e2bfcab18	Unstructured SaaS API subscription guide (#2341 ) To test: > cd docs && make html Sections: - New User sign-up: (i) registration form, (ii) payment processing, and (iii) use API key & URL - API Account maintenance: (i) update billing, (ii) opt-in email, (iii) rotate API key, and (iv) cancel plan - Get Supports 0.11.8	2024-01-03 14:38:03 -08:00
Austin Walker	91b892c79d	fix: Fix api_url param to partition_via_api (#2342 ) Closes #2340 We need to make sure the custom url is passed to our client. The client constructor takes the base url, so for compatibility we can continue to take the full url and strip off the path. To verify, run the api locally and confirm you can make calls to it. ``` # In unstructured-api make run-web-app # In ipython in this repo from unstructured.partition.api import partition_via_api filename = "example-docs/layout-parser-paper.pdf" partition_via_api(filename=filename, api_url="http://localhost:8000") ``` 0.11.7	2024-01-03 20:08:48 +00:00
Yao You	1b70ea86b3	fix: update table structure eval to use new table inference interface (#2306 ) Provide OCR tokens for table eval script. Right now `unstructured-inference` can compute OCR components when they are not passed in but in a future release we will be required to pass in OCR results into table structure extraction model: `d3b2981313/CHANGELOG.md (0719)` This PR prepares for the upcoming change by passing ocr token into table structure extraction process. ## test Create a new virtual env that follows the setup in readme then upgrade `inference` with `pip install unstructured-inference --upgrade`. Run test `PYTHONPATH=. pytest test_unstructured/metrics/test_table_structure.py` would fail on main branch but fixed in this PR. --------- Co-authored-by: Austin Walker <awalk89@gmail.com>	2024-01-03 19:41:51 +00:00
ryannikolaidis	dd1443ab6f	feat: add Qdrant ingest destination connector (#2338 ) This PR intends to add [Qdrant](https://qdrant.tech/) as a supported ingestion destination. - Implements CLI and programmatic usage. - Documentation update - Integration test script --- Clone of #2315 to run with CI secrets --------- Co-authored-by: Anush008 <anushshetty90@gmail.com> Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>	2024-01-02 22:08:20 +00:00
Christine Straub	9459af435d	Fix: element extraction not working when using "auto" strategy for pdf (#2324 ) Closes #2323. ### Summary - update logic to return "hi_res" if either `extract_images_in_pdf` or `extract_element_types` is set - refactor: remove unused `file` parameter from `determine_pdf_or_image_strategy()` ### Testing ``` from unstructured.partition.pdf import partition_pdf elements = partition_pdf( filename="example-docs/embedded-images-tables.pdf", extract_element_types=["Image"], extract_to_payload=True, ) image_elements = [el for el in elements if el.category == ElementType.IMAGE] print(image_elements) ```	2023-12-28 22:25:30 +00:00
Christine Straub	dd144456de	Feat: return base64 encoded images for PDF's (#2310 ) Closes #2302. ### Summary - add functionality to get a Base64 encoded string from a PIL image - store base64 encoded image data in two metadata fields: `image_base64` and `image_mime_type` - update the "image element filter" logic to keep all image elements in the output if a user specifies image extraction ### Testing ``` from unstructured.partition.pdf import partition_pdf elements = partition_pdf( filename="example-docs/embedded-images-tables.pdf", strategy="hi_res", extract_element_types=["Image", "Table"], extract_to_payload=True, ) ``` or ``` from unstructured.partition.auto import partition elements = partition( filename="example-docs/embedded-images-tables.pdf", strategy="hi_res", pdf_extract_element_types=["Image", "Table"], pdf_extract_to_payload=True, ) ```	2023-12-27 05:39:01 +00:00
Roman Isecke	8ba9fadf8a	feat: improve dataclass use for encoders (#2318 ) ### Description Leverage a similar pattern to what is used for connectors, where there is a nested config dataclass as a field, along with cached content for things like the client and sample embedding for each. This required an update on the embeddings config in ingest and I left a TODO in there because the current approach breaks on other encoders such as bedrock because the parameters in that config don't map to all encoders. But this keeps the existing functionality working. This update makes sure all variables associated with the dataclass exist when it's instantiated rather than being added in the `__post_init__()` method or the `initialize()`, allowing other libraries like pydantic to appropriately generate schemas from it. It also now follows the pattern of the connectors in that each class has a nested config class used to instantiate the client itself as well as a field/property approach used to cache the client.	2023-12-26 22:33:19 +00:00
Roman Isecke	bfef183f77	feat: update encoders to be dataclasses (#2313 ) ### Description Convert all encoders to be based off dataclasses. Purpose: this will allow encoders to be used in a generic way amongst other dataclasses. Otherwise, it'll break validation in those parent dataclasses.	2023-12-26 14:48:00 +00:00
Steve Canny	eb1b022ff8	feat(chunking): add overlap on chunk-splits (#2305 ) There are two distinct overlap operations with completely different implementations. This is "intra-chunk" overlap, applying overlap to chunks resulting from text-splitting an oversized element. So if an oversized element had text "abcd efgh ijkl mnop qrst" and was split at 15 chars with overlap of 5, it would produce "abcd efgh ijkl" and "ijkl mnop qrst". Any inter-chunk overlap from the prior chunk and applied at the beginning of the string (before "abcd") is handled in a separate operation in the next PR.	2023-12-22 20:35:18 +00:00
John	5c0043aa7d	chore: add hi_res_model_name kwarg (#2289 ) Closes #2160 Explicitly adds `hi_res_model_name` as kwarg to relevant functions and notes that `model_name` is to be deprecated. Testing: ``` from unstructured.partition.auto import partition filename = "example-docs/DA-1p.pdf" elements = partition(filename, strategy="hi_res", hi_res_model_name="yolox") ``` --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: Steve Canny <stcanny@gmail.com> Co-authored-by: Christine Straub <christinemstraub@gmail.com> Co-authored-by: Yao You <yao@unstructured.io> Co-authored-by: Yao You <theyaoyou@gmail.com>	2023-12-22 15:06:54 +00:00

... 4 5 6 7 8 ...

1393 Commits