unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-12-04 19:16:03 +00:00

Author	SHA1	Message	Date
Pawel Kmiecik	dc376053dd	feat(eval): Correct table metrics evaluations (#2615 ) This PR: - replaces `-1.0` value in table metrics with `nan`s - corrected rows filtering basing on above	2024-03-06 15:37:32 +00:00
Steve Canny	4096a38371	rfctr(chunking): extract chunking-strategy dispatch (#2545 ) Summary This is the final step in adding pluggable chunking-strategies. It introduces the `chunk()` function to replace calls to strategy-specific chunkers in the `@add_chunking_strategy` decorator. The `chunk()` function then uses a mapping of chunking-strategy names (e.g. "by_title", "basic") to chunking functions (chunkers) to dispatch the chunking call. This allows other chunkers to be added at runtime rather than requiring a code change, which is what "pluggable" chunkers is. Additional Information - Move the `@add_chunking_strategy` to the new `chunking.dispatch` module since it coheres strongly with that operation, but publish it from `chunking(.__init__)` (as it was before) so users don't couple to the way we organize the chunking sub-package. Also remove the third level of nesting as it's unrequired in this case. - Add unit tests for the `@add_chunking_strategy` decorator which was previously uncovered by any direct test.	2024-03-05 23:19:29 +00:00
Klaijan	3ff6de4f50	refactor: refactor var name for consistency (#2609 ) refactor variable name for consistency.	2024-03-05 09:08:25 +00:00
John	3783b44d0b	fix documentation html links example (#2608 ) Closes #2577 Testing: ``` from unstructured.partition.html import partition_html cnn_lite_url = "https://lite.cnn.com/" elements = partition_html(url=cnn_lite_url) links = [] for element in elements: if element.metadata.link_urls: relative_link = element.metadata.link_urls[0][1:] if relative_link.startswith("2024"): links.append(f"{cnn_lite_url}{relative_link}") print(links) ``` --------- Co-authored-by: ron-unstructured <ronny@unstructured.io> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-03-04 18:33:42 +00:00
Michał Martyniak	b9aa4b7452	fix: Install pandoc consistently, via Makefile recipe (version that supports .rtf files as input format) (#2593 ) ## Problem Description In some cases you might find yourselves in a situation when pandoc won't be able to process an `rtf` as input file format, because older versions simply do not support that. ``` RuntimeError: Invalid input format! Got "rtf" but expected one of these: commonmark, creole, csv, docbook, docx, dokuwiki, epub, fb2, gfm, haddock, html, ipynb, jats, jira, json, latex, man, markdown, markdown_github, markdown_mmd, markdown_phpextra, markdown_strict, mediawiki, muse, native, odt, opml, org, rst, t2t, textile, tikiwiki, twiki, vimwiki ``` Basically, some user may install the wrong version. The `README.md` is not be precise enough when mentioning RTF files support: `47b35ccdd6/README.md (L120-L122)` ## Example Installing `pandoc` from a [stable repository, like Debian](https://packages.debian.org/source/bullseye/pandoc) will give you `2.9` and the official documentation shows clearly that support for rtf was introduced in `2.14` https://pandoc.org/releases.html#pandoc-2.14.2-2021-08-21 ![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/3d5199f1-5e39-46ad-ac90-fff9cc5543a8) ### Note that `rtf` is not there ![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/de90ebaf-86f2-4b21-83fb-085e27eeea38) ### More detail ![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/59fbb91f-1650-4091-bdcb-15aa035416c8) ## Proposed Solution - [x] I've simply added/copied `make install-pandoc` calls, mimicking other recipes in order to ensure that `3.1.2` will be installed in all cases. Side note: `make install-pandoc` calls `./scripts/install-pandoc.sh` under the hood. - [x] Update README file - mention that `make install-pandoc` is recommended (`>=2.14.2`) - [x] Verify tests that cover `rtf` cases: `47b35ccdd6/test_unstructured/file_utils/test_file_conversion.py (L14)` - [x] Update `setup_ubuntu.sh` if needed?: `47b35ccdd6/scripts/setup_ubuntu.sh (L87)` -	2024-03-04 11:02:32 +00:00
David Potter	43250d5576	bug CORE-3971: fix deserialization in google-drive source connector key path (#2586 ) Google Drive Service account key can be a dict or a file path(str) We have successfully been using the path. But the dict can also end up being stored as a string that needs to be deserialized. The deserialization can have issues with single and double quotes.	2024-03-03 15:30:35 +00:00
Klaijan	6a4b7a134b	feat: element type accuracy grouping (#2594 ) This PR allow grouping functionality on `evaluate.py` To test: Run `PYTHONPATH=. pytest test_unstructured/metrics/test_evaluate.py` or call `get_mean_grouping(<doctype or connector>, <dataframe or path to tsv file>, <export directory>, "element_type")` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2024-03-01 15:18:37 +00:00
ryannikolaidis	71d5d513ef	fix: handling of varied SharePoint date formats (#2591 ) We are seeing occurrences of inconsistency in the timestamps returned by office365.sharepoint when fetching created and modified dates. Furthermore, in future versions of this library, a datetime object will be returned rather than a string. ## Changes - This adds logic to guarantee SharePoint dates will be properly formatted as ISO, regardless of the format provided by the sharepoint library. - Bumps timestamp format output to include timezone offset (as we do with others) ## Testing Unit test added to validate this datetime handling across various formats. --------- Co-authored-by: David Potter <potterdavidm@gmail.com>	2024-02-28 16:11:53 +00:00
Christine Straub	47b35ccdd6	build(release): release commit for 0.12.5 (#2585 ) 0.12.5	2024-02-26 21:16:21 +00:00
Christine Straub	ee8b0f93dc	feat: pass list type parameters via client sdk (#2567 ) The purpose of this PR is to support using the same type of parameters as `partition_()` when using `partition_via_api()`. This PR works together with `unsturctured-api` [PR #368](https://github.com/Unstructured-IO/unstructured-api/pull/368). Note:* This PR will support extracting image blocks("Image", "Table") via partition_via_api(). ### Summary - update `partition_via_api()` to convert all list type parameters to JSON formatted strings before passing them to the unstructured client SDK - add a unit test function to test extracting image blocks via `parition_via_api()` - add a unit test function to test list type parameters passed to API via unstructured client sdk ### Testing ``` from unstructured.partition.api import partition_via_api elements = partition_via_api( filename="example-docs/embedded-images-tables.pdf", api_key="YOUR-API-KEY", strategy="hi_res", extract_image_block_types=["image", "table"], ) image_block_elements = [el for el in elements if el.category == "Image" or el.category == "Table"] print("\n\n".join([el.metadata.image_mime_type for el in image_block_elements])) print("\n\n".join([el.metadata.image_base64 for el in image_block_elements])) ```	2024-02-26 19:17:06 +00:00
Klaijan	8f7853894e	bug: add main function to ingest/evaluate.py (#2583 ) Conflicted on previous two merge. Add the main function back.	2024-02-26 05:01:34 +00:00
Christine Straub	5cb6504d5a	docs: update image block extraction docs (#2578 ) This PR removes `extract_image_block_to_payload` section from "API Parameters" page. The "unstructured" API does not support the `extract_image_block_to_payload` parameter, and it is always set to `True` internally on the API side when trying to extract image blocks via the API. Users only need to specify `extract_image_block_types` parameter when extracting image blocks via the API. NOTE: The `extract_image_block_to_payload` parameter is only used when calling `partition()`, `partition_pdf()`, and `partition_image()` functions directly. ### Testing CI should pass.	2024-02-24 04:36:58 +00:00
David Potter	e8ec09c8b9	feat: astra dest connector (#2571 ) Thanks to Eric Hare @erichare at DataStax we have a new destination connector. This Pull Request implements an integration with [Astra DB](https://datastax.com) which allows for the Astra DB Vector Database to be compatible with Unstructured's set of integrations. To create your Astra account and authenticate with your `ASTRA_DB_APPLICATION_TOKEN`, and `ASTRA_DB_API_ENDPOINT`, follow these steps: 1. Create an account at https://astra.datastax.com 2. Login and create a new database 3. From the database page, in the right hand panel, you will find your API Endpoint 4. Beneath that, you can create a Token to be used Some notes about Astra DB: - Astra DB is a Vector Database which allows for high-performance database transactions, and enables modern GenAI apps [See here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html) - It supports similarity search via a number of methods [See here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html#metrics) - It also supports non-vector tables / collections	2024-02-23 20:50:50 +00:00
Steve Canny	51cf6bf716	rfctr(chunking): extract strategy-specific chunking options (#2556 ) Summary A pluggable chunking strategy needs its own local set of chunking options that subclasses a base-class in `unstructured`. Extract distinct `_ByTitleChunkingOptions` and `_BasicChunkingOptions` for the existing two chunking strategies and move their strategy-specific option setting and validation to the respective subclass. This was also a good opportunity for us to clean up a few odds and ends we'd been meaning to. Might be worth looking at the commits individually as they are cohesive incremental steps toward the goal.	2024-02-23 18:22:44 +00:00
Matt Robinson	b4d9ad8130	enhancement: detect headers in `partition_pdf` with fast strategy (#2455 ) ### Summary Detects headers and footers when using `partition_pdf` with the fast strategy. Identifies elements that are positioned in the top or bottom 5% of the page as headers or footers. If no coordinate information is available, an element won't be detected as a header or footer. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2024-02-23 16:56:09 +00:00
Klaijan	daaf1775b4	feat: separate evaluate grouping function (#2572 ) Separate the aggregating functionality of `text_extraction_accuracy` to a stand-alone function to avoid duplicated eval effort if the granular level eval is already available. To test: Run `PYTHONPATH=. pytest test_unstructured/metrics/test_evaluate.py` locally	2024-02-23 05:45:20 +00:00
Steve Canny	d3242fb546	rfctr(xlsx): extract connected components (#2575 ) Summary Refactoring as part of `partition_xlsx()` algorithm replacement that was delayed by some CI challenges. A separate PR because it is cohesive and relatively independent from the prior PR.	2024-02-22 22:50:48 +00:00
Pawel Kmiecik	ff9d46f9dc	feat(eval): table evaluation metrics (#2558 ) This PR adds new table evaluation metrics prepared by @leah1985 The metrics include: - `table count` (check) - `table_level_acc` - accuracy of table detection - `element_col_level_index_acc` - accuracy of cell detection in columns - `element_row_level_index_acc` - accuracy of cell detection in rows - `element_col_level_content_acc` - accuracy of content detected in columns - `element_row_level_content_acc` - accuracy of content detected in rows TODO in next steps: - create a minimal dataset and upload to s3 for ingest tests - generate and add metrics on the above dataset to `test_unstructured_ingest/metrics`	2024-02-22 16:35:46 +00:00
Steve Canny	1947375b2e	rfctr(chunking): preparation for plug-in chunkers, Part I (#2550 ) Summary In order to accommodate customized chunkers other than those directly provided by `unstructured`, some further modularization is necessary such that a new chunker can be added as a "plug-in" without modifying the `unstructured` library code. This PR is the straightforward refactoring required for this process like typing changes. There are also some other small changes we've been meaning to make like making all chunking options accept `None` to represent their default value so the broad field of callers (e.g. ingest, unstructured-api, SDK) don't need to determine and set default values for chunking arguments leading to diverging defaults. Isolating these "noisy" but easy to accept changes in this preparatory PR reduces the noise in the more substantive changes to follow.	2024-02-21 23:16:13 +00:00
erjieyong	4d12c61cb8	added parent_element as output for overlapping cases (#2507 ) To provide more utility to the `catch_overlapping_and_nested_bboxes` and `identify_overlapping_or_nesting_case` functions, included parent_element as part of the output. This would allow user to - identify the parent element in the overlapping case: `nested {type} in {type}`. Currently, if the element types is similar, an example case output would be `nested Image in Image` which is confusing. - easily identify elements to keep or delete	2024-02-21 00:13:09 -08:00
Steve Canny	f1c52c3e3f	fix(json): partition_json() does not chunk (#2564 ) Summary For whatever reason, the `@add_chunking_strategy` decorator was not present on `partition_json()`. This broke the only way to accomplish a "chunking-only" workflow using the REST API. This PR remedies that problem.	2024-02-21 01:35:16 +00:00
Austin Walker	6d17b9a7e4	Fix a parameter name in the js-client example usage (#2560 ) `files` should be `fileName` as noted in https://github.com/Unstructured-IO/unstructured-js-client/issues/24 Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-02-18 17:28:15 +00:00
Klaijan	d06936d35a	feat: modify test-ingest-src and evaluation-metrics to allow EXPORT_DIR (#2551 ) The current `test-ingest-src.sh` and `evaluation-metrics` do not allow passing the `EXPORT_DIR` (`OUTPUT_ROOT` in `evaluation-metrics`). It is currently saving at the current working directory (`unstructured/test_unstructured_ingest`). When running the eval from `core-product`, all outputs is now saved at `core-product/upstream-unstructured/test_unstructured_ingest` which is undesirable. This PR modifies two scripts to accommodate such behavior: 1. `test-ingest-src.sh` - assign `EVAL_OUTPUT_ROOT` to the value set within the environment if exist, or the current working directory if not. Then calls to run `evaluation-metrics.sh`. 2. `evaluation-metrics.sh` - accepting param from `test-ingest-src.sh` if exist, or to the value set within the environment if exist, or the current directory if not. (Note: I also add param to `evaluation-metrics.sh` because it makes sense to allow a separate run to be able to specify an export directory) This PR should work in sync with another PR under `core-product`, which I will add the link here later. To test: Run the script below, change `$SCRIPT_DIR` as needed to see the result. ``` export OVERWRITE_FIXTURES=true ./upstream-unstructured/test_unstructured_ingest/src/s3.sh SCRIPT_DIR=$(dirname "$(realpath "$0")") bash -x ./upstream-unstructured/test_unstructured_ingest/evaluation-metrics.sh text-extraction "$SCRIPT_DIR" ``` ---- This PR also updates the requirements by `make pip-compile` since the `click` module was not found.	2024-02-17 05:21:15 +00:00
Ronny H	ad561b7939	Fixed broken links and improved readability in `key concepts` page (#2533 ) To test: > cd docs && make html	2024-02-14 18:19:47 +00:00
Filip Knefel	f048695a55	feat: include text from shapes in docx (#2510 ) Reported bug: Text from docx shapes is not included in the `partition` output. Fix: Extend docx partition to search for text tags nested inside structures responsible for creating the shape. --------- Co-authored-by: Filip Knefel <filip@unstructured.io>	2024-02-14 17:48:38 +00:00
Ronny H	51427b3103	Renamed OpenAiEmbeddingConfig dataclass (#2546 )	2024-02-14 17:24:52 +00:00
Matt Robinson	882370022e	fix: don't treat double quote enclosed text as JSON (#2544 ) ### Summary Closes #2444. Treats JSON serializable content that results in a string as plain text. Even though this is valid JSON per [RFC 4627](https://www.ietf.org/rfc/rfc4627.txt), this is valid JSON, but in almost every cases were really want to treat this as a text file. ### Testing 1. Put `"This is not a JSON"` is a text file `notajson.txt` 2. Run the following ```python from unstructured.file_utils.filetype import _is_text_file_a_json _is_text_file_a_json(filename="notajson.txt") # Should be False ```	2024-02-14 13:41:43 +00:00
Christine Straub	d11a83ce65	refactor: embedded text processing modules (#2535 ) This PR is similar to ocr module refactoring PR - https://github.com/Unstructured-IO/unstructured/pull/2492. ### Summary - refactor "embedded text extraction" related modules to use decorator - `@requires_dependencies` on functions that require external libraries and import those libraries inside those functions instead of on module level. - add missing test cases for `pdf_image_utils.py` module to improve average test coverage ### Testing CI should pass.	2024-02-13 21:19:07 -08:00
Steve Canny	d9f8467187	fix(xlsx): xlsx subtable algorithm (#2534 ) Reviewers: It may be easier to review each of the two commits separately. The first adds the new `_SubtableParser` object with its unit-tests and the second one uses that object to replace the flawed existing subtable-parsing algorithm. Summary There are a cluster of bugs in `partition_xlsx()` that all derive from flaws in the algorithm we use to detect "subtables". These are encountered when the user wants to get multiple document-elements from each worksheet, which is the default (argument `find_subtable = True`). This PR replaces the flawed existing algorithm with a `_SubtableParser` object that encapsulates all that logic and has thorough unit-tests. Additional Context This is a summary of the failure cases. There are a few other cases but they're closely related and this was enough evidence and scope for my purposes. This PR fixes all these bugs: ```python # # -- ✅ CASE 1: There are no leading or trailing single-cell rows. # -> this subtable functions never get called, subtable is emitted as the only element # # a b -> Table(a, b, c, d) # c d # -- ✅ CASE 2: There is exactly one leading single-cell row. # -> Leading single-cell row emitted as `Title` element, core-table properly identified. # # a -> [ Title(a), # b c Table(b, c, d, e) ] # d e # -- ❌ CASE 3: There are two-or-more leading single-cell rows. # -> leading single-cell rows are included in subtable # # a -> [ Table(a, b, c, d, e, f) ] # b # c d # e f # -- ❌ CASE 4: There is exactly one trailing single-cell row. # -> core table is dropped. trailing single-cell row is emitted as Title # (this is the behavior in the reported bug) # # a b -> [ Title(e) ] # c d # e # -- ❌ CASE 5: There are two-or-more trailing single-cell rows. # -> core table is dropped. trailing single-cell rows are each emitted as a Title # # a b -> [ Title(e), # c d Title(f) ] # e # f # -- ✅ CASE 6: There are exactly one each leading and trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b c Table(b, c, d, e), # d e Title(f) ] # f # -- ✅ CASE 7: There are two leading and one trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b Title(b), # c d Table(c, d, e, f), # e f Title(g) ] # g # -- ✅ CASE 8: There are two-or-more leading and trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b Title(b), # c d Table(c, d, e, f), # e f Title(g), # g Title(h) ] # h # -- ❌ CASE 9: Single-row subtable, no single-cell rows above or below. # -> First cell is mistakenly emitted as title, remaining cells are dropped. # # a b c -> [ Title(a) ] # -- ❌ CASE 10: Single-row subtable with one leading single-cell row. # -> Leading single-row cell is correctly identified as title, core-table is mis-identified # as a `Title` and truncated. # # a -> [ Title(a), # b c d Title(b) ] ```	2024-02-13 20:29:17 -08:00
David Potter	1a706771fa	feature: add octoai for embeddings (#2538 ) Thanks to Pedro at OctoAI we have a new embedding option. The following PR adds support for the use of OctoAI embeddings. Forked from the original OpenAI embeddings class. We removed the use of the LangChain adaptor, and use OpenAI's SDK directly instead. Also updated out-of-date example script. Including new test file for OctoAI. # Testing Get a token from our platform at: https://www.octoai.cloud/ For testing one can do the following: ``` export OCTOAI_TOKEN=<your octo token> python3 examples/embed/example_octoai.py ``` ## Testing done Validated running the above script from within a locally built container via `make docker-start-dev` --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-02-10 15:27:06 +00:00
David Potter	d11c70cf83	bug: fix check_connection for (#2497 ) fixes check_connection for: azure opensearch postgres For Azure, the check_connection in fsspec.py actually worked better. Adding check_connection for Databricks Volumes --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-02-09 14:33:12 +00:00
Steve Canny	dd6576c603	rfctr(xlsx): cleaning in prep for XLSX algorithm replacement (#2524 ) Reviewers: It may be faster to review each of the three commits separately since they are groomed to only make one type of change each (typing, docstrings, test-cleanup). Summary There are a cluster of bugs in `partition_xlsx()` that all derive from flaws in the algorithm we use to detect "subtables". These are encountered when the user wants to get multiple document-elements from each worksheet, which is the default (argument `find_subtable = True`). These commits clean up typing, lint, and other non-behavior-changing aspects of the code in preparation for installing a new algorithm that correctly identifies and partitions contiguous sub-regions of an Excel worksheet into distinct elements. Additional Context This is a summary of the failure cases. There are a few other cases but they're closely related and this was enough evidence and scope for my purposes: ```python # # -- ✅ CASE 1: There are no leading or trailing single-cell rows. # -> this subtable functions never get called, subtable is emitted as the only element # # a b -> Table(a, b, c, d) # c d # -- ✅ CASE 2: There is exactly one leading single-cell row. # -> Leading single-cell row emitted as `Title` element, core-table properly identified. # # a -> [ Title(a), # b c Table(b, c, d, e) ] # d e # -- ❌ CASE 3: There are two-or-more leading single-cell rows. # -> leading single-cell rows are included in subtable # # a -> [ Table(a, b, c, d, e, f) ] # b # c d # e f # -- ❌ CASE 4: There is exactly one trailing single-cell row. # -> core table is dropped. trailing single-cell row is emitted as Title # (this is the behavior in the reported bug) # # a b -> [ Title(e) ] # c d # e # -- ❌ CASE 5: There are two-or-more trailing single-cell rows. # -> core table is dropped. trailing single-cell rows are each emitted as a Title # # a b -> [ Title(e), # c d Title(f) ] # e # f # -- ✅ CASE 6: There are exactly one each leading and trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b c Table(b, c, d, e), # d e Title(f) ] # f # -- ✅ CASE 7: There are two leading and one trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b Title(b), # c d Table(c, d, e, f), # e f Title(g) ] # g # -- ✅ CASE 8: There are two-or-more leading and trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b Title(b), # c d Table(c, d, e, f), # e f Title(g), # g Title(h) ] # h # -- ❌ CASE 9: Single-row subtable, no single-cell rows above or below. # -> First cell is mistakenly emitted as title, remaining cells are dropped. # # a b c -> [ Title(a) ] # -- ❌ CASE 10: Single-row subtable with one leading single-cell row. # -> Leading single-row cell is correctly identified as title, core-table is mis-identified # as a `Title` and truncated. # # a -> [ Title(a), # b c d Title(b) ] ```	2024-02-08 23:33:41 +00:00
Ahmet Melek	f9f2cacb58	build(release): release commit for 0.12.4 (#2525 ) Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> 0.12.4	2024-02-08 21:18:29 +00:00
Matt Robinson	389dbb63d7	fix: add missing dep files to manifest (#2516 ) ### Summary Closes #2484. Adds missing dependency files to `MANIFEST.in` so they are included in the Python distribution. Also updates the manifest to look for ingest dependencies in the `requirements/ingest` subdirectory. --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com> Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>	2024-02-08 01:30:13 +00:00
Matt Robinson	ccf0477080	enhancement: process `.p7s` files with `partition_email` (#2521 ) ### Summary Closes #2489, which reported an inability to process `.p7s` files. PR implements two changes: - If the user selected content type for the email is not available and there is another valid content type available, fall back to the other valid content type. - For signed message, extract the signature and add it to the metadata ### Testing ```python from unstructured.partition.auto import partition filename = "example-docs/eml/signed-doc.p7s" elements = partition(filename=filename) # should get a message about fall back logic print(elements[0]) # "This is a test" elements[0].metadata.to_dict() # Will see the signature ```	2024-02-07 22:31:49 +00:00
David Potter	0c834517d8	fix: change opensearch port (#2517 ) change opensearch port to see if fixes CI. We think there may be a conflict with the elasticsearch docker port. Also adding simple retry to vector query. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-02-07 21:25:04 +00:00
Ronny H	67a8fd9809	Added example to use SaaS API URL in partition_via_api (#2512 ) To test: > cd docs && make html Changelog: added an example to use `SaaS API URL` in `partition_via_api` using `api_url` param. --------- Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>	2024-02-07 19:25:42 +00:00
Filip Knefel	5defe79bf2	docs: add information about MIME type of extracted images (#2515 ) Include information about what mime type is expected when extracting images. Co-authored-by: Filip Knefel <filip@unstructured.io>	2024-02-07 08:40:24 +00:00
Ahmet Melek	be71633415	refactor: isolate ingest dependencies into local scopes (#2509 ) This PR: - Moves ingest dependencies into local scopes to be able to import ingest connector classes without the need of installing imported external dependencies. This allows lightweight use of the classes (not the instances. to use the instances as intended you'll still need the dependencies). - Upgrades the embed module dependencies from `langchain` to `langchain-community` module (to pass CI [rather than introducing a pin]) - Does pip-compile - Does minor refactors in other files to pass `ruff 2.0` checks which were introduced by pip-compile	2024-02-06 21:28:55 +00:00
David Potter	138625438f	fix: add title to Vectara upload (#2511 ) Small improvement to Vectara requested by Ofer at Vectara In the "Document" construct, every document can have a title. If it's there, in the UI it will show up above the document (otherwise you get "Untitled") --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-02-06 19:49:53 +00:00
Christine Straub	29b9ea7ba6	refactor: ocr modules (#2492 ) The purpose of this PR is to refactor OCR-related modules to reduce unnecessary module imports to avoid potential issues (most likely due to a "circular import"). ### Summary - add `inference_utils` module (unstructured/partition/pdf_image/inference_utils.py) to define unstructured-inference library related utility functions, which will reduce importing unstructured-inference library functions in other files - add `conftest.py` in `test_unstructured/partition/pdf_image/` directory to define fixtures that are available to all tests in the same directory and its subdirectories ### Testing CI should pass	2024-02-06 17:11:55 +00:00
David Potter	0f0b58dfe7	bug: remove vectara requirements (#2491 ) I accidentally added Vectara to setup and makefile. But there are no dependencies for Vectara This removes Vectara from those files. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-02-01 20:41:53 +00:00
David Potter	c100ce28a7	feat: add Vectara destination connector (#2357 ) Thanks to Ofer at Vectara, we now have a Vectara destination connector. - There are no dependencies since it is all REST calls to API - --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-02-01 14:38:34 +00:00
Christine Straub	94001a208d	feat: improve table cell data (#2457 ) The purpose of this PR is to pass embedded text through table processing sub-pipeline later later use.	2024-02-01 05:29:19 +00:00
Christophe Jolif	ccc2302b33	feat: add the ability to specify a custom OCR besides the ones natively supported (#2462 ) This is nice to natively support both Tesseract and Paddle. However, one might already use another OCR and might want to keep using it (for quality reasons, for cost reasons etc...). This PR adds the ability for the user to specify its own OCR agent implementation that is then called by unstructured. I am new to unstructured so don't hesitate to let me know if you would prefer this being done differently and I will rework the PR. --------- Co-authored-by: Yao You <theyaoyou@gmail.com> Co-authored-by: Yao You <yao@unstructured.io>	2024-01-31 16:38:14 -06:00
Christine Straub	8b1de4c2b8	fix: `partition_pdf()` not working when using chipper model with file (#2479 ) Closes #2480. ### Summary - fixed an error introduced by PR [#2347](https://github.com/Unstructured-IO/unstructured/pull/2347) - https://github.com/Unstructured-IO/unstructured/pull/2347/files#diff-cefa2d296ae7ffcf5c28b5734d5c7d506fbdb225c05a0bc27c6b755d5424ffdaL373 - updated `test_partition_pdf_with_model_name()` to test more model names ### Testing The updated test function `test_partition_pdf_with_model_name()` should work on this branch, but fails on the `main` branch.	2024-01-31 17:36:59 +00:00
qued	399dd60311	build(deps): unpin pillow (#2472 ) Removed `pillow` pin and recompiled. I think it was originally there to address a conflict, which, as far as I can tell, no longer exists. Also a security vulnerability was discovered in the older version of `pillow`. #### Testing: CI should pass.	2024-01-30 21:29:08 +00:00
John	5adc04ac27	bug: fix typo in makefile (#2474 ) fix typo in makefile: `.PHONE` -> `.PHONY`	2024-01-30 18:12:35 +00:00
qued	007fc45739	chore: new black changes (#2473 ) Update `black` and apply changes to affected files. I separated this PR so we can have a look at the changes and decide whether we want to: 1. Go forward with the new formatting 2. Change the black config to make the old formatting valid 3. Get rid of black entirely and just use `ruff` 4. Do something I haven't thought of	2024-01-30 17:12:35 +00:00
John	db67805ec6	feat: add support for partitioning .heic files (#2454 ) .heic files are an image filetype we have not supported. #### Testing ``` from unstructured.partition.image import partition_image png_filename = "example-docs/DA-1p.png" heic_filename = "example-docs/DA-1p.heic" png_elements = partition_image(png_filename, strategy="hi_res") heic_elements = partition_image(heic_filename, strategy="hi_res") for i in range(len(heic_elements)): print(heic_elements[i].text == png_elements[i].text) ``` --------- Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-01-30 04:49:00 +00:00

... 3 4 5 6 7 ...

1418 Commits