unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-08-01 13:29:45 +00:00

Author	SHA1	Message	Date
Pawel Kmiecik	ff9d46f9dc	feat(eval): table evaluation metrics (#2558 ) This PR adds new table evaluation metrics prepared by @leah1985 The metrics include: - `table count` (check) - `table_level_acc` - accuracy of table detection - `element_col_level_index_acc` - accuracy of cell detection in columns - `element_row_level_index_acc` - accuracy of cell detection in rows - `element_col_level_content_acc` - accuracy of content detected in columns - `element_row_level_content_acc` - accuracy of content detected in rows TODO in next steps: - create a minimal dataset and upload to s3 for ingest tests - generate and add metrics on the above dataset to `test_unstructured_ingest/metrics`	2024-02-22 16:35:46 +00:00
Steve Canny	1947375b2e	rfctr(chunking): preparation for plug-in chunkers, Part I (#2550 ) Summary In order to accommodate customized chunkers other than those directly provided by `unstructured`, some further modularization is necessary such that a new chunker can be added as a "plug-in" without modifying the `unstructured` library code. This PR is the straightforward refactoring required for this process like typing changes. There are also some other small changes we've been meaning to make like making all chunking options accept `None` to represent their default value so the broad field of callers (e.g. ingest, unstructured-api, SDK) don't need to determine and set default values for chunking arguments leading to diverging defaults. Isolating these "noisy" but easy to accept changes in this preparatory PR reduces the noise in the more substantive changes to follow.	2024-02-21 23:16:13 +00:00
erjieyong	4d12c61cb8	added parent_element as output for overlapping cases (#2507 ) To provide more utility to the `catch_overlapping_and_nested_bboxes` and `identify_overlapping_or_nesting_case` functions, included parent_element as part of the output. This would allow user to - identify the parent element in the overlapping case: `nested {type} in {type}`. Currently, if the element types is similar, an example case output would be `nested Image in Image` which is confusing. - easily identify elements to keep or delete	2024-02-21 00:13:09 -08:00
Steve Canny	f1c52c3e3f	fix(json): partition_json() does not chunk (#2564 ) Summary For whatever reason, the `@add_chunking_strategy` decorator was not present on `partition_json()`. This broke the only way to accomplish a "chunking-only" workflow using the REST API. This PR remedies that problem.	2024-02-21 01:35:16 +00:00
Filip Knefel	f048695a55	feat: include text from shapes in docx (#2510 ) Reported bug: Text from docx shapes is not included in the `partition` output. Fix: Extend docx partition to search for text tags nested inside structures responsible for creating the shape. --------- Co-authored-by: Filip Knefel <filip@unstructured.io>	2024-02-14 17:48:38 +00:00
Ronny H	51427b3103	Renamed OpenAiEmbeddingConfig dataclass (#2546 )	2024-02-14 17:24:52 +00:00
Matt Robinson	882370022e	fix: don't treat double quote enclosed text as JSON (#2544 ) ### Summary Closes #2444. Treats JSON serializable content that results in a string as plain text. Even though this is valid JSON per [RFC 4627](https://www.ietf.org/rfc/rfc4627.txt), this is valid JSON, but in almost every cases were really want to treat this as a text file. ### Testing 1. Put `"This is not a JSON"` is a text file `notajson.txt` 2. Run the following ```python from unstructured.file_utils.filetype import _is_text_file_a_json _is_text_file_a_json(filename="notajson.txt") # Should be False ```	2024-02-14 13:41:43 +00:00
Christine Straub	d11a83ce65	refactor: embedded text processing modules (#2535 ) This PR is similar to ocr module refactoring PR - https://github.com/Unstructured-IO/unstructured/pull/2492. ### Summary - refactor "embedded text extraction" related modules to use decorator - `@requires_dependencies` on functions that require external libraries and import those libraries inside those functions instead of on module level. - add missing test cases for `pdf_image_utils.py` module to improve average test coverage ### Testing CI should pass.	2024-02-13 21:19:07 -08:00
Steve Canny	d9f8467187	fix(xlsx): xlsx subtable algorithm (#2534 ) Reviewers: It may be easier to review each of the two commits separately. The first adds the new `_SubtableParser` object with its unit-tests and the second one uses that object to replace the flawed existing subtable-parsing algorithm. Summary There are a cluster of bugs in `partition_xlsx()` that all derive from flaws in the algorithm we use to detect "subtables". These are encountered when the user wants to get multiple document-elements from each worksheet, which is the default (argument `find_subtable = True`). This PR replaces the flawed existing algorithm with a `_SubtableParser` object that encapsulates all that logic and has thorough unit-tests. Additional Context This is a summary of the failure cases. There are a few other cases but they're closely related and this was enough evidence and scope for my purposes. This PR fixes all these bugs: ```python # # -- ✅ CASE 1: There are no leading or trailing single-cell rows. # -> this subtable functions never get called, subtable is emitted as the only element # # a b -> Table(a, b, c, d) # c d # -- ✅ CASE 2: There is exactly one leading single-cell row. # -> Leading single-cell row emitted as `Title` element, core-table properly identified. # # a -> [ Title(a), # b c Table(b, c, d, e) ] # d e # -- ❌ CASE 3: There are two-or-more leading single-cell rows. # -> leading single-cell rows are included in subtable # # a -> [ Table(a, b, c, d, e, f) ] # b # c d # e f # -- ❌ CASE 4: There is exactly one trailing single-cell row. # -> core table is dropped. trailing single-cell row is emitted as Title # (this is the behavior in the reported bug) # # a b -> [ Title(e) ] # c d # e # -- ❌ CASE 5: There are two-or-more trailing single-cell rows. # -> core table is dropped. trailing single-cell rows are each emitted as a Title # # a b -> [ Title(e), # c d Title(f) ] # e # f # -- ✅ CASE 6: There are exactly one each leading and trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b c Table(b, c, d, e), # d e Title(f) ] # f # -- ✅ CASE 7: There are two leading and one trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b Title(b), # c d Table(c, d, e, f), # e f Title(g) ] # g # -- ✅ CASE 8: There are two-or-more leading and trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b Title(b), # c d Table(c, d, e, f), # e f Title(g), # g Title(h) ] # h # -- ❌ CASE 9: Single-row subtable, no single-cell rows above or below. # -> First cell is mistakenly emitted as title, remaining cells are dropped. # # a b c -> [ Title(a) ] # -- ❌ CASE 10: Single-row subtable with one leading single-cell row. # -> Leading single-row cell is correctly identified as title, core-table is mis-identified # as a `Title` and truncated. # # a -> [ Title(a), # b c d Title(b) ] ```	2024-02-13 20:29:17 -08:00
David Potter	1a706771fa	feature: add octoai for embeddings (#2538 ) Thanks to Pedro at OctoAI we have a new embedding option. The following PR adds support for the use of OctoAI embeddings. Forked from the original OpenAI embeddings class. We removed the use of the LangChain adaptor, and use OpenAI's SDK directly instead. Also updated out-of-date example script. Including new test file for OctoAI. # Testing Get a token from our platform at: https://www.octoai.cloud/ For testing one can do the following: ``` export OCTOAI_TOKEN=<your octo token> python3 examples/embed/example_octoai.py ``` ## Testing done Validated running the above script from within a locally built container via `make docker-start-dev` --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-02-10 15:27:06 +00:00
Steve Canny	dd6576c603	rfctr(xlsx): cleaning in prep for XLSX algorithm replacement (#2524 ) Reviewers: It may be faster to review each of the three commits separately since they are groomed to only make one type of change each (typing, docstrings, test-cleanup). Summary There are a cluster of bugs in `partition_xlsx()` that all derive from flaws in the algorithm we use to detect "subtables". These are encountered when the user wants to get multiple document-elements from each worksheet, which is the default (argument `find_subtable = True`). These commits clean up typing, lint, and other non-behavior-changing aspects of the code in preparation for installing a new algorithm that correctly identifies and partitions contiguous sub-regions of an Excel worksheet into distinct elements. Additional Context This is a summary of the failure cases. There are a few other cases but they're closely related and this was enough evidence and scope for my purposes: ```python # # -- ✅ CASE 1: There are no leading or trailing single-cell rows. # -> this subtable functions never get called, subtable is emitted as the only element # # a b -> Table(a, b, c, d) # c d # -- ✅ CASE 2: There is exactly one leading single-cell row. # -> Leading single-cell row emitted as `Title` element, core-table properly identified. # # a -> [ Title(a), # b c Table(b, c, d, e) ] # d e # -- ❌ CASE 3: There are two-or-more leading single-cell rows. # -> leading single-cell rows are included in subtable # # a -> [ Table(a, b, c, d, e, f) ] # b # c d # e f # -- ❌ CASE 4: There is exactly one trailing single-cell row. # -> core table is dropped. trailing single-cell row is emitted as Title # (this is the behavior in the reported bug) # # a b -> [ Title(e) ] # c d # e # -- ❌ CASE 5: There are two-or-more trailing single-cell rows. # -> core table is dropped. trailing single-cell rows are each emitted as a Title # # a b -> [ Title(e), # c d Title(f) ] # e # f # -- ✅ CASE 6: There are exactly one each leading and trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b c Table(b, c, d, e), # d e Title(f) ] # f # -- ✅ CASE 7: There are two leading and one trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b Title(b), # c d Table(c, d, e, f), # e f Title(g) ] # g # -- ✅ CASE 8: There are two-or-more leading and trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b Title(b), # c d Table(c, d, e, f), # e f Title(g), # g Title(h) ] # h # -- ❌ CASE 9: Single-row subtable, no single-cell rows above or below. # -> First cell is mistakenly emitted as title, remaining cells are dropped. # # a b c -> [ Title(a) ] # -- ❌ CASE 10: Single-row subtable with one leading single-cell row. # -> Leading single-row cell is correctly identified as title, core-table is mis-identified # as a `Title` and truncated. # # a -> [ Title(a), # b c d Title(b) ] ```	2024-02-08 23:33:41 +00:00
Matt Robinson	ccf0477080	enhancement: process `.p7s` files with `partition_email` (#2521 ) ### Summary Closes #2489, which reported an inability to process `.p7s` files. PR implements two changes: - If the user selected content type for the email is not available and there is another valid content type available, fall back to the other valid content type. - For signed message, extract the signature and add it to the metadata ### Testing ```python from unstructured.partition.auto import partition filename = "example-docs/eml/signed-doc.p7s" elements = partition(filename=filename) # should get a message about fall back logic print(elements[0]) # "This is a test" elements[0].metadata.to_dict() # Will see the signature ```	2024-02-07 22:31:49 +00:00
Ahmet Melek	be71633415	refactor: isolate ingest dependencies into local scopes (#2509 ) This PR: - Moves ingest dependencies into local scopes to be able to import ingest connector classes without the need of installing imported external dependencies. This allows lightweight use of the classes (not the instances. to use the instances as intended you'll still need the dependencies). - Upgrades the embed module dependencies from `langchain` to `langchain-community` module (to pass CI [rather than introducing a pin]) - Does pip-compile - Does minor refactors in other files to pass `ruff 2.0` checks which were introduced by pip-compile	2024-02-06 21:28:55 +00:00
Christine Straub	29b9ea7ba6	refactor: ocr modules (#2492 ) The purpose of this PR is to refactor OCR-related modules to reduce unnecessary module imports to avoid potential issues (most likely due to a "circular import"). ### Summary - add `inference_utils` module (unstructured/partition/pdf_image/inference_utils.py) to define unstructured-inference library related utility functions, which will reduce importing unstructured-inference library functions in other files - add `conftest.py` in `test_unstructured/partition/pdf_image/` directory to define fixtures that are available to all tests in the same directory and its subdirectories ### Testing CI should pass	2024-02-06 17:11:55 +00:00
Christine Straub	94001a208d	feat: improve table cell data (#2457 ) The purpose of this PR is to pass embedded text through table processing sub-pipeline later later use.	2024-02-01 05:29:19 +00:00
Christophe Jolif	ccc2302b33	feat: add the ability to specify a custom OCR besides the ones natively supported (#2462 ) This is nice to natively support both Tesseract and Paddle. However, one might already use another OCR and might want to keep using it (for quality reasons, for cost reasons etc...). This PR adds the ability for the user to specify its own OCR agent implementation that is then called by unstructured. I am new to unstructured so don't hesitate to let me know if you would prefer this being done differently and I will rework the PR. --------- Co-authored-by: Yao You <theyaoyou@gmail.com> Co-authored-by: Yao You <yao@unstructured.io>	2024-01-31 16:38:14 -06:00
Christine Straub	8b1de4c2b8	fix: `partition_pdf()` not working when using chipper model with file (#2479 ) Closes #2480. ### Summary - fixed an error introduced by PR [#2347](https://github.com/Unstructured-IO/unstructured/pull/2347) - https://github.com/Unstructured-IO/unstructured/pull/2347/files#diff-cefa2d296ae7ffcf5c28b5734d5c7d506fbdb225c05a0bc27c6b755d5424ffdaL373 - updated `test_partition_pdf_with_model_name()` to test more model names ### Testing The updated test function `test_partition_pdf_with_model_name()` should work on this branch, but fails on the `main` branch.	2024-01-31 17:36:59 +00:00
John	db67805ec6	feat: add support for partitioning .heic files (#2454 ) .heic files are an image filetype we have not supported. #### Testing ``` from unstructured.partition.image import partition_image png_filename = "example-docs/DA-1p.png" heic_filename = "example-docs/DA-1p.heic" png_elements = partition_image(png_filename, strategy="hi_res") heic_elements = partition_image(heic_filename, strategy="hi_res") for i in range(len(heic_elements)): print(heic_elements[i].text == png_elements[i].text) ``` --------- Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-01-30 04:49:00 +00:00
John	9320311a19	fix: check languages args (#2435 ) This PR is the last in a series of PRs for refactoring and fixing the language parameters (`languages` and `ocr_languages` so we can address incorrect input by users. See #2293 It is recommended to go though this PR commit-by-commit and note the commit message. The most significant commit is "update check_languages..."	2024-01-29 20:12:08 +00:00
Yao You	97fb10db4a	fix: default hi_res model rely on inference setting (#2441 ) - there are multiple places setting the default `hi_res_model_name` in both `unstructured` and `unstructured-inference` - they lead to inconsistency and unexpected behaviors - this fix removes a helper in `unstructured` that tries to set the default hi_res layout detection model; instead we rely on the `unstructured-inference` to provide that default when no explicit model name is passed in ## test ```bash UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true ipython ``` ```python from unstructured.partition.auto import partition # find a pdf file elements = partition("foo.pdf", strategy="hi_res") assert elements[0].metadata.detection_origin == "yolox" ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2024-01-29 16:44:41 +00:00
Antonio Jose Jimeno Yepes	d8b3bdb919	Check chipper version and prevent running pdfminer with chipper (#2347 ) We have added a new version of chipper (Chipperv3), which needs to allow unstructured to effective work with all the current Chipper versions. This implies resizing images with the appropriate resolution and make sure that Chipper elements are not sorted by unstructured. In addition, it seems that PDFMiner is being called when calling Chipper, which adds repeated elements from Chipper and PDFMiner. To evaluate this PR, you can test the code below with the attached PDF. The code writes a JSON file with the generated elements. The output can be examined with `cat out.un.json \| python -m json.tool`. There are three things to check: 1. The size of the image passed to Chipper, which can be identiied in the layout_height and layout_width attributes, which should have values 3301 and 2550 as shown in the example below: ``` [ { "element_id": "c0493a7872f227e4172c4192c5f48a06", "metadata": { "coordinates": { "layout_height": 3301, "layout_width": 2550, ``` 2. There should be no repeated elements. 3. Order should be closer to reading order. The script to run Chipper from unstructured is: ``` from unstructured import __version__ print(__version__.__version__) import json from unstructured.partition.auto import partition from unstructured.staging.base import elements_to_json elements = json.loads(elements_to_json(partition("Huang_Improving_Table_Structure_Recognition_With_Visual-Alignment_Sequential_Coordinate_Modeling_CVPR_2023_paper-p6.pdf", strategy="hi_res", model_name="chipperv3"))) with open('out.un.json', 'w') as w: json.dump(elements, w) ``` [Huang_Improving_Table_Structure_Recognition_With_Visual-Alignment_Sequential_Coordinate_Modeling_CVPR_2023_paper-p6.pdf](https://github.com/Unstructured-IO/unstructured/files/13817273/Huang_Improving_Table_Structure_Recognition_With_Visual-Alignment_Sequential_Coordinate_Modeling_CVPR_2023_paper-p6.pdf) --------- Co-authored-by: Antonio Jimeno Yepes <antonio@unstructured.io>	2024-01-25 02:33:32 +00:00
Matt Robinson	4613e52e11	fix: treat yaml files as plain text (#2446 ) ### Summary Closes #2412. Adds support for YAML MIME types and treats them as plain text. In response to `500` errors that the API currently returns if the MIME type is `text/yaml`.	2024-01-24 17:48:36 +00:00
David Potter	9fea85dc21	fix: remove none value keys from flattened dictionary (#2442 ) When a partitioned or embedded document json has null values, those get converted to a dictionary with None values. This happens in the metadata. I have not see it in other keys. Chroma and Pinecone do not like those None values. `flatten_dict` has been modified with a `remove_none` arg to remove keys with None values. Also, Pinecone has been pinned at 2.2.4 because at 3.0 and above it breaks our code. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-23 21:52:11 +00:00
John	c34fac9c3a	enhancement: add _clean_ocr_languages_arg helper function (#2413 ) This PR is one in a series of PRs for refactoring and fixing the languages parameter so it can address incorrect input by users. #2293 This PR adds _clean_ocr_languages_arg. There are no calls to this function yet, but it will be called in later PRs related to this series.	2024-01-19 19:59:08 +00:00
Christine Straub	7378a378f6	enhancement: allow setting image block crop padding parameter (#2415 ) Closes #2320 . ### Summary In certain circumstances, adjusting the image block crop padding can improve image block extraction by preventing extracted image blocks from being clipped. ### Testing - PDF: [LM339-D_2-2.pdf](https://github.com/Unstructured-IO/unstructured/files/13968952/LM339-D_2-2.pdf) - Set two environment variables `EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD` and `EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD` (e.g. `EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD = 40`, `EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD = 20` ``` elements = partition_pdf( filename="LM339-D_2-2.pdf", extract_image_block_types=["image"], ) ```	2024-01-19 06:28:32 +00:00
Ahmet Melek	a9ad8ac8d1	fix: update flatten dict to support flattening tuples (#2423 ) This PR updates flatten_dict function to support flattening tuples. This is necessary for objects like Coordinates, when the object is not written to the disk, therefore not being converted to a list before getting flattened.	2024-01-19 00:21:22 +00:00
John	fa9f6ccc17	refactor: use _get_iso639_language_object (#2424 ) This refactor removes `_convert_to_standard_langcode` and replaces it with calling `_get_iso639_language_object` with a string slice. Use of TESSERACT_LANGUAGES_AND_CODES, which was added to `_convert_to_standard_langcode` previously, is moved to the relevant part where `_convert_to_standard_langcode` was previously called. If/else statements replace the list comprehension for readability and `langdetect_langs.append("zho")` replaces `_convert_to_standard_langcode("zh")` since that always returned `"zho"`.	2024-01-19 00:14:45 +00:00
Matt Robinson	4d5038d9fd	enhancement: add support from bitmap images (#2414 ) ### Summary Adds support for bitmap images (`.bmp`) in both file detection and partitioning. Bitmap images will be processed with `partition_image` just like JPGs and PNGs. ### Testing ```python from unstructured.file_utils.filetype import detect_filetype from unstructured.partition.auto import partition from PIL import Image filename = "example-docs/layout-parser-paper-with-table.jpg" bmp_filename = "~/tmp/ayout-parser-paper-with-table.bmp" img = Image.open(filename) img.save(bmp_filename) detect_filetype(filename=bmp_filename) # Should be FileType.BMP elements = partition(filename=bmp_filename) ```	2024-01-17 22:50:36 +00:00
John	125b63cd7c	refactor: extract language helper functions (#2370 ) This PR is one in a series of PRs for refactoring and fixing the `languages` parameter so it can address incorrect input by users. #2293 Refactor `_convert_language_code_to_pytesseract_lang_code` and extract `_get_iso639_language_object` to its own function ``` from unstructured.partition.lang import _convert_language_code_to_pytesseract_lang_code as convert convert("English") # this will raise an error on both main and this branch convert("en") # this will return "eng" on both branches ```	2024-01-16 17:51:03 +00:00
Christine Straub	ee06260987	feat: keep all image elements when using `hi_res` strategy. (#2382 ) ### Summary The goal of this PR is to keep all image elements when using "hi_res" strategy. Previously, `Image` elements with small chunks of text were ignored unless the image block extraction parameters (`extract_images_in_pdf` or `extract_image_block_types`) were specified. Now, all image elements are kept regardless of whether the image block extraction parameters are specified. ### Testing - on `main` branch, ``` elements = partition_pdf( filename="example-docs/embedded-images.pdf", strategy="hi_res", ) image_elements = [el for el in elements if el.category == ElementType.IMAGE] print("number of image elements: ", len(image_elements)) ``` The above code will display `number of image elements: 0`. - on this `feature` branch, The same code will display `number of image elements: 3` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-01-15 23:19:17 +00:00
Matt Robinson	36faf677c0	enhancement: file detection for `.wav` files (#2387 ) ### Summary Adds filetype detection for `.wav` audio files ### Testing ```python from unstructured.file_utils.filetype import detect_filetype filename = "example-docs/CantinaBand3.wav" detect_filetype(filename=filename) # Should be FileType.WAV ```	2024-01-15 16:50:49 +00:00
John	bfd0258ba5	chore: refactor _convert_to_standard_langcode (#2369 ) This PR is one in a series of PRs for refactoring and fixing the `languages` parameter so it can address incorrect input by users. #2293 This PR adds a dictionary for helping map fully spelled out languages to tesseract language codes --------- Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>	2024-01-11 00:34:13 +00:00
Steve Canny	23edf2e911	feature(chunking): add basic strategy and overlap (#2367 ) This PR culminates the restructuring of chunking over my prior dozen-or-so commits by adding the new options to the API and documentation. Separately I'll be adding a new ingest test to defend against regression, although the integration test included in this PR will do a pretty good job of that too.	2024-01-10 22:19:24 +00:00
Klaijan	e65a44eabb	feat: update cct eval for text dir (#2299 ) The code makes edit to the `measure_text_extraction_accuracy` function to allows dir of txt as well as json. The function also takes input `output_type` to be either "json" or "txt" only, and checks if the files under given directory/list contains only specified file type or not. To test this feature, run the following code: ```PYTHONPATH=. python unstructured/ingest/evaluate.py measure-text-extraction-accuracy-command --output_dir <clean-text-path> --source_dir <cct-label-path> --output_type txt```	2024-01-05 23:34:53 +00:00
Steve Canny	7a1e732aa1	feat(chunking): add inter-chunk overlap (#2309 ) Reviewer: This PR probably reviews faster commit-by-commit. Each of the commits is groomed and focuses on a separate clear aspect of this implementation. This PR adds inter-chunk overlap capability to chunking. It does not yet expose it via the API. Inter-chunk overlap is overlap between whole pre-chunks, prior to any text-splitting required for oversized chunks. Contrast with intra-chunk overlap implemented in the prior PR which implements overlap on these latter text-splitting boundaries. Inter-chunk overlap is disabled by default since a pre-chunk already has a "clean" semantic boundary (composed of whole elements) and adding overlap there introduces noise from the adjacent context. If the user wants inter-chunk overlap they must specify `overlap_all=True` in the options. Inter-chunk overlap uses the same `overlap` length value used by intra-chunk overlap and does not overlap when that value is 0.	2024-01-05 01:24:12 +00:00
Steve Canny	22cbdce7ca	fix(html): unequal row lengths in HTMLTable.text_as_html (#2345 ) Fixes #2339 Fixes to HTML partitioning introduced with v0.11.0 removed the use of `tabulate` for forming the HTML placed in `HTMLTable.text_as_html`. This had several benefits, but part of `tabulate`'s behavior was to make row-length (cell-count) uniform across the rows of the table. Lacking this prior uniformity produced a downstream problem reported in On closer inspection, the method used to "harvest" cell-text was producing more text-nodes than there were cells and was sensitive to where whitespace was used to format the HTML. It also "moved" text to different columns in certain rows. Refine the cell-text gathering mechanism to get exactly one text string for each row cell, eliminating whitespace formatting nodes and producing strict correspondence between the number of cells in the original HTML table row and that placed in HTML.text_as_html. HTML tables that are uniform (every row has the same number of cells) will produce a uniform table in `.text_as_html`. Merged cells may still produce a non-uniform table in `.text_as_html` (because the source table is non-uniform).	2024-01-04 21:53:19 +00:00
Christine Straub	5b0ae3fd8b	Refactor: rename image extraction kwargs (#2303 ) Currently, we're using different kwarg names in partition() and partition_pdf(), which has implications for the API since it goes through partition(). ### Summary - rename `extract_element_types` -> `extract_image_block_types` - rename `image_output_dir_path` to `extract_image_block_output_dir` - rename `extract_to_payload` -> `extract_image_block_to_payload` - rename `pdf_extract_images` -> `extract_images_in_pdf` in `partition.auto` - add unit tests to test element extraction for `pdf/image` via `partition.auto` ### Testing CI should pass.	2024-01-04 17:52:00 +00:00
Austin Walker	91b892c79d	fix: Fix api_url param to partition_via_api (#2342 ) Closes #2340 We need to make sure the custom url is passed to our client. The client constructor takes the base url, so for compatibility we can continue to take the full url and strip off the path. To verify, run the api locally and confirm you can make calls to it. ``` # In unstructured-api make run-web-app # In ipython in this repo from unstructured.partition.api import partition_via_api filename = "example-docs/layout-parser-paper.pdf" partition_via_api(filename=filename, api_url="http://localhost:8000") ```	2024-01-03 20:08:48 +00:00
Christine Straub	9459af435d	Fix: element extraction not working when using "auto" strategy for pdf (#2324 ) Closes #2323. ### Summary - update logic to return "hi_res" if either `extract_images_in_pdf` or `extract_element_types` is set - refactor: remove unused `file` parameter from `determine_pdf_or_image_strategy()` ### Testing ``` from unstructured.partition.pdf import partition_pdf elements = partition_pdf( filename="example-docs/embedded-images-tables.pdf", extract_element_types=["Image"], extract_to_payload=True, ) image_elements = [el for el in elements if el.category == ElementType.IMAGE] print(image_elements) ```	2023-12-28 22:25:30 +00:00
Christine Straub	dd144456de	Feat: return base64 encoded images for PDF's (#2310 ) Closes #2302. ### Summary - add functionality to get a Base64 encoded string from a PIL image - store base64 encoded image data in two metadata fields: `image_base64` and `image_mime_type` - update the "image element filter" logic to keep all image elements in the output if a user specifies image extraction ### Testing ``` from unstructured.partition.pdf import partition_pdf elements = partition_pdf( filename="example-docs/embedded-images-tables.pdf", strategy="hi_res", extract_element_types=["Image", "Table"], extract_to_payload=True, ) ``` or ``` from unstructured.partition.auto import partition elements = partition( filename="example-docs/embedded-images-tables.pdf", strategy="hi_res", pdf_extract_element_types=["Image", "Table"], pdf_extract_to_payload=True, ) ```	2023-12-27 05:39:01 +00:00
Roman Isecke	8ba9fadf8a	feat: improve dataclass use for encoders (#2318 ) ### Description Leverage a similar pattern to what is used for connectors, where there is a nested config dataclass as a field, along with cached content for things like the client and sample embedding for each. This required an update on the embeddings config in ingest and I left a TODO in there because the current approach breaks on other encoders such as bedrock because the parameters in that config don't map to all encoders. But this keeps the existing functionality working. This update makes sure all variables associated with the dataclass exist when it's instantiated rather than being added in the `__post_init__()` method or the `initialize()`, allowing other libraries like pydantic to appropriately generate schemas from it. It also now follows the pattern of the connectors in that each class has a nested config class used to instantiate the client itself as well as a field/property approach used to cache the client.	2023-12-26 22:33:19 +00:00
Steve Canny	eb1b022ff8	feat(chunking): add overlap on chunk-splits (#2305 ) There are two distinct overlap operations with completely different implementations. This is "intra-chunk" overlap, applying overlap to chunks resulting from text-splitting an oversized element. So if an oversized element had text "abcd efgh ijkl mnop qrst" and was split at 15 chars with overlap of 5, it would produce "abcd efgh ijkl" and "ijkl mnop qrst". Any inter-chunk overlap from the prior chunk and applied at the beginning of the string (before "abcd") is handled in a separate operation in the next PR.	2023-12-22 20:35:18 +00:00
John	5c0043aa7d	chore: add hi_res_model_name kwarg (#2289 ) Closes #2160 Explicitly adds `hi_res_model_name` as kwarg to relevant functions and notes that `model_name` is to be deprecated. Testing: ``` from unstructured.partition.auto import partition filename = "example-docs/DA-1p.pdf" elements = partition(filename, strategy="hi_res", hi_res_model_name="yolox") ``` --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: Steve Canny <stcanny@gmail.com> Co-authored-by: Christine Straub <christinemstraub@gmail.com> Co-authored-by: Yao You <yao@unstructured.io> Co-authored-by: Yao You <theyaoyou@gmail.com>	2023-12-22 15:06:54 +00:00
Steve Canny	093a11d058	rfctr(chunking): split oversized chunks on word boundary (#2297 ) The text of an oversized chunk is split on an arbitrary character boundary (mid-word). The `chunk_by_character()` strategy introduces the idea of allowing the user to specify a separator to use for chunk-splitting. For `langchain` this is typically "\n\n", "\n", or " "; blank-line, newline, or word boundaries respectively. Even if the user is allowed to specify a separator, we must provide fall-back for when a chunk contains no such character. This can be done incrementally, like blank-line is preferable to newline, newline is preferable to word, and word is preferable to arbitrary character. Further, there is nothing particular to `chunk_by_character()` in providing such a fall-back text-splitting strategy. It would be preferable for all strategies to split oversized chunks on even-word boundaries for example. Note that while a "blank-line" ("\n\n") may be common in plain text, it is unlikely to appear in the text of an element because it would have been interpreted as an element boundary during partitioning. Add _TextSplitter with basic separator preferences and fall-back and apply it to chunk-splitting for all strategies. The `by_character` chunking strategy may enhance this behavior by adding the option for a user to specify a particular separator suited to their use case.	2023-12-21 05:45:36 +00:00
John	04f4c3ab16	create teardown fixture for tests (#2269 ) Closes #2263 Files were being created as a side effect from running tests in `test_unstructured/metrics/test_evaluate.py`. The added decorator removes the created directory and its files after the tests run. Testing on the main branch, run `make test` or `pytest test_unstructured/metrics/test_evaluate.py` and files will be created. On this branch no files are created	2023-12-20 17:50:12 +00:00
Andy Li	4ae49419c9	feat: support base64-encoded text in partition_email (#2277 ) closes #816 ## Description Added functionality for `partition_email` to automatically decode base64 text before passing it to `partition_text` or `partition_html`. Also adds base64 encoded email text test cases.	2023-12-19 23:37:17 -08:00
Steve Canny	82714cad98	rfctr(chunking): extract BasePreChunker (#2294 ) The `_split_elements_by_title_and_table()` function fulfills the pre-chunker role for `chunk_by_title()`, but most of its operation is not strategy-specific and can be reused by other chunking strategies. Extract `BasePreChunker` and use it as the base class for `_ByTitlePreChunker` which now only needs to provide the boundary predicates specific to that strategy.	2023-12-20 06:30:21 +00:00
Steve Canny	4e2ba2c9b2	rfctr(chunking): extract boundary predicates (#2284 ) `chunk_by_title()` respects certain semantic boundaries while chunking. Those are sections introduced by a `Title` element, sections introduced by a `metadata.section` value change, and optionally page-breaks. "Respecting" in this context means that elements on opposite sides of a semantic boundary never appear in the same chunk. The `metadata_differs()` function used for this purpose is clumsy to use requiring the caller to maintain state (prior element). It also combines what are independent predicates such that they cannot be individually reused. Introduce the `BoundaryPredicate` type which takes an element and returns bool, indicating whether the element introduces a new semantic boundary. These can be reused by any chunking strategy that needs them and allows the pre-chunking operation to be generalized for use by any chunking strategy, which it will be in the following PR.	2023-12-19 18:20:05 +00:00
David Potter	4b8352e0f5	feat: add chroma destination connector (#2240 ) Adds Chroma (also known as ChromaDB) as a vector destination. Currently Chroma is an in-memory single-process oriented library with plans of a hosted and/or more production ready solution -https://docs.trychroma.com/deployment Though they now claim to support multiple Clients hitting the database at once, I found that it was inconsistent. Sometimes multiprocessing worked (maybe 1 out of 3 times) But the other times I would get different errors. So I kept it single process. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2023-12-19 16:58:23 +00:00
Steve Canny	0c7f64ecaa	rfctr(chunking): generalize PreChunkBuilder (#2283 ) To implement inter-pre-chunk overlap, we need a context that sees every pre-chunk both before and after it is accumulated (from elements). - We need access to the pre-chunk when it is completed so we can extract the "tail" overlap to be applied to the next chunk. - We need access to the as-yet-unpopulated pre-chunk so we can add the prior tail to it as a prefix. This "visibility" is split between `PreChunkBuilder` and the pre-chunker itself, which handles `TablePreChunk`s without the builder. Move `Table` element and TablePreChunk` formation into `PreChunkBuilder` such that _all_ element types (adding `Table` elements in particular) pass through it. Then `PreChunkBuilder` becomes the context we require. The actual overlap harvesting and application will come in a subsequent commit.	2023-12-18 22:21:34 +00:00

... 4 5 6 7 8 ...

725 Commits