unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-24 17:41:15 +00:00

Author	SHA1	Message	Date
David Potter	9fea85dc21	fix: remove none value keys from flattened dictionary (#2442 ) When a partitioned or embedded document json has null values, those get converted to a dictionary with None values. This happens in the metadata. I have not see it in other keys. Chroma and Pinecone do not like those None values. `flatten_dict` has been modified with a `remove_none` arg to remove keys with None values. Also, Pinecone has been pinned at 2.2.4 because at 3.0 and above it breaks our code. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-23 21:52:11 +00:00
John	c34fac9c3a	enhancement: add _clean_ocr_languages_arg helper function (#2413 ) This PR is one in a series of PRs for refactoring and fixing the languages parameter so it can address incorrect input by users. #2293 This PR adds _clean_ocr_languages_arg. There are no calls to this function yet, but it will be called in later PRs related to this series.	2024-01-19 19:59:08 +00:00
Christine Straub	7378a378f6	enhancement: allow setting image block crop padding parameter (#2415 ) Closes #2320 . ### Summary In certain circumstances, adjusting the image block crop padding can improve image block extraction by preventing extracted image blocks from being clipped. ### Testing - PDF: [LM339-D_2-2.pdf](https://github.com/Unstructured-IO/unstructured/files/13968952/LM339-D_2-2.pdf) - Set two environment variables `EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD` and `EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD` (e.g. `EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD = 40`, `EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD = 20` ``` elements = partition_pdf( filename="LM339-D_2-2.pdf", extract_image_block_types=["image"], ) ```	2024-01-19 06:28:32 +00:00
Ahmet Melek	a9ad8ac8d1	fix: update flatten dict to support flattening tuples (#2423 ) This PR updates flatten_dict function to support flattening tuples. This is necessary for objects like Coordinates, when the object is not written to the disk, therefore not being converted to a list before getting flattened.	2024-01-19 00:21:22 +00:00
John	fa9f6ccc17	refactor: use _get_iso639_language_object (#2424 ) This refactor removes `_convert_to_standard_langcode` and replaces it with calling `_get_iso639_language_object` with a string slice. Use of TESSERACT_LANGUAGES_AND_CODES, which was added to `_convert_to_standard_langcode` previously, is moved to the relevant part where `_convert_to_standard_langcode` was previously called. If/else statements replace the list comprehension for readability and `langdetect_langs.append("zho")` replaces `_convert_to_standard_langcode("zh")` since that always returned `"zho"`.	2024-01-19 00:14:45 +00:00
Matt Robinson	4d5038d9fd	enhancement: add support from bitmap images (#2414 ) ### Summary Adds support for bitmap images (`.bmp`) in both file detection and partitioning. Bitmap images will be processed with `partition_image` just like JPGs and PNGs. ### Testing ```python from unstructured.file_utils.filetype import detect_filetype from unstructured.partition.auto import partition from PIL import Image filename = "example-docs/layout-parser-paper-with-table.jpg" bmp_filename = "~/tmp/ayout-parser-paper-with-table.bmp" img = Image.open(filename) img.save(bmp_filename) detect_filetype(filename=bmp_filename) # Should be FileType.BMP elements = partition(filename=bmp_filename) ```	2024-01-17 22:50:36 +00:00
John	125b63cd7c	refactor: extract language helper functions (#2370 ) This PR is one in a series of PRs for refactoring and fixing the `languages` parameter so it can address incorrect input by users. #2293 Refactor `_convert_language_code_to_pytesseract_lang_code` and extract `_get_iso639_language_object` to its own function ``` from unstructured.partition.lang import _convert_language_code_to_pytesseract_lang_code as convert convert("English") # this will raise an error on both main and this branch convert("en") # this will return "eng" on both branches ```	2024-01-16 17:51:03 +00:00
Christine Straub	ee06260987	feat: keep all image elements when using `hi_res` strategy. (#2382 ) ### Summary The goal of this PR is to keep all image elements when using "hi_res" strategy. Previously, `Image` elements with small chunks of text were ignored unless the image block extraction parameters (`extract_images_in_pdf` or `extract_image_block_types`) were specified. Now, all image elements are kept regardless of whether the image block extraction parameters are specified. ### Testing - on `main` branch, ``` elements = partition_pdf( filename="example-docs/embedded-images.pdf", strategy="hi_res", ) image_elements = [el for el in elements if el.category == ElementType.IMAGE] print("number of image elements: ", len(image_elements)) ``` The above code will display `number of image elements: 0`. - on this `feature` branch, The same code will display `number of image elements: 3` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-01-15 23:19:17 +00:00
Matt Robinson	36faf677c0	enhancement: file detection for `.wav` files (#2387 ) ### Summary Adds filetype detection for `.wav` audio files ### Testing ```python from unstructured.file_utils.filetype import detect_filetype filename = "example-docs/CantinaBand3.wav" detect_filetype(filename=filename) # Should be FileType.WAV ```	2024-01-15 16:50:49 +00:00
John	bfd0258ba5	chore: refactor _convert_to_standard_langcode (#2369 ) This PR is one in a series of PRs for refactoring and fixing the `languages` parameter so it can address incorrect input by users. #2293 This PR adds a dictionary for helping map fully spelled out languages to tesseract language codes --------- Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>	2024-01-11 00:34:13 +00:00
Steve Canny	23edf2e911	feature(chunking): add basic strategy and overlap (#2367 ) This PR culminates the restructuring of chunking over my prior dozen-or-so commits by adding the new options to the API and documentation. Separately I'll be adding a new ingest test to defend against regression, although the integration test included in this PR will do a pretty good job of that too.	2024-01-10 22:19:24 +00:00
Klaijan	e65a44eabb	feat: update cct eval for text dir (#2299 ) The code makes edit to the `measure_text_extraction_accuracy` function to allows dir of txt as well as json. The function also takes input `output_type` to be either "json" or "txt" only, and checks if the files under given directory/list contains only specified file type or not. To test this feature, run the following code: ```PYTHONPATH=. python unstructured/ingest/evaluate.py measure-text-extraction-accuracy-command --output_dir <clean-text-path> --source_dir <cct-label-path> --output_type txt```	2024-01-05 23:34:53 +00:00
Steve Canny	7a1e732aa1	feat(chunking): add inter-chunk overlap (#2309 ) Reviewer: This PR probably reviews faster commit-by-commit. Each of the commits is groomed and focuses on a separate clear aspect of this implementation. This PR adds inter-chunk overlap capability to chunking. It does not yet expose it via the API. Inter-chunk overlap is overlap between whole pre-chunks, prior to any text-splitting required for oversized chunks. Contrast with intra-chunk overlap implemented in the prior PR which implements overlap on these latter text-splitting boundaries. Inter-chunk overlap is disabled by default since a pre-chunk already has a "clean" semantic boundary (composed of whole elements) and adding overlap there introduces noise from the adjacent context. If the user wants inter-chunk overlap they must specify `overlap_all=True` in the options. Inter-chunk overlap uses the same `overlap` length value used by intra-chunk overlap and does not overlap when that value is 0.	2024-01-05 01:24:12 +00:00
Steve Canny	22cbdce7ca	fix(html): unequal row lengths in HTMLTable.text_as_html (#2345 ) Fixes #2339 Fixes to HTML partitioning introduced with v0.11.0 removed the use of `tabulate` for forming the HTML placed in `HTMLTable.text_as_html`. This had several benefits, but part of `tabulate`'s behavior was to make row-length (cell-count) uniform across the rows of the table. Lacking this prior uniformity produced a downstream problem reported in On closer inspection, the method used to "harvest" cell-text was producing more text-nodes than there were cells and was sensitive to where whitespace was used to format the HTML. It also "moved" text to different columns in certain rows. Refine the cell-text gathering mechanism to get exactly one text string for each row cell, eliminating whitespace formatting nodes and producing strict correspondence between the number of cells in the original HTML table row and that placed in HTML.text_as_html. HTML tables that are uniform (every row has the same number of cells) will produce a uniform table in `.text_as_html`. Merged cells may still produce a non-uniform table in `.text_as_html` (because the source table is non-uniform).	2024-01-04 21:53:19 +00:00
Christine Straub	5b0ae3fd8b	Refactor: rename image extraction kwargs (#2303 ) Currently, we're using different kwarg names in partition() and partition_pdf(), which has implications for the API since it goes through partition(). ### Summary - rename `extract_element_types` -> `extract_image_block_types` - rename `image_output_dir_path` to `extract_image_block_output_dir` - rename `extract_to_payload` -> `extract_image_block_to_payload` - rename `pdf_extract_images` -> `extract_images_in_pdf` in `partition.auto` - add unit tests to test element extraction for `pdf/image` via `partition.auto` ### Testing CI should pass.	2024-01-04 17:52:00 +00:00
Austin Walker	91b892c79d	fix: Fix api_url param to partition_via_api (#2342 ) Closes #2340 We need to make sure the custom url is passed to our client. The client constructor takes the base url, so for compatibility we can continue to take the full url and strip off the path. To verify, run the api locally and confirm you can make calls to it. ``` # In unstructured-api make run-web-app # In ipython in this repo from unstructured.partition.api import partition_via_api filename = "example-docs/layout-parser-paper.pdf" partition_via_api(filename=filename, api_url="http://localhost:8000") ```	2024-01-03 20:08:48 +00:00
Christine Straub	9459af435d	Fix: element extraction not working when using "auto" strategy for pdf (#2324 ) Closes #2323. ### Summary - update logic to return "hi_res" if either `extract_images_in_pdf` or `extract_element_types` is set - refactor: remove unused `file` parameter from `determine_pdf_or_image_strategy()` ### Testing ``` from unstructured.partition.pdf import partition_pdf elements = partition_pdf( filename="example-docs/embedded-images-tables.pdf", extract_element_types=["Image"], extract_to_payload=True, ) image_elements = [el for el in elements if el.category == ElementType.IMAGE] print(image_elements) ```	2023-12-28 22:25:30 +00:00
Christine Straub	dd144456de	Feat: return base64 encoded images for PDF's (#2310 ) Closes #2302. ### Summary - add functionality to get a Base64 encoded string from a PIL image - store base64 encoded image data in two metadata fields: `image_base64` and `image_mime_type` - update the "image element filter" logic to keep all image elements in the output if a user specifies image extraction ### Testing ``` from unstructured.partition.pdf import partition_pdf elements = partition_pdf( filename="example-docs/embedded-images-tables.pdf", strategy="hi_res", extract_element_types=["Image", "Table"], extract_to_payload=True, ) ``` or ``` from unstructured.partition.auto import partition elements = partition( filename="example-docs/embedded-images-tables.pdf", strategy="hi_res", pdf_extract_element_types=["Image", "Table"], pdf_extract_to_payload=True, ) ```	2023-12-27 05:39:01 +00:00
Roman Isecke	8ba9fadf8a	feat: improve dataclass use for encoders (#2318 ) ### Description Leverage a similar pattern to what is used for connectors, where there is a nested config dataclass as a field, along with cached content for things like the client and sample embedding for each. This required an update on the embeddings config in ingest and I left a TODO in there because the current approach breaks on other encoders such as bedrock because the parameters in that config don't map to all encoders. But this keeps the existing functionality working. This update makes sure all variables associated with the dataclass exist when it's instantiated rather than being added in the `__post_init__()` method or the `initialize()`, allowing other libraries like pydantic to appropriately generate schemas from it. It also now follows the pattern of the connectors in that each class has a nested config class used to instantiate the client itself as well as a field/property approach used to cache the client.	2023-12-26 22:33:19 +00:00
Steve Canny	eb1b022ff8	feat(chunking): add overlap on chunk-splits (#2305 ) There are two distinct overlap operations with completely different implementations. This is "intra-chunk" overlap, applying overlap to chunks resulting from text-splitting an oversized element. So if an oversized element had text "abcd efgh ijkl mnop qrst" and was split at 15 chars with overlap of 5, it would produce "abcd efgh ijkl" and "ijkl mnop qrst". Any inter-chunk overlap from the prior chunk and applied at the beginning of the string (before "abcd") is handled in a separate operation in the next PR.	2023-12-22 20:35:18 +00:00
John	5c0043aa7d	chore: add hi_res_model_name kwarg (#2289 ) Closes #2160 Explicitly adds `hi_res_model_name` as kwarg to relevant functions and notes that `model_name` is to be deprecated. Testing: ``` from unstructured.partition.auto import partition filename = "example-docs/DA-1p.pdf" elements = partition(filename, strategy="hi_res", hi_res_model_name="yolox") ``` --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: Steve Canny <stcanny@gmail.com> Co-authored-by: Christine Straub <christinemstraub@gmail.com> Co-authored-by: Yao You <yao@unstructured.io> Co-authored-by: Yao You <theyaoyou@gmail.com>	2023-12-22 15:06:54 +00:00
Steve Canny	093a11d058	rfctr(chunking): split oversized chunks on word boundary (#2297 ) The text of an oversized chunk is split on an arbitrary character boundary (mid-word). The `chunk_by_character()` strategy introduces the idea of allowing the user to specify a separator to use for chunk-splitting. For `langchain` this is typically "\n\n", "\n", or " "; blank-line, newline, or word boundaries respectively. Even if the user is allowed to specify a separator, we must provide fall-back for when a chunk contains no such character. This can be done incrementally, like blank-line is preferable to newline, newline is preferable to word, and word is preferable to arbitrary character. Further, there is nothing particular to `chunk_by_character()` in providing such a fall-back text-splitting strategy. It would be preferable for all strategies to split oversized chunks on even-word boundaries for example. Note that while a "blank-line" ("\n\n") may be common in plain text, it is unlikely to appear in the text of an element because it would have been interpreted as an element boundary during partitioning. Add _TextSplitter with basic separator preferences and fall-back and apply it to chunk-splitting for all strategies. The `by_character` chunking strategy may enhance this behavior by adding the option for a user to specify a particular separator suited to their use case.	2023-12-21 05:45:36 +00:00
John	04f4c3ab16	create teardown fixture for tests (#2269 ) Closes #2263 Files were being created as a side effect from running tests in `test_unstructured/metrics/test_evaluate.py`. The added decorator removes the created directory and its files after the tests run. Testing on the main branch, run `make test` or `pytest test_unstructured/metrics/test_evaluate.py` and files will be created. On this branch no files are created	2023-12-20 17:50:12 +00:00
Andy Li	4ae49419c9	feat: support base64-encoded text in partition_email (#2277 ) closes #816 ## Description Added functionality for `partition_email` to automatically decode base64 text before passing it to `partition_text` or `partition_html`. Also adds base64 encoded email text test cases.	2023-12-19 23:37:17 -08:00
Steve Canny	82714cad98	rfctr(chunking): extract BasePreChunker (#2294 ) The `_split_elements_by_title_and_table()` function fulfills the pre-chunker role for `chunk_by_title()`, but most of its operation is not strategy-specific and can be reused by other chunking strategies. Extract `BasePreChunker` and use it as the base class for `_ByTitlePreChunker` which now only needs to provide the boundary predicates specific to that strategy.	2023-12-20 06:30:21 +00:00
Steve Canny	4e2ba2c9b2	rfctr(chunking): extract boundary predicates (#2284 ) `chunk_by_title()` respects certain semantic boundaries while chunking. Those are sections introduced by a `Title` element, sections introduced by a `metadata.section` value change, and optionally page-breaks. "Respecting" in this context means that elements on opposite sides of a semantic boundary never appear in the same chunk. The `metadata_differs()` function used for this purpose is clumsy to use requiring the caller to maintain state (prior element). It also combines what are independent predicates such that they cannot be individually reused. Introduce the `BoundaryPredicate` type which takes an element and returns bool, indicating whether the element introduces a new semantic boundary. These can be reused by any chunking strategy that needs them and allows the pre-chunking operation to be generalized for use by any chunking strategy, which it will be in the following PR.	2023-12-19 18:20:05 +00:00
David Potter	4b8352e0f5	feat: add chroma destination connector (#2240 ) Adds Chroma (also known as ChromaDB) as a vector destination. Currently Chroma is an in-memory single-process oriented library with plans of a hosted and/or more production ready solution -https://docs.trychroma.com/deployment Though they now claim to support multiple Clients hitting the database at once, I found that it was inconsistent. Sometimes multiprocessing worked (maybe 1 out of 3 times) But the other times I would get different errors. So I kept it single process. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2023-12-19 16:58:23 +00:00
Steve Canny	0c7f64ecaa	rfctr(chunking): generalize PreChunkBuilder (#2283 ) To implement inter-pre-chunk overlap, we need a context that sees every pre-chunk both before and after it is accumulated (from elements). - We need access to the pre-chunk when it is completed so we can extract the "tail" overlap to be applied to the next chunk. - We need access to the as-yet-unpopulated pre-chunk so we can add the prior tail to it as a prefix. This "visibility" is split between `PreChunkBuilder` and the pre-chunker itself, which handles `TablePreChunk`s without the builder. Move `Table` element and TablePreChunk` formation into `PreChunkBuilder` such that _all_ element types (adding `Table` elements in particular) pass through it. Then `PreChunkBuilder` becomes the context we require. The actual overlap harvesting and application will come in a subsequent commit.	2023-12-18 22:21:34 +00:00
Steve Canny	36e81c3367	rfctr(chunking): extract general-purpose objects to base (#2281 ) Many of the classes defined in `unstructured.chunking.title` are applicable to any chunking strategy and will shortly be used for the "by-character" chunking strategy as well. Move these and their tests to `unstructured.chunking.base`. Along the way, rename `TextPreChunkBuilder` to `PreChunkBuilder` because it will be generalized in a subsequent PR to also take `Table` elements such that inter-pre-chunk overlap can be implemented. Otherwise, no logic changes, just moves.	2023-12-16 17:28:15 +00:00
Christine Straub	a7c3f5f570	Refactor: importation consistency for `partition_pdf()` and `partition_image()` (#2282 ) Closes #2278. This PR also removes the `extract_tables_in_pdf` mentioned in issue #2280.	2023-12-15 22:29:58 +00:00
Steve Canny	70cf141036	rfctr: extract ChunkingOptions (#2266 ) Chunking options for things like chunk-size are largely independent of chunking strategy. Further, validating the args and applying defaults based on call arguments is sophisticated to make its use easy for the caller. These details distract from what the chunker is actually doing and would need to be repeated for every chunking strategy if left where they are. Extract these settings and the rules governing chunking behavior based on options into its own immutable object that can be passed to any component that is subject to optional behavior (pretty much all of them).	2023-12-15 19:51:02 +00:00
Yao You	5f5ff6319f	fix: consider text in cid code as invalid in hi_res (#2259 ) This PR addresses [CORE-2969](https://unstructured-ai.atlassian.net/browse/CORE-2969) - pdfminer sometimes fail to decode text in an pdf file and returns cid codes as text - now those text will be considered invalid and be replaced with ocr results in `hi_res` mode ## test This PR adds unit test for the utility functions. In addition the file below would return elements with text in cid code on main but proper ascii text with this PR: [005-CISA-AA22-076-Strengthening-Cybersecurity-p1-p4.pdf](https://github.com/Unstructured-IO/unstructured/files/13662984/005-CISA-AA22-076-Strengthening-Cybersecurity-p1-p4.pdf) This change improves both cct accuracy and %missing scores: before: ``` metric average sample_sd population_sd count -------------------------------------------------- cct-accuracy 0.681 0.267 0.266 105 cct-%missing 0.086 0.159 0.159 105 ``` after: ``` metric average sample_sd population_sd count -------------------------------------------------- cct-accuracy 0.697 0.251 0.250 105 cct-%missing 0.071 0.123 0.122 105 ``` [CORE-2969]: https://unstructured-ai.atlassian.net/browse/CORE-2969?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com> Co-authored-by: christinestraub <christinemstraub@gmail.com>	2023-12-14 06:49:23 +00:00
Austin Walker	d594c06a3e	fix: handle delimiter bug in partition_csv (#2224 ) Closes #2218. When a csv has commas in its content, and the delimiter is something else, Pandas may throw an error. We can sniff the csv and get the correct delimiter to pass to Pandas. To verify, try partitioning the file in the linked bug.	2023-12-13 23:57:46 +00:00
Steve Canny	cbeaed21ef	rfctr: rename pre chunk (#2261 ) The original naming for the pre-cursor to a chunk in `chunk_by_title()` was conflated with the idea of how these element subsequences were bounded (by document-section) for that strategy. I mistakenly picked that up as a universal concept but in fact no notion of section arises in the `by_character` or other chunking strategies. Fix this misconception by using the name `pre-chunk` for this concept throughout.	2023-12-13 23:13:57 +00:00
Steve Canny	74d089d942	rfctr: skip CheckBox elements during chunking (#2253 ) `CheckBox` elements get special treatment during chunking. `CheckBox` does not derive from `Text` and can contribute no text to a chunk. It is considered "non-combinable" and so is emitted as-is as a chunk of its own. A consequence of this is it breaks an otherwise contiguous chunk into two wherever it occurs. This is problematic, but becomes much more so when overlap is introduced. Each chunk accepts a "tail" text fragment from its preceding element and contributes its own tail fragment to the next chunk. These tails represent the "overlap" between chunks. However, a non-text chunk can neither accept nor provide a tail-fragment and so interrupts the overlap. None of the possible solutions are terrific. Give `Element` a `.text` attribute such that _all_ elements have a `.text` attribute, even though its value is the empty-string for element-types such as CheckBox and PageBreak which inherently have no text. As a consequence, several `cast()` wrappers are no longer required to satisfy strict type-checking. This also allows a `CheckBox` element to be combined with `Text` subtypes during chunking, essentially the same way `PageBreak` is, contributing no text to the chunk. Also, remove the `_NonTextSection` object which previously wrapped a `CheckBox` element during pre-chunking as it is no longer required.	2023-12-13 20:22:25 +00:00
Yao You	36e4639e05	fix: image may be scaled too large for tesseract (#2252 ) This PR addresses [CORE-2965](https://unstructured-ai.atlassian.net/browse/CORE-2965) by limiting zoom factor so that the scaled image can still be processed by tesseract. - tesseract has a 2^31 byte limit on image data - occasionally an image may be scaled too much and larger than that size - fix limits the scaling factor so that we never scale an image larger than what tesseract can handle ## test A unit test is added in this PR to test a unlikely case where we'd scale an image a few thousand times and massively exceed the limit without the fix. Unstructured reviewers can also use the document in the ticket to test. [CORE-2965]: https://unstructured-ai.atlassian.net/browse/CORE-2965?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ	2023-12-13 19:35:05 +00:00
John	d3a404cfb5	pdfminer bug (#2244 ) Closes #2212. ### Summary This PR implements logic to fall back to the "inferred_layout + OCR" if pdfminer fails in the `hi_res` pipeline (discussed in[ this slack channel](https://unstructuredw-kbe4326.slack.com/archives/C057R3F8F7A/p1701807299018929). ### Testing PDF: [NASA-SNA-8-D-027III-Rev2-CsmLmSpacecraftOperationalDataBook-Volume3-MassProperties-pg856.pdf](https://github.com/Unstructured-IO/unstructured/files/13554149/NASA-SNA-8-D-027III-Rev2-CsmLmSpacecraftOperationalDataBook-Volume3-MassProperties-pg856.pdf) ``` elements = partition_pdf( filename="NASA-SNA-8-D-027III-Rev2-CsmLmSpacecraftOperationalDataBook-Volume3-MassProperties-pg856.pdf", strategy="hi_res", ) ``` --------- Co-authored-by: christinestraub <christinemstraub@gmail.com>	2023-12-13 00:51:38 +00:00
Steve Canny	21bc67f52f	rfctr: improve element typing (#2247 ) In preparation for work on generalized chunking including `chunk_by_character()` and overlap, get `elements` module and tests passing strict type-checking.	2023-12-12 23:12:23 +00:00
Christine Straub	da7ac625b1	Feat: save tables in PDF's as images (#2229 ) closes #2222. ### Summary The "table" elements are saved as `table-<pageN>-<tableN>.jpg`. This filename is presented in the `image_path` metadata field for the Table element. The default would be to not do this. ### Testing PDF: [124_PDFsam_Basel III - Finalising post-crisis reforms.pdf](https://github.com/Unstructured-IO/unstructured/files/13591714/124_PDFsam_Basel.III.-.Finalising.post-crisis.reforms.pdf) ``` elements = partition_pdf( filename="124_PDFsam_Basel III - Finalising post-crisis reforms.pdf", strategy="hi_res", infer_table_structure=True, extract_element_types=['Table'], ) ```	2023-12-11 19:14:41 +00:00
Christine Straub	ed76b11b1a	Refactor: support image extraction (#2201 ) ### Summary This PR is the second part of the "image extraction" refactor to move it from unstructured-inference repo to unstructured repo, the first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/299. This PR adds logic to support extracting images. ### Testing `git clone -b refactor/remove_image_extraction_code --single-branch https://github.com/Unstructured-IO/unstructured-inference.git && cd unstructured-inference && pip install -e . && cd ../` ``` elements = partition_pdf( filename="example-docs/embedded-images.pdf", strategy="hi_res", extract_images_in_pdf=True, ) print("\n\n".join([str(el) for el in elements])) ```	2023-12-05 18:22:29 +00:00
John	8fa5cbf036	build(ci): rm unneeded call to get_api_key in test (#2199 ) Follow-up PR to [https://github.com/Unstructured-IO/unstructured/pull/2195](https://github.com/Unstructured-IO/unstructured/pull/2195). Removes unnecessary calls to `get_api_key()`. That helper function is supposed to only be used for tests decorated by @pytest.mark.skipif(skip_outside_ci, reason="Skipping test run outside of CI") (which are skipped because those tests are partitioning pdf/jpg files). These tests are partitioning emails and rely on the MockResponse at the top of the file, so they don't need to call `get_api_key()` and it can simply be removed from them.	2023-12-03 21:28:05 -08:00
rvztz	ce905dd098	feat: Weaviate destination connector (#1963 ) Closes #1781. - Adds a Weaviate destination connector - The connector receives a host for the weaviate instance and a weaviate class name. - Defines a weaviate schema for json elements. - Defines the pre-processing to conform unstructured's schema to the proposed weaviate schema.	2023-12-01 22:27:41 +00:00
Christine Straub	69d0ee1aea	Refactor: support merging `extracted` layout with `inferred` layout (#2158 ) ### Summary This PR is the second part of `pdfminer` refactor to move it from `unstructured-inference` repo to `unstructured` repo, the first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/294. This PR adds logic to merge the extracted layout with the inferred layout. The updated workflow for the `hi_res` strategy: * pass the document (as data/filename) to the `inference` repo to get `inferred_layout` (DocumentLayout) * pass the `inferred_layout` returned from the `inference` repo and the document (as data/filename) to the `pdfminer_processing` module, which first opens the document (create temp file/dir as needed), and splits the document by pages * if is_image is `True`, return the passed inferred_layout(DocumentLayout) * if is_image is `False`: * get extracted_layout (TextRegions) from the passed document(data/filename) by pdfminer * merge `extracted_layout` (TextRegions) with the passed `inferred_layout` (DocumentLayout) * return the `inferred_layout `(DocumentLayout) with updated elements (all merged LayoutElements) as merged_layout (DocumentLayout) * pass merged_layout and the document (as data/filename) to the `OCR` module, which first opens the document (create temp file/dir as needed), and splits the document by pages (convert PDF pages to image pages for PDF file) ### Note This PR also fixes issue #2164 by using functionality similar to the one implemented in the `fast` strategy workflow when extracting elements by `pdfminer`. ### TODO * image extraction refactor to move it from `unstructured-inference` repo to `unstructured` repo * improving natural reading order by applying the current default `xycut` sorting to the elements extracted by `pdfminer`	2023-12-01 20:56:31 +00:00
John	e5bdf7fb43	chore: unstructured python client (#2195 ) ### Summary Closes #2033 Updates `partition_via_api` to use `UnstructuredClient` for api calls instead of `requests`. Updates associated tests. Note: This PR does not update `partition_multiple_via_api` as documentation in `unstructured-python-client` indicates it does not support multiple files. A new issue should be opened to add that functionality to `unstructured-python-client`. --------- Co-authored-by: Klaijan <klaijan@unstructured.io> Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-12-01 18:49:59 +00:00
pravin-unstructured	341f0f428c	Add coco staging brick to unstructured base (#2180 )	2023-11-29 20:55:23 +00:00
Yuming Long	92dae8cd1a	Chore: Repair invalid PDF structure for PDFminer when PSSyntaxError (#2137 ) ### Summary Add a procedure to repair PDF when the PDF structure is invalid for `PDFminer` to process. This PR handles two cases of `PSSyntaxError Invalid dictionary construct: ...`: * PDFminer open entire document and create pages generator on `PDFPage.get_pages(fp)`: [sentry log example](https://unstructuredio.sentry.io/issues/4655715023/?alert_rule_id=14681339&alert_type=issue&notification_uuid=d8db4cf4-686f-4504-8a22-74a79a8e966f&project=4505909127086080&referrer=slack) * PDFminer's interpreter process a single page on `interpreter.process_page(page)`: [sentry log example](https://unstructuredio.sentry.io/issues/4655898781/?referrer=slack&notification_uuid=0d929d48-f490-4db8-8dad-5d431c8460bc&alert_rule_id=14681339&alert_type=issue) Additional tech details: * Add new dependency `pikepdf` in `requirements/extra-pdf-image.in`, which is used for repairing PDF. * Add new denpendenct `pypdf` in `requirements/extra-pdf-image.in`, which is used to find the error page from entire document by reading the PDF file again (can't find a way to split pdf in PDFminer). * Refactor the `is null` check for `get_uris_from_annots`, since the root cause is that `get_uris` passed a None `annots` to `get_uris_from_annots`, so the Null check should happen in `get_uris`. * Add more type protection in `get_uris_from_annots` when using any `PDFObjRef.resolve()` as `dict` (it could still be a `PDFObjRef`). This should fix : * https://github.com/Unstructured-IO/unstructured/issues/1922 where `annotation_dict` is a `PDFObjRef` * https://github.com/Unstructured-IO/unstructured/issues/1921 where `rect` is a `PDFObjRef` ### Test Added three test files (both are larger than 500 KB) for unittests to test: * Repair entire doc * Repair one page * Reprocess failure after repairing one page (just return the elements before error page in this case). * Also seems like splitting the document into smaller pages could fix this problem, but not sure why. For example, I saw error from reprocess in the whole [cancer.pdf](https://github.com/Unstructured-IO/unstructured/files/13461616/cancer.pdf) doc, but no error when i split the pdf by error page.... * tested if i can repair the entire doc again in this case, saw other error which means repairing is not helping imo * PDFminer can process the whole doc after pikepdf only repaired the entire doc in the first place, but we can't repair by pages in this way --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-11-29 19:00:15 +00:00
Klaijan	0aae1faa54	feat: add visualize param to command and add test (#2178 ) - Add `visualize` parameter to the click command -- now callable using `--visualize` flag to show the progress bar. - Refactor the name.	2023-11-29 01:05:55 +00:00
Yuming Long	6c08c136ae	ci: fix broken API unit test for using unsupported `fast` strategy for images (#2144 ) ### Summary This should fix the broken unit test on main CI * change the strategy in `test_partition_multiple_via_api_valid_request_data_kwargs` from `fast` to `auto`, since the test was using `fast` for images, and we don't support it.	2023-11-22 17:35:04 -08:00
Steve Canny	02e8c962aa	fix(docx): tables in header/footer dropped (#2135 ) A DOCX header or footer is a so-called "story part" meaning like the document body (which is also a story part) it can contain both paragraphs and tables. The implementation of `Header.text` and `Footer.text` gather only the paragraphs. Add a new method to extract all content from a header or footer, including table content, suitable for use as the `.text` attribute of that element. Fixes #2126.	2023-11-22 15:39:25 -08:00
Klaijan	2c2d5b65ca	refactor: measure_text_edit_distance function for aggregation (#2108 ) - Refactor `metrics/evaluation.py` to accepts `grouping` as parameter. - Switch to `DataFrame` for easier analysis and aggregation.	2023-11-22 13:30:16 -08:00

1 2 3 4 5 ...

553 Commits