unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-31 18:14:51 +00:00

Author	SHA1	Message	Date
Yao You	a11ad22609	bump `unstructured-inference` (#3711 ) This PR bumps `unstructured-inference` to `0.8.0`, which introduces vectorized data structure for layout elements and text regions. This PR also cleans up a few places in CI that has repeated definition of env variables or missing installation of testing dependencies in cache. A few document ingest results are changed: - two places for `biomed-api` (actually processed locally on runner) are due to very small changes in numerical results of the bounding box areas: one results in a duplicated page number/header and another results in a deduplication of a word of a sentence that starts in a new line. (yes, two cases goes in opposite directions) - the layout parser paper now outputs the code lines with page number inside the code box as list items --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com> Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-10-21 21:55:08 +00:00
Steve Canny	3240e3d17a	rfctr(pptx): minify HTML and table.text is cct (#3734 ) Summary Eliminate historical "idiosyncracies" of `table.metadata.text_as_html` HTML introduced by `partition_pptx()`. Produce minified `.text_as_html` consistent with that formed by chunking. Additional Context - PPTX `.metadata.text_as_html` is minified (no extra whitespace or thead, tbody, tfoot elements). - `table.text` is clean-concatenated-text (CCT) of table. - Last use of `tabulate` library is removed and that dependency is removed from `base.in`.	2024-10-21 16:23:15 +00:00
Steve Canny	208c7edc52	rfctr(csv): minify HTML and table text is cct (#3733 ) Summary Eliminate historical "idiosyncracies" of `table.metadata.text_as_html` HTML introduced by `partition_csv()`. Produce minified `.text_as_html` consistent with that formed by chunking. Additional Context - CSV `.metadata.text_as_html` is minified (no extra whitespace or thead, tbody, tfoot elements). - `table.text` is clean-concatenated-text (CCT) of table. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-10-19 06:49:09 +00:00
Steve Canny	c85f29e6ca	fix(xlsx): XLSX emits std minified .text_as_html (#3558 ) Summary Eliminate historical "idiosyncracies" of `table.metadata.text_as_html` HTML introduced by `partition_xlsx()`. Produce minified `.text_as_html` consistent with that formed by chunking. Additional Context - XLSX `.text_as_html` is minified (no extra whitespace or thead, tbody, tfoot elements). - `table.text` is clean-concatenated-text (CCT) of table. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-10-17 22:05:11 +00:00
Nathan Van Gheem	b092d45816	Remove unsupported chipper model (#3728 ) The chipper model is no longer supported.	2024-10-17 17:40:45 +00:00
Steve Canny	1eceac26c8	rfctr(email): eml partitioner rewrite (#3694 ) Summary Initial attempts to incrementally refactor `partition_email()` into shape to allow pluggable partitioning quickly became too complex for ready code-review. Prepare separate rewritten module and tests and swap them out whole. Additional Context - Uses the modern stdlib `email` module to reliably accomplish several manual decoding steps in the legacy code. - Remove obsolete email-specific element-types which were replaced 18 months or so ago with email-specific metadata fields for things like Cc: addresses, subject, etc. - Remove accepting an email as `text: str` because MIME-email is inherently a binary format which can and often does contain multiple and contradictory character-encodings. - Remove `encoding` parameters as it is now unused. An email file is not a text file and as such does not have a single overall encoding. Character encoding is specified individually for each MIME-part within the message and often varies from one part to another in the same message. - Remove the need for a caller to specify `attachment_partitioner`. There is only one reasonable choice for this which is `auto.partition()`, consistent with the same interface and operation in `partition_msg()`. - Fixes #3671 along the way by silently skipping attachments with a file-type for which there is no partitioner. - Substantially extend the test-suite to cover multiple transport-encoding/charset combinations. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-10-16 02:02:33 +00:00
Roman Isecke	9049e4e2be	feat/remove ingest code, use new dep for tests (#3595 ) ### Description Alternative to https://github.com/Unstructured-IO/unstructured/pull/3572 but maintaining all ingest tests, running them by pulling in the latest version of unstructured-ingest. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com> Co-authored-by: Christine Straub <christinemstraub@gmail.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-10-15 10:01:34 -05:00
David Blore	ecf0267b85	fix: add `language` to `OCRAgentGoogleVision` constructor (#3696 ) This PR addresses issue #3659 by adding an optional `language` parameter to the `OCRAgentGoogleVision` class constructor. This parameter serves as a "language hint" for the `document_text_detection` method in the `ImageAnnotatorClient`. For more information on language hints, refer to the [Google Cloud Vision documentation](https://cloud.google.com/vision/docs/languages). Default Behavior: The language parameter defaults to None, allowing Google Cloud Vision to auto-detect the language, as recommended in their documentation. Purpose: This change is necessary because the `OCRAgent`'s `get_instance` method expects all `OCRAgent`s to include a language parameter in their constructors. Context on Issue: When trying to parse a PDF with `OCR_AGENT=unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision`, an error occurs in the `get_instance` method. The method expects a `language` parameter, which the current `OCRAgentGoogleVision` constructor does not support, leading to a positional argument error. --------- Co-authored-by: Christine Straub <christinemstraub@gmail.com>	2024-10-14 05:35:05 +00:00
Steve Canny	2f496f867c	fix(auto): quick fix for auto test failing in CI (#3715 ) Better fix to follow.	2024-10-10 18:44:00 +00:00
John	27fa2a39d8	add tests describing the behavior of set_element_hierarchy (#3700 ) Small pr adding tests describing the behavior of `set_element_hierarchy`. No tests were changed, just added.	2024-10-04 22:49:38 +00:00
Steve Canny	718891a447	rfctr(part): remove double-decoration 5 (#3692 ) Summary Remove double-decoration from EML and MSG. Additional Context - These needed to wait to the end because `partition_email()` and `partition_msg()` can use any other partitioner for one of their attachments. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-10-04 21:01:32 +00:00
Steve Canny	4711a8dc26	rfctr(part): remove double-decoration 4 (#3690 ) Summary Install new `@apply_metadata()` on TXT. Additional Context - Both EML and MSG delegate to both HTML and TXT to partition the message-body, depending on which MIME-part body payload is selected (`text/plain` or `text/html`). This PR prepares the way to remove decorators from EML and MSG in the next PR. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-10-03 16:41:31 +00:00
Steve Canny	9bd91a836e	rfctr(part): remove double-decoration 3 (#3687 ) Summary Install new `@apply_metadata()` on HTML and remove decorators from delegating partitioners EPUB, MD, ORG, RST, and RTF. Additional Context - All five of these delegating partitioners delegate to `partition_html()` so they're something of a matched set. EML and MSG also partially delegate to HTML but that's a harder problem (they also delegate to all other partitioners for attachments) that we'll address a couple PRs later . - Replace use of `@process_metadata()` and `@add_metadata_with_filetype()` decorators with `@apply_metadata()` on `partition_html()`. - Remove all decorators from delegating partitioners; this removes the "double-decorating".	2024-10-02 21:04:37 +00:00
Steve Canny	17092198d0	rfctr(part): remove double-decoration 2 (#3686 ) Summary Install new `@apply_metadata()` on PPTX, TSV, XLSX, and XML and remove decoration from PPT. Additional Context - Alphabetical order turns out to be hard, so this is the remaining "easy" delegating partitioner and the remaining principal partitioners. - Replace use of `@process_metadata()` and `@add_metadata_with_filetype()` decorators with `@apply_metadata()` on principal partitioners (those that do not delegate to other partitioners. - Remove all decorators from delegating partitioners (PPT in this case); this removes the "double-decorating".	2024-10-02 18:52:59 +00:00
Steve Canny	bba60260b2	rfctr(part): remove double-decoration 1 (#3685 ) Summary Install new `@apply_metadata()` on CSV and DOCX and remove decoration from DOC and ODT. Additional Context - Working in alphabetical order and keeping PR size manageable, replace use of `@process_metadata()` and `@add_metadata_with_filetype()` decorators with `@apply_metadata()` on principal partitioners (those that do not delegate to other partitioners. - Remove all decorators from delegating partitioners (DOC and ODT in this case); this removes the "double-decorating".	2024-10-01 22:40:58 +00:00
Steve Canny	c05e1babf1	rfctr(meta): refine @apply_metadata() decorator (#3667 ) Summary Refine `@apply_metadata()` replacement decorator. Note it has not been installed yet. - Apply `metadata_last_modified` arg with the `@apply_metadata()` decorator. No need for redundant processing in each partitioner. - Add "unique-ify" step to fix any cases where the same `Element` or `ElementMetadata` instance was used more than once in the element stream. This prevents unexpected "multi-mutation" in downstream processes. - Apply "global" metadata items before computing hash-ids. In particular, `.metadata.filename` is used in the hash computation and will produce different results if that's not already settled. - Compute hash-ids _before_ computing `.metadata.parent_id`. This removes the need for mapping UUID element-ids to their hash counterpart and doing a fixup of `.parent_id` after applying hash-ids to elements. Additional Context - The `@apply_metadata()` decorator replaces the four metadata-related decorators: `@process_metadata()`, `@add_metadata_with_filetype()`, `@add_metadata()`, and `@add_filetype()`. - It will be installed on each partitioner in a series of following PRs.	2024-10-01 21:28:32 +00:00
Tomasz Cąkała	75c4998bc7	Fix: partition on empty or whitespace-only text files (#3675 ) This is a fix for this [bug](https://github.com/Unstructured-IO/unstructured/issues/3674), auto partition fails on text files which are empty or contain only whitespaces Inference of .txt file type fails if the file has only whitespaces. To Reproduce: ``` from tempfile import NamedTemporaryFile from unstructured.partition.auto import partition with NamedTemporaryFile(mode="w", suffix=".txt") as f: f.write(" \n") f.seek(0) elements = partition(filename=f.name) ```	2024-09-28 21:16:33 -07:00
Steve Canny	50d75c47d3	rfctr(part): add new decorator to replace four (#3650 ) Summary In preparation for pluggable auto-partitioners, add a new metadata decorator to replace the four existing ones. Additional Context "Global" metadata items, those applied to all element on all partitioners, are applied using a decorator. Currently there are four decorators where there only needs to be one. Consolidate those into a single metadata decorator. One or two additional behaviors of the new decorator will allow us to remove decorators from delegating partitioners which is a prerequisite for pluggable auto-partitioners.	2024-09-25 23:15:50 +00:00
Steve Canny	44bad216f3	rfctr(part): prepare for pluggable auto-partitioners 3 (#3661 ) Summary Remove unused `include_metadata` parameter. Additional Context - The `include_metadata` parameter was originally added circa v0.7.12 as a mechanism for avoiding the "double-decorating" problem on delegating partitioners. - It turns out it doesn't fully address that problem, is now unused, and is unnecessary for the solution we'll be adding as part of pluggable partitioners. - Remove the unnecessary complexity introduced by this unused parameter.	2024-09-25 18:17:48 +00:00
Steve Canny	086b8d6f8a	rfctr(part): prepare for pluggable auto-partitioners 2 (#3657 ) Summary Step 2 in prep for pluggable auto-partitioners, remove `regex_metadata` field from `ElementMetadata`. Additional Context - "regex-metadata" was an experimental feature that didn't pan out. - It's implemented by one of the post-partitioning metadata decorators, so get rid of it as part of the cleanup before consolidating those decorators.	2024-09-24 17:33:25 +00:00
Yao You	903efb0c6d	fix: fix occasional key error when mapping parent id (#3658 ) This PR fixes an occasional `KeyError` when calling `assign_and_map_hash_ids`. - This happens when the input `elements` has duplicated element instances or metadata. - When there are duplications the logic to iterate through all elements and map their parent ids will raise an error when an already mapped parent id is up for mapping. - The fix adds a logic to check if the parent id exists in `old_to_new_mapping` and if it doesn't we skip mapping it ## test This PR adds a unit test on this case and the test would fail without the fix.	2024-09-24 16:39:11 +00:00
Austin Walker	6428d19e5a	fix: update python SDK syntax for forward compatibility (#3656 ) Wrap the `shared.PartitionParameters` usage with `operations.PartitionRequest`. This syntax has been deprecated since v0.23.0 of the SDK, and will be unsupported in v0.26.0.	2024-09-24 16:37:38 +00:00
Steve Canny	3bab9d93e6	rfctr(part): prepare for pluggable auto-partitioners 1 (#3655 ) Summary In preparation for pluggable auto-partitioners simplify metadata as discussed. Additional Context - Pluggable auto-partitioners requires partitioners to have a consistent call signature. An arbitrary partitioner provided at runtime needs to have a call signature that is known and consistent. Basically `partition_x(filename, , file, *kwargs)`. - The current `auto.partition()` is highly coupled to each distinct file-type partitioner, deciding which arguments to forward to each. - This is driven by the existence of "delegating" partitioners, those that convert their file-type and then call a second partitioner to do the actual partitioning. Both the delegating and proxy partitioners are decorated with metadata-post-processing decorators and those decorators are not idempotent. We call the situation where those decorators would run twice "double-decorating". For example, EPUB converts to HTML and calls `partition_html()` and both `partition_epub()` and `partition_html()` are decorated. - The way double-decorating has been avoided in the past is to avoid sending the arguments the metadata decorators are sensitive to to the proxy partitioner. This is very obscure, complex to reason about, error-prone, and just overall not a viable strategy. The better solution is to not decorate delegating partitioners and let the proxy partitioner handle all the metadata. - This first step in preparation for that is part of simplifying the metadata processing by removing unused or unwanted legacy parameters. - `date_from_file_object` is a misnomer because a file-object never contains last-modified data. - It can never produce useful results in the API where last-modified information must be provided by `metadata_last_modified`. - It is an undocumented parameter so not in use. - Using it can produce incorrect metadata.	2024-09-23 22:23:10 +00:00
Steve Canny	03c2bf8f1f	rfctr(part): extract partition.common submodules (#3649 ) Summary In preparation for consolidating post-partitioning metadata decorators, extract `partition.common` module into a sub-package (directory) and extract `partition.common.metadata` module to house metadata-specific object shared by partitioners. Additional Context - This new module will be the home of the new consolidated metadata decorator. - The consolidated decorator is a step toward removing post-processing decorators from _delegating_ partitioners. A delegating partitioner is one that convert its file to a different format and "delegates" actual partitioning to the partitioner for that target format. 10 of the 20 partitioners are delegating partitioners. - Removing decorators from delegating partitioners will allow us to avoid "double-decorating", i.e. running those decorators twice, once on the principal partitioner and again on the proxy partitioner. - This will allow us to send `*kwargs` to either partitioner, removing the knowledge of which arguments to send for each file-type from auto-partition. - And this will allow pluggable auto-partitioners which all have a `partition_x(filename, , file, **kwargs) -> list[Element]` interface.	2024-09-20 20:35:28 +00:00
Christine Straub	0ed69a1ac3	refactor: pdfminer image cleanup (#3648 ) This PR aims to remove `clean_pdfminer_duplicate_image_elements()` function, as its functionality has already been integrated into the `remove_duplicate_elements()` function in [PR #3630](https://github.com/Unstructured-IO/unstructured/pull/3630).	2024-09-19 18:57:02 +00:00
Christine Straub	be88eef06f	perf: optimize pdfminer image cleanup process for improved performance (#3630 ) This PR enhances `pdfminer` image cleanup process by repositioning the duplicate image removal step. It optimizes the removal of duplicated pdfminer images by performing the cleanup before merging elements, rather than after. This improvement reduces execution time and enhances the overall processing speed of PDF documents. --------- Co-authored-by: Yao You <theyaoyou@gmail.com>	2024-09-19 14:05:05 +00:00
Steve Canny	cd074bb32b	chore(file): remove dead code (#3645 ) Summary Remove dead code in `unstructured.file_utils`. Additional Context These modules were added in 12/2022 and 1/2023 and are not referenced by any code. Removing to reduce unnecessary complexity. These can of course be recovered from Git history if we decide we want them again in future.	2024-09-19 06:45:33 +00:00
Christine Straub	87a88a3c87	feat: improve pdfminer element processing (#3618 ) This PR implements splitting of `pdfminer` elements (`groups of text chunks`) into smaller bounding boxes (`text lines`). This implementation prevents loss of information from the object detection model and facilitates more effective removal of duplicated `pdfminer` text. This PR also addresses #3430. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-09-12 21:17:27 +00:00
John	ab94c6c5d1	chore: remove pins (#3579 ) - Remove constraint pins for `Office365-REST-Python-Client`, `weaviate-client`, and `platformdirs`. Removing the pin for `Office365` brought to light some bugs in the Onedrive connector, so some changes were also made to `unstructured/ingest/v2/processes/connectors/onedrive.py`. - Also, as part of updating dependencies `unstructured-client` was updated to `0.25.8`, which introduced a new default for the `strategy` param and required updating a test fixture. - The `hubspot.sh` integration test was failing and is now ignored in CI with this PR per discussion with @rbiseck3. May be easiest to review commit-by-commit.	2024-09-12 13:48:59 +00:00
cragwolfe	3bb0ee1e79	chore: fix tests breaking on main (#3603 ) Fix API tests (really more like integration tests) that run only on main. Also use less compute intensive files to speedup test time and remove a useless test. Tests in `test_unstructured/partition/test_api.py` pass, temporarily running outside of main per per screenshot: ![image](https://github.com/user-attachments/assets/f15d440a-2574-40f2-98b4-adf57fbae704) https://github.com/Unstructured-IO/unstructured/actions/runs/10754098974/job/29824415513	2024-09-08 21:25:52 +00:00
Yao You	d51fb134e6	Feat/improve iou speed (#3582 ) This PR vectorizes the computation of element overlap to speed up deduplication process of extracted elements. ## test This PR adds unit test to the new vectorized IOU and subregion computation functions. In addition, running partition on large files with many elements like this slide: [002489.pdf](https://github.com/user-attachments/files/16823176/002489.pdf) shows a reduction of runtime from around 15min on the main branch to less than 4min with this branch. Profiling results show that the new implementation greatly reduces the time cost of computation and now most of the time is spend on getting the coordinates from a list of bboxes. ![Screenshot 2024-08-30 at 9 29 27 PM](https://github.com/user-attachments/assets/6c186838-54c7-483b-ac3e-7342c23ff3a6)	2024-09-03 00:06:18 +00:00
Pawel Kmiecik	404f780bbb	feat: make analysis drawing more flexible (#3574 ) This PR changes the way the analysis tools can be used: - by default if `analysis` is set to `True` in `partition_pdf` and the strategy is resolved to `hi_res`: - for each file 4 layout dumps are produced and saved as JSON files (`object_detection`, `extracted`, `ocr`, `final`) - similar way to the current `object_detection` dump - the drawing functions/classes now accept these dumps accordingly instead of the internal classes instances (like `TextRegion`, `DocumentLayout` - it makes it possible to use the lightweight JSON files to render the bboxes of a given file after the partition is done - `_partition_pdf_or_image_local` has been refactored and most of the analysis code is now encapsulated in `save_analysis_artifiacts` function - to do this, helper function `render_bboxes_for_file` is added <img width="338" alt="Screenshot 2024-08-28 at 14 37 56" src="https://github.com/user-attachments/assets/10b6fbbd-7824-448d-8c11-52fc1b1b0dd0">	2024-09-02 11:06:11 +00:00
Matt Robinson	6ba8135bf9	fix: check ole storage content to differentiate filetypes (#3581 ) ### Summary Updates the file detection logic for OLE files to check the storage content of the file to more reliable differentiate between DOC, PPT, XLS and MSG files. This corrects a bug that caused file type detection to be incorrect in cases where the `filetype` library guessed and incorrect MIME type, such as `'application/vnd.ms-excel'` for a `.msg` file. As part of this work, the `"msg"` extra was removed because the `python-oxmsg` package is now a base dependency. ### Testing Using a test `.msg` file that returns `'application/vnd.ms-excel'` from `filetype.guess_mime`. ```python from unstructured.file_utils.filetype import detect_filetype filename = "test-file.msg" detect_filetype(filename=filename) # result should be FileType.MSG ```	2024-08-30 15:12:46 -04:00
Austin Walker	f440eb476c	feat: Support encoding parameter in partition_csv (#3564 ) See added test file. Added support for the encoding parameter, which can be passed directly to `pd.read_csv`.	2024-08-28 14:19:58 +00:00
John	f21c853ade	bug: fix file_conversion disk leak (#3562 ) Fix disk space leaks and Windows errors when accessing file.name on a NamedTemporaryFile Uses of `NamedTemporaryFile(..., delete=False)` and/or uses of `file.name` of NamedTemporaryFile have been replaced with TemporaryFileDirectory to avoid a known issue: - https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile - https://github.com/Unstructured-IO/unstructured/issues/3390 The first 7 commits each address an individual occurrence of the issue if reviewers want to review commit-by-commit.	2024-08-27 22:02:24 +00:00
David Potter	ddba928344	Potter/mixedbread embedder (#3513 ) Thanks to @huangrpablo and @juliuslipp we now have a mixedbread.ai embedder!	2024-08-27 14:52:13 +00:00
Steve Canny	32bb77aafb	fix(file): no default OLE subtype (#3516 ) Summary Do not assume MSG format when an OLE "container" file cannot be differentiated into DOC, PPT, XLS, or MSG. Fall back to extention-based identification in that case. Additional Context DOC, MSG, PPT, and XLS are all OLE files. An OLE file is, very roughly, a Microsoft-proprietary Zip format which "contains" a filesystem of discrete files and directories. An OLE "container" is easily identified by inspecting the first 8 bytes of the file, so all we need to do is differentiate between the four subtypes we can process. The `filetype` module does a good job of this but it not perfect and does not identify MSG files. Previously we assumed MSG format when none of DOC, PPT, or XLS was detected, but we discovered that `filetype` is not completely reliable at detecting these types. Change the behavior to remove the assumption of MSG format. `_OleFileDifferentiator` returns `None` in this case and filetype detection falls back to use filename-extension. Note a file with no filename and no metadata_filename or an incorrect extension will not be correctly identified in this case, however we're assuming for now that will be rare in practice.	2024-08-22 19:16:53 +00:00
Steve Canny	03e0ed3519	rfctr(docx): DOCX emits std minified .text_as_html (#3545 ) Summary Eliminate historical "idiosyncracies" of `table.metadata.text_as_html` HTML introduced by `partition_docx()`. Produce minified `.text_as_html` consistent with that formed by chunking. Additional Context - nested tables appear as their extracted text in the parent cell (no nested `<table>` elements in `.text_as_html`). - DOCX `.text_as_html` is minified (no extra whitespace or thead, tbody, tfoot elements).	2024-08-21 18:54:21 +00:00
John	604cadfb7e	chore: remove ipython pin (#3548 ) this pr is stacked on https://github.com/Unstructured-IO/unstructured/pull/3538 and https://github.com/Unstructured-IO/unstructured/pull/3547 This pr removes dependency pins for IPython, anyio, and pyparsing. It also updates the label-studio-sdk import statement so we don't have to have that pinned and make some minor type hinting edits. Label Studio had a breaking change in their 1.13.0 [release](https://github.com/HumanSignal/label-studio/releases/tag/1.13.0)	2024-08-21 00:06:31 +00:00
Steve Canny	a861ed8fe7	feat(chunk): split tables on even row boundaries (#3504 ) Summary Use more sophisticated algorithm for splitting oversized `Table` elements into `TableChunk` elements during chunking to ensure element text and HTML are "synchronized" and HTML is always parseable. Additional Context Table splitting now has the following characteristics: - `TableChunk.metadata.text_as_html` is always a parseable HTML `<table>` subtree. - `TableChunk.text` is always the text in the HTML version of the table fragment in `.metadata.text_as_html`. Text and HTML are "synchronized". - The table is divided at a whole-row boundary whenever possible. - A row is broken at an even-cell boundary when a single row is larger than the chunking window. - A cell is broken at an even-word boundary when a single cell is larger than the chunking window. - `.text_as_html` is "minified", removing all extraneous whitespace and unneeded elements or attributes. This maximizes the semantic "density" of each chunk.	2024-08-19 18:56:53 +00:00
Christine Straub	fc26426310	feat: replace `pytesseract` with `unstructured.pytesseract` fork (#3528 ) This PR reverts `pytesseract` dependency to `unstructured.pytesseract` fork due to the unavailability of some recent release versions of `pytesseract` on PyPI. This PR also addresses an issue encountered during the publication of `unstructured==0.15.4` to PyPI. The error was due to the fact that PyPI does not allow direct dependencies from Version Control System URLs like GitHub in the `install_requires` or `extras_require` sections of the `setup.py` file.	2024-08-16 10:34:22 -04:00
Matt Robinson	7437f0a084	fix(CVE-2024-39705): update to latest `nltk` version (#3512 ) ### Summary Addresses [CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705) by updating to `nltk==3.8.2` and closes #3511. This CVE had previously been mitigated in #3361. --------- Co-authored-by: Christine Straub <christinemstraub@gmail.com>	2024-08-13 09:39:29 -04:00
Steve Canny	cbe1b35621	rfctr(chunk): prep for adding TableSplitter (#3510 ) Summary Mechanical refactoring in preparation for adding (pre-chunk) `TableSplitter` in a PR stacked on this one.	2024-08-12 18:04:49 +00:00
Jake Zerrer	051be5aead	Remove unstructured.pytesseract fork (#3454 ) A second attempt at https://github.com/Unstructured-IO/unstructured/pull/3360, this PR removes unstructured's dependency on its own fork of `pytesseract`. (The original reason for the fork, the addition of `run_and_get_multiple_output`, was removed [here](https://github.com/madmaze/pytesseract/releases/tag/v0.3.12).) --------- Co-authored-by: Christine Straub <christinemstraub@gmail.com>	2024-08-09 04:28:48 +00:00
John	24a1f298e5	chore: small edits (#3480 ) Add comments and fix decorators on some tests.	2024-08-06 19:21:43 +00:00
Steve Canny	73bef27ef1	fix(pptx): accommodate invalid image/jpg MIME-type (#3475 ) As described in #3381, some clients, perhaps including Adobe PDF Converter, map JPEG images to the invalid `image/jpg` MIME-type. Prior to v1.0.0, `python-pptx` would not load these images, which caused image extraction to fail. Update the `python-pptx` dependency to `v1.0.1` or above to ensure this upstream fix is always available. Fixes: #3381	2024-08-06 18:48:15 +00:00
Steve Canny	a468b2de3b	rfctr(csv): accommodate single column CSV files (#3483 ) Summary Improve factoring, type-annotation, and tests for `partition_csv()` and accommodate single-column CSV files. Fixes: #2616	2024-08-06 00:48:37 +00:00
Maciej Kurzawa	b749b891a7	fix: disabled checking max pages for images (#3473 ) Added fix related to https://github.com/Unstructured-IO/unstructured/pull/3431, which disables checking max pages for images	2024-08-02 14:25:08 +00:00
John	147514f6b5	feat: msg and email metadata (#3444 ) Update partition_eml and partition_msg to capture cc, bcc, and message id fields. Docs PR: https://github.com/Unstructured-IO/docs/pull/135/files Testing ``` from unstructured.partition.email import partition_email from test_unstructured.unit_utils import example_doc_path elements = partition_email(filename=example_doc_path("eml/fake-email-header.eml"), include_headers=True) print(elements) elements[0].metadata.to_dict() ``` Note to reviewers: Tests in `test_unstructured/partition/test_email.py` were refactored and rearranged to group similar tests together, so it will be easiest to review those changes commit by commit. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>	2024-08-01 19:24:17 +00:00
Maciej Kurzawa	8fd216cc9f	feat/pdf-page-limit-in-hi-res (#3431 ) # Description: Passing `max_pages` argument allows rejecting pdf files which exceeds this page number limit while `high_res` strategy is chosen. By default it will allow parsing pdf files with unlimited number of pages. # Testing: ```python from unstructured.partition.auto import partition elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res') # should pass elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res', max_pages=4) # should pass elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res', max_pages=2) # should raise PdfMaxPagesExceededError ```	2024-07-30 16:52:17 +00:00

1 2 3 4 5 ...

656 Commits