unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-08-15 04:08:49 +00:00

Author	SHA1	Message	Date
Steve Canny	3d6e30a1f7	rfctr(auto): improve expression in tests (#3384 ) Summary In preparation for further work on auto-partitioning, improve the expression in the test-suite.	2024-07-11 19:57:28 +00:00
Steve Canny	c27e0d0062	rfctr(html): replace html parser (#3218 ) Summary Replace legacy HTML parser with recursive version that captures all content and provides flexibility to add new metadata. It's also substantially faster although that's just a happy side-effect. Additional Context The prior HTML parsing algorithm that makes up the core of HTML partitioning was buggy and very difficult to reason about because it did not conform to the inherently recursive structure of HTML. The new version retains `lxml` as the performant and reliable base library but uses `lxml`'s custom element classes to efficiently classify HTML elements by their behaviors (block-item and inline (phrasing) primarily) and give those elements the desired partitioning behaviors. This solves a host of existing problems with content being skipped and elements (paragraphs) being divided improperly, but also provides a clear domain model for reasoning about its behavior and reliably adjusting it to suit our existing and future purposes. The parser's operation is recursive, closely modeling the recursive structure of HTML itself. It's behaviors are based on the HTML Standard and reliably produce proper and explainable results even for novel cases. Fixes #2325 Fixes #2562 Fixes #2675 Fixes #3168 Fixes #3227 Fixes #3228 Fixes #3230 Fixes #3237 Fixes #3245 Fixes #3247 Fixes #3255 Fixes #3309 ### BEHAVIOR DIFFERENCES #### `emphasized_text_tags` encoding is changed: - `<strong>` is encoded as `"b"` rather than `"strong"`. - `<em>` is encoded as `"i"` rather than `"em"`. - `<span>` is no longer recorded in `emphasized_text_tags` (because without the CSS we can't tell whether it's used for emphasis or if so what kind). - nested emphasis (e.g. bold+italic) is encoded as multiple characters ("bi"). - `emphasized_text_contents` is broken on emphasis-change boundaries, like: ```html `<p>foo <b>bar <i>baz</i> bada</b> bing</p>` ``` produces: ```json { "emphasized_text_contents": ["bar", "baz", "bada"], "emphasized_text_tags": ["b", "bi", "b"] } ``` whereas previously it would have produced: ```json { "emphasized_text_contents": ["bar baz bada", "baz"], "emphasized_text_tags": ["b", "i"] } ``` #### `<pre>` text is preserved as it appears in the html Except that a leading newline is removed if present (has to be in position 0 of text). Also, a trailing newline is stripped but only if it appears in the very last position ([-1]) of the `<pre>` text. Old parser stripped all leading and trailing whitespace. Result is that: ```html <pre> foo bar baz </pre> ``` parses to `"foo\nbar\nbaz"` which is the same result produced for: ```html <pre>foo bar baz</pre> ``` This equivalence is the same behavior exhibited by a browser, which is why we did the extra work to make it this way. #### Whitespace normalization Leading and trailing whitespace are removed from element text, just as it is removed in the browser. Runs of whitespace within the element text are reduced to a single space character (like in the browser). Note this means that `\t`, `\n`, and ` ` are replaced with a regular space character. All text derived from elements is whitespace normalized except the text within a `<pre>` tag. Any leading or trailing newline is trimmed from `<pre>` element text; all other whitespace is preserved just as it appeared in the HTML source. #### `link_start_indexes` metadata is no longer captured. Rationale: - It was frequently wrong, often `-1`. - It was deprecated but then added back in a community PR. - Maintaining it across any possible downstream transformations (e.g. chunking) would be expensive and almost certainly lead to wrong values as distant code evolves. - It is complex to compute and recompute when whitespace is normalized, adding substantial complexity to the code and reducing readability and maintainability #### `<br/>` element is replaced with a single newline (`"\n"`) but that is usually replaced with a space in `Element.text` when it is normalized. The newline is preserved within a `<pre>` element. - Related: _No paragraph-break on `<br/><br/>`_ #### Empty `h1..h6` elements are dropped. HTML heading elements (`<h1..h6>`) are "skipped" (do not generate a `Title` element) when they contain no text or contain only whitespace. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-07-11 00:14:28 +00:00
Steve Canny	0c562d8050	rfctr(auto): fix auto-partition test xfails and skips (#3367 ) Summary Improve expression in auto-partition tests and fix xfails and skips. Add issues for the two hard-fails where xfail needed to stay.	2024-07-10 05:29:07 +00:00
Steve Canny	00e1d5c05b	rfctr(html): refine HTML parser (#3351 ) Note This refines the new HTML parser but _does not install it_. This is why no changes to ingest test expectations or other unit-tests are required here. Installing the new parser will happen in the next PR #3218. Summary The initial version of the parser (purposely) raised on a block element nested inside a phrasing element. While such nesting is not valid according to the HTML Standard, it is accepted by the browser and does happen in the wild. The refinements here handle this situation similarly to how the browser does, breaking phrasing at the block element boundaries and starting it up again after the block element. Unfortunately this adds complexity to the parser, but it makes the parser robust against pretty much any HTML we're likely to encounter and partitions it consistent with how it would be rendered in the browser.	2024-07-09 01:10:03 +00:00
Matt Robinson	7b25dfc337	fix(CVE-2024-39705): remove nltk download (#3361 ) ### Summary Addresses [CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705), which highlights the risk of remote code execution when running `nltk.download` . Removes `nltk.download` in favor of a `.tgz` file with the appropriate NLTK data files and checking the SHA256 hash to validate the download. An error now raises if `nltk.download` is invoked. The logic for determining the NLTK download directory is borrowed from `nltk`, so users can still set `NLTK_DATA` as they did previously. ### Testing 1. Create a directory called `~/tmp/nltk_test`. Set `NLTK_DATA=${HOME}/tmp/nltk_test`. 2. From a python interactive session, run: ```python from unstructured.nlp.tokenize import download_nltk_packages download_nltk_packages() ``` 3. Run `ls /tmp/nltk_test/nltk_data`. You should see the downloaded data. --------- Co-authored-by: Steve Canny <stcanny@gmail.com>	2024-07-08 22:55:36 +00:00
Steve Canny	d48fa3b163	rfctr(auto): improve typing and organize auto tests (#3355 ) Summary In preparation for further work on auto-partitioning (`partition()`), improve typing and organize `test_auto.py` by introducing categories.	2024-07-08 21:25:17 +00:00
Pluto	609a08a95f	remove unused _with_spans metric (#3342 ) The table metrics considering spans is not used and it messes with the output thus I have cleaned the code from it. Though, I have left table_as_cells in the source code - it still may be useful for the users	2024-07-08 16:59:53 +00:00
Pluto	caea73c8e3	Tables detection f1 (#3341 ) This pull request add table detection metrics. One case that was considered by me: Case: Two tables are predicted and matched with one table in ground truth Question: Is this matching correct in both cases or just for on table There are two subcases: - table was predicted by OD as two sub tables (so half in two, there are two non overlapping subtables) -> in my opinion both are correct - it is false positive from tables matching script in get_table_level_alignment -> 1 good, 1 wrong As we don't have bounding boxes I followed the notebook calculation script and assumed pessimistic, second subcase version	2024-07-08 13:29:52 +00:00
Christine Straub	493bfccddd	fix: exception handling for OCRAgent.get_agent() (#3335 ) The purpose of this PR is to help investigate https://github.com/Unstructured-IO/unstructured/issues/3202.	2024-07-03 17:58:04 +00:00
John	0046f58a4f	revert unstructured-client pin and make pip-compile (#3298 ) Change unstructured-client pin to setting minimum version instead of max version and `make pip-compile`. Integration tests that were dependent on the old version of the client are removed. These tests should be replicated in/moved to the SDK repo(s).	2024-07-02 16:42:03 +00:00
Pluto	5d89b41b1a	Fix not counting false negatives and false positives in table metrics (#3300 ) This pull request fixes counting tables metric for three cases: - False Negatives: when table exist in ground truth but any of the predicted tables doesn't match the table, the table should count as 0 and the file should not be completely skipped (before it was np.NaN). - False Positives: When there is a predicted table that didn't match any ground truth table it should be counted as 0, right now it is skipped in processing (matched_indices==-1) - The file should be completely skipped only if there is no tables in ground truth and in prediction In short we can say that previous metric calculation didn't consider OD mistakes	2024-07-02 10:07:24 +00:00
Steve Canny	087adb218f	feat(docx): differentiate no-file from not-ZIP (#3306 ) Summary The `python-docx` error `docx.opc.exceptions.PackageNotFoundError` arises both when no file exists at the given path and when the file exists but is not a ZIP archive (and so is not a DOCX file). This ambiguity is unwelcome when diagnosing the error as the two possible conditions generally indicate a different course of action to resolve the error. Add detailed validation to `DocxPartitionerOptions` to distinguish these two and provide more precise exception messages. Additional Context - `python-pptx` shares the same OPC-Package (file) loading code used by `python-docx`, so the same ambiguity will be present in `python-pptx`. - It would be preferable for this distinguished exception behavior to be upstream in `python-docx` and `python-pptx`. If we're willing to take the version bump it might be worth considering doing that instead.	2024-06-27 00:18:56 +00:00
Pawel Kmiecik	575957b2d2	feat: enhance analysis options with od model dump and better vis (#3234 ) This PR adds new capabilities for drawing bboxes for each layout (extracted, inferred, ocr and final) + OD model output dump as a json file for better analysis. --------- Co-authored-by: Christine Straub <christinemstraub@gmail.com> Co-authored-by: Michal Martyniak <michal.martyniak@deepsense.ai>	2024-06-26 13:14:55 +00:00
Steve Canny	f2fee0c32f	fix(auto): partition() passes strategy to DOC,ODT (#3278 ) Summary Remedy gap where `strategy` argument passed to `partition()` was not forwarded to `partition_doc()` or `partition_odt()` and so was not making its way to `partition_docx()`.	2024-06-26 00:29:47 +00:00
Yao You	c32aeaac44	fix: wait to run soffice until there is no other soffice process running (#3287 ) ## Summary This PR addresses an issue where the code could attempt to run `soffice` in multiple processes and closes #3284 The fix is to add a wait mechanism when there is another `soffice` process running in already. ## Diagnosis of issue - `soffice` can only have one process running when using the command `soffice` as is. - on main branch the function `partition.common.convert_office_doc` simply spawns a subprocess to run `soffice` command to convert a `doc` or `ppt` file into `docx` or `pptx` format. - if there are multiple partition calls to process `doc` or `ppt` files and they all want to spawn `soffice` subprocesses only one will succeed while other processes will simply fail and return 1 from the subprocess - in downstream this will lead to errors like `PackageNotFoundError: Package not found at '/tmp/tmpac6lcu4w/document.docx'` ## solution While there are [ways](https://www.reddit.com/r/libreoffice/comments/agk3os/how_to_open_more_than_one_calc_instance_under/) to circumvent the limit of `soffice` by setting a tmp file as user installation env, these kind of solutions rely on the internals of `soffice` and adds maintenance cost to track its changes. This PR solves this problem by adding a wait mechanism: - we first spawning a subprocess to run `soffice` - if the `stdout` is empty and we still have wait time budget left the function first checks if there is another `soffice` running * If yes then the function waits for 0.01s before checking again; * if no the functions spawns a subprocess to run `soffice` and return to beginning of this step * we need to return the the beginning to check if `stdout` is empty because we could have another collision right after `soffice` becomes available. ## test This PR adds two unit tests. Additionally this can be tested by running partition of `.doc` files locally with multiprocessing.	2024-06-25 18:49:27 +00:00
Yao You	edddf9f6ee	Feat/pass down strategy to partition ppt as well (#3274 ) Following the same pattern of https://github.com/Unstructured-IO/unstructured/pull/3273 and pass down `strategy` parameter to `partition_ppt` as well.	2024-06-22 02:23:58 +00:00
Steve Canny	16df6944dd	fix(auto): partition() passes strategy to PPTX,DOCX (#3273 ) Summary Remedy gap where `strategy` argument passed to `partition()` was not forwarded to `partition_pptx()` or `partition_docx()`.	2024-06-22 00:16:39 +00:00
Steve Canny	6fe1c9980e	rfctr(html): prepare for new html parser (#3257 ) Summary Extract as much mechanical refactoring from the HTML parser change-over into the PR as possible. This leaves the next PR focused on installing the new parser and the ingest-test impact. Reviewers: Commits are well groomed and reviewing commit-by-commit is probably easier. Additional Context This PR introduces the rewritten HTML parser. Its general design is recursive, consistent with the recursive structure of HTML (tree of elements). It also adds the unit tests for that parser but it does not _install_ the parser. So the behavior of `partition_html()` is unchanged by this PR. The next PR in this series will do that and handle the ingest and other unit test changes required to reflect the dozen or so bug-fixes the new parser provides.	2024-06-21 20:59:48 +00:00
Austin Walker	0b73978b92	fix: fix `IndexError` when partioning a pdf with `starting_page_number` (#3246 ) The Issue: When extracting images from pdfs, we use the metadata page number to index into a list of the images. However, the metadata page number can now be changed via `starting_page_number`. To get the true page index, we need to subtract this value. Testing: Run this snippet in a python shell. Before the fix, this throws an IndexError. On this branch, it will return the elements. ``` from unstructured.partition.auto import partition filename = "example-docs/layout-parser-paper-with-table.pdf" partition(filename, strategy="hi_res", extract_image_block_types=["Image", "Table"], starting_page_number=20) ``` --------- Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-06-19 18:20:54 +00:00
Pawel Kmiecik	c3af03d5ac	feat: expose converters deckerd -> html and back (#3233 ) This PR exposes functions in evaluation module for easy conversion between tables in Deckerd and HTML formats, which are useful in evalution experiments.	2024-06-19 07:03:38 +00:00
Steve Canny	77a9e1b54d	rfctr(html): drop convert_and_partition_html() (#3215 ) Summary Remove `unstructured.partition.html.convert_and_partition_html()`. Move file-type conversion (to HTML) responsibility to each brokering partitioner that uses that strategy and let them call `partition_html()` for themselves with the result. Additional Context Rationale: - `partition_html()` does not want or need to know which partitioners might broker partitioning to it. - Different brokering partitioners have their own methods to convert their format to HTML and quirks that may be involved for their format. Avoid coupling them so they can evolve independently. - The core of the conversion work is already encapsulated in `unstructured.partition.common.convert_file_to_html_text_using_pandoc()`. - `convert_and_partition_html()` represents an additional brokering layer with the entailed complexities of an additional site for default parameter values to be (mis-)applied and/or dropped and is an additional location for new parameters to be added.	2024-06-17 19:43:18 +00:00
Steve Canny	9fae0111d9	rfctr(html): drop HTML-specific elements (#3207 ) Summary Remove HTML-specific element types and return "regular" elements like `Title` and `NarrativeText` from `partition_html()`. Additional Context - An aspect of the legacy HTML partitioner was the use of HTML-specific element types used to track metadata during partitioning. - That role is no longer necessary or desireable. - HTML-specific elements like `HTMLTitle` and `HTMLNarrativeText` were returned from partitioning HTML but also the seven other file-formats that broker partitioning to HTML (convert-to-HTML and partition_html()). This does not cause immediate breakage because these are still `Text` element subtypes, but it produces a confusing developer experience. - Remove the prior metadata roles from HTML-specific elements and remove those element types entirely.	2024-06-15 00:14:22 +00:00
Matt Robinson	08383a27de	build: pull from wolfi base image (#3213 ) ### Summary Updates the `wolfi` image to pull from the upstream `wolfi-base` base image to avoid maintaining the base layers in both locations. Closes #3105 by pulling in the fix from upstream. ### Testing `test_dockerfile` should continue to pass with the changes.	2024-06-14 20:41:27 +00:00
Christine Straub	9552fbbfbf	chore: bump unstructured-inference 0.7.35 (#3205 ) ### Summary - bump unstructured-inference to `0.7.35` which fixed syntax for generated HTML tables - update unit tests and ingest test fixtures to reflect changes in the generated HTML tables - cut a release for `0.14.6` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-06-14 18:11:38 +00:00
Pawel Kmiecik	29e64eb281	feat: table evaluations for fixed html table generation (#3196 ) Update to the evaluation script to handle correct HTML syntax for tables. See https://github.com/Unstructured-IO/unstructured-inference/pull/355 for details. This change: - modifies transforming HTML tables to evaluation internal `cells` format - fixes the indexing of the output (internal format cells) when HTML cells use spans	2024-06-14 09:03:27 +00:00
Steve Canny	f5ebb209a4	rfctr(html): drop page concept (#3184 ) Summary Pagination of HTML documents is currently unused. The `Page` class and concept were deeply embedding in the legacy organization of HTML partitioning code due to the legacy `Document` (= pages of elements) domain model. Remove this concept from the code such that elements are available directly from the partitioner. Additional Context - Pagination can be re-added later if we decide we want it again. A re-implementation would be much simpler and much lower impact to the structure of the code and introduce much less additional complexity, similar to the approach we take in `partition_docx()`.	2024-06-13 18:19:42 +00:00
Filip Knefel	c2065db716	fix API-297: List parameters incorrectly passed to API requests (#3154 ) In two places parameters passed to the python client when using either Ingest workflow and `partition_via_api` function directly we parse the parameters with list values to strings e.g. ```python extract_image_block_types=["image"] -> extract_image_block_types='["image"]' ``` as of now these parameters are parsed incorrectly when given as strings and correctly when given as lists. This PR removes parsing from `PartitionConfig` and `partition_via_api`. --------- Co-authored-by: Filip Knefel <filip@unstructured.io>	2024-06-11 21:00:41 +00:00
Steve Canny	2f0400f279	rfctr(html): break coupling to DocumentLayout (#3180 ) Summary Remove use of `partition.common.document_to_element_list()` by `HTMLDocument`. The transitive coupling with layout-inference through this shared function have been the source of frustration and a drain on engineering time and there's no compelling reason for the two to share this code. Additional Context `partition_html()` uses `partition.common.document_to_element_list()` to get finalized elements from `HTMLDocument` (pages). This gives rise to a very nasty coupling between `DocumentLayout`, used by `unstructured_inference`, and `HTMLDocument`. `document_to_element_list()` has evolved to work for both callers, but they share very few common characteristics with each other. This coupling is bad news for us and also, importantly, for the inference and page layout folks working on PDF and images. Break that coupling so those inference-related functions can evolve whatever way they need to without being dragged down by legacy `HTMLDocument` connections. The initial step is to extract a `document_to_element_list()` function of our own, getting rid of the coordinates and other `DocumentLayout`-related bits we don't need. As you'll see in the next few PRs, all of this `document_to_element_list()` code will end up either going away or being relocated closer to where it's used in `HTMLDocument`.	2024-06-11 20:54:11 +00:00
Steve Canny	e39ee16161	rfctr(html): promote HTMLDoc candidate methods (#3177 ) Summary Make `._find_articles()` and `._find_main` into `._articles` and `._main` properties on HTMLDocument, respectively. Additional Context After prior refactorings, these two functions now each require only `self` and can become `@lazyproperty`s on `HTMLDocument`. This ensures they are computed at most once. In addition, their close relationship to `HTMLDocument` is indicated by their membership as methods rather than "loose" functions.	2024-06-10 22:07:21 +00:00
Steve Canny	a66661a7bf	rfctr(html): drop now dead XMLDocument and Document (#3165 ) Summary `HTMLDocument` is the class handling the core of HTML parsing. This is critical code because 8 of the 20 file-type partitioners end up using this code (`partition_html()` + 7 brokering partitioners like EPUB, MD, and RST). For historical reasons, `HTMLDocument` subclassed `XMLDocument` which in turn subclassed `Document`, both of which are no longer relevant and unnecessarily complicate reasoning about `HTMLDocument` behavior. Remove that inheritance and dependency and drop both `XMLDocument` and `Document` modules which become dead code after no longer being used by `HTMLDocument`.	2024-06-08 07:36:18 +00:00
Steve Canny	a883fc9df2	rfctr(html): improve SNR in HTMLDocument (#3162 ) Summary Remove dead code and organize helpers of HTMLDocument in preparation for improvements and bug-fixes to follow	2024-06-06 21:21:33 +00:00
Steve Canny	8378ddaa3b	rfctr(html): organize and improve HTMLDocument tests (#3161 ) Summary In preparation for further work on HTMLDocument, organize the organic growth in `documents/tests_html.py` and improving typing and expression. Reviewers: Commits are groomed and review is probably eased by going commit-by-commit	2024-06-06 18:16:02 +00:00
Steve Canny	f1cab248ce	rfctr(msg): remove temporary new_msg.py (#3157 ) Summary Remove temporary `new_msg.py` module. Additional Context The rewrite of `partition_msg()` was placed in a separate file `new_msg.py` to avoid a messy diff for code-review. This PR makes that `new_msg.py` the new `msg.py`. No code changes were made in the process.	2024-06-06 08:31:56 +00:00
Steve Canny	ddbe90f6bb	rfctr(html): clean html tests in prep for PRs to follow (#3156 ) Summary Clean `tests_unstructured/partition/test_html.py` in preparation for broader refactor of HTML partitioner to follow. That refactor will address a cluster of bugs. Temporarily remove blank lines in tests so reordering tests in following commit is easier to follow. Those will go back in after that.	2024-06-05 23:11:58 +00:00
Steve Canny	e4158deaff	fix(msg): use python-oxmsg for MSG email parsing (#3142 ) Summary `partition_msg()` previously used the `msg_parser` library for parsing Outlook MSG email files (.msg files). The `msg_parser` library is unmaintained and has several major shortcomings such as not being able to parse MSG files with 8-bit encoded strings and not reliably extracting attachments. Use the new and permissively licenced `python-oxmsg` library instead. Additional Context For reviewability purposes, this PR temporarily places the new `partition_msg()` implementation in `new_msg.py` and references that implementation from `msg.py`. `new_msg.py` will be renamed to `msg.py` in a closely following PR. This avoids a very messy interleaving of hunks in a diff between the old and re-written `partition_msg()` implementation. Fixes #2481 Fixes #3006	2024-06-05 21:12:27 +00:00
Matt Robinson	0e16bf4bf0	enhancement: apply tar filters when using python 3.12 or above (#3124 ) ### Summary Applies tar filters when using Python 3.12 or above. This was added to the [Python `tarfile` library in 3.12](https://docs.python.org/3/library/tarfile.html#extraction-filters) and guards against malicious content being extracted from `.tar.gz` files. ### Testing Added smoke test. If this passes for all Python versions, we're good.	2024-06-05 18:28:59 +00:00
Steve Canny	f2e67539b1	rfctr: clean MSG partitioner and tests as prep (#3107 ) Summary Fix type errors and generally prepare `partition_msg()` and its tests for refactoring to use `python-oxmsg` library instead of the problematic `msg_parser` library for partitioning Outlook MSG files.	2024-05-29 21:36:05 +00:00
Christine Straub	f4457249a7	fix: `partition_pdf()` removes spaces from the text (#3106 ) Closes #2896. This PR aims to fix `partition_pdf()` to keep spaces in text. The control character `\t` is now replaced with a space instead of being removed when merging inferred and embedded elements. ### Testing PDF: [rok_20230930_1-1.pdf](https://github.com/Unstructured-IO/unstructured/files/15001636/rok_20230930_1-1.pdf) ``` elements = partition_pdf( filename="rok_20230930_1-1.pdf", strategy="hi_res", ) print(str(elements[20])) ``` Results: - PR ``` Name of each exchange on which registered New York Stock Exchange ``` - main branch ``` Nameofeachexchangeonwhichregistered NewYorkStockExchange ```	2024-05-29 04:53:17 +00:00
Matt Robinson	6b400b46fe	feat: add VoyageAI embeddings (#3069 ) (#3099 ) Original PR was #3069. Merged in to a feature branch to fix dependency and linting issues. Application code changes from the original PR were already reviewed and approved. ------------ Original PR description: Adding VoyageAI embeddings Voyage AI’s embedding models and rerankers are state-of-the-art in retrieval accuracy. --------- Co-authored-by: fzowl <160063452+fzowl@users.noreply.github.com> Co-authored-by: Liuhong99 <39693953+Liuhong99@users.noreply.github.com>	2024-05-24 21:48:35 +00:00
Christine Straub	35ec21ecd0	fix: decide table extraction (#3090 ) This PR aims to add backward compatibility for the deprecated `pdf_infer_table_structure` parameter. A missing part of turning table extraction for PDFs and Images off by default in https://github.com/Unstructured-IO/unstructured/pull/3035, which was turned on in https://github.com/Unstructured-IO/unstructured/pull/2588. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-05-23 20:37:15 +00:00
Steve Canny	47d28612f7	feat(docx): add pluggable picture sub-partitioner (#3081 ) Summary Allow registration of a custom sub-partitioner that extracts images from a DOCX paragraph. Additional Context - A custom image sub-partitioner must implement the `PicturePartitionerT` interface defined in this PR. Basically have an `.iter_elements()` classmethod that takes the paragraph and generates zero or more `Image` elements from it. - The custom image sub-partitioner must be registered by passing the class to `register_picture_partitioner()`. - The default image sub-partitioner is `_NullPicturePartitioner` that does nothing. - The registered picture partitioner is called once for each paragraph.	2024-05-23 18:46:30 +00:00
Hubert Rutkowski	b8d894f963	feat/Move the category field to Element (#3056 ) It's pretty basic change, just literally moved the category field to Element class. Can't think of other changes that are needed here, because I think pretty much everything expected the category to be directly in elements list. For local testing, IDE's and linters should see difference in that `category` is now in Element.	2024-05-23 10:43:26 +00:00
Steve Canny	b4ee019170	rfctr: flatten test_unstructured/partition (#3073 ) Summary Some partitioner test modules are placed in directories by themselves or with one other test module. This unnecessarily obscures where to find the test module corresponding to a partitiner. Move partitioner test modules to mirror the directory structure of `unstructured/partition`.	2024-05-23 00:51:08 +00:00
Steve Canny	30e5a0cd4e	rfctr(docx): organize docx tests (#3070 ) Summary I preparation for adding DOCX pluggable image extraction, organize a few of the DOCX tests to be parallel to very similar tests for the DOC and ODT partitioners.	2024-05-21 22:11:46 +00:00
Christine Straub	b0d8a779da	feat: `partiton_pdf()` set inferred elements text (#3061 ) This PR adds the ability to fill inferred elements text from embedded text (`pdfminer`) without depending on `unstructured-inference` library. This PR is the second part of moving embedded text related code from `unstructured-inference` to `unstructured` and works together with https://github.com/Unstructured-IO/unstructured-inference/pull/349.	2024-05-21 19:43:38 +00:00
Matt Robinson	acda4d0707	fix: set `skip_infer_tables` explicitly in `test_partition_via_api_with_no_strategy` (#3057 ) ### Summary A `partition_via_api` test that only runs on `main` was [failing](https://github.com/Unstructured-IO/unstructured/actions/runs/9159429513/job/25181600959) with the following output, likely due to the change in the default behavior for `skip_infer_table_types`. This PR explicitly sets the `skip_infer_table_types` param to avoid the failure.. ```python =========================== short test summary info ============================ FAILED test_unstructured/partition/test_api.py::test_partition_via_api_with_no_strategy - AssertionError: assert 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' != 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' + where 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' = <unstructured.documents.elements.Text object at 0x7fb9069fc610>.text + and 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' = <unstructured.documents.elements.Text object at 0x7fb90648ad90>.text = 1 failed, 2299 passed, 9 skipped, 2 deselected, 2 xfailed, 9 xpassed, 14 warnings in 1241.64s (0:20:41) = make: *** [Makefile:302: test] Error 1 ``` ### Testing After temporarily removing the "skip if not on `main`" `pytest` mark, the [unit tests pass](https://github.com/Unstructured-IO/unstructured/actions/runs/9163268381/job/25192040902?pr=3057O) on the feature branch.	2024-05-20 19:05:13 -04:00
Matt Robinson	d7608014c0	improve: add Python 3.12 support (#3033 ) (#3047 ) ### Summary Closes #2959. Updates the dependency and CI to add support for Python 3.12. The MongoDB ingest tests were disabled due to jobs like [this one](https://github.com/Unstructured-IO/unstructured/actions/runs/9133383127/job/25116767333) failing due to issues with the `bson` package. `bson` is a dependency for the AstraDB connector, but `pymongo` does not work when `bson` is installed from `pip`. This issue is documented by MongoDB [here](https://pymongo.readthedocs.io/en/stable/installation.html). Spun off #3049 to resolve this. Issue seems unrelated to Python 3.12, though unsure why this didn't surface previously. Disables the `argilla` tests because `argilla` does not yet support Python 3.12. We can add the `argilla` tests back in once the PR references below is merged. You can still use the `stage_for_argilla` function if you're on `python<3.12` and you install `argilla` yourself. - https://github.com/argilla-io/argilla/pull/4837 --------- Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>	2024-05-19 23:03:15 +00:00
Christine Straub	76831f154b	refactor: `partition_pdf()` pass `kwargs` through `fast` strategy pipeline (#3040 ) This PR aims to pass `kwargs` through `fast` strategy pipeline, which was missing as part of the previous PR - https://github.com/Unstructured-IO/unstructured/pull/3030. I also did some code refactoring in this PR, so I recommend reviewing this PR commit by commit. ### Summary - pass `kwargs` through `fast` strategy pipeline, which will allow users to specify additional params like `sort_mode` - refactor: code reorganization - cut a release for `0.14.0` ### Testing CI should pass	2024-05-17 20:55:11 +00:00
amadeusz-ds	1c8b2b23eb	feat: add GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR config parameteres (#3014 ) This PR introduces GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR controlling where temporary files are stored during partition flow, via tempfile.tempdir. #### Edit: Renamed prefixes from STORAGE_ to UNSTRUCTURED_CACHE_ #### Edit 2: Renamed prefixes from UNSTRUCTURED_CACHE to GLOBAL_WORKING_DIR_	2024-05-17 19:16:10 +00:00
Matt Robinson	ec987dcbb2	BREAKING CHANGE: revert table extraction off by default for PDFs and images (#3035 ) ### Summary Closes #3021 . Turns table extraction for PDFs and images off by default. The default behavior originally changed in #2588 . The reason for reversion is that some users did not realize turning off table extraction was an option and experience long processing times for PDFs and images with the new default behavior. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2024-05-17 15:28:11 +00:00

1 2 3 4 5 ...

596 Commits