unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-30 20:39:54 +00:00

Author	SHA1	Message	Date
Steve Canny	9fae0111d9	rfctr(html): drop HTML-specific elements (#3207 ) Summary Remove HTML-specific element types and return "regular" elements like `Title` and `NarrativeText` from `partition_html()`. Additional Context - An aspect of the legacy HTML partitioner was the use of HTML-specific element types used to track metadata during partitioning. - That role is no longer necessary or desireable. - HTML-specific elements like `HTMLTitle` and `HTMLNarrativeText` were returned from partitioning HTML but also the seven other file-formats that broker partitioning to HTML (convert-to-HTML and partition_html()). This does not cause immediate breakage because these are still `Text` element subtypes, but it produces a confusing developer experience. - Remove the prior metadata roles from HTML-specific elements and remove those element types entirely.	2024-06-15 00:14:22 +00:00
Matt Robinson	08383a27de	build: pull from wolfi base image (#3213 ) ### Summary Updates the `wolfi` image to pull from the upstream `wolfi-base` base image to avoid maintaining the base layers in both locations. Closes #3105 by pulling in the fix from upstream. ### Testing `test_dockerfile` should continue to pass with the changes.	2024-06-14 20:41:27 +00:00
Christine Straub	9552fbbfbf	chore: bump unstructured-inference 0.7.35 (#3205 ) ### Summary - bump unstructured-inference to `0.7.35` which fixed syntax for generated HTML tables - update unit tests and ingest test fixtures to reflect changes in the generated HTML tables - cut a release for `0.14.6` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-06-14 18:11:38 +00:00
Pawel Kmiecik	29e64eb281	feat: table evaluations for fixed html table generation (#3196 ) Update to the evaluation script to handle correct HTML syntax for tables. See https://github.com/Unstructured-IO/unstructured-inference/pull/355 for details. This change: - modifies transforming HTML tables to evaluation internal `cells` format - fixes the indexing of the output (internal format cells) when HTML cells use spans	2024-06-14 09:03:27 +00:00
Steve Canny	f5ebb209a4	rfctr(html): drop page concept (#3184 ) Summary Pagination of HTML documents is currently unused. The `Page` class and concept were deeply embedding in the legacy organization of HTML partitioning code due to the legacy `Document` (= pages of elements) domain model. Remove this concept from the code such that elements are available directly from the partitioner. Additional Context - Pagination can be re-added later if we decide we want it again. A re-implementation would be much simpler and much lower impact to the structure of the code and introduce much less additional complexity, similar to the approach we take in `partition_docx()`.	2024-06-13 18:19:42 +00:00
Filip Knefel	c2065db716	fix API-297: List parameters incorrectly passed to API requests (#3154 ) In two places parameters passed to the python client when using either Ingest workflow and `partition_via_api` function directly we parse the parameters with list values to strings e.g. ```python extract_image_block_types=["image"] -> extract_image_block_types='["image"]' ``` as of now these parameters are parsed incorrectly when given as strings and correctly when given as lists. This PR removes parsing from `PartitionConfig` and `partition_via_api`. --------- Co-authored-by: Filip Knefel <filip@unstructured.io>	2024-06-11 21:00:41 +00:00
Steve Canny	2f0400f279	rfctr(html): break coupling to DocumentLayout (#3180 ) Summary Remove use of `partition.common.document_to_element_list()` by `HTMLDocument`. The transitive coupling with layout-inference through this shared function have been the source of frustration and a drain on engineering time and there's no compelling reason for the two to share this code. Additional Context `partition_html()` uses `partition.common.document_to_element_list()` to get finalized elements from `HTMLDocument` (pages). This gives rise to a very nasty coupling between `DocumentLayout`, used by `unstructured_inference`, and `HTMLDocument`. `document_to_element_list()` has evolved to work for both callers, but they share very few common characteristics with each other. This coupling is bad news for us and also, importantly, for the inference and page layout folks working on PDF and images. Break that coupling so those inference-related functions can evolve whatever way they need to without being dragged down by legacy `HTMLDocument` connections. The initial step is to extract a `document_to_element_list()` function of our own, getting rid of the coordinates and other `DocumentLayout`-related bits we don't need. As you'll see in the next few PRs, all of this `document_to_element_list()` code will end up either going away or being relocated closer to where it's used in `HTMLDocument`.	2024-06-11 20:54:11 +00:00
Steve Canny	e39ee16161	rfctr(html): promote HTMLDoc candidate methods (#3177 ) Summary Make `._find_articles()` and `._find_main` into `._articles` and `._main` properties on HTMLDocument, respectively. Additional Context After prior refactorings, these two functions now each require only `self` and can become `@lazyproperty`s on `HTMLDocument`. This ensures they are computed at most once. In addition, their close relationship to `HTMLDocument` is indicated by their membership as methods rather than "loose" functions.	2024-06-10 22:07:21 +00:00
Steve Canny	a66661a7bf	rfctr(html): drop now dead XMLDocument and Document (#3165 ) Summary `HTMLDocument` is the class handling the core of HTML parsing. This is critical code because 8 of the 20 file-type partitioners end up using this code (`partition_html()` + 7 brokering partitioners like EPUB, MD, and RST). For historical reasons, `HTMLDocument` subclassed `XMLDocument` which in turn subclassed `Document`, both of which are no longer relevant and unnecessarily complicate reasoning about `HTMLDocument` behavior. Remove that inheritance and dependency and drop both `XMLDocument` and `Document` modules which become dead code after no longer being used by `HTMLDocument`.	2024-06-08 07:36:18 +00:00
Steve Canny	a883fc9df2	rfctr(html): improve SNR in HTMLDocument (#3162 ) Summary Remove dead code and organize helpers of HTMLDocument in preparation for improvements and bug-fixes to follow	2024-06-06 21:21:33 +00:00
Steve Canny	8378ddaa3b	rfctr(html): organize and improve HTMLDocument tests (#3161 ) Summary In preparation for further work on HTMLDocument, organize the organic growth in `documents/tests_html.py` and improving typing and expression. Reviewers: Commits are groomed and review is probably eased by going commit-by-commit	2024-06-06 18:16:02 +00:00
Steve Canny	f1cab248ce	rfctr(msg): remove temporary new_msg.py (#3157 ) Summary Remove temporary `new_msg.py` module. Additional Context The rewrite of `partition_msg()` was placed in a separate file `new_msg.py` to avoid a messy diff for code-review. This PR makes that `new_msg.py` the new `msg.py`. No code changes were made in the process.	2024-06-06 08:31:56 +00:00
Steve Canny	ddbe90f6bb	rfctr(html): clean html tests in prep for PRs to follow (#3156 ) Summary Clean `tests_unstructured/partition/test_html.py` in preparation for broader refactor of HTML partitioner to follow. That refactor will address a cluster of bugs. Temporarily remove blank lines in tests so reordering tests in following commit is easier to follow. Those will go back in after that.	2024-06-05 23:11:58 +00:00
Steve Canny	e4158deaff	fix(msg): use python-oxmsg for MSG email parsing (#3142 ) Summary `partition_msg()` previously used the `msg_parser` library for parsing Outlook MSG email files (.msg files). The `msg_parser` library is unmaintained and has several major shortcomings such as not being able to parse MSG files with 8-bit encoded strings and not reliably extracting attachments. Use the new and permissively licenced `python-oxmsg` library instead. Additional Context For reviewability purposes, this PR temporarily places the new `partition_msg()` implementation in `new_msg.py` and references that implementation from `msg.py`. `new_msg.py` will be renamed to `msg.py` in a closely following PR. This avoids a very messy interleaving of hunks in a diff between the old and re-written `partition_msg()` implementation. Fixes #2481 Fixes #3006	2024-06-05 21:12:27 +00:00
Matt Robinson	0e16bf4bf0	enhancement: apply tar filters when using python 3.12 or above (#3124 ) ### Summary Applies tar filters when using Python 3.12 or above. This was added to the [Python `tarfile` library in 3.12](https://docs.python.org/3/library/tarfile.html#extraction-filters) and guards against malicious content being extracted from `.tar.gz` files. ### Testing Added smoke test. If this passes for all Python versions, we're good.	2024-06-05 18:28:59 +00:00
Steve Canny	f2e67539b1	rfctr: clean MSG partitioner and tests as prep (#3107 ) Summary Fix type errors and generally prepare `partition_msg()` and its tests for refactoring to use `python-oxmsg` library instead of the problematic `msg_parser` library for partitioning Outlook MSG files.	2024-05-29 21:36:05 +00:00
Christine Straub	f4457249a7	fix: `partition_pdf()` removes spaces from the text (#3106 ) Closes #2896. This PR aims to fix `partition_pdf()` to keep spaces in text. The control character `\t` is now replaced with a space instead of being removed when merging inferred and embedded elements. ### Testing PDF: [rok_20230930_1-1.pdf](https://github.com/Unstructured-IO/unstructured/files/15001636/rok_20230930_1-1.pdf) ``` elements = partition_pdf( filename="rok_20230930_1-1.pdf", strategy="hi_res", ) print(str(elements[20])) ``` Results: - PR ``` Name of each exchange on which registered New York Stock Exchange ``` - main branch ``` Nameofeachexchangeonwhichregistered NewYorkStockExchange ```	2024-05-29 04:53:17 +00:00
Matt Robinson	6b400b46fe	feat: add VoyageAI embeddings (#3069 ) (#3099 ) Original PR was #3069. Merged in to a feature branch to fix dependency and linting issues. Application code changes from the original PR were already reviewed and approved. ------------ Original PR description: Adding VoyageAI embeddings Voyage AI’s embedding models and rerankers are state-of-the-art in retrieval accuracy. --------- Co-authored-by: fzowl <160063452+fzowl@users.noreply.github.com> Co-authored-by: Liuhong99 <39693953+Liuhong99@users.noreply.github.com>	2024-05-24 21:48:35 +00:00
Christine Straub	35ec21ecd0	fix: decide table extraction (#3090 ) This PR aims to add backward compatibility for the deprecated `pdf_infer_table_structure` parameter. A missing part of turning table extraction for PDFs and Images off by default in https://github.com/Unstructured-IO/unstructured/pull/3035, which was turned on in https://github.com/Unstructured-IO/unstructured/pull/2588. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-05-23 20:37:15 +00:00
Steve Canny	47d28612f7	feat(docx): add pluggable picture sub-partitioner (#3081 ) Summary Allow registration of a custom sub-partitioner that extracts images from a DOCX paragraph. Additional Context - A custom image sub-partitioner must implement the `PicturePartitionerT` interface defined in this PR. Basically have an `.iter_elements()` classmethod that takes the paragraph and generates zero or more `Image` elements from it. - The custom image sub-partitioner must be registered by passing the class to `register_picture_partitioner()`. - The default image sub-partitioner is `_NullPicturePartitioner` that does nothing. - The registered picture partitioner is called once for each paragraph.	2024-05-23 18:46:30 +00:00
Hubert Rutkowski	b8d894f963	feat/Move the category field to Element (#3056 ) It's pretty basic change, just literally moved the category field to Element class. Can't think of other changes that are needed here, because I think pretty much everything expected the category to be directly in elements list. For local testing, IDE's and linters should see difference in that `category` is now in Element.	2024-05-23 10:43:26 +00:00
Steve Canny	b4ee019170	rfctr: flatten test_unstructured/partition (#3073 ) Summary Some partitioner test modules are placed in directories by themselves or with one other test module. This unnecessarily obscures where to find the test module corresponding to a partitiner. Move partitioner test modules to mirror the directory structure of `unstructured/partition`.	2024-05-23 00:51:08 +00:00
Steve Canny	30e5a0cd4e	rfctr(docx): organize docx tests (#3070 ) Summary I preparation for adding DOCX pluggable image extraction, organize a few of the DOCX tests to be parallel to very similar tests for the DOC and ODT partitioners.	2024-05-21 22:11:46 +00:00
Christine Straub	b0d8a779da	feat: `partiton_pdf()` set inferred elements text (#3061 ) This PR adds the ability to fill inferred elements text from embedded text (`pdfminer`) without depending on `unstructured-inference` library. This PR is the second part of moving embedded text related code from `unstructured-inference` to `unstructured` and works together with https://github.com/Unstructured-IO/unstructured-inference/pull/349.	2024-05-21 19:43:38 +00:00
Matt Robinson	acda4d0707	fix: set `skip_infer_tables` explicitly in `test_partition_via_api_with_no_strategy` (#3057 ) ### Summary A `partition_via_api` test that only runs on `main` was [failing](https://github.com/Unstructured-IO/unstructured/actions/runs/9159429513/job/25181600959) with the following output, likely due to the change in the default behavior for `skip_infer_table_types`. This PR explicitly sets the `skip_infer_table_types` param to avoid the failure.. ```python =========================== short test summary info ============================ FAILED test_unstructured/partition/test_api.py::test_partition_via_api_with_no_strategy - AssertionError: assert 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' != 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' + where 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' = <unstructured.documents.elements.Text object at 0x7fb9069fc610>.text + and 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' = <unstructured.documents.elements.Text object at 0x7fb90648ad90>.text = 1 failed, 2299 passed, 9 skipped, 2 deselected, 2 xfailed, 9 xpassed, 14 warnings in 1241.64s (0:20:41) = make: *** [Makefile:302: test] Error 1 ``` ### Testing After temporarily removing the "skip if not on `main`" `pytest` mark, the [unit tests pass](https://github.com/Unstructured-IO/unstructured/actions/runs/9163268381/job/25192040902?pr=3057O) on the feature branch.	2024-05-20 19:05:13 -04:00
Matt Robinson	d7608014c0	improve: add Python 3.12 support (#3033 ) (#3047 ) ### Summary Closes #2959. Updates the dependency and CI to add support for Python 3.12. The MongoDB ingest tests were disabled due to jobs like [this one](https://github.com/Unstructured-IO/unstructured/actions/runs/9133383127/job/25116767333) failing due to issues with the `bson` package. `bson` is a dependency for the AstraDB connector, but `pymongo` does not work when `bson` is installed from `pip`. This issue is documented by MongoDB [here](https://pymongo.readthedocs.io/en/stable/installation.html). Spun off #3049 to resolve this. Issue seems unrelated to Python 3.12, though unsure why this didn't surface previously. Disables the `argilla` tests because `argilla` does not yet support Python 3.12. We can add the `argilla` tests back in once the PR references below is merged. You can still use the `stage_for_argilla` function if you're on `python<3.12` and you install `argilla` yourself. - https://github.com/argilla-io/argilla/pull/4837 --------- Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>	2024-05-19 23:03:15 +00:00
Christine Straub	76831f154b	refactor: `partition_pdf()` pass `kwargs` through `fast` strategy pipeline (#3040 ) This PR aims to pass `kwargs` through `fast` strategy pipeline, which was missing as part of the previous PR - https://github.com/Unstructured-IO/unstructured/pull/3030. I also did some code refactoring in this PR, so I recommend reviewing this PR commit by commit. ### Summary - pass `kwargs` through `fast` strategy pipeline, which will allow users to specify additional params like `sort_mode` - refactor: code reorganization - cut a release for `0.14.0` ### Testing CI should pass	2024-05-17 20:55:11 +00:00
amadeusz-ds	1c8b2b23eb	feat: add GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR config parameteres (#3014 ) This PR introduces GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR controlling where temporary files are stored during partition flow, via tempfile.tempdir. #### Edit: Renamed prefixes from STORAGE_ to UNSTRUCTURED_CACHE_ #### Edit 2: Renamed prefixes from UNSTRUCTURED_CACHE to GLOBAL_WORKING_DIR_	2024-05-17 19:16:10 +00:00
Matt Robinson	ec987dcbb2	BREAKING CHANGE: revert table extraction off by default for PDFs and images (#3035 ) ### Summary Closes #3021 . Turns table extraction for PDFs and images off by default. The default behavior originally changed in #2588 . The reason for reversion is that some users did not realize turning off table extraction was an option and experience long processing times for PDFs and images with the new default behavior. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2024-05-17 15:28:11 +00:00
Steve Canny	f320889b4f	feat(docx): add strategy parameter to DOC and ODT (#3042 ) Summary Because DOCX now supports the `strategy` argument to control aspects of image extraction, `partition_doc()` and `partition_odt()` will need to support it to because they delegate partitioning to `partition_docx()`. This will allow image extraction to work the same way for those two additional document-types.	2024-05-16 22:14:02 +00:00
Jan Kanty Milczek	e6ada05c55	Feat: form parsing placeholders (#3034 ) Allows introduction of form extraction in the future - sets up the FormKeysValues element & format, puts in an empty function call in the partition_pdf_or_image pipeline.	2024-05-16 14:21:31 +00:00
Christine Straub	1fb0fe5cf5	enhancement: `partitoin_pdf()` skip unnecessary element sorting (#3030 ) This PR aims to skip element sorting when determining whether embedded text can be extracted. The extracted elements in this step are returned as final elements only for the `fast` strategy pipeline and are never used for other strategy pipelines (`hi_res`, `ocr`). Removing element sorting in this step and adding it to the `fast` strategy pipeline later will improve performance and reduce execution time. ### Summary - skip element sorting when determining whether embedded text can be extracted. - add `_partition_pdf_with_pdfparser()` function for fast` strategy pipeline ### Testing CI should pass.	2024-05-16 06:02:56 +00:00
Steve Canny	aeca8bef88	rfctr(odt): organize and improve test_odt.py (#3031 ) Summary In preparation for adding more tests related to image extraction, improve the `partition_odt()` test suite: - Add type annotations to type-check clean on strict mode. - Improve test names. - Simplify tests where possible. - Remove a couple duplicated tests	2024-05-16 01:04:06 +00:00
Matt Robinson	612905e311	build: wolfi base image for Dockerfile (#3016 ) ### Summary Updates the `Dockerfile` to use the Chainguard `wolfi-base` image to reduce CVEs. Also adds a step in the docker publish job that scans the images and checks for CVEs before publishing. The job will fail if there are high or critical vulnerabilities. ### Testing Run `make docker-run-dev` and then `python3.11` once you're in. And that point, you can try: ```python from unstructured.partition.auto import partition elements = partition(filename="example-docs/DA-1p.pdf", skip_infer_table_types=["pdf"]) elements ``` Stop the container once you're done.	2024-05-15 22:53:15 +00:00
Steve Canny	094e3542cb	feat(docx): add strategy parameter to partition_docx() (#3026 ) Summary The behavior of an image sub-partitioner can be partially determined by the partitioning strategy, for example whether it is "hi_res" or "fast". Add this parameter to `partition_docx()` so it can pass it along to `DocxPartitionerOptions` which will make it available to any image sub-partitioners.	2024-05-15 21:05:32 +00:00
Steve Canny	a164b01c7e	rfctr(doc): spruce up test_doc.py (#3024 ) Summary In preparation for adding more tests related to image extraction, improve the `partition_doc()` test suite: - Remove redundant DOCX -> DOC file conversions on most tests. - Add type annotations to type-check clean on strict mode. - Improve test names. - Simplify tests where possible. - Remove one duplicated test Speed was roughly doubled: 24 tests in 20s -> 23 tests in 8s.	2024-05-15 18:32:51 +00:00
Steve Canny	12b30d2810	rfctr(docx): extract DocxPartitionerOptions (#3018 ) Reviewers: Probably easier to review first and second commits separately as the first one adds all the new code and tests (without installing it), and the second one installs it into the partitioner along with the required changes to code and tests. Summary Enable communication of partitioning options to sub-partitioners, in particular to the pluggable `PicturePartitioner` coming in a closely subsequent PR to implement image-extraction and OCR for DOCX, DOC, and ODT formats. Additional Context In general, validation of partitioning options as well as assigning default values and computing derived partitioning settings can be extracted from partitioners into a neatly encapsulated separate object. This simplifies the core partitioning code by removing the noise associated with computing metadata values and deciding how to access the source document, etc. However, better factoring aside, having the partition-time "settings" available in a single object allows partitioning of certain document features, for example images, to be readily _delegated_ to a sub-partitioner while still giving it access to all the relevant partitioning settings for the current document. This is particularly important when a sub-partitioner is "pluggable" at runtime and must rely on a clearly-defined (and simple as possible) interface to operate smoothly.	2024-05-15 00:50:31 +00:00
Steve Canny	db186dc23b	rfctr(doc): organize test_doc.py (#3017 ) Summary Organize DOC tests into related groups with markers. This makes it easier to assess coverage and find tests related to particular behaviors. This is in preparation for adding tests related to DOC image extraction. No code changes, purely line-block moves. - Move module-level fixtures to the bottom. - Organize tests into related groups with markers.	2024-05-14 20:57:31 +00:00
Steve Canny	b4a6009c09	rfctr(docx): improve typing etc. in prep for docx image extraction (#3015 ) Summary Noisy but trivial changes to `partition_docx()` environs and tests in preparation for DOCX image extraction. These changes are extracted here so they don't distract on the changes of substance to follow in the next PR.	2024-05-14 19:32:17 +00:00
Steve Canny	3f8e6b79c5	rfctr(docx): move docx unit tests to bottom (#3011 ) No code changes, strictly this single block move. Move `Describe_DocxPartitioner` unit-test class to bottom so `DescribeDocxPartitionerOptions` unit-test to follow in subsequent commit will be together with it. Integration tests first, then unit tests, for consistency with other test modules e.g. test_pptx. I added `Describe_DocxPartitioner` soon after I arrived, before we adopted the convention of placing unit-tests after integration tests. Move this so we can maintain that consistency with the block of tests to follow in a closely subsequent PR.	2024-05-13 22:05:12 +00:00
Steve Canny	e4c895923d	fix(csv): partition_csv() raises on long lines (#2998 ) Summary The CSV delimiter-sniffer requires whole lines to properly detect the delimiter character. Limiting bytes read produced partial lines when lines were very long. Limit bytes but read whole lines. Fixes #2643.	2024-05-10 21:19:31 +00:00
John	593aa47802	fix: ppt parameters include_page_breaks and include_slide_notes (#2996 ) Pass the parameters `include_slide_notes` and `include_page_breaks` to `partition_pptx` from `partition_ppt`. Also update the .ppt example doc we use for testing so it has slide notes and a PageBreak (and second page)	2024-05-10 17:57:36 +00:00
John	d829b669e6	Add starting_page_num param to partition_image (#2987 ) Add missing starting_page_num param to partition_image Closes #2985	2024-05-09 21:31:35 +00:00
Michał Martyniak	2f25d8f79e	Support for concurrent processing of documents during evaluation (#2973 ) Currently, CCT eval takes a long time for any of the test_metrics CI runs. Documents in an eval set are evaluated sequentially, and It appears that a max of 1 cpu core is currently utilized. This implies there could be a large speedup by running eval across multiple docs concurrently (probably with multiprocessing). Things done in this PR: - [x] concurrent.futures.ProcessPoolExecutor instead of sequential for-loop - [x] refactor/reorganization of redundant pieces of code without changing the inner logic too much. Without that we'd have 3 places where documents are being processed. Take a look at `BaseMetricsCalculator` class and classes that inherit from it. - [x] string paths manipulation is now reworked and relies on `pathlib.Path()`	2024-05-09 21:25:47 +00:00
John	ef47d530f6	feat: add chunking to partition_tsv (#2982 ) Closes #2980	2024-05-07 23:09:27 +00:00
Pluto	4397dd6a10	Add calculation of table related metrics based on table_as_cells (#2898 ) This pull request add metrics that are calculated based on table_as_cells instead of text_as_html. This change is required for comprehensive metrics calculation, as previously every colspan or rowspan predicted was considered to be an incorrect predicted (even if it was correct prediction) This change has to be merged after https://github.com/Unstructured-IO/unstructured/pull/2892 which introduces table_as_cells field	2024-05-07 13:57:38 +00:00
Christine Straub	0cd07d78f9	feat: `parition_pdf()` add ability to get `cid` ratio (#2970 ) This PR adds the ability to get the ratio of `cid` characters in embedded text extracted by `pdfminer`. This PR is the second part of moving `cid` related code from `unstructured-inference` to `unstructured` and works together with https://github.com/Unstructured-IO/unstructured-inference/pull/342.	2024-05-04 05:21:27 +00:00
Steve Canny	cb55245f70	rfctr: extract OCRAgent.get_agent() out of PDF subtree (#2965 ) Summary File-types other than PDF need to use OCR on extracted images. Extract `OCRAgent.get_agent()` such that any file-type partitioner can use it without risking dependency on PDF-only extras.	2024-05-03 19:39:22 +00:00
Steve Canny	39b74a2370	fix(test): Remedy macOS-only test failure not triggered by CI (#2957 ) Summary A crude and OS-specific mechanism was used to detect when a path represented a temp-file. Change that to be robust across operating systems and localized configurations. The specific problem was for DOC files but this PR fixes it for PPT too which was prone to the same problem.	2024-05-02 18:21:18 +00:00
Steve Canny	7dea2fa4a1	rfctr: tidy up ppt+doc tests (#2956 ) Summary Make tests for DOC and PPT formats more concise and readable in preparation for adding one or two.	2024-05-02 16:00:00 +00:00

... 2 3 4 5 6 ...

725 Commits