656 Commits

Author SHA1 Message Date
Yao You
a11ad22609
bump unstructured-inference (#3711)
This PR bumps `unstructured-inference` to `0.8.0`, which introduces
vectorized data structure for layout elements and text regions.
This PR also cleans up a few places in CI that has repeated definition
of env variables or missing installation of testing dependencies in
cache.

A few document ingest results are changed:
- two places for `biomed-api` (actually processed locally on runner) are
due to very small changes in numerical results of the bounding box
areas: one results in a duplicated page number/header and another
results in a deduplication of a word of a sentence that starts in a new
line. (yes, two cases goes in opposite directions)
- the layout parser paper now outputs the code lines with page number
inside the code box as list items

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
2024-10-21 21:55:08 +00:00
Steve Canny
3240e3d17a
rfctr(pptx): minify HTML and table.text is cct (#3734)
**Summary**
Eliminate historical "idiosyncracies" of `table.metadata.text_as_html`
HTML introduced by `partition_pptx()`. Produce minified `.text_as_html`
consistent with that formed by chunking.

**Additional Context**
- PPTX `.metadata.text_as_html` is minified (no extra whitespace or
thead, tbody, tfoot elements).
- `table.text` is clean-concatenated-text (CCT) of table.
- Last use of `tabulate` library is removed and that dependency is
removed from `base.in`.
2024-10-21 16:23:15 +00:00
Steve Canny
208c7edc52
rfctr(csv): minify HTML and table text is cct (#3733)
**Summary**
Eliminate historical "idiosyncracies" of `table.metadata.text_as_html`
HTML introduced by `partition_csv()`. Produce minified `.text_as_html`
consistent with that formed by chunking.

**Additional Context**
- CSV `.metadata.text_as_html` is minified (no extra whitespace or
thead, tbody, tfoot elements).
- `table.text` is clean-concatenated-text (CCT) of table.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-10-19 06:49:09 +00:00
Steve Canny
c85f29e6ca
fix(xlsx): XLSX emits std minified .text_as_html (#3558)
**Summary**
Eliminate historical "idiosyncracies" of `table.metadata.text_as_html`
HTML introduced by `partition_xlsx()`. Produce minified `.text_as_html`
consistent with that formed by chunking.

**Additional Context**
- XLSX `.text_as_html` is minified (no extra whitespace or thead, tbody,
tfoot elements).
- `table.text` is clean-concatenated-text (CCT) of table.

---------

Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-10-17 22:05:11 +00:00
Nathan Van Gheem
b092d45816
Remove unsupported chipper model (#3728)
The chipper model is no longer supported.
2024-10-17 17:40:45 +00:00
Steve Canny
1eceac26c8
rfctr(email): eml partitioner rewrite (#3694)
**Summary**
Initial attempts to incrementally refactor `partition_email()` into
shape to allow pluggable partitioning quickly became too complex for
ready code-review. Prepare separate rewritten module and tests and swap
them out whole.

**Additional Context**
- Uses the modern stdlib `email` module to reliably accomplish several
manual decoding steps in the legacy code.
- Remove obsolete email-specific element-types which were replaced 18
months or so ago with email-specific metadata fields for things like Cc:
addresses, subject, etc.
- Remove accepting an email as `text: str` because MIME-email is
inherently a binary format which can and often does contain multiple and
contradictory character-encodings.
- Remove `encoding` parameters as it is now unused. An email file is not
a text file and as such does not have a single overall encoding.
Character encoding is specified individually for each MIME-part within
the message and often varies from one part to another in the same
message.
- Remove the need for a caller to specify `attachment_partitioner`.
There is only one reasonable choice for this which is
`auto.partition()`, consistent with the same interface and operation in
`partition_msg()`.
- Fixes #3671 along the way by silently skipping attachments with a
file-type for which there is no partitioner.
- Substantially extend the test-suite to cover multiple
transport-encoding/charset combinations.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-10-16 02:02:33 +00:00
Roman Isecke
9049e4e2be
feat/remove ingest code, use new dep for tests (#3595)
### Description
Alternative to https://github.com/Unstructured-IO/unstructured/pull/3572
but maintaining all ingest tests, running them by pulling in the latest
version of unstructured-ingest.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
Co-authored-by: Christine Straub <christinemstraub@gmail.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-10-15 10:01:34 -05:00
David Blore
ecf0267b85
fix: add language to OCRAgentGoogleVision constructor (#3696)
This PR addresses issue #3659 by adding an optional `language` parameter
to the `OCRAgentGoogleVision` class constructor.

This parameter serves as a "language hint" for the
`document_text_detection` method in the `ImageAnnotatorClient`. For more
information on language hints, refer to the [Google Cloud Vision
documentation](https://cloud.google.com/vision/docs/languages).


**Default Behavior**: 
The language parameter defaults to None, allowing Google Cloud Vision to
auto-detect the language, as recommended in their documentation.

**Purpose**: 
This change is necessary because the `OCRAgent`'s `get_instance` method
expects all `OCRAgent`s to include a language parameter in their
constructors.

**Context on Issue:**
When trying to parse a PDF with
`OCR_AGENT=unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision`,
an error occurs in the `get_instance` method. The method expects a
`language` parameter, which the current `OCRAgentGoogleVision`
constructor does not support, leading to a positional argument error.

---------

Co-authored-by: Christine Straub <christinemstraub@gmail.com>
2024-10-14 05:35:05 +00:00
Steve Canny
2f496f867c
fix(auto): quick fix for auto test failing in CI (#3715)
Better fix to follow.
2024-10-10 18:44:00 +00:00
John
27fa2a39d8
add tests describing the behavior of set_element_hierarchy (#3700)
Small pr adding tests describing the behavior of
`set_element_hierarchy`. No tests were changed, just added.
2024-10-04 22:49:38 +00:00
Steve Canny
718891a447
rfctr(part): remove double-decoration 5 (#3692)
**Summary**
Remove double-decoration from EML and MSG.

**Additional Context**
- These needed to wait to the end because `partition_email()` and
`partition_msg()` can use any other partitioner for one of their
attachments.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-10-04 21:01:32 +00:00
Steve Canny
4711a8dc26
rfctr(part): remove double-decoration 4 (#3690)
**Summary**
Install new `@apply_metadata()` on TXT.

**Additional Context**
- Both EML and MSG delegate to both HTML and TXT to partition the
message-body, depending on which MIME-part body payload is selected
(`text/plain` or `text/html`). This PR prepares the way to remove
decorators from EML and MSG in the next PR.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-10-03 16:41:31 +00:00
Steve Canny
9bd91a836e
rfctr(part): remove double-decoration 3 (#3687)
**Summary**
Install new `@apply_metadata()` on HTML and remove decorators from
delegating partitioners EPUB, MD, ORG, RST, and RTF.

**Additional Context**
- All five of these delegating partitioners delegate to
`partition_html()` so they're something of a matched set. EML and MSG
also partially delegate to HTML but that's a harder problem (they also
delegate to all other partitioners for attachments) that we'll address a
couple PRs later .
- Replace use of `@process_metadata()` and
`@add_metadata_with_filetype()` decorators with `@apply_metadata()` on
`partition_html()`.
- Remove all decorators from delegating partitioners; this removes the
"double-decorating".
2024-10-02 21:04:37 +00:00
Steve Canny
17092198d0
rfctr(part): remove double-decoration 2 (#3686)
**Summary**
Install new `@apply_metadata()` on PPTX, TSV, XLSX, and XML and remove
decoration from PPT.

**Additional Context**
- Alphabetical order turns out to be hard, so this is the remaining
"easy" delegating partitioner and the remaining principal partitioners.
- Replace use of `@process_metadata()` and
`@add_metadata_with_filetype()` decorators with `@apply_metadata()` on
principal partitioners (those that do not delegate to other
partitioners.
- Remove all decorators from delegating partitioners (PPT in this case);
this removes the "double-decorating".
2024-10-02 18:52:59 +00:00
Steve Canny
bba60260b2
rfctr(part): remove double-decoration 1 (#3685)
**Summary**
Install new `@apply_metadata()` on CSV and DOCX and remove decoration
from DOC and ODT.

**Additional Context**
- Working in alphabetical order and keeping PR size manageable, replace
use of `@process_metadata()` and `@add_metadata_with_filetype()`
decorators with `@apply_metadata()` on principal partitioners (those
that do not delegate to other partitioners.
- Remove all decorators from delegating partitioners (DOC and ODT in
this case); this removes the "double-decorating".
2024-10-01 22:40:58 +00:00
Steve Canny
c05e1babf1
rfctr(meta): refine @apply_metadata() decorator (#3667)
**Summary**
Refine `@apply_metadata()` replacement decorator. Note it has not been
installed yet.

- Apply `metadata_last_modified` arg with the `@apply_metadata()`
decorator. No need for redundant processing in each partitioner.
- Add "unique-ify" step to fix any cases where the same `Element` or
`ElementMetadata` instance was used more than once in the element
stream. This prevents unexpected "multi-mutation" in downstream
processes.
- Apply "global" metadata items before computing hash-ids. In
particular, `.metadata.filename` is used in the hash computation and
will produce different results if that's not already settled.
- Compute hash-ids _before_ computing `.metadata.parent_id`. This
removes the need for mapping UUID element-ids to their hash counterpart
and doing a fixup of `.parent_id` after applying hash-ids to elements.

**Additional Context**
- The `@apply_metadata()` decorator replaces the four metadata-related
decorators: `@process_metadata()`, `@add_metadata_with_filetype()`,
`@add_metadata()`, and `@add_filetype()`.
- It will be installed on each partitioner in a series of following PRs.
2024-10-01 21:28:32 +00:00
Tomasz Cąkała
75c4998bc7
Fix: partition on empty or whitespace-only text files (#3675)
This is a fix for this
[bug](https://github.com/Unstructured-IO/unstructured/issues/3674), auto partition fails on text files which are empty or contain only whitespaces

Inference of .txt file type fails if the file has only whitespaces.

To Reproduce:

```
from tempfile import NamedTemporaryFile

from unstructured.partition.auto import partition

with NamedTemporaryFile(mode="w", suffix=".txt") as f:
    f.write("   \n")
    f.seek(0)
    elements = partition(filename=f.name)
```
2024-09-28 21:16:33 -07:00
Steve Canny
50d75c47d3
rfctr(part): add new decorator to replace four (#3650)
**Summary**
In preparation for pluggable auto-partitioners, add a new metadata
decorator to replace the four existing ones.

**Additional Context**
"Global" metadata items, those applied to all element on all
partitioners, are applied using a decorator.

Currently there are four decorators where there only needs to be one.
Consolidate those into a single metadata decorator.
One or two additional behaviors of the new decorator will allow us to
remove decorators from delegating partitioners which is a prerequisite
for pluggable auto-partitioners.
2024-09-25 23:15:50 +00:00
Steve Canny
44bad216f3
rfctr(part): prepare for pluggable auto-partitioners 3 (#3661)
**Summary**
Remove unused `include_metadata` parameter.

**Additional Context**
- The `include_metadata` parameter was originally added circa v0.7.12 as
a mechanism for avoiding the "double-decorating" problem on delegating
partitioners.
- It turns out it doesn't fully address that problem, is now unused, and
is unnecessary for the solution we'll be adding as part of pluggable
partitioners.
- Remove the unnecessary complexity introduced by this unused parameter.
2024-09-25 18:17:48 +00:00
Steve Canny
086b8d6f8a
rfctr(part): prepare for pluggable auto-partitioners 2 (#3657)
**Summary**
Step 2 in prep for pluggable auto-partitioners, remove `regex_metadata`
field from `ElementMetadata`.

**Additional Context**
- "regex-metadata" was an experimental feature that didn't pan out.
- It's implemented by one of the post-partitioning metadata decorators,
so get rid of it as part of the cleanup before consolidating those
decorators.
2024-09-24 17:33:25 +00:00
Yao You
903efb0c6d
fix: fix occasional key error when mapping parent id (#3658)
This PR fixes an occasional `KeyError` when calling
`assign_and_map_hash_ids`.

- This happens when the input `elements` has duplicated element
instances or metadata.
- When there are duplications the logic to iterate through all elements
and map their parent ids will raise an error when an already mapped
parent id is up for mapping.
- The fix adds a logic to check if the parent id exists in
`old_to_new_mapping` and if it doesn't we skip mapping it

## test

This PR adds a unit test on this case and the test would fail without
the fix.
2024-09-24 16:39:11 +00:00
Austin Walker
6428d19e5a
fix: update python SDK syntax for forward compatibility (#3656)
Wrap the `shared.PartitionParameters` usage with
`operations.PartitionRequest`. This syntax has been deprecated since
v0.23.0 of the SDK, and will be unsupported in v0.26.0.
2024-09-24 16:37:38 +00:00
Steve Canny
3bab9d93e6
rfctr(part): prepare for pluggable auto-partitioners 1 (#3655)
**Summary**
In preparation for pluggable auto-partitioners simplify metadata as
discussed.

**Additional Context**
- Pluggable auto-partitioners requires partitioners to have a consistent
call signature. An arbitrary partitioner provided at runtime needs to
have a call signature that is known and consistent. Basically
`partition_x(filename, *, file, **kwargs)`.
- The current `auto.partition()` is highly coupled to each distinct
file-type partitioner, deciding which arguments to forward to each.
- This is driven by the existence of "delegating" partitioners, those
that convert their file-type and then call a second partitioner to do
the actual partitioning. Both the delegating and proxy partitioners are
decorated with metadata-post-processing decorators and those decorators
are not idempotent. We call the situation where those decorators would
run twice "double-decorating". For example, EPUB converts to HTML and
calls `partition_html()` and both `partition_epub()` and
`partition_html()` are decorated.
- The way double-decorating has been avoided in the past is to avoid
sending the arguments the metadata decorators are sensitive to to the
proxy partitioner. This is very obscure, complex to reason about,
error-prone, and just overall not a viable strategy. The better solution
is to not decorate delegating partitioners and let the proxy partitioner
handle all the metadata.
- This first step in preparation for that is part of simplifying the
metadata processing by removing unused or unwanted legacy parameters.
- `date_from_file_object` is a misnomer because a file-object never
contains last-modified data.
- It can never produce useful results in the API where last-modified
information must be provided by `metadata_last_modified`.
- It is an undocumented parameter so not in use.
- Using it can produce incorrect metadata.
2024-09-23 22:23:10 +00:00
Steve Canny
03c2bf8f1f
rfctr(part): extract partition.common submodules (#3649)
**Summary**
In preparation for consolidating post-partitioning metadata decorators,
extract `partition.common` module into a sub-package (directory) and
extract `partition.common.metadata` module to house metadata-specific
object shared by partitioners.

**Additional Context**
- This new module will be the home of the new consolidated metadata
decorator.
- The consolidated decorator is a step toward removing post-processing
decorators from _delegating_ partitioners. A delegating partitioner is
one that convert its file to a different format and "delegates" actual
partitioning to the partitioner for that target format. 10 of the 20
partitioners are delegating partitioners.
- Removing decorators from delegating partitioners will allow us to
avoid "double-decorating", i.e. running those decorators twice, once on
the principal partitioner and again on the proxy partitioner.
- This will allow us to send `**kwargs` to either partitioner, removing
the knowledge of which arguments to send for each file-type from
auto-partition.
- And this will allow pluggable auto-partitioners which all have a
`partition_x(filename, *, file, **kwargs) -> list[Element]` interface.
2024-09-20 20:35:28 +00:00
Christine Straub
0ed69a1ac3
refactor: pdfminer image cleanup (#3648)
This PR aims to remove `clean_pdfminer_duplicate_image_elements()`
function, as its functionality has already been integrated into the
`remove_duplicate_elements()` function in [PR
#3630](https://github.com/Unstructured-IO/unstructured/pull/3630).
2024-09-19 18:57:02 +00:00
Christine Straub
be88eef06f
perf: optimize pdfminer image cleanup process for improved performance (#3630)
This PR enhances `pdfminer` image cleanup process by repositioning the
duplicate image removal step. It optimizes the removal of duplicated
pdfminer images by performing the cleanup before merging elements,
rather than after. This improvement reduces execution time and enhances
the overall processing speed of PDF documents.

---------

Co-authored-by: Yao You <theyaoyou@gmail.com>
2024-09-19 14:05:05 +00:00
Steve Canny
cd074bb32b
chore(file): remove dead code (#3645)
**Summary**
Remove dead code in `unstructured.file_utils`.

**Additional Context**
These modules were added in 12/2022 and 1/2023 and are not referenced by
any code. Removing to reduce unnecessary complexity. These can of course
be recovered from Git history if we decide we want them again in future.
2024-09-19 06:45:33 +00:00
Christine Straub
87a88a3c87
feat: improve pdfminer element processing (#3618)
This PR implements splitting of `pdfminer` elements (`groups of text
chunks`) into smaller bounding boxes (`text lines`). This implementation
prevents loss of information from the object detection model and
facilitates more effective removal of duplicated `pdfminer` text. This
PR also addresses #3430.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-09-12 21:17:27 +00:00
John
ab94c6c5d1
chore: remove pins (#3579)
- Remove constraint pins for `Office365-REST-Python-Client`,
`weaviate-client`, and `platformdirs`. Removing the pin for `Office365`
brought to light some bugs in the Onedrive connector, so some changes
were also made to
`unstructured/ingest/v2/processes/connectors/onedrive.py`.
- Also, as part of updating dependencies `unstructured-client` was
updated to `0.25.8`, which introduced a new default for the `strategy`
param and required updating a test fixture.
- The `hubspot.sh` integration test was failing and is now ignored in CI
with this PR per discussion with @rbiseck3.

May be easiest to review commit-by-commit.
2024-09-12 13:48:59 +00:00
cragwolfe
3bb0ee1e79
chore: fix tests breaking on main (#3603)
Fix API tests (really more like integration tests) that run only on
main. Also use less compute intensive files to speedup test time and
remove a useless test.

Tests in `test_unstructured/partition/test_api.py` pass, temporarily
running outside of main per per screenshot:

![image](https://github.com/user-attachments/assets/f15d440a-2574-40f2-98b4-adf57fbae704)


https://github.com/Unstructured-IO/unstructured/actions/runs/10754098974/job/29824415513
2024-09-08 21:25:52 +00:00
Yao You
d51fb134e6
Feat/improve iou speed (#3582)
This PR vectorizes the computation of element overlap to speed up
deduplication process of extracted elements.

## test

This PR adds unit test to the new vectorized IOU and subregion
computation functions.

In addition, running partition on large files with many elements like
this slide:

[002489.pdf](https://github.com/user-attachments/files/16823176/002489.pdf)

shows a reduction of runtime from around 15min on the main branch to
less than 4min with this branch.

Profiling results show that the new implementation greatly reduces the
time cost of computation and now most of the time is spend on getting
the coordinates from a list of bboxes.

![Screenshot 2024-08-30 at 9 29
27 PM](https://github.com/user-attachments/assets/6c186838-54c7-483b-ac3e-7342c23ff3a6)
2024-09-03 00:06:18 +00:00
Pawel Kmiecik
404f780bbb
feat: make analysis drawing more flexible (#3574)
This PR changes the way the analysis tools can be used:
- by default if `analysis` is set to `True` in `partition_pdf` and the
strategy is resolved to `hi_res`:
- for each file 4 layout dumps are produced and saved as JSON files
(`object_detection`, `extracted`, `ocr`, `final`) - similar way to the
current `object_detection` dump
- the drawing functions/classes now accept these dumps accordingly
instead of the internal classes instances (like `TextRegion`,
`DocumentLayout`
- it makes it possible to use the lightweight JSON files to render the
bboxes of a given file after the partition is done
- `_partition_pdf_or_image_local` has been refactored and most of the
analysis code is now encapsulated in `save_analysis_artifiacts` function
- to do this, helper function `render_bboxes_for_file` is added
<img width="338" alt="Screenshot 2024-08-28 at 14 37 56"
src="https://github.com/user-attachments/assets/10b6fbbd-7824-448d-8c11-52fc1b1b0dd0">
2024-09-02 11:06:11 +00:00
Matt Robinson
6ba8135bf9
fix: check ole storage content to differentiate filetypes (#3581)
### Summary

Updates the file detection logic for OLE files to check the storage
content of the file to more reliable differentiate between DOC, PPT, XLS
and MSG files. This corrects a bug that caused file type detection to be
incorrect in cases where the `filetype` library guessed and incorrect
MIME type, such as `'application/vnd.ms-excel'` for a `.msg` file.

As part of this work, the `"msg"` extra was removed because the
`python-oxmsg` package is now a base dependency.

### Testing

Using a test `.msg` file that returns `'application/vnd.ms-excel'` from
`filetype.guess_mime`.

```python
from unstructured.file_utils.filetype import detect_filetype

filename = "test-file.msg"
detect_filetype(filename=filename) # result should be FileType.MSG
```
2024-08-30 15:12:46 -04:00
Austin Walker
f440eb476c
feat: Support encoding parameter in partition_csv (#3564)
See added test file. Added support for the encoding parameter, which can
be passed directly to `pd.read_csv`.
2024-08-28 14:19:58 +00:00
John
f21c853ade
bug: fix file_conversion disk leak (#3562)
Fix disk space leaks and Windows errors when accessing file.name on a
NamedTemporaryFile

Uses of `NamedTemporaryFile(..., delete=False)` and/or uses of
`file.name` of NamedTemporaryFile have been replaced with
TemporaryFileDirectory to avoid a known issue:
-
https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile
- https://github.com/Unstructured-IO/unstructured/issues/3390

The first 7 commits each address an individual occurrence of the issue
if reviewers want to review commit-by-commit.
2024-08-27 22:02:24 +00:00
David Potter
ddba928344
Potter/mixedbread embedder (#3513)
Thanks to @huangrpablo and @juliuslipp we now have a mixedbread.ai
embedder!
2024-08-27 14:52:13 +00:00
Steve Canny
32bb77aafb
fix(file): no default OLE subtype (#3516)
**Summary**
Do not assume MSG format when an OLE "container" file cannot be
differentiated into DOC, PPT, XLS, or MSG. Fall back to extention-based
identification in that case.

**Additional Context**
DOC, MSG, PPT, and XLS are all OLE files. An OLE file is, very roughly,
a Microsoft-proprietary Zip format which "contains" a filesystem of
discrete files and directories.

An OLE "container" is easily identified by inspecting the first 8 bytes
of the file, so all we need to do is differentiate between the four
subtypes we can process. The `filetype` module does a good job of this
but it not perfect and does not identify MSG files.

Previously we assumed MSG format when none of DOC, PPT, or XLS was
detected, but we discovered that `filetype` is not completely reliable
at detecting these types.

Change the behavior to remove the assumption of MSG format.
`_OleFileDifferentiator` returns `None` in this case and filetype
detection falls back to use filename-extension.

Note a file with no filename and no metadata_filename or an incorrect
extension will not be correctly identified in this case, however we're
assuming for now that will be rare in practice.
2024-08-22 19:16:53 +00:00
Steve Canny
03e0ed3519
rfctr(docx): DOCX emits std minified .text_as_html (#3545)
**Summary**
Eliminate historical "idiosyncracies" of `table.metadata.text_as_html`
HTML introduced by `partition_docx()`. Produce minified `.text_as_html`
consistent with that formed by chunking.

**Additional Context**
- nested tables appear as their extracted text in the parent cell (no
nested `<table>` elements in `.text_as_html`).
- DOCX `.text_as_html` is minified (no extra whitespace or thead, tbody,
tfoot elements).
2024-08-21 18:54:21 +00:00
John
604cadfb7e
chore: remove ipython pin (#3548)
this pr is stacked on
https://github.com/Unstructured-IO/unstructured/pull/3538 and
https://github.com/Unstructured-IO/unstructured/pull/3547

This pr removes dependency pins for IPython, anyio, and pyparsing. It
also updates the label-studio-sdk import statement so we don't have to
have that pinned and make some minor type hinting edits. Label Studio
had a breaking change in their 1.13.0
[release](https://github.com/HumanSignal/label-studio/releases/tag/1.13.0)
2024-08-21 00:06:31 +00:00
Steve Canny
a861ed8fe7
feat(chunk): split tables on even row boundaries (#3504)
**Summary**
Use more sophisticated algorithm for splitting oversized `Table`
elements into `TableChunk` elements during chunking to ensure element
text and HTML are "synchronized" and HTML is always parseable.

**Additional Context**
Table splitting now has the following characteristics:
- `TableChunk.metadata.text_as_html` is always a parseable HTML
`<table>` subtree.
- `TableChunk.text` is always the text in the HTML version of the table
fragment in `.metadata.text_as_html`. Text and HTML are "synchronized".
- The table is divided at a whole-row boundary whenever possible.
- A row is broken at an even-cell boundary when a single row is larger
than the chunking window.
- A cell is broken at an even-word boundary when a single cell is larger
than the chunking window.
- `.text_as_html` is "minified", removing all extraneous whitespace and
unneeded elements or attributes. This maximizes the semantic "density"
of each chunk.
2024-08-19 18:56:53 +00:00
Christine Straub
fc26426310
feat: replace pytesseract with unstructured.pytesseract fork (#3528)
This PR reverts `pytesseract` dependency to `unstructured.pytesseract`
fork due to the unavailability of some recent release versions of
`pytesseract` on PyPI.

This PR also addresses an issue encountered during the publication of
`unstructured==0.15.4` to PyPI. The error was due to the fact that PyPI
does not allow direct dependencies from Version Control System URLs like
GitHub in the `install_requires` or `extras_require` sections of the
`setup.py` file.
2024-08-16 10:34:22 -04:00
Matt Robinson
7437f0a084
fix(CVE-2024-39705): update to latest nltk version (#3512)
### Summary

Addresses
[CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705) by
updating to `nltk==3.8.2` and closes #3511. This CVE had previously been
mitigated in #3361.

---------

Co-authored-by: Christine Straub <christinemstraub@gmail.com>
2024-08-13 09:39:29 -04:00
Steve Canny
cbe1b35621
rfctr(chunk): prep for adding TableSplitter (#3510)
**Summary**
Mechanical refactoring in preparation for adding (pre-chunk)
`TableSplitter` in a PR stacked on this one.
2024-08-12 18:04:49 +00:00
Jake Zerrer
051be5aead
Remove unstructured.pytesseract fork (#3454)
A second attempt at
https://github.com/Unstructured-IO/unstructured/pull/3360, this PR
removes unstructured's dependency on its own fork of `pytesseract`. (The
original reason for the fork, the addition of
`run_and_get_multiple_output`, was removed
[here](https://github.com/madmaze/pytesseract/releases/tag/v0.3.12).)

---------

Co-authored-by: Christine Straub <christinemstraub@gmail.com>
2024-08-09 04:28:48 +00:00
John
24a1f298e5
chore: small edits (#3480)
Add comments and fix decorators on some tests.
2024-08-06 19:21:43 +00:00
Steve Canny
73bef27ef1
fix(pptx): accommodate invalid image/jpg MIME-type (#3475)
As described in #3381, some clients, perhaps including Adobe PDF
Converter, map JPEG images to the invalid `image/jpg` MIME-type. Prior
to v1.0.0, `python-pptx` would not load these images, which caused image
extraction to fail.

Update the `python-pptx` dependency to `v1.0.1` or above to ensure this
upstream fix is always available.

Fixes: #3381
2024-08-06 18:48:15 +00:00
Steve Canny
a468b2de3b
rfctr(csv): accommodate single column CSV files (#3483)
**Summary**
Improve factoring, type-annotation, and tests for `partition_csv()` and
accommodate single-column CSV files.

Fixes: #2616
2024-08-06 00:48:37 +00:00
Maciej Kurzawa
b749b891a7
fix: disabled checking max pages for images (#3473)
Added fix related to
https://github.com/Unstructured-IO/unstructured/pull/3431, which
disables checking max pages for images
2024-08-02 14:25:08 +00:00
John
147514f6b5
feat: msg and email metadata (#3444)
Update partition_eml and partition_msg to capture cc, bcc, and message
id fields.

Docs PR: https://github.com/Unstructured-IO/docs/pull/135/files

Testing
```
from unstructured.partition.email import partition_email
from test_unstructured.unit_utils import example_doc_path

elements = partition_email(filename=example_doc_path("eml/fake-email-header.eml"), include_headers=True)
print(elements)
elements[0].metadata.to_dict()
```

Note to reviewers:
Tests in `test_unstructured/partition/test_email.py` were refactored and
rearranged to group similar tests together, so it will be easiest to
review those changes commit by commit.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
2024-08-01 19:24:17 +00:00
Maciej Kurzawa
8fd216cc9f
feat/pdf-page-limit-in-hi-res (#3431)
# Description:
Passing `max_pages` argument allows rejecting pdf files which exceeds
this page number limit while `high_res` strategy is chosen. By default
it will allow parsing pdf files with unlimited number of pages.

# Testing:
```python
from unstructured.partition.auto import partition

elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res')  # should pass
elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res', max_pages=4)  # should pass
elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res', max_pages=2)  # should raise PdfMaxPagesExceededError
```
2024-07-30 16:52:17 +00:00