438 Commits

Author SHA1 Message Date
Jake Zerrer
051be5aead
Remove unstructured.pytesseract fork (#3454)
A second attempt at
https://github.com/Unstructured-IO/unstructured/pull/3360, this PR
removes unstructured's dependency on its own fork of `pytesseract`. (The
original reason for the fork, the addition of
`run_and_get_multiple_output`, was removed
[here](https://github.com/madmaze/pytesseract/releases/tag/v0.3.12).)

---------

Co-authored-by: Christine Straub <christinemstraub@gmail.com>
2024-08-09 04:28:48 +00:00
John
24a1f298e5
chore: small edits (#3480)
Add comments and fix decorators on some tests.
2024-08-06 19:21:43 +00:00
Steve Canny
73bef27ef1
fix(pptx): accommodate invalid image/jpg MIME-type (#3475)
As described in #3381, some clients, perhaps including Adobe PDF
Converter, map JPEG images to the invalid `image/jpg` MIME-type. Prior
to v1.0.0, `python-pptx` would not load these images, which caused image
extraction to fail.

Update the `python-pptx` dependency to `v1.0.1` or above to ensure this
upstream fix is always available.

Fixes: #3381
2024-08-06 18:48:15 +00:00
Steve Canny
a468b2de3b
rfctr(csv): accommodate single column CSV files (#3483)
**Summary**
Improve factoring, type-annotation, and tests for `partition_csv()` and
accommodate single-column CSV files.

Fixes: #2616
2024-08-06 00:48:37 +00:00
Maciej Kurzawa
b749b891a7
fix: disabled checking max pages for images (#3473)
Added fix related to
https://github.com/Unstructured-IO/unstructured/pull/3431, which
disables checking max pages for images
2024-08-02 14:25:08 +00:00
John
147514f6b5
feat: msg and email metadata (#3444)
Update partition_eml and partition_msg to capture cc, bcc, and message
id fields.

Docs PR: https://github.com/Unstructured-IO/docs/pull/135/files

Testing
```
from unstructured.partition.email import partition_email
from test_unstructured.unit_utils import example_doc_path

elements = partition_email(filename=example_doc_path("eml/fake-email-header.eml"), include_headers=True)
print(elements)
elements[0].metadata.to_dict()
```

Note to reviewers:
Tests in `test_unstructured/partition/test_email.py` were refactored and
rearranged to group similar tests together, so it will be easiest to
review those changes commit by commit.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
2024-08-01 19:24:17 +00:00
Maciej Kurzawa
8fd216cc9f
feat/pdf-page-limit-in-hi-res (#3431)
# Description:
Passing `max_pages` argument allows rejecting pdf files which exceeds
this page number limit while `high_res` strategy is chosen. By default
it will allow parsing pdf files with unlimited number of pages.

# Testing:
```python
from unstructured.partition.auto import partition

elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res')  # should pass
elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res', max_pages=4)  # should pass
elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res', max_pages=2)  # should raise PdfMaxPagesExceededError
```
2024-07-30 16:52:17 +00:00
Steve Canny
4e61acc1c6
fix(file): fix OLE-based file-type auto-detection (#3437)
**Summary**
A DOC, PPT, or XLS file sent to partition() as a file-like object is
misidentified as a MSG file and raises an exception in python-oxmsg
(which is used to process MSG files).

**Fix**
DOC, PPT, XLS, and MSG are all Microsoft OLE-based files, aka. Compound
File Binary Format (CFBF). These can be reliably distinguished by
inspecting magic bytes in certain locations. `libmagic` is unreliable at
this or doesn't try, reporting the generic `"application/x-ole-storage"`
which corresponds to the "container" CFBF format (vaguely like a
Microsoft Zip format) that all these document types are stored in.

Unconditionally use `filetype.guess_mime()` provided by the `filetype`
package that is part of the base unstructured install. Unlike
`libmagic`, this package reliably detects the distinguished MIME-type
(e.g. `"application/msword"`) for OLE file subtypes.

Fixes #3364
2024-07-25 17:25:41 +00:00
Steve Canny
3fe5c094fa
rfctr(file): refactor detect_filetype() (#3429)
**Summary**
In preparation for fixing a cluster of bugs with automatic file-type
detection and paving the way for some reliability improvements, refactor
`unstructured.file_utils.filetype` module and improve thoroughness of
tests.

**Additional Context**
Factor type-recognition process into three distinct strategies that are
attempted in sequence. Attempted in order of preference,
type-recognition falls to the next strategy when the one before it is
not applicable or cannot determine the file-type. This provides a clear
basis for organizing the code and tests at the top level.

Consolidate the existing tests around these strategies, adding
additional cases to achieve better coverage.

Several bugs were uncovered in the process. Small ones were just fixed,
bigger ones will be remedied in following PRs.
2024-07-23 23:18:48 +00:00
Steve Canny
49c4bd34be
rfctr(auto): add _PartitionerLoader (#3418)
**Summary**
Replace conditional explicit import of partitioner modules in
`.partition.auto` with the new `_PartitionerLoader` class. This avoids
unbound variable warnings and is much less noisy.

`_PartitionerLoader` makes use of the new `FileType` property
`.importable_package_dependencies` to determine whether all required
packages are importable before dispatching the file to its partitioner.
It uses `FileType.extra_name` to form a helpful error message when a
dependency is not installed, so the caller knows which `pip install`
extra to specify to remedy the error.

`PartitionerLoader` uses the `FileType` properties
`.partitioner_module_qname` and `partitioner_function_name` to load
the partitioner once its dependencies are verified. Loaded partitioners
are cached with module lifetime scope for efficiency.
2024-07-22 06:03:55 +00:00
Christine Straub
ec59abfabc
enhancement: improve text clearing process in email partitioning (#3422)
### Summary
Currently, the email partitioner removes only `=\n` characters during
the clearing process. However, email content sometimes contains `=\r\n`
characters, especially when read from file-like objects such as
`SpooledTemporaryFile` (the file type used in our API). This PR updates
the email partitioner to remove both `=\n` and `=\r\n` characters during
the clearing process.

### Testing

```
filename = "example-docs/eml/family-day.eml"

elements = partition_email(
    filename=filename,
)
print(f"From filename: {elements[3].text}")

with open(filename, "rb") as test_file:
    spooled_temp_file = tempfile.SpooledTemporaryFile()
    spooled_temp_file.write(test_file.read())
    spooled_temp_file.seek(0)
    elements = partition_email(file=spooled_temp_file)
    print(f"From spooled_temp_file: {elements[3].text}")
```

**Results:**
- on `main`
```
From filename: Make sure to RSVP!
From spooled_temp_file: Make sure to = RSVP!
```
- on `PR`
```
From filename: Make sure to RSVP!
From spooled_temp_file: Make sure to RSVP!
```
2024-07-19 18:18:02 +00:00
Christine Straub
0eb461acc2
refactor: restructure PDF/Image example document organization (#3410)
This PR aims to improve the organization and readability of our example
documents used in unit tests, specifically focusing on PDF and image
files.

### Summary
- Created two new subdirectories in the `example-docs` folder:
  - `pdf/`: for all PDF example files
  - `img/`: for all image example files
- Moved relevant PDF files from `example-docs/` to `example-docs/pdf/`
- Moved relevant image files from `example-docs/` to `example-docs/img/`
- Updated file paths in affected unit & ingest tests to reflect the new
directory structure

### Testing
All unit & ingest tests should be updated and verified to work with the
new file structure.

## Notes
Other file types (e.g., office documents, HTML files) remain in the root
of `example-docs/` for now.

## Next Steps
Consider similar reorganization for other file types if this structure
proves to be beneficial.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-07-18 22:21:32 +00:00
Steve Canny
e99e5a8abd
rfctr(file): make FileType enum a file-type descriptor (#3411)
**Summary**
Elaborate the `FileType` enum to be a complete descriptor of file-types.
Add methods to allow `STR_TO_FILETYPE`, `EXT_TO_FILETYPE` and
`FILETYPE_TO_MIMETYPE` mappings to be replaced, removing those redundant
and noisy declarations.

In the process, fix some lingering file-type identification and
`.metadata.filetype` errors that had been skipped in the tests.

**Additional Context**
Gathering the various attributes of a file-type into the `FileType` enum
eliminates the duplication inherent in the separate `STR_TO_FILETYPE`
etc. mappings and makes access to those values convenient for callers.
These attributes include what MIME-type a file-type should record in
metadata and what MIME-types and extensions map to that file-type. These
values and others are made available as methods and properties directly
on the `FileType` class and members. Because all attributes are defined
in the `FileType` enum there is no risk of inconsistency across multiple
locations and any changes happen in one and only one place. Further
attributes and methods will be added in later commits to support other
file-type related operations like mapping to a partitioner and verifying
its dependencies are installed.
2024-07-18 02:05:33 +00:00
Christine Straub
48bdf94656
feat: partition_pdf() support language specification for PaddleOCR (#3400)
Closes #3159.

This PR extends language specification capability to `PaddleOCR` in
addition to `TesseractOCR`. Users can now specify OCR languages for both
OCR engines when using `partition_pdf()`.

### Testing

```
os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle"

elements = partition_pdf(
    filename=<file_path>,
    strategy=strategy,
    languages=["chi_sim"], # chinese - simplified
    infer_table_structure=True,
)
```
2024-07-16 22:19:25 +00:00
Steve Canny
3d6e30a1f7
rfctr(auto): improve expression in tests (#3384)
**Summary**
In preparation for further work on auto-partitioning, improve the
expression in the test-suite.
2024-07-11 19:57:28 +00:00
Steve Canny
c27e0d0062
rfctr(html): replace html parser (#3218)
**Summary**
Replace legacy HTML parser with recursive version that captures all
content and provides flexibility to add new metadata. It's also
substantially faster although that's just a happy side-effect.

**Additional Context**
The prior HTML parsing algorithm that makes up the core of HTML
partitioning was buggy and very difficult to reason about because it did
not conform to the inherently recursive structure of HTML. The new
version retains `lxml` as the performant and reliable base library but
uses `lxml`'s custom element classes to efficiently classify HTML
elements by their behaviors (block-item and inline (phrasing) primarily)
and give those elements the desired partitioning behaviors.

This solves a host of existing problems with content being skipped and
elements (paragraphs) being divided improperly, but also provides a
clear domain model for reasoning about its behavior and reliably
adjusting it to suit our existing and future purposes.

The parser's operation is recursive, closely modeling the recursive
structure of HTML itself. It's behaviors are based on the HTML Standard
and reliably produce proper and explainable results even for novel
cases.

Fixes #2325 
Fixes #2562
Fixes #2675
Fixes #3168
Fixes #3227
Fixes #3228 
Fixes #3230 
Fixes #3237 
Fixes #3245 
Fixes #3247 
Fixes #3255
Fixes #3309 

### BEHAVIOR DIFFERENCES

#### `emphasized_text_tags` encoding is changed:
- `<strong>` is encoded as `"b"` rather than `"strong"`.
- `<em>` is encoded as `"i"` rather than `"em"`.
- `<span>` is no longer recorded in `emphasized_text_tags` (because
without the CSS we can't tell whether it's used for emphasis or if so
what kind).
- nested emphasis (e.g. bold+italic) is encoded as multiple characters
("bi").
- `emphasized_text_contents` is broken on emphasis-change boundaries,
like:
  ```html
   `<p>foo <b>bar <i>baz</i> bada</b> bing</p>`
  ```
  produces:
  ```json
  {
    "emphasized_text_contents": ["bar", "baz", "bada"],
    "emphasized_text_tags": ["b", "bi", "b"]
  }
  ```
   whereas previously it would have produced:
  ```json
  {
    "emphasized_text_contents": ["bar baz bada", "baz"],
    "emphasized_text_tags": ["b", "i"]
  }
  ```

#### `<pre>` text is preserved as it appears in the html
Except that a leading newline is removed if present (has to be in
position 0 of text). Also, a trailing newline is stripped but only if it
appears in the very last position ([-1]) of the `<pre>` text. Old parser
stripped all leading and trailing whitespace.

Result is that:
```html
<pre>
foo
bar
baz
</pre>
```
parses to `"foo\nbar\nbaz"` which is the same result produced for:
```html
<pre>foo
bar
baz</pre>
```
This equivalence is the same behavior exhibited by a browser, which is
why we did the extra work to make it this way.

#### Whitespace normalization
Leading and trailing whitespace are removed from element text, just as
it is removed in the browser. Runs of whitespace within the element text
are reduced to a single space character (like in the browser). Note this
means that `\t`, `\n`, and `&nbsp;` are replaced with a regular space
character. All text derived from elements is whitespace normalized
except the text within a `<pre>` tag. Any leading or trailing newline is
trimmed from `<pre>` element text; all other whitespace is preserved
just as it appeared in the HTML source.

#### `link_start_indexes` metadata is no longer captured. Rationale:
- It was frequently wrong, often `-1`.
- It was deprecated but then added back in a community PR.
- Maintaining it across any possible downstream transformations (e.g.
chunking) would be expensive and almost certainly lead to wrong values
as distant code evolves.
- It is complex to compute and recompute when whitespace is normalized,
adding substantial complexity to the code and reducing readability and
maintainability

#### `<br/>` element is replaced with a single newline (`"\n"`)
but that is usually replaced with a space in `Element.text` when it is
normalized. The newline is preserved within a `<pre>` element.
  - Related: _No paragraph-break on `<br/><br/>`_

#### Empty `h1..h6` elements are dropped.
HTML heading elements (`<h1..h6>`) are "skipped" (do not generate a
`Title` element) when they contain no text or contain only whitespace.

---------

Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-07-11 00:14:28 +00:00
Steve Canny
0c562d8050
rfctr(auto): fix auto-partition test xfails and skips (#3367)
**Summary**
Improve expression in auto-partition tests and fix xfails and skips. Add
issues for the two hard-fails where xfail needed to stay.
2024-07-10 05:29:07 +00:00
Steve Canny
00e1d5c05b
rfctr(html): refine HTML parser (#3351)
**Note**
This refines the new HTML parser but _does not install it_. This is why
no changes to ingest test expectations or other unit-tests are required
here. Installing the new parser will happen in the next PR #3218.

**Summary**
The initial version of the parser (purposely) raised on a block element
nested inside a phrasing element. While such nesting is not valid
according to the HTML Standard, it is accepted by the browser and does
happen in the wild.

The refinements here handle this situation similarly to how the browser
does, breaking phrasing at the block element boundaries and starting it
up again after the block element.

Unfortunately this adds complexity to the parser, but it makes the
parser robust against pretty much any HTML we're likely to encounter and
partitions it consistent with how it would be rendered in the browser.
2024-07-09 01:10:03 +00:00
Steve Canny
d48fa3b163
rfctr(auto): improve typing and organize auto tests (#3355)
**Summary**
In preparation for further work on auto-partitioning (`partition()`),
improve typing and organize `test_auto.py` by introducing categories.
2024-07-08 21:25:17 +00:00
Christine Straub
493bfccddd
fix: exception handling for OCRAgent.get_agent() (#3335)
The purpose of this PR is to help investigate
https://github.com/Unstructured-IO/unstructured/issues/3202.
2024-07-03 17:58:04 +00:00
John
0046f58a4f
revert unstructured-client pin and make pip-compile (#3298)
Change unstructured-client pin to setting minimum version instead of max
version and `make pip-compile`.

Integration tests that were dependent on the old version of the client
are removed. These tests should be replicated in/moved to the SDK
repo(s).
2024-07-02 16:42:03 +00:00
Steve Canny
087adb218f
feat(docx): differentiate no-file from not-ZIP (#3306)
**Summary**
The `python-docx` error `docx.opc.exceptions.PackageNotFoundError`
arises both when no file exists at the given path and when the file
exists but is not a ZIP archive (and so is not a DOCX file).

This ambiguity is unwelcome when diagnosing the error as the two
possible conditions generally indicate a different course of action to
resolve the error.

Add detailed validation to `DocxPartitionerOptions` to distinguish these
two and provide more precise exception messages.

**Additional Context**
- `python-pptx` shares the same OPC-Package (file) loading code used by
`python-docx`, so the same ambiguity will be present in `python-pptx`.
- It would be preferable for this distinguished exception behavior to be
upstream in `python-docx` and `python-pptx`. If we're willing to take
the version bump it might be worth considering doing that instead.
2024-06-27 00:18:56 +00:00
Pawel Kmiecik
575957b2d2
feat: enhance analysis options with od model dump and better vis (#3234)
This PR adds new capabilities for drawing bboxes for each layout
(extracted, inferred, ocr and final) + OD model output dump as a json
file for better analysis.

---------

Co-authored-by: Christine Straub <christinemstraub@gmail.com>
Co-authored-by: Michal Martyniak <michal.martyniak@deepsense.ai>
2024-06-26 13:14:55 +00:00
Steve Canny
f2fee0c32f
fix(auto): partition() passes strategy to DOC,ODT (#3278)
**Summary**
Remedy gap where `strategy` argument passed to `partition()` was not
forwarded to `partition_doc()` or `partition_odt()` and so was not
making its way to `partition_docx()`.
2024-06-26 00:29:47 +00:00
Yao You
c32aeaac44
fix: wait to run soffice until there is no other soffice process running (#3287)
## Summary

This PR addresses an issue where the code could attempt to run `soffice`
in multiple processes and closes #3284
The fix is to add a wait mechanism when there is another `soffice`
process running in already.

## Diagnosis of issue

- `soffice` can only have one process running when using the command
`soffice` as is.
- on main branch the function `partition.common.convert_office_doc`
simply spawns a subprocess to run `soffice` command to convert a `doc`
or `ppt` file into `docx` or `pptx` format.
- if there are multiple partition calls to process `doc` or `ppt` files
and they all want to spawn `soffice` subprocesses only one will succeed
while other processes will simply fail and return 1 from the subprocess
- in downstream this will lead to errors like `PackageNotFoundError:
Package not found at '/tmp/tmpac6lcu4w/document.docx'`

## solution

While there are
[ways](https://www.reddit.com/r/libreoffice/comments/agk3os/how_to_open_more_than_one_calc_instance_under/)
to circumvent the limit of `soffice` by setting a tmp file as user
installation env, these kind of solutions rely on the internals of
`soffice` and adds maintenance cost to track its changes.

This PR solves this problem by adding a wait mechanism: 
- we first spawning a subprocess to run `soffice` 
- if the `stdout` is empty and we still have wait time budget left the
function first checks if there is another `soffice` running
  * If yes then the function waits for 0.01s before checking again; 
* if no the functions spawns a subprocess to run `soffice` and return to
beginning of this step
* we need to return the the beginning to check if `stdout` is empty
because we could have another collision right after `soffice` becomes
available.

## test

This PR adds two unit tests.
Additionally this can be tested by running partition of `.doc` files
locally with multiprocessing.
2024-06-25 18:49:27 +00:00
Yao You
edddf9f6ee
Feat/pass down strategy to partition ppt as well (#3274)
Following the same pattern of
https://github.com/Unstructured-IO/unstructured/pull/3273 and pass down
`strategy` parameter to `partition_ppt` as well.
2024-06-22 02:23:58 +00:00
Steve Canny
16df6944dd
fix(auto): partition() passes strategy to PPTX,DOCX (#3273)
**Summary**
Remedy gap where `strategy` argument passed to `partition()` was not
forwarded to `partition_pptx()` or `partition_docx()`.
2024-06-22 00:16:39 +00:00
Steve Canny
6fe1c9980e
rfctr(html): prepare for new html parser (#3257)
**Summary**
Extract as much mechanical refactoring from the HTML parser change-over
into the PR as possible. This leaves the next PR focused on installing
the new parser and the ingest-test impact.

**Reviewers:** Commits are well groomed and reviewing commit-by-commit
is probably easier.

**Additional Context**
This PR introduces the rewritten HTML parser. Its general design is
recursive, consistent with the recursive structure of HTML (tree of
elements). It also adds the unit tests for that parser but it does not
_install_ the parser. So the behavior of `partition_html()` is unchanged
by this PR. The next PR in this series will do that and handle the
ingest and other unit test changes required to reflect the dozen or so
bug-fixes the new parser provides.
2024-06-21 20:59:48 +00:00
Austin Walker
0b73978b92
fix: fix IndexError when partioning a pdf with starting_page_number (#3246)
The Issue:

When extracting images from pdfs, we use the metadata page number to
index into a list of the images. However, the metadata page number can
now be changed via `starting_page_number`. To get the true page index,
we need to subtract this value.

Testing:

Run this snippet in a python shell. Before the fix, this throws an
IndexError. On this branch, it will return the elements.
```
from unstructured.partition.auto import partition
filename = "example-docs/layout-parser-paper-with-table.pdf"
partition(filename, strategy="hi_res", extract_image_block_types=["Image", "Table"], starting_page_number=20)
```

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
2024-06-19 18:20:54 +00:00
Steve Canny
77a9e1b54d
rfctr(html): drop convert_and_partition_html() (#3215)
**Summary**
Remove `unstructured.partition.html.convert_and_partition_html()`. Move
file-type conversion (to HTML) responsibility to each brokering
partitioner that uses that strategy and let them call `partition_html()`
for themselves with the result.

**Additional Context**

Rationale:
- `partition_html()` does not want or need to know which partitioners
might broker partitioning to it.
- Different brokering partitioners have their own methods to convert
their format to HTML and quirks that may be involved for their format.
Avoid coupling them so they can evolve independently.
- The core of the conversion work is already encapsulated in
`unstructured.partition.common.convert_file_to_html_text_using_pandoc()`.
- `convert_and_partition_html()` represents an additional brokering
layer with the entailed complexities of an additional site for default
parameter values to be (mis-)applied and/or dropped and is an additional
location for new parameters to be added.
2024-06-17 19:43:18 +00:00
Steve Canny
9fae0111d9
rfctr(html): drop HTML-specific elements (#3207)
**Summary**
Remove HTML-specific element types and return "regular" elements like
`Title` and `NarrativeText` from `partition_html()`.

**Additional Context**
- An aspect of the legacy HTML partitioner was the use of HTML-specific
element types used to track metadata during partitioning.
- That role is no longer necessary or desireable.
- HTML-specific elements like `HTMLTitle` and `HTMLNarrativeText` were
returned from partitioning HTML but also the seven other file-formats
that broker partitioning to HTML (convert-to-HTML and partition_html()).
This does not cause immediate breakage because these are still `Text`
element subtypes, but it produces a confusing developer experience.
- Remove the prior metadata roles from HTML-specific elements and remove
those element types entirely.
2024-06-15 00:14:22 +00:00
Matt Robinson
08383a27de
build: pull from wolfi base image (#3213)
### Summary

Updates the `wolfi` image to pull from the upstream `wolfi-base` base
image to avoid maintaining the base layers in both locations. Closes
#3105 by pulling in the fix from upstream.

### Testing

`test_dockerfile` should continue to pass with the changes.
2024-06-14 20:41:27 +00:00
Christine Straub
9552fbbfbf
chore: bump unstructured-inference 0.7.35 (#3205)
### Summary
- bump unstructured-inference to `0.7.35` which fixed syntax for
generated HTML tables
- update unit tests and ingest test fixtures to reflect changes in the
generated HTML tables
- cut a release for `0.14.6`

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-06-14 18:11:38 +00:00
Steve Canny
f5ebb209a4
rfctr(html): drop page concept (#3184)
**Summary**
Pagination of HTML documents is currently unused. The `Page` class and
concept were deeply embedding in the legacy organization of HTML
partitioning code due to the legacy `Document` (= pages of elements)
domain model. Remove this concept from the code such that elements are
available directly from the partitioner.

**Additional Context**
- Pagination can be re-added later if we decide we want it again. A
re-implementation would be much simpler and much lower impact to the
structure of the code and introduce much less additional complexity,
similar to the approach we take in `partition_docx()`.
2024-06-13 18:19:42 +00:00
Filip Knefel
c2065db716
fix API-297: List parameters incorrectly passed to API requests (#3154)
In two places parameters passed to the python client when using either
Ingest workflow and `partition_via_api` function directly we parse the
parameters with list values to strings e.g.
```python
extract_image_block_types=["image"] -> extract_image_block_types='["image"]'
```
as of now these parameters are parsed incorrectly when given as strings
and correctly when given as lists.

This PR removes parsing from `PartitionConfig` and `partition_via_api`.

---------

Co-authored-by: Filip Knefel <filip@unstructured.io>
2024-06-11 21:00:41 +00:00
Steve Canny
2f0400f279
rfctr(html): break coupling to DocumentLayout (#3180)
**Summary**
Remove use of `partition.common.document_to_element_list()` by
`HTMLDocument`. The transitive coupling with layout-inference through
this shared function have been the source of frustration and a drain on
engineering time and there's no compelling reason for the two to share
this code.

**Additional Context**
`partition_html()` uses `partition.common.document_to_element_list()` to
get finalized elements from `HTMLDocument` (pages). This gives rise to a
very nasty coupling between `DocumentLayout`, used by
`unstructured_inference`, and `HTMLDocument`.
`document_to_element_list()` has evolved to work for both callers, but
they share very few common characteristics with each other.

This coupling is bad news for us and also, importantly, for the
inference and page layout folks working on PDF and images.

Break that coupling so those inference-related functions can evolve
whatever way they need to without being dragged down by legacy
`HTMLDocument` connections.

The initial step is to extract a `document_to_element_list()` function
of our own, getting rid of the coordinates and other
`DocumentLayout`-related bits we don't need. As you'll see in the next
few PRs, all of this `document_to_element_list()` code will end up
either going away or being relocated closer to where it's used in
`HTMLDocument`.
2024-06-11 20:54:11 +00:00
Steve Canny
a883fc9df2
rfctr(html): improve SNR in HTMLDocument (#3162)
**Summary**
Remove dead code and organize helpers of HTMLDocument in preparation for
improvements and bug-fixes to follow
2024-06-06 21:21:33 +00:00
Steve Canny
f1cab248ce
rfctr(msg): remove temporary new_msg.py (#3157)
**Summary**
Remove temporary `new_msg.py` module.

**Additional Context**
The rewrite of `partition_msg()` was placed in a separate file
`new_msg.py` to avoid a messy diff for code-review. This PR makes that
`new_msg.py` the new `msg.py`.

No code changes were made in the process.
2024-06-06 08:31:56 +00:00
Steve Canny
ddbe90f6bb
rfctr(html): clean html tests in prep for PRs to follow (#3156)
**Summary**
Clean `tests_unstructured/partition/test_html.py` in preparation for
broader refactor of HTML partitioner to follow. That refactor will
address a cluster of bugs.

Temporarily remove blank lines in tests so reordering tests in following
commit is easier to follow. Those will go back in after that.
2024-06-05 23:11:58 +00:00
Steve Canny
e4158deaff
fix(msg): use python-oxmsg for MSG email parsing (#3142)
**Summary**
`partition_msg()` previously used the `msg_parser` library for parsing
Outlook MSG email files (.msg files). The `msg_parser` library is
unmaintained and has several major shortcomings such as not being able
to parse MSG files with 8-bit encoded strings and not reliably
extracting attachments.

Use the new and permissively licenced `python-oxmsg` library instead.

**Additional Context**
For reviewability purposes, this PR temporarily places the new
`partition_msg()` implementation in `new_msg.py` and references that
implementation from `msg.py`. `new_msg.py` will be renamed to `msg.py`
in a closely following PR. This avoids a very messy interleaving of
hunks in a diff between the old and re-written `partition_msg()`
implementation.

Fixes #2481 
Fixes #3006
2024-06-05 21:12:27 +00:00
Steve Canny
f2e67539b1
rfctr: clean MSG partitioner and tests as prep (#3107)
**Summary**
Fix type errors and generally prepare `partition_msg()` and its tests
for refactoring to use `python-oxmsg` library instead of the problematic
`msg_parser` library for partitioning Outlook MSG files.
2024-05-29 21:36:05 +00:00
Christine Straub
f4457249a7
fix: partition_pdf() removes spaces from the text (#3106)
Closes #2896.

This PR aims to fix `partition_pdf()` to keep spaces in text. The
control character `\t` is now replaced with a space instead of being
removed when merging inferred and embedded elements.

### Testing
PDF:
[rok_20230930_1-1.pdf](https://github.com/Unstructured-IO/unstructured/files/15001636/rok_20230930_1-1.pdf)
```
elements = partition_pdf(
    filename="rok_20230930_1-1.pdf",
    strategy="hi_res",
)

print(str(elements[20]))
```
**Results:**
- PR
```
Name of each exchange on which registered New York Stock Exchange
```
- main branch
```
Nameofeachexchangeonwhichregistered NewYorkStockExchange
```
2024-05-29 04:53:17 +00:00
Christine Straub
35ec21ecd0
fix: decide table extraction (#3090)
This PR aims to add backward compatibility for the deprecated
`pdf_infer_table_structure` parameter. A missing part of turning table
extraction for PDFs and Images off by default in
https://github.com/Unstructured-IO/unstructured/pull/3035, which was
turned on in https://github.com/Unstructured-IO/unstructured/pull/2588.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-05-23 20:37:15 +00:00
Steve Canny
47d28612f7
feat(docx): add pluggable picture sub-partitioner (#3081)
**Summary**
Allow registration of a custom sub-partitioner that extracts images from
a DOCX paragraph.

**Additional Context**
- A custom image sub-partitioner must implement the
`PicturePartitionerT` interface defined in this PR. Basically have an
`.iter_elements()` classmethod that takes the paragraph and generates
zero or more `Image` elements from it.
- The custom image sub-partitioner must be registered by passing the
class to `register_picture_partitioner()`.
- The default image sub-partitioner is `_NullPicturePartitioner` that
does nothing.
- The registered picture partitioner is called once for each paragraph.
2024-05-23 18:46:30 +00:00
Hubert Rutkowski
b8d894f963
feat/Move the category field to Element (#3056)
It's pretty basic change, just literally moved the category field to
Element class. Can't think of other changes that are needed here,
because I think pretty much everything expected the category to be
directly in elements list.

For local testing, IDE's and linters should see difference in that
`category` is now in Element.
2024-05-23 10:43:26 +00:00
Steve Canny
b4ee019170
rfctr: flatten test_unstructured/partition (#3073)
**Summary**
Some partitioner test modules are placed in directories by themselves or
with one other test module. This unnecessarily obscures where to find
the test module corresponding to a partitiner.

Move partitioner test modules to mirror the directory structure of
`unstructured/partition`.
2024-05-23 00:51:08 +00:00
Steve Canny
30e5a0cd4e
rfctr(docx): organize docx tests (#3070)
**Summary**
I preparation for adding DOCX pluggable image extraction, organize a few
of the DOCX tests to be parallel to very similar tests for the DOC and
ODT partitioners.
2024-05-21 22:11:46 +00:00
Christine Straub
b0d8a779da
feat: partiton_pdf() set inferred elements text (#3061)
This PR adds the ability to fill inferred elements text from embedded
text (`pdfminer`) without depending on `unstructured-inference` library.
This PR is the second part of moving embedded text related code from
`unstructured-inference` to `unstructured` and works together with
https://github.com/Unstructured-IO/unstructured-inference/pull/349.
2024-05-21 19:43:38 +00:00
Matt Robinson
acda4d0707
fix: set skip_infer_tables explicitly in test_partition_via_api_with_no_strategy (#3057)
### Summary

A `partition_via_api` test that only runs on `main` was
[failing](https://github.com/Unstructured-IO/unstructured/actions/runs/9159429513/job/25181600959)
with the following output, likely due to the change in the default
behavior for `skip_infer_table_types`. This PR explicitly sets the
`skip_infer_table_types` param to avoid the failure..

```python
=========================== short test summary info ============================
FAILED test_unstructured/partition/test_api.py::test_partition_via_api_with_no_strategy - AssertionError: assert 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' != 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®'
 +  where 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' = <unstructured.documents.elements.Text object at 0x7fb9069fc610>.text
 +  and   'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' = <unstructured.documents.elements.Text object at 0x7fb90648ad90>.text
= 1 failed, 2299 passed, 9 skipped, 2 deselected, 2 xfailed, 9 xpassed, 14 warnings in 1241.64s (0:20:41) =
make: *** [Makefile:302: test] Error 1
```

### Testing

After temporarily removing the "skip if not on `main`" `pytest` mark,
the [unit tests
pass](https://github.com/Unstructured-IO/unstructured/actions/runs/9163268381/job/25192040902?pr=3057O)
on the feature branch.
2024-05-20 19:05:13 -04:00
Christine Straub
76831f154b
refactor: partition_pdf() pass kwargs through fast strategy pipeline (#3040)
This PR aims to pass `kwargs` through `fast` strategy pipeline, which
was missing as part of the previous PR -
https://github.com/Unstructured-IO/unstructured/pull/3030.
I also did some code refactoring in this PR, so I recommend reviewing
this PR commit by commit.

### Summary
- pass `kwargs` through `fast` strategy pipeline, which will allow users
to specify additional params like `sort_mode`
- refactor: code reorganization
- cut a release for `0.14.0`
### Testing
CI should pass
2024-05-17 20:55:11 +00:00