3 Commits

Author SHA1 Message Date
Christine Straub
08fafc564f
Fix: embedded text not getting merged with inferred elements (#2679)
This PR is the second part of fixing "embedded text not getting merged
with inferred elements", the first part is done in
https://github.com/Unstructured-IO/unstructured-inference/pull/331.

### Summary
- replace `Rectangle.is_in()` with `Rectangle.is_almost_subregion_of()`
when removing pdfminer (embedded) elements that were merged with
inferred elements
- use env_config `EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD`
introduced in the [first
part](https://github.com/Unstructured-IO/unstructured-inference/pull/331)
when removing pdfminer (embedded) elements that were merged with
inferred elements
- bump `unstructured-inference` to 0.7.25

### Testing
PDF:
[pwc-financial-statements-p114.pdf](https://github.com/Unstructured-IO/unstructured/files/14707146/pwc-financial-statements-p114.pdf)

```
$ pip uninstall unstructured-inference -y
$ git clone -b fix/embedded-text-not-getting-merged-with-inferred-elements git@github.com:Unstructured-IO/unstructured-inference.git && cd unstructured-inference
$ pip install -e .
```

```
elements = partition_pdf(
    filename="pwc-financial-statements-p114.pdf",
    strategy="hi_res",
    infer_table_structure=True,
    extract_image_block_types=["Image"],
)

table_elements = [el for el in elements if el.category == "Table"]
print(table_elements[0].text)
```

---------

Co-authored-by: Antonio Jose Jimeno Yepes <antonio.jimeno@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-03-23 03:59:23 +00:00
Christine Straub
69d0ee1aea
Refactor: support merging extracted layout with inferred layout (#2158)
### Summary
This PR is the second part of `pdfminer` refactor to move it from
`unstructured-inference` repo to `unstructured` repo, the first part is
done in
https://github.com/Unstructured-IO/unstructured-inference/pull/294. This
PR adds logic to merge the extracted layout with the inferred layout.

The updated workflow for the `hi_res` strategy:
* pass the document (as data/filename) to the `inference` repo to get
`inferred_layout` (DocumentLayout)
* pass the `inferred_layout` returned from the `inference` repo and the
document (as data/filename) to the `pdfminer_processing` module, which
first opens the document (create temp file/dir as needed), and splits
the document by pages
* if is_image is `True`, return the passed
inferred_layout(DocumentLayout)
  * if is_image is `False`:
* get extracted_layout (TextRegions) from the passed
document(data/filename) by pdfminer
* merge `extracted_layout` (TextRegions) with the passed
`inferred_layout` (DocumentLayout)
* return the `inferred_layout `(DocumentLayout) with updated elements
(all merged LayoutElements) as merged_layout (DocumentLayout)
* pass merged_layout and the document (as data/filename) to the `OCR`
module, which first opens the document (create temp file/dir as needed),
and splits the document by pages (convert PDF pages to image pages for
PDF file)

### Note
This PR also fixes issue #2164 by using functionality similar to the one
implemented in the `fast` strategy workflow when extracting elements by
`pdfminer`.

### TODO
* image extraction refactor to move it from `unstructured-inference`
repo to `unstructured` repo
* improving natural reading order by applying the current default
`xycut` sorting to the elements extracted by `pdfminer`
2023-12-01 20:56:31 +00:00
Benjamin Torres
05c3cd1be2
feat: clean pdfminer elements inside tables (#1808)
This PR introduces `clean_pdfminer_inner_elements` , which deletes
pdfminer elements inside other detection origins such as YoloX or
detectron.
This function returns the clean document.

Also, the ingest-test fixtures were updated to reflect the new standard
output.

The best way to check that this function is working properly is check
the new test `test_clean_pdfminer_inner_elements` in
`test_unstructured/partition/utils/test_processing_elements.py`

---------

Co-authored-by: Roman Isecke <roman@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
2023-10-30 07:10:51 +00:00