Christine Straub 08fafc564f
Fix: embedded text not getting merged with inferred elements (#2679)
This PR is the second part of fixing "embedded text not getting merged
with inferred elements", the first part is done in
https://github.com/Unstructured-IO/unstructured-inference/pull/331.

### Summary
- replace `Rectangle.is_in()` with `Rectangle.is_almost_subregion_of()`
when removing pdfminer (embedded) elements that were merged with
inferred elements
- use env_config `EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD`
introduced in the [first
part](https://github.com/Unstructured-IO/unstructured-inference/pull/331)
when removing pdfminer (embedded) elements that were merged with
inferred elements
- bump `unstructured-inference` to 0.7.25

### Testing
PDF:
[pwc-financial-statements-p114.pdf](https://github.com/Unstructured-IO/unstructured/files/14707146/pwc-financial-statements-p114.pdf)

```
$ pip uninstall unstructured-inference -y
$ git clone -b fix/embedded-text-not-getting-merged-with-inferred-elements git@github.com:Unstructured-IO/unstructured-inference.git && cd unstructured-inference
$ pip install -e .
```

```
elements = partition_pdf(
    filename="pwc-financial-statements-p114.pdf",
    strategy="hi_res",
    infer_table_structure=True,
    extract_image_block_types=["Image"],
)

table_elements = [el for el in elements if el.category == "Table"]
print(table_elements[0].text)
```

---------

Co-authored-by: Antonio Jose Jimeno Yepes <antonio.jimeno@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-03-23 03:59:23 +00:00

977 B

1filenamedoctypeconnectorcct-accuracycct-%missing
2fake-text.txttxtSharepoint1.00.0
3ideas-page.htmlhtmlSharepoint0.930.033
4stanley-cups.xlsxxlsxSharepoint0.7780.0
5Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdfpdfazure0.9810.005
6IRS-form-1987.pdfpdfazure0.7940.135
7spring-weather.htmlhtmlazure0.00.018
8example-10k.htmlhtmllocal0.7270.037
9fake-html-cp1252.htmlhtmllocal0.6590.0
10ideas-page.htmlhtmllocal0.930.033
11UDHR_first_article_all.txttxtlocal-single-file0.9950.0
12handbook-1p.docxdocxlocal-single-file-basic-chunking0.8580.029
13fake-html-cp1252.htmlhtmllocal-single-file-with-encoding0.6590.0
14layout-parser-paper-with-table.jpgjpglocal-single-file-with-pdf-infer-table-structure0.7160.032
15layout-parser-paper.pdfpdflocal-single-file-with-pdf-infer-table-structure0.950.029
162023-Jan-economic-outlook.pdfpdfs30.840.044
17page-with-formula.pdfpdfs30.9710.021
18recalibrating-risk-report.pdfpdfs30.9680.008