Christine Straub 08fafc564f
Fix: embedded text not getting merged with inferred elements (#2679)
This PR is the second part of fixing "embedded text not getting merged
with inferred elements", the first part is done in
https://github.com/Unstructured-IO/unstructured-inference/pull/331.

### Summary
- replace `Rectangle.is_in()` with `Rectangle.is_almost_subregion_of()`
when removing pdfminer (embedded) elements that were merged with
inferred elements
- use env_config `EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD`
introduced in the [first
part](https://github.com/Unstructured-IO/unstructured-inference/pull/331)
when removing pdfminer (embedded) elements that were merged with
inferred elements
- bump `unstructured-inference` to 0.7.25

### Testing
PDF:
[pwc-financial-statements-p114.pdf](https://github.com/Unstructured-IO/unstructured/files/14707146/pwc-financial-statements-p114.pdf)

```
$ pip uninstall unstructured-inference -y
$ git clone -b fix/embedded-text-not-getting-merged-with-inferred-elements git@github.com:Unstructured-IO/unstructured-inference.git && cd unstructured-inference
$ pip install -e .
```

```
elements = partition_pdf(
    filename="pwc-financial-statements-p114.pdf",
    strategy="hi_res",
    infer_table_structure=True,
    extract_image_block_types=["Image"],
)

table_elements = [el for el in elements if el.category == "Table"]
print(table_elements[0].text)
```

---------

Co-authored-by: Antonio Jose Jimeno Yepes <antonio.jimeno@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-03-23 03:59:23 +00:00

19 lines
977 B
Plaintext

filename doctype connector cct-accuracy cct-%missing
fake-text.txt txt Sharepoint 1.0 0.0
ideas-page.html html Sharepoint 0.93 0.033
stanley-cups.xlsx xlsx Sharepoint 0.778 0.0
Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf pdf azure 0.981 0.005
IRS-form-1987.pdf pdf azure 0.794 0.135
spring-weather.html html azure 0.0 0.018
example-10k.html html local 0.727 0.037
fake-html-cp1252.html html local 0.659 0.0
ideas-page.html html local 0.93 0.033
UDHR_first_article_all.txt txt local-single-file 0.995 0.0
handbook-1p.docx docx local-single-file-basic-chunking 0.858 0.029
fake-html-cp1252.html html local-single-file-with-encoding 0.659 0.0
layout-parser-paper-with-table.jpg jpg local-single-file-with-pdf-infer-table-structure 0.716 0.032
layout-parser-paper.pdf pdf local-single-file-with-pdf-infer-table-structure 0.95 0.029
2023-Jan-economic-outlook.pdf pdf s3 0.84 0.044
page-with-formula.pdf pdf s3 0.971 0.021
recalibrating-risk-report.pdf pdf s3 0.968 0.008