unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-12-19 03:10:21 +00:00

History

Christine Straub 3fe480799a

Fix: missing characters at the beginning of sentences on table ingest output after table OCR refactor (#1961 )

Closes #1875.

### Summary
- add functionality to do a second OCR on cropped table images
- use `IMAGE_CROP_PAD` env for `individual_blocks` mode
### Testing
The test function
[`test_partition_pdf_hi_res_ocr_mode_with_table_extraction()`](https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured/partition/pdf_image/test_pdf.py#L425)
in `test_pdf.py` should pass.

### NOTE: 
I've tried to experiment with values for scaling ENVs on the following
PRs but found that changes to the values for scaling ENVs affect the
entire page OCR output(OCR regression) so switched to doing a second OCR
for tables.
- https://github.com/Unstructured-IO/unstructured/pull/1998/files 
- https://github.com/Unstructured-IO/unstructured/pull/2004/files
- https://github.com/Unstructured-IO/unstructured/pull/2016/files
- https://github.com/Unstructured-IO/unstructured/pull/2029/files

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>

2023-11-09 18:29:55 +00:00

layout-parser-paper-with-table.jpg.json

Fix: missing characters at the beginning of sentences on table ingest output after table OCR refactor (#1961 )

2023-11-09 18:29:55 +00:00

layout-parser-paper.pdf.json

Fix: missing characters at the beginning of sentences on table ingest output after table OCR refactor (#1961 )

2023-11-09 18:29:55 +00:00