mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-19 15:06:21 +00:00

Closes #1875. ### Summary - add functionality to do a second OCR on cropped table images - use `IMAGE_CROP_PAD` env for `individual_blocks` mode ### Testing The test function [`test_partition_pdf_hi_res_ocr_mode_with_table_extraction()`](https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured/partition/pdf_image/test_pdf.py#L425) in `test_pdf.py` should pass. ### NOTE: I've tried to experiment with values for scaling ENVs on the following PRs but found that changes to the values for scaling ENVs affect the entire page OCR output(OCR regression) so switched to doing a second OCR for tables. - https://github.com/Unstructured-IO/unstructured/pull/1998/files - https://github.com/Unstructured-IO/unstructured/pull/2004/files - https://github.com/Unstructured-IO/unstructured/pull/2016/files - https://github.com/Unstructured-IO/unstructured/pull/2029/files --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>