mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-04 23:52:23 +00:00

This PR addresses [CORE-2969](https://unstructured-ai.atlassian.net/browse/CORE-2969) - pdfminer sometimes fail to decode text in an pdf file and returns cid codes as text - now those text will be considered invalid and be replaced with ocr results in `hi_res` mode ## test This PR adds unit test for the utility functions. In addition the file below would return elements with text in cid code on main but proper ascii text with this PR: [005-CISA-AA22-076-Strengthening-Cybersecurity-p1-p4.pdf](https://github.com/Unstructured-IO/unstructured/files/13662984/005-CISA-AA22-076-Strengthening-Cybersecurity-p1-p4.pdf) This change improves both cct accuracy and %missing scores: **before:** ``` metric average sample_sd population_sd count -------------------------------------------------- cct-accuracy 0.681 0.267 0.266 105 cct-%missing 0.086 0.159 0.159 105 ``` **after:** ``` metric average sample_sd population_sd count -------------------------------------------------- cct-accuracy 0.697 0.251 0.250 105 cct-%missing 0.071 0.123 0.122 105 ``` [CORE-2969]: https://unstructured-ai.atlassian.net/browse/CORE-2969?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com> Co-authored-by: christinestraub <christinemstraub@gmail.com>