unstructured/test_unstructured
Pluto 8685905bd1
Character confidence threshold (#3860)
This change adds the ability to filter out characters predicted by
Tesseract with low confidence scores.

Some notes:
- I intentionally disabled it by default; I think some low score(like
0.9-0.95 for Tesseract) could be a safe choice though
- I wanted to use character bboxes and combine them into word bbox
later. However, a bug in Tesseract in some specific scenarios returns
incorrect character bboxes (unit tests caught it 🥳 ). More in comment in
the code
2025-01-13 13:12:46 +00:00
..
2024-12-09 14:19:13 -08:00
2024-11-26 16:20:23 +00:00