mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-27 10:56:47 +00:00

Currently, CCT eval takes a long time for any of the test_metrics CI runs. Documents in an eval set are evaluated sequentially, and It appears that a max of 1 cpu core is currently utilized. This implies there could be a large speedup by running eval across multiple docs concurrently (probably with multiprocessing). Things done in this PR: - [x] concurrent.futures.ProcessPoolExecutor instead of sequential for-loop - [x] refactor/reorganization of redundant pieces of code without changing the inner logic too much. Without that we'd have 3 places where documents are being processed. Take a look at `BaseMetricsCalculator` class and classes that inherit from it. - [x] string paths manipulation is now reworked and relies on `pathlib.Path()`