mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-28 03:20:57 +00:00

Closes #2896. This PR aims to fix `partition_pdf()` to keep spaces in text. The control character `\t` is now replaced with a space instead of being removed when merging inferred and embedded elements. ### Testing PDF: [rok_20230930_1-1.pdf](https://github.com/Unstructured-IO/unstructured/files/15001636/rok_20230930_1-1.pdf) ``` elements = partition_pdf( filename="rok_20230930_1-1.pdf", strategy="hi_res", ) print(str(elements[20])) ``` **Results:** - PR ``` Name of each exchange on which registered New York Stock Exchange ``` - main branch ``` Nameofeachexchangeonwhichregistered NewYorkStockExchange ```