mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-06-27 02:30:08 +00:00

Page breaks can and often do occur within a paragraph. The full text of the paragraph is attributed to the page (number) the paragraph starts on. Improve page-break fidelity such that a paragraph containing a page-break is split into two elements, one containing the text before the page-break and the other the text after. Emit the `PageBreak` element between these two and assign the correct page-number (n and n+1 respectively) to the two textual elements. This functionality is largely provided upstream by the new `python-docx` v1.0.0 release (1.0.0 from 0.8.11 because it drops Python 2 support). That version also makes obsolete the "include hyperlink text in `Paragraph.text` monkey patch that we had maintained up to now. Remove that monkey-patch.
14 KiB
14 KiB