mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-06-27 02:30:08 +00:00

## Summary **Improve title detection in pptx documents** The default title textboxes on a pptx slide are now categorized as titles. **Improve hierarchy detection in pptx documents** List items, and other slide text are properly nested under the slide title. This will enable better chunking of pptx documents. Hierarchy detection is improved by determining category depth via the following: - Check if the paragraph item has a level parameter via the python pptx paragraph. If so, use the paragraph level as the category_depth level. - If the shape being checked is a title shape and the item is not a bullet or email, the element will be set as a Title with a depth corresponding to the enumerated paragraph increment (e.g. 1st line of title shape is depth 0, second is depth 1 etc.). - If the shape is not a title shape but the paragraph is a title, the increment will match the level + 1, so that all paragraph titles are at least 1 to set them below the slide title element
41 KiB
41 KiB