unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-08-07 00:10:05 +00:00

Author	SHA1	Message	Date
Steve Canny	f2e67539b1	rfctr: clean MSG partitioner and tests as prep (#3107 ) Summary Fix type errors and generally prepare `partition_msg()` and its tests for refactoring to use `python-oxmsg` library instead of the problematic `msg_parser` library for partitioning Outlook MSG files.	2024-05-29 21:36:05 +00:00
Steve Canny	4dc8327149	rfctr(pptx): make PptxPartitionerOptions public (#2901 ) Summary A few additional small, mechanical odds and ends required for PPTX image extraction. The big one is removing the leading underscore from `PptxPartitionerOptions` because now client code that implements a custom Picture-shape sub-partitioner will need to reference this class.	2024-04-19 04:50:06 +00:00
Steve Canny	3e643c4cb3	feat(pptx): add pluggable PPTX Picture sub-partitioner (#2880 ) Summary Delegate partitioning of PPTX Picture (image, to a first approximation) shapes to a distinct sub-partitioner and allow the default picture sub-partitioner to be replaced at run-time by one of the user's choosing.	2024-04-12 06:00:01 +00:00
Steve Canny	2c7e0289aa	rfctr(pptx): extract _PptxPartitionerOptions (#2853 ) Reviewers: Likely quicker to review commit-by-commit. Summary In preparation for adding a PPTX `Picture` shape _sub-partitioner_, extract management of PPTX partitioning-run options to a separate `_PptxPartitioningOptions` object similar to those used in chunking and XLSX partitioning. This provides several benefits: - Extract code dealing with applying defaults and computing derived values from the main partitioning code, leaving it less cluttered and focused on the partitioning algorithm itself. - Allow the options set to be passed to helper objects, prominently including sub-partitioners, without requiring a long list of parameters or requiring the caller to couple itself to the particular option values the helper object requires. - Allow options behaviors to be thoroughly and efficiently tested in isolation.	2024-04-08 19:01:03 +00:00
Steve Canny	b59e4b69ce	rfctr: prepare for fix to raises on file-like-object with name not a path to a file (#2617 ) Summary Improve typing and other mechanical refactoring in preparation for fix to issue 2308.	2024-03-06 23:46:54 +00:00
qued	007fc45739	chore: new black changes (#2473 ) Update `black` and apply changes to affected files. I separated this PR so we can have a look at the changes and decide whether we want to: 1. Go forward with the new formatting 2. Change the black config to make the old formatting valid 3. Get rid of black entirely and just use `ruff` 4. Do something I haven't thought of	2024-01-30 17:12:35 +00:00
Newel H	e34396b2c9	Feat: Native hierarchies for elements from pptx documents (#1616 ) ## Summary Improve title detection in pptx documents The default title textboxes on a pptx slide are now categorized as titles. Improve hierarchy detection in pptx documents List items, and other slide text are properly nested under the slide title. This will enable better chunking of pptx documents. Hierarchy detection is improved by determining category depth via the following: - Check if the paragraph item has a level parameter via the python pptx paragraph. If so, use the paragraph level as the category_depth level. - If the shape being checked is a title shape and the item is not a bullet or email, the element will be set as a Title with a depth corresponding to the enumerated paragraph increment (e.g. 1st line of title shape is depth 0, second is depth 1 etc.). - If the shape is not a title shape but the paragraph is a title, the increment will match the level + 1, so that all paragraph titles are at least 1 to set them below the slide title element	2023-10-05 12:55:45 -04:00
Steve Canny	ab29de8dbd	Rfctr: Refactor PPTX partitioning to more closely align with how pptx documents are structured This refactor solves a problem or two, the big one being recursing into group-shapes to get all shapes on the slide, but mostly lays the groundwork to allow us to refine further aspects such as list-item detection, off-slide shape detection, and image-capture going forward.	2023-09-26 15:43:55 -04:00
Steve Canny	b54994ae95	rfctr: docx partitioning (#1422 ) Reviewers: I recommend reviewing commit-by-commit or just looking at the final version of `partition/docx.py` as View File. This refactor solves a few problems but mostly lays the groundwork to allow us to refine further aspects such as page-break detection, list-item detection, and moving python-docx internals upstream to that library so our work doesn't depend on that domain-knowledge.	2023-09-19 15:32:46 -07:00

9 Commits