9 Commits

Author SHA1 Message Date
Steve Canny
f2e67539b1
rfctr: clean MSG partitioner and tests as prep (#3107)
**Summary**
Fix type errors and generally prepare `partition_msg()` and its tests
for refactoring to use `python-oxmsg` library instead of the problematic
`msg_parser` library for partitioning Outlook MSG files.
2024-05-29 21:36:05 +00:00
Steve Canny
4dc8327149
rfctr(pptx): make PptxPartitionerOptions public (#2901)
**Summary**
A few additional small, mechanical odds and ends required for PPTX image
extraction.

The big one is removing the leading underscore from
`PptxPartitionerOptions` because now client code that implements a
custom Picture-shape sub-partitioner will need to reference this class.
2024-04-19 04:50:06 +00:00
Steve Canny
3e643c4cb3
feat(pptx): add pluggable PPTX Picture sub-partitioner (#2880)
**Summary**
Delegate partitioning of PPTX Picture (image, to a first approximation)
shapes to a distinct sub-partitioner and allow the default picture
sub-partitioner to be replaced at run-time by one of the user's
choosing.
2024-04-12 06:00:01 +00:00
Steve Canny
2c7e0289aa
rfctr(pptx): extract _PptxPartitionerOptions (#2853)
**Reviewers:** Likely quicker to review commit-by-commit.

**Summary**

In preparation for adding a PPTX `Picture` shape _sub-partitioner_,
extract management of PPTX partitioning-run options to a separate
`_PptxPartitioningOptions` object similar to those used in chunking and
XLSX partitioning. This provides several benefits:
- Extract code dealing with applying defaults and computing derived
values from the main partitioning code, leaving it less cluttered and
focused on the partitioning algorithm itself.
- Allow the options set to be passed to helper objects, prominently
including sub-partitioners, without requiring a long list of parameters
or requiring the caller to couple itself to the particular option values
the helper object requires.
- Allow options behaviors to be thoroughly and efficiently tested in
isolation.
2024-04-08 19:01:03 +00:00
Steve Canny
b59e4b69ce
rfctr: prepare for fix to raises on file-like-object with name not a path to a file (#2617)
**Summary**
Improve typing and other mechanical refactoring in preparation for fix
to issue 2308.
2024-03-06 23:46:54 +00:00
qued
007fc45739
chore: new black changes (#2473)
Update `black` and apply changes to affected files. I separated this PR
so we can have a look at the changes and decide whether we want to:
1. Go forward with the new formatting
2. Change the black config to make the old formatting valid
3. Get rid of black entirely and just use `ruff`
4. Do something I haven't thought of
2024-01-30 17:12:35 +00:00
Newel H
e34396b2c9
Feat: Native hierarchies for elements from pptx documents (#1616)
## Summary
**Improve title detection in pptx documents** The default title
textboxes on a pptx slide are now categorized as titles.
**Improve hierarchy detection in pptx documents** List items, and other
slide text are properly nested under the slide title. This will enable
better chunking of pptx documents.

Hierarchy detection is improved by determining category depth via the
following:
- Check if the paragraph item has a level parameter via the python pptx
paragraph. If so, use the paragraph level as the category_depth level.
- If the shape being checked is a title shape and the item is not a
bullet or email, the element will be set as a Title with a depth
corresponding to the enumerated paragraph increment (e.g. 1st line of
title shape is depth 0, second is depth 1 etc.).
- If the shape is not a title shape but the paragraph is a title, the
increment will match the level + 1, so that all paragraph titles are at
least 1 to set them below the slide title element
2023-10-05 12:55:45 -04:00
Steve Canny
ab29de8dbd
Rfctr: Refactor PPTX partitioning to more closely align with how pptx documents are structured
This refactor solves a problem or two, the big one being recursing into
group-shapes to get all shapes on the slide, but mostly lays the
groundwork to allow us to refine further aspects such as list-item
detection, off-slide shape detection, and image-capture going forward.
2023-09-26 15:43:55 -04:00
Steve Canny
b54994ae95
rfctr: docx partitioning (#1422)
Reviewers: I recommend reviewing commit-by-commit or just looking at the
final version of `partition/docx.py` as View File.

This refactor solves a few problems but mostly lays the groundwork to
allow us to refine further aspects such as page-break detection,
list-item detection, and moving python-docx internals upstream to that
library so our work doesn't depend on that domain-knowledge.
2023-09-19 15:32:46 -07:00