20 Commits

Author SHA1 Message Date
Steve Canny
f1cab248ce
rfctr(msg): remove temporary new_msg.py (#3157)
**Summary**
Remove temporary `new_msg.py` module.

**Additional Context**
The rewrite of `partition_msg()` was placed in a separate file
`new_msg.py` to avoid a messy diff for code-review. This PR makes that
`new_msg.py` the new `msg.py`.

No code changes were made in the process.
2024-06-06 08:31:56 +00:00
Steve Canny
f2e67539b1
rfctr: clean MSG partitioner and tests as prep (#3107)
**Summary**
Fix type errors and generally prepare `partition_msg()` and its tests
for refactoring to use `python-oxmsg` library instead of the problematic
`msg_parser` library for partitioning Outlook MSG files.
2024-05-29 21:36:05 +00:00
Steve Canny
8644a3b09a
fix(odt): fix disk-space leak in partition_odt() (#3037)
Remedy disk-space leak where `partition_odt()` would leave an on-disk
copy of each `.odt` file passed as a file-like object.

`partition_odt()` creates a temporary file in which it writes each
source-document provided as a file-like object. This file is not deleted
and disk consumption grows without bound.

The `convert_and_partition_docx()` function used to convert ODT->DOCX
uses `pandoc` (a command-line program) to do the conversion. Because
this command-line program operates in a different memory space, the
source file cannot be passed as an in-memory object and needs to be on
the filesystem. When the ODT source-document is passed as a file-like
object, it is written to disk so the conversion program has access to
it. It is not deleted afterward.

Fix this by writing the temporary source ODT file in a
`TemporaryDirectory` and also use that location to write the
conversion-target DOCX file. That directory is automatically removed
when `partition_odt()` completes.

While we're in there, improve the factoring of `partition_odt()`.

- Extract `convert_and_partition_docx()` from `partition.docx` (used
only by `partition_odt()`) to `_convert_odt_to_docx()` in
`partition.odt` where it is used. Decouple file conversion from calling
`partition_docx()` with the converted file as the `partition_docx()`
call is `partition_odt()`'s natural responsibility.
- Improve docstrings, typing, and comments.
- All tests pass both before and after.
2024-05-16 20:04:10 +00:00
Steve Canny
601594d373
fix(docx): fix short-row DOCX table (#2943)
**Summary**
The DOCX format allows a table row to start late and/or end early,
meaning cells at the beginning or end of a row can be omitted. While
there are legitimate uses for this capability, using it in practice is
relatively rare. However, it can happen unintentionally when adjusting
cell borders with the mouse. Accommodate this case and generate accurate
`.text` and `.metadata.text_as_html` for these tables.
2024-05-02 00:45:52 +00:00
Steve Canny
4dc8327149
rfctr(pptx): make PptxPartitionerOptions public (#2901)
**Summary**
A few additional small, mechanical odds and ends required for PPTX image
extraction.

The big one is removing the leading underscore from
`PptxPartitionerOptions` because now client code that implements a
custom Picture-shape sub-partitioner will need to reference this class.
2024-04-19 04:50:06 +00:00
Steve Canny
3e643c4cb3
feat(pptx): add pluggable PPTX Picture sub-partitioner (#2880)
**Summary**
Delegate partitioning of PPTX Picture (image, to a first approximation)
shapes to a distinct sub-partitioner and allow the default picture
sub-partitioner to be replaced at run-time by one of the user's
choosing.
2024-04-12 06:00:01 +00:00
Steve Canny
2c7e0289aa
rfctr(pptx): extract _PptxPartitionerOptions (#2853)
**Reviewers:** Likely quicker to review commit-by-commit.

**Summary**

In preparation for adding a PPTX `Picture` shape _sub-partitioner_,
extract management of PPTX partitioning-run options to a separate
`_PptxPartitioningOptions` object similar to those used in chunking and
XLSX partitioning. This provides several benefits:
- Extract code dealing with applying defaults and computing derived
values from the main partitioning code, leaving it less cluttered and
focused on the partitioning algorithm itself.
- Allow the options set to be passed to helper objects, prominently
including sub-partitioners, without requiring a long list of parameters
or requiring the caller to couple itself to the particular option values
the helper object requires.
- Allow options behaviors to be thoroughly and efficiently tested in
isolation.
2024-04-08 19:01:03 +00:00
Steve Canny
b59e4b69ce
rfctr: prepare for fix to raises on file-like-object with name not a path to a file (#2617)
**Summary**
Improve typing and other mechanical refactoring in preparation for fix
to issue 2308.
2024-03-06 23:46:54 +00:00
qued
007fc45739
chore: new black changes (#2473)
Update `black` and apply changes to affected files. I separated this PR
so we can have a look at the changes and decide whether we want to:
1. Go forward with the new formatting
2. Change the black config to make the old formatting valid
3. Get rid of black entirely and just use `ruff`
4. Do something I haven't thought of
2024-01-30 17:12:35 +00:00
Steve Canny
e6637592d1
fix(docx): Table.text duplicates merged cell text (#2134)
**Summary.** The `python-docx` table API is designed for _uniform_
tables (no merged cells, no nested tables). Naive processing of DOCX
tables using this API produces duplicate text when the table has merged
cells. Add a more sophisticated parsing method that reads only "root"
cells (those with an actual `<tc>` element) and skip cells spanned by a
merge.

In the process, abandon use of the `tabulate` package for this job
(which is also designed for uniform tables) and remove the whitespace
padding it adds for visual alignment of columns. Separate the text for
each cell with a single newline ("\n").

Since it's little extra trouble, add support for nested tables such that
their text also contributes to the `Table.text` string.

The new `._iter_table_texts()` method will also be used for parsing
tables in headers and footers (where they are frequently used for layout
purposes) in a closely following PR.

Fixes #2106.
2023-11-21 22:22:40 +00:00
Steve Canny
0e2c21e5a2
fix: handle sectionless-docx in the general case (#1829)
A DOCX document that has no sections can still contain one or more
tables. Such files are never created by Word but Word can open them just
fine. These can be and are generated by other applications.

Use the newly-added `Document.iter_inner_content()` method added
upstream in `python-docx` to capture both paragraphs and tables from a
section-less DOCX document.

This generalizes the fix for MS Teams chat-transcripts (an example of
sectionless-docx) implemented in #1825.
2023-11-08 19:05:19 +00:00
Steve Canny
80fe07b89f
fix: #1952 support nested docx tables (#2020)
In DOCX, like HTML, a table cell can itself contain a table. This is not
uncommon and is typically used for formatting purposes.

When a DOCX table is nested, create nested HTML tables to reflect that
structure and create a plain-text table with captures all the text in
nested tables, formatting it as a reasonable facsimile of a table.

This implements the solution described and spiked in PR #1952.

---------

Co-authored-by: Bruno Bornsztein <bruno.bornsztein@gmail.com>
2023-11-08 00:37:21 +00:00
Steve Canny
4e40999070
rfctr: prepare docx partitioner and tests for nested tables PR to follow (#1978)
*Reviewer:* May be quicker to review commit by commit as they are quite
distinct and well-groomed to each focus on a single clean-up task.

Clean up odds-and-ends in the docx partitioner in preparation for adding
nested-tables support in a closely following PR.

1. Remove obsolete TODOs now in GitHub issues, which is probably where
they belong in future anyway.
2. Remove local DOCX "workaround" code that has been implemented
upstream and is now obsolete.
3. "Clean" the docx tests, introducing strict typing, extracting a
fixture or two, and generally tightening things up.
4. Extract docx-local versions of
`unstructured.partition.common.convert_ms_office_table_to_text()` which
will be the base for adding nested-table support. More information on
why this is required in that commit.
2023-11-02 05:22:17 +00:00
Roman Isecke
b265d8874b
refactoring linting (#1739)
### Description
Currently linting only takes place over the base unstructured directory
but we support python files throughout the repo. It makes sense for all
those files to also abide by the same linting rules so the entire repo
was set to be inspected when the linters are run. Along with that
autoflake was added as a linter which has a lot of added benefits such
as removing unused imports for you that would currently break flake and
require manual intervention.

The only real relevant changes in this PR are in the `Makefile`,
`setup.cfg`, and `requirements/test.in`. The rest is the result of
running the linters.
2023-10-17 12:45:12 +00:00
Steve Canny
4b84d596c2
docx: add hyperlink metadata (#1746) 2023-10-13 06:26:14 +00:00
Newel H
e34396b2c9
Feat: Native hierarchies for elements from pptx documents (#1616)
## Summary
**Improve title detection in pptx documents** The default title
textboxes on a pptx slide are now categorized as titles.
**Improve hierarchy detection in pptx documents** List items, and other
slide text are properly nested under the slide title. This will enable
better chunking of pptx documents.

Hierarchy detection is improved by determining category depth via the
following:
- Check if the paragraph item has a level parameter via the python pptx
paragraph. If so, use the paragraph level as the category_depth level.
- If the shape being checked is a title shape and the item is not a
bullet or email, the element will be set as a Title with a depth
corresponding to the enumerated paragraph increment (e.g. 1st line of
title shape is depth 0, second is depth 1 etc.).
- If the shape is not a title shape but the paragraph is a title, the
increment will match the level + 1, so that all paragraph titles are at
least 1 to set them below the slide title element
2023-10-05 12:55:45 -04:00
Newel H
55315cf645
Feat: Native hierarchies for docx element types (#1505)
Improves hierarchy from docx files by leveraging natural hierarchies
built into docx documents. Hierarchy can now be detected from an
indentation level for list bullets/numbers and by style name (e.g.
Heading 1, List Bullet 2, List Number).

Hierarchy detection is improved by determining category depth via the
following:
1. Check if the paragraph item has an indentation level (ilvl) xpath -
these are typically on list bullet/numbers. Return the indentation level
if it exists
2. Check the name of the paragraph style if it contains any category
depth information (e.g. Heading 1 vs Heading 2 or List Bullet vs List
Bullet 2). Return the category depth if found, else default to depth of
0.
3. Check the paragraph ilvl via the paragraph's style name. Outside of
the paragraph's metadata, docx stores default ilvls for various style
names, which requires a complex lookup. This check is yet to be
implemented, as the above methods cover most usecases but the
implementation is stubbed out.
---
Co-authored-by: Steve Canny <stcanny@gmail.com>
2023-09-27 11:32:46 -04:00
Steve Canny
ab29de8dbd
Rfctr: Refactor PPTX partitioning to more closely align with how pptx documents are structured
This refactor solves a problem or two, the big one being recursing into
group-shapes to get all shapes on the slide, but mostly lays the
groundwork to allow us to refine further aspects such as list-item
detection, off-slide shape detection, and image-capture going forward.
2023-09-26 15:43:55 -04:00
rvztz
2f52df180f
Adds data source properties to onedrive, reddit and slack (#1281) 2023-09-20 04:26:36 +00:00
Steve Canny
b54994ae95
rfctr: docx partitioning (#1422)
Reviewers: I recommend reviewing commit-by-commit or just looking at the
final version of `partition/docx.py` as View File.

This refactor solves a few problems but mostly lays the groundwork to
allow us to refine further aspects such as page-break detection,
list-item detection, and moving python-docx internals upstream to that
library so our work doesn't depend on that domain-knowledge.
2023-09-19 15:32:46 -07:00