unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-14 17:37:27 +00:00

Author	SHA1	Message	Date
Steve Canny	f1cab248ce	rfctr(msg): remove temporary new_msg.py (#3157 ) Summary Remove temporary `new_msg.py` module. Additional Context The rewrite of `partition_msg()` was placed in a separate file `new_msg.py` to avoid a messy diff for code-review. This PR makes that `new_msg.py` the new `msg.py`. No code changes were made in the process.	2024-06-06 08:31:56 +00:00
Steve Canny	f2e67539b1	rfctr: clean MSG partitioner and tests as prep (#3107 ) Summary Fix type errors and generally prepare `partition_msg()` and its tests for refactoring to use `python-oxmsg` library instead of the problematic `msg_parser` library for partitioning Outlook MSG files.	2024-05-29 21:36:05 +00:00
Steve Canny	8644a3b09a	fix(odt): fix disk-space leak in partition_odt() (#3037 ) Remedy disk-space leak where `partition_odt()` would leave an on-disk copy of each `.odt` file passed as a file-like object. `partition_odt()` creates a temporary file in which it writes each source-document provided as a file-like object. This file is not deleted and disk consumption grows without bound. The `convert_and_partition_docx()` function used to convert ODT->DOCX uses `pandoc` (a command-line program) to do the conversion. Because this command-line program operates in a different memory space, the source file cannot be passed as an in-memory object and needs to be on the filesystem. When the ODT source-document is passed as a file-like object, it is written to disk so the conversion program has access to it. It is not deleted afterward. Fix this by writing the temporary source ODT file in a `TemporaryDirectory` and also use that location to write the conversion-target DOCX file. That directory is automatically removed when `partition_odt()` completes. While we're in there, improve the factoring of `partition_odt()`. - Extract `convert_and_partition_docx()` from `partition.docx` (used only by `partition_odt()`) to `_convert_odt_to_docx()` in `partition.odt` where it is used. Decouple file conversion from calling `partition_docx()` with the converted file as the `partition_docx()` call is `partition_odt()`'s natural responsibility. - Improve docstrings, typing, and comments. - All tests pass both before and after.	2024-05-16 20:04:10 +00:00
Steve Canny	601594d373	fix(docx): fix short-row DOCX table (#2943 ) Summary The DOCX format allows a table row to start late and/or end early, meaning cells at the beginning or end of a row can be omitted. While there are legitimate uses for this capability, using it in practice is relatively rare. However, it can happen unintentionally when adjusting cell borders with the mouse. Accommodate this case and generate accurate `.text` and `.metadata.text_as_html` for these tables.	2024-05-02 00:45:52 +00:00
Steve Canny	4dc8327149	rfctr(pptx): make PptxPartitionerOptions public (#2901 ) Summary A few additional small, mechanical odds and ends required for PPTX image extraction. The big one is removing the leading underscore from `PptxPartitionerOptions` because now client code that implements a custom Picture-shape sub-partitioner will need to reference this class.	2024-04-19 04:50:06 +00:00
Steve Canny	3e643c4cb3	feat(pptx): add pluggable PPTX Picture sub-partitioner (#2880 ) Summary Delegate partitioning of PPTX Picture (image, to a first approximation) shapes to a distinct sub-partitioner and allow the default picture sub-partitioner to be replaced at run-time by one of the user's choosing.	2024-04-12 06:00:01 +00:00
Steve Canny	2c7e0289aa	rfctr(pptx): extract _PptxPartitionerOptions (#2853 ) Reviewers: Likely quicker to review commit-by-commit. Summary In preparation for adding a PPTX `Picture` shape _sub-partitioner_, extract management of PPTX partitioning-run options to a separate `_PptxPartitioningOptions` object similar to those used in chunking and XLSX partitioning. This provides several benefits: - Extract code dealing with applying defaults and computing derived values from the main partitioning code, leaving it less cluttered and focused on the partitioning algorithm itself. - Allow the options set to be passed to helper objects, prominently including sub-partitioners, without requiring a long list of parameters or requiring the caller to couple itself to the particular option values the helper object requires. - Allow options behaviors to be thoroughly and efficiently tested in isolation.	2024-04-08 19:01:03 +00:00
Steve Canny	b59e4b69ce	rfctr: prepare for fix to raises on file-like-object with name not a path to a file (#2617 ) Summary Improve typing and other mechanical refactoring in preparation for fix to issue 2308.	2024-03-06 23:46:54 +00:00
qued	007fc45739	chore: new black changes (#2473 ) Update `black` and apply changes to affected files. I separated this PR so we can have a look at the changes and decide whether we want to: 1. Go forward with the new formatting 2. Change the black config to make the old formatting valid 3. Get rid of black entirely and just use `ruff` 4. Do something I haven't thought of	2024-01-30 17:12:35 +00:00
Steve Canny	e6637592d1	fix(docx): Table.text duplicates merged cell text (#2134 ) Summary. The `python-docx` table API is designed for _uniform_ tables (no merged cells, no nested tables). Naive processing of DOCX tables using this API produces duplicate text when the table has merged cells. Add a more sophisticated parsing method that reads only "root" cells (those with an actual `<tc>` element) and skip cells spanned by a merge. In the process, abandon use of the `tabulate` package for this job (which is also designed for uniform tables) and remove the whitespace padding it adds for visual alignment of columns. Separate the text for each cell with a single newline ("\n"). Since it's little extra trouble, add support for nested tables such that their text also contributes to the `Table.text` string. The new `._iter_table_texts()` method will also be used for parsing tables in headers and footers (where they are frequently used for layout purposes) in a closely following PR. Fixes #2106.	2023-11-21 22:22:40 +00:00
Steve Canny	0e2c21e5a2	fix: handle sectionless-docx in the general case (#1829 ) A DOCX document that has no sections can still contain one or more tables. Such files are never created by Word but Word can open them just fine. These can be and are generated by other applications. Use the newly-added `Document.iter_inner_content()` method added upstream in `python-docx` to capture both paragraphs and tables from a section-less DOCX document. This generalizes the fix for MS Teams chat-transcripts (an example of sectionless-docx) implemented in #1825.	2023-11-08 19:05:19 +00:00
Steve Canny	80fe07b89f	fix: #1952 support nested docx tables (#2020 ) In DOCX, like HTML, a table cell can itself contain a table. This is not uncommon and is typically used for formatting purposes. When a DOCX table is nested, create nested HTML tables to reflect that structure and create a plain-text table with captures all the text in nested tables, formatting it as a reasonable facsimile of a table. This implements the solution described and spiked in PR #1952. --------- Co-authored-by: Bruno Bornsztein <bruno.bornsztein@gmail.com>	2023-11-08 00:37:21 +00:00
Steve Canny	4e40999070	rfctr: prepare docx partitioner and tests for nested tables PR to follow (#1978 ) Reviewer: May be quicker to review commit by commit as they are quite distinct and well-groomed to each focus on a single clean-up task. Clean up odds-and-ends in the docx partitioner in preparation for adding nested-tables support in a closely following PR. 1. Remove obsolete TODOs now in GitHub issues, which is probably where they belong in future anyway. 2. Remove local DOCX "workaround" code that has been implemented upstream and is now obsolete. 3. "Clean" the docx tests, introducing strict typing, extracting a fixture or two, and generally tightening things up. 4. Extract docx-local versions of `unstructured.partition.common.convert_ms_office_table_to_text()` which will be the base for adding nested-table support. More information on why this is required in that commit.	2023-11-02 05:22:17 +00:00
Roman Isecke	b265d8874b	refactoring linting (#1739 ) ### Description Currently linting only takes place over the base unstructured directory but we support python files throughout the repo. It makes sense for all those files to also abide by the same linting rules so the entire repo was set to be inspected when the linters are run. Along with that autoflake was added as a linter which has a lot of added benefits such as removing unused imports for you that would currently break flake and require manual intervention. The only real relevant changes in this PR are in the `Makefile`, `setup.cfg`, and `requirements/test.in`. The rest is the result of running the linters.	2023-10-17 12:45:12 +00:00
Steve Canny	4b84d596c2	docx: add hyperlink metadata (#1746 )	2023-10-13 06:26:14 +00:00
Newel H	e34396b2c9	Feat: Native hierarchies for elements from pptx documents (#1616 ) ## Summary Improve title detection in pptx documents The default title textboxes on a pptx slide are now categorized as titles. Improve hierarchy detection in pptx documents List items, and other slide text are properly nested under the slide title. This will enable better chunking of pptx documents. Hierarchy detection is improved by determining category depth via the following: - Check if the paragraph item has a level parameter via the python pptx paragraph. If so, use the paragraph level as the category_depth level. - If the shape being checked is a title shape and the item is not a bullet or email, the element will be set as a Title with a depth corresponding to the enumerated paragraph increment (e.g. 1st line of title shape is depth 0, second is depth 1 etc.). - If the shape is not a title shape but the paragraph is a title, the increment will match the level + 1, so that all paragraph titles are at least 1 to set them below the slide title element	2023-10-05 12:55:45 -04:00
Newel H	55315cf645	Feat: Native hierarchies for docx element types (#1505 ) Improves hierarchy from docx files by leveraging natural hierarchies built into docx documents. Hierarchy can now be detected from an indentation level for list bullets/numbers and by style name (e.g. Heading 1, List Bullet 2, List Number). Hierarchy detection is improved by determining category depth via the following: 1. Check if the paragraph item has an indentation level (ilvl) xpath - these are typically on list bullet/numbers. Return the indentation level if it exists 2. Check the name of the paragraph style if it contains any category depth information (e.g. Heading 1 vs Heading 2 or List Bullet vs List Bullet 2). Return the category depth if found, else default to depth of 0. 3. Check the paragraph ilvl via the paragraph's style name. Outside of the paragraph's metadata, docx stores default ilvls for various style names, which requires a complex lookup. This check is yet to be implemented, as the above methods cover most usecases but the implementation is stubbed out. --- Co-authored-by: Steve Canny <stcanny@gmail.com>	2023-09-27 11:32:46 -04:00
Steve Canny	ab29de8dbd	Rfctr: Refactor PPTX partitioning to more closely align with how pptx documents are structured This refactor solves a problem or two, the big one being recursing into group-shapes to get all shapes on the slide, but mostly lays the groundwork to allow us to refine further aspects such as list-item detection, off-slide shape detection, and image-capture going forward.	2023-09-26 15:43:55 -04:00
rvztz	2f52df180f	Adds data source properties to onedrive, reddit and slack (#1281 )	2023-09-20 04:26:36 +00:00
Steve Canny	b54994ae95	rfctr: docx partitioning (#1422 ) Reviewers: I recommend reviewing commit-by-commit or just looking at the final version of `partition/docx.py` as View File. This refactor solves a few problems but mostly lays the groundwork to allow us to refine further aspects such as page-break detection, list-item detection, and moving python-docx internals upstream to that library so our work doesn't depend on that domain-knowledge.	2023-09-19 15:32:46 -07:00

20 Commits