unstructured/table.pyi at e65a44eabbcb1597e7a4b46e4d61f083503632f8 - unstructured - Gitea: Git with a cup of tea

yujunjun/unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-03 04:14:15 +00:00

Steve Canny e6637592d1

fix(docx): Table.text duplicates merged cell text (#2134 )

**Summary.** The `python-docx` table API is designed for _uniform_
tables (no merged cells, no nested tables). Naive processing of DOCX
tables using this API produces duplicate text when the table has merged
cells. Add a more sophisticated parsing method that reads only "root"
cells (those with an actual `<tc>` element) and skip cells spanned by a
merge.

In the process, abandon use of the `tabulate` package for this job
(which is also designed for uniform tables) and remove the whitespace
padding it adds for visual alignment of columns. Separate the text for
each cell with a single newline ("\n").

Since it's little extra trouble, add support for nested tables such that
their text also contributes to the `Table.text` string.

The new `._iter_table_texts()` method will also be used for parsing
tables in headers and footers (where they are frequently used for layout
purposes) in a closely following PR.

Fixes #2106.

2023-11-21 22:22:40 +00:00

17 lines

325 B

Python

Raw Blame History

 """Table-related XML element-types."""
 from __future__ import annotations
 from typing import List
 from docx.oxml.xmlchemy import BaseOxmlElement
 class CT_Row(BaseOxmlElement):
     tc_lst: List[CT_Tc]
 class CT_Tc(BaseOxmlElement):
     @property
     def vMerge(self) -> str | None: ...
 class CT_Tbl(BaseOxmlElement): ...