mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-10-03 04:14:15 +00:00

**Summary.** The `python-docx` table API is designed for _uniform_ tables (no merged cells, no nested tables). Naive processing of DOCX tables using this API produces duplicate text when the table has merged cells. Add a more sophisticated parsing method that reads only "root" cells (those with an actual `<tc>` element) and skip cells spanned by a merge. In the process, abandon use of the `tabulate` package for this job (which is also designed for uniform tables) and remove the whitespace padding it adds for visual alignment of columns. Separate the text for each cell with a single newline ("\n"). Since it's little extra trouble, add support for nested tables such that their text also contributes to the `Table.text` string. The new `._iter_table_texts()` method will also be used for parsing tables in headers and footers (where they are frequently used for layout purposes) in a closely following PR. Fixes #2106.
17 lines
325 B
Python
17 lines
325 B
Python
"""Table-related XML element-types."""
|
|
|
|
from __future__ import annotations
|
|
|
|
from typing import List
|
|
|
|
from docx.oxml.xmlchemy import BaseOxmlElement
|
|
|
|
class CT_Row(BaseOxmlElement):
|
|
tc_lst: List[CT_Tc]
|
|
|
|
class CT_Tc(BaseOxmlElement):
|
|
@property
|
|
def vMerge(self) -> str | None: ...
|
|
|
|
class CT_Tbl(BaseOxmlElement): ...
|