unstructured/example-docs/docx-tables.docx
Steve Canny e6637592d1
fix(docx): Table.text duplicates merged cell text (#2134)
**Summary.** The `python-docx` table API is designed for _uniform_
tables (no merged cells, no nested tables). Naive processing of DOCX
tables using this API produces duplicate text when the table has merged
cells. Add a more sophisticated parsing method that reads only "root"
cells (those with an actual `<tc>` element) and skip cells spanned by a
merge.

In the process, abandon use of the `tabulate` package for this job
(which is also designed for uniform tables) and remove the whitespace
padding it adds for visual alignment of columns. Separate the text for
each cell with a single newline ("\n").

Since it's little extra trouble, add support for nested tables such that
their text also contributes to the `Table.text` string.

The new `._iter_table_texts()` method will also be used for parsing
tables in headers and footers (where they are frequently used for layout
purposes) in a closely following PR.

Fixes #2106.
2023-11-21 22:22:40 +00:00

12 KiB