mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-10-28 16:40:34 +00:00
**Reviewers:** It may be easier to review each of the two commits
separately. The first adds the new `_SubtableParser` object with its
unit-tests and the second one uses that object to replace the flawed
existing subtable-parsing algorithm.
**Summary**
There are a cluster of bugs in `partition_xlsx()` that all derive from
flaws in the algorithm we use to detect "subtables". These are
encountered when the user wants to get multiple document-elements from
each worksheet, which is the default (argument `find_subtable = True`).
This PR replaces the flawed existing algorithm with a `_SubtableParser`
object that encapsulates all that logic and has thorough unit-tests.
**Additional Context**
This is a summary of the failure cases. There are a few other cases but
they're closely related and this was enough evidence and scope for my
purposes. This PR fixes all these bugs:
```python
#
# -- ✅ CASE 1: There are no leading or trailing single-cell rows.
# -> this subtable functions never get called, subtable is emitted as the only element
#
# a b -> Table(a, b, c, d)
# c d
# -- ✅ CASE 2: There is exactly one leading single-cell row.
# -> Leading single-cell row emitted as `Title` element, core-table properly identified.
#
# a -> [ Title(a),
# b c Table(b, c, d, e) ]
# d e
# -- ❌ CASE 3: There are two-or-more leading single-cell rows.
# -> leading single-cell rows are included in subtable
#
# a -> [ Table(a, b, c, d, e, f) ]
# b
# c d
# e f
# -- ❌ CASE 4: There is exactly one trailing single-cell row.
# -> core table is dropped. trailing single-cell row is emitted as Title
# (this is the behavior in the reported bug)
#
# a b -> [ Title(e) ]
# c d
# e
# -- ❌ CASE 5: There are two-or-more trailing single-cell rows.
# -> core table is dropped. trailing single-cell rows are each emitted as a Title
#
# a b -> [ Title(e),
# c d Title(f) ]
# e
# f
# -- ✅ CASE 6: There are exactly one each leading and trailing single-cell rows.
# -> core table is correctly identified, leading and trailing single-cell rows are each
# emitted as a Title.
#
# a -> [ Title(a),
# b c Table(b, c, d, e),
# d e Title(f) ]
# f
# -- ✅ CASE 7: There are two leading and one trailing single-cell rows.
# -> core table is correctly identified, leading and trailing single-cell rows are each
# emitted as a Title.
#
# a -> [ Title(a),
# b Title(b),
# c d Table(c, d, e, f),
# e f Title(g) ]
# g
# -- ✅ CASE 8: There are two-or-more leading and trailing single-cell rows.
# -> core table is correctly identified, leading and trailing single-cell rows are each
# emitted as a Title.
#
# a -> [ Title(a),
# b Title(b),
# c d Table(c, d, e, f),
# e f Title(g),
# g Title(h) ]
# h
# -- ❌ CASE 9: Single-row subtable, no single-cell rows above or below.
# -> First cell is mistakenly emitted as title, remaining cells are dropped.
#
# a b c -> [ Title(a) ]
# -- ❌ CASE 10: Single-row subtable with one leading single-cell row.
# -> Leading single-row cell is correctly identified as title, core-table is mis-identified
# as a `Title` and truncated.
#
# a -> [ Title(a),
# b c d Title(b) ]
```
9.0 KiB
9.0 KiB