mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-06-27 02:30:08 +00:00

Fixes recursion limit error that was being raised when partitioning Excel documents of a certain size. Previously we used a recursive method to find subtables within an excel sheet. However this would run afoul of Python's recursion depth limit when there was a contiguous block of more than 1000 cells within a sheet. This function has been updated to use the NetworkX library which avoids Python recursion issues. * Updated `_get_connected_components` to use `networkx` graph methods rather than implementing our own algorithm for finding contiguous groups of cells within a sheet. * Added a test and example doc that replicates the `RecursionError` prior to the change. * Added `networkx` to `extra_xlsx` dependencies and `pip-compile`d. #### Testing: The following run from a Python terminal should raise a `RecursionError` on `main` and succeed on this branch: ```python import sys from unstructured.partition.xlsx import partition_xlsx old_recursion_limit = sys.getrecursionlimit() try: sys.setrecursionlimit(1000) filename = "example-docs/more-than-1k-cells.xlsx" partition_xlsx(filename=filename) finally: sys.setrecursionlimit(old_recursion_limit) ``` Note: the recursion limit is different in different contexts. Checking my own system, the default in a notebook seems to be 3000, but in a terminal it's 1000. The documented Python default recursion limit is 1000.
6.9 KiB
6.9 KiB