unstructured/test_unstructured_ingest/expected-structured-output/local-single-file-with-pdf-infer-table-structure
Yao You 32df4ee1c6
fix: disable table_as_cells output by default (#3093)
This PR changes the output of table elements: now by default the table
elements' `metadata.table_as_cells` is `None`. The data will only be
populated when the env `EXTRACT_TABLE_AS_CELLS` is set to `true`.

The original design of the `table_as_cells` is for evaluate table
extraction performance. The format itself is not as readable as the
`table_as_html` metadata for human or RAG consumption. Therefore by
default this data is not needed.

Since this output is meant for evaluation use this PR choose to use an
environment variable to control if it should be present in the
partitioned results. This approach avoids adding parameters to the
`partition` function call. Adding a new parameter to the `partition`
interface increases the complexity of the interface and adds more
maintenance cost since there is a long chain of function calls to pass
down this parameter to where it is needed.

## test

running the following code snippet on main vs. this PR

```python
from unstructured.partition.auto import partition

elements = partition("example-docs/layout-parser-paper-with-table.pdf", strategy="hi_res", skip_infer_table_types=[])
table_cells = [element.metadata.table_as_cells, None) for element in elements if element.category == "Table"]
```

on main branch `table_cells` contains cell structured data but on this
branch it is a list of `None`

However if we first set in terminal:

```bash
export EXTRACT_TABLE_AS_CELLS=true
```

then run the same code again with this PR the `table_cells` would
contain actual data, the same as on main branch.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
2024-05-24 16:41:25 +00:00
..