mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-12-12 15:27:06 +00:00
proposal: TableCell (#3875)
* Initial commit for TableSpan proposal * Updating the proposal * More updates to the proposal * More changes * Rename of file per Proposal instructions * Update link * Adding drawbacks * Fixing typos * Changed TableSpan to TableCell and updated proposal based on discussions. * Adding discussion on identified bug. * Rename proposal to reflect name change made during discussion. Added point to make it clear that we will be able to return a List of TableCells * Update proposal with discussion about storing table as a list of lists * Adding some additional code change descriptions.
This commit is contained in:
parent
8002cf92d6
commit
96706e9e7b
238
proposals/text/3875-table-cell.md
Normal file
238
proposals/text/3875-table-cell.md
Normal file
@ -0,0 +1,238 @@
|
||||
- Title: TableCell Dataclass
|
||||
- Decision driver: Sebastian Lee
|
||||
- Start Date: 2023-01-17
|
||||
- Proposal PR: https://github.com/deepset-ai/haystack/pull/3875
|
||||
- Github Issue: https://github.com/deepset-ai/haystack/issues/3616
|
||||
|
||||
# Summary
|
||||
|
||||
When returning answers for a TableQA pipeline we would like to return the column and row index as the answer location
|
||||
within the table since the table is either returned as a list of lists in Haystack.
|
||||
This would allow users to easily look up the answer in the returned table to fetch the text directly from the table,
|
||||
identify the row or column labels for that answer, or generally perform operations on the table near or around the
|
||||
answer cell.
|
||||
|
||||
# Basic Example
|
||||
|
||||
When applicable, write a snippet of code showing how the new feature would be used.
|
||||
```python
|
||||
import pandas as pd
|
||||
from haystack.nodes import TableReader
|
||||
from haystack import Document
|
||||
|
||||
data = {
|
||||
"actors": ["brad pitt", "leonardo di caprio", "george clooney"],
|
||||
"age": ["58", "47", "60"],
|
||||
"number of movies": ["87", "53", "69"],
|
||||
"date of birth": ["18 december 1963", "11 november 1974", "6 may 1961"],
|
||||
}
|
||||
table_doc = Document(content=pd.DataFrame(data), content_type="table")
|
||||
reader = TableReader(model_name_or_path="google/tapas-base-finetuned-wtq", max_seq_len=128)
|
||||
prediction = reader.predict(query="Who was in the most number of movies?", documents=[table_doc])
|
||||
answer = prediction["answers"][0]
|
||||
|
||||
# New feature
|
||||
# answer.context -> [["actor", "age", "number of movies"], ["Brad Pitt",...], [...]]
|
||||
# answer.offsets_in_context[0] -> (row=1, col=1)
|
||||
print(answer.context[answer.offsets_in_context[0].row][answer.offsets_in_context[0].col])
|
||||
```
|
||||
|
||||
# Motivation
|
||||
|
||||
## Why do we need this feature?
|
||||
To allow users to easily look up the answer cell in the returned table to fetch the answer text
|
||||
directly from the table, identify the row or column labels for that answer, or generally perform operations on the table
|
||||
near or around the answer cell.
|
||||
|
||||
Currently, we return the location of the answer in the **linearized** version of the table, so we can use the
|
||||
`Span` dataclass. The `Span` dataclass is reproduced below:
|
||||
```python
|
||||
@dataclass
|
||||
class Span:
|
||||
start: int
|
||||
end: int
|
||||
"""
|
||||
Defining a sequence of characters (Text span) or cells (Table span) via start and end index.
|
||||
For extractive QA: Character where answer starts/ends
|
||||
For TableQA: Cell where the answer starts/ends (counted from top left to bottom right of table)
|
||||
|
||||
:param start: Position where the span starts
|
||||
:param end: Position where the span ends
|
||||
"""
|
||||
```
|
||||
This is inconvenient for users because they would need to know how the table is linearized (column major or row major)
|
||||
so they could reconstruct the column and row indices of the answer before they could locate the answer cell in the table.
|
||||
|
||||
## What use cases does it support?
|
||||
Some examples are already stated above but to recap, to easily perform operations on the table near or around the answer cell.
|
||||
|
||||
## What's the expected outcome?
|
||||
The addition of a new dataclass called `TableCell` that would look like
|
||||
```python
|
||||
@dataclass
|
||||
class TableCell:
|
||||
row: int
|
||||
col: int
|
||||
"""
|
||||
Defining a table cell via the row and column index.
|
||||
|
||||
:param row: Row index of the cell
|
||||
:param col: Column index of the cell
|
||||
"""
|
||||
```
|
||||
|
||||
# Detailed design
|
||||
|
||||
**New terminology:** `TableCell`, the new name for the dataclass to
|
||||
store the column and row index of the answer cell.
|
||||
|
||||
**Basic Example:** [Above Basic Example](#basic-example)
|
||||
|
||||
## Code changes
|
||||
- Addition of `TableCell` dataclass to https://github.com/deepset-ai/haystack/blob/main/haystack/schema.py
|
||||
```python
|
||||
@dataclass
|
||||
class TableCell:
|
||||
row: int
|
||||
col: int
|
||||
"""
|
||||
Defining a table cell via the row and column index.
|
||||
|
||||
:param row: Row index of the cell
|
||||
:param col: Column index of the cell
|
||||
"""
|
||||
```
|
||||
|
||||
- Updating code (e.g. schema objects, classes, functions) that use `Span` to also support `TableCell` where appropriate.
|
||||
This includes:
|
||||
- Updating the `Answer` dataclass to support `TableCell` as a valid type for `offsets_in_document` and `offsets_in_context`
|
||||
```python
|
||||
@dataclass
|
||||
class Answer:
|
||||
answer: str
|
||||
type: Literal["generative", "extractive", "other"] = "extractive"
|
||||
score: Optional[float] = None
|
||||
context: Optional[Union[str, List[List]]] = None
|
||||
offsets_in_document: Optional[List[Span], List[TableCell]] = None
|
||||
offsets_in_context: Optional[List[Span], List[TableCell]] = None
|
||||
document_id: Optional[str] = None
|
||||
meta: Optional[Dict[str, Any]] = None
|
||||
```
|
||||
- Similar to how we can return a list of `Span`s, we would allow a list of `TableCell`s to be returned to handle the case
|
||||
when multiple `TableCell`s are returned to form a final answer.
|
||||
- Updating any functions that accept table answers as input to use the new `col` and `row` variables instead of `start` and `end` variables.
|
||||
This type of check for table answers is most likely already done by checking if the `context` is of type `pd.DataFrame`.
|
||||
- `TableReader` and `RCIReader` to return `TableCell` objects instead of `Span`.
|
||||
|
||||
Changes related to the Edge Case/Bug below
|
||||
- Update `Document.content` and `Answer.context` to use `List[List]` instead of `pd.DataFrame`.
|
||||
- Update `TableReader` nodes to convert table from `List[List]` into `pd.DataFrame` before inputting to the model.
|
||||
|
||||
## Edge Case/Bug
|
||||
Internally, Haystack stores a table as a pandas DataFrame in the `Answer` dataclass, which does not treat the column
|
||||
labels as the first row in the table.
|
||||
However, in Haystack's rest-api the table is converted into a list of lists format where the column labels are
|
||||
stored as the first row, which can be seen [here](https://github.com/deepset-ai/haystack/pull/3872), which is consistent
|
||||
with the `Document.to_dict()` method seen [here](https://github.com/deepset-ai/haystack/blob/6af4f14fe0d375a1ae0ced18930a9239401231c7/haystack/schema.py#L164-L165).
|
||||
|
||||
This means that the current `Span` and (new) `TableCell` dataclass point to the wrong location when the table is
|
||||
converted to a list of lists.
|
||||
|
||||
For example, the following code
|
||||
```python
|
||||
import pandas as pd
|
||||
from haystack import Document
|
||||
|
||||
data = {
|
||||
"actors": ["brad pitt", "leonardo di caprio", "george clooney"],
|
||||
"age": ["58", "47", "60"],
|
||||
"number of movies": ["87", "53", "69"],
|
||||
"date of birth": ["18 december 1963", "11 november 1974", "6 may 1961"],
|
||||
}
|
||||
table_doc = Document(content=pd.DataFrame(data), content_type="table")
|
||||
span = (0, 0)
|
||||
print(table_doc.content.iloc[span]) # prints "brad pitt"
|
||||
|
||||
dict_table_doc = table_doc.to_dict()
|
||||
print(dict_table_doc["content"][span[0]][span[1]]) # prints "actors"
|
||||
```
|
||||
|
||||
We have decided to store the table internally as a list of lists to avoid this issue. See discussion starting [here](https://github.com/deepset-ai/haystack/pull/3875#discussion_r1088766318).
|
||||
|
||||
# Drawbacks
|
||||
|
||||
Look at the feature from the other side: what are the reasons why we should _not_ work on it? Consider the following:
|
||||
|
||||
- What's the implementation cost, both in terms of code size and complexity?
|
||||
|
||||
I don't believe this will require too much code change since we already check for Table like answers by checking if the
|
||||
returned context is of type string or pandas Dataframe.
|
||||
|
||||
- Can the solution you're proposing be implemented as a separate package, outside of Haystack?
|
||||
|
||||
Technically yes, but since it affects core classes like `TableReader`, and `RCIReader` it makes sense to implement in
|
||||
Haystack.
|
||||
|
||||
- Does it teach people more about Haystack?
|
||||
|
||||
It would update already existing documentation and tutorials of Haystack.
|
||||
|
||||
- How does this feature integrate with other existing and planned features?
|
||||
|
||||
This feature directly integrates and impacts the TableQA feature of Haystack.
|
||||
|
||||
- What's the cost of migrating existing Haystack pipelines (is it a breaking change?)?
|
||||
|
||||
Yes there are breaking changes that would affect end users.
|
||||
1. The way to access the offsets in returned Answers would be different.
|
||||
Following the deprecation policy we will support both `Span` and `TableCell` (can be toggled between using a boolean flag)
|
||||
for 2 additional versions of Haystack.
|
||||
2. Tables in Haystack Documents and Answers will change from type pandas Dataframe to a list of lists.
|
||||
|
||||
# Alternatives
|
||||
|
||||
## What's the impact of not adding this feature?
|
||||
Requiring users to figure out how to interpret the linearized answer cell coordinates to reconstruct the row and column indices
|
||||
to be able to access the answer cell in the returned tabel.
|
||||
|
||||
## Other designs
|
||||
1. Expand `Span` dataclass to have optional `col` and `row` fields. This would require a similar check as `TableCell`, but instead
|
||||
require checking for which of the elements are populated, which seems unnecessarily complex.
|
||||
```python
|
||||
@dataclass
|
||||
class Span:
|
||||
start: int = None
|
||||
end: int = None
|
||||
col: int = None
|
||||
row: int = None
|
||||
```
|
||||
2. Use the existing `Span` dataclass and put the row index and column index as the `start` and `end` respectively.
|
||||
This may be confusing to users since it is not obvious that `start` should refer to `row` and `end` should refer to `column`.
|
||||
```python
|
||||
answer_cell_offset = Span(start=row_idx, end=col_idx)
|
||||
```
|
||||
3. Provide a convenience function shown [here](https://github.com/deepset-ai/haystack/issues/3616#issuecomment-1361300067)
|
||||
to help users convert the linearized `Span` back to row and column indices. I believe this solution is non-ideal since it would
|
||||
require a user of the rest_api to access a python function to convert the linearized indices back into row and column indices.
|
||||
|
||||
# Adoption strategy
|
||||
|
||||
## How will the existing Haystack users adopt it?
|
||||
Haystack users would immediately experience this change once they update their installation of Haystack if they were using
|
||||
the TableQA reader. This would be a breaking change since it would change the `offsets_in_document` and
|
||||
`offsets_in_context` in the returned `Answer`. I'm not sure if there would be a straightforward way to write a migration
|
||||
script for this change.
|
||||
|
||||
# How we teach this
|
||||
|
||||
Would implementing this feature mean the documentation must be re-organized
|
||||
or updated? Does it change how Haystack is taught to new developers at any level?
|
||||
|
||||
- The API docs for `TableCell` would need to be added.
|
||||
- The documentation page for [Table Question Answering](https://docs.haystack.deepset.ai/docs/table_qa) would need to be updated.
|
||||
- Update the (TableQa tutorial)[https://github.com/deepset-ai/haystack-tutorials/blob/main/tutorials/15_TableQA.ipynb]
|
||||
to reflect the `Span` is no longer linearzied.
|
||||
|
||||
# Unresolved questions
|
||||
|
||||
No more unresolved questions.
|
||||
Loading…
x
Reference in New Issue
Block a user