10 KiB
- Title: TableCell Dataclass
- Decision driver: Sebastian Lee
- Start Date: 2023-01-17
- Proposal PR: https://github.com/deepset-ai/haystack/pull/3875
- Github Issue: https://github.com/deepset-ai/haystack/issues/3616
Summary
When returning answers for a TableQA pipeline we would like to return the column and row index as the answer location within the table since the table is either returned as a list of lists in Haystack. This would allow users to easily look up the answer in the returned table to fetch the text directly from the table, identify the row or column labels for that answer, or generally perform operations on the table near or around the answer cell.
Basic Example
When applicable, write a snippet of code showing how the new feature would be used.
import pandas as pd
from haystack.nodes import TableReader
from haystack import Document
data = {
"actors": ["brad pitt", "leonardo di caprio", "george clooney"],
"age": ["58", "47", "60"],
"number of movies": ["87", "53", "69"],
"date of birth": ["18 december 1963", "11 november 1974", "6 may 1961"],
}
table_doc = Document(content=pd.DataFrame(data), content_type="table")
reader = TableReader(model_name_or_path="google/tapas-base-finetuned-wtq", max_seq_len=128)
prediction = reader.predict(query="Who was in the most number of movies?", documents=[table_doc])
answer = prediction["answers"][0]
# New feature
# answer.context -> [["actor", "age", "number of movies"], ["Brad Pitt",...], [...]]
# answer.offsets_in_context[0] -> (row=1, col=1)
print(answer.context[answer.offsets_in_context[0].row][answer.offsets_in_context[0].col])
Motivation
Why do we need this feature?
To allow users to easily look up the answer cell in the returned table to fetch the answer text directly from the table, identify the row or column labels for that answer, or generally perform operations on the table near or around the answer cell.
Currently, we return the location of the answer in the linearized version of the table, so we can use the
Span
dataclass. The Span
dataclass is reproduced below:
@dataclass
class Span:
start: int
end: int
"""
Defining a sequence of characters (Text span) or cells (Table span) via start and end index.
For extractive QA: Character where answer starts/ends
For TableQA: Cell where the answer starts/ends (counted from top left to bottom right of table)
:param start: Position where the span starts
:param end: Position where the span ends
"""
This is inconvenient for users because they would need to know how the table is linearized (column major or row major) so they could reconstruct the column and row indices of the answer before they could locate the answer cell in the table.
What use cases does it support?
Some examples are already stated above but to recap, to easily perform operations on the table near or around the answer cell.
What's the expected outcome?
The addition of a new dataclass called TableCell
that would look like
@dataclass
class TableCell:
row: int
col: int
"""
Defining a table cell via the row and column index.
:param row: Row index of the cell
:param col: Column index of the cell
"""
Detailed design
New terminology: TableCell
, the new name for the dataclass to
store the column and row index of the answer cell.
Basic Example: Above Basic Example
Code changes
- Addition of
TableCell
dataclass to https://github.com/deepset-ai/haystack/blob/main/haystack/schema.py
@dataclass
class TableCell:
row: int
col: int
"""
Defining a table cell via the row and column index.
:param row: Row index of the cell
:param col: Column index of the cell
"""
- Updating code (e.g. schema objects, classes, functions) that use
Span
to also supportTableCell
where appropriate. This includes: - Updating the
Answer
dataclass to supportTableCell
as a valid type foroffsets_in_document
andoffsets_in_context
@dataclass
class Answer:
answer: str
type: Literal["generative", "extractive", "other"] = "extractive"
score: Optional[float] = None
context: Optional[Union[str, List[List]]] = None
offsets_in_document: Optional[List[Span], List[TableCell]] = None
offsets_in_context: Optional[List[Span], List[TableCell]] = None
document_id: Optional[str] = None
meta: Optional[Dict[str, Any]] = None
- Similar to how we can return a list of
Span
s, we would allow a list ofTableCell
s to be returned to handle the case when multipleTableCell
s are returned to form a final answer. - Updating any functions that accept table answers as input to use the new
col
androw
variables instead ofstart
andend
variables. This type of check for table answers is most likely already done by checking if thecontext
is of typepd.DataFrame
. TableReader
andRCIReader
to returnTableCell
objects instead ofSpan
.
Changes related to the Edge Case/Bug below
- Update
Document.content
andAnswer.context
to useList[List]
instead ofpd.DataFrame
. - Update
TableReader
nodes to convert table fromList[List]
intopd.DataFrame
before inputting to the model.
Edge Case/Bug
Internally, Haystack stores a table as a pandas DataFrame in the Answer
dataclass, which does not treat the column
labels as the first row in the table.
However, in Haystack's rest-api the table is converted into a list of lists format where the column labels are
stored as the first row, which can be seen here, which is consistent
with the Document.to_dict()
method seen here.
This means that the current Span
and (new) TableCell
dataclass point to the wrong location when the table is
converted to a list of lists.
For example, the following code
import pandas as pd
from haystack import Document
data = {
"actors": ["brad pitt", "leonardo di caprio", "george clooney"],
"age": ["58", "47", "60"],
"number of movies": ["87", "53", "69"],
"date of birth": ["18 december 1963", "11 november 1974", "6 may 1961"],
}
table_doc = Document(content=pd.DataFrame(data), content_type="table")
span = (0, 0)
print(table_doc.content.iloc[span]) # prints "brad pitt"
dict_table_doc = table_doc.to_dict()
print(dict_table_doc["content"][span[0]][span[1]]) # prints "actors"
We have decided to store the table internally as a list of lists to avoid this issue. See discussion starting here.
Drawbacks
Look at the feature from the other side: what are the reasons why we should not work on it? Consider the following:
- What's the implementation cost, both in terms of code size and complexity?
I don't believe this will require too much code change since we already check for Table like answers by checking if the returned context is of type string or pandas Dataframe.
- Can the solution you're proposing be implemented as a separate package, outside of Haystack?
Technically yes, but since it affects core classes like TableReader
, and RCIReader
it makes sense to implement in
Haystack.
- Does it teach people more about Haystack?
It would update already existing documentation and tutorials of Haystack.
- How does this feature integrate with other existing and planned features?
This feature directly integrates and impacts the TableQA feature of Haystack.
- What's the cost of migrating existing Haystack pipelines (is it a breaking change?)?
Yes there are breaking changes that would affect end users.
- The way to access the offsets in returned Answers would be different.
Following the deprecation policy we will support both
Span
andTableCell
(can be toggled between using a boolean flag) for 2 additional versions of Haystack. - Tables in Haystack Documents and Answers will change from type pandas Dataframe to a list of lists.
Alternatives
What's the impact of not adding this feature?
Requiring users to figure out how to interpret the linearized answer cell coordinates to reconstruct the row and column indices to be able to access the answer cell in the returned table.
Other designs
- Expand
Span
dataclass to have optionalcol
androw
fields. This would require a similar check asTableCell
, but instead require checking for which of the elements are populated, which seems unnecessarily complex.
@dataclass
class Span:
start: int = None
end: int = None
col: int = None
row: int = None
- Use the existing
Span
dataclass and put the row index and column index as thestart
andend
respectively. This may be confusing to users since it is not obvious thatstart
should refer torow
andend
should refer tocolumn
.
answer_cell_offset = Span(start=row_idx, end=col_idx)
- Provide a convenience function shown here
to help users convert the linearized
Span
back to row and column indices. I believe this solution is non-ideal since it would require a user of the rest_api to access a python function to convert the linearized indices back into row and column indices.
Adoption strategy
How will the existing Haystack users adopt it?
Haystack users would immediately experience this change once they update their installation of Haystack if they were using
the TableQA reader. This would be a breaking change since it would change the offsets_in_document
and
offsets_in_context
in the returned Answer
. I'm not sure if there would be a straightforward way to write a migration
script for this change.
How we teach this
Would implementing this feature mean the documentation must be re-organized or updated? Does it change how Haystack is taught to new developers at any level?
- The API docs for
TableCell
would need to be added. - The documentation page for Table Question Answering would need to be updated.
- Update the (TableQa tutorial)[https://github.com/deepset-ai/haystack-tutorials/blob/main/tutorials/15_TableQA.ipynb]
to reflect the
Span
is no longer linearzied.
Unresolved questions
No more unresolved questions.