mirror of https://github.com/deepset-ai/haystack.git synced 2025-06-26 22:00:13 +00:00

Christian Clauss 6dd52d91b2

ci: Fix typos discovered by codespell (#5778 )

* Fix typos discovered by codespell

* pylint: max-args = 38

2023-09-13 16:14:45 +02:00

10 KiB

Raw Permalink Blame History

Title: TableCell Dataclass
Decision driver: Sebastian Lee
Start Date: 2023-01-17
Proposal PR: https://github.com/deepset-ai/haystack/pull/3875
Github Issue: https://github.com/deepset-ai/haystack/issues/3616

Summary

When returning answers for a TableQA pipeline we would like to return the column and row index as the answer location within the table since the table is either returned as a list of lists in Haystack. This would allow users to easily look up the answer in the returned table to fetch the text directly from the table, identify the row or column labels for that answer, or generally perform operations on the table near or around the answer cell.

Basic Example

When applicable, write a snippet of code showing how the new feature would be used.

import pandas as pd
from haystack.nodes import TableReader
from haystack import Document

data = {
    "actors": ["brad pitt", "leonardo di caprio", "george clooney"],
    "age": ["58", "47", "60"],
    "number of movies": ["87", "53", "69"],
    "date of birth": ["18 december 1963", "11 november 1974", "6 may 1961"],
}
table_doc = Document(content=pd.DataFrame(data), content_type="table")
reader = TableReader(model_name_or_path="google/tapas-base-finetuned-wtq", max_seq_len=128)
prediction = reader.predict(query="Who was in the most number of movies?", documents=[table_doc])
answer = prediction["answers"][0]

# New feature
# answer.context -> [["actor", "age", "number of movies"], ["Brad Pitt",...], [...]]
# answer.offsets_in_context[0] -> (row=1, col=1)
print(answer.context[answer.offsets_in_context[0].row][answer.offsets_in_context[0].col])

Motivation

Why do we need this feature?

To allow users to easily look up the answer cell in the returned table to fetch the answer text directly from the table, identify the row or column labels for that answer, or generally perform operations on the table near or around the answer cell.

Currently, we return the location of the answer in the linearized version of the table, so we can use the Span dataclass. The Span dataclass is reproduced below:

@dataclass
class Span:
    start: int
    end: int
    """
    Defining a sequence of characters (Text span) or cells (Table span) via start and end index.
    For extractive QA: Character where answer starts/ends
    For TableQA: Cell where the answer starts/ends (counted from top left to bottom right of table)

    :param start: Position where the span starts
    :param end:  Position where the span ends
    """

This is inconvenient for users because they would need to know how the table is linearized (column major or row major) so they could reconstruct the column and row indices of the answer before they could locate the answer cell in the table.

What use cases does it support?

Some examples are already stated above but to recap, to easily perform operations on the table near or around the answer cell.

What's the expected outcome?

The addition of a new dataclass called TableCell that would look like

@dataclass
class TableCell:
    row: int
    col: int
    """
    Defining a table cell via the row and column index.

    :param row: Row index of the cell
    :param col: Column index of the cell
    """

Detailed design

New terminology: TableCell, the new name for the dataclass to store the column and row index of the answer cell.

Basic Example: Above Basic Example

Code changes

Addition of TableCell dataclass to https://github.com/deepset-ai/haystack/blob/main/haystack/schema.py

@dataclass
class TableCell:
    row: int
    col: int
    """
    Defining a table cell via the row and column index.

    :param row: Row index of the cell
    :param col: Column index of the cell
    """

Updating code (e.g. schema objects, classes, functions) that use Span to also support TableCell where appropriate. This includes:
Updating the Answer dataclass to support TableCell as a valid type for offsets_in_document and offsets_in_context

@dataclass
 class Answer:
     answer: str
     type: Literal["generative", "extractive", "other"] = "extractive"
     score: Optional[float] = None
     context: Optional[Union[str, List[List]]] = None
     offsets_in_document: Optional[List[Span], List[TableCell]] = None
     offsets_in_context: Optional[List[Span], List[TableCell]] = None
     document_id: Optional[str] = None
     meta: Optional[Dict[str, Any]] = None

Similar to how we can return a list of Spans, we would allow a list of TableCells to be returned to handle the case when multiple TableCells are returned to form a final answer.
Updating any functions that accept table answers as input to use the new col and row variables instead of start and end variables. This type of check for table answers is most likely already done by checking if the context is of type pd.DataFrame.
TableReader and RCIReader to return TableCell objects instead of Span.

Changes related to the Edge Case/Bug below

Update Document.content and Answer.context to use List[List] instead of pd.DataFrame.
Update TableReader nodes to convert table from List[List] into pd.DataFrame before inputting to the model.

Edge Case/Bug

Internally, Haystack stores a table as a pandas DataFrame in the Answer dataclass, which does not treat the column labels as the first row in the table. However, in Haystack's rest-api the table is converted into a list of lists format where the column labels are stored as the first row, which can be seen here, which is consistent with the Document.to_dict() method seen here.

This means that the current Span and (new) TableCell dataclass point to the wrong location when the table is converted to a list of lists.

For example, the following code

import pandas as pd
from haystack import Document

data = {
    "actors": ["brad pitt", "leonardo di caprio", "george clooney"],
    "age": ["58", "47", "60"],
    "number of movies": ["87", "53", "69"],
    "date of birth": ["18 december 1963", "11 november 1974", "6 may 1961"],
}
table_doc = Document(content=pd.DataFrame(data), content_type="table")
span = (0, 0)
print(table_doc.content.iloc[span])  # prints "brad pitt"

dict_table_doc = table_doc.to_dict()
print(dict_table_doc["content"][span[0]][span[1]])  # prints "actors"

We have decided to store the table internally as a list of lists to avoid this issue. See discussion starting here.

Drawbacks

Look at the feature from the other side: what are the reasons why we should not work on it? Consider the following:

What's the implementation cost, both in terms of code size and complexity?

I don't believe this will require too much code change since we already check for Table like answers by checking if the returned context is of type string or pandas Dataframe.

Can the solution you're proposing be implemented as a separate package, outside of Haystack?

Technically yes, but since it affects core classes like TableReader, and RCIReader it makes sense to implement in Haystack.

Does it teach people more about Haystack?

It would update already existing documentation and tutorials of Haystack.

How does this feature integrate with other existing and planned features?

This feature directly integrates and impacts the TableQA feature of Haystack.

What's the cost of migrating existing Haystack pipelines (is it a breaking change?)?

Yes there are breaking changes that would affect end users.

The way to access the offsets in returned Answers would be different. Following the deprecation policy we will support both Span and TableCell (can be toggled between using a boolean flag) for 2 additional versions of Haystack.
Tables in Haystack Documents and Answers will change from type pandas Dataframe to a list of lists.

Alternatives

What's the impact of not adding this feature?

Requiring users to figure out how to interpret the linearized answer cell coordinates to reconstruct the row and column indices to be able to access the answer cell in the returned table.

Other designs

Expand Span dataclass to have optional col and row fields. This would require a similar check as TableCell, but instead require checking for which of the elements are populated, which seems unnecessarily complex.

@dataclass
class Span:
    start: int = None
    end: int = None
    col: int = None
    row: int = None

Use the existing Span dataclass and put the row index and column index as the start and end respectively. This may be confusing to users since it is not obvious that start should refer to row and end should refer to column.

answer_cell_offset = Span(start=row_idx, end=col_idx)

Provide a convenience function shown here to help users convert the linearized Span back to row and column indices. I believe this solution is non-ideal since it would require a user of the rest_api to access a python function to convert the linearized indices back into row and column indices.

Adoption strategy

How will the existing Haystack users adopt it?

Haystack users would immediately experience this change once they update their installation of Haystack if they were using the TableQA reader. This would be a breaking change since it would change the offsets_in_document and offsets_in_context in the returned Answer. I'm not sure if there would be a straightforward way to write a migration script for this change.

How we teach this

Would implementing this feature mean the documentation must be re-organized or updated? Does it change how Haystack is taught to new developers at any level?

The API docs for TableCell would need to be added.
The documentation page for Table Question Answering would need to be updated.
Update the (TableQa tutorial)[https://github.com/deepset-ai/haystack-tutorials/blob/main/tutorials/15_TableQA.ipynb] to reflect the Span is no longer linearzied.

Unresolved questions

No more unresolved questions.

10 KiB Raw Permalink Blame History