Replace TableTextRetriever with EmbeddingRetriever in Tutorial 15 (#2479)

* replace TableTextRetriever with EmbeddingRetriever in Tutorial 15

* Update Documentation & Code Style

* fix bug

* Update Documentation & Code Style

* update tutorial 15 outputs

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-20-212.eu-west-1.compute.internal>
This commit is contained in:
MichelBartels 2022-05-05 10:12:44 +02:00 committed by GitHub
parent 5d98810a17
commit c7e39e5225
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 1528 additions and 1361 deletions

View File

@ -10,7 +10,7 @@ id: "tutorial15md"
# Open-Domain QA on Tables
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial15_TableQA.ipynb)
This tutorial shows you how to perform question-answering on tables using the `TableTextRetriever` or `BM25Retriever` as retriever node and the `TableReader` as reader node.
This tutorial shows you how to perform question-answering on tables using the `EmbeddingRetriever` or `BM25Retriever` as retriever node and the `TableReader` as reader node.
### Prepare environment
@ -79,15 +79,12 @@ es_server = Popen(
# Connect to Elasticsearch
from haystack.document_stores import ElasticsearchDocumentStore
# We want to use a small model producing 512-dimensional embeddings, so we need to set embedding_dim to 512
document_index = "document"
document_store = ElasticsearchDocumentStore(
host="localhost", username="", password="", index=document_index, embedding_dim=512
)
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index=document_index)
```
## Add Tables to DocumentStore
To quickly demonstrate the capabilities of the `TableTextRetriever` and the `TableReader` we use a subset of 1000 tables and text documents from a dataset we have published in [this paper](https://arxiv.org/abs/2108.04049).
To quickly demonstrate the capabilities of the `EmbeddingRetriever` and the `TableReader` we use a subset of 1000 tables and text documents from a dataset we have published in [this paper](https://arxiv.org/abs/2108.04049).
Just as text passages, tables are represented as `Document` objects in Haystack. The content field, though, is a pandas DataFrame instead of a string.
@ -140,8 +137,7 @@ print(tables[0].meta)
Retrievers help narrowing down the scope for the Reader to a subset of tables where a given question could be answered.
They use some simple but fast algorithm.
**Here:** We use the `TableTextRetriever` capable of retrieving relevant content among a database
of texts and tables using dense embeddings. It is an extension of the `DensePassageRetriever` and consists of three encoders (one query encoder, one text passage encoder and one table encoder) that create embeddings in the same vector space. More details on the `TableTextRetriever` and how it is trained can be found in [this paper](https://arxiv.org/abs/2108.04049).
**Here:** We specify an embedding model that is finetuned so it can also generate embeddings for tables (instead of just text).
**Alternatives:**
@ -150,13 +146,12 @@ of texts and tables using dense embeddings. It is an extension of the `DensePass
```python
from haystack.nodes.retriever import TableTextRetriever
from haystack.nodes.retriever import EmbeddingRetriever
retriever = TableTextRetriever(
retriever = EmbeddingRetriever(
document_store=document_store,
query_embedding_model="deepset/bert-small-mm_retrieval-question_encoder",
passage_embedding_model="deepset/bert-small-mm_retrieval-passage_encoder",
table_embedding_model="deepset/bert-small-mm_retrieval-table_encoder",
embedding_model="deepset/all-mpnet-base-v2-table",
model_format="sentence_transformers",
)
```
@ -230,8 +225,8 @@ The Retriever and the Reader can be sticked together to a pipeline in order to f
from haystack import Pipeline
table_qa_pipeline = Pipeline()
table_qa_pipeline.add_node(component=retriever, name="TableTextRetriever", inputs=["Query"])
table_qa_pipeline.add_node(component=reader, name="TableReader", inputs=["TableTextRetriever"])
table_qa_pipeline.add_node(component=retriever, name="EmbeddingRetriever", inputs=["Query"])
table_qa_pipeline.add_node(component=reader, name="TableReader", inputs=["EmbeddingRetriever"])
```
@ -266,10 +261,10 @@ document_store.update_embeddings(retriever=retriever, update_existing_embeddings
```
## Pipeline for QA on Combination of Text and Tables
We are using one node for retrieving both texts and tables, the `TableTextRetriever`. In order to do question-answering on the Documents coming from the `TableTextRetriever`, we need to route Documents of type `"text"` to a `FARMReader` (or alternatively `TransformersReader`) and Documents of type `"table"` to a `TableReader`.
We are using one node for retrieving both texts and tables, the `EmbeddingRetriever`. In order to do question-answering on the Documents coming from the `EmbeddingRetriever`, we need to route Documents of type `"text"` to a `FARMReader` (or alternatively `TransformersReader`) and Documents of type `"table"` to a `TableReader`.
To achieve this, we make use of two additional nodes:
- `RouteDocuments`: Splits the List of Documents retrieved by the `TableTextRetriever` into two lists containing only Documents of type `"text"` or `"table"`, respectively.
- `RouteDocuments`: Splits the List of Documents retrieved by the `EmbeddingRetriever` into two lists containing only Documents of type `"text"` or `"table"`, respectively.
- `JoinAnswers`: Takes Answers coming from two different Readers (in this case `FARMReader` and `TableReader`) and joins them to a single list of Answers.
@ -288,8 +283,8 @@ join_answers = JoinAnswers()
```python
text_table_qa_pipeline = Pipeline()
text_table_qa_pipeline.add_node(component=retriever, name="TableTextRetriever", inputs=["Query"])
text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["TableTextRetriever"])
text_table_qa_pipeline.add_node(component=retriever, name="EmbeddingRetriever", inputs=["Query"])
text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["EmbeddingRetriever"])
text_table_qa_pipeline.add_node(component=text_reader, name="TextReader", inputs=["RouteDocuments.output_1"])
text_table_qa_pipeline.add_node(component=table_reader, name="TableReader", inputs=["RouteDocuments.output_2"])
text_table_qa_pipeline.add_node(component=join_answers, name="JoinAnswers", inputs=["TextReader", "TableReader"])
@ -376,7 +371,10 @@ It can sometimes be hard to provide your data in form of a pandas DataFrame. For
```python
import time
!docker run -d -p 3001:3001 axarev/parsr
time.sleep(30)
```

File diff suppressed because one or more lines are too long

View File

@ -1,5 +1,6 @@
import os
import json
import time
import pandas as pd
@ -7,7 +8,7 @@ from haystack import Label, MultiLabel, Answer
from haystack.utils import launch_es, fetch_archive_from_http, print_answers
from haystack.document_stores import ElasticsearchDocumentStore
from haystack import Document, Pipeline
from haystack.nodes.retriever import TableTextRetriever
from haystack.nodes.retriever import EmbeddingRetriever
from haystack.nodes import TableReader, FARMReader, RouteDocuments, JoinAnswers, ParsrConverter
@ -17,10 +18,7 @@ def tutorial15_tableqa():
launch_es()
## Connect to Elasticsearch
# We want to use a small model producing 512-dimensional embeddings, so we need to set embedding_dim to 512
document_store = ElasticsearchDocumentStore(
host="localhost", username="", password="", index="document", embedding_dim=512
)
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")
## Add Tables to DocumentStore
@ -53,15 +51,13 @@ def tutorial15_tableqa():
# Retrievers help narrowing down the scope for the Reader to a subset of tables where a given question could be answered.
# They use some simple but fast algorithm.
#
# **Here:** We use the TableTextRetriever capable of retrieving relevant content among a database
# **Here:** We use the EmbeddingRetriever capable of retrieving relevant content among a database
# of texts and tables using dense embeddings.
retriever = TableTextRetriever(
retriever = EmbeddingRetriever(
document_store=document_store,
query_embedding_model="deepset/bert-small-mm_retrieval-question_encoder",
passage_embedding_model="deepset/bert-small-mm_retrieval-passage_encoder",
table_embedding_model="deepset/bert-small-mm_retrieval-table_encoder",
embed_meta_fields=["title", "section_title"],
embedding_model="deepset/all-mpnet-base-v2-table",
model_format="sentence_transformers",
)
# Add table embeddings to the tables in DocumentStore
@ -104,15 +100,15 @@ def tutorial15_tableqa():
# for each of the tables, the sorting of the answers might be not helpful.
table_qa_pipeline = Pipeline()
table_qa_pipeline.add_node(component=retriever, name="TableTextRetriever", inputs=["Query"])
table_qa_pipeline.add_node(component=reader, name="TableReader", inputs=["TableTextRetriever"])
table_qa_pipeline.add_node(component=retriever, name="EmbeddingRetriever", inputs=["Query"])
table_qa_pipeline.add_node(component=reader, name="TableReader", inputs=["EmbeddingRetriever"])
prediction = table_qa_pipeline.run("When was Guilty Gear Xrd : Sign released?")
print_answers(prediction, details="minimum")
### Pipeline for QA on Combination of Text and Tables
# We are using one node for retrieving both texts and tables, the TableTextRetriever.
# In order to do question-answering on the Documents coming from the TableTextRetriever, we need to route
# We are using one node for retrieving both texts and tables, the EmbeddingRetriever.
# In order to do question-answering on the Documents coming from the EmbeddingRetriever, we need to route
# Documents of type "text" to a FARMReader ( or alternatively TransformersReader) and Documents of type
# "table" to a TableReader.
@ -125,8 +121,8 @@ def tutorial15_tableqa():
join_answers = JoinAnswers()
text_table_qa_pipeline = Pipeline()
text_table_qa_pipeline.add_node(component=retriever, name="TableTextRetriever", inputs=["Query"])
text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["TableTextRetriever"])
text_table_qa_pipeline.add_node(component=retriever, name="EmbeddingRetriever", inputs=["Query"])
text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["EmbeddingRetriever"])
text_table_qa_pipeline.add_node(component=text_reader, name="TextReader", inputs=["RouteDocuments.output_1"])
text_table_qa_pipeline.add_node(component=table_reader, name="TableReader", inputs=["RouteDocuments.output_2"])
text_table_qa_pipeline.add_node(component=join_answers, name="JoinAnswers", inputs=["TextReader", "TableReader"])
@ -189,11 +185,12 @@ def tutorial15_tableqa():
# It can sometimes be hard to provide your data in form of a pandas DataFrame.
# For this case, we provide the `ParsrConverter` wrapper that can help you to convert, for example, a PDF file into a document that you can index.
os.system("docker run -d -p 3001:3001 axarev/parsr")
time.sleep(30)
os.system("wget https://www.w3.org/WAI/WCAG21/working-examples/pdf-table/table.pdf")
converter = ParsrConverter()
docs = converter.convert("table.pdf")
tables = [doc for doc in docs if doc["content_type"] == "table"]
tables = [doc for doc in docs if doc.content_type == "table"]
print(tables)