Replace TableTextRetriever with EmbeddingRetriever in Tutorial 15 (#2479)

* replace TableTextRetriever with EmbeddingRetriever in Tutorial 15

* Update Documentation & Code Style

* fix bug

* Update Documentation & Code Style

* update tutorial 15 outputs

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-20-212.eu-west-1.compute.internal>
This commit is contained in:
MichelBartels 2022-05-05 10:12:44 +02:00 committed by GitHub
parent 5d98810a17
commit c7e39e5225
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 1528 additions and 1361 deletions

View File

@ -10,7 +10,7 @@ id: "tutorial15md"
# Open-Domain QA on Tables # Open-Domain QA on Tables
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial15_TableQA.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial15_TableQA.ipynb)
This tutorial shows you how to perform question-answering on tables using the `TableTextRetriever` or `BM25Retriever` as retriever node and the `TableReader` as reader node. This tutorial shows you how to perform question-answering on tables using the `EmbeddingRetriever` or `BM25Retriever` as retriever node and the `TableReader` as reader node.
### Prepare environment ### Prepare environment
@ -79,15 +79,12 @@ es_server = Popen(
# Connect to Elasticsearch # Connect to Elasticsearch
from haystack.document_stores import ElasticsearchDocumentStore from haystack.document_stores import ElasticsearchDocumentStore
# We want to use a small model producing 512-dimensional embeddings, so we need to set embedding_dim to 512
document_index = "document" document_index = "document"
document_store = ElasticsearchDocumentStore( document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index=document_index)
host="localhost", username="", password="", index=document_index, embedding_dim=512
)
``` ```
## Add Tables to DocumentStore ## Add Tables to DocumentStore
To quickly demonstrate the capabilities of the `TableTextRetriever` and the `TableReader` we use a subset of 1000 tables and text documents from a dataset we have published in [this paper](https://arxiv.org/abs/2108.04049). To quickly demonstrate the capabilities of the `EmbeddingRetriever` and the `TableReader` we use a subset of 1000 tables and text documents from a dataset we have published in [this paper](https://arxiv.org/abs/2108.04049).
Just as text passages, tables are represented as `Document` objects in Haystack. The content field, though, is a pandas DataFrame instead of a string. Just as text passages, tables are represented as `Document` objects in Haystack. The content field, though, is a pandas DataFrame instead of a string.
@ -140,8 +137,7 @@ print(tables[0].meta)
Retrievers help narrowing down the scope for the Reader to a subset of tables where a given question could be answered. Retrievers help narrowing down the scope for the Reader to a subset of tables where a given question could be answered.
They use some simple but fast algorithm. They use some simple but fast algorithm.
**Here:** We use the `TableTextRetriever` capable of retrieving relevant content among a database **Here:** We specify an embedding model that is finetuned so it can also generate embeddings for tables (instead of just text).
of texts and tables using dense embeddings. It is an extension of the `DensePassageRetriever` and consists of three encoders (one query encoder, one text passage encoder and one table encoder) that create embeddings in the same vector space. More details on the `TableTextRetriever` and how it is trained can be found in [this paper](https://arxiv.org/abs/2108.04049).
**Alternatives:** **Alternatives:**
@ -150,13 +146,12 @@ of texts and tables using dense embeddings. It is an extension of the `DensePass
```python ```python
from haystack.nodes.retriever import TableTextRetriever from haystack.nodes.retriever import EmbeddingRetriever
retriever = TableTextRetriever( retriever = EmbeddingRetriever(
document_store=document_store, document_store=document_store,
query_embedding_model="deepset/bert-small-mm_retrieval-question_encoder", embedding_model="deepset/all-mpnet-base-v2-table",
passage_embedding_model="deepset/bert-small-mm_retrieval-passage_encoder", model_format="sentence_transformers",
table_embedding_model="deepset/bert-small-mm_retrieval-table_encoder",
) )
``` ```
@ -230,8 +225,8 @@ The Retriever and the Reader can be sticked together to a pipeline in order to f
from haystack import Pipeline from haystack import Pipeline
table_qa_pipeline = Pipeline() table_qa_pipeline = Pipeline()
table_qa_pipeline.add_node(component=retriever, name="TableTextRetriever", inputs=["Query"]) table_qa_pipeline.add_node(component=retriever, name="EmbeddingRetriever", inputs=["Query"])
table_qa_pipeline.add_node(component=reader, name="TableReader", inputs=["TableTextRetriever"]) table_qa_pipeline.add_node(component=reader, name="TableReader", inputs=["EmbeddingRetriever"])
``` ```
@ -266,10 +261,10 @@ document_store.update_embeddings(retriever=retriever, update_existing_embeddings
``` ```
## Pipeline for QA on Combination of Text and Tables ## Pipeline for QA on Combination of Text and Tables
We are using one node for retrieving both texts and tables, the `TableTextRetriever`. In order to do question-answering on the Documents coming from the `TableTextRetriever`, we need to route Documents of type `"text"` to a `FARMReader` (or alternatively `TransformersReader`) and Documents of type `"table"` to a `TableReader`. We are using one node for retrieving both texts and tables, the `EmbeddingRetriever`. In order to do question-answering on the Documents coming from the `EmbeddingRetriever`, we need to route Documents of type `"text"` to a `FARMReader` (or alternatively `TransformersReader`) and Documents of type `"table"` to a `TableReader`.
To achieve this, we make use of two additional nodes: To achieve this, we make use of two additional nodes:
- `RouteDocuments`: Splits the List of Documents retrieved by the `TableTextRetriever` into two lists containing only Documents of type `"text"` or `"table"`, respectively. - `RouteDocuments`: Splits the List of Documents retrieved by the `EmbeddingRetriever` into two lists containing only Documents of type `"text"` or `"table"`, respectively.
- `JoinAnswers`: Takes Answers coming from two different Readers (in this case `FARMReader` and `TableReader`) and joins them to a single list of Answers. - `JoinAnswers`: Takes Answers coming from two different Readers (in this case `FARMReader` and `TableReader`) and joins them to a single list of Answers.
@ -288,8 +283,8 @@ join_answers = JoinAnswers()
```python ```python
text_table_qa_pipeline = Pipeline() text_table_qa_pipeline = Pipeline()
text_table_qa_pipeline.add_node(component=retriever, name="TableTextRetriever", inputs=["Query"]) text_table_qa_pipeline.add_node(component=retriever, name="EmbeddingRetriever", inputs=["Query"])
text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["TableTextRetriever"]) text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["EmbeddingRetriever"])
text_table_qa_pipeline.add_node(component=text_reader, name="TextReader", inputs=["RouteDocuments.output_1"]) text_table_qa_pipeline.add_node(component=text_reader, name="TextReader", inputs=["RouteDocuments.output_1"])
text_table_qa_pipeline.add_node(component=table_reader, name="TableReader", inputs=["RouteDocuments.output_2"]) text_table_qa_pipeline.add_node(component=table_reader, name="TableReader", inputs=["RouteDocuments.output_2"])
text_table_qa_pipeline.add_node(component=join_answers, name="JoinAnswers", inputs=["TextReader", "TableReader"]) text_table_qa_pipeline.add_node(component=join_answers, name="JoinAnswers", inputs=["TextReader", "TableReader"])
@ -376,7 +371,10 @@ It can sometimes be hard to provide your data in form of a pandas DataFrame. For
```python ```python
import time
!docker run -d -p 3001:3001 axarev/parsr !docker run -d -p 3001:3001 axarev/parsr
time.sleep(30)
``` ```

File diff suppressed because one or more lines are too long

View File

@ -1,5 +1,6 @@
import os import os
import json import json
import time
import pandas as pd import pandas as pd
@ -7,7 +8,7 @@ from haystack import Label, MultiLabel, Answer
from haystack.utils import launch_es, fetch_archive_from_http, print_answers from haystack.utils import launch_es, fetch_archive_from_http, print_answers
from haystack.document_stores import ElasticsearchDocumentStore from haystack.document_stores import ElasticsearchDocumentStore
from haystack import Document, Pipeline from haystack import Document, Pipeline
from haystack.nodes.retriever import TableTextRetriever from haystack.nodes.retriever import EmbeddingRetriever
from haystack.nodes import TableReader, FARMReader, RouteDocuments, JoinAnswers, ParsrConverter from haystack.nodes import TableReader, FARMReader, RouteDocuments, JoinAnswers, ParsrConverter
@ -17,10 +18,7 @@ def tutorial15_tableqa():
launch_es() launch_es()
## Connect to Elasticsearch ## Connect to Elasticsearch
# We want to use a small model producing 512-dimensional embeddings, so we need to set embedding_dim to 512 document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")
document_store = ElasticsearchDocumentStore(
host="localhost", username="", password="", index="document", embedding_dim=512
)
## Add Tables to DocumentStore ## Add Tables to DocumentStore
@ -53,15 +51,13 @@ def tutorial15_tableqa():
# Retrievers help narrowing down the scope for the Reader to a subset of tables where a given question could be answered. # Retrievers help narrowing down the scope for the Reader to a subset of tables where a given question could be answered.
# They use some simple but fast algorithm. # They use some simple but fast algorithm.
# #
# **Here:** We use the TableTextRetriever capable of retrieving relevant content among a database # **Here:** We use the EmbeddingRetriever capable of retrieving relevant content among a database
# of texts and tables using dense embeddings. # of texts and tables using dense embeddings.
retriever = TableTextRetriever( retriever = EmbeddingRetriever(
document_store=document_store, document_store=document_store,
query_embedding_model="deepset/bert-small-mm_retrieval-question_encoder", embedding_model="deepset/all-mpnet-base-v2-table",
passage_embedding_model="deepset/bert-small-mm_retrieval-passage_encoder", model_format="sentence_transformers",
table_embedding_model="deepset/bert-small-mm_retrieval-table_encoder",
embed_meta_fields=["title", "section_title"],
) )
# Add table embeddings to the tables in DocumentStore # Add table embeddings to the tables in DocumentStore
@ -104,15 +100,15 @@ def tutorial15_tableqa():
# for each of the tables, the sorting of the answers might be not helpful. # for each of the tables, the sorting of the answers might be not helpful.
table_qa_pipeline = Pipeline() table_qa_pipeline = Pipeline()
table_qa_pipeline.add_node(component=retriever, name="TableTextRetriever", inputs=["Query"]) table_qa_pipeline.add_node(component=retriever, name="EmbeddingRetriever", inputs=["Query"])
table_qa_pipeline.add_node(component=reader, name="TableReader", inputs=["TableTextRetriever"]) table_qa_pipeline.add_node(component=reader, name="TableReader", inputs=["EmbeddingRetriever"])
prediction = table_qa_pipeline.run("When was Guilty Gear Xrd : Sign released?") prediction = table_qa_pipeline.run("When was Guilty Gear Xrd : Sign released?")
print_answers(prediction, details="minimum") print_answers(prediction, details="minimum")
### Pipeline for QA on Combination of Text and Tables ### Pipeline for QA on Combination of Text and Tables
# We are using one node for retrieving both texts and tables, the TableTextRetriever. # We are using one node for retrieving both texts and tables, the EmbeddingRetriever.
# In order to do question-answering on the Documents coming from the TableTextRetriever, we need to route # In order to do question-answering on the Documents coming from the EmbeddingRetriever, we need to route
# Documents of type "text" to a FARMReader ( or alternatively TransformersReader) and Documents of type # Documents of type "text" to a FARMReader ( or alternatively TransformersReader) and Documents of type
# "table" to a TableReader. # "table" to a TableReader.
@ -125,8 +121,8 @@ def tutorial15_tableqa():
join_answers = JoinAnswers() join_answers = JoinAnswers()
text_table_qa_pipeline = Pipeline() text_table_qa_pipeline = Pipeline()
text_table_qa_pipeline.add_node(component=retriever, name="TableTextRetriever", inputs=["Query"]) text_table_qa_pipeline.add_node(component=retriever, name="EmbeddingRetriever", inputs=["Query"])
text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["TableTextRetriever"]) text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["EmbeddingRetriever"])
text_table_qa_pipeline.add_node(component=text_reader, name="TextReader", inputs=["RouteDocuments.output_1"]) text_table_qa_pipeline.add_node(component=text_reader, name="TextReader", inputs=["RouteDocuments.output_1"])
text_table_qa_pipeline.add_node(component=table_reader, name="TableReader", inputs=["RouteDocuments.output_2"]) text_table_qa_pipeline.add_node(component=table_reader, name="TableReader", inputs=["RouteDocuments.output_2"])
text_table_qa_pipeline.add_node(component=join_answers, name="JoinAnswers", inputs=["TextReader", "TableReader"]) text_table_qa_pipeline.add_node(component=join_answers, name="JoinAnswers", inputs=["TextReader", "TableReader"])
@ -189,11 +185,12 @@ def tutorial15_tableqa():
# It can sometimes be hard to provide your data in form of a pandas DataFrame. # It can sometimes be hard to provide your data in form of a pandas DataFrame.
# For this case, we provide the `ParsrConverter` wrapper that can help you to convert, for example, a PDF file into a document that you can index. # For this case, we provide the `ParsrConverter` wrapper that can help you to convert, for example, a PDF file into a document that you can index.
os.system("docker run -d -p 3001:3001 axarev/parsr") os.system("docker run -d -p 3001:3001 axarev/parsr")
time.sleep(30)
os.system("wget https://www.w3.org/WAI/WCAG21/working-examples/pdf-table/table.pdf") os.system("wget https://www.w3.org/WAI/WCAG21/working-examples/pdf-table/table.pdf")
converter = ParsrConverter() converter = ParsrConverter()
docs = converter.convert("table.pdf") docs = converter.convert("table.pdf")
tables = [doc for doc in docs if doc["content_type"] == "table"] tables = [doc for doc in docs if doc.content_type == "table"]
print(tables) print(tables)