mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-11-01 10:19:23 +00:00
Replace TableTextRetriever with EmbeddingRetriever in Tutorial 15 (#2479)
* replace TableTextRetriever with EmbeddingRetriever in Tutorial 15 * Update Documentation & Code Style * fix bug * Update Documentation & Code Style * update tutorial 15 outputs Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-20-212.eu-west-1.compute.internal>
This commit is contained in:
parent
5d98810a17
commit
c7e39e5225
@ -10,7 +10,7 @@ id: "tutorial15md"
|
||||
# Open-Domain QA on Tables
|
||||
[](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial15_TableQA.ipynb)
|
||||
|
||||
This tutorial shows you how to perform question-answering on tables using the `TableTextRetriever` or `BM25Retriever` as retriever node and the `TableReader` as reader node.
|
||||
This tutorial shows you how to perform question-answering on tables using the `EmbeddingRetriever` or `BM25Retriever` as retriever node and the `TableReader` as reader node.
|
||||
|
||||
### Prepare environment
|
||||
|
||||
@ -79,15 +79,12 @@ es_server = Popen(
|
||||
# Connect to Elasticsearch
|
||||
from haystack.document_stores import ElasticsearchDocumentStore
|
||||
|
||||
# We want to use a small model producing 512-dimensional embeddings, so we need to set embedding_dim to 512
|
||||
document_index = "document"
|
||||
document_store = ElasticsearchDocumentStore(
|
||||
host="localhost", username="", password="", index=document_index, embedding_dim=512
|
||||
)
|
||||
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index=document_index)
|
||||
```
|
||||
|
||||
## Add Tables to DocumentStore
|
||||
To quickly demonstrate the capabilities of the `TableTextRetriever` and the `TableReader` we use a subset of 1000 tables and text documents from a dataset we have published in [this paper](https://arxiv.org/abs/2108.04049).
|
||||
To quickly demonstrate the capabilities of the `EmbeddingRetriever` and the `TableReader` we use a subset of 1000 tables and text documents from a dataset we have published in [this paper](https://arxiv.org/abs/2108.04049).
|
||||
|
||||
Just as text passages, tables are represented as `Document` objects in Haystack. The content field, though, is a pandas DataFrame instead of a string.
|
||||
|
||||
@ -140,8 +137,7 @@ print(tables[0].meta)
|
||||
Retrievers help narrowing down the scope for the Reader to a subset of tables where a given question could be answered.
|
||||
They use some simple but fast algorithm.
|
||||
|
||||
**Here:** We use the `TableTextRetriever` capable of retrieving relevant content among a database
|
||||
of texts and tables using dense embeddings. It is an extension of the `DensePassageRetriever` and consists of three encoders (one query encoder, one text passage encoder and one table encoder) that create embeddings in the same vector space. More details on the `TableTextRetriever` and how it is trained can be found in [this paper](https://arxiv.org/abs/2108.04049).
|
||||
**Here:** We specify an embedding model that is finetuned so it can also generate embeddings for tables (instead of just text).
|
||||
|
||||
**Alternatives:**
|
||||
|
||||
@ -150,13 +146,12 @@ of texts and tables using dense embeddings. It is an extension of the `DensePass
|
||||
|
||||
|
||||
```python
|
||||
from haystack.nodes.retriever import TableTextRetriever
|
||||
from haystack.nodes.retriever import EmbeddingRetriever
|
||||
|
||||
retriever = TableTextRetriever(
|
||||
retriever = EmbeddingRetriever(
|
||||
document_store=document_store,
|
||||
query_embedding_model="deepset/bert-small-mm_retrieval-question_encoder",
|
||||
passage_embedding_model="deepset/bert-small-mm_retrieval-passage_encoder",
|
||||
table_embedding_model="deepset/bert-small-mm_retrieval-table_encoder",
|
||||
embedding_model="deepset/all-mpnet-base-v2-table",
|
||||
model_format="sentence_transformers",
|
||||
)
|
||||
```
|
||||
|
||||
@ -230,8 +225,8 @@ The Retriever and the Reader can be sticked together to a pipeline in order to f
|
||||
from haystack import Pipeline
|
||||
|
||||
table_qa_pipeline = Pipeline()
|
||||
table_qa_pipeline.add_node(component=retriever, name="TableTextRetriever", inputs=["Query"])
|
||||
table_qa_pipeline.add_node(component=reader, name="TableReader", inputs=["TableTextRetriever"])
|
||||
table_qa_pipeline.add_node(component=retriever, name="EmbeddingRetriever", inputs=["Query"])
|
||||
table_qa_pipeline.add_node(component=reader, name="TableReader", inputs=["EmbeddingRetriever"])
|
||||
```
|
||||
|
||||
|
||||
@ -266,10 +261,10 @@ document_store.update_embeddings(retriever=retriever, update_existing_embeddings
|
||||
```
|
||||
|
||||
## Pipeline for QA on Combination of Text and Tables
|
||||
We are using one node for retrieving both texts and tables, the `TableTextRetriever`. In order to do question-answering on the Documents coming from the `TableTextRetriever`, we need to route Documents of type `"text"` to a `FARMReader` (or alternatively `TransformersReader`) and Documents of type `"table"` to a `TableReader`.
|
||||
We are using one node for retrieving both texts and tables, the `EmbeddingRetriever`. In order to do question-answering on the Documents coming from the `EmbeddingRetriever`, we need to route Documents of type `"text"` to a `FARMReader` (or alternatively `TransformersReader`) and Documents of type `"table"` to a `TableReader`.
|
||||
|
||||
To achieve this, we make use of two additional nodes:
|
||||
- `RouteDocuments`: Splits the List of Documents retrieved by the `TableTextRetriever` into two lists containing only Documents of type `"text"` or `"table"`, respectively.
|
||||
- `RouteDocuments`: Splits the List of Documents retrieved by the `EmbeddingRetriever` into two lists containing only Documents of type `"text"` or `"table"`, respectively.
|
||||
- `JoinAnswers`: Takes Answers coming from two different Readers (in this case `FARMReader` and `TableReader`) and joins them to a single list of Answers.
|
||||
|
||||
|
||||
@ -288,8 +283,8 @@ join_answers = JoinAnswers()
|
||||
|
||||
```python
|
||||
text_table_qa_pipeline = Pipeline()
|
||||
text_table_qa_pipeline.add_node(component=retriever, name="TableTextRetriever", inputs=["Query"])
|
||||
text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["TableTextRetriever"])
|
||||
text_table_qa_pipeline.add_node(component=retriever, name="EmbeddingRetriever", inputs=["Query"])
|
||||
text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["EmbeddingRetriever"])
|
||||
text_table_qa_pipeline.add_node(component=text_reader, name="TextReader", inputs=["RouteDocuments.output_1"])
|
||||
text_table_qa_pipeline.add_node(component=table_reader, name="TableReader", inputs=["RouteDocuments.output_2"])
|
||||
text_table_qa_pipeline.add_node(component=join_answers, name="JoinAnswers", inputs=["TextReader", "TableReader"])
|
||||
@ -376,7 +371,10 @@ It can sometimes be hard to provide your data in form of a pandas DataFrame. For
|
||||
|
||||
|
||||
```python
|
||||
import time
|
||||
|
||||
!docker run -d -p 3001:3001 axarev/parsr
|
||||
time.sleep(30)
|
||||
```
|
||||
|
||||
|
||||
|
||||
File diff suppressed because one or more lines are too long
@ -1,5 +1,6 @@
|
||||
import os
|
||||
import json
|
||||
import time
|
||||
|
||||
import pandas as pd
|
||||
|
||||
@ -7,7 +8,7 @@ from haystack import Label, MultiLabel, Answer
|
||||
from haystack.utils import launch_es, fetch_archive_from_http, print_answers
|
||||
from haystack.document_stores import ElasticsearchDocumentStore
|
||||
from haystack import Document, Pipeline
|
||||
from haystack.nodes.retriever import TableTextRetriever
|
||||
from haystack.nodes.retriever import EmbeddingRetriever
|
||||
from haystack.nodes import TableReader, FARMReader, RouteDocuments, JoinAnswers, ParsrConverter
|
||||
|
||||
|
||||
@ -17,10 +18,7 @@ def tutorial15_tableqa():
|
||||
launch_es()
|
||||
|
||||
## Connect to Elasticsearch
|
||||
# We want to use a small model producing 512-dimensional embeddings, so we need to set embedding_dim to 512
|
||||
document_store = ElasticsearchDocumentStore(
|
||||
host="localhost", username="", password="", index="document", embedding_dim=512
|
||||
)
|
||||
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")
|
||||
|
||||
## Add Tables to DocumentStore
|
||||
|
||||
@ -53,15 +51,13 @@ def tutorial15_tableqa():
|
||||
# Retrievers help narrowing down the scope for the Reader to a subset of tables where a given question could be answered.
|
||||
# They use some simple but fast algorithm.
|
||||
#
|
||||
# **Here:** We use the TableTextRetriever capable of retrieving relevant content among a database
|
||||
# **Here:** We use the EmbeddingRetriever capable of retrieving relevant content among a database
|
||||
# of texts and tables using dense embeddings.
|
||||
|
||||
retriever = TableTextRetriever(
|
||||
retriever = EmbeddingRetriever(
|
||||
document_store=document_store,
|
||||
query_embedding_model="deepset/bert-small-mm_retrieval-question_encoder",
|
||||
passage_embedding_model="deepset/bert-small-mm_retrieval-passage_encoder",
|
||||
table_embedding_model="deepset/bert-small-mm_retrieval-table_encoder",
|
||||
embed_meta_fields=["title", "section_title"],
|
||||
embedding_model="deepset/all-mpnet-base-v2-table",
|
||||
model_format="sentence_transformers",
|
||||
)
|
||||
|
||||
# Add table embeddings to the tables in DocumentStore
|
||||
@ -104,15 +100,15 @@ def tutorial15_tableqa():
|
||||
# for each of the tables, the sorting of the answers might be not helpful.
|
||||
|
||||
table_qa_pipeline = Pipeline()
|
||||
table_qa_pipeline.add_node(component=retriever, name="TableTextRetriever", inputs=["Query"])
|
||||
table_qa_pipeline.add_node(component=reader, name="TableReader", inputs=["TableTextRetriever"])
|
||||
table_qa_pipeline.add_node(component=retriever, name="EmbeddingRetriever", inputs=["Query"])
|
||||
table_qa_pipeline.add_node(component=reader, name="TableReader", inputs=["EmbeddingRetriever"])
|
||||
|
||||
prediction = table_qa_pipeline.run("When was Guilty Gear Xrd : Sign released?")
|
||||
print_answers(prediction, details="minimum")
|
||||
|
||||
### Pipeline for QA on Combination of Text and Tables
|
||||
# We are using one node for retrieving both texts and tables, the TableTextRetriever.
|
||||
# In order to do question-answering on the Documents coming from the TableTextRetriever, we need to route
|
||||
# We are using one node for retrieving both texts and tables, the EmbeddingRetriever.
|
||||
# In order to do question-answering on the Documents coming from the EmbeddingRetriever, we need to route
|
||||
# Documents of type "text" to a FARMReader ( or alternatively TransformersReader) and Documents of type
|
||||
# "table" to a TableReader.
|
||||
|
||||
@ -125,8 +121,8 @@ def tutorial15_tableqa():
|
||||
join_answers = JoinAnswers()
|
||||
|
||||
text_table_qa_pipeline = Pipeline()
|
||||
text_table_qa_pipeline.add_node(component=retriever, name="TableTextRetriever", inputs=["Query"])
|
||||
text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["TableTextRetriever"])
|
||||
text_table_qa_pipeline.add_node(component=retriever, name="EmbeddingRetriever", inputs=["Query"])
|
||||
text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["EmbeddingRetriever"])
|
||||
text_table_qa_pipeline.add_node(component=text_reader, name="TextReader", inputs=["RouteDocuments.output_1"])
|
||||
text_table_qa_pipeline.add_node(component=table_reader, name="TableReader", inputs=["RouteDocuments.output_2"])
|
||||
text_table_qa_pipeline.add_node(component=join_answers, name="JoinAnswers", inputs=["TextReader", "TableReader"])
|
||||
@ -189,11 +185,12 @@ def tutorial15_tableqa():
|
||||
# It can sometimes be hard to provide your data in form of a pandas DataFrame.
|
||||
# For this case, we provide the `ParsrConverter` wrapper that can help you to convert, for example, a PDF file into a document that you can index.
|
||||
os.system("docker run -d -p 3001:3001 axarev/parsr")
|
||||
time.sleep(30)
|
||||
os.system("wget https://www.w3.org/WAI/WCAG21/working-examples/pdf-table/table.pdf")
|
||||
|
||||
converter = ParsrConverter()
|
||||
docs = converter.convert("table.pdf")
|
||||
tables = [doc for doc in docs if doc["content_type"] == "table"]
|
||||
tables = [doc for doc in docs if doc.content_type == "table"]
|
||||
|
||||
print(tables)
|
||||
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user