mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-11-08 13:54:31 +00:00
Replace TableTextRetriever with EmbeddingRetriever in Tutorial 15 (#2479)
* replace TableTextRetriever with EmbeddingRetriever in Tutorial 15 * Update Documentation & Code Style * fix bug * Update Documentation & Code Style * update tutorial 15 outputs Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-20-212.eu-west-1.compute.internal>
This commit is contained in:
parent
5d98810a17
commit
c7e39e5225
@ -10,7 +10,7 @@ id: "tutorial15md"
|
|||||||
# Open-Domain QA on Tables
|
# Open-Domain QA on Tables
|
||||||
[](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial15_TableQA.ipynb)
|
[](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial15_TableQA.ipynb)
|
||||||
|
|
||||||
This tutorial shows you how to perform question-answering on tables using the `TableTextRetriever` or `BM25Retriever` as retriever node and the `TableReader` as reader node.
|
This tutorial shows you how to perform question-answering on tables using the `EmbeddingRetriever` or `BM25Retriever` as retriever node and the `TableReader` as reader node.
|
||||||
|
|
||||||
### Prepare environment
|
### Prepare environment
|
||||||
|
|
||||||
@ -79,15 +79,12 @@ es_server = Popen(
|
|||||||
# Connect to Elasticsearch
|
# Connect to Elasticsearch
|
||||||
from haystack.document_stores import ElasticsearchDocumentStore
|
from haystack.document_stores import ElasticsearchDocumentStore
|
||||||
|
|
||||||
# We want to use a small model producing 512-dimensional embeddings, so we need to set embedding_dim to 512
|
|
||||||
document_index = "document"
|
document_index = "document"
|
||||||
document_store = ElasticsearchDocumentStore(
|
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index=document_index)
|
||||||
host="localhost", username="", password="", index=document_index, embedding_dim=512
|
|
||||||
)
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Add Tables to DocumentStore
|
## Add Tables to DocumentStore
|
||||||
To quickly demonstrate the capabilities of the `TableTextRetriever` and the `TableReader` we use a subset of 1000 tables and text documents from a dataset we have published in [this paper](https://arxiv.org/abs/2108.04049).
|
To quickly demonstrate the capabilities of the `EmbeddingRetriever` and the `TableReader` we use a subset of 1000 tables and text documents from a dataset we have published in [this paper](https://arxiv.org/abs/2108.04049).
|
||||||
|
|
||||||
Just as text passages, tables are represented as `Document` objects in Haystack. The content field, though, is a pandas DataFrame instead of a string.
|
Just as text passages, tables are represented as `Document` objects in Haystack. The content field, though, is a pandas DataFrame instead of a string.
|
||||||
|
|
||||||
@ -140,8 +137,7 @@ print(tables[0].meta)
|
|||||||
Retrievers help narrowing down the scope for the Reader to a subset of tables where a given question could be answered.
|
Retrievers help narrowing down the scope for the Reader to a subset of tables where a given question could be answered.
|
||||||
They use some simple but fast algorithm.
|
They use some simple but fast algorithm.
|
||||||
|
|
||||||
**Here:** We use the `TableTextRetriever` capable of retrieving relevant content among a database
|
**Here:** We specify an embedding model that is finetuned so it can also generate embeddings for tables (instead of just text).
|
||||||
of texts and tables using dense embeddings. It is an extension of the `DensePassageRetriever` and consists of three encoders (one query encoder, one text passage encoder and one table encoder) that create embeddings in the same vector space. More details on the `TableTextRetriever` and how it is trained can be found in [this paper](https://arxiv.org/abs/2108.04049).
|
|
||||||
|
|
||||||
**Alternatives:**
|
**Alternatives:**
|
||||||
|
|
||||||
@ -150,13 +146,12 @@ of texts and tables using dense embeddings. It is an extension of the `DensePass
|
|||||||
|
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from haystack.nodes.retriever import TableTextRetriever
|
from haystack.nodes.retriever import EmbeddingRetriever
|
||||||
|
|
||||||
retriever = TableTextRetriever(
|
retriever = EmbeddingRetriever(
|
||||||
document_store=document_store,
|
document_store=document_store,
|
||||||
query_embedding_model="deepset/bert-small-mm_retrieval-question_encoder",
|
embedding_model="deepset/all-mpnet-base-v2-table",
|
||||||
passage_embedding_model="deepset/bert-small-mm_retrieval-passage_encoder",
|
model_format="sentence_transformers",
|
||||||
table_embedding_model="deepset/bert-small-mm_retrieval-table_encoder",
|
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -230,8 +225,8 @@ The Retriever and the Reader can be sticked together to a pipeline in order to f
|
|||||||
from haystack import Pipeline
|
from haystack import Pipeline
|
||||||
|
|
||||||
table_qa_pipeline = Pipeline()
|
table_qa_pipeline = Pipeline()
|
||||||
table_qa_pipeline.add_node(component=retriever, name="TableTextRetriever", inputs=["Query"])
|
table_qa_pipeline.add_node(component=retriever, name="EmbeddingRetriever", inputs=["Query"])
|
||||||
table_qa_pipeline.add_node(component=reader, name="TableReader", inputs=["TableTextRetriever"])
|
table_qa_pipeline.add_node(component=reader, name="TableReader", inputs=["EmbeddingRetriever"])
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
@ -266,10 +261,10 @@ document_store.update_embeddings(retriever=retriever, update_existing_embeddings
|
|||||||
```
|
```
|
||||||
|
|
||||||
## Pipeline for QA on Combination of Text and Tables
|
## Pipeline for QA on Combination of Text and Tables
|
||||||
We are using one node for retrieving both texts and tables, the `TableTextRetriever`. In order to do question-answering on the Documents coming from the `TableTextRetriever`, we need to route Documents of type `"text"` to a `FARMReader` (or alternatively `TransformersReader`) and Documents of type `"table"` to a `TableReader`.
|
We are using one node for retrieving both texts and tables, the `EmbeddingRetriever`. In order to do question-answering on the Documents coming from the `EmbeddingRetriever`, we need to route Documents of type `"text"` to a `FARMReader` (or alternatively `TransformersReader`) and Documents of type `"table"` to a `TableReader`.
|
||||||
|
|
||||||
To achieve this, we make use of two additional nodes:
|
To achieve this, we make use of two additional nodes:
|
||||||
- `RouteDocuments`: Splits the List of Documents retrieved by the `TableTextRetriever` into two lists containing only Documents of type `"text"` or `"table"`, respectively.
|
- `RouteDocuments`: Splits the List of Documents retrieved by the `EmbeddingRetriever` into two lists containing only Documents of type `"text"` or `"table"`, respectively.
|
||||||
- `JoinAnswers`: Takes Answers coming from two different Readers (in this case `FARMReader` and `TableReader`) and joins them to a single list of Answers.
|
- `JoinAnswers`: Takes Answers coming from two different Readers (in this case `FARMReader` and `TableReader`) and joins them to a single list of Answers.
|
||||||
|
|
||||||
|
|
||||||
@ -288,8 +283,8 @@ join_answers = JoinAnswers()
|
|||||||
|
|
||||||
```python
|
```python
|
||||||
text_table_qa_pipeline = Pipeline()
|
text_table_qa_pipeline = Pipeline()
|
||||||
text_table_qa_pipeline.add_node(component=retriever, name="TableTextRetriever", inputs=["Query"])
|
text_table_qa_pipeline.add_node(component=retriever, name="EmbeddingRetriever", inputs=["Query"])
|
||||||
text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["TableTextRetriever"])
|
text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["EmbeddingRetriever"])
|
||||||
text_table_qa_pipeline.add_node(component=text_reader, name="TextReader", inputs=["RouteDocuments.output_1"])
|
text_table_qa_pipeline.add_node(component=text_reader, name="TextReader", inputs=["RouteDocuments.output_1"])
|
||||||
text_table_qa_pipeline.add_node(component=table_reader, name="TableReader", inputs=["RouteDocuments.output_2"])
|
text_table_qa_pipeline.add_node(component=table_reader, name="TableReader", inputs=["RouteDocuments.output_2"])
|
||||||
text_table_qa_pipeline.add_node(component=join_answers, name="JoinAnswers", inputs=["TextReader", "TableReader"])
|
text_table_qa_pipeline.add_node(component=join_answers, name="JoinAnswers", inputs=["TextReader", "TableReader"])
|
||||||
@ -376,7 +371,10 @@ It can sometimes be hard to provide your data in form of a pandas DataFrame. For
|
|||||||
|
|
||||||
|
|
||||||
```python
|
```python
|
||||||
|
import time
|
||||||
|
|
||||||
!docker run -d -p 3001:3001 axarev/parsr
|
!docker run -d -p 3001:3001 axarev/parsr
|
||||||
|
time.sleep(30)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
File diff suppressed because one or more lines are too long
@ -1,5 +1,6 @@
|
|||||||
import os
|
import os
|
||||||
import json
|
import json
|
||||||
|
import time
|
||||||
|
|
||||||
import pandas as pd
|
import pandas as pd
|
||||||
|
|
||||||
@ -7,7 +8,7 @@ from haystack import Label, MultiLabel, Answer
|
|||||||
from haystack.utils import launch_es, fetch_archive_from_http, print_answers
|
from haystack.utils import launch_es, fetch_archive_from_http, print_answers
|
||||||
from haystack.document_stores import ElasticsearchDocumentStore
|
from haystack.document_stores import ElasticsearchDocumentStore
|
||||||
from haystack import Document, Pipeline
|
from haystack import Document, Pipeline
|
||||||
from haystack.nodes.retriever import TableTextRetriever
|
from haystack.nodes.retriever import EmbeddingRetriever
|
||||||
from haystack.nodes import TableReader, FARMReader, RouteDocuments, JoinAnswers, ParsrConverter
|
from haystack.nodes import TableReader, FARMReader, RouteDocuments, JoinAnswers, ParsrConverter
|
||||||
|
|
||||||
|
|
||||||
@ -17,10 +18,7 @@ def tutorial15_tableqa():
|
|||||||
launch_es()
|
launch_es()
|
||||||
|
|
||||||
## Connect to Elasticsearch
|
## Connect to Elasticsearch
|
||||||
# We want to use a small model producing 512-dimensional embeddings, so we need to set embedding_dim to 512
|
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")
|
||||||
document_store = ElasticsearchDocumentStore(
|
|
||||||
host="localhost", username="", password="", index="document", embedding_dim=512
|
|
||||||
)
|
|
||||||
|
|
||||||
## Add Tables to DocumentStore
|
## Add Tables to DocumentStore
|
||||||
|
|
||||||
@ -53,15 +51,13 @@ def tutorial15_tableqa():
|
|||||||
# Retrievers help narrowing down the scope for the Reader to a subset of tables where a given question could be answered.
|
# Retrievers help narrowing down the scope for the Reader to a subset of tables where a given question could be answered.
|
||||||
# They use some simple but fast algorithm.
|
# They use some simple but fast algorithm.
|
||||||
#
|
#
|
||||||
# **Here:** We use the TableTextRetriever capable of retrieving relevant content among a database
|
# **Here:** We use the EmbeddingRetriever capable of retrieving relevant content among a database
|
||||||
# of texts and tables using dense embeddings.
|
# of texts and tables using dense embeddings.
|
||||||
|
|
||||||
retriever = TableTextRetriever(
|
retriever = EmbeddingRetriever(
|
||||||
document_store=document_store,
|
document_store=document_store,
|
||||||
query_embedding_model="deepset/bert-small-mm_retrieval-question_encoder",
|
embedding_model="deepset/all-mpnet-base-v2-table",
|
||||||
passage_embedding_model="deepset/bert-small-mm_retrieval-passage_encoder",
|
model_format="sentence_transformers",
|
||||||
table_embedding_model="deepset/bert-small-mm_retrieval-table_encoder",
|
|
||||||
embed_meta_fields=["title", "section_title"],
|
|
||||||
)
|
)
|
||||||
|
|
||||||
# Add table embeddings to the tables in DocumentStore
|
# Add table embeddings to the tables in DocumentStore
|
||||||
@ -104,15 +100,15 @@ def tutorial15_tableqa():
|
|||||||
# for each of the tables, the sorting of the answers might be not helpful.
|
# for each of the tables, the sorting of the answers might be not helpful.
|
||||||
|
|
||||||
table_qa_pipeline = Pipeline()
|
table_qa_pipeline = Pipeline()
|
||||||
table_qa_pipeline.add_node(component=retriever, name="TableTextRetriever", inputs=["Query"])
|
table_qa_pipeline.add_node(component=retriever, name="EmbeddingRetriever", inputs=["Query"])
|
||||||
table_qa_pipeline.add_node(component=reader, name="TableReader", inputs=["TableTextRetriever"])
|
table_qa_pipeline.add_node(component=reader, name="TableReader", inputs=["EmbeddingRetriever"])
|
||||||
|
|
||||||
prediction = table_qa_pipeline.run("When was Guilty Gear Xrd : Sign released?")
|
prediction = table_qa_pipeline.run("When was Guilty Gear Xrd : Sign released?")
|
||||||
print_answers(prediction, details="minimum")
|
print_answers(prediction, details="minimum")
|
||||||
|
|
||||||
### Pipeline for QA on Combination of Text and Tables
|
### Pipeline for QA on Combination of Text and Tables
|
||||||
# We are using one node for retrieving both texts and tables, the TableTextRetriever.
|
# We are using one node for retrieving both texts and tables, the EmbeddingRetriever.
|
||||||
# In order to do question-answering on the Documents coming from the TableTextRetriever, we need to route
|
# In order to do question-answering on the Documents coming from the EmbeddingRetriever, we need to route
|
||||||
# Documents of type "text" to a FARMReader ( or alternatively TransformersReader) and Documents of type
|
# Documents of type "text" to a FARMReader ( or alternatively TransformersReader) and Documents of type
|
||||||
# "table" to a TableReader.
|
# "table" to a TableReader.
|
||||||
|
|
||||||
@ -125,8 +121,8 @@ def tutorial15_tableqa():
|
|||||||
join_answers = JoinAnswers()
|
join_answers = JoinAnswers()
|
||||||
|
|
||||||
text_table_qa_pipeline = Pipeline()
|
text_table_qa_pipeline = Pipeline()
|
||||||
text_table_qa_pipeline.add_node(component=retriever, name="TableTextRetriever", inputs=["Query"])
|
text_table_qa_pipeline.add_node(component=retriever, name="EmbeddingRetriever", inputs=["Query"])
|
||||||
text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["TableTextRetriever"])
|
text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["EmbeddingRetriever"])
|
||||||
text_table_qa_pipeline.add_node(component=text_reader, name="TextReader", inputs=["RouteDocuments.output_1"])
|
text_table_qa_pipeline.add_node(component=text_reader, name="TextReader", inputs=["RouteDocuments.output_1"])
|
||||||
text_table_qa_pipeline.add_node(component=table_reader, name="TableReader", inputs=["RouteDocuments.output_2"])
|
text_table_qa_pipeline.add_node(component=table_reader, name="TableReader", inputs=["RouteDocuments.output_2"])
|
||||||
text_table_qa_pipeline.add_node(component=join_answers, name="JoinAnswers", inputs=["TextReader", "TableReader"])
|
text_table_qa_pipeline.add_node(component=join_answers, name="JoinAnswers", inputs=["TextReader", "TableReader"])
|
||||||
@ -189,11 +185,12 @@ def tutorial15_tableqa():
|
|||||||
# It can sometimes be hard to provide your data in form of a pandas DataFrame.
|
# It can sometimes be hard to provide your data in form of a pandas DataFrame.
|
||||||
# For this case, we provide the `ParsrConverter` wrapper that can help you to convert, for example, a PDF file into a document that you can index.
|
# For this case, we provide the `ParsrConverter` wrapper that can help you to convert, for example, a PDF file into a document that you can index.
|
||||||
os.system("docker run -d -p 3001:3001 axarev/parsr")
|
os.system("docker run -d -p 3001:3001 axarev/parsr")
|
||||||
|
time.sleep(30)
|
||||||
os.system("wget https://www.w3.org/WAI/WCAG21/working-examples/pdf-table/table.pdf")
|
os.system("wget https://www.w3.org/WAI/WCAG21/working-examples/pdf-table/table.pdf")
|
||||||
|
|
||||||
converter = ParsrConverter()
|
converter = ParsrConverter()
|
||||||
docs = converter.convert("table.pdf")
|
docs = converter.convert("table.pdf")
|
||||||
tables = [doc for doc in docs if doc["content_type"] == "table"]
|
tables = [doc for doc in docs if doc.content_type == "table"]
|
||||||
|
|
||||||
print(tables)
|
print(tables)
|
||||||
|
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user