Replace TableTextRetriever with EmbeddingRetriever in Tutorial 15 (#2479)

* replace TableTextRetriever with EmbeddingRetriever in Tutorial 15 * Update Documentation & Code Style * fix bug * Update Documentation & Code Style * update tutorial 15 outputs Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-20-212.eu-west-1.compute.internal>
2025-11-08 13:54:31 +00:00 · 2022-05-05 10:12:44 +02:00 · 2022-05-05 10:12:44 +02:00 · c7e39e5225
commit c7e39e5225
parent 5d98810a17
3 changed files with 1528 additions and 1361 deletions
--- a/docs/_src/tutorials/tutorials/15.md
+++ b/docs/_src/tutorials/tutorials/15.md
@ -10,7 +10,7 @@ id: "tutorial15md"
 # Open-Domain QA on Tables
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial15_TableQA.ipynb)
-This tutorial shows you how to perform question-answering on tables using the `TableTextRetriever` or `BM25Retriever` as retriever node and the `TableReader` as reader node.
+This tutorial shows you how to perform question-answering on tables using the `EmbeddingRetriever` or `BM25Retriever` as retriever node and the `TableReader` as reader node.
 ### Prepare environment
@ -79,15 +79,12 @@ es_server = Popen(
 # Connect to Elasticsearch
 from haystack.document_stores import ElasticsearchDocumentStore
 # We want to use a small model producing 512-dimensional embeddings, so we need to set embedding_dim to 512
 document_index = "document"
-document_store = ElasticsearchDocumentStore(
+document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index=document_index)
    host="localhost", username="", password="", index=document_index, embedding_dim=512
 )
 ```
 ## Add Tables to DocumentStore
-To quickly demonstrate the capabilities of the `TableTextRetriever` and the `TableReader` we use a subset of 1000 tables and text documents from a dataset we have published in [this paper](https://arxiv.org/abs/2108.04049).
+To quickly demonstrate the capabilities of the `EmbeddingRetriever` and the `TableReader` we use a subset of 1000 tables and text documents from a dataset we have published in [this paper](https://arxiv.org/abs/2108.04049).
 Just as text passages, tables are represented as `Document` objects in Haystack. The content field, though, is a pandas DataFrame instead of a string.
@ -140,8 +137,7 @@ print(tables[0].meta)
 Retrievers help narrowing down the scope for the Reader to a subset of tables where a given question could be answered.
 They use some simple but fast algorithm.
-**Here:** We use the `TableTextRetriever` capable of retrieving relevant content among a database
+**Here:** We specify an embedding model that is finetuned so it can also generate embeddings for tables (instead of just text).
 of texts and tables using dense embeddings. It is an extension of the `DensePassageRetriever` and consists of three encoders (one query encoder, one text passage encoder and one table encoder) that create embeddings in the same vector space. More details on the `TableTextRetriever` and how it is trained can be found in [this paper](https://arxiv.org/abs/2108.04049).
 **Alternatives:**
@ -150,13 +146,12 @@ of texts and tables using dense embeddings. It is an extension of the `DensePass
 ```python
-from haystack.nodes.retriever import TableTextRetriever
+from haystack.nodes.retriever import EmbeddingRetriever
-retriever = TableTextRetriever(
+retriever = EmbeddingRetriever(
    document_store=document_store,
-    query_embedding_model="deepset/bert-small-mm_retrieval-question_encoder",
+    embedding_model="deepset/all-mpnet-base-v2-table",
-    passage_embedding_model="deepset/bert-small-mm_retrieval-passage_encoder",
+    model_format="sentence_transformers",
    table_embedding_model="deepset/bert-small-mm_retrieval-table_encoder",
 )
 ```
@ -230,8 +225,8 @@ The Retriever and the Reader can be sticked together to a pipeline in order to f
 from haystack import Pipeline
 table_qa_pipeline = Pipeline()
-table_qa_pipeline.add_node(component=retriever, name="TableTextRetriever", inputs=["Query"])
+table_qa_pipeline.add_node(component=retriever, name="EmbeddingRetriever", inputs=["Query"])
-table_qa_pipeline.add_node(component=reader, name="TableReader", inputs=["TableTextRetriever"])
+table_qa_pipeline.add_node(component=reader, name="TableReader", inputs=["EmbeddingRetriever"])
 ```
@ -266,10 +261,10 @@ document_store.update_embeddings(retriever=retriever, update_existing_embeddings
 ```
 ## Pipeline for QA on Combination of Text and Tables
-We are using one node for retrieving both texts and tables, the `TableTextRetriever`. In order to do question-answering on the Documents coming from the `TableTextRetriever`, we need to route Documents of type `"text"` to a `FARMReader` (or alternatively `TransformersReader`) and Documents of type `"table"` to a `TableReader`.
+We are using one node for retrieving both texts and tables, the `EmbeddingRetriever`. In order to do question-answering on the Documents coming from the `EmbeddingRetriever`, we need to route Documents of type `"text"` to a `FARMReader` (or alternatively `TransformersReader`) and Documents of type `"table"` to a `TableReader`.
 To achieve this, we make use of two additional nodes:
- `RouteDocuments`: Splits the List of Documents retrieved by the `TableTextRetriever` into two lists containing only Documents of type `"text"` or `"table"`, respectively.
+- `RouteDocuments`: Splits the List of Documents retrieved by the `EmbeddingRetriever` into two lists containing only Documents of type `"text"` or `"table"`, respectively.
 - `JoinAnswers`: Takes Answers coming from two different Readers (in this case `FARMReader` and `TableReader`) and joins them to a single list of Answers.
@ -288,8 +283,8 @@ join_answers = JoinAnswers()
 ```python
 text_table_qa_pipeline = Pipeline()
-text_table_qa_pipeline.add_node(component=retriever, name="TableTextRetriever", inputs=["Query"])
+text_table_qa_pipeline.add_node(component=retriever, name="EmbeddingRetriever", inputs=["Query"])
-text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["TableTextRetriever"])
+text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["EmbeddingRetriever"])
 text_table_qa_pipeline.add_node(component=text_reader, name="TextReader", inputs=["RouteDocuments.output_1"])
 text_table_qa_pipeline.add_node(component=table_reader, name="TableReader", inputs=["RouteDocuments.output_2"])
 text_table_qa_pipeline.add_node(component=join_answers, name="JoinAnswers", inputs=["TextReader", "TableReader"])
@ -376,7 +371,10 @@ It can sometimes be hard to provide your data in form of a pandas DataFrame. For
 ```python
 import time
 !docker run -d -p 3001:3001 axarev/parsr
 time.sleep(30)
 ```
--- a/tutorials/Tutorial15_TableQA.ipynb
+++ b/tutorials/Tutorial15_TableQA.ipynb
--- a/tutorials/Tutorial15_TableQA.py
+++ b/tutorials/Tutorial15_TableQA.py
@ -1,5 +1,6 @@
 import os
 import json
 import time
 import pandas as pd
@ -7,7 +8,7 @@ from haystack import Label, MultiLabel, Answer
 from haystack.utils import launch_es, fetch_archive_from_http, print_answers
 from haystack.document_stores import ElasticsearchDocumentStore
 from haystack import Document, Pipeline
-from haystack.nodes.retriever import TableTextRetriever
+from haystack.nodes.retriever import EmbeddingRetriever
 from haystack.nodes import TableReader, FARMReader, RouteDocuments, JoinAnswers, ParsrConverter
@ -17,10 +18,7 @@ def tutorial15_tableqa():
    launch_es()
    ## Connect to Elasticsearch
-    # We want to use a small model producing 512-dimensional embeddings, so we need to set embedding_dim to 512
+    document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")
    document_store = ElasticsearchDocumentStore(
        host="localhost", username="", password="", index="document", embedding_dim=512
    )
    ## Add Tables to DocumentStore
@ -53,15 +51,13 @@ def tutorial15_tableqa():
    # Retrievers help narrowing down the scope for the Reader to a subset of tables where a given question could be answered.
    # They use some simple but fast algorithm.
    #
-    # **Here:** We use the TableTextRetriever capable of retrieving relevant content among a database
+    # **Here:** We use the EmbeddingRetriever capable of retrieving relevant content among a database
    # of texts and tables using dense embeddings.
-    retriever = TableTextRetriever(
+    retriever = EmbeddingRetriever(
        document_store=document_store,
-        query_embedding_model="deepset/bert-small-mm_retrieval-question_encoder",
+        embedding_model="deepset/all-mpnet-base-v2-table",
-        passage_embedding_model="deepset/bert-small-mm_retrieval-passage_encoder",
+        model_format="sentence_transformers",
        table_embedding_model="deepset/bert-small-mm_retrieval-table_encoder",
        embed_meta_fields=["title", "section_title"],
    )
    # Add table embeddings to the tables in DocumentStore
@ -104,15 +100,15 @@ def tutorial15_tableqa():
    # for each of the tables, the sorting of the answers might be not helpful.
    table_qa_pipeline = Pipeline()
-    table_qa_pipeline.add_node(component=retriever, name="TableTextRetriever", inputs=["Query"])
+    table_qa_pipeline.add_node(component=retriever, name="EmbeddingRetriever", inputs=["Query"])
-    table_qa_pipeline.add_node(component=reader, name="TableReader", inputs=["TableTextRetriever"])
+    table_qa_pipeline.add_node(component=reader, name="TableReader", inputs=["EmbeddingRetriever"])
    prediction = table_qa_pipeline.run("When was Guilty Gear Xrd : Sign released?")
    print_answers(prediction, details="minimum")
    ### Pipeline for QA on Combination of Text and Tables
-    # We are using one node for retrieving both texts and tables, the TableTextRetriever.
+    # We are using one node for retrieving both texts and tables, the EmbeddingRetriever.
-    # In order to do question-answering on the Documents coming from the TableTextRetriever, we need to route
+    # In order to do question-answering on the Documents coming from the EmbeddingRetriever, we need to route
    # Documents of type "text" to a FARMReader ( or alternatively TransformersReader) and Documents of type
    # "table" to a TableReader.
@ -125,8 +121,8 @@ def tutorial15_tableqa():
    join_answers = JoinAnswers()
    text_table_qa_pipeline = Pipeline()
-    text_table_qa_pipeline.add_node(component=retriever, name="TableTextRetriever", inputs=["Query"])
+    text_table_qa_pipeline.add_node(component=retriever, name="EmbeddingRetriever", inputs=["Query"])
-    text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["TableTextRetriever"])
+    text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["EmbeddingRetriever"])
    text_table_qa_pipeline.add_node(component=text_reader, name="TextReader", inputs=["RouteDocuments.output_1"])
    text_table_qa_pipeline.add_node(component=table_reader, name="TableReader", inputs=["RouteDocuments.output_2"])
    text_table_qa_pipeline.add_node(component=join_answers, name="JoinAnswers", inputs=["TextReader", "TableReader"])
@ -189,11 +185,12 @@ def tutorial15_tableqa():
    # It can sometimes be hard to provide your data in form of a pandas DataFrame.
    # For this case, we provide the `ParsrConverter` wrapper that can help you to convert, for example, a PDF file into a document that you can index.
    os.system("docker run -d -p 3001:3001 axarev/parsr")
    time.sleep(30)
    os.system("wget https://www.w3.org/WAI/WCAG21/working-examples/pdf-table/table.pdf")
    converter = ParsrConverter()
    docs = converter.convert("table.pdf")
-    tables = [doc for doc in docs if doc["content_type"] == "table"]
+    tables = [doc for doc in docs if doc.content_type == "table"]
    print(tables)