Replace TableTextRetriever with EmbeddingRetriever in Tutorial 15 (#2479)

* replace TableTextRetriever with EmbeddingRetriever in Tutorial 15 * Update Documentation & Code Style * fix bug * Update Documentation & Code Style * update tutorial 15 outputs Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-20-212.eu-west-1.compute.internal>
2025-12-28 07:29:06 +00:00 · 2022-05-05 10:12:44 +02:00 · 2022-05-05 10:12:44 +02:00 · c7e39e5225
commit c7e39e5225
parent 5d98810a17
3 changed files with 1528 additions and 1361 deletions
--- a/docs/_src/tutorials/tutorials/15.md
+++ b/docs/_src/tutorials/tutorials/15.md
@ -10,7 +10,7 @@ id: "tutorial15md"
 # Open-Domain QA on Tables
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial15_TableQA.ipynb)

-This tutorial shows you how to perform question-answering on tables using the `TableTextRetriever` or `BM25Retriever` as retriever node and the `TableReader` as reader node.
+This tutorial shows you how to perform question-answering on tables using the `EmbeddingRetriever` or `BM25Retriever` as retriever node and the `TableReader` as reader node.

 ### Prepare environment

@ -79,15 +79,12 @@ es_server = Popen(
 # Connect to Elasticsearch
 from haystack.document_stores import ElasticsearchDocumentStore

-# We want to use a small model producing 512-dimensional embeddings, so we need to set embedding_dim to 512
 document_index = "document"
-document_store = ElasticsearchDocumentStore(
-    host="localhost", username="", password="", index=document_index, embedding_dim=512
-)
+document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index=document_index)
 ```

 ## Add Tables to DocumentStore
-To quickly demonstrate the capabilities of the `TableTextRetriever` and the `TableReader` we use a subset of 1000 tables and text documents from a dataset we have published in [this paper](https://arxiv.org/abs/2108.04049).
+To quickly demonstrate the capabilities of the `EmbeddingRetriever` and the `TableReader` we use a subset of 1000 tables and text documents from a dataset we have published in [this paper](https://arxiv.org/abs/2108.04049).

 Just as text passages, tables are represented as `Document` objects in Haystack. The content field, though, is a pandas DataFrame instead of a string.

@ -140,8 +137,7 @@ print(tables[0].meta)
 Retrievers help narrowing down the scope for the Reader to a subset of tables where a given question could be answered.
 They use some simple but fast algorithm.

-**Here:** We use the `TableTextRetriever` capable of retrieving relevant content among a database
-of texts and tables using dense embeddings. It is an extension of the `DensePassageRetriever` and consists of three encoders (one query encoder, one text passage encoder and one table encoder) that create embeddings in the same vector space. More details on the `TableTextRetriever` and how it is trained can be found in [this paper](https://arxiv.org/abs/2108.04049).
+**Here:** We specify an embedding model that is finetuned so it can also generate embeddings for tables (instead of just text).

 **Alternatives:**

@ -150,13 +146,12 @@ of texts and tables using dense embeddings. It is an extension of the `DensePass


 ```python
-from haystack.nodes.retriever import TableTextRetriever
+from haystack.nodes.retriever import EmbeddingRetriever

-retriever = TableTextRetriever(
+retriever = EmbeddingRetriever(
    document_store=document_store,
-    query_embedding_model="deepset/bert-small-mm_retrieval-question_encoder",
-    passage_embedding_model="deepset/bert-small-mm_retrieval-passage_encoder",
-    table_embedding_model="deepset/bert-small-mm_retrieval-table_encoder",
+    embedding_model="deepset/all-mpnet-base-v2-table",
+    model_format="sentence_transformers",
 )
 ```

@ -230,8 +225,8 @@ The Retriever and the Reader can be sticked together to a pipeline in order to f
 from haystack import Pipeline

 table_qa_pipeline = Pipeline()
-table_qa_pipeline.add_node(component=retriever, name="TableTextRetriever", inputs=["Query"])
-table_qa_pipeline.add_node(component=reader, name="TableReader", inputs=["TableTextRetriever"])
+table_qa_pipeline.add_node(component=retriever, name="EmbeddingRetriever", inputs=["Query"])
+table_qa_pipeline.add_node(component=reader, name="TableReader", inputs=["EmbeddingRetriever"])
 ```


@ -266,10 +261,10 @@ document_store.update_embeddings(retriever=retriever, update_existing_embeddings
 ```

 ## Pipeline for QA on Combination of Text and Tables
-We are using one node for retrieving both texts and tables, the `TableTextRetriever`. In order to do question-answering on the Documents coming from the `TableTextRetriever`, we need to route Documents of type `"text"` to a `FARMReader` (or alternatively `TransformersReader`) and Documents of type `"table"` to a `TableReader`.
+We are using one node for retrieving both texts and tables, the `EmbeddingRetriever`. In order to do question-answering on the Documents coming from the `EmbeddingRetriever`, we need to route Documents of type `"text"` to a `FARMReader` (or alternatively `TransformersReader`) and Documents of type `"table"` to a `TableReader`.

 To achieve this, we make use of two additional nodes:
- `RouteDocuments`: Splits the List of Documents retrieved by the `TableTextRetriever` into two lists containing only Documents of type `"text"` or `"table"`, respectively.
+- `RouteDocuments`: Splits the List of Documents retrieved by the `EmbeddingRetriever` into two lists containing only Documents of type `"text"` or `"table"`, respectively.
 - `JoinAnswers`: Takes Answers coming from two different Readers (in this case `FARMReader` and `TableReader`) and joins them to a single list of Answers.


@ -288,8 +283,8 @@ join_answers = JoinAnswers()

 ```python
 text_table_qa_pipeline = Pipeline()
-text_table_qa_pipeline.add_node(component=retriever, name="TableTextRetriever", inputs=["Query"])
-text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["TableTextRetriever"])
+text_table_qa_pipeline.add_node(component=retriever, name="EmbeddingRetriever", inputs=["Query"])
+text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["EmbeddingRetriever"])
 text_table_qa_pipeline.add_node(component=text_reader, name="TextReader", inputs=["RouteDocuments.output_1"])
 text_table_qa_pipeline.add_node(component=table_reader, name="TableReader", inputs=["RouteDocuments.output_2"])
 text_table_qa_pipeline.add_node(component=join_answers, name="JoinAnswers", inputs=["TextReader", "TableReader"])
@ -376,7 +371,10 @@ It can sometimes be hard to provide your data in form of a pandas DataFrame. For


 ```python
+import time
+
 !docker run -d -p 3001:3001 axarev/parsr
+time.sleep(30)
 ```


--- a/tutorials/Tutorial15_TableQA.ipynb
+++ b/tutorials/Tutorial15_TableQA.ipynb
--- a/tutorials/Tutorial15_TableQA.py
+++ b/tutorials/Tutorial15_TableQA.py
@ -1,5 +1,6 @@
 import os
 import json
+import time

 import pandas as pd

@ -7,7 +8,7 @@ from haystack import Label, MultiLabel, Answer
 from haystack.utils import launch_es, fetch_archive_from_http, print_answers
 from haystack.document_stores import ElasticsearchDocumentStore
 from haystack import Document, Pipeline
-from haystack.nodes.retriever import TableTextRetriever
+from haystack.nodes.retriever import EmbeddingRetriever
 from haystack.nodes import TableReader, FARMReader, RouteDocuments, JoinAnswers, ParsrConverter


@ -17,10 +18,7 @@ def tutorial15_tableqa():
    launch_es()

    ## Connect to Elasticsearch
-    # We want to use a small model producing 512-dimensional embeddings, so we need to set embedding_dim to 512
-    document_store = ElasticsearchDocumentStore(
-        host="localhost", username="", password="", index="document", embedding_dim=512
-    )
+    document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

    ## Add Tables to DocumentStore

@ -53,15 +51,13 @@ def tutorial15_tableqa():
    # Retrievers help narrowing down the scope for the Reader to a subset of tables where a given question could be answered.
    # They use some simple but fast algorithm.
    #
-    # **Here:** We use the TableTextRetriever capable of retrieving relevant content among a database
+    # **Here:** We use the EmbeddingRetriever capable of retrieving relevant content among a database
    # of texts and tables using dense embeddings.

-    retriever = TableTextRetriever(
+    retriever = EmbeddingRetriever(
        document_store=document_store,
-        query_embedding_model="deepset/bert-small-mm_retrieval-question_encoder",
-        passage_embedding_model="deepset/bert-small-mm_retrieval-passage_encoder",
-        table_embedding_model="deepset/bert-small-mm_retrieval-table_encoder",
-        embed_meta_fields=["title", "section_title"],
+        embedding_model="deepset/all-mpnet-base-v2-table",
+        model_format="sentence_transformers",
    )

    # Add table embeddings to the tables in DocumentStore
@ -104,15 +100,15 @@ def tutorial15_tableqa():
    # for each of the tables, the sorting of the answers might be not helpful.

    table_qa_pipeline = Pipeline()
-    table_qa_pipeline.add_node(component=retriever, name="TableTextRetriever", inputs=["Query"])
-    table_qa_pipeline.add_node(component=reader, name="TableReader", inputs=["TableTextRetriever"])
+    table_qa_pipeline.add_node(component=retriever, name="EmbeddingRetriever", inputs=["Query"])
+    table_qa_pipeline.add_node(component=reader, name="TableReader", inputs=["EmbeddingRetriever"])

    prediction = table_qa_pipeline.run("When was Guilty Gear Xrd : Sign released?")
    print_answers(prediction, details="minimum")

    ### Pipeline for QA on Combination of Text and Tables
-    # We are using one node for retrieving both texts and tables, the TableTextRetriever.
-    # In order to do question-answering on the Documents coming from the TableTextRetriever, we need to route
+    # We are using one node for retrieving both texts and tables, the EmbeddingRetriever.
+    # In order to do question-answering on the Documents coming from the EmbeddingRetriever, we need to route
    # Documents of type "text" to a FARMReader ( or alternatively TransformersReader) and Documents of type
    # "table" to a TableReader.

@ -125,8 +121,8 @@ def tutorial15_tableqa():
    join_answers = JoinAnswers()

    text_table_qa_pipeline = Pipeline()
-    text_table_qa_pipeline.add_node(component=retriever, name="TableTextRetriever", inputs=["Query"])
-    text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["TableTextRetriever"])
+    text_table_qa_pipeline.add_node(component=retriever, name="EmbeddingRetriever", inputs=["Query"])
+    text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["EmbeddingRetriever"])
    text_table_qa_pipeline.add_node(component=text_reader, name="TextReader", inputs=["RouteDocuments.output_1"])
    text_table_qa_pipeline.add_node(component=table_reader, name="TableReader", inputs=["RouteDocuments.output_2"])
    text_table_qa_pipeline.add_node(component=join_answers, name="JoinAnswers", inputs=["TextReader", "TableReader"])
@ -189,11 +185,12 @@ def tutorial15_tableqa():
    # It can sometimes be hard to provide your data in form of a pandas DataFrame.
    # For this case, we provide the `ParsrConverter` wrapper that can help you to convert, for example, a PDF file into a document that you can index.
    os.system("docker run -d -p 3001:3001 axarev/parsr")
+    time.sleep(30)
    os.system("wget https://www.w3.org/WAI/WCAG21/working-examples/pdf-table/table.pdf")

    converter = ParsrConverter()
    docs = converter.convert("table.pdf")
-    tables = [doc for doc in docs if doc["content_type"] == "table"]
+    tables = [doc for doc in docs if doc.content_type == "table"]

    print(tables)