haystack/docs/v1.0.0/_src/tutorials/tutorials/15.md

<!---
title: "Tutorial 15"
metaTitle: "TableQA Tutorial"
metaDescription: ""
slug: "/docs/tutorial15"
date: "2021-10-28"
id: "tutorial15md"
--->

# Open-Domain QA on Tables
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/15_TableQA.ipynb)

This tutorial shows you how to perform question-answering on tables using the `TableTextRetriever` or `ElasticsearchRetriever` as retriever node and the `TableReader` as reader node.

### Prepare environment

#### Colab: Enable the GPU runtime
Make sure you enable the GPU runtime to experience decent speed in this tutorial.
**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**

<img src="https://raw.githubusercontent.com/deepset-ai/haystack/main/docs/img/colab_gpu_runtime.jpg">


```python
# Make sure you have a GPU running
!nvidia-smi
```


```python
# Install the latest release of Haystack in your own environment
#! pip install farm-haystack

# Install the latest main of Haystack
!pip install grpcio-tools==1.34.1
!pip install git+https://github.com/deepset-ai/haystack.git

# The TaPAs-based TableReader requires the torch-scatter library
!pip install torch-scatter -f https://data.pyg.org/whl/torch-1.10.0+cu113.html

# If you run this notebook on Google Colab, you might need to
# restart the runtime after installing haystack.
```

### Start an Elasticsearch server
You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (e.g. in Colab notebooks), then you can manually download and execute Elasticsearch from source.


```python
# Recommended: Start Elasticsearch using Docker via the Haystack utility function
from haystack.utils import launch_es

launch_es()
```


```python
# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30
```


```python
# Connect to Elasticsearch
from haystack.document_stores import ElasticsearchDocumentStore

# We want to use a small model producing 512-dimensional embeddings, so we need to set embedding_dim to 512
document_store = ElasticsearchDocumentStore(host="localhost",
                                            username="",
                                            password="",
                                            index="document",
                                            embedding_dim=512)
```

## Add Tables to DocumentStore
To quickly demonstrate the capabilities of the `TableTextRetriever` and the `TableReader` we use a subset of 1000 tables of the [Open Table-and-Text Question Answering (OTT-QA) dataset](https://github.com/wenhuchen/OTT-QA).

Just as text passages, tables are represented as `Document` objects in Haystack. The content field, though, is a pandas DataFrame instead of a string.


```python
# Let's first fetch some tables that we want to query
# Here: 1000 tables from OTT-QA
from haystack.utils import fetch_archive_from_http

doc_dir = "data"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/ottqa_tables_sample.json.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)
```


```python
# Add the tables to the DocumentStore

import json
from haystack import Document
import pandas as pd

def read_ottqa_tables(filename):
    processed_tables = []
    with open(filename) as tables:
        tables = json.load(tables)
        for key, table in tables.items():
            current_columns = table["header"]
            current_rows = table["data"]
            current_df = pd.DataFrame(columns=current_columns, data=current_rows)
            current_doc_title = table["title"]
            current_section_title = table["section_title"]
            document = Document(
                content=current_df,
                content_type="table",
                meta={"title": current_doc_title, "section_title": current_section_title},
                id=key
            )
            processed_tables.append(document)

    return processed_tables


tables = read_ottqa_tables("data/ottqa_tables_sample.json")
document_store.write_documents(tables, index="document")

# Showing content field and meta field of one of the Documents of content_type 'table'
print(tables[0].content)
print(tables[0].meta)
```

## Initialize Retriever, Reader & Pipeline

### Retriever

Retrievers help narrowing down the scope for the Reader to a subset of tables where a given question could be answered.
They use some simple but fast algorithm.

**Here:** We use the `TableTextRetriever` capable of retrieving relevant content among a database
of texts and tables using dense embeddings. It is an extension of the `DensePassageRetriever` and consists of three encoders (one query encoder, one text passage encoder and one table encoder) that create embeddings in the same vector space. More details on the `TableTextRetriever` and how it is trained can be found in [this paper](https://arxiv.org/abs/2108.04049).

**Alternatives:**

- `ElasticsearchRetriever` that uses BM25 algorithm


```python
from haystack.nodes.retriever import TableTextRetriever

retriever = TableTextRetriever(
    document_store=document_store,
    query_embedding_model="deepset/bert-small-mm_retrieval-question_encoder",
    passage_embedding_model="deepset/bert-small-mm_retrieval-passage_encoder",
    table_embedding_model="deepset/bert-small-mm_retrieval-table_encoder",
    embed_meta_fields=["title", "section_title"]
)
```


```python
# Add table embeddings to the tables in DocumentStore
document_store.update_embeddings(retriever=retriever)
```


```python
## Alternative: ElasticsearchRetriever
#from haystack.nodes.retriever import ElasticsearchRetriever
#retriever = ElasticsearchRetriever(document_store=document_store)
```


```python
# Try the Retriever
from haystack.utils import print_documents

retrieved_tables = retriever.retrieve("How many twin buildings are under construction?", top_k=5)
# Get highest scored table
print(retrieved_tables[0].content)
```

### Reader
The `TableReader` is based on TaPas, a transformer-based language model capable of grasping the two-dimensional structure of a table. It scans the tables returned by the retriever and extracts the anser. The available TableReader models can be found [here](https://huggingface.co/models?pipeline_tag=table-question-answering&sort=downloads).

**Notice**: The `TableReader` will return an answer for each table, even if the query cannot be answered by the table. Furthermore, the confidence scores are not useful as of now, given that they will *always* be very high (i.e. 1 or close to 1).


```python
from haystack.nodes import TableReader

reader = TableReader(model_name_or_path="google/tapas-base-finetuned-wtq", max_seq_len=512)
```


```python
# Try the TableReader on one Table (highest-scored retrieved table from previous section)

table_doc = document_store.get_document_by_id("List_of_tallest_twin_buildings_and_structures_in_the_world_1")
print(table_doc.content)
```


```python
from haystack.utils import print_answers

prediction = reader.predict(query="How many twin buildings are under construction?", documents=[table_doc])
print_answers(prediction, details="all")
```

The offsets in the `offsets_in_document` and `offsets_in_context` field indicate the table cells that the model predicts to be part of the answer. They need to be interpreted on the linearized table, i.e., a flat list containing all of the table cells.

In the `Answer`'s meta field, you can find the aggreagtion operator used to construct the answer (in this case `COUNT`) and the answer cells as strings.


```python
print(f"Predicted answer: {prediction['answers'][0].answer}")
print(f"Meta field: {prediction['answers'][0].meta}")
```

### Pipeline
The Retriever and the Reader can be sticked together to a pipeline in order to first retrieve relevant tables and then extract the answer.

**Notice**: Given that the `TableReader` does not provide useful confidence scores and returns an answer for each of the tables, the sorting of the answers might be not helpful.


```python
# Initialize pipeline
from haystack import Pipeline

table_qa_pipeline = Pipeline()
table_qa_pipeline.add_node(component=retriever, name="TableTextRetriever", inputs=["Query"])
table_qa_pipeline.add_node(component=reader, name="TableReader", inputs=["TableTextRetriever"])
```


```python
prediction = table_qa_pipeline.run("How many twin buildings are under construction?")
print_answers(prediction, details="minimum")
```

## About us

This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany

We bring NLP to the industry via open source!
Our focus: Industry specific language models & large scale QA systems.

Some of our other work:
- [German BERT](https://deepset.ai/german-bert)
- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)
- [FARM](https://github.com/deepset-ai/FARM)

Get in touch:
[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)

By the way: [we're hiring!](https://www.deepset.ai/jobs)
Create v1.0 docs (#1862) 2021-12-08 17:53:00 +01:00			`<!---`
			`title: "Tutorial 15"`
			`metaTitle: "TableQA Tutorial"`
			`metaDescription: ""`
			`slug: "/docs/tutorial15"`
			`date: "2021-10-28"`
			`id: "tutorial15md"`
			`--->`

			`# Open-Domain QA on Tables`
chore: updating colab links in older docs versions (#3250) * updating colab links to tutorial 1 * remaining tutorials Co-authored-by: Massimiliano Pippi <mpippi@gmail.com> 2022-09-20 18:15:29 +02:00			`[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/15_TableQA.ipynb)`
Create v1.0 docs (#1862) 2021-12-08 17:53:00 +01:00
			This tutorial shows you how to perform question-answering on tables using the `TableTextRetriever` or `ElasticsearchRetriever` as retriever node and the `TableReader` as reader node.

			`### Prepare environment`

			`#### Colab: Enable the GPU runtime`
			`Make sure you enable the GPU runtime to experience decent speed in this tutorial.`
			`Runtime -> Change Runtime type -> Hardware accelerator -> GPU`

refactor: rename `master` into `main` in documentation and links (#3063) * master->main * revert master rename * Revert change to sphinx link and rename master schema 2022-08-24 19:05:12 +02:00			`<img src="https://raw.githubusercontent.com/deepset-ai/haystack/main/docs/img/colab_gpu_runtime.jpg">`
Create v1.0 docs (#1862) 2021-12-08 17:53:00 +01:00

			```python
			`# Make sure you have a GPU running`
			`!nvidia-smi`
			```


			```python
chore: updating colab links in older docs versions (#3250) * updating colab links to tutorial 1 * remaining tutorials Co-authored-by: Massimiliano Pippi <mpippi@gmail.com> 2022-09-20 18:15:29 +02:00			`# Install the latest release of Haystack in your own environment`
Create v1.0 docs (#1862) 2021-12-08 17:53:00 +01:00			`#! pip install farm-haystack`

refactor: rename `master` into `main` in documentation and links (#3063) * master->main * revert master rename * Revert change to sphinx link and rename master schema 2022-08-24 19:05:12 +02:00			`# Install the latest main of Haystack`
Create v1.0 docs (#1862) 2021-12-08 17:53:00 +01:00			`!pip install grpcio-tools==1.34.1`
			`!pip install git+https://github.com/deepset-ai/haystack.git`

			`# The TaPAs-based TableReader requires the torch-scatter library`
			`!pip install torch-scatter -f https://data.pyg.org/whl/torch-1.10.0+cu113.html`

			`# If you run this notebook on Google Colab, you might need to`
			`# restart the runtime after installing haystack.`
			```

			`### Start an Elasticsearch server`
			`You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (e.g. in Colab notebooks), then you can manually download and execute Elasticsearch from source.`


			```python
			`# Recommended: Start Elasticsearch using Docker via the Haystack utility function`
			`from haystack.utils import launch_es`

			`launch_es()`
			```


			```python
			`# In Colab / No Docker environments: Start Elasticsearch from source`
			`! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q`
			`! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz`
			`! chown -R daemon:daemon elasticsearch-7.9.2`

			`import os`
			`from subprocess import Popen, PIPE, STDOUT`
			`es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],`
			`stdout=PIPE, stderr=STDOUT,`
			`preexec_fn=lambda: os.setuid(1) # as daemon`
			`)`
			`# wait until ES has started`
			`! sleep 30`
			```


			```python
			`# Connect to Elasticsearch`
			`from haystack.document_stores import ElasticsearchDocumentStore`

			`# We want to use a small model producing 512-dimensional embeddings, so we need to set embedding_dim to 512`
			`document_store = ElasticsearchDocumentStore(host="localhost",`
			`username="",`
			`password="",`
			`index="document",`
			`embedding_dim=512)`
			```

			`## Add Tables to DocumentStore`
			To quickly demonstrate the capabilities of the `TableTextRetriever` and the `TableReader` we use a subset of 1000 tables of the [Open Table-and-Text Question Answering (OTT-QA) dataset](https://github.com/wenhuchen/OTT-QA).

			Just as text passages, tables are represented as `Document` objects in Haystack. The content field, though, is a pandas DataFrame instead of a string.


			```python
			`# Let's first fetch some tables that we want to query`
			`# Here: 1000 tables from OTT-QA`
			`from haystack.utils import fetch_archive_from_http`

			`doc_dir = "data"`
			`s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/ottqa_tables_sample.json.zip"`
			`fetch_archive_from_http(url=s3_url, output_dir=doc_dir)`
			```


			```python
			`# Add the tables to the DocumentStore`

			`import json`
			`from haystack import Document`
			`import pandas as pd`

			`def read_ottqa_tables(filename):`
			`processed_tables = []`
			`with open(filename) as tables:`
			`tables = json.load(tables)`
			`for key, table in tables.items():`
			`current_columns = table["header"]`
			`current_rows = table["data"]`
			`current_df = pd.DataFrame(columns=current_columns, data=current_rows)`
			`current_doc_title = table["title"]`
			`current_section_title = table["section_title"]`
			`document = Document(`
			`content=current_df,`
			`content_type="table",`
			`meta={"title": current_doc_title, "section_title": current_section_title},`
			`id=key`
			`)`
			`processed_tables.append(document)`

			`return processed_tables`


			`tables = read_ottqa_tables("data/ottqa_tables_sample.json")`
			`document_store.write_documents(tables, index="document")`

			`# Showing content field and meta field of one of the Documents of content_type 'table'`
			`print(tables[0].content)`
			`print(tables[0].meta)`
			```

Improve Docs Readability (#2617) Signed-off-by: Ryan Russell <git@ryanrussell.org> 2022-06-03 02:57:40 -05:00			`## Initialize Retriever, Reader & Pipeline`
Create v1.0 docs (#1862) 2021-12-08 17:53:00 +01:00
			`### Retriever`

			`Retrievers help narrowing down the scope for the Reader to a subset of tables where a given question could be answered.`
			`They use some simple but fast algorithm.`

			Here: We use the `TableTextRetriever` capable of retrieving relevant content among a database
			of texts and tables using dense embeddings. It is an extension of the `DensePassageRetriever` and consists of three encoders (one query encoder, one text passage encoder and one table encoder) that create embeddings in the same vector space. More details on the `TableTextRetriever` and how it is trained can be found in [this paper](https://arxiv.org/abs/2108.04049).

			`Alternatives:`

			- `ElasticsearchRetriever` that uses BM25 algorithm



			```python
			`from haystack.nodes.retriever import TableTextRetriever`

			`retriever = TableTextRetriever(`
			`document_store=document_store,`
			`query_embedding_model="deepset/bert-small-mm_retrieval-question_encoder",`
			`passage_embedding_model="deepset/bert-small-mm_retrieval-passage_encoder",`
			`table_embedding_model="deepset/bert-small-mm_retrieval-table_encoder",`
			`embed_meta_fields=["title", "section_title"]`
			`)`
			```


			```python
			`# Add table embeddings to the tables in DocumentStore`
			`document_store.update_embeddings(retriever=retriever)`
			```


			```python
			`## Alternative: ElasticsearchRetriever`
			`#from haystack.nodes.retriever import ElasticsearchRetriever`
			`#retriever = ElasticsearchRetriever(document_store=document_store)`
			```


			```python
			`# Try the Retriever`
			`from haystack.utils import print_documents`

			`retrieved_tables = retriever.retrieve("How many twin buildings are under construction?", top_k=5)`
			`# Get highest scored table`
			`print(retrieved_tables[0].content)`
			```

			`### Reader`
			The `TableReader` is based on TaPas, a transformer-based language model capable of grasping the two-dimensional structure of a table. It scans the tables returned by the retriever and extracts the anser. The available TableReader models can be found [here](https://huggingface.co/models?pipeline_tag=table-question-answering&sort=downloads).

			Notice: The `TableReader` will return an answer for each table, even if the query cannot be answered by the table. Furthermore, the confidence scores are not useful as of now, given that they will always be very high (i.e. 1 or close to 1).


			```python
			`from haystack.nodes import TableReader`

			`reader = TableReader(model_name_or_path="google/tapas-base-finetuned-wtq", max_seq_len=512)`
			```


			```python
			`# Try the TableReader on one Table (highest-scored retrieved table from previous section)`

			`table_doc = document_store.get_document_by_id("List_of_tallest_twin_buildings_and_structures_in_the_world_1")`
			`print(table_doc.content)`
			```


			```python
			`from haystack.utils import print_answers`

			`prediction = reader.predict(query="How many twin buildings are under construction?", documents=[table_doc])`
			`print_answers(prediction, details="all")`
			```

			The offsets in the `offsets_in_document` and `offsets_in_context` field indicate the table cells that the model predicts to be part of the answer. They need to be interpreted on the linearized table, i.e., a flat list containing all of the table cells.

			In the `Answer`'s meta field, you can find the aggreagtion operator used to construct the answer (in this case `COUNT`) and the answer cells as strings.


			```python
			`print(f"Predicted answer: {prediction['answers'][0].answer}")`
			`print(f"Meta field: {prediction['answers'][0].meta}")`
			```

			`### Pipeline`
			`The Retriever and the Reader can be sticked together to a pipeline in order to first retrieve relevant tables and then extract the answer.`

			Notice: Given that the `TableReader` does not provide useful confidence scores and returns an answer for each of the tables, the sorting of the answers might be not helpful.


			```python
			`# Initialize pipeline`
			`from haystack import Pipeline`

			`table_qa_pipeline = Pipeline()`
			`table_qa_pipeline.add_node(component=retriever, name="TableTextRetriever", inputs=["Query"])`
			`table_qa_pipeline.add_node(component=reader, name="TableReader", inputs=["TableTextRetriever"])`
			```


			```python
			`prediction = table_qa_pipeline.run("How many twin buildings are under construction?")`
			`print_answers(prediction, details="minimum")`
			```

			`## About us`

			`This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany`

chore: updating colab links in older docs versions (#3250) * updating colab links to tutorial 1 * remaining tutorials Co-authored-by: Massimiliano Pippi <mpippi@gmail.com> 2022-09-20 18:15:29 +02:00			`We bring NLP to the industry via open source!`
			`Our focus: Industry specific language models & large scale QA systems.`

			`Some of our other work:`
Create v1.0 docs (#1862) 2021-12-08 17:53:00 +01:00			`- [German BERT](https://deepset.ai/german-bert)`
			`- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)`
			`- [FARM](https://github.com/deepset-ai/FARM)`

			`Get in touch:`
			`[Twitter](https://twitter.com/deepset_ai) \| [LinkedIn](https://www.linkedin.com/company/deepset-ai/) \| [Slack](https://haystack.deepset.ai/community/join) \| [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) \| [Website](https://deepset.ai)`

			`By the way: [we're hiring!](https://www.deepset.ai/jobs)`