Extend TranslationWrapper to work with QA Generation (#1905)

* draft translationwrapper example

* draft translation of generated qa pairs

* Add latest docstring and tutorial changes

* fixed pass by reference by deepcopy

* delete adapted tutorial 13 (test purposes only)

* adapt method signature and doc string

* Add latest docstring and tutorial changes

* add type ignore

* extend tutorial 13 with TranslationWrapper example

* Add latest docstring and tutorial changes

* removed duplicate code

* indent if statement

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: ArzelaAscoIi <kristof.herrmann@rwth-aachen.de>
This commit is contained in:
Julian Risch 2022-01-03 13:30:24 +01:00 committed by GitHub
parent a94c274134
commit a846be99d1
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
6 changed files with 200 additions and 67 deletions

View File

@ -15,7 +15,7 @@ Abstract class for a Translator component that translates either a query or a do
```python
| @abstractmethod
| translate(query: Optional[str] = None, documents: Optional[Union[List[Document], List[Answer], List[str], List[Dict[str, Any]]]] = None, dict_key: Optional[str] = None) -> Union[str, List[Document], List[Answer], List[str], List[Dict[str, Any]]]
| translate(results: List[Dict[str, Any]] = None, query: Optional[str] = None, documents: Optional[Union[List[Document], List[Answer], List[str], List[Dict[str, Any]]]] = None, dict_key: Optional[str] = None) -> Union[str, List[Document], List[Answer], List[str], List[Dict[str, Any]]]
```
Translate the passed query or a list of documents from language A to B.
@ -24,7 +24,7 @@ Translate the passed query or a list of documents from language A to B.
#### run
```python
| run(query: Optional[str] = None, documents: Optional[Union[List[Document], List[Answer], List[str], List[Dict[str, Any]]]] = None, answers: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None, dict_key: Optional[str] = None)
| run(results: List[Dict[str, Any]] = None, query: Optional[str] = None, documents: Optional[Union[List[Document], List[Answer], List[str], List[Dict[str, Any]]]] = None, answers: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None, dict_key: Optional[str] = None)
```
Method that gets executed when this class is used as a Node in a Haystack Pipeline
@ -90,13 +90,14 @@ They also have a few multilingual models that support multiple languages at once
#### translate
```python
| translate(query: Optional[str] = None, documents: Optional[Union[List[Document], List[Answer], List[str], List[Dict[str, Any]]]] = None, dict_key: Optional[str] = None) -> Union[str, List[Document], List[Answer], List[str], List[Dict[str, Any]]]
| translate(results: List[Dict[str, Any]] = None, query: Optional[str] = None, documents: Optional[Union[List[Document], List[Answer], List[str], List[Dict[str, Any]]]] = None, dict_key: Optional[str] = None) -> Union[str, List[Document], List[Answer], List[str], List[Dict[str, Any]]]
```
Run the actual translation. You can supply a query or a list of documents. Whatever is supplied will be translated.
**Arguments**:
- `results`: Generated QA pairs to translate
- `query`: The query string to translate
- `documents`: The documents to translate
- `dict_key`: If you pass a dictionary in `documents`, you can specify here the field which shall be translated.

View File

@ -136,6 +136,37 @@ for idx, document in enumerate(tqdm(document_store)):
print_questions(result)
```
## Translated Question Answer Generation Pipeline
Trained models for Question Answer Generation are not available in many languages other than English. Haystack
provides a workaround for that issue by machine-translating a pipeline's inputs and outputs with the
TranslationWrapperPipeline. The following example generates German questions and answers on a German text
document - by using an English model for Question Answer Generation.
```python
# Fill the document store with a German document.
text1 = "Python ist eine interpretierte Hochsprachenprogrammiersprache für allgemeine Zwecke. Sie wurde von Guido van Rossum entwickelt und 1991 erstmals veröffentlicht. Die Design-Philosophie von Python legt den Schwerpunkt auf die Lesbarkeit des Codes und die Verwendung von viel Leerraum (Whitespace)."
docs = [{"content": text1}]
document_store.delete_documents()
document_store.write_documents(docs)
# Load machine translation models
from haystack.nodes import TransformersTranslator
in_translator = TransformersTranslator(model_name_or_path="Helsinki-NLP/opus-mt-de-en")
out_translator = TransformersTranslator(model_name_or_path="Helsinki-NLP/opus-mt-en-de")
# Wrap the previously defined QuestionAnswerGenerationPipeline
from haystack.pipelines import TranslationWrapperPipeline
pipeline_with_translation = TranslationWrapperPipeline(input_translator=in_translator,
output_translator=out_translator,
pipeline=qag_pipeline)
for idx, document in enumerate(tqdm(document_store)):
print(f"\n * Generating questions and answers for document {idx}: {document.content[:100]}...\n")
result = pipeline_with_translation.run(documents=[document])
print_questions(result)
```
## About us
This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany

View File

@ -1,5 +1,5 @@
from typing import Any, Dict, List, Mapping, Optional, Union
from copy import deepcopy
from abc import abstractmethod
from haystack.nodes.base import BaseComponent
@ -15,6 +15,7 @@ class BaseTranslator(BaseComponent):
@abstractmethod
def translate(
self,
results: List[Dict[str, Any]] = None,
query: Optional[str] = None,
documents: Optional[Union[List[Document], List[Answer], List[str], List[Dict[str, Any]]]] = None,
dict_key: Optional[str] = None,
@ -26,30 +27,38 @@ class BaseTranslator(BaseComponent):
def run( # type: ignore
self,
results: List[Dict[str, Any]] = None,
query: Optional[str] = None,
documents: Optional[Union[List[Document], List[Answer], List[str], List[Dict[str, Any]]]] = None,
answers: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
dict_key: Optional[str] = None,
):
"""Method that gets executed when this class is used as a Node in a Haystack Pipeline"""
results = {}
translation_results = {}
if results is not None:
translation_results = {"results":deepcopy(results)}
translated_queries_answers = self.translate(results=translation_results["results"])
for i, result in enumerate(translation_results["results"]):
result["query"] = translated_queries_answers[i]
result["answers"][0].answer = translated_queries_answers[len(translation_results["results"])+i]
return translation_results, "output_1"
# This will cover input query stage
if query:
results["query"] = self.translate(query=query)
translation_results["query"] = self.translate(query=query) # type: ignore
# This will cover retriever and summarizer
if documents:
_dict_key = dict_key or "text"
results["documents"] = self.translate(documents=documents, dict_key=_dict_key)
translation_results["documents"] = self.translate(documents=documents, dict_key=_dict_key) # type: ignore
if answers:
_dict_key = dict_key or "answer"
if isinstance(answers, Mapping):
# This will cover reader
results["answers"] = self.translate(documents=answers["answers"], dict_key=_dict_key)
translation_results["answers"] = self.translate(documents=answers["answers"], dict_key=_dict_key) # type: ignore
else:
# This will cover generator
results["answers"] = self.translate(documents=answers, dict_key=_dict_key)
translation_results["answers"] = self.translate(documents=answers, dict_key=_dict_key) # type: ignore
return results, "output_1"
return translation_results, "output_1"

View File

@ -78,17 +78,24 @@ class TransformersTranslator(BaseTranslator):
def translate(
self,
results: List[Dict[str, Any]] = None,
query: Optional[str] = None,
documents: Optional[Union[List[Document], List[Answer], List[str], List[Dict[str, Any]]]] = None,
dict_key: Optional[str] = None,
) -> Union[str, List[Document], List[Answer], List[str], List[Dict[str, Any]]]:
"""
Run the actual translation. You can supply a query or a list of documents. Whatever is supplied will be translated.
:param results: Generated QA pairs to translate
:param query: The query string to translate
:param documents: The documents to translate
:param dict_key: If you pass a dictionary in `documents`, you can specify here the field which shall be translated.
"""
if not query and not documents:
queries_for_translator = None
answers_for_translator = None
if results is not None:
queries_for_translator = [result["query"] for result in results]
answers_for_translator = [result["answers"][0].answer for result in results]
if not query and not documents and results is None:
raise AttributeError("Translator need query or documents to perform translation")
if query and documents:
@ -100,7 +107,10 @@ class TransformersTranslator(BaseTranslator):
dict_key = dict_key or "content"
if isinstance(documents, list):
if queries_for_translator is not None and answers_for_translator is not None:
text_for_translator = queries_for_translator + answers_for_translator
elif isinstance(documents, list):
if isinstance(documents[0], Document):
text_for_translator = [doc.content for doc in documents] # type: ignore
elif isinstance(documents[0], Answer):
@ -126,7 +136,9 @@ class TransformersTranslator(BaseTranslator):
clean_up_tokenization_spaces=self.clean_up_tokenization_spaces
)
if query:
if queries_for_translator is not None and answers_for_translator is not None:
return translated_texts
elif query:
return translated_texts[0]
elif documents:
if isinstance(documents, list) and isinstance(documents[0], str):

View File

@ -262,7 +262,56 @@
" result = qag_pipeline.run(documents=[document])\n",
" print_questions(result)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": true
}
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Translated Question Answer Generation Pipeline\n",
"Trained models for Question Answer Generation are not available in many languages other than English. Haystack\n",
"provides a workaround for that issue by machine-translating a pipeline's inputs and outputs with the\n",
"TranslationWrapperPipeline. The following example generates German questions and answers on a German text\n",
"document - by using an English model for Question Answer Generation."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"# Fill the document store with a German document.\n",
"text1 = \"Python ist eine interpretierte Hochsprachenprogrammiersprache für allgemeine Zwecke. Sie wurde von Guido van Rossum entwickelt und 1991 erstmals veröffentlicht. Die Design-Philosophie von Python legt den Schwerpunkt auf die Lesbarkeit des Codes und die Verwendung von viel Leerraum (Whitespace).\"\n",
"docs = [{\"content\": text1}]\n",
"document_store.delete_documents()\n",
"document_store.write_documents(docs)\n",
"\n",
"# Load machine translation models\n",
"from haystack.nodes import TransformersTranslator\n",
"in_translator = TransformersTranslator(model_name_or_path=\"Helsinki-NLP/opus-mt-de-en\")\n",
"out_translator = TransformersTranslator(model_name_or_path=\"Helsinki-NLP/opus-mt-en-de\")\n",
"\n",
"# Wrap the previously defined QuestionAnswerGenerationPipeline\n",
"from haystack.pipelines import TranslationWrapperPipeline\n",
"pipeline_with_translation = TranslationWrapperPipeline(input_translator=in_translator,\n",
" output_translator=out_translator,\n",
" pipeline=qag_pipeline)\n",
"\n",
"for idx, document in enumerate(tqdm(document_store)):\n",
" print(f\"\\n * Generating questions and answers for document {idx}: {document.content[:100]}...\\n\")\n",
" result = pipeline_with_translation.run(documents=[document])\n",
" print_questions(result)"
],
"metadata": {
"collapsed": false,
"pycharm": {

View File

@ -1,8 +1,7 @@
from tqdm import tqdm
from pprint import pprint
from haystack.nodes import QuestionGenerator, ElasticsearchRetriever, FARMReader
from haystack.nodes import QuestionGenerator, ElasticsearchRetriever, FARMReader, TransformersTranslator
from haystack.document_stores import ElasticsearchDocumentStore
from haystack.pipelines import QuestionGenerationPipeline, RetrieverQuestionGenerationPipeline, QuestionAnswerGenerationPipeline
from haystack.pipelines import QuestionGenerationPipeline, RetrieverQuestionGenerationPipeline, QuestionAnswerGenerationPipeline, TranslationWrapperPipeline
from haystack.utils import launch_es, print_questions
"""
@ -10,74 +9,106 @@ This is a bare bones tutorial showing what is possible with the QuestionGenerato
questions which the model thinks can be answered by a given document.
"""
# Start Elasticsearch service via Docker
launch_es()
text1 = "Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace."
text2 = "Princess Arya Stark is the third child and second daughter of Lord Eddard Stark and his wife, Lady Catelyn Stark. She is the sister of the incumbent Westerosi monarchs, Sansa, Queen in the North, and Brandon, King of the Andals and the First Men. After narrowly escaping the persecution of House Stark by House Lannister, Arya is trained as a Faceless Man at the House of Black and White in Braavos, using her abilities to avenge her family. Upon her return to Westeros, she exacts retribution for the Red Wedding by exterminating the Frey male line."
text3 = "Dry Cleaning are an English post-punk band who formed in South London in 2018.[3] The band is composed of vocalist Florence Shaw, guitarist Tom Dowse, bassist Lewis Maynard and drummer Nick Buxton. They are noted for their use of spoken word primarily in lieu of sung vocals, as well as their unconventional lyrics. Their musical stylings have been compared to Wire, Magazine and Joy Division.[4] The band released their debut single, 'Magic of Meghan' in 2019. Shaw wrote the song after going through a break-up and moving out of her former partner's apartment the same day that Meghan Markle and Prince Harry announced they were engaged.[5] This was followed by the release of two EPs that year: Sweet Princess in August and Boundary Road Snacks and Drinks in October. The band were included as part of the NME 100 of 2020,[6] as well as DIY magazine's Class of 2020.[7] The band signed to 4AD in late 2020 and shared a new single, 'Scratchcard Lanyard'.[8] In February 2021, the band shared details of their debut studio album, New Long Leg. They also shared the single 'Strong Feelings'.[9] The album, which was produced by John Parish, was released on 2 April 2021.[10]"
def tutorial13_question_generation():
# Start Elasticsearch service via Docker
launch_es()
docs = [{"content": text1},
{"content": text2},
{"content": text3}]
text1 = "Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace."
text2 = "Princess Arya Stark is the third child and second daughter of Lord Eddard Stark and his wife, Lady Catelyn Stark. She is the sister of the incumbent Westerosi monarchs, Sansa, Queen in the North, and Brandon, King of the Andals and the First Men. After narrowly escaping the persecution of House Stark by House Lannister, Arya is trained as a Faceless Man at the House of Black and White in Braavos, using her abilities to avenge her family. Upon her return to Westeros, she exacts retribution for the Red Wedding by exterminating the Frey male line."
text3 = "Dry Cleaning are an English post-punk band who formed in South London in 2018.[3] The band is composed of vocalist Florence Shaw, guitarist Tom Dowse, bassist Lewis Maynard and drummer Nick Buxton. They are noted for their use of spoken word primarily in lieu of sung vocals, as well as their unconventional lyrics. Their musical stylings have been compared to Wire, Magazine and Joy Division.[4] The band released their debut single, 'Magic of Meghan' in 2019. Shaw wrote the song after going through a break-up and moving out of her former partner's apartment the same day that Meghan Markle and Prince Harry announced they were engaged.[5] This was followed by the release of two EPs that year: Sweet Princess in August and Boundary Road Snacks and Drinks in October. The band were included as part of the NME 100 of 2020,[6] as well as DIY magazine's Class of 2020.[7] The band signed to 4AD in late 2020 and shared a new single, 'Scratchcard Lanyard'.[8] In February 2021, the band shared details of their debut studio album, New Long Leg. They also shared the single 'Strong Feelings'.[9] The album, which was produced by John Parish, was released on 2 April 2021.[10]"
# Initialize document store and write in the documents
document_store = ElasticsearchDocumentStore()
document_store.write_documents(docs)
docs = [{"content": text1},
{"content": text2},
{"content": text3}]
# Initialize Question Generator
question_generator = QuestionGenerator()
# Initialize document store and write in the documents
document_store = ElasticsearchDocumentStore()
document_store.write_documents(docs)
"""
The most basic version of a question generator pipeline takes a document as input and outputs generated questions
which the the document can answer.
"""
# Initialize Question Generator
question_generator = QuestionGenerator()
# QuestionGenerationPipeline
print("\nQuestionGenerationPipeline")
print("==========================")
"""
The most basic version of a question generator pipeline takes a document as input and outputs generated questions
which the the document can answer.
"""
question_generation_pipeline = QuestionGenerationPipeline(question_generator)
for idx, document in enumerate(document_store):
print(f"\n * Generating questions for document {idx}: {document.content[:100]}...\n")
result = question_generation_pipeline.run(documents=[document])
# QuestionGenerationPipeline
print("\nQuestionGenerationPipeline")
print("==========================")
question_generation_pipeline = QuestionGenerationPipeline(question_generator)
for idx, document in enumerate(document_store):
print(f"\n * Generating questions for document {idx}: {document.content[:100]}...\n")
result = question_generation_pipeline.run(documents=[document])
print_questions(result)
"""
This pipeline takes a query as input. It retrievers relevant documents and then generates questions based on these.
"""
# RetrieverQuestionGenerationPipeline
print("\RetrieverQuestionGenerationPipeline")
print("==================================")
retriever = ElasticsearchRetriever(document_store=document_store)
rqg_pipeline = RetrieverQuestionGenerationPipeline(retriever, question_generator)
print(f"\n * Generating questions for documents matching the query 'Arya Stark'\n")
result = rqg_pipeline.run(query="Arya Stark")
print_questions(result)
"""
This pipeline takes a query as input. It retrievers relevant documents and then generates questions based on these.
"""
# RetrieverQuestionGenerationPipeline
print("\RetrieverQuestionGenerationPipeline")
print("==================================")
"""
This pipeline takes a document as input, generates questions on it, and attempts to answer these questions using
a Reader model
"""
retriever = ElasticsearchRetriever(document_store=document_store)
rqg_pipeline = RetrieverQuestionGenerationPipeline(retriever, question_generator)
# QuestionAnswerGenerationPipeline
print("\QuestionAnswerGenerationPipeline")
print("===============================")
print(f"\n * Generating questions for documents matching the query 'Arya Stark'\n")
result = rqg_pipeline.run(query="Arya Stark")
print_questions(result)
reader = FARMReader("deepset/roberta-base-squad2")
qag_pipeline = QuestionAnswerGenerationPipeline(question_generator, reader)
for idx, document in enumerate(tqdm(document_store)):
print(f"\n * Generating questions and answers for document {idx}: {document.content[:100]}...\n")
result = qag_pipeline.run(documents=[document])
print_questions(result)
"""
This pipeline takes a document as input, generates questions on it, and attempts to answer these questions using
a Reader model
"""
"""
Trained models for Question Answer Generation are not available in many languages other than English.
Haystack provides a workaround for that issue by machine-translating a pipeline's inputs and outputs with the TranslationWrapperPipeline.
The following example generates German questions and answers on a German text document - by using an English model for Question Answer Generation.
"""
# QuestionAnswerGenerationPipeline
print("\QuestionAnswerGenerationPipeline")
print("===============================")
# Fill the document store with a German document.
text1 = "Python ist eine interpretierte Hochsprachenprogrammiersprache für allgemeine Zwecke. Sie wurde von Guido van Rossum entwickelt und 1991 erstmals veröffentlicht. Die Design-Philosophie von Python legt den Schwerpunkt auf die Lesbarkeit des Codes und die Verwendung von viel Leerraum (Whitespace)."
docs = [{"content": text1}]
document_store.delete_documents()
document_store.write_documents(docs)
reader = FARMReader("deepset/roberta-base-squad2")
qag_pipeline = QuestionAnswerGenerationPipeline(question_generator, reader)
for idx, document in enumerate(tqdm(document_store)):
# Load machine translation models
in_translator = TransformersTranslator(model_name_or_path="Helsinki-NLP/opus-mt-de-en")
out_translator = TransformersTranslator(model_name_or_path="Helsinki-NLP/opus-mt-en-de")
print(f"\n * Generating questions and answers for document {idx}: {document.content[:100]}...\n")
result = qag_pipeline.run(documents=[document])
print_questions(result)
# Wrap the previously defined QuestionAnswerGenerationPipeline
pipeline_with_translation = TranslationWrapperPipeline(input_translator=in_translator,
output_translator=out_translator,
pipeline=qag_pipeline)
for idx, document in enumerate(tqdm(document_store)):
print(f"\n * Generating questions and answers for document {idx}: {document.content[:100]}...\n")
result = pipeline_with_translation.run(documents=[document])
print_questions(result)
if __name__ == "__main__":
tutorial13_question_generation()
# This Haystack script was made with love by deepset in Berlin, Germany
# Haystack: https://github.com/deepset-ai/haystack
# deepset: https://deepset.ai/
# deepset: https://deepset.ai/