mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-07-23 08:52:16 +00:00

* first draft / notes on new primitives * wip label / feedback refactor * rename doc.text -> doc.content. add doc.content_type * add datatype for content * remove faq_question_field from ES and weaviate. rename text_field -> content_field in docstores. update tutorials for content field * update converters for . Add warning for empty * renam label.question -> label.query. Allow sorting of Answers. * WIP primitives * update ui/reader for new Answer format * Improve Label. First refactoring of MultiLabel. Adjust eval code * fixed workflow conflict with introducing new one (#1472) * Add latest docstring and tutorial changes * make add_eval_data() work again * fix reader formats. WIP fix _extract_docs_and_labels_from_dict * fix test reader * Add latest docstring and tutorial changes * fix another test case for reader * fix mypy in farm reader.eval() * fix mypy in farm reader.eval() * WIP ORM refactor * Add latest docstring and tutorial changes * fix mypy weaviate * make label and multilabel dataclasses * bump mypy env in CI to python 3.8 * WIP refactor Label ORM * WIP refactor Label ORM * simplify tests for individual doc stores * WIP refactoring markers of tests * test alternative approach for tests with existing parametrization * WIP refactor ORMs * fix skip logic of already parametrized tests * fix weaviate behaviour in tests - not parametrizing it in our general test cases. * Add latest docstring and tutorial changes * fix some tests * remove sql from document_store_types * fix markers for generator and pipeline test * remove inmemory marker * remove unneeded elasticsearch markers * add dataclasses-json dependency. adjust ORM to just store JSON repr * ignore type as dataclasses_json seems to miss functionality here * update readme and contributing.md * update contributing * adjust example * fix duplicate doc handling for custom index * Add latest docstring and tutorial changes * fix some ORM issues. fix get_all_labels_aggregated. * update drop flags where get_all_labels_aggregated() was used before * Add latest docstring and tutorial changes * add to_json(). add + fix tests * fix no_answer handling in label / multilabel * fix duplicate docs in memory doc store. change primary key for sql doc table * fix mypy issues * fix mypy issues * haystack/retriever/base.py * fix test_write_document_meta[elastic] * fix test_elasticsearch_custom_fields * fix test_labels[elastic] * fix crawler * fix converter * fix docx converter * fix preprocessor * fix test_utils * fix tfidf retriever. fix selection of docstore in tests with multiple fixtures / parameterizations * Add latest docstring and tutorial changes * fix crawler test. fix ocrconverter attribute * fix test_elasticsearch_custom_query * fix generator pipeline * fix ocr converter * fix ragenerator * Add latest docstring and tutorial changes * fix test_load_and_save_yaml for elasticsearch * fixes for pipeline tests * fix faq pipeline * fix pipeline tests * Add latest docstring and tutorial changes * fix weaviate * Add latest docstring and tutorial changes * trigger CI * satisfy mypy * Add latest docstring and tutorial changes * satisfy mypy * Add latest docstring and tutorial changes * trigger CI * fix question generation test * fix ray. fix Q-generation * fix translator test * satisfy mypy * wip refactor feedback rest api * fix rest api feedback endpoint * fix doc classifier * remove relation of Labels -> Docs in SQL ORM * fix faiss/milvus tests * fix doc classifier test * fix eval test * fixing eval issues * Add latest docstring and tutorial changes * fix mypy * WIP replace dataclasses-json with manual serialization * Add latest docstring and tutorial changes * revert to dataclass-json serialization for now. remove debug prints. * update docstrings * fix extractor. fix Answer Span init * fix api test * keep meta data of answers in reader.run() * fix meta handling * adress review feedback * Add latest docstring and tutorial changes * make document=None for open domain labels * add import * fix print utils * fix rest api * adress review feedback * Add latest docstring and tutorial changes * fix mypy Co-authored-by: Markus Paff <markuspaff.mp@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
394 lines
11 KiB
Plaintext
394 lines
11 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"# Generative QA with \"Retrieval-Augmented Generation\"\n",
|
|
"\n",
|
|
"[](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial7_RAG_Generator.ipynb)\n",
|
|
"\n",
|
|
"While extractive QA highlights the span of text that answers a query,\n",
|
|
"generative QA can return a novel text answer that it has composed.\n",
|
|
"In this tutorial, you will learn how to set up a generative system using the\n",
|
|
"[RAG model](https://arxiv.org/abs/2005.11401) which conditions the\n",
|
|
"answer generator on a set of retrieved documents."
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"### Prepare environment\n",
|
|
"\n",
|
|
"#### Colab: Enable the GPU runtime\n",
|
|
"Make sure you enable the GPU runtime to experience decent speed in this tutorial.\n",
|
|
"**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**\n",
|
|
"\n",
|
|
"<img src=\"https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/colab_gpu_runtime.jpg\">"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"# Make sure you have a GPU running\n",
|
|
"!nvidia-smi"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Here are the packages and imports that we'll need:"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"!pip install grpcio-tools==1.34.1\n",
|
|
"!pip install git+https://github.com/deepset-ai/haystack.git\n",
|
|
"\n",
|
|
"# If you run this notebook on Google Colab, you might need to\n",
|
|
"# restart the runtime after installing haystack."
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"from typing import List\n",
|
|
"import requests\n",
|
|
"import pandas as pd\n",
|
|
"from haystack import Document\n",
|
|
"from haystack.document_store.faiss import FAISSDocumentStore\n",
|
|
"from haystack.generator.transformers import RAGenerator\n",
|
|
"from haystack.retriever.dense import DensePassageRetriever"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Let's download a csv containing some sample text and preprocess the data.\n"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"# Download sample\n",
|
|
"temp = requests.get(\"https://raw.githubusercontent.com/deepset-ai/haystack/master/tutorials/small_generator_dataset.csv\")\n",
|
|
"open('small_generator_dataset.csv', 'wb').write(temp.content)\n",
|
|
"\n",
|
|
"# Create dataframe with columns \"title\" and \"text\"\n",
|
|
"df = pd.read_csv(\"small_generator_dataset.csv\", sep=',')\n",
|
|
"# Minimal cleaning\n",
|
|
"df.fillna(value=\"\", inplace=True)\n",
|
|
"\n",
|
|
"print(df.head())"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"We can cast our data into Haystack Document objects.\n",
|
|
"Alternatively, we can also just use dictionaries with \"text\" and \"meta\" fields"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"# Use data to initialize Document objects\n",
|
|
"titles = list(df[\"title\"].values)\n",
|
|
"texts = list(df[\"text\"].values)\n",
|
|
"documents: List[Document] = []\n",
|
|
"for title, text in zip(titles, texts):\n",
|
|
" documents.append(\n",
|
|
" Document(\n",
|
|
" content=text,\n",
|
|
" meta={\n",
|
|
" \"name\": title or \"\"\n",
|
|
" }\n",
|
|
" )\n",
|
|
" )"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Here we initialize the FAISSDocumentStore, DensePassageRetriever and RAGenerator.\n",
|
|
"FAISS is chosen here since it is optimized vector storage."
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"# Initialize FAISS document store.\n",
|
|
"# Set `return_embedding` to `True`, so generator doesn't have to perform re-embedding\n",
|
|
"document_store = FAISSDocumentStore(\n",
|
|
" faiss_index_factory_str=\"Flat\",\n",
|
|
" return_embedding=True\n",
|
|
")\n",
|
|
"\n",
|
|
"# Initialize DPR Retriever to encode documents, encode question and query documents\n",
|
|
"retriever = DensePassageRetriever(\n",
|
|
" document_store=document_store,\n",
|
|
" query_embedding_model=\"facebook/dpr-question_encoder-single-nq-base\",\n",
|
|
" passage_embedding_model=\"facebook/dpr-ctx_encoder-single-nq-base\",\n",
|
|
" use_gpu=True,\n",
|
|
" embed_title=True,\n",
|
|
")\n",
|
|
"\n",
|
|
"# Initialize RAG Generator\n",
|
|
"generator = RAGenerator(\n",
|
|
" model_name_or_path=\"facebook/rag-token-nq\",\n",
|
|
" use_gpu=True,\n",
|
|
" top_k=1,\n",
|
|
" max_length=200,\n",
|
|
" min_length=2,\n",
|
|
" embed_title=True,\n",
|
|
" num_beams=2,\n",
|
|
")"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"We write documents to the DocumentStore, first by deleting any remaining documents then calling `write_documents()`.\n",
|
|
"The `update_embeddings()` method uses the retriever to create an embedding for each document.\n"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"# Delete existing documents in documents store\n",
|
|
"document_store.delete_documents()\n",
|
|
"\n",
|
|
"# Write documents to document store\n",
|
|
"document_store.write_documents(documents)\n",
|
|
"\n",
|
|
"# Add documents embeddings to index\n",
|
|
"document_store.update_embeddings(\n",
|
|
" retriever=retriever\n",
|
|
")"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Here are our questions:"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"QUESTIONS = [\n",
|
|
" \"who got the first nobel prize in physics\",\n",
|
|
" \"when is the next deadpool movie being released\",\n",
|
|
" \"which mode is used for short wave broadcast service\",\n",
|
|
" \"who is the owner of reading football club\",\n",
|
|
" \"when is the next scandal episode coming out\",\n",
|
|
" \"when is the last time the philadelphia won the superbowl\",\n",
|
|
" \"what is the most current adobe flash player version\",\n",
|
|
" \"how many episodes are there in dragon ball z\",\n",
|
|
" \"what is the first step in the evolution of the eye\",\n",
|
|
" \"where is gall bladder situated in human body\",\n",
|
|
" \"what is the main mineral in lithium batteries\",\n",
|
|
" \"who is the president of usa right now\",\n",
|
|
" \"where do the greasers live in the outsiders\",\n",
|
|
" \"panda is a national animal of which country\",\n",
|
|
" \"what is the name of manchester united stadium\",\n",
|
|
"]"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Now let's run our system!\n",
|
|
"The retriever will pick out a small subset of documents that it finds relevant.\n",
|
|
"These are used to condition the generator as it generates the answer.\n",
|
|
"What it should return then are novel text spans that form and answer to your question!"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"# Now generate an answer for each question\n",
|
|
"for question in QUESTIONS:\n",
|
|
" # Retrieve related documents from retriever\n",
|
|
" retriever_results = retriever.retrieve(\n",
|
|
" query=question\n",
|
|
" )\n",
|
|
"\n",
|
|
" # Now generate answer from question and retrieved documents\n",
|
|
" predicted_result = generator.predict(\n",
|
|
" query=question,\n",
|
|
" documents=retriever_results,\n",
|
|
" top_k=1\n",
|
|
" )\n",
|
|
"\n",
|
|
" # Print you answer\n",
|
|
" answers = predicted_result[\"answers\"]\n",
|
|
" print(f'Generated answer is \\'{answers[0][\"answer\"]}\\' for the question = \\'{question}\\'')"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"# Or alternatively use the Pipeline class\n",
|
|
"from haystack.pipeline import GenerativeQAPipeline\n",
|
|
"\n",
|
|
"pipe = GenerativeQAPipeline(generator=generator, retriever=retriever)\n",
|
|
"for question in QUESTIONS:\n",
|
|
" res = pipe.run(query=question, params={\"Generator\": {\"top_k\": 1}, \"Retriever\": {\"top_k\": 5}})\n",
|
|
" print(res)"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"## About us\n",
|
|
"\n",
|
|
"This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany\n",
|
|
"\n",
|
|
"We bring NLP to the industry via open source! \n",
|
|
"Our focus: Industry specific language models & large scale QA systems. \n",
|
|
" \n",
|
|
"Some of our other work: \n",
|
|
"- [German BERT](https://deepset.ai/german-bert)\n",
|
|
"- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)\n",
|
|
"- [FARM](https://github.com/deepset-ai/FARM)\n",
|
|
"\n",
|
|
"Get in touch:\n",
|
|
"[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)\n",
|
|
"\n",
|
|
"By the way: [we're hiring!](https://apply.workable.com/deepset/) "
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
}
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 2
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython2",
|
|
"version": "2.7.6"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 0
|
|
} |