haystack/tutorials/Tutorial12_LFQA.ipynb
bogdankostic 834f8c4902
Change return types of indexing pipeline nodes (#2342)
* Change return types of file converters

* Change return types of preprocessor

* Change return types of crawler

* Adapt utils to functions to new return types

* Adapt __init__.py to new method names

* Prevent circular imports

* Update Documentation & Code Style

* Let DocStores' run method accept Documents

* Adapt tests to new return types

* Update Documentation & Code Style

* Put "# type: ignore" to right place

* Remove id_hash_keys property from Document primitive

* Update Documentation & Code Style

* Adapt tests to new return types and missing id_hash_keys property

* Fix mypy

* Fix mypy

* Adapt PDFToTextOCRConverter

* Remove id_hash_keys from RestAPI tests

* Update Documentation & Code Style

* Rename tests

* Remove redundant setting of content_type="text"

* Add DeprecationWarning

* Add id_hash_keys to elasticsearch_index_to_document_store

* Change document type from dict to Docuemnt in PreProcessor test

* Fix file path in Tutorial 5

* Remove added output in Tutorial 5

* Update Documentation & Code Style

* Fix file_paths in Tutorial 9 + fix gz files in fetch_archive_from_http

* Adapt tutorials to new return types

* Adapt tutorial 14 to new return types

* Update Documentation & Code Style

* Change assertions to HaystackErrors

* Import HaystackError correctly

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-03-29 13:53:35 +02:00

338 lines
9.4 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "bEH-CRbeA6NU"
},
"source": [
"# Long-Form Question Answering\n",
"\n",
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial12_LFQA.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3K27Y5FbA6NV"
},
"source": [
"### Prepare environment\n",
"\n",
"#### Colab: Enable the GPU runtime\n",
"Make sure you enable the GPU runtime to experience decent speed in this tutorial. \n",
"**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**\n",
"\n",
"<img src=\"https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/colab_gpu_runtime.jpg\">"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "JlZgP8q1A6NW"
},
"outputs": [],
"source": [
"# Make sure you have a GPU running\n",
"!nvidia-smi"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "NM36kbRFA6Nc"
},
"outputs": [],
"source": [
"# Install the latest release of Haystack in your own environment\n",
"#! pip install farm-haystack\n",
"\n",
"# Install the latest master of Haystack\n",
"!pip install --upgrade pip\n",
"!pip install -q git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,faiss]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "xmRuhTQ7A6Nh"
},
"outputs": [],
"source": [
"from haystack.utils import convert_files_to_docs, fetch_archive_from_http, clean_wiki_text\n",
"from haystack.nodes import Seq2SeqGenerator"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "q3dSo7ZtA6Nl"
},
"source": [
"### Document Store\n",
"\n",
"FAISS is a library for efficient similarity search on a cluster of dense vectors.\n",
"The `FAISSDocumentStore` uses a SQL(SQLite in-memory be default) database under-the-hood\n",
"to store the document text and other meta data. The vector embeddings of the text are\n",
"indexed on a FAISS Index that later is queried for searching answers.\n",
"The default flavour of FAISSDocumentStore is \"Flat\" but can also be set to \"HNSW\" for\n",
"faster search at the expense of some accuracy. Just set the faiss_index_factor_str argument in the constructor.\n",
"For more info on which suits your use case: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "1cYgDJmrA6Nv",
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"from haystack.document_stores import FAISSDocumentStore\n",
"\n",
"document_store = FAISSDocumentStore(embedding_dim=128, faiss_index_factory_str=\"Flat\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "06LatTJBA6N0",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### Cleaning & indexing documents\n",
"\n",
"Similarly to the previous tutorials, we download, convert and index some Game of Thrones articles to our DocumentStore"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "iqKnu6wxA6N1",
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Let's first get some files that we want to use\n",
"doc_dir = \"data/tutorial12\"\n",
"s3_url = \"https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt12.zip\"\n",
"fetch_archive_from_http(url=s3_url, output_dir=doc_dir)\n",
"\n",
"# Convert files to dicts\n",
"docs = convert_files_to_docs(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)\n",
"\n",
"# Now, let's write the dicts containing documents to our DB.\n",
"document_store.write_documents(docs)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wgjedxx_A6N6"
},
"source": [
"### Initalize Retriever and Reader/Generator\n",
"\n",
"#### Retriever\n",
"\n",
"We use a `DensePassageRetriever` and we invoke `update_embeddings` to index the embeddings of documents in the `FAISSDocumentStore`\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "kFwiPP60A6N7",
"pycharm": {
"is_executing": true
}
},
"outputs": [],
"source": [
"from haystack.nodes import DensePassageRetriever\n",
"\n",
"retriever = DensePassageRetriever(\n",
" document_store=document_store,\n",
" query_embedding_model=\"vblagoje/dpr-question_encoder-single-lfqa-wiki\",\n",
" passage_embedding_model=\"vblagoje/dpr-ctx_encoder-single-lfqa-wiki\",\n",
")\n",
"\n",
"document_store.update_embeddings(retriever)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "sMlVEnJ2NkZZ"
},
"source": [
"Before we blindly use the `DensePassageRetriever` let's empirically test it to make sure a simple search indeed finds the relevant documents."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "qpu-t9rndgpe"
},
"outputs": [],
"source": [
"from haystack.utils import print_documents\n",
"from haystack.pipelines import DocumentSearchPipeline\n",
"\n",
"p_retrieval = DocumentSearchPipeline(retriever)\n",
"res = p_retrieval.run(query=\"Tell me something about Arya Stark?\", params={\"Retriever\": {\"top_k\": 10}})\n",
"print_documents(res, max_text_len=512)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "rnVR28OXA6OA"
},
"source": [
"#### Reader/Generator\n",
"\n",
"Similar to previous Tutorials we now initalize our reader/generator.\n",
"\n",
"Here we use a `Seq2SeqGenerator` with the *vblagoje/bart_lfqa* model (see: https://huggingface.co/vblagoje/bart_lfqa)\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "fyIuWVwhA6OB"
},
"outputs": [],
"source": [
"generator = Seq2SeqGenerator(model_name_or_path=\"vblagoje/bart_lfqa\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "unhLD18yA6OF"
},
"source": [
"### Pipeline\n",
"\n",
"With a Haystack `Pipeline` you can stick together your building blocks to a search pipeline.\n",
"Under the hood, `Pipelines` are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases.\n",
"To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the `GenerativeQAPipeline` that combines a retriever and a reader/generator to answer our questions.\n",
"You can learn more about `Pipelines` in the [docs](https://haystack.deepset.ai/docs/latest/pipelinesmd)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "TssPQyzWA6OG"
},
"outputs": [],
"source": [
"from haystack.pipelines import GenerativeQAPipeline\n",
"\n",
"pipe = GenerativeQAPipeline(generator, retriever)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "bXlBBxKXA6OL"
},
"source": [
"## Voilà! Ask a question!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Zi97Hif2A6OM"
},
"outputs": [],
"source": [
"pipe.run(\n",
" query=\"How did Arya Stark's character get portrayed in a television adaptation?\", params={\"Retriever\": {\"top_k\": 3}}\n",
")"
]
},
{
"cell_type": "code",
"source": [
"pipe.run(query=\"Why is Arya Stark an unusual character?\", params={\"Retriever\": {\"top_k\": 3}})"
],
"metadata": {
"id": "IfTP9BfFGOo6"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"id": "i88KdOc2wUXQ"
},
"source": [
"## About us\n",
"\n",
"This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany\n",
"\n",
"We bring NLP to the industry via open source!\n",
"Our focus: Industry specific language models & large scale QA systems.\n",
"\n",
"Some of our other work:\n",
"- [German BERT](https://deepset.ai/german-bert)\n",
"- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)\n",
"- [FARM](https://github.com/deepset-ai/FARM)\n",
"\n",
"Get in touch:\n",
"[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)\n",
"\n",
"By the way: [we're hiring!](https://www.deepset.ai/jobs)"
]
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"collapsed_sections": [],
"name": "LFQA_via_Haystack.ipynb",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 0
}