mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-07-22 08:21:24 +00:00

* Files moved, imports all broken * Fix most imports and docstrings into * Fix the paths to the modules in the API docs * Add latest docstring and tutorial changes * Add a few pipelines that were lost in the inports * Fix a bunch of mypy warnings * Add latest docstring and tutorial changes * Create a file_classifier module * Add docs for file_classifier * Fixed most circular imports, now the REST API can start * Add latest docstring and tutorial changes * Tackling more mypy issues * Reintroduce from FARM and fix last mypy issues hopefully * Re-enable old-style imports * Fix some more import from the top-level package in an attempt to sort out circular imports * Fix some imports in tests to new-style to prevent failed class equalities from breaking tests * Change document_store into document_stores * Update imports in tutorials * Add latest docstring and tutorial changes * Probably fixes summarizer tests * Improve the old-style import allowing module imports (should work) * Try to fix the docs * Remove dedicated KnowledgeGraph page from autodocs * Remove dedicated GraphRetriever page from autodocs * Fix generate_docstrings.sh with an updated list of yaml files to look for * Fix some more modules in the docs * Fix the document stores docs too * Fix a small issue on Tutorial14 * Add latest docstring and tutorial changes * Add deprecation warning to old-style imports * Remove stray folder and import Dict into dense.py * Change import path for MLFlowLogger * Add old loggers path to the import path aliases * Fix debug output of convert_ipynb.py * Fix circular import on BaseRetriever * Missed one merge block * re-run tutorial 5 * Fix imports in tutorial 5 * Re-enable squad_to_dpr CLI from the root package and move get_batches_from_generator into document_stores.base * Add latest docstring and tutorial changes * Fix typo in utils __init__ * Fix a few more imports * Fix benchmarks too * New-style imports in test_knowledge_graph * Rollback setup.py * Rollback squad_to_dpr too Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
346 lines
9.9 KiB
Plaintext
346 lines
9.9 KiB
Plaintext
{
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2,
|
|
"metadata": {
|
|
"accelerator": "GPU",
|
|
"colab": {
|
|
"name": "LFQA_via_Haystack.ipynb",
|
|
"provenance": [],
|
|
"collapsed_sections": []
|
|
},
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.9"
|
|
}
|
|
},
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"# Long-Form Question Answering\n",
|
|
"\n",
|
|
"[](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial12_LFQA.ipynb)"
|
|
],
|
|
"metadata": {
|
|
"id": "bEH-CRbeA6NU"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"### Prepare environment\n",
|
|
"\n",
|
|
"#### Colab: Enable the GPU runtime\n",
|
|
"Make sure you enable the GPU runtime to experience decent speed in this tutorial. \n",
|
|
"**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**\n",
|
|
"\n",
|
|
"<img src=\"https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/colab_gpu_runtime.jpg\">"
|
|
],
|
|
"metadata": {
|
|
"id": "3K27Y5FbA6NV"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Make sure you have a GPU running\n",
|
|
"!nvidia-smi"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "JlZgP8q1A6NW"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Install the latest master of Haystack\n",
|
|
"!pip install git+https://github.com/deepset-ai/haystack.git\n",
|
|
"\n",
|
|
"# If you run this notebook on Google Colab, you might need to\n",
|
|
"# restart the runtime after installing haystack."
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "NM36kbRFA6Nc"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"from haystack.utils import convert_files_to_dicts, fetch_archive_from_http, clean_wiki_text\n",
|
|
"from haystack.nodes import Seq2SeqGenerator"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "xmRuhTQ7A6Nh"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"### Document Store\n",
|
|
"\n",
|
|
"FAISS is a library for efficient similarity search on a cluster of dense vectors.\n",
|
|
"The `FAISSDocumentStore` uses a SQL(SQLite in-memory be default) database under-the-hood\n",
|
|
"to store the document text and other meta data. The vector embeddings of the text are\n",
|
|
"indexed on a FAISS Index that later is queried for searching answers.\n",
|
|
"The default flavour of FAISSDocumentStore is \"Flat\" but can also be set to \"HNSW\" for\n",
|
|
"faster search at the expense of some accuracy. Just set the faiss_index_factor_str argument in the constructor.\n",
|
|
"For more info on which suits your use case: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index"
|
|
],
|
|
"metadata": {
|
|
"id": "q3dSo7ZtA6Nl"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"from haystack.document_stores import FAISSDocumentStore\n",
|
|
"\n",
|
|
"document_store = FAISSDocumentStore(vector_dim=128, faiss_index_factory_str=\"Flat\")"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "1cYgDJmrA6Nv",
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"### Cleaning & indexing documents\n",
|
|
"\n",
|
|
"Similarly to the previous tutorials, we download, convert and index some Game of Thrones articles to our DocumentStore"
|
|
],
|
|
"metadata": {
|
|
"id": "06LatTJBA6N0",
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"# Let's first get some files that we want to use\n",
|
|
"doc_dir = \"data/article_txt_got\"\n",
|
|
"s3_url = \"https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip\"\n",
|
|
"fetch_archive_from_http(url=s3_url, output_dir=doc_dir)\n",
|
|
"\n",
|
|
"# Convert files to dicts\n",
|
|
"dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)\n",
|
|
" \n",
|
|
"# Now, let's write the dicts containing documents to our DB.\n",
|
|
"document_store.write_documents(dicts)"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "iqKnu6wxA6N1",
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"### Initalize Retriever and Reader/Generator\n",
|
|
"\n",
|
|
"#### Retriever\n",
|
|
"\n",
|
|
"**Here:** We use a `RetribertRetriever` and we invoke `update_embeddings` to index the embeddings of documents in the `FAISSDocumentStore`\n",
|
|
"\n"
|
|
],
|
|
"metadata": {
|
|
"id": "wgjedxx_A6N6"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"from haystack.nodes import EmbeddingRetriever\n",
|
|
"\n",
|
|
"retriever = EmbeddingRetriever(document_store=document_store,\n",
|
|
" embedding_model=\"yjernite/retribert-base-uncased\",\n",
|
|
" model_format=\"retribert\")\n",
|
|
"\n",
|
|
"document_store.update_embeddings(retriever)"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "kFwiPP60A6N7",
|
|
"pycharm": {
|
|
"is_executing": true
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Before we blindly use the `RetribertRetriever` let's empirically test it to make sure a simple search indeed finds the relevant documents."
|
|
],
|
|
"metadata": {
|
|
"id": "sMlVEnJ2NkZZ"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"source": [
|
|
"from haystack.utils import print_documents\n",
|
|
"from haystack.pipelines import DocumentSearchPipeline\n",
|
|
"\n",
|
|
"p_retrieval = DocumentSearchPipeline(retriever)\n",
|
|
"res = p_retrieval.run(\n",
|
|
" query=\"Tell me something about Arya Stark?\",\n",
|
|
" params={\"Retriever\": {\"top_k\": 10}}\n",
|
|
")\n",
|
|
"print_documents(res, max_text_len=512)\n"
|
|
],
|
|
"outputs": [
|
|
{
|
|
"output_type": "error",
|
|
"ename": "SyntaxError",
|
|
"evalue": "EOL while scanning string literal (<ipython-input-1-cc681f017dc5>, line 7)",
|
|
"traceback": [
|
|
"\u001b[0;36m File \u001b[0;32m\"<ipython-input-1-cc681f017dc5>\"\u001b[0;36m, line \u001b[0;32m7\u001b[0m\n\u001b[0;31m params={\"top_k_retriever=5\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m EOL while scanning string literal\n"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"id": "qpu-t9rndgpe"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"#### Reader/Generator\n",
|
|
"\n",
|
|
"Similar to previous Tutorials we now initalize our reader/generator.\n",
|
|
"\n",
|
|
"Here we use a `Seq2SeqGenerator` with the *yjernite/bart_eli5* model (see: https://huggingface.co/yjernite/bart_eli5)\n",
|
|
"\n"
|
|
],
|
|
"metadata": {
|
|
"id": "rnVR28OXA6OA"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"generator = Seq2SeqGenerator(model_name_or_path=\"yjernite/bart_eli5\")"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "fyIuWVwhA6OB"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"### Pipeline\n",
|
|
"\n",
|
|
"With a Haystack `Pipeline` you can stick together your building blocks to a search pipeline.\n",
|
|
"Under the hood, `Pipelines` are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases.\n",
|
|
"To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the `GenerativeQAPipeline` that combines a retriever and a reader/generator to answer our questions.\n",
|
|
"You can learn more about `Pipelines` in the [docs](https://haystack.deepset.ai/docs/latest/pipelinesmd)."
|
|
],
|
|
"metadata": {
|
|
"id": "unhLD18yA6OF"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"from haystack.pipelines import GenerativeQAPipeline\n",
|
|
"pipe = GenerativeQAPipeline(generator, retriever)"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "TssPQyzWA6OG"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"## Voilà! Ask a question!"
|
|
],
|
|
"metadata": {
|
|
"id": "bXlBBxKXA6OL"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"pipe.run(\n",
|
|
" query=\"Why did Arya Stark's character get portrayed in a television adaptation?\",\n",
|
|
" params={\"Retriever\": {\"top_k\": 1}}\n",
|
|
")"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "Zi97Hif2A6OM"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"source": [
|
|
"pipe.run(query=\"What kind of character does Arya Stark play?\", params={\"Retriever\": {\"top_k\": 1}})"
|
|
],
|
|
"outputs": [],
|
|
"metadata": {
|
|
"id": "zvHb8SvMblw9"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"## About us\n",
|
|
"\n",
|
|
"This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany\n",
|
|
"\n",
|
|
"We bring NLP to the industry via open source!\n",
|
|
"Our focus: Industry specific language models & large scale QA systems.\n",
|
|
"\n",
|
|
"Some of our other work:\n",
|
|
"- [German BERT](https://deepset.ai/german-bert)\n",
|
|
"- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)\n",
|
|
"- [FARM](https://github.com/deepset-ai/FARM)\n",
|
|
"\n",
|
|
"Get in touch:\n",
|
|
"[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)\n",
|
|
"\n",
|
|
"By the way: [we're hiring!](https://www.deepset.ai/jobs)"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
}
|
|
}
|
|
]
|
|
} |