mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-07-23 17:00:41 +00:00

* move logging config from haystack lib to application * Update Documentation & Code Style * config logging before importing haystack * Update Documentation & Code Style * add logging config to all tutorials * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
446 lines
14 KiB
Plaintext
446 lines
14 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"# Question Answering on a Knowledge Graph\n",
|
|
"\n",
|
|
"[](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial10_Knowledge_Graph.ipynb)\n",
|
|
"\n",
|
|
"Haystack allows storing and querying knowledge graphs with the help of pre-trained models that translate text queries to SPARQL queries.\n",
|
|
"This tutorial demonstrates how to load an existing knowledge graph into haystack, load a pre-trained retriever, and execute text queries on the knowledge graph.\n",
|
|
"The training of models that translate text queries into SPARQL queries is currently not supported."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"jupyter": {
|
|
"outputs_hidden": false
|
|
},
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Install the latest release of Haystack in your own environment\n",
|
|
"#! pip install farm-haystack\n",
|
|
"\n",
|
|
"# Install the latest master of Haystack\n",
|
|
"!pip install --upgrade pip\n",
|
|
"!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,inmemorygraph]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"## Logging\n",
|
|
"\n",
|
|
"We configure how logging messages should be displayed and which log level should be used before importing Haystack.\n",
|
|
"Example log message:\n",
|
|
"INFO - haystack.utils.preprocessing - Converting data/tutorial1/218_Olenna_Tyrell.txt\n",
|
|
"Default log level in basicConfig is WARNING so the explicit parameter is not necessary but can be changed easily:"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"import logging\n",
|
|
"\n",
|
|
"logging.basicConfig(format=\"%(levelname)s - %(name)s - %(message)s\", level=logging.WARNING)\n",
|
|
"logging.getLogger(\"haystack\").setLevel(logging.INFO)"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"jupyter": {
|
|
"outputs_hidden": false
|
|
},
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Here are some imports that we'll need\n",
|
|
"\n",
|
|
"import subprocess\n",
|
|
"import time\n",
|
|
"from pathlib import Path\n",
|
|
"\n",
|
|
"from haystack.nodes import Text2SparqlRetriever\n",
|
|
"from haystack.document_stores import InMemoryKnowledgeGraph\n",
|
|
"from haystack.utils import fetch_archive_from_http"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Downloading Knowledge Graph and Model"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"jupyter": {
|
|
"outputs_hidden": false
|
|
},
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Let's first fetch some triples that we want to store in our knowledge graph\n",
|
|
"# Here: exemplary triples from the wizarding world\n",
|
|
"graph_dir = \"data/tutorial10\"\n",
|
|
"s3_url = \"https://fandom-qa.s3-eu-west-1.amazonaws.com/triples_and_config.zip\"\n",
|
|
"fetch_archive_from_http(url=s3_url, output_dir=graph_dir)\n",
|
|
"\n",
|
|
"# Fetch a pre-trained BART model that translates text queries to SPARQL queries\n",
|
|
"model_dir = \"../saved_models/tutorial10_knowledge_graph/\"\n",
|
|
"s3_url = \"https://fandom-qa.s3-eu-west-1.amazonaws.com/saved_models/hp_v3.4.zip\"\n",
|
|
"fetch_archive_from_http(url=s3_url, output_dir=model_dir)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Initialize a knowledge graph and load data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Currently, Haystack supports two alternative implementations for knowledge graphs:\n",
|
|
"* simple InMemoryKnowledgeGraph (based on RDFLib in-memory store)\n",
|
|
"* GraphDBKnowledgeGraph, which runs on GraphDB."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### InMemoryKnowledgeGraph "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"The last triple stored in the knowledge graph is: {'s': {'type': 'uri', 'value': 'https://deepset.ai/harry_potter/Harry_potter'}, 'p': {'type': 'uri', 'value': 'https://deepset.ai/harry_potter/family'}, 'o': {'type': 'uri', 'value': 'https://deepset.ai/harry_potter/Dudley_dursleys_children'}}\n",
|
|
"There are 118543 triples stored in the knowledge graph.\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Initialize a in memory knowledge graph and use \"tutorial_10_index\" as the name of the index\n",
|
|
"kg = InMemoryKnowledgeGraph(index=\"tutorial_10_index\")\n",
|
|
"\n",
|
|
"# Delete the index as it might have been already created in previous runs\n",
|
|
"kg.delete_index()\n",
|
|
"\n",
|
|
"# Create the index\n",
|
|
"kg.create_index()\n",
|
|
"\n",
|
|
"# Import triples of subject, predicate, and object statements from a ttl file\n",
|
|
"kg.import_from_ttl_file(index=\"tutorial_10_index\", path=Path(graph_dir) / \"triples.ttl\")\n",
|
|
"print(f\"The last triple stored in the knowledge graph is: {kg.get_all_triples()[-1]}\")\n",
|
|
"print(f\"There are {len(kg.get_all_triples())} triples stored in the knowledge graph.\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"jp-MarkdownHeadingCollapsed": true,
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"### GraphDBKnowledgeGraph (alternative)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"#### Launching a GraphDB instance"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"jupyter": {
|
|
"outputs_hidden": false
|
|
},
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# # Unfortunately, there seems to be no good way to run GraphDB in colab environments\n",
|
|
"# # In your local environment, you could start a GraphDB server with docker\n",
|
|
"# # Feel free to check GraphDB's website for the free version https://www.ontotext.com/products/graphdb/graphdb-free/\n",
|
|
"# import os\n",
|
|
"\n",
|
|
"# LAUNCH_GRAPHDB = os.environ.get(\"LAUNCH_GRAPHDB\", False)\n",
|
|
"\n",
|
|
"# if LAUNCH_GRAPHDB:\n",
|
|
"# print(\"Starting GraphDB ...\")\n",
|
|
"# status = subprocess.run(\n",
|
|
"# [\n",
|
|
"# \"docker run -d -p 7200:7200 --name graphdb-instance-tutorial docker-registry.ontotext.com/graphdb-free:9.4.1-adoptopenjdk11\"\n",
|
|
"# ],\n",
|
|
"# shell=True,\n",
|
|
"# )\n",
|
|
"# if status.returncode:\n",
|
|
"# raise Exception(\n",
|
|
"# \"Failed to launch GraphDB. Maybe it is already running or you already have a container with that name that you could start?\"\n",
|
|
"# )\n",
|
|
"# time.sleep(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"#### Creating a new GraphDB repository (also known as index in haystack's document stores)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"jupyter": {
|
|
"outputs_hidden": false
|
|
},
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# from haystack.document_stores import GraphDBKnowledgeGraph\n",
|
|
"\n",
|
|
"# # Initialize a knowledge graph connected to GraphDB and use \"tutorial_10_index\" as the name of the index\n",
|
|
"# kg = GraphDBKnowledgeGraph(index=\"tutorial_10_index\")\n",
|
|
"\n",
|
|
"# # Delete the index as it might have been already created in previous runs\n",
|
|
"# kg.delete_index()\n",
|
|
"\n",
|
|
"# # Create the index based on a configuration file\n",
|
|
"# kg.create_index(config_path=Path(graph_dir) / \"repo-config.ttl\")\n",
|
|
"\n",
|
|
"# # Import triples of subject, predicate, and object statements from a ttl file\n",
|
|
"# kg.import_from_ttl_file(index=\"tutorial_10_index\", path=Path(graph_dir) / \"triples.ttl\")\n",
|
|
"# print(f\"The last triple stored in the knowledge graph is: {kg.get_all_triples()[-1]}\")\n",
|
|
"# print(f\"There are {len(kg.get_all_triples())} triples stored in the knowledge graph.\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"jupyter": {
|
|
"outputs_hidden": false
|
|
},
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# # Define prefixes for names of resources so that we can use shorter resource names in queries\n",
|
|
"# prefixes = \"\"\"PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\n",
|
|
"# PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>\n",
|
|
"# PREFIX hp: <https://deepset.ai/harry_potter/>\n",
|
|
"# \"\"\"\n",
|
|
"# kg.prefixes = prefixes"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Load the pre-trained retriever"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Load a pre-trained model that translates text queries to SPARQL queries\n",
|
|
"kgqa_retriever = Text2SparqlRetriever(knowledge_graph=kg, model_name_or_path=Path(model_dir) / \"hp_v3.4\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Query Execution\n",
|
|
"\n",
|
|
"We can now ask questions that will be answered by our knowledge graph!\n",
|
|
"One limitation though: our pre-trained model can only generate questions about resources it has seen during training.\n",
|
|
"Otherwise, it cannot translate the name of the resource to the identifier used in the knowledge graph.\n",
|
|
"E.g. \"Harry\" -> \"hp:Harry_potter\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"jupyter": {
|
|
"outputs_hidden": false
|
|
},
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Translating the text query \"In which house is Harry Potter?\" to a SPARQL query and executing it on the knowledge graph...\n",
|
|
"[{'answer': ['https://deepset.ai/harry_potter/Gryffindor'], 'prediction_meta': {'model': 'Text2SparqlRetriever', 'sparql_query': 'select ?a { hp:Harry_potter hp:house ?a . }'}}]\n",
|
|
"Executing a SPARQL query with prefixed names of resources...\n",
|
|
"(['https://deepset.ai/harry_potter/Rubeus_hagrid', 'https://deepset.ai/harry_potter/Ogg'], 'select distinct ?sbj where { ?sbj hp:job hp:Keeper_of_keys_and_grounds . }')\n",
|
|
"Executing a SPARQL query with full names of resources...\n",
|
|
"(['https://deepset.ai/harry_potter/Otter'], 'select distinct ?obj where { <https://deepset.ai/harry_potter/Hermione_granger> <https://deepset.ai/harry_potter/patronus> ?obj . }')\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"query = \"In which house is Harry Potter?\"\n",
|
|
"print(f'Translating the text query \"{query}\" to a SPARQL query and executing it on the knowledge graph...')\n",
|
|
"result = kgqa_retriever.retrieve(query=query)\n",
|
|
"print(result)\n",
|
|
"# Correct SPARQL query: select ?a { hp:Harry_potter hp:house ?a . }\n",
|
|
"# Correct answer: Gryffindor\n",
|
|
"\n",
|
|
"print(\"Executing a SPARQL query with prefixed names of resources...\")\n",
|
|
"result = kgqa_retriever._query_kg(\n",
|
|
" sparql_query=\"select distinct ?sbj where { ?sbj hp:job hp:Keeper_of_keys_and_grounds . }\"\n",
|
|
")\n",
|
|
"print(result)\n",
|
|
"# Paraphrased question: Who is the keeper of keys and grounds?\n",
|
|
"# Correct answer: Rubeus Hagrid\n",
|
|
"\n",
|
|
"print(\"Executing a SPARQL query with full names of resources...\")\n",
|
|
"result = kgqa_retriever._query_kg(\n",
|
|
" sparql_query=\"select distinct ?obj where { <https://deepset.ai/harry_potter/Hermione_granger> <https://deepset.ai/harry_potter/patronus> ?obj . }\"\n",
|
|
")\n",
|
|
"print(result)\n",
|
|
"# Paraphrased question: What is the patronus of Hermione?\n",
|
|
"# Correct answer: Otter"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## About us\n",
|
|
"\n",
|
|
"This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany\n",
|
|
"\n",
|
|
"We bring NLP to the industry via open source! \n",
|
|
"Our focus: Industry specific language models & large scale QA systems. \n",
|
|
" \n",
|
|
"Some of our other work: \n",
|
|
"- [German BERT](https://deepset.ai/german-bert)\n",
|
|
"- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)\n",
|
|
"- [FARM](https://github.com/deepset-ai/FARM)\n",
|
|
"\n",
|
|
"Get in touch:\n",
|
|
"[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)\n",
|
|
"\n",
|
|
"By the way: [we're hiring!](https://www.deepset.ai/jobs)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.4"
|
|
},
|
|
"vscode": {
|
|
"interpreter": {
|
|
"hash": "d6fc774dec8e6d4d8b6a5562b41269a570ea5456d1c03f28da35966a9134f033"
|
|
}
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|