haystack/tutorials/Tutorial10_Knowledge_Graph.ipynb
Sara Zan 735ffa635b
[CI refactoring] Tutorials on CI (#2547)
* Experimental Ci workflow for running tutorials

* Run on every push for now

* Not starting?

* Disabling paths temporarily

* Sort tutorials in natural order

* Install ipython

* remove ipython install

* Try running ipython with sudo

* env.pythonLocation

* Skipping tutorial2 and 9 for speed

* typo

* Use one runner per tutorial, for now

* Typo in dependend job

* Missing quotes broke scripts matrix

* Simplify setup for the tutorials, try to prevent containers conflict

* Remove needless job dependencies

* Try prevent cache issues, fix small Tut10 bug

* Missing deps for running notebook tutorials

* Create three groups of tutorials excluding the longest among them

* remove deps

* use proper bash loop

* Try with a single string

* Fix typo in echo

* Forgot do

* Typo

* Try to make the GraphDB tutorial without launching its own container

* Run notebook and script together

* Whitespace

* separate scrpits and notebooks execution

* Run notebooks first

* Try caching the GoT data before running the scripts

* add note

* fix mkdir

* Fix path

* Update Documentation & Code Style

* missing -r

* Fix folder numbering

* Run notebooks as well

* Typo in notebook command

* complete path in notebook command

* Try with TIKA_LOG_PATH

* Fix folder naming

* Do not use cached data in Tut9

* extracting the number better

* Small tweaks

* Same fix on Tut10 on the notebook

* Exclude GoT cache for tut5 too

* Remove faiss files after tutorial run

* Layout

* fix remove command

* Fix path in tut10 notebook

* Fix typo in node name in tut14

* Third block was too long, rebancing

* Reduce GoT dataset even more, why wasting time after all...

* Fix paths in tut10 again

* do git clean to make sure to cleanup everything (breaks post Python)

* Remove ES file with bad permission at the end of the run

* Split first block, takes >30mins

* take out tut15 for a moment, has an actual bug

* typo

* Forgot rm option

* Simply remove all ES files

* Improve logs of GoT reduction

* Exclude also tut16 from cache to try fix bug

* Replace ll with ls

* Reintroduce 15_TableQA

* Small regrouping

* regrouping to make the min num of runners go for about 30mins

* Add cron schedule and PR paths conditions

* Add some timing information

* Separate tutorials by diff and tutorials by cron

* temp add pull_request to tutorials nightly

* Add badge in README to keep track of the nightly tutorials run

* Remove prefixes from data folder names

* Add fetch depth to get diff with master

* Fix paths again

* typo

* Exclude long-running ones

* Typo

* Fix tutorials.yml as well

* Use head_ref

* Using an action for now

* exclude other files

* Use only the correct command to run the tutorial

* Add long running tutorials in separate runners, just for experiment

* Factor out the complex bash script

* Pass the python path to the bash script

* Fix paths

* adding log statement

* Missing dollarsign

* Resetting variable in loop

* using mini GoT dataset and improving bash script

* change dataset name

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-15 09:53:36 +02:00

301 lines
9.5 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# Question Answering on a Knowledge Graph\n",
"\n",
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial10_Knowledge_Graph.ipynb)\n",
"\n",
"Haystack allows storing and querying knowledge graphs with the help of pre-trained models that translate text queries to SPARQL queries.\n",
"This tutorial demonstrates how to load an existing knowledge graph into haystack, load a pre-trained retriever, and execute text queries on the knowledge graph.\n",
"The training of models that translate text queries into SPARQL queries is currently not supported."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Install the latest release of Haystack in your own environment\n",
"#! pip install farm-haystack\n",
"\n",
"# Install the latest master of Haystack\n",
"!pip install --upgrade pip\n",
"!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,graphdb]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Here are some imports that we'll need\n",
"\n",
"import subprocess\n",
"import time\n",
"from pathlib import Path\n",
"\n",
"from haystack.nodes import Text2SparqlRetriever\n",
"from haystack.document_stores import GraphDBKnowledgeGraph\n",
"from haystack.utils import fetch_archive_from_http"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Downloading Knowledge Graph and Model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Let's first fetch some triples that we want to store in our knowledge graph\n",
"# Here: exemplary triples from the wizarding world\n",
"graph_dir = \"data/tutorial10\"\n",
"s3_url = \"https://fandom-qa.s3-eu-west-1.amazonaws.com/triples_and_config.zip\"\n",
"fetch_archive_from_http(url=s3_url, output_dir=graph_dir)\n",
"\n",
"# Fetch a pre-trained BART model that translates text queries to SPARQL queries\n",
"model_dir = \"../saved_models/tutorial10_knowledge_graph/\"\n",
"s3_url = \"https://fandom-qa.s3-eu-west-1.amazonaws.com/saved_models/hp_v3.4.zip\"\n",
"fetch_archive_from_http(url=s3_url, output_dir=model_dir)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Launching a GraphDB instance"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Unfortunately, there seems to be no good way to run GraphDB in colab environments\n",
"# In your local environment, you could start a GraphDB server with docker\n",
"# Feel free to check GraphDB's website for the free version https://www.ontotext.com/products/graphdb/graphdb-free/\n",
"import os\n",
"\n",
"LAUNCH_GRAPHDB = os.environ.get(\"LAUNCH_GRAPHDB\", False)\n",
"\n",
"if LAUNCH_GRAPHDB:\n",
" print(\"Starting GraphDB ...\")\n",
" status = subprocess.run(\n",
" [\n",
" \"docker run -d -p 7200:7200 --name graphdb-instance-tutorial docker-registry.ontotext.com/graphdb-free:9.4.1-adoptopenjdk11\"\n",
" ],\n",
" shell=True,\n",
" )\n",
" if status.returncode:\n",
" raise Exception(\n",
" \"Failed to launch GraphDB. Maybe it is already running or you already have a container with that name that you could start?\"\n",
" )\n",
" time.sleep(5)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Creating a new GraphDB repository (also known as index in haystack's document stores)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Initialize a knowledge graph connected to GraphDB and use \"tutorial_10_index\" as the name of the index\n",
"kg = GraphDBKnowledgeGraph(index=\"tutorial_10_index\")\n",
"\n",
"# Delete the index as it might have been already created in previous runs\n",
"kg.delete_index()\n",
"\n",
"# Create the index based on a configuration file\n",
"kg.create_index(config_path=Path(graph_dir) / \"repo-config.ttl\")\n",
"\n",
"# Import triples of subject, predicate, and object statements from a ttl file\n",
"kg.import_from_ttl_file(index=\"tutorial_10_index\", path=Path(graph_dir) / \"triples.ttl\")\n",
"print(f\"The last triple stored in the knowledge graph is: {kg.get_all_triples()[-1]}\")\n",
"print(f\"There are {len(kg.get_all_triples())} triples stored in the knowledge graph.\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Define prefixes for names of resources so that we can use shorter resource names in queries\n",
"prefixes = \"\"\"PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\n",
"PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>\n",
"PREFIX hp: <https://deepset.ai/harry_potter/>\n",
"\"\"\"\n",
"kg.prefixes = prefixes\n",
"\n",
"# Load a pre-trained model that translates text queries to SPARQL queries\n",
"kgqa_retriever = Text2SparqlRetriever(knowledge_graph=kg, model_name_or_path=Path(model_dir) / \"hp_v3.4\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Query Execution\n",
"\n",
"We can now ask questions that will be answered by our knowledge graph!\n",
"One limitation though: our pre-trained model can only generate questions about resources it has seen during training.\n",
"Otherwise, it cannot translate the name of the resource to the identifier used in the knowledge graph.\n",
"E.g. \"Harry\" -> \"hp:Harry_potter\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"query = \"In which house is Harry Potter?\"\n",
"print(f'Translating the text query \"{query}\" to a SPARQL query and executing it on the knowledge graph...')\n",
"result = kgqa_retriever.retrieve(query=query)\n",
"print(result)\n",
"# Correct SPARQL query: select ?a { hp:Harry_potter hp:house ?a . }\n",
"# Correct answer: Gryffindor\n",
"\n",
"print(\"Executing a SPARQL query with prefixed names of resources...\")\n",
"result = kgqa_retriever._query_kg(\n",
" sparql_query=\"select distinct ?sbj where { ?sbj hp:job hp:Keeper_of_keys_and_grounds . }\"\n",
")\n",
"print(result)\n",
"# Paraphrased question: Who is the keeper of keys and grounds?\n",
"# Correct answer: Rubeus Hagrid\n",
"\n",
"print(\"Executing a SPARQL query with full names of resources...\")\n",
"result = kgqa_retriever._query_kg(\n",
" sparql_query=\"select distinct ?obj where { <https://deepset.ai/harry_potter/Hermione_granger> <https://deepset.ai/harry_potter/patronus> ?obj . }\"\n",
")\n",
"print(result)\n",
"# Paraphrased question: What is the patronus of Hermione?\n",
"# Correct answer: Otter"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"## About us\n",
"\n",
"This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany\n",
"\n",
"We bring NLP to the industry via open source! \n",
"Our focus: Industry specific language models & large scale QA systems. \n",
" \n",
"Some of our other work: \n",
"- [German BERT](https://deepset.ai/german-bert)\n",
"- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)\n",
"- [FARM](https://github.com/deepset-ai/FARM)\n",
"\n",
"Get in touch:\n",
"[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)\n",
"\n",
"By the way: [we're hiring!](https://www.deepset.ai/jobs)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}