mirror of
https://github.com/Azure-Samples/graphrag-accelerator.git
synced 2025-06-27 04:39:57 +00:00
974 lines
34 KiB
Plaintext
974 lines
34 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "0",
|
|
"metadata": {},
|
|
"source": [
|
|
"# GraphRAG API Demo\n",
|
|
"\n",
|
|
"This notebook is written as an advanced tutorial/demonstration on how to use the GraphRAG solution accelerator API. It builds on top of the concepts covered in the `1-Quickstart` notebook."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "1",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Existing APIs\n",
|
|
"\n",
|
|
"| HTTP Method | Endpoint |\n",
|
|
"|-------------|------------------|\n",
|
|
"| GET | /data\n",
|
|
"| POST | /data\n",
|
|
"| DELETE | /data/{storage_name}\n",
|
|
"| GET | /index\n",
|
|
"| POST | /index\n",
|
|
"| DELETE | /index/{index_name}\n",
|
|
"| GET | /index/status/{index_name}\n",
|
|
"| POST | /query/global\n",
|
|
"| POST | /query/local\n",
|
|
"| GET | /index/config/prompts\n",
|
|
"| GET | /source/report/{index_name}/{report_id}\n",
|
|
"| GET | /source/text/{index_name}/{text_unit_id}\n",
|
|
"| GET | /source/entity/{index_name}/{entity_id}\n",
|
|
"| GET | /source/claim/{index_name}/{claim_id}\n",
|
|
"| GET | /source/relationship/{index_name}/{relationship_id}\n",
|
|
"| GET | /graph/graphml/{index_name}"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "2",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Prerequisites\n",
|
|
"Install 3rd party packages that are not part of the Python Standard Library"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "3",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"! pip install devtools pandas requests tqdm"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "4",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import getpass\n",
|
|
"import json\n",
|
|
"import sys\n",
|
|
"import time\n",
|
|
"from pathlib import Path\n",
|
|
"\n",
|
|
"import pandas as pd\n",
|
|
"import requests\n",
|
|
"from devtools import pprint\n",
|
|
"from tqdm import tqdm"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "5",
|
|
"metadata": {},
|
|
"source": [
|
|
"## (REQUIRED) User Configuration\n",
|
|
"Set the API subscription key, API base endpoint, and some file directory names that will be referenced later in this notebook."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "6",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### API subscription key\n",
|
|
"\n",
|
|
"APIM supports multiple forms of authentication and access control (e.g. managed identity). For this notebook demonstration, we will use a **[subscription key](https://learn.microsoft.com/en-us/azure/api-management/api-management-subscriptions)**. To locate this key, visit the Azure Portal. The subscription key can be found under `<my_resource_group> --> <API Management service> --> <APIs> --> <Subscriptions> --> <Built-in all-access subscription> Primary Key`. For multiple API users, individual subscription keys can be generated."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "7",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"ocp_apim_subscription_key = getpass.getpass(\n",
|
|
" \"Enter the subscription key to the GraphRag APIM:\"\n",
|
|
")\n",
|
|
"\n",
|
|
"\"\"\"\n",
|
|
"\"Ocp-Apim-Subscription-Key\": \n",
|
|
" This is a custom HTTP header used by Azure API Management service (APIM) to \n",
|
|
" authenticate API requests. The value for this key should be set to the subscription \n",
|
|
" key provided by the Azure APIM instance in your GraphRAG resource group.\n",
|
|
"\"\"\"\n",
|
|
"headers = {\"Ocp-Apim-Subscription-Key\": ocp_apim_subscription_key}"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "8",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Setup directories and API endpoint\n",
|
|
"\n",
|
|
"For demonstration purposes, please use the provided `get-wiki-articles.py` script to download a small set of wikipedia articles or provide your own data (graphrag requires txt files to be utf-8 encoded)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "9",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"\"\"\"\n",
|
|
"These parameters must be defined by the notebook user:\n",
|
|
"\n",
|
|
"- file_directory: a local directory of text files. The file structure should be flat,\n",
|
|
" with no nested directories. (i.e. file_directory/file1.txt, file_directory/file2.txt, etc.)\n",
|
|
"- storage_name: a unique name to identify a blob storage container in Azure where files\n",
|
|
" from `file_directory` will be uploaded.\n",
|
|
"- index_name: a unique name to identify a single graphrag knowledge graph index.\n",
|
|
" Note: Multiple indexes may be created from the same `storage_name` blob storage container.\n",
|
|
"- endpoint: the base/endpoint URL for the GraphRAG API (this is the Gateway URL found in the APIM resource).\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"file_directory = \"\"\n",
|
|
"storage_name = \"\"\n",
|
|
"index_name = \"\"\n",
|
|
"endpoint = \"\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "10",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"assert (\n",
|
|
" file_directory != \"\" and storage_name != \"\" and index_name != \"\" and endpoint != \"\"\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "11",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Helper Functions\n",
|
|
"\n",
|
|
"For cleanliness, we've provided helper functions below that encapsulate http requests to make API interaction with each API endpoint more intuitive."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "12",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"def upload_files(\n",
|
|
" file_directory: str,\n",
|
|
" container_name: str,\n",
|
|
" batch_size: int = 100,\n",
|
|
" overwrite: bool = True,\n",
|
|
" max_retries: int = 5,\n",
|
|
") -> requests.Response | list[Path]:\n",
|
|
" \"\"\"\n",
|
|
" Upload files to a blob storage container.\n",
|
|
"\n",
|
|
" Args:\n",
|
|
" file_directory - a local directory of .txt files to upload. All files must be in utf-8 encoding.\n",
|
|
" container_name - a unique name for the Azure storage container.\n",
|
|
" batch_size - the number of files to upload in a single batch.\n",
|
|
" overwrite - whether or not to overwrite files if they already exist in the storage container.\n",
|
|
" max_retries - the maximum number of times to retry uploading a batch of files if the API is busy.\n",
|
|
"\n",
|
|
" NOTE: Uploading files may sometimes fail if the blob container was recently deleted\n",
|
|
" (i.e. a few seconds before. The solution \"in practice\" is to sleep a few seconds and try again.\n",
|
|
" \"\"\"\n",
|
|
" url = endpoint + \"/data\"\n",
|
|
"\n",
|
|
" def upload_batch(\n",
|
|
" files: list, container_name: str, overwrite: bool, max_retries: int\n",
|
|
" ) -> requests.Response:\n",
|
|
" for _ in range(max_retries):\n",
|
|
" response = requests.post(\n",
|
|
" url=url,\n",
|
|
" files=files,\n",
|
|
" params={\"container_name\": container_name, \"overwrite\": overwrite},\n",
|
|
" headers=headers,\n",
|
|
" )\n",
|
|
" # API may be busy, retry\n",
|
|
" if response.status_code == 500:\n",
|
|
" print(\"API busy. Sleeping and will try again.\")\n",
|
|
" time.sleep(10)\n",
|
|
" continue\n",
|
|
" return response\n",
|
|
" return response\n",
|
|
"\n",
|
|
" batch_files = []\n",
|
|
" filepaths = list(Path(file_directory).iterdir())\n",
|
|
" for file in tqdm(filepaths):\n",
|
|
" # validate that file is a file, has acceptable file type, has a .txt extension, and has utf-8 encoding\n",
|
|
" if (not file.is_file()):\n",
|
|
" print(f\"Skipping invalid file: {file}\")\n",
|
|
" continue\n",
|
|
" batch_files.append(\n",
|
|
" (\"files\", open(file=file, mode=\"rb\"))\n",
|
|
" )\n",
|
|
" # upload batch of files\n",
|
|
" if len(batch_files) == batch_size:\n",
|
|
" response = upload_batch(batch_files, container_name, overwrite, max_retries)\n",
|
|
" # if response is not ok, return early\n",
|
|
" if not response.ok:\n",
|
|
" return response\n",
|
|
" batch_files.clear()\n",
|
|
" # upload last batch of remaining files\n",
|
|
" if len(batch_files) > 0:\n",
|
|
" response = upload_batch(batch_files, container_name, overwrite, max_retries)\n",
|
|
" return response\n",
|
|
"\n",
|
|
"\n",
|
|
"def delete_files(container_name: str) -> requests.Response:\n",
|
|
" \"\"\"Delete an azure storage container that holds raw data.\"\"\"\n",
|
|
" url = endpoint + f\"/data/{container_name}\"\n",
|
|
" return requests.delete(url=url, headers=headers)\n",
|
|
"\n",
|
|
"\n",
|
|
"def list_files() -> requests.Response:\n",
|
|
" \"\"\"Get a list of all azure storage containers that hold raw data.\"\"\"\n",
|
|
" url = endpoint + \"/data\"\n",
|
|
" return requests.get(url=url, headers=headers)\n",
|
|
"\n",
|
|
"\n",
|
|
"def build_index(\n",
|
|
" storage_name: str,\n",
|
|
" index_name: str,\n",
|
|
" entity_extraction_prompt: str = None,\n",
|
|
" entity_summarization_prompt: str = None,\n",
|
|
" community_summarization_prompt: str = None,\n",
|
|
") -> requests.Response:\n",
|
|
" \"\"\"Build a graphrag index.\n",
|
|
" This function submits a job that builds a graphrag index (i.e. a knowledge graph) from data files located in a blob storage container.\n",
|
|
" \"\"\"\n",
|
|
" url = endpoint + \"/index\"\n",
|
|
" prompts = dict()\n",
|
|
" if entity_extraction_prompt:\n",
|
|
" prompts[\"entity_extraction_prompt\"] = entity_extraction_prompt\n",
|
|
" if entity_summarization_prompt:\n",
|
|
" prompts[\"summarize_descriptions_prompt\"] = entity_summarization_prompt\n",
|
|
" if community_summarization_prompt:\n",
|
|
" prompts[\"community_report_prompt\"] = community_summarization_prompt\n",
|
|
" return requests.post(\n",
|
|
" url,\n",
|
|
" files=prompts if len(prompts) > 0 else None,\n",
|
|
" params={\n",
|
|
" \"index_container_name\": index_name,\n",
|
|
" \"storage_container_name\": storage_name,\n",
|
|
" },\n",
|
|
" headers=headers,\n",
|
|
" )\n",
|
|
"\n",
|
|
"\n",
|
|
"def delete_index(container_name: str) -> requests.Response:\n",
|
|
" \"\"\"Delete an azure storage container that holds a search index.\"\"\"\n",
|
|
" url = endpoint + f\"/index/{container_name}\"\n",
|
|
" return requests.delete(url, headers=headers)\n",
|
|
"\n",
|
|
"\n",
|
|
"def list_indexes() -> list:\n",
|
|
" \"\"\"Get a list of all azure storage containers that hold search indexes.\"\"\"\n",
|
|
" url = endpoint + \"/index\"\n",
|
|
" response = requests.get(url, headers=headers)\n",
|
|
" try:\n",
|
|
" indexes = json.loads(response.text)\n",
|
|
" return indexes[\"index_name\"]\n",
|
|
" except json.JSONDecodeError:\n",
|
|
" print(response.text)\n",
|
|
" return response\n",
|
|
"\n",
|
|
"\n",
|
|
"def index_status(container_name: str) -> requests.Response:\n",
|
|
" \"\"\"Get the status of a specific index.\"\"\"\n",
|
|
" url = endpoint + f\"/index/status/{container_name}\"\n",
|
|
" return requests.get(url, headers=headers)\n",
|
|
"\n",
|
|
"\n",
|
|
"def global_search(\n",
|
|
" index_name: str | list[str], query: str, community_level: int\n",
|
|
") -> requests.Response:\n",
|
|
" \"\"\"Run a global query over the knowledge graph(s) associated with one or more indexes\"\"\"\n",
|
|
" url = endpoint + \"/query/global\"\n",
|
|
" # optional parameter: community level to query the graph at (default for global query = 1)\n",
|
|
" request = {\n",
|
|
" \"index_name\": index_name,\n",
|
|
" \"query\": query,\n",
|
|
" \"community_level\": community_level,\n",
|
|
" }\n",
|
|
" return requests.post(url, json=request, headers=headers)\n",
|
|
"\n",
|
|
"\n",
|
|
"def global_search_streaming(\n",
|
|
" index_name: str | list[str], query: str, community_level: int\n",
|
|
") -> requests.Response:\n",
|
|
" raise NotImplementedError(\"this functionality has been temporarily removed\")\n",
|
|
" \"\"\"Run a global query across one or more indexes and stream back the response\"\"\"\n",
|
|
" url = endpoint + \"/query/streaming/global\"\n",
|
|
" # optional parameter: community level to query the graph at (default for global query = 1)\n",
|
|
" request = {\n",
|
|
" \"index_name\": index_name,\n",
|
|
" \"query\": query,\n",
|
|
" \"community_level\": community_level,\n",
|
|
" }\n",
|
|
" context_list = []\n",
|
|
" with requests.post(url, json=request, headers=headers, stream=True) as r:\n",
|
|
" r.raise_for_status()\n",
|
|
" for chunk in r.iter_lines(chunk_size=256 * 1024, decode_unicode=True):\n",
|
|
" try:\n",
|
|
" payload = json.loads(chunk)\n",
|
|
" token = payload[\"token\"]\n",
|
|
" context = payload[\"context\"]\n",
|
|
" if token != \"<EOM>\":\n",
|
|
" print(token, end=\"\")\n",
|
|
" elif (token == \"<EOM>\") and not context:\n",
|
|
" print(\"\\n\") # transition from output message to context\n",
|
|
" else:\n",
|
|
" context_list.append(context)\n",
|
|
" except json.JSONDecodeError:\n",
|
|
" print(type(chunk), len(chunk), sys.getsizeof(chunk), chunk, end=\"\\n\")\n",
|
|
" display(pd.DataFrame.from_dict(context_list[0][\"reports\"]).head(10))\n",
|
|
"\n",
|
|
"\n",
|
|
"def local_search(\n",
|
|
" index_name: str | list[str], query: str, community_level: int\n",
|
|
") -> requests.Response:\n",
|
|
" \"\"\"Run a local query over the knowledge graph(s) associated with one or more indexes\"\"\"\n",
|
|
" url = endpoint + \"/query/local\"\n",
|
|
" # optional parameter: community level to query the graph at (default for local query = 2)\n",
|
|
" request = {\n",
|
|
" \"index_name\": index_name,\n",
|
|
" \"query\": query,\n",
|
|
" \"community_level\": community_level,\n",
|
|
" }\n",
|
|
" return requests.post(url, json=request, headers=headers)\n",
|
|
"\n",
|
|
"\n",
|
|
"def local_search_streaming(\n",
|
|
" index_name: str | list[str], query: str, community_level: int\n",
|
|
") -> requests.Response:\n",
|
|
" raise NotImplementedError(\"this functionality has been temporarily removed\")\n",
|
|
" \"\"\"Run a global query across one or more indexes and stream back the response\"\"\"\n",
|
|
" url = endpoint + \"/query/streaming/local\"\n",
|
|
" # optional parameter: community level to query the graph at (default for local query = 2)\n",
|
|
" request = {\n",
|
|
" \"index_name\": index_name,\n",
|
|
" \"query\": query,\n",
|
|
" \"community_level\": community_level,\n",
|
|
" }\n",
|
|
" context_list = []\n",
|
|
" with requests.post(url, json=request, headers=headers, stream=True) as r:\n",
|
|
" r.raise_for_status()\n",
|
|
" for chunk in r.iter_lines(chunk_size=256 * 1024, decode_unicode=True):\n",
|
|
" try:\n",
|
|
" payload = json.loads(chunk)\n",
|
|
" token = payload[\"token\"]\n",
|
|
" context = payload[\"context\"]\n",
|
|
" if token != \"<EOM>\":\n",
|
|
" print(token, end=\"\")\n",
|
|
" elif (token == \"<EOM>\") and not context:\n",
|
|
" print(\"\\n\") # transition from output message to context\n",
|
|
" else:\n",
|
|
" context_list.append(context)\n",
|
|
" except json.JSONDecodeError:\n",
|
|
" print(type(chunk), len(chunk), sys.getsizeof(chunk), chunk, end=\"\\n\")\n",
|
|
" for key in context_list[0].keys():\n",
|
|
" display(pd.DataFrame.from_dict(context_list[0][key]).head(10))\n",
|
|
"\n",
|
|
"\n",
|
|
"def save_graphml_file(index_name: str, graphml_file_name: str) -> None:\n",
|
|
" \"\"\"Retrieve and save a graphml file that represents the knowledge graph.\n",
|
|
" The file is downloaded in chunks and saved to the local file system.\n",
|
|
" \"\"\"\n",
|
|
" url = endpoint + f\"/graph/graphml/{index_name}\"\n",
|
|
" if Path(graphml_file_name).suffix != \".graphml\":\n",
|
|
" raise UserWarning(f\"{graphml_file_name} must have a .graphml file extension\")\n",
|
|
" with requests.get(url, headers=headers, stream=True) as r:\n",
|
|
" r.raise_for_status()\n",
|
|
" with open(graphml_file_name, \"wb\") as f:\n",
|
|
" for chunk in r.iter_content(chunk_size=1024):\n",
|
|
" f.write(chunk)\n",
|
|
"\n",
|
|
"\n",
|
|
"def get_report(index_name: str, report_id: str) -> requests.Response:\n",
|
|
" \"\"\"Retrieve a report generated by GraphRAG for a specific index.\"\"\"\n",
|
|
" url = endpoint + f\"/source/report/{index_name}/{report_id}\"\n",
|
|
" return requests.get(url, headers=headers)\n",
|
|
"\n",
|
|
"\n",
|
|
"def get_entity(index_name: str, entity_id: str) -> requests.Response:\n",
|
|
" \"\"\"Retrieve an entity generated by GraphRAG for a specific index.\"\"\"\n",
|
|
" url = endpoint + f\"/source/entity/{index_name}/{entity_id}\"\n",
|
|
" return requests.get(url, headers=headers)\n",
|
|
"\n",
|
|
"\n",
|
|
"def get_relationship(index_name: str, relationship_id: str) -> requests.Response:\n",
|
|
" \"\"\"Retrieve a relationship generated by GraphRAG for a specific index.\"\"\"\n",
|
|
" url = endpoint + f\"/source/relationship/{index_name}/{relationship_id}\"\n",
|
|
" return requests.get(url, headers=headers)\n",
|
|
"\n",
|
|
"\n",
|
|
"def get_claim(index_name: str, claim_id: str) -> requests.Response:\n",
|
|
" \"\"\"Retrieve a claim/covariate generated by GraphRAG for a specific index.\"\"\"\n",
|
|
" url = endpoint + f\"/source/claim/{index_name}/{claim_id}\"\n",
|
|
" return requests.get(url, headers=headers)\n",
|
|
"\n",
|
|
"\n",
|
|
"def get_text_unit(index_name: str, text_unit_id: str) -> requests.Response:\n",
|
|
" \"\"\"Retrieve a text unit generated by GraphRAG for a specific index.\"\"\"\n",
|
|
" url = endpoint + f\"/source/text/{index_name}/{text_unit_id}\"\n",
|
|
" return requests.get(url, headers=headers)\n",
|
|
"\n",
|
|
"\n",
|
|
"def parse_query_response(\n",
|
|
" response: requests.Response, return_context_data: bool = False\n",
|
|
") -> requests.Response | dict[list[dict]]:\n",
|
|
" \"\"\"\n",
|
|
" Prints response['result'] value and optionally\n",
|
|
" returns associated context data.\n",
|
|
" \"\"\"\n",
|
|
" if response.ok:\n",
|
|
" print(json.loads(response.text)[\"result\"])\n",
|
|
" if return_context_data:\n",
|
|
" return json.loads(response.text)[\"context_data\"]\n",
|
|
" return response\n",
|
|
" else:\n",
|
|
" print(response.reason)\n",
|
|
" print(response.content)\n",
|
|
" return response\n",
|
|
"\n",
|
|
"\n",
|
|
"def generate_prompts(container_name: str, limit: int = 1) -> None:\n",
|
|
" \"\"\"Generate graphrag prompts using data provided in a specific storage container.\"\"\"\n",
|
|
" url = endpoint + \"/index/config/prompts\"\n",
|
|
" params = {\"container_name\": container_name, \"limit\": limit}\n",
|
|
" return requests.get(url, params=params, headers=headers)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "13",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Upload files\n",
|
|
"\n",
|
|
"Use the API to upload a collection of files. **Multiple filetypes are now supported via the [MarkItDown](https://github.com/microsoft/markitdown) library.** The API will create a new storage blob container to host these files in. For a set of large files, consider reducing the batch upload size in order to not overwhelm the API endpoint and prevent out-of-memory problems."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "14",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"response = upload_files(\n",
|
|
" file_directory=file_directory,\n",
|
|
" container_name=storage_name,\n",
|
|
" batch_size=100,\n",
|
|
" overwrite=True,\n",
|
|
")\n",
|
|
"if not response.ok:\n",
|
|
" print(response.text)\n",
|
|
"else:\n",
|
|
" print(response)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "15",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### To list all existing data storage containers:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "16",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"response = list_files()\n",
|
|
"print(response)\n",
|
|
"if response.ok:\n",
|
|
" pprint(response.json())\n",
|
|
"else:\n",
|
|
" pprint(response.text)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "17",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### To remove files from the GraphRAG service:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "18",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# # uncomment this cell to delete data container\n",
|
|
"# response = delete_files(storage_name)\n",
|
|
"# print(response)\n",
|
|
"# pprint(response.text)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "19",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Auto-Template Generation (Optional)\n",
|
|
"\n",
|
|
"GraphRAG constructs a knowledge graph from data based on the ability to identify entities and the relationships between them. To improve the quality of the knowledge graph constructed by GraphRAG over private data, we provide a feature called \"Automatic Templating\". This capability takes user-provided data samples and generates custom-tailored prompts based on characteristics of that data. These custom prompts contain few-shot examples of entities and relationships, which can then be used to build a graphrag index."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "20",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"auto_template_response = generate_prompts(container_name=storage_name, limit=1)\n",
|
|
"if auto_template_response.ok:\n",
|
|
" prompts = auto_template_response.json()\n",
|
|
"else:\n",
|
|
" print(auto_template_response.text)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "21",
|
|
"metadata": {},
|
|
"source": [
|
|
"After running the previous cell, a new local directory (`prompts`) will be created. Please look at the prompts (`prompts/entity_extraction.txt`, `prompts/community_report.txt`, and `prompts/summarize_descriptions.txt`) that were generated from the user-provided data. Users are encouraged to spend some time and inspect/modify these prompts, taking into account characteristics of their data and their own goals of what kind/type of knowledge they wish to extract and model with graphrag."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "22",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Build an Index\n",
|
|
"\n",
|
|
"After data files have been uploaded and (optionally) custom promps have been generated, it is time to construct a knowledge graph by building an index. If custom prompts are not provided (demonstrated in the `1-Quickstart` notebook), default built-in prompts are used that we find generally work well."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "23",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Start a new indexing job"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "24",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# check if custom prompts were generated\n",
|
|
"if \"auto_template_response\" in locals() and auto_template_response.ok:\n",
|
|
" entity_extraction_prompt = prompts[\"entity_extraction_prompt\"]\n",
|
|
" community_summarization_prompt = prompts[\"community_summarization_prompt\"]\n",
|
|
" summarize_description_prompt = prompts[\"entity_summarization_prompt\"]\n",
|
|
"else:\n",
|
|
" entity_extraction_prompt = community_summarization_prompt = (\n",
|
|
" summarize_description_prompt\n",
|
|
" ) = None\n",
|
|
"\n",
|
|
"response = build_index(\n",
|
|
" storage_name=storage_name,\n",
|
|
" index_name=index_name,\n",
|
|
" entity_extraction_prompt=entity_extraction_prompt,\n",
|
|
" community_summarization_prompt=community_summarization_prompt,\n",
|
|
" entity_summarization_prompt=summarize_description_prompt,\n",
|
|
")\n",
|
|
"if response.ok:\n",
|
|
" pprint(response.json())\n",
|
|
"else:\n",
|
|
" print(f\"Failed to submit job.\\nStatus: {response.text}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "25",
|
|
"metadata": {},
|
|
"source": [
|
|
"Note: indexing jobs are submitted to a queue to run. A cronjob checks every 5 minutes to schedule new jobs if possible.\n",
|
|
"\n",
|
|
"An indexing job can sometimes fail due to insufficient TPM quota of the Azure OpenAI model. In this situation, an indexing job can be restarted by re-running the cell above with the same parameters. `graphrag` caches previous indexing results as a cost-savings measure so that restarting indexing jobs will effectively \"pick up\" where they left off."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "26",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Check index job status\n",
|
|
"\n",
|
|
"Please wait for the index to reach 100 percent completion before continuing on to the next section (running queries). You may rerun the next cell multiple times to monitor status. Note: the indexing speed of graphrag is directly correlated to the TPM quota of the Azure OpenAI model you are using."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "27",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"response = index_status(index_name)\n",
|
|
"print(response)\n",
|
|
"if response.ok:\n",
|
|
" pprint(response.json())\n",
|
|
"else:\n",
|
|
" print(response.text)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "28",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### List indexes\n",
|
|
"To view a list of all indexes that exist in the GraphRAG service:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "29",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"all_indexes = list_indexes()\n",
|
|
"pprint(all_indexes)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "30",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Delete an indexing job\n",
|
|
"If an index is no longer needed, remove it from the GraphRAG service."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "31",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# # uncomment this cell to delete an index\n",
|
|
"# response = delete_index(index_name)\n",
|
|
"# print(response)\n",
|
|
"# pprint(response.json())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "32",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Query\n",
|
|
"\n",
|
|
"Once an indexing job is complete, the knowledge graph is ready to query. Two types of queries (global and local) are currently supported. We encourage you to try both and experience the difference in responses. Note that query response time is also correlated to the TPM quota of the Azure OpenAI model you are using."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "33",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Global Search\n",
|
|
"\n",
|
|
"Global search queries are resource-intensive, but give good responses to questions that require an understanding of the dataset as a whole."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "34",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# pass in a single index name as a string or to query across multiple indexes, set index_name=[myindex1, myindex2]\n",
|
|
"global_response = global_search(\n",
|
|
" index_name=index_name,\n",
|
|
" query=\"Summarize the qualifications to being a delivery data scientist\",\n",
|
|
" community_level=2,\n",
|
|
")\n",
|
|
"# print the result and save context data in a variable\n",
|
|
"global_response_data = parse_query_response(global_response, return_context_data=True)\n",
|
|
"global_response_data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "37",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Local Search\n",
|
|
"\n",
|
|
"Local search queries are best suited for narrow-focused questions that require an understanding of specific entities mentioned in the documents (e.g. What are the healing properties of chamomile?)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "38",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# pass in a single index name as a string or to query across multiple indexes, set index_name=[myindex1, myindex2]\n",
|
|
"local_response = local_search(\n",
|
|
" index_name=index_name,\n",
|
|
" query=\"Who are the primary actors in these communities?\",\n",
|
|
" community_level=2,\n",
|
|
")\n",
|
|
"# print the result and save context data in a variable\n",
|
|
"local_response_data = parse_query_response(local_response, return_context_data=True)\n",
|
|
"local_response_data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "41",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Sources\n",
|
|
"\n",
|
|
"In a query response, citations will often appear that support GraphRAG's response. API endpoints are provided to enable retrieval of the sourced documents, entities, relationships, etc.\n",
|
|
"\n",
|
|
"Multiple types of sources may be referenced in a query: Reports, Entities, Relationships, Claims, and Text Units. The API provides various endpoints to retrieve these sources for data provenance."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "42",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Get a Report"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "43",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"report_response = get_report(index_name, 0)\n",
|
|
"print(report_response.json()[\"text\"]) if report_response.ok else (\n",
|
|
" report_response.reason,\n",
|
|
" report_response.content,\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "44",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Get an Entity"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "45",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"entity_response = get_entity(index_name, 0)\n",
|
|
"entity_response.json() if entity_response.ok else (\n",
|
|
" entity_response.reason,\n",
|
|
" entity_response.content,\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "46",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"#### Get a Relationship"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "47",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"relationship_response = get_relationship(index_name, 1)\n",
|
|
"relationship_response.json() if relationship_response.ok else (\n",
|
|
" relationship_response.reason,\n",
|
|
" relationship_response.content,\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "48",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Get a Claim"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "49",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"claim_response = get_claim(index_name, 1)\n",
|
|
"if claim_response.ok:\n",
|
|
" pprint(claim_response.json())\n",
|
|
"else:\n",
|
|
" print(claim_response)\n",
|
|
" print(claim_response.text)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "50",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"#### Get a Text Unit"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "51",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# get a text unit id from one of the previous Source endpoint results (look for 'text_units' in the response)\n",
|
|
"text_unit_id = \"\"\n",
|
|
"if not text_unit_id:\n",
|
|
" raise ValueError(\n",
|
|
" \"Must provide a text_unit_id from previous source results. Look for 'text_units' in the response.\"\n",
|
|
" )\n",
|
|
"text_unit_response = get_text_unit(index_name, text_unit_id)\n",
|
|
"if text_unit_response.ok:\n",
|
|
" print(text_unit_response.json()[\"text\"])\n",
|
|
"else:\n",
|
|
" print(text_unit_response.reason)\n",
|
|
" print(text_unit_response.content)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "52",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Exploring the GraphRAG knowledge graph\n",
|
|
"To better understand the knowledge graph that was constructed during the indexing process, the API provides an option to download a graphml file, which can be imported by other open source visualization software (we recommend [Gephi](https://gephi.org/)) for deeper exploration."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "53",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Get a GraphML file"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "54",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# will save graphml file to the current local directory\n",
|
|
"save_graphml_file(index_name, \"knowledge_graph.graphml\")"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.16"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|