diff --git a/config/json_yaml/index.html b/config/json_yaml/index.html index 4853e98e..3d4fd0d0 100644 --- a/config/json_yaml/index.html +++ b/config/json_yaml/index.html @@ -1402,7 +1402,7 @@
api_version str - The API versionorganization str - The client organization.proxy str - The proxy URL to use.cognitive_services_endpoint str - The url endpoint for cognitive services.audience str - (Azure OpenAI only) The URI of the target Azure resource/service for which a managed identity token is requested. Used if api_key is not defined. Default=https://cognitiveservices.azure.com/.defaultdeployment_name str - The deployment name to use (Azure).model_supports_json bool - Whether the model supports JSON-mode output.tokens_per_minute int - Set a leaky-bucket throttle on tokens-per-minute.parallelization (see Parallelization top-level config)async_mode (see Async Mode top-level config)batch_size int - The maximum batch size to use.batch_max_tokens int - The maximum batch #-tokens.batch_max_tokens int - The maximum batch # of tokens.target required|all - Determines which set of embeddings to emit.skip list[str] - Which embeddings to skip.vector_store dict - The vector store to use. Configured for lancedb by default.type str - lancedb or azure_ai_search. Default=lancedbdb_uri str (only for lancedb) - The database uri. Default=storage.base_dir/lancedburl str (only for AI Search) - AI Search endpointapi_key str (optional - only for AI Search) - The AI Search api key to use.audience str (only for AI Search) - Audience for managed identity token if managed identity authentication is used.overwrite bool (only used at index creation time) - Overwrite collection if it exist. Default=Truecollection_name str - The name of a vector collection. Default=entity_description_embeddingsstrategy dict - Fully override the text-embedding strategy.top_level_nodes bool - Emit top-level-node snapshots.str - The text encoding model to use. Default is cl100k_base.
str - The text encoding model to use. Default=cl100k_base.
list[str] - Which workflow names to skip.
diff --git a/examples_notebooks/global_search/index.html b/examples_notebooks/global_search/index.html index ec6851e9..e04af54c 100644 --- a/examples_notebooks/global_search/index.html +++ b/examples_notebooks/global_search/index.html @@ -2531,15 +2531,15 @@ print(result.response)### Major Conflict -The central conflict in the story revolves around the Paranormal Military Squad's mission to establish contact with extraterrestrial intelligence. This mission involves deciphering alien signals and managing the potential implications of first contact. The conflict is marked by the secrecy and high stakes associated with the mission, as well as the challenges posed by the unknown nature of the extraterrestrial entities [Data: Reports (4, 5, 2, 3, 0)]. +The central conflict in the story revolves around the Paranormal Military Squad's mission to establish contact with extraterrestrial intelligence. This involves deciphering alien signals and managing the potential implications of first contact. The mission is characterized by its secrecy and high stakes, as well as the challenges posed by the unknown nature of the extraterrestrial entities. The team must navigate these uncertainties and potential threats as they work towards their goal [Data: Reports (4, 5, 2, 3, 0)]. ### Protagonists -The protagonists are the members of the Paranormal Military Squad, which includes key figures such as Taylor Cruz, Dr. Jordan Hayes, Alex Mercer, and Sam Rivera. These individuals play crucial roles in the mission, contributing their expertise in leadership, signal decryption, and communication with extraterrestrial beings [Data: Reports (4, 5, 2, 3, 0)]. +The protagonists are the key members of the Paranormal Military Squad, including Taylor Cruz, Dr. Jordan Hayes, Alex Mercer, and Sam Rivera. Each of these individuals plays a crucial role in the mission, bringing their expertise in leadership, signal decryption, diplomatic engagement, and technical innovation to the forefront. Their combined efforts are essential in tackling the challenges posed by the mission [Data: Reports (4, 5, 2, 3, 0)]. ### Antagonist -The antagonist in the story is not a single entity or character. Instead, it may be considered the unknown and potentially threatening nature of the extraterrestrial signals and the challenges they present to the Paranormal Military Squad's mission [Data: Reports (4, 5, 2, 3, 0)]. +There is no clear antagonist in the traditional sense. Instead, the conflict primarily involves the challenges and uncertainties associated with extraterrestrial communication and the potential risks it poses. The antagonist could be interpreted as the unknown and potentially threatening nature of the extraterrestrial entities or the obstacles faced by the team in achieving their mission [Data: Reports (4, 5, 2, 3, 0)].
LLM calls: 2. LLM tokens: 5274 +LLM calls: 2. LLM tokens: 5283
[2024-10-21T18:05:01Z WARN lance::dataset] No existing dataset at /home/runner/work/graphrag/graphrag/docs/examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance, it will be created +[2024-10-21T20:58:06Z WARN lance::dataset] No existing dataset at /home/runner/work/graphrag/graphrag/docs/examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance, it will be created
### Overview of Agent Alex Mercer -Agent Alex Mercer is a central figure within the Paranormal Military Squad Team at Dulce Base, where he plays a pivotal role in the team's operations and mission objectives. His responsibilities are multifaceted, encompassing leadership, strategic oversight, and direct involvement in the analysis and interpretation of extraterrestrial signals. Mercer's military background and experience are crucial to his role, as he guides the team through complex scenarios involving potential first contact with alien intelligence [Data: Entities (0, 209); Relationships (5, 8, 6)]. +Agent Alex Mercer is a central figure within the Paranormal Military Squad Team at Dulce Base, where he plays a pivotal role in the team's operations and mission objectives. His responsibilities are multifaceted, encompassing leadership, strategic oversight, and direct involvement in the analysis and interpretation of extraterrestrial signals. Mercer's military background and experience are crucial to his role, as he guides the team through complex scenarios involving potential first contact with alien intelligence [Data: Entities (0, 209); Relationships (5, 8, 65)]. ### Leadership and Responsibilities -As a leader, Alex Mercer is instrumental in directing the Paranormal Military Squad's efforts to engage with extraterrestrial intelligence. His leadership style is characterized by a mix of caution and anticipation, reflecting the gravity of the mission at hand. Mercer is responsible for ensuring a cautious approach to interspecies communication, unraveling galactic mysteries, and engaging with alien signals. His role involves not only overseeing the team but also participating in the decryption and analysis of alien messages, which are critical to understanding extraterrestrial societies [Data: Entities (0); Claims (73, 82, 67)]. +As a leader, Alex Mercer is instrumental in overseeing the Paranormal Military Squad's efforts to engage with extraterrestrial intelligence. His leadership style is characterized by a mix of caution and anticipation, reflecting the gravity of the mission at hand. Mercer is responsible for ensuring a cautious approach to interspecies communication and unraveling galactic mysteries, which involves both strategic planning and hands-on participation in decryption efforts [Data: Entities (0); Relationships (5, 6, 8)]. ### Collaboration and Team Dynamics -Agent Mercer works closely with other key members of the team, such as Dr. Jordan Hayes, with whom he shares a mutual respect and understanding of the mission's significance. Their collaboration focuses on decrypting and communicating with extraterrestrial intelligence, highlighting the importance of teamwork in achieving their objectives. Mercer also interacts with other team members like Sam Rivera and Taylor Cruz, fostering a collaborative environment that is essential for the success of their mission [Data: Relationships (1, 4, 26, 67); Reports (0)]. +Agent Mercer works closely with other key members of the team, such as Dr. Jordan Hayes, with whom he collaborates on deciphering alien signals and managing interspecies communication. Their partnership is built on mutual respect and recognition of each other's analytical skills, which is essential for the success of their mission. Additionally, Mercer interacts with other team members like Sam Rivera and Taylor Cruz, highlighting his role in fostering teamwork and collaboration within the squad [Data: Reports (0); Relationships (1, 4, 26, 67)]. -### Strategic Approach and Challenges +### Involvement in Extraterrestrial Communication -Mercer's strategic approach to the mission involves balancing the need for caution with the potential for groundbreaking discoveries. He is aware of the complexities and potential risks associated with extraterrestrial contact, and he emphasizes the importance of understanding and engaging with alien signals. This strategic foresight is crucial as the team navigates the challenges of deciphering alien code and preparing for first contact scenarios [Data: Claims (50, 57, 60); Reports (0)]. +Mercer's involvement in the decryption and analysis of alien signals is a significant aspect of his role. He is seen as a key figure in establishing contact and dialogue with alien intelligence, acting as a representative of humanity. This responsibility underscores the importance of his work in the broader context of interstellar communication and exploration [Data: Claims (73, 82); Reports (0)]. -In summary, Agent Alex Mercer is a key leader within the Paranormal Military Squad Team, guiding the team through the intricacies of extraterrestrial communication and ensuring that their mission is conducted with strategic foresight and caution. His collaboration with team members and his leadership in the face of potential first contact highlight his critical role in the team's efforts at Dulce Base. +In summary, Agent Alex Mercer is a vital member of the Paranormal Military Squad Team, whose leadership and expertise are crucial to the team's mission of engaging with extraterrestrial intelligence. His role involves a delicate balance of strategic oversight, collaboration, and direct involvement in the analysis of alien signals, making him a central figure in the unfolding narrative of human-alien relations.
## Overview of Dr. Jordan Hayes -Dr. Jordan Hayes is a prominent member of the Paranormal Military Squad Team stationed at Dulce Base. Their primary role involves deciphering alien code and interpreting extraterrestrial patterns, which are crucial for the team's mission of understanding and interacting with extraterrestrial entities [Data: Entities (104, 2); Reports (0)]. Dr. Hayes is known for their analytical mindset and skepticism, which they balance with a willingness to explore new possibilities during their mission [Data: Entities (124); Claims (12)]. +Dr. Jordan Hayes is a prominent member of the Paranormal Military Squad, a specialized team stationed at Dulce Base. This team is dedicated to the analysis and interpretation of extraterrestrial signals and patterns, with Dr. Hayes playing a crucial role in deciphering alien code and facilitating interspecies communication [Data: Entities (104, 2); Reports (0)]. -## Role and Contributions +## Role and Expertise -Dr. Hayes plays a pivotal role in the Paranormal Military Squad's efforts to communicate with alien intelligence. This involves isolating signal harmonics, decrypting alien messages, and interpreting alien signals for further analysis [Data: Entities (2, 192, 180)]. Their expertise in decryption algorithms and signal analysis is vital to the team's mission, as they work on deciphering extraterrestrial signals and engaging in interstellar communication [Data: Entities (166, 180); Claims (68, 79)]. +Dr. Hayes is known for their analytical mindset and expertise in decryption algorithms, which are essential for interpreting alien signals. Their work involves isolating signal harmonics and decrypting alien messages, which are critical components of the team's mission to understand and interact with extraterrestrial entities [Data: Entities (2, 180, 192, 166); Claims (61, 68, 79)]. Dr. Hayes's focus on empirical evidence and adaptability is a key aspect of their approach to the complex challenges posed by extraterrestrial communication [Data: Entities (2); Claims (12)]. -## Collaboration and Relationships +## Contributions to the Paranormal Military Squad -Dr. Hayes collaborates closely with other team members, including Agent Alex Mercer, Sam Rivera, and Taylor Cruz. Their partnership with Alex Mercer is particularly significant, as they work together on managing interspecies communication and interpreting signals crucial to the team's operations [Data: Relationships (1, 4, 26, 67); Reports (0)]. Despite occasional differing views with Taylor Cruz, their collaboration is essential for the success of the mission [Data: Relationships (9, 15)]. +Within the Paranormal Military Squad, Dr. Hayes collaborates closely with other team members, including Agent Alex Mercer, to manage interspecies communication and analyze extraterrestrial patterns. Their partnership is marked by mutual respect and a shared commitment to the mission's objectives [Data: Relationships (1, 4, 26, 67); Reports (0)]. Dr. Hayes's work is pivotal in the team's efforts to prepare for potential first contact scenarios and to decipher alien signals that could represent both threats and opportunities for untapped wisdom [Data: Reports (0); Claims (84)]. ## Scientific Breakthroughs and Challenges -Dr. Hayes is on the verge of a scientific breakthrough, as they analyze evolving alien signals and consider the implications of a tandem evolution with extraterrestrial intelligence [Data: Claims (49, 74)]. They have discovered hidden technology and deciphered extraterrestrial patterns that could represent potential threats or untapped wisdom, highlighting the complex nature of their mission [Data: Claims (18, 84)]. +Dr. Hayes is on the verge of significant scientific breakthroughs, as they analyze evolving alien signals and consider the implications of these patterns. Their work suggests a potential tandem evolution with extraterrestrial intelligence, highlighting the profound impact of their research on the understanding of alien thought patterns and communication [Data: Claims (49, 74)]. Despite the challenges, Dr. Hayes remains focused on the mission, contemplating the skepticism and the need to accept other possibilities as they navigate the unknown [Data: Claims (12, 54)]. -In summary, Dr. Jordan Hayes is a central figure in the Paranormal Military Squad's mission at Dulce Base, contributing significantly to the understanding and communication with extraterrestrial entities. Their analytical skills, collaboration with team members, and potential scientific breakthroughs underscore their importance in the team's efforts to navigate the complexities of interstellar communication. +In summary, Dr. Jordan Hayes is a central figure in the Paranormal Military Squad's mission to engage with extraterrestrial intelligence. Their expertise in signal analysis and decryption, combined with their analytical approach, makes them an invaluable asset to the team as they explore the frontiers of interstellar communication and diplomacy.
['- What is the role of Alex Mercer in Operation: Dulce?', '- How does the Paranormal Military Squad interact with extraterrestrial intelligence at Dulce Base?', '- What are the main objectives of Operation: Dulce?', '- How does the environment of Dulce Military Base affect the team members?', '- What challenges does the Paranormal Military Squad face during their mission at Dulce Base?'] +['- What is the role of Alex Mercer in Operation: Dulce?', '- How does the Paranormal Military Squad interact with extraterrestrial intelligence at Dulce Base?', '- What are the main objectives of Operation: Dulce?', '- How does the environment of the Dulce Military Base affect the team members?', '- What is the significance of New Mexico in the context of Operation: Dulce?']
\ud83d\udc49 Microsoft Research Blog Post \ud83d\udc49 GraphRAG Accelerator \ud83d\udc49 GraphRAG Arxiv
Figure 1: An LLM-generated knowledge graph built using GPT-4 Turbo.
GraphRAG is a structured, hierarchical approach to Retrieval Augmented Generation (RAG), as opposed to naive semantic-search approaches using plain text snippets. The GraphRAG process involves extracting a knowledge graph out of raw text, building a community hierarchy, generating summaries for these communities, and then leveraging these structures when perform RAG-based tasks.
To learn more about GraphRAG and how it can be used to enhance your LLMs ability to reason about your private data, please visit the Microsoft Research Blog Post.
"}, {"location": "#solution-accelerator", "title": "Solution Accelerator \ud83d\ude80", "text": "To quickstart the GraphRAG system we recommend trying the Solution Accelerator package. This provides a user-friendly end-to-end experience with Azure resources.
"}, {"location": "#get-started-with-graphrag", "title": "Get Started with GraphRAG \ud83d\ude80", "text": "To start using GraphRAG, check out the Get Started guide. For a deeper dive into the main sub-systems, please visit the docpages for the Indexer and Query packages.
"}, {"location": "#graphrag-vs-baseline-rag", "title": "GraphRAG vs Baseline RAG \ud83d\udd0d", "text": "Retrieval-Augmented Generation (RAG) is a technique to improve LLM outputs using real-world information. This technique is an important part of most LLM-based tools and the majority of RAG approaches use vector similarity as the search technique, which we call Baseline RAG. GraphRAG uses knowledge graphs to provide substantial improvements in question-and-answer performance when reasoning about complex information. RAG techniques have shown promise in helping LLMs to reason about private datasets - data that the LLM is not trained on and has never seen before, such as an enterprise\u2019s proprietary research, business documents, or communications. Baseline RAG was created to help solve this problem, but we observe situations where baseline RAG performs very poorly. For example:
To address this, the tech community is working to develop methods that extend and enhance RAG. Microsoft Research\u2019s new approach, GraphRAG, uses LLMs to create a knowledge graph based on an input corpus. This graph, along with community summaries and graph machine learning outputs, are used to augment prompts at query time. GraphRAG shows substantial improvement in answering the two classes of questions described above, demonstrating intelligence or mastery that outperforms other approaches previously applied to private datasets.
"}, {"location": "#the-graphrag-process", "title": "The GraphRAG Process \ud83e\udd16", "text": "GraphRAG builds upon our prior research and tooling using graph machine learning. The basic steps of the GraphRAG process are as follows:
"}, {"location": "#index", "title": "Index", "text": "At query time, these structures are used to provide materials for the LLM context window when answering a question. The primary query modes are:
Using GraphRAG with your data out of the box may not yield the best possible results. We strongly recommend to fine-tune your prompts following the Prompt Tuning Guide in our documentation.
"}, {"location": "blog_posts/", "title": "Microsoft Research Blog", "text": "GraphRAG: Unlocking LLM discovery on narrative private data
Published February 13, 2024
By Jonathan Larson, Senior Principal Data Architect; Steven Truitt, Principal Program Manager
GraphRAG: New tool for complex data discovery now on GitHub
Published July 2, 2024
By Darren Edge, Senior Director; Ha Trinh, Senior Data Scientist; Steven Truitt, Principal Program Manager; Jonathan Larson, Senior Principal Data Architect
GraphRAG auto-tuning provides rapid adaptation to new domains
Published September 9, 2024
By Alonso Guevara Fern\u00e1ndez, Sr. Software Engineer; Katy Smith, Data Scientist II; Joshua Bradley, Senior Data Scientist; Darren Edge, Senior Director; Ha Trinh, Senior Data Scientist; Sarah Smith, Senior Program Manager; Ben Cutler, Senior Director; Steven Truitt, Principal Program Manager; Jonathan Larson, Senior Principal Data Architect
"}, {"location": "developing/", "title": "Development Guide", "text": ""}, {"location": "developing/#requirements", "title": "Requirements", "text": "Name Installation Purpose Python 3.10-3.12 Download The library is Python-based. Poetry Instructions Poetry is used for package management and virtualenv management in Python codebases"}, {"location": "developing/#getting-started", "title": "Getting Started", "text": ""}, {"location": "developing/#install-dependencies", "title": "Install Dependencies", "text": "# Install Python dependencies.\npoetry install\n"}, {"location": "developing/#execute-the-indexing-engine", "title": "Execute the Indexing Engine", "text": "poetry run poe index <...args>\n"}, {"location": "developing/#executing-queries", "title": "Executing Queries", "text": "poetry run poe query <...args>\n"}, {"location": "developing/#azurite", "title": "Azurite", "text": "Some unit and smoke tests use Azurite to emulate Azure resources. This can be started by running:
./scripts/start-azurite.sh\n or by simply running azurite in the terminal if already installed globally. See the Azurite documentation for more information about how to install and use Azurite.
Our Python package utilizes Poetry to manage dependencies and poethepoet to manage build scripts.
Available scripts are:
poetry run poe index - Run the Indexing CLIpoetry run poe query - Run the Query CLIpoetry build - This invokes poetry build, which will build a wheel file and other distributable artifacts.poetry run poe test - This will execute all tests.poetry run poe test_unit - This will execute unit tests.poetry run poe test_integration - This will execute integration tests.poetry run poe test_smoke - This will execute smoke tests.poetry run poe check - This will perform a suite of static checks across the package, including:poetry run poe fix - This will apply any available auto-fixes to the package. Usually this is just formatting fixes.poetry run poe fix_unsafe - This will apply any available auto-fixes to the package, including those that may be unsafe.poetry run poe format - Explicitly run the formatter across the package.Make sure llvm-9 and llvm-9-dev are installed:
sudo apt-get install llvm-9 llvm-9-dev
and then in your bashrc, add
export LLVM_CONFIG=/usr/bin/llvm-config-9
Make sure you have python3.10-dev installed or more generally python<version>-dev
sudo apt-get install python3.10-dev
GRAPHRAG_LLM_THREAD_COUNT and GRAPHRAG_EMBEDDING_THREAD_COUNT are both set to 50 by default. You can modify this values to reduce concurrency. Please refer to the Configuration Documents
Python 3.10-3.12
To get started with the GraphRAG system, you have a few options:
\ud83d\udc49 Use the GraphRAG Accelerator solution \ud83d\udc49 Install from pypi. \ud83d\udc49 Use it from source
"}, {"location": "get_started/#quickstart", "title": "Quickstart", "text": "To get started with the GraphRAG system we recommend trying the Solution Accelerator package. This provides a user-friendly end-to-end experience with Azure resources.
"}, {"location": "get_started/#top-level-modules", "title": "Top-Level Modules", "text": "The following is a simple end-to-end example for using the GraphRAG system. It shows how to use the system to index some text, and then use the indexed data to answer questions about the documents.
"}, {"location": "get_started/#install-graphrag", "title": "Install GraphRAG", "text": "pip install graphrag\n"}, {"location": "get_started/#running-the-indexer", "title": "Running the Indexer", "text": "Now we need to set up a data project and some initial configuration. Let's set that up. We're using the default configuration mode, which you can customize as needed using a config file, which we recommend, or environment variables.
First let's get a sample dataset ready:
mkdir -p ./ragtest/input\n Now let's get a copy of A Christmas Carol by Charles Dickens from a trusted source
curl https://www.gutenberg.org/cache/epub/24022/pg24022.txt -o ./ragtest/input/book.txt\n Next we'll inject some required config variables:
"}, {"location": "get_started/#set-up-your-workspace-variables", "title": "Set Up Your Workspace Variables", "text": "First let's make sure to setup the required environment variables. For details on these environment variables, and what environment variables are available, see the variables documentation.
To initialize your workspace, let's first run the graphrag.index --init command. Since we have already configured a directory named .ragtest` in the previous step, we can run the following command:
python -m graphrag.index --init --root ./ragtest\n This will create two files: .env and settings.yaml in the ./ragtest directory.
.env contains the environment variables required to run the GraphRAG pipeline. If you inspect the file, you'll see a single environment variable defined, GRAPHRAG_API_KEY=<API_KEY>. This is the API key for the OpenAI API or Azure OpenAI endpoint. You can replace this with your own API key.settings.yaml contains the settings for the pipeline. You can modify this file to change the settings for the pipeline. To run in OpenAI mode, just make sure to update the value of GRAPHRAG_API_KEY in the .env file with your OpenAI API key.
In addition, Azure OpenAI users should set the following variables in the settings.yaml file. To find the appropriate sections, just search for the llm: configuration, you should see two sections, one for the chat endpoint and one for the embeddings endpoint. Here is an example of how to configure the chat endpoint:
type: azure_openai_chat # Or azure_openai_embedding for embeddings\napi_base: https://<instance>.openai.azure.com\napi_version: 2024-02-15-preview # You can customize this for other versions\ndeployment_name: <azure_model_deployment_name>\n Finally we'll run the pipeline!
python -m graphrag.index --root ./ragtest\n This process will take some time to run. This depends on the size of your input data, what model you're using, and the text chunk size being used (these can be configured in your settings.yml file). Once the pipeline is complete, you should see a new folder called ./ragtest/output/<timestamp>/artifacts with a series of parquet files.
Now let's ask some questions using this dataset.
Here is an example using Global search to ask a high-level question:
python -m graphrag.query \\\n--root ./ragtest \\\n--method global \\\n\"What are the top themes in this story?\"\n Here is an example using Local search to ask a more specific question about a particular character:
python -m graphrag.query \\\n--root ./ragtest \\\n--method local \\\n\"Who is Scrooge, and what are his main relationships?\"\n Please refer to Query Engine docs for detailed information about how to leverage our Local and Global search mechanisms for extracting meaningful insights from data after the Indexer has wrapped up execution.
"}, {"location": "config/custom/", "title": "Fully Custom Config", "text": "The primary configuration sections for Indexing Engine pipelines are described below. Each configuration section can be expressed in Python (for use in Python API mode) as well as YAML, but YAML is show here for brevity.
Using custom configuration is an advanced use-case. Most users will want to use the Default Configuration instead.
"}, {"location": "config/custom/#indexing-engine-examples", "title": "Indexing Engine Examples", "text": "The examples directory contains several examples of how to use the indexing engine with custom configuration.
Most examples include two different forms of running the pipeline, both are contained in the examples run.py
To run an example:
poetry shell to activate a virtual environment with the required dependencies.PYTHONPATH=\"$(pwd)\" python examples/path_to_example/run.py from the root directory.For example to run the single_verb example, you would run the following commands:
poetry shell\n PYTHONPATH=\"$(pwd)\" python examples/single_verb/run.py\n"}, {"location": "config/custom/#configuration-sections", "title": "Configuration Sections", "text": ""}, {"location": "config/custom/#extends", "title": "> extends", "text": "This configuration allows you to extend a base configuration file or files.
# single base\nextends: ../base_config.yml\n # multiple bases\nextends:\n - ../base_config.yml\n - ../base_config2.yml\n"}, {"location": "config/custom/#root_dir", "title": "> root_dir", "text": "This configuration allows you to set the root directory for the pipeline. All data inputs and outputs are assumed to be relative to this path.
root_dir: /workspace/data_project\n"}, {"location": "config/custom/#storage", "title": "> storage", "text": "This configuration allows you define the output strategy for the pipeline.
type: The type of storage to use. Options are file, memory, and blobbase_dir (type: file only): The base directory to store the data in. This is relative to the config root.connection_string (type: blob only): The connection string to use for blob storage.container_name (type: blob only): The container to use for blob storage.This configuration allows you define the cache strategy for the pipeline.
type: The type of cache to use. Options are file and memory, and blob.base_dir (type: file only): The base directory to store the cache in. This is relative to the config root.connection_string (type: blob only): The connection string to use for blob storage.container_name (type: blob only): The container to use for blob storage.This configuration allows you define the reporting strategy for the pipeline. Report files are generated artifacts that summarize the performance metrics of the pipeline and emit any error messages.
type: The type of reporting to use. Options are file, memory, and blobbase_dir (type: file only): The base directory to store the reports in. This is relative to the config root.connection_string (type: blob only): The connection string to use for blob storage.container_name (type: blob only): The container to use for blob storage.This configuration section defines the workflow DAG for the pipeline. Here we define an array of workflows and express their inter-dependencies in steps:
name: The name of the workflow. This is used to reference the workflow in other parts of the config.steps: The DataShaper steps that this workflow comprises. If a step defines an input in the form of workflow:<workflow_name>, then it is assumed to have a dependency on the output of that workflow.workflows:\n - name: workflow1\n steps:\n - verb: derive\n args:\n column1: \"col1\"\n column2: \"col2\"\n - name: workflow2\n steps:\n - verb: derive\n args:\n column1: \"col1\"\n column2: \"col2\"\n input:\n # dependency established here\n source: workflow:workflow1\n"}, {"location": "config/custom/#input", "title": "> input", "text": "type: The type of input to use. Options are file or blob.file_type: The file type field discriminates between the different input types. Options are csv and text.base_dir: The base directory to read the input files from. This is relative to the config file.file_pattern: A regex to match the input files. The regex must have named groups for each of the fields in the file_filter.post_process: A DataShaper workflow definition to apply to the input before executing the primary workflow.source_column (type: csv only): The column containing the source/author of the datatext_column (type: csv only): The column containing the text of the datatimestamp_column (type: csv only): The column containing the timestamp of the datatimestamp_format (type: csv only): The format of the timestampinput:\n type: file\n file_type: csv\n base_dir: ../data/csv # the directory containing the CSV files, this is relative to the config file\n file_pattern: '.*[\\/](?P<source>[^\\/]+)[\\/](?P<year>\\d{4})-(?P<month>\\d{2})-(?P<day>\\d{2})_(?P<author>[^_]+)_\\d+\\.csv$' # a regex to match the CSV files\n # An additional file filter which uses the named groups from the file_pattern to further filter the files\n # file_filter:\n # # source: (source_filter)\n # year: (2023)\n # month: (06)\n # # day: (22)\n source_column: \"author\" # the column containing the source/author of the data\n text_column: \"message\" # the column containing the text of the data\n timestamp_column: \"date(yyyyMMddHHmmss)\" # optional, the column containing the timestamp of the data\n timestamp_format: \"%Y%m%d%H%M%S\" # optional, the format of the timestamp\n post_process: # Optional, set of steps to process the data before going into the workflow\n - verb: filter\n args:\n column: \"title\",\n value: \"My document\"\n input:\n type: file\n file_type: csv\n base_dir: ../data/csv # the directory containing the CSV files, this is relative to the config file\n file_pattern: '.*[\\/](?P<source>[^\\/]+)[\\/](?P<year>\\d{4})-(?P<month>\\d{2})-(?P<day>\\d{2})_(?P<author>[^_]+)_\\d+\\.csv$' # a regex to match the CSV files\n # An additional file filter which uses the named groups from the file_pattern to further filter the files\n # file_filter:\n # # source: (source_filter)\n # year: (2023)\n # month: (06)\n # # day: (22)\n post_process: # Optional, set of steps to process the data before going into the workflow\n - verb: filter\n args:\n column: \"title\",\n value: \"My document\"\n"}, {"location": "config/env_vars/", "title": "Default Configuration Mode (using Env Vars)", "text": ""}, {"location": "config/env_vars/#text-embeddings-customization", "title": "Text-Embeddings Customization", "text": "By default, the GraphRAG indexer will only emit embeddings required for our query methods. However, the model has embeddings defined for all plaintext fields, and these can be generated by setting the GRAPHRAG_EMBEDDING_TARGET environment variable to all.
If the embedding target is all, and you want to only embed a subset of these fields, you may specify which embeddings to skip using the GRAPHRAG_EMBEDDING_SKIP argument described below.
text_unit.textdocument.raw_contententity.nameentity.descriptionrelationship.descriptioncommunity.titlecommunity.summarycommunity.full_contentOur pipeline can ingest .csv or .txt data from an input folder. These files can be nested within subfolders. To configure how input data is handled, what fields are mapped over, and how timestamps are parsed, look for configuration values starting with GRAPHRAG_INPUT_ below. In general, CSV-based data provides the most customizability. Each CSV should at least contain a text field (which can be mapped with environment variables), but it's helpful if they also have title, timestamp, and source fields. Additional fields can be included as well, which will land as extra fields on the Document table.
These are the primary settings for configuring LLM connectivity.
Parameter Required? Description Type Default ValueGRAPHRAG_API_KEY Yes for OpenAI. Optional for AOAI The API key. (Note: OPENAI_API_KEY is also used as a fallback). If not defined when using AOAI, managed identity will be used. |str|None` GRAPHRAG_API_BASE For AOAI The API Base URL str None GRAPHRAG_API_VERSION For AOAI The AOAI API version. str None GRAPHRAG_API_ORGANIZATION The AOAI organization. str None GRAPHRAG_API_PROXY The AOAI proxy. str None"}, {"location": "config/env_vars/#text-generation-settings", "title": "Text Generation Settings", "text": "These settings control the text generation model used by the pipeline. Any settings with a fallback will use the base LLM settings, if available.
Parameter Required? Description Type Default ValueGRAPHRAG_LLM_TYPE For AOAI The LLM operation type. Either openai_chat or azure_openai_chat str openai_chat GRAPHRAG_LLM_DEPLOYMENT_NAME For AOAI The AOAI model deployment name. str None GRAPHRAG_LLM_API_KEY Yes (uses fallback) The API key. If not defined when using AOAI, managed identity will be used. str None GRAPHRAG_LLM_API_BASE For AOAI (uses fallback) The API Base URL str None GRAPHRAG_LLM_API_VERSION For AOAI (uses fallback) The AOAI API version. str None GRAPHRAG_LLM_API_ORGANIZATION For AOAI (uses fallback) The AOAI organization. str None GRAPHRAG_LLM_API_PROXY The AOAI proxy. str None GRAPHRAG_LLM_MODEL The LLM model. str gpt-4-turbo-preview GRAPHRAG_LLM_MAX_TOKENS The maximum number of tokens. int 4000 GRAPHRAG_LLM_REQUEST_TIMEOUT The maximum number of seconds to wait for a response from the chat client. int 180 GRAPHRAG_LLM_MODEL_SUPPORTS_JSON Indicates whether the given model supports JSON output mode. True to enable. str None GRAPHRAG_LLM_THREAD_COUNT The number of threads to use for LLM parallelization. int 50 GRAPHRAG_LLM_THREAD_STAGGER The time to wait (in seconds) between starting each thread. float 0.3 GRAPHRAG_LLM_CONCURRENT_REQUESTS The number of concurrent requests to allow for the embedding client. int 25 GRAPHRAG_LLM_TOKENS_PER_MINUTE The number of tokens per minute to allow for the LLM client. 0 = Bypass int 0 GRAPHRAG_LLM_REQUESTS_PER_MINUTE The number of requests per minute to allow for the LLM client. 0 = Bypass int 0 GRAPHRAG_LLM_MAX_RETRIES The maximum number of retries to attempt when a request fails. int 10 GRAPHRAG_LLM_MAX_RETRY_WAIT The maximum number of seconds to wait between retries. int 10 GRAPHRAG_LLM_SLEEP_ON_RATE_LIMIT_RECOMMENDATION Whether to sleep on rate limit recommendation. (Azure Only) bool True GRAPHRAG_LLM_TEMPERATURE The temperature to use generation. float 0 GRAPHRAG_LLM_TOP_P The top_p to use for sampling. float 1 GRAPHRAG_LLM_N The number of responses to generate. int 1"}, {"location": "config/env_vars/#text-embedding-settings", "title": "Text Embedding Settings", "text": "These settings control the text embedding model used by the pipeline. Any settings with a fallback will use the base LLM settings, if available.
Parameter Required ? Description Type DefaultGRAPHRAG_EMBEDDING_TYPE For AOAI The embedding client to use. Either openai_embedding or azure_openai_embedding str openai_embedding GRAPHRAG_EMBEDDING_DEPLOYMENT_NAME For AOAI The AOAI deployment name. str None GRAPHRAG_EMBEDDING_API_KEY Yes (uses fallback) The API key to use for the embedding client. If not defined when using AOAI, managed identity will be used. str None GRAPHRAG_EMBEDDING_API_BASE For AOAI (uses fallback) The API base URL. str None GRAPHRAG_EMBEDDING_API_VERSION For AOAI (uses fallback) The AOAI API version to use for the embedding client. str None GRAPHRAG_EMBEDDING_API_ORGANIZATION For AOAI (uses fallback) The AOAI organization to use for the embedding client. str None GRAPHRAG_EMBEDDING_API_PROXY The AOAI proxy to use for the embedding client. str None GRAPHRAG_EMBEDDING_MODEL The model to use for the embedding client. str text-embedding-3-small GRAPHRAG_EMBEDDING_BATCH_SIZE The number of texts to embed at once. (Azure limit is 16) int 16 GRAPHRAG_EMBEDDING_BATCH_MAX_TOKENS The maximum tokens per batch (Azure limit is 8191) int 8191 GRAPHRAG_EMBEDDING_TARGET The target fields to embed. Either required or all. str required GRAPHRAG_EMBEDDING_SKIP A comma-separated list of fields to skip embeddings for . (e.g. 'relationship.description') str None GRAPHRAG_EMBEDDING_THREAD_COUNT The number of threads to use for parallelization for embeddings. int GRAPHRAG_EMBEDDING_THREAD_STAGGER The time to wait (in seconds) between starting each thread for embeddings. float 50 GRAPHRAG_EMBEDDING_CONCURRENT_REQUESTS The number of concurrent requests to allow for the embedding client. int 25 GRAPHRAG_EMBEDDING_TOKENS_PER_MINUTE The number of tokens per minute to allow for the embedding client. 0 = Bypass int 0 GRAPHRAG_EMBEDDING_REQUESTS_PER_MINUTE The number of requests per minute to allow for the embedding client. 0 = Bypass int 0 GRAPHRAG_EMBEDDING_MAX_RETRIES The maximum number of retries to attempt when a request fails. int 10 GRAPHRAG_EMBEDDING_MAX_RETRY_WAIT The maximum number of seconds to wait between retries. int 10 GRAPHRAG_EMBEDDING_SLEEP_ON_RATE_LIMIT_RECOMMENDATION Whether to sleep on rate limit recommendation. (Azure Only) bool True"}, {"location": "config/env_vars/#input-settings", "title": "Input Settings", "text": "These settings control the data input used by the pipeline. Any settings with a fallback will use the base LLM settings, if available.
"}, {"location": "config/env_vars/#plaintext-input-data-graphrag_input_file_typetext", "title": "Plaintext Input Data (GRAPHRAG_INPUT_FILE_TYPE=text)", "text": "Parameter Description Type Required or Optional Default GRAPHRAG_INPUT_FILE_PATTERN The file pattern regexp to use when reading input files from the input directory. str optional .*\\.txt$"}, {"location": "config/env_vars/#csv-input-data-graphrag_input_file_typecsv", "title": "CSV Input Data (GRAPHRAG_INPUT_FILE_TYPE=csv)", "text": "Parameter Description Type Required or Optional Default GRAPHRAG_INPUT_TYPE The input storage type to use when reading files. (file or blob) str optional file GRAPHRAG_INPUT_FILE_PATTERN The file pattern regexp to use when reading input files from the input directory. str optional .*\\.txt$ GRAPHRAG_INPUT_SOURCE_COLUMN The 'source' column to use when reading CSV input files. str optional source GRAPHRAG_INPUT_TIMESTAMP_COLUMN The 'timestamp' column to use when reading CSV input files. str optional None GRAPHRAG_INPUT_TIMESTAMP_FORMAT The timestamp format to use when parsing timestamps in the timestamp column. str optional None GRAPHRAG_INPUT_TEXT_COLUMN The 'text' column to use when reading CSV input files. str optional text GRAPHRAG_INPUT_DOCUMENT_ATTRIBUTE_COLUMNS A list of CSV columns, comma-separated, to incorporate as document fields. str optional id GRAPHRAG_INPUT_TITLE_COLUMN The 'title' column to use when reading CSV input files. str optional title GRAPHRAG_INPUT_STORAGE_ACCOUNT_BLOB_URL The Azure Storage blob endpoint to use when in blob mode and using managed identity. Will have the format https://<storage_account_name>.blob.core.windows.net str optional None GRAPHRAG_INPUT_CONNECTION_STRING The connection string to use when reading CSV input files from Azure Blob Storage. str optional None GRAPHRAG_INPUT_CONTAINER_NAME The container name to use when reading CSV input files from Azure Blob Storage. str optional None GRAPHRAG_INPUT_BASE_DIR The base directory to read input files from. str optional None"}, {"location": "config/env_vars/#data-mapping-settings", "title": "Data Mapping Settings", "text": "Parameter Description Type Required or Optional Default GRAPHRAG_INPUT_FILE_TYPE The type of input data, csv or text str optional text GRAPHRAG_INPUT_ENCODING The encoding to apply when reading CSV/text input files. str optional utf-8"}, {"location": "config/env_vars/#data-chunking", "title": "Data Chunking", "text": "Parameter Description Type Required or Optional Default GRAPHRAG_CHUNK_SIZE The chunk size in tokens for text-chunk analysis windows. str optional 1200 GRAPHRAG_CHUNK_OVERLAP The chunk overlap in tokens for text-chunk analysis windows. str optional 100 GRAPHRAG_CHUNK_BY_COLUMNS A comma-separated list of document attributes to groupby when performing TextUnit chunking. str optional id GRAPHRAG_CHUNK_ENCODING_MODEL The encoding model to use for chunking. str optional The top-level encoding model."}, {"location": "config/env_vars/#prompting-overrides", "title": "Prompting Overrides", "text": "Parameter Description Type Required or Optional Default GRAPHRAG_ENTITY_EXTRACTION_PROMPT_FILE The path (relative to the root) of an entity extraction prompt template text file. str optional None GRAPHRAG_ENTITY_EXTRACTION_MAX_GLEANINGS The maximum number of redrives (gleanings) to invoke when extracting entities in a loop. int optional 1 GRAPHRAG_ENTITY_EXTRACTION_ENTITY_TYPES A comma-separated list of entity types to extract. str optional organization,person,event,geo GRAPHRAG_ENTITY_EXTRACTION_ENCODING_MODEL The encoding model to use for entity extraction. str optional The top-level encoding model. GRAPHRAG_SUMMARIZE_DESCRIPTIONS_PROMPT_FILE The path (relative to the root) of an description summarization prompt template text file. str optional None GRAPHRAG_SUMMARIZE_DESCRIPTIONS_MAX_LENGTH The maximum number of tokens to generate per description summarization. int optional 500 GRAPHRAG_CLAIM_EXTRACTION_ENABLED Whether claim extraction is enabled for this pipeline. bool optional False GRAPHRAG_CLAIM_EXTRACTION_DESCRIPTION The claim_description prompting argument to utilize. string optional \"Any claims or facts that could be relevant to threat analysis.\" GRAPHRAG_CLAIM_EXTRACTION_PROMPT_FILE The claim extraction prompt to utilize. string optional None GRAPHRAG_CLAIM_EXTRACTION_MAX_GLEANINGS The maximum number of redrives (gleanings) to invoke when extracting claims in a loop. int optional 1 GRAPHRAG_CLAIM_EXTRACTION_ENCODING_MODEL The encoding model to use for claim extraction. str optional The top-level encoding model GRAPHRAG_COMMUNITY_REPORTS_PROMPT_FILE The community reports extraction prompt to utilize. string optional None GRAPHRAG_COMMUNITY_REPORTS_MAX_LENGTH The maximum number of tokens to generate per community reports. int optional 1500"}, {"location": "config/env_vars/#storage", "title": "Storage", "text": "This section controls the storage mechanism used by the pipeline used for emitting output tables.
Parameter Description Type Required or Optional DefaultGRAPHRAG_STORAGE_TYPE The type of reporter to use. Options are file, memory, or blob str optional file GRAPHRAG_STORAGE_STORAGE_ACCOUNT_BLOB_URL The Azure Storage blob endpoint to use when in blob mode and using managed identity. Will have the format https://<storage_account_name>.blob.core.windows.net str optional None GRAPHRAG_STORAGE_CONNECTION_STRING The Azure Storage connection string to use when in blob mode. str optional None GRAPHRAG_STORAGE_CONTAINER_NAME The Azure Storage container name to use when in blob mode. str optional None GRAPHRAG_STORAGE_BASE_DIR The base path to data outputs outputs. str optional None"}, {"location": "config/env_vars/#cache", "title": "Cache", "text": "This section controls the cache mechanism used by the pipeline. This is used to cache LLM invocation results.
Parameter Description Type Required or Optional DefaultGRAPHRAG_CACHE_TYPE The type of cache to use. Options are file, memory, none or blob str optional file GRAPHRAG_CACHE_STORAGE_ACCOUNT_BLOB_URL The Azure Storage blob endpoint to use when in blob mode and using managed identity. Will have the format https://<storage_account_name>.blob.core.windows.net str optional None GRAPHRAG_CACHE_CONNECTION_STRING The Azure Storage connection string to use when in blob mode. str optional None GRAPHRAG_CACHE_CONTAINER_NAME The Azure Storage container name to use when in blob mode. str optional None GRAPHRAG_CACHE_BASE_DIR The base path to the cache files. str optional None"}, {"location": "config/env_vars/#reporting", "title": "Reporting", "text": "This section controls the reporting mechanism used by the pipeline, for common events and error messages. The default is to write reports to a file in the output directory. However, you can also choose to write reports to the console or to an Azure Blob Storage container.
Parameter Description Type Required or Optional DefaultGRAPHRAG_REPORTING_TYPE The type of reporter to use. Options are file, console, or blob str optional file GRAPHRAG_REPORTING_STORAGE_ACCOUNT_BLOB_URL The Azure Storage blob endpoint to use when in blob mode and using managed identity. Will have the format https://<storage_account_name>.blob.core.windows.net str optional None GRAPHRAG_REPORTING_CONNECTION_STRING The Azure Storage connection string to use when in blob mode. str optional None GRAPHRAG_REPORTING_CONTAINER_NAME The Azure Storage container name to use when in blob mode. str optional None GRAPHRAG_REPORTING_BASE_DIR The base path to the reporting outputs. str optional None"}, {"location": "config/env_vars/#node2vec-parameters", "title": "Node2Vec Parameters", "text": "Parameter Description Type Required or Optional Default GRAPHRAG_NODE2VEC_ENABLED Whether to enable Node2Vec bool optional False GRAPHRAG_NODE2VEC_NUM_WALKS The Node2Vec number of walks to perform int optional 10 GRAPHRAG_NODE2VEC_WALK_LENGTH The Node2Vec walk length int optional 40 GRAPHRAG_NODE2VEC_WINDOW_SIZE The Node2Vec window size int optional 2 GRAPHRAG_NODE2VEC_ITERATIONS The number of iterations to run node2vec int optional 3 GRAPHRAG_NODE2VEC_RANDOM_SEED The random seed to use for node2vec int optional 597832"}, {"location": "config/env_vars/#data-snapshotting", "title": "Data Snapshotting", "text": "Parameter Description Type Required or Optional Default GRAPHRAG_SNAPSHOT_GRAPHML Whether to enable GraphML snapshots. bool optional False GRAPHRAG_SNAPSHOT_RAW_ENTITIES Whether to enable raw entity snapshots. bool optional False GRAPHRAG_SNAPSHOT_TOP_LEVEL_NODES Whether to enable top-level node snapshots. bool optional False"}, {"location": "config/env_vars/#miscellaneous-settings", "title": "Miscellaneous Settings", "text": "Parameter Description Type Required or Optional Default GRAPHRAG_ASYNC_MODE Which async mode to use. Either asyncio or threaded. str optional asyncio GRAPHRAG_ENCODING_MODEL The text encoding model, used in tiktoken, to encode text. str optional cl100k_base GRAPHRAG_MAX_CLUSTER_SIZE The maximum number of entities to include in a single Leiden cluster. int optional 10 GRAPHRAG_SKIP_WORKFLOWS A comma-separated list of workflow names to skip. str optional None GRAPHRAG_UMAP_ENABLED Whether to enable UMAP layouts bool optional False"}, {"location": "config/init/", "title": "Configuring GraphRAG Indexing", "text": "To start using GraphRAG, you need to configure the system. The init command is the easiest way to get started. It will create a .env and settings.yaml files in the specified directory with the necessary configuration settings. It will also output the default LLM prompts used by GraphRAG.
python -m graphrag.index [--init] [--root PATH]\n"}, {"location": "config/init/#options", "title": "Options", "text": "--init - Initialize the directory with the necessary configuration files.--root PATH - The root directory to initialize. Default is the current directory.python -m graphrag.index --init --root ./ragtest\n"}, {"location": "config/init/#output", "title": "Output", "text": "The init command will create the following files in the specified directory:
settings.yaml - The configuration settings file. This file contains the configuration settings for GraphRAG..env - The environment variables file. These are referenced in the settings.yaml file.prompts/ - The LLM prompts folder. This contains the default prompts used by GraphRAG, you can modify them or run the Auto Prompt Tuning command to generate new prompts adapted to your data.After initializing your workspace, you can either run the Prompt Tuning command to adapt the prompts to your data or even start running the Indexing Pipeline to index your data. For more information on configuring GraphRAG, see the Configuration documentation.
"}, {"location": "config/json_yaml/", "title": "Default Configuration Mode (using JSON/YAML)", "text": "The default configuration mode may be configured by using a settings.json or settings.yml file in the data project root. If a .env file is present along with this config file, then it will be loaded, and the environment variables defined therein will be available for token replacements in your configuration document using ${ENV_VAR} syntax.
For example:
# .env\nAPI_KEY=some_api_key\n\n# settings.json\n{\n \"llm\": {\n \"api_key\": \"${API_KEY}\"\n }\n}\n"}, {"location": "config/json_yaml/#config-sections", "title": "Config Sections", "text": ""}, {"location": "config/json_yaml/#input", "title": "input", "text": ""}, {"location": "config/json_yaml/#fields", "title": "Fields", "text": "type file|blob - The input type to use. Default=filefile_type text|csv - The type of input data to load. Either text or csv. Default is textfile_encoding str - The encoding of the input file. Default is utf-8file_pattern str - A regex to match input files. Default is .*\\.csv$ if in csv mode and .*\\.txt$ if in text mode.source_column str - (CSV Mode Only) The source column name.timestamp_column str - (CSV Mode Only) The timestamp column name.timestamp_format str - (CSV Mode Only) The source format.text_column str - (CSV Mode Only) The text column name.title_column str - (CSV Mode Only) The title column name.document_attribute_columns list[str] - (CSV Mode Only) The additional document attributes to include.connection_string str - (blob only) The Azure Storage connection string.container_name str - (blob only) The Azure Storage container name.base_dir str - The base directory to read input from, relative to the root.storage_account_blob_url str - The storage account blob URL to use.This is the base LLM configuration section. Other steps may override this configuration with their own LLM configuration.
"}, {"location": "config/json_yaml/#fields_1", "title": "Fields", "text": "api_key str - The OpenAI API key to use.type openai_chat|azure_openai_chat|openai_embedding|azure_openai_embedding - The type of LLM to use.model str - The model name.max_tokens int - The maximum number of output tokens.request_timeout float - The per-request timeout.api_base str - The API base url to use.api_version str - The API versionorganization str - The client organization.proxy str - The proxy URL to use.cognitive_services_endpoint str - The url endpoint for cognitive services.deployment_name str - The deployment name to use (Azure).model_supports_json bool - Whether the model supports JSON-mode output.tokens_per_minute int - Set a leaky-bucket throttle on tokens-per-minute.requests_per_minute int - Set a leaky-bucket throttle on requests-per-minute.max_retries int - The maximum number of retries to use.max_retry_wait float - The maximum backoff time.sleep_on_rate_limit_recommendation bool - Whether to adhere to sleep recommendations (Azure).concurrent_requests int The number of open requests to allow at once.temperature float - The temperature to use.top_p float - The top-p value to use.n int - The number of completions to generate.stagger float - The threading stagger value.num_threads int - The maximum number of work threads.asyncio|threaded The async mode to use. Either asyncio or `threaded.
llm (see LLM top-level config)parallelization (see Parallelization top-level config)async_mode (see Async Mode top-level config)batch_size int - The maximum batch size to use.batch_max_tokens int - The maximum batch #-tokens.target required|all - Determines which set of embeddings to emit.skip list[str] - Which embeddings to skip.strategy dict - Fully override the text-embedding strategy.size int - The max chunk size in tokens.overlap int - The chunk overlap in tokens.group_by_columns list[str] - group documents by fields before chunking.encoding_model str - The text encoding model to use. Default is to use the top-level encoding model.strategy dict - Fully override the chunking strategy.type file|memory|none|blob - The cache type to use. Default=fileconnection_string str - (blob only) The Azure Storage connection string.container_name str - (blob only) The Azure Storage container name.base_dir str - The base directory to write cache to, relative to the root.storage_account_blob_url str - The storage account blob URL to use.type file|memory|blob - The storage type to use. Default=fileconnection_string str - (blob only) The Azure Storage connection string.container_name str - (blob only) The Azure Storage container name.base_dir str - The base directory to write reports to, relative to the root.storage_account_blob_url str - The storage account blob URL to use.type file|console|blob - The reporting type to use. Default=fileconnection_string str - (blob only) The Azure Storage connection string.container_name str - (blob only) The Azure Storage container name.base_dir str - The base directory to write reports to, relative to the root.storage_account_blob_url str - The storage account blob URL to use.llm (see LLM top-level config)parallelization (see Parallelization top-level config)async_mode (see Async Mode top-level config)prompt str - The prompt file to use.entity_types list[str] - The entity types to identify.max_gleanings int - The maximum number of gleaning cycles to use.encoding_model str - The text encoding model to use. By default, this will use the top-level encoding model.strategy dict - Fully override the entity extraction strategy.llm (see LLM top-level config)parallelization (see Parallelization top-level config)async_mode (see Async Mode top-level config)prompt str - The prompt file to use.max_length int - The maximum number of output tokens per summarization.strategy dict - Fully override the summarize description strategy.enabled bool - Whether to enable claim extraction. default=Falsellm (see LLM top-level config)parallelization (see Parallelization top-level config)async_mode (see Async Mode top-level config)prompt str - The prompt file to use.description str - Describes the types of claims we want to extract.max_gleanings int - The maximum number of gleaning cycles to use.encoding_model str - The text encoding model to use. By default, this will use the top-level encoding model.strategy dict - Fully override the claim extraction strategy.llm (see LLM top-level config)parallelization (see Parallelization top-level config)async_mode (see Async Mode top-level config)prompt str - The prompt file to use.max_length int - The maximum number of output tokens per report.max_input_length int - The maximum number of input tokens to use when generating reports.strategy dict - Fully override the community reports strategy.max_cluster_size int - The maximum cluster size to emit.strategy dict - Fully override the cluster_graph strategy.enabled bool - Whether to enable graph embeddings.num_walks int - The node2vec number of walks.walk_length int - The node2vec walk length.window_size int - The node2vec window size.iterations int - The node2vec number of iterations.random_seed int - The node2vec random seed.strategy dict - Fully override the embed graph strategy.enabled bool - Whether to enable UMAP layouts.graphml bool - Emit graphml snapshots.raw_entities bool - Emit raw entity snapshots.top_level_nodes bool - Emit top-level-node snapshots.str - The text encoding model to use. Default is cl100k_base.
list[str] - Which workflow names to skip.
"}, {"location": "config/overview/", "title": "Configuring GraphRAG Indexing", "text": "The GraphRAG system is highly configurable. This page provides an overview of the configuration options available for the GraphRAG indexing engine.
"}, {"location": "config/overview/#default-configuration-mode", "title": "Default Configuration Mode", "text": "The default configuration mode is the simplest way to get started with the GraphRAG system. It is designed to work out-of-the-box with minimal configuration. The primary configuration sections for the Indexing Engine pipelines are described below. The main ways to set up GraphRAG in Default Configuration mode are via:
Custom configuration mode is an advanced use-case. Most users will want to use the Default Configuration instead. The primary configuration sections for Indexing Engine pipelines are described below. Details about how to use custom configuration are available in the Custom Configuration Mode documentation.
"}, {"location": "config/template/", "title": "Configuration Template", "text": "The following template can be used and stored as a .env in the the directory where you're are pointing the --root parameter on your Indexing Pipeline execution.
For details about how to run the Indexing Pipeline, refer to the Index CLI documentation.
"}, {"location": "config/template/#env-file-template", "title": ".env File Template", "text": "Required variables are uncommented. All the optional configuration can be turned on or off as needed.
"}, {"location": "config/template/#minimal-configuration", "title": "Minimal Configuration", "text": "# Base LLM Settings\nGRAPHRAG_API_KEY=\"your_api_key\"\nGRAPHRAG_API_BASE=\"http://<domain>.openai.azure.com\" # For Azure OpenAI Users\nGRAPHRAG_API_VERSION=\"api_version\" # For Azure OpenAI Users\n\n# Text Generation Settings\nGRAPHRAG_LLM_TYPE=\"azure_openai_chat\" # or openai_chat\nGRAPHRAG_LLM_DEPLOYMENT_NAME=\"gpt-4-turbo-preview\"\nGRAPHRAG_LLM_MODEL_SUPPORTS_JSON=True\n\n# Text Embedding Settings\nGRAPHRAG_EMBEDDING_TYPE=\"azure_openai_embedding\" # or openai_embedding\nGRAPHRAG_LLM_DEPLOYMENT_NAME=\"text-embedding-3-small\"\n\n# Data Mapping Settings\nGRAPHRAG_INPUT_TYPE=\"text\"\n"}, {"location": "config/template/#full-configuration", "title": "Full Configuration", "text": "# Required LLM Config\n\n# Input Data Configuration\nGRAPHRAG_INPUT_TYPE=\"file\"\n\n# Plaintext Input Data Configuration\n# GRAPHRAG_INPUT_FILE_PATTERN=.*\\.txt\n\n# Text Input Data Configuration\nGRAPHRAG_INPUT_FILE_TYPE=\"text\"\nGRAPHRAG_INPUT_FILE_PATTERN=\".*\\.txt$\"\nGRAPHRAG_INPUT_SOURCE_COLUMN=source\n# GRAPHRAG_INPUT_TIMESTAMP_COLUMN=None\n# GRAPHRAG_INPUT_TIMESTAMP_FORMAT=None\n# GRAPHRAG_INPUT_TEXT_COLUMN=\"text\"\n# GRAPHRAG_INPUT_ATTRIBUTE_COLUMNS=id\n# GRAPHRAG_INPUT_TITLE_COLUMN=\"title\"\n# GRAPHRAG_INPUT_TYPE=\"file\"\n# GRAPHRAG_INPUT_CONNECTION_STRING=None\n# GRAPHRAG_INPUT_CONTAINER_NAME=None\n# GRAPHRAG_INPUT_BASE_DIR=None\n\n# Base LLM Settings\nGRAPHRAG_API_KEY=\"your_api_key\"\nGRAPHRAG_API_BASE=\"http://<domain>.openai.azure.com\" # For Azure OpenAI Users\nGRAPHRAG_API_VERSION=\"api_version\" # For Azure OpenAI Users\n# GRAPHRAG_API_ORGANIZATION=None\n# GRAPHRAG_API_PROXY=None\n\n# Text Generation Settings\n# GRAPHRAG_LLM_TYPE=openai_chat\nGRAPHRAG_LLM_API_KEY=\"your_api_key\" # If GRAPHRAG_API_KEY is not set\nGRAPHRAG_LLM_API_BASE=\"http://<domain>.openai.azure.com\" # For Azure OpenAI Users and if GRAPHRAG_API_BASE is not set\nGRAPHRAG_LLM_API_VERSION=\"api_version\" # For Azure OpenAI Users and if GRAPHRAG_API_VERSION is not set\nGRAPHRAG_LLM_MODEL_SUPPORTS_JSON=True # Suggested by default\n# GRAPHRAG_LLM_API_ORGANIZATION=None\n# GRAPHRAG_LLM_API_PROXY=None\n# GRAPHRAG_LLM_DEPLOYMENT_NAME=None\n# GRAPHRAG_LLM_MODEL=gpt-4-turbo-preview\n# GRAPHRAG_LLM_MAX_TOKENS=4000\n# GRAPHRAG_LLM_REQUEST_TIMEOUT=180\n# GRAPHRAG_LLM_THREAD_COUNT=50\n# GRAPHRAG_LLM_THREAD_STAGGER=0.3\n# GRAPHRAG_LLM_CONCURRENT_REQUESTS=25\n# GRAPHRAG_LLM_TPM=0\n# GRAPHRAG_LLM_RPM=0\n# GRAPHRAG_LLM_MAX_RETRIES=10\n# GRAPHRAG_LLM_MAX_RETRY_WAIT=10\n# GRAPHRAG_LLM_SLEEP_ON_RATE_LIMIT_RECOMMENDATION=True\n\n# Text Embedding Settings\n# GRAPHRAG_EMBEDDING_TYPE=openai_embedding\nGRAPHRAG_EMBEDDING_API_KEY=\"your_api_key\" # If GRAPHRAG_API_KEY is not set\nGRAPHRAG_EMBEDDING_API_BASE=\"http://<domain>.openai.azure.com\" # For Azure OpenAI Users and if GRAPHRAG_API_BASE is not set\nGRAPHRAG_EMBEDDING_API_VERSION=\"api_version\" # For Azure OpenAI Users and if GRAPHRAG_API_VERSION is not set\n# GRAPHRAG_EMBEDDING_API_ORGANIZATION=None\n# GRAPHRAG_EMBEDDING_API_PROXY=None\n# GRAPHRAG_EMBEDDING_DEPLOYMENT_NAME=None\n# GRAPHRAG_EMBEDDING_MODEL=text-embedding-3-small\n# GRAPHRAG_EMBEDDING_BATCH_SIZE=16\n# GRAPHRAG_EMBEDDING_BATCH_MAX_TOKENS=8191\n# GRAPHRAG_EMBEDDING_TARGET=required\n# GRAPHRAG_EMBEDDING_SKIP=None\n# GRAPHRAG_EMBEDDING_THREAD_COUNT=None\n# GRAPHRAG_EMBEDDING_THREAD_STAGGER=50\n# GRAPHRAG_EMBEDDING_CONCURRENT_REQUESTS=25\n# GRAPHRAG_EMBEDDING_TPM=0\n# GRAPHRAG_EMBEDDING_RPM=0\n# GRAPHRAG_EMBEDDING_MAX_RETRIES=10\n# GRAPHRAG_EMBEDDING_MAX_RETRY_WAIT=10\n# GRAPHRAG_EMBEDDING_SLEEP_ON_RATE_LIMIT_RECOMMENDATION=True\n\n# Data Mapping Settings\n# GRAPHRAG_INPUT_ENCODING=utf-8\n\n# Data Chunking\n# GRAPHRAG_CHUNK_SIZE=1200\n# GRAPHRAG_CHUNK_OVERLAP=100\n# GRAPHRAG_CHUNK_BY_COLUMNS=id\n\n# Prompting Overrides\n# GRAPHRAG_ENTITY_EXTRACTION_PROMPT_FILE=None\n# GRAPHRAG_ENTITY_EXTRACTION_MAX_GLEANINGS=1\n# GRAPHRAG_ENTITY_EXTRACTION_ENTITY_TYPES=organization,person,event,geo\n# GRAPHRAG_SUMMARIZE_DESCRIPTIONS_PROMPT_FILE=None\n# GRAPHRAG_SUMMARIZE_DESCRIPTIONS_MAX_LENGTH=500\n# GRAPHRAG_CLAIM_EXTRACTION_DESCRIPTION=\"Any claims or facts that could be relevant to threat analysis.\"\n# GRAPHRAG_CLAIM_EXTRACTION_PROMPT_FILE=None\n# GRAPHRAG_CLAIM_EXTRACTION_MAX_GLEANINGS=1\n# GRAPHRAG_COMMUNITY_REPORT_PROMPT_FILE=None\n# GRAPHRAG_COMMUNITY_REPORT_MAX_LENGTH=1500\n\n# Storage\n# GRAPHRAG_STORAGE_TYPE=file\n# GRAPHRAG_STORAGE_CONNECTION_STRING=None\n# GRAPHRAG_STORAGE_CONTAINER_NAME=None\n# GRAPHRAG_STORAGE_BASE_DIR=None\n\n# Cache\n# GRAPHRAG_CACHE_TYPE=file\n# GRAPHRAG_CACHE_CONNECTION_STRING=None\n# GRAPHRAG_CACHE_CONTAINER_NAME=None\n# GRAPHRAG_CACHE_BASE_DIR=None\n\n# Reporting\n# GRAPHRAG_REPORTING_TYPE=file\n# GRAPHRAG_REPORTING_CONNECTION_STRING=None\n# GRAPHRAG_REPORTING_CONTAINER_NAME=None\n# GRAPHRAG_REPORTING_BASE_DIR=None\n\n# Node2Vec Parameters\n# GRAPHRAG_NODE2VEC_ENABLED=False\n# GRAPHRAG_NODE2VEC_NUM_WALKS=10\n# GRAPHRAG_NODE2VEC_WALK_LENGTH=40\n# GRAPHRAG_NODE2VEC_WINDOW_SIZE=2\n# GRAPHRAG_NODE2VEC_ITERATIONS=3\n# GRAPHRAG_NODE2VEC_RANDOM_SEED=597832\n\n# Data Snapshotting\n# GRAPHRAG_SNAPSHOT_GRAPHML=False\n# GRAPHRAG_SNAPSHOT_RAW_ENTITIES=False\n# GRAPHRAG_SNAPSHOT_TOP_LEVEL_NODES=False\n\n# Miscellaneous Settings\n# GRAPHRAG_ASYNC_MODE=asyncio\n# GRAPHRAG_ENCODING_MODEL=cl100k_base\n# GRAPHRAG_MAX_CLUSTER_SIZE=10\n# GRAPHRAG_SKIP_WORKFLOWS=None\n# GRAPHRAG_UMAP_ENABLED=False\n"}, {"location": "data/operation_dulce/ABOUT/", "title": "About", "text": "This document (Operation Dulce) is an AI-generated science fiction novella, included here for the purposes of integration testing.
"}, {"location": "index/architecture/", "title": "Indexing Architecture", "text": ""}, {"location": "index/architecture/#key-concepts", "title": "Key Concepts", "text": ""}, {"location": "index/architecture/#knowledge-model", "title": "Knowledge Model", "text": "In order to support the GraphRAG system, the outputs of the indexing engine (in the Default Configuration Mode) are aligned to a knowledge model we call the GraphRAG Knowledge Model. This model is designed to be an abstraction over the underlying data storage technology, and to provide a common interface for the GraphRAG system to interact with. In normal use-cases the outputs of the GraphRAG Indexer would be loaded into a database system, and the GraphRAG's Query Engine would interact with the database using the knowledge model data-store types.
"}, {"location": "index/architecture/#datashaper-workflows", "title": "DataShaper Workflows", "text": "GraphRAG's Indexing Pipeline is built on top of our open-source library, DataShaper. DataShaper is a data processing library that allows users to declaratively express data pipelines, schemas, and related assets using well-defined schemas. DataShaper has implementations in JavaScript and Python, and is designed to be extensible to other languages.
One of the core resource types within DataShaper is a Workflow. Workflows are expressed as sequences of steps, which we call verbs. Each step has a verb name and a configuration object. In DataShaper, these verbs model relational concepts such as SELECT, DROP, JOIN, etc.. Each verb transforms an input data table, and that table is passed down the pipeline.
---\ntitle: Sample Workflow\n---\nflowchart LR\n input[Input Table] --> select[SELECT] --> join[JOIN] --> binarize[BINARIZE] --> output[Output Table]"}, {"location": "index/architecture/#llm-based-workflow-steps", "title": "LLM-based Workflow Steps", "text": "GraphRAG's Indexing Pipeline implements a handful of custom verbs on top of the standard, relational verbs that our DataShaper library provides. These verbs give us the ability to augment text documents with rich, structured data using the power of LLMs such as GPT-4. We utilize these verbs in our standard workflow to extract entities, relationships, claims, community structures, and community reports and summaries. This behavior is customizable and can be extended to support many kinds of AI-based data enrichment and extraction tasks.
"}, {"location": "index/architecture/#workflow-graphs", "title": "Workflow Graphs", "text": "Because of the complexity of our data indexing tasks, we needed to be able to express our data pipeline as series of multiple, interdependent workflows. In the GraphRAG Indexing Pipeline, each workflow may define dependencies on other workflows, effectively forming a directed acyclic graph (DAG) of workflows, which is then used to schedule processing.
---\ntitle: Sample Workflow DAG\n---\nstateDiagram-v2\n [*] --> Prepare\n Prepare --> Chunk\n Chunk --> ExtractGraph\n Chunk --> EmbedDocuments\n ExtractGraph --> GenerateReports\n ExtractGraph --> EmbedEntities\n ExtractGraph --> EmbedGraph"}, {"location": "index/architecture/#dataframe-message-format", "title": "Dataframe Message Format", "text": "The primary unit of communication between workflows, and between workflow steps is an instance of pandas.DataFrame. Although side-effects are possible, our goal is to be data-centric and table-centric in our approach to data processing. This allows us to easily reason about our data, and to leverage the power of dataframe-based ecosystems. Our underlying dataframe technology may change over time, but our primary goal is to support the DataShaper workflow schema while retaining single-machine ease of use and developer ergonomics.
The GraphRAG library was designed with LLM interactions in mind, and a common setback when working with LLM APIs is various errors due to network latency, throttling, etc.. Because of these potential error cases, we've added a cache layer around LLM interactions. When completion requests are made using the same input set (prompt and tuning parameters), we return a cached result if one exists. This allows our indexer to be more resilient to network issues, to act idempotently, and to provide a more efficient end-user experience.
"}, {"location": "index/cli/", "title": "Indexer CLI", "text": "The GraphRAG indexer CLI allows for no-code usage of the GraphRAG Indexer.
python -m graphrag.index --verbose --root </workspace/project/root> \\\n--config <custom_config.yml> --resume <timestamp> \\\n--reporter <rich|print|none> --emit json,csv,parquet \\\n--nocache\n"}, {"location": "index/cli/#cli-arguments", "title": "CLI Arguments", "text": "--verbose - Adds extra logging information during the run.--root <data-project-dir> - the data root directory. This should contain an input directory with the input data, and an .env file with environment variables. These are described below.--init - This will initialize the data project directory at the specified root with bootstrap configuration and prompt-overrides.--resume <output-timestamp> - if specified, the pipeline will attempt to resume a prior run. The parquet files from the prior run will be loaded into the system as inputs, and the workflows that generated those files will be skipped. The input value should be the timestamped output folder, e.g. \"20240105-143721\".--config <config_file.yml> - This will opt-out of the Default Configuration mode and execute a custom configuration. If this is used, then none of the environment-variables below will apply.--reporter <reporter> - This will specify the progress reporter to use. The default is rich. Valid values are rich, print, and none.--emit <types> - This specifies the table output formats the pipeline should emit. The default is parquet. Valid values are parquet, csv, and json, comma-separated.--nocache - This will disable the caching mechanism. This is useful for debugging and development, but should not be used in production.--output <directory> - Specify the output directory for pipeline artifacts.--reports <directory> - Specify the output directory for reporting.The knowledge model is a specification for data outputs that conform to our data-model definition. You can find these definitions in the python/graphrag/graphrag/model folder within the GraphRAG repository. The following entity types are provided. The fields here represent the fields that are text-embedded by default.
Document - An input document into the system. These either represent individual rows in a CSV or individual .txt file.TextUnit - A chunk of text to analyze. The size of these chunks, their overlap, and whether they adhere to any data boundaries may be configured below. A common use case is to set CHUNK_BY_COLUMNS to id so that there is a 1-to-many relationship between documents and TextUnits instead of a many-to-many.Entity - An entity extracted from a TextUnit. These represent people, places, events, or some other entity-model that you provide.Relationship - A relationship between two entities. These are generated from the covariates.Covariate - Extracted claim information, which contains statements about entities which may be time-bound.Community Report - Once entities are generated, we perform hierarchical community detection on them and generate reports for each community in this hierarchy.Node - This table contains layout information for rendered graph-views of the Entities and Documents which have been embedded and clustered.Let's take a look at how the default-configuration workflow transforms text documents into the GraphRAG Knowledge Model. This page gives a general overview of the major steps in this process. To fully configure this workflow, check out the configuration documentation.
---\ntitle: Dataflow Overview\n---\nflowchart TB\n subgraph phase1[Phase 1: Compose TextUnits]\n documents[Documents] --> chunk[Chunk]\n chunk --> embed[Embed] --> textUnits[Text Units]\n end\n subgraph phase2[Phase 2: Graph Extraction]\n textUnits --> graph_extract[Entity & Relationship Extraction]\n graph_extract --> graph_summarize[Entity & Relationship Summarization]\n graph_summarize --> claim_extraction[Claim Extraction]\n claim_extraction --> graph_outputs[Graph Tables]\n end\n subgraph phase3[Phase 3: Graph Augmentation]\n graph_outputs --> community_detect[Community Detection]\n community_detect --> graph_embed[Graph Embedding]\n graph_embed --> augmented_graph[Augmented Graph Tables]\n end\n subgraph phase4[Phase 4: Community Summarization]\n augmented_graph --> summarized_communities[Community Summarization]\n summarized_communities --> embed_communities[Community Embedding]\n embed_communities --> community_outputs[Community Tables]\n end\n subgraph phase5[Phase 5: Document Processing]\n documents --> link_to_text_units[Link to TextUnits]\n textUnits --> link_to_text_units\n link_to_text_units --> embed_documents[Document Embedding]\n embed_documents --> document_graph[Document Graph Creation]\n document_graph --> document_outputs[Document Tables]\n end\n subgraph phase6[Phase 6: Network Visualization]\n document_outputs --> umap_docs[Umap Documents]\n augmented_graph --> umap_entities[Umap Entities]\n umap_docs --> combine_nodes[Nodes Table]\n umap_entities --> combine_nodes\n end"}, {"location": "index/default_dataflow/#phase-1-compose-textunits", "title": "Phase 1: Compose TextUnits", "text": "The first phase of the default-configuration workflow is to transform input documents into TextUnits. A TextUnit is a chunk of text that is used for our graph extraction techniques. They are also used as source-references by extracted knowledge items in order to empower breadcrumbs and provenance by concepts back to their original source tex.
The chunk size (counted in tokens), is user-configurable. By default this is set to 300 tokens, although we've had positive experience with 1200-token chunks using a single \"glean\" step. (A \"glean\" step is a follow-on extraction). Larger chunks result in lower-fidelity output and less meaningful reference texts; however, using larger chunks can result in much faster processing time.
The group-by configuration is also user-configurable. By default, we align our chunks to document boundaries, meaning that there is a strict 1-to-many relationship between Documents and TextUnits. In rare cases, this can be turned into a many-to-many relationship. This is useful when the documents are very short and we need several of them to compose a meaningful analysis unit (e.g. Tweets or a chat log)
Each of these text-units are text-embedded and passed into the next phase of the pipeline.
---\ntitle: Documents into Text Chunks\n---\nflowchart LR\n doc1[Document 1] --> tu1[TextUnit 1]\n doc1 --> tu2[TextUnit 2]\n doc2[Document 2] --> tu3[TextUnit 3]\n doc2 --> tu4[TextUnit 4]\n"}, {"location": "index/default_dataflow/#phase-2-graph-extraction", "title": "Phase 2: Graph Extraction", "text": "In this phase, we analyze each text unit and extract our graph primitives: Entities, Relationships, and Claims. Entities and Relationships are extracted at once in our entity_extract verb, and claims are extracted in our claim_extract verb. Results are then combined and passed into following phases of the pipeline.
---\ntitle: Graph Extraction\n---\nflowchart LR\n tu[TextUnit] --> ge[Graph Extraction] --> gs[Graph Summarization]\n tu --> ce[Claim Extraction]"}, {"location": "index/default_dataflow/#entity-relationship-extraction", "title": "Entity & Relationship Extraction", "text": "In this first step of graph extraction, we process each text-unit in order to extract entities and relationships out of the raw text using the LLM. The output of this step is a subgraph-per-TextUnit containing a list of entities with a name, type, and description, and a list of relationships with a source, target, and description.
These subgraphs are merged together - any entities with the same name and type are merged by creating an array of their descriptions. Similarly, any relationships with the same source and target are merged by creating an array of their descriptions.
"}, {"location": "index/default_dataflow/#entity-relationship-summarization", "title": "Entity & Relationship Summarization", "text": "Now that we have a graph of entities and relationships, each with a list of descriptions, we can summarize these lists into a single description per entity and relationship. This is done by asking the LLM for a short summary that captures all of the distinct information from each description. This allows all of our entities and relationships to have a single concise description.
"}, {"location": "index/default_dataflow/#claim-extraction-emission", "title": "Claim Extraction & Emission", "text": "Finally, as an independent workflow, we extract claims from the source TextUnits. These claims represent positive factual statements with an evaluated status and time-bounds. These are emitted as a primary artifact called Covariates.
Note: claim extraction is optional and turned off by default. This is because claim extraction generally needs prompt tuning to be useful.
"}, {"location": "index/default_dataflow/#phase-3-graph-augmentation", "title": "Phase 3: Graph Augmentation", "text": "Now that we have a usable graph of entities and relationships, we want to understand their community structure and augment the graph with additional information. This is done in two steps: Community Detection and Graph Embedding. These give us explicit (communities) and implicit (embeddings) ways of understanding the topological structure of our graph.
---\ntitle: Graph Augmentation\n---\nflowchart LR\n cd[Leiden Hierarchical Community Detection] --> ge[Node2Vec Graph Embedding] --> ag[Graph Table Emission]"}, {"location": "index/default_dataflow/#community-detection", "title": "Community Detection", "text": "In this step, we generate a hierarchy of entity communities using the Hierarchical Leiden Algorithm. This method will apply a recursive community-clustering to our graph until we reach a community-size threshold. This will allow us to understand the community structure of our graph and provide a way to navigate and summarize the graph at different levels of granularity.
"}, {"location": "index/default_dataflow/#graph-embedding", "title": "Graph Embedding", "text": "In this step, we generate a vector representation of our graph using the Node2Vec algorithm. This will allow us to understand the implicit structure of our graph and provide an additional vector-space in which to search for related concepts during our query phase.
"}, {"location": "index/default_dataflow/#graph-tables-emission", "title": "Graph Tables Emission", "text": "Once our graph augmentation steps are complete, the final Entities and Relationships tables are emitted after their text fields are text-embedded.
"}, {"location": "index/default_dataflow/#phase-4-community-summarization", "title": "Phase 4: Community Summarization", "text": "---\ntitle: Community Summarization\n---\nflowchart LR\n sc[Generate Community Reports] --> ss[Summarize Community Reports] --> ce[Community Embedding] --> co[Community Tables Emission] At this point, we have a functional graph of entities and relationships, a hierarchy of communities for the entities, as well as node2vec embeddings.
Now we want to build on the communities data and generate reports for each community. This gives us a high-level understanding of the graph at several points of graph granularity. For example, if community A is the top-level community, we'll get a report about the entire graph. If the community is lower-level, we'll get a report about a local cluster.
"}, {"location": "index/default_dataflow/#generate-community-reports", "title": "Generate Community Reports", "text": "In this step, we generate a summary of each community using the LLM. This will allow us to understand the distinct information contained within each community and provide a scoped understanding of the graph, from either a high-level or a low-level perspective. These reports contain an executive overview and reference the key entities, relationships, and claims within the community sub-structure.
"}, {"location": "index/default_dataflow/#summarize-community-reports", "title": "Summarize Community Reports", "text": "In this step, each community report is then summarized via the LLM for shorthand use.
"}, {"location": "index/default_dataflow/#community-embedding", "title": "Community Embedding", "text": "In this step, we generate a vector representation of our communities by generating text embeddings of the community report, the community report summary, and the title of the community report.
"}, {"location": "index/default_dataflow/#community-tables-emission", "title": "Community Tables Emission", "text": "At this point, some bookkeeping work is performed and we emit the Communities and CommunityReports tables.
"}, {"location": "index/default_dataflow/#phase-5-document-processing", "title": "Phase 5: Document Processing", "text": "In this phase of the workflow, we create the Documents table for the knowledge model.
---\ntitle: Document Processing\n---\nflowchart LR\n aug[Augment] --> dp[Link to TextUnits] --> de[Avg. Embedding] --> dg[Document Table Emission]"}, {"location": "index/default_dataflow/#augment-with-columns-csv-only", "title": "Augment with Columns (CSV Only)", "text": "If the workflow is operating on CSV data, you may configure your workflow to add additional fields to Documents output. These fields should exist on the incoming CSV tables. Details about configuring this can be found in the configuration documentation.
"}, {"location": "index/default_dataflow/#link-to-textunits", "title": "Link to TextUnits", "text": "In this step, we link each document to the text-units that were created in the first phase. This allows us to understand which documents are related to which text-units and vice-versa.
"}, {"location": "index/default_dataflow/#document-embedding", "title": "Document Embedding", "text": "In this step, we generate a vector representation of our documents using an average embedding of document slices. We re-chunk documents without overlapping chunks, and then generate an embedding for each chunk. We create an average of these chunks weighted by token-count and use this as the document embedding. This will allow us to understand the implicit relationship between documents, and will help us generate a network representation of our documents.
"}, {"location": "index/default_dataflow/#documents-table-emission", "title": "Documents Table Emission", "text": "At this point, we can emit the Documents table into the knowledge Model.
"}, {"location": "index/default_dataflow/#phase-6-network-visualization", "title": "Phase 6: Network Visualization", "text": "In this phase of the workflow, we perform some steps to support network visualization of our high-dimensional vector spaces within our existing graphs. At this point there are two logical graphs at play: the Entity-Relationship graph and the Document graph.
---\ntitle: Network Visualization Workflows\n---\nflowchart LR\n nv[Umap Documents] --> ne[Umap Entities] --> ng[Nodes Table Emission] For each of the logical graphs, we perform a UMAP dimensionality reduction to generate a 2D representation of the graph. This will allow us to visualize the graph in a 2D space and understand the relationships between the nodes in the graph. The UMAP embeddings are then emitted as a table of Nodes. The rows of this table include a discriminator indicating whether the node is a document or an entity, and the UMAP coordinates.
"}, {"location": "index/overview/", "title": "GraphRAG Indexing \ud83e\udd16", "text": "The GraphRAG indexing package is a data pipeline and transformation suite that is designed to extract meaningful, structured data from unstructured text using LLMs.
Indexing Pipelines are configurable. They are composed of workflows, standard and custom steps, prompt templates, and input/output adapters. Our standard pipeline is designed to:
The outputs of the pipeline can be stored in a variety of formats, including JSON and Parquet - or they can be handled manually via the Python API.
"}, {"location": "index/overview/#getting-started", "title": "Getting Started", "text": ""}, {"location": "index/overview/#requirements", "title": "Requirements", "text": "See the requirements section in Get Started for details on setting up a development environment.
The Indexing Engine can be used in either a default configuration mode or with a custom pipeline. To configure GraphRAG, see the configuration documentation. After you have a config file you can run the pipeline using the CLI or the Python API.
"}, {"location": "index/overview/#usage", "title": "Usage", "text": ""}, {"location": "index/overview/#cli", "title": "CLI", "text": "# Via Poetry\npoetry run poe cli --root <data_root> # default config mode\npoetry run poe cli --config your_pipeline.yml # custom config mode\n\n# Via Node\nyarn run:index --root <data_root> # default config mode\nyarn run:index --config your_pipeline.yml # custom config mode\n"}, {"location": "index/overview/#python-api", "title": "Python API", "text": "from graphrag.index import run_pipeline\nfrom graphrag.index.config import PipelineWorkflowReference\n\nworkflows: list[PipelineWorkflowReference] = [\n PipelineWorkflowReference(\n steps=[\n {\n # built-in verb\n \"verb\": \"derive\", # https://github.com/microsoft/datashaper/blob/main/python/datashaper/datashaper/verbs/derive.py\n \"args\": {\n \"column1\": \"col1\", # from above\n \"column2\": \"col2\", # from above\n \"to\": \"col_multiplied\", # new column name\n \"operator\": \"*\", # multiply the two columns\n },\n # Since we're trying to act on the default input, we don't need explicitly to specify an input\n }\n ]\n ),\n]\n\ndataset = pd.DataFrame([{\"col1\": 2, \"col2\": 4}, {\"col1\": 5, \"col2\": 10}])\noutputs = []\nasync for output in await run_pipeline(dataset=dataset, workflows=workflows):\n outputs.append(output)\npipeline_result = outputs[-1]\nprint(pipeline_result)\n"}, {"location": "index/overview/#further-reading", "title": "Further Reading", "text": "GraphRAG provides the ability to create domain adapted prompts for the generation of the knowledge graph. This step is optional, though it is highly encouraged to run it as it will yield better results when executing an Index Run.
These are generated by loading the inputs, splitting them into chunks (text units) and then running a series of LLM invocations and template substitutions to generate the final prompts. We suggest using the default values provided by the script, but in this page you'll find the detail of each in case you want to further explore and tweak the prompt tuning algorithm.
Figure 1: Auto Tuning Conceptual Diagram.
"}, {"location": "prompt_tuning/auto_prompt_tuning/#prerequisites", "title": "Prerequisites", "text": "Before running auto tuning make sure you have already initialized your workspace with the graphrag.index --init command. This will create the necessary configuration files and the default prompts. Refer to the Init Documentation for more information about the initialization process.
You can run the main script from the command line with various options:
python -m graphrag.prompt_tune [--root ROOT] [--domain DOMAIN] [--method METHOD] [--limit LIMIT] [--language LANGUAGE] \\\n[--max-tokens MAX_TOKENS] [--chunk-size CHUNK_SIZE] [--n-subset-max N_SUBSET_MAX] [--k K] \\\n[--min-examples-required MIN_EXAMPLES_REQUIRED] [--no-entity-types] [--output OUTPUT]\n"}, {"location": "prompt_tuning/auto_prompt_tuning/#command-line-options", "title": "Command-Line Options", "text": "--config (required): The path to the configuration file. This is required to load the data and model settings.
--root (optional): The data project root directory, including the config files (YML, JSON, or .env). Defaults to the current directory.
--domain (optional): The domain related to your input data, such as 'space science', 'microbiology', or 'environmental news'. If left empty, the domain will be inferred from the input data.
--method (optional): The method to select documents. Options are all, random, auto or top. Default is random.
--limit (optional): The limit of text units to load when using random or top selection. Default is 15.
--language (optional): The language to use for input processing. If it is different from the inputs' language, the LLM will translate. Default is \"\" meaning it will be automatically detected from the inputs.
--max-tokens (optional): Maximum token count for prompt generation. Default is 2000.
--chunk-size (optional): The size in tokens to use for generating text units from input documents. Default is 200.
--n-subset-max (optional): The number of text chunks to embed when using auto selection method. Default is 300.
--k (optional): The number of documents to select when using auto selection method. Default is 15.
--min-examples-required (optional): The minimum number of examples required for entity extraction prompts. Default is 2.
--no-entity-types (optional): Use untyped entity extraction generation. We recommend using this when your data covers a lot of topics or it is highly randomized.
--output (optional): The folder to save the generated prompts. Default is \"prompts\".
python -m graphrag.prompt_tune --root /path/to/project --config /path/to/settings.yaml --domain \"environmental news\" \\\n--method random --limit 10 --language English --max-tokens 2048 --chunk-size 256 --min-examples-required 3 \\\n--no-entity-types --output /path/to/output\n or, with minimal configuration (suggested):
python -m graphrag.prompt_tune --root /path/to/project --config /path/to/settings.yaml --no-entity-types\n"}, {"location": "prompt_tuning/auto_prompt_tuning/#document-selection-methods", "title": "Document Selection Methods", "text": "The auto tuning feature ingests the input data and then divides it into text units the size of the chunk size parameter. After that, it uses one of the following selection methods to pick a sample to work with for prompt generation:
random: Select text units randomly. This is the default and recommended option.top: Select the head n text units.all: Use all text units for the generation. Use only with small datasets; this option is not usually recommended.auto: Embed text units in a lower-dimensional space and select the k nearest neighbors to the centroid. This is useful when you have a large dataset and want to select a representative sample.After running auto tuning, you should modify the following environment variables (or config variables) to pick up the new prompts on your index run. Note: Please make sure to update the correct path to the generated prompts, in this example we are using the default \"prompts\" path.
GRAPHRAG_ENTITY_EXTRACTION_PROMPT_FILE = \"prompts/entity_extraction.txt\"
GRAPHRAG_COMMUNITY_REPORT_PROMPT_FILE = \"prompts/community_report.txt\"
GRAPHRAG_SUMMARIZE_DESCRIPTIONS_PROMPT_FILE = \"prompts/summarize_descriptions.txt\"
or in your yaml config file:
entity_extraction:\n prompt: \"prompts/entity_extraction.txt\"\n\nsummarize_descriptions:\n prompt: \"prompts/summarize_descriptions.txt\"\n\ncommunity_reports:\n prompt: \"prompts/community_report.txt\"\n"}, {"location": "prompt_tuning/manual_prompt_tuning/", "title": "Manual Prompt Tuning \u2699\ufe0f", "text": "The GraphRAG indexer, by default, will run with a handful of prompts that are designed to work well in the broad context of knowledge discovery. However, it is quite common to want to tune the prompts to better suit your specific use case. We provide a means for you to do this by allowing you to specify a custom prompt file, which will each use a series of token-replacements internally.
Each of these prompts may be overridden by writing a custom prompt file in plaintext. We use token-replacements in the form of {token_name}, and the descriptions for the available tokens can be found below.
Prompt Source
"}, {"location": "prompt_tuning/manual_prompt_tuning/#tokens-values-provided-by-extractor", "title": "Tokens (values provided by extractor)", "text": "Prompt Source
"}, {"location": "prompt_tuning/manual_prompt_tuning/#tokens-values-provided-by-extractor_1", "title": "Tokens (values provided by extractor)", "text": "Prompt Source
"}, {"location": "prompt_tuning/manual_prompt_tuning/#tokens-values-provided-by-extractor_2", "title": "Tokens (values provided by extractor)", "text": "Note: there is additional parameter for the Claim Description that is used in claim extraction. The default value is
\"Any claims or facts that could be relevant to information discovery.\"
See the configuration documentation for details on how to change this.
"}, {"location": "prompt_tuning/manual_prompt_tuning/#generate-community-reports", "title": "Generate Community Reports", "text": "Prompt Source
"}, {"location": "prompt_tuning/manual_prompt_tuning/#tokens-values-provided-by-extractor_3", "title": "Tokens (values provided by extractor)", "text": "This page provides an overview of the prompt tuning options available for the GraphRAG indexing engine.
"}, {"location": "prompt_tuning/overview/#default-prompts", "title": "Default Prompts", "text": "The default prompts are the simplest way to get started with the GraphRAG system. It is designed to work out-of-the-box with minimal configuration. You can find more detail about these prompts in the following links:
Auto Tuning leverages your input data and LLM interactions to create domain adapted prompts for the generation of the knowledge graph. It is highly encouraged to run it as it will yield better results when executing an Index Run. For more details about how to use it, please refer to the Auto Tuning documentation.
"}, {"location": "prompt_tuning/overview/#manual-tuning", "title": "Manual Tuning", "text": "Manual tuning is an advanced use-case. Most users will want to use the Auto Tuning feature instead. Details about how to use manual configuration are available in the Manual Tuning documentation.
"}, {"location": "query/cli/", "title": "Query CLI", "text": "The GraphRAG query CLI allows for no-code usage of the GraphRAG Query engine.
python -m graphrag.query --config <config_file.yml> --data <path-to-data> --community_level <comunit-level> --response_type <response-type> --method <\"local\"|\"global\"> <query>\n"}, {"location": "query/cli/#cli-arguments", "title": "CLI Arguments", "text": "--config <config_file.yml> - The configuration yaml file to use when running the query. If this is used, then none of the environment-variables below will apply.--data <path-to-data> - Folder containing the .parquet output files from running the Indexer.--community_level <community-level> - Community level in the Leiden community hierarchy from which we will load the community reports higher value means we use reports on smaller communities. Default: 2--response_type <response-type> - Free form text describing the response type and format, can be anything, e.g. Multiple Paragraphs, Single Paragraph, Single Sentence, List of 3-7 Points, Single Page, Multi-Page Report. Default: Multiple Paragraphs.--method <\"local\"|\"global\"> - Method to use to answer the query, one of local or global. For more information check Overview--streaming - Stream back the LLM responseRequired environment variables to execute: - GRAPHRAG_API_KEY - API Key for executing the model, will fallback to OPENAI_API_KEY if one is not provided. - GRAPHRAG_LLM_MODEL - Model to use for Chat Completions. - GRAPHRAG_EMBEDDING_MODEL - Model to use for Embeddings.
You can further customize the execution by providing these environment variables:
GRAPHRAG_LLM_API_BASE - The API Base URL. Default: NoneGRAPHRAG_LLM_TYPE - The LLM operation type. Either openai_chat or azure_openai_chat. Default: openai_chatGRAPHRAG_LLM_MAX_RETRIES - The maximum number of retries to attempt when a request fails. Default: 20GRAPHRAG_EMBEDDING_API_BASE - The API Base URL. Default: NoneGRAPHRAG_EMBEDDING_TYPE - The embedding client to use. Either openai_embedding or azure_openai_embedding. Default: openai_embeddingGRAPHRAG_EMBEDDING_MAX_RETRIES - The maximum number of retries to attempt when a request fails. Default: 20GRAPHRAG_LOCAL_SEARCH_TEXT_UNIT_PROP - Proportion of context window dedicated to related text units. Default: 0.5GRAPHRAG_LOCAL_SEARCH_COMMUNITY_PROP - Proportion of context window dedicated to community reports. Default: 0.1GRAPHRAG_LOCAL_SEARCH_CONVERSATION_HISTORY_MAX_TURNS - Maximum number of turns to include in the conversation history. Default: 5GRAPHRAG_LOCAL_SEARCH_TOP_K_ENTITIES - Number of related entities to retrieve from the entity description embedding store. Default: 10GRAPHRAG_LOCAL_SEARCH_TOP_K_RELATIONSHIPS - Control the number of out-of-network relationships to pull into the context window. Default: 10GRAPHRAG_LOCAL_SEARCH_MAX_TOKENS - Change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000). Default: 12000GRAPHRAG_LOCAL_SEARCH_LLM_MAX_TOKENS - Change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 1000=1500). Default: 2000GRAPHRAG_GLOBAL_SEARCH_MAX_TOKENS - Change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000). Default: 12000GRAPHRAG_GLOBAL_SEARCH_DATA_MAX_TOKENS - Change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000). Default: 12000GRAPHRAG_GLOBAL_SEARCH_MAP_MAX_TOKENS - Default: 500GRAPHRAG_GLOBAL_SEARCH_REDUCE_MAX_TOKENS - Change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 1000-1500). Default: 2000GRAPHRAG_GLOBAL_SEARCH_CONCURRENCY - Default: 32Baseline RAG struggles with queries that require aggregation of information across the dataset to compose an answer. Queries such as \u201cWhat are the top 5 themes in the data?\u201d perform terribly because baseline RAG relies on a vector search of semantically similar text content within the dataset. There is nothing in the query to direct it to the correct information.
However, with GraphRAG we can answer such questions, because the structure of the LLM-generated knowledge graph tells us about the structure (and thus themes) of the dataset as a whole. This allows the private dataset to be organized into meaningful semantic clusters that are pre-summarized. Using our global search method, the LLM uses these clusters to summarize these themes when responding to a user query.
"}, {"location": "query/global_search/#methodology", "title": "Methodology", "text": "---\ntitle: Global Search Dataflow\n---\n%%{ init: { 'flowchart': { 'curve': 'step' } } }%%\nflowchart LR\n\n uq[User Query] --- .1\n ch1[Conversation History] --- .1\n\n subgraph RIR\n direction TB\n ri1[Rated Intermediate<br/>Response 1]~~~ri2[Rated Intermediate<br/>Response 2] -.\"{1..N}\".-rin[Rated Intermediate<br/>Response N]\n end\n\n .1--Shuffled Community<br/>Report Batch 1-->RIR\n .1--Shuffled Community<br/>Report Batch 2-->RIR---.2\n .1--Shuffled Community<br/>Report Batch N-->RIR\n\n .2--Ranking +<br/>Filtering-->agr[Aggregated Intermediate<br/>Responses]-->res[Response]\n\n\n\n classDef green fill:#26B653,stroke:#333,stroke-width:2px,color:#fff;\n classDef turquoise fill:#19CCD3,stroke:#333,stroke-width:2px,color:#fff;\n classDef rose fill:#DD8694,stroke:#333,stroke-width:2px,color:#fff;\n classDef orange fill:#F19914,stroke:#333,stroke-width:2px,color:#fff;\n classDef purple fill:#B356CD,stroke:#333,stroke-width:2px,color:#fff;\n classDef invisible fill:#fff,stroke:#fff,stroke-width:0px,color:#fff, width:0px;\n class uq,ch1 turquoise;\n class ri1,ri2,rin rose;\n class agr orange;\n class res purple;\n class .1,.2 invisible;\n Given a user query and, optionally, the conversation history, the global search method uses a collection of LLM-generated community reports from a specified level of the graph's community hierarchy as context data to generate response in a map-reduce manner. At the map step, community reports are segmented into text chunks of pre-defined size. Each text chunk is then used to produce an intermediate response containing a list of point, each of which is accompanied by a numerical rating indicating the importance of the point. At the reduce step, a filtered set of the most important points from the intermediate responses are aggregated and used as the context to generate the final response.
The quality of the global search\u2019s response can be heavily influenced by the level of the community hierarchy chosen for sourcing community reports. Lower hierarchy levels, with their detailed reports, tend to yield more thorough responses, but may also increase the time and LLM resources needed to generate the final response due to the volume of reports.
"}, {"location": "query/global_search/#configuration", "title": "Configuration", "text": "Below are the key parameters of the GlobalSearch class:
llm: OpenAI model object to be used for response generationcontext_builder: context builder object to be used for preparing context data from community reportsmap_system_prompt: prompt template used in the map stage. Default template can be found at map_system_promptreduce_system_prompt: prompt template used in the reduce stage, default template can be found at reduce_system_promptresponse_type: free-form text describing the desired response type and format (e.g., Multiple Paragraphs, Multi-Page Report)allow_general_knowledge: setting this to True will include additional instructions to the reduce_system_prompt to prompt the LLM to incorporate relevant real-world knowledge outside of the dataset. Note that this may increase hallucinations, but can be useful for certain scenarios. Default is False *general_knowledge_inclusion_prompt: instruction to add to the reduce_system_prompt if allow_general_knowledge is enabled. Default instruction can be found at general_knowledge_instructionmax_data_tokens: token budget for the context datamap_llm_params: a dictionary of additional parameters (e.g., temperature, max_tokens) to be passed to the LLM call at the map stagereduce_llm_params: a dictionary of additional parameters (e.g., temperature, max_tokens) to passed to the LLM call at the reduce stagecontext_builder_params: a dictionary of additional parameters to be passed to the context_builder object when building context window for the map stage.concurrent_coroutines: controls the degree of parallelism in the map stage.callbacks: optional callback functions, can be used to provide custom event handlers for LLM's completion streaming eventsAn example of a global search scenario can be found in the following notebook.
"}, {"location": "query/local_search/", "title": "Local Search \ud83d\udd0e", "text": ""}, {"location": "query/local_search/#entity-based-reasoning", "title": "Entity-based Reasoning", "text": "The local search method combines structured data from the knowledge graph with unstructured data from the input documents to augment the LLM context with relevant entity information at query time. It is well-suited for answering questions that require an understanding of specific entities mentioned in the input documents (e.g., \u201cWhat are the healing properties of chamomile?\u201d).
"}, {"location": "query/local_search/#methodology", "title": "Methodology", "text": "---\ntitle: Local Search Dataflow\n---\n%%{ init: { 'flowchart': { 'curve': 'step' } } }%%\nflowchart LR\n\n uq[User Query] ---.1\n ch1[Conversation<br/>History]---.1\n\n .1--Entity<br/>Description<br/>Embedding--> ee[Extracted Entities]\n\n ee[Extracted Entities] ---.2--Entity-Text<br/>Unit Mapping--> ctu[Candidate<br/>Text Units]--Ranking + <br/>Filtering -->ptu[Prioritized<br/>Text Units]---.3\n .2--Entity-Report<br/>Mapping--> ccr[Candidate<br/>Community Reports]--Ranking + <br/>Filtering -->pcr[Prioritized<br/>Community Reports]---.3\n .2--Entity-Entity<br/>Relationships--> ce[Candidate<br/>Entities]--Ranking + <br/>Filtering -->pe[Prioritized<br/>Entities]---.3\n .2--Entity-Entity<br/>Relationships--> cr[Candidate<br/>Relationships]--Ranking + <br/>Filtering -->pr[Prioritized<br/>Relationships]---.3\n .2--Entity-Covariate<br/>Mappings--> cc[Candidate<br/>Covariates]--Ranking + <br/>Filtering -->pc[Prioritized<br/>Covariates]---.3\n ch1 -->ch2[Conversation History]---.3\n .3-->res[Response]\n\n classDef green fill:#26B653,stroke:#333,stroke-width:2px,color:#fff;\n classDef turquoise fill:#19CCD3,stroke:#333,stroke-width:2px,color:#fff;\n classDef rose fill:#DD8694,stroke:#333,stroke-width:2px,color:#fff;\n classDef orange fill:#F19914,stroke:#333,stroke-width:2px,color:#fff;\n classDef purple fill:#B356CD,stroke:#333,stroke-width:2px,color:#fff;\n classDef invisible fill:#fff,stroke:#fff,stroke-width:0px,color:#fff, width:0px;\n class uq,ch1 turquoise\n class ee green\n class ctu,ccr,ce,cr,cc rose\n class ptu,pcr,pe,pr,pc,ch2 orange\n class res purple\n class .1,.2,.3 invisible\n\n Given a user query and, optionally, the conversation history, the local search method identifies a set of entities from the knowledge graph that are semantically-related to the user input. These entities serve as access points into the knowledge graph, enabling the extraction of further relevant details such as connected entities, relationships, entity covariates, and community reports. Additionally, it also extracts relevant text chunks from the raw input documents that are associated with the identified entities. These candidate data sources are then prioritized and filtered to fit within a single context window of pre-defined size, which is used to generate a response to the user query.
"}, {"location": "query/local_search/#configuration", "title": "Configuration", "text": "Below are the key parameters of the LocalSearch class:
llm: OpenAI model object to be used for response generationcontext_builder: context builder object to be used for preparing context data from collections of knowledge model objectssystem_prompt: prompt template used to generate the search response. Default template can be found at system_promptresponse_type: free-form text describing the desired response type and format (e.g., Multiple Paragraphs, Multi-Page Report)llm_params: a dictionary of additional parameters (e.g., temperature, max_tokens) to be passed to the LLM callcontext_builder_params: a dictionary of additional parameters to be passed to the context_builder object when building context for the search promptcallbacks: optional callback functions, can be used to provide custom event handlers for LLM's completion streaming eventsAn example of a local search scenario can be found in the following notebook.
"}, {"location": "query/overview/", "title": "Query Engine \ud83d\udd0e", "text": "The Query Engine is the retrieval module of the Graph RAG Library. It is one of the two main components of the Graph RAG library, the other being the Indexing Pipeline (see Indexing Pipeline). It is responsible for the following tasks:
Local search method generates answers by combining relevant data from the AI-extracted knowledge-graph with text chunks of the raw documents. This method is suitable for questions that require an understanding of specific entities mentioned in the documents (e.g. What are the healing properties of chamomile?).
For more details about how Local Search works please refer to the Local Search documentation.
"}, {"location": "query/overview/#global-search", "title": "Global Search", "text": "Global search method generates answers by searching over all AI-generated community reports in a map-reduce fashion. This is a resource-intensive method, but often gives good responses for questions that require an understanding of the dataset as a whole (e.g. What are the most significant values of the herbs mentioned in this notebook?).
More about this can be checked at the Global Search documentation.
"}, {"location": "query/overview/#question-generation", "title": "Question Generation", "text": "This functionality takes a list of user queries and generates the next candidate questions. This is useful for generating follow-up questions in a conversation or for generating a list of questions for the investigator to dive deeper into the dataset.
Information about how question generation works can be found at the Question Generation documentation page.
"}, {"location": "query/question_generation/", "title": "Question Generation \u2754", "text": ""}, {"location": "query/question_generation/#entity-based-question-generation", "title": "Entity-based Question Generation", "text": "The question generation method combines structured data from the knowledge graph with unstructured data from the input documents to generate candidate questions related to specific entities.
"}, {"location": "query/question_generation/#methodology", "title": "Methodology", "text": "Given a list of prior user questions, the question generation method uses the same context-building approach employed in local search to extract and prioritize relevant structured and unstructured data, including entities, relationships, covariates, community reports and raw text chunks. These data records are then fitted into a single LLM prompt to generate candidate follow-up questions that represent the most important or urgent information content or themes in the data.
"}, {"location": "query/question_generation/#configuration", "title": "Configuration", "text": "Below are the key parameters of the Question Generation class:
llm: OpenAI model object to be used for response generationcontext_builder: context builder object to be used for preparing context data from collections of knowledge model objects, using the same context builder class as in local searchsystem_prompt: prompt template used to generate candidate questions. Default template can be found at system_promptllm_params: a dictionary of additional parameters (e.g., temperature, max_tokens) to be passed to the LLM callcontext_builder_params: a dictionary of additional parameters to be passed to the context_builder object when building context for the question generation promptcallbacks: optional callback functions, can be used to provide custom event handlers for LLM's completion streaming eventsAn example of the question generation function can be found in the following notebook.
"}, {"location": "query/notebooks/overview/", "title": "Query Engine Notebooks", "text": "For examples about running Query please refer to the following notebooks:
The test dataset for these notebooks can be found in dataset.zip.
"}]} \ No newline at end of file +{"config": {"lang": ["en"], "separator": "[\\s\\-]+", "pipeline": ["stopWordFilter"]}, "docs": [{"location": "", "title": "Welcome to GraphRAG", "text": "\ud83d\udc49 Microsoft Research Blog Post \ud83d\udc49 GraphRAG Accelerator \ud83d\udc49 GraphRAG Arxiv
Figure 1: An LLM-generated knowledge graph built using GPT-4 Turbo.
GraphRAG is a structured, hierarchical approach to Retrieval Augmented Generation (RAG), as opposed to naive semantic-search approaches using plain text snippets. The GraphRAG process involves extracting a knowledge graph out of raw text, building a community hierarchy, generating summaries for these communities, and then leveraging these structures when perform RAG-based tasks.
To learn more about GraphRAG and how it can be used to enhance your LLMs ability to reason about your private data, please visit the Microsoft Research Blog Post.
"}, {"location": "#solution-accelerator", "title": "Solution Accelerator \ud83d\ude80", "text": "To quickstart the GraphRAG system we recommend trying the Solution Accelerator package. This provides a user-friendly end-to-end experience with Azure resources.
"}, {"location": "#get-started-with-graphrag", "title": "Get Started with GraphRAG \ud83d\ude80", "text": "To start using GraphRAG, check out the Get Started guide. For a deeper dive into the main sub-systems, please visit the docpages for the Indexer and Query packages.
"}, {"location": "#graphrag-vs-baseline-rag", "title": "GraphRAG vs Baseline RAG \ud83d\udd0d", "text": "Retrieval-Augmented Generation (RAG) is a technique to improve LLM outputs using real-world information. This technique is an important part of most LLM-based tools and the majority of RAG approaches use vector similarity as the search technique, which we call Baseline RAG. GraphRAG uses knowledge graphs to provide substantial improvements in question-and-answer performance when reasoning about complex information. RAG techniques have shown promise in helping LLMs to reason about private datasets - data that the LLM is not trained on and has never seen before, such as an enterprise\u2019s proprietary research, business documents, or communications. Baseline RAG was created to help solve this problem, but we observe situations where baseline RAG performs very poorly. For example:
To address this, the tech community is working to develop methods that extend and enhance RAG. Microsoft Research\u2019s new approach, GraphRAG, uses LLMs to create a knowledge graph based on an input corpus. This graph, along with community summaries and graph machine learning outputs, are used to augment prompts at query time. GraphRAG shows substantial improvement in answering the two classes of questions described above, demonstrating intelligence or mastery that outperforms other approaches previously applied to private datasets.
"}, {"location": "#the-graphrag-process", "title": "The GraphRAG Process \ud83e\udd16", "text": "GraphRAG builds upon our prior research and tooling using graph machine learning. The basic steps of the GraphRAG process are as follows:
"}, {"location": "#index", "title": "Index", "text": "At query time, these structures are used to provide materials for the LLM context window when answering a question. The primary query modes are:
Using GraphRAG with your data out of the box may not yield the best possible results. We strongly recommend to fine-tune your prompts following the Prompt Tuning Guide in our documentation.
"}, {"location": "blog_posts/", "title": "Microsoft Research Blog", "text": "GraphRAG: Unlocking LLM discovery on narrative private data
Published February 13, 2024
By Jonathan Larson, Senior Principal Data Architect; Steven Truitt, Principal Program Manager
GraphRAG: New tool for complex data discovery now on GitHub
Published July 2, 2024
By Darren Edge, Senior Director; Ha Trinh, Senior Data Scientist; Steven Truitt, Principal Program Manager; Jonathan Larson, Senior Principal Data Architect
GraphRAG auto-tuning provides rapid adaptation to new domains
Published September 9, 2024
By Alonso Guevara Fern\u00e1ndez, Sr. Software Engineer; Katy Smith, Data Scientist II; Joshua Bradley, Senior Data Scientist; Darren Edge, Senior Director; Ha Trinh, Senior Data Scientist; Sarah Smith, Senior Program Manager; Ben Cutler, Senior Director; Steven Truitt, Principal Program Manager; Jonathan Larson, Senior Principal Data Architect
"}, {"location": "developing/", "title": "Development Guide", "text": ""}, {"location": "developing/#requirements", "title": "Requirements", "text": "Name Installation Purpose Python 3.10-3.12 Download The library is Python-based. Poetry Instructions Poetry is used for package management and virtualenv management in Python codebases"}, {"location": "developing/#getting-started", "title": "Getting Started", "text": ""}, {"location": "developing/#install-dependencies", "title": "Install Dependencies", "text": "# Install Python dependencies.\npoetry install\n"}, {"location": "developing/#execute-the-indexing-engine", "title": "Execute the Indexing Engine", "text": "poetry run poe index <...args>\n"}, {"location": "developing/#executing-queries", "title": "Executing Queries", "text": "poetry run poe query <...args>\n"}, {"location": "developing/#azurite", "title": "Azurite", "text": "Some unit and smoke tests use Azurite to emulate Azure resources. This can be started by running:
./scripts/start-azurite.sh\n or by simply running azurite in the terminal if already installed globally. See the Azurite documentation for more information about how to install and use Azurite.
Our Python package utilizes Poetry to manage dependencies and poethepoet to manage build scripts.
Available scripts are:
poetry run poe index - Run the Indexing CLIpoetry run poe query - Run the Query CLIpoetry build - This invokes poetry build, which will build a wheel file and other distributable artifacts.poetry run poe test - This will execute all tests.poetry run poe test_unit - This will execute unit tests.poetry run poe test_integration - This will execute integration tests.poetry run poe test_smoke - This will execute smoke tests.poetry run poe check - This will perform a suite of static checks across the package, including:poetry run poe fix - This will apply any available auto-fixes to the package. Usually this is just formatting fixes.poetry run poe fix_unsafe - This will apply any available auto-fixes to the package, including those that may be unsafe.poetry run poe format - Explicitly run the formatter across the package.Make sure llvm-9 and llvm-9-dev are installed:
sudo apt-get install llvm-9 llvm-9-dev
and then in your bashrc, add
export LLVM_CONFIG=/usr/bin/llvm-config-9
Make sure you have python3.10-dev installed or more generally python<version>-dev
sudo apt-get install python3.10-dev
GRAPHRAG_LLM_THREAD_COUNT and GRAPHRAG_EMBEDDING_THREAD_COUNT are both set to 50 by default. You can modify this values to reduce concurrency. Please refer to the Configuration Documents
Python 3.10-3.12
To get started with the GraphRAG system, you have a few options:
\ud83d\udc49 Use the GraphRAG Accelerator solution \ud83d\udc49 Install from pypi. \ud83d\udc49 Use it from source
"}, {"location": "get_started/#quickstart", "title": "Quickstart", "text": "To get started with the GraphRAG system we recommend trying the Solution Accelerator package. This provides a user-friendly end-to-end experience with Azure resources.
"}, {"location": "get_started/#top-level-modules", "title": "Top-Level Modules", "text": "The following is a simple end-to-end example for using the GraphRAG system. It shows how to use the system to index some text, and then use the indexed data to answer questions about the documents.
"}, {"location": "get_started/#install-graphrag", "title": "Install GraphRAG", "text": "pip install graphrag\n"}, {"location": "get_started/#running-the-indexer", "title": "Running the Indexer", "text": "Now we need to set up a data project and some initial configuration. Let's set that up. We're using the default configuration mode, which you can customize as needed using a config file, which we recommend, or environment variables.
First let's get a sample dataset ready:
mkdir -p ./ragtest/input\n Now let's get a copy of A Christmas Carol by Charles Dickens from a trusted source
curl https://www.gutenberg.org/cache/epub/24022/pg24022.txt -o ./ragtest/input/book.txt\n Next we'll inject some required config variables:
"}, {"location": "get_started/#set-up-your-workspace-variables", "title": "Set Up Your Workspace Variables", "text": "First let's make sure to setup the required environment variables. For details on these environment variables, and what environment variables are available, see the variables documentation.
To initialize your workspace, let's first run the graphrag.index --init command. Since we have already configured a directory named .ragtest` in the previous step, we can run the following command:
python -m graphrag.index --init --root ./ragtest\n This will create two files: .env and settings.yaml in the ./ragtest directory.
.env contains the environment variables required to run the GraphRAG pipeline. If you inspect the file, you'll see a single environment variable defined, GRAPHRAG_API_KEY=<API_KEY>. This is the API key for the OpenAI API or Azure OpenAI endpoint. You can replace this with your own API key.settings.yaml contains the settings for the pipeline. You can modify this file to change the settings for the pipeline. To run in OpenAI mode, just make sure to update the value of GRAPHRAG_API_KEY in the .env file with your OpenAI API key.
In addition, Azure OpenAI users should set the following variables in the settings.yaml file. To find the appropriate sections, just search for the llm: configuration, you should see two sections, one for the chat endpoint and one for the embeddings endpoint. Here is an example of how to configure the chat endpoint:
type: azure_openai_chat # Or azure_openai_embedding for embeddings\napi_base: https://<instance>.openai.azure.com\napi_version: 2024-02-15-preview # You can customize this for other versions\ndeployment_name: <azure_model_deployment_name>\n Finally we'll run the pipeline!
python -m graphrag.index --root ./ragtest\n This process will take some time to run. This depends on the size of your input data, what model you're using, and the text chunk size being used (these can be configured in your settings.yml file). Once the pipeline is complete, you should see a new folder called ./ragtest/output/<timestamp>/artifacts with a series of parquet files.
Now let's ask some questions using this dataset.
Here is an example using Global search to ask a high-level question:
python -m graphrag.query \\\n--root ./ragtest \\\n--method global \\\n\"What are the top themes in this story?\"\n Here is an example using Local search to ask a more specific question about a particular character:
python -m graphrag.query \\\n--root ./ragtest \\\n--method local \\\n\"Who is Scrooge, and what are his main relationships?\"\n Please refer to Query Engine docs for detailed information about how to leverage our Local and Global search mechanisms for extracting meaningful insights from data after the Indexer has wrapped up execution.
"}, {"location": "config/custom/", "title": "Fully Custom Config", "text": "The primary configuration sections for Indexing Engine pipelines are described below. Each configuration section can be expressed in Python (for use in Python API mode) as well as YAML, but YAML is show here for brevity.
Using custom configuration is an advanced use-case. Most users will want to use the Default Configuration instead.
"}, {"location": "config/custom/#indexing-engine-examples", "title": "Indexing Engine Examples", "text": "The examples directory contains several examples of how to use the indexing engine with custom configuration.
Most examples include two different forms of running the pipeline, both are contained in the examples run.py
To run an example:
poetry shell to activate a virtual environment with the required dependencies.PYTHONPATH=\"$(pwd)\" python examples/path_to_example/run.py from the root directory.For example to run the single_verb example, you would run the following commands:
poetry shell\n PYTHONPATH=\"$(pwd)\" python examples/single_verb/run.py\n"}, {"location": "config/custom/#configuration-sections", "title": "Configuration Sections", "text": ""}, {"location": "config/custom/#extends", "title": "> extends", "text": "This configuration allows you to extend a base configuration file or files.
# single base\nextends: ../base_config.yml\n # multiple bases\nextends:\n - ../base_config.yml\n - ../base_config2.yml\n"}, {"location": "config/custom/#root_dir", "title": "> root_dir", "text": "This configuration allows you to set the root directory for the pipeline. All data inputs and outputs are assumed to be relative to this path.
root_dir: /workspace/data_project\n"}, {"location": "config/custom/#storage", "title": "> storage", "text": "This configuration allows you define the output strategy for the pipeline.
type: The type of storage to use. Options are file, memory, and blobbase_dir (type: file only): The base directory to store the data in. This is relative to the config root.connection_string (type: blob only): The connection string to use for blob storage.container_name (type: blob only): The container to use for blob storage.This configuration allows you define the cache strategy for the pipeline.
type: The type of cache to use. Options are file and memory, and blob.base_dir (type: file only): The base directory to store the cache in. This is relative to the config root.connection_string (type: blob only): The connection string to use for blob storage.container_name (type: blob only): The container to use for blob storage.This configuration allows you define the reporting strategy for the pipeline. Report files are generated artifacts that summarize the performance metrics of the pipeline and emit any error messages.
type: The type of reporting to use. Options are file, memory, and blobbase_dir (type: file only): The base directory to store the reports in. This is relative to the config root.connection_string (type: blob only): The connection string to use for blob storage.container_name (type: blob only): The container to use for blob storage.This configuration section defines the workflow DAG for the pipeline. Here we define an array of workflows and express their inter-dependencies in steps:
name: The name of the workflow. This is used to reference the workflow in other parts of the config.steps: The DataShaper steps that this workflow comprises. If a step defines an input in the form of workflow:<workflow_name>, then it is assumed to have a dependency on the output of that workflow.workflows:\n - name: workflow1\n steps:\n - verb: derive\n args:\n column1: \"col1\"\n column2: \"col2\"\n - name: workflow2\n steps:\n - verb: derive\n args:\n column1: \"col1\"\n column2: \"col2\"\n input:\n # dependency established here\n source: workflow:workflow1\n"}, {"location": "config/custom/#input", "title": "> input", "text": "type: The type of input to use. Options are file or blob.file_type: The file type field discriminates between the different input types. Options are csv and text.base_dir: The base directory to read the input files from. This is relative to the config file.file_pattern: A regex to match the input files. The regex must have named groups for each of the fields in the file_filter.post_process: A DataShaper workflow definition to apply to the input before executing the primary workflow.source_column (type: csv only): The column containing the source/author of the datatext_column (type: csv only): The column containing the text of the datatimestamp_column (type: csv only): The column containing the timestamp of the datatimestamp_format (type: csv only): The format of the timestampinput:\n type: file\n file_type: csv\n base_dir: ../data/csv # the directory containing the CSV files, this is relative to the config file\n file_pattern: '.*[\\/](?P<source>[^\\/]+)[\\/](?P<year>\\d{4})-(?P<month>\\d{2})-(?P<day>\\d{2})_(?P<author>[^_]+)_\\d+\\.csv$' # a regex to match the CSV files\n # An additional file filter which uses the named groups from the file_pattern to further filter the files\n # file_filter:\n # # source: (source_filter)\n # year: (2023)\n # month: (06)\n # # day: (22)\n source_column: \"author\" # the column containing the source/author of the data\n text_column: \"message\" # the column containing the text of the data\n timestamp_column: \"date(yyyyMMddHHmmss)\" # optional, the column containing the timestamp of the data\n timestamp_format: \"%Y%m%d%H%M%S\" # optional, the format of the timestamp\n post_process: # Optional, set of steps to process the data before going into the workflow\n - verb: filter\n args:\n column: \"title\",\n value: \"My document\"\n input:\n type: file\n file_type: csv\n base_dir: ../data/csv # the directory containing the CSV files, this is relative to the config file\n file_pattern: '.*[\\/](?P<source>[^\\/]+)[\\/](?P<year>\\d{4})-(?P<month>\\d{2})-(?P<day>\\d{2})_(?P<author>[^_]+)_\\d+\\.csv$' # a regex to match the CSV files\n # An additional file filter which uses the named groups from the file_pattern to further filter the files\n # file_filter:\n # # source: (source_filter)\n # year: (2023)\n # month: (06)\n # # day: (22)\n post_process: # Optional, set of steps to process the data before going into the workflow\n - verb: filter\n args:\n column: \"title\",\n value: \"My document\"\n"}, {"location": "config/env_vars/", "title": "Default Configuration Mode (using Env Vars)", "text": ""}, {"location": "config/env_vars/#text-embeddings-customization", "title": "Text-Embeddings Customization", "text": "By default, the GraphRAG indexer will only emit embeddings required for our query methods. However, the model has embeddings defined for all plaintext fields, and these can be generated by setting the GRAPHRAG_EMBEDDING_TARGET environment variable to all.
If the embedding target is all, and you want to only embed a subset of these fields, you may specify which embeddings to skip using the GRAPHRAG_EMBEDDING_SKIP argument described below.
text_unit.textdocument.raw_contententity.nameentity.descriptionrelationship.descriptioncommunity.titlecommunity.summarycommunity.full_contentOur pipeline can ingest .csv or .txt data from an input folder. These files can be nested within subfolders. To configure how input data is handled, what fields are mapped over, and how timestamps are parsed, look for configuration values starting with GRAPHRAG_INPUT_ below. In general, CSV-based data provides the most customizability. Each CSV should at least contain a text field (which can be mapped with environment variables), but it's helpful if they also have title, timestamp, and source fields. Additional fields can be included as well, which will land as extra fields on the Document table.
These are the primary settings for configuring LLM connectivity.
Parameter Required? Description Type Default ValueGRAPHRAG_API_KEY Yes for OpenAI. Optional for AOAI The API key. (Note: OPENAI_API_KEY is also used as a fallback). If not defined when using AOAI, managed identity will be used. |str|None` GRAPHRAG_API_BASE For AOAI The API Base URL str None GRAPHRAG_API_VERSION For AOAI The AOAI API version. str None GRAPHRAG_API_ORGANIZATION The AOAI organization. str None GRAPHRAG_API_PROXY The AOAI proxy. str None"}, {"location": "config/env_vars/#text-generation-settings", "title": "Text Generation Settings", "text": "These settings control the text generation model used by the pipeline. Any settings with a fallback will use the base LLM settings, if available.
Parameter Required? Description Type Default ValueGRAPHRAG_LLM_TYPE For AOAI The LLM operation type. Either openai_chat or azure_openai_chat str openai_chat GRAPHRAG_LLM_DEPLOYMENT_NAME For AOAI The AOAI model deployment name. str None GRAPHRAG_LLM_API_KEY Yes (uses fallback) The API key. If not defined when using AOAI, managed identity will be used. str None GRAPHRAG_LLM_API_BASE For AOAI (uses fallback) The API Base URL str None GRAPHRAG_LLM_API_VERSION For AOAI (uses fallback) The AOAI API version. str None GRAPHRAG_LLM_API_ORGANIZATION For AOAI (uses fallback) The AOAI organization. str None GRAPHRAG_LLM_API_PROXY The AOAI proxy. str None GRAPHRAG_LLM_MODEL The LLM model. str gpt-4-turbo-preview GRAPHRAG_LLM_MAX_TOKENS The maximum number of tokens. int 4000 GRAPHRAG_LLM_REQUEST_TIMEOUT The maximum number of seconds to wait for a response from the chat client. int 180 GRAPHRAG_LLM_MODEL_SUPPORTS_JSON Indicates whether the given model supports JSON output mode. True to enable. str None GRAPHRAG_LLM_THREAD_COUNT The number of threads to use for LLM parallelization. int 50 GRAPHRAG_LLM_THREAD_STAGGER The time to wait (in seconds) between starting each thread. float 0.3 GRAPHRAG_LLM_CONCURRENT_REQUESTS The number of concurrent requests to allow for the embedding client. int 25 GRAPHRAG_LLM_TOKENS_PER_MINUTE The number of tokens per minute to allow for the LLM client. 0 = Bypass int 0 GRAPHRAG_LLM_REQUESTS_PER_MINUTE The number of requests per minute to allow for the LLM client. 0 = Bypass int 0 GRAPHRAG_LLM_MAX_RETRIES The maximum number of retries to attempt when a request fails. int 10 GRAPHRAG_LLM_MAX_RETRY_WAIT The maximum number of seconds to wait between retries. int 10 GRAPHRAG_LLM_SLEEP_ON_RATE_LIMIT_RECOMMENDATION Whether to sleep on rate limit recommendation. (Azure Only) bool True GRAPHRAG_LLM_TEMPERATURE The temperature to use generation. float 0 GRAPHRAG_LLM_TOP_P The top_p to use for sampling. float 1 GRAPHRAG_LLM_N The number of responses to generate. int 1"}, {"location": "config/env_vars/#text-embedding-settings", "title": "Text Embedding Settings", "text": "These settings control the text embedding model used by the pipeline. Any settings with a fallback will use the base LLM settings, if available.
Parameter Required ? Description Type DefaultGRAPHRAG_EMBEDDING_TYPE For AOAI The embedding client to use. Either openai_embedding or azure_openai_embedding str openai_embedding GRAPHRAG_EMBEDDING_DEPLOYMENT_NAME For AOAI The AOAI deployment name. str None GRAPHRAG_EMBEDDING_API_KEY Yes (uses fallback) The API key to use for the embedding client. If not defined when using AOAI, managed identity will be used. str None GRAPHRAG_EMBEDDING_API_BASE For AOAI (uses fallback) The API base URL. str None GRAPHRAG_EMBEDDING_API_VERSION For AOAI (uses fallback) The AOAI API version to use for the embedding client. str None GRAPHRAG_EMBEDDING_API_ORGANIZATION For AOAI (uses fallback) The AOAI organization to use for the embedding client. str None GRAPHRAG_EMBEDDING_API_PROXY The AOAI proxy to use for the embedding client. str None GRAPHRAG_EMBEDDING_MODEL The model to use for the embedding client. str text-embedding-3-small GRAPHRAG_EMBEDDING_BATCH_SIZE The number of texts to embed at once. (Azure limit is 16) int 16 GRAPHRAG_EMBEDDING_BATCH_MAX_TOKENS The maximum tokens per batch (Azure limit is 8191) int 8191 GRAPHRAG_EMBEDDING_TARGET The target fields to embed. Either required or all. str required GRAPHRAG_EMBEDDING_SKIP A comma-separated list of fields to skip embeddings for . (e.g. 'relationship.description') str None GRAPHRAG_EMBEDDING_THREAD_COUNT The number of threads to use for parallelization for embeddings. int GRAPHRAG_EMBEDDING_THREAD_STAGGER The time to wait (in seconds) between starting each thread for embeddings. float 50 GRAPHRAG_EMBEDDING_CONCURRENT_REQUESTS The number of concurrent requests to allow for the embedding client. int 25 GRAPHRAG_EMBEDDING_TOKENS_PER_MINUTE The number of tokens per minute to allow for the embedding client. 0 = Bypass int 0 GRAPHRAG_EMBEDDING_REQUESTS_PER_MINUTE The number of requests per minute to allow for the embedding client. 0 = Bypass int 0 GRAPHRAG_EMBEDDING_MAX_RETRIES The maximum number of retries to attempt when a request fails. int 10 GRAPHRAG_EMBEDDING_MAX_RETRY_WAIT The maximum number of seconds to wait between retries. int 10 GRAPHRAG_EMBEDDING_SLEEP_ON_RATE_LIMIT_RECOMMENDATION Whether to sleep on rate limit recommendation. (Azure Only) bool True"}, {"location": "config/env_vars/#input-settings", "title": "Input Settings", "text": "These settings control the data input used by the pipeline. Any settings with a fallback will use the base LLM settings, if available.
"}, {"location": "config/env_vars/#plaintext-input-data-graphrag_input_file_typetext", "title": "Plaintext Input Data (GRAPHRAG_INPUT_FILE_TYPE=text)", "text": "Parameter Description Type Required or Optional Default GRAPHRAG_INPUT_FILE_PATTERN The file pattern regexp to use when reading input files from the input directory. str optional .*\\.txt$"}, {"location": "config/env_vars/#csv-input-data-graphrag_input_file_typecsv", "title": "CSV Input Data (GRAPHRAG_INPUT_FILE_TYPE=csv)", "text": "Parameter Description Type Required or Optional Default GRAPHRAG_INPUT_TYPE The input storage type to use when reading files. (file or blob) str optional file GRAPHRAG_INPUT_FILE_PATTERN The file pattern regexp to use when reading input files from the input directory. str optional .*\\.txt$ GRAPHRAG_INPUT_SOURCE_COLUMN The 'source' column to use when reading CSV input files. str optional source GRAPHRAG_INPUT_TIMESTAMP_COLUMN The 'timestamp' column to use when reading CSV input files. str optional None GRAPHRAG_INPUT_TIMESTAMP_FORMAT The timestamp format to use when parsing timestamps in the timestamp column. str optional None GRAPHRAG_INPUT_TEXT_COLUMN The 'text' column to use when reading CSV input files. str optional text GRAPHRAG_INPUT_DOCUMENT_ATTRIBUTE_COLUMNS A list of CSV columns, comma-separated, to incorporate as document fields. str optional id GRAPHRAG_INPUT_TITLE_COLUMN The 'title' column to use when reading CSV input files. str optional title GRAPHRAG_INPUT_STORAGE_ACCOUNT_BLOB_URL The Azure Storage blob endpoint to use when in blob mode and using managed identity. Will have the format https://<storage_account_name>.blob.core.windows.net str optional None GRAPHRAG_INPUT_CONNECTION_STRING The connection string to use when reading CSV input files from Azure Blob Storage. str optional None GRAPHRAG_INPUT_CONTAINER_NAME The container name to use when reading CSV input files from Azure Blob Storage. str optional None GRAPHRAG_INPUT_BASE_DIR The base directory to read input files from. str optional None"}, {"location": "config/env_vars/#data-mapping-settings", "title": "Data Mapping Settings", "text": "Parameter Description Type Required or Optional Default GRAPHRAG_INPUT_FILE_TYPE The type of input data, csv or text str optional text GRAPHRAG_INPUT_ENCODING The encoding to apply when reading CSV/text input files. str optional utf-8"}, {"location": "config/env_vars/#data-chunking", "title": "Data Chunking", "text": "Parameter Description Type Required or Optional Default GRAPHRAG_CHUNK_SIZE The chunk size in tokens for text-chunk analysis windows. str optional 1200 GRAPHRAG_CHUNK_OVERLAP The chunk overlap in tokens for text-chunk analysis windows. str optional 100 GRAPHRAG_CHUNK_BY_COLUMNS A comma-separated list of document attributes to groupby when performing TextUnit chunking. str optional id GRAPHRAG_CHUNK_ENCODING_MODEL The encoding model to use for chunking. str optional The top-level encoding model."}, {"location": "config/env_vars/#prompting-overrides", "title": "Prompting Overrides", "text": "Parameter Description Type Required or Optional Default GRAPHRAG_ENTITY_EXTRACTION_PROMPT_FILE The path (relative to the root) of an entity extraction prompt template text file. str optional None GRAPHRAG_ENTITY_EXTRACTION_MAX_GLEANINGS The maximum number of redrives (gleanings) to invoke when extracting entities in a loop. int optional 1 GRAPHRAG_ENTITY_EXTRACTION_ENTITY_TYPES A comma-separated list of entity types to extract. str optional organization,person,event,geo GRAPHRAG_ENTITY_EXTRACTION_ENCODING_MODEL The encoding model to use for entity extraction. str optional The top-level encoding model. GRAPHRAG_SUMMARIZE_DESCRIPTIONS_PROMPT_FILE The path (relative to the root) of an description summarization prompt template text file. str optional None GRAPHRAG_SUMMARIZE_DESCRIPTIONS_MAX_LENGTH The maximum number of tokens to generate per description summarization. int optional 500 GRAPHRAG_CLAIM_EXTRACTION_ENABLED Whether claim extraction is enabled for this pipeline. bool optional False GRAPHRAG_CLAIM_EXTRACTION_DESCRIPTION The claim_description prompting argument to utilize. string optional \"Any claims or facts that could be relevant to threat analysis.\" GRAPHRAG_CLAIM_EXTRACTION_PROMPT_FILE The claim extraction prompt to utilize. string optional None GRAPHRAG_CLAIM_EXTRACTION_MAX_GLEANINGS The maximum number of redrives (gleanings) to invoke when extracting claims in a loop. int optional 1 GRAPHRAG_CLAIM_EXTRACTION_ENCODING_MODEL The encoding model to use for claim extraction. str optional The top-level encoding model GRAPHRAG_COMMUNITY_REPORTS_PROMPT_FILE The community reports extraction prompt to utilize. string optional None GRAPHRAG_COMMUNITY_REPORTS_MAX_LENGTH The maximum number of tokens to generate per community reports. int optional 1500"}, {"location": "config/env_vars/#storage", "title": "Storage", "text": "This section controls the storage mechanism used by the pipeline used for emitting output tables.
Parameter Description Type Required or Optional DefaultGRAPHRAG_STORAGE_TYPE The type of reporter to use. Options are file, memory, or blob str optional file GRAPHRAG_STORAGE_STORAGE_ACCOUNT_BLOB_URL The Azure Storage blob endpoint to use when in blob mode and using managed identity. Will have the format https://<storage_account_name>.blob.core.windows.net str optional None GRAPHRAG_STORAGE_CONNECTION_STRING The Azure Storage connection string to use when in blob mode. str optional None GRAPHRAG_STORAGE_CONTAINER_NAME The Azure Storage container name to use when in blob mode. str optional None GRAPHRAG_STORAGE_BASE_DIR The base path to data outputs outputs. str optional None"}, {"location": "config/env_vars/#cache", "title": "Cache", "text": "This section controls the cache mechanism used by the pipeline. This is used to cache LLM invocation results.
Parameter Description Type Required or Optional DefaultGRAPHRAG_CACHE_TYPE The type of cache to use. Options are file, memory, none or blob str optional file GRAPHRAG_CACHE_STORAGE_ACCOUNT_BLOB_URL The Azure Storage blob endpoint to use when in blob mode and using managed identity. Will have the format https://<storage_account_name>.blob.core.windows.net str optional None GRAPHRAG_CACHE_CONNECTION_STRING The Azure Storage connection string to use when in blob mode. str optional None GRAPHRAG_CACHE_CONTAINER_NAME The Azure Storage container name to use when in blob mode. str optional None GRAPHRAG_CACHE_BASE_DIR The base path to the cache files. str optional None"}, {"location": "config/env_vars/#reporting", "title": "Reporting", "text": "This section controls the reporting mechanism used by the pipeline, for common events and error messages. The default is to write reports to a file in the output directory. However, you can also choose to write reports to the console or to an Azure Blob Storage container.
Parameter Description Type Required or Optional DefaultGRAPHRAG_REPORTING_TYPE The type of reporter to use. Options are file, console, or blob str optional file GRAPHRAG_REPORTING_STORAGE_ACCOUNT_BLOB_URL The Azure Storage blob endpoint to use when in blob mode and using managed identity. Will have the format https://<storage_account_name>.blob.core.windows.net str optional None GRAPHRAG_REPORTING_CONNECTION_STRING The Azure Storage connection string to use when in blob mode. str optional None GRAPHRAG_REPORTING_CONTAINER_NAME The Azure Storage container name to use when in blob mode. str optional None GRAPHRAG_REPORTING_BASE_DIR The base path to the reporting outputs. str optional None"}, {"location": "config/env_vars/#node2vec-parameters", "title": "Node2Vec Parameters", "text": "Parameter Description Type Required or Optional Default GRAPHRAG_NODE2VEC_ENABLED Whether to enable Node2Vec bool optional False GRAPHRAG_NODE2VEC_NUM_WALKS The Node2Vec number of walks to perform int optional 10 GRAPHRAG_NODE2VEC_WALK_LENGTH The Node2Vec walk length int optional 40 GRAPHRAG_NODE2VEC_WINDOW_SIZE The Node2Vec window size int optional 2 GRAPHRAG_NODE2VEC_ITERATIONS The number of iterations to run node2vec int optional 3 GRAPHRAG_NODE2VEC_RANDOM_SEED The random seed to use for node2vec int optional 597832"}, {"location": "config/env_vars/#data-snapshotting", "title": "Data Snapshotting", "text": "Parameter Description Type Required or Optional Default GRAPHRAG_SNAPSHOT_GRAPHML Whether to enable GraphML snapshots. bool optional False GRAPHRAG_SNAPSHOT_RAW_ENTITIES Whether to enable raw entity snapshots. bool optional False GRAPHRAG_SNAPSHOT_TOP_LEVEL_NODES Whether to enable top-level node snapshots. bool optional False"}, {"location": "config/env_vars/#miscellaneous-settings", "title": "Miscellaneous Settings", "text": "Parameter Description Type Required or Optional Default GRAPHRAG_ASYNC_MODE Which async mode to use. Either asyncio or threaded. str optional asyncio GRAPHRAG_ENCODING_MODEL The text encoding model, used in tiktoken, to encode text. str optional cl100k_base GRAPHRAG_MAX_CLUSTER_SIZE The maximum number of entities to include in a single Leiden cluster. int optional 10 GRAPHRAG_SKIP_WORKFLOWS A comma-separated list of workflow names to skip. str optional None GRAPHRAG_UMAP_ENABLED Whether to enable UMAP layouts bool optional False"}, {"location": "config/init/", "title": "Configuring GraphRAG Indexing", "text": "To start using GraphRAG, you need to configure the system. The init command is the easiest way to get started. It will create a .env and settings.yaml files in the specified directory with the necessary configuration settings. It will also output the default LLM prompts used by GraphRAG.
python -m graphrag.index [--init] [--root PATH]\n"}, {"location": "config/init/#options", "title": "Options", "text": "--init - Initialize the directory with the necessary configuration files.--root PATH - The root directory to initialize. Default is the current directory.python -m graphrag.index --init --root ./ragtest\n"}, {"location": "config/init/#output", "title": "Output", "text": "The init command will create the following files in the specified directory:
settings.yaml - The configuration settings file. This file contains the configuration settings for GraphRAG..env - The environment variables file. These are referenced in the settings.yaml file.prompts/ - The LLM prompts folder. This contains the default prompts used by GraphRAG, you can modify them or run the Auto Prompt Tuning command to generate new prompts adapted to your data.After initializing your workspace, you can either run the Prompt Tuning command to adapt the prompts to your data or even start running the Indexing Pipeline to index your data. For more information on configuring GraphRAG, see the Configuration documentation.
"}, {"location": "config/json_yaml/", "title": "Default Configuration Mode (using JSON/YAML)", "text": "The default configuration mode may be configured by using a settings.json or settings.yml file in the data project root. If a .env file is present along with this config file, then it will be loaded, and the environment variables defined therein will be available for token replacements in your configuration document using ${ENV_VAR} syntax.
For example:
# .env\nAPI_KEY=some_api_key\n\n# settings.json\n{\n \"llm\": {\n \"api_key\": \"${API_KEY}\"\n }\n}\n"}, {"location": "config/json_yaml/#config-sections", "title": "Config Sections", "text": ""}, {"location": "config/json_yaml/#input", "title": "input", "text": ""}, {"location": "config/json_yaml/#fields", "title": "Fields", "text": "type file|blob - The input type to use. Default=filefile_type text|csv - The type of input data to load. Either text or csv. Default is textfile_encoding str - The encoding of the input file. Default is utf-8file_pattern str - A regex to match input files. Default is .*\\.csv$ if in csv mode and .*\\.txt$ if in text mode.source_column str - (CSV Mode Only) The source column name.timestamp_column str - (CSV Mode Only) The timestamp column name.timestamp_format str - (CSV Mode Only) The source format.text_column str - (CSV Mode Only) The text column name.title_column str - (CSV Mode Only) The title column name.document_attribute_columns list[str] - (CSV Mode Only) The additional document attributes to include.connection_string str - (blob only) The Azure Storage connection string.container_name str - (blob only) The Azure Storage container name.base_dir str - The base directory to read input from, relative to the root.storage_account_blob_url str - The storage account blob URL to use.This is the base LLM configuration section. Other steps may override this configuration with their own LLM configuration.
"}, {"location": "config/json_yaml/#fields_1", "title": "Fields", "text": "api_key str - The OpenAI API key to use.type openai_chat|azure_openai_chat|openai_embedding|azure_openai_embedding - The type of LLM to use.model str - The model name.max_tokens int - The maximum number of output tokens.request_timeout float - The per-request timeout.api_base str - The API base url to use.api_version str - The API versionorganization str - The client organization.proxy str - The proxy URL to use.audience str - (Azure OpenAI only) The URI of the target Azure resource/service for which a managed identity token is requested. Used if api_key is not defined. Default=https://cognitiveservices.azure.com/.defaultdeployment_name str - The deployment name to use (Azure).model_supports_json bool - Whether the model supports JSON-mode output.tokens_per_minute int - Set a leaky-bucket throttle on tokens-per-minute.requests_per_minute int - Set a leaky-bucket throttle on requests-per-minute.max_retries int - The maximum number of retries to use.max_retry_wait float - The maximum backoff time.sleep_on_rate_limit_recommendation bool - Whether to adhere to sleep recommendations (Azure).concurrent_requests int The number of open requests to allow at once.temperature float - The temperature to use.top_p float - The top-p value to use.n int - The number of completions to generate.stagger float - The threading stagger value.num_threads int - The maximum number of work threads.asyncio|threaded The async mode to use. Either asyncio or `threaded.
llm (see LLM top-level config)parallelization (see Parallelization top-level config)async_mode (see Async Mode top-level config)batch_size int - The maximum batch size to use.batch_max_tokens int - The maximum batch # of tokens.target required|all - Determines which set of embeddings to emit.skip list[str] - Which embeddings to skip.vector_store dict - The vector store to use. Configured for lancedb by default.type str - lancedb or azure_ai_search. Default=lancedbdb_uri str (only for lancedb) - The database uri. Default=storage.base_dir/lancedburl str (only for AI Search) - AI Search endpointapi_key str (optional - only for AI Search) - The AI Search api key to use.audience str (only for AI Search) - Audience for managed identity token if managed identity authentication is used.overwrite bool (only used at index creation time) - Overwrite collection if it exist. Default=Truecollection_name str - The name of a vector collection. Default=entity_description_embeddingsstrategy dict - Fully override the text-embedding strategy.size int - The max chunk size in tokens.overlap int - The chunk overlap in tokens.group_by_columns list[str] - group documents by fields before chunking.encoding_model str - The text encoding model to use. Default is to use the top-level encoding model.strategy dict - Fully override the chunking strategy.type file|memory|none|blob - The cache type to use. Default=fileconnection_string str - (blob only) The Azure Storage connection string.container_name str - (blob only) The Azure Storage container name.base_dir str - The base directory to write cache to, relative to the root.storage_account_blob_url str - The storage account blob URL to use.type file|memory|blob - The storage type to use. Default=fileconnection_string str - (blob only) The Azure Storage connection string.container_name str - (blob only) The Azure Storage container name.base_dir str - The base directory to write reports to, relative to the root.storage_account_blob_url str - The storage account blob URL to use.type file|console|blob - The reporting type to use. Default=fileconnection_string str - (blob only) The Azure Storage connection string.container_name str - (blob only) The Azure Storage container name.base_dir str - The base directory to write reports to, relative to the root.storage_account_blob_url str - The storage account blob URL to use.llm (see LLM top-level config)parallelization (see Parallelization top-level config)async_mode (see Async Mode top-level config)prompt str - The prompt file to use.entity_types list[str] - The entity types to identify.max_gleanings int - The maximum number of gleaning cycles to use.encoding_model str - The text encoding model to use. By default, this will use the top-level encoding model.strategy dict - Fully override the entity extraction strategy.llm (see LLM top-level config)parallelization (see Parallelization top-level config)async_mode (see Async Mode top-level config)prompt str - The prompt file to use.max_length int - The maximum number of output tokens per summarization.strategy dict - Fully override the summarize description strategy.enabled bool - Whether to enable claim extraction. default=Falsellm (see LLM top-level config)parallelization (see Parallelization top-level config)async_mode (see Async Mode top-level config)prompt str - The prompt file to use.description str - Describes the types of claims we want to extract.max_gleanings int - The maximum number of gleaning cycles to use.encoding_model str - The text encoding model to use. By default, this will use the top-level encoding model.strategy dict - Fully override the claim extraction strategy.llm (see LLM top-level config)parallelization (see Parallelization top-level config)async_mode (see Async Mode top-level config)prompt str - The prompt file to use.max_length int - The maximum number of output tokens per report.max_input_length int - The maximum number of input tokens to use when generating reports.strategy dict - Fully override the community reports strategy.max_cluster_size int - The maximum cluster size to emit.strategy dict - Fully override the cluster_graph strategy.enabled bool - Whether to enable graph embeddings.num_walks int - The node2vec number of walks.walk_length int - The node2vec walk length.window_size int - The node2vec window size.iterations int - The node2vec number of iterations.random_seed int - The node2vec random seed.strategy dict - Fully override the embed graph strategy.enabled bool - Whether to enable UMAP layouts.graphml bool - Emit graphml snapshots.raw_entities bool - Emit raw entity snapshots.top_level_nodes bool - Emit top-level-node snapshots.str - The text encoding model to use. Default=cl100k_base.
list[str] - Which workflow names to skip.
"}, {"location": "config/overview/", "title": "Configuring GraphRAG Indexing", "text": "The GraphRAG system is highly configurable. This page provides an overview of the configuration options available for the GraphRAG indexing engine.
"}, {"location": "config/overview/#default-configuration-mode", "title": "Default Configuration Mode", "text": "The default configuration mode is the simplest way to get started with the GraphRAG system. It is designed to work out-of-the-box with minimal configuration. The primary configuration sections for the Indexing Engine pipelines are described below. The main ways to set up GraphRAG in Default Configuration mode are via:
Custom configuration mode is an advanced use-case. Most users will want to use the Default Configuration instead. The primary configuration sections for Indexing Engine pipelines are described below. Details about how to use custom configuration are available in the Custom Configuration Mode documentation.
"}, {"location": "config/template/", "title": "Configuration Template", "text": "The following template can be used and stored as a .env in the the directory where you're are pointing the --root parameter on your Indexing Pipeline execution.
For details about how to run the Indexing Pipeline, refer to the Index CLI documentation.
"}, {"location": "config/template/#env-file-template", "title": ".env File Template", "text": "Required variables are uncommented. All the optional configuration can be turned on or off as needed.
"}, {"location": "config/template/#minimal-configuration", "title": "Minimal Configuration", "text": "# Base LLM Settings\nGRAPHRAG_API_KEY=\"your_api_key\"\nGRAPHRAG_API_BASE=\"http://<domain>.openai.azure.com\" # For Azure OpenAI Users\nGRAPHRAG_API_VERSION=\"api_version\" # For Azure OpenAI Users\n\n# Text Generation Settings\nGRAPHRAG_LLM_TYPE=\"azure_openai_chat\" # or openai_chat\nGRAPHRAG_LLM_DEPLOYMENT_NAME=\"gpt-4-turbo-preview\"\nGRAPHRAG_LLM_MODEL_SUPPORTS_JSON=True\n\n# Text Embedding Settings\nGRAPHRAG_EMBEDDING_TYPE=\"azure_openai_embedding\" # or openai_embedding\nGRAPHRAG_LLM_DEPLOYMENT_NAME=\"text-embedding-3-small\"\n\n# Data Mapping Settings\nGRAPHRAG_INPUT_TYPE=\"text\"\n"}, {"location": "config/template/#full-configuration", "title": "Full Configuration", "text": "# Required LLM Config\n\n# Input Data Configuration\nGRAPHRAG_INPUT_TYPE=\"file\"\n\n# Plaintext Input Data Configuration\n# GRAPHRAG_INPUT_FILE_PATTERN=.*\\.txt\n\n# Text Input Data Configuration\nGRAPHRAG_INPUT_FILE_TYPE=\"text\"\nGRAPHRAG_INPUT_FILE_PATTERN=\".*\\.txt$\"\nGRAPHRAG_INPUT_SOURCE_COLUMN=source\n# GRAPHRAG_INPUT_TIMESTAMP_COLUMN=None\n# GRAPHRAG_INPUT_TIMESTAMP_FORMAT=None\n# GRAPHRAG_INPUT_TEXT_COLUMN=\"text\"\n# GRAPHRAG_INPUT_ATTRIBUTE_COLUMNS=id\n# GRAPHRAG_INPUT_TITLE_COLUMN=\"title\"\n# GRAPHRAG_INPUT_TYPE=\"file\"\n# GRAPHRAG_INPUT_CONNECTION_STRING=None\n# GRAPHRAG_INPUT_CONTAINER_NAME=None\n# GRAPHRAG_INPUT_BASE_DIR=None\n\n# Base LLM Settings\nGRAPHRAG_API_KEY=\"your_api_key\"\nGRAPHRAG_API_BASE=\"http://<domain>.openai.azure.com\" # For Azure OpenAI Users\nGRAPHRAG_API_VERSION=\"api_version\" # For Azure OpenAI Users\n# GRAPHRAG_API_ORGANIZATION=None\n# GRAPHRAG_API_PROXY=None\n\n# Text Generation Settings\n# GRAPHRAG_LLM_TYPE=openai_chat\nGRAPHRAG_LLM_API_KEY=\"your_api_key\" # If GRAPHRAG_API_KEY is not set\nGRAPHRAG_LLM_API_BASE=\"http://<domain>.openai.azure.com\" # For Azure OpenAI Users and if GRAPHRAG_API_BASE is not set\nGRAPHRAG_LLM_API_VERSION=\"api_version\" # For Azure OpenAI Users and if GRAPHRAG_API_VERSION is not set\nGRAPHRAG_LLM_MODEL_SUPPORTS_JSON=True # Suggested by default\n# GRAPHRAG_LLM_API_ORGANIZATION=None\n# GRAPHRAG_LLM_API_PROXY=None\n# GRAPHRAG_LLM_DEPLOYMENT_NAME=None\n# GRAPHRAG_LLM_MODEL=gpt-4-turbo-preview\n# GRAPHRAG_LLM_MAX_TOKENS=4000\n# GRAPHRAG_LLM_REQUEST_TIMEOUT=180\n# GRAPHRAG_LLM_THREAD_COUNT=50\n# GRAPHRAG_LLM_THREAD_STAGGER=0.3\n# GRAPHRAG_LLM_CONCURRENT_REQUESTS=25\n# GRAPHRAG_LLM_TPM=0\n# GRAPHRAG_LLM_RPM=0\n# GRAPHRAG_LLM_MAX_RETRIES=10\n# GRAPHRAG_LLM_MAX_RETRY_WAIT=10\n# GRAPHRAG_LLM_SLEEP_ON_RATE_LIMIT_RECOMMENDATION=True\n\n# Text Embedding Settings\n# GRAPHRAG_EMBEDDING_TYPE=openai_embedding\nGRAPHRAG_EMBEDDING_API_KEY=\"your_api_key\" # If GRAPHRAG_API_KEY is not set\nGRAPHRAG_EMBEDDING_API_BASE=\"http://<domain>.openai.azure.com\" # For Azure OpenAI Users and if GRAPHRAG_API_BASE is not set\nGRAPHRAG_EMBEDDING_API_VERSION=\"api_version\" # For Azure OpenAI Users and if GRAPHRAG_API_VERSION is not set\n# GRAPHRAG_EMBEDDING_API_ORGANIZATION=None\n# GRAPHRAG_EMBEDDING_API_PROXY=None\n# GRAPHRAG_EMBEDDING_DEPLOYMENT_NAME=None\n# GRAPHRAG_EMBEDDING_MODEL=text-embedding-3-small\n# GRAPHRAG_EMBEDDING_BATCH_SIZE=16\n# GRAPHRAG_EMBEDDING_BATCH_MAX_TOKENS=8191\n# GRAPHRAG_EMBEDDING_TARGET=required\n# GRAPHRAG_EMBEDDING_SKIP=None\n# GRAPHRAG_EMBEDDING_THREAD_COUNT=None\n# GRAPHRAG_EMBEDDING_THREAD_STAGGER=50\n# GRAPHRAG_EMBEDDING_CONCURRENT_REQUESTS=25\n# GRAPHRAG_EMBEDDING_TPM=0\n# GRAPHRAG_EMBEDDING_RPM=0\n# GRAPHRAG_EMBEDDING_MAX_RETRIES=10\n# GRAPHRAG_EMBEDDING_MAX_RETRY_WAIT=10\n# GRAPHRAG_EMBEDDING_SLEEP_ON_RATE_LIMIT_RECOMMENDATION=True\n\n# Data Mapping Settings\n# GRAPHRAG_INPUT_ENCODING=utf-8\n\n# Data Chunking\n# GRAPHRAG_CHUNK_SIZE=1200\n# GRAPHRAG_CHUNK_OVERLAP=100\n# GRAPHRAG_CHUNK_BY_COLUMNS=id\n\n# Prompting Overrides\n# GRAPHRAG_ENTITY_EXTRACTION_PROMPT_FILE=None\n# GRAPHRAG_ENTITY_EXTRACTION_MAX_GLEANINGS=1\n# GRAPHRAG_ENTITY_EXTRACTION_ENTITY_TYPES=organization,person,event,geo\n# GRAPHRAG_SUMMARIZE_DESCRIPTIONS_PROMPT_FILE=None\n# GRAPHRAG_SUMMARIZE_DESCRIPTIONS_MAX_LENGTH=500\n# GRAPHRAG_CLAIM_EXTRACTION_DESCRIPTION=\"Any claims or facts that could be relevant to threat analysis.\"\n# GRAPHRAG_CLAIM_EXTRACTION_PROMPT_FILE=None\n# GRAPHRAG_CLAIM_EXTRACTION_MAX_GLEANINGS=1\n# GRAPHRAG_COMMUNITY_REPORT_PROMPT_FILE=None\n# GRAPHRAG_COMMUNITY_REPORT_MAX_LENGTH=1500\n\n# Storage\n# GRAPHRAG_STORAGE_TYPE=file\n# GRAPHRAG_STORAGE_CONNECTION_STRING=None\n# GRAPHRAG_STORAGE_CONTAINER_NAME=None\n# GRAPHRAG_STORAGE_BASE_DIR=None\n\n# Cache\n# GRAPHRAG_CACHE_TYPE=file\n# GRAPHRAG_CACHE_CONNECTION_STRING=None\n# GRAPHRAG_CACHE_CONTAINER_NAME=None\n# GRAPHRAG_CACHE_BASE_DIR=None\n\n# Reporting\n# GRAPHRAG_REPORTING_TYPE=file\n# GRAPHRAG_REPORTING_CONNECTION_STRING=None\n# GRAPHRAG_REPORTING_CONTAINER_NAME=None\n# GRAPHRAG_REPORTING_BASE_DIR=None\n\n# Node2Vec Parameters\n# GRAPHRAG_NODE2VEC_ENABLED=False\n# GRAPHRAG_NODE2VEC_NUM_WALKS=10\n# GRAPHRAG_NODE2VEC_WALK_LENGTH=40\n# GRAPHRAG_NODE2VEC_WINDOW_SIZE=2\n# GRAPHRAG_NODE2VEC_ITERATIONS=3\n# GRAPHRAG_NODE2VEC_RANDOM_SEED=597832\n\n# Data Snapshotting\n# GRAPHRAG_SNAPSHOT_GRAPHML=False\n# GRAPHRAG_SNAPSHOT_RAW_ENTITIES=False\n# GRAPHRAG_SNAPSHOT_TOP_LEVEL_NODES=False\n\n# Miscellaneous Settings\n# GRAPHRAG_ASYNC_MODE=asyncio\n# GRAPHRAG_ENCODING_MODEL=cl100k_base\n# GRAPHRAG_MAX_CLUSTER_SIZE=10\n# GRAPHRAG_SKIP_WORKFLOWS=None\n# GRAPHRAG_UMAP_ENABLED=False\n"}, {"location": "data/operation_dulce/ABOUT/", "title": "About", "text": "This document (Operation Dulce) is an AI-generated science fiction novella, included here for the purposes of integration testing.
"}, {"location": "index/architecture/", "title": "Indexing Architecture", "text": ""}, {"location": "index/architecture/#key-concepts", "title": "Key Concepts", "text": ""}, {"location": "index/architecture/#knowledge-model", "title": "Knowledge Model", "text": "In order to support the GraphRAG system, the outputs of the indexing engine (in the Default Configuration Mode) are aligned to a knowledge model we call the GraphRAG Knowledge Model. This model is designed to be an abstraction over the underlying data storage technology, and to provide a common interface for the GraphRAG system to interact with. In normal use-cases the outputs of the GraphRAG Indexer would be loaded into a database system, and the GraphRAG's Query Engine would interact with the database using the knowledge model data-store types.
"}, {"location": "index/architecture/#datashaper-workflows", "title": "DataShaper Workflows", "text": "GraphRAG's Indexing Pipeline is built on top of our open-source library, DataShaper. DataShaper is a data processing library that allows users to declaratively express data pipelines, schemas, and related assets using well-defined schemas. DataShaper has implementations in JavaScript and Python, and is designed to be extensible to other languages.
One of the core resource types within DataShaper is a Workflow. Workflows are expressed as sequences of steps, which we call verbs. Each step has a verb name and a configuration object. In DataShaper, these verbs model relational concepts such as SELECT, DROP, JOIN, etc.. Each verb transforms an input data table, and that table is passed down the pipeline.
---\ntitle: Sample Workflow\n---\nflowchart LR\n input[Input Table] --> select[SELECT] --> join[JOIN] --> binarize[BINARIZE] --> output[Output Table]"}, {"location": "index/architecture/#llm-based-workflow-steps", "title": "LLM-based Workflow Steps", "text": "GraphRAG's Indexing Pipeline implements a handful of custom verbs on top of the standard, relational verbs that our DataShaper library provides. These verbs give us the ability to augment text documents with rich, structured data using the power of LLMs such as GPT-4. We utilize these verbs in our standard workflow to extract entities, relationships, claims, community structures, and community reports and summaries. This behavior is customizable and can be extended to support many kinds of AI-based data enrichment and extraction tasks.
"}, {"location": "index/architecture/#workflow-graphs", "title": "Workflow Graphs", "text": "Because of the complexity of our data indexing tasks, we needed to be able to express our data pipeline as series of multiple, interdependent workflows. In the GraphRAG Indexing Pipeline, each workflow may define dependencies on other workflows, effectively forming a directed acyclic graph (DAG) of workflows, which is then used to schedule processing.
---\ntitle: Sample Workflow DAG\n---\nstateDiagram-v2\n [*] --> Prepare\n Prepare --> Chunk\n Chunk --> ExtractGraph\n Chunk --> EmbedDocuments\n ExtractGraph --> GenerateReports\n ExtractGraph --> EmbedEntities\n ExtractGraph --> EmbedGraph"}, {"location": "index/architecture/#dataframe-message-format", "title": "Dataframe Message Format", "text": "The primary unit of communication between workflows, and between workflow steps is an instance of pandas.DataFrame. Although side-effects are possible, our goal is to be data-centric and table-centric in our approach to data processing. This allows us to easily reason about our data, and to leverage the power of dataframe-based ecosystems. Our underlying dataframe technology may change over time, but our primary goal is to support the DataShaper workflow schema while retaining single-machine ease of use and developer ergonomics.
The GraphRAG library was designed with LLM interactions in mind, and a common setback when working with LLM APIs is various errors due to network latency, throttling, etc.. Because of these potential error cases, we've added a cache layer around LLM interactions. When completion requests are made using the same input set (prompt and tuning parameters), we return a cached result if one exists. This allows our indexer to be more resilient to network issues, to act idempotently, and to provide a more efficient end-user experience.
"}, {"location": "index/cli/", "title": "Indexer CLI", "text": "The GraphRAG indexer CLI allows for no-code usage of the GraphRAG Indexer.
python -m graphrag.index --verbose --root </workspace/project/root> \\\n--config <custom_config.yml> --resume <timestamp> \\\n--reporter <rich|print|none> --emit json,csv,parquet \\\n--nocache\n"}, {"location": "index/cli/#cli-arguments", "title": "CLI Arguments", "text": "--verbose - Adds extra logging information during the run.--root <data-project-dir> - the data root directory. This should contain an input directory with the input data, and an .env file with environment variables. These are described below.--init - This will initialize the data project directory at the specified root with bootstrap configuration and prompt-overrides.--resume <output-timestamp> - if specified, the pipeline will attempt to resume a prior run. The parquet files from the prior run will be loaded into the system as inputs, and the workflows that generated those files will be skipped. The input value should be the timestamped output folder, e.g. \"20240105-143721\".--config <config_file.yml> - This will opt-out of the Default Configuration mode and execute a custom configuration. If this is used, then none of the environment-variables below will apply.--reporter <reporter> - This will specify the progress reporter to use. The default is rich. Valid values are rich, print, and none.--emit <types> - This specifies the table output formats the pipeline should emit. The default is parquet. Valid values are parquet, csv, and json, comma-separated.--nocache - This will disable the caching mechanism. This is useful for debugging and development, but should not be used in production.--output <directory> - Specify the output directory for pipeline artifacts.--reports <directory> - Specify the output directory for reporting.The knowledge model is a specification for data outputs that conform to our data-model definition. You can find these definitions in the python/graphrag/graphrag/model folder within the GraphRAG repository. The following entity types are provided. The fields here represent the fields that are text-embedded by default.
Document - An input document into the system. These either represent individual rows in a CSV or individual .txt file.TextUnit - A chunk of text to analyze. The size of these chunks, their overlap, and whether they adhere to any data boundaries may be configured below. A common use case is to set CHUNK_BY_COLUMNS to id so that there is a 1-to-many relationship between documents and TextUnits instead of a many-to-many.Entity - An entity extracted from a TextUnit. These represent people, places, events, or some other entity-model that you provide.Relationship - A relationship between two entities. These are generated from the covariates.Covariate - Extracted claim information, which contains statements about entities which may be time-bound.Community Report - Once entities are generated, we perform hierarchical community detection on them and generate reports for each community in this hierarchy.Node - This table contains layout information for rendered graph-views of the Entities and Documents which have been embedded and clustered.Let's take a look at how the default-configuration workflow transforms text documents into the GraphRAG Knowledge Model. This page gives a general overview of the major steps in this process. To fully configure this workflow, check out the configuration documentation.
---\ntitle: Dataflow Overview\n---\nflowchart TB\n subgraph phase1[Phase 1: Compose TextUnits]\n documents[Documents] --> chunk[Chunk]\n chunk --> embed[Embed] --> textUnits[Text Units]\n end\n subgraph phase2[Phase 2: Graph Extraction]\n textUnits --> graph_extract[Entity & Relationship Extraction]\n graph_extract --> graph_summarize[Entity & Relationship Summarization]\n graph_summarize --> claim_extraction[Claim Extraction]\n claim_extraction --> graph_outputs[Graph Tables]\n end\n subgraph phase3[Phase 3: Graph Augmentation]\n graph_outputs --> community_detect[Community Detection]\n community_detect --> graph_embed[Graph Embedding]\n graph_embed --> augmented_graph[Augmented Graph Tables]\n end\n subgraph phase4[Phase 4: Community Summarization]\n augmented_graph --> summarized_communities[Community Summarization]\n summarized_communities --> embed_communities[Community Embedding]\n embed_communities --> community_outputs[Community Tables]\n end\n subgraph phase5[Phase 5: Document Processing]\n documents --> link_to_text_units[Link to TextUnits]\n textUnits --> link_to_text_units\n link_to_text_units --> embed_documents[Document Embedding]\n embed_documents --> document_graph[Document Graph Creation]\n document_graph --> document_outputs[Document Tables]\n end\n subgraph phase6[Phase 6: Network Visualization]\n document_outputs --> umap_docs[Umap Documents]\n augmented_graph --> umap_entities[Umap Entities]\n umap_docs --> combine_nodes[Nodes Table]\n umap_entities --> combine_nodes\n end"}, {"location": "index/default_dataflow/#phase-1-compose-textunits", "title": "Phase 1: Compose TextUnits", "text": "The first phase of the default-configuration workflow is to transform input documents into TextUnits. A TextUnit is a chunk of text that is used for our graph extraction techniques. They are also used as source-references by extracted knowledge items in order to empower breadcrumbs and provenance by concepts back to their original source tex.
The chunk size (counted in tokens), is user-configurable. By default this is set to 300 tokens, although we've had positive experience with 1200-token chunks using a single \"glean\" step. (A \"glean\" step is a follow-on extraction). Larger chunks result in lower-fidelity output and less meaningful reference texts; however, using larger chunks can result in much faster processing time.
The group-by configuration is also user-configurable. By default, we align our chunks to document boundaries, meaning that there is a strict 1-to-many relationship between Documents and TextUnits. In rare cases, this can be turned into a many-to-many relationship. This is useful when the documents are very short and we need several of them to compose a meaningful analysis unit (e.g. Tweets or a chat log)
Each of these text-units are text-embedded and passed into the next phase of the pipeline.
---\ntitle: Documents into Text Chunks\n---\nflowchart LR\n doc1[Document 1] --> tu1[TextUnit 1]\n doc1 --> tu2[TextUnit 2]\n doc2[Document 2] --> tu3[TextUnit 3]\n doc2 --> tu4[TextUnit 4]\n"}, {"location": "index/default_dataflow/#phase-2-graph-extraction", "title": "Phase 2: Graph Extraction", "text": "In this phase, we analyze each text unit and extract our graph primitives: Entities, Relationships, and Claims. Entities and Relationships are extracted at once in our entity_extract verb, and claims are extracted in our claim_extract verb. Results are then combined and passed into following phases of the pipeline.
---\ntitle: Graph Extraction\n---\nflowchart LR\n tu[TextUnit] --> ge[Graph Extraction] --> gs[Graph Summarization]\n tu --> ce[Claim Extraction]"}, {"location": "index/default_dataflow/#entity-relationship-extraction", "title": "Entity & Relationship Extraction", "text": "In this first step of graph extraction, we process each text-unit in order to extract entities and relationships out of the raw text using the LLM. The output of this step is a subgraph-per-TextUnit containing a list of entities with a name, type, and description, and a list of relationships with a source, target, and description.
These subgraphs are merged together - any entities with the same name and type are merged by creating an array of their descriptions. Similarly, any relationships with the same source and target are merged by creating an array of their descriptions.
"}, {"location": "index/default_dataflow/#entity-relationship-summarization", "title": "Entity & Relationship Summarization", "text": "Now that we have a graph of entities and relationships, each with a list of descriptions, we can summarize these lists into a single description per entity and relationship. This is done by asking the LLM for a short summary that captures all of the distinct information from each description. This allows all of our entities and relationships to have a single concise description.
"}, {"location": "index/default_dataflow/#claim-extraction-emission", "title": "Claim Extraction & Emission", "text": "Finally, as an independent workflow, we extract claims from the source TextUnits. These claims represent positive factual statements with an evaluated status and time-bounds. These are emitted as a primary artifact called Covariates.
Note: claim extraction is optional and turned off by default. This is because claim extraction generally needs prompt tuning to be useful.
"}, {"location": "index/default_dataflow/#phase-3-graph-augmentation", "title": "Phase 3: Graph Augmentation", "text": "Now that we have a usable graph of entities and relationships, we want to understand their community structure and augment the graph with additional information. This is done in two steps: Community Detection and Graph Embedding. These give us explicit (communities) and implicit (embeddings) ways of understanding the topological structure of our graph.
---\ntitle: Graph Augmentation\n---\nflowchart LR\n cd[Leiden Hierarchical Community Detection] --> ge[Node2Vec Graph Embedding] --> ag[Graph Table Emission]"}, {"location": "index/default_dataflow/#community-detection", "title": "Community Detection", "text": "In this step, we generate a hierarchy of entity communities using the Hierarchical Leiden Algorithm. This method will apply a recursive community-clustering to our graph until we reach a community-size threshold. This will allow us to understand the community structure of our graph and provide a way to navigate and summarize the graph at different levels of granularity.
"}, {"location": "index/default_dataflow/#graph-embedding", "title": "Graph Embedding", "text": "In this step, we generate a vector representation of our graph using the Node2Vec algorithm. This will allow us to understand the implicit structure of our graph and provide an additional vector-space in which to search for related concepts during our query phase.
"}, {"location": "index/default_dataflow/#graph-tables-emission", "title": "Graph Tables Emission", "text": "Once our graph augmentation steps are complete, the final Entities and Relationships tables are emitted after their text fields are text-embedded.
"}, {"location": "index/default_dataflow/#phase-4-community-summarization", "title": "Phase 4: Community Summarization", "text": "---\ntitle: Community Summarization\n---\nflowchart LR\n sc[Generate Community Reports] --> ss[Summarize Community Reports] --> ce[Community Embedding] --> co[Community Tables Emission] At this point, we have a functional graph of entities and relationships, a hierarchy of communities for the entities, as well as node2vec embeddings.
Now we want to build on the communities data and generate reports for each community. This gives us a high-level understanding of the graph at several points of graph granularity. For example, if community A is the top-level community, we'll get a report about the entire graph. If the community is lower-level, we'll get a report about a local cluster.
"}, {"location": "index/default_dataflow/#generate-community-reports", "title": "Generate Community Reports", "text": "In this step, we generate a summary of each community using the LLM. This will allow us to understand the distinct information contained within each community and provide a scoped understanding of the graph, from either a high-level or a low-level perspective. These reports contain an executive overview and reference the key entities, relationships, and claims within the community sub-structure.
"}, {"location": "index/default_dataflow/#summarize-community-reports", "title": "Summarize Community Reports", "text": "In this step, each community report is then summarized via the LLM for shorthand use.
"}, {"location": "index/default_dataflow/#community-embedding", "title": "Community Embedding", "text": "In this step, we generate a vector representation of our communities by generating text embeddings of the community report, the community report summary, and the title of the community report.
"}, {"location": "index/default_dataflow/#community-tables-emission", "title": "Community Tables Emission", "text": "At this point, some bookkeeping work is performed and we emit the Communities and CommunityReports tables.
"}, {"location": "index/default_dataflow/#phase-5-document-processing", "title": "Phase 5: Document Processing", "text": "In this phase of the workflow, we create the Documents table for the knowledge model.
---\ntitle: Document Processing\n---\nflowchart LR\n aug[Augment] --> dp[Link to TextUnits] --> de[Avg. Embedding] --> dg[Document Table Emission]"}, {"location": "index/default_dataflow/#augment-with-columns-csv-only", "title": "Augment with Columns (CSV Only)", "text": "If the workflow is operating on CSV data, you may configure your workflow to add additional fields to Documents output. These fields should exist on the incoming CSV tables. Details about configuring this can be found in the configuration documentation.
"}, {"location": "index/default_dataflow/#link-to-textunits", "title": "Link to TextUnits", "text": "In this step, we link each document to the text-units that were created in the first phase. This allows us to understand which documents are related to which text-units and vice-versa.
"}, {"location": "index/default_dataflow/#document-embedding", "title": "Document Embedding", "text": "In this step, we generate a vector representation of our documents using an average embedding of document slices. We re-chunk documents without overlapping chunks, and then generate an embedding for each chunk. We create an average of these chunks weighted by token-count and use this as the document embedding. This will allow us to understand the implicit relationship between documents, and will help us generate a network representation of our documents.
"}, {"location": "index/default_dataflow/#documents-table-emission", "title": "Documents Table Emission", "text": "At this point, we can emit the Documents table into the knowledge Model.
"}, {"location": "index/default_dataflow/#phase-6-network-visualization", "title": "Phase 6: Network Visualization", "text": "In this phase of the workflow, we perform some steps to support network visualization of our high-dimensional vector spaces within our existing graphs. At this point there are two logical graphs at play: the Entity-Relationship graph and the Document graph.
---\ntitle: Network Visualization Workflows\n---\nflowchart LR\n nv[Umap Documents] --> ne[Umap Entities] --> ng[Nodes Table Emission] For each of the logical graphs, we perform a UMAP dimensionality reduction to generate a 2D representation of the graph. This will allow us to visualize the graph in a 2D space and understand the relationships between the nodes in the graph. The UMAP embeddings are then emitted as a table of Nodes. The rows of this table include a discriminator indicating whether the node is a document or an entity, and the UMAP coordinates.
"}, {"location": "index/overview/", "title": "GraphRAG Indexing \ud83e\udd16", "text": "The GraphRAG indexing package is a data pipeline and transformation suite that is designed to extract meaningful, structured data from unstructured text using LLMs.
Indexing Pipelines are configurable. They are composed of workflows, standard and custom steps, prompt templates, and input/output adapters. Our standard pipeline is designed to:
The outputs of the pipeline can be stored in a variety of formats, including JSON and Parquet - or they can be handled manually via the Python API.
"}, {"location": "index/overview/#getting-started", "title": "Getting Started", "text": ""}, {"location": "index/overview/#requirements", "title": "Requirements", "text": "See the requirements section in Get Started for details on setting up a development environment.
The Indexing Engine can be used in either a default configuration mode or with a custom pipeline. To configure GraphRAG, see the configuration documentation. After you have a config file you can run the pipeline using the CLI or the Python API.
"}, {"location": "index/overview/#usage", "title": "Usage", "text": ""}, {"location": "index/overview/#cli", "title": "CLI", "text": "# Via Poetry\npoetry run poe cli --root <data_root> # default config mode\npoetry run poe cli --config your_pipeline.yml # custom config mode\n\n# Via Node\nyarn run:index --root <data_root> # default config mode\nyarn run:index --config your_pipeline.yml # custom config mode\n"}, {"location": "index/overview/#python-api", "title": "Python API", "text": "from graphrag.index import run_pipeline\nfrom graphrag.index.config import PipelineWorkflowReference\n\nworkflows: list[PipelineWorkflowReference] = [\n PipelineWorkflowReference(\n steps=[\n {\n # built-in verb\n \"verb\": \"derive\", # https://github.com/microsoft/datashaper/blob/main/python/datashaper/datashaper/verbs/derive.py\n \"args\": {\n \"column1\": \"col1\", # from above\n \"column2\": \"col2\", # from above\n \"to\": \"col_multiplied\", # new column name\n \"operator\": \"*\", # multiply the two columns\n },\n # Since we're trying to act on the default input, we don't need explicitly to specify an input\n }\n ]\n ),\n]\n\ndataset = pd.DataFrame([{\"col1\": 2, \"col2\": 4}, {\"col1\": 5, \"col2\": 10}])\noutputs = []\nasync for output in await run_pipeline(dataset=dataset, workflows=workflows):\n outputs.append(output)\npipeline_result = outputs[-1]\nprint(pipeline_result)\n"}, {"location": "index/overview/#further-reading", "title": "Further Reading", "text": "GraphRAG provides the ability to create domain adapted prompts for the generation of the knowledge graph. This step is optional, though it is highly encouraged to run it as it will yield better results when executing an Index Run.
These are generated by loading the inputs, splitting them into chunks (text units) and then running a series of LLM invocations and template substitutions to generate the final prompts. We suggest using the default values provided by the script, but in this page you'll find the detail of each in case you want to further explore and tweak the prompt tuning algorithm.
Figure 1: Auto Tuning Conceptual Diagram.
"}, {"location": "prompt_tuning/auto_prompt_tuning/#prerequisites", "title": "Prerequisites", "text": "Before running auto tuning make sure you have already initialized your workspace with the graphrag.index --init command. This will create the necessary configuration files and the default prompts. Refer to the Init Documentation for more information about the initialization process.
You can run the main script from the command line with various options:
python -m graphrag.prompt_tune [--root ROOT] [--domain DOMAIN] [--method METHOD] [--limit LIMIT] [--language LANGUAGE] \\\n[--max-tokens MAX_TOKENS] [--chunk-size CHUNK_SIZE] [--n-subset-max N_SUBSET_MAX] [--k K] \\\n[--min-examples-required MIN_EXAMPLES_REQUIRED] [--no-entity-types] [--output OUTPUT]\n"}, {"location": "prompt_tuning/auto_prompt_tuning/#command-line-options", "title": "Command-Line Options", "text": "--config (required): The path to the configuration file. This is required to load the data and model settings.
--root (optional): The data project root directory, including the config files (YML, JSON, or .env). Defaults to the current directory.
--domain (optional): The domain related to your input data, such as 'space science', 'microbiology', or 'environmental news'. If left empty, the domain will be inferred from the input data.
--method (optional): The method to select documents. Options are all, random, auto or top. Default is random.
--limit (optional): The limit of text units to load when using random or top selection. Default is 15.
--language (optional): The language to use for input processing. If it is different from the inputs' language, the LLM will translate. Default is \"\" meaning it will be automatically detected from the inputs.
--max-tokens (optional): Maximum token count for prompt generation. Default is 2000.
--chunk-size (optional): The size in tokens to use for generating text units from input documents. Default is 200.
--n-subset-max (optional): The number of text chunks to embed when using auto selection method. Default is 300.
--k (optional): The number of documents to select when using auto selection method. Default is 15.
--min-examples-required (optional): The minimum number of examples required for entity extraction prompts. Default is 2.
--no-entity-types (optional): Use untyped entity extraction generation. We recommend using this when your data covers a lot of topics or it is highly randomized.
--output (optional): The folder to save the generated prompts. Default is \"prompts\".
python -m graphrag.prompt_tune --root /path/to/project --config /path/to/settings.yaml --domain \"environmental news\" \\\n--method random --limit 10 --language English --max-tokens 2048 --chunk-size 256 --min-examples-required 3 \\\n--no-entity-types --output /path/to/output\n or, with minimal configuration (suggested):
python -m graphrag.prompt_tune --root /path/to/project --config /path/to/settings.yaml --no-entity-types\n"}, {"location": "prompt_tuning/auto_prompt_tuning/#document-selection-methods", "title": "Document Selection Methods", "text": "The auto tuning feature ingests the input data and then divides it into text units the size of the chunk size parameter. After that, it uses one of the following selection methods to pick a sample to work with for prompt generation:
random: Select text units randomly. This is the default and recommended option.top: Select the head n text units.all: Use all text units for the generation. Use only with small datasets; this option is not usually recommended.auto: Embed text units in a lower-dimensional space and select the k nearest neighbors to the centroid. This is useful when you have a large dataset and want to select a representative sample.After running auto tuning, you should modify the following environment variables (or config variables) to pick up the new prompts on your index run. Note: Please make sure to update the correct path to the generated prompts, in this example we are using the default \"prompts\" path.
GRAPHRAG_ENTITY_EXTRACTION_PROMPT_FILE = \"prompts/entity_extraction.txt\"
GRAPHRAG_COMMUNITY_REPORT_PROMPT_FILE = \"prompts/community_report.txt\"
GRAPHRAG_SUMMARIZE_DESCRIPTIONS_PROMPT_FILE = \"prompts/summarize_descriptions.txt\"
or in your yaml config file:
entity_extraction:\n prompt: \"prompts/entity_extraction.txt\"\n\nsummarize_descriptions:\n prompt: \"prompts/summarize_descriptions.txt\"\n\ncommunity_reports:\n prompt: \"prompts/community_report.txt\"\n"}, {"location": "prompt_tuning/manual_prompt_tuning/", "title": "Manual Prompt Tuning \u2699\ufe0f", "text": "The GraphRAG indexer, by default, will run with a handful of prompts that are designed to work well in the broad context of knowledge discovery. However, it is quite common to want to tune the prompts to better suit your specific use case. We provide a means for you to do this by allowing you to specify a custom prompt file, which will each use a series of token-replacements internally.
Each of these prompts may be overridden by writing a custom prompt file in plaintext. We use token-replacements in the form of {token_name}, and the descriptions for the available tokens can be found below.
Prompt Source
"}, {"location": "prompt_tuning/manual_prompt_tuning/#tokens-values-provided-by-extractor", "title": "Tokens (values provided by extractor)", "text": "Prompt Source
"}, {"location": "prompt_tuning/manual_prompt_tuning/#tokens-values-provided-by-extractor_1", "title": "Tokens (values provided by extractor)", "text": "Prompt Source
"}, {"location": "prompt_tuning/manual_prompt_tuning/#tokens-values-provided-by-extractor_2", "title": "Tokens (values provided by extractor)", "text": "Note: there is additional parameter for the Claim Description that is used in claim extraction. The default value is
\"Any claims or facts that could be relevant to information discovery.\"
See the configuration documentation for details on how to change this.
"}, {"location": "prompt_tuning/manual_prompt_tuning/#generate-community-reports", "title": "Generate Community Reports", "text": "Prompt Source
"}, {"location": "prompt_tuning/manual_prompt_tuning/#tokens-values-provided-by-extractor_3", "title": "Tokens (values provided by extractor)", "text": "This page provides an overview of the prompt tuning options available for the GraphRAG indexing engine.
"}, {"location": "prompt_tuning/overview/#default-prompts", "title": "Default Prompts", "text": "The default prompts are the simplest way to get started with the GraphRAG system. It is designed to work out-of-the-box with minimal configuration. You can find more detail about these prompts in the following links:
Auto Tuning leverages your input data and LLM interactions to create domain adapted prompts for the generation of the knowledge graph. It is highly encouraged to run it as it will yield better results when executing an Index Run. For more details about how to use it, please refer to the Auto Tuning documentation.
"}, {"location": "prompt_tuning/overview/#manual-tuning", "title": "Manual Tuning", "text": "Manual tuning is an advanced use-case. Most users will want to use the Auto Tuning feature instead. Details about how to use manual configuration are available in the Manual Tuning documentation.
"}, {"location": "query/cli/", "title": "Query CLI", "text": "The GraphRAG query CLI allows for no-code usage of the GraphRAG Query engine.
python -m graphrag.query --config <config_file.yml> --data <path-to-data> --community_level <comunit-level> --response_type <response-type> --method <\"local\"|\"global\"> <query>\n"}, {"location": "query/cli/#cli-arguments", "title": "CLI Arguments", "text": "--config <config_file.yml> - The configuration yaml file to use when running the query. If this is used, then none of the environment-variables below will apply.--data <path-to-data> - Folder containing the .parquet output files from running the Indexer.--community_level <community-level> - Community level in the Leiden community hierarchy from which we will load the community reports higher value means we use reports on smaller communities. Default: 2--response_type <response-type> - Free form text describing the response type and format, can be anything, e.g. Multiple Paragraphs, Single Paragraph, Single Sentence, List of 3-7 Points, Single Page, Multi-Page Report. Default: Multiple Paragraphs.--method <\"local\"|\"global\"> - Method to use to answer the query, one of local or global. For more information check Overview--streaming - Stream back the LLM responseRequired environment variables to execute: - GRAPHRAG_API_KEY - API Key for executing the model, will fallback to OPENAI_API_KEY if one is not provided. - GRAPHRAG_LLM_MODEL - Model to use for Chat Completions. - GRAPHRAG_EMBEDDING_MODEL - Model to use for Embeddings.
You can further customize the execution by providing these environment variables:
GRAPHRAG_LLM_API_BASE - The API Base URL. Default: NoneGRAPHRAG_LLM_TYPE - The LLM operation type. Either openai_chat or azure_openai_chat. Default: openai_chatGRAPHRAG_LLM_MAX_RETRIES - The maximum number of retries to attempt when a request fails. Default: 20GRAPHRAG_EMBEDDING_API_BASE - The API Base URL. Default: NoneGRAPHRAG_EMBEDDING_TYPE - The embedding client to use. Either openai_embedding or azure_openai_embedding. Default: openai_embeddingGRAPHRAG_EMBEDDING_MAX_RETRIES - The maximum number of retries to attempt when a request fails. Default: 20GRAPHRAG_LOCAL_SEARCH_TEXT_UNIT_PROP - Proportion of context window dedicated to related text units. Default: 0.5GRAPHRAG_LOCAL_SEARCH_COMMUNITY_PROP - Proportion of context window dedicated to community reports. Default: 0.1GRAPHRAG_LOCAL_SEARCH_CONVERSATION_HISTORY_MAX_TURNS - Maximum number of turns to include in the conversation history. Default: 5GRAPHRAG_LOCAL_SEARCH_TOP_K_ENTITIES - Number of related entities to retrieve from the entity description embedding store. Default: 10GRAPHRAG_LOCAL_SEARCH_TOP_K_RELATIONSHIPS - Control the number of out-of-network relationships to pull into the context window. Default: 10GRAPHRAG_LOCAL_SEARCH_MAX_TOKENS - Change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000). Default: 12000GRAPHRAG_LOCAL_SEARCH_LLM_MAX_TOKENS - Change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 1000=1500). Default: 2000GRAPHRAG_GLOBAL_SEARCH_MAX_TOKENS - Change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000). Default: 12000GRAPHRAG_GLOBAL_SEARCH_DATA_MAX_TOKENS - Change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000). Default: 12000GRAPHRAG_GLOBAL_SEARCH_MAP_MAX_TOKENS - Default: 500GRAPHRAG_GLOBAL_SEARCH_REDUCE_MAX_TOKENS - Change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 1000-1500). Default: 2000GRAPHRAG_GLOBAL_SEARCH_CONCURRENCY - Default: 32Baseline RAG struggles with queries that require aggregation of information across the dataset to compose an answer. Queries such as \u201cWhat are the top 5 themes in the data?\u201d perform terribly because baseline RAG relies on a vector search of semantically similar text content within the dataset. There is nothing in the query to direct it to the correct information.
However, with GraphRAG we can answer such questions, because the structure of the LLM-generated knowledge graph tells us about the structure (and thus themes) of the dataset as a whole. This allows the private dataset to be organized into meaningful semantic clusters that are pre-summarized. Using our global search method, the LLM uses these clusters to summarize these themes when responding to a user query.
"}, {"location": "query/global_search/#methodology", "title": "Methodology", "text": "---\ntitle: Global Search Dataflow\n---\n%%{ init: { 'flowchart': { 'curve': 'step' } } }%%\nflowchart LR\n\n uq[User Query] --- .1\n ch1[Conversation History] --- .1\n\n subgraph RIR\n direction TB\n ri1[Rated Intermediate<br/>Response 1]~~~ri2[Rated Intermediate<br/>Response 2] -.\"{1..N}\".-rin[Rated Intermediate<br/>Response N]\n end\n\n .1--Shuffled Community<br/>Report Batch 1-->RIR\n .1--Shuffled Community<br/>Report Batch 2-->RIR---.2\n .1--Shuffled Community<br/>Report Batch N-->RIR\n\n .2--Ranking +<br/>Filtering-->agr[Aggregated Intermediate<br/>Responses]-->res[Response]\n\n\n\n classDef green fill:#26B653,stroke:#333,stroke-width:2px,color:#fff;\n classDef turquoise fill:#19CCD3,stroke:#333,stroke-width:2px,color:#fff;\n classDef rose fill:#DD8694,stroke:#333,stroke-width:2px,color:#fff;\n classDef orange fill:#F19914,stroke:#333,stroke-width:2px,color:#fff;\n classDef purple fill:#B356CD,stroke:#333,stroke-width:2px,color:#fff;\n classDef invisible fill:#fff,stroke:#fff,stroke-width:0px,color:#fff, width:0px;\n class uq,ch1 turquoise;\n class ri1,ri2,rin rose;\n class agr orange;\n class res purple;\n class .1,.2 invisible;\n Given a user query and, optionally, the conversation history, the global search method uses a collection of LLM-generated community reports from a specified level of the graph's community hierarchy as context data to generate response in a map-reduce manner. At the map step, community reports are segmented into text chunks of pre-defined size. Each text chunk is then used to produce an intermediate response containing a list of point, each of which is accompanied by a numerical rating indicating the importance of the point. At the reduce step, a filtered set of the most important points from the intermediate responses are aggregated and used as the context to generate the final response.
The quality of the global search\u2019s response can be heavily influenced by the level of the community hierarchy chosen for sourcing community reports. Lower hierarchy levels, with their detailed reports, tend to yield more thorough responses, but may also increase the time and LLM resources needed to generate the final response due to the volume of reports.
"}, {"location": "query/global_search/#configuration", "title": "Configuration", "text": "Below are the key parameters of the GlobalSearch class:
llm: OpenAI model object to be used for response generationcontext_builder: context builder object to be used for preparing context data from community reportsmap_system_prompt: prompt template used in the map stage. Default template can be found at map_system_promptreduce_system_prompt: prompt template used in the reduce stage, default template can be found at reduce_system_promptresponse_type: free-form text describing the desired response type and format (e.g., Multiple Paragraphs, Multi-Page Report)allow_general_knowledge: setting this to True will include additional instructions to the reduce_system_prompt to prompt the LLM to incorporate relevant real-world knowledge outside of the dataset. Note that this may increase hallucinations, but can be useful for certain scenarios. Default is False *general_knowledge_inclusion_prompt: instruction to add to the reduce_system_prompt if allow_general_knowledge is enabled. Default instruction can be found at general_knowledge_instructionmax_data_tokens: token budget for the context datamap_llm_params: a dictionary of additional parameters (e.g., temperature, max_tokens) to be passed to the LLM call at the map stagereduce_llm_params: a dictionary of additional parameters (e.g., temperature, max_tokens) to passed to the LLM call at the reduce stagecontext_builder_params: a dictionary of additional parameters to be passed to the context_builder object when building context window for the map stage.concurrent_coroutines: controls the degree of parallelism in the map stage.callbacks: optional callback functions, can be used to provide custom event handlers for LLM's completion streaming eventsAn example of a global search scenario can be found in the following notebook.
"}, {"location": "query/local_search/", "title": "Local Search \ud83d\udd0e", "text": ""}, {"location": "query/local_search/#entity-based-reasoning", "title": "Entity-based Reasoning", "text": "The local search method combines structured data from the knowledge graph with unstructured data from the input documents to augment the LLM context with relevant entity information at query time. It is well-suited for answering questions that require an understanding of specific entities mentioned in the input documents (e.g., \u201cWhat are the healing properties of chamomile?\u201d).
"}, {"location": "query/local_search/#methodology", "title": "Methodology", "text": "---\ntitle: Local Search Dataflow\n---\n%%{ init: { 'flowchart': { 'curve': 'step' } } }%%\nflowchart LR\n\n uq[User Query] ---.1\n ch1[Conversation<br/>History]---.1\n\n .1--Entity<br/>Description<br/>Embedding--> ee[Extracted Entities]\n\n ee[Extracted Entities] ---.2--Entity-Text<br/>Unit Mapping--> ctu[Candidate<br/>Text Units]--Ranking + <br/>Filtering -->ptu[Prioritized<br/>Text Units]---.3\n .2--Entity-Report<br/>Mapping--> ccr[Candidate<br/>Community Reports]--Ranking + <br/>Filtering -->pcr[Prioritized<br/>Community Reports]---.3\n .2--Entity-Entity<br/>Relationships--> ce[Candidate<br/>Entities]--Ranking + <br/>Filtering -->pe[Prioritized<br/>Entities]---.3\n .2--Entity-Entity<br/>Relationships--> cr[Candidate<br/>Relationships]--Ranking + <br/>Filtering -->pr[Prioritized<br/>Relationships]---.3\n .2--Entity-Covariate<br/>Mappings--> cc[Candidate<br/>Covariates]--Ranking + <br/>Filtering -->pc[Prioritized<br/>Covariates]---.3\n ch1 -->ch2[Conversation History]---.3\n .3-->res[Response]\n\n classDef green fill:#26B653,stroke:#333,stroke-width:2px,color:#fff;\n classDef turquoise fill:#19CCD3,stroke:#333,stroke-width:2px,color:#fff;\n classDef rose fill:#DD8694,stroke:#333,stroke-width:2px,color:#fff;\n classDef orange fill:#F19914,stroke:#333,stroke-width:2px,color:#fff;\n classDef purple fill:#B356CD,stroke:#333,stroke-width:2px,color:#fff;\n classDef invisible fill:#fff,stroke:#fff,stroke-width:0px,color:#fff, width:0px;\n class uq,ch1 turquoise\n class ee green\n class ctu,ccr,ce,cr,cc rose\n class ptu,pcr,pe,pr,pc,ch2 orange\n class res purple\n class .1,.2,.3 invisible\n\n Given a user query and, optionally, the conversation history, the local search method identifies a set of entities from the knowledge graph that are semantically-related to the user input. These entities serve as access points into the knowledge graph, enabling the extraction of further relevant details such as connected entities, relationships, entity covariates, and community reports. Additionally, it also extracts relevant text chunks from the raw input documents that are associated with the identified entities. These candidate data sources are then prioritized and filtered to fit within a single context window of pre-defined size, which is used to generate a response to the user query.
"}, {"location": "query/local_search/#configuration", "title": "Configuration", "text": "Below are the key parameters of the LocalSearch class:
llm: OpenAI model object to be used for response generationcontext_builder: context builder object to be used for preparing context data from collections of knowledge model objectssystem_prompt: prompt template used to generate the search response. Default template can be found at system_promptresponse_type: free-form text describing the desired response type and format (e.g., Multiple Paragraphs, Multi-Page Report)llm_params: a dictionary of additional parameters (e.g., temperature, max_tokens) to be passed to the LLM callcontext_builder_params: a dictionary of additional parameters to be passed to the context_builder object when building context for the search promptcallbacks: optional callback functions, can be used to provide custom event handlers for LLM's completion streaming eventsAn example of a local search scenario can be found in the following notebook.
"}, {"location": "query/overview/", "title": "Query Engine \ud83d\udd0e", "text": "The Query Engine is the retrieval module of the Graph RAG Library. It is one of the two main components of the Graph RAG library, the other being the Indexing Pipeline (see Indexing Pipeline). It is responsible for the following tasks:
Local search method generates answers by combining relevant data from the AI-extracted knowledge-graph with text chunks of the raw documents. This method is suitable for questions that require an understanding of specific entities mentioned in the documents (e.g. What are the healing properties of chamomile?).
For more details about how Local Search works please refer to the Local Search documentation.
"}, {"location": "query/overview/#global-search", "title": "Global Search", "text": "Global search method generates answers by searching over all AI-generated community reports in a map-reduce fashion. This is a resource-intensive method, but often gives good responses for questions that require an understanding of the dataset as a whole (e.g. What are the most significant values of the herbs mentioned in this notebook?).
More about this can be checked at the Global Search documentation.
"}, {"location": "query/overview/#question-generation", "title": "Question Generation", "text": "This functionality takes a list of user queries and generates the next candidate questions. This is useful for generating follow-up questions in a conversation or for generating a list of questions for the investigator to dive deeper into the dataset.
Information about how question generation works can be found at the Question Generation documentation page.
"}, {"location": "query/question_generation/", "title": "Question Generation \u2754", "text": ""}, {"location": "query/question_generation/#entity-based-question-generation", "title": "Entity-based Question Generation", "text": "The question generation method combines structured data from the knowledge graph with unstructured data from the input documents to generate candidate questions related to specific entities.
"}, {"location": "query/question_generation/#methodology", "title": "Methodology", "text": "Given a list of prior user questions, the question generation method uses the same context-building approach employed in local search to extract and prioritize relevant structured and unstructured data, including entities, relationships, covariates, community reports and raw text chunks. These data records are then fitted into a single LLM prompt to generate candidate follow-up questions that represent the most important or urgent information content or themes in the data.
"}, {"location": "query/question_generation/#configuration", "title": "Configuration", "text": "Below are the key parameters of the Question Generation class:
llm: OpenAI model object to be used for response generationcontext_builder: context builder object to be used for preparing context data from collections of knowledge model objects, using the same context builder class as in local searchsystem_prompt: prompt template used to generate candidate questions. Default template can be found at system_promptllm_params: a dictionary of additional parameters (e.g., temperature, max_tokens) to be passed to the LLM callcontext_builder_params: a dictionary of additional parameters to be passed to the context_builder object when building context for the question generation promptcallbacks: optional callback functions, can be used to provide custom event handlers for LLM's completion streaming eventsAn example of the question generation function can be found in the following notebook.
"}, {"location": "query/notebooks/overview/", "title": "Query Engine Notebooks", "text": "For examples about running Query please refer to the following notebooks:
The test dataset for these notebooks can be found in dataset.zip.
"}]} \ No newline at end of file