Docs and notebooks update (#1451)

* Fix local question gen and example notebook * Update global search notebook * Add lazy blog post * Update breaking changes doc for migration notes * Simplify Getting Started page * Semver * Spellcheck * Fix types * Add comments on cache-free migration * Update wording * Spelling --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-12-28 23:49:20 +00:00 · 2024-11-27 09:56:48 -08:00 · 2024-11-27 09:56:48 -08:00 · 0b2120ca45
commit 0b2120ca45
parent 2b7d28944d
8 changed files with 1037 additions and 71 deletions
--- a/.semversioner/next-release/patch-20241126215650769602.json
+++ b/.semversioner/next-release/patch-20241126215650769602.json
@ -0,0 +1,4 @@
+{
+  "type": "patch",
+  "description": "Fix question gen."
+}
--- a/dictionary.txt
+++ b/dictionary.txt
@ -24,6 +24,7 @@ getcwd
 fillna
 noqa
 dtypes
+ints

 # Azure
 abfs
@ -167,6 +168,7 @@ FIRUZABAD
 Krohaara
 KROHAARA
 POKRALLY
+René
 Tazbah
 TIRUZIA
 Tiruzia
--- a/docs/blog_posts.md
+++ b/docs/blog_posts.md
@ -38,4 +38,10 @@

    By Bryan Li, Research Intern; [Ha Trinh](https://www.microsoft.com/en-us/research/people/trinhha/), Senior Data Scientist; [Darren Edge](https://www.microsoft.com/en-us/research/people/daedge/), Senior Director; [Jonathan Larson](https://www.microsoft.com/en-us/research/people/jolarso/), Senior Principal Data Architect</h6>

+- [:octicons-arrow-right-24: __LazyGraphRAG: Setting a new standard for quality and cost__](https://www.microsoft.com/en-us/research/blog/lazygraphrag-setting-a-new-standard-for-quality-and-cost/)
+
+    ---
+    <h6>Published November 25, 2024
+
+    By [Darren Edge](https://www.microsoft.com/en-us/research/people/daedge/), Senior Director; [Ha Trinh](https://www.microsoft.com/en-us/research/people/trinhha/), Senior Data Scientist;  [Jonathan Larson](https://www.microsoft.com/en-us/research/people/jolarso/), Senior Principal Data Architect</h6>
 </div>
--- a/docs/examples_notebooks/global_search.ipynb
+++ b/docs/examples_notebooks/global_search.ipynb
@ -75,9 +75,9 @@
   "source": [
    "### Load community reports as context for global search\n",
    "\n",
-    "- Load all community reports in the `create_final_community_reports` table from the ire-indexing engine, to be used as context data for global search.\n",
-    "- Load entities from the `create_final_nodes` and `create_final_entities` tables from the ire-indexing engine, to be used for calculating community weights for context ranking. Note that this is optional (if no entities are provided, we will not calculate community weights and only use the rank attribute in the community reports table for context ranking)\n",
-    "- Load all communities in the `create_final_communites` table from the ire-indexing engine, to be used to reconstruct the community graph hierarchy for dynamic community selection."
+    "- Load all community reports in the `create_final_community_reports` table from the GraphRAG, to be used as context data for global search.\n",
+    "- Load entities from the `create_final_nodes` and `create_final_entities` tables from the GraphRAG, to be used for calculating community weights for context ranking. Note that this is optional (if no entities are provided, we will not calculate community weights and only use the rank attribute in the community reports table for context ranking)\n",
+    "- Load all communities in the `create_final_communites` table from the GraphRAG, to be used to reconstruct the community graph hierarchy for dynamic community selection."
   ]
  },
  {
@ -379,21 +379,23 @@
     "text": [
      "### Overview of Cosmic Vocalization\n",
      "\n",
-      "Cosmic Vocalization is a phenomenon that has garnered significant attention from various individuals and groups. It is perceived as a cosmic event with potential implications for security and interstellar communication. The Paranormal Military Squad is actively engaged with Cosmic Vocalization, indicating its strategic importance in security measures [Data: Reports (6)].\n",
+      "Cosmic Vocalization is a phenomenon that has garnered significant attention within the community, involving various individuals and groups. It is perceived as an interstellar event with potential implications for both communication and security.\n",
      "\n",
-      "### Key Perspectives and Concerns\n",
+      "### Key Perspectives\n",
      "\n",
-      "1. **Strategic Engagement**: The Paranormal Military Squad's involvement suggests that Cosmic Vocalization is not only a subject of interest but also a matter of strategic importance. This engagement highlights the potential security implications of these cosmic phenomena [Data: Reports (6)].\n",
+      "**Alex Mercer's Viewpoint**  \n",
+      "Alex Mercer perceives Cosmic Vocalization as part of an interstellar duet, suggesting that it may be a responsive or communicative event. This perspective highlights the potential for Cosmic Vocalization to be part of a larger cosmic interaction or dialogue [Data: Reports (6)].\n",
      "\n",
-      "2. **Community Interest**: Within the community, Cosmic Vocalization is a focal point of interest. Alex Mercer, for instance, perceives it as part of an interstellar duet, which suggests a responsive and perhaps communicative approach to these cosmic events [Data: Reports (6)].\n",
+      "**Taylor Cruz's Concerns**  \n",
+      "Taylor Cruz raises concerns about the nature of Cosmic Vocalization, fearing it might be a homing tune. This adds a layer of urgency and potential threat, as it suggests that the vocalization could be attracting attention from unknown entities or forces [Data: Reports (6)].\n",
      "\n",
-      "3. **Potential Threats**: Concerns have been raised by individuals like Taylor Cruz, who fears that Cosmic Vocalization might be a homing tune. This perspective adds a layer of urgency and suggests that there may be potential threats associated with these cosmic sounds [Data: Reports (6)].\n",
+      "### Involvement of the Paranormal Military Squad\n",
      "\n",
-      "### Metaphorical Interpretation\n",
+      "The Paranormal Military Squad is actively engaged with Cosmic Vocalization, indicating its significance in security measures. Their involvement suggests that the phenomenon is not only of scientific interest but also of strategic importance, potentially impacting national or global security [Data: Reports (6)].\n",
      "\n",
-      "The Universe is metaphorically treated as a concert hall by the Paranormal Military Squad, which suggests a broader perspective on how cosmic events are interpreted and responded to by human entities. This metaphorical view may influence how strategies and responses are formulated in relation to Cosmic Vocalization [Data: Reports (6)].\n",
+      "### Conclusion\n",
      "\n",
-      "In summary, Cosmic Vocalization is a complex phenomenon involving strategic, communicative, and potentially threatening elements. The involvement of the Paranormal Military Squad and the concerns raised by community members underscore its significance and the need for careful consideration of its implications.\n"
+      "Cosmic Vocalization is a complex and multifaceted phenomenon that involves various stakeholders, each with their own perspectives and concerns. The involvement of both individuals like Alex Mercer and Taylor Cruz, as well as organized groups like the Paranormal Military Squad, underscores its importance and the need for further investigation and understanding.\n"
     ]
    }
   ],
@ -638,7 +640,7 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "LLM calls: 2. Prompt tokens: 11292. Output tokens: 606.\n"
+      "LLM calls: 2. Prompt tokens: 11237. Output tokens: 483.\n"
     ]
    }
   ],
@ -652,7 +654,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "graphrag",
+   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
@ -666,7 +668,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.12.5"
+   "version": "3.11.9"
  }
 },
 "nbformat": 4,
--- a/docs/examples_notebooks/local_search.ipynb
+++ b/docs/examples_notebooks/local_search.ipynb
--- a/docs/get_started.md
+++ b/docs/get_started.md
@ -14,11 +14,6 @@ To get started with the GraphRAG system, you have a few options:

 To get started with the GraphRAG system we recommend trying the [Solution Accelerator](https://github.com/Azure-Samples/graphrag-accelerator) package. This provides a user-friendly end-to-end experience with Azure resources.

-# Top-Level Modules
-
-* [Indexing Pipeline Overview](index/overview.md)
-* [Query Engine Overview](query/overview.md)
-
 # Overview

 The following is a simple end-to-end example for using the GraphRAG system.
@ -34,26 +29,20 @@ The graphrag library includes a CLI for a no-code approach to getting started. P

 # Running the Indexer

-Now we need to set up a data project and some initial configuration. Let's set that up. We're using the [default configuration mode](config/overview.md), which you can customize as needed using a [config file](config/yaml.md), which we recommend, or [environment variables](config/env_vars.md).
-
-First let's get a sample dataset ready:
+We need to set up a data project and some initial configuration. First let's get a sample dataset ready:

 ```sh
 mkdir -p ./ragtest/input
 ```

-Now let's get a copy of A Christmas Carol by Charles Dickens from a trusted source
+Get a copy of A Christmas Carol by Charles Dickens from a trusted source:

 ```sh
 curl https://www.gutenberg.org/cache/epub/24022/pg24022.txt -o ./ragtest/input/book.txt
 ```

-Next we'll inject some required config variables:
-
 ## Set Up Your Workspace Variables

-First let's make sure to setup the required environment variables. For details on these environment variables, and what environment variables are available, see the [variables documentation](config/overview.md).
-
 To initialize your workspace, first run the `graphrag init` command.
 Since we have already configured a directory named `./ragtest` in the previous step, run the following command:

--- a/graphrag/query/question_gen/local_gen.py
+++ b/graphrag/query/question_gen/local_gen.py
@ -5,12 +5,15 @@

 import logging
 import time
-from typing import Any
+from typing import Any, cast

 import tiktoken

 from graphrag.prompts.query.question_gen_system_prompt import QUESTION_SYSTEM_PROMPT
-from graphrag.query.context_builder.builders import LocalContextBuilder
+from graphrag.query.context_builder.builders import (
+    ContextBuilderResult,
+    LocalContextBuilder,
+)
 from graphrag.query.context_builder.conversation_history import (
    ConversationHistory,
 )
@ -71,12 +74,17 @@ class LocalQuestionGen(BaseQuestionGen):

        if context_data is None:
            # generate context data based on the question history
-            context_data, context_records = self.context_builder.build_context(
-                query=question_text,
-                conversation_history=conversation_history,
-                **kwargs,
-                **self.context_builder_params,
-            )  # type: ignore
+            result = cast(
+                ContextBuilderResult,
+                self.context_builder.build_context(
+                    query=question_text,
+                    conversation_history=conversation_history,
+                    **kwargs,
+                    **self.context_builder_params,
+                ),
+            )
+            context_data = cast(str, result.context_chunks)
+            context_records = result.context_records
        else:
            context_records = {"context_data": context_data}
        log.info("GENERATE QUESTION: %s. LAST QUESTION: %s", start_time, question_text)
@ -144,12 +152,17 @@ class LocalQuestionGen(BaseQuestionGen):

        if context_data is None:
            # generate context data based on the question history
-            context_data, context_records = self.context_builder.build_context(
-                query=question_text,
-                conversation_history=conversation_history,
-                **kwargs,
-                **self.context_builder_params,
-            )  # type: ignore
+            result = cast(
+                ContextBuilderResult,
+                self.context_builder.build_context(
+                    query=question_text,
+                    conversation_history=conversation_history,
+                    **kwargs,
+                    **self.context_builder_params,
+                ),
+            )
+            context_data = cast(str, result.context_chunks)
+            context_records = result.context_records
        else:
            context_records = {"context_data": context_data}
        log.info(
--- a/v1-breaking-changes.md
+++ b/v1-breaking-changes.md
@ -1,4 +1,50 @@
-# Config Breaking Changes
+# GraphRAG Data Model and Config Breaking Changes
+
+As we worked toward a cleaner codebase, data model, and configuration for the v1 release, we made a few changes that can break older indexes. During the development process we left shims in place to account for these changes, so that all old indexes will work up until v1.0. However, with the release of 1.0 we are removing these shims to allow the codebase to move forward without the legacy code elements. This should be a fairly painless process for most users: because we aggressively use a cache for LLM calls, re-running an index over the top of a previous one should be very low (or no) cost. Therefore, our standard migration recommendation is as follows:
+
+1. Rename or move your settings.yml file to back it up.
+2. Re-run `graphrag init` to generate a new default settings.yml.
+3. Open your old settings.yml and copy any critical settings that you changed. For most people this is likely only the LLM and embedding config.
+4. Re-run `graphrag index`. This will re-execute the standard pipeline, using the cache for any LLM calls that it can. The output parquet tables will be in the latest format.
+
+Note that one of the new requirements is that we write embeddings to a vector store during indexing. By default, this uses a local lancedb instance. When you re-generate the default config, a block will be added to reflect this. If you need to write to Azure AI Search instead, we recommend updating these settings before you index, so you don't need to do a separate vector ingest.
+
+All of the breaking changes listed below are accounted for in the four steps above.
+
+## What if I don't have a cache available?
+
+If you no longer have your original GraphRAG cache, you can manually update your index. The most important aspect is ensuring that you have the required embeddings stored. If you already have your embeddings in a vector store, much of this can be avoided.
+
+Parquet changes:
+- The `create_final_entities.name` field has been renamed to `create_final_entities.title` for consistency with the other tables. Use your parquet editor of choice to fix this.
+- The `create_final_communities.id` field has been renamed to `create_final_communities.community` so that `id` can be repurposed for a UUID like the other tables. Use your parquet editor of choice to copy and rename this. You can copy it to leave the `id` field in place, or use a tool such as pandas to give each community a new UUID in the `id` field. (We join on the `community` field internally, so `id` can be effectively ignored).
+
+Embeddings changes:
+- For Local Search, you need to have the entity.description embeddings in a vector store
+- For DRIFT Search, you need the community.full_content embeddings in a vector store
+- If you are only using Global search, you do not need any embeddings
+
+The easiest way to get both of those is to run the pipeline with all workflows skipped except for `generate_embeddings`, which will embed those fields and write them to a vector store directly. Using a newer config file that has the embeddings.vector_store block:
+
+- Set the `skip_workflows` value to [create_base_entity_graph, create_base_text_units, create_final_text_units, create_final_community_reports, create_final_nodes, create_final_relationships, create_final_documents, create_final_covariates, create_final_entities, create_final_communities]
+- Re-run `graphrag index`
+
+What this does is run the pipeline, but skip over all of the usual artifact generation - the only workflow that is not skipped is the one that generates all default (or otherwise configured) embeddings.
+
+## Updated data model
+
+- We have streamlined the data model of the index in a few small ways to align tables more consistently and remove redundant content. Notably:
+    - Consistent use of `id` and `human_readable_id` across all tables; this also insures all int IDs are actually saved as ints and never strings
+    - Alignment of fields from `create_final_entities` (such as name -> title) with `create_final_nodes`, and removal of redundant content across these tables
+    - Rename of `document.raw_content` to `document.text`
+    - Rename of `entity.name` to `entity.title`
+    - Rename `rank` to `combined_degree` in `create_final_relationships` and removal of `source_degree` and `target_degree`fields
+    - Fixed community tables to use a proper UUID for the `id` field, and retain `community` and `human_readable_id` for the short IDs
+    - Removal of all embeddings columns from parquet files in favor of direct vector store writes
+
+### Migration
+
+- Run a new index, leveraging existing cache.

 ## New required Embeddings