graphrag/v1-breaking-changes.md

# GraphRAG Data Model and Config Breaking Changes

As we worked toward a cleaner codebase, data model, and configuration for the v1 release, we made a few changes that can break older indexes. During the development process we left shims in place to account for these changes, so that all old indexes will work up until v1.0. However, with the release of 1.0 we are removing these shims to allow the codebase to move forward without the legacy code elements. We are providing a migration notebook so this process should be fairly painless for most users:

1. Rename or move your settings.yml file to back it up.
2. Re-run `graphrag init` to generate a new default settings.yml.
3. Open your old settings.yml and copy any critical settings that you changed. For most people this is likely only the LLM and embedding config.
4. Run the notebook here: [./docs/examples_notebooks/index_migration.ipynb]()

Note that one of the new requirements is that we write embeddings to a vector store during indexing. By default, this uses a local lancedb instance. When you re-generate the default config, a block will be added to reflect this. If you need to write to Azure AI Search instead, we recommend updating these settings before you index, so you don't need to do a separate vector ingest.

All of the breaking changes listed below are accounted for in the four steps above.

## Updated data model

- We have streamlined the data model of the index in a few small ways to align tables more consistently and remove redundant content. Notably:
    - Consistent use of `id` and `human_readable_id` across all tables; this also insures all int IDs are actually saved as ints and never strings
    - Alignment of fields from `create_final_entities` (such as name -> title) with `create_final_nodes`, and removal of redundant content across these tables
    - Rename of `document.raw_content` to `document.text`
    - Rename of `entity.name` to `entity.title`
    - Rename `rank` to `combined_degree` in `create_final_relationships` and removal of `source_degree` and `target_degree`fields
    - Fixed community tables to use a proper UUID for the `id` field, and retain `community` and `human_readable_id` for the short IDs
    - Removal of all embeddings columns from parquet files in favor of direct vector store writes

### Migration

- Run a new index, leveraging existing cache.

## New required Embeddings

### Change

- Added new required embeddings for `DRIFTSearch` and base RAG capabilities.

### Migration

- Run a new index, leveraging existing cache.

## Vector Store required by default

### Change

- Vector store is now required by default for all search methods.

### Migration

- Run graphrag init command to generate a new settings.yaml file with the vector store configuration.
- Run a new index, leveraging existing cache.

## Deprecate timestamp paths

### Change

- Remove support for timestamp paths, those using `${timestamp}` directory nesting.
- Use the same directory for storage output and reporting output.

### Migration

- Ensure output directories no longer use `${timestamp}` directory nesting.

**Using Environment Variables**

- Ensure `GRAPHRAG_STORAGE_BASE_DIR` is set to a static directory, e.g., `output` instead of `output/${timestamp}/artifacts`.
- Ensure `GRAPHRAG_REPORTING_BASE_DIR` is set to a static directory, e.g., `output` instead of `output/${timestamp}/reports`

[Full docs on using environment variables for configuration](https://microsoft.github.io/graphrag/config/env_vars/).

**Using Configuration File**

```yaml
# rest of settings.yaml file
# ...

storage:
  type: file
  base_dir: "output" # changed from "output/${timestamp}/artifacts"

reporting:
  type: file
  base_dir: "output" # changed from "output/${timestamp}/reports"
```

[Full docs on using YAML files for configuration](https://microsoft.github.io/graphrag/config/yaml/).
Docs and notebooks update (#1451) * Fix local question gen and example notebook * Update global search notebook * Add lazy blog post * Update breaking changes doc for migration notes * Simplify Getting Started page * Semver * Spellcheck * Fix types * Add comments on cache-free migration * Update wording * Spelling --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> 2024-11-27 09:56:48 -08:00			`# GraphRAG Data Model and Config Breaking Changes`

Migration notebook (#1492) * Add migration notebook * Update migration instructions * Semver * Rename item in relationships table * Remove indexing vector store shim * Remove query shims * Remove columns from migrated data * Format * Add community parents 2024-12-10 12:23:26 -08:00			As we worked toward a cleaner codebase, data model, and configuration for the v1 release, we made a few changes that can break older indexes. During the development process we left shims in place to account for these changes, so that all old indexes will work up until v1.0. However, with the release of 1.0 we are removing these shims to allow the codebase to move forward without the legacy code elements. We are providing a migration notebook so this process should be fairly painless for most users:
Docs and notebooks update (#1451) * Fix local question gen and example notebook * Update global search notebook * Add lazy blog post * Update breaking changes doc for migration notes * Simplify Getting Started page * Semver * Spellcheck * Fix types * Add comments on cache-free migration * Update wording * Spelling --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> 2024-11-27 09:56:48 -08:00
			`1. Rename or move your settings.yml file to back it up.`
			2. Re-run `graphrag init` to generate a new default settings.yml.
			`3. Open your old settings.yml and copy any critical settings that you changed. For most people this is likely only the LLM and embedding config.`
Migration notebook (#1492) * Add migration notebook * Update migration instructions * Semver * Rename item in relationships table * Remove indexing vector store shim * Remove query shims * Remove columns from migrated data * Format * Add community parents 2024-12-10 12:23:26 -08:00			`4. Run the notebook here: [./docs/examples_notebooks/index_migration.ipynb]()`
Docs and notebooks update (#1451) * Fix local question gen and example notebook * Update global search notebook * Add lazy blog post * Update breaking changes doc for migration notes * Simplify Getting Started page * Semver * Spellcheck * Fix types * Add comments on cache-free migration * Update wording * Spelling --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> 2024-11-27 09:56:48 -08:00
			`Note that one of the new requirements is that we write embeddings to a vector store during indexing. By default, this uses a local lancedb instance. When you re-generate the default config, a block will be added to reflect this. If you need to write to Azure AI Search instead, we recommend updating these settings before you index, so you don't need to do a separate vector ingest.`

			`All of the breaking changes listed below are accounted for in the four steps above.`

			`## Updated data model`

			`- We have streamlined the data model of the index in a few small ways to align tables more consistently and remove redundant content. Notably:`
			- Consistent use of `id` and `human_readable_id` across all tables; this also insures all int IDs are actually saved as ints and never strings
			- Alignment of fields from `create_final_entities` (such as name -> title) with `create_final_nodes`, and removal of redundant content across these tables
			- Rename of `document.raw_content` to `document.text`
			- Rename of `entity.name` to `entity.title`
			- Rename `rank` to `combined_degree` in `create_final_relationships` and removal of `source_degree` and `target_degree`fields
			- Fixed community tables to use a proper UUID for the `id` field, and retain `community` and `human_readable_id` for the short IDs
			`- Removal of all embeddings columns from parquet files in favor of direct vector store writes`

			`### Migration`

			`- Run a new index, leveraging existing cache.`
Migrate towards using static output directories (#1113) * Migrate towards using static output directories - Fixes load_config eagering resolving directories. Directories are only resolved when the output directories are local. - Add support for `--output` and `--reporting` flags for index CLI. To achieve previous output structure `index --output run1/artifacts --reports run1/reports`. - Use static output directories when initializing a new project. - Maintains backward compatibility for those using timestamp outputs locally. * fix smoke tests * update query cli to work with static directories * remove eager path resolution from load_config. Support CLI overrides that can be resolved. * add docs and output logs/artifacts to same directory * use match statement * switch back to if statement --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> 2024-09-18 16:36:50 -07:00
Drift Search CLI, API, Docs and Example Notebook (#1348) * Drift CLI and backwards compat * Adding DRIFT Cli, Docs and example notebook * Update tests and fix ruff * Format * Small cleanup * Fix smoke tests * Update notebook * Oopsie fix * Delete duplicate img 2024-11-05 12:05:19 -06:00			`## New required Embeddings`

			`### Change`

			- Added new required embeddings for `DRIFTSearch` and base RAG capabilities.

			`### Migration`

			`- Run a new index, leveraging existing cache.`

			`## Vector Store required by default`

			`### Change`

			`- Vector store is now required by default for all search methods.`

			`### Migration`

			`- Run graphrag init command to generate a new settings.yaml file with the vector store configuration.`
			`- Run a new index, leveraging existing cache.`

Migrate towards using static output directories (#1113) * Migrate towards using static output directories - Fixes load_config eagering resolving directories. Directories are only resolved when the output directories are local. - Add support for `--output` and `--reporting` flags for index CLI. To achieve previous output structure `index --output run1/artifacts --reports run1/reports`. - Use static output directories when initializing a new project. - Maintains backward compatibility for those using timestamp outputs locally. * fix smoke tests * update query cli to work with static directories * remove eager path resolution from load_config. Support CLI overrides that can be resolved. * add docs and output logs/artifacts to same directory * use match statement * switch back to if statement --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> 2024-09-18 16:36:50 -07:00			`## Deprecate timestamp paths`

			`### Change`

Drift Search CLI, API, Docs and Example Notebook (#1348) * Drift CLI and backwards compat * Adding DRIFT Cli, Docs and example notebook * Update tests and fix ruff * Format * Small cleanup * Fix smoke tests * Update notebook * Oopsie fix * Delete duplicate img 2024-11-05 12:05:19 -06:00			- Remove support for timestamp paths, those using `${timestamp}` directory nesting.
Migrate towards using static output directories (#1113) * Migrate towards using static output directories - Fixes load_config eagering resolving directories. Directories are only resolved when the output directories are local. - Add support for `--output` and `--reporting` flags for index CLI. To achieve previous output structure `index --output run1/artifacts --reports run1/reports`. - Use static output directories when initializing a new project. - Maintains backward compatibility for those using timestamp outputs locally. * fix smoke tests * update query cli to work with static directories * remove eager path resolution from load_config. Support CLI overrides that can be resolved. * add docs and output logs/artifacts to same directory * use match statement * switch back to if statement --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> 2024-09-18 16:36:50 -07:00			`- Use the same directory for storage output and reporting output.`

			`### Migration`

Drift Search CLI, API, Docs and Example Notebook (#1348) * Drift CLI and backwards compat * Adding DRIFT Cli, Docs and example notebook * Update tests and fix ruff * Format * Small cleanup * Fix smoke tests * Update notebook * Oopsie fix * Delete duplicate img 2024-11-05 12:05:19 -06:00			- Ensure output directories no longer use `${timestamp}` directory nesting.
Migrate towards using static output directories (#1113) * Migrate towards using static output directories - Fixes load_config eagering resolving directories. Directories are only resolved when the output directories are local. - Add support for `--output` and `--reporting` flags for index CLI. To achieve previous output structure `index --output run1/artifacts --reports run1/reports`. - Use static output directories when initializing a new project. - Maintains backward compatibility for those using timestamp outputs locally. * fix smoke tests * update query cli to work with static directories * remove eager path resolution from load_config. Support CLI overrides that can be resolved. * add docs and output logs/artifacts to same directory * use match statement * switch back to if statement --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> 2024-09-18 16:36:50 -07:00
			`Using Environment Variables`

			- Ensure `GRAPHRAG_STORAGE_BASE_DIR` is set to a static directory, e.g., `output` instead of `output/${timestamp}/artifacts`.
			- Ensure `GRAPHRAG_REPORTING_BASE_DIR` is set to a static directory, e.g., `output` instead of `output/${timestamp}/reports`

fix typo. Update documentation URLs for consistency (#1298) Update documentation URLs for consistency Revised links in documentation files to remove the "posts" subdirectory for consistency and correctness. Co-authored-by: Alonso Guevara <alonsog@microsoft.com> 2024-10-22 08:24:17 +09:00			`[Full docs on using environment variables for configuration](https://microsoft.github.io/graphrag/config/env_vars/).`
Migrate towards using static output directories (#1113) * Migrate towards using static output directories - Fixes load_config eagering resolving directories. Directories are only resolved when the output directories are local. - Add support for `--output` and `--reporting` flags for index CLI. To achieve previous output structure `index --output run1/artifacts --reports run1/reports`. - Use static output directories when initializing a new project. - Maintains backward compatibility for those using timestamp outputs locally. * fix smoke tests * update query cli to work with static directories * remove eager path resolution from load_config. Support CLI overrides that can be resolved. * add docs and output logs/artifacts to same directory * use match statement * switch back to if statement --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> 2024-09-18 16:36:50 -07:00
			`Using Configuration File`

			```yaml
			`# rest of settings.yaml file`
			`# ...`

			`storage:`
			`type: file`
			`base_dir: "output" # changed from "output/${timestamp}/artifacts"`

			`reporting:`
			`type: file`
			`base_dir: "output" # changed from "output/${timestamp}/reports"`
			```

First cut at config cleanup (#1411) * Firsst cut at config cleanup * Reorder top nav * Add query prompts to tuning page * Remove dynamic notebook from nav * Add more thorough yml config descriptions in docs * Further clean out the config * Semver * Add new blog post * Emphasize yaml * Clarify output * Fix unit test * Fix bullet nesting 2024-11-15 14:33:26 -08:00			`[Full docs on using YAML files for configuration](https://microsoft.github.io/graphrag/config/yaml/).`