graphrag/docs/config/json_yaml.md

# Default Configuration Mode (using JSON/YAML)

The default configuration mode may be configured by using a `settings.json` or `settings.yml` file in the data project root. If a `.env` file is present along with this config file, then it will be loaded, and the environment variables defined therein will be available for token replacements in your configuration document using `${ENV_VAR}` syntax.

For example:

```
# .env
API_KEY=some_api_key

# settings.json
{
    "llm": {
        "api_key": "${API_KEY}"
    }
}
```

# Config Sections

## input

### Fields

- `type` **file|blob** - The input type to use. Default=`file`
- `file_type` **text|csv** - The type of input data to load. Either `text` or `csv`. Default is `text`
- `file_encoding` **str** - The encoding of the input file. Default is `utf-8`
- `file_pattern` **str** - A regex to match input files. Default is `.*\.csv$` if in csv mode and `.*\.txt$` if in text mode.
- `source_column` **str** - (CSV Mode Only) The source column name.
- `timestamp_column` **str** - (CSV Mode Only) The timestamp column name.
- `timestamp_format` **str** - (CSV Mode Only) The source format.
- `text_column` **str** - (CSV Mode Only) The text column name.
- `title_column` **str** - (CSV Mode Only) The title column name.
- `document_attribute_columns` **list[str]** - (CSV Mode Only) The additional document attributes to include.
- `connection_string` **str** - (blob only) The Azure Storage connection string.
- `container_name` **str** - (blob only) The Azure Storage container name.
- `base_dir` **str** - The base directory to read input from, relative to the root.
- `storage_account_blob_url` **str** - The storage account blob URL to use.

## llm

This is the base LLM configuration section. Other steps may override this configuration with their own LLM configuration.

### Fields

- `api_key` **str** - The OpenAI API key to use.
- `type` **openai_chat|azure_openai_chat|openai_embedding|azure_openai_embedding** - The type of LLM to use.
- `model` **str** - The model name.
- `max_tokens` **int** - The maximum number of output tokens.
- `request_timeout` **float** - The per-request timeout.
- `api_base` **str** - The API base url to use.
- `api_version` **str** - The API version
- `organization` **str** - The client organization.
- `proxy` **str** - The proxy URL to use.
- `audience` **str** - (Azure OpenAI only) The URI of the target Azure resource/service for which a managed identity token is requested. Used if `api_key` is not defined. Default=`https://cognitiveservices.azure.com/.default`
- `deployment_name` **str** - The deployment name to use (Azure).
- `model_supports_json` **bool** - Whether the model supports JSON-mode output.
- `tokens_per_minute` **int** - Set a leaky-bucket throttle on tokens-per-minute.
- `requests_per_minute` **int** - Set a leaky-bucket throttle on requests-per-minute.
- `max_retries` **int** - The maximum number of retries to use.
- `max_retry_wait` **float** - The maximum backoff time.
- `sleep_on_rate_limit_recommendation` **bool** - Whether to adhere to sleep recommendations (Azure).
- `concurrent_requests` **int** The number of open requests to allow at once.
- `temperature` **float** - The temperature to use.
- `top_p` **float** - The top-p value to use.
- `n` **int** - The number of completions to generate.

## parallelization

### Fields

- `stagger` **float** - The threading stagger value.
- `num_threads` **int** - The maximum number of work threads.

## async_mode

**asyncio|threaded** The async mode to use. Either `asyncio` or `threaded.

## embeddings

### Fields

- `llm` (see LLM top-level config)
- `parallelization` (see Parallelization top-level config)
- `async_mode` (see Async Mode top-level config)
- `batch_size` **int** - The maximum batch size to use.
- `batch_max_tokens` **int** - The maximum batch # of tokens.
- `target` **required|all|none** - Determines which set of embeddings to emit.
- `skip` **list[str]** - Which embeddings to skip. Only useful if target=all to customize the list.
- `vector_store` **dict** - The vector store to use. Configured for lancedb by default.
  - `type` **str** - `lancedb` or `azure_ai_search`. Default=`lancedb`
  - `db_uri` **str** (only for lancedb) - The database uri. Default=`storage.base_dir/lancedb`
  - `url` **str** (only for AI Search) - AI Search endpoint
  - `api_key` **str** (optional - only for AI Search) - The AI Search api key to use.
  - `audience` **str** (only for AI Search) - Audience for managed identity token if managed identity authentication is used.
  - `overwrite` **bool** (only used at index creation time) - Overwrite collection if it exist. Default=`True`
  - `container_name` **str** - The name of a vector container. This stores all indexes (tables) for a given dataset ingest. Default=`default`
- `strategy` **dict** - Fully override the text-embedding strategy.

## chunks

### Fields

- `size` **int** - The max chunk size in tokens.
- `overlap` **int** - The chunk overlap in tokens.
- `group_by_columns` **list[str]** - group documents by fields before chunking.
- `encoding_model` **str** - The text encoding model to use. Default is to use the top-level encoding model.
- `strategy` **dict** - Fully override the chunking strategy.

## cache

### Fields

- `type` **file|memory|none|blob** - The cache type to use. Default=`file`
- `connection_string` **str** - (blob only) The Azure Storage connection string.
- `container_name` **str** - (blob only) The Azure Storage container name.
- `base_dir` **str** - The base directory to write cache to, relative to the root.
- `storage_account_blob_url` **str** - The storage account blob URL to use.

## storage

### Fields

- `type` **file|memory|blob** - The storage type to use. Default=`file`
- `connection_string` **str** - (blob only) The Azure Storage connection string.
- `container_name` **str** - (blob only) The Azure Storage container name.
- `base_dir` **str** - The base directory to write reports to, relative to the root.
- `storage_account_blob_url` **str** - The storage account blob URL to use.

## reporting

### Fields

- `type` **file|console|blob** - The reporting type to use. Default=`file`
- `connection_string` **str** - (blob only) The Azure Storage connection string.
- `container_name` **str** - (blob only) The Azure Storage container name.
- `base_dir` **str** - The base directory to write reports to, relative to the root.
- `storage_account_blob_url` **str** - The storage account blob URL to use.

## entity_extraction

### Fields

- `llm` (see LLM top-level config)
- `parallelization` (see Parallelization top-level config)
- `async_mode` (see Async Mode top-level config)
- `prompt` **str** - The prompt file to use.
- `entity_types` **list[str]** - The entity types to identify.
- `max_gleanings` **int** - The maximum number of gleaning cycles to use.
- `encoding_model` **str** - The text encoding model to use. By default, this will use the top-level encoding model.
- `strategy` **dict** - Fully override the entity extraction strategy.

## summarize_descriptions

### Fields

- `llm` (see LLM top-level config)
- `parallelization` (see Parallelization top-level config)
- `async_mode` (see Async Mode top-level config)
- `prompt` **str** - The prompt file to use.
- `max_length` **int** - The maximum number of output tokens per summarization.
- `strategy` **dict** - Fully override the summarize description strategy.

## claim_extraction

### Fields

- `enabled` **bool** - Whether to enable claim extraction. default=False
- `llm` (see LLM top-level config)
- `parallelization` (see Parallelization top-level config)
- `async_mode` (see Async Mode top-level config)
- `prompt` **str** - The prompt file to use.
- `description` **str** - Describes the types of claims we want to extract.
- `max_gleanings` **int** - The maximum number of gleaning cycles to use.
- `encoding_model` **str** - The text encoding model to use. By default, this will use the top-level encoding model.
- `strategy` **dict** - Fully override the claim extraction strategy.

## community_reports

### Fields

- `llm` (see LLM top-level config)
- `parallelization` (see Parallelization top-level config)
- `async_mode` (see Async Mode top-level config)
- `prompt` **str** - The prompt file to use.
- `max_length` **int** - The maximum number of output tokens per report.
- `max_input_length` **int** - The maximum number of input tokens to use when generating reports.
- `strategy` **dict** - Fully override the community reports strategy.

## cluster_graph

### Fields

- `max_cluster_size` **int** - The maximum cluster size to emit.
- `strategy` **dict** - Fully override the cluster_graph strategy.

## embed_graph

### Fields

- `enabled` **bool** - Whether to enable graph embeddings.
- `num_walks` **int** - The node2vec number of walks.
- `walk_length` **int** - The node2vec walk length.
- `window_size` **int** - The node2vec window size.
- `iterations` **int** - The node2vec number of iterations.
- `random_seed` **int** - The node2vec random seed.
- `strategy` **dict** - Fully override the embed graph strategy.

## umap

### Fields

- `enabled` **bool** - Whether to enable UMAP layouts.

## snapshots

### Fields

- `embeddings` **bool** - Emit embeddings snapshots to parquet.
- `graphml` **bool** - Emit graph snapshots to GraphML.
- `raw_entities` **bool** - Emit raw entity snapshots to JSON.
- `top_level_nodes` **bool** - Emit top-level-node snapshots to JSON.
- `transient` **bool** - Emit transient workflow tables snapshots to parquet.

## encoding_model

**str** - The text encoding model to use. Default=`cl100k_base`.

## skip_workflows

**list[str]** - Which workflow names to skip.
Replace current docs by mkdocs (#1263) * Replace docs by mkdocs-material * Fix markdown * Fix verions in gh-pages workflow * remove whitespaces * add semver * Add build docs check on python-ci * Fix command in index cli * Spellcheck * Spellcheck * remove docsite paths * clear outputs from notebook * remove dependabot npm for docsite * remove more docsite left overs * execute notebooks * Update notebooks * update poetry lock * Remove notebook build from ci * Revert dep update * Navigation tabs * Fix stylesheet * add kwds to dictionary * Turn on notebook execution * Update gitignore * Add MSR Blog posts * spellcheck * Accessibility Changes --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> 2024-10-11 13:39:03 -06:00			`# Default Configuration Mode (using JSON/YAML)`
Initial Release 2024-07-01 15:25:30 -06:00
Change config.json references to settings.json in the configuration document. (#1221) Updated the configuration documentation to reflect the default filename for configuration file. Default config files are `["settings.yaml", "settings.yml", "settings.json"]` https://github.com/microsoft/graphrag/blob/ce71bcf7fbe9811058f6bbc1eb725c4a1d960e7e/graphrag/config/config_file_loader.py#L15 Co-authored-by: Alonso Guevara <alonsog@microsoft.com> 2024-10-10 06:20:18 +09:00			The default configuration mode may be configured by using a `settings.json` or `settings.yml` file in the data project root. If a `.env` file is present along with this config file, then it will be loaded, and the environment variables defined therein will be available for token replacements in your configuration document using `${ENV_VAR}` syntax.
Initial Release 2024-07-01 15:25:30 -06:00
			`For example:`

			```
			`# .env`
			`API_KEY=some_api_key`

Change config.json references to settings.json in the configuration document. (#1221) Updated the configuration documentation to reflect the default filename for configuration file. Default config files are `["settings.yaml", "settings.yml", "settings.json"]` https://github.com/microsoft/graphrag/blob/ce71bcf7fbe9811058f6bbc1eb725c4a1d960e7e/graphrag/config/config_file_loader.py#L15 Co-authored-by: Alonso Guevara <alonsog@microsoft.com> 2024-10-10 06:20:18 +09:00			`# settings.json`
Initial Release 2024-07-01 15:25:30 -06:00			`{`
			`"llm": {`
			`"api_key": "${API_KEY}"`
			`}`
			`}`
			```

			`# Config Sections`

			`## input`

			`### Fields`

			- `type` file\|blob - The input type to use. Default=`file`
			- `file_type` text\|csv - The type of input data to load. Either `text` or `csv`. Default is `text`
			- `file_encoding` str - The encoding of the input file. Default is `utf-8`
			- `file_pattern` str - A regex to match input files. Default is `.\.csv$` if in csv mode and `.\.txt$` if in text mode.
			- `source_column` str - (CSV Mode Only) The source column name.
			- `timestamp_column` str - (CSV Mode Only) The timestamp column name.
			- `timestamp_format` str - (CSV Mode Only) The source format.
			- `text_column` str - (CSV Mode Only) The text column name.
			- `title_column` str - (CSV Mode Only) The title column name.
			- `document_attribute_columns` list[str] - (CSV Mode Only) The additional document attributes to include.
			- `connection_string` str - (blob only) The Azure Storage connection string.
			- `container_name` str - (blob only) The Azure Storage container name.
			- `base_dir` str - The base directory to read input from, relative to the root.
			- `storage_account_blob_url` str - The storage account blob URL to use.

			`## llm`

			`This is the base LLM configuration section. Other steps may override this configuration with their own LLM configuration.`

			`### Fields`

			- `api_key` str - The OpenAI API key to use.
			- `type` openai_chat\|azure_openai_chat\|openai_embedding\|azure_openai_embedding - The type of LLM to use.
			- `model` str - The model name.
			- `max_tokens` int - The maximum number of output tokens.
			- `request_timeout` float - The per-request timeout.
			- `api_base` str - The API base url to use.
			- `api_version` str - The API version
			- `organization` str - The client organization.
			- `proxy` str - The proxy URL to use.
Fix vector store logic and refactor audience parameter (#1259) 2024-10-21 16:56:56 -04:00			- `audience` str - (Azure OpenAI only) The URI of the target Azure resource/service for which a managed identity token is requested. Used if `api_key` is not defined. Default=`https://cognitiveservices.azure.com/.default`
Initial Release 2024-07-01 15:25:30 -06:00			- `deployment_name` str - The deployment name to use (Azure).
			- `model_supports_json` bool - Whether the model supports JSON-mode output.
			- `tokens_per_minute` int - Set a leaky-bucket throttle on tokens-per-minute.
			- `requests_per_minute` int - Set a leaky-bucket throttle on requests-per-minute.
			- `max_retries` int - The maximum number of retries to use.
			- `max_retry_wait` float - The maximum backoff time.
			- `sleep_on_rate_limit_recommendation` bool - Whether to adhere to sleep recommendations (Azure).
			- `concurrent_requests` int The number of open requests to allow at once.
Add N parameter support (#390) * Add N parameter support * Fix unit tests * Add new env vars to param testing 2024-07-08 14:04:49 -06:00			- `temperature` float - The temperature to use.
			- `top_p` float - The top-p value to use.
			- `n` int - The number of completions to generate.
Initial Release 2024-07-01 15:25:30 -06:00
			`## parallelization`

			`### Fields`

			- `stagger` float - The threading stagger value.
			- `num_threads` int - The maximum number of work threads.

			`## async_mode`

			asyncio\|threaded The async mode to use. Either `asyncio` or `threaded.

			`## embeddings`

			`### Fields`

			- `llm` (see LLM top-level config)
			- `parallelization` (see Parallelization top-level config)
			- `async_mode` (see Async Mode top-level config)
			- `batch_size` int - The maximum batch size to use.
Fix vector store logic and refactor audience parameter (#1259) 2024-10-21 16:56:56 -04:00			- `batch_max_tokens` int - The maximum batch # of tokens.
New workflow to generate embeddings in a single workflow (#1296) * New workflow to generate embeddings in a single workflow * New workflow to generate embeddings in a single workflow * version change * clean tests without any embeddings references * clean tests without any embeddings references * remove code * feedback implemented * changes in logic * feedback implemented * store in table bug fixed * smoke test for generate_text_embeddings workflow * smoke test fix * add generate_text_embeddings to the list of transient workflows * smoke tests * fix * ruff formatting updates * fix * smoke test fixed * smoke test fixed * fix lancedb import * smoke test fix * ignore sorting * smoke test fixed * smoke test fixed * check smoke test * smoke test fixed * change config for vector store * format fix * vector store changes * revert debug profile back to empty filepath * merge conflict solved * merge conflict solved * format fixed * format fixed * fix return dataframe * snapshot fix * format fix * embeddings param implemented * validation fixes * fix map * fix map * fix properties * config updates * smoke test fixed * settings change * Update collection config and rework back-compat * Repalce . with - for embedding store --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> Co-authored-by: Josh Bradley <joshbradley@microsoft.com> Co-authored-by: Nathan Evans <github@talkswithnumbers.com> 2024-11-01 16:01:35 -06:00			- `target` required\|all\|none - Determines which set of embeddings to emit.
			- `skip` list[str] - Which embeddings to skip. Only useful if target=all to customize the list.
Fix vector store logic and refactor audience parameter (#1259) 2024-10-21 16:56:56 -04:00			- `vector_store` dict - The vector store to use. Configured for lancedb by default.
			- `type` str - `lancedb` or `azure_ai_search`. Default=`lancedb`
			- `db_uri` str (only for lancedb) - The database uri. Default=`storage.base_dir/lancedb`
			- `url` str (only for AI Search) - AI Search endpoint
			- `api_key` str (optional - only for AI Search) - The AI Search api key to use.
			- `audience` str (only for AI Search) - Audience for managed identity token if managed identity authentication is used.
			- `overwrite` bool (only used at index creation time) - Overwrite collection if it exist. Default=`True`
New workflow to generate embeddings in a single workflow (#1296) * New workflow to generate embeddings in a single workflow * New workflow to generate embeddings in a single workflow * version change * clean tests without any embeddings references * clean tests without any embeddings references * remove code * feedback implemented * changes in logic * feedback implemented * store in table bug fixed * smoke test for generate_text_embeddings workflow * smoke test fix * add generate_text_embeddings to the list of transient workflows * smoke tests * fix * ruff formatting updates * fix * smoke test fixed * smoke test fixed * fix lancedb import * smoke test fix * ignore sorting * smoke test fixed * smoke test fixed * check smoke test * smoke test fixed * change config for vector store * format fix * vector store changes * revert debug profile back to empty filepath * merge conflict solved * merge conflict solved * format fixed * format fixed * fix return dataframe * snapshot fix * format fix * embeddings param implemented * validation fixes * fix map * fix map * fix properties * config updates * smoke test fixed * settings change * Update collection config and rework back-compat * Repalce . with - for embedding store --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> Co-authored-by: Josh Bradley <joshbradley@microsoft.com> Co-authored-by: Nathan Evans <github@talkswithnumbers.com> 2024-11-01 16:01:35 -06:00			- `container_name` str - The name of a vector container. This stores all indexes (tables) for a given dataset ingest. Default=`default`
Initial Release 2024-07-01 15:25:30 -06:00			- `strategy` dict - Fully override the text-embedding strategy.

			`## chunks`

			`### Fields`

			- `size` int - The max chunk size in tokens.
			- `overlap` int - The chunk overlap in tokens.
			- `group_by_columns` list[str] - group documents by fields before chunking.
add encoding model to text-chunking config (#743) * add encoding model to text-chunking config * revert groupby fix, handled in other pr * revert environment reader update for other pr 2024-07-26 14:15:17 -07:00			- `encoding_model` str - The text encoding model to use. Default is to use the top-level encoding model.
Initial Release 2024-07-01 15:25:30 -06:00			- `strategy` dict - Fully override the chunking strategy.

			`## cache`

			`### Fields`

			- `type` file\|memory\|none\|blob - The cache type to use. Default=`file`
			- `connection_string` str - (blob only) The Azure Storage connection string.
			- `container_name` str - (blob only) The Azure Storage container name.
			- `base_dir` str - The base directory to write cache to, relative to the root.
			- `storage_account_blob_url` str - The storage account blob URL to use.

			`## storage`

			`### Fields`

			- `type` file\|memory\|blob - The storage type to use. Default=`file`
			- `connection_string` str - (blob only) The Azure Storage connection string.
			- `container_name` str - (blob only) The Azure Storage container name.
			- `base_dir` str - The base directory to write reports to, relative to the root.
			- `storage_account_blob_url` str - The storage account blob URL to use.

			`## reporting`

			`### Fields`

			- `type` file\|console\|blob - The reporting type to use. Default=`file`
			- `connection_string` str - (blob only) The Azure Storage connection string.
			- `container_name` str - (blob only) The Azure Storage container name.
			- `base_dir` str - The base directory to write reports to, relative to the root.
			- `storage_account_blob_url` str - The storage account blob URL to use.

			`## entity_extraction`

			`### Fields`

			- `llm` (see LLM top-level config)
			- `parallelization` (see Parallelization top-level config)
			- `async_mode` (see Async Mode top-level config)
			- `prompt` str - The prompt file to use.
			- `entity_types` list[str] - The entity types to identify.
			- `max_gleanings` int - The maximum number of gleaning cycles to use.
Add encoding model to entity/claim extraction config sections (#740) * Add encoding-model configuration to entity & claim extraction * add change note * pr updates * test fix * disable GH-based smoke tests 2024-07-26 15:05:08 -07:00			- `encoding_model` str - The text encoding model to use. By default, this will use the top-level encoding model.
Initial Release 2024-07-01 15:25:30 -06:00			- `strategy` dict - Fully override the entity extraction strategy.

			`## summarize_descriptions`

			`### Fields`

			- `llm` (see LLM top-level config)
			- `parallelization` (see Parallelization top-level config)
			- `async_mode` (see Async Mode top-level config)
			- `prompt` str - The prompt file to use.
			- `max_length` int - The maximum number of output tokens per summarization.
			- `strategy` dict - Fully override the summarize description strategy.

			`## claim_extraction`

			`### Fields`

			- `enabled` bool - Whether to enable claim extraction. default=False
			- `llm` (see LLM top-level config)
			- `parallelization` (see Parallelization top-level config)
			- `async_mode` (see Async Mode top-level config)
			- `prompt` str - The prompt file to use.
			- `description` str - Describes the types of claims we want to extract.
			- `max_gleanings` int - The maximum number of gleaning cycles to use.
Add encoding model to entity/claim extraction config sections (#740) * Add encoding-model configuration to entity & claim extraction * add change note * pr updates * test fix * disable GH-based smoke tests 2024-07-26 15:05:08 -07:00			- `encoding_model` str - The text encoding model to use. By default, this will use the top-level encoding model.
Initial Release 2024-07-01 15:25:30 -06:00			- `strategy` dict - Fully override the claim extraction strategy.

			`## community_reports`

			`### Fields`

			- `llm` (see LLM top-level config)
			- `parallelization` (see Parallelization top-level config)
			- `async_mode` (see Async Mode top-level config)
			- `prompt` str - The prompt file to use.
			- `max_length` int - The maximum number of output tokens per report.
			- `max_input_length` int - The maximum number of input tokens to use when generating reports.
			- `strategy` dict - Fully override the community reports strategy.

			`## cluster_graph`

			`### Fields`

			- `max_cluster_size` int - The maximum cluster size to emit.
			- `strategy` dict - Fully override the cluster_graph strategy.

			`## embed_graph`

			`### Fields`

			- `enabled` bool - Whether to enable graph embeddings.
			- `num_walks` int - The node2vec number of walks.
			- `walk_length` int - The node2vec walk length.
			- `window_size` int - The node2vec window size.
			- `iterations` int - The node2vec number of iterations.
			- `random_seed` int - The node2vec random seed.
			- `strategy` dict - Fully override the embed graph strategy.

			`## umap`

			`### Fields`

			- `enabled` bool - Whether to enable UMAP layouts.

			`## snapshots`

			`### Fields`

Transient entity graph (#1349) * Make base_entity_graph transient * Add transient snapshots * Semver * Fix unit test * Fix smoke tests 2024-11-04 17:23:29 -08:00			- `embeddings` bool - Emit embeddings snapshots to parquet.
			- `graphml` bool - Emit graph snapshots to GraphML.
			- `raw_entities` bool - Emit raw entity snapshots to JSON.
			- `top_level_nodes` bool - Emit top-level-node snapshots to JSON.
			- `transient` bool - Emit transient workflow tables snapshots to parquet.
Initial Release 2024-07-01 15:25:30 -06:00
			`## encoding_model`

Fix vector store logic and refactor audience parameter (#1259) 2024-10-21 16:56:56 -04:00			str - The text encoding model to use. Default=`cl100k_base`.
Initial Release 2024-07-01 15:25:30 -06:00
			`## skip_workflows`

			`list[str] - Which workflow names to skip.`