graphrag/docsite/posts/config/json_yaml.md

---
title: Default Configuration Mode (using JSON/YAML)
navtitle: Using JSON or YAML
tags: [post]
layout: page
date: 2023-01-03
---

The default configuration mode may be configured by using a `config.json` or `config.yml` file in the data project root. If a `.env` file is present along with this config file, then it will be loaded, and the environment variables defined therein will be available for token replacements in your configuration document using `${ENV_VAR}` syntax.

For example:

```
# .env
API_KEY=some_api_key

# config.json
{
    "llm": {
        "api_key": "${API_KEY}"
    }
}
```

# Config Sections

## input

### Fields

- `type` **file|blob** - The input type to use. Default=`file`
- `file_type` **text|csv** - The type of input data to load. Either `text` or `csv`. Default is `text`
- `file_encoding` **str** - The encoding of the input file. Default is `utf-8`
- `file_pattern` **str** - A regex to match input files. Default is `.*\.csv$` if in csv mode and `.*\.txt$` if in text mode.
- `source_column` **str** - (CSV Mode Only) The source column name.
- `timestamp_column` **str** - (CSV Mode Only) The timestamp column name.
- `timestamp_format` **str** - (CSV Mode Only) The source format.
- `text_column` **str** - (CSV Mode Only) The text column name.
- `title_column` **str** - (CSV Mode Only) The title column name.
- `document_attribute_columns` **list[str]** - (CSV Mode Only) The additional document attributes to include.
- `connection_string` **str** - (blob only) The Azure Storage connection string.
- `container_name` **str** - (blob only) The Azure Storage container name.
- `base_dir` **str** - The base directory to read input from, relative to the root.
- `storage_account_blob_url` **str** - The storage account blob URL to use.

## llm

This is the base LLM configuration section. Other steps may override this configuration with their own LLM configuration.

### Fields

- `api_key` **str** - The OpenAI API key to use.
- `type` **openai_chat|azure_openai_chat|openai_embedding|azure_openai_embedding** - The type of LLM to use.
- `model` **str** - The model name.
- `max_tokens` **int** - The maximum number of output tokens.
- `request_timeout` **float** - The per-request timeout.
- `api_base` **str** - The API base url to use.
- `api_version` **str** - The API version
- `organization` **str** - The client organization.
- `proxy` **str** - The proxy URL to use.
- `cognitive_services_endpoint` **str** - The url endpoint for cognitive services.
- `deployment_name` **str** - The deployment name to use (Azure).
- `model_supports_json` **bool** - Whether the model supports JSON-mode output.
- `tokens_per_minute` **int** - Set a leaky-bucket throttle on tokens-per-minute.
- `requests_per_minute` **int** - Set a leaky-bucket throttle on requests-per-minute.
- `max_retries` **int** - The maximum number of retries to use.
- `max_retry_wait` **float** - The maximum backoff time.
- `sleep_on_rate_limit_recommendation` **bool** - Whether to adhere to sleep recommendations (Azure).
- `concurrent_requests` **int** The number of open requests to allow at once.
- `temperature` **float** - The temperature to use.
- `top_p` **float** - The top-p value to use.
- `n` **int** - The number of completions to generate.

## parallelization

### Fields

- `stagger` **float** - The threading stagger value.
- `num_threads` **int** - The maximum number of work threads.

## async_mode

**asyncio|threaded** The async mode to use. Either `asyncio` or `threaded.

## embeddings

### Fields

- `llm` (see LLM top-level config)
- `parallelization` (see Parallelization top-level config)
- `async_mode` (see Async Mode top-level config)
- `batch_size` **int** - The maximum batch size to use.
- `batch_max_tokens` **int** - The maximum batch #-tokens.
- `target` **required|all** - Determines which set of embeddings to emit.
- `skip` **list[str]** - Which embeddings to skip.
- `strategy` **dict** - Fully override the text-embedding strategy.

## chunks

### Fields

- `size` **int** - The max chunk size in tokens.
- `overlap` **int** - The chunk overlap in tokens.
- `group_by_columns` **list[str]** - group documents by fields before chunking.
- `encoding_model` **str** - The text encoding model to use. Default is to use the top-level encoding model.
- `strategy` **dict** - Fully override the chunking strategy.

## cache

### Fields

- `type` **file|memory|none|blob** - The cache type to use. Default=`file`
- `connection_string` **str** - (blob only) The Azure Storage connection string.
- `container_name` **str** - (blob only) The Azure Storage container name.
- `base_dir` **str** - The base directory to write cache to, relative to the root.
- `storage_account_blob_url` **str** - The storage account blob URL to use.

## storage

### Fields

- `type` **file|memory|blob** - The storage type to use. Default=`file`
- `connection_string` **str** - (blob only) The Azure Storage connection string.
- `container_name` **str** - (blob only) The Azure Storage container name.
- `base_dir` **str** - The base directory to write reports to, relative to the root.
- `storage_account_blob_url` **str** - The storage account blob URL to use.

## reporting

### Fields

- `type` **file|console|blob** - The reporting type to use. Default=`file`
- `connection_string` **str** - (blob only) The Azure Storage connection string.
- `container_name` **str** - (blob only) The Azure Storage container name.
- `base_dir` **str** - The base directory to write reports to, relative to the root.
- `storage_account_blob_url` **str** - The storage account blob URL to use.

## entity_extraction

### Fields

- `llm` (see LLM top-level config)
- `parallelization` (see Parallelization top-level config)
- `async_mode` (see Async Mode top-level config)
- `prompt` **str** - The prompt file to use.
- `entity_types` **list[str]** - The entity types to identify.
- `max_gleanings` **int** - The maximum number of gleaning cycles to use.
- `encoding_model` **str** - The text encoding model to use. By default, this will use the top-level encoding model.
- `strategy` **dict** - Fully override the entity extraction strategy.

## summarize_descriptions

### Fields

- `llm` (see LLM top-level config)
- `parallelization` (see Parallelization top-level config)
- `async_mode` (see Async Mode top-level config)
- `prompt` **str** - The prompt file to use.
- `max_length` **int** - The maximum number of output tokens per summarization.
- `strategy` **dict** - Fully override the summarize description strategy.

## claim_extraction

### Fields

- `enabled` **bool** - Whether to enable claim extraction. default=False
- `llm` (see LLM top-level config)
- `parallelization` (see Parallelization top-level config)
- `async_mode` (see Async Mode top-level config)
- `prompt` **str** - The prompt file to use.
- `description` **str** - Describes the types of claims we want to extract.
- `max_gleanings` **int** - The maximum number of gleaning cycles to use.
- `encoding_model` **str** - The text encoding model to use. By default, this will use the top-level encoding model.
- `strategy` **dict** - Fully override the claim extraction strategy.

## community_reports

### Fields

- `llm` (see LLM top-level config)
- `parallelization` (see Parallelization top-level config)
- `async_mode` (see Async Mode top-level config)
- `prompt` **str** - The prompt file to use.
- `max_length` **int** - The maximum number of output tokens per report.
- `max_input_length` **int** - The maximum number of input tokens to use when generating reports.
- `strategy` **dict** - Fully override the community reports strategy.

## cluster_graph

### Fields

- `max_cluster_size` **int** - The maximum cluster size to emit.
- `strategy` **dict** - Fully override the cluster_graph strategy.

## embed_graph

### Fields

- `enabled` **bool** - Whether to enable graph embeddings.
- `num_walks` **int** - The node2vec number of walks.
- `walk_length` **int** - The node2vec walk length.
- `window_size` **int** - The node2vec window size.
- `iterations` **int** - The node2vec number of iterations.
- `random_seed` **int** - The node2vec random seed.
- `strategy` **dict** - Fully override the embed graph strategy.

## umap

### Fields

- `enabled` **bool** - Whether to enable UMAP layouts.

## snapshots

### Fields

- `graphml` **bool** - Emit graphml snapshots.
- `raw_entities` **bool** - Emit raw entity snapshots.
- `top_level_nodes` **bool** - Emit top-level-node snapshots.

## encoding_model

**str** - The text encoding model to use. Default is `cl100k_base`.

## skip_workflows

**list[str]** - Which workflow names to skip.
Initial Release 2024-07-01 15:25:30 -06:00			`---`
			`title: Default Configuration Mode (using JSON/YAML)`
			`navtitle: Using JSON or YAML`
			`tags: [post]`
			`layout: page`
			`date: 2023-01-03`
			`---`

			The default configuration mode may be configured by using a `config.json` or `config.yml` file in the data project root. If a `.env` file is present along with this config file, then it will be loaded, and the environment variables defined therein will be available for token replacements in your configuration document using `${ENV_VAR}` syntax.

			`For example:`

			```
			`# .env`
			`API_KEY=some_api_key`

			`# config.json`
			`{`
			`"llm": {`
			`"api_key": "${API_KEY}"`
			`}`
			`}`
			```

			`# Config Sections`

			`## input`

			`### Fields`

			- `type` file\|blob - The input type to use. Default=`file`
			- `file_type` text\|csv - The type of input data to load. Either `text` or `csv`. Default is `text`
			- `file_encoding` str - The encoding of the input file. Default is `utf-8`
			- `file_pattern` str - A regex to match input files. Default is `.\.csv$` if in csv mode and `.\.txt$` if in text mode.
			- `source_column` str - (CSV Mode Only) The source column name.
			- `timestamp_column` str - (CSV Mode Only) The timestamp column name.
			- `timestamp_format` str - (CSV Mode Only) The source format.
			- `text_column` str - (CSV Mode Only) The text column name.
			- `title_column` str - (CSV Mode Only) The title column name.
			- `document_attribute_columns` list[str] - (CSV Mode Only) The additional document attributes to include.
			- `connection_string` str - (blob only) The Azure Storage connection string.
			- `container_name` str - (blob only) The Azure Storage container name.
			- `base_dir` str - The base directory to read input from, relative to the root.
			- `storage_account_blob_url` str - The storage account blob URL to use.

			`## llm`

			`This is the base LLM configuration section. Other steps may override this configuration with their own LLM configuration.`

			`### Fields`

			- `api_key` str - The OpenAI API key to use.
			- `type` openai_chat\|azure_openai_chat\|openai_embedding\|azure_openai_embedding - The type of LLM to use.
			- `model` str - The model name.
			- `max_tokens` int - The maximum number of output tokens.
			- `request_timeout` float - The per-request timeout.
			- `api_base` str - The API base url to use.
			- `api_version` str - The API version
			- `organization` str - The client organization.
			- `proxy` str - The proxy URL to use.
			- `cognitive_services_endpoint` str - The url endpoint for cognitive services.
			- `deployment_name` str - The deployment name to use (Azure).
			- `model_supports_json` bool - Whether the model supports JSON-mode output.
			- `tokens_per_minute` int - Set a leaky-bucket throttle on tokens-per-minute.
			- `requests_per_minute` int - Set a leaky-bucket throttle on requests-per-minute.
			- `max_retries` int - The maximum number of retries to use.
			- `max_retry_wait` float - The maximum backoff time.
			- `sleep_on_rate_limit_recommendation` bool - Whether to adhere to sleep recommendations (Azure).
			- `concurrent_requests` int The number of open requests to allow at once.
Add N parameter support (#390) * Add N parameter support * Fix unit tests * Add new env vars to param testing 2024-07-08 14:04:49 -06:00			- `temperature` float - The temperature to use.
			- `top_p` float - The top-p value to use.
			- `n` int - The number of completions to generate.
Initial Release 2024-07-01 15:25:30 -06:00
			`## parallelization`

			`### Fields`

			- `stagger` float - The threading stagger value.
			- `num_threads` int - The maximum number of work threads.

			`## async_mode`

			asyncio\|threaded The async mode to use. Either `asyncio` or `threaded.

			`## embeddings`

			`### Fields`

			- `llm` (see LLM top-level config)
			- `parallelization` (see Parallelization top-level config)
			- `async_mode` (see Async Mode top-level config)
			- `batch_size` int - The maximum batch size to use.
			- `batch_max_tokens` int - The maximum batch #-tokens.
			- `target` required\|all - Determines which set of embeddings to emit.
			- `skip` list[str] - Which embeddings to skip.
			- `strategy` dict - Fully override the text-embedding strategy.

			`## chunks`

			`### Fields`

			- `size` int - The max chunk size in tokens.
			- `overlap` int - The chunk overlap in tokens.
			- `group_by_columns` list[str] - group documents by fields before chunking.
add encoding model to text-chunking config (#743) * add encoding model to text-chunking config * revert groupby fix, handled in other pr * revert environment reader update for other pr 2024-07-26 14:15:17 -07:00			- `encoding_model` str - The text encoding model to use. Default is to use the top-level encoding model.
Initial Release 2024-07-01 15:25:30 -06:00			- `strategy` dict - Fully override the chunking strategy.

			`## cache`

			`### Fields`

			- `type` file\|memory\|none\|blob - The cache type to use. Default=`file`
			- `connection_string` str - (blob only) The Azure Storage connection string.
			- `container_name` str - (blob only) The Azure Storage container name.
			- `base_dir` str - The base directory to write cache to, relative to the root.
			- `storage_account_blob_url` str - The storage account blob URL to use.

			`## storage`

			`### Fields`

			- `type` file\|memory\|blob - The storage type to use. Default=`file`
			- `connection_string` str - (blob only) The Azure Storage connection string.
			- `container_name` str - (blob only) The Azure Storage container name.
			- `base_dir` str - The base directory to write reports to, relative to the root.
			- `storage_account_blob_url` str - The storage account blob URL to use.

			`## reporting`

			`### Fields`

			- `type` file\|console\|blob - The reporting type to use. Default=`file`
			- `connection_string` str - (blob only) The Azure Storage connection string.
			- `container_name` str - (blob only) The Azure Storage container name.
			- `base_dir` str - The base directory to write reports to, relative to the root.
			- `storage_account_blob_url` str - The storage account blob URL to use.

			`## entity_extraction`

			`### Fields`

			- `llm` (see LLM top-level config)
			- `parallelization` (see Parallelization top-level config)
			- `async_mode` (see Async Mode top-level config)
			- `prompt` str - The prompt file to use.
			- `entity_types` list[str] - The entity types to identify.
			- `max_gleanings` int - The maximum number of gleaning cycles to use.
Add encoding model to entity/claim extraction config sections (#740) * Add encoding-model configuration to entity & claim extraction * add change note * pr updates * test fix * disable GH-based smoke tests 2024-07-26 15:05:08 -07:00			- `encoding_model` str - The text encoding model to use. By default, this will use the top-level encoding model.
Initial Release 2024-07-01 15:25:30 -06:00			- `strategy` dict - Fully override the entity extraction strategy.

			`## summarize_descriptions`

			`### Fields`

			- `llm` (see LLM top-level config)
			- `parallelization` (see Parallelization top-level config)
			- `async_mode` (see Async Mode top-level config)
			- `prompt` str - The prompt file to use.
			- `max_length` int - The maximum number of output tokens per summarization.
			- `strategy` dict - Fully override the summarize description strategy.

			`## claim_extraction`

			`### Fields`

			- `enabled` bool - Whether to enable claim extraction. default=False
			- `llm` (see LLM top-level config)
			- `parallelization` (see Parallelization top-level config)
			- `async_mode` (see Async Mode top-level config)
			- `prompt` str - The prompt file to use.
			- `description` str - Describes the types of claims we want to extract.
			- `max_gleanings` int - The maximum number of gleaning cycles to use.
Add encoding model to entity/claim extraction config sections (#740) * Add encoding-model configuration to entity & claim extraction * add change note * pr updates * test fix * disable GH-based smoke tests 2024-07-26 15:05:08 -07:00			- `encoding_model` str - The text encoding model to use. By default, this will use the top-level encoding model.
Initial Release 2024-07-01 15:25:30 -06:00			- `strategy` dict - Fully override the claim extraction strategy.

			`## community_reports`

			`### Fields`

			- `llm` (see LLM top-level config)
			- `parallelization` (see Parallelization top-level config)
			- `async_mode` (see Async Mode top-level config)
			- `prompt` str - The prompt file to use.
			- `max_length` int - The maximum number of output tokens per report.
			- `max_input_length` int - The maximum number of input tokens to use when generating reports.
			- `strategy` dict - Fully override the community reports strategy.

			`## cluster_graph`

			`### Fields`

			- `max_cluster_size` int - The maximum cluster size to emit.
			- `strategy` dict - Fully override the cluster_graph strategy.

			`## embed_graph`

			`### Fields`

			- `enabled` bool - Whether to enable graph embeddings.
			- `num_walks` int - The node2vec number of walks.
			- `walk_length` int - The node2vec walk length.
			- `window_size` int - The node2vec window size.
			- `iterations` int - The node2vec number of iterations.
			- `random_seed` int - The node2vec random seed.
			- `strategy` dict - Fully override the embed graph strategy.

			`## umap`

			`### Fields`

			- `enabled` bool - Whether to enable UMAP layouts.

			`## snapshots`

			`### Fields`

			- `graphml` bool - Emit graphml snapshots.
			- `raw_entities` bool - Emit raw entity snapshots.
			- `top_level_nodes` bool - Emit top-level-node snapshots.

			`## encoding_model`

			str - The text encoding model to use. Default is `cl100k_base`.

			`## skip_workflows`

			`list[str] - Which workflow names to skip.`