The default configuration mode may be configured by using a `settings.yml` or `settings.json` file in the data project root. If a `.env` file is present along with this config file, then it will be loaded, and the environment variables defined therein will be available for token replacements in your configuration document using `${ENV_VAR}` syntax. We initialize with YML by default in `graphrag init` but you may use the equivalent JSON form if preferred.
Many of these config values have defaults. Rather than replicate them here, please refer to the [constants in the code](https://github.com/microsoft/graphrag/blob/main/graphrag/config/defaults.py) directly.
This is a dict of model configurations. The dict key is used to reference this configuration elsewhere when a model instance is desired. In this way, you can specify as many different models as you need, and reference them differentially in the workflow steps.
-`encoding_model`**str** - The text encoding model to use. Default is to use the encoding model aligned with the language model (i.e., it is retrieved from tiktoken if unset).
-`audience`**str** - (Azure OpenAI only) The URI of the target Azure resource/service for which a managed identity token is requested. Used if `api_key` is not defined. Default=`https://cognitiveservices.azure.com/.default`
-`retry_strategy`**str** - Retry strategy to use, "native" is the default and uses the strategy built into the OpenAI SDK. Other allowable values include "exponential_backoff", "random_wait", and "incremental_wait".
-`encoding`**str** - The encoding of the input file. Default is `utf-8`
-`file_pattern`**str** - A regex to match input files. Default is `.*\.csv$`, `.*\.txt$`, or `.*\.json$` depending on the specified `file_type`, but you can customize it if needed.
These settings configure how we parse documents into text chunks. This is necessary because very large documents may not fit into a single context window, and graph extraction accuracy can be modulated. Also note the `metadata` setting in the input document config, which will replicate document metadata into each chunk.
This section controls the cache mechanism used by the pipeline. This is used to cache LLM invocation results for faster performance when re-running the indexing process.
This section controls the reporting mechanism used by the pipeline, for common events and error messages. The default is to write reports to a file in the output directory. However, you can also choose to write reports to the console or to an Azure Blob Storage container.
Where to put all vectors for the system. Configured for lancedb by default. This is a dict, with the key used to identify individual store parameters (e.g., for text embedding).
#### Fields
-`type`**lancedb|azure_ai_search|cosmosdb** - Type of vector store. Default=`lancedb`
-`db_uri`**str** (only for lancedb) - The database uri. Default=`storage.base_dir/lancedb`
-`url`**str** (only for AI Search) - AI Search endpoint
-`api_key`**str** (optional - only for AI Search) - The AI Search api key to use.
-`audience`**str** (only for AI Search) - Audience for managed identity token if managed identity authentication is used.
-`container_name`**str** - The name of a vector container. This stores all indexes (tables) for a given dataset ingest. Default=`default`
-`database_name`**str** - (cosmosdb only) Name of the database.
-`overwrite`**bool** (only used at index creation time) - Overwrite collection if it exist. Default=`True`
## Workflow Configurations
These settings control each individual workflow as they execute.
### workflows
**list[str]** - This is a list of workflow names to run, in order. GraphRAG has built-in pipelines to configure this, but you can run exactly and only what you want by specifying the list here. Useful if you have done part of the processing yourself.
### embed_text
By default, the GraphRAG indexer will only export embeddings required for our query methods. However, the model has embeddings defined for all plaintext fields, and these can be customized by setting the `target` and `names` fields.
Supported embeddings names are:
-`text_unit.text`
-`document.text`
-`entity.title`
-`entity.description`
-`relationship.description`
-`community.title`
-`community.summary`
-`community.full_content`
#### Fields
-`model_id`**str** - Name of the model definition to use for text embedding.
-`vector_store_id`**str** - Name of vector store definition to write to.
-`batch_size`**int** - The maximum batch size to use.
-`batch_max_tokens`**int** - The maximum batch # of tokens.
-`target`**required|all|selected|none** - Determines which set of embeddings to export.
-`names`**list[str]** - If target=selected, this should be an explicit list of the embeddings names we support.
-`encoding_model`**str** - The text encoding model to use. Default is to use the encoding model aligned with the language model (i.e., it is retrieved from tiktoken if unset). This is only used for gleanings during the logit_bias check.
-`use_lcc`**bool** - Whether to only use the largest connected component.
-`seed`**int** - A randomization seed to provide if consistent run-to-run results are desired. We do provide a default in order to guarantee clustering stability.
-`enabled`**bool** - Whether to enable claim extraction. Off by default, because claim prompts really need user tuning.
-`model_id`**str** - Name of the model definition to use for API calls.
-`prompt`**str** - The prompt file to use.
-`description`**str** - Describes the types of claims we want to extract.
-`max_gleanings`**int** - The maximum number of gleaning cycles to use.
-`encoding_model`**str** - The text encoding model to use. Default is to use the encoding model aligned with the language model (i.e., it is retrieved from tiktoken if unset). This is only used for gleanings during the logit_bias check.
### community_reports
#### Fields
-`model_id`**str** - Name of the model definition to use for API calls.
-`prompt`**str** - The prompt file to use.
-`max_length`**int** - The maximum number of output tokens per report.
-`max_input_length`**int** - The maximum number of input tokens to use when generating reports.
Indicates whether we should run UMAP dimensionality reduction. This is used to provide an x/y coordinate to each graph node, suitable for visualization. If this is not enabled, nodes will receive a 0/0 x/y coordinate. If this is enabled, you *must* enable graph embedding as well.