* Refactor graph creation * Semver * Spellcheck * Update integ pipeline * Fix cast * Improve pandas chaining * Cleaner apply * Use list comprehensions --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
14 KiB
Default Configuration Mode (using YAML/JSON)
The default configuration mode may be configured by using a settings.yml or settings.json file in the data project root. If a .env file is present along with this config file, then it will be loaded, and the environment variables defined therein will be available for token replacements in your configuration document using ${ENV_VAR} syntax. We initialize with YML by default in graphrag init but you may use the equivalent JSON form if preferred.
Many of these config values have defaults. Rather than replicate them here, please refer to the constants in the code directly.
For example:
# .env
GRAPHRAG_API_KEY=some_api_key
# settings.yml
llm:
api_key: ${GRAPHRAG_API_KEY}
Config Sections
Indexing
llm
This is the base LLM configuration section. Other steps may override this configuration with their own LLM configuration.
Fields
api_keystr - The OpenAI API key to use.typeopenai_chat|azure_openai_chat|openai_embedding|azure_openai_embedding - The type of LLM to use.modelstr - The model name.max_tokensint - The maximum number of output tokens.request_timeoutfloat - The per-request timeout.api_basestr - The API base url to use.api_versionstr - The API versionorganizationstr - The client organization.proxystr - The proxy URL to use.audiencestr - (Azure OpenAI only) The URI of the target Azure resource/service for which a managed identity token is requested. Used ifapi_keyis not defined. Default=https://cognitiveservices.azure.com/.defaultdeployment_namestr - The deployment name to use (Azure).model_supports_jsonbool - Whether the model supports JSON-mode output.tokens_per_minuteint - Set a leaky-bucket throttle on tokens-per-minute.requests_per_minuteint - Set a leaky-bucket throttle on requests-per-minute.max_retriesint - The maximum number of retries to use.max_retry_waitfloat - The maximum backoff time.sleep_on_rate_limit_recommendationbool - Whether to adhere to sleep recommendations (Azure).concurrent_requestsint The number of open requests to allow at once.temperaturefloat - The temperature to use.top_pfloat - The top-p value to use.nint - The number of completions to generate.
parallelization
Fields
staggerfloat - The threading stagger value.num_threadsint - The maximum number of work threads.
async_mode
asyncio|threaded The async mode to use. Either asyncio or `threaded.
embeddings
Fields
llm(see LLM top-level config)parallelization(see Parallelization top-level config)async_mode(see Async Mode top-level config)batch_sizeint - The maximum batch size to use.batch_max_tokensint - The maximum batch # of tokens.targetrequired|all|none - Determines which set of embeddings to export.skiplist[str] - Which embeddings to skip. Only useful if target=all to customize the list.vector_storedict - The vector store to use. Configured for lancedb by default.typestr -lancedborazure_ai_search. Default=lancedbdb_uristr (only for lancedb) - The database uri. Default=storage.base_dir/lancedburlstr (only for AI Search) - AI Search endpointapi_keystr (optional - only for AI Search) - The AI Search api key to use.audiencestr (only for AI Search) - Audience for managed identity token if managed identity authentication is used.overwritebool (only used at index creation time) - Overwrite collection if it exist. Default=Truecontainer_namestr - The name of a vector container. This stores all indexes (tables) for a given dataset ingest. Default=default
strategydict - Fully override the text-embedding strategy.
input
Fields
typefile|blob - The input type to use. Default=filefile_typetext|csv - The type of input data to load. Eithertextorcsv. Default istextbase_dirstr - The base directory to read input from, relative to the root.connection_stringstr - (blob only) The Azure Storage connection string.storage_account_blob_urlstr - The storage account blob URL to use.container_namestr - (blob only) The Azure Storage container name.file_encodingstr - The encoding of the input file. Default isutf-8file_patternstr - A regex to match input files. Default is.*\.csv$if in csv mode and.*\.txt$if in text mode.file_filterdict - Key/value pairs to filter. Default is None.source_columnstr - (CSV Mode Only) The source column name.timestamp_columnstr - (CSV Mode Only) The timestamp column name.timestamp_formatstr - (CSV Mode Only) The source format.text_columnstr - (CSV Mode Only) The text column name.title_columnstr - (CSV Mode Only) The title column name.document_attribute_columnslist[str] - (CSV Mode Only) The additional document attributes to include.
chunks
Fields
sizeint - The max chunk size in tokens.overlapint - The chunk overlap in tokens.group_by_columnslist[str] - group documents by fields before chunking.encoding_modelstr - The text encoding model to use. Default is to use the top-level encoding model.strategydict - Fully override the chunking strategy.
cache
Fields
typefile|memory|none|blob - The cache type to use. Default=fileconnection_stringstr - (blob only) The Azure Storage connection string.container_namestr - (blob only) The Azure Storage container name.base_dirstr - The base directory to write cache to, relative to the root.storage_account_blob_urlstr - The storage account blob URL to use.
storage
Fields
typefile|memory|blob - The storage type to use. Default=fileconnection_stringstr - (blob only) The Azure Storage connection string.container_namestr - (blob only) The Azure Storage container name.base_dirstr - The base directory to write output artifacts to, relative to the root.storage_account_blob_urlstr - The storage account blob URL to use.
update_index_storage
Fields
typefile|memory|blob - The storage type to use. Default=fileconnection_stringstr - (blob only) The Azure Storage connection string.container_namestr - (blob only) The Azure Storage container name.base_dirstr - The base directory to write output artifacts to, relative to the root.storage_account_blob_urlstr - The storage account blob URL to use.
reporting
Fields
typefile|console|blob - The reporting type to use. Default=fileconnection_stringstr - (blob only) The Azure Storage connection string.container_namestr - (blob only) The Azure Storage container name.base_dirstr - The base directory to write reports to, relative to the root.storage_account_blob_urlstr - The storage account blob URL to use.
entity_extraction
Fields
llm(see LLM top-level config)parallelization(see Parallelization top-level config)async_mode(see Async Mode top-level config)promptstr - The prompt file to use.entity_typeslist[str] - The entity types to identify.max_gleaningsint - The maximum number of gleaning cycles to use.encoding_modelstr - The text encoding model to use. By default, this will use the top-level encoding model.strategydict - Fully override the entity extraction strategy.
summarize_descriptions
Fields
llm(see LLM top-level config)parallelization(see Parallelization top-level config)async_mode(see Async Mode top-level config)promptstr - The prompt file to use.max_lengthint - The maximum number of output tokens per summarization.strategydict - Fully override the summarize description strategy.
claim_extraction
Fields
enabledbool - Whether to enable claim extraction. Off by default, because claim prompts really need user tuning.llm(see LLM top-level config)parallelization(see Parallelization top-level config)async_mode(see Async Mode top-level config)promptstr - The prompt file to use.descriptionstr - Describes the types of claims we want to extract.max_gleaningsint - The maximum number of gleaning cycles to use.encoding_modelstr - The text encoding model to use. By default, this will use the top-level encoding model.strategydict - Fully override the claim extraction strategy.
community_reports
Fields
llm(see LLM top-level config)parallelization(see Parallelization top-level config)async_mode(see Async Mode top-level config)promptstr - The prompt file to use.max_lengthint - The maximum number of output tokens per report.max_input_lengthint - The maximum number of input tokens to use when generating reports.strategydict - Fully override the community reports strategy.
cluster_graph
Fields
max_cluster_sizeint - The maximum cluster size to export.strategydict - Fully override the cluster_graph strategy.
embed_graph
Fields
enabledbool - Whether to enable graph embeddings.num_walksint - The node2vec number of walks.walk_lengthint - The node2vec walk length.window_sizeint - The node2vec window size.iterationsint - The node2vec number of iterations.random_seedint - The node2vec random seed.strategydict - Fully override the embed graph strategy.
umap
Fields
enabledbool - Whether to enable UMAP layouts.
snapshots
Fields
embeddingsbool - Export embeddings snapshots to parquet.graphmlbool - Export graph snapshots to GraphML.transientbool - Export transient workflow tables snapshots to parquet.
encoding_model
str - The text encoding model to use. Default=cl100k_base.
skip_workflows
list[str] - Which workflow names to skip.
Query
local_search
Fields
promptstr - The prompt file to use.text_unit_propfloat - The text unit proportion.community_propfloat - The community proportion.conversation_history_max_turnsint - The conversation history maximum turns.top_k_entitiesint - The top k mapped entities.top_k_relationshipsint - The top k mapped relations.temperaturefloat | None - The temperature to use for token generation.top_pfloat | None - The top-p value to use for token generation.nint | None - The number of completions to generate.max_tokensint - The maximum tokens.llm_max_tokensint - The LLM maximum tokens.
global_search
Fields
map_promptstr - The mapper prompt file to use.reduce_promptstr - The reducer prompt file to use.knowledge_promptstr - The knowledge prompt file to use.map_promptstr | None - The global search mapper prompt to use.reduce_promptstr | None - The global search reducer to use.knowledge_promptstr | None - The global search general prompt to use.temperaturefloat | None - The temperature to use for token generation.top_pfloat | None - The top-p value to use for token generation.nint | None - The number of completions to generate.max_tokensint - The maximum context size in tokens.data_max_tokensint - The data llm maximum tokens.map_max_tokensint - The map llm maximum tokens.reduce_max_tokensint - The reduce llm maximum tokens.concurrencyint - The number of concurrent requests.dynamic_search_llmstr - LLM model to use for dynamic community selection.dynamic_search_thresholdint - Rating threshold in include a community report.dynamic_search_keep_parentbool - Keep parent community if any of the child communities are relevant.dynamic_search_num_repeatsint - Number of times to rate the same community report.dynamic_search_use_summarybool - Use community summary instead of full_context.dynamic_search_concurrent_coroutinesint - Number of concurrent coroutines to rate community reports.dynamic_search_max_levelint - The maximum level of community hierarchy to consider if none of the processed communities are relevant.
drift_search
Fields
promptstr - The prompt file to use.temperaturefloat - The temperature to use for token generation.",top_pfloat - The top-p value to use for token generation.nint - The number of completions to generate.max_tokensint - The maximum context size in tokens.data_max_tokensint - The data llm maximum tokens.concurrencyint - The number of concurrent requests.drift_k_followupsint - The number of top global results to retrieve.primer_foldsint - The number of folds for search priming.primer_llm_max_tokensint - The maximum number of tokens for the LLM in primer.n_depthint - The number of drift search steps to take.local_search_text_unit_propfloat - The proportion of search dedicated to text units.local_search_community_propfloat - The proportion of search dedicated to community properties.local_search_top_k_mapped_entitiesint - The number of top K entities to map during local search.local_search_top_k_relationshipsint - The number of top K relationships to map during local search.local_search_max_data_tokensint - The maximum context size in tokens for local search.local_search_temperaturefloat - The temperature to use for token generation in local search.local_search_top_pfloat - The top-p value to use for token generation in local search.local_search_nint - The number of completions to generate in local search.local_search_llm_max_gen_tokensint - The maximum number of generated tokens for the LLM in local search.