graphrag/docs/config/env_vars.md
Copilot e84df28e64
Improve internal logging functionality by using Python's standard logging module (#1956)
* Initial plan for issue

* Implement standard logging module and integrate with existing loggers

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Add test cases and improve documentation for standard logging

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Apply ruff formatting and add semversioner file for logging improvements

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove custom logger classes and refactor to use standard logging only

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Apply ruff formatting to resolve CI/CD test failures

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Add semversioner file and fix linting issues

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* ruff fixes

* fix spelling error

* Remove StandardProgressLogger and refactor to use standard logging

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove LoggerFactory and custom loggers, refactor to use standard logging

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix pyright error: use logger.info() instead of calling logger as function in cosmosdb_pipeline_storage.py

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* ruff fixes

* Remove deprecated logger files that were marked as deprecated placeholders

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Replace custom get_logger with standard Python logging

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix linting issues found by ruff check --fix

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* apply ruff check fixes

* add word to dictionary

* Fix type checker error in ModelManager.__new__ method

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Refactor multiple logging.getLogger() calls to use single logger per file

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove progress_logger parameter from build_index() and logger parameter from generate_indexing_prompts()

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove logger parameter from run_pipeline and standardize logger naming

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Replace logger parameter with log_level parameter in CLI commands

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix import ordering in notebook files to pass poetry poe check

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove --logger parameter from smoke test command

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix Windows CI/CD issue with log file cleanup in tests

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Add StreamHandler to root logger in __main__.py for CLI logging

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Only add StreamHandler if root logger doesn't have existing StreamHandler

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix import ordering in notebook files to pass ruff checks

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Replace logging.StreamHandler with colorlog.StreamHandler for colorized log output

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Regenerate poetry.lock file after adding colorlog dependency

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix import ordering in notebook files to pass ruff checks

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* move printing of dataframes to debug level

* remove colorlog for now

* Refactor workflow callbacks to inherit from logging.Handler

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix linting issues in workflow callback handlers

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix pyright type errors in blob and file workflow callbacks

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Refactor pipeline logging to use pure logging.Handler subclasses

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Rename workflow callback classes to workflow logger classes and move to logger directory

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* update dictionary

* apply ruff fixes

* fix function name

* simplify logger code

* update

* Remove error, warning, and log methods from WorkflowCallbacks and replace with standard logging

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* ruff fixes

* Fix pyright errors by removing WorkflowCallbacks from strategy type signatures

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove ConsoleWorkflowLogger and apply consistent formatter to all handlers

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* apply ruff fixes

* Refactor pipeline_logger.py to use standard FileHandler and remove FileWorkflowLogger

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove conditional azure import checks from blob_workflow_logger.py

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix pyright type checking errors in mock_provider.py and utils.py

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Run ruff check --fix to fix import ordering in notebooks

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Merge configure_logging and create_pipeline_logger into init_loggers function

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove configure_logging and create_pipeline_logger functions, replace all usage with init_loggers

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* apply ruff fixes

* cleanup unused code

* Update init_loggers to accept GraphRagConfig instead of ReportingConfig

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* apply ruff check fixes

* Fix test failures by providing valid GraphRagConfig with required model configurations

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* apply ruff fixes

* remove logging_workflow_callback

* cleanup logging messages

* Add logging to track progress of pandas DataFrame apply operation in create_base_text_units

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* cleanup logger logic throughout codebase

* update

* more cleanup of old loggers

* small logger cleanup

* final code cleanup and added loggers to query

* add verbose logging to query

* minor code cleanup

* Fix broken unit tests for chunk_text and standard_logging

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* apply ruff fixes

* Fix test_chunk_text by mocking progress_ticker function instead of ProgressTicker class

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* remove unnecessary logger

* remove rich and fix type annotation

* revert test formatting changes my by copilot

* promote graphrag logs to root logger

* add correct semversioner file

* revert change to file

* revert formatting changes that have no effect

* fix changes after merge with main

* revert unnecessary copilot changes

* remove whitespace

* cleanup docstring

* simplify some logic with less code

* update poetry lock file

* ruff fixes

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>
Co-authored-by: Josh Bradley <joshbradley@microsoft.com>
2025-07-09 18:29:03 -06:00

31 KiB

Default Configuration Mode (using Env Vars)

As of version 1.3, GraphRAG no longer supports a full complement of pre-built environment variables. Instead, we support variable replacement within the settings.yml file so you can specify any environment variables you like.

The only standard environment variable we expect, and include in the default settings.yml, is GRAPHRAG_API_KEY. If you are already using a number of the previous GRAPHRAG_* environment variables, you can insert them with template syntax into settings.yml and they will be adopted.

The environment variables below are documented as an aid for migration, but they WILL NOT be read unless you use template syntax in your settings.yml. We also WILL NOT be updating this page as the main config object changes.


Text-Embeddings Customization

By default, the GraphRAG indexer will only export embeddings required for our query methods. However, the model has embeddings defined for all plaintext fields, and these can be generated by setting the GRAPHRAG_EMBEDDING_TARGET environment variable to all.

Embedded Fields

  • text_unit.text
  • document.text
  • entity.title
  • entity.description
  • relationship.description
  • community.title
  • community.summary
  • community.full_content

Input Data

Our pipeline can ingest .csv or .txt data from an input folder. These files can be nested within subfolders. To configure how input data is handled, what fields are mapped over, and how timestamps are parsed, look for configuration values starting with GRAPHRAG_INPUT_ below. In general, CSV-based data provides the most customizability. Each CSV should at least contain a text field (which can be mapped with environment variables), but it's helpful if they also have title, timestamp, and source fields. Additional fields can be included as well, which will land as extra fields on the Document table.

Base LLM Settings

These are the primary settings for configuring LLM connectivity.

Parameter Required? Description Type Default Value
GRAPHRAG_API_KEY Yes for OpenAI. Optional for AOAI The API key. (Note: OPENAI_API_KEY is also used as a fallback). If not defined when using AOAI, managed identity will be used. str None
GRAPHRAG_API_BASE For AOAI The API Base URL str None
GRAPHRAG_API_VERSION For AOAI The AOAI API version. str None
GRAPHRAG_API_ORGANIZATION The AOAI organization. str None
GRAPHRAG_API_PROXY The AOAI proxy. str None

Text Generation Settings

These settings control the text generation model used by the pipeline. Any settings with a fallback will use the base LLM settings, if available.

Parameter Required? Description Type Default Value
GRAPHRAG_LLM_TYPE For AOAI The LLM operation type. Either openai_chat or azure_openai_chat str openai_chat
GRAPHRAG_LLM_DEPLOYMENT_NAME For AOAI The AOAI model deployment name. str None
GRAPHRAG_LLM_API_KEY Yes (uses fallback) The API key. If not defined when using AOAI, managed identity will be used. str None
GRAPHRAG_LLM_API_BASE For AOAI (uses fallback) The API Base URL str None
GRAPHRAG_LLM_API_VERSION For AOAI (uses fallback) The AOAI API version. str None
GRAPHRAG_LLM_API_ORGANIZATION For AOAI (uses fallback) The AOAI organization. str None
GRAPHRAG_LLM_API_PROXY The AOAI proxy. str None
GRAPHRAG_LLM_MODEL The LLM model. str gpt-4-turbo-preview
GRAPHRAG_LLM_MAX_TOKENS The maximum number of tokens. int 4000
GRAPHRAG_LLM_REQUEST_TIMEOUT The maximum number of seconds to wait for a response from the chat client. int 180
GRAPHRAG_LLM_MODEL_SUPPORTS_JSON Indicates whether the given model supports JSON output mode. True to enable. str None
GRAPHRAG_LLM_THREAD_COUNT The number of threads to use for LLM parallelization. int 50
GRAPHRAG_LLM_THREAD_STAGGER The time to wait (in seconds) between starting each thread. float 0.3
GRAPHRAG_LLM_CONCURRENT_REQUESTS The number of concurrent requests to allow for the embedding client. int 25
GRAPHRAG_LLM_TOKENS_PER_MINUTE The number of tokens per minute to allow for the LLM client. 0 = Bypass int 0
GRAPHRAG_LLM_REQUESTS_PER_MINUTE The number of requests per minute to allow for the LLM client. 0 = Bypass int 0
GRAPHRAG_LLM_MAX_RETRIES The maximum number of retries to attempt when a request fails. int 10
GRAPHRAG_LLM_MAX_RETRY_WAIT The maximum number of seconds to wait between retries. int 10
GRAPHRAG_LLM_SLEEP_ON_RATE_LIMIT_RECOMMENDATION Whether to sleep on rate limit recommendation. (Azure Only) bool True
GRAPHRAG_LLM_TEMPERATURE The temperature to use generation. float 0
GRAPHRAG_LLM_TOP_P The top_p to use for sampling. float 1
GRAPHRAG_LLM_N The number of responses to generate. int 1

Text Embedding Settings

These settings control the text embedding model used by the pipeline. Any settings with a fallback will use the base LLM settings, if available.

Parameter Required ? Description Type Default
GRAPHRAG_EMBEDDING_TYPE For AOAI The embedding client to use. Either openai_embedding or azure_openai_embedding str openai_embedding
GRAPHRAG_EMBEDDING_DEPLOYMENT_NAME For AOAI The AOAI deployment name. str None
GRAPHRAG_EMBEDDING_API_KEY Yes (uses fallback) The API key to use for the embedding client. If not defined when using AOAI, managed identity will be used. str None
GRAPHRAG_EMBEDDING_API_BASE For AOAI (uses fallback) The API base URL. str None
GRAPHRAG_EMBEDDING_API_VERSION For AOAI (uses fallback) The AOAI API version to use for the embedding client. str None
GRAPHRAG_EMBEDDING_API_ORGANIZATION For AOAI (uses fallback) The AOAI organization to use for the embedding client. str None
GRAPHRAG_EMBEDDING_API_PROXY The AOAI proxy to use for the embedding client. str None
GRAPHRAG_EMBEDDING_MODEL The model to use for the embedding client. str text-embedding-3-small
GRAPHRAG_EMBEDDING_BATCH_SIZE The number of texts to embed at once. (Azure limit is 16) int 16
GRAPHRAG_EMBEDDING_BATCH_MAX_TOKENS The maximum tokens per batch (Azure limit is 8191) int 8191
GRAPHRAG_EMBEDDING_TARGET The target fields to embed. Either required or all. str required
GRAPHRAG_EMBEDDING_THREAD_COUNT The number of threads to use for parallelization for embeddings. int
GRAPHRAG_EMBEDDING_THREAD_STAGGER The time to wait (in seconds) between starting each thread for embeddings. float 50
GRAPHRAG_EMBEDDING_CONCURRENT_REQUESTS The number of concurrent requests to allow for the embedding client. int 25
GRAPHRAG_EMBEDDING_TOKENS_PER_MINUTE The number of tokens per minute to allow for the embedding client. 0 = Bypass int 0
GRAPHRAG_EMBEDDING_REQUESTS_PER_MINUTE The number of requests per minute to allow for the embedding client. 0 = Bypass int 0
GRAPHRAG_EMBEDDING_MAX_RETRIES The maximum number of retries to attempt when a request fails. int 10
GRAPHRAG_EMBEDDING_MAX_RETRY_WAIT The maximum number of seconds to wait between retries. int 10
GRAPHRAG_EMBEDDING_SLEEP_ON_RATE_LIMIT_RECOMMENDATION Whether to sleep on rate limit recommendation. (Azure Only) bool True

Input Settings

These settings control the data input used by the pipeline. Any settings with a fallback will use the base LLM settings, if available.

Plaintext Input Data (GRAPHRAG_INPUT_FILE_TYPE=text)

Parameter Description Type Required or Optional Default
GRAPHRAG_INPUT_FILE_PATTERN The file pattern regexp to use when reading input files from the input directory. str optional .*\.txt$

CSV Input Data (GRAPHRAG_INPUT_FILE_TYPE=csv)

Parameter Description Type Required or Optional Default
GRAPHRAG_INPUT_TYPE The input storage type to use when reading files. (file or blob) str optional file
GRAPHRAG_INPUT_FILE_PATTERN The file pattern regexp to use when reading input files from the input directory. str optional .*\.txt$
GRAPHRAG_INPUT_TEXT_COLUMN The 'text' column to use when reading CSV input files. str optional text
GRAPHRAG_INPUT_METADATA A list of CSV columns, comma-separated, to incorporate as JSON in a metadata column. str optional None
GRAPHRAG_INPUT_TITLE_COLUMN The 'title' column to use when reading CSV input files. str optional title
GRAPHRAG_INPUT_STORAGE_ACCOUNT_BLOB_URL The Azure Storage blob endpoint to use when in blob mode and using managed identity. Will have the format https://<storage_account_name>.blob.core.windows.net str optional None
GRAPHRAG_INPUT_CONNECTION_STRING The connection string to use when reading CSV input files from Azure Blob Storage. str optional None
GRAPHRAG_INPUT_CONTAINER_NAME The container name to use when reading CSV input files from Azure Blob Storage. str optional None
GRAPHRAG_INPUT_BASE_DIR The base directory to read input files from. str optional None

Data Mapping Settings

Parameter Description Type Required or Optional Default
GRAPHRAG_INPUT_FILE_TYPE The type of input data, csv or text str optional text
GRAPHRAG_INPUT_ENCODING The encoding to apply when reading CSV/text input files. str optional utf-8

Data Chunking

Parameter Description Type Required or Optional Default
GRAPHRAG_CHUNK_SIZE The chunk size in tokens for text-chunk analysis windows. str optional 1200
GRAPHRAG_CHUNK_OVERLAP The chunk overlap in tokens for text-chunk analysis windows. str optional 100
GRAPHRAG_CHUNK_BY_COLUMNS A comma-separated list of document attributes to groupby when performing TextUnit chunking. str optional id
GRAPHRAG_CHUNK_ENCODING_MODEL The encoding model to use for chunking. str optional The top-level encoding model.

Prompting Overrides

Parameter Description Type Required or Optional Default
GRAPHRAG_ENTITY_EXTRACTION_PROMPT_FILE The path (relative to the root) of an entity extraction prompt template text file. str optional None
GRAPHRAG_ENTITY_EXTRACTION_MAX_GLEANINGS The maximum number of redrives (gleanings) to invoke when extracting entities in a loop. int optional 1
GRAPHRAG_ENTITY_EXTRACTION_ENTITY_TYPES A comma-separated list of entity types to extract. str optional organization,person,event,geo
GRAPHRAG_ENTITY_EXTRACTION_ENCODING_MODEL The encoding model to use for entity extraction. str optional The top-level encoding model.
GRAPHRAG_SUMMARIZE_DESCRIPTIONS_PROMPT_FILE The path (relative to the root) of an description summarization prompt template text file. str optional None
GRAPHRAG_SUMMARIZE_DESCRIPTIONS_MAX_LENGTH The maximum number of tokens to generate per description summarization. int optional 500
GRAPHRAG_CLAIM_EXTRACTION_ENABLED Whether claim extraction is enabled for this pipeline. bool optional False
GRAPHRAG_CLAIM_EXTRACTION_DESCRIPTION The claim_description prompting argument to utilize. string optional "Any claims or facts that could be relevant to threat analysis."
GRAPHRAG_CLAIM_EXTRACTION_PROMPT_FILE The claim extraction prompt to utilize. string optional None
GRAPHRAG_CLAIM_EXTRACTION_MAX_GLEANINGS The maximum number of redrives (gleanings) to invoke when extracting claims in a loop. int optional 1
GRAPHRAG_CLAIM_EXTRACTION_ENCODING_MODEL The encoding model to use for claim extraction. str optional The top-level encoding model
GRAPHRAG_COMMUNITY_REPORTS_PROMPT_FILE The community reports extraction prompt to utilize. string optional None
GRAPHRAG_COMMUNITY_REPORTS_MAX_LENGTH The maximum number of tokens to generate per community reports. int optional 1500

Storage

This section controls the storage mechanism used by the pipeline used for exporting output tables.

Parameter Description Type Required or Optional Default
GRAPHRAG_STORAGE_TYPE The type of storage to use. Options are file, memory, or blob str optional file
GRAPHRAG_STORAGE_STORAGE_ACCOUNT_BLOB_URL The Azure Storage blob endpoint to use when in blob mode and using managed identity. Will have the format https://<storage_account_name>.blob.core.windows.net str optional None
GRAPHRAG_STORAGE_CONNECTION_STRING The Azure Storage connection string to use when in blob mode. str optional None
GRAPHRAG_STORAGE_CONTAINER_NAME The Azure Storage container name to use when in blob mode. str optional None
GRAPHRAG_STORAGE_BASE_DIR The base path to data outputs outputs. str optional None

Cache

This section controls the cache mechanism used by the pipeline. This is used to cache LLM invocation results.

Parameter Description Type Required or Optional Default
GRAPHRAG_CACHE_TYPE The type of cache to use. Options are file, memory, none or blob str optional file
GRAPHRAG_CACHE_STORAGE_ACCOUNT_BLOB_URL The Azure Storage blob endpoint to use when in blob mode and using managed identity. Will have the format https://<storage_account_name>.blob.core.windows.net str optional None
GRAPHRAG_CACHE_CONNECTION_STRING The Azure Storage connection string to use when in blob mode. str optional None
GRAPHRAG_CACHE_CONTAINER_NAME The Azure Storage container name to use when in blob mode. str optional None
GRAPHRAG_CACHE_BASE_DIR The base path to the cache files. str optional None

Reporting

This section controls the reporting mechanism used by the pipeline, for common events and error messages. The default is to write reports to a file in the output directory. However, you can also choose to write reports to an Azure Blob Storage container.

Parameter Description Type Required or Optional Default
GRAPHRAG_REPORTING_TYPE The type of reporter to use. Options are file or blob str optional file
GRAPHRAG_REPORTING_STORAGE_ACCOUNT_BLOB_URL The Azure Storage blob endpoint to use when in blob mode and using managed identity. Will have the format https://<storage_account_name>.blob.core.windows.net str optional None
GRAPHRAG_REPORTING_CONNECTION_STRING The Azure Storage connection string to use when in blob mode. str optional None
GRAPHRAG_REPORTING_CONTAINER_NAME The Azure Storage container name to use when in blob mode. str optional None
GRAPHRAG_REPORTING_BASE_DIR The base path to the reporting outputs. str optional None

Node2Vec Parameters

Parameter Description Type Required or Optional Default
GRAPHRAG_NODE2VEC_ENABLED Whether to enable Node2Vec bool optional False
GRAPHRAG_NODE2VEC_NUM_WALKS The Node2Vec number of walks to perform int optional 10
GRAPHRAG_NODE2VEC_WALK_LENGTH The Node2Vec walk length int optional 40
GRAPHRAG_NODE2VEC_WINDOW_SIZE The Node2Vec window size int optional 2
GRAPHRAG_NODE2VEC_ITERATIONS The number of iterations to run node2vec int optional 3
GRAPHRAG_NODE2VEC_RANDOM_SEED The random seed to use for node2vec int optional 597832

Data Snapshotting

Parameter Description Type Required or Optional Default
GRAPHRAG_SNAPSHOT_EMBEDDINGS Whether to enable embeddings snapshots. bool optional False
GRAPHRAG_SNAPSHOT_GRAPHML Whether to enable GraphML snapshots. bool optional False
GRAPHRAG_SNAPSHOT_RAW_ENTITIES Whether to enable raw entity snapshots. bool optional False
GRAPHRAG_SNAPSHOT_TOP_LEVEL_NODES Whether to enable top-level node snapshots. bool optional False
GRAPHRAG_SNAPSHOT_TRANSIENT Whether to enable transient table snapshots. bool optional False

Miscellaneous Settings

Parameter Description Type Required or Optional Default
GRAPHRAG_ASYNC_MODE Which async mode to use. Either asyncio or threaded. str optional asyncio
GRAPHRAG_ENCODING_MODEL The text encoding model, used in tiktoken, to encode text. str optional cl100k_base
GRAPHRAG_MAX_CLUSTER_SIZE The maximum number of entities to include in a single Leiden cluster. int optional 10
GRAPHRAG_UMAP_ENABLED Whether to enable UMAP layouts bool optional False