mirror of https://github.com/microsoft/graphrag.git synced 2025-12-10 22:41:58 +00:00

Improve internal logging functionality by using Python's standard logging module (#1956 )

* Initial plan for issue

* Implement standard logging module and integrate with existing loggers

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Add test cases and improve documentation for standard logging

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Apply ruff formatting and add semversioner file for logging improvements

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove custom logger classes and refactor to use standard logging only

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Apply ruff formatting to resolve CI/CD test failures

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Add semversioner file and fix linting issues

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* ruff fixes

* fix spelling error

* Remove StandardProgressLogger and refactor to use standard logging

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove LoggerFactory and custom loggers, refactor to use standard logging

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix pyright error: use logger.info() instead of calling logger as function in cosmosdb_pipeline_storage.py

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* ruff fixes

* Remove deprecated logger files that were marked as deprecated placeholders

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Replace custom get_logger with standard Python logging

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix linting issues found by ruff check --fix

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* apply ruff check fixes

* add word to dictionary

* Fix type checker error in ModelManager.__new__ method

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Refactor multiple logging.getLogger() calls to use single logger per file

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove progress_logger parameter from build_index() and logger parameter from generate_indexing_prompts()

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove logger parameter from run_pipeline and standardize logger naming

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Replace logger parameter with log_level parameter in CLI commands

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix import ordering in notebook files to pass poetry poe check

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove --logger parameter from smoke test command

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix Windows CI/CD issue with log file cleanup in tests

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Add StreamHandler to root logger in __main__.py for CLI logging

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Only add StreamHandler if root logger doesn't have existing StreamHandler

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix import ordering in notebook files to pass ruff checks

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Replace logging.StreamHandler with colorlog.StreamHandler for colorized log output

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Regenerate poetry.lock file after adding colorlog dependency

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix import ordering in notebook files to pass ruff checks

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* move printing of dataframes to debug level

* remove colorlog for now

* Refactor workflow callbacks to inherit from logging.Handler

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix linting issues in workflow callback handlers

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix pyright type errors in blob and file workflow callbacks

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Refactor pipeline logging to use pure logging.Handler subclasses

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Rename workflow callback classes to workflow logger classes and move to logger directory

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* update dictionary

* apply ruff fixes

* fix function name

* simplify logger code

* update

* Remove error, warning, and log methods from WorkflowCallbacks and replace with standard logging

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* ruff fixes

* Fix pyright errors by removing WorkflowCallbacks from strategy type signatures

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove ConsoleWorkflowLogger and apply consistent formatter to all handlers

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* apply ruff fixes

* Refactor pipeline_logger.py to use standard FileHandler and remove FileWorkflowLogger

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove conditional azure import checks from blob_workflow_logger.py

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix pyright type checking errors in mock_provider.py and utils.py

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Run ruff check --fix to fix import ordering in notebooks

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Merge configure_logging and create_pipeline_logger into init_loggers function

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove configure_logging and create_pipeline_logger functions, replace all usage with init_loggers

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* apply ruff fixes

* cleanup unused code

* Update init_loggers to accept GraphRagConfig instead of ReportingConfig

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* apply ruff check fixes

* Fix test failures by providing valid GraphRagConfig with required model configurations

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* apply ruff fixes

* remove logging_workflow_callback

* cleanup logging messages

* Add logging to track progress of pandas DataFrame apply operation in create_base_text_units

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* cleanup logger logic throughout codebase

* update

* more cleanup of old loggers

* small logger cleanup

* final code cleanup and added loggers to query

* add verbose logging to query

* minor code cleanup

* Fix broken unit tests for chunk_text and standard_logging

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* apply ruff fixes

* Fix test_chunk_text by mocking progress_ticker function instead of ProgressTicker class

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* remove unnecessary logger

* remove rich and fix type annotation

* revert test formatting changes my by copilot

* promote graphrag logs to root logger

* add correct semversioner file

* revert change to file

* revert formatting changes that have no effect

* fix changes after merge with main

* revert unnecessary copilot changes

* remove whitespace

* cleanup docstring

* simplify some logic with less code

* update poetry lock file

* ruff fixes

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>
Co-authored-by: Josh Bradley <joshbradley@microsoft.com>

2025-07-09 18:29:03 -06:00

31 KiB

Raw Blame History

Default Configuration Mode (using Env Vars)

As of version 1.3, GraphRAG no longer supports a full complement of pre-built environment variables. Instead, we support variable replacement within the settings.yml file so you can specify any environment variables you like.

The only standard environment variable we expect, and include in the default settings.yml, is GRAPHRAG_API_KEY. If you are already using a number of the previous GRAPHRAG_* environment variables, you can insert them with template syntax into settings.yml and they will be adopted.

The environment variables below are documented as an aid for migration, but they WILL NOT be read unless you use template syntax in your settings.yml. We also WILL NOT be updating this page as the main config object changes.

Text-Embeddings Customization

By default, the GraphRAG indexer will only export embeddings required for our query methods. However, the model has embeddings defined for all plaintext fields, and these can be generated by setting the GRAPHRAG_EMBEDDING_TARGET environment variable to all.

Embedded Fields

text_unit.text
document.text
entity.title
entity.description
relationship.description
community.title
community.summary
community.full_content

Input Data

Our pipeline can ingest .csv or .txt data from an input folder. These files can be nested within subfolders. To configure how input data is handled, what fields are mapped over, and how timestamps are parsed, look for configuration values starting with GRAPHRAG_INPUT_ below. In general, CSV-based data provides the most customizability. Each CSV should at least contain a text field (which can be mapped with environment variables), but it's helpful if they also have title, timestamp, and source fields. Additional fields can be included as well, which will land as extra fields on the Document table.

Base LLM Settings

These are the primary settings for configuring LLM connectivity.

Parameter	Required?	Description	Type	Default Value
`GRAPHRAG_API_KEY`	Yes for OpenAI. Optional for AOAI	The API key. (Note: `OPENAI_API_KEY` is also used as a fallback). If not defined when using AOAI, managed identity will be used.	`str`	`None`
`GRAPHRAG_API_BASE`	For AOAI	The API Base URL	`str`	`None`
`GRAPHRAG_API_VERSION`	For AOAI	The AOAI API version.	`str`	`None`
`GRAPHRAG_API_ORGANIZATION`		The AOAI organization.	`str`	`None`
`GRAPHRAG_API_PROXY`		The AOAI proxy.	`str`	`None`

Text Generation Settings

These settings control the text generation model used by the pipeline. Any settings with a fallback will use the base LLM settings, if available.

Parameter	Required?	Description	Type	Default Value
`GRAPHRAG_LLM_TYPE`	For AOAI	The LLM operation type. Either `openai_chat` or `azure_openai_chat`	`str`	`openai_chat`
`GRAPHRAG_LLM_DEPLOYMENT_NAME`	For AOAI	The AOAI model deployment name.	`str`	`None`
`GRAPHRAG_LLM_API_KEY`	Yes (uses fallback)	The API key. If not defined when using AOAI, managed identity will be used.	`str`	`None`
`GRAPHRAG_LLM_API_BASE`	For AOAI (uses fallback)	The API Base URL	`str`	`None`
`GRAPHRAG_LLM_API_VERSION`	For AOAI (uses fallback)	The AOAI API version.	`str`	`None`
`GRAPHRAG_LLM_API_ORGANIZATION`	For AOAI (uses fallback)	The AOAI organization.	`str`	`None`
`GRAPHRAG_LLM_API_PROXY`		The AOAI proxy.	`str`	`None`
`GRAPHRAG_LLM_MODEL`		The LLM model.	`str`	`gpt-4-turbo-preview`
`GRAPHRAG_LLM_MAX_TOKENS`		The maximum number of tokens.	`int`	`4000`
`GRAPHRAG_LLM_REQUEST_TIMEOUT`		The maximum number of seconds to wait for a response from the chat client.	`int`	`180`
`GRAPHRAG_LLM_MODEL_SUPPORTS_JSON`		Indicates whether the given model supports JSON output mode. `True` to enable.	`str`	`None`
`GRAPHRAG_LLM_THREAD_COUNT`		The number of threads to use for LLM parallelization.	`int`	50
`GRAPHRAG_LLM_THREAD_STAGGER`		The time to wait (in seconds) between starting each thread.	`float`	0.3
`GRAPHRAG_LLM_CONCURRENT_REQUESTS`		The number of concurrent requests to allow for the embedding client.	`int`	25
`GRAPHRAG_LLM_TOKENS_PER_MINUTE`		The number of tokens per minute to allow for the LLM client. 0 = Bypass	`int`	0
`GRAPHRAG_LLM_REQUESTS_PER_MINUTE`		The number of requests per minute to allow for the LLM client. 0 = Bypass	`int`	0
`GRAPHRAG_LLM_MAX_RETRIES`		The maximum number of retries to attempt when a request fails.	`int`	10
`GRAPHRAG_LLM_MAX_RETRY_WAIT`		The maximum number of seconds to wait between retries.	`int`	10
`GRAPHRAG_LLM_SLEEP_ON_RATE_LIMIT_RECOMMENDATION`		Whether to sleep on rate limit recommendation. (Azure Only)	`bool`	`True`
`GRAPHRAG_LLM_TEMPERATURE`		The temperature to use generation.	`float`	0
`GRAPHRAG_LLM_TOP_P`		The top_p to use for sampling.	`float`	1
`GRAPHRAG_LLM_N`		The number of responses to generate.	`int`	1

Text Embedding Settings

These settings control the text embedding model used by the pipeline. Any settings with a fallback will use the base LLM settings, if available.

Parameter	Required ?	Description	Type	Default
`GRAPHRAG_EMBEDDING_TYPE`	For AOAI	The embedding client to use. Either `openai_embedding` or `azure_openai_embedding`	`str`	`openai_embedding`
`GRAPHRAG_EMBEDDING_DEPLOYMENT_NAME`	For AOAI	The AOAI deployment name.	`str`	`None`
`GRAPHRAG_EMBEDDING_API_KEY`	Yes (uses fallback)	The API key to use for the embedding client. If not defined when using AOAI, managed identity will be used.	`str`	`None`
`GRAPHRAG_EMBEDDING_API_BASE`	For AOAI (uses fallback)	The API base URL.	`str`	`None`
`GRAPHRAG_EMBEDDING_API_VERSION`	For AOAI (uses fallback)	The AOAI API version to use for the embedding client.	`str`	`None`
`GRAPHRAG_EMBEDDING_API_ORGANIZATION`	For AOAI (uses fallback)	The AOAI organization to use for the embedding client.	`str`	`None`
`GRAPHRAG_EMBEDDING_API_PROXY`		The AOAI proxy to use for the embedding client.	`str`	`None`
`GRAPHRAG_EMBEDDING_MODEL`		The model to use for the embedding client.	`str`	`text-embedding-3-small`
`GRAPHRAG_EMBEDDING_BATCH_SIZE`		The number of texts to embed at once. (Azure limit is 16)	`int`	16
`GRAPHRAG_EMBEDDING_BATCH_MAX_TOKENS`		The maximum tokens per batch (Azure limit is 8191)	`int`	8191
`GRAPHRAG_EMBEDDING_TARGET`		The target fields to embed. Either `required` or `all`.	`str`	`required`
`GRAPHRAG_EMBEDDING_THREAD_COUNT`		The number of threads to use for parallelization for embeddings.	`int`
`GRAPHRAG_EMBEDDING_THREAD_STAGGER`		The time to wait (in seconds) between starting each thread for embeddings.	`float`	50
`GRAPHRAG_EMBEDDING_CONCURRENT_REQUESTS`		The number of concurrent requests to allow for the embedding client.	`int`	25
`GRAPHRAG_EMBEDDING_TOKENS_PER_MINUTE`		The number of tokens per minute to allow for the embedding client. 0 = Bypass	`int`	0
`GRAPHRAG_EMBEDDING_REQUESTS_PER_MINUTE`		The number of requests per minute to allow for the embedding client. 0 = Bypass	`int`	0
`GRAPHRAG_EMBEDDING_MAX_RETRIES`		The maximum number of retries to attempt when a request fails.	`int`	10
`GRAPHRAG_EMBEDDING_MAX_RETRY_WAIT`		The maximum number of seconds to wait between retries.	`int`	10
`GRAPHRAG_EMBEDDING_SLEEP_ON_RATE_LIMIT_RECOMMENDATION`		Whether to sleep on rate limit recommendation. (Azure Only)	`bool`	`True`

Input Settings

These settings control the data input used by the pipeline. Any settings with a fallback will use the base LLM settings, if available.

Plaintext Input Data (`GRAPHRAG_INPUT_FILE_TYPE`=text)

Parameter	Description	Type	Required or Optional	Default
`GRAPHRAG_INPUT_FILE_PATTERN`	The file pattern regexp to use when reading input files from the input directory.	`str`	optional	`.*\.txt$`

CSV Input Data (`GRAPHRAG_INPUT_FILE_TYPE`=csv)

Parameter	Description	Type	Required or Optional	Default
`GRAPHRAG_INPUT_TYPE`	The input storage type to use when reading files. (`file` or `blob`)	`str`	optional	`file`
`GRAPHRAG_INPUT_FILE_PATTERN`	The file pattern regexp to use when reading input files from the input directory.	`str`	optional	`.*\.txt$`
`GRAPHRAG_INPUT_TEXT_COLUMN`	The 'text' column to use when reading CSV input files.	`str`	optional	`text`
`GRAPHRAG_INPUT_METADATA`	A list of CSV columns, comma-separated, to incorporate as JSON in a metadata column.	`str`	optional	`None`
`GRAPHRAG_INPUT_TITLE_COLUMN`	The 'title' column to use when reading CSV input files.	`str`	optional	`title`
`GRAPHRAG_INPUT_STORAGE_ACCOUNT_BLOB_URL`	The Azure Storage blob endpoint to use when in `blob` mode and using managed identity. Will have the format `https://<storage_account_name>.blob.core.windows.net`	`str`	optional	`None`
`GRAPHRAG_INPUT_CONNECTION_STRING`	The connection string to use when reading CSV input files from Azure Blob Storage.	`str`	optional	`None`
`GRAPHRAG_INPUT_CONTAINER_NAME`	The container name to use when reading CSV input files from Azure Blob Storage.	`str`	optional	`None`
`GRAPHRAG_INPUT_BASE_DIR`	The base directory to read input files from.	`str`	optional	`None`

Data Mapping Settings

Parameter	Description	Type	Required or Optional	Default
`GRAPHRAG_INPUT_FILE_TYPE`	The type of input data, `csv` or `text`	`str`	optional	`text`
`GRAPHRAG_INPUT_ENCODING`	The encoding to apply when reading CSV/text input files.	`str`	optional	`utf-8`

Data Chunking

Parameter	Description	Type	Required or Optional	Default
`GRAPHRAG_CHUNK_SIZE`	The chunk size in tokens for text-chunk analysis windows.	`str`	optional	1200
`GRAPHRAG_CHUNK_OVERLAP`	The chunk overlap in tokens for text-chunk analysis windows.	`str`	optional	100
`GRAPHRAG_CHUNK_BY_COLUMNS`	A comma-separated list of document attributes to groupby when performing TextUnit chunking.	`str`	optional	`id`
`GRAPHRAG_CHUNK_ENCODING_MODEL`	The encoding model to use for chunking.	`str`	optional	The top-level encoding model.

Prompting Overrides

Parameter	Description	Type	Required or Optional	Default
`GRAPHRAG_ENTITY_EXTRACTION_PROMPT_FILE`	The path (relative to the root) of an entity extraction prompt template text file.	`str`	optional	`None`
`GRAPHRAG_ENTITY_EXTRACTION_MAX_GLEANINGS`	The maximum number of redrives (gleanings) to invoke when extracting entities in a loop.	`int`	optional	1
`GRAPHRAG_ENTITY_EXTRACTION_ENTITY_TYPES`	A comma-separated list of entity types to extract.	`str`	optional	`organization,person,event,geo`
`GRAPHRAG_ENTITY_EXTRACTION_ENCODING_MODEL`	The encoding model to use for entity extraction.	`str`	optional	The top-level encoding model.
`GRAPHRAG_SUMMARIZE_DESCRIPTIONS_PROMPT_FILE`	The path (relative to the root) of an description summarization prompt template text file.	`str`	optional	`None`
`GRAPHRAG_SUMMARIZE_DESCRIPTIONS_MAX_LENGTH`	The maximum number of tokens to generate per description summarization.	`int`	optional	500
`GRAPHRAG_CLAIM_EXTRACTION_ENABLED`	Whether claim extraction is enabled for this pipeline.	`bool`	optional	`False`
`GRAPHRAG_CLAIM_EXTRACTION_DESCRIPTION`	The claim_description prompting argument to utilize.	`string`	optional	"Any claims or facts that could be relevant to threat analysis."
`GRAPHRAG_CLAIM_EXTRACTION_PROMPT_FILE`	The claim extraction prompt to utilize.	`string`	optional	`None`
`GRAPHRAG_CLAIM_EXTRACTION_MAX_GLEANINGS`	The maximum number of redrives (gleanings) to invoke when extracting claims in a loop.	`int`	optional	1
`GRAPHRAG_CLAIM_EXTRACTION_ENCODING_MODEL`	The encoding model to use for claim extraction.	`str`	optional	The top-level encoding model
`GRAPHRAG_COMMUNITY_REPORTS_PROMPT_FILE`	The community reports extraction prompt to utilize.	`string`	optional	`None`
`GRAPHRAG_COMMUNITY_REPORTS_MAX_LENGTH`	The maximum number of tokens to generate per community reports.	`int`	optional	1500

Storage

This section controls the storage mechanism used by the pipeline used for exporting output tables.

Parameter	Description	Type	Required or Optional	Default
`GRAPHRAG_STORAGE_TYPE`	The type of storage to use. Options are `file`, `memory`, or `blob`	`str`	optional	`file`
`GRAPHRAG_STORAGE_STORAGE_ACCOUNT_BLOB_URL`	The Azure Storage blob endpoint to use when in `blob` mode and using managed identity. Will have the format `https://<storage_account_name>.blob.core.windows.net`	`str`	optional	None
`GRAPHRAG_STORAGE_CONNECTION_STRING`	The Azure Storage connection string to use when in `blob` mode.	`str`	optional	None
`GRAPHRAG_STORAGE_CONTAINER_NAME`	The Azure Storage container name to use when in `blob` mode.	`str`	optional	None
`GRAPHRAG_STORAGE_BASE_DIR`	The base path to data outputs outputs.	`str`	optional	None

Cache

This section controls the cache mechanism used by the pipeline. This is used to cache LLM invocation results.

Parameter	Description	Type	Required or Optional	Default
`GRAPHRAG_CACHE_TYPE`	The type of cache to use. Options are `file`, `memory`, `none` or `blob`	`str`	optional	`file`
`GRAPHRAG_CACHE_STORAGE_ACCOUNT_BLOB_URL`	The Azure Storage blob endpoint to use when in `blob` mode and using managed identity. Will have the format `https://<storage_account_name>.blob.core.windows.net`	`str`	optional	None
`GRAPHRAG_CACHE_CONNECTION_STRING`	The Azure Storage connection string to use when in `blob` mode.	`str`	optional	None
`GRAPHRAG_CACHE_CONTAINER_NAME`	The Azure Storage container name to use when in `blob` mode.	`str`	optional	None
`GRAPHRAG_CACHE_BASE_DIR`	The base path to the cache files.	`str`	optional	None

Reporting

This section controls the reporting mechanism used by the pipeline, for common events and error messages. The default is to write reports to a file in the output directory. However, you can also choose to write reports to an Azure Blob Storage container.

Parameter	Description	Type	Required or Optional	Default
`GRAPHRAG_REPORTING_TYPE`	The type of reporter to use. Options are `file` or `blob`	`str`	optional	`file`
`GRAPHRAG_REPORTING_STORAGE_ACCOUNT_BLOB_URL`	The Azure Storage blob endpoint to use when in `blob` mode and using managed identity. Will have the format `https://<storage_account_name>.blob.core.windows.net`	`str`	optional	None
`GRAPHRAG_REPORTING_CONNECTION_STRING`	The Azure Storage connection string to use when in `blob` mode.	`str`	optional	None
`GRAPHRAG_REPORTING_CONTAINER_NAME`	The Azure Storage container name to use when in `blob` mode.	`str`	optional	None
`GRAPHRAG_REPORTING_BASE_DIR`	The base path to the reporting outputs.	`str`	optional	None

Node2Vec Parameters

Parameter	Description	Type	Required or Optional	Default
`GRAPHRAG_NODE2VEC_ENABLED`	Whether to enable Node2Vec	`bool`	optional	False
`GRAPHRAG_NODE2VEC_NUM_WALKS`	The Node2Vec number of walks to perform	`int`	optional	10
`GRAPHRAG_NODE2VEC_WALK_LENGTH`	The Node2Vec walk length	`int`	optional	40
`GRAPHRAG_NODE2VEC_WINDOW_SIZE`	The Node2Vec window size	`int`	optional	2
`GRAPHRAG_NODE2VEC_ITERATIONS`	The number of iterations to run node2vec	`int`	optional	3
`GRAPHRAG_NODE2VEC_RANDOM_SEED`	The random seed to use for node2vec	`int`	optional	597832

Data Snapshotting

Parameter	Description	Type	Required or Optional	Default
`GRAPHRAG_SNAPSHOT_EMBEDDINGS`	Whether to enable embeddings snapshots.	`bool`	optional	False
`GRAPHRAG_SNAPSHOT_GRAPHML`	Whether to enable GraphML snapshots.	`bool`	optional	False
`GRAPHRAG_SNAPSHOT_RAW_ENTITIES`	Whether to enable raw entity snapshots.	`bool`	optional	False
`GRAPHRAG_SNAPSHOT_TOP_LEVEL_NODES`	Whether to enable top-level node snapshots.	`bool`	optional	False
`GRAPHRAG_SNAPSHOT_TRANSIENT`	Whether to enable transient table snapshots.	`bool`	optional	False

Miscellaneous Settings

Parameter	Description	Type	Required or Optional	Default
`GRAPHRAG_ASYNC_MODE`	Which async mode to use. Either `asyncio` or `threaded`.	`str`	optional	`asyncio`
`GRAPHRAG_ENCODING_MODEL`	The text encoding model, used in tiktoken, to encode text.	`str`	optional	`cl100k_base`
`GRAPHRAG_MAX_CLUSTER_SIZE`	The maximum number of entities to include in a single Leiden cluster.	`int`	optional	10
`GRAPHRAG_UMAP_ENABLED`	Whether to enable UMAP layouts	`bool`	optional	False

31 KiB Raw Blame History