graphrag/docs/config/env_vars.md

220 lines
31 KiB
Markdown
Raw Permalink Normal View History

# Default Configuration Mode (using Env Vars)
2024-07-01 15:25:30 -06:00
As of version 1.3, GraphRAG no longer supports a full complement of pre-built environment variables. Instead, we support variable replacement within the [settings.yml file](yaml.md) so you can specify any environment variables you like.
2024-07-01 15:25:30 -06:00
The only standard environment variable we expect, and include in the default settings.yml, is `GRAPHRAG_API_KEY`. If you are already using a number of the previous GRAPHRAG_* environment variables, you can insert them with template syntax into settings.yml and they will be adopted.
> **The environment variables below are documented as an aid for migration, but they WILL NOT be read unless you use template syntax in your settings.yml.**
---
2024-07-01 15:25:30 -06:00
### Text-Embeddings Customization
By default, the GraphRAG indexer will only export embeddings required for our query methods. However, the model has embeddings defined for all plaintext fields, and these can be generated by setting the `GRAPHRAG_EMBEDDING_TARGET` environment variable to `all`.
2024-07-01 15:25:30 -06:00
#### Embedded Fields
2024-07-01 15:25:30 -06:00
- `text_unit.text`
Artifact cleanup (#1341) * Add source documents for verb tests * Remove entity_type erroneous column * Add new test data * Remove source/target degree columns * Remove top_level_node_id * Remove chunk column configs * Rename "chunk" to "text" * Rename "chunk" to "text" in base * Re-map document input to use base text units * Revert base text units as final documents dep * Update test data * Split/rename node source_id * Drop node size (dup of degree) * Drop document_ids from covariates * Remove unused document_ids from models * Remove n_tokens from covariate table * Fix missed document_ids delete * Wire base text units to final documents * Rename relationship rank as combined_degree * Add rank as first-class property to Relationship * Remove split_text operation * Fix relationships test parquet * Update test parquets * Add entity ids to community table * Remove stored graph embedding columns * Format * Semver * Fix JSON typo * Spelling * Rename lancedb * Sort lancedb * Fix unit test * Fix test to account for changing period * Update tests for separate embeddings * Format * Better assertion printing * Fix unit test for windows * Rename document.raw_content -> document.text * Remove read_documents function * Remove unused document summary from model * Remove unused imports * Format * Add new snapshots to default init * Use util to construct embeddings collection name * Align inc index model with branch changes * Update data and tests for int ids * Clean up embedding locs * Switch entity "name" to "title" for consistency * Fix short_id -> human_readable_id defaults * Format * Rework community IDs * Fix community size compute * Fix unit tests * Fix report read * Pare down nodes table output * Fix unit test * Fix merge * Fix community loading * Format * Fix community id report extraction * Update tests * Consistent short IDs and ordering * Update ordering and tests * Update incremental for new nodes model * Guard document columns loc * Match column ordering * Fix document guard * Update smoke tests * Fill NA on community extract * Logging for smoke test debug * Add parquet schema details doc * Fix community hierarchy guard * Use better empty hierarchy guard * Back-compat shims * Semver * Fix warning * Format * Remove default fallback * Reuse key
2024-11-13 15:11:19 -08:00
- `document.text`
- `entity.title`
2024-07-01 15:25:30 -06:00
- `entity.description`
- `relationship.description`
- `community.title`
- `community.summary`
- `community.full_content`
### Input Data
2024-07-01 15:25:30 -06:00
Our pipeline can ingest .csv or .txt data from an input folder. These files can be nested within subfolders. To configure how input data is handled, what fields are mapped over, and how timestamps are parsed, look for configuration values starting with `GRAPHRAG_INPUT_` below. In general, CSV-based data provides the most customizability. Each CSV should at least contain a `text` field (which can be mapped with environment variables), but it's helpful if they also have `title`, `timestamp`, and `source` fields. Additional fields can be included as well, which will land as extra fields on the `Document` table.
2024-07-01 15:25:30 -06:00
### Base LLM Settings
2024-07-01 15:25:30 -06:00
These are the primary settings for configuring LLM connectivity.
| Parameter | Required? | Description | Type | Default Value |
| --------------------------- | ------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- | ----- | ------------- |
| `GRAPHRAG_API_KEY` | **Yes for OpenAI. Optional for AOAI** | The API key. (Note: `OPENAI_API_KEY` is also used as a fallback). If not defined when using AOAI, managed identity will be used. | `str` | `None` |
| `GRAPHRAG_API_BASE` | **For AOAI** | The API Base URL | `str` | `None` |
| `GRAPHRAG_API_VERSION` | **For AOAI** | The AOAI API version. | `str` | `None` |
| `GRAPHRAG_API_ORGANIZATION` | | The AOAI organization. | `str` | `None` |
| `GRAPHRAG_API_PROXY` | | The AOAI proxy. | `str` | `None` |
2024-07-01 15:25:30 -06:00
### Text Generation Settings
2024-07-01 15:25:30 -06:00
These settings control the text generation model used by the pipeline. Any settings with a fallback will use the base LLM settings, if available.
| Parameter | Required? | Description | Type | Default Value |
| ------------------------------------------------- | ------------------------ | ------------------------------------------------------------------------------ | ------- | --------------------- |
| `GRAPHRAG_LLM_TYPE` | **For AOAI** | The LLM operation type. Either `openai_chat` or `azure_openai_chat` | `str` | `openai_chat` |
| `GRAPHRAG_LLM_DEPLOYMENT_NAME` | **For AOAI** | The AOAI model deployment name. | `str` | `None` |
| `GRAPHRAG_LLM_API_KEY` | Yes (uses fallback) | The API key. If not defined when using AOAI, managed identity will be used. | `str` | `None` |
| `GRAPHRAG_LLM_API_BASE` | For AOAI (uses fallback) | The API Base URL | `str` | `None` |
| `GRAPHRAG_LLM_API_VERSION` | For AOAI (uses fallback) | The AOAI API version. | `str` | `None` |
| `GRAPHRAG_LLM_API_ORGANIZATION` | For AOAI (uses fallback) | The AOAI organization. | `str` | `None` |
| `GRAPHRAG_LLM_API_PROXY` | | The AOAI proxy. | `str` | `None` |
| `GRAPHRAG_LLM_MODEL` | | The LLM model. | `str` | `gpt-4-turbo-preview` |
| `GRAPHRAG_LLM_MAX_TOKENS` | | The maximum number of tokens. | `int` | `4000` |
| `GRAPHRAG_LLM_REQUEST_TIMEOUT` | | The maximum number of seconds to wait for a response from the chat client. | `int` | `180` |
| `GRAPHRAG_LLM_MODEL_SUPPORTS_JSON` | | Indicates whether the given model supports JSON output mode. `True` to enable. | `str` | `None` |
| `GRAPHRAG_LLM_THREAD_COUNT` | | The number of threads to use for LLM parallelization. | `int` | 50 |
| `GRAPHRAG_LLM_THREAD_STAGGER` | | The time to wait (in seconds) between starting each thread. | `float` | 0.3 |
| `GRAPHRAG_LLM_CONCURRENT_REQUESTS` | | The number of concurrent requests to allow for the embedding client. | `int` | 25 |
| `GRAPHRAG_LLM_TOKENS_PER_MINUTE` | | The number of tokens per minute to allow for the LLM client. 0 = Bypass | `int` | 0 |
| `GRAPHRAG_LLM_REQUESTS_PER_MINUTE` | | The number of requests per minute to allow for the LLM client. 0 = Bypass | `int` | 0 |
2024-07-01 15:25:30 -06:00
| `GRAPHRAG_LLM_MAX_RETRIES` | | The maximum number of retries to attempt when a request fails. | `int` | 10 |
| `GRAPHRAG_LLM_MAX_RETRY_WAIT` | | The maximum number of seconds to wait between retries. | `int` | 10 |
| `GRAPHRAG_LLM_SLEEP_ON_RATE_LIMIT_RECOMMENDATION` | | Whether to sleep on rate limit recommendation. (Azure Only) | `bool` | `True` |
| `GRAPHRAG_LLM_TEMPERATURE` | | The temperature to use generation. | `float` | 0 |
| `GRAPHRAG_LLM_TOP_P` | | The top_p to use for sampling. | `float` | 1 |
| `GRAPHRAG_LLM_N` | | The number of responses to generate. | `int` | 1 |
2024-07-01 15:25:30 -06:00
### Text Embedding Settings
2024-07-01 15:25:30 -06:00
These settings control the text embedding model used by the pipeline. Any settings with a fallback will use the base LLM settings, if available.
| Parameter | Required ? | Description | Type | Default |
| ------------------------------------------------------- | ------------------------ | -------------------------------------------------------------------------------------------------------------------------- | ------- | ------------------------ |
| `GRAPHRAG_EMBEDDING_TYPE` | **For AOAI** | The embedding client to use. Either `openai_embedding` or `azure_openai_embedding` | `str` | `openai_embedding` |
| `GRAPHRAG_EMBEDDING_DEPLOYMENT_NAME` | **For AOAI** | The AOAI deployment name. | `str` | `None` |
| `GRAPHRAG_EMBEDDING_API_KEY` | Yes (uses fallback) | The API key to use for the embedding client. If not defined when using AOAI, managed identity will be used. | `str` | `None` |
| `GRAPHRAG_EMBEDDING_API_BASE` | For AOAI (uses fallback) | The API base URL. | `str` | `None` |
| `GRAPHRAG_EMBEDDING_API_VERSION` | For AOAI (uses fallback) | The AOAI API version to use for the embedding client. | `str` | `None` |
| `GRAPHRAG_EMBEDDING_API_ORGANIZATION` | For AOAI (uses fallback) | The AOAI organization to use for the embedding client. | `str` | `None` |
| `GRAPHRAG_EMBEDDING_API_PROXY` | | The AOAI proxy to use for the embedding client. | `str` | `None` |
| `GRAPHRAG_EMBEDDING_MODEL` | | The model to use for the embedding client. | `str` | `text-embedding-3-small` |
| `GRAPHRAG_EMBEDDING_BATCH_SIZE` | | The number of texts to embed at once. [(Azure limit is 16)](https://learn.microsoft.com/en-us/azure/ai-ce) | `int` | 16 |
| `GRAPHRAG_EMBEDDING_BATCH_MAX_TOKENS` | | The maximum tokens per batch [(Azure limit is 8191)](https://learn.microsoft.com/en-us/azure/ai-services/openai/reference) | `int` | 8191 |
| `GRAPHRAG_EMBEDDING_TARGET` | | The target fields to embed. Either `required` or `all`. | `str` | `required` | |
2024-07-01 15:25:30 -06:00
| `GRAPHRAG_EMBEDDING_THREAD_COUNT` | | The number of threads to use for parallelization for embeddings. | `int` | |
| `GRAPHRAG_EMBEDDING_THREAD_STAGGER` | | The time to wait (in seconds) between starting each thread for embeddings. | `float` | 50 |
| `GRAPHRAG_EMBEDDING_CONCURRENT_REQUESTS` | | The number of concurrent requests to allow for the embedding client. | `int` | 25 |
| `GRAPHRAG_EMBEDDING_TOKENS_PER_MINUTE` | | The number of tokens per minute to allow for the embedding client. 0 = Bypass | `int` | 0 |
| `GRAPHRAG_EMBEDDING_REQUESTS_PER_MINUTE` | | The number of requests per minute to allow for the embedding client. 0 = Bypass | `int` | 0 |
2024-07-01 15:25:30 -06:00
| `GRAPHRAG_EMBEDDING_MAX_RETRIES` | | The maximum number of retries to attempt when a request fails. | `int` | 10 |
| `GRAPHRAG_EMBEDDING_MAX_RETRY_WAIT` | | The maximum number of seconds to wait between retries. | `int` | 10 |
| `GRAPHRAG_EMBEDDING_SLEEP_ON_RATE_LIMIT_RECOMMENDATION` | | Whether to sleep on rate limit recommendation. (Azure Only) | `bool` | `True` |
### Input Settings
2024-07-01 15:25:30 -06:00
These settings control the data input used by the pipeline. Any settings with a fallback will use the base LLM settings, if available.
#### Plaintext Input Data (`GRAPHRAG_INPUT_FILE_TYPE`=text)
2024-07-01 15:25:30 -06:00
| Parameter | Description | Type | Required or Optional | Default |
| ----------------------------- | --------------------------------------------------------------------------------- | ----- | -------------------- | ---------- |
| `GRAPHRAG_INPUT_FILE_PATTERN` | The file pattern regexp to use when reading input files from the input directory. | `str` | optional | `.*\.txt$` |
#### CSV Input Data (`GRAPHRAG_INPUT_FILE_TYPE`=csv)
2024-07-01 15:25:30 -06:00
| Parameter | Description | Type | Required or Optional | Default |
| ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----- | -------------------- | ---------- |
| `GRAPHRAG_INPUT_TYPE` | The input storage type to use when reading files. (`file` or `blob`) | `str` | optional | `file` |
| `GRAPHRAG_INPUT_FILE_PATTERN` | The file pattern regexp to use when reading input files from the input directory. | `str` | optional | `.*\.txt$` |
| `GRAPHRAG_INPUT_TEXT_COLUMN` | The 'text' column to use when reading CSV input files. | `str` | optional | `text` |
| `GRAPHRAG_INPUT_METADATA` | A list of CSV columns, comma-separated, to incorporate as JSON in a metadata column. | `str` | optional | `None` |
2024-07-01 15:25:30 -06:00
| `GRAPHRAG_INPUT_TITLE_COLUMN` | The 'title' column to use when reading CSV input files. | `str` | optional | `title` |
| `GRAPHRAG_INPUT_STORAGE_ACCOUNT_BLOB_URL` | The Azure Storage blob endpoint to use when in `blob` mode and using managed identity. Will have the format `https://<storage_account_name>.blob.core.windows.net` | `str` | optional | `None` |
| `GRAPHRAG_INPUT_CONNECTION_STRING` | The connection string to use when reading CSV input files from Azure Blob Storage. | `str` | optional | `None` |
| `GRAPHRAG_INPUT_CONTAINER_NAME` | The container name to use when reading CSV input files from Azure Blob Storage. | `str` | optional | `None` |
| `GRAPHRAG_INPUT_BASE_DIR` | The base directory to read input files from. | `str` | optional | `None` |
### Data Mapping Settings
2024-07-01 15:25:30 -06:00
| Parameter | Description | Type | Required or Optional | Default |
| -------------------------- | -------------------------------------------------------- | ----- | -------------------- | ------- |
| `GRAPHRAG_INPUT_FILE_TYPE` | The type of input data, `csv` or `text` | `str` | optional | `text` |
| `GRAPHRAG_INPUT_ENCODING` | The encoding to apply when reading CSV/text input files. | `str` | optional | `utf-8` |
### Data Chunking
2024-07-01 15:25:30 -06:00
| Parameter | Description | Type | Required or Optional | Default |
| ------------------------------- | ------------------------------------------------------------------------------------------- | ----- | -------------------- | ----------------------------- |
| `GRAPHRAG_CHUNK_SIZE` | The chunk size in tokens for text-chunk analysis windows. | `str` | optional | 1200 |
| `GRAPHRAG_CHUNK_OVERLAP` | The chunk overlap in tokens for text-chunk analysis windows. | `str` | optional | 100 |
| `GRAPHRAG_CHUNK_BY_COLUMNS` | A comma-separated list of document attributes to groupby when performing TextUnit chunking. | `str` | optional | `id` |
| `GRAPHRAG_CHUNK_ENCODING_MODEL` | The encoding model to use for chunking. | `str` | optional | The top-level encoding model. |
2024-07-01 15:25:30 -06:00
### Prompting Overrides
2024-07-01 15:25:30 -06:00
| Parameter | Description | Type | Required or Optional | Default |
| --------------------------------------------- | ------------------------------------------------------------------------------------------ | -------- | -------------------- | ---------------------------------------------------------------- |
| `GRAPHRAG_ENTITY_EXTRACTION_PROMPT_FILE` | The path (relative to the root) of an entity extraction prompt template text file. | `str` | optional | `None` |
| `GRAPHRAG_ENTITY_EXTRACTION_MAX_GLEANINGS` | The maximum number of redrives (gleanings) to invoke when extracting entities in a loop. | `int` | optional | 1 |
2024-07-01 15:25:30 -06:00
| `GRAPHRAG_ENTITY_EXTRACTION_ENTITY_TYPES` | A comma-separated list of entity types to extract. | `str` | optional | `organization,person,event,geo` |
| `GRAPHRAG_ENTITY_EXTRACTION_ENCODING_MODEL` | The encoding model to use for entity extraction. | `str` | optional | The top-level encoding model. |
2024-07-01 15:25:30 -06:00
| `GRAPHRAG_SUMMARIZE_DESCRIPTIONS_PROMPT_FILE` | The path (relative to the root) of an description summarization prompt template text file. | `str` | optional | `None` |
| `GRAPHRAG_SUMMARIZE_DESCRIPTIONS_MAX_LENGTH` | The maximum number of tokens to generate per description summarization. | `int` | optional | 500 |
| `GRAPHRAG_CLAIM_EXTRACTION_ENABLED` | Whether claim extraction is enabled for this pipeline. | `bool` | optional | `False` |
| `GRAPHRAG_CLAIM_EXTRACTION_DESCRIPTION` | The claim_description prompting argument to utilize. | `string` | optional | "Any claims or facts that could be relevant to threat analysis." |
| `GRAPHRAG_CLAIM_EXTRACTION_PROMPT_FILE` | The claim extraction prompt to utilize. | `string` | optional | `None` |
| `GRAPHRAG_CLAIM_EXTRACTION_MAX_GLEANINGS` | The maximum number of redrives (gleanings) to invoke when extracting claims in a loop. | `int` | optional | 1 |
| `GRAPHRAG_CLAIM_EXTRACTION_ENCODING_MODEL` | The encoding model to use for claim extraction. | `str` | optional | The top-level encoding model |
| `GRAPHRAG_COMMUNITY_REPORTS_PROMPT_FILE` | The community reports extraction prompt to utilize. | `string` | optional | `None` |
| `GRAPHRAG_COMMUNITY_REPORTS_MAX_LENGTH` | The maximum number of tokens to generate per community reports. | `int` | optional | 1500 |
2024-07-01 15:25:30 -06:00
### Storage
2024-07-01 15:25:30 -06:00
This section controls the storage mechanism used by the pipeline used for exporting output tables.
2024-07-01 15:25:30 -06:00
| Parameter | Description | Type | Required or Optional | Default |
| ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----- | -------------------- | ------- |
| `GRAPHRAG_STORAGE_TYPE` | The type of storage to use. Options are `file`, `memory`, or `blob` | `str` | optional | `file` |
2024-07-01 15:25:30 -06:00
| `GRAPHRAG_STORAGE_STORAGE_ACCOUNT_BLOB_URL` | The Azure Storage blob endpoint to use when in `blob` mode and using managed identity. Will have the format `https://<storage_account_name>.blob.core.windows.net` | `str` | optional | None |
| `GRAPHRAG_STORAGE_CONNECTION_STRING` | The Azure Storage connection string to use when in `blob` mode. | `str` | optional | None |
| `GRAPHRAG_STORAGE_CONTAINER_NAME` | The Azure Storage container name to use when in `blob` mode. | `str` | optional | None |
| `GRAPHRAG_STORAGE_BASE_DIR` | The base path to data outputs outputs. | `str` | optional | None |
### Cache
2024-07-01 15:25:30 -06:00
This section controls the cache mechanism used by the pipeline. This is used to cache LLM invocation results.
| Parameter | Description | Type | Required or Optional | Default |
| ----------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----- | -------------------- | ------- |
| `GRAPHRAG_CACHE_TYPE` | The type of cache to use. Options are `file`, `memory`, `none` or `blob` | `str` | optional | `file` |
| `GRAPHRAG_CACHE_STORAGE_ACCOUNT_BLOB_URL` | The Azure Storage blob endpoint to use when in `blob` mode and using managed identity. Will have the format `https://<storage_account_name>.blob.core.windows.net` | `str` | optional | None |
| `GRAPHRAG_CACHE_CONNECTION_STRING` | The Azure Storage connection string to use when in `blob` mode. | `str` | optional | None |
| `GRAPHRAG_CACHE_CONTAINER_NAME` | The Azure Storage container name to use when in `blob` mode. | `str` | optional | None |
| `GRAPHRAG_CACHE_BASE_DIR` | The base path to the cache files. | `str` | optional | None |
2024-07-01 15:25:30 -06:00
### Reporting
2024-07-01 15:25:30 -06:00
This section controls the reporting mechanism used by the pipeline, for common events and error messages. The default is to write reports to a file in the output directory. However, you can also choose to write reports to the console or to an Azure Blob Storage container.
| Parameter | Description | Type | Required or Optional | Default |
| --------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----- | -------------------- | ------- |
| `GRAPHRAG_REPORTING_TYPE` | The type of reporter to use. Options are `file`, `console`, or `blob` | `str` | optional | `file` |
| `GRAPHRAG_REPORTING_STORAGE_ACCOUNT_BLOB_URL` | The Azure Storage blob endpoint to use when in `blob` mode and using managed identity. Will have the format `https://<storage_account_name>.blob.core.windows.net` | `str` | optional | None |
| `GRAPHRAG_REPORTING_CONNECTION_STRING` | The Azure Storage connection string to use when in `blob` mode. | `str` | optional | None |
| `GRAPHRAG_REPORTING_CONTAINER_NAME` | The Azure Storage container name to use when in `blob` mode. | `str` | optional | None |
| `GRAPHRAG_REPORTING_BASE_DIR` | The base path to the reporting outputs. | `str` | optional | None |
### Node2Vec Parameters
2024-07-01 15:25:30 -06:00
| Parameter | Description | Type | Required or Optional | Default |
| ------------------------------- | ---------------------------------------- | ------ | -------------------- | ------- |
| `GRAPHRAG_NODE2VEC_ENABLED` | Whether to enable Node2Vec | `bool` | optional | False |
| `GRAPHRAG_NODE2VEC_NUM_WALKS` | The Node2Vec number of walks to perform | `int` | optional | 10 |
| `GRAPHRAG_NODE2VEC_WALK_LENGTH` | The Node2Vec walk length | `int` | optional | 40 |
| `GRAPHRAG_NODE2VEC_WINDOW_SIZE` | The Node2Vec window size | `int` | optional | 2 |
| `GRAPHRAG_NODE2VEC_ITERATIONS` | The number of iterations to run node2vec | `int` | optional | 3 |
| `GRAPHRAG_NODE2VEC_RANDOM_SEED` | The random seed to use for node2vec | `int` | optional | 597832 |
### Data Snapshotting
2024-07-01 15:25:30 -06:00
| Parameter | Description | Type | Required or Optional | Default |
| -------------------------------------- | ----------------------------------------------- | ------ | -------------------- | ------- |
| `GRAPHRAG_SNAPSHOT_EMBEDDINGS` | Whether to enable embeddings snapshots. | `bool` | optional | False |
| `GRAPHRAG_SNAPSHOT_GRAPHML` | Whether to enable GraphML snapshots. | `bool` | optional | False |
| `GRAPHRAG_SNAPSHOT_RAW_ENTITIES` | Whether to enable raw entity snapshots. | `bool` | optional | False |
| `GRAPHRAG_SNAPSHOT_TOP_LEVEL_NODES` | Whether to enable top-level node snapshots. | `bool` | optional | False |
| `GRAPHRAG_SNAPSHOT_TRANSIENT` | Whether to enable transient table snapshots. | `bool` | optional | False |
2024-07-01 15:25:30 -06:00
# Miscellaneous Settings
| Parameter | Description | Type | Required or Optional | Default |
| --------------------------- | --------------------------------------------------------------------- | ------ | -------------------- | ------------- |
| `GRAPHRAG_ASYNC_MODE` | Which async mode to use. Either `asyncio` or `threaded`. | `str` | optional | `asyncio` |
| `GRAPHRAG_ENCODING_MODEL` | The text encoding model, used in tiktoken, to encode text. | `str` | optional | `cl100k_base` |
| `GRAPHRAG_MAX_CLUSTER_SIZE` | The maximum number of entities to include in a single Leiden cluster. | `int` | optional | 10 |
| `GRAPHRAG_UMAP_ENABLED` | Whether to enable UMAP layouts | `bool` | optional | False |