<p>The default configuration mode may be configured by using a <code>config.json</code> or <code>config.yml</code> file in the data project root. If a <code>.env</code> file is present along with this config file, then it will be loaded, and the environment variables defined therein will be available for token replacements in your configuration document using <code>${ENV_VAR}</code> syntax.</p>
<p>For example:</p>
<pre><code># .env
API_KEY=some_api_key
# config.json
{
"llm": {
"api_key": "${API_KEY}"
}
}
</code></pre>
<h1>Config Sections</h1>
<h2>input</h2>
<h3>Fields</h3>
<ul>
<li><code>type</code><strong>text|csv</strong> - The type of input data to load. Either <code>text</code> or <code>csv</code>. Default is <code>csv</code></li>
<li><code>file_encoding</code><strong>str</strong> - The encoding of the input file. Default is <code>utf-8</code></li>
<li><code>file_pattern</code><strong>str</strong> - A glob pattern to match input files. Default is <code>**/*.csv</code> if in csv mode and <code>**/*.txt</code> if in text mode.</li>
<li><code>source_column</code><strong>str</strong> - (CSV Mode Only) The source column name.</li>
<li><code>timestamp_column</code><strong>str</strong> - (CSV Mode Only) The timestamp column name.</li>
<li><code>timestamp_format</code><strong>str</strong> - (CSV Mode Only) The source format.</li>
<li><code>text_column</code><strong>str</strong> - (CSV Mode Only) The text column name.</li>
<li><code>title_column</code><strong>str</strong> - (CSV Mode Only) The title column name.</li>
<li><code>document_attribute_columns</code><strong>list[str]</strong> - (CSV Mode Only) The additional document attributes to include.</li>
<li><code>storage_type</code><strong>file|blob</strong> - The input storage type to use. Default=<code>file</code></li>
<li><code>connection_string</code><strong>str</strong> - (blob only) The Azure Storage connection string.</li>
<li><code>container_name</code><strong>str</strong> - (blob only) The Azure Storage container name.</li>
<li><code>base_dir</code><strong>str</strong> - The base directory to read input from, relative to the root.</li>
</ul>
<h2>llm</h2>
<p>This is the base LLM configuration section. Other steps may override this configuration with their own LLM configuration.</p>
<h3>Fields</h3>
<ul>
<li><code>api_key</code><strong>str</strong> - The OpenAI API key to use.</li>
<li><code>type</code><strong>openai_chat|azure_openai_chat|openai_embedding|azure_openai_embedding</strong> - The type of LLM to use.</li>
<li><code>model</code><strong>str</strong> - The model name.</li>
<li><code>max_tokens</code><strong>int</strong> - The maximum number of output tokens.</li>
<li><code>request_timeout</code><strong>float</strong> - The per-request timeout.</li>
<li><code>api_base</code><strong>str</strong> - The API base url to use.</li>
<li><code>api_version</code><strong>str</strong> - The API version</li>
<li><code>organization</code><strong>str</strong> - The client organization.</li>
<li><code>proxy</code><strong>str</strong> - The proxy URL to use.</li>
<li><code>deployment_name</code><strong>str</strong> - The deployment name to use (Azure).</li>
<li><code>model_supports_json</code><strong>bool</strong> - Whether the model supports JSON-mode output.</li>
<li><code>tokens_per_minute</code><strong>int</strong> - Set a leaky-bucket throttle on tokens-per-minute.</li>
<li><code>requests_per_minute</code><strong>int</strong> - Set a leaky-bucket throttle on requests-per-minute.</li>
<li><code>max_retries</code><strong>int</strong> - The maximum number of retries to use.</li>
<li><code>max_retry_wait</code><strong>float</strong> - The maximum backoff time.</li>
<li><code>sleep_on_rate_limit_recommendation</code><strong>bool</strong> - Whether to adhere to sleep recommendations (Azure).</li>
<li><code>concurrent_requests</code><strong>int</strong> The number of open requests to allow at once.</li>
</ul>
<h2>parallelization</h2>
<h3>Fields</h3>
<ul>
<li><code>stagger</code><strong>float</strong> - The threading stagger value.</li>
<li><code>num_threads</code><strong>int</strong> - The maximum number of work threads.</li>
</ul>
<h2>async_mode</h2>
<p><strong>asyncio|threaded</strong> The async mode to use. Either <code>asyncio</code> or `threaded.</p>
<h2>embeddings</h2>
<h3>Fields</h3>
<ul>
<li><code>llm</code> (see LLM top-level config)</li>
<li><code>parallelization</code> (see Parallelization top-level config)</li>
<li><code>async_mode</code> (see Async Mode top-level config)</li>
<li><code>batch_size</code><strong>int</strong> - The maximum batch size to use.</li>
<li><code>batch_max_tokens</code><strong>int</strong> - The maximum batch #-tokens.</li>
<li><code>target</code><strong>required|all</strong> - Determines which set of embeddings to emit.</li>
<li><code>skip</code><strong>list[str]</strong> - Which embeddings to skip.</li>
<li><code>strategy</code><strong>dict</strong> - Fully override the text-embedding strategy.</li>
</ul>
<h2>chunks</h2>
<h3>Fields</h3>
<ul>
<li><code>size</code><strong>int</strong> - The max chunk size in tokens.</li>
<li><code>overlap</code><strong>int</strong> - The chunk overlap in tokens.</li>
<li><code>group_by_columns</code><strong>list[str]</strong> - group documents by fields before chunking.</li>
<li><code>strategy</code><strong>dict</strong> - Fully override the chunking strategy.</li>
</ul>
<h2>cache</h2>
<h3>Fields</h3>
<ul>
<li><code>type</code><strong>file|memory|none|blob</strong> - The cache type to use. Default=<code>file</code></li>
<li><code>connection_string</code><strong>str</strong> - (blob only) The Azure Storage connection string.</li>
<li><code>container_name</code><strong>str</strong> - (blob only) The Azure Storage container name.</li>
<li><code>base_dir</code><strong>str</strong> - The base directory to write cache to, relative to the root.</li>
</ul>
<h2>storage</h2>
<h3>Fields</h3>
<ul>
<li><code>type</code><strong>file|memory|blob</strong> - The storage type to use. Default=<code>file</code></li>
<li><code>connection_string</code><strong>str</strong> - (blob only) The Azure Storage connection string.</li>
<li><code>container_name</code><strong>str</strong> - (blob only) The Azure Storage container name.</li>
<li><code>base_dir</code><strong>str</strong> - The base directory to write reports to, relative to the root.</li>
</ul>
<h2>reporting</h2>
<h3>Fields</h3>
<ul>
<li><code>type</code><strong>file|console|blob</strong> - The reporting type to use. Default=<code>file</code></li>
<li><code>connection_string</code><strong>str</strong> - (blob only) The Azure Storage connection string.</li>
<li><code>container_name</code><strong>str</strong> - (blob only) The Azure Storage container name.</li>
<li><code>base_dir</code><strong>str</strong> - The base directory to write reports to, relative to the root.</li>
</ul>
<h2>entity_extraction</h2>
<h3>Fields</h3>
<ul>
<li><code>llm</code> (see LLM top-level config)</li>
<li><code>parallelization</code> (see Parallelization top-level config)</li>
<li><code>async_mode</code> (see Async Mode top-level config)</li>
<li><code>prompt</code><strong>str</strong> - The prompt file to use.</li>
<li><code>entity_types</code><strong>list[str]</strong> - The entity types to identify.</li>
<li><code>max_gleanings</code><strong>int</strong> - The maximum number of gleaning cycles to use.</li>
<li><code>strategy</code><strong>dict</strong> - Fully override the entity extraction strategy.</li>
</ul>
<h2>summarize_descriptions</h2>
<h3>Fields</h3>
<ul>
<li><code>llm</code> (see LLM top-level config)</li>
<li><code>parallelization</code> (see Parallelization top-level config)</li>
<li><code>async_mode</code> (see Async Mode top-level config)</li>
<li><code>prompt</code><strong>str</strong> - The prompt file to use.</li>
<li><code>max_length</code><strong>int</strong> - The maximum number of output tokens per summarization.</li>
<li><code>strategy</code><strong>dict</strong> - Fully override the summarize description strategy.</li>
</ul>
<h2>claim_extraction</h2>
<h3>Fields</h3>
<ul>
<li><code>llm</code> (see LLM top-level config)</li>
<li><code>parallelization</code> (see Parallelization top-level config)</li>
<li><code>async_mode</code> (see Async Mode top-level config)</li>
<li><code>prompt</code><strong>str</strong> - The prompt file to use.</li>
<li><code>description</code><strong>str</strong> - Describes the types of claims we want to extract.</li>
<li><code>max_gleanings</code><strong>int</strong> - The maximum number of gleaning cycles to use.</li>
<li><code>strategy</code><strong>dict</strong> - Fully override the claim extraction strategy.</li>
</ul>
<h2>community_reports</h2>
<h3>Fields</h3>
<ul>
<li><code>llm</code> (see LLM top-level config)</li>
<li><code>parallelization</code> (see Parallelization top-level config)</li>
<li><code>async_mode</code> (see Async Mode top-level config)</li>
<li><code>prompt</code><strong>str</strong> - The prompt file to use.</li>
<li><code>max_length</code><strong>int</strong> - The maximum number of output tokens per report.</li>
<li><code>max_input_length</code><strong>int</strong> - The maximum number of input tokens to use when generating reports.</li>
<li><code>strategy</code><strong>dict</strong> - Fully override the community reports strategy.</li>
</ul>
<h2>cluster_graph</h2>
<h3>Fields</h3>
<ul>
<li><code>max_cluster_size</code><strong>int</strong> - The maximum cluster size to emit.</li>
<li><code>strategy</code><strong>dict</strong> - Fully override the cluster_graph strategy.</li>
</ul>
<h2>embed_graph</h2>
<h3>Fields</h3>
<ul>
<li><code>is_enabled</code><strong>bool</strong> - Whether to enable graph embeddings.</li>
<li><code>num_walks</code><strong>int</strong> - The node2vec number of walks.</li>
<li><code>walk_length</code><strong>int</strong> - The node2vec walk length.</li>
<li><code>window_size</code><strong>int</strong> - The node2vec window size.</li>
<li><code>iterations</code><strong>int</strong> - The node2vec number of iterations.</li>
<li><code>random_seed</code><strong>int</strong> - The node2vec random seed.</li>
<li><code>strategy</code><strong>dict</strong> - Fully override the embed graph strategy.</li>
</ul>
<h2>umap</h2>
<h3>Fields</h3>
<ul>
<li><code>enabled</code><strong>bool</strong> - Whether to enable UMAP layouts.</li>