datahub/docs/deploy/environment-vars.md
david-leifker 3f4ab44a91
feat(kafka): increase kafka message size and enable compression (#9038)
Co-authored-by: Pedro Silva <pedro@acryl.io>
2023-10-29 16:26:05 -05:00

18 KiB

title
Deployment Environment Variables

Environment Variables

The following is a summary of a few important environment variables which expose various levers which control how DataHub works.

Feature Flags

Variable Default Unit/Type Components Description
UI_INGESTION_ENABLED true boolean [GMS, MCE Consumer] Enable UI based ingestion.
DATAHUB_ANALYTICS_ENABLED true boolean [Frontend, GMS] Collect DataHub usage to populate the analytics dashboard.
BOOTSTRAP_SYSTEM_UPDATE_WAIT_FOR_SYSTEM_UPDATE true boolean [GMS, MCE Consumer, MAE Consumer] Do not wait for the system-update to complete before starting. This should typically only be disabled during development.

Ingestion

Variable Default Unit/Type Components Description
ASYNC_INGEST_DEFAULT false boolean [GMS] Asynchronously process ingestProposals by writing the ingestion MCP to Kafka. Typically enabled with standalone consumers.
MCP_CONSUMER_ENABLED true boolean [GMS, MCE Consumer] When running in standalone mode, disabled on GMS and enabled on separate MCE Consumer.
MCL_CONSUMER_ENABLED true boolean [GMS, MAE Consumer] When running in standalone mode, disabled on GMS and enabled on separate MAE Consumer.
PE_CONSUMER_ENABLED true boolean [GMS, MAE Consumer] When running in standalone mode, disabled on GMS and enabled on separate MAE Consumer.
ES_BULK_REQUESTS_LIMIT 1000 docs [GMS, MAE Consumer] Number of bulk documents to index. MAE Consumer if standalone.
ES_BULK_FLUSH_PERIOD 1 seconds [GMS, MAE Consumer] How frequently indexed documents are made available for query.
ALWAYS_EMIT_CHANGE_LOG false boolean [GMS] Enables always emitting a MCL even when no changes are detected. Used for Time Based Lineage when no changes occur.
GRAPH_SERVICE_DIFF_MODE_ENABLED true boolean [GMS] Enables diff mode for graph writes, uses a different code path that produces a diff from previous to next to write relationships instead of wholesale deleting edges and reading.

Caching

Variable Default Unit/Type Components Description
SEARCH_SERVICE_ENABLE_CACHE false boolean [GMS] Enable caching of search results.
SEARCH_SERVICE_CACHE_IMPLEMENTATION caffeine string [GMS] Set to hazelcast if the number of GMS replicas > 1 for enabling distributed cache.
CACHE_TTL_SECONDS 600 seconds [GMS] Default cache time to live.
CACHE_MAX_SIZE 10000 objects [GMS] Maximum number of items to cache.
LINEAGE_SEARCH_CACHE_ENABLED true boolean [GMS] Enables in-memory cache for searchAcrossLineage query.
CACHE_ENTITY_COUNTS_TTL_SECONDS 600 seconds [GMS] Homepage entity count time to live.
CACHE_SEARCH_LINEAGE_TTL_SECONDS 86400 seconds [GMS] Search lineage cache time to live.
CACHE_SEARCH_LINEAGE_LIGHTNING_THRESHOLD 300 objects [GMS] Lineage graphs exceeding this limit will use a local cache.
Variable Default Unit/Type Components Description
INDEX_PREFIX `` string [GMS, MAE Consumer, Elasticsearch Setup, System Update] Prefix Elasticsearch indices with the given string.
ELASTICSEARCH_NUM_SHARDS_PER_INDEX 1 integer [System Update] Default number of shards per Elasticsearch index.
ELASTICSEARCH_NUM_REPLICAS_PER_INDEX 1 integer [System Update] Default number of replica per Elasticsearch index.
ELASTICSEARCH_BUILD_INDICES_RETENTION_VALUE 60 integer [System Update] Number of units for the retention of Elasticsearch clone/backup indices.
ELASTICSEARCH_BUILD_INDICES_RETENTION_UNIT DAYS string [System Update] Unit for the retention of Elasticsearch clone/backup indices.
ELASTICSEARCH_QUERY_EXACT_MATCH_EXCLUSIVE false boolean [GMS] Only return exact matches when using quotes.
ELASTICSEARCH_QUERY_EXACT_MATCH_WITH_PREFIX true boolean [GMS] Include prefix match in exact match results.
ELASTICSEARCH_QUERY_EXACT_MATCH_FACTOR 10.0 float [GMS] Multiply by this number on true exact match.
ELASTICSEARCH_QUERY_EXACT_MATCH_PREFIX_FACTOR 1.6 float [GMS] Multiply by this number when prefix match.
ELASTICSEARCH_QUERY_EXACT_MATCH_CASE_FACTOR 0.7 float [GMS] Multiply by this number when case insensitive match.
ELASTICSEARCH_QUERY_EXACT_MATCH_ENABLE_STRUCTURED true boolean [GMS] When using structured query, also include exact matches.
ELASTICSEARCH_QUERY_PARTIAL_URN_FACTOR 0.5 float [GMS] Multiply by this number when partial token match on URN)
ELASTICSEARCH_QUERY_PARTIAL_FACTOR 0.4 float [GMS] Multiply by this number when partial token match on non-URN field.
ELASTICSEARCH_QUERY_CUSTOM_CONFIG_ENABLED false boolean [GMS] Enable search query and ranking customization configuration.
ELASTICSEARCH_QUERY_CUSTOM_CONFIG_FILE search_config.yml string [GMS] The location of the search customization configuration.

Kafka

In general, there are lots of Kafka configuration environment variables for both the producer and consumers defined in the official Spring Kafka documentation here. These environment variables follow the standard Spring representation of properties as environment variables. Simply replace the dot, ., with an underscore, _, and convert to uppercase.

Variable Default Unit/Type Components Description
KAFKA_LISTENER_CONCURRENCY 1 integer [GMS, MCE Consumer, MAE Consumer] Number of Kafka consumer threads. Optimize throughput by matching to topic partitions.
SPRING_KAFKA_PRODUCER_PROPERTIES_MAX_REQUEST_SIZE 1048576 bytes [GMS, MCE Consumer, MAE Consumer] Max produced message size. Note that the topic configuration is not controlled by this variable.
SCHEMA_REGISTRY_TYPE INTERNAL string [GMS, MCE Consumer, MAE Consumer] Schema registry implementation. One of INTERNAL or KAFKA or AWS_GLUE
KAFKA_SCHEMAREGISTRY_URL http://localhost:8080/schema-registry/api/ string [GMS, MCE Consumer, MAE Consumer] Schema registry url. Used for INTERNAL and KAFKA. The default value is for the GMS component. The MCE Consumer and MAE Consumer should be the GMS hostname and port.
AWS_GLUE_SCHEMA_REGISTRY_REGION us-east-1 string [GMS, MCE Consumer, MAE Consumer] If using AWS_GLUE in the SCHEMA_REGISTRY_TYPE variable for the schema registry implementation.
AWS_GLUE_SCHEMA_REGISTRY_NAME `` string [GMS, MCE Consumer, MAE Consumer] If using AWS_GLUE in the SCHEMA_REGISTRY_TYPE variable for the schema registry.
USE_CONFLUENT_SCHEMA_REGISTRY true boolean [kafka-setup] Enable Confluent schema registry configuration.
KAFKA_PRODUCER_MAX_REQUEST_SIZE 5242880 integer [Frontend, GMS, MCE Consumer, MAE Consumer] Max produced message size. Note that the topic configuration is not controlled by this variable.
KAFKA_CONSUMER_MAX_PARTITION_FETCH_BYTES 5242880 integer [GMS, MCE Consumer, MAE Consumer] The maximum amount of data per-partition the server will return. Records are fetched in batches by the consumer. If the first record batch in the first non-empty partition of the fetch is larger than this limit, the batch will still be returned to ensure that the consumer can make progress.
MAX_MESSAGE_BYTES 5242880 integer [kafka-setup] Sets the max message size on the kakfa topics.
KAFKA_PRODUCER_COMPRESSION_TYPE snappy string [Frontend, GMS, MCE Consumer, MAE Consumer] The compression used by the producer.

Frontend

Variable Default Unit/Type Components Description
AUTH_VERBOSE_LOGGING false boolean [Frontend] Enable verbose authentication logging. Enabling this will leak sensisitve information in the logs. Disable when finished debugging.
AUTH_OIDC_GROUPS_CLAIM groups string [Frontend] Claim to use as the user's group.
AUTH_OIDC_EXTRACT_GROUPS_ENABLED false boolean [Frontend] Auto-provision the group from the user's group claim.
AUTH_SESSION_TTL_HOURS 24 string [Frontend] The number of hours a user session is valid. After this many hours the actor cookie will be expired by the browser and the user will be prompted to login again.
MAX_SESSION_TOKEN_AGE 24h string [Frontend] The maximum age of the session token. User session tokens are stateless and will become invalid after this time requiring a user to login again.