datahub/docker/datahub-upgrade
david-leifker ecc01b9a46
refactor(restli-mce-consumer) (#6744)
* fix(security): commons-text in frontend

* refactor(restli): set threads based on cpu cores
feat(mce-consumers): hit local restli endpoint

* testing docker build

* Add retry configuration options for entity client

* Kafka debugging

* fix(kafka-setup): parallelize topic creation

* Adjust docker build

* Docker build updates

* WIP

* fix(lint): metadata-ingestion lint

* fix(gradle-docker): fix docker frontend dep

* fix(elastic): fix race condition between gms and mae for index creation

* Revert "fix(elastic): fix race condition between gms and mae for index creation"

This reverts commit 9629d12c3bdb3c0dab87604d409ca4c642c9c6d3.

* fix(test): fix datahub frontend test for clean/test cycle

* fix(test): datahub-frontend missing assets in test

* fix(security): set protobuf lib datahub-upgrade & mce/mae-consumer

* gitingore update

* fix(docker): remove platform on docker base image, set by buildx

* refactor(kafka-producer): update kafka producer tracking/logging

* updates per PR feedback

* Add documentation around mce standalone consumer
Kafka consumer concurrency to follow thread count for restli & sql connection pool

Co-authored-by: leifker <dleifker@gmail.com>
Co-authored-by: Pedro Silva <pedro@acryl.io>
2022-12-26 16:09:08 +00:00
..

DataHub Upgrade Docker Image

This container is used to automatically apply upgrades from one version of DataHub to another.

Supported Upgrades

As of today, there are 2 supported upgrades:

  1. NoCodeDataMigration: Performs a series of pre-flight qualification checks and then migrates metadata_aspect table data to metadata_aspect_v2 table. Arguments:

    • batchSize (Optional): The number of rows to migrate at a time. Defaults to 1000.
    • batchDelayMs (Optional): The number of milliseconds of delay between migrated batches. Used for rate limiting. Defaults to 250.
    • dbType (optional): The target DB type. Valid values are MYSQL, MARIA, POSTGRES. Defaults to MYSQL.
  2. NoCodeDataMigrationCleanup: Cleanses graph index, search index, and key-value store of legacy DataHub data (metadata_aspect table) once the No Code Data Migration has completed successfully. No arguments.

  3. RestoreIndices: Restores indices by fetching the latest version of each aspect and producing MAE

  4. RestoreBackup: Restores the storage stack from a backup of the local database

Environment Variables

To run the datahub-upgrade container, some environment variables must be provided in order to tell the upgrade CLI where the running DataHub containers reside.

Below details the required configurations. By default, these configs are provided for local docker-compose deployments of DataHub within docker/datahub-upgrade/env/docker.env. They assume that there is a Docker network called datahub_network where the DataHub containers can be found.

These are also the variables used when the provided datahub-upgrade.sh script is executed. To run the upgrade CLI for non-local deployments, follow these steps:

  1. Define new ".env" variable to hold your environment variables.

The following variables may be provided:

# Required Environment Variables
EBEAN_DATASOURCE_USERNAME=datahub
EBEAN_DATASOURCE_PASSWORD=datahub
EBEAN_DATASOURCE_HOST=<your-ebean-host>:3306
EBEAN_DATASOURCE_URL=jdbc:mysql://<your-ebean-host>:3306/datahub?verifyServerCertificate=false&useSSL=true&useUnicode=yes&characterEncoding=UTF-8
EBEAN_DATASOURCE_DRIVER=com.mysql.jdbc.Driver

KAFKA_BOOTSTRAP_SERVER=<your-kafka-host>:29092
KAFKA_SCHEMAREGISTRY_URL=http://<your-kafka-host>:8081

ELASTICSEARCH_HOST=<your-elastic-host>
ELASTICSEARCH_PORT=9200

NEO4J_HOST=http://<your-neo-host>:7474
NEO4J_URI=bolt://<your-neo-host>
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=datahub

DATAHUB_GMS_HOST=<your-gms-host>>
DATAHUB_GMS_PORT=8080

DATAHUB_MAE_CONSUMER_HOST=<your-mae-consumer-host>
DATAHUB_MAE_CONSUMER_PORT=9091

# Optional Arguments

# Uncomment and set these to support SSL connection to Elasticsearch
# ELASTICSEARCH_USE_SSL=
# ELASTICSEARCH_SSL_PROTOCOL=
# ELASTICSEARCH_SSL_SECURE_RANDOM_IMPL=
# ELASTICSEARCH_SSL_TRUSTSTORE_FILE=
# ELASTICSEARCH_SSL_TRUSTSTORE_TYPE=
# ELASTICSEARCH_SSL_TRUSTSTORE_PASSWORD=
# ELASTICSEARCH_SSL_KEYSTORE_FILE=
# ELASTICSEARCH_SSL_KEYSTORE_TYPE=
# ELASTICSEARCH_SSL_KEYSTORE_PASSWORD=
  1. Pull (or build) & execute the datahub-upgrade container:
docker pull acryldata/datahub-upgrade:head && docker run --env-file *path-to-custom-env-file.env* acryldata/datahub-upgrade:head -u NoCodeDataMigration

Arguments

The primary argument required by the datahub-upgrade container is the name of the upgrade to perform. This argument can be specified using the -u flag when running the datahub-upgrade container.

For example, to run the migration named "NoCodeDataMigration", you would do execute the following:

./datahub-upgrade.sh -u NoCodeDataMigration

OR

docker pull acryldata/datahub-upgrade:head && docker run --env-file env/docker.env acryldata/datahub-upgrade:head -u NoCodeDataMigration

In addition to the required -u argument, each upgrade may require specific arguments. You can provide arguments to individual upgrades using multiple -a arguments.

For example, the NoCodeDataMigration upgrade provides 2 optional arguments detailed above: batchSize and batchDelayMs. To specify these, you can use a combination of -a arguments and of the form argumentName=argumentValue as follows:

./datahub-upgrade.sh -u NoCodeDataMigration -a batchSize=500 -a batchDelayMs=1000 // Small batches with 1 second delay. 

OR

docker pull acryldata/datahub-upgrade:head && docker run --env-file env/docker.env acryldata/datahub-upgrade:head -u NoCodeDataMigration -a batchSize=500 -a batchDelayMs=1000