2021-06-03 13:24:33 -07:00
# DataHub Upgrade Docker Image
This container is used to automatically apply upgrades from one version of DataHub to another.
## Supported Upgrades
As of today, there are 2 supported upgrades:
2025-04-16 16:55:51 -07:00
1. **NoCodeDataMigration** : Performs a series of pre-flight qualification checks and then migrates metadata*aspect table data
to metadata_aspect_v2 table. Arguments: - \_batchSize* (Optional): The number of rows to migrate at a time. Defaults to 1000. - _batchDelayMs_ (Optional): The number of milliseconds of delay between migrated batches. Used for rate limiting. Defaults to 250. - _dbType_ (optional): The target DB type. Valid values are `MYSQL` , `MARIA` , `POSTGRES` . Defaults to `MYSQL` .
2021-06-03 13:24:33 -07:00
2. **NoCodeDataMigrationCleanup** : Cleanses graph index, search index, and key-value store of legacy DataHub data (metadata_aspect table) once
2025-04-16 16:55:51 -07:00
the No Code Data Migration has completed successfully. No arguments.
2021-06-03 13:24:33 -07:00
2023-12-19 15:08:55 -05:00
3. **RestoreIndices** : Restores indices by fetching the latest version of each aspect and producing MAE. Arguments:
2025-04-16 16:55:51 -07:00
- _batchSize_ (Optional): The number of rows to migrate at a time. Defaults to 1000.
- _batchDelayMs_ (Optional): The number of milliseconds of delay between migrated batches. Used for rate limiting. Defaults to 250.
- _numThreads_ (Optional): The number of threads to use, defaults to 1. Note that this is not used if `urnBasedPagination` is true.
- _aspectName_ (Optional): The aspect name for producing events.
- _urn_ (Optional): The urn for producing events.
- _urnLike_ (Optional): The urn pattern for producing events, using `%` as a wild card
- _urnBasedPagination_ (Optional): Paginate the SQL results using the urn + aspect string instead of `OFFSET` . Defaults to false,
though should improve performance for large amounts of data.
2021-06-30 16:49:02 -07:00
4. **RestoreBackup** : Restores the storage stack from a backup of the local database
2021-06-03 13:24:33 -07:00
## Environment Variables
To run the `datahub-upgrade` container, some environment variables must be provided in order to tell the upgrade CLI
2025-04-16 16:55:51 -07:00
where the running DataHub containers reside.
2021-06-03 13:24:33 -07:00
2025-04-16 16:55:51 -07:00
Below details the required configurations. By default, these configs are provided for local docker-compose deployments of
2021-12-14 10:49:03 -08:00
DataHub within `docker/datahub-upgrade/env/docker.env` . They assume that there is a Docker network called datahub_network
2025-04-16 16:55:51 -07:00
where the DataHub containers can be found.
2021-06-03 13:24:33 -07:00
These are also the variables used when the provided `datahub-upgrade.sh` script is executed. To run the upgrade CLI for non-local deployments,
2025-04-16 16:55:51 -07:00
follow these steps:
2021-06-03 13:24:33 -07:00
1. Define new ".env" variable to hold your environment variables.
2025-04-16 16:55:51 -07:00
The following variables may be provided:
2021-06-03 13:24:33 -07:00
```aidl
# Required Environment Variables
EBEAN_DATASOURCE_USERNAME=datahub
EBEAN_DATASOURCE_PASSWORD=datahub
EBEAN_DATASOURCE_HOST=< your-ebean-host > :3306
EBEAN_DATASOURCE_URL=jdbc:mysql://< your-ebean-host > :3306/datahub?verifyServerCertificate=false& useSSL=true& useUnicode=yes& characterEncoding=UTF-8
EBEAN_DATASOURCE_DRIVER=com.mysql.jdbc.Driver
KAFKA_BOOTSTRAP_SERVER=< your-kafka-host > :29092
KAFKA_SCHEMAREGISTRY_URL=http://< your-kafka-host > :8081
ELASTICSEARCH_HOST=< your-elastic-host >
ELASTICSEARCH_PORT=9200
NEO4J_HOST=http://< your-neo-host > :7474
NEO4J_URI=bolt://< your-neo-host >
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=datahub
DATAHUB_GMS_HOST=< your-gms-host > >
DATAHUB_GMS_PORT=8080
2023-04-12 18:47:55 -07:00
# Datahub protocol (default http)
# DATAHUB_GMS_PROTOCOL=http
2021-06-03 13:24:33 -07:00
DATAHUB_MAE_CONSUMER_HOST=< your-mae-consumer-host >
DATAHUB_MAE_CONSUMER_PORT=9091
# Optional Arguments
# Uncomment and set these to support SSL connection to Elasticsearch
# ELASTICSEARCH_USE_SSL=
# ELASTICSEARCH_SSL_PROTOCOL=
# ELASTICSEARCH_SSL_SECURE_RANDOM_IMPL=
# ELASTICSEARCH_SSL_TRUSTSTORE_FILE=
# ELASTICSEARCH_SSL_TRUSTSTORE_TYPE=
# ELASTICSEARCH_SSL_TRUSTSTORE_PASSWORD=
# ELASTICSEARCH_SSL_KEYSTORE_FILE=
# ELASTICSEARCH_SSL_KEYSTORE_TYPE=
# ELASTICSEARCH_SSL_KEYSTORE_PASSWORD=
```
2025-04-16 16:55:51 -07:00
2021-06-03 13:24:33 -07:00
2. Pull (or build) & execute the `datahub-upgrade` container:
```aidl
2021-06-04 09:52:10 -07:00
docker pull acryldata/datahub-upgrade:head && docker run --env-file *path-to-custom-env-file.env* acryldata/datahub-upgrade:head -u NoCodeDataMigration
2021-06-03 13:24:33 -07:00
```
## Arguments
The primary argument required by the datahub-upgrade container is the name of the upgrade to perform. This argument
2025-04-16 16:55:51 -07:00
can be specified using the `-u` flag when running the `datahub-upgrade` container.
2021-06-03 13:24:33 -07:00
For example, to run the migration named "NoCodeDataMigration", you would do execute the following:
```aidl
./datahub-upgrade.sh -u NoCodeDataMigration
```
OR
```aidl
2021-12-14 10:49:03 -08:00
docker pull acryldata/datahub-upgrade:head & & docker run --env-file env/docker.env acryldata/datahub-upgrade:head -u NoCodeDataMigration
2021-06-03 13:24:33 -07:00
```
In addition to the required `-u` argument, each upgrade may require specific arguments. You can provide arguments to individual
2025-04-16 16:55:51 -07:00
upgrades using multiple `-a` arguments.
2021-06-03 13:24:33 -07:00
2025-04-16 16:55:51 -07:00
For example, the NoCodeDataMigration upgrade provides 2 optional arguments detailed above: _batchSize_ and _batchDelayMs_ .
To specify these, you can use a combination of `-a` arguments and of the form _argumentName=argumentValue_ as follows:
2021-06-03 13:24:33 -07:00
```aidl
2025-04-16 16:55:51 -07:00
./datahub-upgrade.sh -u NoCodeDataMigration -a batchSize=500 -a batchDelayMs=1000 // Small batches with 1 second delay.
2021-06-03 13:24:33 -07:00
```
2025-04-16 16:55:51 -07:00
OR
2021-06-03 13:24:33 -07:00
```aidl
2021-12-14 10:49:03 -08:00
docker pull acryldata/datahub-upgrade:head & & docker run --env-file env/docker.env acryldata/datahub-upgrade:head -u NoCodeDataMigration -a batchSize=500 -a batchDelayMs=1000
2025-04-16 16:55:51 -07:00
```