mirror of
https://github.com/datahub-project/datahub.git
synced 2025-07-04 15:50:14 +00:00
184 lines
8.2 KiB
Markdown
184 lines
8.2 KiB
Markdown
# Bootstrap MetadataChangeProposals (MCPs)
|
|
|
|
Bootstrap MCPs are templated MCPs which are loaded when the `system-update` job runs. This allows adding
|
|
entities and aspects to DataHub at install time with the ability to customize them via environment variable
|
|
overrides.
|
|
|
|
The built-in bootstrap MCP process can also be extended with custom MCPs. This can streamline deployment
|
|
scenarios where a set of standard ingestion recipes, data platforms, users groups, or other configuration
|
|
can be applied without the need for developing custom scripts.
|
|
|
|
## Process Overview
|
|
|
|
When DataHub is installed or upgraded, a job runs called `system-update`, this job is responsible for data
|
|
migration (particularly Elasticsearch indices) and ensuring the data is prepared for the next version of
|
|
DataHub. This is the job which will also apply the bootstrap MCPs.
|
|
|
|
The `system-update` job, depending on configuration, can be split into two sequences of steps. If they are
|
|
not split, then all steps are blocking.
|
|
|
|
1. An initial blocking sequence which is run prior to the new version of GMS and other components
|
|
2. Second sequence of steps where GMS and other components are allowed to run while additional data migration steps are
|
|
continued in the background
|
|
|
|
When applying bootstrap MCPs `system-update` will perform the following steps:
|
|
|
|
1. The `bootstrap_mcps.yaml` file is read, either from a default classpath location, `bootstrap_mcps.yaml`, or a filesystem location
|
|
provided by an environment variable, `SYSTEM_UPDATE_BOOTSTRAP_MCP_CONFIG`.
|
|
2. Depending on the mode of blocking or non-blocking each entry in the configuration file will be executed in sequence.
|
|
3. The template MCP file is loaded either from the classpath, or a filesystem location, and the template values are applied.
|
|
4. The rendered template MCPs are executed with the options specified in the `bootstrap_mcps.yaml`.
|
|
|
|
## `bootstrap_mcps.yaml` Configuration
|
|
|
|
The `bootstrap_mcps.yaml` file has the following format.
|
|
|
|
```yaml
|
|
bootstrap:
|
|
templates:
|
|
- name: <name>
|
|
version: <version>
|
|
force: false
|
|
blocking: false
|
|
async: true
|
|
optional: false
|
|
mcps_location: <classpath or file location>
|
|
values_env: <environment variable>
|
|
```
|
|
|
|
Each entry in the list of templates points to a single yaml file which can contain one or more MCP objects. The
|
|
execution of the template MCPs is tracked by name and version to prevent re-execution. The MCP objects are executed once
|
|
unless `force=true` for each `name`/`version` combination.
|
|
|
|
See the following table of options for descriptions of each field in the template configuration.
|
|
|
|
| Field | Default | Required | Description |
|
|
| ------------- | ------- | -------- | ---------------------------------------------------------------------------------------------------------- |
|
|
| name | | `true` | The name for the collection of template MCPs. |
|
|
| version | | `true` | A string version for the collection of template MCPs. |
|
|
| force | `false` | `false` | Ignores the previous run history, will not skip execution if run previously. |
|
|
| blocking | `false` | `false` | Run before GMS and other components during upgrade/install if running in split blocking/non-blocking mode. |
|
|
| async | `true` | `false` | Controls whether the MCPs are executed for sync or async ingestion. |
|
|
| optional | `false` | `false` | Whether to ignore a failure or fail the entire `system-update` job. |
|
|
| mcps_location | | `true` | The location of the file which contains the template MCPs |
|
|
| values_env | | `false` | The environment variable which contains override template values. |
|
|
|
|
## Template MCPs
|
|
|
|
Template MCPs are stored in a yaml file which uses the mustache templating library to populate values from an optional environment
|
|
variable. Defaults can be provided inline making override only necessary when providing install/upgrade time configuration.
|
|
|
|
In general the file contains a list of MCPs which follow the schema definition for MCPs exactly. Any valid field for an MCP
|
|
is accepted, including optional fields such as `headers`.
|
|
|
|
### Example: Native Group
|
|
|
|
An example template MCP collection, configuration, and values environment variable is shown below which would create a native group.
|
|
|
|
```yaml
|
|
- entityUrn: urn:li:corpGroup:{{group.id}}
|
|
entityType: corpGroup
|
|
aspectName: corpGroupInfo
|
|
changeType: UPSERT
|
|
aspect:
|
|
description: {{group.description}}{{^group.description}}Default description{{/group.description}}
|
|
displayName: {{group.displayName}}
|
|
created: {{&auditStamp}}
|
|
members: [] # required as part of the aspect's schema definition
|
|
groups: [] # required as part of the aspect's schema definition
|
|
admins: [] # required as part of the aspect's schema definition
|
|
- entityUrn: urn:li:corpGroup:{{group.id}}
|
|
entityType: corpGroup
|
|
aspectName: origin
|
|
changeType: UPSERT
|
|
aspect:
|
|
type: NATIVE
|
|
```
|
|
|
|
Creating an entry in the `bootstrap_mcps.yaml` to populate the values from the environment variable `DATAHUB_TEST_GROUP_VALUES`
|
|
|
|
```yaml
|
|
- name: test-group
|
|
version: v1
|
|
mcps_location: "bootstrap_mcps/test-group.yaml"
|
|
values_env: "DATAHUB_TEST_GROUP_VALUES"
|
|
```
|
|
|
|
An example json values are loaded from environment variable in `DATAHUB_TEST_GROUP_VALUES` might look like the following.
|
|
|
|
```json
|
|
{
|
|
"group": {
|
|
"id": "mygroup",
|
|
"displayName": "My Group",
|
|
"description": "Description of the group"
|
|
}
|
|
}
|
|
```
|
|
|
|
Using standard mustache template semantics the values in the environment would be inserted into the yaml structure
|
|
and ingested when the `system-update` runs.
|
|
|
|
#### Default values
|
|
|
|
In the example above, the group's `description` if not provided would default to `Default description` if not specified
|
|
in the values contain in the environment variable override following the standard mustache template semantics.
|
|
|
|
#### AuditStamp
|
|
|
|
A special template reference, `{{&auditStamp}}` can be used to inject an `auditStamp` into the aspect. This can be used to
|
|
populate required fields of type `auditStamp` calculated from when the MCP is applied. This will insert an inline json representation
|
|
of the `auditStamp` into the location and avoid escaping html characters per standard mustache template indicated by the `&` character.
|
|
|
|
### Ingestion Template MCPs
|
|
|
|
Ingestion template MCPs are slightly more complicated since the ingestion `recipe` is stored as a json string within the aspect.
|
|
For ingestion recipes, special handling was added so that they can be described naturally in yaml instead of the normally encoded json string.
|
|
|
|
This means that in the example below, the structure beneath the `aspect.config.recipe` path will be automatically converted
|
|
to the required json structure and stored as a string.
|
|
|
|
```yaml
|
|
- entityType: dataHubIngestionSource
|
|
entityUrn: urn:li:dataHubIngestionSource:demo-data
|
|
aspectName: dataHubIngestionSourceInfo
|
|
changeType: UPSERT
|
|
aspect:
|
|
type: "demo-data"
|
|
name: "demo-data"
|
|
config:
|
|
recipe:
|
|
source:
|
|
type: "datahub-gc"
|
|
config: {}
|
|
executorId: default
|
|
```
|
|
|
|
## `bootstrap_mcps.yaml` Override
|
|
|
|
Additionally, the `bootstrap_mcps.yaml` can be overridden.
|
|
This might be useful for applying changes to the version when using helm defined template values.
|
|
|
|
```yaml
|
|
bootstrap:
|
|
templates:
|
|
- name: myMCPTemplate
|
|
version: v1
|
|
mcps_location: <classpath or file location>
|
|
values_env: <value environment variable>
|
|
revision_env: REVISION_ENV
|
|
```
|
|
|
|
In the above example, we've added a `revision_env` which allows overriding the MCP bootstrap definition itself (excluding `revision_env`).
|
|
|
|
In this example we could configure `REVISION_ENV` to contain a timestamp or hash: `{"version":"2024060600"}`
|
|
This value can be changed/incremented each time the helm supplied template values change. This ensures the MCP is updated
|
|
with the latest values during deployment.
|
|
|
|
## Known Limitations
|
|
|
|
- Supported change types:
|
|
- UPSERT
|
|
- CREATE
|
|
- CREATE_ENTITY
|