mirror of
				https://github.com/datahub-project/datahub.git
				synced 2025-10-31 02:37:05 +00:00 
			
		
		
		
	
		
			
	
	
		
			158 lines
		
	
	
		
			7.5 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
		
		
			
		
	
	
			158 lines
		
	
	
		
			7.5 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
|   | # Bootstrap MetadataChangeProposals (MCPs)
 | ||
|  | 
 | ||
|  | Bootstrap MCPs are templated MCPs which are loaded when the `system-update` job runs. This allows adding | ||
|  | entities and aspects to DataHub at install time with the ability to customize them via environment variable | ||
|  | overrides. | ||
|  | 
 | ||
|  | The built-in bootstrap MCP process can also be extended with custom MCPs. This can streamline deployment | ||
|  | scenarios where a set of standard ingestion recipes, data platforms, users groups, or other configuration | ||
|  | can be applied without the need for developing custom scripts. | ||
|  | 
 | ||
|  | ## Process Overview
 | ||
|  | 
 | ||
|  | When DataHub is installed or upgraded, a job runs called `system-update`, this job is responsible for data | ||
|  | migration (particularly Elasticsearch indices) and ensuring the data is prepared for the next version of | ||
|  | DataHub. This is the job which will also apply the bootstrap MCPs. | ||
|  | 
 | ||
|  | The `system-update` job, depending on configuration, can be split into two sequences of steps. If they are | ||
|  | not split, then all steps are blocking. | ||
|  | 
 | ||
|  | 1. An initial blocking sequence which is run prior to the new version of GMS and other components | ||
|  | 2. Second sequence of steps where GMS and other components are allowed to run while additional data migration steps are | ||
|  | continued in the background | ||
|  | 
 | ||
|  | When applying bootstrap MCPs `system-update` will perform the following steps: | ||
|  | 
 | ||
|  | 1. The `bootstrap_mcps.yaml` file is read, either from a default classpath location, `bootstrap_mcps.yaml`, or a filesystem location | ||
|  |    provided by an environment variable, `SYSTEM_UPDATE_BOOTSTRAP_MCP_CONFIG`. | ||
|  | 2. Depending on the mode of blocking or non-blocking each entry in the configuration file will be executed in sequence. | ||
|  | 3. The template MCP file is loaded either from the classpath, or a filesystem location, and the template values are applied. | ||
|  | 4. The rendered template MCPs are executed with the options specified in the `bootstrap_mcps.yaml`. | ||
|  | 
 | ||
|  | ## `bootstrap_mcps.yaml` Configuration
 | ||
|  | 
 | ||
|  | The `bootstrap_mcps.yaml` file has the following format. | ||
|  | 
 | ||
|  | ```yaml | ||
|  | bootstrap: | ||
|  |   templates: | ||
|  |     - name: <name> | ||
|  |       version: <version> | ||
|  |       force: false | ||
|  |       blocking: false | ||
|  |       async: true | ||
|  |       optional: false | ||
|  |       mcps_location: <classpath or file location> | ||
|  |       values_env: <environment variable> | ||
|  | ``` | ||
|  | 
 | ||
|  | Each entry in the list of templates points to a single yaml file which can contain one or more MCP objects. The | ||
|  | execution of the template MCPs is tracked by name and version to prevent re-execution. The MCP objects are executed once | ||
|  | unless `force=true` for each `name`/`version` combination. | ||
|  | 
 | ||
|  | See the following table of options for descriptions of each field in the template configuration. | ||
|  | 
 | ||
|  | | Field         | Default  | Required  | Description                                                                                                | | ||
|  | |---------------|----------|-----------|------------------------------------------------------------------------------------------------------------| | ||
|  | | name          |          | `true`    | The name for the collection of template MCPs.                                                              | | ||
|  | | version       |          | `true`    | A string version for the collection of template MCPs.                                                      | | ||
|  | | force         | `false`  | `false`   | Ignores the previous run history, will not skip execution if run previously.                               | | ||
|  | | blocking      | `false`  | `false`   | Run before GMS and other components during upgrade/install if running in split blocking/non-blocking mode. | | ||
|  | | async         | `true`   | `false`   | Controls whether the MCPs are executed for sync or async ingestion.                                        | | ||
|  | | optional      | `false`  | `false`   | Whether to ignore a failure or fail the entire `system-update` job.                                        | | ||
|  | | mcps_location |          | `true`    | The location of the file which contains the template MCPs                                                  | | ||
|  | | values_env    |          | `false`   | The environment variable which contains override template values.                                          | | ||
|  | 
 | ||
|  | ## Template MCPs
 | ||
|  | 
 | ||
|  | Template MCPs are stored in a yaml file which uses the mustache templating library to populate values from an optional environment | ||
|  | variable. Defaults can be provided inline making override only necessary when providing install/upgrade time configuration. | ||
|  | 
 | ||
|  | In general the file contains a list of MCPs which follow the schema definition for MCPs exactly. Any valid field for an MCP | ||
|  | is accepted, including optional fields such as `headers`. | ||
|  | 
 | ||
|  | 
 | ||
|  | ### Example: Native Group
 | ||
|  | 
 | ||
|  | An example template MCP collection, configuration, and values environment variable is shown below which would create a native group. | ||
|  | 
 | ||
|  | ```yaml | ||
|  | - entityUrn: urn:li:corpGroup:{{group.id}} | ||
|  |   entityType: corpGroup | ||
|  |   aspectName: corpGroupInfo | ||
|  |   changeType: UPSERT | ||
|  |   aspect: | ||
|  |    description: {{group.description}}{{^group.description}}Default description{{/group.description}} | ||
|  |    displayName: {{group.displayName}} | ||
|  |    created: {{&auditStamp}} | ||
|  |    members: [] # required as part of the aspect's schema definition | ||
|  |    groups: [] # required as part of the aspect's schema definition | ||
|  |    admins: [] # required as part of the aspect's schema definition | ||
|  | - entityUrn: urn:li:corpGroup:{{group.id}} | ||
|  |   entityType: corpGroup | ||
|  |   aspectName: origin | ||
|  |   changeType: UPSERT | ||
|  |   aspect: | ||
|  |      type: NATIVE | ||
|  | ``` | ||
|  | 
 | ||
|  | Creating an entry in the `bootstrap_mcps.yaml` to populate the values from the environment variable `DATAHUB_TEST_GROUP_VALUES` | ||
|  | 
 | ||
|  | ```yaml | ||
|  |     - name: test-group | ||
|  |       version: v1 | ||
|  |       mcps_location: "bootstrap_mcps/test-group.yaml" | ||
|  |       values_env: "DATAHUB_TEST_GROUP_VALUES" | ||
|  | ``` | ||
|  | 
 | ||
|  | An example json values are loaded from environment variable in `DATAHUB_TEST_GROUP_VALUES` might look like the following. | ||
|  | 
 | ||
|  | ```json | ||
|  | {"group":{"id":"mygroup", "displayName":"My Group", "description":"Description of the group"}} | ||
|  | ``` | ||
|  | 
 | ||
|  | Using standard mustache template semantics the values in the environment would be inserted into the yaml structure | ||
|  | and ingested when the `system-update` runs. | ||
|  | 
 | ||
|  | #### Default values
 | ||
|  | 
 | ||
|  | In the example above, the group's `description` if not provided would default to `Default description` if not specified | ||
|  | in the values contain in the environment variable override following the standard mustache template semantics. | ||
|  | 
 | ||
|  | #### AuditStamp
 | ||
|  | 
 | ||
|  | A special template reference, `{{&auditStamp}}` can be used to inject an `auditStamp` into the aspect. This can be used to | ||
|  | populate required fields of type `auditStamp` calculated from when the MCP is applied. This will insert an inline json representation | ||
|  | of the `auditStamp` into the location and avoid escaping html characters per standard mustache template indicated by the `&` character. | ||
|  | 
 | ||
|  | ### Ingestion Template MCPs
 | ||
|  | 
 | ||
|  | Ingestion template MCPs are slightly more complicated since the ingestion `recipe` is stored as a json string within the aspect. | ||
|  | For ingestion recipes, special handling was added so that they can be described naturally in yaml instead of the normally encoded json string. | ||
|  | 
 | ||
|  | This means that in the example below, the structure beneath the `aspect.config.recipe` path will be automatically converted | ||
|  | to the required json structure and stored as a string. | ||
|  | 
 | ||
|  | ```yaml | ||
|  | - entityType: dataHubIngestionSource | ||
|  |   entityUrn: urn:li:dataHubIngestionSource:demo-data | ||
|  |   aspectName: dataHubIngestionSourceInfo | ||
|  |   changeType: UPSERT | ||
|  |   aspect: | ||
|  |     type: 'demo-data' | ||
|  |     name: 'demo-data' | ||
|  |     config: | ||
|  |       recipe: | ||
|  |         source: | ||
|  |           type: 'datahub-gc' | ||
|  |           config: {} | ||
|  |       executorId: default | ||
|  | ``` | ||
|  | 
 | ||
|  | ## Known Limitations
 | ||
|  | 
 | ||
|  | * Supported change types:  | ||
|  |   * UPSERT | ||
|  |   * CREATE | ||
|  |   * CREATE_ENTITY |