mirror of
https://github.com/datahub-project/datahub.git
synced 2025-07-06 08:38:41 +00:00
427 lines
26 KiB
Markdown
427 lines
26 KiB
Markdown
# Metadata Events
|
|
|
|
DataHub makes use a few important Kafka events for operation. The most notable of these include
|
|
|
|
1. Metadata Change Proposal
|
|
2. Metadata Change Log (Versioned + Timeseries)
|
|
3. Platform Event
|
|
|
|
Each event is originally authored using [PDL](https://linkedin.github.io/rest.li/pdl_schema), a modeling language developed by LinkedIn, and
|
|
then converted into their Avro equivalents, which are used when writing and reading the events to Kafka.
|
|
|
|
In the document, we'll describe each of these events in detail - including notes about their structure & semantics.
|
|
|
|
## Metadata Change Proposal (MCP)
|
|
|
|
A Metadata Change Proposal represents a request to change to a specific [aspect](aspect.md) on an enterprise's Metadata
|
|
Graph. Each MCP provides a new value for a given aspect. For example, a single MCP can
|
|
be emitted to change ownership or documentation or domains or deprecation status for a data asset.
|
|
|
|
### Emission
|
|
|
|
MCPs may be emitted by clients of DataHub's low-level ingestion APIs (e.g. ingestion sources)
|
|
during the process of metadata ingestion. The DataHub Python API exposes an interface for
|
|
easily sending MCPs into DataHub.
|
|
|
|
The default Kafka topic name for MCPs is `MetadataChangeProposal_v1`.
|
|
|
|
### Consumption
|
|
|
|
DataHub's storage layer actively listens for new Metadata Change Proposals, attempts
|
|
to apply the requested change to the Metadata Graph.
|
|
|
|
### Schema
|
|
|
|
| Name | Type | Description | Optional |
|
|
| ------------------ | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------- |
|
|
| entityUrn | String | The unique identifier for the Entity being changed. For example, a Dataset's urn. | False |
|
|
| entityType | String | The type of the entity the new aspect is associated with. This corresponds to the entity name in the DataHub Entity Registry, for example 'dataset'. | False |
|
|
| entityKeyAspect | Object | The key struct of the entity that was changed. Only present if the Metadata Change Proposal contained the raw key struct. | True |
|
|
| changeType | String | The change type. CREATE, UPSERT and DELETE are currently supported. | False |
|
|
| aspectName | String | The entity aspect which was changed. | False |
|
|
| aspect | Object | The new aspect value. Null if the aspect was deleted. | True |
|
|
| aspect.contentType | String | The serialization type of the aspect itself. The only supported value is `application/json`. | False |
|
|
| aspect.value | String | The serialized aspect. This is a JSON-serialized representing the aspect document originally defined in PDL. See https://github.com/datahub-project/datahub/tree/master/metadata-models/src/main/pegasus/com/linkedin for more. | False |
|
|
| systemMetadata | Object | The new system metadata. This includes the the ingestion run-id, model registry and more. For the full structure, see https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/mxe/SystemMetadata.pdl | True |
|
|
|
|
The PDL schema can be found [here](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/mxe/MetadataChangeProposal.pdl).
|
|
|
|
### Examples
|
|
|
|
An MCP representing a request to update the 'ownership' aspect for a particular Dataset:
|
|
|
|
```json
|
|
{
|
|
"entityType": "dataset",
|
|
"entityUrn": "urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)",
|
|
"changeType": "UPSERT",
|
|
"aspectName": "ownership",
|
|
"aspect": {
|
|
"value": "{\"owners\":[{\"type\":\"DATAOWNER\",\"owner\":\"urn:li:corpuser:datahub\"}],\"lastModified\":{\"actor\":\"urn:li:corpuser:datahub\",\"time\":1651516640488}}",
|
|
"contentType": "application/json"
|
|
},
|
|
"systemMetadata": {
|
|
"lastObserved": 1651516640493,
|
|
"runId": "no-run-id-provided",
|
|
"registryName": "unknownRegistry",
|
|
"registryVersion": "0.0.0.0-dev",
|
|
"properties": null
|
|
}
|
|
}
|
|
```
|
|
|
|
Note how the aspect payload is serialized as JSON inside the "value" field. The exact structure
|
|
of the aspect is determined by its PDL schema. (For example, the [ownership](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Ownership.pdl) schema)
|
|
|
|
## Metadata Change Log (MCL)
|
|
|
|
A Metadata Change Log represents _any_ change which has been made to the Metadata Graph.
|
|
Metadata Change Log events are emitted to Kafka immediately after writing the change
|
|
the durable storage.
|
|
|
|
There are 2 flavors of Metadata Change Log: _versioned_ and _timeseries_. These correspond to the type
|
|
of aspects which were updated for a given change. **Versioned** aspects are those
|
|
which represent the "latest" state of some attributes, for example the most recent owners of an asset
|
|
or its documentation. **Timeseries** aspects are those which represent events related to an asset
|
|
that occurred at a particular time, for example profiling of a Dataset.
|
|
|
|
### Emission
|
|
|
|
MCLs are emitted when _any_ change is made to an entity on the DataHub Metadata Graph, this includes
|
|
writing to any aspect of an entity.
|
|
|
|
Two distinct topics are maintained for Metadata Change Log. The default Kafka topic name for **versioned** aspects is `MetadataChangeLog_Versioned_v1` and for
|
|
**timeseries** aspects is `MetadataChangeLog_Timeseries_v1`.
|
|
|
|
### Consumption
|
|
|
|
DataHub ships with a Kafka Consumer Job (mae-consumer-job) which listens for MCLs and uses them to update DataHub's search and graph indices,
|
|
as well as to generate derived Platform Events (described below).
|
|
|
|
In addition, the [Actions Framework](../actions/README.md) consumes Metadata Change Logs to power its [Metadata Change Log](../actions/events/metadata-change-log-event.md) event API.
|
|
|
|
### Schema
|
|
|
|
| Name | Type | Description | Optional |
|
|
| ------------------------------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -------- |
|
|
| entityUrn | String | The unique identifier for the Entity being changed. For example, a Dataset's urn. | False |
|
|
| entityType | String | The type of the entity the new aspect is associated with. This corresponds to the entity name in the DataHub Entity Registry, for example 'dataset'. | False |
|
|
| entityKeyAspect | Object | The key struct of the entity that was changed. Only present if the Metadata Change Proposal contained the raw key struct. | True |
|
|
| changeType | String | The change type. CREATE, UPSERT and DELETE are currently supported. | False |
|
|
| aspectName | String | The entity aspect which was changed. | False |
|
|
| aspect | Object | The new aspect value. Null if the aspect was deleted. | True |
|
|
| aspect.contentType | String | The serialization type of the aspect itself. The only supported value is `application/json`. | False |
|
|
| aspect.value | String | The serialized aspect. This is a JSON-serialized representing the aspect document originally defined in PDL. See https://github.com/datahub-project/datahub/tree/master/metadata-models/src/main/pegasus/com/linkedin for more. | False |
|
|
| previousAspectValue | Object | The previous aspect value. Null if the aspect did not exist previously. | True |
|
|
| previousAspectValue.contentType | String | The serialization type of the aspect itself. The only supported value is `application/json` | False |
|
|
| previousAspectValue.value | String | The serialized aspect. This is a JSON-serialized representing the aspect document originally defined in PDL. See https://github.com/datahub-project/datahub/tree/master/metadata-models/src/main/pegasus/com/linkedin for more. | False |
|
|
| systemMetadata | Object | The new system metadata. This includes the the ingestion run-id, model registry and more. For the full structure, see https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/mxe/SystemMetadata.pdl | True |
|
|
| previousSystemMetadata | Object | The previous system metadata. This includes the the ingestion run-id, model registry and more. For the full structure, see https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/mxe/SystemMetadata.pdl | True |
|
|
| created | Object | Audit stamp about who triggered the Metadata Change and when. | False |
|
|
| created.time | Number | The timestamp in milliseconds when the aspect change occurred. | False |
|
|
| created.actor | String | The URN of the actor (e.g. corpuser) that triggered the change. |
|
|
|
|
The PDL schema for can be found [here](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/mxe/MetadataChangeLog.pdl).
|
|
|
|
### Examples
|
|
|
|
An MCL corresponding to a change in the 'ownership' aspect for a particular Dataset:
|
|
|
|
```json
|
|
{
|
|
"entityType": "dataset",
|
|
"entityUrn": "urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)",
|
|
"changeType": "UPSERT",
|
|
"aspectName": "ownership",
|
|
"aspect": {
|
|
"value": "{\"owners\":[{\"type\":\"DATAOWNER\",\"owner\":\"urn:li:corpuser:datahub\"}],\"lastModified\":{\"actor\":\"urn:li:corpuser:datahub\",\"time\":1651516640488}}",
|
|
"contentType": "application/json"
|
|
},
|
|
"previousAspectValue": {
|
|
"value": "{\"owners\":[{\"owner\":\"urn:li:corpuser:jdoe\",\"type\":\"DATAOWNER\"},{\"owner\":\"urn:li:corpuser:datahub\",\"type\":\"DATAOWNER\"}],\"lastModified\":{\"actor\":\"urn:li:corpuser:jdoe\",\"time\":1581407189000}}",
|
|
"contentType": "application/json"
|
|
},
|
|
"systemMetadata": {
|
|
"lastObserved": 1651516640493,
|
|
"runId": "no-run-id-provided",
|
|
"registryName": "unknownRegistry",
|
|
"registryVersion": "0.0.0.0-dev",
|
|
"properties": null
|
|
},
|
|
"previousSystemMetadata": {
|
|
"lastObserved": 1651516415088,
|
|
"runId": "file-2022_05_02-11_33_35",
|
|
"registryName": null,
|
|
"registryVersion": null,
|
|
"properties": null
|
|
},
|
|
"created": {
|
|
"time": 1651516640490,
|
|
"actor": "urn:li:corpuser:datahub",
|
|
"impersonator": null
|
|
}
|
|
}
|
|
```
|
|
|
|
Note how the aspect payload is serialized as JSON inside the "value" field. The exact structure
|
|
of the aspect is determined by its PDL schema. (For example, the [ownership](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Ownership.pdl) schema)
|
|
|
|
## Platform Event (PE)
|
|
|
|
A Platform Event represents an arbitrary business-logic event emitted by DataHub. Each
|
|
Platform Event has a `name` which determines its contents.
|
|
|
|
### Types
|
|
|
|
- **Entity Change Event** (entityChangeEvent): The most important Platform Event is named **Entity Change Event**, and represents a log of semantic changes
|
|
(tag addition, removal, deprecation change, etc) that have occurred on DataHub. It is used an important
|
|
component of the DataHub Actions Framework.
|
|
|
|
All registered Platform Event types are declared inside the DataHub Entity Registry (`entity-registry.yml`).
|
|
|
|
### Emission
|
|
|
|
All Platform Events are generated by DataHub itself during normal operation.
|
|
|
|
PEs are extremely dynamic - they can contain arbitrary payloads depending on the `name`. Thus,
|
|
can be emitted in a variety of circumstances.
|
|
|
|
The default Kafka topic name for all Platform Events is `PlatformEvent_v1`.
|
|
|
|
### Consumption
|
|
|
|
The [Actions Framework](../actions/README.md) consumes Platform Events to power its [Entity Change Event](../actions/events/entity-change-event.md) API.
|
|
|
|
### Schema
|
|
|
|
| Name | Type | Description | Optional |
|
|
| ---------------------- | ------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------- |
|
|
| header | Object | Header fields | False |
|
|
| header.timestampMillis | Long | The time at which the event was generated. | False |
|
|
| name | String | The name / type of the event. | False |
|
|
| payload | Object | The event itself. | False |
|
|
| payload.contentType | String | The serialization type of the event payload. The only supported value is `application/json`. | False |
|
|
| payload.value | String | The serialized payload. This is a JSON-serialized representing the payload document originally defined in PDL. See https://github.com/datahub-project/datahub/tree/master/metadata-models/src/main/pegasus/com/linkedin for more. | False |
|
|
|
|
The full PDL schema can be found [here](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/mxe/PlatformEvent.pdl).
|
|
|
|
### Examples
|
|
|
|
An example of an 'Entity Change Event' Platform Event that is emitted when a new owner is added to a Dataset:
|
|
|
|
```json
|
|
{
|
|
"header": {
|
|
"timestampMillis": 1655390732551
|
|
},
|
|
"name": "entityChangeEvent",
|
|
"payload": {
|
|
"value": "{\"entityUrn\":\"urn:li:dataset:abc\",\"entityType\":\"dataset\",\"category\":\"OWNER\",\"operation\":\"ADD\",\"modifier\":\"urn:li:corpuser:jdoe\",\"parameters\":{\"ownerUrn\":\"urn:li:corpuser:jdoe\",\"ownerType\":\"BUSINESS_OWNER\"},\"auditStamp\":{\"actor\":\"urn:li:corpuser:jdoe\",\"time\":1649953100653}}",
|
|
"contentType": "application/json"
|
|
}
|
|
```
|
|
|
|
Note how the actual payload for the event is serialized as JSON inside the 'payload' field. The exact
|
|
structure of the Platform Event is determined by its PDL schema. (For example, the [Entity Change Event](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/platform/event/v1/EntityChangeEvent.pdl) schema)
|
|
|
|
## Failed Metadata Change Proposal (FMCP)
|
|
|
|
When a Metadata Change Proposal cannot be processed successfully, the event is written to a [dead letter queue](https://en.wikipedia.org/wiki/Dead_letter_queue)
|
|
in an event called Failed Metadata Change Proposal (FMCP).
|
|
|
|
The event simply wraps the original Metadata Change Proposal and an error message, which contains the reason for rejection.
|
|
This event can be used for debugging any potential ingestion issues, as well as for re-playing any previous rejected proposal if necessary.
|
|
|
|
### Emission
|
|
|
|
FMCEs are emitted when MCEs cannot be successfully committed to DataHub's storage layer.
|
|
|
|
The default Kafka topic name for FMCPs is `FailedMetadataChangeProposal_v1`.
|
|
|
|
### Consumption
|
|
|
|
No active consumers.
|
|
|
|
### Schema
|
|
|
|
The PDL schema can be found [here](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/mxe/FailedMetadataChangeProposal.pdl).
|
|
|
|
# Deprecated Events
|
|
|
|
DataHub ships with a set of deprecated events, which were historically used for proposing and logging
|
|
changes to the Metadata Graph.
|
|
|
|
Each event in this category was deprecated due to its inflexibility - namely the fact that
|
|
the schemas had to be updated when a new aspect was introduced. These events have since been replaced
|
|
by the more flexible events described above (Metadata Change Proposal, Metadata Change Log).
|
|
|
|
It is not recommended to build dependencies on deprecated events.
|
|
|
|
## Metadata Change Event (MCE)
|
|
|
|
A Metadata Change Event represents a request to change multiple aspects for the same entity.
|
|
It leverages a deprecated concept of `Snapshot`, which is a strongly-typed list of aspects for the same
|
|
entity.
|
|
|
|
A MCE is a "proposal" for a set of metadata changes, as opposed to [MAE](#metadata-audit-event-mae), which is conveying a committed change.
|
|
Consequently, only successfully accepted and processed MCEs will lead to the emission of a corresponding MAE / MCLs.
|
|
|
|
### Emission
|
|
|
|
MCEs may be emitted by clients of DataHub's low-level ingestion APIs (e.g. ingestion sources)
|
|
during the process of metadata ingestion.
|
|
|
|
The default Kafka topic name for MCEs is `MetadataChangeEvent_v4`.
|
|
|
|
### Consumption
|
|
|
|
DataHub's storage layer actively listens for new Metadata Change Events, attempts
|
|
to apply the requested changes to the Metadata Graph.
|
|
|
|
### Schema
|
|
|
|
The PDL schema can be found [here](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/mxe/MetadataChangeEvent.pdl).
|
|
|
|
### Examples
|
|
|
|
An example of an MCE emitted to change the 'ownership' aspect for an Entity:
|
|
|
|
```json
|
|
{
|
|
"proposedSnapshot": {
|
|
"com.linkedin.pegasus2avro.metadata.snapshot.DatasetSnapshot": {
|
|
"urn": "urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)",
|
|
"aspects": [
|
|
{
|
|
"com.linkedin.pegasus2avro.common.Ownership": {
|
|
"owners": [
|
|
{
|
|
"owner": "urn:li:corpuser:jdoe",
|
|
"type": "DATAOWNER",
|
|
"source": null
|
|
},
|
|
{
|
|
"owner": "urn:li:corpuser:datahub",
|
|
"type": "DATAOWNER",
|
|
"source": null
|
|
}
|
|
],
|
|
"lastModified": {
|
|
"time": 1581407189000,
|
|
"actor": "urn:li:corpuser:jdoe",
|
|
"impersonator": null
|
|
}
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Metadata Audit Event (MAE)
|
|
|
|
A Metadata Audit Event captures changes made to one or multiple metadata [aspects](aspect.md) associated with a particular [entity](entity.md), in the form of a metadata [snapshot](snapshot.md) (deprecated) before the change, and a metadata snapshot after the change.
|
|
|
|
Every source-of-truth for a particular metadata aspect is expected to emit a MAE whenever a change is committed to that aspect. By ensuring that, any listener of MAE will be able to construct a complete view of the latest state for all aspects.
|
|
Furthermore, because each MAE contains the "after image", any mistake made in emitting the MAE can be easily mitigated by emitting a follow-up MAE with the correction. By the same token, the initial bootstrap problem for any newly added entity can also be solved by emitting a MAE containing all the latest metadata aspects associated with that entity.
|
|
|
|
### Emission
|
|
|
|
> Note: In recent versions of DataHub (mid 2022), MAEs are no longer actively emitted, and will soon be completely removed from DataHub.
|
|
> Use Metadata Change Log instead.
|
|
|
|
MAEs are emitted once any metadata change has been successfully committed into DataHub's storage
|
|
layer.
|
|
|
|
The default Kafka topic name for MAEs is `MetadataAuditEvent_v4`.
|
|
|
|
### Consumption
|
|
|
|
No active consumers.
|
|
|
|
### Schema
|
|
|
|
The PDL schema can be found [here](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/mxe/MetadataAuditEvent.pdl).
|
|
|
|
### Examples
|
|
|
|
An example of an MAE emitted representing a change made to the 'ownership' aspect for an Entity (owner removed):
|
|
|
|
```json
|
|
{
|
|
"oldSnapshot": {
|
|
"com.linkedin.pegasus2avro.metadata.snapshot.DatasetSnapshot": {
|
|
"urn": "urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)",
|
|
"aspects": [
|
|
{
|
|
"com.linkedin.pegasus2avro.common.Ownership": {
|
|
"owners": [
|
|
{
|
|
"owner": "urn:li:corpuser:jdoe",
|
|
"type": "DATAOWNER",
|
|
"source": null
|
|
},
|
|
{
|
|
"owner": "urn:li:corpuser:datahub",
|
|
"type": "DATAOWNER",
|
|
"source": null
|
|
}
|
|
],
|
|
"lastModified": {
|
|
"time": 1581407189000,
|
|
"actor": "urn:li:corpuser:jdoe",
|
|
"impersonator": null
|
|
}
|
|
}
|
|
}
|
|
]
|
|
}
|
|
},
|
|
"newSnapshot": {
|
|
"com.linkedin.pegasus2avro.metadata.snapshot.DatasetSnapshot": {
|
|
"urn": "urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)",
|
|
"aspects": [
|
|
{
|
|
"com.linkedin.pegasus2avro.common.Ownership": {
|
|
"owners": [
|
|
{
|
|
"owner": "urn:li:corpuser:datahub",
|
|
"type": "DATAOWNER",
|
|
"source": null
|
|
}
|
|
],
|
|
"lastModified": {
|
|
"time": 1581407189000,
|
|
"actor": "urn:li:corpuser:jdoe",
|
|
"impersonator": null
|
|
}
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Failed Metadata Change Event (FMCE)
|
|
|
|
When a Metadata Change Event cannot be processed successfully, the event is written to a [dead letter queue](https://en.wikipedia.org/wiki/Dead_letter_queue) in an event called Failed Metadata Change Event (FMCE).
|
|
|
|
The event simply wraps the original Metadata Change Event and an error message, which contains the reason for rejection.
|
|
This event can be used for debugging any potential ingestion issues, as well as for re-playing any previous rejected proposal if necessary.
|
|
|
|
### Emission
|
|
|
|
FMCEs are emitted when MCEs cannot be successfully committed to DataHub's storage layer.
|
|
|
|
### Consumption
|
|
|
|
No active consumers.
|
|
|
|
### Schema
|
|
|
|
The PDL schema can be found [here](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/mxe/FailedMetadataChangeEvent.pdl).
|
|
|
|
The default Kafka topic name for FMCEs is `FailedMetadataChangeEvent_v4`.
|