mirror of
https://github.com/datahub-project/datahub.git
synced 2025-12-27 18:07:57 +00:00
docs(metadata-model): Adding to Metadata model docs (#3998)
* Adding better metadata model docs * Final changes
This commit is contained in:
parent
9a9f448ebf
commit
ca2d625529
@ -10,10 +10,18 @@ DataHub takes a schema-first approach to modeling metadata. We use the open-sour
|
||||
|
||||
Conceptually, metadata is modeled using the following abstractions
|
||||
|
||||
- **Entities**: An entity is the primary node in the metadata graph. For example, an instance of a Dataset or a CorpUser is an Entity. An entity is made up of a unique identifier (a primary key) and groups of metadata attributes which we call aspects.
|
||||
- **Entities**: An entity is the primary node in the metadata graph. For example, an instance of a Dataset or a CorpUser is an Entity. An entity is made up of a type, e.g. 'dataset', a unique identifier (e.g. an 'urn') and groups of metadata attributes (e.g. documents) which we call aspects.
|
||||
|
||||
|
||||
- **Aspects**: An aspect is a collection of attributes that describes a particular facet of an entity. They are the smallest atomic unit of write in DataHub. That is, multiple aspects associated with the same Entity can be updated independently. For example, DatasetProperties contains a collection of attributes that describes a Dataset. Aspects can be shared across entities, for example "Ownership" is an aspect that is re-used across all the Entities that have owners.
|
||||
- **Aspects**: An aspect is a collection of attributes that describes a particular facet of an entity. They are the smallest atomic unit of write in DataHub. That is, multiple aspects associated with the same Entity can be updated independently. For example, DatasetProperties contains a collection of attributes that describes a Dataset. Aspects can be shared across entities, for example "Ownership" is an aspect that is re-used across all the Entities that have owners. Common aspects include
|
||||
|
||||
- [ownership](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Ownership.pdl): Captures the users and groups who own an Entity.
|
||||
- [globalTags](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/GlobalTags.pdl): Captures references to the Tags associated with an Entity.
|
||||
- [glossaryTerms](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/GlossaryTerms.pdl): Captures references to the Glossary Terms associated with an Entity.
|
||||
- [institutionalMemory](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/InstitutionalMemory.pdl): Captures internal company Documents associated with an Entity (e.g. links!)
|
||||
- [status](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Status.pdl): Captures the "deletion" status of an Entity, i.e. whether it should be soft-deleted.
|
||||
- [subTypes](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/SubTypes.pdl): Captures one or more "sub types" of a more generic Entity type. An example can be a "Looker Explore" Dataset, a "View" Dataset. Specific sub types can imply that certain additional aspects are present for a given Entity.
|
||||
|
||||
|
||||
- **Relationships**: A relationship represents a named edge between 2 entities. They are declared via foreign key attributes within Aspects along with a custom annotation (@Relationship). Relationships permit edges to be traversed bi-directionally. For example, a Chart may refer to a CorpUser as its owner via a relationship named "OwnedBy". This edge would be walkable starting from the Chart *or* the CorpUser instance.
|
||||
|
||||
@ -24,6 +32,43 @@ Here is an example graph consisting of 3 types of entity (CorpUser, Chart, Dashb
|
||||
|
||||

|
||||
|
||||
## The Core Entities
|
||||
|
||||
DataHub's "core" Entity types model the Data Assets that comprise the Modern Data Stack. They include
|
||||
|
||||
1. **Data Platform**: A type of Data "Platform". That is, an external system that is involved in processing, storing, or visualizing Data Assets. Examples include MySQL, Snowflake, Redshift, and S3.
|
||||
2. **Dataset**: A collection of data. Tables, Views, Streams, Document Collections, and Files are all modeled as "Datasets" on DataHub. Datasets can have tags, owners, links, glossary terms, and descriptions attached to them. They can also have specific sub-types, such as "View", "Collection", "Stream", "Explore", and more. Examples include Postgres Tables, MongoDB Collections, or S3 files.
|
||||
3. **Chart**: A single data vizualization derived from a Dataset. A single Chart can be a part of multiple Dashboards. Charts can have tags, owners, links, glossary terms, and descriptions attached to them. Examples include a Superset or Looker Chart.
|
||||
4. **Dashboard**: A collection of Charts for visualization. Dashboards can have tags, owners, links, glossary terms, and descriptions attached to them. Examples include a Superset or Mode Dashboard.
|
||||
5. **Data Job** (Task): An executable job that processes data assets, where "processing" implies consuming data, producing data, or both. Data Jobs can have tags, owners, links, glossary terms, and descriptions attached to them. They must belong to a single Data Flow. Examples include an Airflow Task.
|
||||
6. **Data Flow** (Pipeline): An executable collection of Data Jobs with dependencies among them, or a DAG. Data Jobs can have tags, owners, links, glossary terms, and descriptions attached to them. Examples include an Airflow DAG.
|
||||
|
||||
## The Entity Registry
|
||||
|
||||
Where are Entities and their aspects defined in DataHub? Where doe the Metadata Model "live"? The Metadata Model is "stitched together" by means
|
||||
of an **Entity Registry**, a catalog of Entities that comprise the Metadata Graph along with the aspects associated with each. Put
|
||||
simply, this is where the "schema" of the model is represented.
|
||||
|
||||
Traditionally, the Entity Registry was constructed using [Snapshot](https://github.com/linkedin/datahub/tree/master/metadata-models/src/main/pegasus/com/linkedin/metadata/snapshot) models, which Data Models (schemas) that explicitly tied
|
||||
an Entity to the Aspects associated with it. An example is [DatasetSnapshot](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/metadata/snapshot/DatasetSnapshot.pdl), which declares the core `Dataset` Entity.
|
||||
The relationship particular aspects of a Dataset was captured via a "union" where the list of possible aspects was modeled. An example is
|
||||
[DatasetAspect](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/metadata/aspect/DatasetAspect.pdl).
|
||||
This file associates dataset-specific aspects, such as [DatasetProperties](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/dataset/DatasetProperties.pdl), as well as common aspects like [Ownership](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Ownership.pdl),
|
||||
[InstitutionalMemory](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/InstitutionalMemory.pdl),
|
||||
and [Status](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Status.pdl),
|
||||
to the Dataset Entity. Similar Snapshots
|
||||
|
||||
As of January 2022, DataHub has deprecated support for Snapshot models as a means of adding new entities. Instead,
|
||||
the Entity Registry is defined inside a YAML configuration file called [entity-registry.yml](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/resources/entity-registry.yml),
|
||||
which is provided to DataHub's Metadata Service at start up. This file declares Entities and Aspects by referring to their [names](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Ownership.pdl#L7).
|
||||
At boot time, DataHub validates the structure of the registry file and ensures that it can find PDL schemas associated with
|
||||
each aspect name provided by configuration (via the [@Aspect](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Ownership.pdl#L6) annotation).
|
||||
|
||||
By moving to this format, evolving the Metadata Model becomes much easier. Adding Entities & Aspects becomes a matter of adding a
|
||||
to the YAML configuration, instead of creating new Snapshot / Aspect files.
|
||||
|
||||
> New to [PDL](https://linkedin.github.io/rest.li/pdl_schema) files? Don't fret. They are just a way to define a JSON document "schema" for Aspects in DataHub. All Data ingested to DataHub's Metadata Service is validated against a PDL schema, with each @Aspect corresponding to a single schema. Structurally, PDL is quite similar to [Protobuf](https://developers.google.com/protocol-buffers) and conveniently maps to JSON.
|
||||
|
||||
## Exploring DataHub's Metadata Model
|
||||
|
||||
To explore the current DataHub metadata model, you can inspect this high-level picture that shows the different entities and edges between them showing the relationships between them.
|
||||
@ -205,11 +250,10 @@ chart.
|
||||
|
||||
### Special Aspects
|
||||
|
||||
There are 3 "special" aspects worth mentioning:
|
||||
There are a few special aspects worth mentioning:
|
||||
|
||||
1. Key aspects
|
||||
2. BrowsePaths aspect
|
||||
3. Timeseries aspects
|
||||
1. Key aspects: Contain the properties that uniquely identify an Entity.
|
||||
2. Browse Paths aspect: Represents a hierarchical path associated with an Entity.
|
||||
|
||||
#### Key aspects
|
||||
|
||||
@ -231,7 +275,7 @@ urn:li:<entity-name>:key-field-1
|
||||
urn:li:<entity-name>:(key-field-1, key-field-2, ... key-field-n)
|
||||
```
|
||||
|
||||
By convention, key aspects are defined under `metadata-models/src/main/pegasus/com/linkedin/metadata/key`.
|
||||
By convention, key aspects are defined under [metadata-models/src/main/pegasus/com/linkedin/metadata/key](https://github.com/linkedin/datahub/tree/master/metadata-models/src/main/pegasus/com/linkedin/metadata/key).
|
||||
|
||||
##### Example
|
||||
|
||||
@ -305,19 +349,17 @@ urn:li:corpuser:johnsmith
|
||||
The BrowsePaths aspect allows you to define a custom "browse path" for an Entity. A browse path is a way to hierarchically organize
|
||||
entities. They manifest within the "Explore" features on the UI, allowing users to navigate through trees of related entities of a given type.
|
||||
|
||||
To support browsing a particular entity, simply include the "BrowsePaths" aspect in its aspect union:
|
||||
To support browsing a particular entity, add the "browsePaths" aspect to the entity in your `entity-registry.yml` file.
|
||||
|
||||
```aidl
|
||||
// DatasetAspect.pdl
|
||||
|
||||
/**
|
||||
* A union of all supported metadata aspects for a Dataset
|
||||
*/
|
||||
typeref DatasetAspect = union[
|
||||
DatasetKey,
|
||||
...
|
||||
BrowsePaths
|
||||
]
|
||||
/// entity-registry.yml
|
||||
entities:
|
||||
- name: dataset
|
||||
doc: Datasets represent logical or physical data assets stored or represented in various data platforms. Tables, Views, Streams are all instances of datasets.
|
||||
keyAspect: datasetKey
|
||||
aspects:
|
||||
...
|
||||
- browsePaths
|
||||
```
|
||||
|
||||
By declaring this aspect, you can produce custom browse paths as well as query for browse paths manually using a CURL like the following:
|
||||
@ -338,7 +380,27 @@ Please note you must provide:
|
||||
- The "/"-delimited root path for which to fetch results.
|
||||
- An entity "type" using its common name ("dataset" in the example above).
|
||||
|
||||
#### Timeseries aspects
|
||||
### Types of Aspect
|
||||
|
||||
There are 2 "types" of Metadata Aspects. Both are modeled using PDL schemas, and both can be ingested in the same way.
|
||||
However, they differ in what they represent and how they are handled by DataHub's Metadata Service.
|
||||
|
||||
#### 1. Versioned Aspects
|
||||
|
||||
Versioned Aspects each have a **numeric version** associated with them. When a field in an aspect changes, a new
|
||||
version is automatically created and stored within DataHub's backend. In practice, all versioned aspects are stored inside a relational database
|
||||
that can be backed up and restored. Versioned aspects power much of the UI experience you're used to, including Ownership, Descriptions,
|
||||
Tags, Glossary Terms, and more. Examples include Ownership, Global Tags, and Glossary Terms.
|
||||
|
||||
#### 2. Timeseries Aspects
|
||||
|
||||
Timeseries Aspects each have a **timestamp** associated with them. They are useful for representing
|
||||
time-ordered events about an Entity. For example, the results of profiling a Dataset, or a set of Data Quality checks that
|
||||
run every day. It is important to note that Timeseries aspects are NOT persisted inside the relational store, and are instead
|
||||
persisted only in the search index (e.g. elasticsearch) and the message queue (Kafka). This makes restoring timeseries aspects
|
||||
in a disaster scenario a bit more challenge. Timeseries aspects can be queried by time range, which is what makes them most different from Versioned Aspects.
|
||||
A timeseries aspect can be identified by the "timeseries" [type](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/dataset/DatasetProfile.pdl#L10) in its [@Aspect](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/dataset/DatasetProfile.pdl#L8) annotation.
|
||||
Examples include [DatasetProfile](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/dataset/DatasetProfile.pdl) & [DatasetUsageStatistics](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/dataset/DatasetUsageStatistics.pdl).
|
||||
|
||||
Timeseries aspects are aspects that have a timestampMillis field, and are meant for aspects that continuously change on a
|
||||
timely basis e.g. data profiles, usage statistics, etc.
|
||||
@ -359,12 +421,12 @@ instead of being stored in local DB).
|
||||
|
||||
You can retrieve timeseries aspects using the "aspects?action=getTimeseriesAspectValues" end point.
|
||||
|
||||
#### Aggregatable Timeseries aspects
|
||||
##### Aggregatable Timeseries aspects
|
||||
Being able to perform SQL like *group by + aggregate* operations on the timeseries aspects is a very natural use-case for
|
||||
this kind of data (dataset profiles, usage statistics etc.). This section describes how to define, ingest and perform an
|
||||
aggregation query against a timeseries aspect.
|
||||
|
||||
##### Defining a new aggregatable Timeseries aspect.
|
||||
###### Defining a new aggregatable Timeseries aspect.
|
||||
|
||||
The *@TimeseriesField* and the *@TimeseriesFieldCollection* are two new annotations that can be attached to a field of
|
||||
a *Timeseries aspect* that allows it to be part of an aggregatable query. The kinds of aggregations allowed on these
|
||||
@ -392,7 +454,7 @@ entities:
|
||||
- datasetUsageStatistics
|
||||
```
|
||||
|
||||
##### Ingesting a Timeseries aspect
|
||||
###### Ingesting a Timeseries aspect
|
||||
The timeseries aspects can be ingested via the GMS REST endpoint `/aspects?action=ingestProposal` or via the python API.
|
||||
|
||||
Example1: Via GMS REST API using curl.
|
||||
@ -449,7 +511,7 @@ my_emitter = DatahubKafkaEmitter("""<config>""")
|
||||
my_emitter.emit(mcpw)
|
||||
```
|
||||
|
||||
##### Performing an aggregation on a Timeseries aspect.
|
||||
###### Performing an aggregation on a Timeseries aspect.
|
||||
|
||||
Aggreations on timeseries aspects can be performed by the GMS REST API for `/analytics?action=getTimeseriesStats` which
|
||||
accepts the following params.
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user