Documentation update part-1

This commit is contained in:
Kerem Sahin 2019-12-18 18:57:18 -08:00
parent 1b46145b3a
commit 165d4aef95
39 changed files with 610 additions and 89 deletions

View File

@ -2,7 +2,7 @@
[![Build Status](https://travis-ci.org/linkedin/WhereHows.svg?branch=datahub)](https://travis-ci.org/linkedin/WhereHows) [![Build Status](https://travis-ci.org/linkedin/WhereHows.svg?branch=datahub)](https://travis-ci.org/linkedin/WhereHows)
[![Gitter](https://img.shields.io/gitter/room/nwjs/nw.js.svg)](https://gitter.im/linkedin/datahub) [![Gitter](https://img.shields.io/gitter/room/nwjs/nw.js.svg)](https://gitter.im/linkedin/datahub)
![Data Hub](docs/imgs/datahublogo.png) ![DataHub](docs/imgs/datahub-logo.png)
## Introduction ## Introduction
DataHub is Linkedin's generalized metadata search & discovery tool. To learn more about DataHub, check out our DataHub is Linkedin's generalized metadata search & discovery tool. To learn more about DataHub, check out our

View File

@ -0,0 +1,11 @@
# DataHub Architecture
![datahub-architecture](../imgs/datahub-architecture.png)
## Metadata Serving
Refer to [metadata-serving](metadata-serving.md).
## Metadata Ingestion
Refer to [metadata-ingestion](metadata-ingestion.md).
## What is Generalized Metadata Architecture (GMA)?
Refer to [GMA](../what/gma.md).

View File

View File

View File

View File

View File

@ -0,0 +1,8 @@
# How to model metadata for GMA?
GMA uses [rest.li](https://rest.li), which is LinkedIn's open source REST framework.
All metadata in GMA needs to be modelled using [Pegasus schema (PDSC)](https://linkedin.github.io/rest.li/DATA-Data-Schema-and-Templates) which is the data schema for [rest.li](https://rest.li).
Conceptually were modelling metadata as a hybrid graph of nodes ([entities](../what/entity.md)) and edges ([relationships](../what/relationship.md)), with additional documents ([metadata aspects](../what/aspect.md)) attached to each node.
Below is an an example graph consisting of 3 types of entities (User, Group, Dataset), 3 types of relationships (OwnedBy, HasAdmin, HasMember), and 3 types of metadata aspects (Ownership, Profile, and Membership).
![metadata-modeling](../imgs/metadata-modeling.png)

View File

Binary file not shown.

After

Width:  |  Height:  |  Size: 43 KiB

View File

Before

Width:  |  Height:  |  Size: 66 KiB

After

Width:  |  Height:  |  Size: 66 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 210 KiB

51
docs/what/aspect.md Normal file
View File

@ -0,0 +1,51 @@
# What is a GMA aspect?
A metadata aspect is a structured document, or more precisely a `record` in [PDSC](https://linkedin.github.io/rest.li/DATA-Data-Schema-and-Templates),
that represents a specific kind of metadata (e.g. ownership, schema, statistics, upstreams).
A metadata aspect on its own has no meaning (e.g. ownership for what?) and must be associated with a particular entity (e.g. ownership for PageViewEvent).
We purposely not to impose any model requirement on metadata aspects, as each aspect is expected to differ significantly.
Metadata aspects are immutable by design, i.e. every change to a particular aspect results in a new version created.
An optional retention policy can be applied such that X number of most recent versions will be retained after each update.
Setting X to 1 effectively means the metadata aspect is non-versioned.
It is also possible to apply the retention based on time, e.g. only keeps the metadata changes from the past 30 days.
While a metadata aspect can be arbitrary complex document with multiple levels of nesting, it is sometimes desirable to break a monolithic aspect into smaller independent aspects.
This will provide the benefits of:
1. **Faster read/write**: As metadata aspects are immutable, every `update` will lead to the writing the entire large aspect back to the underlying data store.
Likewise, readers will need to retrieve the entire aspect even if its only interested in a small part of it.
2. **Ability to independently version different aspects**: For example, one may like to get the change history of all the `ownership metadata` independent of the changes made to `schema metadata` for a dataset.
3. **Help with rest.li endpoint modeling**: While its not required to have 1:1 mapping between rest.li endpoints and metadata aspects,
itd follow this pattern naturally, which means one will end up with smaller, more modular, endpoints instead of giant ones.
Heres an example metadata aspect. Note that the `admin` and `members` fields are implicitly conveying a relationship between `Group` entity & `User` entity.
Its very natural to save such relationships as URNs in a metadata aspect.
The [relationship](relationship.md) section explains how this relationship can be explicitly extracted and modelled.
```json
{
"type": "record",
"name": "Membership",
"namespace": "com.linkedin.group",
"doc": "The membership metadata for a group",
"fields": [
{
"name": "auditStamp",
"type": "com.linkedin.common.AuditStamp",
"doc": "Audit stamp for the last change"
},
{
"name": "admin",
"type": "com.linkedin.common.CorpuserUrn",
"doc": "Admin of the group"
},
{
"name": "members",
"type": {
"type": "array",
"items": "com.linkedin.common.CorpuserUrn"
},
"doc": "Members of the group, ordered in descending importance"
}
]
}
```

77
docs/what/delta.md Normal file
View File

@ -0,0 +1,77 @@
# What is Delta in GMA?
Rest.li supports [partial update](https://linkedin.github.io/rest.li/user_guide/restli_server#partial_update) natively without needing explicitly defined models.
However, the granularity of update is always limited to each field in a PDSC model.
There are cases where the update need to happen at an even finer grain, e.g. adding or removing items from an array.
To this end, were proposing the following entity-specific metadata delta model that allows atomic partial updates at any desired granularity.
Note that:
1. Just like metadata [aspects](aspect.md), were not imposing any limit on the partial update model, as long as its a valid PDSC record.
This is because the rest.li endpoint will have the logic that performs the corresponding partial update based on the information in the model.
That said, its common to have fields that denote the list of items to be added or removed (e.g. `membersToAdd` & `membersToRemove` from below)
2. Similar to metadata [snapshots](snapshot.md), entity that supports metadata delta will add an entity-specific metadata delta
(e.g. `GroupDelta` from below) that unions all supported partial update models.
3. The entity-specific metadata delta is then added to the global `Delta` typeref, which is added as part of [Metadata Change Event](mxe.md#metadata-change-event-mce) and used during [Metadata Ingestion](../architecture/metadata-ingestion.md).
```json
{
"type": "record",
"name": "MembershipPartialUpdate",
"namespace": "com.linkedin.group",
"doc": "A metadata delta for a specific group entity.",
"fields": [
{
"name": "membersToAdd",
"doc": "List of members to be added to the group.",
"type": {
"type": "array",
"items": "com.linkedin.common.CorpuserUrn"
}
},
{
"name": "membersToRemove",
"doc": "List of members to be removed from the group.",
"type": {
"type": "array",
"items": "com.linkedin.common.CorpuserUrn"
}
}
]
}
```
```json
{
"type": "record",
"name": "GroupDelta",
"namespace": "com.linkedin.metadata.delta",
"doc": "A metadata delta for a specific group entity.",
"fields": [
{
"name": "urn",
"type": "com.linkedin.common.CorpGroupUrn",
"doc": "URN for the entity the metadata delta is associated with."
},
{
"name": "delta",
"doc": "The specific type of metadata delta to apply.",
"type": [
"com.linkedin.group.MembershipPartialUpdate"
]
}
]
}
```
```json
{
"type": "typeref",
"name": "Delta",
"namespace": "com.linkedin.metadata.delta",
"doc": "A union of all supported metadata delta types.",
"ref": [
"DatasetDelta",
"GroupDelta"
]
}
```

100
docs/what/entity.md Normal file
View File

@ -0,0 +1,100 @@
# What is a GMA entity?
An entity is very similar to the concept of a [resource](https://linkedin.github.io/rest.li/user_guide/restli_server#writing-resources) in [rest.li](http://rest.li/).
Generally speaking, an entity should have a defined [URN](urn.md) and a corresponding
[CRUD](https://en.wikipedia.org/wiki/Create,_read,_update_and_delete) API for the metadata associated with a particular instance of the entity.
A particular instance of an entity is essentially a node in the [metadata graph](graph.md).
![metadata-modeling](../imgs/metadata-modeling.png)
In the above example graph, `Dataset`, `User`, and `Group` are entities.
A specific dataset, e.g. `/data/tracking/PageViewEvent`, is an instance of `Dataset` entity,
much like how the LDAP group `datahub-dev` is an instance of Group entity.
Unlike rest.li, theres no concept of sub-entity ([sub-resource](https://github.com/linkedin/rest.li/wiki/Rest.li-User-Guide#sub-resources) in rest.li).
In other words, entities are always top-level and non-nesting. Instead, nestedness is modeled using relationships,
e.g. `Contains`, `IsPartOf`, `HasA`, which is covered in the [Relationship](relationship.md) section.
Entities may also contain attributes, which are in the form of key-value pairs.
Each attribute is indexed to support fast attribute-based querying,
e.g. find all the `User`s that have the job title `Software Engineer`.
There may be a size limitation on the value imposed by the underlying indexing system,
but it is suffice to assume that the values should kept at relatively small in size, say less than 1KB.
The value of each attribute is expected to be derived from either the entitys URN or
from the metadata associated with the entity. Another way to understand the attributes of an entity is to treat them as a complex virtual view over the URN
and metadata with indexing support on each column of the view.
Just like a virtual view where one is not supposed to store data in the view directly,
but to derive it from the underlying tables, the value for the attributes should also be derived.
How the actual derivation happens is covered in the [Metadata Serving](../architecture/architecture.md#metadata-serving) section.
Theres no need to explicitly create or destroy entity instances.
An entity instance will be automatically created in the graph whenever a new relationship involving the instance is formed,
or when a new metadata aspect is attached to the instance.
Each entity has a special boolean attribute `removed`, which is used to mark the entity as `soft deleted`,
without destroying existing relationships and attached metadata.
This is useful for quickly reviving an incorrectly deleted entity instance without losing valuable metadata,
e.g. human authored content.
An example schema for the `Dataset` entity is shown below. Note that:
1. Each entity is expected to have a `urn` field with an entity-specific URN type.
2. The optional `removed` field is captured in BaseEntity, which is expected to be included by all entities.
3. All other fields are expected to be of primitive types or enum only.
While it may be possible to support other complex types, namely array, union, map, and record,
this mostly depends on the underlying indexing system. For simplicity, we only allow numeric or string-like values for now.
4. The `urn` field is non-optional, while all other fields must be optional.
This is to support `partial update` when only a selective number of attributes need to be altered.
```json
{
"type": "record",
"name": "BaseEntity",
"namespace": "com.linkedin.metadata.entity",
"doc": "Common fields that apply to all entities",
"fields": [
{
"name": "removed",
"type": "boolean",
"doc": "Whether the entity has been removed or not",
"optional": true,
"default": false
}
]
}
```
```json
{
"type": "record",
"name": "DatasetEntity",
"namespace": "com.linkedin.metadata.entity",
"doc": "Data model for a dataset entity",
"include": [
"BaseEntity"
],
"fields": [
{
"name": "urn",
"type": "com.linkedin.common.DatasetUrn",
"doc": "Urn of the dataset"
},
{
"name": "name",
"type": "string",
"doc": "Dataset native name",
"optional": true
},
{
"name": "platform",
"type": "com.linkedin.common.DataPlatformUrn",
"doc": "Platform urn for the dataset.",
"optional": true
},
{
"name": "fabric",
"type": "com.linkedin.common.FabricType",
"doc": "Fabric type where dataset belongs to.",
"optional": true
}
]
}
```

0
docs/what/gma.md Normal file
View File

0
docs/what/gms.md Normal file
View File

0
docs/what/graph.md Normal file
View File

93
docs/what/mxe.md Normal file
View File

@ -0,0 +1,93 @@
# What is MXE (Metadata Events)?
The models defined in [snapshot](snapshot.md) and [delta](delta.md) are used to build the schema for two metadata Kafka events.
As these events have the prefix `Metadata` and suffix `Event`, theyre collectively referred to as MXE.
We also model MXEs using [PDSC](https://linkedin.github.io/rest.li/DATA-Data-Schema-and-Templates) and rely on the [pegasus gradle plugin](https://linkedin.github.io/rest.li/setup/gradle#generateavroschema) to convert them into [AVSC](https://avro.apache.org/docs/current/spec.html).
However, we also need to rename all the namespaces of the generated AVSC to avoid namespace clashes for projects that depend on both the PDSC models and MXEs.
As the AVSC and PDSC model schemas are 100% compatible, itd be very easy to convert the in-memory representation from one to another using [Pegasus DataTranslator](https://linkedin.github.io/rest.li/DATA-Data-Schema-and-Templates#translating-data-to-and-from-avro).
## Metadata Change Event (MCE)
MCE is a `proposal` for a metadata change, as opposed to [MAE](#metadata-audit-event), which is conveying a committed change.
Consequently, only successfully accepted and processed MCEs will lead to the emission of a corresponding MAE.
A single MCE can contain both snapshot-oriented and delta-oriented metadata change proposal.
The use case of this event is explained in [Metadata Ingestion](../architecture/metadata-ingestion.md).
```json
{
"type": "record",
"name": "MetadataChangeEvent",
"namespace": "com.linkedin.mxe",
"doc": "Kafka event for proposing a metadata change to an entity.",
"fields": [
{
"name": "proposedSnapshot",
"doc": "Snapshot of the proposed metadata change. Include only the aspects affected by the change in the snapshot.",
"type": "com.linkedin.metadata.snapshot.Snapshot",
"optional": true
},
{
"name": "proposedDelta",
"doc": "Delta of the proposed metadata partial update.",
"type": "com.linkedin.metadata.delta.Delta",
"optional": true
}
]
}
```
Well also generate a [dead letter queue](https://en.wikipedia.org/wiki/Dead_letter_queue) event, Failed Metadata Change Event (FMCE), for any rejected MCE.
The event simply wraps the original MCE and an error message, which contains the reason for rejection. This event can be used for debugging any potential ingestion issues, as well as for re-playing any previous rejected proposal if ever needed.
```json
{
"type": "record",
"name": "FailedMetadataChangeEvent",
"namespace": "com.linkedin.mxe",
"doc": "Kafka event for capturing a failure to process a specific MCE.",
"fields": [
{
"name": "metadataChangeEvent",
"doc": "The event that failed to be processed.",
"type": "MetadataChangeEvent"
},
{
"name": "error",
"type": "string",
"doc": "The error message or the stacktrace for the failure."
}
]
}
```
## Metadata Audit Event (MAE)
A Metadata Audit Event captures the change made to one or multiple metadata [aspects](aspect.md) associated with a particular [entity](entity.md), in the form of a metadata [snapshot](snapshot.md) before the change, and a metadata snapshot after the change.
Every source-of-truth for a particular metadata aspect is expected to emit a MAE whenever a change is committed to that aspect.
By ensuring that, any listener of MAE will be able to construct a complete view of the latest state for all aspects.
Furthermore, because each MAE contains the after image, any mistake made in emitting the MAE can be easily mitigated by emitting a follow-up MAE with the correction.
By the same token, the initial bootstrap problem for any newly added entity can also be solved by emitting a MAE containing all the latest metadata aspects associated with that entity.
```json
{
"type": "record",
"name": "MetadataAuditEvent",
"namespace": "com.linkedin.mxe",
"doc": "Kafka event for capturing update made to an entity's metadata.",
"fields": [
{
"name": "oldSnapshot",
"doc": "Snapshot of the metadata before the update. Set to null for newly created metadata. Only the metadata aspects affected by the update are included in the snapshot.",
"type": "com.linkedin.metadata.snapshot.Snapshot",
"optional": true
},
{
"name": "newSnapshot",
"doc": "Snapshot of the metadata after the update. Only the metadata aspects affected by the update are included in the snapshot.",
"type": "com.linkedin.metadata.snapshot.Snapshot"
}
]
}
```

122
docs/what/relationship.md Normal file
View File

@ -0,0 +1,122 @@
# What is a GMA relationship?
A relationship is a named associate between exactly two entities, a source and a destination.
![metadata-modeling](../imgs/metadata-modeling.png)
From the above graph, a `Group` entity can be linked to a `User` entity via a `HasMember` relationship.
Note that the name of the relationship reflects the direction, i.e pointing from `Group` to `User`.
This is due to the fact that the actual metadata aspect holding this information is associated with `Group`, rather than User.
Had the direction been reversed, the relationship would be called IsMemberOf instead.
See [Direction of Relationships](#direction-of-relationships) for more discussions on relationship directionality.
A specific instance of a relationship, e.g. `urn:li:corpgroup:metadata-dev` has a member `urn:li:corpuser:malan`,
corresponds to an edge in the metadata graph.
Similar to an entity, a relationship can also be associated with optional attributes that are derived from metadata.
For example, from the `Membership` metadata aspect shown below, were able to derive the `HasMember` relationship that links a specific `Group` to a specific `User`.
We can also include additional attribute to the relationship, e.g. importance, which corresponds to the position of the specific member in the original membership array.
This allows complex graph query that travel only relationships that match certain criteria, e.g. `returns only the top-5 most important members of this group.`
Once again, attributes should only be added based on query patterns.
```json
{
"type": "record",
"name": "Membership",
"namespace": "com.linkedin.group",
"doc": "The membership metadata for a group",
"fields": [
{
"name": "auditStamp",
"type": "com.linkedin.common.AuditStamp",
"doc": "Audit stamp for the last change"
},
{
"name": "admin",
"type": "com.linkedin.common.CorpuserUrn",
"doc": "Admin of the group"
},
{
"name": "members",
"type": {
"type": "array",
"items": "com.linkedin.common.CorpuserUrn"
},
"doc": "Members of the group, ordered in descending importance"
}
]
}
```
Relationships are meant to be `entity-neutral`. In other words, one would expect to use the same `OwnedBy` relationship to link a `Dataset` to a `User` and to link a `Dashboard` to a `User`.
As Pegasus doesnt allow typing a field using multiple URNs (because theyre all essentially strings), we resort to using generic URN type for the source and destination.
We also introduce a non-standard property pairings to limit the allowed source and destination URN types.
While its possible to model relationships in rest.li as [association resources](https://linkedin.github.io/rest.li/modeling/modeling#association),
which often get stored as mapping tables, it is far more common to model them as `foreign keys` field in a metadata aspect.
For instance, the `Ownership` aspect is likely to contain an array of owners corpuser URNs.
Below is an example of how a relationship is modeled in PDSC. Note that:
1. As the `source` and `destination` are of generic URN type, were able to factor them out to a common `BaseRelationship` model.
2. Each model is expected to have a pairings property that is an array of all allowed source-destination URNs.
3. Unlike entities, theres no requirement on making all attributes optional since relationships do not support partial updates.
```json
{
"type": "record",
"name": "BaseRelationship",
"namespace": "com.linkedin.metadata.relationship",
"doc": "Common fields that apply to all relationships",
"fields": [
{
"name": "source",
"type": "com.linkedin.common.Urn",
"doc": "Urn for the source of the relationship"
},
{
"name": "destination",
"type": "com.linkedin.common.Urn",
"doc": "Urn for the destination of the relationship"
}
]
}
```
```json
{
"type": "record",
"name": "HasMember",
"namespace": "com.linkedin.metadata.relationship",
"doc": "Data model for a has-member relationship",
"include": [
"BaseRelationship"
],
"pairings": [
{
"source": "com.linkedin.common.urn.CorpGroupUrn",
"destination": "com.linkedin.common.urn.CorpUserUrn"
}
],
"fields": [
{
"name": "importance",
"type": "int",
"doc": "The importance of the membership"
}
]
}
```
## Direction of Relationships
As relationships are modeled as directed edges between nodes, its natural to ask which way should it be pointing,
or should there be edges going both ways? The answer is, `it kind of doesnt matter.` Its rather an aesthetic choice than technical one.
For one, the actual direction doesnt really matter when it comes to constructing graph queries.
Most graph DBs are fully capable of traversing edges in reverse direction efficiently.
That being said, generally theres a more `natural way` to specify the direction of a relationship, which is closely related to how metadata is stored.
For example, the membership information for an LDAP group is generally stored as a list in groups metadata.
As a result, its more natural to model a `HasAMember` relationship that points from a group to a member, instead of a IsMemberOf relationship pointing from member to group.
Since all relationships are explicitly declared, its fairly easy for a user to discover what relationships are available and their directionality by inspecting
the [relationships package](../../metadata-models/src/main/pegasus/com/linkedin/metadata/relationship).
Its also possible to provide a UI for the catalog of entities and relationships for analysts who are interested in building complex graph queries to gain insights into metadata.

View File

59
docs/what/snapshot.md Normal file
View File

@ -0,0 +1,59 @@
# What is a snapshot in GMA?
A metadata snapshot models the current state of one or multiple metadata [aspects](aspect.md) associated with a particular [entity](entity.md).
Each entity type is expected to have:
1. An entity-specific aspect (e.g. `GroupAspect` from below), which is a `typeref` containing a union of all possible metadata aspects for the entity.
2. An entity-specific snapshot (e.g. `GroupSnapshot` from below), which contains an array (aspects) of entity-specific aspects.
```json
{
"type": "typeref",
"name": "GroupAspect",
"namespace": "com.linkedin.metadata.aspect",
"doc": "A specific metadata aspect for a group",
"ref": [
"com.linkedin.group.Membership",
"com.linkedin.group.SomeOtherMetadata"
]
}
```
```json
{
"type": "record",
"name": "GroupSnapshot",
"namespace": "com.linkedin.metadata.snapshot",
"doc": "A metadata snapshot for a specific group entity.",
"fields": [
{
"name": "urn",
"type": "com.linkedin.common.CorpGroupUrn",
"doc": "URN for the entity the metadata snapshot is associated with."
},
{
"name": "aspects",
"doc": "The list of metadata aspects associated with the group.",
"type": {
"type": "array",
"items": "com.linkedin.metadata.aspect.GroupAspect"
}
}
]
}
```
The generic `Snapshot` typeref contains a union of all entity-specific snapshots and can therefore be used to represent the state of any metadata aspect for all supported entity types.
```json
{
"type": "typeref",
"name": "Snapshot",
"namespace": "com.linkedin.metadata.snapshot",
"doc": "A union of all supported metadata snapshot types.",
"ref": [
"DatasetSnapshot",
"GroupSnapshot",
"UserSnapshot"
]
}
```

0
docs/what/urn.md Normal file
View File