mirror of
https://github.com/datahub-project/datahub.git
synced 2025-08-14 04:06:45 +00:00
Documentation update part-1
This commit is contained in:
parent
1b46145b3a
commit
165d4aef95
14
README.md
14
README.md
@ -1,12 +1,12 @@
|
||||
# Data Hub
|
||||
# DataHub
|
||||
[](https://travis-ci.org/linkedin/WhereHows)
|
||||
[](https://gitter.im/linkedin/datahub)
|
||||
|
||||

|
||||

|
||||
|
||||
## Introduction
|
||||
Data Hub is Linkedin's generalized metadata search & discovery tool. To learn more about Data Hub, check out our
|
||||
[Linkedin blog post](https://engineering.linkedin.com/blog/2019/data-hub) and [Strata presentation](https://speakerdeck.com/shirshanka/the-evolution-of-metadata-linkedins-journey-strata-nyc-2019). This repository contains the complete source code to be able to build Data Hub's frontend & backend services.
|
||||
DataHub is Linkedin's generalized metadata search & discovery tool. To learn more about DataHub, check out our
|
||||
[Linkedin blog post](https://engineering.linkedin.com/blog/2019/data-hub) and [Strata presentation](https://speakerdeck.com/shirshanka/the-evolution-of-metadata-linkedins-journey-strata-nyc-2019). This repository contains the complete source code to be able to build DataHub's frontend & backend services.
|
||||
|
||||
## Quickstart
|
||||
1. Install [docker](https://docs.docker.com/install/) and [docker-compose](https://docs.docker.com/compose/install/).
|
||||
@ -15,13 +15,13 @@ Data Hub is Linkedin's generalized metadata search & discovery tool. To learn mo
|
||||
```
|
||||
cd docker/quickstart && docker-compose pull && docker-compose up --build
|
||||
```
|
||||
4. After you have all Docker containers running in your machine, run below command to ingest provided sample data to Data Hub:
|
||||
4. After you have all Docker containers running in your machine, run below command to ingest provided sample data to DataHub:
|
||||
```
|
||||
./gradlew :metadata-events:mxe-schemas:build && cd metadata-ingestion/mce-cli && pip install --user -r requirements.txt && python mce_cli.py produce -d bootstrap_mce.dat
|
||||
```
|
||||
Note: Make sure that you're using Java 8, we have a strict dependency to Java 8 for build.
|
||||
|
||||
5. Finally, you can start `Data Hub` by typing `http://localhost:9001` in your browser. You can sign in with `datahub`
|
||||
5. Finally, you can start `DataHub` by typing `http://localhost:9001` in your browser. You can sign in with `datahub`
|
||||
as username and password.
|
||||
|
||||
## Quicklinks
|
||||
@ -33,4 +33,4 @@ as username and password.
|
||||
|
||||
## Roadmap
|
||||
1. Add user profile page
|
||||
2. Deploy Data Hub to [Azure Cloud](https://azure.microsoft.com/en-us/)
|
||||
2. Deploy DataHub to [Azure Cloud](https://azure.microsoft.com/en-us/)
|
||||
|
@ -1,25 +1,25 @@
|
||||
# Data Hub Frontend
|
||||
Data Hub frontend is a [Play](https://www.playframework.com/) service written in Java. It is served as a mid-tier
|
||||
between [Data Hub GMS](../gms) which is the backend service and [Data Hub UI](../datahub-web).
|
||||
# DataHub Frontend
|
||||
DataHub frontend is a [Play](https://www.playframework.com/) service written in Java. It is served as a mid-tier
|
||||
between [DataHub GMS](../gms) which is the backend service and [DataHub UI](../datahub-web).
|
||||
|
||||
## Pre-requisites
|
||||
* You need to have [JDK8](https://www.oracle.com/java/technologies/jdk8-downloads.html)
|
||||
installed on your machine to be able to build `Data Hub Frontend`.
|
||||
installed on your machine to be able to build `DataHub Frontend`.
|
||||
* You need to have [Chrome](https://www.google.com/chrome/) web browser
|
||||
installed to be able to build because UI tests have a dependency on `Google Chrome`.
|
||||
|
||||
## Build
|
||||
`Data Hub Frontend` is already built as part of top level build:
|
||||
`DataHub Frontend` is already built as part of top level build:
|
||||
```
|
||||
./gradlew build
|
||||
```
|
||||
However, if you only want to build `Data Hub Frontend` specifically:
|
||||
However, if you only want to build `DataHub Frontend` specifically:
|
||||
```
|
||||
./gradlew :datahub-frontend:build
|
||||
```
|
||||
|
||||
## Dependencies
|
||||
Before starting `Data Hub Frontend`, you need to make sure that [Data Hub GMS](../gms) and
|
||||
Before starting `DataHub Frontend`, you need to make sure that [DataHub GMS](../gms) and
|
||||
all its dependencies have already started and running.
|
||||
|
||||
Also, user information should already be registered into the DB,
|
||||
@ -42,7 +42,7 @@ python metadata-ingestion/mce_cli.py produce
|
||||
This will create a default user with username `datahub`. You can sign in to the app using `datahub` as your username.
|
||||
|
||||
## Start via Docker image
|
||||
Quickest way to try out `Data Hub Frontend` is running the [Docker image](../docker/frontend).
|
||||
Quickest way to try out `DataHub Frontend` is running the [Docker image](../docker/frontend).
|
||||
|
||||
## Start via command line
|
||||
If you do modify things and want to try it out quickly without building the Docker image, you can also run
|
||||
@ -51,7 +51,7 @@ the application directly from command line after a successful [build](#build):
|
||||
cd datahub-frontend/run && ./run-local-frontend
|
||||
```
|
||||
|
||||
## Checking out Data Hub UI
|
||||
## Checking out DataHub UI
|
||||
After starting your application in one of the two ways mentioned above, you can connect to it by typing below
|
||||
into your favorite web browser:
|
||||
```
|
||||
|
@ -1,5 +1,5 @@
|
||||
# Docker Images
|
||||
The easiest way to bring up and test Data Hub is using Data Hub [Docker](https://www.docker.com) images
|
||||
The easiest way to bring up and test DataHub is using DataHub [Docker](https://www.docker.com) images
|
||||
which are continuously deployed to [Docker Hub](https://hub.docker.com/u/keremsahin) with every commit to repository.
|
||||
|
||||
* [**datahub-gms**](gms): [](https://cloud.docker.com/repository/docker/keremsahin/datahub-gms/)
|
||||
@ -7,9 +7,9 @@ which are continuously deployed to [Docker Hub](https://hub.docker.com/u/keremsa
|
||||
* [**datahub-mce-consumer**](mce-consumer): [](https://cloud.docker.com/repository/docker/keremsahin/datahub-mce-consumer/)
|
||||
* [**datahub-mae-consumer**](mae-consumer): [](https://cloud.docker.com/repository/docker/keremsahin/datahub-mae-consumer/)
|
||||
|
||||
Above Docker images are created for Data Hub specific use. You can check subdirectories to check how those images are
|
||||
Above Docker images are created for DataHub specific use. You can check subdirectories to check how those images are
|
||||
generated via [Dockerbuild](https://docs.docker.com/engine/reference/commandline/build/) files or
|
||||
how to start each container using [Docker Compose](https://docs.docker.com/compose/). Other than these, Data Hub depends
|
||||
how to start each container using [Docker Compose](https://docs.docker.com/compose/). Other than these, DataHub depends
|
||||
on below Docker images to be able to run:
|
||||
* [**Kafka and Schema Registry**](kafka)
|
||||
* [**Elasticsearch**](elasticsearch)
|
||||
@ -23,5 +23,5 @@ The pipeline depends on all the above images composing up.
|
||||
You need to install [docker](https://docs.docker.com/install/) and [docker-compose](https://docs.docker.com/compose/install/).
|
||||
|
||||
## Quickstart
|
||||
If you want to quickly try and evaluate Data Hub by running all necessary Docker containers, you can check
|
||||
If you want to quickly try and evaluate DataHub by running all necessary Docker containers, you can check
|
||||
[Quickstart Guide](quickstart).
|
@ -1,11 +1,11 @@
|
||||
# Elasticsearch & Kibana
|
||||
|
||||
Data Hub uses Elasticsearch as a search engine. Elasticsearch powers search, typeahead and browse functions for Data Hub.
|
||||
DataHub uses Elasticsearch as a search engine. Elasticsearch powers search, typeahead and browse functions for DataHub.
|
||||
[Official Elasticsearch Docker image](https://hub.docker.com/_/elasticsearch) found in Docker Hub is used without
|
||||
any modification.
|
||||
|
||||
## Run Docker container
|
||||
Below command will start the Elasticsearch and Kibana containers. `Data Hub` uses Elasticsearch release `5.6.8`. Newer
|
||||
Below command will start the Elasticsearch and Kibana containers. `DataHub` uses Elasticsearch release `5.6.8`. Newer
|
||||
versions of Elasticsearch are not tested and you might experience compatibility issues.
|
||||
```
|
||||
cd docker/elasticsearch && docker-compose pull && docker-compose up --build
|
||||
@ -26,7 +26,7 @@ ports:
|
||||
```
|
||||
|
||||
### Docker Network
|
||||
All Docker containers for Data Hub are supposed to be on the same Docker network which is `datahub_network`.
|
||||
All Docker containers for DataHub are supposed to be on the same Docker network which is `datahub_network`.
|
||||
If you change this, you will need to change this for all other Docker containers as well.
|
||||
```
|
||||
networks:
|
||||
|
@ -1,8 +1,8 @@
|
||||
# Data Hub Frontend Docker Image
|
||||
# DataHub Frontend Docker Image
|
||||
[](https://cloud.docker.com/repository/docker/keremsahin/datahub-frontend/)
|
||||
|
||||
Refer to [Data Hub Frontend Service](../../datahub-frontend) to have a quick understanding of the architecture and
|
||||
responsibility of this service for the Data Hub.
|
||||
Refer to [DataHub Frontend Service](../../datahub-frontend) to have a quick understanding of the architecture and
|
||||
responsibility of this service for the DataHub.
|
||||
|
||||
## Build
|
||||
```
|
||||
@ -28,7 +28,7 @@ ports:
|
||||
```
|
||||
|
||||
#### Docker Network
|
||||
All Docker containers for Data Hub are supposed to be on the same Docker network which is `datahub_network`.
|
||||
All Docker containers for DataHub are supposed to be on the same Docker network which is `datahub_network`.
|
||||
If you change this, you will need to change this for all other Docker containers as well.
|
||||
```
|
||||
networks:
|
||||
@ -47,7 +47,7 @@ environment:
|
||||
```
|
||||
The value of `DATAHUB_GMS_HOST` variable should be set to the host name of the `datahub-gms` container within the Docker network.
|
||||
|
||||
## Checking out Data Hub UI
|
||||
## Checking out DataHub UI
|
||||
After starting your Docker container, you can connect to it by typing below into your favorite web browser:
|
||||
```
|
||||
http://localhost:9001
|
||||
|
@ -1,8 +1,8 @@
|
||||
# Data Hub Generalized Metadata Store (GMS) Docker Image
|
||||
# DataHub Generalized Metadata Store (GMS) Docker Image
|
||||
[](https://cloud.docker.com/repository/docker/keremsahin/datahub-gms/)
|
||||
|
||||
Refer to [Data Hub GMS Service](../../gms) to have a quick understanding of the architecture and
|
||||
responsibility of this service for the Data Hub.
|
||||
Refer to [DataHub GMS Service](../../gms) to have a quick understanding of the architecture and
|
||||
responsibility of this service for the DataHub.
|
||||
|
||||
## Build
|
||||
```
|
||||
@ -28,7 +28,7 @@ ports:
|
||||
```
|
||||
|
||||
#### Docker Network
|
||||
All Docker containers for Data Hub are supposed to be on the same Docker network which is `datahub_network`.
|
||||
All Docker containers for DataHub are supposed to be on the same Docker network which is `datahub_network`.
|
||||
If you change this, you will need to change this for all other Docker containers as well.
|
||||
```
|
||||
networks:
|
||||
|
@ -1,7 +1,7 @@
|
||||
# Data Hub MetadataChangeEvent (MCE) Ingestion Docker Image
|
||||
# DataHub MetadataChangeEvent (MCE) Ingestion Docker Image
|
||||
|
||||
Refer to [Data Hub Metadata Ingestion](../../metadata-ingestion/mce-cli) to have a quick understanding of the architecture and
|
||||
responsibility of this service for the Data Hub.
|
||||
Refer to [DataHub Metadata Ingestion](../../metadata-ingestion/mce-cli) to have a quick understanding of the architecture and
|
||||
responsibility of this service for the DataHub.
|
||||
|
||||
## Build
|
||||
```
|
||||
@ -18,5 +18,5 @@ for the container otherwise it will build the image from local repository and th
|
||||
|
||||
### Container configuration
|
||||
|
||||
#### Kafka and Data Hub GMS Containers
|
||||
#### Kafka and DataHub GMS Containers
|
||||
Before starting `ingestion` container, `datahub-gms`, `kafka` and `datahub-mce-consumer` containers should already be up and running.
|
@ -1,6 +1,6 @@
|
||||
# Kafka, Zookeeper and Schema Registry
|
||||
|
||||
Data Hub uses Kafka as the pub-sub message queue in the backend.
|
||||
DataHub uses Kafka as the pub-sub message queue in the backend.
|
||||
[Official Confluent Kafka Docker images](https://hub.docker.com/u/confluentinc) found in Docker Hub is used without
|
||||
any modification.
|
||||
|
||||
@ -29,7 +29,7 @@ ports:
|
||||
```
|
||||
|
||||
### Docker Network
|
||||
All Docker containers for Data Hub are supposed to be on the same Docker network which is `datahub_network`.
|
||||
All Docker containers for DataHub are supposed to be on the same Docker network which is `datahub_network`.
|
||||
If you change this, you will need to change this for all other Docker containers as well.
|
||||
```
|
||||
networks:
|
||||
|
@ -1,8 +1,8 @@
|
||||
# Data Hub MetadataAuditEvent (MAE) Consumer Docker Image
|
||||
# DataHub MetadataAuditEvent (MAE) Consumer Docker Image
|
||||
[](https://cloud.docker.com/repository/docker/keremsahin/datahub-mae-consumer/)
|
||||
|
||||
Refer to [Data Hub MAE Consumer Job](../../metadata-jobs/mae-consumer-job) to have a quick understanding of the architecture and
|
||||
responsibility of this service for the Data Hub.
|
||||
Refer to [DataHub MAE Consumer Job](../../metadata-jobs/mae-consumer-job) to have a quick understanding of the architecture and
|
||||
responsibility of this service for the DataHub.
|
||||
|
||||
## Build
|
||||
```
|
||||
@ -20,7 +20,7 @@ for the container otherwise it will download the `latest` image from Docker Hub
|
||||
### Container configuration
|
||||
|
||||
#### Docker Network
|
||||
All Docker containers for Data Hub are supposed to be on the same Docker network which is `datahub_network`.
|
||||
All Docker containers for DataHub are supposed to be on the same Docker network which is `datahub_network`.
|
||||
If you change this, you will need to change this for all other Docker containers as well.
|
||||
```
|
||||
networks:
|
||||
|
@ -1,8 +1,8 @@
|
||||
# Data Hub MetadataChangeEvent (MCE) Consumer Docker Image
|
||||
# DataHub MetadataChangeEvent (MCE) Consumer Docker Image
|
||||
[](https://cloud.docker.com/repository/docker/keremsahin/datahub-mce-consumer/)
|
||||
|
||||
Refer to [Data Hub MCE Consumer Job](../../metadata-jobs/mce-consumer-job) to have a quick understanding of the architecture and
|
||||
responsibility of this service for the Data Hub.
|
||||
Refer to [DataHub MCE Consumer Job](../../metadata-jobs/mce-consumer-job) to have a quick understanding of the architecture and
|
||||
responsibility of this service for the DataHub.
|
||||
|
||||
## Build
|
||||
```
|
||||
@ -20,7 +20,7 @@ for the container otherwise it will download the `latest` image from Docker Hub
|
||||
### Container configuration
|
||||
|
||||
#### Docker Network
|
||||
All Docker containers for Data Hub are supposed to be on the same Docker network which is `datahub_network`.
|
||||
All Docker containers for DataHub are supposed to be on the same Docker network which is `datahub_network`.
|
||||
If you change this, you will need to change this for all other Docker containers as well.
|
||||
```
|
||||
networks:
|
||||
@ -28,7 +28,7 @@ networks:
|
||||
name: datahub_network
|
||||
```
|
||||
|
||||
#### Kafka and Data Hub GMS Containers
|
||||
#### Kafka and DataHub GMS Containers
|
||||
Before starting `datahub-mce-consumer` container, `datahub-gms` and `kafka` containers should already be up and running.
|
||||
These connections are configured via environment variables in `docker-compose.yml`:
|
||||
```
|
||||
|
@ -1,6 +1,6 @@
|
||||
# MySQL
|
||||
|
||||
Data Hub GMS uses MySQL as the storage infrastructure.
|
||||
DataHub GMS uses MySQL as the storage infrastructure.
|
||||
[Official MySQL Docker image](https://hub.docker.com/_/mysql) found in Docker Hub is used without
|
||||
any modification.
|
||||
|
||||
@ -11,7 +11,7 @@ cd docker/mysql && docker-compose pull && docker-compose up
|
||||
```
|
||||
|
||||
An initialization script [init.sql](init.sql) is provided to container. This script initializes `metadata-aspect` table
|
||||
which is basically the Key-Value store of the Data Hub GMS.
|
||||
which is basically the Key-Value store of the DataHub GMS.
|
||||
|
||||
To connect to MySQL container, you can type below command:
|
||||
```
|
||||
@ -29,7 +29,7 @@ ports:
|
||||
```
|
||||
|
||||
### Docker Network
|
||||
All Docker containers for Data Hub are supposed to be on the same Docker network which is `datahub_network`.
|
||||
All Docker containers for DataHub are supposed to be on the same Docker network which is `datahub_network`.
|
||||
If you change this, you will need to change this for all other Docker containers as well.
|
||||
```
|
||||
networks:
|
||||
|
@ -1,6 +1,6 @@
|
||||
# Neo4j
|
||||
|
||||
Data Hub uses Neo4j as graph db in the backend to serve graph queries.
|
||||
DataHub uses Neo4j as graph db in the backend to serve graph queries.
|
||||
[Official Neo4j image](https://hub.docker.com/_/neo4j) found in Docker Hub is used without
|
||||
any modification.
|
||||
|
||||
@ -22,7 +22,7 @@ ports:
|
||||
```
|
||||
|
||||
### Docker Network
|
||||
All Docker containers for Data Hub are supposed to be on the same Docker network which is `datahub_network`.
|
||||
All Docker containers for DataHub are supposed to be on the same Docker network which is `datahub_network`.
|
||||
If you change this, you will need to change it for all other Docker containers as well.
|
||||
```
|
||||
networks:
|
||||
|
@ -1,20 +1,20 @@
|
||||
# Data Hub Quickstart
|
||||
# DataHub Quickstart
|
||||
To start all Docker containers at once, please run below command:
|
||||
```
|
||||
cd docker/quickstart && docker-compose pull && docker-compose up --build
|
||||
```
|
||||
At this point, all containers are ready and Data Hub can be considered up and running. Check specific containers guide
|
||||
At this point, all containers are ready and DataHub can be considered up and running. Check specific containers guide
|
||||
for details:
|
||||
* [Elasticsearch & Kibana](../elasticsearch)
|
||||
* [Data Hub Frontend](../frontend)
|
||||
* [Data Hub GMS](../gms)
|
||||
* [DataHub Frontend](../frontend)
|
||||
* [DataHub GMS](../gms)
|
||||
* [Kafka, Schema Registry & Zookeeper](../kafka)
|
||||
* [Data Hub MAE Consumer](../mae-consumer)
|
||||
* [Data Hub MCE Consumer](../mce-consumer)
|
||||
* [DataHub MAE Consumer](../mae-consumer)
|
||||
* [DataHub MCE Consumer](../mce-consumer)
|
||||
* [MySQL](../mysql)
|
||||
|
||||
From this point on, if you want to be able to sign in to Data Hub and see some sample data, please see
|
||||
[Metadata Ingestion Guide](../../metadata-ingestion) for `bootstrapping Data Hub`.
|
||||
From this point on, if you want to be able to sign in to DataHub and see some sample data, please see
|
||||
[Metadata Ingestion Guide](../../metadata-ingestion) for `bootstrapping DataHub`.
|
||||
|
||||
## Debugging Containers
|
||||
If you want to debug containers, you can check container logs:
|
||||
|
11
docs/architecture/architecture.md
Normal file
11
docs/architecture/architecture.md
Normal file
@ -0,0 +1,11 @@
|
||||
# DataHub Architecture
|
||||

|
||||
|
||||
## Metadata Serving
|
||||
Refer to [metadata-serving](metadata-serving.md).
|
||||
|
||||
## Metadata Ingestion
|
||||
Refer to [metadata-ingestion](metadata-ingestion.md).
|
||||
|
||||
## What is Generalized Metadata Architecture (GMA)?
|
||||
Refer to [GMA](../what/gma.md).
|
0
docs/architecture/metadata-ingestion.md
Normal file
0
docs/architecture/metadata-ingestion.md
Normal file
0
docs/architecture/metadata-serving.md
Normal file
0
docs/architecture/metadata-serving.md
Normal file
0
docs/how/entity-onboarding.md
Normal file
0
docs/how/entity-onboarding.md
Normal file
0
docs/how/graph-onboarding.md
Normal file
0
docs/how/graph-onboarding.md
Normal file
8
docs/how/metadata-modelling.md
Normal file
8
docs/how/metadata-modelling.md
Normal file
@ -0,0 +1,8 @@
|
||||
# How to model metadata for GMA?
|
||||
GMA uses [rest.li](https://rest.li), which is LinkedIn's open source REST framework.
|
||||
All metadata in GMA needs to be modelled using [Pegasus schema (PDSC)](https://linkedin.github.io/rest.li/DATA-Data-Schema-and-Templates) which is the data schema for [rest.li](https://rest.li).
|
||||
|
||||
Conceptually we’re modelling metadata as a hybrid graph of nodes ([entities](../what/entity.md)) and edges ([relationships](../what/relationship.md)), with additional documents ([metadata aspects](../what/aspect.md)) attached to each node.
|
||||
Below is an an example graph consisting of 3 types of entities (User, Group, Dataset), 3 types of relationships (OwnedBy, HasAdmin, HasMember), and 3 types of metadata aspects (Ownership, Profile, and Membership).
|
||||
|
||||

|
0
docs/how/search-onboarding.md
Normal file
0
docs/how/search-onboarding.md
Normal file
BIN
docs/imgs/datahub-architecture.png
Normal file
BIN
docs/imgs/datahub-architecture.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 43 KiB |
Before Width: | Height: | Size: 66 KiB After Width: | Height: | Size: 66 KiB |
BIN
docs/imgs/metadata-modeling.png
Normal file
BIN
docs/imgs/metadata-modeling.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 210 KiB |
51
docs/what/aspect.md
Normal file
51
docs/what/aspect.md
Normal file
@ -0,0 +1,51 @@
|
||||
# What is a GMA aspect?
|
||||
A metadata aspect is a structured document, or more precisely a `record` in [PDSC](https://linkedin.github.io/rest.li/DATA-Data-Schema-and-Templates),
|
||||
that represents a specific kind of metadata (e.g. ownership, schema, statistics, upstreams).
|
||||
A metadata aspect on its own has no meaning (e.g. ownership for what?) and must be associated with a particular entity (e.g. ownership for PageViewEvent).
|
||||
We purposely not to impose any model requirement on metadata aspects, as each aspect is expected to differ significantly.
|
||||
|
||||
Metadata aspects are immutable by design, i.e. every change to a particular aspect results in a new version created.
|
||||
An optional retention policy can be applied such that X number of most recent versions will be retained after each update.
|
||||
Setting X to 1 effectively means the metadata aspect is non-versioned.
|
||||
It is also possible to apply the retention based on time, e.g. only keeps the metadata changes from the past 30 days.
|
||||
|
||||
While a metadata aspect can be arbitrary complex document with multiple levels of nesting, it is sometimes desirable to break a monolithic aspect into smaller independent aspects.
|
||||
This will provide the benefits of:
|
||||
1. **Faster read/write**: As metadata aspects are immutable, every `update` will lead to the writing the entire large aspect back to the underlying data store.
|
||||
Likewise, readers will need to retrieve the entire aspect even if it’s only interested in a small part of it.
|
||||
2. **Ability to independently version different aspects**: For example, one may like to get the change history of all the `ownership metadata` independent of the changes made to `schema metadata` for a dataset.
|
||||
3. **Help with rest.li endpoint modeling**: While it’s not required to have 1:1 mapping between rest.li endpoints and metadata aspects,
|
||||
it’d follow this pattern naturally, which means one will end up with smaller, more modular, endpoints instead of giant ones.
|
||||
|
||||
Here’s an example metadata aspect. Note that the `admin` and `members` fields are implicitly conveying a relationship between `Group` entity & `User` entity.
|
||||
It’s very natural to save such relationships as URNs in a metadata aspect.
|
||||
The [relationship](relationship.md) section explains how this relationship can be explicitly extracted and modelled.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "record",
|
||||
"name": "Membership",
|
||||
"namespace": "com.linkedin.group",
|
||||
"doc": "The membership metadata for a group",
|
||||
"fields": [
|
||||
{
|
||||
"name": "auditStamp",
|
||||
"type": "com.linkedin.common.AuditStamp",
|
||||
"doc": "Audit stamp for the last change"
|
||||
},
|
||||
{
|
||||
"name": "admin",
|
||||
"type": "com.linkedin.common.CorpuserUrn",
|
||||
"doc": "Admin of the group"
|
||||
},
|
||||
{
|
||||
"name": "members",
|
||||
"type": {
|
||||
"type": "array",
|
||||
"items": "com.linkedin.common.CorpuserUrn"
|
||||
},
|
||||
"doc": "Members of the group, ordered in descending importance"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
77
docs/what/delta.md
Normal file
77
docs/what/delta.md
Normal file
@ -0,0 +1,77 @@
|
||||
# What is Delta in GMA?
|
||||
|
||||
Rest.li supports [partial update](https://linkedin.github.io/rest.li/user_guide/restli_server#partial_update) natively without needing explicitly defined models.
|
||||
However, the granularity of update is always limited to each field in a PDSC model.
|
||||
There are cases where the update need to happen at an even finer grain, e.g. adding or removing items from an array.
|
||||
|
||||
To this end, we’re proposing the following entity-specific metadata delta model that allows atomic partial updates at any desired granularity.
|
||||
Note that:
|
||||
1. Just like metadata [aspects](aspect.md), we’re not imposing any limit on the partial update model, as long as it’s a valid PDSC record.
|
||||
This is because the rest.li endpoint will have the logic that performs the corresponding partial update based on the information in the model.
|
||||
That said, it’s common to have fields that denote the list of items to be added or removed (e.g. `membersToAdd` & `membersToRemove` from below)
|
||||
2. Similar to metadata [snapshots](snapshot.md), entity that supports metadata delta will add an entity-specific metadata delta
|
||||
(e.g. `GroupDelta` from below) that unions all supported partial update models.
|
||||
3. The entity-specific metadata delta is then added to the global `Delta` typeref, which is added as part of [Metadata Change Event](mxe.md#metadata-change-event-mce) and used during [Metadata Ingestion](../architecture/metadata-ingestion.md).
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "record",
|
||||
"name": "MembershipPartialUpdate",
|
||||
"namespace": "com.linkedin.group",
|
||||
"doc": "A metadata delta for a specific group entity.",
|
||||
"fields": [
|
||||
{
|
||||
"name": "membersToAdd",
|
||||
"doc": "List of members to be added to the group.",
|
||||
"type": {
|
||||
"type": "array",
|
||||
"items": "com.linkedin.common.CorpuserUrn"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "membersToRemove",
|
||||
"doc": "List of members to be removed from the group.",
|
||||
"type": {
|
||||
"type": "array",
|
||||
"items": "com.linkedin.common.CorpuserUrn"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "record",
|
||||
"name": "GroupDelta",
|
||||
"namespace": "com.linkedin.metadata.delta",
|
||||
"doc": "A metadata delta for a specific group entity.",
|
||||
"fields": [
|
||||
{
|
||||
"name": "urn",
|
||||
"type": "com.linkedin.common.CorpGroupUrn",
|
||||
"doc": "URN for the entity the metadata delta is associated with."
|
||||
},
|
||||
{
|
||||
"name": "delta",
|
||||
"doc": "The specific type of metadata delta to apply.",
|
||||
"type": [
|
||||
"com.linkedin.group.MembershipPartialUpdate"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "typeref",
|
||||
"name": "Delta",
|
||||
"namespace": "com.linkedin.metadata.delta",
|
||||
"doc": "A union of all supported metadata delta types.",
|
||||
"ref": [
|
||||
"DatasetDelta",
|
||||
"GroupDelta"
|
||||
]
|
||||
}
|
||||
```
|
100
docs/what/entity.md
Normal file
100
docs/what/entity.md
Normal file
@ -0,0 +1,100 @@
|
||||
# What is a GMA entity?
|
||||
An entity is very similar to the concept of a [resource](https://linkedin.github.io/rest.li/user_guide/restli_server#writing-resources) in [rest.li](http://rest.li/).
|
||||
Generally speaking, an entity should have a defined [URN](urn.md) and a corresponding
|
||||
[CRUD](https://en.wikipedia.org/wiki/Create,_read,_update_and_delete) API for the metadata associated with a particular instance of the entity.
|
||||
A particular instance of an entity is essentially a node in the [metadata graph](graph.md).
|
||||
|
||||

|
||||
|
||||
In the above example graph, `Dataset`, `User`, and `Group` are entities.
|
||||
A specific dataset, e.g. `/data/tracking/PageViewEvent`, is an instance of `Dataset` entity,
|
||||
much like how the LDAP group `datahub-dev` is an instance of Group entity.
|
||||
|
||||
Unlike rest.li, there’s no concept of sub-entity ([sub-resource](https://github.com/linkedin/rest.li/wiki/Rest.li-User-Guide#sub-resources) in rest.li).
|
||||
In other words, entities are always top-level and non-nesting. Instead, nestedness is modeled using relationships,
|
||||
e.g. `Contains`, `IsPartOf`, `HasA`, which is covered in the [Relationship](relationship.md) section.
|
||||
|
||||
Entities may also contain attributes, which are in the form of key-value pairs.
|
||||
Each attribute is indexed to support fast attribute-based querying,
|
||||
e.g. find all the `User`s that have the job title `Software Engineer`.
|
||||
There may be a size limitation on the value imposed by the underlying indexing system,
|
||||
but it is suffice to assume that the values should kept at relatively small in size, say less than 1KB.
|
||||
|
||||
The value of each attribute is expected to be derived from either the entity’s URN or
|
||||
from the metadata associated with the entity. Another way to understand the attributes of an entity is to treat them as a complex virtual view over the URN
|
||||
and metadata with indexing support on each column of the view.
|
||||
Just like a virtual view where one is not supposed to store data in the view directly,
|
||||
but to derive it from the underlying tables, the value for the attributes should also be derived.
|
||||
How the actual derivation happens is covered in the [Metadata Serving](../architecture/architecture.md#metadata-serving) section.
|
||||
|
||||
There’s no need to explicitly create or destroy entity instances.
|
||||
An entity instance will be automatically created in the graph whenever a new relationship involving the instance is formed,
|
||||
or when a new metadata aspect is attached to the instance.
|
||||
Each entity has a special boolean attribute `removed`, which is used to mark the entity as `soft deleted`,
|
||||
without destroying existing relationships and attached metadata.
|
||||
This is useful for quickly reviving an incorrectly deleted entity instance without losing valuable metadata,
|
||||
e.g. human authored content.
|
||||
|
||||
An example schema for the `Dataset` entity is shown below. Note that:
|
||||
1. Each entity is expected to have a `urn` field with an entity-specific URN type.
|
||||
2. The optional `removed` field is captured in BaseEntity, which is expected to be included by all entities.
|
||||
3. All other fields are expected to be of primitive types or enum only.
|
||||
While it may be possible to support other complex types, namely array, union, map, and record,
|
||||
this mostly depends on the underlying indexing system. For simplicity, we only allow numeric or string-like values for now.
|
||||
4. The `urn` field is non-optional, while all other fields must be optional.
|
||||
This is to support `partial update` when only a selective number of attributes need to be altered.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "record",
|
||||
"name": "BaseEntity",
|
||||
"namespace": "com.linkedin.metadata.entity",
|
||||
"doc": "Common fields that apply to all entities",
|
||||
"fields": [
|
||||
{
|
||||
"name": "removed",
|
||||
"type": "boolean",
|
||||
"doc": "Whether the entity has been removed or not",
|
||||
"optional": true,
|
||||
"default": false
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "record",
|
||||
"name": "DatasetEntity",
|
||||
"namespace": "com.linkedin.metadata.entity",
|
||||
"doc": "Data model for a dataset entity",
|
||||
"include": [
|
||||
"BaseEntity"
|
||||
],
|
||||
"fields": [
|
||||
{
|
||||
"name": "urn",
|
||||
"type": "com.linkedin.common.DatasetUrn",
|
||||
"doc": "Urn of the dataset"
|
||||
},
|
||||
{
|
||||
"name": "name",
|
||||
"type": "string",
|
||||
"doc": "Dataset native name",
|
||||
"optional": true
|
||||
},
|
||||
{
|
||||
"name": "platform",
|
||||
"type": "com.linkedin.common.DataPlatformUrn",
|
||||
"doc": "Platform urn for the dataset.",
|
||||
"optional": true
|
||||
},
|
||||
{
|
||||
"name": "fabric",
|
||||
"type": "com.linkedin.common.FabricType",
|
||||
"doc": "Fabric type where dataset belongs to.",
|
||||
"optional": true
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
0
docs/what/gma.md
Normal file
0
docs/what/gma.md
Normal file
0
docs/what/gms.md
Normal file
0
docs/what/gms.md
Normal file
0
docs/what/graph.md
Normal file
0
docs/what/graph.md
Normal file
93
docs/what/mxe.md
Normal file
93
docs/what/mxe.md
Normal file
@ -0,0 +1,93 @@
|
||||
# What is MXE (Metadata Events)?
|
||||
|
||||
The models defined in [snapshot](snapshot.md) and [delta](delta.md) are used to build the schema for two metadata Kafka events.
|
||||
As these events have the prefix `Metadata` and suffix `Event`, they’re collectively referred to as MXE.
|
||||
|
||||
We also model MXEs using [PDSC](https://linkedin.github.io/rest.li/DATA-Data-Schema-and-Templates) and rely on the [pegasus gradle plugin](https://linkedin.github.io/rest.li/setup/gradle#generateavroschema) to convert them into [AVSC](https://avro.apache.org/docs/current/spec.html).
|
||||
However, we also need to rename all the namespaces of the generated AVSC to avoid namespace clashes for projects that depend on both the PDSC models and MXEs.
|
||||
As the AVSC and PDSC model schemas are 100% compatible, it’d be very easy to convert the in-memory representation from one to another using [Pegasus’ DataTranslator](https://linkedin.github.io/rest.li/DATA-Data-Schema-and-Templates#translating-data-to-and-from-avro).
|
||||
|
||||
## Metadata Change Event (MCE)
|
||||
|
||||
MCE is a `proposal` for a metadata change, as opposed to [MAE](#metadata-audit-event), which is conveying a committed change.
|
||||
Consequently, only successfully accepted and processed MCEs will lead to the emission of a corresponding MAE.
|
||||
A single MCE can contain both snapshot-oriented and delta-oriented metadata change proposal.
|
||||
The use case of this event is explained in [Metadata Ingestion](../architecture/metadata-ingestion.md).
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "record",
|
||||
"name": "MetadataChangeEvent",
|
||||
"namespace": "com.linkedin.mxe",
|
||||
"doc": "Kafka event for proposing a metadata change to an entity.",
|
||||
"fields": [
|
||||
{
|
||||
"name": "proposedSnapshot",
|
||||
"doc": "Snapshot of the proposed metadata change. Include only the aspects affected by the change in the snapshot.",
|
||||
"type": "com.linkedin.metadata.snapshot.Snapshot",
|
||||
"optional": true
|
||||
},
|
||||
{
|
||||
"name": "proposedDelta",
|
||||
"doc": "Delta of the proposed metadata partial update.",
|
||||
"type": "com.linkedin.metadata.delta.Delta",
|
||||
"optional": true
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
We’ll also generate a [dead letter queue](https://en.wikipedia.org/wiki/Dead_letter_queue) event, Failed Metadata Change Event (FMCE), for any rejected MCE.
|
||||
The event simply wraps the original MCE and an error message, which contains the reason for rejection. This event can be used for debugging any potential ingestion issues, as well as for re-playing any previous rejected proposal if ever needed.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "record",
|
||||
"name": "FailedMetadataChangeEvent",
|
||||
"namespace": "com.linkedin.mxe",
|
||||
"doc": "Kafka event for capturing a failure to process a specific MCE.",
|
||||
"fields": [
|
||||
{
|
||||
"name": "metadataChangeEvent",
|
||||
"doc": "The event that failed to be processed.",
|
||||
"type": "MetadataChangeEvent"
|
||||
},
|
||||
{
|
||||
"name": "error",
|
||||
"type": "string",
|
||||
"doc": "The error message or the stacktrace for the failure."
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Metadata Audit Event (MAE)
|
||||
|
||||
A Metadata Audit Event captures the change made to one or multiple metadata [aspects](aspect.md) associated with a particular [entity](entity.md), in the form of a metadata [snapshot](snapshot.md) before the change, and a metadata snapshot after the change.
|
||||
|
||||
Every source-of-truth for a particular metadata aspect is expected to emit a MAE whenever a change is committed to that aspect.
|
||||
By ensuring that, any listener of MAE will be able to construct a complete view of the latest state for all aspects.
|
||||
Furthermore, because each MAE contains the after image, any mistake made in emitting the MAE can be easily mitigated by emitting a follow-up MAE with the correction.
|
||||
By the same token, the initial bootstrap problem for any newly added entity can also be solved by emitting a MAE containing all the latest metadata aspects associated with that entity.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "record",
|
||||
"name": "MetadataAuditEvent",
|
||||
"namespace": "com.linkedin.mxe",
|
||||
"doc": "Kafka event for capturing update made to an entity's metadata.",
|
||||
"fields": [
|
||||
{
|
||||
"name": "oldSnapshot",
|
||||
"doc": "Snapshot of the metadata before the update. Set to null for newly created metadata. Only the metadata aspects affected by the update are included in the snapshot.",
|
||||
"type": "com.linkedin.metadata.snapshot.Snapshot",
|
||||
"optional": true
|
||||
},
|
||||
{
|
||||
"name": "newSnapshot",
|
||||
"doc": "Snapshot of the metadata after the update. Only the metadata aspects affected by the update are included in the snapshot.",
|
||||
"type": "com.linkedin.metadata.snapshot.Snapshot"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
122
docs/what/relationship.md
Normal file
122
docs/what/relationship.md
Normal file
@ -0,0 +1,122 @@
|
||||
# What is a GMA relationship?
|
||||
|
||||
A relationship is a named associate between exactly two entities, a source and a destination.
|
||||
|
||||

|
||||
|
||||
From the above graph, a `Group` entity can be linked to a `User` entity via a `HasMember` relationship.
|
||||
Note that the name of the relationship reflects the direction, i.e pointing from `Group` to `User`.
|
||||
This is due to the fact that the actual metadata aspect holding this information is associated with `Group`, rather than User.
|
||||
Had the direction been reversed, the relationship would be called IsMemberOf instead.
|
||||
See [Direction of Relationships](#direction-of-relationships) for more discussions on relationship directionality.
|
||||
A specific instance of a relationship, e.g. `urn:li:corpgroup:metadata-dev` has a member `urn:li:corpuser:malan`,
|
||||
corresponds to an edge in the metadata graph.
|
||||
|
||||
Similar to an entity, a relationship can also be associated with optional attributes that are derived from metadata.
|
||||
For example, from the `Membership` metadata aspect shown below, we’re able to derive the `HasMember` relationship that links a specific `Group` to a specific `User`.
|
||||
We can also include additional attribute to the relationship, e.g. importance, which corresponds to the position of the specific member in the original membership array.
|
||||
This allows complex graph query that travel only relationships that match certain criteria, e.g. `returns only the top-5 most important members of this group.`
|
||||
Once again, attributes should only be added based on query patterns.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "record",
|
||||
"name": "Membership",
|
||||
"namespace": "com.linkedin.group",
|
||||
"doc": "The membership metadata for a group",
|
||||
"fields": [
|
||||
{
|
||||
"name": "auditStamp",
|
||||
"type": "com.linkedin.common.AuditStamp",
|
||||
"doc": "Audit stamp for the last change"
|
||||
},
|
||||
{
|
||||
"name": "admin",
|
||||
"type": "com.linkedin.common.CorpuserUrn",
|
||||
"doc": "Admin of the group"
|
||||
},
|
||||
{
|
||||
"name": "members",
|
||||
"type": {
|
||||
"type": "array",
|
||||
"items": "com.linkedin.common.CorpuserUrn"
|
||||
},
|
||||
"doc": "Members of the group, ordered in descending importance"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Relationships are meant to be `entity-neutral`. In other words, one would expect to use the same `OwnedBy` relationship to link a `Dataset` to a `User` and to link a `Dashboard` to a `User`.
|
||||
As Pegasus doesn’t allow typing a field using multiple URNs (because they’re all essentially strings), we resort to using generic URN type for the source and destination.
|
||||
We also introduce a non-standard property pairings to limit the allowed source and destination URN types.
|
||||
|
||||
While it’s possible to model relationships in rest.li as [association resources](https://linkedin.github.io/rest.li/modeling/modeling#association),
|
||||
which often get stored as mapping tables, it is far more common to model them as `foreign keys` field in a metadata aspect.
|
||||
For instance, the `Ownership` aspect is likely to contain an array of owner’s corpuser URNs.
|
||||
|
||||
Below is an example of how a relationship is modeled in PDSC. Note that:
|
||||
1. As the `source` and `destination` are of generic URN type, we’re able to factor them out to a common `BaseRelationship` model.
|
||||
2. Each model is expected to have a pairings property that is an array of all allowed source-destination URNs.
|
||||
3. Unlike entities, there’s no requirement on making all attributes optional since relationships do not support partial updates.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "record",
|
||||
"name": "BaseRelationship",
|
||||
"namespace": "com.linkedin.metadata.relationship",
|
||||
"doc": "Common fields that apply to all relationships",
|
||||
"fields": [
|
||||
{
|
||||
"name": "source",
|
||||
"type": "com.linkedin.common.Urn",
|
||||
"doc": "Urn for the source of the relationship"
|
||||
},
|
||||
{
|
||||
"name": "destination",
|
||||
"type": "com.linkedin.common.Urn",
|
||||
"doc": "Urn for the destination of the relationship"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "record",
|
||||
"name": "HasMember",
|
||||
"namespace": "com.linkedin.metadata.relationship",
|
||||
"doc": "Data model for a has-member relationship",
|
||||
"include": [
|
||||
"BaseRelationship"
|
||||
],
|
||||
"pairings": [
|
||||
{
|
||||
"source": "com.linkedin.common.urn.CorpGroupUrn",
|
||||
"destination": "com.linkedin.common.urn.CorpUserUrn"
|
||||
}
|
||||
],
|
||||
"fields": [
|
||||
{
|
||||
"name": "importance",
|
||||
"type": "int",
|
||||
"doc": "The importance of the membership"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Direction of Relationships
|
||||
|
||||
As relationships are modeled as directed edges between nodes, it’s natural to ask which way should it be pointing,
|
||||
or should there be edges going both ways? The answer is, `it kind of doesn’t matter.` It’s rather an aesthetic choice than technical one.
|
||||
For one, the actual direction doesn’t really matter when it comes to constructing graph queries.
|
||||
Most graph DBs are fully capable of traversing edges in reverse direction efficiently.
|
||||
|
||||
That being said, generally there’s a more `natural way` to specify the direction of a relationship, which is closely related to how metadata is stored.
|
||||
For example, the membership information for an LDAP group is generally stored as a list in group’s metadata.
|
||||
As a result, it’s more natural to model a `HasAMember` relationship that points from a group to a member, instead of a IsMemberOf relationship pointing from member to group.
|
||||
|
||||
Since all relationships are explicitly declared, it’s fairly easy for a user to discover what relationships are available and their directionality by inspecting
|
||||
the [relationships package](../../metadata-models/src/main/pegasus/com/linkedin/metadata/relationship).
|
||||
It’s also possible to provide a UI for the catalog of entities and relationships for analysts who are interested in building complex graph queries to gain insights into metadata.
|
0
docs/what/search-document.md
Normal file
0
docs/what/search-document.md
Normal file
59
docs/what/snapshot.md
Normal file
59
docs/what/snapshot.md
Normal file
@ -0,0 +1,59 @@
|
||||
# What is a snapshot in GMA?
|
||||
|
||||
A metadata snapshot models the current state of one or multiple metadata [aspects](aspect.md) associated with a particular [entity](entity.md).
|
||||
Each entity type is expected to have:
|
||||
1. An entity-specific aspect (e.g. `GroupAspect` from below), which is a `typeref` containing a union of all possible metadata aspects for the entity.
|
||||
2. An entity-specific snapshot (e.g. `GroupSnapshot` from below), which contains an array (aspects) of entity-specific aspects.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "typeref",
|
||||
"name": "GroupAspect",
|
||||
"namespace": "com.linkedin.metadata.aspect",
|
||||
"doc": "A specific metadata aspect for a group",
|
||||
"ref": [
|
||||
"com.linkedin.group.Membership",
|
||||
"com.linkedin.group.SomeOtherMetadata"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "record",
|
||||
"name": "GroupSnapshot",
|
||||
"namespace": "com.linkedin.metadata.snapshot",
|
||||
"doc": "A metadata snapshot for a specific group entity.",
|
||||
"fields": [
|
||||
{
|
||||
"name": "urn",
|
||||
"type": "com.linkedin.common.CorpGroupUrn",
|
||||
"doc": "URN for the entity the metadata snapshot is associated with."
|
||||
},
|
||||
{
|
||||
"name": "aspects",
|
||||
"doc": "The list of metadata aspects associated with the group.",
|
||||
"type": {
|
||||
"type": "array",
|
||||
"items": "com.linkedin.metadata.aspect.GroupAspect"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The generic `Snapshot` typeref contains a union of all entity-specific snapshots and can therefore be used to represent the state of any metadata aspect for all supported entity types.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "typeref",
|
||||
"name": "Snapshot",
|
||||
"namespace": "com.linkedin.metadata.snapshot",
|
||||
"doc": "A union of all supported metadata snapshot types.",
|
||||
"ref": [
|
||||
"DatasetSnapshot",
|
||||
"GroupSnapshot",
|
||||
"UserSnapshot"
|
||||
]
|
||||
}
|
||||
```
|
0
docs/what/urn.md
Normal file
0
docs/what/urn.md
Normal file
@ -1,27 +1,27 @@
|
||||
# Data Hub Generalized Metadata Store (GMS)
|
||||
Data Hub GMS is a [Rest.li](https://linkedin.github.io/rest.li/) service written in Java. It is following common
|
||||
# DataHub Generalized Metadata Store (GMS)
|
||||
DataHub GMS is a [Rest.li](https://linkedin.github.io/rest.li/) service written in Java. It is following common
|
||||
Rest.li server development practices and all data models are Pegasus(.pdsc) models.
|
||||
|
||||
## Pre-requisites
|
||||
* You need to have [JDK8](https://www.oracle.com/java/technologies/jdk8-downloads.html)
|
||||
installed on your machine to be able to build `Data Hub GMS`.
|
||||
installed on your machine to be able to build `DataHub GMS`.
|
||||
|
||||
## Build
|
||||
`Data Hub GMS` is already built as part of top level build:
|
||||
`DataHub GMS` is already built as part of top level build:
|
||||
```
|
||||
./gradlew build
|
||||
```
|
||||
However, if you only want to build `Data Hub GMS` specifically:
|
||||
However, if you only want to build `DataHub GMS` specifically:
|
||||
```
|
||||
./gradlew :gms:war:build
|
||||
```
|
||||
|
||||
## Dependencies
|
||||
Before starting `Data Hub GMS`, you need to make sure that [Kafka, Schema Registry & Zookeeper](../docker/kafka),
|
||||
Before starting `DataHub GMS`, you need to make sure that [Kafka, Schema Registry & Zookeeper](../docker/kafka),
|
||||
[Elasticsearch](../docker/elasticsearch) and [MySQL](../docker/mysql) Docker containers are up and running.
|
||||
|
||||
## Start via Docker image
|
||||
Quickest way to try out `Data Hub GMS` is running the [Docker image](../docker/gms).
|
||||
Quickest way to try out `DataHub GMS` is running the [Docker image](../docker/gms).
|
||||
|
||||
## Start via command line
|
||||
If you do modify things and want to try it out quickly without building the Docker image, you can also run
|
||||
|
@ -1,7 +1,7 @@
|
||||
# Metadata Ingestion
|
||||
|
||||
## Prerequisites
|
||||
1. Before running any metadata ingestion job, you should make sure that Data Hub backend services are all running. Easiest
|
||||
1. Before running any metadata ingestion job, you should make sure that DataHub backend services are all running. Easiest
|
||||
way to do that is through [Docker images](../docker).
|
||||
2. You also need to build the `mxe-schemas` module as below.
|
||||
```
|
||||
@ -37,8 +37,8 @@ optional arguments:
|
||||
-d DATA_FILE MCE data file; required if running 'producer' mode
|
||||
```
|
||||
|
||||
## Bootstrapping Data Hub
|
||||
Leverage the mce-cli to quickly ingest lots of sample data and test Data Hub in action, you can run below command:
|
||||
## Bootstrapping DataHub
|
||||
Leverage the mce-cli to quickly ingest lots of sample data and test DataHub in action, you can run below command:
|
||||
```
|
||||
➜ python mce_cli.py produce -d bootstrap_mce.dat
|
||||
Producing MetadataChangeEvent records to topic MetadataChangeEvent. ^c to exit.
|
||||
@ -46,9 +46,9 @@ Producing MetadataChangeEvent records to topic MetadataChangeEvent. ^c to exit.
|
||||
MCE2: {"auditHeader": None, "proposedSnapshot": ("com.linkedin.metadata.snapshot.CorpUserSnapshot", {"urn": "urn:li:corpuser:bar", "aspects": [{"active": False,"email": "bar@linkedin.com"}]}), "proposedDelta": None}
|
||||
Flushing records...
|
||||
```
|
||||
This will bootstrap Data Hub with sample datasets and sample users.
|
||||
This will bootstrap DataHub with sample datasets and sample users.
|
||||
|
||||
## Ingest metadata from LDAP server to Data Hub
|
||||
## Ingest metadata from LDAP server to DataHub
|
||||
The ldap_etl provides you ETL channel to communicate with your LDAP server.
|
||||
```
|
||||
➜ Config your LDAP server environmental variable in the file.
|
||||
@ -68,9 +68,9 @@ The ldap_etl provides you ETL channel to communicate with your LDAP server.
|
||||
|
||||
➜ python ldap_etl.py
|
||||
```
|
||||
This will bootstrap Data Hub with your metadata in the LDAP server as an user entity.
|
||||
This will bootstrap DataHub with your metadata in the LDAP server as an user entity.
|
||||
|
||||
## Ingest metadata from hive store to Data Hub
|
||||
## Ingest metadata from hive store to DataHub
|
||||
The hive_etl provides you ETL channel to communicate with your hive store.
|
||||
```
|
||||
➜ Config your hive store environmental variable in the file.
|
||||
@ -84,9 +84,9 @@ The hive_etl provides you ETL channel to communicate with your hive store.
|
||||
|
||||
➜ python hive_etl.py
|
||||
```
|
||||
This will bootstrap Data Hub with your metadata in the hive store as a dataset entity.
|
||||
This will bootstrap DataHub with your metadata in the hive store as a dataset entity.
|
||||
|
||||
## Ingest metadata from kafka zookeeper and avro schema registry to Data Hub
|
||||
## Ingest metadata from kafka zookeeper and avro schema registry to DataHub
|
||||
The kafka_etl provides you ETL channel to communicate with your kafka.
|
||||
```
|
||||
➜ Config your kafka environmental variable in the file.
|
||||
@ -100,9 +100,9 @@ The kafka_etl provides you ETL channel to communicate with your kafka.
|
||||
|
||||
➜ python kafka_etl.py
|
||||
```
|
||||
This will bootstrap Data Hub with your metadata in the kafka as a dataset entity.
|
||||
This will bootstrap DataHub with your metadata in the kafka as a dataset entity.
|
||||
|
||||
## Ingest metadata from MySQL to Data Hub
|
||||
## Ingest metadata from MySQL to DataHub
|
||||
The mysql_etl provides you ETL channel to communicate with your MySQL.
|
||||
```
|
||||
➜ Config your MySQL environmental variable in the file.
|
||||
@ -119,9 +119,9 @@ The mysql_etl provides you ETL channel to communicate with your MySQL.
|
||||
|
||||
➜ python mysql_etl.py
|
||||
```
|
||||
This will bootstrap Data Hub with your metadata in the MySQL as a dataset entity.
|
||||
This will bootstrap DataHub with your metadata in the MySQL as a dataset entity.
|
||||
|
||||
## Ingest metadata from RDBMS to Data Hub
|
||||
## Ingest metadata from RDBMS to DataHub
|
||||
The rdbms_etl provides you ETL channel to communicate with your RDBMS.
|
||||
- Currently supports IBM DB2, Firebird, MSSQL Server, MySQL, Oracle,PostgreSQL, SQLite and ODBC connections.
|
||||
- Some platform-specific logic are modularized and required to be implemented on your ad-hoc usage.
|
||||
@ -141,4 +141,4 @@ The rdbms_etl provides you ETL channel to communicate with your RDBMS.
|
||||
|
||||
➜ python rdbms_etl.py
|
||||
```
|
||||
This will bootstrap Data Hub with your metadata in the RDBMS as a dataset entity.
|
||||
This will bootstrap DataHub with your metadata in the RDBMS as a dataset entity.
|
@ -1,10 +1,10 @@
|
||||
# MXE Consumer Jobs
|
||||
Data Hub uses Kafka as the pub-sub message queue in the backend. There are 2 Kafka topics used by Data Hub which are
|
||||
DataHub uses Kafka as the pub-sub message queue in the backend. There are 2 Kafka topics used by DataHub which are
|
||||
`MetadataChangeEvent` and `MetadataAuditEvent`.
|
||||
* `MetadataChangeEvent:` This message is emitted by any data platform or crawler in which there is a change in the metadata.
|
||||
* `MetadataAuditEvent:` This message is emitted by [Data Hub GMS](../gms) to notify that metadata change is registered.
|
||||
* `MetadataAuditEvent:` This message is emitted by [DataHub GMS](../gms) to notify that metadata change is registered.
|
||||
|
||||
To be able to consume from these two topics, there are two [Kafka Streams](https://kafka.apache.org/documentation/streams/)
|
||||
jobs Data Hub uses:
|
||||
* [MCE Consumer Job](mce-consumer-job): Writes to [Data Hub GMS](../gms)
|
||||
jobs DataHub uses:
|
||||
* [MCE Consumer Job](mce-consumer-job): Writes to [DataHub GMS](../gms)
|
||||
* [MAE Consumer Job](mae-consumer-job): Writes to [Elasticsearch](../docker/elasticsearch) & [Neo4j](../docker/neo4j)
|
@ -6,7 +6,7 @@ So, this job is providing us a near-realtime search index update.
|
||||
|
||||
## Pre-requisites
|
||||
* You need to have [JDK8](https://www.oracle.com/java/technologies/jdk8-downloads.html)
|
||||
installed on your machine to be able to build `Data Hub GMS`.
|
||||
installed on your machine to be able to build `DataHub GMS`.
|
||||
|
||||
## Build
|
||||
`MAE Consumer Job` is already built as part of top level build:
|
||||
|
@ -1,12 +1,12 @@
|
||||
# MetadataChangeEvent (MAE) Consumer Job
|
||||
MCE Consumer is a [Kafka Streams](https://kafka.apache.org/documentation/streams/) job. Its main function is to listen
|
||||
`MetadataChangeEvent` Kafka topic for messages and process those messages and writes new metadata to `Data Hub GMS`.
|
||||
`MetadataChangeEvent` Kafka topic for messages and process those messages and writes new metadata to `DataHub GMS`.
|
||||
After every successful update of metadata, GMS fires a `MetadataAuditEvent` and this is consumed by
|
||||
[MAE Consumer Job](../mae-consumer-job).
|
||||
|
||||
## Pre-requisites
|
||||
* You need to have [JDK8](https://www.oracle.com/java/technologies/jdk8-downloads.html)
|
||||
installed on your machine to be able to build `Data Hub GMS`.
|
||||
installed on your machine to be able to build `DataHub GMS`.
|
||||
|
||||
## Build
|
||||
`MCE Consumer Job` is already built as part of top level build:
|
||||
@ -20,7 +20,7 @@ However, if you only want to build `MCE Consumer Job` specifically:
|
||||
|
||||
## Dependencies
|
||||
Before starting `MCE Consumer Job`, you need to make sure that [Kafka, Schema Registry & Zookeeper](../../docker/kafka) and
|
||||
[Data Hub GMS](../../docker/gms) Docker containers are up and running.
|
||||
[DataHub GMS](../../docker/gms) Docker containers are up and running.
|
||||
|
||||
## Start via Docker image
|
||||
Quickest way to try out `MCE Consumer Job` is running the [Docker image](../../docker/mce-consumer).
|
||||
|
Loading…
x
Reference in New Issue
Block a user