2019-09-08 05:25:54 -07:00
|
|
|
# Metadata Ingestion
|
|
|
|
|
|
|
|
|
|
## Prerequisites
|
|
|
|
|
1. Before running any metadata ingestion job, you should make sure that Data Hub backend services are all running. Easiest
|
|
|
|
|
way to do that is through [Docker images](../docker).
|
2019-09-08 20:25:58 -07:00
|
|
|
2. You also need to build the `mxe-schemas` module as below.
|
2019-09-08 05:25:54 -07:00
|
|
|
```
|
2019-09-08 20:25:58 -07:00
|
|
|
./gradlew :metadata-events:mxe-schemas:build
|
2019-09-08 05:25:54 -07:00
|
|
|
```
|
|
|
|
|
This is needed to generate `MetadataChangeEvent.avsc` which is the schema for `MetadataChangeEvent` Kafka topic. This
|
|
|
|
|
stream is the entry point for any metadata CRUD operation for Data Hub.
|
|
|
|
|
|
|
|
|
|
## MCE Producer/Consumer CLI
|
|
|
|
|
`mce_cli.py` script provides a convenient way to produce a list of MCEs from a data file.
|
|
|
|
|
Every MCE in the data file should be in a single line. It also supports consuming from
|
|
|
|
|
`MetadataChangeEvent` topic.
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
➜ python mce_cli.py --help
|
|
|
|
|
usage: mce_cli.py [-h] [-b BOOTSTRAP_SERVERS] [-s SCHEMA_REGISTRY]
|
|
|
|
|
[-d DATA_FILE]
|
|
|
|
|
{produce,consume}
|
|
|
|
|
|
|
|
|
|
Client for producing/consuming MetadataChangeEvent
|
|
|
|
|
|
|
|
|
|
positional arguments:
|
|
|
|
|
{produce,consume} Execution mode (produce | consume)
|
|
|
|
|
|
|
|
|
|
optional arguments:
|
|
|
|
|
-h, --help show this help message and exit
|
|
|
|
|
-b BOOTSTRAP_SERVERS Kafka broker(s) (localhost[:port])
|
|
|
|
|
-s SCHEMA_REGISTRY Schema Registry (http(s)://localhost[:port]
|
|
|
|
|
-d DATA_FILE MCE data file; required if running 'producer' mode
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Bootstrapping Data Hub
|
2019-09-08 20:25:58 -07:00
|
|
|
If you want to quickly ingest lots of sample data and test Data Hub in action, you can run below command:
|
2019-09-08 05:25:54 -07:00
|
|
|
```
|
|
|
|
|
➜ python mce_cli.py produce -d bootstrap_mce.dat
|
|
|
|
|
Producing MetadataChangeEvent records to topic MetadataChangeEvent. ^c to exit.
|
|
|
|
|
MCE1: {"auditHeader": None, "proposedSnapshot": ("com.linkedin.metadata.snapshot.CorpUserSnapshot", {"urn": "urn:li:corpuser:foo", "aspects": [{"active": True,"email": "foo@linkedin.com"}]}), "proposedDelta": None}
|
|
|
|
|
MCE2: {"auditHeader": None, "proposedSnapshot": ("com.linkedin.metadata.snapshot.CorpUserSnapshot", {"urn": "urn:li:corpuser:bar", "aspects": [{"active": False,"email": "bar@linkedin.com"}]}), "proposedDelta": None}
|
|
|
|
|
Flushing records...
|
2019-09-08 20:25:58 -07:00
|
|
|
```
|
|
|
|
|
This will bootstrap Data Hub with sample datasets and sample users.
|