95 lines
4.5 KiB
Markdown
Raw Normal View History

# Metadata Ingestion
## Prerequisites
2019-12-18 18:57:18 -08:00
1. Before running any metadata ingestion job, you should make sure that DataHub backend services are all running. Easiest
way to do that is through [Docker images](../docker).
2019-09-08 20:25:58 -07:00
2. You also need to build the `mxe-schemas` module as below.
```
./gradlew :metadata-events:mxe-schemas:build
```
This is needed to generate `MetadataChangeEvent.avsc` which is the schema for `MetadataChangeEvent` Kafka topic.
3. All the scripts are written using Python 3 and most likely won't work with Python 2.x interpreters.
You can verify the version of your Python using the following command.
```
python --version
```
We recommend using [pyenv](https://github.com/pyenv/pyenv) to install and manage your Python environment.
4. Before launching each ETL ingestion pipeline, you can install/verify the library versions as below.
```
pip install --user -r requirements.txt
```
## MCE Producer/Consumer CLI
`mce_cli.py` script provides a convenient way to produce a list of MCEs from a data file.
Every MCE in the data file should be in a single line. It also supports consuming from
`MetadataChangeEvent` topic.
2020-07-31 12:29:39 -07:00
Tested & confirmed platforms:
* Red Hat Enterprise Linux Workstation release 7.6 (Maipo) w/Python 3.6.8
* MacOS 10.15.5 (19F101) Darwin 19.5.0 w/Python 3.7.3
```
➜ python mce_cli.py --help
usage: mce_cli.py [-h] [-b BOOTSTRAP_SERVERS] [-s SCHEMA_REGISTRY]
2019-12-16 17:44:12 -08:00
[-d DATA_FILE] [-l SCHEMA_RECORD]
{produce,consume}
Client for producing/consuming MetadataChangeEvent
positional arguments:
{produce,consume} Execution mode (produce | consume)
optional arguments:
-h, --help show this help message and exit
-b BOOTSTRAP_SERVERS Kafka broker(s) (localhost[:port])
-s SCHEMA_REGISTRY Schema Registry (http(s)://localhost[:port]
2019-12-16 17:44:12 -08:00
-l SCHEMA_RECORD Avro schema record; required if running 'producer' mode
-d DATA_FILE MCE data file; required if running 'producer' mode
```
2019-12-18 18:57:18 -08:00
## Bootstrapping DataHub
2020-07-31 12:29:39 -07:00
* Apply the step 1 & 2 from prerequisites.
* [Optional] Open a new terminal to consume the events:
```
➜ python3 metadata-ingestion/mce-cli/mce_cli.py consume -l metadata-events/mxe-schemas/src/renamed/avro/com/linkedin/mxe/MetadataChangeEvent.avsc
```
* Run the mce-cli to quickly ingest lots of sample data and test DataHub in action, you can run below command:
```
2020-07-31 12:29:39 -07:00
➜ python3 metadata-ingestion/mce-cli/mce_cli.py produce -l metadata-events/mxe-schemas/src/renamed/avro/com/linkedin/mxe/MetadataChangeEvent.avsc -d metadata-ingestion/mce-cli/bootstrap_mce.dat
Producing MetadataChangeEvent records to topic MetadataChangeEvent. ^c to exit.
2020-04-02 19:39:12 -07:00
MCE1: {"auditHeader": None, "proposedSnapshot": ("com.linkedin.pegasus2avro.metadata.snapshot.CorpUserSnapshot", {"urn": "urn:li:corpuser:foo", "aspects": [{"active": True,"email": "foo@linkedin.com"}]}), "proposedDelta": None}
MCE2: {"auditHeader": None, "proposedSnapshot": ("com.linkedin.pegasus2avro.metadata.snapshot.CorpUserSnapshot", {"urn": "urn:li:corpuser:bar", "aspects": [{"active": False,"email": "bar@linkedin.com"}]}), "proposedDelta": None}
Flushing records...
2019-09-08 20:25:58 -07:00
```
2019-12-18 18:57:18 -08:00
This will bootstrap DataHub with sample datasets and sample users.
2020-06-03 13:52:56 -07:00
> ***Note***
> There is a [known issue](https://github.com/fastavro/fastavro/issues/292) with the Python Avro serialization library
> that can lead to unexpected result when it comes to union of types.
> Always [use the tuple notation](https://fastavro.readthedocs.io/en/latest/writer.html#using-the-tuple-notation-to-specify-which-branch-of-a-union-to-take) to avoid encountering these difficult-to-debug issues.
2020-02-06 18:28:29 -08:00
## Ingest metadata from LDAP to DataHub
The ldap_etl provides you ETL channel to communicate with your LDAP server.
```
2019-09-17 16:14:09 -07:00
➜ Config your LDAP server environmental variable in the file.
LDAPSERVER # Your server host.
BASEDN # Base dn as a container location.
LDAPUSER # Your credential.
LDAPPASSWORD # Your password.
PAGESIZE # Pagination size.
ATTRLIST # Return attributes relate to your model.
SEARCHFILTER # Filter to build the search query.
2019-09-17 16:14:09 -07:00
➜ Config your Kafka broker environmental variable in the file.
AVROLOADPATH # Your model event in avro format.
2019-09-17 16:14:09 -07:00
KAFKATOPIC # Your event topic.
BOOTSTRAP # Kafka bootstrap server.
SCHEMAREGISTRY # Kafka schema registry host.
➜ python ldap_etl.py
```
2019-12-18 18:57:18 -08:00
This will bootstrap DataHub with your metadata in the LDAP server as an user entity.
2019-09-17 16:14:09 -07:00
## Ingest metadata from SQL-based data systems to DataHub
2020-06-22 21:26:38 -07:00
See [sql-etl](sql-etl/) for more details.