datahub/metadata-ingestion-old/README.md

# Metadata Ingestion

## Prerequisites
1. Before running any metadata ingestion job, you should make sure that DataHub backend services are all running. Easiest
way to do that is through [Docker images](../docker).
2. You also need to build the `mxe-schemas` module as below.
   ```
   ./gradlew :metadata-events:mxe-schemas:build
   ```
   This is needed to generate `MetadataChangeEvent.avsc` which is the schema for `MetadataChangeEvent` Kafka topic.
3. All the scripts are written using Python 3 and most likely won't work with Python 2.x interpreters.
   You can verify the version of your Python using the following command.
   ```
   python --version
   ```
   We recommend using [pyenv](https://github.com/pyenv/pyenv) to install and manage your Python environment.
4. Before launching each ETL ingestion pipeline, you can install/verify the library versions as below.
   ```
   pip install --user -r requirements.txt
   ```
    
## MCE Producer/Consumer CLI
`mce_cli.py` script provides a convenient way to produce a list of MCEs from a data file. 
Every MCE in the data file should be in a single line. It also supports consuming from 
`MetadataChangeEvent` topic.

Tested & confirmed platforms:
* Red Hat Enterprise Linux Workstation release 7.6 (Maipo) w/Python 3.6.8
* MacOS 10.15.5 (19F101) Darwin 19.5.0 w/Python 3.7.3

```
➜  python mce_cli.py --help
usage: mce_cli.py [-h] [-b BOOTSTRAP_SERVERS] [-s SCHEMA_REGISTRY]
                  [-d DATA_FILE] [-l SCHEMA_RECORD]
                  {produce,consume}

Client for producing/consuming MetadataChangeEvent

positional arguments:
  {produce,consume}     Execution mode (produce | consume)

optional arguments:
  -h, --help            show this help message and exit
  -b BOOTSTRAP_SERVERS  Kafka broker(s) (localhost[:port])
  -s SCHEMA_REGISTRY    Schema Registry (http(s)://localhost[:port]
  -l SCHEMA_RECORD      Avro schema record; required if running 'producer' mode
  -d DATA_FILE          MCE data file; required if running 'producer' mode
```

## Bootstrapping DataHub
* Apply the step 1 & 2 from prerequisites.
* [Optional] Open a new terminal to consume the events: 
```
➜  python3 metadata-ingestion/mce-cli/mce_cli.py consume -l metadata-events/mxe-schemas/src/renamed/avro/com/linkedin/mxe/MetadataChangeEvent.avsc
```
* Run the mce-cli to quickly ingest lots of sample data and test DataHub in action, you can run below command:
```
➜  python3 metadata-ingestion/mce-cli/mce_cli.py produce -l metadata-events/mxe-schemas/src/renamed/avro/com/linkedin/mxe/MetadataChangeEvent.avsc -d metadata-ingestion/mce-cli/bootstrap_mce.dat
Producing MetadataChangeEvent records to topic MetadataChangeEvent. ^c to exit.
  MCE1: {"auditHeader": None, "proposedSnapshot": ("com.linkedin.pegasus2avro.metadata.snapshot.CorpUserSnapshot", {"urn": "urn:li:corpuser:foo", "aspects": [{"active": True,"email": "foo@linkedin.com"}]}), "proposedDelta": None}
  MCE2: {"auditHeader": None, "proposedSnapshot": ("com.linkedin.pegasus2avro.metadata.snapshot.CorpUserSnapshot", {"urn": "urn:li:corpuser:bar", "aspects": [{"active": False,"email": "bar@linkedin.com"}]}), "proposedDelta": None}
Flushing records...
```
This will bootstrap DataHub with sample datasets and sample users.

> ***Note***
> There is a [known issue](https://github.com/fastavro/fastavro/issues/292) with the Python Avro serialization library
> that can lead to unexpected result when it comes to union of types. 
> Always [use the tuple notation](https://fastavro.readthedocs.io/en/latest/writer.html#using-the-tuple-notation-to-specify-which-branch-of-a-union-to-take) to avoid encountering these difficult-to-debug issues.

## Ingest metadata from LDAP to DataHub
The ldap_etl provides you ETL channel to communicate with your LDAP server.
```
➜  Config your LDAP server environmental variable in the file.
    LDAPSERVER    # Your server host.
    BASEDN        # Base dn as a container location.
    LDAPUSER      # Your credential.
    LDAPPASSWORD  # Your password.
    PAGESIZE      # Pagination size.
    ATTRLIST      # Return attributes relate to your model.
    SEARCHFILTER  # Filter to build the search query.
    
➜  Config your Kafka broker environmental variable in the file.
    AVROLOADPATH   # Your model event in avro format.
    KAFKATOPIC     # Your event topic.
    BOOTSTRAP      # Kafka bootstrap server.
    SCHEMAREGISTRY # Kafka schema registry host.

➜  python ldap_etl.py
```
This will bootstrap DataHub with your metadata in the LDAP server as an user entity.

## Ingest metadata from SQL-based data systems to DataHub
See [sql-etl](sql-etl/) for more details.
Fix ingestion script by pointing to correct MCE schema Refactor for metadata-ingestion module Adding readme for metadata-ingestion 2019-09-08 05:25:54 -07:00			`# Metadata Ingestion`

			`## Prerequisites`
Documentation update part-1 2019-12-18 18:57:18 -08:00			`1. Before running any metadata ingestion job, you should make sure that DataHub backend services are all running. Easiest`
Fix ingestion script by pointing to correct MCE schema Refactor for metadata-ingestion module Adding readme for metadata-ingestion 2019-09-08 05:25:54 -07:00			`way to do that is through [Docker images](../docker).`
Add documentation 2019-09-08 20:25:58 -07:00			2. You also need to build the `mxe-schemas` module as below.
refactor(py3): Refactor all ETL scripts to using Python 3 exclusively (#1710) * refactor(py3): Refactor all ETL scripts to using Python 3 exclusively Fix https://github.com/linkedin/datahub/issues/1688 * Update requirements.txt 2020-06-25 15:16:04 -07:00			```
			`./gradlew :metadata-events:mxe-schemas:build`
			```
			This is needed to generate `MetadataChangeEvent.avsc` which is the schema for `MetadataChangeEvent` Kafka topic.
			`3. All the scripts are written using Python 3 and most likely won't work with Python 2.x interpreters.`
			`You can verify the version of your Python using the following command.`
			```
			`python --version`
			```
			`We recommend using [pyenv](https://github.com/pyenv/pyenv) to install and manage your Python environment.`
			`4. Before launching each ETL ingestion pipeline, you can install/verify the library versions as below.`
			```
			`pip install --user -r requirements.txt`
			```
Fix ingestion script by pointing to correct MCE schema Refactor for metadata-ingestion module Adding readme for metadata-ingestion 2019-09-08 05:25:54 -07:00
			`## MCE Producer/Consumer CLI`
			`mce_cli.py` script provides a convenient way to produce a list of MCEs from a data file.
			`Every MCE in the data file should be in a single line. It also supports consuming from`
			`MetadataChangeEvent` topic.

Update README.md 2020-07-31 12:29:39 -07:00			`Tested & confirmed platforms:`
			`* Red Hat Enterprise Linux Workstation release 7.6 (Maipo) w/Python 3.6.8`
			`* MacOS 10.15.5 (19F101) Darwin 19.5.0 w/Python 3.7.3`

Fix ingestion script by pointing to correct MCE schema Refactor for metadata-ingestion module Adding readme for metadata-ingestion 2019-09-08 05:25:54 -07:00			```
			`➜ python mce_cli.py --help`
			`usage: mce_cli.py [-h] [-b BOOTSTRAP_SERVERS] [-s SCHEMA_REGISTRY]`
Built docker-compose in mce ingestion. 2019-12-16 17:44:12 -08:00			`[-d DATA_FILE] [-l SCHEMA_RECORD]`
Fix ingestion script by pointing to correct MCE schema Refactor for metadata-ingestion module Adding readme for metadata-ingestion 2019-09-08 05:25:54 -07:00			`{produce,consume}`

			`Client for producing/consuming MetadataChangeEvent`

			`positional arguments:`
			`{produce,consume} Execution mode (produce \| consume)`

			`optional arguments:`
			`-h, --help show this help message and exit`
			`-b BOOTSTRAP_SERVERS Kafka broker(s) (localhost[:port])`
			`-s SCHEMA_REGISTRY Schema Registry (http(s)://localhost[:port]`
Built docker-compose in mce ingestion. 2019-12-16 17:44:12 -08:00			`-l SCHEMA_RECORD Avro schema record; required if running 'producer' mode`
Fix ingestion script by pointing to correct MCE schema Refactor for metadata-ingestion module Adding readme for metadata-ingestion 2019-09-08 05:25:54 -07:00			`-d DATA_FILE MCE data file; required if running 'producer' mode`
			```

Documentation update part-1 2019-12-18 18:57:18 -08:00			`## Bootstrapping DataHub`
Update README.md 2020-07-31 12:29:39 -07:00			`* Apply the step 1 & 2 from prerequisites.`
			`* [Optional] Open a new terminal to consume the events:`
			```
			`➜ python3 metadata-ingestion/mce-cli/mce_cli.py consume -l metadata-events/mxe-schemas/src/renamed/avro/com/linkedin/mxe/MetadataChangeEvent.avsc`
			```
			`* Run the mce-cli to quickly ingest lots of sample data and test DataHub in action, you can run below command:`
Fix ingestion script by pointing to correct MCE schema Refactor for metadata-ingestion module Adding readme for metadata-ingestion 2019-09-08 05:25:54 -07:00			```
Update README.md 2020-07-31 12:29:39 -07:00			`➜ python3 metadata-ingestion/mce-cli/mce_cli.py produce -l metadata-events/mxe-schemas/src/renamed/avro/com/linkedin/mxe/MetadataChangeEvent.avsc -d metadata-ingestion/mce-cli/bootstrap_mce.dat`
Fix ingestion script by pointing to correct MCE schema Refactor for metadata-ingestion module Adding readme for metadata-ingestion 2019-09-08 05:25:54 -07:00			`Producing MetadataChangeEvent records to topic MetadataChangeEvent. ^c to exit.`
doc: fix example MCEs 2020-04-02 19:39:12 -07:00			`MCE1: {"auditHeader": None, "proposedSnapshot": ("com.linkedin.pegasus2avro.metadata.snapshot.CorpUserSnapshot", {"urn": "urn:li:corpuser:foo", "aspects": [{"active": True,"email": "foo@linkedin.com"}]}), "proposedDelta": None}`
			`MCE2: {"auditHeader": None, "proposedSnapshot": ("com.linkedin.pegasus2avro.metadata.snapshot.CorpUserSnapshot", {"urn": "urn:li:corpuser:bar", "aspects": [{"active": False,"email": "bar@linkedin.com"}]}), "proposedDelta": None}`
Fix ingestion script by pointing to correct MCE schema Refactor for metadata-ingestion module Adding readme for metadata-ingestion 2019-09-08 05:25:54 -07:00			`Flushing records...`
Add documentation 2019-09-08 20:25:58 -07:00			```
Documentation update part-1 2019-12-18 18:57:18 -08:00			`This will bootstrap DataHub with sample datasets and sample users.`
Ingested metadata from LDAP server to Data Hub. 2019-09-10 18:17:32 -07:00
Update README.md 2020-06-03 13:52:56 -07:00			`> *Note*`
			`> There is a [known issue](https://github.com/fastavro/fastavro/issues/292) with the Python Avro serialization library`
			`> that can lead to unexpected result when it comes to union of types.`
			`> Always [use the tuple notation](https://fastavro.readthedocs.io/en/latest/writer.html#using-the-tuple-notation-to-specify-which-branch-of-a-union-to-take) to avoid encountering these difficult-to-debug issues.`

Small doc fix 2020-02-06 18:28:29 -08:00			`## Ingest metadata from LDAP to DataHub`
Classified the ETL jobs in metadata-ingestion. 2019-09-16 12:52:37 -07:00			`The ldap_etl provides you ETL channel to communicate with your LDAP server.`
Ingested metadata from LDAP server to Data Hub. 2019-09-10 18:17:32 -07:00			```
Introduced the hive ETL pipeline. 2019-09-17 16:14:09 -07:00			`➜ Config your LDAP server environmental variable in the file.`
Enabled the configuration lists in the LDAP ETL. 2019-09-11 11:30:19 -07:00			`LDAPSERVER # Your server host.`
			`BASEDN # Base dn as a container location.`
			`LDAPUSER # Your credential.`
			`LDAPPASSWORD # Your password.`
			`PAGESIZE # Pagination size.`
			`ATTRLIST # Return attributes relate to your model.`
			`SEARCHFILTER # Filter to build the search query.`

Introduced the hive ETL pipeline. 2019-09-17 16:14:09 -07:00			`➜ Config your Kafka broker environmental variable in the file.`
Enabled the configuration lists in the LDAP ETL. 2019-09-11 11:30:19 -07:00			`AVROLOADPATH # Your model event in avro format.`
Introduced the hive ETL pipeline. 2019-09-17 16:14:09 -07:00			`KAFKATOPIC # Your event topic.`
			`BOOTSTRAP # Kafka bootstrap server.`
Enabled the configuration lists in the LDAP ETL. 2019-09-11 11:30:19 -07:00			`SCHEMAREGISTRY # Kafka schema registry host.`
Ingested metadata from LDAP server to Data Hub. 2019-09-10 18:17:32 -07:00
			`➜ python ldap_etl.py`
			```
Documentation update part-1 2019-12-18 18:57:18 -08:00			`This will bootstrap DataHub with your metadata in the LDAP server as an user entity.`
Introduced the hive ETL pipeline. 2019-09-17 16:14:09 -07:00
feature(etl): add SQLAlchemy-based ingestion script (#1708) This replaces the old incomplete rdbms ETL script. 2020-06-22 21:25:55 -07:00			`## Ingest metadata from SQL-based data systems to DataHub`
Update README.md 2020-06-22 21:26:38 -07:00			`See [sql-etl](sql-etl/) for more details.`