Table of Contents

1. Add the Application/Database.
2. Fill in all the configurations needed for the job type.
3. Schedule the Metadata ETL Job.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This doc is for older versions (v0.2.1 and before) of WhereHows. Please refer to this for the latest version.

As metadata ETL jobs highly depend on the source systems' data models, and the difference between various models sometimes is significant, most ETL parts have to be rewritten to use in new source systems. In our design, we try to minimize any work that needs to be done for integrating a new system.

This document provides step-by-step instructions for how you can integrate a new source system into WhereHows. There are several build-in ETL types, which only require that you configure connections and the running environment information, then you can get started. You can also push the data through the API or even create new ETL job types (refer to the Integration Guide) as they needed.

Built-In Metadata ETL Types

Currently, we support metadata collections over a few widely used systems. Here is an overview of built-in metadata ETL types:

Type	System	WhereHows ETL Job Name	Comments
Dataset	Hadoop	HADOOP_DATASET_METADATA_ETL	support whitelist of datasets
	Teradata	TERADATA_DATASET_METADATA_ETL	all tables
	Oracle	ORACLE_DATASET_METADATA_ETL	all tables
Execution	Azkaban	AZKABAN_EXECUTION_METADATA_ETL	multi-instance
	Oozie	OOZIE_EXECUTION_METADATA_ETL	multi-instance
Lineage	Azkaban	AZKABAN_LINEAGE_METADATA_ETL	multi-instance

Step-by-Step Instructions

1. Add the Application/Database.

Register the application/database using the following APIs: One application represent either a scheduler system or a execution system, such as Azkaban, Oozie. One database represent a storage system, such as HDFS, Teradata. We first need to add these info in dictionary tables.

Application: Add a new application
Database: Add a new database

2. Fill in all the configurations needed for the job type.

Make sure basic configurations for all wherehows jobs is already there (in wh_property table):

property name	description
`wherehows.app_folder`	the folder to store temporary files
`wherehows.db.driver`	driver class. e.g. `com.mysql.jdbc.Driver`
`wherehows.db.jdbc.url`	url to connect to database. e.g. `jdbc:mysql://host_name/wherehows`
`wherehows.db.password`	password
`wherehows.db.username`	username
`wherehows.encrypt.master.key.loc`	the key file to do the encryption of password
`wherehows.ui.tree.dataset.file`	dataset tree json file
`wherehows.ui.tree.flow.file`	flow tree json file

Add the job configurations through the ETL job property API as stated in each types pages in Metadata ETL types section.
Some ETL jobs need extra configurations setting. For example, lineage ETL needs adding content related configurations through additional APIs: Add a filename, Add a dataset partition pattern, and Add a log lineage pattern as stated in Lineage page.
Some of the properties need encryption. You need to place a file contained your master under a location and configure it through wherehows.encrypt.master.key.loc in wh_property table. Default location is ~/.wherehows/master_key. This encryption key file is not checked in to GitHub or included in Play distribution zip file. Please maintain this file manually and keep the file system permission as rw------- or 0600. Then you can use this etl-job-property API to insert encrypted properties.

3. Schedule the Metadata ETL Job.

Submit a scheduled metadata collection ETL job using the API:

Add a new ETL job

We try to parameterize most part of the ETL rules, for some other parts, either because their pattern is not general enough, or there is too much and it is too trivial to be customized, we didn’t abstract them out to table configuration. Instead, you need to change Jython scripts to fit their requirements.

User Guide

Developer Guide

Metadata ETL Types

A LinkedIn Product
Tech Blog | Jobs