datahub

Table of Contents

Configuration
Extract
Transform
Load

This doc is for older versions (v0.2.1 and before) of WhereHows. Please refer to this for the latest version.

Collect dataset metadata from HDFS.

Configuration

List of properties in the wh_etl_job_property table that are required for the Hadoop dataset ETL process:

configuration key	description
hdfs.cluster	cluster name
hdfs.remote.machine	remote Hadoop gateway machine name (could be localhost)
hdfs.private_key_location	private key used to log in to the remote machine
hdfs.remote.jar	JAR file location on the remote machine
hdfs.remote.user	user login on remote machine
hdfs.remote.raw_metadata	metadata JSON file location on remote machine
hdfs.remote.sample	sample data CSV file location on remote machine
hdfs.local.field_metadata	place to store field metadata file
hdfs.local.metadata	place to store metadata CSV file
hdfs.local.raw_metadata	place to store metadata JSON file
hdfs.local.sample	place to store sample file
hdfs.white_list	the whitelist of folder to collect metadata
hdfs.num_of_thread	optional. number of threads you want to scrape the HDFS
hdfs.file_path_regex_source_map	The map of file path regex and dataset source. e.g. [{"/data/tracking.":"Kafka"},{"/data/retail.":"Teradata"}]

Extract

Major related file: hadoop-dataset-extractor-standalone

The standalone module 'hadoop-dataset-extractor-standalone' is responsible for this process.

In a real production environment, the machine used to run this ETL job is usually different from the machine that is the gateway to Hadoop. So we need to copy the runnable JAR file to a remote machine, execute remotely, and copy back the result.

At compile time, this module hadoop-dataset-extractor-standalone packages into a standalone JAR file. At runtime, a Jython script copies it to the Hadoop gateway, remotely runs it on the Hadoop gateway, and copies the result back.

Inside the module, we use a whitelist of folders (configured through parameters) as the starting point to scan through the folders and files. After abstract at the dataset level, we then extract schema, sample data, and related metadata from them. The final step is to store this into two result files: metadata file and sample data file.

Transform

Major related file: HdfsTransform.py

Transform the JSON output into CSV format for easy loading.

Load

Major related file: HdfsLoad.py

Load into MySQL database. related tables : dict_dataset, dict_dataset_sample, dict_field_detail

User Guide

Developer Guide

Metadata ETL Types

A LinkedIn Product
Tech Blog | Jobs