6
HDFS Dataset
Mars Lan edited this page 2017-07-26 16:47:50 -07:00

This doc is for older versions (v0.2.1 and before) of WhereHows. Please refer to this for the latest version.

Collect dataset metadata from HDFS.

Configuration

List of properties in the wh_etl_job_property table that are required for the Hadoop dataset ETL process:

configuration key description
hdfs.cluster cluster name
hdfs.remote.machine remote Hadoop gateway machine name (could be localhost)
hdfs.private_key_location private key used to log in to the remote machine
hdfs.remote.jar JAR file location on the remote machine
hdfs.remote.user user login on remote machine
hdfs.remote.raw_metadata metadata JSON file location on remote machine
hdfs.remote.sample sample data CSV file location on remote machine
hdfs.local.field_metadata place to store field metadata file
hdfs.local.metadata place to store metadata CSV file
hdfs.local.raw_metadata place to store metadata JSON file
hdfs.local.sample place to store sample file
hdfs.white_list the whitelist of folder to collect metadata
hdfs.num_of_thread optional. number of threads you want to scrape the HDFS
hdfs.file_path_regex_source_map The map of file path regex and dataset source. e.g. [{"/data/tracking.":"Kafka"},{"/data/retail.":"Teradata"}]

Extract

Major related file: hadoop-dataset-extractor-standalone

The standalone module 'hadoop-dataset-extractor-standalone' is responsible for this process.

In a real production environment, the machine used to run this ETL job is usually different from the machine that is the gateway to Hadoop. So we need to copy the runnable JAR file to a remote machine, execute remotely, and copy back the result.

At compile time, this module hadoop-dataset-extractor-standalone packages into a standalone JAR file. At runtime, a Jython script copies it to the Hadoop gateway, remotely runs it on the Hadoop gateway, and copies the result back.

Inside the module, we use a whitelist of folders (configured through parameters) as the starting point to scan through the folders and files. After abstract at the dataset level, we then extract schema, sample data, and related metadata from them. The final step is to store this into two result files: metadata file and sample data file.

Transform

Major related file: HdfsTransform.py

Transform the JSON output into CSV format for easy loading.

Load

Major related file: HdfsLoad.py

Load into MySQL database. related tables : dict_dataset, dict_dataset_sample, dict_field_detail