datahub

mirror of https://github.com/datahub-project/datahub.git synced 2025-06-27 05:03:31 +00:00

Table of Contents

This doc is for older versions (v0.2.1 and before) of WhereHows. Please refer to this for the latest version.

Collect dataset metadata from Hive.

Configuration

List of properties in the wh_etl_job_property table that are required for the Hive dataset ETL process:

configuration key	description
hive.metastore.jdbc.url	hive metastore jdbc url
hive.metastore.jdbc.driver	hive metastore jdbc driver
hive.metastore.username	hive metastore user name
hive.metastore.password	hive metastore password
hive.schema_json_file	local file location to store the schema json file
hive.schema_csv_file	local file location to store the schema csv file
hive.field_metadata	local file location to store the field metadata csv file

Major related file: HiveExtract.py

Connect to Hive Metastore to get the Hive table/view information and store it in a local JSON file.

Major source tables: COLUMNS_V2, SERDE_PARAMS

Major related file: HiveTransform.py

Transform the JSON output into CSV format for easy loading.

Major related file: HiveLoad.py

Load into MySQL database.

Related tables: dict_dataset

A LinkedIn Product
Tech Blog | Jobs