mirror of
https://github.com/datahub-project/datahub.git
synced 2025-06-27 05:03:31 +00:00
Page:
Hive Dataset
Pages
Architecture
Azkaban Execution
Backend API
Coding Guidelines
Frontend API
Frontend User Guide
Getting Started
Git Commit Metadata ETL
Gobblin Kafka Integeration
Gobblin Kafka Integration
HDFS Dataset Ownership Metadata ETL
HDFS Dataset
Hive Dataset
Home
Integration Guide
Kafka Integration Guide
LDAP Directory Metadata ETL
Lineage
Multiproduct Metadata ETL
Oozie Execution
Oracle Dataset
Quick Start with VM
Set Up New Metadata ETL Jobs
Teradata Dataset
Clone
8
Hive Dataset
Mars Lan edited this page 2017-07-26 16:39:58 -07:00
Table of Contents
This doc is for older versions (v0.2.1 and before) of WhereHows. Please refer to this for the latest version.
Collect dataset metadata from Hive.
Configuration
List of properties in the wh_etl_job_property table that are required for the Hive dataset ETL process:
configuration key | description |
---|---|
hive.metastore.jdbc.url | hive metastore jdbc url |
hive.metastore.jdbc.driver | hive metastore jdbc driver |
hive.metastore.username | hive metastore user name |
hive.metastore.password | hive metastore password |
hive.schema_json_file | local file location to store the schema json file |
hive.schema_csv_file | local file location to store the schema csv file |
hive.field_metadata | local file location to store the field metadata csv file |
Extract
Major related file: HiveExtract.py
Connect to Hive Metastore to get the Hive table/view information and store it in a local JSON file.
Major source tables: COLUMNS_V2, SERDE_PARAMS
Transform
Major related file: HiveTransform.py
Transform the JSON output into CSV format for easy loading.
Load
Major related file: HiveLoad.py
Load into MySQL database.
Related tables: dict_dataset