2020-03-18 22:29:47 +08:00
|
|
|
# datahub Ingestion Tool
|
|
|
|
|
|
|
|
|
|
|
|
## Introduction
|
|
|
|
|
|
|
|
some tool to ingestion [jdbc-database-schema] and [etl-lineage] metadata.
|
|
|
|
|
|
|
|
i split the ingestion procedure to two part: [datahub-producer] and different [metadata-generator]
|
|
|
|
|
|
|
|
|
|
|
|
## Roadmap
|
|
|
|
|
|
|
|
- [X] datahub-producer load json avro data.
|
2020-03-26 12:54:22 +08:00
|
|
|
- [X] add lineage-hive generator
|
2020-03-28 09:28:55 +08:00
|
|
|
- [X] add dataset-jdbc generator[include [mysql, mssql, postgresql, oracle] driver]
|
2020-04-22 11:29:19 +08:00
|
|
|
- [X] add dataset-hive generator
|
|
|
|
- [ ] *> add lineage-oracle generator
|
2020-03-27 12:32:26 +08:00
|
|
|
- [ ] enhance lineage-jdbc generator to lazy iterator mode.
|
2020-03-18 22:29:47 +08:00
|
|
|
- [ ] enchance avro parser to show error information
|
|
|
|
|
|
|
|
|
2020-03-27 12:32:26 +08:00
|
|
|
|
2020-03-18 22:29:47 +08:00
|
|
|
## Quickstart
|
|
|
|
1. install nix and channel
|
|
|
|
|
|
|
|
```
|
|
|
|
sudo install -d -m755 -o $(id -u) -g $(id -g) /nix
|
|
|
|
curl https://nixos.org/nix/install | sh
|
|
|
|
|
|
|
|
nix-channel --add https://nixos.org/channels/nixos-20.03 nixpkgs
|
|
|
|
nix-channel --update nixpkgs
|
|
|
|
```
|
|
|
|
|
2020-04-22 11:29:19 +08:00
|
|
|
2. [optional] you can download specified dependency in advanced, or it will automatically download at run time.
|
|
|
|
|
|
|
|
```
|
|
|
|
nix-shell bin/[datahub-producer].hs.nix
|
|
|
|
nix-shell bin/[datahub-producer].py.nix
|
|
|
|
...
|
|
|
|
```
|
|
|
|
|
|
|
|
3. load json data to datahub
|
2020-03-18 22:29:47 +08:00
|
|
|
|
|
|
|
```
|
|
|
|
cat sample/mce.json.dat | bin/datahub-producer.hs config
|
|
|
|
```
|
|
|
|
|
2020-04-22 11:29:19 +08:00
|
|
|
4. parse hive sql to datahub
|
2020-03-26 12:54:22 +08:00
|
|
|
```
|
|
|
|
ls sample/hive_*.sql | bin/lineage_hive_generator.hs | bin/datahub-producer.hs config
|
|
|
|
```
|
2020-03-27 12:32:26 +08:00
|
|
|
|
2020-04-22 11:29:19 +08:00
|
|
|
5. load jdbc schema(mysql, mssql, postgresql, oracle) to datahub
|
2020-03-27 12:32:26 +08:00
|
|
|
```
|
|
|
|
bin/dataset-jdbc-generator.hs | bin/datahub-producer.hs config
|
|
|
|
```
|
2020-04-06 22:29:28 +08:00
|
|
|
|
2020-04-22 11:29:19 +08:00
|
|
|
6. load hive schema to datahub
|
|
|
|
```
|
|
|
|
bin/dataset-hive-generator.py | bin/datahub-producer.hs config
|
|
|
|
```
|
|
|
|
|
2020-04-06 22:29:28 +08:00
|
|
|
## Reference
|
|
|
|
|
|
|
|
- hive/presto/vertica SQL Parser
|
|
|
|
uber/queryparser [https://github.com/uber/queryparser.git]
|
|
|
|
|
|
|
|
- oracle procedure syntax
|
|
|
|
https://docs.oracle.com/cd/E11882_01/server.112/e41085/sqlqr01001.htm#SQLQR110
|
|
|
|
|
|
|
|
- postgresql procedure parser
|
|
|
|
SQream/hssqlppp [https://github.com/JakeWheat/hssqlppp.git]
|