2019-09-01 16:03:45 -07:00
# Data Hub [](https://travis-ci.org/linkedin/WhereHows) [](https://gitter.im/linkedin/datahub)
2019-08-31 20:51:14 -07:00
## Pre-requisites
Be sure to have JDK installed on your machine.
2015-11-19 14:39:21 -08:00
2019-08-31 20:51:14 -07:00
```
sudo yum install java-1.8.0-openjdk-devel
```
2018-01-17 10:00:26 -08:00
2019-08-31 20:51:14 -07:00
Install docker and docker-compose.
```
Check https://www.docker.com/get-started for instructions on how to install docker-ce
```
2015-11-19 14:39:21 -08:00
2019-08-31 20:51:14 -07:00
Install Chrome web browser.
```
https://www.google.com/chrome/
```
2016-02-09 12:23:00 -08:00
2019-08-31 20:51:14 -07:00
## Quickstart
To start all Docker images at once, please follow below instructions.
2016-10-20 11:26:23 -07:00
2019-08-31 20:51:14 -07:00
```
cd docker/quickstart
docker-compose up
cd ../elasticsearch & & bash init.sh
```
2017-07-21 17:32:18 -07:00
2019-08-31 20:51:14 -07:00
## Starting Kafka
Kafka, ZooKeeper and Schema Registry are running in individual Docker containers.
We are using Confluent images. Default configurations are used.
2017-07-21 17:32:18 -07:00
2019-08-31 20:51:14 -07:00
```
cd docker/kafka
docker-compose up
```
2017-07-21 17:32:18 -07:00
2019-08-31 20:51:14 -07:00
## Starting MySQL
MySQL Server runs in its own Docker container. Please run below commands to start MySQL container.
```
cd docker/mysql
docker-compose up
```
To connect to MySQL server you can use below command:
```
docker exec -it mysql mysql -u datahub -pdatahub datahub
```
2017-07-21 17:32:18 -07:00
2019-08-31 20:51:14 -07:00
## Starting ElasticSearch and Kibana
ElasticSearch and Kibana run in their own Docker containers. Please run below commands to start ElasticSearch and Kibana containers.
```
cd docker/elasticsearch
docker-compose up
```
After containers are initialized, we need to create the search index by running below command:
```
bash init.sh
```
You can connect to Kibana on your web browser via below link
```
http://localhost:5601
```
2017-07-21 17:32:18 -07:00
2019-08-31 20:51:14 -07:00
## Starting GMS
2017-07-21 17:32:18 -07:00
2019-08-31 20:51:14 -07:00
```
./gradlew build
./gradlew :gms:war:JettyRunWar
```
2016-02-09 12:23:00 -08:00
2019-08-31 20:51:14 -07:00
### Example GMS Curl Calls
2016-02-16 01:11:18 -08:00
2019-08-31 20:51:14 -07:00
#### Create
```
curl 'http://localhost:8080/corpUsers/($params:(),name:fbar)/snapshot' -X POST -H 'X-RestLi-Method: create' -H 'X-RestLi-Protocol-Version:2.0.0' --data '{"aspects": [{"com.linkedin.identity.CorpUserInfo":{"active": true, "fullName": "Foo Bar", "email": "fbar@linkedin .com"}}, {"com.linkedin.identity.CorpUserEditableInfo":{}}], "urn": "urn:li:corpuser:fbar"}' -v
curl 'http://localhost:8080/datasets/($params:(),name:x.y,origin:PROD,platform:urn%3Ali%3AdataPlatform%3Afoo)/snapshot' -X POST -H 'X-RestLi-Method: create' -H 'X-RestLi-Protocol-Version:2.0.0' --data '{"aspects":[{"com.linkedin.common.Ownership":{"owners":[{"owner":"urn:li:corpuser:ksahin","type":"DATAOWNER"}],"lastModified":{"time":0,"actor":"urn:li:corpuser:ksahin"}}},{"com.linkedin.dataset.UpstreamLineage":{"upstreams":[{"auditStamp":{"time":0,"actor":"urn:li:corpuser:ksahin"},"dataset":"urn:li:dataset:(urn:li:dataPlatform:foo,barUp,PROD)","type":"TRANSFORMED"}]}},{"com.linkedin.common.InstitutionalMemory":{"elements":[{"url":"https://www.linkedin.com","description":"Sample doc","createStamp":{"time":0,"actor":"urn:li:corpuser:ksahin"}}]}},{"com.linkedin.schema.SchemaMetadata":{"schemaName":"FooEvent","platform":"urn:li:dataPlatform:foo","version":0,"created":{"time":0,"actor":"urn:li:corpuser:ksahin"},"lastModified":{"time":0,"actor":"urn:li:corpuser:ksahin"},"hash":"","platformSchema":{"com.linkedin.schema.KafkaSchema":{"documentSchema":"{\"type\":\"record\",\"name\":\"MetadataChangeEvent\",\"namespace\":\"com.linkedin.mxe\",\"doc\":\"Kafka event for proposing a metadata change for an entity.\",\"fields\":[{\"name\":\"auditHeader\",\"type\":{\"type\":\"record\",\"name\":\"KafkaAuditHeader\",\"namespace\":\"com.linkedin.avro2pegasus.events\",\"doc\":\"Header\"}}]}"}},"fields":[{"fieldPath":"foo","description":"Bar","nativeDataType":"string","type":{"type":{"com.linkedin.schema.StringType":{}}}}]}}],"urn":"urn:li:dataset:(urn:li:dataPlatform:foo,bar,PROD)"}' -v
```
2016-02-16 01:11:18 -08:00
2019-08-31 20:51:14 -07:00
#### Get
```
curl -H 'X-RestLi-Protocol-Version:2.0.0' -H 'X-RestLi-Method: get' 'http://localhost:8080/corpUsers/($params:(),name:fbar)/snapshot/($params:(),aspectVersions:List((aspect:com.linkedin.identity.CorpUserInfo,version:0)))' | jq
curl -H 'X-RestLi-Protocol-Version:2.0.0' -H 'X-RestLi-Method: get' 'http://localhost:8080/datasets/($params:(),name:x.y,origin:PROD,platform:urn%3Ali%3AdataPlatform%3Afoo)/snapshot/($params:(),aspectVersions:List((aspect:com.linkedin.common.Ownership,version:0)))' | jq
```
2016-02-16 01:11:18 -08:00
2019-08-31 20:51:14 -07:00
### Get all
```
curl -H 'X-RestLi-Protocol-Version:2.0.0' -H 'X-RestLi-Method: get_all' 'http://localhost:8080/corpUsers' | jq
```
2016-02-09 12:23:34 -08:00
2019-08-31 20:51:14 -07:00
### Browse
2017-07-21 17:32:18 -07:00
2016-02-16 01:11:18 -08:00
```
2019-08-31 20:51:14 -07:00
curl "http://localhost:8080/datasets?action=browse" -d '{"path": "", "start": 0, "limit": 10}' -X POST -H 'X-RestLi-Protocol-Version: 2.0.0' | jq
```
### Search
2016-02-16 01:11:18 -08:00
2019-08-31 20:51:14 -07:00
```
curl "http://localhost:8080/corpUsers?q=search& input=foo& " -X GET -H 'X-RestLi-Protocol-Version: 2.0.0' -H 'X-RestLi-Method: finder' | jq
curl "http://localhost:8080/datasets?q=search& input=foo& " -X GET -H 'X-RestLi-Protocol-Version: 2.0.0' -H 'X-RestLi-Method: finder' | jq
2016-02-16 01:11:18 -08:00
```
2019-08-31 20:51:14 -07:00
### Autocomplete
2016-02-16 01:11:18 -08:00
2019-08-31 20:51:14 -07:00
```
curl "http://localhost:8080/datasets?action=autocomplete" -d '{"query": "foo", "field": "name", "limit": 10}' -X POST -H 'X-RestLi-Protocol-Version: 2.0.0' | jq
```
2016-02-16 01:11:18 -08:00
2019-08-31 20:51:14 -07:00
### Ownership
2017-07-19 14:06:53 -07:00
2019-08-31 20:51:14 -07:00
```
curl -H 'X-RestLi-Protocol-Version:2.0.0' -H 'X-RestLi-Method: get' 'http://localhost:8080/datasets/($params:(),name:x.y,origin:PROD,platform:urn%3Ali%3AdataPlatform%3Afoo)/rawOwnership/0' | jq
```
2017-07-21 16:13:09 -07:00
2019-08-31 20:51:14 -07:00
### Schema
2017-07-21 17:32:18 -07:00
2019-08-31 20:51:14 -07:00
```
curl -H 'X-RestLi-Protocol-Version:2.0.0' -H 'X-RestLi-Method: get' 'http://localhost:8080/datasets/($params:(),name:x.y,origin:PROD,platform:urn%3Ali%3AdataPlatform%3Afoo)/schema/0' | jq
```
2015-11-19 14:39:21 -08:00
2019-08-31 20:51:14 -07:00
## Debugging Kafka
GMS fires a MetadataAuditEvent after a new record is created through snapshot endpoint. We can check if this message is correctly fired using kafkacat.
```
Install kafkacat through this link https://github.com/edenhill/kafkacat
```
To consume messages on MetadataAuditEvent topic, run below command. It doesn't support Avro deserialization just yet, but they have an ongoing [work ](https://github.com/edenhill/kafkacat/pull/151 ) for that.
```
kafkacat -b localhost:9092 -t MetadataAuditEvent
```
2017-07-21 17:32:18 -07:00
2019-08-31 20:51:14 -07:00
## Starting Elasticsearch Indexing Job
Run below to start Elasticsearch indexing job.
```
./gradlew :metadata-jobs:elasticsearch-index-job:run
```
To test the job, you should've already started Kafka, GMS, MySQL and ElasticSearch/Kibana.
After starting all the services, you can create a record in GMS by Snapshot endpoint as below.
```
curl 'http://localhost:8080/metrics/($params:(),name:a.b.c01,type:UMP)/snapshot' -X POST -H 'X-RestLi-Method: create' -H 'X-RestLi-Protocol-Version:2.0.0' --data '{"aspects": [{"com.linkedin.common.Ownership":{"owners":[{"owner":"urn:li:corpuser:ksahin","type":"DATAOWNER"}]}}], "urn": "urn:li:metric:(UMP,a.b.c01)"}' -v
```
This will fire an MAE and search index will be updated by indexing job after reading MAE from Kafka.
Then, you can check ES index if document is populated by below command.
```
curl localhost:9200/metricdocument/_search -d '{"query":{"match":{"urn":"urn:li:metric:(UMP,a.b.c01)"}}}' | jq
```
2016-09-02 12:31:52 -04:00
2019-08-31 20:51:14 -07:00
## Starting MetadataChangeEvent Consuming Job
Run below to start MCE consuming job.
```
./gradlew :metadata-jobs:mce-consumer-job:run
```
Create your own MCE to align the models in sample_MCE.dat.
Tips: one liner per MCE with Python syntax.
2016-09-02 12:31:52 -04:00
2019-08-31 20:51:14 -07:00
Then you can produce MCE to feed your GMS.
```
cd metadata-ingestion/src
python avro_cli.py produce
```
## Starting Datahub Frontend
Run below to start datahub-frontend Play server.
```
cd datahub-frontend/run
./run-local-frontend
```
Then you can connect to Datahub on your web browser via below link
```
http://localhost:9001
2019-09-01 16:03:45 -07:00
```