mirror of
https://github.com/datahub-project/datahub.git
synced 2025-07-23 09:32:04 +00:00
177 lines
8.2 KiB
Markdown
177 lines
8.2 KiB
Markdown
# DataHub CLI
|
|
|
|
DataHub comes with a friendly cli called `datahub` that allows you to perform a lot of common operations using just the command line.
|
|
|
|
## Using pip
|
|
|
|
We recommend python virtual environments (venv-s) to namespace pip modules. Here's an example setup:
|
|
|
|
```shell
|
|
python3 -m venv datahub-env # create the environment
|
|
source datahub-env/bin/activate # activate the environment
|
|
```
|
|
|
|
**_NOTE:_** If you install `datahub` in a virtual environment, that same virtual environment must be re-activated each time a shell window or session is created.
|
|
|
|
Once inside the virtual environment, install `datahub` using the following commands
|
|
|
|
```console
|
|
# Requires Python 3.6+
|
|
python3 -m pip install --upgrade pip wheel setuptools
|
|
python3 -m pip install --upgrade acryl-datahub
|
|
datahub version
|
|
# If you see "command not found", try running this instead: python3 -m datahub version
|
|
```
|
|
|
|
If you run into an error, try checking the [_common setup issues_](../metadata-ingestion/developing.md#Common-setup-issues).
|
|
|
|
### Using docker
|
|
|
|
You can use the `datahub-ingestion` docker image as explained in [Docker Images](../docker/README.md). In case you are using Kubernetes you can start a pod with the `datahub-ingestion` docker image, log onto a shell on the pod and you should have the access to datahub CLI in your kubernetes cluster.
|
|
|
|
## User Guide
|
|
|
|
The `datahub` cli allows you to do many things, such as quickstarting a DataHub docker instance locally, ingesting metadata from your sources, as well as retrieving and modifying metadata.
|
|
Like most command line tools, `--help` is your best friend. Use it to discover the capabilities of the cli and the different commands and sub-commands that are supported.
|
|
|
|
```console
|
|
Usage: datahub [OPTIONS] COMMAND [ARGS]...
|
|
|
|
Options:
|
|
--debug / --no-debug
|
|
--version Show the version and exit.
|
|
--help Show this message and exit.
|
|
|
|
Commands:
|
|
check Helper commands for checking various aspects of DataHub.
|
|
delete Delete metadata from datahub using a single urn or a combination of filters
|
|
docker Helper commands for setting up and interacting with a local DataHub instance using Docker.
|
|
get Get metadata for an entity with an optional list of aspects to project
|
|
ingest Ingest metadata into DataHub.
|
|
init Configure which datahub instance to connect to
|
|
put Update a single aspect of an entity
|
|
telemetry Toggle telemetry.
|
|
version Print version number and exit.
|
|
```
|
|
|
|
The following top-level commands listed below are here mainly to give the reader a high-level picture of what are the kinds of things you can accomplish with the cli.
|
|
We've ordered them roughly in the order we expect you to interact with these commands as you get deeper into the `datahub`-verse.
|
|
|
|
### docker
|
|
|
|
The `docker` command allows you to start up a local DataHub instance using `datahub docker quickstart`. You can also check if the docker cluster is healthy using `datahub docker check`.
|
|
|
|
### ingest
|
|
|
|
The `ingest` command allows you to ingest metadata from your sources using ingestion configuration files, which we call recipes. The main [ingestion page](../metadata-ingestion/README.md) contains detailed instructions about how you can use the ingest command and perform advanced operations like rolling-back previously ingested metadata through the `rollback` sub-command.
|
|
|
|
### check
|
|
|
|
The datahub package is composed of different plugins that allow you to connect to different metadata sources and ingest metadata from them.
|
|
The `check` command allows you to check if all plugins are loaded correctly as well as validate an individual MCE-file.
|
|
|
|
### init
|
|
|
|
The init command is used to tell `datahub` about where your DataHub instance is located. The CLI will point to localhost DataHub by default.
|
|
Running `datahub init` will allow you to customize the datahub instance you are communicating with.
|
|
|
|
**_Note_**: Provide your GMS instance's host when the prompt asks you for the DataHub host.
|
|
|
|
Alternatively, you can set the following env variables if you don't want to use a config file
|
|
|
|
```shell
|
|
DATAHUB_SKIP_CONFIG=True
|
|
DATAHUB_GMS_HOST=http://localhost:8080
|
|
DATAHUB_GMS_TOKEN= # Used for communicating with DataHub Cloud
|
|
The env variables take precedence over what is in the config.
|
|
```
|
|
|
|
### telemetry
|
|
|
|
To help us understand how people are using DataHub, we collect anonymous usage statistics on actions such as command invocations via Google Analytics.
|
|
We do not collect private information such as IP addresses, contents of ingestions, or credentials.
|
|
The code responsible for collecting and broadcasting these events is open-source and can be found [within our GitHub](https://github.com/linkedin/datahub/blob/master/metadata-ingestion/src/datahub/telemetry/telemetry.py).
|
|
|
|
Telemetry is enabled by default, and the `telemetry` command lets you toggle the sending of these statistics via `telemetry enable/disable`.
|
|
You can also disable telemetry by setting `DATAHUB_TELEMETRY_ENABLED` to `false`.
|
|
|
|
### delete
|
|
|
|
The `delete` command allows you to delete metadata from DataHub. Read this [guide](./how/delete-metadata.md) to understand how you can delete metadata from DataHub.
|
|
|
|
```console
|
|
datahub delete --urn "urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)" --soft
|
|
```
|
|
|
|
### get
|
|
|
|
The `get` command allows you to easily retrieve metadata from DataHub, by using the REST API.
|
|
For example the following command gets the ownership aspect from the dataset `urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)`
|
|
|
|
```console
|
|
datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)" --aspect ownership | jq put_command
|
|
{
|
|
"value": {
|
|
"com.linkedin.metadata.snapshot.DatasetSnapshot": {
|
|
"aspects": [
|
|
{
|
|
"com.linkedin.metadata.key.DatasetKey": {
|
|
"name": "SampleHiveDataset",
|
|
"origin": "PROD",
|
|
"platform": "urn:li:dataPlatform:hive"
|
|
}
|
|
},
|
|
{
|
|
"com.linkedin.common.Ownership": {
|
|
"lastModified": {
|
|
"actor": "urn:li:corpuser:jdoe",
|
|
"time": 1581407189000
|
|
},
|
|
"owners": [
|
|
{
|
|
"owner": "urn:li:corpuser:jdoe",
|
|
"type": "DATAOWNER"
|
|
},
|
|
{
|
|
"owner": "urn:li:corpuser:datahub",
|
|
"type": "DATAOWNER"
|
|
}
|
|
]
|
|
}
|
|
}
|
|
],
|
|
"urn": "urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### put
|
|
|
|
The `put` command allows you to write metadata into DataHub. This is a flexible way for you to issue edits to metadata from the command line.
|
|
For example, the following command instructs `datahub` to set the `ownership` aspect of the dataset `urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)` to the value in the file `ownership.json`.
|
|
The JSON in the `ownership.json` file needs to conform to the [`Ownership`](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Ownership.pdl) Aspect model as shown below.
|
|
|
|
```json
|
|
{
|
|
"owners": [
|
|
{
|
|
"owner": "urn:li:corpuser:jdoe",
|
|
"type": "DEVELOPER"
|
|
},
|
|
{
|
|
"owner": "urn:li:corpuser:jdub",
|
|
"type": "DATAOWNER"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
```console
|
|
datahub --debug put --urn "urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)" --aspect ownership -d ownership.json
|
|
|
|
[DATE_TIMESTAMP] DEBUG {datahub.cli.cli_utils:340} - Attempting to emit to DataHub GMS; using curl equivalent to:
|
|
curl -X POST -H 'User-Agent: python-requests/2.26.0' -H 'Accept-Encoding: gzip, deflate' -H 'Accept: */*' -H 'Connection: keep-alive' -H 'X-RestLi-Protocol-Version: 2.0.0' -H 'Content-Type: application/json' --data '{"proposal": {"entityType": "dataset", "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)", "aspectName": "ownership", "changeType": "UPSERT", "aspect": {"contentType": "application/json", "value": "{\"owners\": [{\"owner\": \"urn:li:corpuser:jdoe\", \"type\": \"DEVELOPER\"}, {\"owner\": \"urn:li:corpuser:jdub\", \"type\": \"DATAOWNER\"}]}"}}}' 'http://localhost:8080/aspects/?action=ingestProposal'
|
|
Update succeeded with status 200
|
|
```
|