mirror of
https://github.com/datahub-project/datahub.git
synced 2025-12-26 09:26:22 +00:00
docs: merge cli guide (#10464)
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
This commit is contained in:
parent
6307eecb96
commit
49d1233403
@ -219,11 +219,6 @@ module.exports = {
|
||||
id: "docs/managed-datahub/approval-workflows",
|
||||
className: "saasOnly",
|
||||
},
|
||||
{
|
||||
"Metadata Ingestion With Acryl": [
|
||||
"docs/managed-datahub/metadata-ingestion-with-acryl/ingestion",
|
||||
],
|
||||
},
|
||||
{
|
||||
"DataHub API": [
|
||||
{
|
||||
|
||||
@ -38,7 +38,7 @@ either Kafka or using the Metadata Store Rest APIs directly. DataHub supports an
|
||||
a host of capabilities including schema extraction, table & column profiling, usage information extraction, and more.
|
||||
|
||||
Getting started with the Ingestion Framework is as simple: just define a YAML file and execute the `datahub ingest` command.
|
||||
Learn more by heading over the the [Metadata Ingestion](https://datahubproject.io/docs/metadata-ingestion/) guide.
|
||||
Learn more by heading over the [Metadata Ingestion](https://datahubproject.io/docs/metadata-ingestion/) guide.
|
||||
|
||||
## GraphQL API
|
||||
|
||||
|
||||
@ -1,116 +0,0 @@
|
||||
# Ingestion
|
||||
|
||||
Acryl Metadata Ingestion functions similarly to that in open source DataHub. Sources are configured via the[ UI Ingestion](docs/ui-ingestion.md) or via a [Recipe](metadata-ingestion/README.md#recipes), ingestion recipes can be scheduled using your system of choice, and metadata can be pushed from anywhere.
|
||||
This document will describe the steps required to ingest metadata from your data sources.
|
||||
|
||||
## Batch Ingestion
|
||||
|
||||
Batch ingestion involves extracting metadata from a source system in bulk. Typically, this happens on a predefined schedule using the [Metadata Ingestion ](metadata-ingestion/README.md#install-from-pypi)framework.
|
||||
The metadata that is extracted includes point-in-time instances of dataset, chart, dashboard, pipeline, user, group, usage, and task metadata.
|
||||
|
||||
### Step 1: Install DataHub CLI
|
||||
|
||||
Regardless of how you ingest metadata, you'll need your account subdomain and API key handy.
|
||||
|
||||
#### **Install from Gemfury Private Repository**
|
||||
|
||||
**Installing from command line with pip**
|
||||
|
||||
Determine the version you would like to install and obtain a read access token by requesting a one-time-secret from the Acryl team then run the following command:
|
||||
|
||||
`python3 -m pip install acryl-datahub==<VERSION> --index-url https://<TOKEN>:@pypi.fury.io/acryl-data/`
|
||||
|
||||
#### Install from PyPI for OSS Release
|
||||
|
||||
Run the following commands in your terminal:
|
||||
|
||||
```
|
||||
python3 -m pip install --upgrade pip wheel setuptools
|
||||
python3 -m pip install --upgrade acryl-datahub
|
||||
python3 -m datahub version
|
||||
```
|
||||
|
||||
_Note: Requires Python 3.6+_
|
||||
|
||||
Your command line should return the proper version of DataHub upon executing these commands successfully.
|
||||
|
||||
### Step 2: Install Connector Plugins
|
||||
|
||||
Our CLI follows a plugin architecture. You must install connectors for different data sources individually. For a list of all supported data sources, see [the open source docs](metadata-ingestion/README.md#installing-plugins).
|
||||
Once you've found the connectors you care about, simply install them using `pip install`.
|
||||
For example, to install the `mysql` connector, you can run
|
||||
|
||||
```python
|
||||
pip install --upgrade acryl-datahub[mysql]
|
||||
```
|
||||
|
||||
### Step 3: Write a Recipe
|
||||
|
||||
[Recipes](metadata-ingestion/README.md#recipes) are yaml configuration files that serve as input to the Metadata Ingestion framework. Each recipe file define a single source to read from and a single destination to push the metadata.
|
||||
The two most important pieces of the file are the _source_ and _sink_ configuration blocks.
|
||||
The _source_ configuration block defines where to extract metadata from. This can be an OLTP database system, a data warehouse, or something as simple as a file. Each source has custom configuration depending on what is required to access metadata from the source. To see configurations required for each supported source, refer to the [Sources](metadata-ingestion/README.md#sources) documentation.
|
||||
The _sink_ configuration block defines where to push metadata into. Each sink type requires specific configurations, the details of which are detailed in the [Sinks](metadata-ingestion/README.md#sinks) documentation.
|
||||
In Acryl DataHub deployments, you _must_ use a sink of type `datahub-rest`, which simply means that metadata will be pushed to the REST endpoints exposed by your DataHub instance. The required configurations for this sink are
|
||||
|
||||
1. **server**: the location of the REST API exposed by your instance of DataHub
|
||||
2. **token**: a unique API key used to authenticate requests to your instance's REST API
|
||||
|
||||
The token can be retrieved by logging in as admin. You can go to Settings page and generate a Personal Access Token with your desired expiration date.
|
||||
|
||||
<p align="center">
|
||||
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/saas/home-(1).png"/>
|
||||
</p>
|
||||
|
||||
|
||||
|
||||
<p align="center">
|
||||
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/saas/settings.png"/>
|
||||
</p>
|
||||
|
||||
|
||||
To configure your instance of DataHub as the destination for ingestion, set the "server" field of your recipe to point to your Acryl instance's domain suffixed by the path `/gms`, as shown below.
|
||||
A complete example of a DataHub recipe file, which reads from MySQL and writes into a DataHub instance:
|
||||
|
||||
```yaml
|
||||
# example-recipe.yml
|
||||
|
||||
# MySQL source configuration
|
||||
source:
|
||||
type: mysql
|
||||
config:
|
||||
username: root
|
||||
password: password
|
||||
host_port: localhost:3306
|
||||
|
||||
# Recipe sink configuration.
|
||||
sink:
|
||||
type: "datahub-rest"
|
||||
config:
|
||||
server: "https://<your domain name>.acryl.io/gms"
|
||||
token: <Your API key>
|
||||
```
|
||||
|
||||
:::info
|
||||
Your API key is a signed JSON Web Token that is valid for 6 months from the date of issuance. Please keep this key secure & avoid sharing it.
|
||||
:::
|
||||
|
||||
If your key is compromised for any reason, please reach out to the Acryl team at support@acryl.io.:::
|
||||
|
||||
### Step 4: Running Ingestion
|
||||
|
||||
The final step requires invoking the DataHub CLI to ingest metadata based on your recipe configuration file.
|
||||
To do so, simply run `datahub ingest` with a pointer to your YAML recipe file:
|
||||
|
||||
```
|
||||
datahub ingest -c ./example-recipe.yml
|
||||
```
|
||||
|
||||
### Step 5: Scheduling Ingestion
|
||||
|
||||
Ingestion can either be run in an ad-hoc manner by a system administrator or scheduled for repeated executions. Most commonly, ingestion will be run on a daily cadence.
|
||||
To schedule your ingestion job, we recommend using a job schedule like [Apache Airflow](https://airflow.apache.org/). In cases of simpler deployments, a CRON job scheduled on an always-up machine can also work.
|
||||
Note that each source system will require a separate recipe file. This allows you to schedule ingestion from different sources independently or together.
|
||||
|
||||
_Looking for information on real-time ingestion? Click_ [_here_](docs/lineage/airflow.md)_._
|
||||
|
||||
_Note: Real-time ingestion setup is not recommended for an initial POC as it generally takes longer to configure and is prone to inevitable system errors._
|
||||
@ -49,7 +49,7 @@ Acryl DataHub employs a push-based metadata ingestion model. In practice, this m
|
||||
|
||||
This approach comes with another benefit: security. By managing your own instance of the agent, you can keep the secrets and credentials within your walled garden. Skip uploading secrets & keys into a third-party cloud tool.
|
||||
|
||||
To push metadata into DataHub, Acryl provide's an ingestion framework written in Python. Typically, push jobs are run on a schedule at an interval of your choosing. For our step-by-step guide on ingestion, click [here](docs/managed-datahub/metadata-ingestion-with-acryl/ingestion.md).
|
||||
To push metadata into DataHub, Acryl provide's an ingestion framework written in Python. Typically, push jobs are run on a schedule at an interval of your choosing. For our step-by-step guide on ingestion, click [here](../../metadata-ingestion/cli-ingestion.md).
|
||||
|
||||
### Discovering Metadata
|
||||
|
||||
|
||||
@ -1,56 +1,108 @@
|
||||
# CLI Ingestion
|
||||
|
||||
## Installing the CLI
|
||||
Batch ingestion involves extracting metadata from a source system in bulk. Typically, this happens on a predefined schedule using the [Metadata Ingestion](../docs/components.md#ingestion-framework) framework.
|
||||
The metadata that is extracted includes point-in-time instances of dataset, chart, dashboard, pipeline, user, group, usage, and task metadata.
|
||||
|
||||
Make sure you have installed DataHub CLI before following this guide.
|
||||
## Installing DataHub CLI
|
||||
|
||||
```shell
|
||||
# Requires Python 3.8+
|
||||
:::note Required Python Version
|
||||
Installing DataHub CLI requires Python 3.6+.
|
||||
:::
|
||||
|
||||
Run the following commands in your terminal:
|
||||
|
||||
```
|
||||
python3 -m pip install --upgrade pip wheel setuptools
|
||||
python3 -m pip install --upgrade acryl-datahub
|
||||
# validate that the install was successful
|
||||
datahub version
|
||||
# If you see "command not found", try running this instead: python3 -m datahub version
|
||||
python3 -m datahub version
|
||||
```
|
||||
|
||||
Your command line should return the proper version of DataHub upon executing these commands successfully.
|
||||
|
||||
|
||||
Check out the [CLI Installation Guide](../docs/cli.md#installation) for more installation options and troubleshooting tips.
|
||||
|
||||
After that, install the required plugin for the ingestion.
|
||||
|
||||
## Installing Connector Plugins
|
||||
|
||||
Our CLI follows a plugin architecture. You must install connectors for different data sources individually.
|
||||
For a list of all supported data sources, see [the open source docs](../docs/cli.md#sources).
|
||||
Once you've found the connectors you care about, simply install them using `pip install`.
|
||||
For example, to install the `mysql` connector, you can run
|
||||
|
||||
```shell
|
||||
pip install 'acryl-datahub[datahub-rest]' # install the required plugin
|
||||
pip install --upgrade 'acryl-datahub[mysql]'
|
||||
```
|
||||
|
||||
Check out the [alternative installation options](../docs/cli.md#alternate-installation-options) for more reference.
|
||||
|
||||
## Configuring a Recipe
|
||||
|
||||
Create a `recipe.yml` file that defines the source and sink for metadata, as shown below.
|
||||
Create a [Recipe](recipe_overview.md) yaml file that defines the source and sink for metadata, as shown below.
|
||||
|
||||
```yaml
|
||||
# recipe.yml
|
||||
source:
|
||||
type: <source_name>
|
||||
config:
|
||||
option_1: <value>
|
||||
...
|
||||
# example-recipe.yml
|
||||
|
||||
sink:
|
||||
type: <sink_type_name>
|
||||
# MySQL source configuration
|
||||
source:
|
||||
type: mysql
|
||||
config:
|
||||
...
|
||||
username: root
|
||||
password: password
|
||||
host_port: localhost:3306
|
||||
|
||||
# Recipe sink configuration.
|
||||
sink:
|
||||
type: "datahub-rest"
|
||||
config:
|
||||
server: "https://<your domain name>.acryl.io/gms"
|
||||
token: <Your API key>
|
||||
```
|
||||
The **source** configuration block defines where to extract metadata from. This can be an OLTP database system, a data warehouse, or something as simple as a file. Each source has custom configuration depending on what is required to access metadata from the source. To see configurations required for each supported source, refer to the [Sources](source_overview.md) documentation.
|
||||
|
||||
The **sink** configuration block defines where to push metadata into. Each sink type requires specific configurations, the details of which are detailed in the [Sinks](sink_overview.md) documentation.
|
||||
|
||||
To configure your instance of DataHub as the destination for ingestion, set the "server" field of your recipe to point to your Acryl instance's domain suffixed by the path `/gms`, as shown below.
|
||||
A complete example of a DataHub recipe file, which reads from MySQL and writes into a DataHub instance:
|
||||
|
||||
For more information and examples on configuring recipes, please refer to [Recipes](recipe_overview.md).
|
||||
|
||||
|
||||
### Using Recipes with Authentication
|
||||
In Acryl DataHub deployments, only the `datahub-rest` sink is supported, which simply means that metadata will be pushed to the REST endpoints exposed by your DataHub instance. The required configurations for this sink are
|
||||
|
||||
1. **server**: the location of the REST API exposed by your instance of DataHub
|
||||
2. **token**: a unique API key used to authenticate requests to your instance's REST API
|
||||
|
||||
The token can be retrieved by logging in as admin. You can go to Settings page and generate a Personal Access Token with your desired expiration date.
|
||||
|
||||
<p align="center">
|
||||
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/saas/home-(1).png"/>
|
||||
</p>
|
||||
|
||||
<p align="center">
|
||||
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/saas/settings.png"/>
|
||||
</p>
|
||||
|
||||
:::info Secure Your API Key
|
||||
Please keep Your API key secure & avoid sharing it.
|
||||
If you are on Acryl Cloud and your key is compromised for any reason, please reach out to the Acryl team at support@acryl.io.
|
||||
:::
|
||||
|
||||
|
||||
## Ingesting Metadata
|
||||
|
||||
You can run ingestion using `datahub ingest` like below.
|
||||
|
||||
The final step requires invoking the DataHub CLI to ingest metadata based on your recipe configuration file.
|
||||
To do so, simply run `datahub ingest` with a pointer to your YAML recipe file:
|
||||
```shell
|
||||
datahub ingest -c <path/to/recipe.yml>
|
||||
```
|
||||
|
||||
## Scheduling Ingestion
|
||||
|
||||
Ingestion can either be run in an ad-hoc manner by a system administrator or scheduled for repeated executions. Most commonly, ingestion will be run on a daily cadence.
|
||||
To schedule your ingestion job, we recommend using a job schedule like [Apache Airflow](https://airflow.apache.org/). In cases of simpler deployments, a CRON job scheduled on an always-up machine can also work.
|
||||
Note that each source system will require a separate recipe file. This allows you to schedule ingestion from different sources independently or together.
|
||||
Learn more about scheduling ingestion in the [Scheduling Ingestion Guide](/metadata-ingestion/schedule_docs/intro.md).
|
||||
|
||||
## Reference
|
||||
|
||||
Please refer the following pages for advanced guids on CLI ingestion.
|
||||
@ -59,8 +111,10 @@ Please refer the following pages for advanced guids on CLI ingestion.
|
||||
- [UI Ingestion Guide](../docs/ui-ingestion.md)
|
||||
|
||||
:::tip Compatibility
|
||||
|
||||
DataHub server uses a 3 digit versioning scheme, while the CLI uses a 4 digit scheme. For example, if you're using DataHub server version 0.10.0, you should use CLI version 0.10.0.x, where x is a patch version.
|
||||
We do this because we do CLI releases at a much higher frequency than server releases, usually every few days vs twice a month.
|
||||
|
||||
For ingestion sources, any breaking changes will be highlighted in the [release notes](../docs/how/updating-datahub.md). When fields are deprecated or otherwise changed, we will try to maintain backwards compatibility for two server releases, which is about 4-6 weeks. The CLI will also print warnings whenever deprecated options are used.
|
||||
:::
|
||||
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user