docs: merge cli guide (#10464)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
2025-12-26 09:26:22 +00:00 · 2024-06-25 09:28:38 +09:00 · 2024-06-25 09:28:38 +09:00 · 49d1233403
commit 49d1233403
parent 6307eecb96
5 changed files with 79 additions and 146 deletions
--- a/docs-website/sidebars.js
+++ b/docs-website/sidebars.js
@ -219,11 +219,6 @@ module.exports = {
          id: "docs/managed-datahub/approval-workflows",
          className: "saasOnly",
        },
-        {
-          "Metadata Ingestion With Acryl": [
-            "docs/managed-datahub/metadata-ingestion-with-acryl/ingestion",
-          ],
-        },
        {
          "DataHub API": [
            {
--- a/docs/components.md
+++ b/docs/components.md
@ -38,7 +38,7 @@ either Kafka or using the Metadata Store Rest APIs directly. DataHub supports an
 a host of capabilities including schema extraction, table & column profiling, usage information extraction, and more.  

 Getting started with the Ingestion Framework is as simple: just define a YAML file and execute the `datahub ingest` command.
-Learn more by heading over the the [Metadata Ingestion](https://datahubproject.io/docs/metadata-ingestion/) guide. 
+Learn more by heading over the [Metadata Ingestion](https://datahubproject.io/docs/metadata-ingestion/) guide. 

 ## GraphQL API

--- a/docs/managed-datahub/metadata-ingestion-with-acryl/ingestion.md
+++ b/docs/managed-datahub/metadata-ingestion-with-acryl/ingestion.md
@ -1,116 +0,0 @@
-# Ingestion
-
-Acryl Metadata Ingestion functions similarly to that in open source DataHub. Sources are configured via the[ UI Ingestion](docs/ui-ingestion.md) or via a [Recipe](metadata-ingestion/README.md#recipes), ingestion recipes can be scheduled using your system of choice, and metadata can be pushed from anywhere.
-This document will describe the steps required to ingest metadata from your data sources.
-
-## Batch Ingestion
-
-Batch ingestion involves extracting metadata from a source system in bulk. Typically, this happens on a predefined schedule using the [Metadata Ingestion ](metadata-ingestion/README.md#install-from-pypi)framework.
-The metadata that is extracted includes point-in-time instances of dataset, chart, dashboard, pipeline, user, group, usage, and task metadata.
-
-### Step 1: Install DataHub CLI
-
-Regardless of how you ingest metadata, you'll need your account subdomain and API key handy.
-
-#### **Install from Gemfury Private Repository**
-
-**Installing from command line with pip**
-
-Determine the version you would like to install and obtain a read access token by requesting a one-time-secret from the Acryl team then run the following command:
-
-`python3 -m pip install acryl-datahub==<VERSION> --index-url https://<TOKEN>:@pypi.fury.io/acryl-data/`
-
-#### Install from PyPI for OSS Release
-
-Run the following commands in your terminal:
-
-```
-python3 -m pip install --upgrade pip wheel setuptools
-python3 -m pip install --upgrade acryl-datahub
-python3 -m datahub version
-```
-
-_Note: Requires Python 3.6+_
-
-Your command line should return the proper version of DataHub upon executing these commands successfully.
-
-### Step 2: Install Connector Plugins
-
-Our CLI follows a plugin architecture. You must install connectors for different data sources individually. For a list of all supported data sources, see [the open source docs](metadata-ingestion/README.md#installing-plugins).
-Once you've found the connectors you care about, simply install them using `pip install`.
-For example, to install the `mysql` connector, you can run
-
-```python
-pip install --upgrade acryl-datahub[mysql]
-```
-
-### Step 3: Write a Recipe
-
-[Recipes](metadata-ingestion/README.md#recipes) are yaml configuration files that serve as input to the Metadata Ingestion framework. Each recipe file define a single source to read from and a single destination to push the metadata.
-The two most important pieces of the file are the _source_ and _sink_ configuration blocks.
-The _source_ configuration block defines where to extract metadata from. This can be an OLTP database system, a data warehouse, or something as simple as a file. Each source has custom configuration depending on what is required to access metadata from the source. To see configurations required for each supported source, refer to the [Sources](metadata-ingestion/README.md#sources) documentation.
-The _sink_ configuration block defines where to push metadata into. Each sink type requires specific configurations, the details of which are detailed in the [Sinks](metadata-ingestion/README.md#sinks) documentation.
-In Acryl DataHub deployments, you _must_ use a sink of type `datahub-rest`, which simply means that metadata will be pushed to the REST endpoints exposed by your DataHub instance. The required configurations for this sink are
-
-1. **server**: the location of the REST API exposed by your instance of DataHub
-2. **token**: a unique API key used to authenticate requests to your instance's REST API
-
-The token can be retrieved by logging in as admin. You can go to Settings page and generate a Personal Access Token with your desired expiration date.
-
-<p align="center">
-  <img width="70%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/saas/home-(1).png"/>
-</p>
-
-
-
-<p align="center">
-  <img width="70%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/saas/settings.png"/>
-</p>
-
-
-To configure your instance of DataHub as the destination for ingestion, set the "server" field of your recipe to point to your Acryl instance's domain suffixed by the path `/gms`, as shown below.
-A complete example of a DataHub recipe file, which reads from MySQL and writes into a DataHub instance:
-
-```yaml
-# example-recipe.yml
-
-# MySQL source configuration
-source:
-  type: mysql
-  config:
-    username: root
-    password: password
-    host_port: localhost:3306
-
-# Recipe sink configuration.
-sink:
-  type: "datahub-rest"
-  config:
-    server: "https://<your domain name>.acryl.io/gms"
-    token: <Your API key>
-```
-
-:::info
-Your API key is a signed JSON Web Token that is valid for 6 months from the date of issuance. Please keep this key secure & avoid sharing it.
-:::
-
-If your key is compromised for any reason, please reach out to the Acryl team at support@acryl.io.:::
-
-### Step 4: Running Ingestion
-
-The final step requires invoking the DataHub CLI to ingest metadata based on your recipe configuration file.
-To do so, simply run `datahub ingest` with a pointer to your YAML recipe file:
-
-```
-datahub ingest -c ./example-recipe.yml
-```
-
-### Step 5: Scheduling Ingestion
-
-Ingestion can either be run in an ad-hoc manner by a system administrator or scheduled for repeated executions. Most commonly, ingestion will be run on a daily cadence.
-To schedule your ingestion job, we recommend using a job schedule like [Apache Airflow](https://airflow.apache.org/). In cases of simpler deployments, a CRON job scheduled on an always-up machine can also work.
-Note that each source system will require a separate recipe file. This allows you to schedule ingestion from different sources independently or together.
-
-_Looking for information on real-time ingestion? Click_ [_here_](docs/lineage/airflow.md)_._
-
-_Note: Real-time ingestion setup is not recommended for an initial POC as it generally takes longer to configure and is prone to inevitable system errors._
--- a/docs/managed-datahub/welcome-acryl.md
+++ b/docs/managed-datahub/welcome-acryl.md
@ -49,7 +49,7 @@ Acryl DataHub employs a push-based metadata ingestion model. In practice, this m

 This approach comes with another benefit: security. By managing your own instance of the agent, you can keep the secrets and credentials within your walled garden. Skip uploading secrets & keys into a third-party cloud tool. 

-To push metadata into DataHub, Acryl provide's an ingestion framework written in Python. Typically, push jobs are run on a schedule at an interval of your choosing. For our step-by-step guide on ingestion, click [here](docs/managed-datahub/metadata-ingestion-with-acryl/ingestion.md).
+To push metadata into DataHub, Acryl provide's an ingestion framework written in Python. Typically, push jobs are run on a schedule at an interval of your choosing. For our step-by-step guide on ingestion, click [here](../../metadata-ingestion/cli-ingestion.md).

 ### Discovering Metadata

--- a/metadata-ingestion/cli-ingestion.md
+++ b/metadata-ingestion/cli-ingestion.md
@ -1,56 +1,108 @@
 # CLI Ingestion

-## Installing the CLI
+Batch ingestion involves extracting metadata from a source system in bulk. Typically, this happens on a predefined schedule using the [Metadata Ingestion](../docs/components.md#ingestion-framework) framework.
+The metadata that is extracted includes point-in-time instances of dataset, chart, dashboard, pipeline, user, group, usage, and task metadata.

-Make sure you have installed DataHub CLI before following this guide.
+## Installing DataHub CLI

-```shell
-# Requires Python 3.8+
+:::note Required Python Version
+Installing DataHub CLI requires Python 3.6+.
+:::
+
+Run the following commands in your terminal:
+
+```
 python3 -m pip install --upgrade pip wheel setuptools
 python3 -m pip install --upgrade acryl-datahub
-# validate that the install was successful
-datahub version
-# If you see "command not found", try running this instead: python3 -m datahub version
+python3 -m datahub version
 ```

+Your command line should return the proper version of DataHub upon executing these commands successfully.
+
+
 Check out the [CLI Installation Guide](../docs/cli.md#installation) for more installation options and troubleshooting tips.

-After that, install the required plugin for the ingestion.
+
+## Installing Connector Plugins 
+
+Our CLI follows a plugin architecture. You must install connectors for different data sources individually. 
+For a list of all supported data sources, see [the open source docs](../docs/cli.md#sources).
+Once you've found the connectors you care about, simply install them using `pip install`.
+For example, to install the `mysql` connector, you can run

 ```shell
-pip install 'acryl-datahub[datahub-rest]'  # install the required plugin
+pip install --upgrade 'acryl-datahub[mysql]'
 ```
-
 Check out the [alternative installation options](../docs/cli.md#alternate-installation-options) for more reference.

 ## Configuring a Recipe

-Create a `recipe.yml` file that defines the source and sink for metadata, as shown below.
+Create a [Recipe](recipe_overview.md) yaml file that defines the source and sink for metadata, as shown below.

 ```yaml
-# recipe.yml
-source:
-  type: <source_name>
-  config:
-    option_1: <value>
-    ...
+# example-recipe.yml

-sink:
-  type: <sink_type_name>
+# MySQL source configuration
+source:
+  type: mysql
  config:
-    ...
+    username: root
+    password: password
+    host_port: localhost:3306
+
+# Recipe sink configuration.
+sink:
+  type: "datahub-rest"
+  config:
+    server: "https://<your domain name>.acryl.io/gms"
+    token: <Your API key>
 ```
+The **source** configuration block defines where to extract metadata from. This can be an OLTP database system, a data warehouse, or something as simple as a file. Each source has custom configuration depending on what is required to access metadata from the source. To see configurations required for each supported source, refer to the [Sources](source_overview.md) documentation.
+
+The **sink** configuration block defines where to push metadata into. Each sink type requires specific configurations, the details of which are detailed in the [Sinks](sink_overview.md) documentation.
+
+To configure your instance of DataHub as the destination for ingestion, set the "server" field of your recipe to point to your Acryl instance's domain suffixed by the path `/gms`, as shown below.
+A complete example of a DataHub recipe file, which reads from MySQL and writes into a DataHub instance:

 For more information and examples on configuring recipes, please refer to [Recipes](recipe_overview.md).

+
+### Using Recipes with Authentication
+In Acryl DataHub deployments, only the `datahub-rest` sink is supported, which simply means that metadata will be pushed to the REST endpoints exposed by your DataHub instance. The required configurations for this sink are
+
+1. **server**: the location of the REST API exposed by your instance of DataHub
+2. **token**: a unique API key used to authenticate requests to your instance's REST API
+
+The token can be retrieved by logging in as admin. You can go to Settings page and generate a Personal Access Token with your desired expiration date.
+
+<p align="center">
+  <img width="70%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/saas/home-(1).png"/>
+</p>
+
+<p align="center">
+  <img width="70%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/saas/settings.png"/>
+</p>
+
+:::info Secure Your API Key
+Please keep Your API key secure & avoid sharing it. 
+If you are on Acryl Cloud and your key is compromised for any reason, please reach out to the Acryl team at support@acryl.io.
+:::
+
+
 ## Ingesting Metadata
-
-You can run ingestion using `datahub ingest` like below.
-
+The final step requires invoking the DataHub CLI to ingest metadata based on your recipe configuration file.
+To do so, simply run `datahub ingest` with a pointer to your YAML recipe file:
 ```shell
 datahub ingest -c <path/to/recipe.yml>
 ```

+## Scheduling Ingestion
+
+Ingestion can either be run in an ad-hoc manner by a system administrator or scheduled for repeated executions. Most commonly, ingestion will be run on a daily cadence.
+To schedule your ingestion job, we recommend using a job schedule like [Apache Airflow](https://airflow.apache.org/). In cases of simpler deployments, a CRON job scheduled on an always-up machine can also work.
+Note that each source system will require a separate recipe file. This allows you to schedule ingestion from different sources independently or together.
+Learn more about scheduling ingestion in the [Scheduling Ingestion Guide](/metadata-ingestion/schedule_docs/intro.md).
+
 ## Reference

 Please refer the following pages for advanced guids on CLI ingestion.
@ -59,8 +111,10 @@ Please refer the following pages for advanced guids on CLI ingestion.
 - [UI Ingestion Guide](../docs/ui-ingestion.md)

 :::tip Compatibility
+
 DataHub server uses a 3 digit versioning scheme, while the CLI uses a 4 digit scheme. For example, if you're using DataHub server version 0.10.0, you should use CLI version 0.10.0.x, where x is a patch version.
 We do this because we do CLI releases at a much higher frequency than server releases, usually every few days vs twice a month.

 For ingestion sources, any breaking changes will be highlighted in the [release notes](../docs/how/updating-datahub.md). When fields are deprecated or otherwise changed, we will try to maintain backwards compatibility for two server releases, which is about 4-6 weeks. The CLI will also print warnings whenever deprecated options are used.
 :::
+