Docs Lineage, DBT and Usage (#6101)

* doc migration for lineage dbt and usage

* small fixes

Co-authored-by: Onkar Ravgan <onkarravgan@Onkars-MacBook-Pro.local>
This commit is contained in:
Onkar Ravgan 2022-07-18 10:13:23 +05:30 committed by GitHub
parent 7e0a7dcc5e
commit 5af56a88a9
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
28 changed files with 423 additions and 22 deletions

View File

@ -346,13 +346,21 @@ site_menu:
- category: OpenMetadata / Ingestion / Workflows/ Metadata / DBT
url: /openmetadata/ingestion/workflows/metadata/dbt
- category: OpenMetadata / Ingestion / Workflows/ Metadata / DBT / Ingest DBT UI
url: /openmetadata/ingestion/workflows/metadata/dbt/ingest-dbt-ui
- category: OpenMetadata / Ingestion / Workflows/ Metadata / DBT / Ingest DBT CLI
url: /openmetadata/ingestion/workflows/metadata/dbt/ingest-dbt-cli
- category: OpenMetadata / Ingestion / Workflows / Usage
url: /openmetadata/ingestion/workflows/usage
- category: OpenMetadata / Ingestion / Workflows / Usage / Usage Workflow Through Query Logs
url: /openmetadata/ingestion/workflows/usage/usage-workflow-query-logs
- category: OpenMetadata / Ingestion / Workflows / Profiler
url: /openmetadata/ingestion/workflows/profiler
- category: OpenMetadata / Ingestion / Lineage
url: /openmetadata/ingestion/lineage
- category: OpenMetadata / Ingestion / Lineage / Edit Data Lineage Manually
url: /openmetadata/ingestion/lineage/edit-lineage-manually
- category: OpenMetadata / Ingestion / Versioning
url: /openmetadata/ingestion/versioning
- category: OpenMetadata / Ingestion / Versioning / Change Feeds

View File

@ -1,10 +0,0 @@
---
title: Entity Lineage
slug: /openmetadata/ingestion/lineage
---
# Entity Lineage
- Automated lineage (Usage workflow + views)
- Manual Lineage
- Tools we use

View File

@ -0,0 +1,10 @@
---
title: Edit Data Lineage Manually
slug: /openmetadata/ingestion/lineage/edit-lineage-manually
---
# Edit Data Lineage Manually
Edit lineage to provide a richer understanding of the provenance of data. The OpenMetadata no-code editor provides a drag and drop interface. Drop tables, pipelines, and dashboards onto the lineage graph. You may add new edges or delete existing edges to better represent data lineage.
![gif](/images/openmetadata/ingestion/lineage/edit-lineage-manually.gif)

View File

@ -0,0 +1,12 @@
---
title: Lineage Ingestion
slug: /openmetadata/ingestion/lineage
---
# Lineage Ingestion
A large subset of connectors distributed with OpenMetadata include support for lineage ingestion. Lineage ingestion processes queries to determine upstream and downstream entities for data assets. Lineage is published to the OpenMetadata catalog when metadata is ingested.
Using the OpenMetadata user interface and API, you may trace the path of data across tables, pipelines, and dashboards.
![gif](/images/openmetadata/ingestion/lineage/lineage-ingestion.gif)

View File

@ -1,6 +0,0 @@
---
title: DBT Integration
slug: /openmetadata/ingestion/workflows/metadata/dbt
---
# DBT Integration

View File

@ -0,0 +1,26 @@
---
title: DBT Integration
slug: /openmetadata/ingestion/workflows/metadata/dbt
---
# DBT Integration
### What is DBT?
A DBT model provides transformation logic that creates a table from raw data.
DBT (data build tool) enables analytics engineers to transform data in their warehouses by simply writing select statements. DBT handles turning these select statements into tables [tables](https://docs.getdbt.com/terms/table) and [views](https://docs.getdbt.com/terms/view).
DBT does the T in [ELT](https://docs.getdbt.com/terms/elt) (Extract, Load, Transform) processes it doesnt extract or load data, but its extremely good at transforming data thats already loaded into your warehouse.
For information regarding setting up a DBT project and creating models please refer to the official DBT documentation [here](https://docs.getdbt.com/docs/introduction).
### DBT Integration in Openmetadata
OpenMetadata includes an integration for DBT that enables you to see what models are being used to generate tables.
Openmetadata parses the [manifest](https://docs.getdbt.com/reference/artifacts/manifest-json) and [catalog](https://docs.getdbt.com/reference/artifacts/catalog-json) json files and shows the queries from which the models are being generated.
Metadata regarding the tables and views generated via DBT is also ingested and can be seen.
![gif](/images/openmetadata/ingestion/workflows/metadata/dbt-integration.gif)

View File

@ -0,0 +1,154 @@
---
title: DBT Ingestion CLI
slug: /openmetadata/ingestion/workflows/metadata/dbt/ingest-dbt-cli
---
# Add DBT while ingesting from CLI
Provide and configure the DBT manifest and catalog file source locations.
## Requirements
Refer to the documentation [here](https://docs.getdbt.com/docs/introduction) to setup a DBT project, generate the DBT models and store them in the catalog and manifest files.
Please make sure to have necessary permissions enabled so that the files can be read from their respective sources.
## Setting up a Redshift source connector with DBT
DBT can be ingested with source connectors like Redshift, Snowflake, BigQuery and other connectors which support DBT.
For a detailed list of connectors that support DBT [click here](https://docs.getdbt.com/docs/available-adapters).
Below example shows ingesting DBT with a Redshift service.
### Add a Redshift service in OpenMetadata
Below is a sample yaml config for Redshift service. Add the DBT Source of the manifest.json and catalog.json
```yaml
source:
type: redshift
serviceName: aws_redshift
serviceConnection:
config:
hostPort: cluster.name.region.redshift.amazonaws.com:5439
username: username
password: strong_password
database: dev
type: Redshift
sourceConfig:
config:
schemaFilterPattern:
excludes:
- information_schema.*
- '[\w]*event_vw.*'
sink:
type: metadata-rest
config: {}
workflowConfig:
openMetadataServerConfig:
hostPort: http://localhost:8585/api
authProvider: no-auth
```
Modify the sourceConfig part of the yaml config as shown below according to the preferred source for DBT manifest.json and catalog.json files
### Add DBT Source
DBT sources for manifest and catalog files can be configured as shown in the yaml configs below.
#### AWS S3 Buckets
OpenMetadata connects to the AWS s3 bucket via the credentials provided and scans the AWS s3 buckets for `manifest.json` and `catalog.json` files.
The name of the s3 bucket and prefix path to the folder in which `manifest.json` and `catalog.json` files are stored can be provided. In the case where these parameters are not provided all the buckets are scanned for the files.
```yaml
sourceConfig:
config:
dbtConfigSource:
dbtSecurityConfig:
awsAccessKeyId: <AWS Access Key Id>
awsSecretAccessKey: <AWS Secret Access Key>
awsRegion: AWS Region
dbtPrefixConfig:
dbtBucketName: <Bucket Name>
dbtObjectPrefix: <Path of the folder in which dbt files are stored>
```
#### Google Cloud Storage Buckets
OpenMetadata connects to the GCS bucket via the credentials provided and scans the gcs buckets for `manifest.json` and `catalog.json` files.
The name of the GCS bucket and prefix path to the folder in which manifest.json and catalog.json files are stored can be provided. In the case where these parameters are not provided all the buckets are scanned for the files.
GCS credentials can be stored in two ways:
1. Entering the credentials directly into the json config
```yaml
sourceConfig:
config:
dbtConfigSource:
dbtSecurityConfig:
gcsConfig:
type: <service_account>
projectId: <projectId
privateKeyId: <privateKeyId>
privateKey: <privateKey
clientEmail: <clientEmail>
clientId: <clientId>
authUri: <authUri>
tokenUri: <tokenUri>
authProviderX509CertUrl: <authProviderX509CertUrl>
clientX509CertUrl: <clientX509CertUrl>
dbtPrefixConfig:
dbtBucketName: <Bucket Name>
dbtObjectPrefix: <Path of the folder in which dbt files are stored>
```
2. Entering the path of file in which the GCS bucket credentials are stored.
```yaml
sourceConfig:
config:
dbtConfigSource:
dbtSecurityConfig:
gcsConfig: <path of gcs credentials file>
dbtPrefixConfig:
dbtBucketName: <Bucket Name>
dbtObjectPrefix: <Path of the folder in which dbt files are stored>
```
#### Local Storage
Path of the `manifest.json` and `catalog.json` files stored in the local system or in the container in which openmetadata server is running can be directly provided.
```yaml
sourceConfig:
config:
dbtConfigSource:
dbtCatalogFilePath: <catalog.json file path>
dbtManifestFilePath: <manifest.json file path>
```
#### File Server
File server path of the manifest.json and catalog.json files stored on a file server directly provided.
```yaml
sourceConfig:
config:
dbtConfigSource:
dbtCatalogHttpPath: <catalog.json file path>
dbtManifestHttpPath: <manifest.json file path>
```
#### DBT Cloud
Click on the the link [here](https://docs.getdbt.com/guides/getting-started) for getting started with DBT cloud account setup if not done already.
OpenMetadata uses DBT cloud APIs to fetch the `run artifacts` (manifest.json and catalog.json) from the most recent DBT run.
The APIs need to be authenticated using an Authentication Token. Follow the link [here](https://docs.getdbt.com/dbt-cloud/api-v2#section/Authentication) to generate an authentication token for your DBT cloud account.
```yaml
sourceConfig:
config:
dbtConfigSource:
dbtCloudAuthToken: dbt_cloud_auth_token
dbtCloudAccountId: dbt_cloud_account_id
```

View File

@ -0,0 +1,80 @@
---
title: DBT Ingestion UI
slug: /openmetadata/ingestion/workflows/metadata/dbt/ingest-dbt-ui
---
# Add DBT while ingesting from UI
Provide and configure the DBT manifest and catalog file source locations.
## Requirements
Refer to the documentation [here](https://docs.getdbt.com/docs/introduction) to setup a DBT project, generate the DBT models and store them in the catalog and manifest files.
Please make sure to have necessary permissions enabled so that the files can be read from their respective sources.
## Setting up a Redshift source connector with DBT
DBT can be ingested with source connectors like Redshift, Snowflake, BigQuery and other connectors which support DBT.
For a detailed list of connectors that support DBT [click here](https://docs.getdbt.com/docs/available-adapters).
Below example shows ingesting DBT with a Redshift service.
### Add a Redshift service in OpenMetadata
<Image src="/images/openmetadata/ingestion/workflows/metadata/ingest_dbt_ui/add-service.png" alt="select-service" caption="Select Service"/>
<Image src="/images/openmetadata/ingestion/workflows/metadata/ingest_dbt_ui/enter-service-name.png" alt="enter-service-name" caption="Enter name of the service"/>
<Image src="/images/openmetadata/ingestion/workflows/metadata/ingest_dbt_ui/configure-service.png" alt="add-new-service" caption="Configure the service"/>
<Image src="/images/openmetadata/ingestion/workflows/metadata/ingest_dbt_ui/add-ingestion.png" alt="add-ingestion" caption="Add Ingestion"/>
### Add DBT Source
DBT sources for manifest and catalog files can be configured as shown UI below. The DBT files are needed to be stored on one of these sources.
#### AWS S3 Buckets
OpenMetadata connects to the AWS s3 bucket via the credentials provided and scans the AWS s3 buckets for `manifest.json` and `catalog.json` files.
The name of the s3 bucket and prefix path to the folder in which `manifest.json` and `catalog.json` files are stored can be provided. In the case where these parameters are not provided all the buckets are scanned for the files.
<Image src="/images/openmetadata/ingestion/workflows/metadata/ingest_dbt_ui/s3-bucket.png" alt="aws-s3-bucket" caption="S3 Bucket Config"/>
#### Google Cloud Storage Buckets
OpenMetadata connects to the GCS bucket via the credentials provided and scans the gcs buckets for `manifest.json` and `catalog.json` files.
The name of the GCS bucket and prefix path to the folder in which `manifest.json` and `catalog.json` files are stored can be provided. In the case where these parameters are not provided all the buckets are scanned for the files.
GCS credentials can be stored in two ways:
1. Entering the credentials directly into the form
<Image src="/images/openmetadata/ingestion/workflows/metadata/ingest_dbt_ui/gcs-bucket-form.png" alt="gcs-storage-bucket-form" caption="GCS Bucket config"/>
2. Entering the path of file in which the GCS bucket credentials are stored.
<Image src="/images/openmetadata/ingestion/workflows/metadata/ingest_dbt_ui/gcs-bucket-path.png" alt="gcs-storage-bucket-path" caption="GCS Bucket Path Config"/>
For more information on Google Cloud Storage authentication click [here](https://cloud.google.com/docs/authentication/getting-started#create-service-account-console).
#### Local Storage
Path of the manifest.json and catalog.json files stored in the local system or in the container in which openmetadata server is running can be directly provided.
<Image src="/images/openmetadata/ingestion/workflows/metadata/ingest_dbt_ui/local-storage.png" alt="local-storage" caption="Local Storage Config"/>
#### File Server
File server path of the manifest.json and catalog.json files stored on a file server directly provided.
<Image src="/images/openmetadata/ingestion/workflows/metadata/ingest-dbt-ui/file_server.png" alt="file-server" caption="File Server Config"/>
#### DBT Cloud
Click on the the link [here](https://docs.getdbt.com/guides/getting-started) for getting started with DBT cloud account setup if not done already.
OpenMetadata uses DBT cloud APIs to fetch the `run artifacts` (manifest.json and catalog.json) from the most recent DBT run.
The APIs need to be authenticated using an Authentication Token. Follow the link [here](https://docs.getdbt.com/dbt-cloud/api-v2#section/Authentication) to generate an authentication token for your DBT cloud account.
<Image src="/images/openmetadata/ingestion/workflows/metadata/ingest_dbt_ui/dbt-cloud.png" alt="dbt-cloud" caption="DBT Cloud config"/>

View File

@ -1,6 +0,0 @@
---
title: Usage Workflow
slug: /openmetadata/ingestion/workflows/usage
---
# Usage Workflow

View File

@ -0,0 +1,56 @@
---
title: Usage Workflow
slug: /openmetadata/ingestion/workflows/usage
---
# Usage Workflow
Learn how to configure the Usage workflow from the UI to ingest Query history and Lineage data from your data sources.
This workflow is available ONLY for the following connectors:
- [BigQuery](/openmetadata/connectors/database/bigquery)
- [Snowflake](/openmetadata/connectors/database/snowflake)
- [MSSQL](/openmetadata/connectors/database/mssql)
- [Redshift](/openmetadata/connectors/database/redshift)
- [Clickhouse](/openmetadata/connectors/database/clickhouse)
## UI Configuration
Once the metadata ingestion runs correctly and we are able to explore the service Entities, we can add Query Usage and Entity Lineage information.
This will populate the Queries and Lineage tab from the Table Entity Page.
<Image src="/images/openmetadata/ingestion/workflows/usage/table-entity-page.png" alt="table-entity-page" caption="Table Entity Page"/>
We can create a workflow that will obtain the query log and table creation information from the underlying database and feed it to OpenMetadata. The Usage Ingestion will be in charge of obtaining this data.
### 1. Add a Usage Ingestion
From the Service Page, go to the Ingestions tab to add a new ingestion and click on Add Usage Ingestion.
<Image src="/images/openmetadata/ingestion/workflows/usage/add-ingestion.png" alt="add-ingestion" caption="Add Ingestion"/>
### 2. Configure the Usage Ingestion
Here you can enter the Usage Ingestion details:
<Image src="/images/openmetadata/ingestion/workflows/usage/configure-usage-ingestion.png" alt="configure-usage-ingestion" caption="Configure the Usage Ingestion"/>
<Collapse title="Usage Options">
**Query Log Duration**
Specify the duration in days for which the profiler should capture usage data from the query logs. For example, if you specify 2 as the value for the duration, the data profiler will capture usage information for 48 hours prior to when the ingestion workflow is run.
**Stage File Location**
Mention the absolute file path of the temporary file name to store the query logs before processing.
**Result Limit**
Set the limit for the query log results to be run at a time.
</Collapse>
### 3. Schedule and Deploy
After clicking Next, you will be redirected to the Scheduling form. This will be the same as the Metadata Ingestion. Select your desired schedule and click on Deploy to find the usage pipeline being added to the Service Ingestions.
<Image src="/images/openmetadata/ingestion/workflows/usage/scheule-and-deploy.png" alt="schedule-and-deploy" caption="View Service Ingestion pipelines"/>

View File

@ -0,0 +1,77 @@
---
title: Usage Workflow Through Query Logs
slug: /openmetadata/ingestion/workflows/usage/usage-workflow-query-logs
---
# Usage Workflow Through Query Logs
The following database connectors supports usage workflow in OpenMetadata:
- [BigQuery](/openmetadata/connectors/database/bigquery)
- [Snowflake](/openmetadata/connectors/database/snowflake)
- [MSSQL](/openmetadata/connectors/database/mssql)
- [Redshift](/openmetadata/connectors/database/redshift)
- [Clickhouse](/openmetadata/connectors/database/clickhouse)
If you are using any other database connector, direct execution of usage workflow is not possible. This is mainly because these database connectors does not maintain query execution logs which is required for usage workflow. This documentation will help you to learn, how to execute the usage workflow using a query log file for all the database connectors.
## Query Log File
A query log file is a CSV file which contains the following information.
- **query:** This field contains the literal query that has been executed in the database.
- **user_name (optional):** Enter the database user name which has executed this query.
- **start_time (optional):** Enter the query execution start time in YYYY-MM-DD HH:MM:SS format.
- **end_time (optional):** Enter the query execution end time in YYYY-MM-DD HH:MM:SS format.
- **aborted (optional):** This field accepts values as true or false and indicates whether the query was aborted during execution
- **database_name (optional):** Enter the database name on which the query was executed.
- **schema_name (optional):** Enter the schema name to which the query is associated.
Checkout a sample query log file [here](https://github.com/open-metadata/OpenMetadata/blob/main/ingestion/examples/sample_data/glue/query_log.csv).
## Usage Workflow
In order to run a Usage Workflow we need to make sure that Metadata Ingestion Workflow for corresponding service has already been executed. We will follow the steps to create a JSON configuration able to collect the query log file and execute the usage workflow.
### 1. Create a configuration file using template YAML
Create a new file called `query_log_usage.yaml` in the current directory. Note that the current directory should be the openmetadata directory.
Copy and paste the configuration template below into the `query_log_usage.yaml` the file you created.
```yaml
source:
type: query-log-usage
serviceName: local_mysql
serviceConnection:
config:
type: Mysql
username: openmetadata_user
password: openmetadata_password
hostPort: localhost:3306
connectionOptions: {}
connectionArguments: {}
sourceConfig:
config:
queryLogFilePath: <path to query log file>
processor:
type: query-parser
config:
filter: ''
stage:
type: table-usage
config:
filename: /tmp/query_log_usage
bulkSink:
type: metadata-usage
config:
filename: /tmp/query_log_usage
workflowConfig:
openMetadataServerConfig:
hostPort: http://localhost:8585/api
authProvider: no-auth
```
The `serviceName` and `serviceConnection` used in the above config has to be the same as used during Metadata Ingestion.
The sourceConfig is defined [here](https://github.com/open-metadata/OpenMetadata/blob/main/catalog-rest-service/src/main/resources/json/schema/metadataIngestion/databaseServiceQueryUsagePipeline.json).
- queryLogFilePath: Enter the file path of query log csv file.
### 2. Run with the CLI
First, we will need to save the YAML file. Afterward, and with all requirements installed, we can run:
```yaml
metadata ingest -c <path-to-yaml>
```
Note that from connector-to-connector, this recipe will always be the same. By updating the YAML configuration, you will be able to extract metadata from different sources.

Binary file not shown.

After

Width:  |  Height:  |  Size: 5.9 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 13 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 10 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 365 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 615 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 476 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 357 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 87 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 371 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 333 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 496 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 367 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 467 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 127 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 137 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 140 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 66 KiB