FIX: Add postgres usage and lineage details (#8456)

This commit is contained in:
clueless-bot 2022-11-01 00:18:05 +05:30 committed by GitHub
parent ddc66c8392
commit 68c3e0b8fe
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
5 changed files with 178 additions and 0 deletions

View File

@ -10,6 +10,7 @@ In this section, we provide guides and references to use the Postgres connector.
Configure and schedule Postgres metadata and profiler workflows from the OpenMetadata UI:
- [Requirements](#requirements)
- [Metadata Ingestion](#metadata-ingestion)
- [Query Usage and Lineage Ingestion](#query-usage-and-lineage-ingestion)
- [Data Profiler](#data-profiler)
- [DBT Integration](#dbt-integration)
@ -378,6 +379,85 @@ with DAG(
Note that from connector to connector, this recipe will always be the same.
By updating the YAML configuration, you will be able to extract metadata from different sources.
## Query Usage and Lineage Ingestion
To ingest the Query Usage and Lineage information, the `serviceConnection` configuration will remain the same.
However, the `sourceConfig` is now modeled after this JSON Schema.
### 1. Define the YAML Config
This is a sample config for Postgres Usage:
```yaml
source:
type: postgres
serviceName: local_postgres
serviceConnection:
config:
type: Postgres
username: username
password: password
hostPort: localhost:5432
# database: database
sourceConfig:
config:
# Number of days to look back
queryLogDuration: 7
# This is a directory that will be DELETED after the usage runs
stageFileLocation: <path to store the stage file>
# resultLimit: 1000
# If instead of getting the query logs from the database we want to pass a file with the queries
# queryLogFilePath: path-to-file
processor:
type: query-parser
config: {}
stage:
type: table-usage
config:
filename: /tmp/postgres_usage
bulkSink:
type: metadata-usage
config:
filename: /tmp/postgres_usage
workflowConfig:
# loggerLevel: DEBUG # DEBUG, INFO, WARN or ERROR
openMetadataServerConfig:
hostPort: <OpenMetadata host and port>
authProvider: <OpenMetadata auth provider>
```
#### Source Configuration - Service Connection
You can find all the definitions and types for the `serviceConnection` [here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/entity/services/connections/database/postgresConnection.json).
They are the same as metadata ingestion.
#### Source Configuration - Source Config
The `sourceConfig` is defined [here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/databaseServiceQueryUsagePipeline.json).
- `queryLogDuration`: Configuration to tune how far we want to look back in query logs to process usage data.
- `resultLimit`: Configuration to set the limit for query logs
#### Processor, Stage and Bulk Sink
To specify where the staging files will be located.
Note that the location is a directory that will be cleaned at the end of the ingestion.
#### Workflow Configuration
The same as the metadata ingestion.
### 2. Run with the CLI
There is an extra requirement to run the Usage pipelines. You will need to install:
```bash
pip3 install --upgrade 'openmetadata-ingestion[postgres]'
```
For the usage workflow creation, the Airflow file will look the same as for the metadata ingestion. Updating the YAML configuration will be enough.
## Data Profiler
The Data Profiler workflow will be using the `orm-profiler` processor.

View File

@ -10,6 +10,7 @@ In this section, we provide guides and references to use the Postgres connector.
Configure and schedule Postgres metadata and profiler workflows from the OpenMetadata UI:
- [Requirements](#requirements)
- [Metadata Ingestion](#metadata-ingestion)
- [Query Usage and Lineage Ingestion](#query-usage-and-lineage-ingestion)
- [Data Profiler](#data-profiler)
- [DBT Integration](#dbt-integration)
@ -331,6 +332,90 @@ metadata ingest -c <path-to-yaml>
Note that from connector to connector, this recipe will always be the same. By updating the YAML configuration,
you will be able to extract metadata from different sources.
## Query Usage and Lineage Ingestion
To ingest the Query Usage and Lineage information, the `serviceConnection` configuration will remain the same.
However, the `sourceConfig` is now modeled after this JSON Schema.
### 1. Define the YAML Config
This is a sample config for Postgres Usage:
```yaml
source:
type: postgres
serviceName: local_postgres
serviceConnection:
config:
type: Postgres
username: username
password: password
hostPort: localhost:5432
# database: database
sourceConfig:
config:
# Number of days to look back
queryLogDuration: 7
# This is a directory that will be DELETED after the usage runs
stageFileLocation: <path to store the stage file>
# resultLimit: 1000
# If instead of getting the query logs from the database we want to pass a file with the queries
# queryLogFilePath: path-to-file
processor:
type: query-parser
config: {}
stage:
type: table-usage
config:
filename: /tmp/postgres_usage
bulkSink:
type: metadata-usage
config:
filename: /tmp/postgres_usage
workflowConfig:
# loggerLevel: DEBUG # DEBUG, INFO, WARN or ERROR
openMetadataServerConfig:
hostPort: <OpenMetadata host and port>
authProvider: <OpenMetadata auth provider>
```
#### Source Configuration - Service Connection
You can find all the definitions and types for the `serviceConnection` [here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/entity/services/connections/database/postgresConnection.json).
They are the same as metadata ingestion.
#### Source Configuration - Source Config
The `sourceConfig` is defined [here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/databaseServiceQueryUsagePipeline.json).
- `queryLogDuration`: Configuration to tune how far we want to look back in query logs to process usage data.
- `resultLimit`: Configuration to set the limit for query logs
#### Processor, Stage and Bulk Sink
To specify where the staging files will be located.
Note that the location is a directory that will be cleaned at the end of the ingestion.
#### Workflow Configuration
The same as the metadata ingestion.
### 2. Run with the CLI
There is an extra requirement to run the Usage pipelines. You will need to install:
```bash
pip3 install --upgrade 'openmetadata-ingestion[postgres]'
```
After saving the YAML config, we will run the command the same way we did for the metadata ingestion:
```bash
metadata ingest -c <path-to-yaml>
```
## Data Profiler
The Data Profiler workflow will be using the `orm-profiler` processor.

View File

@ -10,6 +10,7 @@ In this section, we provide guides and references to use the PostgreSQL connecto
Configure and schedule PostgreSQL metadata and profiler workflows from the OpenMetadata UI:
- [Requirements](#requirements)
- [Metadata Ingestion](#metadata-ingestion)
- [Query Usage and Lineage Ingestion](#query-usage-and-lineage-ingestion)
- [Data Profiler](#data-profiler)
- [DBT Integration](#dbt-integration)
@ -228,6 +229,15 @@ caption="Edit and Deploy the Ingestion Pipeline"
From the Connection tab, you can also Edit the Service if needed.
## Query Usage and Lineage Ingestion
<Tile
icon="manage_accounts"
title="Usage Workflow"
text="Learn more about how to configure the Usage Workflow to ingest Query and Lineage information from the UI."
link="/connectors/ingestion/workflows/usage"
/>
## Data Profiler
<Tile

View File

@ -39,6 +39,8 @@ OpenMetadata can extract metadata from the following list of 55 connectors:
- [MySQL](/connectors/database/mysql)
- [Oracle](/connectors/database/oracle)
- [Postgres](/connectors/database/postgres)
- Postgres Metadata
- Postgres Usage
- [Presto](/connectors/database/presto)
- [Redshift](/connectors/database/redshift)
- Redshift Metadata

View File

@ -12,6 +12,7 @@ The following database connectors supports usage workflow in OpenMetadata:
- [Redshift](/connectors/database/redshift)
- [Clickhouse](/connectors/database/clickhouse)
- [Databricks](/connectors/database/databricks)
- [Postgres](/connectors/database/postgres)
If you are using any other database connector, direct execution of usage workflow is not possible. This is mainly because these database connectors does not maintain query execution logs which is required for usage workflow. This documentation will help you to learn, how to execute the usage workflow using a query log file for all the database connectors.