2022-07-18 10:13:23 +05:30
|
|
|
---
|
|
|
|
title: Lineage Ingestion
|
2022-10-05 21:54:02 -07:00
|
|
|
slug: /connectors/ingestion/lineage
|
2022-07-18 10:13:23 +05:30
|
|
|
---
|
|
|
|
|
|
|
|
# Lineage Ingestion
|
|
|
|
|
2022-08-24 07:34:06 +02:00
|
|
|
A large subset of connectors distributed with OpenMetadata include support for lineage ingestion. Lineage ingestion processes
|
|
|
|
queries to determine upstream and downstream entities for data assets. Lineage is published to the OpenMetadata catalog when metadata is ingested.
|
2022-07-18 10:13:23 +05:30
|
|
|
|
2022-08-24 07:34:06 +02:00
|
|
|
Using the OpenMetadata user interface and API, you may trace the path of data across Tables, Pipelines, and Dashboards.
|
2022-07-18 10:13:23 +05:30
|
|
|
|
|
|
|

|
2022-08-24 07:34:06 +02:00
|
|
|
|
|
|
|
Lineage ingestion is specific to the type of the Entity that we are processing. We are going to explain
|
|
|
|
the ingestion process for the supported services.
|
|
|
|
|
|
|
|
The team is continuously working to increase the lineage coverage of the available services. Do not hesitate
|
|
|
|
to [reach out](https://slack.open-metadata.org/) if you have any questions, issues or requests!
|
|
|
|
|
|
|
|
## Database Services
|
|
|
|
|
|
|
|
Here we have 3 lineage sources, divided in different workflows, but mostly built around a **Query Parser**.
|
|
|
|
|
|
|
|
### View Lineage
|
|
|
|
|
|
|
|
During the Metadata Ingestion workflow we differentiate if a Table is a View. For those sources where we can
|
|
|
|
obtain the query that generates the View (e.g., Snowflake allows us to pick up the View query from the DDL).
|
|
|
|
|
|
|
|
After all Tables have been ingested in the workflow, it's time to [parse](https://sqllineage.readthedocs.io/en/latest/)
|
|
|
|
all the queries generating Views. During the query parsing, we will obtain the source and target tables, search if the
|
|
|
|
Tables exist in OpenMetadata, and finally create the lineage relationship between the involved Entities.
|
|
|
|
|
|
|
|
Let's go over this process with an example. Suppose have the following DDL:
|
|
|
|
|
|
|
|
```sql
|
|
|
|
CREATE OR REPLACE VIEW schema.my_view
|
|
|
|
AS SELECT ... FROM schema.table_a JOIN another_schema.table_b;
|
|
|
|
```
|
|
|
|
|
|
|
|
From this query we will extract the following information:
|
|
|
|
1. There are two `source` tables, represented by the string `schema.table_a` as `another_schema.table_b`
|
|
|
|
2. There is a `target` table `schema.my_view`.
|
|
|
|
|
|
|
|
In this case we suppose that the database connection requires us to write the table names as `<schema>.<table>`. However,
|
|
|
|
there are other possible options. Sometimes we can find just `<table>` in a query, or even `<database>.<schema>.<table>`.
|
|
|
|
|
|
|
|
The point here is that we have limited information that we can use to identify the Table Entity that represents the
|
|
|
|
table written down in SQL. To close this gap, we run a query against ElasticSearch using the Table FQN.
|
|
|
|
|
|
|
|
Once we have identified all the ingredients in OpenMetadata as Entities, we can run the Lineage API to add the
|
|
|
|
relationship between the nodes.
|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|
What we just described is the core process of identifying and ingesting lineage, and it will be reused (or partially reused)
|
|
|
|
for the rest of the options as well.
|
|
|
|
|
2022-12-22 18:41:18 +05:30
|
|
|
### dbt
|
2022-08-24 07:34:06 +02:00
|
|
|
|
2022-12-22 18:41:18 +05:30
|
|
|
When configuring an Ingestion adding dbt information we can parse the nodes on the Manifest JSON to get the data model
|
2022-08-24 07:34:06 +02:00
|
|
|
lineage. Here we don't need to parse a query to obtain the source and target elements, but we still rely on querying ElasticSearch
|
|
|
|
to identify the graph nodes as OpenMetadata Entities.
|
|
|
|
|
|
|
|
Note that if a Model is not materialized, its data won't be ingested.
|
|
|
|
|
|
|
|
### Query Log
|
|
|
|
|
|
|
|
<Note>
|
|
|
|
|
|
|
|
Up until 0.11, Query Log analysis for lineage happens during the Usage Workflow.
|
|
|
|
|
|
|
|
From 0.12 onwards, there is a separated Lineage Workflow that will take care of this process.
|
|
|
|
|
|
|
|
</Note>
|
|
|
|
|
|
|
|
#### How to run?
|
|
|
|
|
|
|
|
The main difference here is between those sources that provide internal access to query logs and those that do not. For
|
|
|
|
services such as:
|
|
|
|
|
2022-10-05 21:54:02 -07:00
|
|
|
- [BigQuery](/connectors/database/bigquery)
|
|
|
|
- [Snowflake](/connectors/database/snowflake)
|
|
|
|
- [MSSQL](/connectors/database/mssql)
|
|
|
|
- [Redshift](/connectors/database/redshift)
|
|
|
|
- [Clickhouse](/connectors/database/clickhouse)
|
|
|
|
- [Postgres](/connectors/database/postgres)
|
2022-10-14 10:07:57 +05:30
|
|
|
- [Databricks](/connectors/database/databricks)
|
2022-08-24 07:34:06 +02:00
|
|
|
|
|
|
|
There are specific workflows (Usage & Lineage) that will use the query log information. An alternative for sources not
|
|
|
|
listed here is to run the workflow by providing the Query Logs that you have previously exported and then running
|
2022-10-05 21:54:02 -07:00
|
|
|
the [workflow](/connectors/ingestion/workflows/usage/usage-workflow-query-logs).
|
2022-08-24 07:34:06 +02:00
|
|
|
|
|
|
|
#### Process
|
|
|
|
|
|
|
|
That being said, this process is the same as the one shown in the View Lineage above. By obtaining a set of queries to
|
|
|
|
parse, we will obtain the `source` and `target` information, use ElasticSearch to identify the Entities in OpenMetadata
|
|
|
|
and then send the lineage to the API.
|
|
|
|
|
|
|
|
<Note>
|
|
|
|
|
|
|
|
When running any query from within OpenMetadata we add an information comment to the query text
|
|
|
|
|
|
|
|
```
|
|
|
|
{"app": "OpenMetadata", "version": <openmetadata-ingestion version>}
|
|
|
|
```
|
|
|
|
|
2022-12-22 18:41:18 +05:30
|
|
|
Note that queries with this text as well as the ones containing headers from dbt (which follow a similar structure),
|
2022-08-24 07:34:06 +02:00
|
|
|
will be filtered out when building the query log internally.
|
|
|
|
|
|
|
|
</Note>
|
|
|
|
|
|
|
|
#### Troubleshooting
|
|
|
|
|
|
|
|
Make sure that the tables that you are trying to add lineage for are present in OpenMetadata (and their upstream/downstream
|
|
|
|
as well). You might also need to validate if the query logs are available in the tables for each service.
|
|
|
|
|
|
|
|
You can check the queries being used here:
|
|
|
|
|
|
|
|
- [BigQuery](https://github.com/open-metadata/OpenMetadata/blob/main/ingestion/src/metadata/utils/sql_queries.py#L428)
|
|
|
|
- [Snowflake](https://github.com/open-metadata/OpenMetadata/blob/main/ingestion/src/metadata/utils/sql_queries.py#L197)
|
|
|
|
- [MSSQL](https://github.com/open-metadata/OpenMetadata/blob/main/ingestion/src/metadata/utils/sql_queries.py#L350)
|
|
|
|
- [Redshift](https://github.com/open-metadata/OpenMetadata/blob/main/ingestion/src/metadata/utils/sql_queries.py#L18)
|
|
|
|
- [Clickhouse](https://github.com/open-metadata/OpenMetadata/blob/main/ingestion/src/metadata/utils/sql_queries.py#L376)
|
2022-10-01 20:01:49 +02:00
|
|
|
- [Postgres](https://github.com/open-metadata/OpenMetadata/blob/main/ingestion/src/metadata/utils/sql_queries.py#L467)
|
2022-08-24 07:34:06 +02:00
|
|
|
|
|
|
|
By default, we apply a result limit of 1000 records. You might also need to increase that for databases with big volumes
|
|
|
|
of queries.
|
|
|
|
|
|
|
|
|
|
|
|
## Dashboard Services
|
|
|
|
|
|
|
|
When configuring the Ingestion Workflow for Dashboard Services you can select which Database Services are hosting
|
|
|
|
the data feeding the Dashboards and Charts.
|
|
|
|
|
|
|
|
When ingesting the Dashboards metadata, the workflow will pick up the origin tables (or database, in the case of
|
|
|
|
PowerBI), and prepare the lineage information.
|
|
|
|
|
|
|
|
<Image src="/images/openmetadata/ingestion/lineage/dashboard-ingestion-lineage.png" alt="Dashboard Lineage"/>
|
|
|
|
|
|
|
|
## Pipeline Services
|
|
|
|
|
2022-10-05 21:54:02 -07:00
|
|
|
The supported services here are [Airflow](/connectors/pipeline/airflow),
|
|
|
|
[Fivetran](/connectors/pipeline/fivetran), [Dagster](/connectors/pipeline/dagster)
|
|
|
|
and [Airbyte](/connectors/pipeline/airbyte).
|
2022-08-24 07:34:06 +02:00
|
|
|
|
|
|
|
All of them ingest the lineage information out of the box. The only special case is Airflow, where one needs to
|
|
|
|
setup `inlets` and `outlets`. You can find more information about it
|
2022-10-05 21:54:02 -07:00
|
|
|
[here](https://docs.open-metadata.org/connectors/pipeline/airflow/lineage-backend#adding-lineage).
|
2022-08-24 07:34:06 +02:00
|
|
|
|
|
|
|
## Manual Lineage
|
|
|
|
|
|
|
|
Sometimes there is information that is shared among people but not present in the sources. To enable capturing all
|
|
|
|
the possible knowledge, you can also add lineage manually with our UI editor.
|
|
|
|
|
|
|
|
<InlineCalloutContainer>
|
|
|
|
<InlineCallout
|
|
|
|
color="violet-70"
|
|
|
|
icon="celebration"
|
|
|
|
bold="Manual Lineage"
|
2022-10-07 08:49:46 -07:00
|
|
|
href="/connectors/ingestion/lineage/edit-lineage-manually"
|
2022-08-24 07:34:06 +02:00
|
|
|
>
|
|
|
|
Capture Lineage knowledge with the UI editor.
|
|
|
|
</InlineCallout>
|
|
|
|
</InlineCalloutContainer>
|