mirror of
https://github.com/open-metadata/OpenMetadata.git
synced 2025-10-11 16:58:38 +00:00
Docs - Lineage (#6883)
* Update lineage * Update roadmap * Add Mode to the list * Add lineage information
This commit is contained in:
parent
3a46917cb2
commit
b2cf8993a2
@ -16,7 +16,7 @@
|
||||
- [Features](#features)
|
||||
- [Try our Sandbox](#try-our-sandbox)
|
||||
- [Install & Run](#install-and-run-openmetadata)
|
||||
- [Roadmap](roadmap.md)
|
||||
- [Roadmap](https://docs.open-metadata.org/overview/roadmap)
|
||||
- [Documentation and support](#documentation-and-support)
|
||||
- [Contributors](#contributors)
|
||||
- [License](#license)
|
||||
|
@ -7,6 +7,7 @@ slug: /openmetadata/connectors/dashboard
|
||||
|
||||
- [Looker](/openmetadata/connectors/dashboard/looker)
|
||||
- [Metabase](/openmetadata/connectors/dashboard/metabase)
|
||||
- [Mode](/openmetadata/connectors/dashboard/mode)
|
||||
- [PowerBI](/openmetadata/connectors/dashboard/powerbi)
|
||||
- [Redash](/openmetadata/connectors/dashboard/redash)
|
||||
- [Superset](/openmetadata/connectors/dashboard/superset)
|
||||
|
@ -38,6 +38,7 @@ OpenMetadata can extract metadata from the following list of connectors:
|
||||
|
||||
- [Looker](/openmetadata/connectors/dashboard/looker)
|
||||
- [Metabase](/openmetadata/connectors/dashboard/metabase)
|
||||
- [Mode](/openmetadata/connectors/dashboard/mode)
|
||||
- [PowerBI](/openmetadata/connectors/dashboard/powerbi)
|
||||
- [Redash](/openmetadata/connectors/dashboard/redash)
|
||||
- [Superset](/openmetadata/connectors/dashboard/superset)
|
||||
|
@ -192,3 +192,11 @@ and downstream for outlets) between the Pipeline and Table Entities.
|
||||
|
||||
It is important to get the naming right, as we will fetch the Table Entity by its FQN. If no information is specified
|
||||
in terms of lineage, we will just ingest the Pipeline Entity without adding further information.
|
||||
|
||||
<Note>
|
||||
|
||||
While we are showing here how to parse the lineage using the Lineage Backend, the setup of `inlets` and `outlets`
|
||||
is supported as well through external metadata ingestion from Airflow, be it via the UI, CLI or directly running
|
||||
an extraction DAG from Airflow itself.
|
||||
|
||||
</Note>
|
||||
|
@ -5,8 +5,158 @@ slug: /openmetadata/ingestion/lineage
|
||||
|
||||
# Lineage Ingestion
|
||||
|
||||
A large subset of connectors distributed with OpenMetadata include support for lineage ingestion. Lineage ingestion processes queries to determine upstream and downstream entities for data assets. Lineage is published to the OpenMetadata catalog when metadata is ingested.
|
||||
A large subset of connectors distributed with OpenMetadata include support for lineage ingestion. Lineage ingestion processes
|
||||
queries to determine upstream and downstream entities for data assets. Lineage is published to the OpenMetadata catalog when metadata is ingested.
|
||||
|
||||
Using the OpenMetadata user interface and API, you may trace the path of data across tables, pipelines, and dashboards.
|
||||
Using the OpenMetadata user interface and API, you may trace the path of data across Tables, Pipelines, and Dashboards.
|
||||
|
||||

|
||||
|
||||
Lineage ingestion is specific to the type of the Entity that we are processing. We are going to explain
|
||||
the ingestion process for the supported services.
|
||||
|
||||
The team is continuously working to increase the lineage coverage of the available services. Do not hesitate
|
||||
to [reach out](https://slack.open-metadata.org/) if you have any questions, issues or requests!
|
||||
|
||||
## Database Services
|
||||
|
||||
Here we have 3 lineage sources, divided in different workflows, but mostly built around a **Query Parser**.
|
||||
|
||||
### View Lineage
|
||||
|
||||
During the Metadata Ingestion workflow we differentiate if a Table is a View. For those sources where we can
|
||||
obtain the query that generates the View (e.g., Snowflake allows us to pick up the View query from the DDL).
|
||||
|
||||
After all Tables have been ingested in the workflow, it's time to [parse](https://sqllineage.readthedocs.io/en/latest/)
|
||||
all the queries generating Views. During the query parsing, we will obtain the source and target tables, search if the
|
||||
Tables exist in OpenMetadata, and finally create the lineage relationship between the involved Entities.
|
||||
|
||||
Let's go over this process with an example. Suppose have the following DDL:
|
||||
|
||||
```sql
|
||||
CREATE OR REPLACE VIEW schema.my_view
|
||||
AS SELECT ... FROM schema.table_a JOIN another_schema.table_b;
|
||||
```
|
||||
|
||||
From this query we will extract the following information:
|
||||
1. There are two `source` tables, represented by the string `schema.table_a` as `another_schema.table_b`
|
||||
2. There is a `target` table `schema.my_view`.
|
||||
|
||||
In this case we suppose that the database connection requires us to write the table names as `<schema>.<table>`. However,
|
||||
there are other possible options. Sometimes we can find just `<table>` in a query, or even `<database>.<schema>.<table>`.
|
||||
|
||||
The point here is that we have limited information that we can use to identify the Table Entity that represents the
|
||||
table written down in SQL. To close this gap, we run a query against ElasticSearch using the Table FQN.
|
||||
|
||||
Once we have identified all the ingredients in OpenMetadata as Entities, we can run the Lineage API to add the
|
||||
relationship between the nodes.
|
||||
|
||||

|
||||
|
||||
What we just described is the core process of identifying and ingesting lineage, and it will be reused (or partially reused)
|
||||
for the rest of the options as well.
|
||||
|
||||
### DBT
|
||||
|
||||
When configuring an Ingestion adding DBT information we can parse the nodes on the Manifest JSON to get the data model
|
||||
lineage. Here we don't need to parse a query to obtain the source and target elements, but we still rely on querying ElasticSearch
|
||||
to identify the graph nodes as OpenMetadata Entities.
|
||||
|
||||
Note that if a Model is not materialized, its data won't be ingested.
|
||||
|
||||
### Query Log
|
||||
|
||||
<Note>
|
||||
|
||||
Up until 0.11, Query Log analysis for lineage happens during the Usage Workflow.
|
||||
|
||||
From 0.12 onwards, there is a separated Lineage Workflow that will take care of this process.
|
||||
|
||||
</Note>
|
||||
|
||||
#### How to run?
|
||||
|
||||
The main difference here is between those sources that provide internal access to query logs and those that do not. For
|
||||
services such as:
|
||||
|
||||
- [BigQuery](/openmetadata/connectors/database/bigquery)
|
||||
- [Snowflake](/openmetadata/connectors/database/snowflake)
|
||||
- [MSSQL](/openmetadata/connectors/database/mssql)
|
||||
- [Redshift](/openmetadata/connectors/database/redshift)
|
||||
- [Clickhouse](/openmetadata/connectors/database/clickhouse)
|
||||
|
||||
There are specific workflows (Usage & Lineage) that will use the query log information. An alternative for sources not
|
||||
listed here is to run the workflow by providing the Query Logs that you have previously exported and then running
|
||||
the [workflow](/openmetadata/ingestion/workflows/usage/usage-workflow-query-logs).
|
||||
|
||||
#### Process
|
||||
|
||||
That being said, this process is the same as the one shown in the View Lineage above. By obtaining a set of queries to
|
||||
parse, we will obtain the `source` and `target` information, use ElasticSearch to identify the Entities in OpenMetadata
|
||||
and then send the lineage to the API.
|
||||
|
||||
<Note>
|
||||
|
||||
When running any query from within OpenMetadata we add an information comment to the query text
|
||||
|
||||
```
|
||||
{"app": "OpenMetadata", "version": <openmetadata-ingestion version>}
|
||||
```
|
||||
|
||||
Note that queries with this text as well as the ones containing headers from DBT (which follow a similar structure),
|
||||
will be filtered out when building the query log internally.
|
||||
|
||||
</Note>
|
||||
|
||||
#### Troubleshooting
|
||||
|
||||
Make sure that the tables that you are trying to add lineage for are present in OpenMetadata (and their upstream/downstream
|
||||
as well). You might also need to validate if the query logs are available in the tables for each service.
|
||||
|
||||
You can check the queries being used here:
|
||||
|
||||
- [BigQuery](https://github.com/open-metadata/OpenMetadata/blob/main/ingestion/src/metadata/utils/sql_queries.py#L428)
|
||||
- [Snowflake](https://github.com/open-metadata/OpenMetadata/blob/main/ingestion/src/metadata/utils/sql_queries.py#L197)
|
||||
- [MSSQL](https://github.com/open-metadata/OpenMetadata/blob/main/ingestion/src/metadata/utils/sql_queries.py#L350)
|
||||
- [Redshift](https://github.com/open-metadata/OpenMetadata/blob/main/ingestion/src/metadata/utils/sql_queries.py#L18)
|
||||
- [Clickhouse](https://github.com/open-metadata/OpenMetadata/blob/main/ingestion/src/metadata/utils/sql_queries.py#L376)
|
||||
|
||||
By default, we apply a result limit of 1000 records. You might also need to increase that for databases with big volumes
|
||||
of queries.
|
||||
|
||||
|
||||
## Dashboard Services
|
||||
|
||||
When configuring the Ingestion Workflow for Dashboard Services you can select which Database Services are hosting
|
||||
the data feeding the Dashboards and Charts.
|
||||
|
||||
When ingesting the Dashboards metadata, the workflow will pick up the origin tables (or database, in the case of
|
||||
PowerBI), and prepare the lineage information.
|
||||
|
||||
<Image src="/images/openmetadata/ingestion/lineage/dashboard-ingestion-lineage.png" alt="Dashboard Lineage"/>
|
||||
|
||||
## Pipeline Services
|
||||
|
||||
The supported services here are [Airflow](/openmetadata/connectors/pipeline/airflow),
|
||||
[Fivetran](/openmetadata/connectors/pipeline/fivetran), [Dagster](/openmetadata/connectors/pipeline/dagster)
|
||||
and [Airbyte](/openmetadata/connectors/pipeline/airbyte).
|
||||
|
||||
All of them ingest the lineage information out of the box. The only special case is Airflow, where one needs to
|
||||
setup `inlets` and `outlets`. You can find more information about it
|
||||
[here](https://docs.open-metadata.org/openmetadata/connectors/pipeline/airflow/lineage-backend#adding-lineage).
|
||||
|
||||
## Manual Lineage
|
||||
|
||||
Sometimes there is information that is shared among people but not present in the sources. To enable capturing all
|
||||
the possible knowledge, you can also add lineage manually with our UI editor.
|
||||
|
||||
<InlineCalloutContainer>
|
||||
<InlineCallout
|
||||
color="violet-70"
|
||||
icon="celebration"
|
||||
bold="Manual Lineage"
|
||||
href="/openmetadata/ingestion/lineage/edit-lineage-manually"
|
||||
>
|
||||
Capture Lineage knowledge with the UI editor.
|
||||
</InlineCallout>
|
||||
</InlineCalloutContainer>
|
||||
|
@ -85,16 +85,9 @@ You can check the latest release [here](/overview/releases).
|
||||
bordercolor="blue-70"
|
||||
>
|
||||
<li>Fivetran</li>
|
||||
<li>Mode</li>
|
||||
<li>Redpanda</li>
|
||||
<li>Dagster</li>
|
||||
</Tile>
|
||||
<Tile
|
||||
title="ML Features"
|
||||
text="With the addition of the SageMaker connector"
|
||||
background="blue-70"
|
||||
bordercolor="blue-70"
|
||||
/>
|
||||
</TileContainer>
|
||||
|
||||
## 0.13.0 Release - Oct 12th, 2022
|
||||
|
Binary file not shown.
After Width: | Height: | Size: 180 KiB |
Binary file not shown.
After Width: | Height: | Size: 43 KiB |
82
roadmap.md
82
roadmap.md
@ -1,82 +0,0 @@
|
||||
# OpenMetadata Roadmap
|
||||
|
||||
Here is the OpenMetadata Roadmap for the next 3 releases.
|
||||
|
||||
We are doing a monthly release and we are going to evolve fast and adopt to community needs.
|
||||
Below roadmap is subject to change based on community needs and feedback.
|
||||
|
||||
If you would like to prioitize any feature or would like to add a new feature thats not in
|
||||
our roadmap yet, please file an Issue [Github](https://github.com/open-metadata/OpenMetadata/issues) or ping us on [Slack](https://slack.open-metadata.org/)
|
||||
|
||||
|
||||
## 0.4 Release - Sep 20th, 2021
|
||||
|
||||
#### Theme: Topics, Dashboards, and Data Profiler
|
||||
|
||||
### Support for Kafka (and Pulsar WIP)
|
||||
* Support for Message Service and Topic entities in schemas, APIs, and UI
|
||||
* Kafka connector and ingestion support for Confluent Schema Registry
|
||||
|
||||
### Support for Dashboards
|
||||
* Support for Dashboard services, Dashboards, and Charts entities in schemas, APIs, and UI
|
||||
* Looker, Superset, Tableau connector, and ingestion support
|
||||
|
||||
|
||||
### User Interface
|
||||
* Sort search results based on Usage, Relevance, and Last updated time
|
||||
* Search string highlighted in search results
|
||||
* Support for Kafka and Dashboards from Looker, SuperSet, and Tableau
|
||||
|
||||
### Other features
|
||||
* Pluggable SSO integration - Auth0 support
|
||||
* Support for Presto
|
||||
|
||||
### Work in progress
|
||||
* Salesforce CRM connector
|
||||
* Data profiler to profile tables in ingestion framework and show it table details page
|
||||
|
||||
|
||||
|
||||
## 0.5 Release - Oct 19th, 2021
|
||||
|
||||
#### Theme: Data quality and Lineage
|
||||
|
||||
|
||||
### Support for Lineage
|
||||
* Lineage related schemas and APIs
|
||||
* Lineage metadata integration from AirFlow for tables
|
||||
* Lineage metadata integration from Looker, and Superset for Dashboards
|
||||
* Extra lineage from queries for BigQuery, Hive, Redshift, and Snowflake
|
||||
* UI changes to show lineage information to the users
|
||||
|
||||
### Eventing & Notification framework
|
||||
* Design for eventing framework for both internal and external applications
|
||||
* Schema change event
|
||||
* Schema change notification
|
||||
|
||||
### Other features
|
||||
* Data quality - Data profiler integration work in progress
|
||||
* Schema versioning
|
||||
* Support for Trino
|
||||
|
||||
|
||||
## 0.6 Release - Nov 17th, 2021
|
||||
|
||||
#### Theme: User collaboration features
|
||||
|
||||
### Support for User Collaboration
|
||||
* Allow users to ask questions, suggest changes, request new features for data assets
|
||||
* Activity feeds for User and Data assets
|
||||
* Tracking activity feeds as tasks
|
||||
|
||||
### Lineage new features
|
||||
* Allow users to add lineage information manually for table and column levels
|
||||
* Tier propagation to upstream datasets using lineage
|
||||
* Propagating column level tags and descriptions using lineage (Work in progress)
|
||||
|
||||
### Other features
|
||||
* Metadata Change Event integration into Slack and framework for integration into other services such as Kafka or other Notification frameworks
|
||||
* Data Health Report
|
||||
* Delta Lake support, Databricks, Iceberg
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user