2023-04-20 12:17:11 +09:00
import Tabs from '@theme/Tabs ';
import TabItem from '@theme/TabItem ';
2023-03-16 08:19:31 +09:00
2024-04-30 08:12:32 +09:00
# Data Lineage
2023-04-20 12:17:11 +09:00
## Why Would You Use Lineage?
2023-04-08 08:26:58 +09:00
2024-04-30 08:12:32 +09:00
Data lineage is used to capture data dependencies within an organization. It allows you to track the inputs from which a data asset is derived, along with the data assets that depend on it downstream.
2023-10-04 17:43:59 +09:00
2024-04-30 08:12:32 +09:00
For more information about data lineage, refer to [About DataHub Lineage ](/docs/generated/lineage/lineage-feature-guide.md ).
2023-03-16 08:19:31 +09:00
2023-03-17 06:12:35 +09:00
### Goal Of This Guide
2023-04-08 08:26:58 +09:00
2023-05-03 07:32:23 +09:00
This guide will show you how to
2023-05-19 07:59:30 +09:00
- Add lineage between datasets.
- Add column-level lineage between datasets.
2023-03-17 06:12:35 +09:00
2023-03-16 08:19:31 +09:00
## Prerequisites
2023-04-08 08:26:58 +09:00
For this tutorial, you need to deploy DataHub Quickstart and ingest sample data.
For detailed steps, please refer to [Datahub Quickstart Guide ](/docs/quickstart.md ).
2023-03-16 08:19:31 +09:00
:::note
2023-04-08 08:26:58 +09:00
Before adding lineage, you need to ensure the targeted dataset is already present in your datahub.
If you attempt to manipulate entities that do not exist, your operation will fail.
2023-03-16 08:19:31 +09:00
In this guide, we will be using data from sample ingestion.
:::
2023-05-03 07:32:23 +09:00
## Add Lineage
2023-03-16 08:19:31 +09:00
2023-04-20 12:17:11 +09:00
< Tabs >
< TabItem value = "graphql" label = "GraphQL" default >
2023-03-16 08:19:31 +09:00
```json
mutation updateLineage {
updateLineage(
input: {
edgesToAdd: [
{
downstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)"
upstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)"
}
]
edgesToRemove: []
}
)
}
```
Note that you can create a list of edges. For example, if you want to assign multiple upstream entities to a downstream entity, you can do the following.
```json
mutation updateLineage {
updateLineage(
input: {
edgesToAdd: [
{
downstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)"
upstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)"
}
{
downstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)"
upstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)"
}
]
edgesToRemove: []
}
)
}
```
2023-04-08 08:26:58 +09:00
For more information about the `updateLineage` mutation, please refer to [updateLineage ](https://datahubproject.io/docs/graphql/mutations/#updatelineage ).
2023-03-16 08:19:31 +09:00
If you see the following response, the operation was successful:
2023-04-08 08:26:58 +09:00
2023-03-16 08:19:31 +09:00
```python
{
"data": {
"updateLineage": true
},
"extensions": {}
}
```
2023-05-03 07:32:23 +09:00
2023-04-20 12:17:11 +09:00
< / TabItem >
< TabItem value = "curl" label = "Curl" >
2023-03-16 08:19:31 +09:00
```shell
curl --location --request POST 'http://localhost:8080/api/graphql' \
--header 'Authorization: Bearer < my-access-token > ' \
--header 'Content-Type: application/json' --data-raw '{ "query": "mutation updateLineage { updateLineage( input:{ edgesToAdd : { downstreamUrn: \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)\", upstreamUrn : \"urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)\"}, edgesToRemove :{downstreamUrn: \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)\",upstreamUrn : \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)\" } })}", "variables":{}}'
```
2023-04-08 08:26:58 +09:00
2023-03-16 08:19:31 +09:00
Expected Response:
2023-04-08 08:26:58 +09:00
2023-03-16 08:19:31 +09:00
```json
2023-04-08 08:26:58 +09:00
{ "data": { "updateLineage": true }, "extensions": {} }
2023-03-16 08:19:31 +09:00
```
2023-05-03 07:32:23 +09:00
2023-04-20 12:17:11 +09:00
< / TabItem >
< TabItem value = "python" label = "Python" >
2023-04-08 08:26:58 +09:00
2023-03-16 08:19:31 +09:00
```python
2023-04-19 12:58:03 +09:00
{{ inline /metadata-ingestion/examples/library/lineage_emitter_rest.py show_path_as_comment }}
2023-03-16 08:19:31 +09:00
```
2023-04-20 12:17:11 +09:00
< / TabItem >
< / Tabs >
### Expected Outcomes of Adding Lineage
2023-04-08 08:26:58 +09:00
2023-03-16 08:19:31 +09:00
You can now see the lineage between `fct_users_deleted` and `logging_events` .
2023-08-26 06:10:13 +09:00
< p align = "center" >
< img width = "70%" src = "https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/apis/tutorials/lineage-added.png" / >
< / p >
2023-05-19 07:59:30 +09:00
## Add Column-level Lineage
< Tabs >
< TabItem value = "python" label = "Python" >
```python
{{ inline /metadata-ingestion/examples/library/lineage_emitter_dataset_finegrained_sample.py show_path_as_comment }}
```
< / TabItem >
< / Tabs >
### Expected Outcome of Adding Column Level Lineage
You can now see the column-level lineage between datasets. Note that you have to enable `Show Columns` to be able to see the column-level lineage.
2023-08-26 06:10:13 +09:00
< p align = "center" >
< img width = "70%" src = "https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/apis/tutorials/column-level-lineage-added.png" / >
< / p >
2024-05-09 13:57:44 -07:00
## Read Table Lineage
2023-06-30 08:48:05 -07:00
< Tabs >
< TabItem value = "graphql" label = "GraphQL" default >
2023-09-01 18:14:28 +09:00
```graphql
query searchAcrossLineage {
2023-06-30 08:48:05 -07:00
searchAcrossLineage(
input: {
query: "*"
urn: "urn:li:dataset:(urn:li:dataPlatform:dbt,long_tail_companions.adoption.human_profiles,PROD)"
start: 0
count: 10
direction: DOWNSTREAM
orFilters: [
{
and: [
{
condition: EQUAL
negated: false
field: "degree"
values: ["1", "2", "3+"]
}
]
}
]
}
) {
searchResults {
degree
entity {
urn
type
}
}
}
}
```
2023-10-27 20:18:31 -07:00
This example shows using lineage degrees as a filter, but additional search filters can be included here as well.
2023-06-30 08:48:05 -07:00
< / TabItem >
< TabItem value = "curl" label = "Curl" >
```shell
curl --location --request POST 'http://localhost:8080/api/graphql' \
--header 'Authorization: Bearer < my-access-token > ' \
2023-10-27 20:18:31 -07:00
--header 'Content-Type: application/json' --data-raw '{ { "query": "query searchAcrossLineage { searchAcrossLineage( input: { query: \"*\" urn: \"urn:li:dataset:(urn:li:dataPlatform:dbt,long_tail_companions.adoption.human_profiles,PROD)\" start: 0 count: 10 direction: DOWNSTREAM orFilters: [ { and: [ { condition: EQUAL negated: false field: \"degree\" values: [\"1\", \"2\", \"3+\"] } ] } ] } ) { searchResults { degree entity { urn type } } }}"
2023-06-30 08:48:05 -07:00
}}'
```
< / TabItem >
< TabItem value = "python" label = "Python" >
```python
{{ inline /metadata-ingestion/examples/library/read_lineage_rest.py show_path_as_comment }}
```
< / TabItem >
< / Tabs >
This will perform a multi-hop lineage search on the urn specified. For more information about the `searchAcrossLineage` mutation, please refer to [searchAcrossLineage ](https://datahubproject.io/docs/graphql/queries/#searchacrosslineage ).
2024-05-09 13:57:44 -07:00
## Read Column Lineage
< Tabs >
< TabItem value = "graphql" label = "GraphQL" default >
```graphql
query searchAcrossLineage {
searchAcrossLineage(
input: {
query: "*"
urn: "urn:li:schemaField(urn:li:dataset:(urn:li:dataPlatform:dbt,long_tail_companions.adoption.human_profiles,PROD),profile_id)"
start: 0
count: 10
direction: DOWNSTREAM
orFilters: [
{
and: [
{
condition: EQUAL
negated: false
field: "degree"
values: ["1", "2", "3+"]
}
]
}
]
}
) {
searchResults {
degree
entity {
urn
type
}
}
}
}
```
This example shows using lineage degrees as a filter, but additional search filters can be included here as well.
< / TabItem >
< TabItem value = "curl" label = "Curl" >
```shell
curl --location --request POST 'http://localhost:8080/api/graphql' \
--header 'Authorization: Bearer < my-access-token > ' \
--header 'Content-Type: application/json' --data-raw '{ { "query": "query searchAcrossLineage { searchAcrossLineage( input: { query: \"*\" urn: \"urn:li:schemaField(urn:li:dataset:(urn:li:dataPlatform:dbt,long_tail_companions.adoption.human_profiles,PROD),profile_id)\" start: 0 count: 10 direction: DOWNSTREAM orFilters: [ { and: [ { condition: EQUAL negated: false field: \"degree\" values: [\"1\", \"2\", \"3+\"] } ] } ] } ) { searchResults { degree entity { urn type } } }}"
}}'
```
< / TabItem >
< / Tabs >
This will perform a multi-hop lineage search on the urn specified. You can see schemaField URNs are made up of two parts: first the table they are a column of, and second the path of the column. For more information about the `searchAcrossLineage` mutation, please refer to [searchAcrossLineage ](https://datahubproject.io/docs/graphql/queries/#searchacrosslineage ).