2023-03-16 08:19:31 +09:00
# Adding Lineage
2023-04-08 08:26:58 +09:00
## Why Would You Add Lineage?
2023-03-16 08:19:31 +09:00
Lineage is used to capture data dependencies within an organization. It allows you to track the inputs from which a data asset is derived, along with the data assets that depend on it downstream.
2023-04-08 08:26:58 +09:00
For more information about lineage, refer to [About DataHub Lineage ](/docs/lineage/lineage-feature-guide.md ).
2023-03-16 08:19:31 +09:00
2023-03-17 06:12:35 +09:00
### Goal Of This Guide
2023-04-08 08:26:58 +09:00
2023-03-17 06:12:35 +09:00
This guide will show you how to add lineage between two hive datasets named `fct_users_deleted` and `logging_events` .
2023-03-16 08:19:31 +09:00
## Prerequisites
2023-04-08 08:26:58 +09:00
For this tutorial, you need to deploy DataHub Quickstart and ingest sample data.
For detailed steps, please refer to [Datahub Quickstart Guide ](/docs/quickstart.md ).
2023-03-16 08:19:31 +09:00
:::note
2023-04-08 08:26:58 +09:00
Before adding lineage, you need to ensure the targeted dataset is already present in your datahub.
If you attempt to manipulate entities that do not exist, your operation will fail.
2023-03-16 08:19:31 +09:00
In this guide, we will be using data from sample ingestion.
:::
## Add Lineage With GraphQL
:::note
2023-04-08 08:26:58 +09:00
Please note that there are two available endpoints (`:8000` , `:9002` ) to access `graphql` .
2023-03-16 08:19:31 +09:00
For more information about the differences between these endpoints, please refer to [DataHub Metadata Service ](../../../metadata-service/README.md#graphql-api )
:::
### GraphQL Explorer
2023-04-08 08:26:58 +09:00
GraphQL Explorer is the fastest way to experiment with `graphql` without any dependencies.
2023-03-16 08:19:31 +09:00
Navigate to GraphQL Explorer (`http://localhost:9002/api/graphiql` ) and run the following query.
```json
mutation updateLineage {
updateLineage(
input: {
edgesToAdd: [
{
downstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)"
upstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)"
}
]
edgesToRemove: []
}
)
}
```
Note that you can create a list of edges. For example, if you want to assign multiple upstream entities to a downstream entity, you can do the following.
```json
mutation updateLineage {
updateLineage(
input: {
edgesToAdd: [
{
downstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)"
upstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)"
}
{
downstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)"
upstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)"
}
]
edgesToRemove: []
}
)
}
```
2023-04-08 08:26:58 +09:00
For more information about the `updateLineage` mutation, please refer to [updateLineage ](https://datahubproject.io/docs/graphql/mutations/#updatelineage ).
2023-03-16 08:19:31 +09:00
If you see the following response, the operation was successful:
2023-04-08 08:26:58 +09:00
2023-03-16 08:19:31 +09:00
```python
{
"data": {
"updateLineage": true
},
"extensions": {}
}
```
### CURL
2023-04-08 08:26:58 +09:00
With CURL, you need to provide tokens. To generate a token, please refer to [Access Token Management ](/docs/api/graphql/token-management.md ).
2023-03-16 08:19:31 +09:00
With `accessToken` , you can run the following command.
```shell
curl --location --request POST 'http://localhost:8080/api/graphql' \
--header 'Authorization: Bearer < my-access-token > ' \
--header 'Content-Type: application/json' --data-raw '{ "query": "mutation updateLineage { updateLineage( input:{ edgesToAdd : { downstreamUrn: \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)\", upstreamUrn : \"urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)\"}, edgesToRemove :{downstreamUrn: \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)\",upstreamUrn : \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)\" } })}", "variables":{}}'
```
2023-04-08 08:26:58 +09:00
2023-03-16 08:19:31 +09:00
Expected Response:
2023-04-08 08:26:58 +09:00
2023-03-16 08:19:31 +09:00
```json
2023-04-08 08:26:58 +09:00
{ "data": { "updateLineage": true }, "extensions": {} }
2023-03-16 08:19:31 +09:00
```
## Add Lineage With Python SDK
You can refer to the related code in [lineage_emitter_rest.py ](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_emitter_rest.py ).
2023-04-08 08:26:58 +09:00
2023-03-16 08:19:31 +09:00
```python
2023-04-19 12:58:03 +09:00
{{ inline /metadata-ingestion/examples/library/lineage_emitter_rest.py show_path_as_comment }}
2023-03-16 08:19:31 +09:00
```
We're using the `MetdataChangeEvent` emitter to change entities in this example.
For more information about the `MetadataChangeEvent` , please refer to [Metadata Change Event (MCE) ](/docs/what/mxe.md#metadata-change-event-mce )
## Expected Outcomes
2023-04-08 08:26:58 +09:00
2023-03-16 08:19:31 +09:00
You can now see the lineage between `fct_users_deleted` and `logging_events` .
2023-03-19 05:00:41 +09:00
