2025-06-13 00:57:16 +09:00
# Lineage
2023-03-16 08:19:31 +09:00
2025-06-13 00:57:16 +09:00
DataHub’ s Python SDK allows you to programmatically define and retrieve lineage between metadata entities. With the DataHub Lineage SDK, you can:
2023-04-20 12:17:11 +09:00
2025-06-13 00:57:16 +09:00
- Add **table-level and column-level lineage** across datasets, data jobs, dashboards, and charts
- Automatically **infer lineage from SQL queries**
- **Read lineage** (upstream or downstream) for a given entity or column
- **Filter lineage results** using structured filters
2023-04-08 08:26:58 +09:00
2025-06-13 00:57:16 +09:00
## Getting Started
2023-10-04 17:43:59 +09:00
2025-06-13 00:57:16 +09:00
To use DataHub SDK, you'll need to install [`acryl-datahub` ](https://pypi.org/project/acryl-datahub/ ) and set up a connection to your DataHub instance. Follow the [installation guide ](https://docs.datahub.com/docs/metadata-ingestion/cli-ingestion#installing-datahub-cli ) to get started.
2023-03-16 08:19:31 +09:00
2025-06-13 00:57:16 +09:00
Connect to your DataHub instance:
2023-04-08 08:26:58 +09:00
2025-06-13 00:57:16 +09:00
```python
from datahub.sdk import DataHubClient
client = DataHubClient(server="< your_server > ", token="< your_token > ")
```
- **server**: The URL of your DataHub GMS server
- local: `http://localhost:8080`
- hosted: `https://<your_datahub_url>/gms`
- **token**: You'll need to [generate a Personal Access Token ](https://docs.datahub.com/docs/authentication/personal-access-tokens ) from your DataHub instance.
## Add Lineage
The `add_lineage()` method allows you to define lineage between two entities.
### Add Entity Lineage
You can create lineage between two datasets, data jobs, dashboards, or charts. The `upstream` and `downstream` parameters should be the URNs of the entities you want to link.
#### Add Entity Lineage Between Datasets
```python
{{ inline /metadata-ingestion/examples/library/add_lineage_dataset_to_dataset.py show_path_as_comment }}
```
#### Add Entity Lineage Between Datajobs
```python
{{ inline /metadata-ingestion/examples/library/lineage_datajob_to_datajob.py show_path_as_comment }}
```
:::note Lineage Combinations
For supported lineage combinations, see [Supported Lineage Combinations ](#supported-lineage-combinations ).
:::
### Add Column Lineage
You can add column-level lineage by using `column_lineage` parameter when linking datasets.
2023-05-03 07:32:23 +09:00
2025-06-13 00:57:16 +09:00
#### Add Column Lineage with Fuzzy Matching
```python
{{ inline /metadata-ingestion/examples/library/lineage_dataset_column.py show_path_as_comment }}
```
2023-03-17 06:12:35 +09:00
2025-06-13 00:57:16 +09:00
When `column_lineage` is set to **True** , DataHub will automatically map columns based on their names, allowing for fuzzy matching. This is useful when upstream and downstream datasets have similar but not identical column names. (e.g. `customer_id` in upstream and `CustomerId` in downstream).
2023-04-08 08:26:58 +09:00
2025-06-13 00:57:16 +09:00
#### Add Column Lineage with Strict Matching
```python
{{ inline /metadata-ingestion/examples/library/lineage_dataset_column_auto_strict.py show_path_as_comment }}
```
This will create column-level lineage with strict matching, meaning the column names must match exactly between upstream and downstream datasets.
#### Add Column Lineage with Custom Mapping
For custom mapping, you can use a dictionary where keys are downstream column names and values represent lists of upstream column names. This allows you to specify complex relationships.
```python
{{ inline /metadata-ingestion/examples/library/lineage_dataset_column_custom_mapping.py show_path_as_comment }}
```
### Infer Lineage from SQL
You can infer lineage directly from a SQL query using `infer_lineage_from_sql()` . This will parse the query, determine upstream and downstream datasets, and automatically add lineage (including column-level lineage when possible).
```python
{{ inline /metadata-ingestion/examples/library/lineage_dataset_from_sql.py show_path_as_comment }}
```
:::note DataHub SQL Parser
Check out more information on how we handle SQL parsing below.
- [The DataHub SQL Parser Documentation ](../../lineage/sql_parsing.md )
- [Blog Post : Extracting Column-Level Lineage from SQL ](https://medium.com/datahub-project/extracting-column-level-lineage-from-sql-779b8ce17567 )
:::
### Add Query Node with Lineage
If you provide a `transformation_text` to `add_lineage` , DataHub will create a query node that represents the transformation logic. This is useful for tracking how data is transformed between datasets.
```python
{{ inline /metadata-ingestion/examples/library/add_lineage_dataset_to_dataset_with_query_node.py show_path_as_comment }}
```
Transformation text can be any transformation logic, Python scripts, Airflow DAG code, or any other code that describes how the upstream dataset is transformed into the downstream dataset.
< p align = "center" >
< img width = "80%" src = "https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/lineage/query-node.png" / >
< / p >
2023-03-16 08:19:31 +09:00
:::note
2025-06-13 00:57:16 +09:00
Providing `transformation_text` will NOT create column lineage. You need to specify `column_lineage` parameter to enable column-level lineage.
If you have a SQL query that describes the transformation, you can use [infer_lineage_from_sql ](#infer-lineage-from-sql ) to automatically parse the query and add column level lineage.
2023-03-16 08:19:31 +09:00
:::
2025-06-13 00:57:16 +09:00
## Get Lineage
2023-03-16 08:19:31 +09:00
2025-06-13 00:57:16 +09:00
The `get_lineage()` method allows you to retrieve lineage for a given entity.
2023-03-16 08:19:31 +09:00
2025-06-13 00:57:16 +09:00
### Get Entity Lineage
#### Get Upstream Lineage for a Dataset
This will return the direct upstream entity that the dataset depends on. By default, it retrieves only the immediate upstream entities (1 hop).
```python
{{ inline /metadata-ingestion/examples/library/get_lineage_basic.py show_path_as_comment }}
2023-03-16 08:19:31 +09:00
```
2025-06-13 00:57:16 +09:00
#### Get Downstream Lineage for a Dataset Across Multiple Hops
2023-03-16 08:19:31 +09:00
2025-06-13 00:57:16 +09:00
To get upstream/downstream entities that are more than one hop away, you can use the `max_hops` parameter. This allows you to traverse the lineage graph up to a specified number of hops.
```python
{{ inline /metadata-ingestion/examples/library/get_lineage_with_hops.py show_path_as_comment }}
2023-03-16 08:19:31 +09:00
```
2025-06-13 00:57:16 +09:00
:::note USING MAX_HOPS
if you provide `max_hops` greater than 2, it will traverse the full lineage graph and limit the results by `count` .
:::
#### Return Type
2023-03-16 08:19:31 +09:00
2025-06-13 00:57:16 +09:00
`get_lineage()` returns a list of `LineageResult` objects.
2023-04-08 08:26:58 +09:00
2023-03-16 08:19:31 +09:00
```python
2025-06-13 00:57:16 +09:00
results = [
LineageResult(
urn="urn:li:dataset:(urn:li:dataPlatform:snowflake,table_2,PROD)",
type="DATASET",
hops=1,
direction="downstream",
platform="snowflake",
name="table_2", # name of the entity
paths=[] # Only populated for column-level lineage
)
]
2023-03-16 08:19:31 +09:00
```
2023-05-03 07:32:23 +09:00
2025-06-13 00:57:16 +09:00
### Get Column-Level Lineage
#### Get Downstream Lineage for a Dataset Column
You can retrieve column-level lineage by specifying the `source_column` parameter. This will return lineage paths that include the specified column.
```python
{{ inline /metadata-ingestion/examples/library/get_column_lineage.py show_path_as_comment }}
```
You can also pass `SchemaFieldUrn` as the `source_urn` to get column-level lineage.
```python
{{ inline /metadata-ingestion/examples/library/get_column_lineage_from_schemafield.py show_path_as_comment }}
2023-03-16 08:19:31 +09:00
```
2023-04-08 08:26:58 +09:00
2025-06-13 00:57:16 +09:00
#### Return type
The return type is the same as for entity lineage, but with additional `paths` field that contains column lineage paths.
2023-04-08 08:26:58 +09:00
2025-06-13 00:57:16 +09:00
```python
results = [
LineageResult(
urn="urn:li:dataset:(urn:li:dataPlatform:snowflake,table_2,PROD)",
type="DATASET",
hops=1,
direction="downstream",
platform="snowflake",
name="table_2", # name of the entity
paths=[
LineagePath(
urn="urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:snowflake,table_1,PROD),col1)",
column_name="col1", # name of the column
entity_name="table_1", # name of the entity that contains the column
),
LineagePath(
urn="urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:snowflake,table_2,PROD),col4)",
column_name="col4", # name of the column
entity_name="table_2", # name of the entity that contains the column
)
] # Only populated for column-level lineage
)
]
2023-03-16 08:19:31 +09:00
```
2023-05-03 07:32:23 +09:00
2025-06-13 00:57:16 +09:00
For more details on how to interpret the results, see [Interpreting Lineage Results ](#interpreting-lineage-results ).
### Filter Lineage Results
You can filter by platform, type, domain, environment, and more.
2023-04-08 08:26:58 +09:00
2023-03-16 08:19:31 +09:00
```python
2025-06-13 00:57:16 +09:00
{{ inline /metadata-ingestion/examples/library/get_lineage_with_filter.py show_path_as_comment }}
2023-03-16 08:19:31 +09:00
```
2025-06-13 00:57:16 +09:00
You can check more details about the available filters in the [Search SDK documentation ](./sdk/search_client.md#filter-based-search ).
## Lineage SDK Reference
### Supported Lineage Combinations
The Lineage APIs support the following entity combinations:
| Upstream Entity | Downstream Entity |
| --------------- | ----------------- |
| Dataset | Dataset |
| Dataset | DataJob |
| DataJob | DataJob |
| Dataset | Dashboard |
| Chart | Dashboard |
| Dashboard | Dashboard |
| Dataset | Chart |
> ℹ ️ Column-level lineage and creating query node with transformation text are **only supported** for `Dataset → Dataset` lineage.
2023-04-20 12:17:11 +09:00
2025-06-13 00:57:16 +09:00
### Column Lineage Options
For dataset-to-dataset lineage, you can specify `column_lineage` parameter in `add_lineage()` in several ways:
| Value | Description |
| --------------- | --------------------------------------------------------------------------------- |
| `False` | Disable column-level lineage (default) |
| `True` | Enable column-level lineage with automatic mapping (same as "auto_fuzzy") |
| `"auto_fuzzy"` | Enable column-level lineage with fuzzy matching (useful for similar column names) |
| `"auto_strict"` | Enable column-level lineage with strict matching (exact column names required) |
| Column Mapping | A dictionary mapping downstream column names to lists of upstream column names |
:::note Auto_Fuzzy vs Auto_Strict
- **Auto_Fuzzy**: Automatically matches columns based on similar names, allowing for some flexibility in naming conventions. For example, these two columns would be considered a match:
- user_id → userId
- customer_id → CustomerId
- **Auto_Strict**: Requires exact column name matches between upstream and downstream datasets. For example, `customer_id` in the upstream dataset must match `customer_id` in the downstream dataset exactly.
:::
2023-04-08 08:26:58 +09:00
2025-06-13 00:57:16 +09:00
### Interpreting Column Lineage Results
When retrieving column-level lineage, the results include `paths` that show how columns are related across datasets. Each path is a list of column URNs that represent the lineage from the source column to the target column.
For example, let's say we have the following lineage across three tables:
2023-03-16 08:19:31 +09:00
2023-08-26 06:10:13 +09:00
< p align = "center" >
2025-06-13 00:57:16 +09:00
< img width = "80%" src = "https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/lineage/column-lineage.png" / >
2023-08-26 06:10:13 +09:00
< / p >
2025-06-13 00:57:16 +09:00
#### Example with `max_hops=1`
```python
>>> client.lineage.get_lineage(
source_urn="urn:li:dataset:(urn:li:dataPlatform:snowflake,table_1,PROD)",
source_column="col1",
direction="downstream",
max_hops=1
)
```
2023-05-19 07:59:30 +09:00
2025-06-13 00:57:16 +09:00
**Returns:**
2023-05-19 07:59:30 +09:00
```python
2025-06-13 00:57:16 +09:00
[
{
"urn": "...table_2...",
"hops": 1,
"paths": [
["...table_1.col1", "...table_2.col4"],
["...table_1.col1", "...table_2.col5"]
]
}
]
2023-05-19 07:59:30 +09:00
```
2025-06-13 00:57:16 +09:00
#### Example with `max_hops=2`
2023-05-19 07:59:30 +09:00
2025-06-13 00:57:16 +09:00
```python
>>> client.lineage.get_lineage(
source_urn="urn:li:dataset:(urn:li:dataPlatform:snowflake,table_1,PROD)",
source_column="col1",
direction="downstream",
max_hops=2
)
```
2023-05-19 07:59:30 +09:00
2025-06-13 00:57:16 +09:00
**Returns:**
2023-05-19 07:59:30 +09:00
2025-06-13 00:57:16 +09:00
```python
[
{
"urn": "...table_2...",
"hops": 1,
"paths": [
["...table_1.col1", "...table_2.col4"],
["...table_1.col1", "...table_2.col5"]
]
},
{
"urn": "...table_3...",
"hops": 2,
"paths": [
["...table_1.col1", "...table_2.col4", "...table_3.col7"]
]
}
]
```
2023-08-26 06:10:13 +09:00
2025-06-13 00:57:16 +09:00
### Lineage GraphQL Examples
2024-05-22 15:25:49 +09:00
2025-06-13 00:57:16 +09:00
You can also use the GraphQL API to add and retrieve lineage.
2024-05-22 15:25:49 +09:00
2025-06-13 00:57:16 +09:00
#### Add Lineage Between Datasets with GraphQL
2024-05-22 15:25:49 +09:00
2025-06-13 00:57:16 +09:00
```graphql
mutation updateLineage {
updateLineage(
input: {
edgesToAdd: [
{
downstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)"
upstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)"
}
]
edgesToRemove: []
}
)
}
```
2023-06-30 08:48:05 -07:00
2025-06-13 00:57:16 +09:00
#### Get Downstream Lineage with GraphQL
2023-06-30 08:48:05 -07:00
2023-09-01 18:14:28 +09:00
```graphql
2024-05-22 15:25:49 +09:00
query scrollAcrossLineage {
scrollAcrossLineage(
2023-06-30 08:48:05 -07:00
input: {
query: "*"
2024-05-22 15:25:49 +09:00
urn: "urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)"
2023-06-30 08:48:05 -07:00
count: 10
direction: DOWNSTREAM
orFilters: [
{
and: [
{
condition: EQUAL
negated: false
field: "degree"
values: ["1", "2", "3+"]
}
]
}
]
}
) {
searchResults {
degree
entity {
urn
type
}
}
}
}
```
2025-04-16 16:55:51 -07:00
2025-06-13 00:57:16 +09:00
## FAQ
2024-05-22 15:25:49 +09:00
2025-06-13 00:57:16 +09:00
**Can I get lineage at the column level?**
Yes — for dataset-to-dataset lineage, both `add_lineage()` and `get_lineage()` support column-level lineage.
2024-05-09 13:57:44 -07:00
2025-06-13 00:57:16 +09:00
**Can I pass a SQL query and get lineage automatically?**
Yes — use `infer_lineage_from_sql()` to parse a query and extract table and column lineage.
2024-05-09 13:57:44 -07:00
2025-06-13 00:57:16 +09:00
**Can I use filters when retrieving lineage?**
Yes — `get_lineage()` accepts structured filters via `FilterDsl` , just like in the Search SDK.