datahub/docs/api/tutorials/lineage.md

# Lineage

DataHub’s Python SDK allows you to programmatically define and retrieve lineage between metadata entities. With the DataHub Lineage SDK, you can:

- Add **table-level and column-level lineage** across datasets, data jobs, dashboards, and charts
- Automatically **infer lineage from SQL queries**
- **Read lineage** (upstream or downstream) for a given entity or column
- **Filter lineage results** using structured filters

## Getting Started

To use DataHub SDK, you'll need to install [`acryl-datahub`](https://pypi.org/project/acryl-datahub/) and set up a connection to your DataHub instance. Follow the [installation guide](https://docs.datahub.com/docs/metadata-ingestion/cli-ingestion#installing-datahub-cli) to get started.

Connect to your DataHub instance:

```python
from datahub.sdk import DataHubClient

client = DataHubClient(server="<your_server>", token="<your_token>")
```

- **server**: The URL of your DataHub GMS server
  - local: `http://localhost:8080`
  - hosted: `https://<your_datahub_url>/gms`
- **token**: You'll need to [generate a Personal Access Token](https://docs.datahub.com/docs/authentication/personal-access-tokens) from your DataHub instance.

## Add Lineage

The `add_lineage()` method allows you to define lineage between two entities.

### Add Entity Lineage

You can create lineage between two datasets, data jobs, dashboards, or charts. The `upstream` and `downstream` parameters should be the URNs of the entities you want to link.

#### Add Entity Lineage Between Datasets

```python
{{ inline /metadata-ingestion/examples/library/add_lineage_dataset_to_dataset.py show_path_as_comment }}
```

#### Add Entity Lineage Between Datajobs

```python
{{ inline /metadata-ingestion/examples/library/lineage_datajob_to_datajob.py show_path_as_comment }}
```

:::note Lineage Combinations
For supported lineage combinations, see [Supported Lineage Combinations](#supported-lineage-combinations).
:::

### Add Column Lineage

You can add column-level lineage by using `column_lineage` parameter when linking datasets.

#### Add Column Lineage with Fuzzy Matching

```python
{{ inline /metadata-ingestion/examples/library/lineage_dataset_column.py show_path_as_comment }}
```

When `column_lineage` is set to **True**, DataHub will automatically map columns based on their names, allowing for fuzzy matching. This is useful when upstream and downstream datasets have similar but not identical column names. (e.g. `customer_id` in upstream and `CustomerId` in downstream). See [Column Lineage Options](#column-lineage-options) for more details.

#### Add Column Lineage with Strict Matching

```python
{{ inline /metadata-ingestion/examples/library/lineage_dataset_column_auto_strict.py show_path_as_comment }}
```

This will create column-level lineage with strict matching, meaning the column names must match exactly between upstream and downstream datasets.

#### Add Column Lineage with Custom Mapping

For custom mapping, you can use a dictionary where keys are downstream column names and values represent lists of upstream column names. This allows you to specify complex relationships.

```python
{{ inline /metadata-ingestion/examples/library/lineage_dataset_column_custom_mapping.py show_path_as_comment }}
```

### Infer Lineage from SQL

You can infer lineage directly from a SQL query using `infer_lineage_from_sql()`. This will parse the query, determine upstream and downstream datasets, and automatically add lineage (including column-level lineage when possible) and a query node showing the SQL transformation logic.

```python
{{ inline /metadata-ingestion/examples/library/lineage_dataset_from_sql.py show_path_as_comment }}
```

:::note DataHub SQL Parser

Check out more information on how we handle SQL parsing below.

- [The DataHub SQL Parser Documentation](../../lineage/sql_parsing.md)
- [Blog Post: Extracting Column-Level Lineage from SQL](https://medium.com/datahub-project/extracting-column-level-lineage-from-sql-779b8ce17567)

:::

### Add Query Node with Lineage

If you provide a `transformation_text` to `add_lineage`, DataHub will create a query node that represents the transformation logic. This is useful for tracking how data is transformed between datasets.

```python
{{ inline /metadata-ingestion/examples/library/add_lineage_dataset_to_dataset_with_query_node.py show_path_as_comment }}
```

Transformation text can be any transformation logic, Python scripts, Airflow DAG code, or any other code that describes how the upstream dataset is transformed into the downstream dataset.

<p align="center">
  <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/lineage/query-node.png"/>
</p>

:::note
Providing `transformation_text` will NOT create column lineage. You need to specify `column_lineage` parameter to enable column-level lineage.

If you have a SQL query that describes the transformation, you can use [infer_lineage_from_sql](#infer-lineage-from-sql) to automatically parse the query and add column level lineage.
:::

## Get Lineage

The `get_lineage()` method allows you to retrieve lineage for a given entity.

### Get Entity Lineage

#### Get Upstream Lineage for a Dataset

This will return the direct upstream entity that the dataset depends on. By default, it retrieves only the immediate upstream entities (1 hop).

```python
{{ inline /metadata-ingestion/examples/library/get_lineage_basic.py show_path_as_comment }}
```

#### Get Downstream Lineage for a Dataset Across Multiple Hops

To get upstream/downstream entities that are more than one hop away, you can use the `max_hops` parameter. This allows you to traverse the lineage graph up to a specified number of hops.

```python
{{ inline /metadata-ingestion/examples/library/get_lineage_with_hops.py show_path_as_comment }}

```

:::note USING MAX_HOPS
if you provide `max_hops` greater than 2, it will traverse the full lineage graph and limit the results by `count`.
:::

#### Return Type

`get_lineage()` returns a list of `LineageResult` objects.

```python
results = [
  LineageResult(
    urn="urn:li:dataset:(urn:li:dataPlatform:snowflake,table_2,PROD)",
    type="DATASET",
    hops=1,
    direction="downstream",
    platform="snowflake",
    name="table_2", # name of the entity
    paths=[] # Only populated for column-level lineage
  )
]
```

### Get Column-Level Lineage

#### Get Downstream Lineage for a Dataset Column

You can retrieve column-level lineage by specifying the `source_column` parameter. This will return lineage paths that include the specified column.

```python
{{ inline /metadata-ingestion/examples/library/get_column_lineage.py show_path_as_comment }}
```

You can also pass `SchemaFieldUrn` as the `source_urn` to get column-level lineage.

```python
{{ inline /metadata-ingestion/examples/library/get_column_lineage_from_schemafield.py show_path_as_comment }}

```

#### Return type

The return type is the same as for entity lineage, but with additional `paths` field that contains column lineage paths.

```python
results = [
  LineageResult(
    urn="urn:li:dataset:(urn:li:dataPlatform:snowflake,table_2,PROD)",
    type="DATASET",
    hops=1,
    direction="downstream",
    platform="snowflake",
    name="table_2", # name of the entity
    paths=[
      LineagePath(
        urn="urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:snowflake,table_1,PROD),col1)",
        column_name="col1", # name of the column
        entity_name="table_1", # name of the entity that contains the column
      ),
      LineagePath(
        urn="urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:snowflake,table_2,PROD),col4)",
        column_name="col4", # name of the column
        entity_name="table_2", # name of the entity that contains the column
      )
    ] # Only populated for column-level lineage
  )
]
```

For more details on how to interpret the results, see [Interpreting Lineage Results](#interpreting-lineage-results).

### Filter Lineage Results

You can filter by platform, type, domain, environment, and more.

```python
{{ inline /metadata-ingestion/examples/library/get_lineage_with_filter.py show_path_as_comment }}
```

You can check more details about the available filters in the [Search SDK documentation](./sdk/search_client.md#filter-based-search).

## Lineage SDK Reference

For a full reference, see the [lineage SDK reference](../../../python-sdk/sdk-v2/lineage-client.mdx).

### Supported Lineage Combinations

The Lineage APIs support the following entity combinations:

| Upstream Entity | Downstream Entity |
| --------------- | ----------------- |
| Dataset         | Dataset           |
| Dataset         | DataJob           |
| DataJob         | DataJob           |
| Dataset         | Dashboard         |
| Chart           | Dashboard         |
| Dashboard       | Dashboard         |
| Dataset         | Chart             |

> ℹ️ Column-level lineage and creating query node with transformation text are **only supported** for `Dataset → Dataset` lineage.

### Column Lineage Options

For dataset-to-dataset lineage, you can specify `column_lineage` parameter in `add_lineage()` in several ways:

| Value           | Description                                                                       |
| --------------- | --------------------------------------------------------------------------------- |
| `False`         | Disable column-level lineage (default)                                            |
| `True`          | Enable column-level lineage with automatic mapping (same as "auto_fuzzy")         |
| `"auto_fuzzy"`  | Enable column-level lineage with fuzzy matching (useful for similar column names) |
| `"auto_strict"` | Enable column-level lineage with strict matching (exact column names required)    |
| Column Mapping  | A dictionary mapping downstream column names to lists of upstream column names    |

:::note `auto_fuzzy` vs `auto_strict`

- **`auto_fuzzy`**: Automatically matches columns based on similar names, allowing for some flexibility in naming conventions. For example, these two columns would be considered a match:
  - user_id → userId
  - customer_id → CustomerId
- **`auto_strict`**: Requires exact column name matches between upstream and downstream datasets. For example, `customer_id` in the upstream dataset must match `customer_id` in the downstream dataset exactly.

:::

### Interpreting Column Lineage Results

When retrieving column-level lineage, the results include `paths` that show how columns are related across datasets. Each path is a list of column URNs that represent the lineage from the source column to the target column.

For example, let's say we have the following lineage across three tables:

<p align="center">
  <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/lineage/column-lineage.png"/>
</p>

#### Example with `max_hops=1`

```python
>>> client.lineage.get_lineage(
        source_urn="urn:li:dataset:(urn:li:dataPlatform:snowflake,table_1,PROD)",
        source_column="col1",
        direction="downstream",
        max_hops=1
    )
```

**Returns:**

```python
[
    {
        "urn": "...table_2...",
        "hops": 1,
        "paths": [
            ["...table_1.col1", "...table_2.col4"],
            ["...table_1.col1", "...table_2.col5"]
        ]
    }
]
```

#### Example with `max_hops=2`

```python
>>> client.lineage.get_lineage(
        source_urn="urn:li:dataset:(urn:li:dataPlatform:snowflake,table_1,PROD)",
        source_column="col1",
        direction="downstream",
        max_hops=2
    )
```

**Returns:**

```python
[
    {
        "urn": "...table_2...",
        "hops": 1,
        "paths": [
            ["...table_1.col1", "...table_2.col4"],
            ["...table_1.col1", "...table_2.col5"]
        ]
    },
    {
        "urn": "...table_3...",
        "hops": 2,
        "paths": [
            ["...table_1.col1", "...table_2.col4", "...table_3.col7"]
        ]
    }
]
```

## Alternative: Lineage GraphQL API

While we generally recommend using the Python SDK for lineage, you can also use the GraphQL API to add and retrieve lineage.

#### Add Lineage Between Datasets with GraphQL

```graphql
mutation updateLineage {
  updateLineage(
    input: {
      edgesToAdd: [
        {
          downstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)"
          upstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)"
        }
      ]
      edgesToRemove: []
    }
  )
}
```

#### Get Downstream Lineage with GraphQL

```graphql
query scrollAcrossLineage {
  scrollAcrossLineage(
    input: {
      query: "*"
      urn: "urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)"
      count: 10
      direction: DOWNSTREAM
      orFilters: [
        {
          and: [
            {
              condition: EQUAL
              negated: false
              field: "degree"
              values: ["1", "2", "3+"]
            }
          ]
        }
      ]
    }
  ) {
    searchResults {
      degree
      entity {
        urn
        type
      }
    }
  }
}
```

## FAQ

**Can I get lineage at the column level?**
Yes — for dataset-to-dataset lineage, both `add_lineage()` and `get_lineage()` support column-level lineage.

**Can I pass a SQL query and get lineage automatically?**
Yes — use `infer_lineage_from_sql()` to parse a query and extract table and column lineage.

**Can I use filters when retrieving lineage?**
Yes — `get_lineage()` accepts structured filters via `FilterDsl`, just like in the Search SDK.