mirror of https://github.com/datahub-project/datahub.git synced 2025-07-04 15:50:14 +00:00

fix(sdk): fix typos + improve links in sdk docs (#13777 )

2025-06-16 01:38:46 -04:00

13 KiB

Raw Permalink Blame History

Lineage

DataHub’s Python SDK allows you to programmatically define and retrieve lineage between metadata entities. With the DataHub Lineage SDK, you can:

Add table-level and column-level lineage across datasets, data jobs, dashboards, and charts
Automatically infer lineage from SQL queries
Read lineage (upstream or downstream) for a given entity or column
Filter lineage results using structured filters

Getting Started

To use DataHub SDK, you'll need to install acryl-datahub and set up a connection to your DataHub instance. Follow the installation guide to get started.

Connect to your DataHub instance:

from datahub.sdk import DataHubClient

client = DataHubClient(server="<your_server>", token="<your_token>")

server: The URL of your DataHub GMS server
- local: http://localhost:8080
- hosted: https://<your_datahub_url>/gms
token: You'll need to generate a Personal Access Token from your DataHub instance.

Add Lineage

The add_lineage() method allows you to define lineage between two entities.

Add Entity Lineage

You can create lineage between two datasets, data jobs, dashboards, or charts. The upstream and downstream parameters should be the URNs of the entities you want to link.

Add Entity Lineage Between Datasets

{{ inline /metadata-ingestion/examples/library/add_lineage_dataset_to_dataset.py show_path_as_comment }}

Add Entity Lineage Between Datajobs

{{ inline /metadata-ingestion/examples/library/lineage_datajob_to_datajob.py show_path_as_comment }}

:::note Lineage Combinations For supported lineage combinations, see Supported Lineage Combinations. :::

Add Column Lineage

You can add column-level lineage by using column_lineage parameter when linking datasets.

Add Column Lineage with Fuzzy Matching

{{ inline /metadata-ingestion/examples/library/lineage_dataset_column.py show_path_as_comment }}

When column_lineage is set to True, DataHub will automatically map columns based on their names, allowing for fuzzy matching. This is useful when upstream and downstream datasets have similar but not identical column names. (e.g. customer_id in upstream and CustomerId in downstream). See Column Lineage Options for more details.

Add Column Lineage with Strict Matching

{{ inline /metadata-ingestion/examples/library/lineage_dataset_column_auto_strict.py show_path_as_comment }}

This will create column-level lineage with strict matching, meaning the column names must match exactly between upstream and downstream datasets.

Add Column Lineage with Custom Mapping

For custom mapping, you can use a dictionary where keys are downstream column names and values represent lists of upstream column names. This allows you to specify complex relationships.

{{ inline /metadata-ingestion/examples/library/lineage_dataset_column_custom_mapping.py show_path_as_comment }}

Infer Lineage from SQL

You can infer lineage directly from a SQL query using infer_lineage_from_sql(). This will parse the query, determine upstream and downstream datasets, and automatically add lineage (including column-level lineage when possible) and a query node showing the SQL transformation logic.

{{ inline /metadata-ingestion/examples/library/lineage_dataset_from_sql.py show_path_as_comment }}

:::note DataHub SQL Parser

Check out more information on how we handle SQL parsing below.

:::

Add Query Node with Lineage

If you provide a transformation_text to add_lineage, DataHub will create a query node that represents the transformation logic. This is useful for tracking how data is transformed between datasets.

{{ inline /metadata-ingestion/examples/library/add_lineage_dataset_to_dataset_with_query_node.py show_path_as_comment }}

Transformation text can be any transformation logic, Python scripts, Airflow DAG code, or any other code that describes how the upstream dataset is transformed into the downstream dataset.

:::note Providing transformation_text will NOT create column lineage. You need to specify column_lineage parameter to enable column-level lineage.

If you have a SQL query that describes the transformation, you can use infer_lineage_from_sql to automatically parse the query and add column level lineage. :::

Get Lineage

The get_lineage() method allows you to retrieve lineage for a given entity.

Get Entity Lineage

Get Upstream Lineage for a Dataset

This will return the direct upstream entity that the dataset depends on. By default, it retrieves only the immediate upstream entities (1 hop).

{{ inline /metadata-ingestion/examples/library/get_lineage_basic.py show_path_as_comment }}

Get Downstream Lineage for a Dataset Across Multiple Hops

To get upstream/downstream entities that are more than one hop away, you can use the max_hops parameter. This allows you to traverse the lineage graph up to a specified number of hops.

{{ inline /metadata-ingestion/examples/library/get_lineage_with_hops.py show_path_as_comment }}

:::note USING MAX_HOPS if you provide max_hops greater than 2, it will traverse the full lineage graph and limit the results by count. :::

Return Type

get_lineage() returns a list of LineageResult objects.

results = [
  LineageResult(
    urn="urn:li:dataset:(urn:li:dataPlatform:snowflake,table_2,PROD)",
    type="DATASET",
    hops=1,
    direction="downstream",
    platform="snowflake",
    name="table_2", # name of the entity
    paths=[] # Only populated for column-level lineage
  )
]

Get Column-Level Lineage

Get Downstream Lineage for a Dataset Column

You can retrieve column-level lineage by specifying the source_column parameter. This will return lineage paths that include the specified column.

{{ inline /metadata-ingestion/examples/library/get_column_lineage.py show_path_as_comment }}

You can also pass SchemaFieldUrn as the source_urn to get column-level lineage.

{{ inline /metadata-ingestion/examples/library/get_column_lineage_from_schemafield.py show_path_as_comment }}

Return type

The return type is the same as for entity lineage, but with additional paths field that contains column lineage paths.

results = [
  LineageResult(
    urn="urn:li:dataset:(urn:li:dataPlatform:snowflake,table_2,PROD)",
    type="DATASET",
    hops=1,
    direction="downstream",
    platform="snowflake",
    name="table_2", # name of the entity
    paths=[
      LineagePath(
        urn="urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:snowflake,table_1,PROD),col1)",
        column_name="col1", # name of the column
        entity_name="table_1", # name of the entity that contains the column
      ),
      LineagePath(
        urn="urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:snowflake,table_2,PROD),col4)",
        column_name="col4", # name of the column
        entity_name="table_2", # name of the entity that contains the column
      )
    ] # Only populated for column-level lineage
  )
]

For more details on how to interpret the results, see Interpreting Lineage Results.

Filter Lineage Results

You can filter by platform, type, domain, environment, and more.

{{ inline /metadata-ingestion/examples/library/get_lineage_with_filter.py show_path_as_comment }}

You can check more details about the available filters in the Search SDK documentation.

Lineage SDK Reference

For a full reference, see the lineage SDK reference.

Supported Lineage Combinations

The Lineage APIs support the following entity combinations:

Upstream Entity	Downstream Entity
Dataset	Dataset
Dataset	DataJob
DataJob	DataJob
Dataset	Dashboard
Chart	Dashboard
Dashboard	Dashboard
Dataset	Chart

ℹ️ Column-level lineage and creating query node with transformation text are only supported for Dataset → Dataset lineage.

Column Lineage Options

For dataset-to-dataset lineage, you can specify column_lineage parameter in add_lineage() in several ways:

Value	Description
`False`	Disable column-level lineage (default)
`True`	Enable column-level lineage with automatic mapping (same as "auto_fuzzy")
`"auto_fuzzy"`	Enable column-level lineage with fuzzy matching (useful for similar column names)
`"auto_strict"`	Enable column-level lineage with strict matching (exact column names required)
Column Mapping	A dictionary mapping downstream column names to lists of upstream column names

:::note auto_fuzzy vs auto_strict

auto_fuzzy: Automatically matches columns based on similar names, allowing for some flexibility in naming conventions. For example, these two columns would be considered a match:
- user_id → userId
- customer_id → CustomerId
auto_strict: Requires exact column name matches between upstream and downstream datasets. For example, customer_id in the upstream dataset must match customer_id in the downstream dataset exactly.

:::

Interpreting Column Lineage Results

When retrieving column-level lineage, the results include paths that show how columns are related across datasets. Each path is a list of column URNs that represent the lineage from the source column to the target column.

For example, let's say we have the following lineage across three tables:

Example with `max_hops=1`

>>> client.lineage.get_lineage(
        source_urn="urn:li:dataset:(urn:li:dataPlatform:snowflake,table_1,PROD)",
        source_column="col1",
        direction="downstream",
        max_hops=1
    )

Returns:

[
    {
        "urn": "...table_2...",
        "hops": 1,
        "paths": [
            ["...table_1.col1", "...table_2.col4"],
            ["...table_1.col1", "...table_2.col5"]
        ]
    }
]

Example with `max_hops=2`

>>> client.lineage.get_lineage(
        source_urn="urn:li:dataset:(urn:li:dataPlatform:snowflake,table_1,PROD)",
        source_column="col1",
        direction="downstream",
        max_hops=2
    )

Returns:

[
    {
        "urn": "...table_2...",
        "hops": 1,
        "paths": [
            ["...table_1.col1", "...table_2.col4"],
            ["...table_1.col1", "...table_2.col5"]
        ]
    },
    {
        "urn": "...table_3...",
        "hops": 2,
        "paths": [
            ["...table_1.col1", "...table_2.col4", "...table_3.col7"]
        ]
    }
]

Alternative: Lineage GraphQL API

While we generally recommend using the Python SDK for lineage, you can also use the GraphQL API to add and retrieve lineage.

Add Lineage Between Datasets with GraphQL

mutation updateLineage {
  updateLineage(
    input: {
      edgesToAdd: [
        {
          downstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)"
          upstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)"
        }
      ]
      edgesToRemove: []
    }
  )
}

Get Downstream Lineage with GraphQL

query scrollAcrossLineage {
  scrollAcrossLineage(
    input: {
      query: "*"
      urn: "urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)"
      count: 10
      direction: DOWNSTREAM
      orFilters: [
        {
          and: [
            {
              condition: EQUAL
              negated: false
              field: "degree"
              values: ["1", "2", "3+"]
            }
          ]
        }
      ]
    }
  ) {
    searchResults {
      degree
      entity {
        urn
        type
      }
    }
  }
}

FAQ

Can I get lineage at the column level? Yes — for dataset-to-dataset lineage, both add_lineage() and get_lineage() support column-level lineage.

Can I pass a SQL query and get lineage automatically? Yes — use infer_lineage_from_sql() to parse a query and extract table and column lineage.

Can I use filters when retrieving lineage? Yes — get_lineage() accepts structured filters via FilterDsl, just like in the Search SDK.

13 KiB Raw Permalink Blame History Unescape Escape

Lineage

Getting Started

Add Lineage

Add Entity Lineage

Add Entity Lineage Between Datasets

Add Entity Lineage Between Datajobs

Add Column Lineage

Add Column Lineage with Fuzzy Matching

Add Column Lineage with Strict Matching

Add Column Lineage with Custom Mapping

Infer Lineage from SQL

Add Query Node with Lineage

Get Lineage

Get Entity Lineage

Get Upstream Lineage for a Dataset

Get Downstream Lineage for a Dataset Across Multiple Hops

Return Type

Get Column-Level Lineage

Get Downstream Lineage for a Dataset Column

Return type

Filter Lineage Results

Lineage SDK Reference

Supported Lineage Combinations

Column Lineage Options

Interpreting Column Lineage Results

Example with max_hops=1

Example with max_hops=2

Alternative: Lineage GraphQL API

Add Lineage Between Datasets with GraphQL

Get Downstream Lineage with GraphQL

FAQ

13 KiB

Raw Permalink Blame History

Example with `max_hops=1`

Example with `max_hops=2`