mirror of
				https://github.com/datahub-project/datahub.git
				synced 2025-10-30 18:26:58 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			396 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			396 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # Lineage
 | ||
| 
 | ||
| DataHub’s Python SDK allows you to programmatically define and retrieve lineage between metadata entities. With the DataHub Lineage SDK, you can:
 | ||
| 
 | ||
| - Add **table-level and column-level lineage** across datasets, data jobs, dashboards, and charts
 | ||
| - Automatically **infer lineage from SQL queries**
 | ||
| - **Read lineage** (upstream or downstream) for a given entity or column
 | ||
| - **Filter lineage results** using structured filters
 | ||
| 
 | ||
| ## Getting Started
 | ||
| 
 | ||
| To use DataHub SDK, you'll need to install [`acryl-datahub`](https://pypi.org/project/acryl-datahub/) and set up a connection to your DataHub instance. Follow the [installation guide](https://docs.datahub.com/docs/metadata-ingestion/cli-ingestion#installing-datahub-cli) to get started.
 | ||
| 
 | ||
| Connect to your DataHub instance:
 | ||
| 
 | ||
| ```python
 | ||
| from datahub.sdk import DataHubClient
 | ||
| 
 | ||
| client = DataHubClient(server="<your_server>", token="<your_token>")
 | ||
| ```
 | ||
| 
 | ||
| - **server**: The URL of your DataHub GMS server
 | ||
|   - local: `http://localhost:8080`
 | ||
|   - hosted: `https://<your_datahub_url>/gms`
 | ||
| - **token**: You'll need to [generate a Personal Access Token](https://docs.datahub.com/docs/authentication/personal-access-tokens) from your DataHub instance.
 | ||
| 
 | ||
| ## Add Lineage
 | ||
| 
 | ||
| The `add_lineage()` method allows you to define lineage between two entities.
 | ||
| 
 | ||
| ### Add Entity Lineage
 | ||
| 
 | ||
| You can create lineage between two datasets, data jobs, dashboards, or charts. The `upstream` and `downstream` parameters should be the URNs of the entities you want to link.
 | ||
| 
 | ||
| #### Add Entity Lineage Between Datasets
 | ||
| 
 | ||
| ```python
 | ||
| {{ inline /metadata-ingestion/examples/library/add_lineage_dataset_to_dataset.py show_path_as_comment }}
 | ||
| ```
 | ||
| 
 | ||
| #### Add Entity Lineage Between Datajobs
 | ||
| 
 | ||
| ```python
 | ||
| {{ inline /metadata-ingestion/examples/library/lineage_datajob_to_datajob.py show_path_as_comment }}
 | ||
| ```
 | ||
| 
 | ||
| :::note Lineage Combinations
 | ||
| For supported lineage combinations, see [Supported Lineage Combinations](#supported-lineage-combinations).
 | ||
| :::
 | ||
| 
 | ||
| ### Add Column Lineage
 | ||
| 
 | ||
| You can add column-level lineage by using `column_lineage` parameter when linking datasets.
 | ||
| 
 | ||
| #### Add Column Lineage with Fuzzy Matching
 | ||
| 
 | ||
| ```python
 | ||
| {{ inline /metadata-ingestion/examples/library/lineage_dataset_column.py show_path_as_comment }}
 | ||
| ```
 | ||
| 
 | ||
| When `column_lineage` is set to **True**, DataHub will automatically map columns based on their names, allowing for fuzzy matching. This is useful when upstream and downstream datasets have similar but not identical column names. (e.g. `customer_id` in upstream and `CustomerId` in downstream). See [Column Lineage Options](#column-lineage-options) for more details.
 | ||
| 
 | ||
| #### Add Column Lineage with Strict Matching
 | ||
| 
 | ||
| ```python
 | ||
| {{ inline /metadata-ingestion/examples/library/lineage_dataset_column_auto_strict.py show_path_as_comment }}
 | ||
| ```
 | ||
| 
 | ||
| This will create column-level lineage with strict matching, meaning the column names must match exactly between upstream and downstream datasets.
 | ||
| 
 | ||
| #### Add Column Lineage with Custom Mapping
 | ||
| 
 | ||
| For custom mapping, you can use a dictionary where keys are downstream column names and values represent lists of upstream column names. This allows you to specify complex relationships.
 | ||
| 
 | ||
| ```python
 | ||
| {{ inline /metadata-ingestion/examples/library/lineage_dataset_column_custom_mapping.py show_path_as_comment }}
 | ||
| ```
 | ||
| 
 | ||
| ### Infer Lineage from SQL
 | ||
| 
 | ||
| You can infer lineage directly from a SQL query using `infer_lineage_from_sql()`. This will parse the query, determine upstream and downstream datasets, and automatically add lineage (including column-level lineage when possible) and a query node showing the SQL transformation logic.
 | ||
| 
 | ||
| ```python
 | ||
| {{ inline /metadata-ingestion/examples/library/lineage_dataset_from_sql.py show_path_as_comment }}
 | ||
| ```
 | ||
| 
 | ||
| :::note DataHub SQL Parser
 | ||
| 
 | ||
| Check out more information on how we handle SQL parsing below.
 | ||
| 
 | ||
| - [The DataHub SQL Parser Documentation](../../lineage/sql_parsing.md)
 | ||
| - [Blog Post: Extracting Column-Level Lineage from SQL](https://medium.com/datahub-project/extracting-column-level-lineage-from-sql-779b8ce17567)
 | ||
| 
 | ||
| :::
 | ||
| 
 | ||
| ### Add Query Node with Lineage
 | ||
| 
 | ||
| If you provide a `transformation_text` to `add_lineage`, DataHub will create a query node that represents the transformation logic. This is useful for tracking how data is transformed between datasets.
 | ||
| 
 | ||
| ```python
 | ||
| {{ inline /metadata-ingestion/examples/library/add_lineage_dataset_to_dataset_with_query_node.py show_path_as_comment }}
 | ||
| ```
 | ||
| 
 | ||
| Transformation text can be any transformation logic, Python scripts, Airflow DAG code, or any other code that describes how the upstream dataset is transformed into the downstream dataset.
 | ||
| 
 | ||
| <p align="center">
 | ||
|   <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/lineage/query-node.png"/>
 | ||
| </p>
 | ||
| 
 | ||
| :::note
 | ||
| Providing `transformation_text` will NOT create column lineage. You need to specify `column_lineage` parameter to enable column-level lineage.
 | ||
| 
 | ||
| If you have a SQL query that describes the transformation, you can use [infer_lineage_from_sql](#infer-lineage-from-sql) to automatically parse the query and add column level lineage.
 | ||
| :::
 | ||
| 
 | ||
| ## Get Lineage
 | ||
| 
 | ||
| The `get_lineage()` method allows you to retrieve lineage for a given entity.
 | ||
| 
 | ||
| ### Get Entity Lineage
 | ||
| 
 | ||
| #### Get Upstream Lineage for a Dataset
 | ||
| 
 | ||
| This will return the direct upstream entity that the dataset depends on. By default, it retrieves only the immediate upstream entities (1 hop).
 | ||
| 
 | ||
| ```python
 | ||
| {{ inline /metadata-ingestion/examples/library/get_lineage_basic.py show_path_as_comment }}
 | ||
| ```
 | ||
| 
 | ||
| #### Get Downstream Lineage for a Dataset Across Multiple Hops
 | ||
| 
 | ||
| To get upstream/downstream entities that are more than one hop away, you can use the `max_hops` parameter. This allows you to traverse the lineage graph up to a specified number of hops.
 | ||
| 
 | ||
| ```python
 | ||
| {{ inline /metadata-ingestion/examples/library/get_lineage_with_hops.py show_path_as_comment }}
 | ||
| 
 | ||
| ```
 | ||
| 
 | ||
| :::note USING MAX_HOPS
 | ||
| if you provide `max_hops` greater than 2, it will traverse the full lineage graph and limit the results by `count`.
 | ||
| :::
 | ||
| 
 | ||
| #### Return Type
 | ||
| 
 | ||
| `get_lineage()` returns a list of `LineageResult` objects.
 | ||
| 
 | ||
| ```python
 | ||
| results = [
 | ||
|   LineageResult(
 | ||
|     urn="urn:li:dataset:(urn:li:dataPlatform:snowflake,table_2,PROD)",
 | ||
|     type="DATASET",
 | ||
|     hops=1,
 | ||
|     direction="downstream",
 | ||
|     platform="snowflake",
 | ||
|     name="table_2", # name of the entity
 | ||
|     paths=[] # Only populated for column-level lineage
 | ||
|   )
 | ||
| ]
 | ||
| ```
 | ||
| 
 | ||
| ### Get Column-Level Lineage
 | ||
| 
 | ||
| #### Get Downstream Lineage for a Dataset Column
 | ||
| 
 | ||
| You can retrieve column-level lineage by specifying the `source_column` parameter. This will return lineage paths that include the specified column.
 | ||
| 
 | ||
| ```python
 | ||
| {{ inline /metadata-ingestion/examples/library/get_column_lineage.py show_path_as_comment }}
 | ||
| ```
 | ||
| 
 | ||
| You can also pass `SchemaFieldUrn` as the `source_urn` to get column-level lineage.
 | ||
| 
 | ||
| ```python
 | ||
| {{ inline /metadata-ingestion/examples/library/get_column_lineage_from_schemafield.py show_path_as_comment }}
 | ||
| 
 | ||
| ```
 | ||
| 
 | ||
| #### Return type
 | ||
| 
 | ||
| The return type is the same as for entity lineage, but with additional `paths` field that contains column lineage paths.
 | ||
| 
 | ||
| ```python
 | ||
| results = [
 | ||
|   LineageResult(
 | ||
|     urn="urn:li:dataset:(urn:li:dataPlatform:snowflake,table_2,PROD)",
 | ||
|     type="DATASET",
 | ||
|     hops=1,
 | ||
|     direction="downstream",
 | ||
|     platform="snowflake",
 | ||
|     name="table_2", # name of the entity
 | ||
|     paths=[
 | ||
|       LineagePath(
 | ||
|         urn="urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:snowflake,table_1,PROD),col1)",
 | ||
|         column_name="col1", # name of the column
 | ||
|         entity_name="table_1", # name of the entity that contains the column
 | ||
|       ),
 | ||
|       LineagePath(
 | ||
|         urn="urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:snowflake,table_2,PROD),col4)",
 | ||
|         column_name="col4", # name of the column
 | ||
|         entity_name="table_2", # name of the entity that contains the column
 | ||
|       )
 | ||
|     ] # Only populated for column-level lineage
 | ||
|   )
 | ||
| ]
 | ||
| ```
 | ||
| 
 | ||
| For more details on how to interpret the results, see [Interpreting Lineage Results](#interpreting-lineage-results).
 | ||
| 
 | ||
| ### Filter Lineage Results
 | ||
| 
 | ||
| You can filter by platform, type, domain, environment, and more.
 | ||
| 
 | ||
| ```python
 | ||
| {{ inline /metadata-ingestion/examples/library/get_lineage_with_filter.py show_path_as_comment }}
 | ||
| ```
 | ||
| 
 | ||
| You can check more details about the available filters in the [Search SDK documentation](./sdk/search_client.md#filter-based-search).
 | ||
| 
 | ||
| ## Lineage SDK Reference
 | ||
| 
 | ||
| For a full reference, see the [lineage SDK reference](../../../python-sdk/sdk-v2/lineage-client.mdx).
 | ||
| 
 | ||
| ### Supported Lineage Combinations
 | ||
| 
 | ||
| The Lineage APIs support the following entity combinations:
 | ||
| 
 | ||
| | Upstream Entity | Downstream Entity |
 | ||
| | --------------- | ----------------- |
 | ||
| | Dataset         | Dataset           |
 | ||
| | Dataset         | DataJob           |
 | ||
| | DataJob         | DataJob           |
 | ||
| | Dataset         | Dashboard         |
 | ||
| | Chart           | Dashboard         |
 | ||
| | Dashboard       | Dashboard         |
 | ||
| | Dataset         | Chart             |
 | ||
| 
 | ||
| > ℹ️ Column-level lineage and creating query node with transformation text are **only supported** for `Dataset → Dataset` lineage.
 | ||
| 
 | ||
| ### Column Lineage Options
 | ||
| 
 | ||
| For dataset-to-dataset lineage, you can specify `column_lineage` parameter in `add_lineage()` in several ways:
 | ||
| 
 | ||
| | Value           | Description                                                                       |
 | ||
| | --------------- | --------------------------------------------------------------------------------- |
 | ||
| | `False`         | Disable column-level lineage (default)                                            |
 | ||
| | `True`          | Enable column-level lineage with automatic mapping (same as "auto_fuzzy")         |
 | ||
| | `"auto_fuzzy"`  | Enable column-level lineage with fuzzy matching (useful for similar column names) |
 | ||
| | `"auto_strict"` | Enable column-level lineage with strict matching (exact column names required)    |
 | ||
| | Column Mapping  | A dictionary mapping downstream column names to lists of upstream column names    |
 | ||
| 
 | ||
| :::note `auto_fuzzy` vs `auto_strict`
 | ||
| 
 | ||
| - **`auto_fuzzy`**: Automatically matches columns based on similar names, allowing for some flexibility in naming conventions. For example, these two columns would be considered a match:
 | ||
|   - user_id → userId
 | ||
|   - customer_id → CustomerId
 | ||
| - **`auto_strict`**: Requires exact column name matches between upstream and downstream datasets. For example, `customer_id` in the upstream dataset must match `customer_id` in the downstream dataset exactly.
 | ||
| 
 | ||
| :::
 | ||
| 
 | ||
| ### Interpreting Column Lineage Results
 | ||
| 
 | ||
| When retrieving column-level lineage, the results include `paths` that show how columns are related across datasets. Each path is a list of column URNs that represent the lineage from the source column to the target column.
 | ||
| 
 | ||
| For example, let's say we have the following lineage across three tables:
 | ||
| 
 | ||
| <p align="center">
 | ||
|   <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/lineage/column-lineage.png"/>
 | ||
| </p>
 | ||
| 
 | ||
| #### Example with `max_hops=1`
 | ||
| 
 | ||
| ```python
 | ||
| >>> client.lineage.get_lineage(
 | ||
|         source_urn="urn:li:dataset:(urn:li:dataPlatform:snowflake,table_1,PROD)",
 | ||
|         source_column="col1",
 | ||
|         direction="downstream",
 | ||
|         max_hops=1
 | ||
|     )
 | ||
| ```
 | ||
| 
 | ||
| **Returns:**
 | ||
| 
 | ||
| ```python
 | ||
| [
 | ||
|     {
 | ||
|         "urn": "...table_2...",
 | ||
|         "hops": 1,
 | ||
|         "paths": [
 | ||
|             ["...table_1.col1", "...table_2.col4"],
 | ||
|             ["...table_1.col1", "...table_2.col5"]
 | ||
|         ]
 | ||
|     }
 | ||
| ]
 | ||
| ```
 | ||
| 
 | ||
| #### Example with `max_hops=2`
 | ||
| 
 | ||
| ```python
 | ||
| >>> client.lineage.get_lineage(
 | ||
|         source_urn="urn:li:dataset:(urn:li:dataPlatform:snowflake,table_1,PROD)",
 | ||
|         source_column="col1",
 | ||
|         direction="downstream",
 | ||
|         max_hops=2
 | ||
|     )
 | ||
| ```
 | ||
| 
 | ||
| **Returns:**
 | ||
| 
 | ||
| ```python
 | ||
| [
 | ||
|     {
 | ||
|         "urn": "...table_2...",
 | ||
|         "hops": 1,
 | ||
|         "paths": [
 | ||
|             ["...table_1.col1", "...table_2.col4"],
 | ||
|             ["...table_1.col1", "...table_2.col5"]
 | ||
|         ]
 | ||
|     },
 | ||
|     {
 | ||
|         "urn": "...table_3...",
 | ||
|         "hops": 2,
 | ||
|         "paths": [
 | ||
|             ["...table_1.col1", "...table_2.col4", "...table_3.col7"]
 | ||
|         ]
 | ||
|     }
 | ||
| ]
 | ||
| ```
 | ||
| 
 | ||
| ## Alternative: Lineage GraphQL API
 | ||
| 
 | ||
| While we generally recommend using the Python SDK for lineage, you can also use the GraphQL API to add and retrieve lineage.
 | ||
| 
 | ||
| #### Add Lineage Between Datasets with GraphQL
 | ||
| 
 | ||
| ```graphql
 | ||
| mutation updateLineage {
 | ||
|   updateLineage(
 | ||
|     input: {
 | ||
|       edgesToAdd: [
 | ||
|         {
 | ||
|           downstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)"
 | ||
|           upstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)"
 | ||
|         }
 | ||
|       ]
 | ||
|       edgesToRemove: []
 | ||
|     }
 | ||
|   )
 | ||
| }
 | ||
| ```
 | ||
| 
 | ||
| #### Get Downstream Lineage with GraphQL
 | ||
| 
 | ||
| ```graphql
 | ||
| query scrollAcrossLineage {
 | ||
|   scrollAcrossLineage(
 | ||
|     input: {
 | ||
|       query: "*"
 | ||
|       urn: "urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)"
 | ||
|       count: 10
 | ||
|       direction: DOWNSTREAM
 | ||
|       orFilters: [
 | ||
|         {
 | ||
|           and: [
 | ||
|             {
 | ||
|               condition: EQUAL
 | ||
|               negated: false
 | ||
|               field: "degree"
 | ||
|               values: ["1", "2", "3+"]
 | ||
|             }
 | ||
|           ]
 | ||
|         }
 | ||
|       ]
 | ||
|     }
 | ||
|   ) {
 | ||
|     searchResults {
 | ||
|       degree
 | ||
|       entity {
 | ||
|         urn
 | ||
|         type
 | ||
|       }
 | ||
|     }
 | ||
|   }
 | ||
| }
 | ||
| ```
 | ||
| 
 | ||
| ## FAQ
 | ||
| 
 | ||
| **Can I get lineage at the column level?**
 | ||
| Yes — for dataset-to-dataset lineage, both `add_lineage()` and `get_lineage()` support column-level lineage.
 | ||
| 
 | ||
| **Can I pass a SQL query and get lineage automatically?**
 | ||
| Yes — use `infer_lineage_from_sql()` to parse a query and extract table and column lineage.
 | ||
| 
 | ||
| **Can I use filters when retrieving lineage?**
 | ||
| Yes — `get_lineage()` accepts structured filters via `FilterDsl`, just like in the Search SDK.
 | 
