mirror of
https://github.com/datahub-project/datahub.git
synced 2025-06-27 05:03:31 +00:00
docs: linage client SDK guide (#13700)
This commit is contained in:
parent
a53e62f701
commit
c879836ea6
@ -49,59 +49,56 @@ Here's an overview of what each API can do.
|
||||
|
||||
> Last Updated : Feb 16 2024
|
||||
|
||||
| Feature | GraphQL | Python SDK | OpenAPI |
|
||||
| -------------------------------------------------------- | ----------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
|
||||
| Create a Dataset | 🚫 | ✅ [[Guide]](/docs/api/tutorials/datasets.md) | ✅ |
|
||||
| Delete a Dataset (Soft Delete) | ✅ [[Guide]](/docs/api/tutorials/datasets.md#delete-dataset) | ✅ [[Guide]](/docs/api/tutorials/datasets.md#delete-dataset) | ✅ |
|
||||
| Delete a Dataset (Hard Delete) | 🚫 | ✅ [[Guide]](/docs/api/tutorials/datasets.md#delete-dataset) | ✅ |
|
||||
| Search a Dataset | ✅ [[Guide]](/docs/how/search.md#graphql) | ✅ | ✅ |
|
||||
| Read a Dataset Deprecation | ✅ | ✅ | ✅ |
|
||||
| Read Dataset Entities (V2) | ✅ | ✅ | ✅ |
|
||||
| Create a Tag | ✅ [[Guide]](/docs/api/tutorials/tags.md#create-tags) | ✅ [[Guide]](/docs/api/tutorials/tags.md#create-tags) | ✅ |
|
||||
| Read a Tag | ✅ [[Guide]](/docs/api/tutorials/tags.md#read-tags) | ✅ [[Guide]](/docs/api/tutorials/tags.md#read-tags) | ✅ |
|
||||
| Add Tags to a Dataset | ✅ [[Guide]](/docs/api/tutorials/tags.md#add-tags-to-a-dataset) | ✅ [[Guide]](/docs/api/tutorials/tags.md#add-tags-to-a-dataset) | ✅ |
|
||||
| Add Tags to a Column of a Dataset | ✅ [[Guide]](/docs/api/tutorials/tags.md#add-tags-to-a-column-of-a-dataset) | ✅ [[Guide]](/docs/api/tutorials/tags.md#add-tags-to-a-column-of-a-dataset) | ✅ |
|
||||
| Remove Tags from a Dataset | ✅ [[Guide]](/docs/api/tutorials/tags.md#remove-tags) | ✅ [[Guide]](/docs/api/tutorials/tags.md#add-tags#remove-tags) | ✅ |
|
||||
| Create Glossary Terms | ✅ [[Guide]](/docs/api/tutorials/terms.md#create-terms) | ✅ [[Guide]](/docs/api/tutorials/terms.md#create-terms) | ✅ |
|
||||
| Read Terms from a Dataset | ✅ [[Guide]](/docs/api/tutorials/terms.md#read-terms) | ✅ [[Guide]](/docs/api/tutorials/terms.md#read-terms) | ✅ |
|
||||
| Add Terms to a Column of a Dataset | ✅ [[Guide]](/docs/api/tutorials/terms.md#add-terms-to-a-column-of-a-dataset) | ✅ [[Guide]](/docs/api/tutorials/terms.md#add-terms-to-a-column-of-a-dataset) | ✅ |
|
||||
| Add Terms to a Dataset | ✅ [[Guide]](/docs/api/tutorials/terms.md#add-terms-to-a-dataset) | ✅ [[Guide]](/docs/api/tutorials/terms.md#add-terms-to-a-dataset) | ✅ |
|
||||
| Create Domains | ✅ [[Guide]](/docs/api/tutorials/domains.md#create-domain) | ✅ [[Guide]](/docs/api/tutorials/domains.md#create-domain) | ✅ |
|
||||
| Read Domains | ✅ [[Guide]](/docs/api/tutorials/domains.md#read-domains) | ✅ [[Guide]](/docs/api/tutorials/domains.md#read-domains) | ✅ |
|
||||
| Add Domains to a Dataset | ✅ [[Guide]](/docs/api/tutorials/domains.md#add-domains) | ✅ [[Guide]](/docs/api/tutorials/domains.md#add-domains) | ✅ |
|
||||
| Remove Domains from a Dataset | ✅ [[Guide]](/docs/api/tutorials/domains.md#remove-domains) | ✅ [[Guide]](/docs/api/tutorials/domains.md#remove-domains) | ✅ |
|
||||
| Create / Upsert Users | ✅ [[Guide]](/docs/api/tutorials/owners.md#upsert-users) | ✅ [[Guide]](/docs/api/tutorials/owners.md#upsert-users) | ✅ |
|
||||
| Create / Upsert Group | ✅ [[Guide]](/docs/api/tutorials/owners.md#upsert-group) | ✅ [[Guide]](/docs/api/tutorials/owners.md#upsert-group) | ✅ |
|
||||
| Read Owners of a Dataset | ✅ [[Guide]](/docs/api/tutorials/owners.md#read-owners) | ✅ [[Guide]](/docs/api/tutorials/owners.md#read-owners) | ✅ |
|
||||
| Add Owner to a Dataset | ✅ [[Guide]](/docs/api/tutorials/owners.md#add-owners) | ✅ [[Guide]](/docs/api/tutorials/owners.md#add-owners#remove-owners) | ✅ |
|
||||
| Remove Owner from a Dataset | ✅ [[Guide]](/docs/api/tutorials/owners.md#remove-owners) | ✅ [[Guide]](/docs/api/tutorials/owners.md) | ✅ |
|
||||
| Add Lineage | ✅ [[Guide]](/docs/api/tutorials/lineage.md) | ✅ [[Guide]](/docs/api/tutorials/lineage.md#add-lineage) | ✅ |
|
||||
| Add Column Level (Fine Grained) Lineage | 🚫 | ✅ [[Guide]](docs/api/tutorials/lineage.md#add-column-level-lineage) | ✅ |
|
||||
| Add Documentation (Description) to a Column of a Dataset | ✅ [[Guide]](/docs/api/tutorials/descriptions.md#add-description-on-column) | ✅ [[Guide]](/docs/api/tutorials/descriptions.md#add-description-on-column) | ✅ |
|
||||
| Add Documentation (Description) to a Dataset | ✅ [[Guide]](/docs/api/tutorials/descriptions.md#add-description-on-dataset) | ✅ [[Guide]](/docs/api/tutorials/descriptions.md#add-description-on-dataset) | ✅ |
|
||||
| Add / Remove / Replace Custom Properties on a Dataset | 🚫 | ✅ [[Guide]](/docs/api/tutorials/custom-properties.md) | ✅ |
|
||||
| Add ML Feature to ML Feature Table | 🚫 | ✅ [[Guide]](/docs/api/tutorials/ml.md#add-mlfeature-to-mlfeaturetable) | ✅ |
|
||||
| Add ML Feature to MLModel | 🚫 | ✅ [[Guide]](/docs/api/tutorials/ml.md#add-mlfeature-to-mlmodel) | ✅ |
|
||||
| Add ML Group to MLFeatureTable | 🚫 | ✅ [[Guide]](/docs/api/tutorials/ml.md#add-mlgroup-to-mlfeaturetable) | ✅ |
|
||||
| Create MLFeature | 🚫 | ✅ [[Guide]](/docs/api/tutorials/ml.md#create-mlfeature) | ✅ |
|
||||
| Create MLFeatureTable | 🚫 | ✅ [[Guide]](/docs/api/tutorials/ml.md#create-mlfeaturetable) | ✅ |
|
||||
| Create MLModel | 🚫 | ✅ [[Guide]](/docs/api/tutorials/ml.md#create-mlmodel) | ✅ |
|
||||
| Create MLModelGroup | 🚫 | ✅ [[Guide]](/docs/api/tutorials/ml.md#create-mlmodelgroup) | ✅ |
|
||||
| Create MLPrimaryKey | 🚫 | ✅ [[Guide]](/docs/api/tutorials/ml.md#create-mlprimarykey) | ✅ |
|
||||
| Create MLFeatureTable | 🚫 | ✅ [[Guide]](/docs/api/tutorials/ml.md#create-mlfeaturetable) | ✅ |
|
||||
| Read MLFeature | ✅ [[Guide]](/docs/api/tutorials/ml.md#read-mlfeature) | ✅ [[Guide]](/docs/api/tutorials/ml.md#read-mlfeature) | ✅ |
|
||||
| Read MLFeatureTable | ✅ [[Guide]](/docs/api/tutorials/ml.md#read-mlfeaturetable) | ✅ [[Guide]](/docs/api/tutorials/ml.md#read-mlfeaturetable) | ✅ |
|
||||
| Read MLModel | ✅ [[Guide]](/docs/api/tutorials/ml.md#read-mlmodel) | ✅ [[Guide]](/docs/api/tutorials/ml.md#read-mlmodel) | ✅ |
|
||||
| Read MLModelGroup | ✅ [[Guide]](/docs/api/tutorials/ml.md#read-mlmodelgroup) | ✅ [[Guide]](/docs/api/tutorials/ml.md#read-mlmodelgroup) | ✅ |
|
||||
| Read MLPrimaryKey | ✅ [[Guide]](/docs/api/tutorials/ml.md#read-mlprimarykey) | ✅ [[Guide]](/docs/api/tutorials/ml.md#read-mlprimarykey) | ✅ |
|
||||
| Create Data Product | 🚫 | ✅ [[Code]](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/create_dataproduct.py) | ✅ |
|
||||
| Create Lineage Between Chart and Dashboard | 🚫 | ✅ [[Code]](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_chart_dashboard.py) | ✅ |
|
||||
| Create Lineage Between Dataset and Chart | 🚫 | ✅ [[Code]](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_dataset_chart.py) | ✅ |
|
||||
| Create Lineage Between Dataset and DataJob | 🚫 | ✅ [[Code]](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_dataset_job_dataset.py) | ✅ |
|
||||
| Create Finegrained Lineage as DataJob for Dataset | 🚫 | ✅ [[Code]](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_emitter_datajob_finegrained.py) | ✅ |
|
||||
| Create Finegrained Lineage for Dataset | 🚫 | ✅ [[Code]](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_emitter_dataset_finegrained.py) | ✅ |
|
||||
| Create Dataset Lineage with Kafka | 🚫 | ✅ [[Code]](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_emitter_kafka.py) | ✅ |
|
||||
| Create Dataset Lineage with MCPW & Rest Emitter | 🚫 | ✅ [[Code]](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_emitter_mcpw_rest.py) | ✅ |
|
||||
| Create Dataset Lineage with Rest Emitter | 🚫 | ✅ [[Code]](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_emitter_rest.py) | ✅ |
|
||||
| Create DataJob with Dataflow | 🚫 | ✅ [[Code]](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_job_dataflow.py) [[Simple]](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_job_dataflow_new_api_simple.py) [[Verbose]](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_job_dataflow_new_api_verbose.py) | ✅ |
|
||||
| Create Programmatic Pipeline | 🚫 | ✅ [[Code]](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/programatic_pipeline.py) | ✅ |
|
||||
| Feature | GraphQL | Python SDK | OpenAPI |
|
||||
| -------------------------------------------------------- | ----------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- | ------- | --- | --- |
|
||||
| Create a Dataset | 🚫 | ✅ [[Guide]](/docs/api/tutorials/datasets.md) | ✅ |
|
||||
| Delete a Dataset (Soft Delete) | ✅ [[Guide]](/docs/api/tutorials/datasets.md#delete-dataset) | ✅ [[Guide]](/docs/api/tutorials/datasets.md#delete-dataset) | ✅ |
|
||||
| Delete a Dataset (Hard Delete) | 🚫 | ✅ [[Guide]](/docs/api/tutorials/datasets.md#delete-dataset) | ✅ |
|
||||
| Search a Dataset | ✅ [[Guide]](/docs/how/search.md#graphql) | ✅ | ✅ |
|
||||
| Read a Dataset Deprecation | ✅ | ✅ | ✅ |
|
||||
| Read Dataset Entities (V2) | ✅ | ✅ | ✅ |
|
||||
| Create a Tag | ✅ [[Guide]](/docs/api/tutorials/tags.md#create-tags) | ✅ [[Guide]](/docs/api/tutorials/tags.md#create-tags) | ✅ |
|
||||
| Read a Tag | ✅ [[Guide]](/docs/api/tutorials/tags.md#read-tags) | ✅ [[Guide]](/docs/api/tutorials/tags.md#read-tags) | ✅ |
|
||||
| Add Tags to a Dataset | ✅ [[Guide]](/docs/api/tutorials/tags.md#add-tags-to-a-dataset) | ✅ [[Guide]](/docs/api/tutorials/tags.md#add-tags-to-a-dataset) | ✅ |
|
||||
| Add Tags to a Column of a Dataset | ✅ [[Guide]](/docs/api/tutorials/tags.md#add-tags-to-a-column-of-a-dataset) | ✅ [[Guide]](/docs/api/tutorials/tags.md#add-tags-to-a-column-of-a-dataset) | ✅ |
|
||||
| Remove Tags from a Dataset | ✅ [[Guide]](/docs/api/tutorials/tags.md#remove-tags) | ✅ [[Guide]](/docs/api/tutorials/tags.md#add-tags#remove-tags) | ✅ |
|
||||
| Create Glossary Terms | ✅ [[Guide]](/docs/api/tutorials/terms.md#create-terms) | ✅ [[Guide]](/docs/api/tutorials/terms.md#create-terms) | ✅ |
|
||||
| Read Terms from a Dataset | ✅ [[Guide]](/docs/api/tutorials/terms.md#read-terms) | ✅ [[Guide]](/docs/api/tutorials/terms.md#read-terms) | ✅ |
|
||||
| Add Terms to a Column of a Dataset | ✅ [[Guide]](/docs/api/tutorials/terms.md#add-terms-to-a-column-of-a-dataset) | ✅ [[Guide]](/docs/api/tutorials/terms.md#add-terms-to-a-column-of-a-dataset) | ✅ |
|
||||
| Add Terms to a Dataset | ✅ [[Guide]](/docs/api/tutorials/terms.md#add-terms-to-a-dataset) | ✅ [[Guide]](/docs/api/tutorials/terms.md#add-terms-to-a-dataset) | ✅ |
|
||||
| Create Domains | ✅ [[Guide]](/docs/api/tutorials/domains.md#create-domain) | ✅ [[Guide]](/docs/api/tutorials/domains.md#create-domain) | ✅ |
|
||||
| Read Domains | ✅ [[Guide]](/docs/api/tutorials/domains.md#read-domains) | ✅ [[Guide]](/docs/api/tutorials/domains.md#read-domains) | ✅ |
|
||||
| Add Domains to a Dataset | ✅ [[Guide]](/docs/api/tutorials/domains.md#add-domains) | ✅ [[Guide]](/docs/api/tutorials/domains.md#add-domains) | ✅ |
|
||||
| Remove Domains from a Dataset | ✅ [[Guide]](/docs/api/tutorials/domains.md#remove-domains) | ✅ [[Guide]](/docs/api/tutorials/domains.md#remove-domains) | ✅ |
|
||||
| Create / Upsert Users | ✅ [[Guide]](/docs/api/tutorials/owners.md#upsert-users) | ✅ [[Guide]](/docs/api/tutorials/owners.md#upsert-users) | ✅ |
|
||||
| Create / Upsert Group | ✅ [[Guide]](/docs/api/tutorials/owners.md#upsert-group) | ✅ [[Guide]](/docs/api/tutorials/owners.md#upsert-group) | ✅ |
|
||||
| Read Owners of a Dataset | ✅ [[Guide]](/docs/api/tutorials/owners.md#read-owners) | ✅ [[Guide]](/docs/api/tutorials/owners.md#read-owners) | ✅ |
|
||||
| Add Owner to a Dataset | ✅ [[Guide]](/docs/api/tutorials/owners.md#add-owners) | ✅ [[Guide]](/docs/api/tutorials/owners.md#add-owners#remove-owners) | ✅ |
|
||||
| Remove Owner from a Dataset | ✅ [[Guide]](/docs/api/tutorials/owners.md#remove-owners) | ✅ [[Guide]](/docs/api/tutorials/owners.md) | ✅ |
|
||||
| Add Lineage | ✅ [[Guide]](/docs/api/tutorials/lineage.md) | ✅ [[Guide]](/docs/api/tutorials/lineage.md#add-lineage) | ✅ |
|
||||
| Add Column Level (Fine Grained) Lineage | 🚫 | ✅ [[Guide]](docs/api/tutorials/lineage.md#add-column-level-lineage) | ✅ |
|
||||
| Add Documentation (Description) to a Column of a Dataset | ✅ [[Guide]](/docs/api/tutorials/descriptions.md#add-description-on-column) | ✅ [[Guide]](/docs/api/tutorials/descriptions.md#add-description-on-column) | ✅ |
|
||||
| Add Documentation (Description) to a Dataset | ✅ [[Guide]](/docs/api/tutorials/descriptions.md#add-description-on-dataset) | ✅ [[Guide]](/docs/api/tutorials/descriptions.md#add-description-on-dataset) | ✅ |
|
||||
| Add / Remove / Replace Custom Properties on a Dataset | 🚫 | ✅ [[Guide]](/docs/api/tutorials/custom-properties.md) | ✅ |
|
||||
| Add ML Feature to ML Feature Table | 🚫 | ✅ [[Guide]](/docs/api/tutorials/ml.md#add-mlfeature-to-mlfeaturetable) | ✅ |
|
||||
| Add ML Feature to MLModel | 🚫 | ✅ [[Guide]](/docs/api/tutorials/ml.md#add-mlfeature-to-mlmodel) | ✅ |
|
||||
| Add ML Group to MLFeatureTable | 🚫 | ✅ [[Guide]](/docs/api/tutorials/ml.md#add-mlgroup-to-mlfeaturetable) | ✅ |
|
||||
| Create MLFeature | 🚫 | ✅ [[Guide]](/docs/api/tutorials/ml.md#create-mlfeature) | ✅ |
|
||||
| Create MLFeatureTable | 🚫 | ✅ [[Guide]](/docs/api/tutorials/ml.md#create-mlfeaturetable) | ✅ |
|
||||
| Create MLModel | 🚫 | ✅ [[Guide]](/docs/api/tutorials/ml.md#create-mlmodel) | ✅ |
|
||||
| Create MLModelGroup | 🚫 | ✅ [[Guide]](/docs/api/tutorials/ml.md#create-mlmodelgroup) | ✅ |
|
||||
| Create MLPrimaryKey | 🚫 | ✅ [[Guide]](/docs/api/tutorials/ml.md#create-mlprimarykey) | ✅ |
|
||||
| Create MLFeatureTable | 🚫 | ✅ [[Guide]](/docs/api/tutorials/ml.md#create-mlfeaturetable) | ✅ |
|
||||
| Read MLFeature | ✅ [[Guide]](/docs/api/tutorials/ml.md#read-mlfeature) | ✅ [[Guide]](/docs/api/tutorials/ml.md#read-mlfeature) | ✅ |
|
||||
| Read MLFeatureTable | ✅ [[Guide]](/docs/api/tutorials/ml.md#read-mlfeaturetable) | ✅ [[Guide]](/docs/api/tutorials/ml.md#read-mlfeaturetable) | ✅ |
|
||||
| Read MLModel | ✅ [[Guide]](/docs/api/tutorials/ml.md#read-mlmodel) | ✅ [[Guide]](/docs/api/tutorials/ml.md#read-mlmodel) | ✅ |
|
||||
| Read MLModelGroup | ✅ [[Guide]](/docs/api/tutorials/ml.md#read-mlmodelgroup) | ✅ [[Guide]](/docs/api/tutorials/ml.md#read-mlmodelgroup) | ✅ |
|
||||
| Read MLPrimaryKey | ✅ [[Guide]](/docs/api/tutorials/ml.md#read-mlprimarykey) | ✅ [[Guide]](/docs/api/tutorials/ml.md#read-mlprimarykey) | ✅ |
|
||||
| Create Data Product | 🚫 | ✅ [[Code]](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/create_dataproduct.py) | ✅ |
|
||||
| Create Lineage Between Chart and Dashboard | 🚫 | ✅ [[Code]](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_chart_dashboard.py) | ✅ |
|
||||
| Create Lineage Between Dataset and Chart | 🚫 | ✅ [[Code]](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_dataset_chart.py) | ✅ |
|
||||
| Create Lineage Between Dataset and DataJob | 🚫 | ✅ [[Code]](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_dataset_job_dataset.py) | ✅ |
|
||||
| Create Finegrained Lineage as DataJob for Dataset | 🚫 | ✅ [[Code]](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_emitter_datajob_finegrained.py) | ✅ |
|
||||
| Create Finegrained Lineage for Dataset | 🚫 | ✅ [[Code]](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_emitter_dataset_finegrained.py) | ✅ | | ✅ |
|
||||
| Create DataJob with Dataflow | 🚫 | ✅ [[Code]](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_job_dataflow.py) | ✅ |
|
||||
| Create Programmatic Pipeline | 🚫 | ✅ [[Code]](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/programatic_pipeline.py) | ✅ |
|
||||
|
@ -1,39 +1,336 @@
|
||||
import Tabs from '@theme/Tabs';
|
||||
import TabItem from '@theme/TabItem';
|
||||
# Lineage
|
||||
|
||||
# Data Lineage
|
||||
DataHub’s Python SDK allows you to programmatically define and retrieve lineage between metadata entities. With the DataHub Lineage SDK, you can:
|
||||
|
||||
## Why Would You Use Lineage?
|
||||
- Add **table-level and column-level lineage** across datasets, data jobs, dashboards, and charts
|
||||
- Automatically **infer lineage from SQL queries**
|
||||
- **Read lineage** (upstream or downstream) for a given entity or column
|
||||
- **Filter lineage results** using structured filters
|
||||
|
||||
Data lineage is used to capture data dependencies within an organization. It allows you to track the inputs from which a data asset is derived, along with the data assets that depend on it downstream.
|
||||
## Getting Started
|
||||
|
||||
For more information about data lineage, refer to [About DataHub Lineage](/docs/generated/lineage/lineage-feature-guide.md).
|
||||
To use DataHub SDK, you'll need to install [`acryl-datahub`](https://pypi.org/project/acryl-datahub/) and set up a connection to your DataHub instance. Follow the [installation guide](https://docs.datahub.com/docs/metadata-ingestion/cli-ingestion#installing-datahub-cli) to get started.
|
||||
|
||||
### Goal Of This Guide
|
||||
Connect to your DataHub instance:
|
||||
|
||||
This guide will show you how to
|
||||
```python
|
||||
from datahub.sdk import DataHubClient
|
||||
|
||||
- Add lineage between datasets.
|
||||
- Add column-level lineage between datasets.
|
||||
- Read lineage.
|
||||
client = DataHubClient(server="<your_server>", token="<your_token>")
|
||||
```
|
||||
|
||||
## Prerequisites
|
||||
|
||||
For this tutorial, you need to deploy DataHub Quickstart and ingest sample data.
|
||||
For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.md).
|
||||
|
||||
:::note
|
||||
Before adding lineage, you need to ensure the targeted dataset is already present in your datahub.
|
||||
If you attempt to manipulate entities that do not exist, your operation will fail.
|
||||
In this guide, we will be using data from sample ingestion.
|
||||
:::
|
||||
- **server**: The URL of your DataHub GMS server
|
||||
- local: `http://localhost:8080`
|
||||
- hosted: `https://<your_datahub_url>/gms`
|
||||
- **token**: You'll need to [generate a Personal Access Token](https://docs.datahub.com/docs/authentication/personal-access-tokens) from your DataHub instance.
|
||||
|
||||
## Add Lineage
|
||||
|
||||
<Tabs>
|
||||
<TabItem value="graphql" label="GraphQL" default>
|
||||
The `add_lineage()` method allows you to define lineage between two entities.
|
||||
|
||||
```json
|
||||
### Add Entity Lineage
|
||||
|
||||
You can create lineage between two datasets, data jobs, dashboards, or charts. The `upstream` and `downstream` parameters should be the URNs of the entities you want to link.
|
||||
|
||||
#### Add Entity Lineage Between Datasets
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/add_lineage_dataset_to_dataset.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
#### Add Entity Lineage Between Datajobs
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/lineage_datajob_to_datajob.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
:::note Lineage Combinations
|
||||
For supported lineage combinations, see [Supported Lineage Combinations](#supported-lineage-combinations).
|
||||
:::
|
||||
|
||||
### Add Column Lineage
|
||||
|
||||
You can add column-level lineage by using `column_lineage` parameter when linking datasets.
|
||||
|
||||
#### Add Column Lineage with Fuzzy Matching
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/lineage_dataset_column.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
When `column_lineage` is set to **True**, DataHub will automatically map columns based on their names, allowing for fuzzy matching. This is useful when upstream and downstream datasets have similar but not identical column names. (e.g. `customer_id` in upstream and `CustomerId` in downstream).
|
||||
|
||||
#### Add Column Lineage with Strict Matching
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/lineage_dataset_column_auto_strict.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
This will create column-level lineage with strict matching, meaning the column names must match exactly between upstream and downstream datasets.
|
||||
|
||||
#### Add Column Lineage with Custom Mapping
|
||||
|
||||
For custom mapping, you can use a dictionary where keys are downstream column names and values represent lists of upstream column names. This allows you to specify complex relationships.
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/lineage_dataset_column_custom_mapping.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
### Infer Lineage from SQL
|
||||
|
||||
You can infer lineage directly from a SQL query using `infer_lineage_from_sql()`. This will parse the query, determine upstream and downstream datasets, and automatically add lineage (including column-level lineage when possible).
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/lineage_dataset_from_sql.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
:::note DataHub SQL Parser
|
||||
|
||||
Check out more information on how we handle SQL parsing below.
|
||||
|
||||
- [The DataHub SQL Parser Documentation](../../lineage/sql_parsing.md)
|
||||
- [Blog Post : Extracting Column-Level Lineage from SQL](https://medium.com/datahub-project/extracting-column-level-lineage-from-sql-779b8ce17567)
|
||||
|
||||
:::
|
||||
|
||||
### Add Query Node with Lineage
|
||||
|
||||
If you provide a `transformation_text` to `add_lineage`, DataHub will create a query node that represents the transformation logic. This is useful for tracking how data is transformed between datasets.
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/add_lineage_dataset_to_dataset_with_query_node.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
Transformation text can be any transformation logic, Python scripts, Airflow DAG code, or any other code that describes how the upstream dataset is transformed into the downstream dataset.
|
||||
|
||||
<p align="center">
|
||||
<img width="80%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/lineage/query-node.png"/>
|
||||
</p>
|
||||
|
||||
:::note
|
||||
Providing `transformation_text` will NOT create column lineage. You need to specify `column_lineage` parameter to enable column-level lineage.
|
||||
|
||||
If you have a SQL query that describes the transformation, you can use [infer_lineage_from_sql](#infer-lineage-from-sql) to automatically parse the query and add column level lineage.
|
||||
:::
|
||||
|
||||
## Get Lineage
|
||||
|
||||
The `get_lineage()` method allows you to retrieve lineage for a given entity.
|
||||
|
||||
### Get Entity Lineage
|
||||
|
||||
#### Get Upstream Lineage for a Dataset
|
||||
|
||||
This will return the direct upstream entity that the dataset depends on. By default, it retrieves only the immediate upstream entities (1 hop).
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/get_lineage_basic.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
#### Get Downstream Lineage for a Dataset Across Multiple Hops
|
||||
|
||||
To get upstream/downstream entities that are more than one hop away, you can use the `max_hops` parameter. This allows you to traverse the lineage graph up to a specified number of hops.
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/get_lineage_with_hops.py show_path_as_comment }}
|
||||
|
||||
```
|
||||
|
||||
:::note USING MAX_HOPS
|
||||
if you provide `max_hops` greater than 2, it will traverse the full lineage graph and limit the results by `count`.
|
||||
:::
|
||||
|
||||
#### Return Type
|
||||
|
||||
`get_lineage()` returns a list of `LineageResult` objects.
|
||||
|
||||
```python
|
||||
results = [
|
||||
LineageResult(
|
||||
urn="urn:li:dataset:(urn:li:dataPlatform:snowflake,table_2,PROD)",
|
||||
type="DATASET",
|
||||
hops=1,
|
||||
direction="downstream",
|
||||
platform="snowflake",
|
||||
name="table_2", # name of the entity
|
||||
paths=[] # Only populated for column-level lineage
|
||||
)
|
||||
]
|
||||
```
|
||||
|
||||
### Get Column-Level Lineage
|
||||
|
||||
#### Get Downstream Lineage for a Dataset Column
|
||||
|
||||
You can retrieve column-level lineage by specifying the `source_column` parameter. This will return lineage paths that include the specified column.
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/get_column_lineage.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
You can also pass `SchemaFieldUrn` as the `source_urn` to get column-level lineage.
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/get_column_lineage_from_schemafield.py show_path_as_comment }}
|
||||
|
||||
```
|
||||
|
||||
#### Return type
|
||||
|
||||
The return type is the same as for entity lineage, but with additional `paths` field that contains column lineage paths.
|
||||
|
||||
```python
|
||||
results = [
|
||||
LineageResult(
|
||||
urn="urn:li:dataset:(urn:li:dataPlatform:snowflake,table_2,PROD)",
|
||||
type="DATASET",
|
||||
hops=1,
|
||||
direction="downstream",
|
||||
platform="snowflake",
|
||||
name="table_2", # name of the entity
|
||||
paths=[
|
||||
LineagePath(
|
||||
urn="urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:snowflake,table_1,PROD),col1)",
|
||||
column_name="col1", # name of the column
|
||||
entity_name="table_1", # name of the entity that contains the column
|
||||
),
|
||||
LineagePath(
|
||||
urn="urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:snowflake,table_2,PROD),col4)",
|
||||
column_name="col4", # name of the column
|
||||
entity_name="table_2", # name of the entity that contains the column
|
||||
)
|
||||
] # Only populated for column-level lineage
|
||||
)
|
||||
]
|
||||
```
|
||||
|
||||
For more details on how to interpret the results, see [Interpreting Lineage Results](#interpreting-lineage-results).
|
||||
|
||||
### Filter Lineage Results
|
||||
|
||||
You can filter by platform, type, domain, environment, and more.
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/get_lineage_with_filter.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
You can check more details about the available filters in the [Search SDK documentation](./sdk/search_client.md#filter-based-search).
|
||||
|
||||
## Lineage SDK Reference
|
||||
|
||||
### Supported Lineage Combinations
|
||||
|
||||
The Lineage APIs support the following entity combinations:
|
||||
|
||||
| Upstream Entity | Downstream Entity |
|
||||
| --------------- | ----------------- |
|
||||
| Dataset | Dataset |
|
||||
| Dataset | DataJob |
|
||||
| DataJob | DataJob |
|
||||
| Dataset | Dashboard |
|
||||
| Chart | Dashboard |
|
||||
| Dashboard | Dashboard |
|
||||
| Dataset | Chart |
|
||||
|
||||
> ℹ️ Column-level lineage and creating query node with transformation text are **only supported** for `Dataset → Dataset` lineage.
|
||||
|
||||
### Column Lineage Options
|
||||
|
||||
For dataset-to-dataset lineage, you can specify `column_lineage` parameter in `add_lineage()` in several ways:
|
||||
|
||||
| Value | Description |
|
||||
| --------------- | --------------------------------------------------------------------------------- |
|
||||
| `False` | Disable column-level lineage (default) |
|
||||
| `True` | Enable column-level lineage with automatic mapping (same as "auto_fuzzy") |
|
||||
| `"auto_fuzzy"` | Enable column-level lineage with fuzzy matching (useful for similar column names) |
|
||||
| `"auto_strict"` | Enable column-level lineage with strict matching (exact column names required) |
|
||||
| Column Mapping | A dictionary mapping downstream column names to lists of upstream column names |
|
||||
|
||||
:::note Auto_Fuzzy vs Auto_Strict
|
||||
|
||||
- **Auto_Fuzzy**: Automatically matches columns based on similar names, allowing for some flexibility in naming conventions. For example, these two columns would be considered a match:
|
||||
- user_id → userId
|
||||
- customer_id → CustomerId
|
||||
- **Auto_Strict**: Requires exact column name matches between upstream and downstream datasets. For example, `customer_id` in the upstream dataset must match `customer_id` in the downstream dataset exactly.
|
||||
|
||||
:::
|
||||
|
||||
### Interpreting Column Lineage Results
|
||||
|
||||
When retrieving column-level lineage, the results include `paths` that show how columns are related across datasets. Each path is a list of column URNs that represent the lineage from the source column to the target column.
|
||||
|
||||
For example, let's say we have the following lineage across three tables:
|
||||
|
||||
<p align="center">
|
||||
<img width="80%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/lineage/column-lineage.png"/>
|
||||
</p>
|
||||
|
||||
#### Example with `max_hops=1`
|
||||
|
||||
```python
|
||||
>>> client.lineage.get_lineage(
|
||||
source_urn="urn:li:dataset:(urn:li:dataPlatform:snowflake,table_1,PROD)",
|
||||
source_column="col1",
|
||||
direction="downstream",
|
||||
max_hops=1
|
||||
)
|
||||
```
|
||||
|
||||
**Returns:**
|
||||
|
||||
```python
|
||||
[
|
||||
{
|
||||
"urn": "...table_2...",
|
||||
"hops": 1,
|
||||
"paths": [
|
||||
["...table_1.col1", "...table_2.col4"],
|
||||
["...table_1.col1", "...table_2.col5"]
|
||||
]
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
#### Example with `max_hops=2`
|
||||
|
||||
```python
|
||||
>>> client.lineage.get_lineage(
|
||||
source_urn="urn:li:dataset:(urn:li:dataPlatform:snowflake,table_1,PROD)",
|
||||
source_column="col1",
|
||||
direction="downstream",
|
||||
max_hops=2
|
||||
)
|
||||
```
|
||||
|
||||
**Returns:**
|
||||
|
||||
```python
|
||||
[
|
||||
{
|
||||
"urn": "...table_2...",
|
||||
"hops": 1,
|
||||
"paths": [
|
||||
["...table_1.col1", "...table_2.col4"],
|
||||
["...table_1.col1", "...table_2.col5"]
|
||||
]
|
||||
},
|
||||
{
|
||||
"urn": "...table_3...",
|
||||
"hops": 2,
|
||||
"paths": [
|
||||
["...table_1.col1", "...table_2.col4", "...table_3.col7"]
|
||||
]
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
### Lineage GraphQL Examples
|
||||
|
||||
You can also use the GraphQL API to add and retrieve lineage.
|
||||
|
||||
#### Add Lineage Between Datasets with GraphQL
|
||||
|
||||
```graphql
|
||||
mutation updateLineage {
|
||||
updateLineage(
|
||||
input: {
|
||||
@ -49,111 +346,7 @@ mutation updateLineage {
|
||||
}
|
||||
```
|
||||
|
||||
Note that you can create a list of edges. For example, if you want to assign multiple upstream entities to a downstream entity, you can do the following.
|
||||
|
||||
```json
|
||||
mutation updateLineage {
|
||||
updateLineage(
|
||||
input: {
|
||||
edgesToAdd: [
|
||||
{
|
||||
downstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)"
|
||||
upstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)"
|
||||
}
|
||||
{
|
||||
downstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)"
|
||||
upstreamUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)"
|
||||
}
|
||||
]
|
||||
edgesToRemove: []
|
||||
}
|
||||
)
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
For more information about the `updateLineage` mutation, please refer to [updateLineage](https://docs.datahub.com/docs/graphql/mutations/#updatelineage).
|
||||
|
||||
If you see the following response, the operation was successful:
|
||||
|
||||
```python
|
||||
{
|
||||
"data": {
|
||||
"updateLineage": true
|
||||
},
|
||||
"extensions": {}
|
||||
}
|
||||
```
|
||||
|
||||
</TabItem>
|
||||
<TabItem value="curl" label="Curl">
|
||||
|
||||
```shell
|
||||
curl --location --request POST 'http://localhost:8080/api/graphql' \
|
||||
--header 'Authorization: Bearer <my-access-token>' \
|
||||
--header 'Content-Type: application/json' --data-raw '{ "query": "mutation updateLineage { updateLineage( input:{ edgesToAdd : { downstreamUrn: \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)\", upstreamUrn : \"urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)\"}, edgesToRemove :{downstreamUrn: \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)\",upstreamUrn : \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)\" } })}", "variables":{}}'
|
||||
```
|
||||
|
||||
Expected Response:
|
||||
|
||||
```json
|
||||
{ "data": { "updateLineage": true }, "extensions": {} }
|
||||
```
|
||||
|
||||
</TabItem>
|
||||
<TabItem value="python" label="Python">
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/lineage_emitter_rest.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
</TabItem>
|
||||
</Tabs>
|
||||
|
||||
### Expected Outcome
|
||||
|
||||
You can now see the lineage between `fct_users_deleted` and `logging_events`.
|
||||
|
||||
<p align="center">
|
||||
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/apis/tutorials/lineage-added.png"/>
|
||||
</p>
|
||||
|
||||
## Add Column-level Lineage
|
||||
|
||||
<Tabs>
|
||||
<TabItem value="python" label="Python">
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/lineage_emitter_dataset_finegrained_sample.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
</TabItem>
|
||||
</Tabs>
|
||||
|
||||
### Expected Outcome
|
||||
|
||||
You can now see the column-level lineage between datasets. Note that you have to enable `Show Columns` to be able to see the column-level lineage.
|
||||
|
||||
<p align="center">
|
||||
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/apis/tutorials/column-level-lineage-added.png"/>
|
||||
</p>
|
||||
|
||||
## Add Lineage to Non-Dataset Entities
|
||||
|
||||
You can also add lineage to non-dataset entities, such as DataJobs, Charts, and Dashboards.
|
||||
Please refer to the following examples.
|
||||
|
||||
| Connection | Examples | A.K.A |
|
||||
| ------------------- | ---------------------------------------------------------------------------------------------------------------------- | ---------------- |
|
||||
| DataJob to DataFlow | - [lineage_job_dataflow.py](../../../metadata-ingestion/examples/library/lineage_job_dataflow.py) | |
|
||||
| DataJob to Dataset | - [lineage_dataset_job_dataset.py](../../../metadata-ingestion/examples/library/lineage_dataset_job_dataset.py) <br /> | Pipeline Lineage |
|
||||
| Chart to Dashboard | - [lineage_chart_dashboard.py](../../../metadata-ingestion/examples/library/lineage_chart_dashboard.py) | |
|
||||
| Chart to Dataset | - [lineage_dataset_chart.py](../../../metadata-ingestion/examples/library/lineage_dataset_chart.py) | |
|
||||
|
||||
## Read Lineage (Lineage Impact Analysis)
|
||||
|
||||
<Tabs>
|
||||
<TabItem value="graphql" label="GraphQL" default>
|
||||
#### Get Downstream Lineage with GraphQL
|
||||
|
||||
```graphql
|
||||
query scrollAcrossLineage {
|
||||
@ -188,135 +381,13 @@ query scrollAcrossLineage {
|
||||
}
|
||||
```
|
||||
|
||||
:::info Degree
|
||||
Note that `degree` means the number of hops in the lineage. For example, `degree: 1` means the immediate downstream entities, `degree: 2` means the entities that are two hops away, and so on.
|
||||
:::
|
||||
## FAQ
|
||||
|
||||
The GraphQL example shows using lineage degrees as a filter, but additional search filters can be included here as well.
|
||||
This will perform a multi-hop lineage search on the urn specified. For more information about the `scrollAcrossLineage` mutation, please refer to [scrollAcrossLineage](https://docs.datahub.com/docs/graphql/queries/#scrollacrosslineage).
|
||||
**Can I get lineage at the column level?**
|
||||
Yes — for dataset-to-dataset lineage, both `add_lineage()` and `get_lineage()` support column-level lineage.
|
||||
|
||||
</TabItem>
|
||||
<TabItem value="curl" label="Curl">
|
||||
**Can I pass a SQL query and get lineage automatically?**
|
||||
Yes — use `infer_lineage_from_sql()` to parse a query and extract table and column lineage.
|
||||
|
||||
```shell
|
||||
curl --location --request POST 'http://localhost:8080/api/graphql' \
|
||||
--header 'Authorization: Bearer <my-access-token>' \
|
||||
--header 'Content-Type: application/json' --data-raw '{ { "query": "query scrollAcrossLineage { scrollAcrossLineage( input: { query: \"*\" urn: \"urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)\" count: 10 direction: DOWNSTREAM orFilters: [ { and: [ { condition: EQUAL negated: false field: \"degree\" values: [\"1\", \"2\", \"3+\"] } ] } ] } ) { searchResults { degree entity { urn type } } }}"
|
||||
}}'
|
||||
```
|
||||
|
||||
</TabItem>
|
||||
<TabItem value="python" label="Python">
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/read_lineage_execute_graphql.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
The Python SDK example shows how to read lineage of a dataset. Please note that the `aspect_type` parameter can vary depending on the entity type.
|
||||
Below is a few examples of `aspect_type` for different entities.
|
||||
|
||||
| Entity | Aspect_type | Reference |
|
||||
| --------- | ------------------------- | ------------------------------------------------------------------------ |
|
||||
| Dataset | `UpstreamLineageClass` | [Link](/docs/generated/metamodel/entities/dataset.md#upstreamlineage) |
|
||||
| Datajob | `DataJobInputOutputClass` | [Link](/docs/generated/metamodel/entities/dataJob.md#datajobinputoutput) |
|
||||
| Dashboard | `DashboardInfoClass` | [Link](/docs/generated/metamodel/entities/dashboard.md#dashboardinfo) |
|
||||
| DataFlow | `DataFlowInfoClass` | [Link](/docs/generated/metamodel/entities/dataFlow.md#dataflowinfo) |
|
||||
|
||||
Learn more about lineages of different entities in the [Add Lineage to Non-Dataset Entities](#add-lineage-to-non-dataset-entities) Section.
|
||||
|
||||
</TabItem>
|
||||
</Tabs>
|
||||
|
||||
### Expected Outcome
|
||||
|
||||
As an outcome, you should see the downstream entities of `logging_events`.
|
||||
|
||||
```graphql
|
||||
{
|
||||
"data": {
|
||||
"scrollAcrossLineage": {
|
||||
"searchResults": [
|
||||
{
|
||||
"degree": 1,
|
||||
"entity": {
|
||||
"urn": "urn:li:dataJob:(urn:li:dataFlow:(airflow,dag_abc,PROD),task_123)",
|
||||
"type": "DATA_JOB"
|
||||
}
|
||||
},
|
||||
...
|
||||
{
|
||||
"degree": 2,
|
||||
"entity": {
|
||||
"urn": "urn:li:mlPrimaryKey:(user_analytics,user_name)",
|
||||
"type": "MLPRIMARY_KEY"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"extensions": {}
|
||||
}
|
||||
```
|
||||
|
||||
## Read Column-level Lineage
|
||||
|
||||
You can also read column-level lineage via Python SDK.
|
||||
|
||||
<Tabs>
|
||||
<TabItem value="python" label="Python">
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/read_lineage_dataset_rest.py show_path_as_comment }}
|
||||
|
||||
```
|
||||
|
||||
</TabItem>
|
||||
</Tabs>
|
||||
|
||||
### Expected Outcome
|
||||
|
||||
As a response, you will get the full lineage information like this.
|
||||
|
||||
```graphql
|
||||
{
|
||||
"UpstreamLineageClass": {
|
||||
"upstreams": [
|
||||
{
|
||||
"UpstreamClass": {
|
||||
"auditStamp": {
|
||||
"AuditStampClass": {
|
||||
"time": 0,
|
||||
"actor": "urn:li:corpuser:unknown",
|
||||
"impersonator": null,
|
||||
"message": null
|
||||
}
|
||||
},
|
||||
"created": null,
|
||||
"dataset": "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)",
|
||||
"type": "TRANSFORMED",
|
||||
"properties": null,
|
||||
"query": null
|
||||
}
|
||||
}
|
||||
],
|
||||
"fineGrainedLineages": [
|
||||
{
|
||||
"FineGrainedLineageClass": {
|
||||
"upstreamType": "FIELD_SET",
|
||||
"upstreams": [
|
||||
"urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD),browser_id)",
|
||||
"urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD),user_id)"
|
||||
],
|
||||
"downstreamType": "FIELD",
|
||||
"downstreams": [
|
||||
"urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD),browser)"
|
||||
],
|
||||
"transformOperation": null,
|
||||
"confidenceScore": 1.0,
|
||||
"query": null
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
**Can I use filters when retrieving lineage?**
|
||||
Yes — `get_lineage()` accepts structured filters via `FilterDsl`, just like in the Search SDK.
|
||||
|
@ -4,11 +4,6 @@ from datahub.sdk.main_client import DataHubClient
|
||||
client = DataHubClient.from_env()
|
||||
|
||||
|
||||
upstream_urn = DatasetUrn(platform="snowflake", name="upstream_table")
|
||||
downstream_urn = DatasetUrn(platform="snowflake", name="downstream_table")
|
||||
client.lineage.add_lineage(
|
||||
upstream=upstream_urn, downstream=downstream_urn, column_lineage=True
|
||||
)
|
||||
|
||||
# you can also pass a dictionary with the column names in the form of {downstream_column_name: [upstream_column_name1, upstream_column_name2]}
|
||||
# e.g. column_lineage={"id": ["id", "customer_id"]}
|
||||
upstream_urn = DatasetUrn(platform="snowflake", name="sales_raw")
|
||||
downstream_urn = DatasetUrn(platform="snowflake", name="sales_cleaned")
|
||||
client.lineage.add_lineage(upstream=upstream_urn, downstream=downstream_urn)
|
||||
|
@ -1,24 +1,16 @@
|
||||
from datahub.metadata.urns import DatasetUrn
|
||||
from datahub.sdk.main_client import DataHubClient
|
||||
from datahub.sdk.search_filters import FilterDsl as F
|
||||
|
||||
client = DataHubClient.from_env()
|
||||
|
||||
dataset_urn = DatasetUrn(platform="snowflake", name="downstream_table")
|
||||
|
||||
# Get column lineage for the entire flow
|
||||
# you can pass source_urn and source_column to get lineage for a specific column
|
||||
# alternatively, you can pass schemaFieldUrn to source_urn.
|
||||
# e.g. source_urn="urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:snowflake,downstream_table),id)"
|
||||
downstream_column_lineage = client.lineage.get_lineage(
|
||||
source_urn=dataset_urn,
|
||||
source_urn=DatasetUrn(platform="snowflake", name="sales_summary"),
|
||||
source_column="id",
|
||||
direction="downstream",
|
||||
max_hops=1,
|
||||
filter=F.and_(
|
||||
F.platform("snowflake"),
|
||||
F.entity_type("dataset"),
|
||||
),
|
||||
)
|
||||
|
||||
print(downstream_column_lineage)
|
||||
|
@ -0,0 +1,11 @@
|
||||
from datahub.sdk.main_client import DataHubClient
|
||||
|
||||
client = DataHubClient.from_env()
|
||||
|
||||
# Get column lineage for the entire flow
|
||||
results = client.lineage.get_lineage(
|
||||
source_urn="urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:snowflake,sales_summary,PROD),id)",
|
||||
direction="downstream",
|
||||
)
|
||||
|
||||
print(list(results))
|
@ -1,17 +1,11 @@
|
||||
from datahub.metadata.urns import DatasetUrn
|
||||
from datahub.sdk.main_client import DataHubClient
|
||||
from datahub.sdk.search_filters import FilterDsl as F
|
||||
|
||||
client = DataHubClient.from_env()
|
||||
|
||||
downstream_lineage = client.lineage.get_lineage(
|
||||
source_urn=DatasetUrn(platform="snowflake", name="downstream_table"),
|
||||
source_urn=DatasetUrn(platform="snowflake", name="sales_summary"),
|
||||
direction="downstream",
|
||||
max_hops=2,
|
||||
filter=F.and_(
|
||||
F.platform("airflow"),
|
||||
F.entity_type("dataJob"),
|
||||
),
|
||||
)
|
||||
|
||||
print(downstream_lineage)
|
||||
|
@ -0,0 +1,13 @@
|
||||
from datahub.sdk.main_client import DataHubClient
|
||||
from datahub.sdk.search_filters import FilterDsl as F
|
||||
|
||||
client = DataHubClient.from_env()
|
||||
|
||||
# get upstream snowflake production datasets.
|
||||
results = client.lineage.get_lineage(
|
||||
source_urn="urn:li:dataset:(platform,sales_agg,PROD)",
|
||||
direction="upstream",
|
||||
filter=F.and_(F.platform("snowflake"), F.entity_type("dataset"), F.env("PROD")),
|
||||
)
|
||||
|
||||
print(results)
|
12
metadata-ingestion/examples/library/get_lineage_with_hops.py
Normal file
12
metadata-ingestion/examples/library/get_lineage_with_hops.py
Normal file
@ -0,0 +1,12 @@
|
||||
from datahub.metadata.urns import DatasetUrn
|
||||
from datahub.sdk.main_client import DataHubClient
|
||||
|
||||
client = DataHubClient.from_env()
|
||||
|
||||
downstream_lineage = client.lineage.get_lineage(
|
||||
source_urn=DatasetUrn(platform="snowflake", name="sales_summary"),
|
||||
direction="downstream",
|
||||
max_hops=2,
|
||||
)
|
||||
|
||||
print(downstream_lineage)
|
@ -0,0 +1,10 @@
|
||||
from datahub.metadata.urns import DatasetUrn
|
||||
from datahub.sdk import DataHubClient
|
||||
|
||||
client = DataHubClient.from_env()
|
||||
|
||||
client.lineage.add_lineage(
|
||||
upstream=DatasetUrn(platform="snowflake", name="sales_raw"),
|
||||
downstream=DatasetUrn(platform="snowflake", name="sales_cleaned"),
|
||||
column_lineage=True, # same as "auto_fuzzy", which maps columns based on name similarity
|
||||
)
|
@ -0,0 +1,10 @@
|
||||
from datahub.metadata.urns import DatasetUrn
|
||||
from datahub.sdk import DataHubClient
|
||||
|
||||
client = DataHubClient.from_env()
|
||||
|
||||
client.lineage.add_lineage(
|
||||
upstream=DatasetUrn(platform="snowflake", name="sales_raw"),
|
||||
downstream=DatasetUrn(platform="snowflake", name="sales_cleaned"),
|
||||
column_lineage="auto_strict",
|
||||
)
|
@ -0,0 +1,15 @@
|
||||
from datahub.metadata.urns import DatasetUrn
|
||||
from datahub.sdk import DataHubClient
|
||||
|
||||
client = DataHubClient.from_env()
|
||||
|
||||
client.lineage.add_lineage(
|
||||
upstream=DatasetUrn(platform="snowflake", name="sales_raw"),
|
||||
downstream=DatasetUrn(platform="snowflake", name="sales_cleaned"),
|
||||
# { downstream_column -> [upstream_columns] }
|
||||
column_lineage={
|
||||
"id": ["id"],
|
||||
"region": ["region", "region_id"],
|
||||
"total_revenue": ["revenue"],
|
||||
},
|
||||
)
|
@ -579,12 +579,7 @@ For data tools with limited native lineage tracking, [**DataHub's SQL Parser**](
|
||||
|
||||
Types of lineage connections supported in DataHub and the example codes are as follows.
|
||||
|
||||
* Dataset to Dataset
|
||||
* [Dataset Lineage](../../../metadata-ingestion/examples/library/lineage_emitter_rest.py)
|
||||
* [Finegrained Dataset Lineage](../../../metadata-ingestion/examples/library/lineage_emitter_dataset_finegrained.py)
|
||||
* [Datahub BigQuery Lineage](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/sql/snowflake.py#L249)
|
||||
* [Dataset Lineage via MCPW REST Emitter](../../../metadata-ingestion/examples/library/lineage_emitter_mcpw_rest.py)
|
||||
* [Dataset Lineage via Kafka Emitter](../../../metadata-ingestion/examples/library/lineage_emitter_kafka.py)
|
||||
* [Dataset to Dataset](../../../metadata-ingestion/examples/library/add_lineage_dataset_to_dataset.py)
|
||||
* [DataJob to DataFlow](../../../metadata-ingestion/examples/library/lineage_job_dataflow.py)
|
||||
* [DataJob to Dataset](../../../metadata-ingestion/examples/library/lineage_dataset_job_dataset.py)
|
||||
* [Chart to Dashboard](../../../metadata-ingestion/examples/library/lineage_chart_dashboard.py)
|
||||
@ -592,7 +587,7 @@ Types of lineage connections supported in DataHub and the example codes are as f
|
||||
|
||||
### Automatic Lineage Extraction Support
|
||||
|
||||
This is a summary of automatic lineage extraciton support in our data source. Please refer to the **Important Capabilities** table in the source documentation. Note that even if the source does not support automatic extraction, you can still add lineage manually using our API & SDKs.\n"""
|
||||
This is a summary of automatic lineage extraction support in our data source. Please refer to the **Important Capabilities** table in the source documentation. Note that even if the source does not support automatic extraction, you can still add lineage manually using our API & SDKs.\n"""
|
||||
)
|
||||
|
||||
f.write(
|
||||
|
Loading…
x
Reference in New Issue
Block a user