docs: add sdk entity guides (#13870)

This commit is contained in:
Hyejin Yoon 2025-06-27 15:28:20 +09:00 committed by GitHub
parent a3688f78e7
commit 827a2308cd
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
11 changed files with 393 additions and 0 deletions

View File

@ -862,6 +862,10 @@ module.exports = {
"docs/api/tutorials/domains",
"docs/api/tutorials/forms",
"docs/api/tutorials/lineage",
"docs/api/tutorials/container",
"docs/api/tutorials/dashboard-chart",
"docs/api/tutorials/dataflow-datajob",
"docs/api/tutorials/mlmodel-mlmodelgroup",
{
type: "doc",
id: "docs/api/tutorials/ml",

View File

@ -0,0 +1,39 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Container
## Why Would You Use Containers?
The Container entity represents a logical grouping of entities, such as datasets, data processing instances, or even other containers. It helps users organize and manage metadata in a hierarchical structure, making it easier to navigate and understand relationships between different entities.
#### How is a Container related to other entities?
1. **Parent-Child Relationship** : Containers can contain other entities such as datasets, charts, dashboards, data jobs, ML models, and even other containers (nested containers). For example, a dataset can have a container aspect that links it to the schema or folder (container) it belongs to.
2. **Hierarchical Organization** : Containers can be nested, forming a hierarchy (e.g., a database container contains schema containers, which contain table containers). This enables a folder-like browsing experience in the DataHub UI.
3. **Relationships in the Metadata Model** : Many entities (datasets, data jobs, ML models, etc.) have a container aspect that links them to their parent container. Containers themselves can have a parentContainer aspect for nesting.
### Goal Of This Guide
This guide will show you how to
- Create a container
## Prerequisites
For this tutorial, you need to deploy DataHub Quickstart and ingest sample data.
For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.md).
## Create Container
<Tabs>
<TabItem value="python" label="Python" default>
```python
{{ inline /metadata-ingestion/examples/library/create_container.py show_path_as_comment }}
```
</TabItem>
</Tabs>

View File

@ -0,0 +1,77 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Dashboard & Chart
## Why Would You Use Dashboards and Charts?
The dashboard and chart entities are used to represent visualizations of data, typically in the context of business intelligence or analytics platforms. They allow users to create, manage, and share visual representations of data insights.
### Goal Of This Guide
This guide will show you how to
- Create a dashboard and a chart.
- Link the dashboard to the chart or another dashboard.
- Read dashboard and chart entities.
## Prerequisites
For this tutorial, you need to deploy DataHub Quickstart and ingest sample data.
For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.md).
## Create Chart
```python
{{ inline /metadata-ingestion/examples/library/create_chart.py show_path_as_comment }}
```
### Link Chart with Datasets
You can associate datasets with the chart by providing the dataset URN in the `input_datasets` parameter. This will create lineage between the chart and the datasets, so you can track the data sources used by the chart.
```python
{{ inline /metadata-ingestion/examples/library/create_chart_complex.py show_path_as_comment }}
```
## Create Dashboard
```python
{{ inline /metadata-ingestion/examples/library/create_dashboard.py show_path_as_comment }}
```
### Link Dashboard with Charts, Dashboards, and Datasets
You can associate charts, dashboards, and datasets with the dashboard by providing their URNs in the `charts`, `dashboards`, and `input_datasets` parameters, respectively. This will create lineage between the dashboard and the associated entities.
```python
{{ inline /metadata-ingestion/examples/library/create_dashboard_complex.py show_path_as_comment }}
```
## Read Chart
```python
{{ inline /metadata-ingestion/examples/library/read_chart.py show_path_as_comment }}
```
#### Expected Output
```python
>> Chart name: example_chart
>> Chart platform: urn:li:dataPlatform:looker
>> Chart description: looker chart for production
```
## Read Dashboard
```python
{{ inline /metadata-ingestion/examples/library/read_dashboard.py show_path_as_comment }}
```
#### Expected Output
```python
>> Dashboard name: example_dashboard
>> Dashboard platform: urn:li:dataPlatform:looker
>> Dashboard description: looker dashboard for production
```

View File

@ -0,0 +1,78 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# DataFlow & DataJob
## Why Would You Use DataFlow and DataJob?
The DataFlow and DataJob entities are used to represent data processing pipelines and jobs within a data ecosystem. They allow users to define, manage, and monitor the flow of data through various stages of processing, from ingestion to transformation and storage.
### Goal Of This Guide
This guide will show you how to
- Create a DataFlow.
- Create a Datajob with a DataFlow.
## Prerequisites
For this tutorial, you need to deploy DataHub Quickstart and ingest sample data.
For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.md).
## Create DataFlow
<Tabs>
<TabItem value="python" label="Python" default>
```python
{{ inline /metadata-ingestion/examples/library/create_dataflow.py show_path_as_comment }}
```
</TabItem>
</Tabs>
## Create DataJob
DataJob must be associated with a DataFlow. You can create a DataJob by providing the DataFlow object or the DataFlow URN and its platform instance.
<Tabs>
<TabItem value="python" label="Create DataJob with a DataFlow Object" default>
```python
{{ inline /metadata-ingestion/examples/library/create_datajob.py show_path_as_comment }}
```
</TabItem>
<TabItem value="python" label="Create DataJob with DataFlow URN">
```python
{{ inline /metadata-ingestion/examples/library/create_datajob_with_flow_urn.py show_path_as_comment }}
```
</TabItem>
</Tabs>
## Read DataFlow
```python
{{ inline /metadata-ingestion/examples/library/read_dataflow.py show_path_as_comment }}
```
#### Example Output
```python
>> DataFlow name: example_dataflow
>> DataFlow platform: urn:li:dataPlatform:airflow
>> DataFlow description: airflow pipeline for production
```
## Read DataJob
```python
{{ inline /metadata-ingestion/examples/library/read_datajob.py show_path_as_comment }}
```
#### Example Output
```python
>> DataJob name: example_datajob
>> DataJob Flow URN: urn:li:dataFlow:(airflow,PROD.example_dag,PROD)
>> DataJob description: example datajob
```

View File

@ -0,0 +1,78 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# MLModel & MLModelGroup
## Why Would You Use MLModel and MLModelGroup?
MLModel and MLModelGroup entities are used to represent machine learning models and their associated groups within a metadata ecosystem. They allow users to define, manage, and monitor machine learning models, including their versions, configurations, and performance metrics.
### Goal Of This Guide
This guide will show you how to
- Create an MLModel or MLModelGroup.
- Associate an MLModel with an MLModelGroup.
- Read MLModel and MLModelGroup entities.
## Prerequisites
For this tutorial, you need to deploy DataHub Quickstart and ingest sample data.
For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.md).
## Create MLModelGroup
You can create an MLModelGroup by providing the necessary attributes such as name, platform, and other metadata.
```python
{{ inline /metadata-ingestion/examples/library/create_mlmodel_group.py show_path_as_comment }}
```
## Create MLModel
You can create an MLModel by providing the necessary attributes such as name, platform, and other metadata.
```python
{{ inline /metadata-ingestion/examples/library/create_mlmodel.py show_path_as_comment }}
```
Note that you can associate an MLModel with an MLModelGroup by providing the group URN when creating the MLModel.
You can also set MLModelGroup later by updating the MLModel entity as shown below.
```python
{{ inline /metadata-ingestion/examples/library/add_mlgroup_to_mlmodel.py show_path_as_comment }}
```
## Read MLModelGroup
You can read an MLModelGroup by providing the group URN.
```python
{{ inline /metadata-ingestion/examples/library/read_mlmodel_group.py show_path_as_comment }}
```
#### Expected Output
```python
>> Model Group Name: My Recommendations Model Group
>> Model Group Description: A group for recommendations models
>> Model Group Custom Properties: {'owner': 'John Doe', 'team': 'recommendations', 'domain': 'marketing'}
```
## Read MLModel
You can read an MLModel by providing the model URN.
```python
{{ inline /metadata-ingestion/examples/library/read_mlmodel.py show_path_as_comment }}
```
#### Expected Output
```python
>> Model Name: My Recommendations Model
>> Model Description: A model for recommending products to users
>> Model Group: urn:li:mlModelGroup:(urn:li:dataPlatform:mlflow,my-recommendations-model,PROD)
>> Model Hyper Parameters: [MLHyperParamClass({'name': 'learning_rate', 'description': None, 'value': '0.01', 'createdAt': None}), MLHyperParamClass({'name': 'num_epochs', 'description': None, 'value': '100', 'createdAt': None}), MLHyperParamClass({'name': 'batch_size', 'description': None, 'value': '32', 'createdAt': None})]
```

View File

@ -0,0 +1,13 @@
from datahub.emitter.mcp_builder import ContainerKey
from datahub.sdk import Container, DataHubClient
client = DataHubClient.from_env()
# datajob will inherit the platform and platform instance from the flow
container = Container(
container_key=ContainerKey(platform="mlflow", name="airline_forecast_experiment"),
display_name="Airline Forecast Experiment",
)
client.entities.upsert(container)

View File

@ -0,0 +1,24 @@
from datahub.metadata.urns import DataFlowUrn, DatasetUrn
from datahub.sdk import DataHubClient, DataJob
client = DataHubClient.from_env()
# datajob will inherit the platform and platform instance from the flow
datajob = DataJob(
name="example_datajob",
flow_urn=DataFlowUrn(
orchestrator="airflow",
flow_id="example_dag",
cluster="PROD",
),
platform_instance="PROD",
inlets=[
DatasetUrn(platform="hdfs", name="dataset1", env="PROD"),
],
outlets=[
DatasetUrn(platform="hdfs", name="dataset2", env="PROD"),
],
)
client.entities.upsert(datajob)

View File

@ -0,0 +1,19 @@
from datahub.metadata.urns import TagUrn
from datahub.sdk import Chart, DataHubClient
client = DataHubClient.from_env()
chart = Chart(
name="example_chart",
platform="looker",
description="looker chart for production",
tags=[TagUrn(name="production"), TagUrn(name="data_engineering")],
)
client.entities.upsert(chart)
chart_entity = client.entities.get(chart.urn)
print("Chart name:", chart_entity.name)
print("Chart platform:", chart_entity.platform)
print("Chart description:", chart_entity.description)

View File

@ -0,0 +1,18 @@
from datahub.metadata.urns import TagUrn
from datahub.sdk import Dashboard, DataHubClient
client = DataHubClient.from_env()
dashboard = Dashboard(
name="example_dashboard",
platform="looker",
description="looker dashboard for production",
tags=[TagUrn(name="production"), TagUrn(name="data_engineering")],
)
client.entities.upsert(dashboard)
dashboard_entity = client.entities.get(dashboard.urn)
print("Dashboard name:", dashboard_entity.name)
print("Dashboard platform:", dashboard_entity.platform)
print("Dashboard description:", dashboard_entity.description)

View File

@ -0,0 +1,18 @@
from datahub.metadata.urns import TagUrn
from datahub.sdk import DataFlow, DataHubClient
client = DataHubClient.from_env()
dataflow = DataFlow(
name="example_dataflow",
platform="airflow",
description="airflow pipeline for production",
tags=[TagUrn(name="production"), TagUrn(name="data_engineering")],
)
client.entities.upsert(dataflow)
dataflow_entity = client.entities.get(dataflow.urn)
print("DataFlow name:", dataflow_entity.name)
print("DataFlow platform:", dataflow_entity.platform)
print("DataFlow description:", dataflow_entity.description)

View File

@ -0,0 +1,25 @@
from datahub.sdk import DataFlow, DataHubClient, DataJob
client = DataHubClient.from_env()
dataflow = DataFlow(
platform="airflow",
name="example_dag",
platform_instance="PROD",
)
# datajob will inherit the platform and platform instance from the flow
datajob = DataJob(
name="example_datajob",
description="example datajob",
flow=dataflow,
)
client.entities.upsert(dataflow)
client.entities.upsert(datajob)
datajob_entity = client.entities.get(datajob.urn)
print("DataJob name:", datajob_entity.name)
print("DataJob Flow URN:", datajob_entity.flow_urn)
print("DataJob description:", datajob_entity.description)