docs: add sdk entity guides (#13870)

2025-12-27 09:58:14 +00:00 · 2025-06-27 15:28:20 +09:00 · 2025-06-27 15:28:20 +09:00 · 827a2308cd
commit 827a2308cd
parent a3688f78e7
11 changed files with 393 additions and 0 deletions
--- a/docs-website/sidebars.js
+++ b/docs-website/sidebars.js
@ -862,6 +862,10 @@ module.exports = {
        "docs/api/tutorials/domains",
        "docs/api/tutorials/forms",
        "docs/api/tutorials/lineage",
+        "docs/api/tutorials/container",
+        "docs/api/tutorials/dashboard-chart",
+        "docs/api/tutorials/dataflow-datajob",
+        "docs/api/tutorials/mlmodel-mlmodelgroup",
        {
          type: "doc",
          id: "docs/api/tutorials/ml",
--- a/docs/api/tutorials/container.md
+++ b/docs/api/tutorials/container.md
@ -0,0 +1,39 @@
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+# Container
+
+## Why Would You Use Containers?
+
+The Container entity represents a logical grouping of entities, such as datasets, data processing instances, or even other containers. It helps users organize and manage metadata in a hierarchical structure, making it easier to navigate and understand relationships between different entities.
+
+#### How is a Container related to other entities?
+
+1. **Parent-Child Relationship** : Containers can contain other entities such as datasets, charts, dashboards, data jobs, ML models, and even other containers (nested containers). For example, a dataset can have a container aspect that links it to the schema or folder (container) it belongs to.
+
+2. **Hierarchical Organization** : Containers can be nested, forming a hierarchy (e.g., a database container contains schema containers, which contain table containers). This enables a folder-like browsing experience in the DataHub UI.
+
+3. **Relationships in the Metadata Model** : Many entities (datasets, data jobs, ML models, etc.) have a container aspect that links them to their parent container. Containers themselves can have a parentContainer aspect for nesting.
+
+### Goal Of This Guide
+
+This guide will show you how to
+
+- Create a container
+
+## Prerequisites
+
+For this tutorial, you need to deploy DataHub Quickstart and ingest sample data.
+For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.md).
+
+## Create Container
+
+<Tabs>
+<TabItem value="python" label="Python" default>
+
+```python
+{{ inline /metadata-ingestion/examples/library/create_container.py show_path_as_comment }}
+```
+
+</TabItem>
+</Tabs>
--- a/docs/api/tutorials/dashboard-chart.md
+++ b/docs/api/tutorials/dashboard-chart.md
@ -0,0 +1,77 @@
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+# Dashboard & Chart
+
+## Why Would You Use Dashboards and Charts?
+
+The dashboard and chart entities are used to represent visualizations of data, typically in the context of business intelligence or analytics platforms. They allow users to create, manage, and share visual representations of data insights.
+
+### Goal Of This Guide
+
+This guide will show you how to
+
+- Create a dashboard and a chart.
+- Link the dashboard to the chart or another dashboard.
+- Read dashboard and chart entities.
+
+## Prerequisites
+
+For this tutorial, you need to deploy DataHub Quickstart and ingest sample data.
+For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.md).
+
+## Create Chart
+
+```python
+{{ inline /metadata-ingestion/examples/library/create_chart.py show_path_as_comment }}
+```
+
+### Link Chart with Datasets
+
+You can associate datasets with the chart by providing the dataset URN in the `input_datasets` parameter. This will create lineage between the chart and the datasets, so you can track the data sources used by the chart.
+
+```python
+{{ inline /metadata-ingestion/examples/library/create_chart_complex.py show_path_as_comment }}
+```
+
+## Create Dashboard
+
+```python
+{{ inline /metadata-ingestion/examples/library/create_dashboard.py show_path_as_comment }}
+```
+
+### Link Dashboard with Charts, Dashboards, and Datasets
+
+You can associate charts, dashboards, and datasets with the dashboard by providing their URNs in the `charts`, `dashboards`, and `input_datasets` parameters, respectively. This will create lineage between the dashboard and the associated entities.
+
+```python
+{{ inline /metadata-ingestion/examples/library/create_dashboard_complex.py show_path_as_comment }}
+```
+
+## Read Chart
+
+```python
+{{ inline /metadata-ingestion/examples/library/read_chart.py show_path_as_comment }}
+```
+
+#### Expected Output
+
+```python
+>> Chart name: example_chart
+>> Chart platform: urn:li:dataPlatform:looker
+>> Chart description: looker chart for production
+```
+
+## Read Dashboard
+
+```python
+{{ inline /metadata-ingestion/examples/library/read_dashboard.py show_path_as_comment }}
+```
+
+#### Expected Output
+
+```python
+>> Dashboard name: example_dashboard
+>> Dashboard platform: urn:li:dataPlatform:looker
+>> Dashboard description: looker dashboard for production
+```
--- a/docs/api/tutorials/dataflow-datajob.md
+++ b/docs/api/tutorials/dataflow-datajob.md
@ -0,0 +1,78 @@
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+# DataFlow & DataJob
+
+## Why Would You Use DataFlow and DataJob?
+
+The DataFlow and DataJob entities are used to represent data processing pipelines and jobs within a data ecosystem. They allow users to define, manage, and monitor the flow of data through various stages of processing, from ingestion to transformation and storage.
+
+### Goal Of This Guide
+
+This guide will show you how to
+
+- Create a DataFlow.
+- Create a Datajob with a DataFlow.
+
+## Prerequisites
+
+For this tutorial, you need to deploy DataHub Quickstart and ingest sample data.
+For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.md).
+
+## Create DataFlow
+
+<Tabs>
+<TabItem value="python" label="Python" default>
+
+```python
+{{ inline /metadata-ingestion/examples/library/create_dataflow.py show_path_as_comment }}
+```
+
+</TabItem>
+</Tabs>
+
+## Create DataJob
+
+DataJob must be associated with a DataFlow. You can create a DataJob by providing the DataFlow object or the DataFlow URN and its platform instance.
+
+<Tabs>
+<TabItem value="python" label="Create DataJob with a DataFlow Object" default>
+```python
+{{ inline /metadata-ingestion/examples/library/create_datajob.py show_path_as_comment }}
+```
+
+</TabItem>
+<TabItem value="python" label="Create DataJob with DataFlow URN">
+```python
+{{ inline /metadata-ingestion/examples/library/create_datajob_with_flow_urn.py show_path_as_comment }}
+```
+</TabItem>
+</Tabs>
+
+## Read DataFlow
+
+```python
+{{ inline /metadata-ingestion/examples/library/read_dataflow.py show_path_as_comment }}
+```
+
+#### Example Output
+
+```python
+>> DataFlow name: example_dataflow
+>> DataFlow platform: urn:li:dataPlatform:airflow
+>> DataFlow description: airflow pipeline for production
+```
+
+## Read DataJob
+
+```python
+{{ inline /metadata-ingestion/examples/library/read_datajob.py show_path_as_comment }}
+```
+
+#### Example Output
+
+```python
+>> DataJob name: example_datajob
+>> DataJob Flow URN: urn:li:dataFlow:(airflow,PROD.example_dag,PROD)
+>> DataJob description: example datajob
+```
--- a/docs/api/tutorials/mlmodel-mlmodelgroup.md
+++ b/docs/api/tutorials/mlmodel-mlmodelgroup.md
@ -0,0 +1,78 @@
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+# MLModel & MLModelGroup
+
+## Why Would You Use MLModel and MLModelGroup?
+
+MLModel and MLModelGroup entities are used to represent machine learning models and their associated groups within a metadata ecosystem. They allow users to define, manage, and monitor machine learning models, including their versions, configurations, and performance metrics.
+
+### Goal Of This Guide
+
+This guide will show you how to
+
+- Create an MLModel or MLModelGroup.
+- Associate an MLModel with an MLModelGroup.
+- Read MLModel and MLModelGroup entities.
+
+## Prerequisites
+
+For this tutorial, you need to deploy DataHub Quickstart and ingest sample data.
+For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.md).
+
+## Create MLModelGroup
+
+You can create an MLModelGroup by providing the necessary attributes such as name, platform, and other metadata.
+
+```python
+{{ inline /metadata-ingestion/examples/library/create_mlmodel_group.py show_path_as_comment }}
+```
+
+## Create MLModel
+
+You can create an MLModel by providing the necessary attributes such as name, platform, and other metadata.
+
+```python
+{{ inline /metadata-ingestion/examples/library/create_mlmodel.py show_path_as_comment }}
+```
+
+Note that you can associate an MLModel with an MLModelGroup by providing the group URN when creating the MLModel.
+
+You can also set MLModelGroup later by updating the MLModel entity as shown below.
+
+```python
+{{ inline /metadata-ingestion/examples/library/add_mlgroup_to_mlmodel.py show_path_as_comment }}
+```
+
+## Read MLModelGroup
+
+You can read an MLModelGroup by providing the group URN.
+
+```python
+{{ inline /metadata-ingestion/examples/library/read_mlmodel_group.py show_path_as_comment }}
+```
+
+#### Expected Output
+
+```python
+>> Model Group Name:  My Recommendations Model Group
+>> Model Group Description:  A group for recommendations models
+>> Model Group Custom Properties:  {'owner': 'John Doe', 'team': 'recommendations', 'domain': 'marketing'}
+```
+
+## Read MLModel
+
+You can read an MLModel by providing the model URN.
+
+```python
+{{ inline /metadata-ingestion/examples/library/read_mlmodel.py show_path_as_comment }}
+```
+
+#### Expected Output
+
+```python
+>> Model Name:  My Recommendations Model
+>> Model Description:  A model for recommending products to users
+>> Model Group:  urn:li:mlModelGroup:(urn:li:dataPlatform:mlflow,my-recommendations-model,PROD)
+>> Model Hyper Parameters:  [MLHyperParamClass({'name': 'learning_rate', 'description': None, 'value': '0.01', 'createdAt': None}), MLHyperParamClass({'name': 'num_epochs', 'description': None, 'value': '100', 'createdAt': None}), MLHyperParamClass({'name': 'batch_size', 'description': None, 'value': '32', 'createdAt': None})]
+```
--- a/metadata-ingestion/examples/library/create_container.py
+++ b/metadata-ingestion/examples/library/create_container.py
@ -0,0 +1,13 @@
+from datahub.emitter.mcp_builder import ContainerKey
+from datahub.sdk import Container, DataHubClient
+
+client = DataHubClient.from_env()
+
+# datajob will inherit the platform and platform instance from the flow
+
+container = Container(
+    container_key=ContainerKey(platform="mlflow", name="airline_forecast_experiment"),
+    display_name="Airline Forecast Experiment",
+)
+
+client.entities.upsert(container)
--- a/metadata-ingestion/examples/library/create_datajob_with_flow_urn.py
+++ b/metadata-ingestion/examples/library/create_datajob_with_flow_urn.py
@ -0,0 +1,24 @@
+from datahub.metadata.urns import DataFlowUrn, DatasetUrn
+from datahub.sdk import DataHubClient, DataJob
+
+client = DataHubClient.from_env()
+
+# datajob will inherit the platform and platform instance from the flow
+
+datajob = DataJob(
+    name="example_datajob",
+    flow_urn=DataFlowUrn(
+        orchestrator="airflow",
+        flow_id="example_dag",
+        cluster="PROD",
+    ),
+    platform_instance="PROD",
+    inlets=[
+        DatasetUrn(platform="hdfs", name="dataset1", env="PROD"),
+    ],
+    outlets=[
+        DatasetUrn(platform="hdfs", name="dataset2", env="PROD"),
+    ],
+)
+
+client.entities.upsert(datajob)
--- a/metadata-ingestion/examples/library/read_chart.py
+++ b/metadata-ingestion/examples/library/read_chart.py
@ -0,0 +1,19 @@
+from datahub.metadata.urns import TagUrn
+from datahub.sdk import Chart, DataHubClient
+
+client = DataHubClient.from_env()
+
+chart = Chart(
+    name="example_chart",
+    platform="looker",
+    description="looker chart for production",
+    tags=[TagUrn(name="production"), TagUrn(name="data_engineering")],
+)
+
+client.entities.upsert(chart)
+
+chart_entity = client.entities.get(chart.urn)
+
+print("Chart name:", chart_entity.name)
+print("Chart platform:", chart_entity.platform)
+print("Chart description:", chart_entity.description)
--- a/metadata-ingestion/examples/library/read_dashboard.py
+++ b/metadata-ingestion/examples/library/read_dashboard.py
@ -0,0 +1,18 @@
+from datahub.metadata.urns import TagUrn
+from datahub.sdk import Dashboard, DataHubClient
+
+client = DataHubClient.from_env()
+
+dashboard = Dashboard(
+    name="example_dashboard",
+    platform="looker",
+    description="looker dashboard for production",
+    tags=[TagUrn(name="production"), TagUrn(name="data_engineering")],
+)
+
+client.entities.upsert(dashboard)
+
+dashboard_entity = client.entities.get(dashboard.urn)
+print("Dashboard name:", dashboard_entity.name)
+print("Dashboard platform:", dashboard_entity.platform)
+print("Dashboard description:", dashboard_entity.description)
--- a/metadata-ingestion/examples/library/read_dataflow.py
+++ b/metadata-ingestion/examples/library/read_dataflow.py
@ -0,0 +1,18 @@
+from datahub.metadata.urns import TagUrn
+from datahub.sdk import DataFlow, DataHubClient
+
+client = DataHubClient.from_env()
+
+dataflow = DataFlow(
+    name="example_dataflow",
+    platform="airflow",
+    description="airflow pipeline for production",
+    tags=[TagUrn(name="production"), TagUrn(name="data_engineering")],
+)
+
+client.entities.upsert(dataflow)
+
+dataflow_entity = client.entities.get(dataflow.urn)
+print("DataFlow name:", dataflow_entity.name)
+print("DataFlow platform:", dataflow_entity.platform)
+print("DataFlow description:", dataflow_entity.description)
--- a/metadata-ingestion/examples/library/read_datajob.py
+++ b/metadata-ingestion/examples/library/read_datajob.py
@ -0,0 +1,25 @@
+from datahub.sdk import DataFlow, DataHubClient, DataJob
+
+client = DataHubClient.from_env()
+
+dataflow = DataFlow(
+    platform="airflow",
+    name="example_dag",
+    platform_instance="PROD",
+)
+
+# datajob will inherit the platform and platform instance from the flow
+datajob = DataJob(
+    name="example_datajob",
+    description="example datajob",
+    flow=dataflow,
+)
+
+client.entities.upsert(dataflow)
+client.entities.upsert(datajob)
+
+datajob_entity = client.entities.get(datajob.urn)
+
+print("DataJob name:", datajob_entity.name)
+print("DataJob Flow URN:", datajob_entity.flow_urn)
+print("DataJob description:", datajob_entity.description)