2023-05-17 10:21:39 +09:00
import Tabs from '@theme/Tabs ';
import TabItem from '@theme/TabItem ';
2025-01-30 13:59:23 +09:00
# AI/ML Framework Integration with DataHub
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
## Why Integrate Your AI/ML System with DataHub?
2023-05-17 10:21:39 +09:00
2025-04-16 16:55:51 -07:00
As a data practitioner, keeping track of your AI experiments, models, and their relationships can be challenging.
2025-01-30 13:59:23 +09:00
DataHub makes this easier by providing a central place to organize and track your AI assets.
2023-05-17 10:21:39 +09:00
2025-04-16 16:55:51 -07:00
This guide will show you how to integrate your AI workflows with DataHub.
2025-01-30 13:59:23 +09:00
With integrations for popular ML platforms like MLflow and Amazon SageMaker, DataHub enables you to easily find and share AI models across your organization, track how models evolve over time, and understand how training data connects to each model.
Most importantly, it enables seamless collaboration on AI projects by making everything discoverable and connected.
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
## Goals Of This Guide
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
In this guide, you'll learn how to:
2025-04-16 16:55:51 -07:00
2025-01-30 13:59:23 +09:00
- Create your basic AI components (models, experiments, runs)
- Connect these components to build a complete AI system
- Track relationships between models, data, and experiments
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
## Core AI Concepts
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
Here's what you need to know about the key components in DataHub:
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
- **Experiments** are collections of training runs for the same project, like all attempts to build a churn predictor
- **Training Runs** are attempts to train a model within an experiment, capturing parameters and results
- **Model Groups** organize related models together, like all versions of your churn predictor
- **Models** are successful training runs registered for production use
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
< p align = "center" >
< img width = "70%" src = "https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/apis/tutorials/ml/concept-diagram-dh-term.png" / >
< / p >
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
The hierarchy works like this:
2025-04-16 16:55:51 -07:00
2025-01-30 13:59:23 +09:00
1. Every run belongs to an experiment
2. Successful runs can be registered as models
3. Models belong to a model group
4. Not every run becomes a model
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
:::note Terminology Mapping
2025-04-16 16:55:51 -07:00
Different AI platforms (MLflow, Amazon SageMaker) have their own terminology.
2025-01-30 13:59:23 +09:00
To keep things consistent, we'll use DataHub's terms throughout this guide.
Here's how DataHub's terminology maps to these platforms:
2023-05-17 10:21:39 +09:00
2025-04-16 16:55:51 -07:00
| DataHub | Description | MLflow | SageMaker |
| --------------- | ----------------------------------- | ------------- | ------------- |
| ML Model Group | Collection of related models | Model | Model Group |
| ML Model | Versioned artifact in a model group | Model Version | Model Version |
| ML Training Run | Single training attempt | Run | Training Job |
| ML Experiment | Collection of training runs | Experiment | Experiment |
2025-01-30 13:59:23 +09:00
:::
2023-09-19 09:02:24 -07:00
2025-01-30 13:59:23 +09:00
For platform-specific details, see our integration guides for [MLflow ](/docs/generated/ingestion/sources/mlflow.md ) and [Amazon SageMaker ](/docs/generated/ingestion/sources/sagemaker.md ).
## Basic Setup
To follow this tutorial, you'll need DataHub Quickstart deployed locally.
For detailed steps, see the [Datahub Quickstart Guide ](/docs/quickstart.md ).
Next, set up the Python client for DataHub using `DatahubAIClient` defined in [here ](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/ai/dh_ai_client.py ).
Create a token in DataHub UI and replace `<your_token>` with your token:
2023-05-17 10:21:39 +09:00
```python
2025-01-30 13:59:23 +09:00
from dh_ai_client import DatahubAIClient
client = DatahubAIClient(token="< your_token > ", server_url="http://localhost:9002")
2023-05-17 10:21:39 +09:00
```
2025-01-30 13:59:23 +09:00
:::note Verifying via GraphQL
Throughout this guide, we'll show how to verify changes using GraphQL queries.
You can run these queries in the DataHub UI at `https://localhost:9002/api/graphiql` .
:::
2023-09-19 09:02:24 -07:00
2025-04-15 16:12:38 +09:00
## Create AI Assets
2023-09-19 09:02:24 -07:00
2025-01-30 13:59:23 +09:00
Let's create the basic building blocks of your ML system. These components will help you organize your AI work and make it discoverable by your team.
2023-09-19 09:02:24 -07:00
2025-01-30 13:59:23 +09:00
### Create a Model Group
A model group contains different versions of a similar model. For example, all versions of your "Customer Churn Predictor" would go in one group.
2023-09-19 09:02:24 -07:00
< Tabs >
2025-01-30 13:59:23 +09:00
< TabItem value = "basic" label = "Basic" >
Create a basic model group with just an identifier:
2023-09-19 09:02:24 -07:00
```python
2025-04-15 16:12:38 +09:00
model_group = MLModelGroup(
id="airline_forecast_models_group",
platform="mlflow",
2025-01-30 13:59:23 +09:00
)
2025-04-15 16:12:38 +09:00
client._emit_mcps(model_group.as_mcps())
2023-09-19 09:02:24 -07:00
```
2025-01-30 13:59:23 +09:00
< / TabItem >
< TabItem value = "advanced" label = "Advanced" >
Add rich metadata like descriptions, creation timestamps, and team information:
```python
2025-04-15 16:12:38 +09:00
model_group = MLModelGroup(
id="airline_forecast_models_group",
platform="mlflow",
name="Airline Forecast Models Group",
description="Group of models for airline passenger forecasting",
created=datetime.now(),
last_modified=datetime.now(),
owners=[CorpUserUrn("urn:li:corpuser:datahub")],
external_url="https://www.linkedin.com/in/datahub",
tags=["urn:li:tag:forecasting", "urn:li:tag:arima"],
terms=["urn:li:glossaryTerm:forecasting"],
custom_properties={"team": "forecasting"},
2025-01-30 13:59:23 +09:00
)
2025-04-15 16:12:38 +09:00
client._emit_mcps(model_group.as_mcps())
2025-01-30 13:59:23 +09:00
```
2023-05-17 10:21:39 +09:00
< / TabItem >
< / Tabs >
2025-01-30 13:59:23 +09:00
Let's verify that our model group was created:
2023-09-19 09:02:24 -07:00
2023-05-17 10:21:39 +09:00
< Tabs >
2025-01-30 13:59:23 +09:00
< TabItem value = "UI" label = "UI" >
See your new model group in the DataHub UI:
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
< p align = "center" >
< img width = "70%" src = "https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/apis/tutorials/ml/model-group-empty.png" / >
< / p >
< / TabItem >
< TabItem value = "graphql" label = "GraphQL" >
Query your model group to check its properties:
```graphql
query {
mlModelGroup(
2025-04-16 16:55:51 -07:00
urn: "urn:li:mlModelGroup:(urn:li:dataPlatform:mlflow,airline_forecast_models_group,PROD)"
2025-01-30 13:59:23 +09:00
) {
2025-04-15 16:12:38 +09:00
properties {
name
description
created {
time
}
}
2025-01-30 13:59:23 +09:00
}
}
2023-05-17 10:21:39 +09:00
```
2025-01-30 13:59:23 +09:00
The response will show your model group's details:
```json
{
"data": {
"mlModelGroup": {
2025-04-15 16:12:38 +09:00
"properties": {
"name": "Airline Forecast Models Group",
"description": "Group of models for airline passenger forecasting",
"created": {
"time": 1744356062485
}
}
2025-01-30 13:59:23 +09:00
}
2025-04-15 16:12:38 +09:00
},
"extensions": {}
2025-01-30 13:59:23 +09:00
}
```
2023-05-17 10:21:39 +09:00
< / TabItem >
< / Tabs >
2025-01-30 13:59:23 +09:00
### Create a Model
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
Next, let's create a specific model version that represents a trained model ready for deployment.
2023-05-17 10:21:39 +09:00
< Tabs >
2025-01-30 13:59:23 +09:00
< TabItem value = "basic" label = "Basic" >
Create a model with just the required version:
2023-05-17 10:21:39 +09:00
```python
2025-04-15 16:12:38 +09:00
model = MLModel(
id="arima_model",
platform="mlflow",
2025-01-30 13:59:23 +09:00
)
2025-04-15 16:12:38 +09:00
client._emit_mcps(model.as_mcps())
2023-05-17 10:21:39 +09:00
```
< / TabItem >
2025-01-30 13:59:23 +09:00
< TabItem value = "advanced" label = "Advanced" >
Include metrics, parameters, and metadata for production use:
2023-05-17 10:21:39 +09:00
```python
2025-04-15 16:12:38 +09:00
model = MLModel(
id="arima_model",
platform="mlflow",
name="ARIMA Model",
description="ARIMA model for airline passenger forecasting",
created=datetime.now(),
last_modified=datetime.now(),
owners=[CorpUserUrn("urn:li:corpuser:datahub")],
external_url="https://www.linkedin.com/in/datahub",
tags=["urn:li:tag:forecasting", "urn:li:tag:arima"],
terms=["urn:li:glossaryTerm:forecasting"],
custom_properties={"team": "forecasting"},
version="1",
aliases=["champion"],
hyper_params={"learning_rate": "0.01"},
training_metrics={"accuracy": "0.9"},
2025-01-30 13:59:23 +09:00
)
2025-04-15 16:12:38 +09:00
client._emit_mcps(model.as_mcps())
2023-05-17 10:21:39 +09:00
```
< / TabItem >
< / Tabs >
2025-01-30 13:59:23 +09:00
Let's verify our model:
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
< Tabs >
< TabItem value = "UI" label = "UI" >
Check your model's details in the DataHub UI:
2023-08-26 06:10:13 +09:00
< p align = "center" >
2025-01-30 13:59:23 +09:00
< img width = "70%" src = "https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/apis/tutorials/ml/model-empty.png" / >
2023-08-26 06:10:13 +09:00
< / p >
2025-01-30 13:59:23 +09:00
< / TabItem >
2023-08-26 06:10:13 +09:00
2025-01-30 13:59:23 +09:00
< TabItem value = "graphql" label = "GraphQL" >
Query your model's information:
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
```graphql
2023-05-17 10:21:39 +09:00
query {
2025-04-16 16:55:51 -07:00
mlModel(urn: "urn:li:mlModel:(urn:li:dataPlatform:mlflow,arima_model,PROD)") {
2025-04-15 16:12:38 +09:00
properties {
name
description
}
2025-01-30 13:59:23 +09:00
versionProperties {
2023-05-17 10:21:39 +09:00
version {
versionTag
}
}
}
}
```
2025-01-30 13:59:23 +09:00
The response will show your model's details:
2023-05-17 10:21:39 +09:00
```json
{
"data": {
2025-01-30 13:59:23 +09:00
"mlModel": {
2025-04-15 16:12:38 +09:00
"properties": {
"name": "ARIMA Model",
"description": "ARIMA model for airline passenger forecasting"
},
2025-01-30 13:59:23 +09:00
"versionProperties": {
"version": {
2025-04-15 16:12:38 +09:00
"versionTag": "1"
2025-01-30 13:59:23 +09:00
}
2023-05-17 10:21:39 +09:00
}
}
2025-04-15 16:12:38 +09:00
},
"extensions": {}
2023-05-17 10:21:39 +09:00
}
```
< / TabItem >
2025-01-30 13:59:23 +09:00
< / Tabs >
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
### Create an Experiment
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
An experiment helps organize multiple training runs for a specific project.
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
< Tabs >
< TabItem value = "basic" label = "Basic" >
Create a basic experiment:
```python
2025-04-15 16:12:38 +09:00
experiment = Container(
container_key=ContainerKey(
platform="mlflow",
name="airline_forecast_experiment"
),
display_name="Airline Forecast Experiment"
2025-01-30 13:59:23 +09:00
)
2025-04-15 16:12:38 +09:00
client._emit_mcps(experiment.as_mcps())
2023-05-17 10:21:39 +09:00
```
< / TabItem >
2025-01-30 13:59:23 +09:00
< TabItem value = "advanced" label = "Advanced" >
Add context and metadata:
2023-05-17 10:21:39 +09:00
```python
2025-04-15 16:12:38 +09:00
experiment = Container(
container_key=ContainerKey(
platform="mlflow",
name="airline_forecast_experiment"
2025-01-30 13:59:23 +09:00
),
2025-04-15 16:12:38 +09:00
display_name="Airline Forecast Experiment",
description="Experiment to forecast airline passenger numbers",
extra_properties={"team": "forecasting"},
created=datetime(2025, 4, 9, 22, 30),
last_modified=datetime(2025, 4, 9, 22, 30),
subtype=MLAssetSubTypes.MLFLOW_EXPERIMENT,
2025-01-30 13:59:23 +09:00
)
2025-04-15 16:12:38 +09:00
client._emit_mcps(experiment.as_mcps())
2023-05-17 10:21:39 +09:00
```
< / TabItem >
< / Tabs >
2025-01-30 13:59:23 +09:00
Verify your experiment:
2023-09-19 09:02:24 -07:00
< Tabs >
2025-01-30 13:59:23 +09:00
< TabItem value = "UI" label = "UI" >
See your experiment's details in the UI:
2023-09-19 09:02:24 -07:00
2025-01-30 13:59:23 +09:00
< p align = "center" >
< img width = "70%" src = "https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/apis/tutorials/ml/experiment-empty.png" / >
< / p >
< / TabItem >
< TabItem value = "graphql" label = "GraphQL" >
Query your experiment's information:
```graphql
2023-09-19 09:02:24 -07:00
query {
2025-04-16 16:55:51 -07:00
container(urn: "urn:li:container:airline_forecast_experiment") {
2023-09-19 09:02:24 -07:00
properties {
2025-04-15 16:12:38 +09:00
name
description
2023-09-19 09:02:24 -07:00
}
}
}
```
2025-01-30 13:59:23 +09:00
Check the response:
2023-09-19 09:02:24 -07:00
```json
{
"data": {
2025-01-30 13:59:23 +09:00
"container": {
2023-09-19 09:02:24 -07:00
"properties": {
2025-04-15 16:12:38 +09:00
"name": "Airline Forecast Experiment",
"description": "Experiment to forecast airline passenger numbers"
2023-09-19 09:02:24 -07:00
}
}
2025-04-15 16:12:38 +09:00
},
"extensions": {}
2023-09-19 09:02:24 -07:00
}
```
< / TabItem >
2025-01-30 13:59:23 +09:00
< / Tabs >
2023-09-19 09:02:24 -07:00
2025-01-30 13:59:23 +09:00
### Create a Training Run
2023-09-19 09:02:24 -07:00
2025-01-30 13:59:23 +09:00
A training run captures all details about a specific model training attempt.
2023-09-19 09:02:24 -07:00
2025-01-30 13:59:23 +09:00
< Tabs >
< TabItem value = "basic" label = "Basic" >
Create a basic training run:
```python
client.create_training_run(
2025-04-15 16:12:38 +09:00
run_id="simple_training_run",
2025-01-30 13:59:23 +09:00
)
2023-09-19 09:02:24 -07:00
```
< / TabItem >
2025-01-30 13:59:23 +09:00
< TabItem value = "advanced" label = "Advanced" >
Include metrics, parameters, and other important metadata:
2023-09-19 09:02:24 -07:00
```python
2025-01-30 13:59:23 +09:00
client.create_training_run(
2025-04-15 16:12:38 +09:00
run_id="simple_training_run",
properties=DataProcessInstancePropertiesClass(
name="Simple Training Run",
created=AuditStampClass(
2025-01-30 13:59:23 +09:00
time=1628580000000, actor="urn:li:corpuser:datahub"
),
customProperties={"team": "forecasting"},
),
2025-04-15 16:12:38 +09:00
training_run_properties=MLTrainingRunPropertiesClass(
id="simple_training_run",
2025-01-30 13:59:23 +09:00
outputUrls=["s3://my-bucket/output"],
2025-04-15 16:12:38 +09:00
trainingMetrics=[MLMetricClass(name="accuracy", value="0.9")],
hyperParams=[MLHyperParamClass(name="learning_rate", value="0.01")],
2025-01-30 13:59:23 +09:00
externalUrl="https:localhost:5000",
),
run_result=RunResultType.FAILURE,
start_timestamp=1628580000000,
end_timestamp=1628580001000,
)
2023-09-19 09:02:24 -07:00
```
< / TabItem >
< / Tabs >
2025-01-30 13:59:23 +09:00
Verify your training run:
2023-05-17 10:21:39 +09:00
< Tabs >
2025-01-30 13:59:23 +09:00
< TabItem value = "UI" label = "UI" >
View the run details in the UI:
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
< p align = "center" >
< img width = "70%" src = "https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/apis/tutorials/ml/run-empty.png" / >
< / p >
< / TabItem >
< TabItem value = "graphql" label = "GraphQL" >
Query your training run:
```graphql
2023-05-17 10:21:39 +09:00
query {
2025-04-16 16:55:51 -07:00
dataProcessInstance(urn: "urn:li:dataProcessInstance:simple_training_run") {
2023-05-17 10:21:39 +09:00
name
2025-01-30 13:59:23 +09:00
created {
time
2023-05-17 10:21:39 +09:00
}
properties {
2025-01-30 13:59:23 +09:00
customProperties
2023-05-17 10:21:39 +09:00
}
}
}
```
2025-01-30 13:59:23 +09:00
Check the response:
2023-05-17 10:21:39 +09:00
```json
{
"data": {
2025-01-30 13:59:23 +09:00
"dataProcessInstance": {
2025-04-15 16:12:38 +09:00
"name": "Simple Training Run",
2025-01-30 13:59:23 +09:00
"created": {
"time": 1628580000000
2023-05-17 10:21:39 +09:00
},
"properties": {
2025-01-30 13:59:23 +09:00
"customProperties": {
"team": "forecasting"
}
2023-05-17 10:21:39 +09:00
}
}
2025-01-30 13:59:23 +09:00
}
2023-05-17 10:21:39 +09:00
}
```
< / TabItem >
2025-01-30 13:59:23 +09:00
< / Tabs >
2023-05-17 10:21:39 +09:00
2025-04-15 16:12:38 +09:00
### Create a Dataset
Datasets are crucial components in your ML system, serving as inputs and outputs for your training runs. Creating a dataset in DataHub allows you to track data lineage and understand how data flows through your ML pipeline.
< Tabs >
< TabItem value = "basic" label = "Basic" >
Create a basic dataset with minimal information:
```python
input_dataset = Dataset(
platform="snowflake",
name="iris_input",
)
client._emit_mcps(input_dataset.as_mcps())
```
2025-04-16 16:55:51 -07:00
2025-04-15 16:12:38 +09:00
< / TabItem >
2025-04-16 16:55:51 -07:00
< TabItem value = "advanced" label = "Advanced" >
2025-04-15 16:12:38 +09:00
Create a dataset with more detailed information:
```python
input_dataset = Dataset(
platform="snowflake",
name="iris_input",
description="Raw Iris dataset used for training ML models",
schema=[("id", "number"), ("name", "string"), ("species", "string")],
display_name="Iris Training Input Data",
tags=["urn:li:tag:ml_data", "urn:li:tag:iris"],
terms=["urn:li:glossaryTerm:raw_data"],
owners=[CorpUserUrn("urn:li:corpuser:datahub")],
custom_properties={
"data_source": "UCI Repository",
"records": "150",
"features": "4",
},
)
client._emit_mcps(input_dataset.as_mcps())
output_dataset = Dataset(
platform="snowflake",
name="iris_output",
description="Processed Iris dataset with model predictions",
schema=[("id", "number"), ("name", "string"), ("species", "string")],
display_name="Iris Model Output Data",
tags=["urn:li:tag:ml_data", "urn:li:tag:predictions"],
terms=["urn:li:glossaryTerm:model_output"],
owners=[CorpUserUrn("urn:li:corpuser:datahub")],
custom_properties={
"model_version": "1.0",
"records": "150",
"accuracy": "0.95",
},
)
client._emit_mcps(output_dataset.as_mcps())
```
< / TabItem >
< / Tabs >
Verify your datasets:
< Tabs >
2025-04-16 16:55:51 -07:00
< TabItem value = "UI" label = "UI" > View dataset details in the DataHub UI:
2025-04-15 16:12:38 +09:00
< p align = "center" >
< img width = "70%" src = "https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/apis/tutorials/ml/dataset.png" / >
2025-04-16 16:55:51 -07:00
< / p >
2025-04-15 16:12:38 +09:00
< / TabItem >
2025-04-16 16:55:51 -07:00
< TabItem value = "graphql" label = "GraphQL" >
2025-04-15 16:12:38 +09:00
Query your dataset's information:
```graphql
query {
dataset(
2025-04-16 16:55:51 -07:00
urn: "urn:li:dataset:(urn:li:dataPlatform:snowflake,iris_input,PROD)"
2025-04-15 16:12:38 +09:00
) {
name
properties {
customProperties
}
}
}
```
2025-04-16 16:55:51 -07:00
2025-04-15 16:12:38 +09:00
Check the response:
```graphql
{
"data": {
"dataset": {
"name": "iris_input",
"properties": {
"customProperties": {
"data_source": "UCI Repository",
"records": "150",
"features": "4"
}
}
}
}
}
```
2025-04-16 16:55:51 -07:00
2025-04-15 16:12:38 +09:00
< / TabItem >
< / Tabs >
Datasets in DataHub can also include schema information, data quality metrics, and lineage details, which are particularly valuable for ML workflows where understanding data characteristics is crucial for model performance.
## Define Relationships
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
Now let's connect these components to create a comprehensive ML system. These connections enable you to track model lineage, monitor model evolution, understand dependencies, and search effectively across your ML assets.
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
### Add Model To Model Group
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
Connect your model to its group:
2023-05-17 10:21:39 +09:00
```python
2025-04-15 16:12:38 +09:00
model.add_group(model_group.urn)
client._emit_mcps(model.as_mcps())
2023-05-17 10:21:39 +09:00
```
2025-01-30 13:59:23 +09:00
< Tabs >
< TabItem value = "UI" label = "UI" >
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
View model versions in the **Model Group** under the **Models** section:
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
< p align = "center" >
< img width = "70%" src = "https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/apis/tutorials/ml/model-group-with-model.png" / >
< / p >
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
Find group information in the **Model** page under the **Group** tab:
2025-04-16 16:55:51 -07:00
2025-01-30 13:59:23 +09:00
< p align = "center" >
< img width = "70%" src = "https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/apis/tutorials/ml/model-with-model-group.png" / >
< / p >
< / TabItem >
< TabItem value = "graphql" label = "GraphQL" >
Query the model-group relationship:
```graphql
2023-05-17 10:21:39 +09:00
query {
2025-04-16 16:55:51 -07:00
mlModel(urn: "urn:li:mlModel:(urn:li:dataPlatform:mlflow,arima_model,PROD)") {
2023-05-17 10:21:39 +09:00
name
properties {
groups {
urn
2025-01-30 13:59:23 +09:00
properties {
name
}
2023-05-17 10:21:39 +09:00
}
}
}
}
```
2025-01-30 13:59:23 +09:00
Check the response:
2023-05-17 10:21:39 +09:00
```json
{
"data": {
"mlModel": {
2025-04-15 16:12:38 +09:00
"name": "ARIMA Model",
2023-05-17 10:21:39 +09:00
"properties": {
2025-01-30 13:59:23 +09:00
"groups": [
{
2025-04-15 16:12:38 +09:00
"urn": "urn:li:mlModelGroup:(urn:li:dataPlatform:mlflow,airline_forecast_models_group)",
2025-01-30 13:59:23 +09:00
"properties": {
2025-04-15 16:12:38 +09:00
"name": "Airline Forecast Models Group"
2025-01-30 13:59:23 +09:00
}
}
]
2023-05-17 10:21:39 +09:00
}
}
2025-01-30 13:59:23 +09:00
}
2023-05-17 10:21:39 +09:00
}
```
< / TabItem >
2025-01-30 13:59:23 +09:00
< / Tabs >
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
### Add Run To Experiment
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
Connect a training run to its experiment:
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
```python
client.add_run_to_experiment(run_urn=run_urn, experiment_urn=experiment_urn)
2023-05-17 10:21:39 +09:00
```
2025-01-30 13:59:23 +09:00
< Tabs >
< TabItem value = "UI" label = "UI" >
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
Find your runs in the **Experiment** page under the **Entities** tab:
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
< p align = "center" >
< img width = "70%" src = "https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/apis/tutorials/ml/experiment-with-run.png" / >
< / p >
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
See the experiment details in the **Run** page:
2025-04-16 16:55:51 -07:00
2025-01-30 13:59:23 +09:00
< p align = "center" >
< img width = "40%" src = "https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/apis/tutorials/ml/run-with-experiment.png" / >
< / p >
< / TabItem >
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
< TabItem value = "graphql" label = "GraphQL" >
Query the run-experiment relationship:
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
```graphql
2023-05-17 10:21:39 +09:00
query {
2025-04-16 16:55:51 -07:00
dataProcessInstance(urn: "urn:li:dataProcessInstance:simple_training_run") {
2023-05-17 10:21:39 +09:00
name
2025-01-30 13:59:23 +09:00
parentContainers {
containers {
urn
properties {
name
}
}
2023-05-17 10:21:39 +09:00
}
}
}
```
2025-01-30 13:59:23 +09:00
View the relationship details:
2023-05-17 10:21:39 +09:00
```json
{
"data": {
2025-01-30 13:59:23 +09:00
"dataProcessInstance": {
"name": "Simple Training Run",
"parentContainers": {
"containers": [
{
"urn": "urn:li:container:airline_forecast_experiment",
"properties": {
"name": "Airline Forecast Experiment"
}
}
]
2023-05-17 10:21:39 +09:00
}
}
2025-01-30 13:59:23 +09:00
}
2023-05-17 10:21:39 +09:00
}
```
< / TabItem >
2025-01-30 13:59:23 +09:00
< / Tabs >
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
### Add Run To Model
Connect a training run to its resulting model:
```python
2025-04-15 16:12:38 +09:00
model.add_training_job(DataProcessInstanceUrn(run_id))
client._emit_mcps(model.as_mcps())
2025-01-30 13:59:23 +09:00
```
This relationship enables you to:
2025-04-16 16:55:51 -07:00
2025-01-30 13:59:23 +09:00
- Track which runs produced each model
- Understand model provenance
- Debug model issues
- Monitor model evolution
< Tabs >
< TabItem value = "UI" label = "UI" >
Find the source run in the **Model** page under the **Summary** tab:
< p align = "center" >
< img width = "70%" src = "https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/apis/tutorials/ml/model-with-source-run.png" / >
< / p >
See related models in the **Run** page under the **Lineage** tab:
< p align = "center" >
< img width = "70%" src = "https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/apis/tutorials/ml/run-lineage-model.png" / >
< / p >
< p align = "center" >
< img width = "50%" src = "https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/apis/tutorials/ml/run-lineage-model-graph.png" / >
< / p >
< / TabItem >
< TabItem value = "graphql" label = "GraphQL" >
Query the model's training jobs:
```graphql
query {
2025-04-16 16:55:51 -07:00
mlModel(urn: "urn:li:mlModel:(urn:li:dataPlatform:mlflow,arima_model,PROD)") {
2025-01-30 13:59:23 +09:00
name
properties {
mlModelLineageInfo {
trainingJobs
}
}
}
}
2023-05-17 10:21:39 +09:00
```
2025-01-30 13:59:23 +09:00
View the relationship:
2023-05-17 10:21:39 +09:00
```json
{
"data": {
2025-01-30 13:59:23 +09:00
"mlModel": {
2025-04-15 16:12:38 +09:00
"name": "ARIMA Model",
2023-05-17 10:21:39 +09:00
"properties": {
2025-01-30 13:59:23 +09:00
"mlModelLineageInfo": {
2025-04-16 16:55:51 -07:00
"trainingJobs": ["urn:li:dataProcessInstance:simple_training_run"]
2025-01-30 13:59:23 +09:00
}
2023-05-17 10:21:39 +09:00
}
}
2025-01-30 13:59:23 +09:00
}
2023-05-17 10:21:39 +09:00
}
```
< / TabItem >
2025-01-30 13:59:23 +09:00
< / Tabs >
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
### Add Run To Model Group
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
Create a direct connection between a run and a model group:
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
```python
2025-04-15 16:12:38 +09:00
model_group.add_training_job(DataProcessInstanceUrn(run_id))
client._emit_mcps(model_group.as_mcps())
2025-01-30 13:59:23 +09:00
```
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
This connection lets you:
2025-04-16 16:55:51 -07:00
2025-01-30 13:59:23 +09:00
- View model groups in the run's lineage
- Query training jobs at the group level
- Track training history for model families
2023-05-17 10:21:39 +09:00
< Tabs >
2025-01-30 13:59:23 +09:00
< TabItem value = "UI" label = "UI" >
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
See model groups in the **Run** page under the **Lineage** tab:
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
< p align = "center" >
< img width = "70%" src = "https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/apis/tutorials/ml/run-lineage-model-group.png" / >
< / p >
< p align = "center" >
< img width = "50%" src = "https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/apis/tutorials/ml/run-lineage-model-group-graph.png" / >
< / p >
2023-05-17 10:21:39 +09:00
< / TabItem >
2025-01-30 13:59:23 +09:00
< TabItem value = "graphql" label = "GraphQL" >
Query the model group's training jobs:
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
```graphql
query {
mlModelGroup(
2025-04-16 16:55:51 -07:00
urn: "urn:li:mlModelGroup:(urn:li:dataPlatform:mlflow,airline_forecast_models_group)"
2025-01-30 13:59:23 +09:00
) {
name
properties {
mlModelLineageInfo {
trainingJobs
}
}
}
}
```
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
Check the relationship:
```json
{
"data": {
"mlModelGroup": {
2025-04-15 16:12:38 +09:00
"name": "Airline Forecast Models Group",
2025-01-30 13:59:23 +09:00
"properties": {
"mlModelLineageInfo": {
2025-04-16 16:55:51 -07:00
"trainingJobs": ["urn:li:dataProcessInstance:simple_training_run"]
2025-01-30 13:59:23 +09:00
}
}
}
}
}
2023-05-17 10:21:39 +09:00
```
< / TabItem >
< / Tabs >
2025-01-30 13:59:23 +09:00
### Add Dataset To Run
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
Track input and output datasets for your training runs:
2023-05-17 10:21:39 +09:00
```python
2025-01-30 13:59:23 +09:00
client.add_input_datasets_to_run(
2025-04-16 16:55:51 -07:00
run_urn=run_urn,
2025-01-30 13:59:23 +09:00
dataset_urns=[str(input_dataset_urn)]
)
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
client.add_output_datasets_to_run(
2025-04-16 16:55:51 -07:00
run_urn=run_urn,
2025-01-30 13:59:23 +09:00
dataset_urns=[str(output_dataset_urn)]
)
```
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
These connections help you:
2025-04-16 16:55:51 -07:00
2025-01-30 13:59:23 +09:00
- Track data lineage
- Understand data dependencies
- Ensure reproducibility
- Monitor data quality impacts
2023-05-17 10:21:39 +09:00
2025-01-30 13:59:23 +09:00
Find dataset relationships in the **Lineage** tab of either the **Dataset** or **Run** page:
2025-04-16 16:55:51 -07:00
2023-08-26 06:10:13 +09:00
< p align = "center" >
2025-01-30 13:59:23 +09:00
< img width = "70%" src = "https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/apis/tutorials/ml/run-lineage-dataset-graph.png" / >
2023-08-26 06:10:13 +09:00
< / p >
2025-04-15 16:12:38 +09:00
## Update Properties
### Update Model Group Properties
Model groups can be updated with additional information:
```python
# Update description
model_group.set_description("Updated description for airline forecast models")
# Add tags and terms
model_group.add_tag(TagUrn("production"))
model_group.add_term(GlossaryTermUrn("time-series"))
# Update custom properties
model_group.set_custom_properties({
"team": "forecasting",
"business_unit": "operations",
"status": "active"
})
# Save the changes
client._emit_mcps(model_group.as_mcps())
```
These updates allow you to:
2025-04-16 16:55:51 -07:00
2025-04-15 16:12:38 +09:00
- Improve documentation with detailed descriptions
- Apply consistent business context with tags and terms
- Track organizational ownership and status
### Update Model Properties
Models can be updated with additional information as they evolve:
```python
# Update model version
model.set_version("2")
# Add tags and terms
model.add_tag(TagUrn("marketing"))
model.add_term(GlossaryTermUrn("marketing"))
# Add version alias
model.add_version_alias("challenger")
# Save the changes
client._emit_mcps(model.as_mcps())
```
These updates allow you to:
2025-04-16 16:55:51 -07:00
2025-04-15 16:12:38 +09:00
- Track model iterations through versioning
- Apply business context with tags and terms
- Manage deployment aliases like "champion" and "challenger"
### Update Experiment Properties
Experiments can be updated with additional metadata as your project evolves:
```python
# Create a container object for the existing experiment
existing_experiment = Container(
container_key=ContainerKey(
platform="mlflow",
name="airline_forecast_experiment"
),
)
# Update properties
existing_experiment.set_description("Updated experiment for forecasting passenger numbers")
existing_experiment.add_tag(TagUrn("time-series"))
existing_experiment.add_term(GlossaryTermUrn("forecasting"))
existing_experiment.set_custom_properties({
"team": "forecasting",
"priority": "high",
"status": "active"
})
# Save the changes
client._emit_mcps(existing_experiment.as_mcps())
```
These updates help you:
2025-04-16 16:55:51 -07:00
2025-04-15 16:12:38 +09:00
- Document evolving experiment objectives
- Categorize experiments with consistent tags
- Track experiment status and priority
2025-01-30 13:59:23 +09:00
## Full Overview
Here's your complete ML system with all components connected:
2023-08-26 06:10:13 +09:00
< p align = "center" >
2025-01-30 13:59:23 +09:00
< img width = "70%" src = "https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/apis/tutorials/ml/lineage-full.png" / >
2023-08-26 06:10:13 +09:00
< / p >
2025-01-30 13:59:23 +09:00
You now have a complete lineage view of your ML assets, from training data through runs to production models!
You can check out the full code for this tutorial [here ](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/ai/dh_ai_client_sample.py ).
2025-04-16 16:55:51 -07:00
2025-01-30 13:59:23 +09:00
## What's Next?
To see these integrations in action:
2025-04-16 16:55:51 -07:00
2025-01-30 13:59:23 +09:00
- Watch our [Townhall demo ](https://youtu.be/_WUoVqkF2Zo?feature=shared&t=1932 ) showcasing the MLflow integration
- Explore our detailed documentation:
- [MLflow Integration Guide ](/docs/generated/ingestion/sources/mlflow.md )
2025-04-16 16:55:51 -07:00
- [Amazon SageMaker Integration Guide ](/docs/generated/ingestion/sources/sagemaker.md )