dcs(ml-models): enhancing ml model documentation (#8848)

2025-12-27 18:07:57 +00:00 · 2023-09-19 09:02:24 -07:00 · 2023-09-19 09:02:24 -07:00 · 67af68284f
commit 67af68284f
parent 85fa5a1c4f
7 changed files with 201 additions and 37 deletions
--- a/docs/api/tutorials/ml.md
+++ b/docs/api/tutorials/ml.md
@ -7,11 +7,12 @@ import TabItem from '@theme/TabItem';

 Machine learning systems have become a crucial feature in modern data stacks.
 However, the relationships between the different components of a machine learning system, such as features, models, and feature tables, can be complex.
-Thus, it is essential for these systems to be discoverable to facilitate easy access and utilization by other members of the organization.
+DataHub makes these relationships discoverable and facilitate utilization by other members of the organization.

-For more information on ML entities, please refer to the following docs:
+For technical details on ML entities, please refer to the following docs:

 - [MlFeature](/docs/generated/metamodel/entities/mlFeature.md)
+- [MlPrimaryKey](/docs/generated/metamodel/entities/mlPrimaryKey.md)
 - [MlFeatureTable](/docs/generated/metamodel/entities/mlFeatureTable.md)
 - [MlModel](/docs/generated/metamodel/entities/mlModel.md)
 - [MlModelGroup](/docs/generated/metamodel/entities/mlModelGroup.md)
@ -20,9 +21,11 @@ For more information on ML entities, please refer to the following docs:

 This guide will show you how to

- Create ML entities: MlFeature, MlFeatureTable, MlModel, MlModelGroup
- Read ML entities: MlFeature, MlFeatureTable, MlModel, MlModelGroup
- Attach MlFeatureTable or MlModel to MlFeature
+- Create ML entities: MlFeature, MlFeatureTable, MlModel, MlModelGroup, MlPrimaryKey
+- Read ML entities: MlFeature, MlFeatureTable, MlModel, MlModelGroup, MlPrimaryKey
+- Attach MlModel to MlFeature
+- Attach MlFeatures to MlFeatureTable
+- Attached MlFeatures to upstream Datasets that power them

 ## Prerequisites

@ -33,6 +36,8 @@ For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.

 ### Create MlFeature

+An ML Feature represents an instance of a feature that can be used across different machine learning models. Features are organized into Feature Tables to be consumed by machine learning models. For example, if we were modeling features for a Users Feature Table, the Features would be `age`, `sign_up_date`, `active_in_past_30_days` and so forth.Using Features in DataHub allows users to see the sources a feature was generated from and how a feature is used to train models.
+
 <Tabs>
 <TabItem value="python" label="Python" default>

@ -40,13 +45,31 @@ For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.
 {{ inline /metadata-ingestion/examples/library/create_mlfeature.py show_path_as_comment }}
 ```

-Note that when creating a feature, you can access a list of data sources using `sources`.
+Note that when creating a feature, you create upstream lineage to the data warehouse using `sources`.
+
+</TabItem>
+</Tabs>
+
+### Create MlPrimaryKey
+
+An ML Primary Key represents a specific element of a Feature Table that indicates what group the other features belong to. For example, if a Feature Table contained features for Users, the ML Primary Key would likely be `user_id` or some similar unique identifier for a user. Using ML Primary Keys in DataHub allow users to indicate how ML Feature Tables are structured.
+
+<Tabs>
+<TabItem value="python" label="Python" default>
+
+```python
+{{ inline /metadata-ingestion/examples/library/create_mlprimarykey.py show_path_as_comment }}
+```
+
+Note that when creating a primary key, you create upstream lineage to the data warehouse using `sources`.

 </TabItem>
 </Tabs>

 ### Create MlFeatureTable

+A feature table represents a group of similar Features that can all be used together to train a model. For example, if there was a Users Feature Table, it would contain documentation around how to use the Users collection of Features and references to each Feature and ML Primary Key contained within it.
+
 <Tabs>
 <TabItem value="python" label="Python" default>

@ -54,14 +77,14 @@ Note that when creating a feature, you can access a list of data sources using `
 {{ inline /metadata-ingestion/examples/library/create_mlfeature_table.py show_path_as_comment }}
 ```

-Note that when creating a feature table, you can access a list of features using `mlFeatures`.
+Note that when creating a feature table, you connect the table to its features and primary key using `mlFeatures` and `mlPrimaryKeys`.

 </TabItem>
 </Tabs>

 ### Create MlModel

-Please note that an MlModel represents the outcome of a single training run for a model, not the collective results of all model runs.
+An ML Model in Acryl represents an individual version of a trained Machine Learning Model. Another way to think about the ML Model entity is as an istance of a training run. An ML Model entity tracks the exact ML Features used in that instance of training, along with the training results. This entity does not represents all versions of a ML Model. For example, if we train a model for homepage customization on a certain day, that would be a ML Model in DataHub. If you re-train the model the next day off of new data or with different parameters, that would produce a second ML Model entity.

 <Tabs>
 <TabItem value="python" label="Python" default>
@ -70,15 +93,15 @@ Please note that an MlModel represents the outcome of a single training run for
 {{ inline /metadata-ingestion/examples/library/create_mlmodel.py show_path_as_comment }}
 ```

-Note that when creating a model, you can access a list of features using `mlFeatures`.
-Additionally, you can access the relationship to model groups with `groups`.
+Note that when creating a model, you link it to a list of features using `mlFeatures`. This indicates how the individual instance of the model was trained.
+Additionally, you can access the relationship to model groups with `groups`. An ML Model is connected to the warehouse tables it depends on via its dependency on the ML Features it reads from.

 </TabItem>
 </Tabs>

 ### Create MlModelGroup

-Please note that an MlModelGroup serves as a container for all the runs of a single ML model.
+An ML Model Group represents the grouping of all training runs of a single Machine Learning model category. It will store documentation about the group of ML Models, along with references to each individual ML Model instance.

 <Tabs>
 <TabItem value="python" label="Python" default>
@ -94,18 +117,14 @@ Please note that an MlModelGroup serves as a container for all the runs of a sin

 You can search the entities in DataHub UI.

-
 <p align="center">
  <img width="70%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/apis/tutorials/feature-table-created.png"/>
 </p>

-
-
 <p align="center">
  <img width="70%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/apis/tutorials/model-group-created.png"/>
 </p>

-
 ## Read ML Entities

 ### Read MLFeature
@ -192,6 +211,93 @@ Expected response:
 </TabItem>
 </Tabs>

+### Read MlPrimaryKey
+
+<Tabs>
+<TabItem value="graphql" label="GraphQL" default>
+
+```json
+query {
+  mlPrimaryKey(urn: "urn:li:mlPrimaryKey:(user_features,user_id)"){
+    name
+    featureNamespace
+    description
+    dataType
+    properties {
+      description
+      dataType
+      version {
+        versionTag
+      }
+    }
+  }
+}
+```
+
+Expected response:
+
+```json
+{
+  "data": {
+    "mlPrimaryKey": {
+      "name": "user_id",
+      "featureNamespace": "user_features",
+      "description": "User's internal ID",
+      "dataType": "ORDINAL",
+      "properties": {
+        "description": "User's internal ID",
+        "dataType": "ORDINAL",
+        "version": null
+      }
+    }
+  },
+  "extensions": {}
+}
+```
+
+</TabItem>
+<TabItem value="curl" label="Curl" default>
+
+```json
+curl --location --request POST 'http://localhost:8080/api/graphql' \
+--header 'Authorization: Bearer <my-access-token>' \
+--header 'Content-Type: application/json' \
+--data-raw '{
+    "query": "query {  mlPrimaryKey(urn: \"urn:li:mlPrimaryKey:(user_features,user_id)\"){    name    featureNamespace    description    dataType    properties {      description      dataType      version {        versionTag      }    }  }}"
+}'
+```
+
+Expected response:
+
+```json
+{
+  "data": {
+    "mlPrimaryKey": {
+      "name": "user_id",
+      "featureNamespace": "user_features",
+      "description": "User's internal ID",
+      "dataType": "ORDINAL",
+      "properties": {
+        "description": "User's internal ID",
+        "dataType": "ORDINAL",
+        "version": null
+      }
+    }
+  },
+  "extensions": {}
+}
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+{{ inline /metadata-ingestion/examples/library/read_mlprimarykey.py show_path_as_comment }}
+```
+
+</TabItem>
+</Tabs>
+
 ### Read MLFeatureTable

 <Tabs>
@ -232,8 +338,7 @@ Expected Response:
          {
            "name": "test_BOOL_LIST_feature"
          },
-          ...
-          {
+          ...{
            "name": "test_STRING_feature"
          }
        ]
@ -273,8 +378,7 @@ Expected Response:
          {
            "name": "test_BOOL_LIST_feature"
          },
-          ...
-          {
+          ...{
            "name": "test_STRING_feature"
          }
        ]
@ -507,14 +611,10 @@ Expected Response: (Note that this entity does not exist in the sample ingestion

 You can access to `Features` or `Group` Tab of each entity to view the added entities.

-
 <p align="center">
  <img width="70%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/apis/tutorials/feature-added-to-model.png"/>
 </p>

-
-
 <p align="center">
  <img width="70%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/apis/tutorials/model-group-added-to-model.png"/>
 </p>
-
--- a/metadata-ingestion/examples/library/create_mlfeature.py
+++ b/metadata-ingestion/examples/library/create_mlfeature.py
@ -7,11 +7,11 @@ from datahub.emitter.rest_emitter import DatahubRestEmitter
 emitter = DatahubRestEmitter(gms_server="http://localhost:8080", extra_headers={})

 dataset_urn = builder.make_dataset_urn(
-    name="fct_users_deleted", platform="hive", env="PROD"
+    name="fct_users_created", platform="hive", env="PROD"
 )
 feature_urn = builder.make_ml_feature_urn(
-    feature_table_name="my-feature-table",
-    feature_name="my-feature",
+    feature_table_name="users_feature_table",
+    feature_name="user_signup_date",
 )

 #  Create feature
@ -21,7 +21,12 @@ metadata_change_proposal = MetadataChangeProposalWrapper(
    entityUrn=feature_urn,
    aspectName="mlFeatureProperties",
    aspect=models.MLFeaturePropertiesClass(
-        description="my feature", sources=[dataset_urn], dataType="TEXT"
+        description="Represents the date the user created their account",
+        # attaching a source to a feature creates lineage between the feature
+        # and the upstream dataset. This is how lineage between your data warehouse
+        # and machine learning ecosystem is established.
+        sources=[dataset_urn],
+        dataType="TIME",
    ),
 )

--- a/metadata-ingestion/examples/library/create_mlfeature_table.py
+++ b/metadata-ingestion/examples/library/create_mlfeature_table.py
@ -7,18 +7,31 @@ from datahub.emitter.rest_emitter import DatahubRestEmitter
 emitter = DatahubRestEmitter(gms_server="http://localhost:8080", extra_headers={})

 feature_table_urn = builder.make_ml_feature_table_urn(
-    feature_table_name="my-feature-table", platform="feast"
+    feature_table_name="users_feature_table", platform="feast"
 )
+
 feature_urns = [
    builder.make_ml_feature_urn(
-        feature_name="my-feature", feature_table_name="my-feature-table"
+        feature_name="user_signup_date", feature_table_name="users_feature_table"
    ),
    builder.make_ml_feature_urn(
-        feature_name="my-feature2", feature_table_name="my-feature-table"
+        feature_name="user_last_active_date", feature_table_name="users_feature_table"
    ),
 ]
+
+primary_key_urns = [
+    builder.make_ml_primary_key_urn(
+        feature_table_name="users_feature_table",
+        primary_key_name="user_id",
+    )
+]
+
 feature_table_properties = models.MLFeatureTablePropertiesClass(
-    description="Test description", mlFeatures=feature_urns
+    description="Test description",
+    # link your features to a feature table
+    mlFeatures=feature_urns,
+    # link your primary keys to the feature table
+    mlPrimaryKeys=primary_key_urns,
 )

 # MCP creation
--- a/metadata-ingestion/examples/library/create_mlmodel.py
+++ b/metadata-ingestion/examples/library/create_mlmodel.py
@ -6,19 +6,19 @@ from datahub.emitter.rest_emitter import DatahubRestEmitter
 # Create an emitter to DataHub over REST
 emitter = DatahubRestEmitter(gms_server="http://localhost:8080", extra_headers={})
 model_urn = builder.make_ml_model_urn(
-    model_name="my-test-model", platform="science", env="PROD"
+    model_name="my-recommendations-model-run-1", platform="science", env="PROD"
 )
 model_group_urns = [
    builder.make_ml_model_group_urn(
-        group_name="my-model-group", platform="science", env="PROD"
+        group_name="my-recommendations-model-group", platform="science", env="PROD"
    )
 ]
 feature_urns = [
    builder.make_ml_feature_urn(
-        feature_name="my-feature", feature_table_name="my-feature-table"
+        feature_name="user_signup_date", feature_table_name="users_feature_table"
    ),
    builder.make_ml_feature_urn(
-        feature_name="my-feature2", feature_table_name="my-feature-table"
+        feature_name="user_last_active_date", feature_table_name="users_feature_table"
    ),
 ]

--- a/metadata-ingestion/examples/library/create_mlmodel_group.py
+++ b/metadata-ingestion/examples/library/create_mlmodel_group.py
@ -6,7 +6,7 @@ from datahub.emitter.rest_emitter import DatahubRestEmitter
 # Create an emitter to DataHub over REST
 emitter = DatahubRestEmitter(gms_server="http://localhost:8080", extra_headers={})
 model_group_urn = builder.make_ml_model_group_urn(
-    group_name="my-model-group", platform="science", env="PROD"
+    group_name="my-recommendations-model-group", platform="science", env="PROD"
 )


@ -16,7 +16,7 @@ metadata_change_proposal = MetadataChangeProposalWrapper(
    entityUrn=model_group_urn,
    aspectName="mlModelGroupProperties",
    aspect=models.MLModelGroupPropertiesClass(
-        description="my model group",
+        description="Grouping of ml model training runs related to home page recommendations.",
    ),
 )

--- a/metadata-ingestion/examples/library/create_mlprimarykey.py
+++ b/metadata-ingestion/examples/library/create_mlprimarykey.py
@ -0,0 +1,34 @@
+import datahub.emitter.mce_builder as builder
+import datahub.metadata.schema_classes as models
+from datahub.emitter.mcp import MetadataChangeProposalWrapper
+from datahub.emitter.rest_emitter import DatahubRestEmitter
+
+# Create an emitter to DataHub over REST
+emitter = DatahubRestEmitter(gms_server="http://localhost:8080", extra_headers={})
+
+dataset_urn = builder.make_dataset_urn(
+    name="fct_users_created", platform="hive", env="PROD"
+)
+primary_key_urn = builder.make_ml_primary_key_urn(
+    feature_table_name="users_feature_table",
+    primary_key_name="user_id",
+)
+
+#  Create feature
+metadata_change_proposal = MetadataChangeProposalWrapper(
+    entityType="mlPrimaryKey",
+    changeType=models.ChangeTypeClass.UPSERT,
+    entityUrn=primary_key_urn,
+    aspectName="mlPrimaryKeyProperties",
+    aspect=models.MLPrimaryKeyPropertiesClass(
+        description="Represents the id of the user the other features relate to.",
+        # attaching a source to a ml primary key creates lineage between the feature
+        # and the upstream dataset. This is how lineage between your data warehouse
+        # and machine learning ecosystem is established.
+        sources=[dataset_urn],
+        dataType="TEXT",
+    ),
+)
+
+# Emit metadata!
+emitter.emit(metadata_change_proposal)
--- a/metadata-ingestion/examples/library/read_mlprimarykey.py
+++ b/metadata-ingestion/examples/library/read_mlprimarykey.py
@ -0,0 +1,12 @@
+from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph
+
+# Imports for metadata model classes
+from datahub.metadata.schema_classes import MLPrimaryKeyPropertiesClass
+
+# First we get the current owners
+gms_endpoint = "http://localhost:8080"
+graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint))
+
+urn = "urn:li:mlPrimaryKey:(user_features,user_id)"
+result = graph.get_aspect(entity_urn=urn, aspect_type=MLPrimaryKeyPropertiesClass)
+print(result)