Hyejin Yoon f986315582
doc: Acryl to DataHub, datahubproject.io to datahub.com (#13252)
Co-authored-by: Jay <159848059+jayacryl@users.noreply.github.com>
2025-04-28 10:34:33 -04:00

20 KiB

Ingesting metadata from VertexAI requires using the Vertex AI module.

Prerequisites

Please refer to the Vertex AI documentation for basic information on Vertex AI.

Credentials to access to GCP

Please read the section to understand how to set up application default Credentials to GCP docs.

Permissions
  • Grant the following permissions to the Service Account on every project where you would like to extract metadata from

Default GCP Role which contains these permissions roles/aiplatform.viewer

Permission Description
aiplatform.models.list Allows a user to view and list all ML models in a project
aiplatform.models.get Allows a user to view details of a specific ML model
aiplatform.endpoints.list Allows a user to view and list all prediction endpoints in a project
aiplatform.endpoints.get Allows a user to view details of a specific prediction endpoint
aiplatform.trainingPipelines.list Allows a user to view and list all training pipelines in a project
aiplatform.trainingPipelines.get Allows a user to view details of a specific training pipeline
aiplatform.customJobs.list Allows a user to view and list all custom jobs in a project
aiplatform.customJobs.get Allows a user to view details of a specific custom job
aiplatform.experiments.list Allows a user to view and list all experiments in a project
laiplatform.experiments.get Allows a user to view details of a specific experiment in a project
aiplatform.metadataStores.list allows a user to view and list all metadata store in a project
aiplatform.metadataStores.get allows a user to view details of a specific metadata store
aiplatform.executions.list allows a user to view and list all executions in a project
aiplatform.executions.get allows a user to view details of a specific execution
aiplatform.datasets.list allows a user to view and list all datasets in a project
aiplatform.datasets.get allows a user to view details of a specific dataset
aiplatform.pipelineJobs.get allows a user to view and list all pipeline jobs in a project
aiplatform.pipelineJobs.list allows a user to view details of a specific pipeline job

Create a service account and assign roles

  1. Setup a ServiceAccount as per GCP docs and assign the previously created role to this service account.

  2. Download a service account JSON keyfile.

    • Example credential file:
    {
      "type": "service_account",
      "project_id": "project-id-1234567",
      "private_key_id": "d0121d0000882411234e11166c6aaa23ed5d74e0",
      "private_key": "-----BEGIN PRIVATE KEY-----\nMIIyourkey\n-----END PRIVATE KEY-----",
      "client_email": "test@suppproject-id-1234567.iam.gserviceaccount.com",
      "client_id": "113545814931671546333",
      "auth_uri": "https://accounts.google.com/o/oauth2/auth",
      "token_uri": "https://oauth2.googleapis.com/token",
      "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
      "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/test%suppproject-id-1234567.iam.gserviceaccount.com"
    }
    
  3. To provide credentials to the source, you can either:

  • Set an environment variable:

    $ export GOOGLE_APPLICATION_CREDENTIALS="/path/to/keyfile.json"
    

    or

  • Set credential config in your source based on the credential json file. For example:

    credential:
      private_key_id: "d0121d0000882411234e11166c6aaa23ed5d74e0"
      private_key: "-----BEGIN PRIVATE KEY-----\nMIIyourkey\n-----END PRIVATE KEY-----\n"
      client_email: "test@suppproject-id-1234567.iam.gserviceaccount.com"
      client_id: "123456678890"
    

Integration Details

Ingestion Job extract Models, Datasets, Training Jobs, Endpoints, Experiment and Experiment Runs in a given project and region on Vertex AI.

Concept Mapping

This ingestion source maps the following Vertex AI Concepts to DataHub Concepts:

Source Concept DataHub Concept Notes
Model MlModelGroup The name of a Model Group is the same as Model's name. Model serve as containers for multiple versions of the same model in Vertex AI.
Model Version MlModel The name of a Model is {model_name}_{model_version} (e.g. my_vertexai_model_1 for model registered to Model Registry or Deployed to Endpoint. Each Model Version represents a specific iteration of a model with its own metadata.
Dataset

Dataset A Managed Dataset resource in Vertex AI is mapped to Dataset in DataHub.

Supported types of datasets include (Text, Tabular, Image Dataset, Video, TimeSeries)
Training Job DataProcessInstance A Training Job is mapped as DataProcessInstance in DataHub.

Supported types of training jobs include (AutoMLTextTrainingJob, AutoMLTabularTrainingJob, AutoMLImageTrainingJob, AutoMLVideoTrainingJob, AutoMLForecastingTrainingJob, Custom Job, Custom TrainingJob, Custom Container TrainingJob, Custom Python Packaging Job )
Experiment Container Experiments organize related runs and serve as logical groupings for model development iterations. Each Experiment is mapped to a Container in DataHub.
Experiment Run DataProcessInstance An Experiment Run represents a single execution of a ML workflow. An Experiment Run tracks ML parameters, metricis, artifacts and metadata
Execution DataProcessInstance Metadata Execution resource for Vertex AI. Metadata Execution is started in a experiment run and captures input and output artifacts.

Vertex AI Concept Diagram:

Lineage

Lineage is emitted using Vertex AI API to capture the following relationships:

  • A training job and a model (which training job produce a model)
  • A dataset and a training job (which dataset was consumed by a training job to train a model)
  • Experiment runs and an experiment
  • Metadata execution and an experiment run