12 KiB

Dataset Entity

The Dataset entity represents collections of data with a common schema (tables, views, files, topics, etc.). This guide covers comprehensive dataset operations in SDK V2.

Creating a Dataset

Minimal Dataset

Only platform and name are required:

Dataset dataset = Dataset.builder()
    .platform("snowflake")
    .name("my_database.my_schema.my_table")
    .build();

With Environment

Specify environment (PROD, DEV, STAGING, etc.):

Dataset dataset = Dataset.builder()
    .platform("snowflake")
    .name("my_table")
    .env("PROD")
    .build();
// URN: urn:li:dataset:(urn:li:dataPlatform:snowflake,my_table,PROD)

With Metadata

Add description and display name at construction:

Dataset dataset = Dataset.builder()
    .platform("bigquery")
    .name("project.dataset.table")
    .env("PROD")
    .description("User transactions table")
    .displayName("User Transactions")
    .build();

With Custom Properties

Include custom properties in builder:

Map<String, String> props = new HashMap<>();
props.put("team", "data-engineering");
props.put("retention", "90_days");

Dataset dataset = Dataset.builder()
    .platform("postgres")
    .name("public.users")
    .customProperties(props)
    .build();

With Platform Instance

For multi-instance platforms:

Dataset dataset = Dataset.builder()
    .platform("kafka")
    .name("user-events")
    .platformInstance("kafka-prod-cluster")
    .build();

URN Construction

Dataset URNs follow the pattern:

urn:li:dataset:(urn:li:dataPlatform:{platform},{name},{env})

Automatic URN creation:

Dataset dataset = Dataset.builder()
    .platform("snowflake")
    .name("analytics.public.events")
    .env("PROD")
    .build();

DatasetUrn urn = dataset.getDatasetUrn();
// urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.public.events,PROD)

Description Operations

Mode-Aware Description

The setDescription() method routes to different aspects based on mode:

// SDK mode (default) - writes to editableDatasetProperties
dataset.setDescription("User-provided description");

// INGESTION mode - writes to datasetProperties
dataset.setDescription("Ingested from Snowflake");

Explicit Aspect Targeting

Control which aspect to write:

// System description (datasetProperties)
dataset.setSystemDescription("Generated by ETL pipeline");

// Editable description (editableDatasetProperties)
dataset.setEditableDescription("User override description");

Reading Description

Get description (prefers editable over system):

String description = dataset.getDescription();
// Returns editableDatasetProperties.description if set
// Otherwise returns datasetProperties.description

Display Name Operations

Similar to description, display names are mode-aware:

// Mode-aware (SDK → editable, INGESTION → system)
dataset.setDisplayName("User Events");

// Explicit aspect targeting
dataset.setSystemDisplayName("user_events_table");
dataset.setEditableDisplayName("User Events Table");

// Read display name (prefers editable)
String name = dataset.getDisplayName();

Tags

Adding Tags

// Simple tag name (auto-prefixed)
dataset.addTag("pii");
// Creates: urn:li:tag:pii

// Full tag URN
dataset.addTag("urn:li:tag:analytics");

Removing Tags

dataset.removeTag("pii");
dataset.removeTag("urn:li:tag:analytics");

Tag Chaining

dataset.addTag("pii")
       .addTag("sensitive")
       .addTag("gdpr");

Owners

Adding Owners

import com.linkedin.common.OwnershipType;

// Technical owner
dataset.addOwner(
    "urn:li:corpuser:john_doe",
    OwnershipType.TECHNICAL_OWNER
);

// Data steward
dataset.addOwner(
    "urn:li:corpuser:jane_smith",
    OwnershipType.DATA_STEWARD
);

// Business owner
dataset.addOwner(
    "urn:li:corpuser:alice",
    OwnershipType.BUSINESS_OWNER
);

Removing Owners

dataset.removeOwner("urn:li:corpuser:john_doe");

Owner Types

Available ownership types:

  • TECHNICAL_OWNER - Maintains the technical implementation
  • BUSINESS_OWNER - Business stakeholder
  • DATA_STEWARD - Manages data quality and compliance
  • DATAOWNER - Generic data owner
  • DEVELOPER - Software developer
  • PRODUCER - Data producer
  • CONSUMER - Data consumer
  • STAKEHOLDER - Other stakeholder

Glossary Terms

Adding Terms

dataset.addTerm("urn:li:glossaryTerm:CustomerData");
dataset.addTerm("urn:li:glossaryTerm:Classification.Confidential");

Removing Terms

dataset.removeTerm("urn:li:glossaryTerm:CustomerData");

Term Chaining

dataset.addTerm("urn:li:glossaryTerm:Customer Data")
       .addTerm("urn:li:glossaryTerm:PII")
       .addTerm("urn:li:glossaryTerm:GDPR");

Domain

Setting Domain

dataset.setDomain("urn:li:domain:Marketing");

Removing Domain

// Remove a specific domain
dataset.removeDomain("urn:li:domain:Marketing");

// Or clear all domains
dataset.clearDomains();

Custom Properties

Adding Individual Properties

dataset.addCustomProperty("team", "data-engineering");
dataset.addCustomProperty("retention_days", "90");
dataset.addCustomProperty("cost_center", "12345");

Setting All Properties

Replace all custom properties:

Map<String, String> properties = new HashMap<>();
properties.put("team", "data-engineering");
properties.put("retention", "90_days");
properties.put("classification", "internal");

dataset.setCustomProperties(properties);

Removing Properties

dataset.removeCustomProperty("retention_days");

Schema

Setting Schema Metadata

import com.linkedin.schema.*;

SchemaMetadata schema = new SchemaMetadata();
// Configure schema...
dataset.setSchema(schema);

Setting Schema Fields

import com.linkedin.schema.*;

List<SchemaField> fields = new ArrayList<>();

// String field
SchemaField userIdField = new SchemaField();
userIdField.setFieldPath("user_id");
userIdField.setNativeDataType("VARCHAR(255)");
userIdField.setType(
    new SchemaFieldDataType().setType(SchemaFieldDataType.Type.create(new StringType())));
fields.add(userIdField);

// Numeric field
SchemaField amountField = new SchemaField();
amountField.setFieldPath("amount");
amountField.setNativeDataType("DECIMAL(10,2)");
amountField.setType(
    new SchemaFieldDataType().setType(SchemaFieldDataType.Type.create(new NumberType())));
fields.add(amountField);

dataset.setSchemaFields(fields);

Complete Example

import datahub.client.v2.DataHubClientV2;
import datahub.client.v2.entity.Dataset;
import com.linkedin.common.OwnershipType;
import java.io.IOException;
import java.util.concurrent.ExecutionException;

public class DatasetExample {
    public static void main(String[] args) {
        // Create client
        DataHubClientV2 client = DataHubClientV2.builder()
            .server("http://localhost:8080")
            .build();

        try {
            // Build dataset with all metadata
            Dataset dataset = Dataset.builder()
                .platform("snowflake")
                .name("analytics.public.user_events")
                .env("PROD")
                .description("User interaction events from web and mobile")
                .displayName("User Events")
                .build();

            // Add tags
            dataset.addTag("pii")
                   .addTag("analytics")
                   .addTag("gdpr");

            // Add owners
            dataset.addOwner("urn:li:corpuser:data_team", OwnershipType.TECHNICAL_OWNER)
                   .addOwner("urn:li:corpuser:product_team", OwnershipType.BUSINESS_OWNER);

            // Add glossary terms
            dataset.addTerm("urn:li:glossaryTerm:CustomerData")
                   .addTerm("urn:li:glossaryTerm:EventData");

            // Set domain
            dataset.setDomain("urn:li:domain:Analytics");

            // Add custom properties
            dataset.addCustomProperty("team", "data-engineering")
                   .addCustomProperty("retention_days", "365")
                   .addCustomProperty("refresh_schedule", "daily");

            // Upsert to DataHub
            client.entities().upsert(dataset);

            System.out.println("Successfully created dataset: " + dataset.getUrn());

        } catch (IOException | ExecutionException | InterruptedException e) {
            e.printStackTrace();
        } finally {
            try {
                client.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

Updating Existing Datasets

Load and Modify

// Load existing dataset
DatasetUrn urn = new DatasetUrn("snowflake", "my_table", "PROD");
Dataset dataset = client.entities().get(urn);

// Add new metadata (creates patches)
dataset.addTag("new-tag")
       .addOwner("urn:li:corpuser:new_owner", OwnershipType.TECHNICAL_OWNER);

// Apply patches
client.entities().update(dataset);

Incremental Updates

// Just add what you need
dataset.addTag("sensitive");
client.entities().update(dataset);

// Later, add more
dataset.addCustomProperty("updated_at", String.valueOf(System.currentTimeMillis()));
client.entities().update(dataset);

Builder Options Reference

Method Required Description
platform(String) Yes Data platform (e.g., "snowflake", "bigquery")
name(String) Yes Fully qualified dataset name
env(String) No Environment (PROD, DEV, etc.) Default: PROD
platformInstance(String) No Platform instance identifier
description(String) No Dataset description
displayName(String) No Display name shown in UI
customProperties(Map) No Map of custom key-value properties

Mode-Aware vs Explicit Methods

Operation Mode-Aware Method SDK Mode Aspect INGESTION Mode Aspect
Description setDescription() editableDatasetProperties datasetProperties
Display Name setDisplayName() editableDatasetProperties datasetProperties

Explicit methods (always available):

  • setSystemDescription() / setEditableDescription()
  • setSystemDisplayName() / setEditableDisplayName()

Common Patterns

Creating Multiple Datasets

for (String tableName : tableNames) {
    Dataset dataset = Dataset.builder()
        .platform("postgres")
        .name("public." + tableName)
        .env("PROD")
        .build();

    dataset.addTag("auto-generated")
           .addCustomProperty("created_by", "sync_job");

    client.entities().upsert(dataset);
}

Batch Metadata Addition

Dataset dataset = Dataset.builder()
    .platform("snowflake")
    .name("my_table")
    .build();

List<String> tags = Arrays.asList("pii", "sensitive", "gdpr");
tags.forEach(dataset::addTag);

client.entities().upsert(dataset);  // Emits all tags in one call

Conditional Metadata

if (isPII(dataset)) {
    dataset.addTag("pii")
           .addTerm("urn:li:glossaryTerm:PersonalData");
}

if (requiresGovernance(dataset)) {
    dataset.addOwner("urn:li:corpuser:governance_team", OwnershipType.DATA_STEWARD);
}

Next Steps

Examples

Basic Dataset Creation

{{ inline /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/DatasetCreateExample.java show_path_as_comment }}

Dataset Patch Operations

{{ inline /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/DatasetPatchExample.java show_path_as_comment }}

Comprehensive Dataset Example

{{ inline /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/DatasetFullExample.java show_path_as_comment }}