mirror of
https://github.com/datahub-project/datahub.git
synced 2025-12-18 13:38:23 +00:00
12 KiB
12 KiB
Dataset Entity
The Dataset entity represents collections of data with a common schema (tables, views, files, topics, etc.). This guide covers comprehensive dataset operations in SDK V2.
Creating a Dataset
Minimal Dataset
Only platform and name are required:
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_database.my_schema.my_table")
.build();
With Environment
Specify environment (PROD, DEV, STAGING, etc.):
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.env("PROD")
.build();
// URN: urn:li:dataset:(urn:li:dataPlatform:snowflake,my_table,PROD)
With Metadata
Add description and display name at construction:
Dataset dataset = Dataset.builder()
.platform("bigquery")
.name("project.dataset.table")
.env("PROD")
.description("User transactions table")
.displayName("User Transactions")
.build();
With Custom Properties
Include custom properties in builder:
Map<String, String> props = new HashMap<>();
props.put("team", "data-engineering");
props.put("retention", "90_days");
Dataset dataset = Dataset.builder()
.platform("postgres")
.name("public.users")
.customProperties(props)
.build();
With Platform Instance
For multi-instance platforms:
Dataset dataset = Dataset.builder()
.platform("kafka")
.name("user-events")
.platformInstance("kafka-prod-cluster")
.build();
URN Construction
Dataset URNs follow the pattern:
urn:li:dataset:(urn:li:dataPlatform:{platform},{name},{env})
Automatic URN creation:
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("analytics.public.events")
.env("PROD")
.build();
DatasetUrn urn = dataset.getDatasetUrn();
// urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.public.events,PROD)
Description Operations
Mode-Aware Description
The setDescription() method routes to different aspects based on mode:
// SDK mode (default) - writes to editableDatasetProperties
dataset.setDescription("User-provided description");
// INGESTION mode - writes to datasetProperties
dataset.setDescription("Ingested from Snowflake");
Explicit Aspect Targeting
Control which aspect to write:
// System description (datasetProperties)
dataset.setSystemDescription("Generated by ETL pipeline");
// Editable description (editableDatasetProperties)
dataset.setEditableDescription("User override description");
Reading Description
Get description (prefers editable over system):
String description = dataset.getDescription();
// Returns editableDatasetProperties.description if set
// Otherwise returns datasetProperties.description
Display Name Operations
Similar to description, display names are mode-aware:
// Mode-aware (SDK → editable, INGESTION → system)
dataset.setDisplayName("User Events");
// Explicit aspect targeting
dataset.setSystemDisplayName("user_events_table");
dataset.setEditableDisplayName("User Events Table");
// Read display name (prefers editable)
String name = dataset.getDisplayName();
Tags
Adding Tags
// Simple tag name (auto-prefixed)
dataset.addTag("pii");
// Creates: urn:li:tag:pii
// Full tag URN
dataset.addTag("urn:li:tag:analytics");
Removing Tags
dataset.removeTag("pii");
dataset.removeTag("urn:li:tag:analytics");
Tag Chaining
dataset.addTag("pii")
.addTag("sensitive")
.addTag("gdpr");
Owners
Adding Owners
import com.linkedin.common.OwnershipType;
// Technical owner
dataset.addOwner(
"urn:li:corpuser:john_doe",
OwnershipType.TECHNICAL_OWNER
);
// Data steward
dataset.addOwner(
"urn:li:corpuser:jane_smith",
OwnershipType.DATA_STEWARD
);
// Business owner
dataset.addOwner(
"urn:li:corpuser:alice",
OwnershipType.BUSINESS_OWNER
);
Removing Owners
dataset.removeOwner("urn:li:corpuser:john_doe");
Owner Types
Available ownership types:
TECHNICAL_OWNER- Maintains the technical implementationBUSINESS_OWNER- Business stakeholderDATA_STEWARD- Manages data quality and complianceDATAOWNER- Generic data ownerDEVELOPER- Software developerPRODUCER- Data producerCONSUMER- Data consumerSTAKEHOLDER- Other stakeholder
Glossary Terms
Adding Terms
dataset.addTerm("urn:li:glossaryTerm:CustomerData");
dataset.addTerm("urn:li:glossaryTerm:Classification.Confidential");
Removing Terms
dataset.removeTerm("urn:li:glossaryTerm:CustomerData");
Term Chaining
dataset.addTerm("urn:li:glossaryTerm:Customer Data")
.addTerm("urn:li:glossaryTerm:PII")
.addTerm("urn:li:glossaryTerm:GDPR");
Domain
Setting Domain
dataset.setDomain("urn:li:domain:Marketing");
Removing Domain
// Remove a specific domain
dataset.removeDomain("urn:li:domain:Marketing");
// Or clear all domains
dataset.clearDomains();
Custom Properties
Adding Individual Properties
dataset.addCustomProperty("team", "data-engineering");
dataset.addCustomProperty("retention_days", "90");
dataset.addCustomProperty("cost_center", "12345");
Setting All Properties
Replace all custom properties:
Map<String, String> properties = new HashMap<>();
properties.put("team", "data-engineering");
properties.put("retention", "90_days");
properties.put("classification", "internal");
dataset.setCustomProperties(properties);
Removing Properties
dataset.removeCustomProperty("retention_days");
Schema
Setting Schema Metadata
import com.linkedin.schema.*;
SchemaMetadata schema = new SchemaMetadata();
// Configure schema...
dataset.setSchema(schema);
Setting Schema Fields
import com.linkedin.schema.*;
List<SchemaField> fields = new ArrayList<>();
// String field
SchemaField userIdField = new SchemaField();
userIdField.setFieldPath("user_id");
userIdField.setNativeDataType("VARCHAR(255)");
userIdField.setType(
new SchemaFieldDataType().setType(SchemaFieldDataType.Type.create(new StringType())));
fields.add(userIdField);
// Numeric field
SchemaField amountField = new SchemaField();
amountField.setFieldPath("amount");
amountField.setNativeDataType("DECIMAL(10,2)");
amountField.setType(
new SchemaFieldDataType().setType(SchemaFieldDataType.Type.create(new NumberType())));
fields.add(amountField);
dataset.setSchemaFields(fields);
Complete Example
import datahub.client.v2.DataHubClientV2;
import datahub.client.v2.entity.Dataset;
import com.linkedin.common.OwnershipType;
import java.io.IOException;
import java.util.concurrent.ExecutionException;
public class DatasetExample {
public static void main(String[] args) {
// Create client
DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
.build();
try {
// Build dataset with all metadata
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("analytics.public.user_events")
.env("PROD")
.description("User interaction events from web and mobile")
.displayName("User Events")
.build();
// Add tags
dataset.addTag("pii")
.addTag("analytics")
.addTag("gdpr");
// Add owners
dataset.addOwner("urn:li:corpuser:data_team", OwnershipType.TECHNICAL_OWNER)
.addOwner("urn:li:corpuser:product_team", OwnershipType.BUSINESS_OWNER);
// Add glossary terms
dataset.addTerm("urn:li:glossaryTerm:CustomerData")
.addTerm("urn:li:glossaryTerm:EventData");
// Set domain
dataset.setDomain("urn:li:domain:Analytics");
// Add custom properties
dataset.addCustomProperty("team", "data-engineering")
.addCustomProperty("retention_days", "365")
.addCustomProperty("refresh_schedule", "daily");
// Upsert to DataHub
client.entities().upsert(dataset);
System.out.println("Successfully created dataset: " + dataset.getUrn());
} catch (IOException | ExecutionException | InterruptedException e) {
e.printStackTrace();
} finally {
try {
client.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
Updating Existing Datasets
Load and Modify
// Load existing dataset
DatasetUrn urn = new DatasetUrn("snowflake", "my_table", "PROD");
Dataset dataset = client.entities().get(urn);
// Add new metadata (creates patches)
dataset.addTag("new-tag")
.addOwner("urn:li:corpuser:new_owner", OwnershipType.TECHNICAL_OWNER);
// Apply patches
client.entities().update(dataset);
Incremental Updates
// Just add what you need
dataset.addTag("sensitive");
client.entities().update(dataset);
// Later, add more
dataset.addCustomProperty("updated_at", String.valueOf(System.currentTimeMillis()));
client.entities().update(dataset);
Builder Options Reference
| Method | Required | Description |
|---|---|---|
platform(String) |
✅ Yes | Data platform (e.g., "snowflake", "bigquery") |
name(String) |
✅ Yes | Fully qualified dataset name |
env(String) |
No | Environment (PROD, DEV, etc.) Default: PROD |
platformInstance(String) |
No | Platform instance identifier |
description(String) |
No | Dataset description |
displayName(String) |
No | Display name shown in UI |
customProperties(Map) |
No | Map of custom key-value properties |
Mode-Aware vs Explicit Methods
| Operation | Mode-Aware Method | SDK Mode Aspect | INGESTION Mode Aspect |
|---|---|---|---|
| Description | setDescription() |
editableDatasetProperties |
datasetProperties |
| Display Name | setDisplayName() |
editableDatasetProperties |
datasetProperties |
Explicit methods (always available):
setSystemDescription()/setEditableDescription()setSystemDisplayName()/setEditableDisplayName()
Common Patterns
Creating Multiple Datasets
for (String tableName : tableNames) {
Dataset dataset = Dataset.builder()
.platform("postgres")
.name("public." + tableName)
.env("PROD")
.build();
dataset.addTag("auto-generated")
.addCustomProperty("created_by", "sync_job");
client.entities().upsert(dataset);
}
Batch Metadata Addition
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.build();
List<String> tags = Arrays.asList("pii", "sensitive", "gdpr");
tags.forEach(dataset::addTag);
client.entities().upsert(dataset); // Emits all tags in one call
Conditional Metadata
if (isPII(dataset)) {
dataset.addTag("pii")
.addTerm("urn:li:glossaryTerm:PersonalData");
}
if (requiresGovernance(dataset)) {
dataset.addOwner("urn:li:corpuser:governance_team", OwnershipType.DATA_STEWARD);
}
Next Steps
- Chart Entity - Working with chart entities
- Patch Operations - Deep dive into patches
- Migration Guide - Upgrading from V1
Examples
Basic Dataset Creation
{{ inline /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/DatasetCreateExample.java show_path_as_comment }}
Dataset Patch Operations
{{ inline /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/DatasetPatchExample.java show_path_as_comment }}
Comprehensive Dataset Example
{{ inline /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/DatasetFullExample.java show_path_as_comment }}