516 lines
12 KiB
Markdown
Raw Normal View History

# Dataset Entity
The Dataset entity represents collections of data with a common schema (tables, views, files, topics, etc.). This guide covers comprehensive dataset operations in SDK V2.
## Creating a Dataset
### Minimal Dataset
Only platform and name are required:
```java
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_database.my_schema.my_table")
.build();
```
### With Environment
Specify environment (PROD, DEV, STAGING, etc.):
```java
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.env("PROD")
.build();
// URN: urn:li:dataset:(urn:li:dataPlatform:snowflake,my_table,PROD)
```
### With Metadata
Add description and display name at construction:
```java
Dataset dataset = Dataset.builder()
.platform("bigquery")
.name("project.dataset.table")
.env("PROD")
.description("User transactions table")
.displayName("User Transactions")
.build();
```
### With Custom Properties
Include custom properties in builder:
```java
Map<String, String> props = new HashMap<>();
props.put("team", "data-engineering");
props.put("retention", "90_days");
Dataset dataset = Dataset.builder()
.platform("postgres")
.name("public.users")
.customProperties(props)
.build();
```
### With Platform Instance
For multi-instance platforms:
```java
Dataset dataset = Dataset.builder()
.platform("kafka")
.name("user-events")
.platformInstance("kafka-prod-cluster")
.build();
```
## URN Construction
Dataset URNs follow the pattern:
```
urn:li:dataset:(urn:li:dataPlatform:{platform},{name},{env})
```
**Automatic URN creation:**
```java
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("analytics.public.events")
.env("PROD")
.build();
DatasetUrn urn = dataset.getDatasetUrn();
// urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.public.events,PROD)
```
## Description Operations
### Mode-Aware Description
The `setDescription()` method routes to different aspects based on mode:
```java
// SDK mode (default) - writes to editableDatasetProperties
dataset.setDescription("User-provided description");
// INGESTION mode - writes to datasetProperties
dataset.setDescription("Ingested from Snowflake");
```
### Explicit Aspect Targeting
Control which aspect to write:
```java
// System description (datasetProperties)
dataset.setSystemDescription("Generated by ETL pipeline");
// Editable description (editableDatasetProperties)
dataset.setEditableDescription("User override description");
```
### Reading Description
Get description (prefers editable over system):
```java
String description = dataset.getDescription();
// Returns editableDatasetProperties.description if set
// Otherwise returns datasetProperties.description
```
## Display Name Operations
Similar to description, display names are mode-aware:
```java
// Mode-aware (SDK → editable, INGESTION → system)
dataset.setDisplayName("User Events");
// Explicit aspect targeting
dataset.setSystemDisplayName("user_events_table");
dataset.setEditableDisplayName("User Events Table");
// Read display name (prefers editable)
String name = dataset.getDisplayName();
```
## Tags
### Adding Tags
```java
// Simple tag name (auto-prefixed)
dataset.addTag("pii");
// Creates: urn:li:tag:pii
// Full tag URN
dataset.addTag("urn:li:tag:analytics");
```
### Removing Tags
```java
dataset.removeTag("pii");
dataset.removeTag("urn:li:tag:analytics");
```
### Tag Chaining
```java
dataset.addTag("pii")
.addTag("sensitive")
.addTag("gdpr");
```
## Owners
### Adding Owners
```java
import com.linkedin.common.OwnershipType;
// Technical owner
dataset.addOwner(
"urn:li:corpuser:john_doe",
OwnershipType.TECHNICAL_OWNER
);
// Data steward
dataset.addOwner(
"urn:li:corpuser:jane_smith",
OwnershipType.DATA_STEWARD
);
// Business owner
dataset.addOwner(
"urn:li:corpuser:alice",
OwnershipType.BUSINESS_OWNER
);
```
### Removing Owners
```java
dataset.removeOwner("urn:li:corpuser:john_doe");
```
### Owner Types
Available ownership types:
- `TECHNICAL_OWNER` - Maintains the technical implementation
- `BUSINESS_OWNER` - Business stakeholder
- `DATA_STEWARD` - Manages data quality and compliance
- `DATAOWNER` - Generic data owner
- `DEVELOPER` - Software developer
- `PRODUCER` - Data producer
- `CONSUMER` - Data consumer
- `STAKEHOLDER` - Other stakeholder
## Glossary Terms
### Adding Terms
```java
dataset.addTerm("urn:li:glossaryTerm:CustomerData");
dataset.addTerm("urn:li:glossaryTerm:Classification.Confidential");
```
### Removing Terms
```java
dataset.removeTerm("urn:li:glossaryTerm:CustomerData");
```
### Term Chaining
```java
dataset.addTerm("urn:li:glossaryTerm:Customer Data")
.addTerm("urn:li:glossaryTerm:PII")
.addTerm("urn:li:glossaryTerm:GDPR");
```
## Domain
### Setting Domain
```java
dataset.setDomain("urn:li:domain:Marketing");
```
### Removing Domain
```java
// Remove a specific domain
dataset.removeDomain("urn:li:domain:Marketing");
// Or clear all domains
dataset.clearDomains();
```
## Custom Properties
### Adding Individual Properties
```java
dataset.addCustomProperty("team", "data-engineering");
dataset.addCustomProperty("retention_days", "90");
dataset.addCustomProperty("cost_center", "12345");
```
### Setting All Properties
Replace all custom properties:
```java
Map<String, String> properties = new HashMap<>();
properties.put("team", "data-engineering");
properties.put("retention", "90_days");
properties.put("classification", "internal");
dataset.setCustomProperties(properties);
```
### Removing Properties
```java
dataset.removeCustomProperty("retention_days");
```
## Schema
### Setting Schema Metadata
```java
import com.linkedin.schema.*;
SchemaMetadata schema = new SchemaMetadata();
// Configure schema...
dataset.setSchema(schema);
```
### Setting Schema Fields
```java
import com.linkedin.schema.*;
List<SchemaField> fields = new ArrayList<>();
// String field
SchemaField userIdField = new SchemaField();
userIdField.setFieldPath("user_id");
userIdField.setNativeDataType("VARCHAR(255)");
userIdField.setType(
new SchemaFieldDataType().setType(SchemaFieldDataType.Type.create(new StringType())));
fields.add(userIdField);
// Numeric field
SchemaField amountField = new SchemaField();
amountField.setFieldPath("amount");
amountField.setNativeDataType("DECIMAL(10,2)");
amountField.setType(
new SchemaFieldDataType().setType(SchemaFieldDataType.Type.create(new NumberType())));
fields.add(amountField);
dataset.setSchemaFields(fields);
```
## Complete Example
```java
import datahub.client.v2.DataHubClientV2;
import datahub.client.v2.entity.Dataset;
import com.linkedin.common.OwnershipType;
import java.io.IOException;
import java.util.concurrent.ExecutionException;
public class DatasetExample {
public static void main(String[] args) {
// Create client
DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
.build();
try {
// Build dataset with all metadata
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("analytics.public.user_events")
.env("PROD")
.description("User interaction events from web and mobile")
.displayName("User Events")
.build();
// Add tags
dataset.addTag("pii")
.addTag("analytics")
.addTag("gdpr");
// Add owners
dataset.addOwner("urn:li:corpuser:data_team", OwnershipType.TECHNICAL_OWNER)
.addOwner("urn:li:corpuser:product_team", OwnershipType.BUSINESS_OWNER);
// Add glossary terms
dataset.addTerm("urn:li:glossaryTerm:CustomerData")
.addTerm("urn:li:glossaryTerm:EventData");
// Set domain
dataset.setDomain("urn:li:domain:Analytics");
// Add custom properties
dataset.addCustomProperty("team", "data-engineering")
.addCustomProperty("retention_days", "365")
.addCustomProperty("refresh_schedule", "daily");
// Upsert to DataHub
client.entities().upsert(dataset);
System.out.println("Successfully created dataset: " + dataset.getUrn());
} catch (IOException | ExecutionException | InterruptedException e) {
e.printStackTrace();
} finally {
try {
client.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
```
## Updating Existing Datasets
### Load and Modify
```java
// Load existing dataset
DatasetUrn urn = new DatasetUrn("snowflake", "my_table", "PROD");
Dataset dataset = client.entities().get(urn);
// Add new metadata (creates patches)
dataset.addTag("new-tag")
.addOwner("urn:li:corpuser:new_owner", OwnershipType.TECHNICAL_OWNER);
// Apply patches
client.entities().update(dataset);
```
### Incremental Updates
```java
// Just add what you need
dataset.addTag("sensitive");
client.entities().update(dataset);
// Later, add more
dataset.addCustomProperty("updated_at", String.valueOf(System.currentTimeMillis()));
client.entities().update(dataset);
```
## Builder Options Reference
| Method | Required | Description |
| -------------------------- | -------- | --------------------------------------------- |
| `platform(String)` | ✅ Yes | Data platform (e.g., "snowflake", "bigquery") |
| `name(String)` | ✅ Yes | Fully qualified dataset name |
| `env(String)` | No | Environment (PROD, DEV, etc.) Default: PROD |
| `platformInstance(String)` | No | Platform instance identifier |
| `description(String)` | No | Dataset description |
| `displayName(String)` | No | Display name shown in UI |
| `customProperties(Map)` | No | Map of custom key-value properties |
## Mode-Aware vs Explicit Methods
| Operation | Mode-Aware Method | SDK Mode Aspect | INGESTION Mode Aspect |
| ------------ | ------------------ | --------------------------- | --------------------- |
| Description | `setDescription()` | `editableDatasetProperties` | `datasetProperties` |
| Display Name | `setDisplayName()` | `editableDatasetProperties` | `datasetProperties` |
**Explicit methods** (always available):
- `setSystemDescription()` / `setEditableDescription()`
- `setSystemDisplayName()` / `setEditableDisplayName()`
## Common Patterns
### Creating Multiple Datasets
```java
for (String tableName : tableNames) {
Dataset dataset = Dataset.builder()
.platform("postgres")
.name("public." + tableName)
.env("PROD")
.build();
dataset.addTag("auto-generated")
.addCustomProperty("created_by", "sync_job");
client.entities().upsert(dataset);
}
```
### Batch Metadata Addition
```java
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.build();
List<String> tags = Arrays.asList("pii", "sensitive", "gdpr");
tags.forEach(dataset::addTag);
client.entities().upsert(dataset); // Emits all tags in one call
```
### Conditional Metadata
```java
if (isPII(dataset)) {
dataset.addTag("pii")
.addTerm("urn:li:glossaryTerm:PersonalData");
}
if (requiresGovernance(dataset)) {
dataset.addOwner("urn:li:corpuser:governance_team", OwnershipType.DATA_STEWARD);
}
```
## Next Steps
- **[Chart Entity](./chart-entity.md)** - Working with chart entities
- **[Patch Operations](./patch-operations.md)** - Deep dive into patches
- **[Migration Guide](./migration-from-v1.md)** - Upgrading from V1
## Examples
### Basic Dataset Creation
```java
{{ inline /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/DatasetCreateExample.java show_path_as_comment }}
```
### Dataset Patch Operations
```java
{{ inline /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/DatasetPatchExample.java show_path_as_comment }}
```
### Comprehensive Dataset Example
```java
{{ inline /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/DatasetFullExample.java show_path_as_comment }}
```