mirror of
https://github.com/datahub-project/datahub.git
synced 2025-12-19 05:56:13 +00:00
516 lines
12 KiB
Markdown
516 lines
12 KiB
Markdown
# Dataset Entity
|
|
|
|
The Dataset entity represents collections of data with a common schema (tables, views, files, topics, etc.). This guide covers comprehensive dataset operations in SDK V2.
|
|
|
|
## Creating a Dataset
|
|
|
|
### Minimal Dataset
|
|
|
|
Only platform and name are required:
|
|
|
|
```java
|
|
Dataset dataset = Dataset.builder()
|
|
.platform("snowflake")
|
|
.name("my_database.my_schema.my_table")
|
|
.build();
|
|
```
|
|
|
|
### With Environment
|
|
|
|
Specify environment (PROD, DEV, STAGING, etc.):
|
|
|
|
```java
|
|
Dataset dataset = Dataset.builder()
|
|
.platform("snowflake")
|
|
.name("my_table")
|
|
.env("PROD")
|
|
.build();
|
|
// URN: urn:li:dataset:(urn:li:dataPlatform:snowflake,my_table,PROD)
|
|
```
|
|
|
|
### With Metadata
|
|
|
|
Add description and display name at construction:
|
|
|
|
```java
|
|
Dataset dataset = Dataset.builder()
|
|
.platform("bigquery")
|
|
.name("project.dataset.table")
|
|
.env("PROD")
|
|
.description("User transactions table")
|
|
.displayName("User Transactions")
|
|
.build();
|
|
```
|
|
|
|
### With Custom Properties
|
|
|
|
Include custom properties in builder:
|
|
|
|
```java
|
|
Map<String, String> props = new HashMap<>();
|
|
props.put("team", "data-engineering");
|
|
props.put("retention", "90_days");
|
|
|
|
Dataset dataset = Dataset.builder()
|
|
.platform("postgres")
|
|
.name("public.users")
|
|
.customProperties(props)
|
|
.build();
|
|
```
|
|
|
|
### With Platform Instance
|
|
|
|
For multi-instance platforms:
|
|
|
|
```java
|
|
Dataset dataset = Dataset.builder()
|
|
.platform("kafka")
|
|
.name("user-events")
|
|
.platformInstance("kafka-prod-cluster")
|
|
.build();
|
|
```
|
|
|
|
## URN Construction
|
|
|
|
Dataset URNs follow the pattern:
|
|
|
|
```
|
|
urn:li:dataset:(urn:li:dataPlatform:{platform},{name},{env})
|
|
```
|
|
|
|
**Automatic URN creation:**
|
|
|
|
```java
|
|
Dataset dataset = Dataset.builder()
|
|
.platform("snowflake")
|
|
.name("analytics.public.events")
|
|
.env("PROD")
|
|
.build();
|
|
|
|
DatasetUrn urn = dataset.getDatasetUrn();
|
|
// urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.public.events,PROD)
|
|
```
|
|
|
|
## Description Operations
|
|
|
|
### Mode-Aware Description
|
|
|
|
The `setDescription()` method routes to different aspects based on mode:
|
|
|
|
```java
|
|
// SDK mode (default) - writes to editableDatasetProperties
|
|
dataset.setDescription("User-provided description");
|
|
|
|
// INGESTION mode - writes to datasetProperties
|
|
dataset.setDescription("Ingested from Snowflake");
|
|
```
|
|
|
|
### Explicit Aspect Targeting
|
|
|
|
Control which aspect to write:
|
|
|
|
```java
|
|
// System description (datasetProperties)
|
|
dataset.setSystemDescription("Generated by ETL pipeline");
|
|
|
|
// Editable description (editableDatasetProperties)
|
|
dataset.setEditableDescription("User override description");
|
|
```
|
|
|
|
### Reading Description
|
|
|
|
Get description (prefers editable over system):
|
|
|
|
```java
|
|
String description = dataset.getDescription();
|
|
// Returns editableDatasetProperties.description if set
|
|
// Otherwise returns datasetProperties.description
|
|
```
|
|
|
|
## Display Name Operations
|
|
|
|
Similar to description, display names are mode-aware:
|
|
|
|
```java
|
|
// Mode-aware (SDK → editable, INGESTION → system)
|
|
dataset.setDisplayName("User Events");
|
|
|
|
// Explicit aspect targeting
|
|
dataset.setSystemDisplayName("user_events_table");
|
|
dataset.setEditableDisplayName("User Events Table");
|
|
|
|
// Read display name (prefers editable)
|
|
String name = dataset.getDisplayName();
|
|
```
|
|
|
|
## Tags
|
|
|
|
### Adding Tags
|
|
|
|
```java
|
|
// Simple tag name (auto-prefixed)
|
|
dataset.addTag("pii");
|
|
// Creates: urn:li:tag:pii
|
|
|
|
// Full tag URN
|
|
dataset.addTag("urn:li:tag:analytics");
|
|
```
|
|
|
|
### Removing Tags
|
|
|
|
```java
|
|
dataset.removeTag("pii");
|
|
dataset.removeTag("urn:li:tag:analytics");
|
|
```
|
|
|
|
### Tag Chaining
|
|
|
|
```java
|
|
dataset.addTag("pii")
|
|
.addTag("sensitive")
|
|
.addTag("gdpr");
|
|
```
|
|
|
|
## Owners
|
|
|
|
### Adding Owners
|
|
|
|
```java
|
|
import com.linkedin.common.OwnershipType;
|
|
|
|
// Technical owner
|
|
dataset.addOwner(
|
|
"urn:li:corpuser:john_doe",
|
|
OwnershipType.TECHNICAL_OWNER
|
|
);
|
|
|
|
// Data steward
|
|
dataset.addOwner(
|
|
"urn:li:corpuser:jane_smith",
|
|
OwnershipType.DATA_STEWARD
|
|
);
|
|
|
|
// Business owner
|
|
dataset.addOwner(
|
|
"urn:li:corpuser:alice",
|
|
OwnershipType.BUSINESS_OWNER
|
|
);
|
|
```
|
|
|
|
### Removing Owners
|
|
|
|
```java
|
|
dataset.removeOwner("urn:li:corpuser:john_doe");
|
|
```
|
|
|
|
### Owner Types
|
|
|
|
Available ownership types:
|
|
|
|
- `TECHNICAL_OWNER` - Maintains the technical implementation
|
|
- `BUSINESS_OWNER` - Business stakeholder
|
|
- `DATA_STEWARD` - Manages data quality and compliance
|
|
- `DATAOWNER` - Generic data owner
|
|
- `DEVELOPER` - Software developer
|
|
- `PRODUCER` - Data producer
|
|
- `CONSUMER` - Data consumer
|
|
- `STAKEHOLDER` - Other stakeholder
|
|
|
|
## Glossary Terms
|
|
|
|
### Adding Terms
|
|
|
|
```java
|
|
dataset.addTerm("urn:li:glossaryTerm:CustomerData");
|
|
dataset.addTerm("urn:li:glossaryTerm:Classification.Confidential");
|
|
```
|
|
|
|
### Removing Terms
|
|
|
|
```java
|
|
dataset.removeTerm("urn:li:glossaryTerm:CustomerData");
|
|
```
|
|
|
|
### Term Chaining
|
|
|
|
```java
|
|
dataset.addTerm("urn:li:glossaryTerm:Customer Data")
|
|
.addTerm("urn:li:glossaryTerm:PII")
|
|
.addTerm("urn:li:glossaryTerm:GDPR");
|
|
```
|
|
|
|
## Domain
|
|
|
|
### Setting Domain
|
|
|
|
```java
|
|
dataset.setDomain("urn:li:domain:Marketing");
|
|
```
|
|
|
|
### Removing Domain
|
|
|
|
```java
|
|
// Remove a specific domain
|
|
dataset.removeDomain("urn:li:domain:Marketing");
|
|
|
|
// Or clear all domains
|
|
dataset.clearDomains();
|
|
```
|
|
|
|
## Custom Properties
|
|
|
|
### Adding Individual Properties
|
|
|
|
```java
|
|
dataset.addCustomProperty("team", "data-engineering");
|
|
dataset.addCustomProperty("retention_days", "90");
|
|
dataset.addCustomProperty("cost_center", "12345");
|
|
```
|
|
|
|
### Setting All Properties
|
|
|
|
Replace all custom properties:
|
|
|
|
```java
|
|
Map<String, String> properties = new HashMap<>();
|
|
properties.put("team", "data-engineering");
|
|
properties.put("retention", "90_days");
|
|
properties.put("classification", "internal");
|
|
|
|
dataset.setCustomProperties(properties);
|
|
```
|
|
|
|
### Removing Properties
|
|
|
|
```java
|
|
dataset.removeCustomProperty("retention_days");
|
|
```
|
|
|
|
## Schema
|
|
|
|
### Setting Schema Metadata
|
|
|
|
```java
|
|
import com.linkedin.schema.*;
|
|
|
|
SchemaMetadata schema = new SchemaMetadata();
|
|
// Configure schema...
|
|
dataset.setSchema(schema);
|
|
```
|
|
|
|
### Setting Schema Fields
|
|
|
|
```java
|
|
import com.linkedin.schema.*;
|
|
|
|
List<SchemaField> fields = new ArrayList<>();
|
|
|
|
// String field
|
|
SchemaField userIdField = new SchemaField();
|
|
userIdField.setFieldPath("user_id");
|
|
userIdField.setNativeDataType("VARCHAR(255)");
|
|
userIdField.setType(
|
|
new SchemaFieldDataType().setType(SchemaFieldDataType.Type.create(new StringType())));
|
|
fields.add(userIdField);
|
|
|
|
// Numeric field
|
|
SchemaField amountField = new SchemaField();
|
|
amountField.setFieldPath("amount");
|
|
amountField.setNativeDataType("DECIMAL(10,2)");
|
|
amountField.setType(
|
|
new SchemaFieldDataType().setType(SchemaFieldDataType.Type.create(new NumberType())));
|
|
fields.add(amountField);
|
|
|
|
dataset.setSchemaFields(fields);
|
|
```
|
|
|
|
## Complete Example
|
|
|
|
```java
|
|
import datahub.client.v2.DataHubClientV2;
|
|
import datahub.client.v2.entity.Dataset;
|
|
import com.linkedin.common.OwnershipType;
|
|
import java.io.IOException;
|
|
import java.util.concurrent.ExecutionException;
|
|
|
|
public class DatasetExample {
|
|
public static void main(String[] args) {
|
|
// Create client
|
|
DataHubClientV2 client = DataHubClientV2.builder()
|
|
.server("http://localhost:8080")
|
|
.build();
|
|
|
|
try {
|
|
// Build dataset with all metadata
|
|
Dataset dataset = Dataset.builder()
|
|
.platform("snowflake")
|
|
.name("analytics.public.user_events")
|
|
.env("PROD")
|
|
.description("User interaction events from web and mobile")
|
|
.displayName("User Events")
|
|
.build();
|
|
|
|
// Add tags
|
|
dataset.addTag("pii")
|
|
.addTag("analytics")
|
|
.addTag("gdpr");
|
|
|
|
// Add owners
|
|
dataset.addOwner("urn:li:corpuser:data_team", OwnershipType.TECHNICAL_OWNER)
|
|
.addOwner("urn:li:corpuser:product_team", OwnershipType.BUSINESS_OWNER);
|
|
|
|
// Add glossary terms
|
|
dataset.addTerm("urn:li:glossaryTerm:CustomerData")
|
|
.addTerm("urn:li:glossaryTerm:EventData");
|
|
|
|
// Set domain
|
|
dataset.setDomain("urn:li:domain:Analytics");
|
|
|
|
// Add custom properties
|
|
dataset.addCustomProperty("team", "data-engineering")
|
|
.addCustomProperty("retention_days", "365")
|
|
.addCustomProperty("refresh_schedule", "daily");
|
|
|
|
// Upsert to DataHub
|
|
client.entities().upsert(dataset);
|
|
|
|
System.out.println("Successfully created dataset: " + dataset.getUrn());
|
|
|
|
} catch (IOException | ExecutionException | InterruptedException e) {
|
|
e.printStackTrace();
|
|
} finally {
|
|
try {
|
|
client.close();
|
|
} catch (IOException e) {
|
|
e.printStackTrace();
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Updating Existing Datasets
|
|
|
|
### Load and Modify
|
|
|
|
```java
|
|
// Load existing dataset
|
|
DatasetUrn urn = new DatasetUrn("snowflake", "my_table", "PROD");
|
|
Dataset dataset = client.entities().get(urn);
|
|
|
|
// Add new metadata (creates patches)
|
|
dataset.addTag("new-tag")
|
|
.addOwner("urn:li:corpuser:new_owner", OwnershipType.TECHNICAL_OWNER);
|
|
|
|
// Apply patches
|
|
client.entities().update(dataset);
|
|
```
|
|
|
|
### Incremental Updates
|
|
|
|
```java
|
|
// Just add what you need
|
|
dataset.addTag("sensitive");
|
|
client.entities().update(dataset);
|
|
|
|
// Later, add more
|
|
dataset.addCustomProperty("updated_at", String.valueOf(System.currentTimeMillis()));
|
|
client.entities().update(dataset);
|
|
```
|
|
|
|
## Builder Options Reference
|
|
|
|
| Method | Required | Description |
|
|
| -------------------------- | -------- | --------------------------------------------- |
|
|
| `platform(String)` | ✅ Yes | Data platform (e.g., "snowflake", "bigquery") |
|
|
| `name(String)` | ✅ Yes | Fully qualified dataset name |
|
|
| `env(String)` | No | Environment (PROD, DEV, etc.) Default: PROD |
|
|
| `platformInstance(String)` | No | Platform instance identifier |
|
|
| `description(String)` | No | Dataset description |
|
|
| `displayName(String)` | No | Display name shown in UI |
|
|
| `customProperties(Map)` | No | Map of custom key-value properties |
|
|
|
|
## Mode-Aware vs Explicit Methods
|
|
|
|
| Operation | Mode-Aware Method | SDK Mode Aspect | INGESTION Mode Aspect |
|
|
| ------------ | ------------------ | --------------------------- | --------------------- |
|
|
| Description | `setDescription()` | `editableDatasetProperties` | `datasetProperties` |
|
|
| Display Name | `setDisplayName()` | `editableDatasetProperties` | `datasetProperties` |
|
|
|
|
**Explicit methods** (always available):
|
|
|
|
- `setSystemDescription()` / `setEditableDescription()`
|
|
- `setSystemDisplayName()` / `setEditableDisplayName()`
|
|
|
|
## Common Patterns
|
|
|
|
### Creating Multiple Datasets
|
|
|
|
```java
|
|
for (String tableName : tableNames) {
|
|
Dataset dataset = Dataset.builder()
|
|
.platform("postgres")
|
|
.name("public." + tableName)
|
|
.env("PROD")
|
|
.build();
|
|
|
|
dataset.addTag("auto-generated")
|
|
.addCustomProperty("created_by", "sync_job");
|
|
|
|
client.entities().upsert(dataset);
|
|
}
|
|
```
|
|
|
|
### Batch Metadata Addition
|
|
|
|
```java
|
|
Dataset dataset = Dataset.builder()
|
|
.platform("snowflake")
|
|
.name("my_table")
|
|
.build();
|
|
|
|
List<String> tags = Arrays.asList("pii", "sensitive", "gdpr");
|
|
tags.forEach(dataset::addTag);
|
|
|
|
client.entities().upsert(dataset); // Emits all tags in one call
|
|
```
|
|
|
|
### Conditional Metadata
|
|
|
|
```java
|
|
if (isPII(dataset)) {
|
|
dataset.addTag("pii")
|
|
.addTerm("urn:li:glossaryTerm:PersonalData");
|
|
}
|
|
|
|
if (requiresGovernance(dataset)) {
|
|
dataset.addOwner("urn:li:corpuser:governance_team", OwnershipType.DATA_STEWARD);
|
|
}
|
|
```
|
|
|
|
## Next Steps
|
|
|
|
- **[Chart Entity](./chart-entity.md)** - Working with chart entities
|
|
- **[Patch Operations](./patch-operations.md)** - Deep dive into patches
|
|
- **[Migration Guide](./migration-from-v1.md)** - Upgrading from V1
|
|
|
|
## Examples
|
|
|
|
### Basic Dataset Creation
|
|
|
|
```java
|
|
{{ inline /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/DatasetCreateExample.java show_path_as_comment }}
|
|
```
|
|
|
|
### Dataset Patch Operations
|
|
|
|
```java
|
|
{{ inline /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/DatasetPatchExample.java show_path_as_comment }}
|
|
```
|
|
|
|
### Comprehensive Dataset Example
|
|
|
|
```java
|
|
{{ inline /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/DatasetFullExample.java show_path_as_comment }}
|
|
```
|