# Dataset Entity The Dataset entity represents collections of data with a common schema (tables, views, files, topics, etc.). This guide covers comprehensive dataset operations in SDK V2. ## Creating a Dataset ### Minimal Dataset Only platform and name are required: ```java Dataset dataset = Dataset.builder() .platform("snowflake") .name("my_database.my_schema.my_table") .build(); ``` ### With Environment Specify environment (PROD, DEV, STAGING, etc.): ```java Dataset dataset = Dataset.builder() .platform("snowflake") .name("my_table") .env("PROD") .build(); // URN: urn:li:dataset:(urn:li:dataPlatform:snowflake,my_table,PROD) ``` ### With Metadata Add description and display name at construction: ```java Dataset dataset = Dataset.builder() .platform("bigquery") .name("project.dataset.table") .env("PROD") .description("User transactions table") .displayName("User Transactions") .build(); ``` ### With Custom Properties Include custom properties in builder: ```java Map props = new HashMap<>(); props.put("team", "data-engineering"); props.put("retention", "90_days"); Dataset dataset = Dataset.builder() .platform("postgres") .name("public.users") .customProperties(props) .build(); ``` ### With Platform Instance For multi-instance platforms: ```java Dataset dataset = Dataset.builder() .platform("kafka") .name("user-events") .platformInstance("kafka-prod-cluster") .build(); ``` ## URN Construction Dataset URNs follow the pattern: ``` urn:li:dataset:(urn:li:dataPlatform:{platform},{name},{env}) ``` **Automatic URN creation:** ```java Dataset dataset = Dataset.builder() .platform("snowflake") .name("analytics.public.events") .env("PROD") .build(); DatasetUrn urn = dataset.getDatasetUrn(); // urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.public.events,PROD) ``` ## Description Operations ### Mode-Aware Description The `setDescription()` method routes to different aspects based on mode: ```java // SDK mode (default) - writes to editableDatasetProperties dataset.setDescription("User-provided description"); // INGESTION mode - writes to datasetProperties dataset.setDescription("Ingested from Snowflake"); ``` ### Explicit Aspect Targeting Control which aspect to write: ```java // System description (datasetProperties) dataset.setSystemDescription("Generated by ETL pipeline"); // Editable description (editableDatasetProperties) dataset.setEditableDescription("User override description"); ``` ### Reading Description Get description (prefers editable over system): ```java String description = dataset.getDescription(); // Returns editableDatasetProperties.description if set // Otherwise returns datasetProperties.description ``` ## Display Name Operations Similar to description, display names are mode-aware: ```java // Mode-aware (SDK → editable, INGESTION → system) dataset.setDisplayName("User Events"); // Explicit aspect targeting dataset.setSystemDisplayName("user_events_table"); dataset.setEditableDisplayName("User Events Table"); // Read display name (prefers editable) String name = dataset.getDisplayName(); ``` ## Tags ### Adding Tags ```java // Simple tag name (auto-prefixed) dataset.addTag("pii"); // Creates: urn:li:tag:pii // Full tag URN dataset.addTag("urn:li:tag:analytics"); ``` ### Removing Tags ```java dataset.removeTag("pii"); dataset.removeTag("urn:li:tag:analytics"); ``` ### Tag Chaining ```java dataset.addTag("pii") .addTag("sensitive") .addTag("gdpr"); ``` ## Owners ### Adding Owners ```java import com.linkedin.common.OwnershipType; // Technical owner dataset.addOwner( "urn:li:corpuser:john_doe", OwnershipType.TECHNICAL_OWNER ); // Data steward dataset.addOwner( "urn:li:corpuser:jane_smith", OwnershipType.DATA_STEWARD ); // Business owner dataset.addOwner( "urn:li:corpuser:alice", OwnershipType.BUSINESS_OWNER ); ``` ### Removing Owners ```java dataset.removeOwner("urn:li:corpuser:john_doe"); ``` ### Owner Types Available ownership types: - `TECHNICAL_OWNER` - Maintains the technical implementation - `BUSINESS_OWNER` - Business stakeholder - `DATA_STEWARD` - Manages data quality and compliance - `DATAOWNER` - Generic data owner - `DEVELOPER` - Software developer - `PRODUCER` - Data producer - `CONSUMER` - Data consumer - `STAKEHOLDER` - Other stakeholder ## Glossary Terms ### Adding Terms ```java dataset.addTerm("urn:li:glossaryTerm:CustomerData"); dataset.addTerm("urn:li:glossaryTerm:Classification.Confidential"); ``` ### Removing Terms ```java dataset.removeTerm("urn:li:glossaryTerm:CustomerData"); ``` ### Term Chaining ```java dataset.addTerm("urn:li:glossaryTerm:Customer Data") .addTerm("urn:li:glossaryTerm:PII") .addTerm("urn:li:glossaryTerm:GDPR"); ``` ## Domain ### Setting Domain ```java dataset.setDomain("urn:li:domain:Marketing"); ``` ### Removing Domain ```java // Remove a specific domain dataset.removeDomain("urn:li:domain:Marketing"); // Or clear all domains dataset.clearDomains(); ``` ## Custom Properties ### Adding Individual Properties ```java dataset.addCustomProperty("team", "data-engineering"); dataset.addCustomProperty("retention_days", "90"); dataset.addCustomProperty("cost_center", "12345"); ``` ### Setting All Properties Replace all custom properties: ```java Map properties = new HashMap<>(); properties.put("team", "data-engineering"); properties.put("retention", "90_days"); properties.put("classification", "internal"); dataset.setCustomProperties(properties); ``` ### Removing Properties ```java dataset.removeCustomProperty("retention_days"); ``` ## Schema ### Setting Schema Metadata ```java import com.linkedin.schema.*; SchemaMetadata schema = new SchemaMetadata(); // Configure schema... dataset.setSchema(schema); ``` ### Setting Schema Fields ```java import com.linkedin.schema.*; List fields = new ArrayList<>(); // String field SchemaField userIdField = new SchemaField(); userIdField.setFieldPath("user_id"); userIdField.setNativeDataType("VARCHAR(255)"); userIdField.setType( new SchemaFieldDataType().setType(SchemaFieldDataType.Type.create(new StringType()))); fields.add(userIdField); // Numeric field SchemaField amountField = new SchemaField(); amountField.setFieldPath("amount"); amountField.setNativeDataType("DECIMAL(10,2)"); amountField.setType( new SchemaFieldDataType().setType(SchemaFieldDataType.Type.create(new NumberType()))); fields.add(amountField); dataset.setSchemaFields(fields); ``` ## Complete Example ```java import datahub.client.v2.DataHubClientV2; import datahub.client.v2.entity.Dataset; import com.linkedin.common.OwnershipType; import java.io.IOException; import java.util.concurrent.ExecutionException; public class DatasetExample { public static void main(String[] args) { // Create client DataHubClientV2 client = DataHubClientV2.builder() .server("http://localhost:8080") .build(); try { // Build dataset with all metadata Dataset dataset = Dataset.builder() .platform("snowflake") .name("analytics.public.user_events") .env("PROD") .description("User interaction events from web and mobile") .displayName("User Events") .build(); // Add tags dataset.addTag("pii") .addTag("analytics") .addTag("gdpr"); // Add owners dataset.addOwner("urn:li:corpuser:data_team", OwnershipType.TECHNICAL_OWNER) .addOwner("urn:li:corpuser:product_team", OwnershipType.BUSINESS_OWNER); // Add glossary terms dataset.addTerm("urn:li:glossaryTerm:CustomerData") .addTerm("urn:li:glossaryTerm:EventData"); // Set domain dataset.setDomain("urn:li:domain:Analytics"); // Add custom properties dataset.addCustomProperty("team", "data-engineering") .addCustomProperty("retention_days", "365") .addCustomProperty("refresh_schedule", "daily"); // Upsert to DataHub client.entities().upsert(dataset); System.out.println("Successfully created dataset: " + dataset.getUrn()); } catch (IOException | ExecutionException | InterruptedException e) { e.printStackTrace(); } finally { try { client.close(); } catch (IOException e) { e.printStackTrace(); } } } } ``` ## Updating Existing Datasets ### Load and Modify ```java // Load existing dataset DatasetUrn urn = new DatasetUrn("snowflake", "my_table", "PROD"); Dataset dataset = client.entities().get(urn); // Add new metadata (creates patches) dataset.addTag("new-tag") .addOwner("urn:li:corpuser:new_owner", OwnershipType.TECHNICAL_OWNER); // Apply patches client.entities().update(dataset); ``` ### Incremental Updates ```java // Just add what you need dataset.addTag("sensitive"); client.entities().update(dataset); // Later, add more dataset.addCustomProperty("updated_at", String.valueOf(System.currentTimeMillis())); client.entities().update(dataset); ``` ## Builder Options Reference | Method | Required | Description | | -------------------------- | -------- | --------------------------------------------- | | `platform(String)` | ✅ Yes | Data platform (e.g., "snowflake", "bigquery") | | `name(String)` | ✅ Yes | Fully qualified dataset name | | `env(String)` | No | Environment (PROD, DEV, etc.) Default: PROD | | `platformInstance(String)` | No | Platform instance identifier | | `description(String)` | No | Dataset description | | `displayName(String)` | No | Display name shown in UI | | `customProperties(Map)` | No | Map of custom key-value properties | ## Mode-Aware vs Explicit Methods | Operation | Mode-Aware Method | SDK Mode Aspect | INGESTION Mode Aspect | | ------------ | ------------------ | --------------------------- | --------------------- | | Description | `setDescription()` | `editableDatasetProperties` | `datasetProperties` | | Display Name | `setDisplayName()` | `editableDatasetProperties` | `datasetProperties` | **Explicit methods** (always available): - `setSystemDescription()` / `setEditableDescription()` - `setSystemDisplayName()` / `setEditableDisplayName()` ## Common Patterns ### Creating Multiple Datasets ```java for (String tableName : tableNames) { Dataset dataset = Dataset.builder() .platform("postgres") .name("public." + tableName) .env("PROD") .build(); dataset.addTag("auto-generated") .addCustomProperty("created_by", "sync_job"); client.entities().upsert(dataset); } ``` ### Batch Metadata Addition ```java Dataset dataset = Dataset.builder() .platform("snowflake") .name("my_table") .build(); List tags = Arrays.asList("pii", "sensitive", "gdpr"); tags.forEach(dataset::addTag); client.entities().upsert(dataset); // Emits all tags in one call ``` ### Conditional Metadata ```java if (isPII(dataset)) { dataset.addTag("pii") .addTerm("urn:li:glossaryTerm:PersonalData"); } if (requiresGovernance(dataset)) { dataset.addOwner("urn:li:corpuser:governance_team", OwnershipType.DATA_STEWARD); } ``` ## Next Steps - **[Chart Entity](./chart-entity.md)** - Working with chart entities - **[Patch Operations](./patch-operations.md)** - Deep dive into patches - **[Migration Guide](./migration-from-v1.md)** - Upgrading from V1 ## Examples ### Basic Dataset Creation ```java {{ inline /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/DatasetCreateExample.java show_path_as_comment }} ``` ### Dataset Patch Operations ```java {{ inline /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/DatasetPatchExample.java show_path_as_comment }} ``` ### Comprehensive Dataset Example ```java {{ inline /metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/DatasetFullExample.java show_path_as_comment }} ```