datahub/docs/developers/java-sdk-v2-design.md

1272 lines
44 KiB
Markdown

# DataHub Java SDK V2 Design Document
## Executive Summary
This document describes the design of DataHub Java SDK V2, a modern, user-friendly Java client library that provides feature parity with the Python SDK V2. The new SDK addresses feedback from enterprise Java customers who require a first-class SDK experience comparable to Python developers.
This document is organized into two main sections:
- **Part 1 - User-Facing API Design**: The public API, patterns, and behaviors visible to SDK users
- **Part 2 - Developer-Facing Implementation**: Internal architecture and implementation details for contributors
> **Why Hand-Crafted?** For a deep dive into why we chose to hand-craft this SDK instead of using OpenAPI code generation, see [Java SDK V2 Philosophy](java-sdk-v2-philosophy.md).
## Background
### Problem Statement
Currently, DataHub's Java SDK (`datahub-client`) provides only low-level emission capabilities:
- Manual MCP (Metadata Change Proposal) construction required
- No high-level entity builders for Dataset, Chart, Dashboard, etc.
- No client for CRUD operations (read, update, delete)
- No patch capabilities for granular updates
- Significantly inferior developer experience compared to Python SDK V2
This gap has created issues with enterprise customers, particularly Java shops who feel like "second-class citizens" when compared to Python developers.
### Goals
1. **Feature Parity**: Match Python SDK V2 capabilities for entity management
2. **Backward Compatibility**: Maintain 100% compatibility with existing Java SDK
3. **Namespace Separation**: Use `datahub.client.v2.*` namespace for new APIs
4. **Builder Pattern**: Fluent, type-safe API for entity construction
5. **Patch Support**: Granular updates without full entity replacement
6. **CRUD Operations**: Support create, read, update, upsert operations (delete/exists deferred)
7. **Comprehensive Testing**: Unit and integration tests validating all functionality
### Non-Goals
- Rewriting existing emitter infrastructure (leverage existing)
- 100% feature parity with Python SDK (focus on core entities first)
- GraphQL client implementation (focus on REST/OpenAPI)
- Search client (future enhancement)
- Lineage client (future enhancement)
---
# Part 1: User-Facing API Design
This section describes the public API that SDK users interact with - the patterns, behaviors, and interfaces that define the developer experience.
## Design Principles
### 1. Fluent Builder Pattern
Intuitive entity construction through method chaining:
```java
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.env("PROD")
.description("My dataset")
.build();
// Fluent metadata operations with type-safe method chaining
dataset.addTag("pii")
.addOwner("urn:li:corpuser:jdoe", OwnershipType.TECHNICAL_OWNER)
.setDomain("urn:li:domain:Analytics")
.setStructuredProperty("io.acryl.dataQuality.qualityScore", 95.5);
client.entities().upsert(dataset);
```
### 2. Type Safety and Compile-Time Checking
Leverage Java's strong typing:
- Strongly-typed URNs (`DatasetUrn`, `ChartUrn`, etc.)
- Generic types for entity operations
- CRTP (Curiously Recurring Template Pattern) for type-safe mixin interfaces
- Builder validation at construction time
### 3. Mode-Aware Behavior
**SDK Mode vs INGESTION Mode** for proper separation of concerns:
- **SDK Mode (default)**: User edits → `editableDatasetProperties`
- **INGESTION Mode**: Pipeline writes → `datasetProperties`
- Getters intelligently prefer editable aspects over system aspects
```java
// SDK mode - user edits go to editable aspects
DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
.mode(OperationMode.SDK) // Default
.build();
// INGESTION mode - pipeline writes go to system aspects
DataHubClientV2 ingestionClient = DataHubClientV2.builder()
.server("http://localhost:8080")
.mode(OperationMode.INGESTION)
.build();
```
### 4. Patch-First Philosophy
**Design Decision: Prioritize patches over full aspect replacement**
The SDK V2 is designed around patch-based operations because they represent the most common and intuitive way to make metadata changes:
```java
Dataset dataset = client.entities().get(datasetUrn);
Dataset mutable = dataset.mutable(); // Get mutable copy
// These create patches internally - no server calls yet
mutable.addTag("pii")
.addTag("sensitive")
.addOwner(ownerUrn, OwnershipType.TECHNICAL_OWNER);
// Single call emits all accumulated patches atomically
client.entities().update(mutable);
```
**Why patches?**
- **Simplicity**: Users think "add a tag" not "fetch all tags, add one, PUT entire tag aspect back"
- **Safety**: Patches don't overwrite concurrent changes from other users
- **Efficiency**: Only changed fields are transmitted and processed
- **Common use case**: Most metadata operations are incremental additions/removals
**When to use low-level SDK:**
If you need to completely replace an aspect (full PUT/upsert semantics), use the V1 SDK's `RestEmitter` directly with `MetadataChangeProposalWrapper`. The V2 SDK focuses on making common operations simple, not exposing every low-level primitive.
### 5. Composition Through Mixin Interfaces
Shared metadata operations via type-safe mixin interfaces:
- `HasTags<T>` - Add, remove, set tags
- `HasOwners<T>` - Manage ownership
- `HasGlossaryTerms<T>` - Associate glossary terms
- `DomainOperations<T>` - Domain assignment
- `HasContainer<T>` - Parent-child hierarchies
All mixins use CRTP pattern for type-safe method chaining that returns the concrete entity type.
## Architecture
### Package Structure (Actual Implementation)
```
datahub-client/
├── src/main/java/
│ ├── datahub/client/ # Existing v1 (unchanged)
│ │ ├── Emitter.java
│ │ ├── rest/RestEmitter.java
│ │ └── ...
│ │
│ └── datahub/client/v2/ # New v2 namespace
│ ├── DataHubClientV2.java # Main client entry point
│ │
│ ├── entity/ # Entity classes
│ │ ├── Entity.java # Base entity class (490 lines)
│ │ ├── AspectCache.java # Unified cache with dirty tracking (184 lines)
│ │ ├── CachedAspect.java # Aspect wrapper with metadata (68 lines)
│ │ ├── AspectSource.java # SERVER vs LOCAL enum (23 lines)
│ │ ├── ReadMode.java # ALLOW_DIRTY vs SERVER_ONLY (28 lines)
│ │ ├── Dataset.java # Dataset entity (564 lines)
│ │ ├── Chart.java # Chart entity (587 lines)
│ │ ├── Dashboard.java # Dashboard entity (671 lines)
│ │ ├── DataJob.java # DataJob entity (597 lines)
│ │ ├── DataFlow.java # DataFlow entity (467 lines)
│ │ ├── Container.java # Container entity (500 lines)
│ │ ├── MLModel.java # ML Model entity NEW
│ │ ├── MLModelGroup.java # ML Model Group entity NEW
│ │ ├── HasTags.java # Tag operations mixin
│ │ ├── HasOwners.java # Ownership operations mixin
│ │ ├── HasGlossaryTerms.java # Terms operations mixin
│ │ ├── HasDomains.java # Domain operations mixin
│ │ ├── HasContainer.java # Container hierarchy mixin
│ │ └── HasStructuredProperties.java # Structured properties mixin
│ │
│ ├── operations/ # CRUD operation clients
│ │ └── EntityClient.java # Entity CRUD operations (570 lines)
│ │
│ └── config/ # Configuration
│ └── DataHubClientConfigV2.java # Config with mode support
└── src/test/java/ # Tests mirror structure
└── datahub/client/v2/
├── DataHubClientV2Test.java # Client tests
├── entity/ # 378 unit tests
│ ├── AspectCacheTest.java # 30 tests (cache infrastructure)
│ ├── CachedAspectTest.java # 13 tests (cache infrastructure)
│ ├── DatasetTest.java # 37 tests
│ ├── ChartTest.java # 43 tests
│ ├── DashboardTest.java # 52 tests
│ ├── DataJobTest.java # 45 tests
│ ├── DataFlowTest.java # 40 tests
│ ├── ContainerTest.java # 40 tests
│ ├── MLModelTest.java # 44 tests
│ └── MLModelGroupTest.java # 38 tests
└── integration/ # 79 integration tests
├── DatasetIntegrationTest.java
├── ChartIntegrationTest.java
├── DashboardIntegrationTest.java
├── DataJobIntegrationTest.java
├── DataFlowIntegrationTest.java
├── ContainerIntegrationTest.java
├── MLModelIntegrationTest.java
└── MLModelGroupIntegrationTest.java
```
**Key Design Decisions:**
- No separate `patch/` package - patches accumulate internally within entities
- Mixin interfaces in `entity/` package using CRTP pattern for type safety
- Support for 8 entity types including ML entities (MLModel, MLModelGroup)
- Mode-aware configuration for SDK vs INGESTION behavior
### Core Classes
#### 1. DataHubClientV2 (Main Entry Point)
**File**: `metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/DataHubClientV2.java` (266 lines)
```java
package datahub.client.v2;
/**
* Main entry point for DataHub Java SDK V2.
* Provides high-level operations for entity management with mode-aware behavior.
*
* <p>Example usage:
* <pre>
* DataHubClientV2 client = DataHubClientV2.builder()
* .server("http://localhost:8080")
* .token("my-token")
* .mode(OperationMode.SDK) // SDK or INGESTION mode
* .build();
*
* Dataset dataset = Dataset.builder()
* .platform("snowflake")
* .name("my_table")
* .env("PROD")
* .description("My dataset")
* .build();
*
* client.entities().upsert(dataset);
* </pre>
*/
public class DataHubClientV2 implements AutoCloseable {
private final RestEmitter emitter;
private final DataHubClientConfigV2 config;
private final EntityClient entityClient;
// Builder for client configuration
public static Builder builder() { ... }
// Entity operations
public EntityClient entities() { return entityClient; }
// Low-level emitter access (for advanced users)
public RestEmitter emitter() { return emitter; }
// Configuration access
public DataHubClientConfigV2 config() { return config; }
@Override
public void close() throws IOException { ... }
public static class Builder {
public Builder server(String serverUrl) { ... }
public Builder token(String token) { ... }
public Builder timeout(int timeoutMs) { ... }
public Builder mode(OperationMode mode) { ... } // NEW
public Builder config(DataHubClientConfigV2 config) { ... }
public DataHubClientV2 build() { ... }
}
}
```
**Design Features:**
- Mode-aware behavior (SDK vs INGESTION) for proper aspect routing
- Environment variable support for configuration
- Builder pattern with sensible defaults
- AutoCloseable interface for resource management
#### 2. Entity (Base Class) - User-Facing API
**File**: `metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/Entity.java` (490 lines)
The Entity base class provides a unified interface for all DataHub entities. From a user perspective, all entities support:
**Public API Methods:**
```java
// URN access
public Urn getUrn()
public abstract String getEntityType()
// Convert to MCPs for emission (primarily internal)
public List<MetadataChangeProposalWrapper> toMCPs()
```
**Entity Construction:**
Entities are constructed via fluent builders:
```java
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.env("PROD")
.description("My dataset")
.build();
```
**Fluent Metadata Operations:**
All entities support method chaining for metadata operations (via mixin interfaces):
```java
dataset.addTag("pii")
.addOwner(ownerUrn, OwnershipType.TECHNICAL_OWNER)
.setDomain(domainUrn)
.addTerm(termUrn);
```
**Lazy Loading:**
Entities loaded from the server fetch aspects on-demand:
```java
Dataset dataset = client.entities().get(datasetUrn); // Only URN loaded
String description = dataset.getDescription(); // Aspect fetched now
List<String> tags = dataset.getTags(); // Another aspect fetch
```
**Patch Accumulation:**
Metadata operations create patches that accumulate until save:
```java
Dataset dataset = client.entities().get(datasetUrn);
Dataset mutable = dataset.mutable(); // Get mutable copy
mutable.addTag("pii"); // Creates patch (not sent yet)
mutable.addTag("sensitive"); // Another patch (not sent yet)
client.entities().update(mutable); // Emits all patches atomically
```
**Immutability-by-Default:**
Entities fetched from the server are read-only to prevent accidental mutations:
```java
Dataset dataset = client.entities().get(datasetUrn);
dataset.isReadOnly(); // true
dataset.isMutable(); // false
// Attempting mutation throws ReadOnlyEntityException
// dataset.addTag("pii"); // ERROR!
// Get mutable copy for updates
Dataset mutable = dataset.mutable();
mutable.isMutable(); // true
mutable.addTag("pii"); // Works
client.entities().upsert(mutable);
```
**Entity Lifecycle:**
1. **Builder-created entities** - Mutable from creation
```java
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.build();
dataset.isMutable(); // true - can mutate immediately
```
2. **Server-fetched entities** - Immutable by default
```java
Dataset dataset = client.entities().get(urn);
dataset.isReadOnly(); // true - must call .mutable()
```
3. **Mutable copies** - Created via `.mutable()`
```java
Dataset mutable = dataset.mutable();
mutable.isMutable(); // true - can mutate
```
**The .mutable() method:**
- Creates a shallow copy with independent mutability flags
- Shares aspect cache with original (read-your-own-writes semantics)
- Idempotent - returns self if already mutable
- Original entity remains read-only after creating mutable copy
**Why immutability-by-default?**
- Makes mutations explicit and intentional
- Prevents accidental modification when passing entities between functions
- Clear separation between read and write workflows
- Enables safe entity sharing across threads
- Common pattern in modern APIs (Rust, Python, Java immutable collections)
See "Developer-Facing Implementation Design" section below for internal architecture details.
#### 3. Supported Entities
The SDK V2 implements 8 entity types with full metadata support:
**Data Entities:**
- **Dataset** - Tables, views, files with schema support
- **Container** - Databases, schemas, folders (hierarchical structures)
**Pipeline Entities:**
- **DataFlow** - Pipelines, workflows (Airflow DAGs, Spark jobs, dbt projects)
- **DataJob** - Individual tasks with inlet/outlet lineage
**Visualization Entities:**
- **Chart** - Visualizations with input dataset lineage
- **Dashboard** - Dashboards with chart relationships and input datasets
**ML Entities:**
- **MLModel** - Machine learning models with metrics, hyperparameters, training jobs
- **MLModelGroup** - Model families with version management
**Common Entity Operations:**
All entities support these fluent operations (via mixin interfaces):
```java
// Tags
entity.addTag("pii")
.removeTag("deprecated")
.setTags(Arrays.asList("tag1", "tag2"))
.clearTags()
// Owners
entity.addOwner(ownerUrn, OwnershipType.TECHNICAL_OWNER)
.removeOwner(ownerUrn)
.setOwners(ownerList)
.clearOwners()
// Glossary Terms
entity.addTerm(termUrn)
.removeTerm(termUrn)
.setTerms(termList)
.clearTerms()
// Domains
entity.setDomain(domainUrn)
.removeDomain(domainUrn)
.clearDomains()
// Container (for hierarchical entities)
entity.setContainer(containerUrn)
.clearContainer()
// Structured Properties (custom typed metadata)
entity.setStructuredProperty("io.acryl.dataManagement.replicationSLA", "24h")
.setStructuredProperty("io.acryl.dataQuality.qualityScore", 95.5)
.setStructuredProperty("io.acryl.dataManagement.certifications",
Arrays.asList("SOC2", "HIPAA", "GDPR"))
.setStructuredProperty("io.acryl.privacy.retentionDays", 90, 180, 365)
.removeStructuredProperty("io.acryl.dataManagement.deprecated")
```
**Entity-Specific Documentation:**
See comprehensive guides in `metadata-integration/java/docs/sdk-v2/`:
- `dataset-entity.md` - Dataset with schema support
- `chart-entity.md` - Chart with lineage
- `dashboard-entity.md` - Dashboard with chart relationships
- `container-entity.md` - Container hierarchies
- `dataflow-entity.md` - DataFlow pipelines
- `datajob-entity.md` - DataJob with inlet/outlet lineage
- `mlmodel-entity.md` - MLModel with metrics
- `mlmodelgroup-entity.md` - MLModelGroup with versions
#### 4. EntityClient (CRUD Operations)
**File**: `metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/operations/EntityClient.java` (570 lines)
```java
package datahub.client.v2.operations;
/**
* Client for entity CRUD operations.
* Provides create, read, update, and upsert operations.
*/
public class EntityClient {
private final RestEmitter emitter;
private final DataHubClientConfigV2 config;
/**
* Create a new entity (convenience method - same as upsert).
*/
public <T extends Entity> void create(T entity) throws IOException, ExecutionException, InterruptedException {
upsert(entity);
}
/**
* Upsert an entity (create or update).
* Emits all aspects and accumulated patches.
*/
public <T extends Entity> void upsert(T entity) throws IOException, ExecutionException, InterruptedException {
List<MetadataChangeProposalWrapper> mcps = entity.toMCPs();
// Emit all MCPs asynchronously and wait for completion
// ...
}
/**
* Update an existing entity.
* Emits only accumulated patches (not full aspects).
*/
public <T extends Entity> void update(T entity) throws IOException, ExecutionException, InterruptedException {
// Emit only pending patches
// ...
}
/**
* Get an entity by URN.
* Returns entity with lazy-loaded aspects.
*/
public <T extends Entity> T get(Urn urn, Class<T> entityClass) throws IOException {
// Fetch entity aspects from server
// Construct entity with lazy loading support
// ...
}
// Note: delete(Urn) and exists(Urn) operations deferred to future releases
}
```
**Supported Operations:**
- `create()` - Create new entities (wrapper for upsert)
- `upsert()` - Create or update entities (emits all aspects + patches)
- `update()` - Update existing entities (emits only patches)
- `get()` - Retrieve entities with lazy loading
- `delete()` and `exists()` - Deferred to future releases
**Patch Behavior:**
Patches are accumulated **inside entities** during metadata operations and emitted automatically during `upsert()`/`update()`:
```java
Dataset dataset = client.entities().get(datasetUrn);
Dataset mutable = dataset.mutable(); // Get mutable copy
mutable.addTag("pii"); // Creates internal patch
mutable.addTag("sensitive"); // Creates another internal patch
client.entities().update(mutable); // Emits both patches atomically
```
There is **no separate `patch()` method** - patches are managed internally by entities.
#### 5. Mixin Interfaces (CRTP Pattern)
**Files**: `metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/Has*.java`
Mixin interfaces provide reusable metadata operations across entities using the **Curiously Recurring Template Pattern (CRTP)** for type-safe method chaining:
```java
/**
* Interface for entities that support tags.
* Uses CRTP for type-safe method chaining.
*/
public interface HasTags<T extends Entity & HasTags<T>> {
/**
* Add a tag to this entity.
* Creates a patch that will be emitted on save.
*/
default T addTag(@Nonnull String tagUrn) {
// Implementation creates patch internally
return (T) this;
}
default T removeTag(@Nonnull String tagUrn) { ... }
default T setTags(@Nonnull List<String> tagUrns) { ... }
default T clearTags() { ... }
// Getter methods
default List<String> getTags() { ... }
}
```
**Available Mixin Interfaces:**
1. **`HasTags<T>`** - Tag operations (`addTag`, `removeTag`, `setTags`, `clearTags`)
2. **`HasOwners<T>`** - Ownership operations (`addOwner`, `removeOwner`, `setOwners`, `clearOwners`)
3. **`HasGlossaryTerms<T>`** - Glossary term operations (`addTerm`, `removeTerm`, `setTerms`, `clearTerms`)
4. **`DomainOperations<T>`** - Domain operations (`setDomain`, `removeDomain`, `clearDomains`)
5. **`HasContainer<T>`** - Container hierarchy (`setContainer`, `clearContainer`)
6. **`HasStructuredProperties<T>`** - Structured properties operations (`setStructuredProperty`, `removeStructuredProperty`)
**Why CRTP?**
The CRTP pattern enables type-safe method chaining that returns the concrete entity type:
```java
// Without CRTP: returns Entity
Entity entity = dataset.addTag("pii"); // Loses Dataset type!
// With CRTP: returns Dataset
Dataset dataset = dataset.addTag("pii")
.addOwner(ownerUrn, type) // Still Dataset type!
.setDomain(domainUrn); // Still Dataset type!
```
**Entity Implementations:**
Entities implement mixin interfaces by declaring them in the class signature:
```java
public class Dataset extends Entity
implements HasTags<Dataset>,
HasOwners<Dataset>,
HasGlossaryTerms<Dataset>,
DomainOperations<Dataset>,
HasContainer<Dataset>,
HasStructuredProperties<Dataset> {
// Mixin methods provided by default implementations
}
```
---
# Part 2: Developer-Facing Implementation Design
This section describes the internal architecture and implementation details for developers contributing to the SDK.
## Internal Architecture
### Entity Base Class - Internal Implementation
**File**: `metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/Entity.java` (490 lines)
The Entity base class implements three core subsystems:
#### 1. AspectCache System with Read-Your-Own-Writes
**Unified Cache Architecture**: The SDK uses a unified `AspectCache` that provides read-your-own-writes semantics with proper dirty tracking. This architecture fixes bugs where fetched aspects would override patches.
**Core Implementation Files:**
- `AspectCache.java` (184 lines) - Main cache with dirty tracking
- `CachedAspect.java` (68 lines) - Aspect wrapper with metadata
- `AspectSource.java` (23 lines) - Enum for SERVER vs LOCAL aspects
- `ReadMode.java` (28 lines) - Enum for ALLOW_DIRTY vs SERVER_ONLY reads
**Key Architectural Features:**
1. **AspectSource Tracking**: Distinguishes between SERVER-fetched aspects (subject to TTL) and LOCAL-created aspects (no expiration)
2. **Dirty Tracking**: Explicit marking of aspects that need write-back to server via `markDirty()` method
3. **Read-Your-Own-Writes**: Default `ReadMode.ALLOW_DIRTY` returns local modifications immediately, `SERVER_ONLY` mode skips dirty aspects
4. **TTL Management**: 60-second TTL enforced only for SERVER-sourced aspects, LOCAL aspects never expire
5. **Thread Safety**: Uses `ConcurrentHashMap` for safe concurrent access
**Internal State (Entity.java):**
```
protected final AspectCache cache; // Unified cache with dirty tracking
protected final Map<String, List<MetadataChangeProposal>> pendingPatches;
private DataHubClientV2 boundClient = null;
```
**Cache Operations:**
- `getAspectLazy()` - Lazy loads from server, stores as clean SERVER-sourced aspect
- `getOrCreateAspect()` - Gets from cache or creates new LOCAL-sourced aspect (marked dirty)
- `markAspectDirty()` - Marks aspect dirty after in-place modification (used by domain operations)
- `toMCPs()` - Returns **only dirty aspects** for emission (excludes clean fetched aspects)
**Why This Architecture?**
The unified cache solves a critical bug: when entities are fetched from the server and then patch operations are applied (e.g., `removeTerm()`), the cached aspect would be included in `toMCPs()` and override the patches. With dirty tracking, `toMCPs()` only returns modified aspects, allowing patches to work correctly.
#### 2. Patch Accumulation and MCP Generation
Metadata operations create patches that accumulate until emission. The system supports two types of operations:
**Patch-Based Operations** (incremental updates):
- Tags, owners, glossary terms use `PatchBuilder` classes
- Patches accumulate in `pendingPatches` map (aspect name → list of patches)
- Multiple operations on same aspect create multiple patches
**Cache-Based Operations** (full aspect replacement):
- Domains, custom properties modify aspects in cache
- Aspects marked dirty via `markAspectDirty()` after modification
- Dirty aspects included in `toMCPs()` output
**MCP Generation:**
The `toMCPs()` method returns **only dirty aspects** and accumulated patches:
```
public List<MetadataChangeProposalWrapper> toMCPs() {
// 1. Add dirty aspects from cache (excludes clean fetched aspects)
for (Map.Entry<String, RecordTemplate> entry : cache.getDirtyAspects().entrySet()) {
mcps.add(createMCP(entry.getKey(), entry.getValue()));
}
// 2. Add accumulated patches
for (PatchBuilder builder : patchBuilders.values()) {
mcps.add(builder.build());
}
// 3. Add pending MCPs
mcps.addAll(pendingMCPs);
return mcps;
}
```
**Critical Design Point**: `toMCPs()` uses `cache.getDirtyAspects()` instead of all cached aspects. This ensures that fetched aspects don't override patches - only locally modified aspects are emitted.
#### 3. Mode-Aware Aspect Routing
SDK mode vs INGESTION mode for proper aspect selection:
````java
/**
* Get aspect name based on operation mode.
* SDK mode: prefer editable aspects
* INGESTION mode: use system aspects
*/
protected String getAspectName(Class<? extends RecordTemplate> aspectClass, OperationMode mode) {
if (mode == OperationMode.SDK) {
// Check if editable variant exists
String editableAspectName = getEditableAspectName(aspectClass);
if (editableAspectName != null) {
return editableAspectName;
}
}
return aspectClass.getSimpleName();
}
/**
* Get getter preference order: editable aspects first, then system aspects.
*/
protected <T extends RecordTemplate> T getAspectWithPreference(
Class<T> editableClass,
Class<T> systemClass
) {
// Try editable aspect first
T editable = getAspectLazy(editableClass);
if (editable != null) {
return editable;
}
// Fall back to system aspect
return getAspectLazy(systemClass);
}
## Implementation Phases
### Phase 1: Core Framework
Base functionality for all entities:
- Base `Entity` class with aspect management, lazy loading, and patch accumulation
- `DataHubClientV2` main client class with mode-aware behavior
- `EntityClient` with create, read, update, upsert operations
- Configuration classes with environment variable support
- Mixin interfaces using CRTP pattern for type safety
### Phase 2: Dataset Entity
Reference implementation demonstrating all patterns:
- `Dataset` entity with fluent builder
- Dataset-specific aspects (properties, schema, lineage)
- Mixin interface implementations
- Comprehensive unit tests
### Phase 3: Additional Entities
Seven additional entity types:
- `Chart` - Visualizations with lineage
- `Dashboard` - Dashboards with chart relationships
- `Container` - Hierarchical data structures
- `DataJob` - Pipeline tasks with inlet/outlet lineage
- `DataFlow` - Pipeline workflows
- `MLModel` - Machine learning models
- `MLModelGroup` - ML model families
### Phase 4: Patch Capabilities
Patch-based updates for efficient metadata changes:
- Internal patch accumulation within entities (not separate patch builders)
- Automatic patch emission on `update()` and `upsert()`
- Leverages existing `PatchBuilder` classes from entity-registry module
- Patches tested via entity unit tests
### Phase 5: Testing & Documentation
Comprehensive validation and user guides:
- Integration tests with live DataHub server
- API documentation (Javadoc) and 13 comprehensive Markdown guides
- 19 working example files demonstrating real-world usage
- Migration guide from V1
- Design principles document
- Patch operations deep-dive
- Entity-specific guides for all 8 entities
## Testing Strategy
### Unit Tests
Each entity and component has comprehensive unit tests:
- Builder validation (required fields, optional fields, validation logic)
- Aspect management (getters, setters, mode-aware routing)
- MCP generation (full aspects + patches)
- Patch operations (accumulation, emission)
- Fluent API chaining (type safety via CRTP)
- Mixin operations (tags, owners, terms, domains)
**Test Coverage by Entity:**
- Dataset: 37 tests
- Chart: 43 tests
- Dashboard: 52 tests
- DataJob: 45 tests
- DataFlow: 40 tests
- Container: 40 tests
- MLModel: 44 tests
- MLModelGroup: 38 tests
### Integration Tests
Full end-to-end tests against a real DataHub instance:
```java
@Test
public void testDatasetCreateAndRead() throws Exception {
// Create client
DataHubClientV2 client = DataHubClientV2.builder()
.server(TEST_SERVER)
.token(TEST_TOKEN)
.build();
// Create dataset
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("db.schema.test_table_" + System.currentTimeMillis())
.env("PROD")
.description("Test dataset created by Java SDK V2")
.build();
dataset.addTag("test-tag")
.addOwner("urn:li:corpuser:datahub", OwnershipType.TECHNICAL_OWNER);
// Upsert
client.entities().upsert(dataset);
// Read back
Dataset retrieved = client.entities().get(dataset.getUrn(), Dataset.class);
assertNotNull(retrieved);
assertEquals("Test dataset created by Java SDK V2", retrieved.getDescription());
}
@Test
public void testDatasetPatchOperations() throws Exception {
DataHubClientV2 client = DataHubClientV2.builder()
.server(TEST_SERVER)
.token(TEST_TOKEN)
.build();
// Create dataset first
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("db.schema.test_table_patch_" + System.currentTimeMillis())
.env("PROD")
.build();
client.entities().upsert(dataset);
// Retrieve and apply patches
Dataset retrieved = client.entities().get(dataset.getUrn(), Dataset.class);
Dataset mutable = retrieved.mutable(); // Get mutable copy
mutable.addTag("pii") // Creates patch
.addTag("sensitive") // Another patch
.addTerm("urn:li:glossaryTerm:CustomerData"); // Another patch
// All patches emitted atomically
client.entities().update(mutable);
// Verify patches were applied
Dataset verified = client.entities().get(dataset.getUrn(), Dataset.class);
assertTrue(verified.getTags().contains("urn:li:tag:pii"));
}
````
**Integration Test Coverage:**
- Entity creation and retrieval
- Tag, owner, term, domain operations
- Lineage relationships (charts → datasets, jobs → datasets)
- Custom properties
- Full metadata workflows
- Batch operations
- Patch accumulation and emission
**Running Integration Tests:**
```bash
export DATAHUB_SERVER=http://localhost:8080
export DATAHUB_TOKEN=your_token
./gradlew :metadata-integration:java:datahub-client:test --tests "*Integration*"
```
### Test Coverage Results
- Unit test coverage: **>80%** for new code (378 unit tests + 79 integration tests = 457 total)
- All public APIs covered
- Edge cases tested (null values, invalid inputs, mode switching)
- Async operations tested with proper synchronization
- Cache infrastructure thoroughly tested (43 tests for AspectCache + CachedAspect)
- Full end-to-end integration tests (79 tests)
## API Documentation
All public classes and methods have comprehensive Javadoc plus extensive Markdown documentation:
**Javadoc Coverage:**
- Class-level documentation explaining purpose and usage
- Method-level documentation with parameters, returns, exceptions
- Code examples for common use cases
- Links to related classes and methods
**Markdown Documentation (13 files):**
Located in `metadata-integration/java/docs/sdk-v2/`:
1. **getting-started.md** - Quick start guide for new users
2. **design-principles.md** - Architecture and design decisions
3. **dataset-entity.md** - Dataset operations and schema support
4. **chart-entity.md** - Chart operations and lineage
5. **dashboard-entity.md** - Dashboard operations and relationships
6. **container-entity.md** - Container hierarchies
7. **dataflow-entity.md** - DataFlow pipeline operations
8. **datajob-entity.md** - DataJob inlet/outlet lineage
9. **mlmodel-entity.md** - MLModel metrics and hyperparameters
10. **mlmodelgroup-entity.md** - MLModelGroup version management
11. **patch-operations.md** - Deep dive into patch-based updates
12. **migration-from-v1.md** - Migration guide from V1 SDK
13. **java-sdk-v2-design.md** - This comprehensive design document
**Working Examples (19 files):**
Located in `metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/`:
- Dataset examples: DatasetCreateExample, DatasetFullExample, DatasetPatchExample
- Chart examples: ChartCreateExample, ChartFullExample, ChartLineageExample
- Dashboard examples: DashboardCreateExample, DashboardFullExample, DashboardLineageExample
- DataFlow examples: DataFlowCreateExample, DataFlowFullExample
- DataJob examples: DataJobCreateExample, DataJobFullExample, DataJobLineageExample
- Container examples: ContainerCreateExample, ContainerFullExample, ContainerHierarchyExample
- MLModel examples: MLModelCreateExample, MLModelFullExample
- MLModelGroup examples: MLModelGroupCreateExample, MLModelGroupFullExample
## Migration Guide
For users of the existing Java SDK:
### Before (V1):
```java
RestEmitter emitter = RestEmitter.create(b -> b.server("http://localhost:8080"));
DatasetUrn urn = new DatasetUrn(
new DataPlatformUrn("postgres"),
"my_table",
FabricType.PROD
);
DatasetProperties props = new DatasetProperties();
props.setDescription("My dataset");
MetadataChangeProposalWrapper mcpw = MetadataChangeProposalWrapper.builder()
.entityType("dataset")
.entityUrn(urn)
.upsert()
.aspect(props)
.build();
emitter.emit(mcpw).get();
```
### After (V2):
```java
DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
.build();
Dataset dataset = Dataset.builder()
.platform("postgres")
.name("my_table")
.description("My dataset")
.build();
client.entities().upsert(dataset);
```
## Decision Log
### 1. Use Pegasus Models vs OpenAPI Models
**Decision**: Use Pegasus models (`com.linkedin.*`) for aspect classes.
**Rationale**:
- Pegasus models are the canonical representation in DataHub
- Already used by v1 SDK, maintains consistency
- Generated from PDL schemas, always in sync with backend
- OpenAPI models are less mature and have fewer utilities
**Result**: Proven correct - seamless integration with existing infrastructure.
### 2. Namespace Separation
**Decision**: Use `datahub.client.v2.*` namespace.
**Rationale**:
- Clear separation from v1 API
- Allows side-by-side usage
- Follows semantic versioning principles
- Easy to deprecate v1 in future
**Result**: 100% backward compatibility achieved - v1 code unchanged.
### 3. Builder Pattern
**Decision**: Use nested static Builder classes.
**Rationale**:
- Idiomatic Java pattern
- Type-safe construction
- Optional parameters handled cleanly
- Better than telescoping constructors
**Result**: Excellent developer experience with fluent API.
### 4. Synchronous vs Async
**Decision**: Provide synchronous API that wraps async operations.
**Rationale**:
- Simpler for most users
- Matches Python SDK V2 API
- Can expose async API later for advanced users
- RestEmitter already provides async primitives
**Result**: Simplified API widely adopted in examples and tests.
### 5. Error Handling
**Decision**: Throw checked exceptions for I/O operations.
**Rationale**:
- Forces callers to handle errors
- Consistent with Java conventions
- Clear distinction between programmer errors and runtime failures
**Result**: Clear error handling patterns in all code.
**Exception Hierarchy:**
The SDK introduces custom exceptions for common error conditions:
**ReadOnlyEntityException** - Thrown when attempting to mutate a read-only entity:
```java
try {
Dataset dataset = client.entities().get(urn);
dataset.addTag("pii"); // Throws ReadOnlyEntityException
} catch (ReadOnlyEntityException e) {
// Exception message explains the issue and provides fix
System.err.println(e.getMessage());
// Fix: Get mutable copy first
Dataset mutable = dataset.mutable();
mutable.addTag("pii");
client.entities().upsert(mutable);
}
```
**PendingMutationsException** - Thrown when reading from entity with pending mutations:
```java
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.build();
dataset.setDescription("New description");
// dataset.getDescription(); // Throws PendingMutationsException!
// Fix: Save first, then read
client.entities().upsert(dataset); // Clears dirty flag
String desc = dataset.getDescription(); // Now works
```
**Why these restrictions?**
- **ReadOnlyEntityException**: Makes mutations explicit, prevents accidental changes when passing entities between functions
- **PendingMutationsException**: Prevents reading stale cached data, enforces explicit save-then-fetch workflow
Both restrictions enforce clear separation between read and write workflows. These may be relaxed in future versions as the API matures and usage patterns emerge.
### 6. Patch-First over Full Aspect Replacement
**Decision**: Prioritize patch-based operations as the primary API, defer full aspect replacement to V1 SDK.
**Rationale**:
- **User mental model**: "Add a tag" is more natural than "fetch all tags, modify list, PUT entire aspect"
- **Safety**: Patches don't clobber concurrent changes from other users/systems
- **Simplicity**: Most metadata operations are incremental (add owner, remove tag, etc.)
- **Efficiency**: Only changed fields transmitted and processed by server
- **Escape hatch exists**: Users needing full PUT semantics can use V1 SDK's `RestEmitter` directly
**Why not both?**
V2 SDK focuses on making common operations simple, not exposing every low-level primitive. This keeps the API focused and prevents confusion about when to use patches vs full replacement.
**Result**: Clean, intuitive API for 95% of use cases. Power users can drop to V1 SDK for remaining 5%.
### 7. Internal Patch Accumulation vs External Patch Builders
**Decision**: Accumulate patches **inside entities** rather than separate patch builder classes.
**Rationale**:
- More intuitive API - metadata operations just work
- Patches automatically emitted on save
- Reduces API surface area
- Simplifies user code
**Original Design**: Separate `DatasetPatch`, `ChartPatch` builder classes
**Actual Implementation**: Patches accumulate in `Entity.pendingPatches` and emit via `toMCPs()`
**Result**: Superior developer experience - no need to learn separate patch API.
### 8. CRTP Pattern for Mixin Interfaces
**Decision**: Use Curiously Recurring Template Pattern for type-safe mixin interfaces.
**Rationale**:
- Type-safe method chaining returns concrete entity type
- Compile-time type checking
- No casting required in user code
- Idiomatic Java generics pattern
**Original Design**: Simple interfaces returning `Entity`
**Actual Implementation**:
```java
public interface HasTags<T extends Entity & HasTags<T>> {
default T addTag(String tagUrn) { return (T) this; }
}
```
**Result**: Excellent type safety and developer experience.
### 9. Mode-Aware Behavior (SDK vs INGESTION)
**Decision**: Support SDK mode and INGESTION mode for aspect routing.
**Rationale**:
- Proper separation of user edits vs pipeline writes
- SDK mode → editable aspects (user overrides)
- INGESTION mode → system aspects (pipeline data)
- Getters prefer editable over system
**Original Design**: Not specified
**Actual Implementation**: `OperationMode` enum with aspect routing logic
**Result**: Clear separation of concerns, aligns with DataHub's aspect model.
### 10. Lazy Loading for GET Operations
**Decision**: Implement lazy loading for aspects when entities are retrieved.
**Rationale**:
- Performance - only fetch aspects when accessed
- Client binding enables on-demand fetching
- Cache management with timestamps
**Original Design**: Not specified (GET deferred)
**Actual Implementation**: Full lazy loading with `getAspectLazy()` and client binding
**Result**: Efficient entity retrieval with on-demand aspect fetching.
## Design Questions and Resolutions
1. **GET operation implementation**: Should we implement REST client for reading entities, or defer to future?
- **Resolution**: Implemented with lazy loading support
2. **Search client**: Should we include search functionality in V2?
- **Resolution**: Deferred to future (out of scope for V2)
3. **Lineage client**: Should we include lineage management?
- **Resolution**: Basic lineage on Dataset, Chart, Dashboard, DataJob entities
4. **Schema field builders**: Should we provide fluent builders for schema fields?
- **Resolution**: Yes, schema field support in Dataset entity
## References
- [Python SDK V2 Implementation](https://github.com/datahub-project/datahub/tree/master/metadata-ingestion/src/datahub/sdk)
- [Existing Java SDK](https://github.com/datahub-project/datahub/tree/master/metadata-integration/java/datahub-client)
- [DataHub Metadata Model](https://github.com/datahub-project/datahub/tree/master/metadata-models)
## Quick Links for Reviewers
**Start Here:**
1. `metadata-integration/java/docs/sdk-v2/getting-started.md` - Quick start guide
2. `metadata-integration/java/docs/sdk-v2/design-principles.md` - Architecture overview
**Core Implementation:** 3. `metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/Entity.java` (490 lines) - Base entity class 4. `metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/operations/EntityClient.java` (570 lines) - CRUD operations 5. `metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/DataHubClientV2.java` (266 lines) - Main client
**Sample Entities:** 6. `metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/Dataset.java` (564 lines) - Reference implementation 7. `metadata-integration/java/datahub-client/src/main/java/datahub/client/v2/entity/HasTags.java` (145 lines) - CRTP mixin example
**Examples:** 8. `metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/DatasetFullExample.java` - Complete workflow 9. `metadata-integration/java/examples/src/main/java/io/datahubproject/examples/v2/ChartLineageExample.java` - Lineage relationships
**Tests:** 10. `metadata-integration/java/datahub-client/src/test/java/datahub/client/v2/entity/DatasetTest.java` (37 unit tests) 11. `metadata-integration/java/datahub-client/src/test/java/datahub/client/v2/integration/DatasetIntegrationTest.java` - End-to-end validation
---
**Document Status**: Design document reflecting implemented architecture (includes AspectCache refactoring)
**Author**: DataHub OSS Team
**Last Updated**: 2025-01-06