17 KiB
Design Principles of Java SDK V2
This document provides an architectural overview of DataHub Java SDK V2, exploring the engineering principles and design patterns that enable its type-safe, efficient metadata management capabilities.
Architectural Philosophy
SDK V2 is built on a foundation of pragmatic reuse, intelligent caching, and layered abstractions. Rather than reinventing infrastructure, it composes proven components into a coherent, intuitive API while introducing new patterns for efficient metadata operations.
Core Tenets
- Leverage Existing Infrastructure - Build atop battle-tested components
- Type Safety as a First-Class Concern - Exploit Java's type system for compile-time correctness
- Separation of Concerns - Clear boundaries between entity, operations, and transport layers
- Efficiency Through Patches - Surgical updates over full replacements
- Intelligent Resource Management - Lazy loading, caching, and batching
Layer Architecture
SDK V2 employs a three-layer architecture with clear separation of responsibilities:
┌─────────────────────────────────────────────────────────────┐
│ Entity Layer │
│ (Dataset, Chart, Dashboard - Business Logic) │
│ - Fluent builders for entity construction │
│ - Patch accumulation and aspect management │
│ - Mode-aware behavior (SDK vs INGESTION) │
└──────────────────────┬──────────────────────────────────────┘
│
┌──────────────────────┴──────────────────────────────────────┐
│ Operations Layer │
│ (EntityClient - CRUD Operations) │
│ - Entity lifecycle management │
│ - Patch vs full aspect emission logic │
│ - Lazy loading coordination │
└──────────────────────┬──────────────────────────────────────┘
│
┌──────────────────────┴──────────────────────────────────────┐
│ Transport Layer │
│ (RestEmitter, Patch Builders) │
│ - HTTP communication with DataHub │
│ - MCP serialization and emission │
│ - Patch builder integration │
└─────────────────────────────────────────────────────────────┘
Design Patterns
1. Fluent Builder Pattern
Entity construction follows a fluent builder pattern that guides developers through required fields and provides IDE autocomplete support:
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("analytics.public.events")
.env("PROD")
.description("User events")
.build();
Engineering Benefits:
- Compile-time validation - Missing required fields (platform, name) fail at compilation
- Immutable construction - Builder accumulates state;
build()creates immutable entity - Discoverability - IDE autocomplete reveals available methods
- Extensibility - New optional parameters added without breaking existing code
2. Patch Accumulation Pattern
Rather than modifying aspects directly, mutations create patch MCPs that accumulate in a pending list:
dataset.addTag("pii") // Creates patch MCP
.addOwner("user", TECHNICAL_OWNER) // Creates patch MCP
.addCustomProperty("retention", "90"); // Creates patch MCP
client.entities().upsert(dataset); // Emits all patches atomically
Engineering Benefits:
- Deferred execution - Batches multiple changes into a single network round-trip
- Atomic updates - All patches applied together or none
- Efficient transmission - Only changed fields sent over wire
- Reuse of proven infrastructure - Leverages existing
datahub.client.patchbuilders
Implementation Detail: Entity base class maintains multiple change tracking mechanisms:
// From Entity.java
protected final Map<String, RecordTemplate> aspectCache; // Cached aspects from builder
protected final List<MetadataChangeProposalWrapper> pendingMCPs; // Full aspect replacements
protected final List<MetadataChangeProposal> pendingPatches; // Incremental patches
Each mutation (addTag, addOwner) creates a patch using existing builders:
// From Dataset.java
public Dataset addTag(@Nonnull String tagUrn) {
GlobalTagsPatchBuilder patch = new GlobalTagsPatchBuilder()
.urn(getUrn())
.addTag(tag, null);
addPatchMcp(patch.build()); // Adds to pendingPatches list
return this;
}
When EntityClient.upsert() is called, it emits everything accumulated on the entity in order:
// From EntityClient.upsert()
// Step 1: Emit cached aspects (from builder)
if (!entity.toMCPs().isEmpty()) {
for (MetadataChangeProposalWrapper mcp : entity.toMCPs()) {
emitter.emit(mcp);
}
}
// Step 2: Emit pending full aspect MCPs (from set*() methods)
if (entity.hasPendingMCPs()) {
for (MetadataChangeProposalWrapper mcp : entity.getPendingMCPs()) {
emitter.emit(mcp);
}
entity.clearPendingMCPs();
}
// Step 3: Emit all pending patches (from add*/remove* methods)
if (entity.hasPendingPatches()) {
for (MetadataChangeProposal patchMcp : entity.getPendingPatches()) {
emitter.emit(patchMcp, null);
}
entity.clearPendingPatches();
}
Key insight: upsert() is not an either/or operation - it emits all accumulated changes. What gets sent depends on what you've accumulated on the entity, not which method you call.
3. Lazy Loading with TTL-Based Caching
Entities support lazy aspect loading to minimize network calls while ensuring data freshness:
// Entity maintains aspect cache with timestamps
protected final Map<String, RecordTemplate> aspectCache;
protected final Map<String, Long> aspectTimestamps;
protected long cacheTtlMs = 60000; // 60-second default TTL
Loading Strategy:
- Cache-only access (
getAspectCached) - Returns cached aspect or null - Lazy loading (
getAspectLazy) - Checks cache freshness, fetches from server if stale - Get-or-create (
getOrCreateAspect) - Returns cached or creates new empty aspect locally
Implementation:
protected <T extends RecordTemplate> T getAspectLazy(@Nonnull Class<T> aspectClass) {
String aspectName = getAspectName(aspectClass);
// Check cache freshness
if (aspectCache.containsKey(aspectName)) {
Long timestamp = aspectTimestamps.get(aspectName);
if (timestamp != null && System.currentTimeMillis() - timestamp < cacheTtlMs) {
return aspectClass.cast(aspectCache.get(aspectName));
}
}
// Fetch from server if client is bound
if (client != null) {
T aspect = client.getAspect(urn, aspectClass);
if (aspect != null) {
aspectCache.put(aspectName, aspect);
aspectTimestamps.put(aspectName, System.currentTimeMillis());
}
return aspect;
}
return null;
}
Engineering Benefits:
- Network efficiency - Reduces redundant server calls
- Freshness guarantee - Configurable TTL ensures data isn't stale
- Transparent to caller - Complexity hidden behind simple getter
- Client binding - Entities bound to EntityClient enable lazy loading
4. Mode-Aware Aspect Selection
SDK V2 distinguishes between user-initiated edits (SDK mode) and system/pipeline writes (INGESTION mode):
public enum OperationMode {
SDK, // Interactive use - writes to editable aspects
INGESTION // ETL pipelines - writes to system aspects
}
Aspect Routing:
- SDK Mode →
editableDatasetProperties,editableSchemaMetadata - INGESTION Mode →
datasetProperties,schemaMetadata
Implementation:
public Dataset setDescription(@Nonnull String description) {
if (isIngestionMode()) {
return setSystemDescription(description); // datasetProperties
} else {
return setEditableDescription(description); // editableDatasetProperties
}
}
Engineering Benefits:
- Clear provenance - Distinguishes human vs machine edits
- UI consistency - DataHub UI shows editable aspects as user overrides
- Non-destructive - System data preserved even when users add documentation
- Lineage preservation - Ingestion pipelines can refresh system data without clobbering user edits
5. Two Entity Lifecycle Patterns
Entities can be instantiated in two ways, each with distinct semantics:
Pattern 1: Builder Construction (New Entities)
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.build();
// aspectCache populated with builder-provided aspects
// aspectTimestamps empty - indicates new entity
Use case: Creating new entities from scratch
Pattern 2: Server Loading (Existing Entities)
Dataset dataset = client.entities().get(urn);
// aspectCache populated with server aspects
// aspectTimestamps records fetch time for each aspect
// Entity automatically bound to client for lazy loading
Use case: Modifying existing entities with current server state. When you access aspects not already cached, the entity will automatically fetch them from the server (lazy loading).
6. Client Binding for Lazy Loading
Entities are automatically bound to an EntityClient when loaded from the server or during upsert() to enable lazy aspect fetching:
public void bindToClient(@Nonnull EntityClient client,
@Nonnull OperationMode mode) {
if (this.client == null) {
this.client = client;
}
if (this.mode == null) {
this.mode = mode;
}
}
Binding occurs automatically during upsert():
// From EntityClient.upsert()
entity.bindToClient(this, config.getMode());
Engineering Benefits:
- Transparent lazy loading - Aspects fetched on first access if not cached
- Automatic binding - Entities bound to client during
get()orupsert()operations - Mode propagation - Client mode automatically applied to entity
Type Safety & Generic Design
Strongly-Typed Aspect Handling
SDK V2 leverages Java generics to provide compile-time type safety for aspects:
// Type-safe aspect retrieval
protected <T extends RecordTemplate> T getAspectLazy(@Nonnull Class<T> aspectClass) {
String aspectName = getAspectName(aspectClass);
RecordTemplate aspect = aspectCache.get(aspectName);
return aspectClass.cast(aspect);
}
// Usage - compiler enforces type correctness
DatasetProperties props = dataset.getAspectLazy(DatasetProperties.class);
Engineering Benefits:
- Compile-time checking - Type mismatches caught before runtime
- Refactoring safety - IDE can trace aspect usages across codebase
- Autocomplete support - IDE suggests available aspects
- Runtime safety -
ClassCastExceptionimpossible with correct usage
URN Type Safety
Entity-specific URN types prevent incorrect URN usage:
public class Dataset extends Entity {
public DatasetUrn getDatasetUrn() {
return (DatasetUrn) urn;
}
}
// Compile-time enforcement
DatasetUrn urn = dataset.getDatasetUrn(); // Type-safe
Urn genericUrn = dataset.getUrn(); // Also available
Integration with Existing Infrastructure
Reuse of Patch Builders
SDK V2 reuses existing patch builders from datahub.client.patch rather than creating new implementations:
OwnershipPatchBuilder- Owner additions/removalsGlobalTagsPatchBuilder- Tag managementGlossaryTermsPatchBuilder- Term associationsDomainsPatchBuilder- Domain assignmentDatasetPropertiesPatchBuilder- Property updatesEditableDatasetPropertiesPatchBuilder- Editable property updates
Engineering Benefits:
- Battle-tested logic - Patch builders used in production by Python SDK
- Consistency - Same patch semantics across language SDKs
- Maintainability - Single implementation to maintain
- Correctness - Complex JSON Patch logic already validated
Example Integration:
public Dataset addOwner(@Nonnull String ownerUrn, @Nonnull OwnershipType type) {
Urn owner = Urn.createFromString(ownerUrn);
OwnershipPatchBuilder patch = new OwnershipPatchBuilder()
.urn(getUrn())
.addOwner(owner, type);
addPatchMcp(patch.build()); // Stores patch MCP
return this;
}
Leverage RestEmitter
Transport layer reuses RestEmitter for HTTP communication:
- Non-blocking emission with futures
- Configurable retries and timeouts
- Token-based authentication
- Async HTTP client pooling
No changes to RestEmitter - SDK V2 is purely additive.
Resource Management & Efficiency
Batched Emission
Multiple patches accumulated and emitted atomically:
dataset.addTag("tag1").addTag("tag2").addOwner("user1", OWNER);
client.entities().upsert(dataset); // Single network call, 3 patches
Connection Pooling
RestEmitter uses CloseableHttpAsyncClient with connection pooling for efficient HTTP reuse.
Graceful Degradation
Lazy loading failures logged but don't crash:
catch (Exception e) {
log.warn("Failed to lazy-load aspect {}: {}", aspectName, e.getMessage());
return null; // Graceful degradation
}
Comparison: V1 vs V2 Architecture
| Aspect | V1 (RestEmitter) | V2 (DataHubClientV2) |
|---|---|---|
| Abstraction Level | Low - MCPs | High - Entities |
| URN Construction | Manual strings | Automatic from builder |
| Aspect Wiring | Manual MCP building | Hidden in entity methods |
| Updates | Full aspect replacement | Patch-based incremental |
| Type Safety | Minimal - generic MCPs | Strong - typed entities |
| Lazy Loading | Not supported | TTL-based caching |
| Mode Awareness | Not supported | SDK vs INGESTION modes |
| Learning Curve | Steep - requires MCP knowledge | Gentle - intuitive builders |
Performance Characteristics
Network Efficiency
- Patch-based updates: O(changed_fields) vs O(all_fields)
- Lazy loading: Aspects fetched only when accessed
- Batch emission: Multiple patches sent in single flush
- Connection reuse: HTTP client pooling
Memory Efficiency
- Aspect caching: Only fetched aspects stored
- TTL expiration: Stale aspects eligible for GC
- Lazy instantiation: Aspects created on-demand
Time Complexity
- Entity creation: O(1) - builder accumulation
- Patch addition: O(1) - append to list
- Upsert operation: O(n) where n = pending patches or cached aspects
- Lazy fetch: O(1) cache lookup + O(1) network if miss
Extension Points
SDK V2 designed for extensibility:
- New entity types - Extend
Entitybase class - Custom aspects - Use
getAspectLazy/getOrCreateAspect - New patch types - Leverage existing patch builders
- Custom caching - Override
cacheTtlMs - Transport customization - Customize RestEmitter via builder
Summary
Java SDK V2 achieves its goals through principled design:
- Reuse over reinvention - Leverages existing patch builders and RestEmitter
- Patches over replacements - Efficient incremental updates
- Lazy over eager - Aspects fetched on-demand with caching
- Type safety over convenience - Strong typing throughout
- Layers over monoliths - Clear separation of entity, operations, transport
- Pragmatism over purity - Mode-aware behavior matches real-world usage
The result is an SDK that feels natural to Java developers while providing the efficiency and correctness required for production metadata management at scale.