17 KiB

Design Principles of Java SDK V2

This document provides an architectural overview of DataHub Java SDK V2, exploring the engineering principles and design patterns that enable its type-safe, efficient metadata management capabilities.

Architectural Philosophy

SDK V2 is built on a foundation of pragmatic reuse, intelligent caching, and layered abstractions. Rather than reinventing infrastructure, it composes proven components into a coherent, intuitive API while introducing new patterns for efficient metadata operations.

Core Tenets

  1. Leverage Existing Infrastructure - Build atop battle-tested components
  2. Type Safety as a First-Class Concern - Exploit Java's type system for compile-time correctness
  3. Separation of Concerns - Clear boundaries between entity, operations, and transport layers
  4. Efficiency Through Patches - Surgical updates over full replacements
  5. Intelligent Resource Management - Lazy loading, caching, and batching

Layer Architecture

SDK V2 employs a three-layer architecture with clear separation of responsibilities:

┌─────────────────────────────────────────────────────────────┐
│                    Entity Layer                              │
│  (Dataset, Chart, Dashboard - Business Logic)                │
│  - Fluent builders for entity construction                   │
│  - Patch accumulation and aspect management                  │
│  - Mode-aware behavior (SDK vs INGESTION)                    │
└──────────────────────┬──────────────────────────────────────┘
                       │
┌──────────────────────┴──────────────────────────────────────┐
│                 Operations Layer                             │
│  (EntityClient - CRUD Operations)                            │
│  - Entity lifecycle management                               │
│  - Patch vs full aspect emission logic                       │
│  - Lazy loading coordination                                 │
└──────────────────────┬──────────────────────────────────────┘
                       │
┌──────────────────────┴──────────────────────────────────────┐
│                  Transport Layer                             │
│  (RestEmitter, Patch Builders)                               │
│  - HTTP communication with DataHub                           │
│  - MCP serialization and emission                            │
│  - Patch builder integration                                 │
└─────────────────────────────────────────────────────────────┘

Design Patterns

1. Fluent Builder Pattern

Entity construction follows a fluent builder pattern that guides developers through required fields and provides IDE autocomplete support:

Dataset dataset = Dataset.builder()
    .platform("snowflake")
    .name("analytics.public.events")
    .env("PROD")
    .description("User events")
    .build();

Engineering Benefits:

  • Compile-time validation - Missing required fields (platform, name) fail at compilation
  • Immutable construction - Builder accumulates state; build() creates immutable entity
  • Discoverability - IDE autocomplete reveals available methods
  • Extensibility - New optional parameters added without breaking existing code

2. Patch Accumulation Pattern

Rather than modifying aspects directly, mutations create patch MCPs that accumulate in a pending list:

dataset.addTag("pii")                          // Creates patch MCP
       .addOwner("user", TECHNICAL_OWNER)      // Creates patch MCP
       .addCustomProperty("retention", "90");  // Creates patch MCP

client.entities().upsert(dataset);  // Emits all patches atomically

Engineering Benefits:

  • Deferred execution - Batches multiple changes into a single network round-trip
  • Atomic updates - All patches applied together or none
  • Efficient transmission - Only changed fields sent over wire
  • Reuse of proven infrastructure - Leverages existing datahub.client.patch builders

Implementation Detail: Entity base class maintains multiple change tracking mechanisms:

// From Entity.java
protected final Map<String, RecordTemplate> aspectCache;        // Cached aspects from builder
protected final List<MetadataChangeProposalWrapper> pendingMCPs; // Full aspect replacements
protected final List<MetadataChangeProposal> pendingPatches;     // Incremental patches

Each mutation (addTag, addOwner) creates a patch using existing builders:

// From Dataset.java
public Dataset addTag(@Nonnull String tagUrn) {
    GlobalTagsPatchBuilder patch = new GlobalTagsPatchBuilder()
        .urn(getUrn())
        .addTag(tag, null);
    addPatchMcp(patch.build());  // Adds to pendingPatches list
    return this;
}

When EntityClient.upsert() is called, it emits everything accumulated on the entity in order:

// From EntityClient.upsert()

// Step 1: Emit cached aspects (from builder)
if (!entity.toMCPs().isEmpty()) {
    for (MetadataChangeProposalWrapper mcp : entity.toMCPs()) {
        emitter.emit(mcp);
    }
}

// Step 2: Emit pending full aspect MCPs (from set*() methods)
if (entity.hasPendingMCPs()) {
    for (MetadataChangeProposalWrapper mcp : entity.getPendingMCPs()) {
        emitter.emit(mcp);
    }
    entity.clearPendingMCPs();
}

// Step 3: Emit all pending patches (from add*/remove* methods)
if (entity.hasPendingPatches()) {
    for (MetadataChangeProposal patchMcp : entity.getPendingPatches()) {
        emitter.emit(patchMcp, null);
    }
    entity.clearPendingPatches();
}

Key insight: upsert() is not an either/or operation - it emits all accumulated changes. What gets sent depends on what you've accumulated on the entity, not which method you call.

3. Lazy Loading with TTL-Based Caching

Entities support lazy aspect loading to minimize network calls while ensuring data freshness:

// Entity maintains aspect cache with timestamps
protected final Map<String, RecordTemplate> aspectCache;
protected final Map<String, Long> aspectTimestamps;
protected long cacheTtlMs = 60000;  // 60-second default TTL

Loading Strategy:

  1. Cache-only access (getAspectCached) - Returns cached aspect or null
  2. Lazy loading (getAspectLazy) - Checks cache freshness, fetches from server if stale
  3. Get-or-create (getOrCreateAspect) - Returns cached or creates new empty aspect locally

Implementation:

protected <T extends RecordTemplate> T getAspectLazy(@Nonnull Class<T> aspectClass) {
    String aspectName = getAspectName(aspectClass);

    // Check cache freshness
    if (aspectCache.containsKey(aspectName)) {
        Long timestamp = aspectTimestamps.get(aspectName);
        if (timestamp != null && System.currentTimeMillis() - timestamp < cacheTtlMs) {
            return aspectClass.cast(aspectCache.get(aspectName));
        }
    }

    // Fetch from server if client is bound
    if (client != null) {
        T aspect = client.getAspect(urn, aspectClass);
        if (aspect != null) {
            aspectCache.put(aspectName, aspect);
            aspectTimestamps.put(aspectName, System.currentTimeMillis());
        }
        return aspect;
    }

    return null;
}

Engineering Benefits:

  • Network efficiency - Reduces redundant server calls
  • Freshness guarantee - Configurable TTL ensures data isn't stale
  • Transparent to caller - Complexity hidden behind simple getter
  • Client binding - Entities bound to EntityClient enable lazy loading

4. Mode-Aware Aspect Selection

SDK V2 distinguishes between user-initiated edits (SDK mode) and system/pipeline writes (INGESTION mode):

public enum OperationMode {
    SDK,        // Interactive use - writes to editable aspects
    INGESTION   // ETL pipelines - writes to system aspects
}

Aspect Routing:

  • SDK ModeeditableDatasetProperties, editableSchemaMetadata
  • INGESTION ModedatasetProperties, schemaMetadata

Implementation:

public Dataset setDescription(@Nonnull String description) {
    if (isIngestionMode()) {
        return setSystemDescription(description);  // datasetProperties
    } else {
        return setEditableDescription(description); // editableDatasetProperties
    }
}

Engineering Benefits:

  • Clear provenance - Distinguishes human vs machine edits
  • UI consistency - DataHub UI shows editable aspects as user overrides
  • Non-destructive - System data preserved even when users add documentation
  • Lineage preservation - Ingestion pipelines can refresh system data without clobbering user edits

5. Two Entity Lifecycle Patterns

Entities can be instantiated in two ways, each with distinct semantics:

Pattern 1: Builder Construction (New Entities)

Dataset dataset = Dataset.builder()
    .platform("snowflake")
    .name("my_table")
    .build();
// aspectCache populated with builder-provided aspects
// aspectTimestamps empty - indicates new entity

Use case: Creating new entities from scratch

Pattern 2: Server Loading (Existing Entities)

Dataset dataset = client.entities().get(urn);
// aspectCache populated with server aspects
// aspectTimestamps records fetch time for each aspect
// Entity automatically bound to client for lazy loading

Use case: Modifying existing entities with current server state. When you access aspects not already cached, the entity will automatically fetch them from the server (lazy loading).

6. Client Binding for Lazy Loading

Entities are automatically bound to an EntityClient when loaded from the server or during upsert() to enable lazy aspect fetching:

public void bindToClient(@Nonnull EntityClient client,
                        @Nonnull OperationMode mode) {
    if (this.client == null) {
        this.client = client;
    }
    if (this.mode == null) {
        this.mode = mode;
    }
}

Binding occurs automatically during upsert():

// From EntityClient.upsert()
entity.bindToClient(this, config.getMode());

Engineering Benefits:

  • Transparent lazy loading - Aspects fetched on first access if not cached
  • Automatic binding - Entities bound to client during get() or upsert() operations
  • Mode propagation - Client mode automatically applied to entity

Type Safety & Generic Design

Strongly-Typed Aspect Handling

SDK V2 leverages Java generics to provide compile-time type safety for aspects:

// Type-safe aspect retrieval
protected <T extends RecordTemplate> T getAspectLazy(@Nonnull Class<T> aspectClass) {
    String aspectName = getAspectName(aspectClass);
    RecordTemplate aspect = aspectCache.get(aspectName);
    return aspectClass.cast(aspect);
}

// Usage - compiler enforces type correctness
DatasetProperties props = dataset.getAspectLazy(DatasetProperties.class);

Engineering Benefits:

  • Compile-time checking - Type mismatches caught before runtime
  • Refactoring safety - IDE can trace aspect usages across codebase
  • Autocomplete support - IDE suggests available aspects
  • Runtime safety - ClassCastException impossible with correct usage

URN Type Safety

Entity-specific URN types prevent incorrect URN usage:

public class Dataset extends Entity {
    public DatasetUrn getDatasetUrn() {
        return (DatasetUrn) urn;
    }
}

// Compile-time enforcement
DatasetUrn urn = dataset.getDatasetUrn();  // Type-safe
Urn genericUrn = dataset.getUrn();         // Also available

Integration with Existing Infrastructure

Reuse of Patch Builders

SDK V2 reuses existing patch builders from datahub.client.patch rather than creating new implementations:

  • OwnershipPatchBuilder - Owner additions/removals
  • GlobalTagsPatchBuilder - Tag management
  • GlossaryTermsPatchBuilder - Term associations
  • DomainsPatchBuilder - Domain assignment
  • DatasetPropertiesPatchBuilder - Property updates
  • EditableDatasetPropertiesPatchBuilder - Editable property updates

Engineering Benefits:

  • Battle-tested logic - Patch builders used in production by Python SDK
  • Consistency - Same patch semantics across language SDKs
  • Maintainability - Single implementation to maintain
  • Correctness - Complex JSON Patch logic already validated

Example Integration:

public Dataset addOwner(@Nonnull String ownerUrn, @Nonnull OwnershipType type) {
    Urn owner = Urn.createFromString(ownerUrn);
    OwnershipPatchBuilder patch = new OwnershipPatchBuilder()
        .urn(getUrn())
        .addOwner(owner, type);
    addPatchMcp(patch.build());  // Stores patch MCP
    return this;
}

Leverage RestEmitter

Transport layer reuses RestEmitter for HTTP communication:

  • Non-blocking emission with futures
  • Configurable retries and timeouts
  • Token-based authentication
  • Async HTTP client pooling

No changes to RestEmitter - SDK V2 is purely additive.

Resource Management & Efficiency

Batched Emission

Multiple patches accumulated and emitted atomically:

dataset.addTag("tag1").addTag("tag2").addOwner("user1", OWNER);
client.entities().upsert(dataset);  // Single network call, 3 patches

Connection Pooling

RestEmitter uses CloseableHttpAsyncClient with connection pooling for efficient HTTP reuse.

Graceful Degradation

Lazy loading failures logged but don't crash:

catch (Exception e) {
    log.warn("Failed to lazy-load aspect {}: {}", aspectName, e.getMessage());
    return null;  // Graceful degradation
}

Comparison: V1 vs V2 Architecture

Aspect V1 (RestEmitter) V2 (DataHubClientV2)
Abstraction Level Low - MCPs High - Entities
URN Construction Manual strings Automatic from builder
Aspect Wiring Manual MCP building Hidden in entity methods
Updates Full aspect replacement Patch-based incremental
Type Safety Minimal - generic MCPs Strong - typed entities
Lazy Loading Not supported TTL-based caching
Mode Awareness Not supported SDK vs INGESTION modes
Learning Curve Steep - requires MCP knowledge Gentle - intuitive builders

Performance Characteristics

Network Efficiency

  • Patch-based updates: O(changed_fields) vs O(all_fields)
  • Lazy loading: Aspects fetched only when accessed
  • Batch emission: Multiple patches sent in single flush
  • Connection reuse: HTTP client pooling

Memory Efficiency

  • Aspect caching: Only fetched aspects stored
  • TTL expiration: Stale aspects eligible for GC
  • Lazy instantiation: Aspects created on-demand

Time Complexity

  • Entity creation: O(1) - builder accumulation
  • Patch addition: O(1) - append to list
  • Upsert operation: O(n) where n = pending patches or cached aspects
  • Lazy fetch: O(1) cache lookup + O(1) network if miss

Extension Points

SDK V2 designed for extensibility:

  1. New entity types - Extend Entity base class
  2. Custom aspects - Use getAspectLazy / getOrCreateAspect
  3. New patch types - Leverage existing patch builders
  4. Custom caching - Override cacheTtlMs
  5. Transport customization - Customize RestEmitter via builder

Summary

Java SDK V2 achieves its goals through principled design:

  • Reuse over reinvention - Leverages existing patch builders and RestEmitter
  • Patches over replacements - Efficient incremental updates
  • Lazy over eager - Aspects fetched on-demand with caching
  • Type safety over convenience - Strong typing throughout
  • Layers over monoliths - Clear separation of entity, operations, transport
  • Pragmatism over purity - Mode-aware behavior matches real-world usage

The result is an SDK that feels natural to Java developers while providing the efficiency and correctness required for production metadata management at scale.