mirror of https://github.com/datahub-project/datahub.git synced 2025-12-19 14:08:38 +00:00

feat(java-sdk): Add Java SDK V2 with fluent builder API and entity support (#15307 )

2025-12-03 05:54:02 -08:00

17 KiB

Raw Blame History

Design Principles of Java SDK V2

This document provides an architectural overview of DataHub Java SDK V2, exploring the engineering principles and design patterns that enable its type-safe, efficient metadata management capabilities.

Architectural Philosophy

SDK V2 is built on a foundation of pragmatic reuse, intelligent caching, and layered abstractions. Rather than reinventing infrastructure, it composes proven components into a coherent, intuitive API while introducing new patterns for efficient metadata operations.

Core Tenets

Leverage Existing Infrastructure - Build atop battle-tested components
Type Safety as a First-Class Concern - Exploit Java's type system for compile-time correctness
Separation of Concerns - Clear boundaries between entity, operations, and transport layers
Efficiency Through Patches - Surgical updates over full replacements
Intelligent Resource Management - Lazy loading, caching, and batching

Layer Architecture

SDK V2 employs a three-layer architecture with clear separation of responsibilities:

┌─────────────────────────────────────────────────────────────┐
│                    Entity Layer                              │
│  (Dataset, Chart, Dashboard - Business Logic)                │
│  - Fluent builders for entity construction                   │
│  - Patch accumulation and aspect management                  │
│  - Mode-aware behavior (SDK vs INGESTION)                    │
└──────────────────────┬──────────────────────────────────────┘
                       │
┌──────────────────────┴──────────────────────────────────────┐
│                 Operations Layer                             │
│  (EntityClient - CRUD Operations)                            │
│  - Entity lifecycle management                               │
│  - Patch vs full aspect emission logic                       │
│  - Lazy loading coordination                                 │
└──────────────────────┬──────────────────────────────────────┘
                       │
┌──────────────────────┴──────────────────────────────────────┐
│                  Transport Layer                             │
│  (RestEmitter, Patch Builders)                               │
│  - HTTP communication with DataHub                           │
│  - MCP serialization and emission                            │
│  - Patch builder integration                                 │
└─────────────────────────────────────────────────────────────┘

Design Patterns

1. Fluent Builder Pattern

Entity construction follows a fluent builder pattern that guides developers through required fields and provides IDE autocomplete support:

Dataset dataset = Dataset.builder()
    .platform("snowflake")
    .name("analytics.public.events")
    .env("PROD")
    .description("User events")
    .build();

Engineering Benefits:

Compile-time validation - Missing required fields (platform, name) fail at compilation
Immutable construction - Builder accumulates state; build() creates immutable entity
Discoverability - IDE autocomplete reveals available methods
Extensibility - New optional parameters added without breaking existing code

2. Patch Accumulation Pattern

Rather than modifying aspects directly, mutations create patch MCPs that accumulate in a pending list:

dataset.addTag("pii")                          // Creates patch MCP
       .addOwner("user", TECHNICAL_OWNER)      // Creates patch MCP
       .addCustomProperty("retention", "90");  // Creates patch MCP

client.entities().upsert(dataset);  // Emits all patches atomically

Engineering Benefits:

Deferred execution - Batches multiple changes into a single network round-trip
Atomic updates - All patches applied together or none
Efficient transmission - Only changed fields sent over wire
Reuse of proven infrastructure - Leverages existing datahub.client.patch builders

Implementation Detail: Entity base class maintains multiple change tracking mechanisms:

// From Entity.java
protected final Map<String, RecordTemplate> aspectCache;        // Cached aspects from builder
protected final List<MetadataChangeProposalWrapper> pendingMCPs; // Full aspect replacements
protected final List<MetadataChangeProposal> pendingPatches;     // Incremental patches

Each mutation (addTag, addOwner) creates a patch using existing builders:

// From Dataset.java
public Dataset addTag(@Nonnull String tagUrn) {
    GlobalTagsPatchBuilder patch = new GlobalTagsPatchBuilder()
        .urn(getUrn())
        .addTag(tag, null);
    addPatchMcp(patch.build());  // Adds to pendingPatches list
    return this;
}

When EntityClient.upsert() is called, it emits everything accumulated on the entity in order:

// From EntityClient.upsert()

// Step 1: Emit cached aspects (from builder)
if (!entity.toMCPs().isEmpty()) {
    for (MetadataChangeProposalWrapper mcp : entity.toMCPs()) {
        emitter.emit(mcp);
    }
}

// Step 2: Emit pending full aspect MCPs (from set*() methods)
if (entity.hasPendingMCPs()) {
    for (MetadataChangeProposalWrapper mcp : entity.getPendingMCPs()) {
        emitter.emit(mcp);
    }
    entity.clearPendingMCPs();
}

// Step 3: Emit all pending patches (from add*/remove* methods)
if (entity.hasPendingPatches()) {
    for (MetadataChangeProposal patchMcp : entity.getPendingPatches()) {
        emitter.emit(patchMcp, null);
    }
    entity.clearPendingPatches();
}

Key insight: upsert() is not an either/or operation - it emits all accumulated changes. What gets sent depends on what you've accumulated on the entity, not which method you call.

3. Lazy Loading with TTL-Based Caching

Entities support lazy aspect loading to minimize network calls while ensuring data freshness:

// Entity maintains aspect cache with timestamps
protected final Map<String, RecordTemplate> aspectCache;
protected final Map<String, Long> aspectTimestamps;
protected long cacheTtlMs = 60000;  // 60-second default TTL

Loading Strategy:

Cache-only access (getAspectCached) - Returns cached aspect or null
Lazy loading (getAspectLazy) - Checks cache freshness, fetches from server if stale
Get-or-create (getOrCreateAspect) - Returns cached or creates new empty aspect locally

Implementation:

protected <T extends RecordTemplate> T getAspectLazy(@Nonnull Class<T> aspectClass) {
    String aspectName = getAspectName(aspectClass);

    // Check cache freshness
    if (aspectCache.containsKey(aspectName)) {
        Long timestamp = aspectTimestamps.get(aspectName);
        if (timestamp != null && System.currentTimeMillis() - timestamp < cacheTtlMs) {
            return aspectClass.cast(aspectCache.get(aspectName));
        }
    }

    // Fetch from server if client is bound
    if (client != null) {
        T aspect = client.getAspect(urn, aspectClass);
        if (aspect != null) {
            aspectCache.put(aspectName, aspect);
            aspectTimestamps.put(aspectName, System.currentTimeMillis());
        }
        return aspect;
    }

    return null;
}

Engineering Benefits:

Network efficiency - Reduces redundant server calls
Freshness guarantee - Configurable TTL ensures data isn't stale
Transparent to caller - Complexity hidden behind simple getter
Client binding - Entities bound to EntityClient enable lazy loading

4. Mode-Aware Aspect Selection

SDK V2 distinguishes between user-initiated edits (SDK mode) and system/pipeline writes (INGESTION mode):

public enum OperationMode {
    SDK,        // Interactive use - writes to editable aspects
    INGESTION   // ETL pipelines - writes to system aspects
}

Aspect Routing:

SDK Mode → editableDatasetProperties, editableSchemaMetadata
INGESTION Mode → datasetProperties, schemaMetadata

Implementation:

public Dataset setDescription(@Nonnull String description) {
    if (isIngestionMode()) {
        return setSystemDescription(description);  // datasetProperties
    } else {
        return setEditableDescription(description); // editableDatasetProperties
    }
}

Engineering Benefits:

Clear provenance - Distinguishes human vs machine edits
UI consistency - DataHub UI shows editable aspects as user overrides
Non-destructive - System data preserved even when users add documentation
Lineage preservation - Ingestion pipelines can refresh system data without clobbering user edits

5. Two Entity Lifecycle Patterns

Entities can be instantiated in two ways, each with distinct semantics:

Pattern 1: Builder Construction (New Entities)

Dataset dataset = Dataset.builder()
    .platform("snowflake")
    .name("my_table")
    .build();
// aspectCache populated with builder-provided aspects
// aspectTimestamps empty - indicates new entity

Use case: Creating new entities from scratch

Pattern 2: Server Loading (Existing Entities)

Dataset dataset = client.entities().get(urn);
// aspectCache populated with server aspects
// aspectTimestamps records fetch time for each aspect
// Entity automatically bound to client for lazy loading

Use case: Modifying existing entities with current server state. When you access aspects not already cached, the entity will automatically fetch them from the server (lazy loading).

6. Client Binding for Lazy Loading

Entities are automatically bound to an EntityClient when loaded from the server or during upsert() to enable lazy aspect fetching:

public void bindToClient(@Nonnull EntityClient client,
                        @Nonnull OperationMode mode) {
    if (this.client == null) {
        this.client = client;
    }
    if (this.mode == null) {
        this.mode = mode;
    }
}

Binding occurs automatically during upsert():

// From EntityClient.upsert()
entity.bindToClient(this, config.getMode());

Engineering Benefits:

Transparent lazy loading - Aspects fetched on first access if not cached
Automatic binding - Entities bound to client during get() or upsert() operations
Mode propagation - Client mode automatically applied to entity

Type Safety & Generic Design

Strongly-Typed Aspect Handling

SDK V2 leverages Java generics to provide compile-time type safety for aspects:

// Type-safe aspect retrieval
protected <T extends RecordTemplate> T getAspectLazy(@Nonnull Class<T> aspectClass) {
    String aspectName = getAspectName(aspectClass);
    RecordTemplate aspect = aspectCache.get(aspectName);
    return aspectClass.cast(aspect);
}

// Usage - compiler enforces type correctness
DatasetProperties props = dataset.getAspectLazy(DatasetProperties.class);

Engineering Benefits:

Compile-time checking - Type mismatches caught before runtime
Refactoring safety - IDE can trace aspect usages across codebase
Autocomplete support - IDE suggests available aspects
Runtime safety - ClassCastException impossible with correct usage

URN Type Safety

Entity-specific URN types prevent incorrect URN usage:

public class Dataset extends Entity {
    public DatasetUrn getDatasetUrn() {
        return (DatasetUrn) urn;
    }
}

// Compile-time enforcement
DatasetUrn urn = dataset.getDatasetUrn();  // Type-safe
Urn genericUrn = dataset.getUrn();         // Also available

Integration with Existing Infrastructure

Reuse of Patch Builders

SDK V2 reuses existing patch builders from datahub.client.patch rather than creating new implementations:

OwnershipPatchBuilder - Owner additions/removals
GlobalTagsPatchBuilder - Tag management
GlossaryTermsPatchBuilder - Term associations
DomainsPatchBuilder - Domain assignment
DatasetPropertiesPatchBuilder - Property updates
EditableDatasetPropertiesPatchBuilder - Editable property updates

Engineering Benefits:

Battle-tested logic - Patch builders used in production by Python SDK
Consistency - Same patch semantics across language SDKs
Maintainability - Single implementation to maintain
Correctness - Complex JSON Patch logic already validated

Example Integration:

public Dataset addOwner(@Nonnull String ownerUrn, @Nonnull OwnershipType type) {
    Urn owner = Urn.createFromString(ownerUrn);
    OwnershipPatchBuilder patch = new OwnershipPatchBuilder()
        .urn(getUrn())
        .addOwner(owner, type);
    addPatchMcp(patch.build());  // Stores patch MCP
    return this;
}

Leverage RestEmitter

Transport layer reuses RestEmitter for HTTP communication:

Non-blocking emission with futures
Configurable retries and timeouts
Token-based authentication
Async HTTP client pooling

No changes to RestEmitter - SDK V2 is purely additive.

Resource Management & Efficiency

Batched Emission

Multiple patches accumulated and emitted atomically:

dataset.addTag("tag1").addTag("tag2").addOwner("user1", OWNER);
client.entities().upsert(dataset);  // Single network call, 3 patches

Connection Pooling

RestEmitter uses CloseableHttpAsyncClient with connection pooling for efficient HTTP reuse.

Graceful Degradation

Lazy loading failures logged but don't crash:

catch (Exception e) {
    log.warn("Failed to lazy-load aspect {}: {}", aspectName, e.getMessage());
    return null;  // Graceful degradation
}

Comparison: V1 vs V2 Architecture

Aspect	V1 (RestEmitter)	V2 (DataHubClientV2)
Abstraction Level	Low - MCPs	High - Entities
URN Construction	Manual strings	Automatic from builder
Aspect Wiring	Manual MCP building	Hidden in entity methods
Updates	Full aspect replacement	Patch-based incremental
Type Safety	Minimal - generic MCPs	Strong - typed entities
Lazy Loading	Not supported	TTL-based caching
Mode Awareness	Not supported	SDK vs INGESTION modes
Learning Curve	Steep - requires MCP knowledge	Gentle - intuitive builders

Performance Characteristics

Network Efficiency

Patch-based updates: O(changed_fields) vs O(all_fields)
Lazy loading: Aspects fetched only when accessed
Batch emission: Multiple patches sent in single flush
Connection reuse: HTTP client pooling

Memory Efficiency

Aspect caching: Only fetched aspects stored
TTL expiration: Stale aspects eligible for GC
Lazy instantiation: Aspects created on-demand

Time Complexity

Entity creation: O(1) - builder accumulation
Patch addition: O(1) - append to list
Upsert operation: O(n) where n = pending patches or cached aspects
Lazy fetch: O(1) cache lookup + O(1) network if miss

Extension Points

SDK V2 designed for extensibility:

New entity types - Extend Entity base class
Custom aspects - Use getAspectLazy / getOrCreateAspect
New patch types - Leverage existing patch builders
Custom caching - Override cacheTtlMs
Transport customization - Customize RestEmitter via builder

Summary

Java SDK V2 achieves its goals through principled design:

Reuse over reinvention - Leverages existing patch builders and RestEmitter
Patches over replacements - Efficient incremental updates
Lazy over eager - Aspects fetched on-demand with caching
Type safety over convenience - Strong typing throughout
Layers over monoliths - Clear separation of entity, operations, transport
Pragmatism over purity - Mode-aware behavior matches real-world usage

The result is an SDK that feels natural to Java developers while providing the efficiency and correctness required for production metadata management at scale.

17 KiB Raw Blame History

Design Principles of Java SDK V2

Architectural Philosophy

Core Tenets

Layer Architecture

Design Patterns

1. Fluent Builder Pattern

2. Patch Accumulation Pattern

3. Lazy Loading with TTL-Based Caching

4. Mode-Aware Aspect Selection

5. Two Entity Lifecycle Patterns

Pattern 1: Builder Construction (New Entities)

Pattern 2: Server Loading (Existing Entities)

6. Client Binding for Lazy Loading

Type Safety & Generic Design

Strongly-Typed Aspect Handling

URN Type Safety

Integration with Existing Infrastructure

Reuse of Patch Builders

Leverage RestEmitter

Resource Management & Efficiency

Batched Emission

Connection Pooling

Graceful Degradation

Comparison: V1 vs V2 Architecture

Performance Characteristics

Network Efficiency

Memory Efficiency

Time Complexity

Extension Points

Summary

17 KiB

Raw Blame History