# Design Principles of Java SDK V2 This document provides an architectural overview of DataHub Java SDK V2, exploring the engineering principles and design patterns that enable its type-safe, efficient metadata management capabilities. ## Architectural Philosophy SDK V2 is built on a foundation of **pragmatic reuse, intelligent caching, and layered abstractions**. Rather than reinventing infrastructure, it composes proven components into a coherent, intuitive API while introducing new patterns for efficient metadata operations. ### Core Tenets 1. **Leverage Existing Infrastructure** - Build atop battle-tested components 2. **Type Safety as a First-Class Concern** - Exploit Java's type system for compile-time correctness 3. **Separation of Concerns** - Clear boundaries between entity, operations, and transport layers 4. **Efficiency Through Patches** - Surgical updates over full replacements 5. **Intelligent Resource Management** - Lazy loading, caching, and batching ## Layer Architecture SDK V2 employs a three-layer architecture with clear separation of responsibilities: ``` ┌─────────────────────────────────────────────────────────────┐ │ Entity Layer │ │ (Dataset, Chart, Dashboard - Business Logic) │ │ - Fluent builders for entity construction │ │ - Patch accumulation and aspect management │ │ - Mode-aware behavior (SDK vs INGESTION) │ └──────────────────────┬──────────────────────────────────────┘ │ ┌──────────────────────┴──────────────────────────────────────┐ │ Operations Layer │ │ (EntityClient - CRUD Operations) │ │ - Entity lifecycle management │ │ - Patch vs full aspect emission logic │ │ - Lazy loading coordination │ └──────────────────────┬──────────────────────────────────────┘ │ ┌──────────────────────┴──────────────────────────────────────┐ │ Transport Layer │ │ (RestEmitter, Patch Builders) │ │ - HTTP communication with DataHub │ │ - MCP serialization and emission │ │ - Patch builder integration │ └─────────────────────────────────────────────────────────────┘ ``` ## Design Patterns ### 1. Fluent Builder Pattern Entity construction follows a **fluent builder pattern** that guides developers through required fields and provides IDE autocomplete support: ```java Dataset dataset = Dataset.builder() .platform("snowflake") .name("analytics.public.events") .env("PROD") .description("User events") .build(); ``` **Engineering Benefits:** - **Compile-time validation** - Missing required fields (platform, name) fail at compilation - **Immutable construction** - Builder accumulates state; `build()` creates immutable entity - **Discoverability** - IDE autocomplete reveals available methods - **Extensibility** - New optional parameters added without breaking existing code ### 2. Patch Accumulation Pattern Rather than modifying aspects directly, mutations create **patch MCPs** that accumulate in a pending list: ```java dataset.addTag("pii") // Creates patch MCP .addOwner("user", TECHNICAL_OWNER) // Creates patch MCP .addCustomProperty("retention", "90"); // Creates patch MCP client.entities().upsert(dataset); // Emits all patches atomically ``` **Engineering Benefits:** - **Deferred execution** - Batches multiple changes into a single network round-trip - **Atomic updates** - All patches applied together or none - **Efficient transmission** - Only changed fields sent over wire - **Reuse of proven infrastructure** - Leverages existing `datahub.client.patch` builders **Implementation Detail:** Entity base class maintains multiple change tracking mechanisms: ```java // From Entity.java protected final Map aspectCache; // Cached aspects from builder protected final List pendingMCPs; // Full aspect replacements protected final List pendingPatches; // Incremental patches ``` Each mutation (addTag, addOwner) creates a patch using existing builders: ```java // From Dataset.java public Dataset addTag(@Nonnull String tagUrn) { GlobalTagsPatchBuilder patch = new GlobalTagsPatchBuilder() .urn(getUrn()) .addTag(tag, null); addPatchMcp(patch.build()); // Adds to pendingPatches list return this; } ``` When `EntityClient.upsert()` is called, it emits **everything** accumulated on the entity in order: ```java // From EntityClient.upsert() // Step 1: Emit cached aspects (from builder) if (!entity.toMCPs().isEmpty()) { for (MetadataChangeProposalWrapper mcp : entity.toMCPs()) { emitter.emit(mcp); } } // Step 2: Emit pending full aspect MCPs (from set*() methods) if (entity.hasPendingMCPs()) { for (MetadataChangeProposalWrapper mcp : entity.getPendingMCPs()) { emitter.emit(mcp); } entity.clearPendingMCPs(); } // Step 3: Emit all pending patches (from add*/remove* methods) if (entity.hasPendingPatches()) { for (MetadataChangeProposal patchMcp : entity.getPendingPatches()) { emitter.emit(patchMcp, null); } entity.clearPendingPatches(); } ``` **Key insight:** `upsert()` is not an either/or operation - it emits **all** accumulated changes. What gets sent depends on what you've accumulated on the entity, not which method you call. ### 3. Lazy Loading with TTL-Based Caching Entities support **lazy aspect loading** to minimize network calls while ensuring data freshness: ```java // Entity maintains aspect cache with timestamps protected final Map aspectCache; protected final Map aspectTimestamps; protected long cacheTtlMs = 60000; // 60-second default TTL ``` **Loading Strategy:** 1. **Cache-only access** (`getAspectCached`) - Returns cached aspect or null 2. **Lazy loading** (`getAspectLazy`) - Checks cache freshness, fetches from server if stale 3. **Get-or-create** (`getOrCreateAspect`) - Returns cached or creates new empty aspect locally **Implementation:** ```java protected T getAspectLazy(@Nonnull Class aspectClass) { String aspectName = getAspectName(aspectClass); // Check cache freshness if (aspectCache.containsKey(aspectName)) { Long timestamp = aspectTimestamps.get(aspectName); if (timestamp != null && System.currentTimeMillis() - timestamp < cacheTtlMs) { return aspectClass.cast(aspectCache.get(aspectName)); } } // Fetch from server if client is bound if (client != null) { T aspect = client.getAspect(urn, aspectClass); if (aspect != null) { aspectCache.put(aspectName, aspect); aspectTimestamps.put(aspectName, System.currentTimeMillis()); } return aspect; } return null; } ``` **Engineering Benefits:** - **Network efficiency** - Reduces redundant server calls - **Freshness guarantee** - Configurable TTL ensures data isn't stale - **Transparent to caller** - Complexity hidden behind simple getter - **Client binding** - Entities bound to EntityClient enable lazy loading ### 4. Mode-Aware Aspect Selection SDK V2 distinguishes between **user-initiated edits** (SDK mode) and **system/pipeline writes** (INGESTION mode): ```java public enum OperationMode { SDK, // Interactive use - writes to editable aspects INGESTION // ETL pipelines - writes to system aspects } ``` **Aspect Routing:** - **SDK Mode** → `editableDatasetProperties`, `editableSchemaMetadata` - **INGESTION Mode** → `datasetProperties`, `schemaMetadata` **Implementation:** ```java public Dataset setDescription(@Nonnull String description) { if (isIngestionMode()) { return setSystemDescription(description); // datasetProperties } else { return setEditableDescription(description); // editableDatasetProperties } } ``` **Engineering Benefits:** - **Clear provenance** - Distinguishes human vs machine edits - **UI consistency** - DataHub UI shows editable aspects as user overrides - **Non-destructive** - System data preserved even when users add documentation - **Lineage preservation** - Ingestion pipelines can refresh system data without clobbering user edits ### 5. Two Entity Lifecycle Patterns Entities can be instantiated in two ways, each with distinct semantics: #### **Pattern 1: Builder Construction (New Entities)** ```java Dataset dataset = Dataset.builder() .platform("snowflake") .name("my_table") .build(); // aspectCache populated with builder-provided aspects // aspectTimestamps empty - indicates new entity ``` **Use case:** Creating new entities from scratch #### **Pattern 2: Server Loading (Existing Entities)** ```java Dataset dataset = client.entities().get(urn); // aspectCache populated with server aspects // aspectTimestamps records fetch time for each aspect // Entity automatically bound to client for lazy loading ``` **Use case:** Modifying existing entities with current server state. When you access aspects not already cached, the entity will automatically fetch them from the server (lazy loading). ### 6. Client Binding for Lazy Loading Entities are **automatically bound to an EntityClient** when loaded from the server or during `upsert()` to enable lazy aspect fetching: ```java public void bindToClient(@Nonnull EntityClient client, @Nonnull OperationMode mode) { if (this.client == null) { this.client = client; } if (this.mode == null) { this.mode = mode; } } ``` **Binding occurs automatically** during `upsert()`: ```java // From EntityClient.upsert() entity.bindToClient(this, config.getMode()); ``` **Engineering Benefits:** - **Transparent lazy loading** - Aspects fetched on first access if not cached - **Automatic binding** - Entities bound to client during `get()` or `upsert()` operations - **Mode propagation** - Client mode automatically applied to entity ## Type Safety & Generic Design ### Strongly-Typed Aspect Handling SDK V2 leverages Java generics to provide compile-time type safety for aspects: ```java // Type-safe aspect retrieval protected T getAspectLazy(@Nonnull Class aspectClass) { String aspectName = getAspectName(aspectClass); RecordTemplate aspect = aspectCache.get(aspectName); return aspectClass.cast(aspect); } // Usage - compiler enforces type correctness DatasetProperties props = dataset.getAspectLazy(DatasetProperties.class); ``` **Engineering Benefits:** - **Compile-time checking** - Type mismatches caught before runtime - **Refactoring safety** - IDE can trace aspect usages across codebase - **Autocomplete support** - IDE suggests available aspects - **Runtime safety** - `ClassCastException` impossible with correct usage ### URN Type Safety Entity-specific URN types prevent incorrect URN usage: ```java public class Dataset extends Entity { public DatasetUrn getDatasetUrn() { return (DatasetUrn) urn; } } // Compile-time enforcement DatasetUrn urn = dataset.getDatasetUrn(); // Type-safe Urn genericUrn = dataset.getUrn(); // Also available ``` ## Integration with Existing Infrastructure ### Reuse of Patch Builders SDK V2 **reuses existing patch builders** from `datahub.client.patch` rather than creating new implementations: - `OwnershipPatchBuilder` - Owner additions/removals - `GlobalTagsPatchBuilder` - Tag management - `GlossaryTermsPatchBuilder` - Term associations - `DomainsPatchBuilder` - Domain assignment - `DatasetPropertiesPatchBuilder` - Property updates - `EditableDatasetPropertiesPatchBuilder` - Editable property updates **Engineering Benefits:** - **Battle-tested logic** - Patch builders used in production by Python SDK - **Consistency** - Same patch semantics across language SDKs - **Maintainability** - Single implementation to maintain - **Correctness** - Complex JSON Patch logic already validated **Example Integration:** ```java public Dataset addOwner(@Nonnull String ownerUrn, @Nonnull OwnershipType type) { Urn owner = Urn.createFromString(ownerUrn); OwnershipPatchBuilder patch = new OwnershipPatchBuilder() .urn(getUrn()) .addOwner(owner, type); addPatchMcp(patch.build()); // Stores patch MCP return this; } ``` ### Leverage RestEmitter Transport layer reuses `RestEmitter` for HTTP communication: - Non-blocking emission with futures - Configurable retries and timeouts - Token-based authentication - Async HTTP client pooling **No changes to RestEmitter** - SDK V2 is purely additive. ## Resource Management & Efficiency ### Batched Emission Multiple patches accumulated and emitted atomically: ```java dataset.addTag("tag1").addTag("tag2").addOwner("user1", OWNER); client.entities().upsert(dataset); // Single network call, 3 patches ``` ### Connection Pooling RestEmitter uses `CloseableHttpAsyncClient` with connection pooling for efficient HTTP reuse. ### Graceful Degradation Lazy loading failures logged but don't crash: ```java catch (Exception e) { log.warn("Failed to lazy-load aspect {}: {}", aspectName, e.getMessage()); return null; // Graceful degradation } ``` ## Comparison: V1 vs V2 Architecture | Aspect | V1 (RestEmitter) | V2 (DataHubClientV2) | | --------------------- | ------------------------------ | --------------------------- | | **Abstraction Level** | Low - MCPs | High - Entities | | **URN Construction** | Manual strings | Automatic from builder | | **Aspect Wiring** | Manual MCP building | Hidden in entity methods | | **Updates** | Full aspect replacement | Patch-based incremental | | **Type Safety** | Minimal - generic MCPs | Strong - typed entities | | **Lazy Loading** | Not supported | TTL-based caching | | **Mode Awareness** | Not supported | SDK vs INGESTION modes | | **Learning Curve** | Steep - requires MCP knowledge | Gentle - intuitive builders | ## Performance Characteristics ### Network Efficiency - **Patch-based updates**: O(changed_fields) vs O(all_fields) - **Lazy loading**: Aspects fetched only when accessed - **Batch emission**: Multiple patches sent in single flush - **Connection reuse**: HTTP client pooling ### Memory Efficiency - **Aspect caching**: Only fetched aspects stored - **TTL expiration**: Stale aspects eligible for GC - **Lazy instantiation**: Aspects created on-demand ### Time Complexity - **Entity creation**: O(1) - builder accumulation - **Patch addition**: O(1) - append to list - **Upsert operation**: O(n) where n = pending patches or cached aspects - **Lazy fetch**: O(1) cache lookup + O(1) network if miss ## Extension Points SDK V2 designed for extensibility: 1. **New entity types** - Extend `Entity` base class 2. **Custom aspects** - Use `getAspectLazy` / `getOrCreateAspect` 3. **New patch types** - Leverage existing patch builders 4. **Custom caching** - Override `cacheTtlMs` 5. **Transport customization** - Customize RestEmitter via builder ## Summary Java SDK V2 achieves its goals through principled design: - **Reuse over reinvention** - Leverages existing patch builders and RestEmitter - **Patches over replacements** - Efficient incremental updates - **Lazy over eager** - Aspects fetched on-demand with caching - **Type safety over convenience** - Strong typing throughout - **Layers over monoliths** - Clear separation of entity, operations, transport - **Pragmatism over purity** - Mode-aware behavior matches real-world usage The result is an SDK that feels natural to Java developers while providing the efficiency and correctness required for production metadata management at scale.