datahub/metadata-integration/java/docs/sdk-v2/design-principles.md

# Design Principles of Java SDK V2

This document provides an architectural overview of DataHub Java SDK V2, exploring the engineering principles and design patterns that enable its type-safe, efficient metadata management capabilities.

## Architectural Philosophy

SDK V2 is built on a foundation of **pragmatic reuse, intelligent caching, and layered abstractions**. Rather than reinventing infrastructure, it composes proven components into a coherent, intuitive API while introducing new patterns for efficient metadata operations.

### Core Tenets

1. **Leverage Existing Infrastructure** - Build atop battle-tested components
2. **Type Safety as a First-Class Concern** - Exploit Java's type system for compile-time correctness
3. **Separation of Concerns** - Clear boundaries between entity, operations, and transport layers
4. **Efficiency Through Patches** - Surgical updates over full replacements
5. **Intelligent Resource Management** - Lazy loading, caching, and batching

## Layer Architecture

SDK V2 employs a three-layer architecture with clear separation of responsibilities:

```
┌─────────────────────────────────────────────────────────────┐
│                    Entity Layer                              │
│  (Dataset, Chart, Dashboard - Business Logic)                │
│  - Fluent builders for entity construction                   │
│  - Patch accumulation and aspect management                  │
│  - Mode-aware behavior (SDK vs INGESTION)                    │
└──────────────────────┬──────────────────────────────────────┘
                       │
┌──────────────────────┴──────────────────────────────────────┐
│                 Operations Layer                             │
│  (EntityClient - CRUD Operations)                            │
│  - Entity lifecycle management                               │
│  - Patch vs full aspect emission logic                       │
│  - Lazy loading coordination                                 │
└──────────────────────┬──────────────────────────────────────┘
                       │
┌──────────────────────┴──────────────────────────────────────┐
│                  Transport Layer                             │
│  (RestEmitter, Patch Builders)                               │
│  - HTTP communication with DataHub                           │
│  - MCP serialization and emission                            │
│  - Patch builder integration                                 │
└─────────────────────────────────────────────────────────────┘
```

## Design Patterns

### 1. Fluent Builder Pattern

Entity construction follows a **fluent builder pattern** that guides developers through required fields and provides IDE autocomplete support:

```java
Dataset dataset = Dataset.builder()
    .platform("snowflake")
    .name("analytics.public.events")
    .env("PROD")
    .description("User events")
    .build();
```

**Engineering Benefits:**

- **Compile-time validation** - Missing required fields (platform, name) fail at compilation
- **Immutable construction** - Builder accumulates state; `build()` creates immutable entity
- **Discoverability** - IDE autocomplete reveals available methods
- **Extensibility** - New optional parameters added without breaking existing code

### 2. Patch Accumulation Pattern

Rather than modifying aspects directly, mutations create **patch MCPs** that accumulate in a pending list:

```java
dataset.addTag("pii")                          // Creates patch MCP
       .addOwner("user", TECHNICAL_OWNER)      // Creates patch MCP
       .addCustomProperty("retention", "90");  // Creates patch MCP

client.entities().upsert(dataset);  // Emits all patches atomically
```

**Engineering Benefits:**

- **Deferred execution** - Batches multiple changes into a single network round-trip
- **Atomic updates** - All patches applied together or none
- **Efficient transmission** - Only changed fields sent over wire
- **Reuse of proven infrastructure** - Leverages existing `datahub.client.patch` builders

**Implementation Detail:**
Entity base class maintains multiple change tracking mechanisms:

```java
// From Entity.java
protected final Map<String, RecordTemplate> aspectCache;        // Cached aspects from builder
protected final List<MetadataChangeProposalWrapper> pendingMCPs; // Full aspect replacements
protected final List<MetadataChangeProposal> pendingPatches;     // Incremental patches
```

Each mutation (addTag, addOwner) creates a patch using existing builders:

```java
// From Dataset.java
public Dataset addTag(@Nonnull String tagUrn) {
    GlobalTagsPatchBuilder patch = new GlobalTagsPatchBuilder()
        .urn(getUrn())
        .addTag(tag, null);
    addPatchMcp(patch.build());  // Adds to pendingPatches list
    return this;
}
```

When `EntityClient.upsert()` is called, it emits **everything** accumulated on the entity in order:

```java
// From EntityClient.upsert()

// Step 1: Emit cached aspects (from builder)
if (!entity.toMCPs().isEmpty()) {
    for (MetadataChangeProposalWrapper mcp : entity.toMCPs()) {
        emitter.emit(mcp);
    }
}

// Step 2: Emit pending full aspect MCPs (from set*() methods)
if (entity.hasPendingMCPs()) {
    for (MetadataChangeProposalWrapper mcp : entity.getPendingMCPs()) {
        emitter.emit(mcp);
    }
    entity.clearPendingMCPs();
}

// Step 3: Emit all pending patches (from add*/remove* methods)
if (entity.hasPendingPatches()) {
    for (MetadataChangeProposal patchMcp : entity.getPendingPatches()) {
        emitter.emit(patchMcp, null);
    }
    entity.clearPendingPatches();
}
```

**Key insight:** `upsert()` is not an either/or operation - it emits **all** accumulated changes. What gets sent depends on what you've accumulated on the entity, not which method you call.

### 3. Lazy Loading with TTL-Based Caching

Entities support **lazy aspect loading** to minimize network calls while ensuring data freshness:

```java
// Entity maintains aspect cache with timestamps
protected final Map<String, RecordTemplate> aspectCache;
protected final Map<String, Long> aspectTimestamps;
protected long cacheTtlMs = 60000;  // 60-second default TTL
```

**Loading Strategy:**

1. **Cache-only access** (`getAspectCached`) - Returns cached aspect or null
2. **Lazy loading** (`getAspectLazy`) - Checks cache freshness, fetches from server if stale
3. **Get-or-create** (`getOrCreateAspect`) - Returns cached or creates new empty aspect locally

**Implementation:**

```java
protected <T extends RecordTemplate> T getAspectLazy(@Nonnull Class<T> aspectClass) {
    String aspectName = getAspectName(aspectClass);

    // Check cache freshness
    if (aspectCache.containsKey(aspectName)) {
        Long timestamp = aspectTimestamps.get(aspectName);
        if (timestamp != null && System.currentTimeMillis() - timestamp < cacheTtlMs) {
            return aspectClass.cast(aspectCache.get(aspectName));
        }
    }

    // Fetch from server if client is bound
    if (client != null) {
        T aspect = client.getAspect(urn, aspectClass);
        if (aspect != null) {
            aspectCache.put(aspectName, aspect);
            aspectTimestamps.put(aspectName, System.currentTimeMillis());
        }
        return aspect;
    }

    return null;
}
```

**Engineering Benefits:**

- **Network efficiency** - Reduces redundant server calls
- **Freshness guarantee** - Configurable TTL ensures data isn't stale
- **Transparent to caller** - Complexity hidden behind simple getter
- **Client binding** - Entities bound to EntityClient enable lazy loading

### 4. Mode-Aware Aspect Selection

SDK V2 distinguishes between **user-initiated edits** (SDK mode) and **system/pipeline writes** (INGESTION mode):

```java
public enum OperationMode {
    SDK,        // Interactive use - writes to editable aspects
    INGESTION   // ETL pipelines - writes to system aspects
}
```

**Aspect Routing:**

- **SDK Mode** → `editableDatasetProperties`, `editableSchemaMetadata`
- **INGESTION Mode** → `datasetProperties`, `schemaMetadata`

**Implementation:**

```java
public Dataset setDescription(@Nonnull String description) {
    if (isIngestionMode()) {
        return setSystemDescription(description);  // datasetProperties
    } else {
        return setEditableDescription(description); // editableDatasetProperties
    }
}
```

**Engineering Benefits:**

- **Clear provenance** - Distinguishes human vs machine edits
- **UI consistency** - DataHub UI shows editable aspects as user overrides
- **Non-destructive** - System data preserved even when users add documentation
- **Lineage preservation** - Ingestion pipelines can refresh system data without clobbering user edits

### 5. Two Entity Lifecycle Patterns

Entities can be instantiated in two ways, each with distinct semantics:

#### **Pattern 1: Builder Construction (New Entities)**

```java
Dataset dataset = Dataset.builder()
    .platform("snowflake")
    .name("my_table")
    .build();
// aspectCache populated with builder-provided aspects
// aspectTimestamps empty - indicates new entity
```

**Use case:** Creating new entities from scratch

#### **Pattern 2: Server Loading (Existing Entities)**

```java
Dataset dataset = client.entities().get(urn);
// aspectCache populated with server aspects
// aspectTimestamps records fetch time for each aspect
// Entity automatically bound to client for lazy loading
```

**Use case:** Modifying existing entities with current server state. When you access aspects not already cached, the entity will automatically fetch them from the server (lazy loading).

### 6. Client Binding for Lazy Loading

Entities are **automatically bound to an EntityClient** when loaded from the server or during `upsert()` to enable lazy aspect fetching:

```java
public void bindToClient(@Nonnull EntityClient client,
                        @Nonnull OperationMode mode) {
    if (this.client == null) {
        this.client = client;
    }
    if (this.mode == null) {
        this.mode = mode;
    }
}
```

**Binding occurs automatically** during `upsert()`:

```java
// From EntityClient.upsert()
entity.bindToClient(this, config.getMode());
```

**Engineering Benefits:**

- **Transparent lazy loading** - Aspects fetched on first access if not cached
- **Automatic binding** - Entities bound to client during `get()` or `upsert()` operations
- **Mode propagation** - Client mode automatically applied to entity

## Type Safety & Generic Design

### Strongly-Typed Aspect Handling

SDK V2 leverages Java generics to provide compile-time type safety for aspects:

```java
// Type-safe aspect retrieval
protected <T extends RecordTemplate> T getAspectLazy(@Nonnull Class<T> aspectClass) {
    String aspectName = getAspectName(aspectClass);
    RecordTemplate aspect = aspectCache.get(aspectName);
    return aspectClass.cast(aspect);
}

// Usage - compiler enforces type correctness
DatasetProperties props = dataset.getAspectLazy(DatasetProperties.class);
```

**Engineering Benefits:**

- **Compile-time checking** - Type mismatches caught before runtime
- **Refactoring safety** - IDE can trace aspect usages across codebase
- **Autocomplete support** - IDE suggests available aspects
- **Runtime safety** - `ClassCastException` impossible with correct usage

### URN Type Safety

Entity-specific URN types prevent incorrect URN usage:

```java
public class Dataset extends Entity {
    public DatasetUrn getDatasetUrn() {
        return (DatasetUrn) urn;
    }
}

// Compile-time enforcement
DatasetUrn urn = dataset.getDatasetUrn();  // Type-safe
Urn genericUrn = dataset.getUrn();         // Also available
```

## Integration with Existing Infrastructure

### Reuse of Patch Builders

SDK V2 **reuses existing patch builders** from `datahub.client.patch` rather than creating new implementations:

- `OwnershipPatchBuilder` - Owner additions/removals
- `GlobalTagsPatchBuilder` - Tag management
- `GlossaryTermsPatchBuilder` - Term associations
- `DomainsPatchBuilder` - Domain assignment
- `DatasetPropertiesPatchBuilder` - Property updates
- `EditableDatasetPropertiesPatchBuilder` - Editable property updates

**Engineering Benefits:**

- **Battle-tested logic** - Patch builders used in production by Python SDK
- **Consistency** - Same patch semantics across language SDKs
- **Maintainability** - Single implementation to maintain
- **Correctness** - Complex JSON Patch logic already validated

**Example Integration:**

```java
public Dataset addOwner(@Nonnull String ownerUrn, @Nonnull OwnershipType type) {
    Urn owner = Urn.createFromString(ownerUrn);
    OwnershipPatchBuilder patch = new OwnershipPatchBuilder()
        .urn(getUrn())
        .addOwner(owner, type);
    addPatchMcp(patch.build());  // Stores patch MCP
    return this;
}
```

### Leverage RestEmitter

Transport layer reuses `RestEmitter` for HTTP communication:

- Non-blocking emission with futures
- Configurable retries and timeouts
- Token-based authentication
- Async HTTP client pooling

**No changes to RestEmitter** - SDK V2 is purely additive.

## Resource Management & Efficiency

### Batched Emission

Multiple patches accumulated and emitted atomically:

```java
dataset.addTag("tag1").addTag("tag2").addOwner("user1", OWNER);
client.entities().upsert(dataset);  // Single network call, 3 patches
```

### Connection Pooling

RestEmitter uses `CloseableHttpAsyncClient` with connection pooling for efficient HTTP reuse.

### Graceful Degradation

Lazy loading failures logged but don't crash:

```java
catch (Exception e) {
    log.warn("Failed to lazy-load aspect {}: {}", aspectName, e.getMessage());
    return null;  // Graceful degradation
}
```

## Comparison: V1 vs V2 Architecture

| Aspect                | V1 (RestEmitter)               | V2 (DataHubClientV2)        |
| --------------------- | ------------------------------ | --------------------------- |
| **Abstraction Level** | Low - MCPs                     | High - Entities             |
| **URN Construction**  | Manual strings                 | Automatic from builder      |
| **Aspect Wiring**     | Manual MCP building            | Hidden in entity methods    |
| **Updates**           | Full aspect replacement        | Patch-based incremental     |
| **Type Safety**       | Minimal - generic MCPs         | Strong - typed entities     |
| **Lazy Loading**      | Not supported                  | TTL-based caching           |
| **Mode Awareness**    | Not supported                  | SDK vs INGESTION modes      |
| **Learning Curve**    | Steep - requires MCP knowledge | Gentle - intuitive builders |

## Performance Characteristics

### Network Efficiency

- **Patch-based updates**: O(changed_fields) vs O(all_fields)
- **Lazy loading**: Aspects fetched only when accessed
- **Batch emission**: Multiple patches sent in single flush
- **Connection reuse**: HTTP client pooling

### Memory Efficiency

- **Aspect caching**: Only fetched aspects stored
- **TTL expiration**: Stale aspects eligible for GC
- **Lazy instantiation**: Aspects created on-demand

### Time Complexity

- **Entity creation**: O(1) - builder accumulation
- **Patch addition**: O(1) - append to list
- **Upsert operation**: O(n) where n = pending patches or cached aspects
- **Lazy fetch**: O(1) cache lookup + O(1) network if miss

## Extension Points

SDK V2 designed for extensibility:

1. **New entity types** - Extend `Entity` base class
2. **Custom aspects** - Use `getAspectLazy` / `getOrCreateAspect`
3. **New patch types** - Leverage existing patch builders
4. **Custom caching** - Override `cacheTtlMs`
5. **Transport customization** - Customize RestEmitter via builder

## Summary

Java SDK V2 achieves its goals through principled design:

- **Reuse over reinvention** - Leverages existing patch builders and RestEmitter
- **Patches over replacements** - Efficient incremental updates
- **Lazy over eager** - Aspects fetched on-demand with caching
- **Type safety over convenience** - Strong typing throughout
- **Layers over monoliths** - Clear separation of entity, operations, transport
- **Pragmatism over purity** - Mode-aware behavior matches real-world usage

The result is an SDK that feels natural to Java developers while providing the efficiency and correctness required for production metadata management at scale.