datahub/docs/developers/java-sdk-v2-philosophy.md

# Why We Hand-Crafted the Java SDK V2 (Instead of Generating It)

## The Question

When building DataHub's Java SDK V2, we faced a choice that every API platform eventually confronts: should we generate our SDK from OpenAPI specs, or hand-craft it?

OpenAPI code generation is seductive. Tools like OpenAPI Generator promise instant SDKs in dozens of languages. Run a command, get a client—complete with type-safe models, proper serialization, and comprehensive endpoint coverage. Why would anyone choose to write thousands of lines of code by hand?

We chose to hand-craft. This document explains why.

## When Code Generation Works Beautifully

Let's be clear: code generation isn't wrong. It's incredibly effective when your abstraction boundary aligns with your wire protocol.

**CRUD APIs**: If your API exposes resources like `GET /users/{id}`, `POST /users`, `DELETE /users/{id}`, a generated client is perfect:

```java
User user = client.getUser(123);
client.createUser(newUser);
client.deleteUser(456);
```

The user's mental model—"I want to fetch/create/delete a user"—maps directly to HTTP operations. There's no translation needed.

**Protocol Buffers**: Google's protobuf generators are exemplary because the `.proto` file **is** the contract:

```protobuf
service UserService {
  rpc GetUser(UserId) returns (User);
  rpc ListUsers(ListRequest) returns (UserList);
}
```

The service definition becomes the client API with perfect fidelity. What you define is what users get.

**The Pattern**: Code generation excels when **the API's conceptual model matches user mental models**, and the wire protocol fully captures domain semantics.

## The Semantic Gap: Why DataHub Is Different

DataHub doesn't fit this mold. Our metadata platform has a semantic gap between what users want to do and what the HTTP API exposes.

### The Aspect-Based Model

DataHub stores metadata as discrete "aspects"—properties, tags, ownership, schemas. But users don't think in aspects. They think:

- "I want to add a 'PII' tag to this dataset"
- "I need to assign ownership to John"
- "This table should be in the Finance domain"

An OpenAPI-generated client would expose:

```java
// What the API provides
client.updateGlobalTags(entityUrn, globalTagsPayload);
client.updateOwnership(entityUrn, ownershipPayload);
```

But to use this, you need to know:

- What is `GlobalTags`? How do I construct it?
- Should I use PUT (full replacement) or PATCH (incremental update)?
- How do I avoid race conditions when multiple systems update tags?
- Where do tags even live—in system aspects or editable aspects?

This is expert-level knowledge pushed onto every user.

### The Patch Complexity

DataHub supports both full aspect replacement (PUT) and JSON Patch (incremental updates). The generated client would expose both:

```java
// Full replacement
void putGlobalTags(Urn entityUrn, GlobalTags tags);

// JSON Patch
void patchGlobalTags(Urn entityUrn, JsonPatch patch);
```

Now users must decide when to use each. Patches are safer (no race conditions), but how do you construct a JsonPatch? Do you use a PatchBuilder? Hand-write JSON?

Every user solves this problem independently, reinventing best practices.

### The Mode Problem

DataHub has dual aspects: **system aspects** (written by ingestion pipelines) and **editable aspects** (written by humans via UI/SDK). Users editing metadata should write to editable aspects, but pipelines should write to system aspects.

A generated client doesn't understand this distinction. It just exposes endpoints. Users must learn DataHub's aspect model to route correctly.

## Five Principles of Hand-Crafted SDKs

Our hand-crafted SDK addresses these gaps through five design principles.

### 1. Semantic Layers Translate Domain Concepts

The SDK provides operations that match how users think:

```java
Dataset dataset = Dataset.builder()
    .platform("snowflake")
    .name("fact_revenue")
    .build();

// Think "add a tag", not "construct and PUT a GlobalTags aspect"
dataset.addTag("pii");

// Think "assign ownership", not "build an Ownership aspect"
dataset.addOwner("urn:li:corpuser:jdoe", OwnershipType.TECHNICAL_OWNER);

client.entities().upsert(dataset);
```

The SDK translates `addTag()` into the correct:

- Aspect type (GlobalTags)
- Operation type (JSON Patch for safety)
- Aspect variant (editable, in SDK mode)
- JSON path (into the aspect structure)

This is **semantic translation**—mapping domain intent to wire protocol. Generators can't do this because the semantics live in institutional knowledge, not OpenAPI specs.

### 2. Opinionated APIs: The 95/5 Rule

We optimized for the 95% case and provided escape hatches for the 5%.

**The 95% case**: Incremental metadata changes—add a tag, update ownership, set a domain.

```java
dataset.addTag("sensitive")
       .addOwner(ownerUrn, type)
       .setDomain(domainUrn);

client.entities().update(dataset);
```

Users never think about PUT vs PATCH, aspect construction, or batch strategies. It just works.

**The 5% case**: Complete aspect replacement, custom MCPs, or operations V2 doesn't support.

```java
// Drop to V1 SDK for full control
RestEmitter emitter = client.emitter();
MetadataChangeProposalWrapper mcpw = /* custom logic */;
emitter.emit(mcpw).get();
```

This philosophy—**make simple things trivial, complex things possible**—requires intentional API design. Generators produce flat API surfaces where every operation has equal weight.

### 3. Encoding Expert Knowledge

Every platform accumulates tribal knowledge:

- "Always use patches for concurrent-safe updates"
- "Editable aspects override system aspects in SDK mode"
- "Batch operations to avoid Kafka load spikes"
- "Schema field names don't always match aspect names"

A generated client leaves this knowledge in Slack threads and documentation. Users discover best practices through painful trial and error.

The hand-crafted SDK **encodes** this knowledge:

```java
// Users call addTag(), SDK internally:
// - Creates a JSON Patch (not full replacement)
// - Targets the editable aspect in SDK mode
// - Accumulates patches for atomic emission
// - Uses the correct field paths
```

The SDK becomes **executable documentation** of best practices. This scales better than tribal knowledge.

### Why Not an ORM Approach?

Tools like Hibernate, SQLAlchemy, and Pydantic+ORM excel at managing complex object graphs in transactional applications. Why didn't we use this pattern?

**Metadata operations follow different patterns than OLTP workloads:**

1. **Bulk mutations** - "Tag 50 datasets as PII" requires only URNs and the operation, not loading full object graphs
2. **Point lookups** - "Get this dataset's schema before querying" is a direct fetch, no relationship navigation needed
3. **Read-modify-write** - "Infer quality scores from schema statistics" involves fetching an aspect, transforming it, and patching it back

ORMs optimize for relationship traversal (`dataset.container.database.catalog`), session lifecycle management, and automatic dirty tracking. But:

- **Relationship traversal** is handled by DataHub's search and graph query APIs, not in-memory navigation
- **Explicit patches** are central to our design—we want `addTag()` visible in code, not hidden behind session flush
- **Session complexity** adds cognitive overhead without benefit for metadata's bulk/point/patch patterns

The result: a simpler, more explicit API that matches how developers actually work with metadata.

### 4. Centralized Maintenance vs Distributed Pain

Generated clients push maintenance costs onto users. When we improve DataHub:

- **Add a new endpoint**: Users regenerate their client. Breaking change? Every team upgrades simultaneously.
- **Change error handling**: Regenerate. Update all call sites.
- **Optimize batch operations**: Can't—that logic lives in user code, reinvented by every team.

Hand-crafted SDKs centralize expertise:

- **Add convenience methods**: Users pull the SDK update. No code changes required.
- **Improve retry logic**: Fixed once in the SDK. All users benefit immediately.
- **Optimize batching**: Built into the SDK. Users get better performance automatically.

The total maintenance cost is **lower** because we fix problems once instead of every team solving them independently.

### 5. Progressive Disclosure

Generated clients are flat—every endpoint is equally visible. Hand-crafted SDKs enable **progressive disclosure**: simple tasks are simple, complexity is opt-in.

**Day 1 user**: Create and tag a dataset

```java
Dataset dataset = Dataset.builder()
    .platform("snowflake")
    .name("my_table")
    .build();

dataset.addTag("pii");
client.entities().upsert(dataset);
```

No need to understand aspects, patches, or modes.

**Week 1 user**: Manage governance

```java
dataset.addOwner(ownerUrn, type)
       .setDomain(domainUrn)
       .addTerm(termUrn);
```

Still pure domain operations.

**Month 1 user**: Understand update vs upsert

```java
// update() emits only patches (for existing entities)
Dataset existing = client.entities().get(urn);
Dataset mutable = existing.mutable();  // Get writable copy
mutable.addTag("sensitive");
client.entities().update(mutable);

// upsert() emits full aspects + patches
Dataset newEntity = Dataset.builder()...;
client.entities().upsert(newEntity);
```

Complexity revealed **when needed**, not upfront.

### 6. Immutability by Default

Entities fetched from the server are **read-only by default**, enforcing explicit mutation intent.

**The Problem:**

Traditional SDKs allow silent mutation of fetched objects:

```java
Dataset dataset = client.get(urn);
// Pass to function - might it mutate dataset? Who knows!
processDataset(dataset);
// Is dataset still the same? Must read all code to know
```

**The Solution:**

Immutable-by-default makes mutation intent explicit:

```java
Dataset dataset = client.get(urn);
// dataset is read-only - safe to pass anywhere
processDataset(dataset);

// Want to mutate? Make it explicit
Dataset mutable = dataset.mutable();
mutable.addTag("updated");
client.entities().upsert(mutable);
```

**Benefits:**

- **Safety:** Can't accidentally mutate shared references
- **Clarity:** `.mutable()` call signals write intent
- **Debugging:** Easier to track where mutations happen
- **Concurrency:** Safe to share read-only entities across threads

**Design Inspiration:**

This pattern is common in modern APIs because immutability scales better than defensive copying:

- **Rust's ownership model** - mut vs immutable borrows
- **Python's frozen dataclasses** - `@dataclass(frozen=True)`
- **Java's immutable collections** - `Collections.unmodifiableList()`
- **Functional programming principles** - immutable data structures

When you see `.mutable()` in our SDK, you're seeing battle-tested patterns from languages designed for safety and concurrency.

## What This Costs (And Why It's Worth It)

Hand-crafting isn't free:

- **3,000+ lines of code** across entity classes, caching, and operations
- **457 tests** validating workflows, not just HTTP mechanics
- **13 documentation guides** teaching patterns, not just parameters
- **Ongoing maintenance** as DataHub evolves

But this investment compounds. Every hour we spend on the SDK saves hundreds of hours across our user community. The SDK makes metadata management **effortless** instead of just **possible**.

Compare total cost of ownership:

| Approach         | Initial Dev | User Onboarding | Ongoing Support | Innovation Speed |
| ---------------- | ----------- | --------------- | --------------- | ---------------- |
| Generated Client | Hours       | High (steep)    | High (repeated) | Slow (coupled)   |
| Hand-Crafted SDK | Weeks       | Low (gradual)   | Low (central)   | Fast (buffered)  |

After 6-12 months, the hand-crafted SDK becomes cheaper because centralized expertise scales better than distributed tribal knowledge.

## The Philosophy: What SDKs Should Be

This isn't about generated vs hand-crafted code. It's about what we believe SDKs **should be**.

**SDKs are not just API wrappers.** They are:

- **Semantic layers** that translate domain concepts to wire protocols
- **Knowledge repositories** that encode best practices
- **Usability interfaces** that optimize for human cognition
- **Evolution buffers** that allow internals to improve while APIs remain stable

Code generation is perfect when **the API is the abstraction**. But for domain-rich platforms where users think in terms of datasets, lineage, and governance—not HTTP verbs and JSON payloads—hand-crafted SDKs aren't just better. They're necessary.

## When Should You Generate? When Should You Craft?

**Generate when**:

- Your API's conceptual model matches user mental models
- The wire protocol fully captures domain semantics
- Operations are mostly stateless CRUD
- You prioritize API coverage over workflow optimization

**Hand-craft when**:

- Domain concepts require translation to wire protocol
- Users need guidance on best practices
- Stateful workflows matter (accumulate changes, emit atomically)
- You prioritize usability over feature completeness

DataHub falls firmly in the second category. Our users don't want to learn aspect models, patch formats, or mode routing. They want to **add a tag to a dataset** and have it just work.

That's what the hand-crafted SDK delivers.

## Conclusion: Empathy at Scale

In an era of automation, there's pressure to generate everything. But some problems demand craftsmanship.

The hand-crafted SDK is an act of **empathy at scale**. It says: "We understand your problems. We've encoded the solutions. You shouldn't have to become a DataHub expert to use DataHub."

A generated client says: "Here's our API. Figure it out."

A hand-crafted SDK says: "Here's how to solve your problems."

That difference is why we invested in hand-crafting. And it's why our users can focus on their data, not our API internals.

---

**Document Status**: Design Philosophy
**Author**: DataHub OSS Team
**Last Updated**: 2025-01-06