datahub/docs/developers/java-sdk-v2-philosophy.md

367 lines
14 KiB
Markdown

# Why We Hand-Crafted the Java SDK V2 (Instead of Generating It)
## The Question
When building DataHub's Java SDK V2, we faced a choice that every API platform eventually confronts: should we generate our SDK from OpenAPI specs, or hand-craft it?
OpenAPI code generation is seductive. Tools like OpenAPI Generator promise instant SDKs in dozens of languages. Run a command, get a client—complete with type-safe models, proper serialization, and comprehensive endpoint coverage. Why would anyone choose to write thousands of lines of code by hand?
We chose to hand-craft. This document explains why.
## When Code Generation Works Beautifully
Let's be clear: code generation isn't wrong. It's incredibly effective when your abstraction boundary aligns with your wire protocol.
**CRUD APIs**: If your API exposes resources like `GET /users/{id}`, `POST /users`, `DELETE /users/{id}`, a generated client is perfect:
```java
User user = client.getUser(123);
client.createUser(newUser);
client.deleteUser(456);
```
The user's mental model—"I want to fetch/create/delete a user"—maps directly to HTTP operations. There's no translation needed.
**Protocol Buffers**: Google's protobuf generators are exemplary because the `.proto` file **is** the contract:
```protobuf
service UserService {
rpc GetUser(UserId) returns (User);
rpc ListUsers(ListRequest) returns (UserList);
}
```
The service definition becomes the client API with perfect fidelity. What you define is what users get.
**The Pattern**: Code generation excels when **the API's conceptual model matches user mental models**, and the wire protocol fully captures domain semantics.
## The Semantic Gap: Why DataHub Is Different
DataHub doesn't fit this mold. Our metadata platform has a semantic gap between what users want to do and what the HTTP API exposes.
### The Aspect-Based Model
DataHub stores metadata as discrete "aspects"—properties, tags, ownership, schemas. But users don't think in aspects. They think:
- "I want to add a 'PII' tag to this dataset"
- "I need to assign ownership to John"
- "This table should be in the Finance domain"
An OpenAPI-generated client would expose:
```java
// What the API provides
client.updateGlobalTags(entityUrn, globalTagsPayload);
client.updateOwnership(entityUrn, ownershipPayload);
```
But to use this, you need to know:
- What is `GlobalTags`? How do I construct it?
- Should I use PUT (full replacement) or PATCH (incremental update)?
- How do I avoid race conditions when multiple systems update tags?
- Where do tags even live—in system aspects or editable aspects?
This is expert-level knowledge pushed onto every user.
### The Patch Complexity
DataHub supports both full aspect replacement (PUT) and JSON Patch (incremental updates). The generated client would expose both:
```java
// Full replacement
void putGlobalTags(Urn entityUrn, GlobalTags tags);
// JSON Patch
void patchGlobalTags(Urn entityUrn, JsonPatch patch);
```
Now users must decide when to use each. Patches are safer (no race conditions), but how do you construct a JsonPatch? Do you use a PatchBuilder? Hand-write JSON?
Every user solves this problem independently, reinventing best practices.
### The Mode Problem
DataHub has dual aspects: **system aspects** (written by ingestion pipelines) and **editable aspects** (written by humans via UI/SDK). Users editing metadata should write to editable aspects, but pipelines should write to system aspects.
A generated client doesn't understand this distinction. It just exposes endpoints. Users must learn DataHub's aspect model to route correctly.
## Five Principles of Hand-Crafted SDKs
Our hand-crafted SDK addresses these gaps through five design principles.
### 1. Semantic Layers Translate Domain Concepts
The SDK provides operations that match how users think:
```java
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("fact_revenue")
.build();
// Think "add a tag", not "construct and PUT a GlobalTags aspect"
dataset.addTag("pii");
// Think "assign ownership", not "build an Ownership aspect"
dataset.addOwner("urn:li:corpuser:jdoe", OwnershipType.TECHNICAL_OWNER);
client.entities().upsert(dataset);
```
The SDK translates `addTag()` into the correct:
- Aspect type (GlobalTags)
- Operation type (JSON Patch for safety)
- Aspect variant (editable, in SDK mode)
- JSON path (into the aspect structure)
This is **semantic translation**—mapping domain intent to wire protocol. Generators can't do this because the semantics live in institutional knowledge, not OpenAPI specs.
### 2. Opinionated APIs: The 95/5 Rule
We optimized for the 95% case and provided escape hatches for the 5%.
**The 95% case**: Incremental metadata changes—add a tag, update ownership, set a domain.
```java
dataset.addTag("sensitive")
.addOwner(ownerUrn, type)
.setDomain(domainUrn);
client.entities().update(dataset);
```
Users never think about PUT vs PATCH, aspect construction, or batch strategies. It just works.
**The 5% case**: Complete aspect replacement, custom MCPs, or operations V2 doesn't support.
```java
// Drop to V1 SDK for full control
RestEmitter emitter = client.emitter();
MetadataChangeProposalWrapper mcpw = /* custom logic */;
emitter.emit(mcpw).get();
```
This philosophy—**make simple things trivial, complex things possible**—requires intentional API design. Generators produce flat API surfaces where every operation has equal weight.
### 3. Encoding Expert Knowledge
Every platform accumulates tribal knowledge:
- "Always use patches for concurrent-safe updates"
- "Editable aspects override system aspects in SDK mode"
- "Batch operations to avoid Kafka load spikes"
- "Schema field names don't always match aspect names"
A generated client leaves this knowledge in Slack threads and documentation. Users discover best practices through painful trial and error.
The hand-crafted SDK **encodes** this knowledge:
```java
// Users call addTag(), SDK internally:
// - Creates a JSON Patch (not full replacement)
// - Targets the editable aspect in SDK mode
// - Accumulates patches for atomic emission
// - Uses the correct field paths
```
The SDK becomes **executable documentation** of best practices. This scales better than tribal knowledge.
### Why Not an ORM Approach?
Tools like Hibernate, SQLAlchemy, and Pydantic+ORM excel at managing complex object graphs in transactional applications. Why didn't we use this pattern?
**Metadata operations follow different patterns than OLTP workloads:**
1. **Bulk mutations** - "Tag 50 datasets as PII" requires only URNs and the operation, not loading full object graphs
2. **Point lookups** - "Get this dataset's schema before querying" is a direct fetch, no relationship navigation needed
3. **Read-modify-write** - "Infer quality scores from schema statistics" involves fetching an aspect, transforming it, and patching it back
ORMs optimize for relationship traversal (`dataset.container.database.catalog`), session lifecycle management, and automatic dirty tracking. But:
- **Relationship traversal** is handled by DataHub's search and graph query APIs, not in-memory navigation
- **Explicit patches** are central to our design—we want `addTag()` visible in code, not hidden behind session flush
- **Session complexity** adds cognitive overhead without benefit for metadata's bulk/point/patch patterns
The result: a simpler, more explicit API that matches how developers actually work with metadata.
### 4. Centralized Maintenance vs Distributed Pain
Generated clients push maintenance costs onto users. When we improve DataHub:
- **Add a new endpoint**: Users regenerate their client. Breaking change? Every team upgrades simultaneously.
- **Change error handling**: Regenerate. Update all call sites.
- **Optimize batch operations**: Can't—that logic lives in user code, reinvented by every team.
Hand-crafted SDKs centralize expertise:
- **Add convenience methods**: Users pull the SDK update. No code changes required.
- **Improve retry logic**: Fixed once in the SDK. All users benefit immediately.
- **Optimize batching**: Built into the SDK. Users get better performance automatically.
The total maintenance cost is **lower** because we fix problems once instead of every team solving them independently.
### 5. Progressive Disclosure
Generated clients are flat—every endpoint is equally visible. Hand-crafted SDKs enable **progressive disclosure**: simple tasks are simple, complexity is opt-in.
**Day 1 user**: Create and tag a dataset
```java
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.build();
dataset.addTag("pii");
client.entities().upsert(dataset);
```
No need to understand aspects, patches, or modes.
**Week 1 user**: Manage governance
```java
dataset.addOwner(ownerUrn, type)
.setDomain(domainUrn)
.addTerm(termUrn);
```
Still pure domain operations.
**Month 1 user**: Understand update vs upsert
```java
// update() emits only patches (for existing entities)
Dataset existing = client.entities().get(urn);
Dataset mutable = existing.mutable(); // Get writable copy
mutable.addTag("sensitive");
client.entities().update(mutable);
// upsert() emits full aspects + patches
Dataset newEntity = Dataset.builder()...;
client.entities().upsert(newEntity);
```
Complexity revealed **when needed**, not upfront.
### 6. Immutability by Default
Entities fetched from the server are **read-only by default**, enforcing explicit mutation intent.
**The Problem:**
Traditional SDKs allow silent mutation of fetched objects:
```java
Dataset dataset = client.get(urn);
// Pass to function - might it mutate dataset? Who knows!
processDataset(dataset);
// Is dataset still the same? Must read all code to know
```
**The Solution:**
Immutable-by-default makes mutation intent explicit:
```java
Dataset dataset = client.get(urn);
// dataset is read-only - safe to pass anywhere
processDataset(dataset);
// Want to mutate? Make it explicit
Dataset mutable = dataset.mutable();
mutable.addTag("updated");
client.entities().upsert(mutable);
```
**Benefits:**
- **Safety:** Can't accidentally mutate shared references
- **Clarity:** `.mutable()` call signals write intent
- **Debugging:** Easier to track where mutations happen
- **Concurrency:** Safe to share read-only entities across threads
**Design Inspiration:**
This pattern is common in modern APIs because immutability scales better than defensive copying:
- **Rust's ownership model** - mut vs immutable borrows
- **Python's frozen dataclasses** - `@dataclass(frozen=True)`
- **Java's immutable collections** - `Collections.unmodifiableList()`
- **Functional programming principles** - immutable data structures
When you see `.mutable()` in our SDK, you're seeing battle-tested patterns from languages designed for safety and concurrency.
## What This Costs (And Why It's Worth It)
Hand-crafting isn't free:
- **3,000+ lines of code** across entity classes, caching, and operations
- **457 tests** validating workflows, not just HTTP mechanics
- **13 documentation guides** teaching patterns, not just parameters
- **Ongoing maintenance** as DataHub evolves
But this investment compounds. Every hour we spend on the SDK saves hundreds of hours across our user community. The SDK makes metadata management **effortless** instead of just **possible**.
Compare total cost of ownership:
| Approach | Initial Dev | User Onboarding | Ongoing Support | Innovation Speed |
| ---------------- | ----------- | --------------- | --------------- | ---------------- |
| Generated Client | Hours | High (steep) | High (repeated) | Slow (coupled) |
| Hand-Crafted SDK | Weeks | Low (gradual) | Low (central) | Fast (buffered) |
After 6-12 months, the hand-crafted SDK becomes cheaper because centralized expertise scales better than distributed tribal knowledge.
## The Philosophy: What SDKs Should Be
This isn't about generated vs hand-crafted code. It's about what we believe SDKs **should be**.
**SDKs are not just API wrappers.** They are:
- **Semantic layers** that translate domain concepts to wire protocols
- **Knowledge repositories** that encode best practices
- **Usability interfaces** that optimize for human cognition
- **Evolution buffers** that allow internals to improve while APIs remain stable
Code generation is perfect when **the API is the abstraction**. But for domain-rich platforms where users think in terms of datasets, lineage, and governance—not HTTP verbs and JSON payloads—hand-crafted SDKs aren't just better. They're necessary.
## When Should You Generate? When Should You Craft?
**Generate when**:
- Your API's conceptual model matches user mental models
- The wire protocol fully captures domain semantics
- Operations are mostly stateless CRUD
- You prioritize API coverage over workflow optimization
**Hand-craft when**:
- Domain concepts require translation to wire protocol
- Users need guidance on best practices
- Stateful workflows matter (accumulate changes, emit atomically)
- You prioritize usability over feature completeness
DataHub falls firmly in the second category. Our users don't want to learn aspect models, patch formats, or mode routing. They want to **add a tag to a dataset** and have it just work.
That's what the hand-crafted SDK delivers.
## Conclusion: Empathy at Scale
In an era of automation, there's pressure to generate everything. But some problems demand craftsmanship.
The hand-crafted SDK is an act of **empathy at scale**. It says: "We understand your problems. We've encoded the solutions. You shouldn't have to become a DataHub expert to use DataHub."
A generated client says: "Here's our API. Figure it out."
A hand-crafted SDK says: "Here's how to solve your problems."
That difference is why we invested in hand-crafting. And it's why our users can focus on their data, not our API internals.
---
**Document Status**: Design Philosophy
**Author**: DataHub OSS Team
**Last Updated**: 2025-01-06