mirror of
https://github.com/datahub-project/datahub.git
synced 2025-12-22 15:36:42 +00:00
367 lines
14 KiB
Markdown
367 lines
14 KiB
Markdown
# Why We Hand-Crafted the Java SDK V2 (Instead of Generating It)
|
|
|
|
## The Question
|
|
|
|
When building DataHub's Java SDK V2, we faced a choice that every API platform eventually confronts: should we generate our SDK from OpenAPI specs, or hand-craft it?
|
|
|
|
OpenAPI code generation is seductive. Tools like OpenAPI Generator promise instant SDKs in dozens of languages. Run a command, get a client—complete with type-safe models, proper serialization, and comprehensive endpoint coverage. Why would anyone choose to write thousands of lines of code by hand?
|
|
|
|
We chose to hand-craft. This document explains why.
|
|
|
|
## When Code Generation Works Beautifully
|
|
|
|
Let's be clear: code generation isn't wrong. It's incredibly effective when your abstraction boundary aligns with your wire protocol.
|
|
|
|
**CRUD APIs**: If your API exposes resources like `GET /users/{id}`, `POST /users`, `DELETE /users/{id}`, a generated client is perfect:
|
|
|
|
```java
|
|
User user = client.getUser(123);
|
|
client.createUser(newUser);
|
|
client.deleteUser(456);
|
|
```
|
|
|
|
The user's mental model—"I want to fetch/create/delete a user"—maps directly to HTTP operations. There's no translation needed.
|
|
|
|
**Protocol Buffers**: Google's protobuf generators are exemplary because the `.proto` file **is** the contract:
|
|
|
|
```protobuf
|
|
service UserService {
|
|
rpc GetUser(UserId) returns (User);
|
|
rpc ListUsers(ListRequest) returns (UserList);
|
|
}
|
|
```
|
|
|
|
The service definition becomes the client API with perfect fidelity. What you define is what users get.
|
|
|
|
**The Pattern**: Code generation excels when **the API's conceptual model matches user mental models**, and the wire protocol fully captures domain semantics.
|
|
|
|
## The Semantic Gap: Why DataHub Is Different
|
|
|
|
DataHub doesn't fit this mold. Our metadata platform has a semantic gap between what users want to do and what the HTTP API exposes.
|
|
|
|
### The Aspect-Based Model
|
|
|
|
DataHub stores metadata as discrete "aspects"—properties, tags, ownership, schemas. But users don't think in aspects. They think:
|
|
|
|
- "I want to add a 'PII' tag to this dataset"
|
|
- "I need to assign ownership to John"
|
|
- "This table should be in the Finance domain"
|
|
|
|
An OpenAPI-generated client would expose:
|
|
|
|
```java
|
|
// What the API provides
|
|
client.updateGlobalTags(entityUrn, globalTagsPayload);
|
|
client.updateOwnership(entityUrn, ownershipPayload);
|
|
```
|
|
|
|
But to use this, you need to know:
|
|
|
|
- What is `GlobalTags`? How do I construct it?
|
|
- Should I use PUT (full replacement) or PATCH (incremental update)?
|
|
- How do I avoid race conditions when multiple systems update tags?
|
|
- Where do tags even live—in system aspects or editable aspects?
|
|
|
|
This is expert-level knowledge pushed onto every user.
|
|
|
|
### The Patch Complexity
|
|
|
|
DataHub supports both full aspect replacement (PUT) and JSON Patch (incremental updates). The generated client would expose both:
|
|
|
|
```java
|
|
// Full replacement
|
|
void putGlobalTags(Urn entityUrn, GlobalTags tags);
|
|
|
|
// JSON Patch
|
|
void patchGlobalTags(Urn entityUrn, JsonPatch patch);
|
|
```
|
|
|
|
Now users must decide when to use each. Patches are safer (no race conditions), but how do you construct a JsonPatch? Do you use a PatchBuilder? Hand-write JSON?
|
|
|
|
Every user solves this problem independently, reinventing best practices.
|
|
|
|
### The Mode Problem
|
|
|
|
DataHub has dual aspects: **system aspects** (written by ingestion pipelines) and **editable aspects** (written by humans via UI/SDK). Users editing metadata should write to editable aspects, but pipelines should write to system aspects.
|
|
|
|
A generated client doesn't understand this distinction. It just exposes endpoints. Users must learn DataHub's aspect model to route correctly.
|
|
|
|
## Five Principles of Hand-Crafted SDKs
|
|
|
|
Our hand-crafted SDK addresses these gaps through five design principles.
|
|
|
|
### 1. Semantic Layers Translate Domain Concepts
|
|
|
|
The SDK provides operations that match how users think:
|
|
|
|
```java
|
|
Dataset dataset = Dataset.builder()
|
|
.platform("snowflake")
|
|
.name("fact_revenue")
|
|
.build();
|
|
|
|
// Think "add a tag", not "construct and PUT a GlobalTags aspect"
|
|
dataset.addTag("pii");
|
|
|
|
// Think "assign ownership", not "build an Ownership aspect"
|
|
dataset.addOwner("urn:li:corpuser:jdoe", OwnershipType.TECHNICAL_OWNER);
|
|
|
|
client.entities().upsert(dataset);
|
|
```
|
|
|
|
The SDK translates `addTag()` into the correct:
|
|
|
|
- Aspect type (GlobalTags)
|
|
- Operation type (JSON Patch for safety)
|
|
- Aspect variant (editable, in SDK mode)
|
|
- JSON path (into the aspect structure)
|
|
|
|
This is **semantic translation**—mapping domain intent to wire protocol. Generators can't do this because the semantics live in institutional knowledge, not OpenAPI specs.
|
|
|
|
### 2. Opinionated APIs: The 95/5 Rule
|
|
|
|
We optimized for the 95% case and provided escape hatches for the 5%.
|
|
|
|
**The 95% case**: Incremental metadata changes—add a tag, update ownership, set a domain.
|
|
|
|
```java
|
|
dataset.addTag("sensitive")
|
|
.addOwner(ownerUrn, type)
|
|
.setDomain(domainUrn);
|
|
|
|
client.entities().update(dataset);
|
|
```
|
|
|
|
Users never think about PUT vs PATCH, aspect construction, or batch strategies. It just works.
|
|
|
|
**The 5% case**: Complete aspect replacement, custom MCPs, or operations V2 doesn't support.
|
|
|
|
```java
|
|
// Drop to V1 SDK for full control
|
|
RestEmitter emitter = client.emitter();
|
|
MetadataChangeProposalWrapper mcpw = /* custom logic */;
|
|
emitter.emit(mcpw).get();
|
|
```
|
|
|
|
This philosophy—**make simple things trivial, complex things possible**—requires intentional API design. Generators produce flat API surfaces where every operation has equal weight.
|
|
|
|
### 3. Encoding Expert Knowledge
|
|
|
|
Every platform accumulates tribal knowledge:
|
|
|
|
- "Always use patches for concurrent-safe updates"
|
|
- "Editable aspects override system aspects in SDK mode"
|
|
- "Batch operations to avoid Kafka load spikes"
|
|
- "Schema field names don't always match aspect names"
|
|
|
|
A generated client leaves this knowledge in Slack threads and documentation. Users discover best practices through painful trial and error.
|
|
|
|
The hand-crafted SDK **encodes** this knowledge:
|
|
|
|
```java
|
|
// Users call addTag(), SDK internally:
|
|
// - Creates a JSON Patch (not full replacement)
|
|
// - Targets the editable aspect in SDK mode
|
|
// - Accumulates patches for atomic emission
|
|
// - Uses the correct field paths
|
|
```
|
|
|
|
The SDK becomes **executable documentation** of best practices. This scales better than tribal knowledge.
|
|
|
|
### Why Not an ORM Approach?
|
|
|
|
Tools like Hibernate, SQLAlchemy, and Pydantic+ORM excel at managing complex object graphs in transactional applications. Why didn't we use this pattern?
|
|
|
|
**Metadata operations follow different patterns than OLTP workloads:**
|
|
|
|
1. **Bulk mutations** - "Tag 50 datasets as PII" requires only URNs and the operation, not loading full object graphs
|
|
2. **Point lookups** - "Get this dataset's schema before querying" is a direct fetch, no relationship navigation needed
|
|
3. **Read-modify-write** - "Infer quality scores from schema statistics" involves fetching an aspect, transforming it, and patching it back
|
|
|
|
ORMs optimize for relationship traversal (`dataset.container.database.catalog`), session lifecycle management, and automatic dirty tracking. But:
|
|
|
|
- **Relationship traversal** is handled by DataHub's search and graph query APIs, not in-memory navigation
|
|
- **Explicit patches** are central to our design—we want `addTag()` visible in code, not hidden behind session flush
|
|
- **Session complexity** adds cognitive overhead without benefit for metadata's bulk/point/patch patterns
|
|
|
|
The result: a simpler, more explicit API that matches how developers actually work with metadata.
|
|
|
|
### 4. Centralized Maintenance vs Distributed Pain
|
|
|
|
Generated clients push maintenance costs onto users. When we improve DataHub:
|
|
|
|
- **Add a new endpoint**: Users regenerate their client. Breaking change? Every team upgrades simultaneously.
|
|
- **Change error handling**: Regenerate. Update all call sites.
|
|
- **Optimize batch operations**: Can't—that logic lives in user code, reinvented by every team.
|
|
|
|
Hand-crafted SDKs centralize expertise:
|
|
|
|
- **Add convenience methods**: Users pull the SDK update. No code changes required.
|
|
- **Improve retry logic**: Fixed once in the SDK. All users benefit immediately.
|
|
- **Optimize batching**: Built into the SDK. Users get better performance automatically.
|
|
|
|
The total maintenance cost is **lower** because we fix problems once instead of every team solving them independently.
|
|
|
|
### 5. Progressive Disclosure
|
|
|
|
Generated clients are flat—every endpoint is equally visible. Hand-crafted SDKs enable **progressive disclosure**: simple tasks are simple, complexity is opt-in.
|
|
|
|
**Day 1 user**: Create and tag a dataset
|
|
|
|
```java
|
|
Dataset dataset = Dataset.builder()
|
|
.platform("snowflake")
|
|
.name("my_table")
|
|
.build();
|
|
|
|
dataset.addTag("pii");
|
|
client.entities().upsert(dataset);
|
|
```
|
|
|
|
No need to understand aspects, patches, or modes.
|
|
|
|
**Week 1 user**: Manage governance
|
|
|
|
```java
|
|
dataset.addOwner(ownerUrn, type)
|
|
.setDomain(domainUrn)
|
|
.addTerm(termUrn);
|
|
```
|
|
|
|
Still pure domain operations.
|
|
|
|
**Month 1 user**: Understand update vs upsert
|
|
|
|
```java
|
|
// update() emits only patches (for existing entities)
|
|
Dataset existing = client.entities().get(urn);
|
|
Dataset mutable = existing.mutable(); // Get writable copy
|
|
mutable.addTag("sensitive");
|
|
client.entities().update(mutable);
|
|
|
|
// upsert() emits full aspects + patches
|
|
Dataset newEntity = Dataset.builder()...;
|
|
client.entities().upsert(newEntity);
|
|
```
|
|
|
|
Complexity revealed **when needed**, not upfront.
|
|
|
|
### 6. Immutability by Default
|
|
|
|
Entities fetched from the server are **read-only by default**, enforcing explicit mutation intent.
|
|
|
|
**The Problem:**
|
|
|
|
Traditional SDKs allow silent mutation of fetched objects:
|
|
|
|
```java
|
|
Dataset dataset = client.get(urn);
|
|
// Pass to function - might it mutate dataset? Who knows!
|
|
processDataset(dataset);
|
|
// Is dataset still the same? Must read all code to know
|
|
```
|
|
|
|
**The Solution:**
|
|
|
|
Immutable-by-default makes mutation intent explicit:
|
|
|
|
```java
|
|
Dataset dataset = client.get(urn);
|
|
// dataset is read-only - safe to pass anywhere
|
|
processDataset(dataset);
|
|
|
|
// Want to mutate? Make it explicit
|
|
Dataset mutable = dataset.mutable();
|
|
mutable.addTag("updated");
|
|
client.entities().upsert(mutable);
|
|
```
|
|
|
|
**Benefits:**
|
|
|
|
- **Safety:** Can't accidentally mutate shared references
|
|
- **Clarity:** `.mutable()` call signals write intent
|
|
- **Debugging:** Easier to track where mutations happen
|
|
- **Concurrency:** Safe to share read-only entities across threads
|
|
|
|
**Design Inspiration:**
|
|
|
|
This pattern is common in modern APIs because immutability scales better than defensive copying:
|
|
|
|
- **Rust's ownership model** - mut vs immutable borrows
|
|
- **Python's frozen dataclasses** - `@dataclass(frozen=True)`
|
|
- **Java's immutable collections** - `Collections.unmodifiableList()`
|
|
- **Functional programming principles** - immutable data structures
|
|
|
|
When you see `.mutable()` in our SDK, you're seeing battle-tested patterns from languages designed for safety and concurrency.
|
|
|
|
## What This Costs (And Why It's Worth It)
|
|
|
|
Hand-crafting isn't free:
|
|
|
|
- **3,000+ lines of code** across entity classes, caching, and operations
|
|
- **457 tests** validating workflows, not just HTTP mechanics
|
|
- **13 documentation guides** teaching patterns, not just parameters
|
|
- **Ongoing maintenance** as DataHub evolves
|
|
|
|
But this investment compounds. Every hour we spend on the SDK saves hundreds of hours across our user community. The SDK makes metadata management **effortless** instead of just **possible**.
|
|
|
|
Compare total cost of ownership:
|
|
|
|
| Approach | Initial Dev | User Onboarding | Ongoing Support | Innovation Speed |
|
|
| ---------------- | ----------- | --------------- | --------------- | ---------------- |
|
|
| Generated Client | Hours | High (steep) | High (repeated) | Slow (coupled) |
|
|
| Hand-Crafted SDK | Weeks | Low (gradual) | Low (central) | Fast (buffered) |
|
|
|
|
After 6-12 months, the hand-crafted SDK becomes cheaper because centralized expertise scales better than distributed tribal knowledge.
|
|
|
|
## The Philosophy: What SDKs Should Be
|
|
|
|
This isn't about generated vs hand-crafted code. It's about what we believe SDKs **should be**.
|
|
|
|
**SDKs are not just API wrappers.** They are:
|
|
|
|
- **Semantic layers** that translate domain concepts to wire protocols
|
|
- **Knowledge repositories** that encode best practices
|
|
- **Usability interfaces** that optimize for human cognition
|
|
- **Evolution buffers** that allow internals to improve while APIs remain stable
|
|
|
|
Code generation is perfect when **the API is the abstraction**. But for domain-rich platforms where users think in terms of datasets, lineage, and governance—not HTTP verbs and JSON payloads—hand-crafted SDKs aren't just better. They're necessary.
|
|
|
|
## When Should You Generate? When Should You Craft?
|
|
|
|
**Generate when**:
|
|
|
|
- Your API's conceptual model matches user mental models
|
|
- The wire protocol fully captures domain semantics
|
|
- Operations are mostly stateless CRUD
|
|
- You prioritize API coverage over workflow optimization
|
|
|
|
**Hand-craft when**:
|
|
|
|
- Domain concepts require translation to wire protocol
|
|
- Users need guidance on best practices
|
|
- Stateful workflows matter (accumulate changes, emit atomically)
|
|
- You prioritize usability over feature completeness
|
|
|
|
DataHub falls firmly in the second category. Our users don't want to learn aspect models, patch formats, or mode routing. They want to **add a tag to a dataset** and have it just work.
|
|
|
|
That's what the hand-crafted SDK delivers.
|
|
|
|
## Conclusion: Empathy at Scale
|
|
|
|
In an era of automation, there's pressure to generate everything. But some problems demand craftsmanship.
|
|
|
|
The hand-crafted SDK is an act of **empathy at scale**. It says: "We understand your problems. We've encoded the solutions. You shouldn't have to become a DataHub expert to use DataHub."
|
|
|
|
A generated client says: "Here's our API. Figure it out."
|
|
|
|
A hand-crafted SDK says: "Here's how to solve your problems."
|
|
|
|
That difference is why we invested in hand-crafting. And it's why our users can focus on their data, not our API internals.
|
|
|
|
---
|
|
|
|
**Document Status**: Design Philosophy
|
|
**Author**: DataHub OSS Team
|
|
**Last Updated**: 2025-01-06
|