mirror of https://github.com/datahub-project/datahub.git synced 2025-12-19 14:08:38 +00:00

feat(java-sdk): Add Java SDK V2 with fluent builder API and entity support (#15307 )

2025-12-03 05:54:02 -08:00

13 KiB

Raw Permalink Blame History

DataHubClientV2 Configuration

The DataHubClientV2 is the primary entry point for interacting with DataHub using SDK V2. This guide covers client configuration, connection management, and operation modes.

Creating a Client

Basic Configuration

The minimal configuration requires only a server URL:

import datahub.client.v2.DataHubClientV2;

DataHubClientV2 client = DataHubClientV2.builder()
    .server("http://localhost:8080")
    .build();

With Authentication

For DataHub Cloud or secured instances, provide a personal access token:

DataHubClientV2 client = DataHubClientV2.builder()
    .server("https://your-instance.acryl.io")
    .token("your-personal-access-token")
    .build();

Getting a Token: In DataHub UI → Settings → Access Tokens → Generate Personal Access Token

From Environment Variables

Configure the client using environment variables:

export DATAHUB_SERVER=http://localhost:8080
export DATAHUB_TOKEN=your-token-here

DataHubClientConfig V2 config = DataHubClientConfigV2.fromEnv();
DataHubClientV2 client = new DataHubClientV2(config);

Supported environment variables:

DATAHUB_SERVER or DATAHUB_GMS_URL - Server URL (required)
DATAHUB_TOKEN or DATAHUB_GMS_TOKEN - Authentication token (optional)

Configuration Options

Timeouts

Configure request timeouts to handle slow networks:

DataHubClientV2 client = DataHubClientV2.builder()
    .server("http://localhost:8080")
    .timeoutMs(30000)  // 30 seconds
    .build();

Default: 10 seconds (10000ms)

Retries

Configure automatic retries for failed requests:

DataHubClientV2 client = DataHubClientV2.builder()
    .server("http://localhost:8080")
    .maxRetries(5)  // Retry up to 5 times
    .build();

Default: 3 retries

SSL Certificate Verification

For testing environments, you can disable SSL verification:

DataHubClientV2 client = DataHubClientV2.builder()
    .server("https://localhost:8443")
    .disableSslVerification(true)  // WARNING: Only for testing!
    .build();

Warning: Never disable SSL verification in production! This makes your connection vulnerable to man-in-the-middle attacks.

Operation Modes

SDK V2 supports two distinct operation modes that control how metadata is written to DataHub:

SDK Mode (Default)

Use for: Interactive applications, user-initiated metadata edits, real-time UI updates

Behavior:

Writes to editable aspects (e.g., editableDatasetProperties)
Uses synchronous DB writes for immediate consistency
Returns only after metadata is committed to database

DataHubClientV2 client = DataHubClientV2.builder()
    .server("http://localhost:8080")
    .operationMode(DataHubClientConfigV2.OperationMode.SDK)  // Default
    .build();

Dataset dataset = Dataset.builder()
    .platform("snowflake")
    .name("my_table")
    .build();

dataset.setDescription("User-provided description");
client.entities().upsert(dataset);
// Writes to editableDatasetProperties synchronously
// Metadata immediately visible after return

INGESTION Mode

Use for: ETL pipelines, data ingestion jobs, automated metadata collection, batch processing

Behavior:

Writes to system aspects (e.g., datasetProperties)
Uses asynchronous Kafka writes for high throughput
Returns immediately after message is queued

DataHubClientV2 client = DataHubClientV2.builder()
    .server("http://localhost:8080")
    .operationMode(DataHubClientConfigV2.OperationMode.INGESTION)
    .build();

Dataset dataset = Dataset.builder()
    .platform("snowflake")
    .name("my_table")
    .build();

dataset.setDescription("Ingested from Snowflake");
client.entities().upsert(dataset);
// Writes to datasetProperties asynchronously via Kafka
// High throughput for batch ingestion

Mode Comparison

Aspect	SDK Mode	INGESTION Mode
Target Aspects	Editable aspects	System aspects
Write Path	Synchronous (direct to DB)	Asynchronous (via Kafka)
Consistency	Immediate (linearizable)	Eventual (async processing)
Throughput	Lower (waits for DB)	Higher (queued)
Use Case	User edits via UI/API	Pipeline metadata extraction
Precedence	Higher (overrides system)	Lower (overridden by user edits)
Example Aspects	`editableDatasetProperties`	`datasetProperties`
Latency	~100-500ms	~10-50ms (queueing only)
Error Handling	Immediate feedback	Eventual (check logs)

Why two modes?

Clear provenance: Distinguish human edits from machine-generated metadata
Non-destructive updates: Ingestion can refresh without clobbering user documentation
UI consistency: DataHub UI shows editable aspects as user overrides
Performance optimization: Async ingestion for high-volume batch writes, sync for interactive edits

Async Mode Control (The Escape Hatch)

By default, the async mode is automatically inferred from your operation mode:

SDK mode → synchronous writes (immediate consistency)
INGESTION mode → asynchronous writes (high throughput)

However, you can explicitly override this behavior using the asyncIngest parameter when you need full control:

Force Synchronous in INGESTION Mode

For pipelines that need immediate consistency guarantees:

DataHubClientV2 client = DataHubClientV2.builder()
    .server("http://localhost:8080")
    .operationMode(DataHubClientConfigV2.OperationMode.INGESTION)
    .asyncIngest(false)  // Override: force synchronous despite INGESTION mode
    .build();

Dataset dataset = Dataset.builder()
    .platform("snowflake")
    .name("my_table")
    .build();

dataset.setDescription("Ingested description");
client.entities().upsert(dataset);
// Writes to datasetProperties synchronously, waits for DB commit
// Use when you need guaranteed consistency before proceeding

Use cases:

Critical ingestion jobs where you must verify writes succeeded
Sequential processing where each step depends on previous writes
Testing scenarios requiring deterministic behavior
Compliance workflows requiring audit trail confirmation

Force Asynchronous in SDK Mode

For high-volume SDK operations that can tolerate eventual consistency:

DataHubClientV2 client = DataHubClientV2.builder()
    .server("http://localhost:8080")
    .operationMode(DataHubClientConfigV2.OperationMode.SDK)
    .asyncIngest(true)  // Override: force async despite SDK mode
    .build();

Dataset dataset = Dataset.builder()
    .platform("snowflake")
    .name("my_table")
    .build();

dataset.setDescription("User-provided description");
client.entities().upsert(dataset);
// Writes to editableDatasetProperties via Kafka for higher throughput
// Trade immediate consistency for performance

Use cases:

Bulk metadata updates from admin tools
Migration scripts moving large volumes of data
Performance-critical batch operations
Load testing and benchmarking

Decision Guide

Scenario	Operation Mode	asyncIngest	Result
User edits in web UI	SDK	(default)	Sync writes to editable aspects
ETL pipeline ingestion	INGESTION	(default)	Async writes to system aspects
Critical data migration	INGESTION	false	Sync writes to system aspects
Bulk admin updates	SDK	true	Async writes to editable aspects

Default behavior is best for 95% of use cases. Only use explicit asyncIngest when you have specific performance or consistency requirements.

Testing the Connection

Verify connectivity before performing operations:

try {
    boolean connected = client.testConnection();
    if (connected) {
        System.out.println("Connected to DataHub!");
    } else {
        System.err.println("Failed to connect");
    }
} catch (Exception e) {
    System.err.println("Connection error: " + e.getMessage());
}

The testConnection() method performs a GET request to /config endpoint to verify the server is reachable.

Client Lifecycle

Resource Management

The client implements AutoCloseable for automatic resource management:

try (DataHubClientV2 client = DataHubClientV2.builder()
        .server("http://localhost:8080")
        .build()) {

    // Use client
    client.entities().upsert(dataset);

} // Client automatically closed

Manual Closing

If not using try-with-resources, explicitly close the client:

DataHubClientV2 client = DataHubClientV2.builder()
    .server("http://localhost:8080")
    .build();

try {
    // Use client
} finally {
    client.close();  // Release HTTP connections
}

Why close? Closing the client releases the underlying HTTP connection pool.

Advanced Configuration

Complete Configuration Example

import datahub.client.v2.DataHubClientV2;
import datahub.client.v2.config.DataHubClientConfigV2;

DataHubClientV2 client = DataHubClientV2.builder()
    // Server configuration
    .server("https://your-instance.acryl.io")
    .token("your-personal-access-token")

    // Timeout configuration
    .timeoutMs(30000)  // 30 seconds

    // Retry configuration
    .maxRetries(5)

    // Operation mode (SDK or INGESTION)
    .operationMode(DataHubClientConfigV2.OperationMode.SDK)

    // Async mode control (optional - overrides mode-based default)
    // .asyncIngest(false)  // Explicit control: true=async, false=sync

    // SSL configuration (testing only!)
    .disableSslVerification(false)

    .build();

Accessing the Underlying RestEmitter

For advanced use cases, access the low-level REST emitter:

RestEmitter emitter = client.getEmitter();
// Direct access to emission methods

Note: Most users should use the high-level client.entities() API instead.

Entity Operations

Once configured, use the client to perform entity operations:

CRUD Operations

// Create/Update (upsert)
client.entities().upsert(dataset);

// Update with patches
client.entities().update(dataset);

// Read
Dataset loaded = client.entities().get(datasetUrn);

See the Getting Started Guide for comprehensive examples.

Configuration Best Practices

Production Deployment

DataHubClientV2 client = DataHubClientV2.builder()
    .server(System.getenv("DATAHUB_SERVER"))
    .token(System.getenv("DATAHUB_TOKEN"))
    .timeoutMs(30000)       // Higher timeout for production
    .maxRetries(5)          // More retries for reliability
    .operationMode(DataHubClientConfigV2.OperationMode.SDK)
    .disableSslVerification(false)  // Always verify SSL!
    .build();

ETL Pipeline

DataHubClientV2 client = DataHubClientV2.builder()
    .server(System.getenv("DATAHUB_SERVER"))
    .token(System.getenv("DATAHUB_TOKEN"))
    .timeoutMs(60000)       // Higher timeout for batch jobs
    .maxRetries(3)
    .operationMode(DataHubClientConfigV2.OperationMode.INGESTION)  // Async by default
    .build();

Critical Data Migration

For migrations where you need confirmation before proceeding:

DataHubClientV2 client = DataHubClientV2.builder()
    .server(System.getenv("DATAHUB_SERVER"))
    .token(System.getenv("DATAHUB_TOKEN"))
    .timeoutMs(60000)
    .maxRetries(5)
    .operationMode(DataHubClientConfigV2.OperationMode.INGESTION)
    .asyncIngest(false)     // Force sync for guaranteed consistency
    .build();

Local Development

DataHubClientV2 client = DataHubClientV2.builder()
    .server("http://localhost:8080")
    // No token needed for local quickstart
    .timeoutMs(10000)
    .build();

Troubleshooting

Connection Refused

Error: java.net.ConnectException: Connection refused

Solutions:

Verify DataHub server is running
Check server URL is correct
Ensure port is accessible (firewall rules)

Authentication Failed

Error: 401 Unauthorized

Solutions:

Verify token is valid and not expired
Check token has correct permissions
Ensure token matches the server environment

Timeout

Error: java.util.concurrent.TimeoutException

Solutions:

Increase timeoutMs configuration
Check network latency to DataHub server
Verify server is not overloaded

SSL Certificate Error

Error: javax.net.ssl.SSLHandshakeException

Solutions:

Ensure server SSL certificate is valid
Add CA certificate to Java truststore
For testing only: use disableSslVerification(true)

Next Steps

Entities Overview - Working with different entity types
Dataset Entity Guide - Comprehensive dataset operations
Patch Operations - Efficient incremental updates
Getting Started Guide - Complete walkthrough

API Reference

For complete API documentation, see:

13 KiB Raw Permalink Blame History

DataHubClientV2 Configuration

Creating a Client

Basic Configuration

With Authentication

From Environment Variables

Configuration Options

Timeouts

Retries

SSL Certificate Verification

Operation Modes

SDK Mode (Default)

INGESTION Mode

Mode Comparison

Async Mode Control (The Escape Hatch)

Force Synchronous in INGESTION Mode

Force Asynchronous in SDK Mode

Decision Guide

Testing the Connection

Client Lifecycle

Resource Management

Manual Closing

Advanced Configuration

Complete Configuration Example

Accessing the Underlying RestEmitter

Entity Operations

CRUD Operations

Configuration Best Practices

Production Deployment

ETL Pipeline

Critical Data Migration

Local Development

Troubleshooting

Connection Refused

Authentication Failed

Timeout

SSL Certificate Error

Next Steps

API Reference

13 KiB

Raw Permalink Blame History