13 KiB

DataHubClientV2 Configuration

The DataHubClientV2 is the primary entry point for interacting with DataHub using SDK V2. This guide covers client configuration, connection management, and operation modes.

Creating a Client

Basic Configuration

The minimal configuration requires only a server URL:

import datahub.client.v2.DataHubClientV2;

DataHubClientV2 client = DataHubClientV2.builder()
    .server("http://localhost:8080")
    .build();

With Authentication

For DataHub Cloud or secured instances, provide a personal access token:

DataHubClientV2 client = DataHubClientV2.builder()
    .server("https://your-instance.acryl.io")
    .token("your-personal-access-token")
    .build();

Getting a Token: In DataHub UI → Settings → Access Tokens → Generate Personal Access Token

From Environment Variables

Configure the client using environment variables:

export DATAHUB_SERVER=http://localhost:8080
export DATAHUB_TOKEN=your-token-here
DataHubClientConfig V2 config = DataHubClientConfigV2.fromEnv();
DataHubClientV2 client = new DataHubClientV2(config);

Supported environment variables:

  • DATAHUB_SERVER or DATAHUB_GMS_URL - Server URL (required)
  • DATAHUB_TOKEN or DATAHUB_GMS_TOKEN - Authentication token (optional)

Configuration Options

Timeouts

Configure request timeouts to handle slow networks:

DataHubClientV2 client = DataHubClientV2.builder()
    .server("http://localhost:8080")
    .timeoutMs(30000)  // 30 seconds
    .build();

Default: 10 seconds (10000ms)

Retries

Configure automatic retries for failed requests:

DataHubClientV2 client = DataHubClientV2.builder()
    .server("http://localhost:8080")
    .maxRetries(5)  // Retry up to 5 times
    .build();

Default: 3 retries

SSL Certificate Verification

For testing environments, you can disable SSL verification:

DataHubClientV2 client = DataHubClientV2.builder()
    .server("https://localhost:8443")
    .disableSslVerification(true)  // WARNING: Only for testing!
    .build();

Warning: Never disable SSL verification in production! This makes your connection vulnerable to man-in-the-middle attacks.

Operation Modes

SDK V2 supports two distinct operation modes that control how metadata is written to DataHub:

SDK Mode (Default)

Use for: Interactive applications, user-initiated metadata edits, real-time UI updates

Behavior:

  • Writes to editable aspects (e.g., editableDatasetProperties)
  • Uses synchronous DB writes for immediate consistency
  • Returns only after metadata is committed to database
DataHubClientV2 client = DataHubClientV2.builder()
    .server("http://localhost:8080")
    .operationMode(DataHubClientConfigV2.OperationMode.SDK)  // Default
    .build();

Dataset dataset = Dataset.builder()
    .platform("snowflake")
    .name("my_table")
    .build();

dataset.setDescription("User-provided description");
client.entities().upsert(dataset);
// Writes to editableDatasetProperties synchronously
// Metadata immediately visible after return

INGESTION Mode

Use for: ETL pipelines, data ingestion jobs, automated metadata collection, batch processing

Behavior:

  • Writes to system aspects (e.g., datasetProperties)
  • Uses asynchronous Kafka writes for high throughput
  • Returns immediately after message is queued
DataHubClientV2 client = DataHubClientV2.builder()
    .server("http://localhost:8080")
    .operationMode(DataHubClientConfigV2.OperationMode.INGESTION)
    .build();

Dataset dataset = Dataset.builder()
    .platform("snowflake")
    .name("my_table")
    .build();

dataset.setDescription("Ingested from Snowflake");
client.entities().upsert(dataset);
// Writes to datasetProperties asynchronously via Kafka
// High throughput for batch ingestion

Mode Comparison

Aspect SDK Mode INGESTION Mode
Target Aspects Editable aspects System aspects
Write Path Synchronous (direct to DB) Asynchronous (via Kafka)
Consistency Immediate (linearizable) Eventual (async processing)
Throughput Lower (waits for DB) Higher (queued)
Use Case User edits via UI/API Pipeline metadata extraction
Precedence Higher (overrides system) Lower (overridden by user edits)
Example Aspects editableDatasetProperties datasetProperties
Latency ~100-500ms ~10-50ms (queueing only)
Error Handling Immediate feedback Eventual (check logs)

Why two modes?

  • Clear provenance: Distinguish human edits from machine-generated metadata
  • Non-destructive updates: Ingestion can refresh without clobbering user documentation
  • UI consistency: DataHub UI shows editable aspects as user overrides
  • Performance optimization: Async ingestion for high-volume batch writes, sync for interactive edits

Async Mode Control (The Escape Hatch)

By default, the async mode is automatically inferred from your operation mode:

  • SDK mode → synchronous writes (immediate consistency)
  • INGESTION mode → asynchronous writes (high throughput)

However, you can explicitly override this behavior using the asyncIngest parameter when you need full control:

Force Synchronous in INGESTION Mode

For pipelines that need immediate consistency guarantees:

DataHubClientV2 client = DataHubClientV2.builder()
    .server("http://localhost:8080")
    .operationMode(DataHubClientConfigV2.OperationMode.INGESTION)
    .asyncIngest(false)  // Override: force synchronous despite INGESTION mode
    .build();

Dataset dataset = Dataset.builder()
    .platform("snowflake")
    .name("my_table")
    .build();

dataset.setDescription("Ingested description");
client.entities().upsert(dataset);
// Writes to datasetProperties synchronously, waits for DB commit
// Use when you need guaranteed consistency before proceeding

Use cases:

  • Critical ingestion jobs where you must verify writes succeeded
  • Sequential processing where each step depends on previous writes
  • Testing scenarios requiring deterministic behavior
  • Compliance workflows requiring audit trail confirmation

Force Asynchronous in SDK Mode

For high-volume SDK operations that can tolerate eventual consistency:

DataHubClientV2 client = DataHubClientV2.builder()
    .server("http://localhost:8080")
    .operationMode(DataHubClientConfigV2.OperationMode.SDK)
    .asyncIngest(true)  // Override: force async despite SDK mode
    .build();

Dataset dataset = Dataset.builder()
    .platform("snowflake")
    .name("my_table")
    .build();

dataset.setDescription("User-provided description");
client.entities().upsert(dataset);
// Writes to editableDatasetProperties via Kafka for higher throughput
// Trade immediate consistency for performance

Use cases:

  • Bulk metadata updates from admin tools
  • Migration scripts moving large volumes of data
  • Performance-critical batch operations
  • Load testing and benchmarking

Decision Guide

Scenario Operation Mode asyncIngest Result
User edits in web UI SDK (default) Sync writes to editable aspects
ETL pipeline ingestion INGESTION (default) Async writes to system aspects
Critical data migration INGESTION false Sync writes to system aspects
Bulk admin updates SDK true Async writes to editable aspects

Default behavior is best for 95% of use cases. Only use explicit asyncIngest when you have specific performance or consistency requirements.

Testing the Connection

Verify connectivity before performing operations:

try {
    boolean connected = client.testConnection();
    if (connected) {
        System.out.println("Connected to DataHub!");
    } else {
        System.err.println("Failed to connect");
    }
} catch (Exception e) {
    System.err.println("Connection error: " + e.getMessage());
}

The testConnection() method performs a GET request to /config endpoint to verify the server is reachable.

Client Lifecycle

Resource Management

The client implements AutoCloseable for automatic resource management:

try (DataHubClientV2 client = DataHubClientV2.builder()
        .server("http://localhost:8080")
        .build()) {

    // Use client
    client.entities().upsert(dataset);

} // Client automatically closed

Manual Closing

If not using try-with-resources, explicitly close the client:

DataHubClientV2 client = DataHubClientV2.builder()
    .server("http://localhost:8080")
    .build();

try {
    // Use client
} finally {
    client.close();  // Release HTTP connections
}

Why close? Closing the client releases the underlying HTTP connection pool.

Advanced Configuration

Complete Configuration Example

import datahub.client.v2.DataHubClientV2;
import datahub.client.v2.config.DataHubClientConfigV2;

DataHubClientV2 client = DataHubClientV2.builder()
    // Server configuration
    .server("https://your-instance.acryl.io")
    .token("your-personal-access-token")

    // Timeout configuration
    .timeoutMs(30000)  // 30 seconds

    // Retry configuration
    .maxRetries(5)

    // Operation mode (SDK or INGESTION)
    .operationMode(DataHubClientConfigV2.OperationMode.SDK)

    // Async mode control (optional - overrides mode-based default)
    // .asyncIngest(false)  // Explicit control: true=async, false=sync

    // SSL configuration (testing only!)
    .disableSslVerification(false)

    .build();

Accessing the Underlying RestEmitter

For advanced use cases, access the low-level REST emitter:

RestEmitter emitter = client.getEmitter();
// Direct access to emission methods

Note: Most users should use the high-level client.entities() API instead.

Entity Operations

Once configured, use the client to perform entity operations:

CRUD Operations

// Create/Update (upsert)
client.entities().upsert(dataset);

// Update with patches
client.entities().update(dataset);

// Read
Dataset loaded = client.entities().get(datasetUrn);

See the Getting Started Guide for comprehensive examples.

Configuration Best Practices

Production Deployment

DataHubClientV2 client = DataHubClientV2.builder()
    .server(System.getenv("DATAHUB_SERVER"))
    .token(System.getenv("DATAHUB_TOKEN"))
    .timeoutMs(30000)       // Higher timeout for production
    .maxRetries(5)          // More retries for reliability
    .operationMode(DataHubClientConfigV2.OperationMode.SDK)
    .disableSslVerification(false)  // Always verify SSL!
    .build();

ETL Pipeline

DataHubClientV2 client = DataHubClientV2.builder()
    .server(System.getenv("DATAHUB_SERVER"))
    .token(System.getenv("DATAHUB_TOKEN"))
    .timeoutMs(60000)       // Higher timeout for batch jobs
    .maxRetries(3)
    .operationMode(DataHubClientConfigV2.OperationMode.INGESTION)  // Async by default
    .build();

Critical Data Migration

For migrations where you need confirmation before proceeding:

DataHubClientV2 client = DataHubClientV2.builder()
    .server(System.getenv("DATAHUB_SERVER"))
    .token(System.getenv("DATAHUB_TOKEN"))
    .timeoutMs(60000)
    .maxRetries(5)
    .operationMode(DataHubClientConfigV2.OperationMode.INGESTION)
    .asyncIngest(false)     // Force sync for guaranteed consistency
    .build();

Local Development

DataHubClientV2 client = DataHubClientV2.builder()
    .server("http://localhost:8080")
    // No token needed for local quickstart
    .timeoutMs(10000)
    .build();

Troubleshooting

Connection Refused

Error: java.net.ConnectException: Connection refused

Solutions:

  • Verify DataHub server is running
  • Check server URL is correct
  • Ensure port is accessible (firewall rules)

Authentication Failed

Error: 401 Unauthorized

Solutions:

  • Verify token is valid and not expired
  • Check token has correct permissions
  • Ensure token matches the server environment

Timeout

Error: java.util.concurrent.TimeoutException

Solutions:

  • Increase timeoutMs configuration
  • Check network latency to DataHub server
  • Verify server is not overloaded

SSL Certificate Error

Error: javax.net.ssl.SSLHandshakeException

Solutions:

  • Ensure server SSL certificate is valid
  • Add CA certificate to Java truststore
  • For testing only: use disableSslVerification(true)

Next Steps

API Reference

For complete API documentation, see: