13 KiB
DataHubClientV2 Configuration
The DataHubClientV2 is the primary entry point for interacting with DataHub using SDK V2. This guide covers client configuration, connection management, and operation modes.
Creating a Client
Basic Configuration
The minimal configuration requires only a server URL:
import datahub.client.v2.DataHubClientV2;
DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
.build();
With Authentication
For DataHub Cloud or secured instances, provide a personal access token:
DataHubClientV2 client = DataHubClientV2.builder()
.server("https://your-instance.acryl.io")
.token("your-personal-access-token")
.build();
Getting a Token: In DataHub UI → Settings → Access Tokens → Generate Personal Access Token
From Environment Variables
Configure the client using environment variables:
export DATAHUB_SERVER=http://localhost:8080
export DATAHUB_TOKEN=your-token-here
DataHubClientConfig V2 config = DataHubClientConfigV2.fromEnv();
DataHubClientV2 client = new DataHubClientV2(config);
Supported environment variables:
DATAHUB_SERVERorDATAHUB_GMS_URL- Server URL (required)DATAHUB_TOKENorDATAHUB_GMS_TOKEN- Authentication token (optional)
Configuration Options
Timeouts
Configure request timeouts to handle slow networks:
DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
.timeoutMs(30000) // 30 seconds
.build();
Default: 10 seconds (10000ms)
Retries
Configure automatic retries for failed requests:
DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
.maxRetries(5) // Retry up to 5 times
.build();
Default: 3 retries
SSL Certificate Verification
For testing environments, you can disable SSL verification:
DataHubClientV2 client = DataHubClientV2.builder()
.server("https://localhost:8443")
.disableSslVerification(true) // WARNING: Only for testing!
.build();
Warning: Never disable SSL verification in production! This makes your connection vulnerable to man-in-the-middle attacks.
Operation Modes
SDK V2 supports two distinct operation modes that control how metadata is written to DataHub:
SDK Mode (Default)
Use for: Interactive applications, user-initiated metadata edits, real-time UI updates
Behavior:
- Writes to editable aspects (e.g.,
editableDatasetProperties) - Uses synchronous DB writes for immediate consistency
- Returns only after metadata is committed to database
DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
.operationMode(DataHubClientConfigV2.OperationMode.SDK) // Default
.build();
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.build();
dataset.setDescription("User-provided description");
client.entities().upsert(dataset);
// Writes to editableDatasetProperties synchronously
// Metadata immediately visible after return
INGESTION Mode
Use for: ETL pipelines, data ingestion jobs, automated metadata collection, batch processing
Behavior:
- Writes to system aspects (e.g.,
datasetProperties) - Uses asynchronous Kafka writes for high throughput
- Returns immediately after message is queued
DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
.operationMode(DataHubClientConfigV2.OperationMode.INGESTION)
.build();
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.build();
dataset.setDescription("Ingested from Snowflake");
client.entities().upsert(dataset);
// Writes to datasetProperties asynchronously via Kafka
// High throughput for batch ingestion
Mode Comparison
| Aspect | SDK Mode | INGESTION Mode |
|---|---|---|
| Target Aspects | Editable aspects | System aspects |
| Write Path | Synchronous (direct to DB) | Asynchronous (via Kafka) |
| Consistency | Immediate (linearizable) | Eventual (async processing) |
| Throughput | Lower (waits for DB) | Higher (queued) |
| Use Case | User edits via UI/API | Pipeline metadata extraction |
| Precedence | Higher (overrides system) | Lower (overridden by user edits) |
| Example Aspects | editableDatasetProperties |
datasetProperties |
| Latency | ~100-500ms | ~10-50ms (queueing only) |
| Error Handling | Immediate feedback | Eventual (check logs) |
Why two modes?
- Clear provenance: Distinguish human edits from machine-generated metadata
- Non-destructive updates: Ingestion can refresh without clobbering user documentation
- UI consistency: DataHub UI shows editable aspects as user overrides
- Performance optimization: Async ingestion for high-volume batch writes, sync for interactive edits
Async Mode Control (The Escape Hatch)
By default, the async mode is automatically inferred from your operation mode:
- SDK mode → synchronous writes (immediate consistency)
- INGESTION mode → asynchronous writes (high throughput)
However, you can explicitly override this behavior using the asyncIngest parameter when you need full control:
Force Synchronous in INGESTION Mode
For pipelines that need immediate consistency guarantees:
DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
.operationMode(DataHubClientConfigV2.OperationMode.INGESTION)
.asyncIngest(false) // Override: force synchronous despite INGESTION mode
.build();
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.build();
dataset.setDescription("Ingested description");
client.entities().upsert(dataset);
// Writes to datasetProperties synchronously, waits for DB commit
// Use when you need guaranteed consistency before proceeding
Use cases:
- Critical ingestion jobs where you must verify writes succeeded
- Sequential processing where each step depends on previous writes
- Testing scenarios requiring deterministic behavior
- Compliance workflows requiring audit trail confirmation
Force Asynchronous in SDK Mode
For high-volume SDK operations that can tolerate eventual consistency:
DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
.operationMode(DataHubClientConfigV2.OperationMode.SDK)
.asyncIngest(true) // Override: force async despite SDK mode
.build();
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.build();
dataset.setDescription("User-provided description");
client.entities().upsert(dataset);
// Writes to editableDatasetProperties via Kafka for higher throughput
// Trade immediate consistency for performance
Use cases:
- Bulk metadata updates from admin tools
- Migration scripts moving large volumes of data
- Performance-critical batch operations
- Load testing and benchmarking
Decision Guide
| Scenario | Operation Mode | asyncIngest | Result |
|---|---|---|---|
| User edits in web UI | SDK | (default) | Sync writes to editable aspects |
| ETL pipeline ingestion | INGESTION | (default) | Async writes to system aspects |
| Critical data migration | INGESTION | false | Sync writes to system aspects |
| Bulk admin updates | SDK | true | Async writes to editable aspects |
Default behavior is best for 95% of use cases. Only use explicit asyncIngest when you have specific performance or consistency requirements.
Testing the Connection
Verify connectivity before performing operations:
try {
boolean connected = client.testConnection();
if (connected) {
System.out.println("Connected to DataHub!");
} else {
System.err.println("Failed to connect");
}
} catch (Exception e) {
System.err.println("Connection error: " + e.getMessage());
}
The testConnection() method performs a GET request to /config endpoint to verify the server is reachable.
Client Lifecycle
Resource Management
The client implements AutoCloseable for automatic resource management:
try (DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
.build()) {
// Use client
client.entities().upsert(dataset);
} // Client automatically closed
Manual Closing
If not using try-with-resources, explicitly close the client:
DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
.build();
try {
// Use client
} finally {
client.close(); // Release HTTP connections
}
Why close? Closing the client releases the underlying HTTP connection pool.
Advanced Configuration
Complete Configuration Example
import datahub.client.v2.DataHubClientV2;
import datahub.client.v2.config.DataHubClientConfigV2;
DataHubClientV2 client = DataHubClientV2.builder()
// Server configuration
.server("https://your-instance.acryl.io")
.token("your-personal-access-token")
// Timeout configuration
.timeoutMs(30000) // 30 seconds
// Retry configuration
.maxRetries(5)
// Operation mode (SDK or INGESTION)
.operationMode(DataHubClientConfigV2.OperationMode.SDK)
// Async mode control (optional - overrides mode-based default)
// .asyncIngest(false) // Explicit control: true=async, false=sync
// SSL configuration (testing only!)
.disableSslVerification(false)
.build();
Accessing the Underlying RestEmitter
For advanced use cases, access the low-level REST emitter:
RestEmitter emitter = client.getEmitter();
// Direct access to emission methods
Note: Most users should use the high-level
client.entities()API instead.
Entity Operations
Once configured, use the client to perform entity operations:
CRUD Operations
// Create/Update (upsert)
client.entities().upsert(dataset);
// Update with patches
client.entities().update(dataset);
// Read
Dataset loaded = client.entities().get(datasetUrn);
See the Getting Started Guide for comprehensive examples.
Configuration Best Practices
Production Deployment
DataHubClientV2 client = DataHubClientV2.builder()
.server(System.getenv("DATAHUB_SERVER"))
.token(System.getenv("DATAHUB_TOKEN"))
.timeoutMs(30000) // Higher timeout for production
.maxRetries(5) // More retries for reliability
.operationMode(DataHubClientConfigV2.OperationMode.SDK)
.disableSslVerification(false) // Always verify SSL!
.build();
ETL Pipeline
DataHubClientV2 client = DataHubClientV2.builder()
.server(System.getenv("DATAHUB_SERVER"))
.token(System.getenv("DATAHUB_TOKEN"))
.timeoutMs(60000) // Higher timeout for batch jobs
.maxRetries(3)
.operationMode(DataHubClientConfigV2.OperationMode.INGESTION) // Async by default
.build();
Critical Data Migration
For migrations where you need confirmation before proceeding:
DataHubClientV2 client = DataHubClientV2.builder()
.server(System.getenv("DATAHUB_SERVER"))
.token(System.getenv("DATAHUB_TOKEN"))
.timeoutMs(60000)
.maxRetries(5)
.operationMode(DataHubClientConfigV2.OperationMode.INGESTION)
.asyncIngest(false) // Force sync for guaranteed consistency
.build();
Local Development
DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
// No token needed for local quickstart
.timeoutMs(10000)
.build();
Troubleshooting
Connection Refused
Error: java.net.ConnectException: Connection refused
Solutions:
- Verify DataHub server is running
- Check server URL is correct
- Ensure port is accessible (firewall rules)
Authentication Failed
Error: 401 Unauthorized
Solutions:
- Verify token is valid and not expired
- Check token has correct permissions
- Ensure token matches the server environment
Timeout
Error: java.util.concurrent.TimeoutException
Solutions:
- Increase
timeoutMsconfiguration - Check network latency to DataHub server
- Verify server is not overloaded
SSL Certificate Error
Error: javax.net.ssl.SSLHandshakeException
Solutions:
- Ensure server SSL certificate is valid
- Add CA certificate to Java truststore
- For testing only: use
disableSslVerification(true)
Next Steps
- Entities Overview - Working with different entity types
- Dataset Entity Guide - Comprehensive dataset operations
- Patch Operations - Efficient incremental updates
- Getting Started Guide - Complete walkthrough
API Reference
For complete API documentation, see: