471 lines
13 KiB
Markdown
Raw Permalink Normal View History

# DataHubClientV2 Configuration
The `DataHubClientV2` is the primary entry point for interacting with DataHub using SDK V2. This guide covers client configuration, connection management, and operation modes.
## Creating a Client
### Basic Configuration
The minimal configuration requires only a server URL:
```java
import datahub.client.v2.DataHubClientV2;
DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
.build();
```
### With Authentication
For DataHub Cloud or secured instances, provide a personal access token:
```java
DataHubClientV2 client = DataHubClientV2.builder()
.server("https://your-instance.acryl.io")
.token("your-personal-access-token")
.build();
```
> **Getting a Token:** In DataHub UI → Settings → Access Tokens → Generate Personal Access Token
### From Environment Variables
Configure the client using environment variables:
```bash
export DATAHUB_SERVER=http://localhost:8080
export DATAHUB_TOKEN=your-token-here
```
```java
DataHubClientConfig V2 config = DataHubClientConfigV2.fromEnv();
DataHubClientV2 client = new DataHubClientV2(config);
```
**Supported environment variables:**
- `DATAHUB_SERVER` or `DATAHUB_GMS_URL` - Server URL (required)
- `DATAHUB_TOKEN` or `DATAHUB_GMS_TOKEN` - Authentication token (optional)
## Configuration Options
### Timeouts
Configure request timeouts to handle slow networks:
```java
DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
.timeoutMs(30000) // 30 seconds
.build();
```
**Default:** 10 seconds (10000ms)
### Retries
Configure automatic retries for failed requests:
```java
DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
.maxRetries(5) // Retry up to 5 times
.build();
```
**Default:** 3 retries
### SSL Certificate Verification
For testing environments, you can disable SSL verification:
```java
DataHubClientV2 client = DataHubClientV2.builder()
.server("https://localhost:8443")
.disableSslVerification(true) // WARNING: Only for testing!
.build();
```
> **Warning:** Never disable SSL verification in production! This makes your connection vulnerable to man-in-the-middle attacks.
## Operation Modes
SDK V2 supports two distinct operation modes that control how metadata is written to DataHub:
### SDK Mode (Default)
**Use for:** Interactive applications, user-initiated metadata edits, real-time UI updates
**Behavior:**
- Writes to **editable aspects** (e.g., `editableDatasetProperties`)
- Uses **synchronous DB writes** for immediate consistency
- Returns only after metadata is committed to database
```java
DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
.operationMode(DataHubClientConfigV2.OperationMode.SDK) // Default
.build();
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.build();
dataset.setDescription("User-provided description");
client.entities().upsert(dataset);
// Writes to editableDatasetProperties synchronously
// Metadata immediately visible after return
```
### INGESTION Mode
**Use for:** ETL pipelines, data ingestion jobs, automated metadata collection, batch processing
**Behavior:**
- Writes to **system aspects** (e.g., `datasetProperties`)
- Uses **asynchronous Kafka writes** for high throughput
- Returns immediately after message is queued
```java
DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
.operationMode(DataHubClientConfigV2.OperationMode.INGESTION)
.build();
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.build();
dataset.setDescription("Ingested from Snowflake");
client.entities().upsert(dataset);
// Writes to datasetProperties asynchronously via Kafka
// High throughput for batch ingestion
```
### Mode Comparison
| Aspect | SDK Mode | INGESTION Mode |
| ------------------- | --------------------------- | -------------------------------- |
| **Target Aspects** | Editable aspects | System aspects |
| **Write Path** | Synchronous (direct to DB) | Asynchronous (via Kafka) |
| **Consistency** | Immediate (linearizable) | Eventual (async processing) |
| **Throughput** | Lower (waits for DB) | Higher (queued) |
| **Use Case** | User edits via UI/API | Pipeline metadata extraction |
| **Precedence** | Higher (overrides system) | Lower (overridden by user edits) |
| **Example Aspects** | `editableDatasetProperties` | `datasetProperties` |
| **Latency** | ~100-500ms | ~10-50ms (queueing only) |
| **Error Handling** | Immediate feedback | Eventual (check logs) |
**Why two modes?**
- **Clear provenance**: Distinguish human edits from machine-generated metadata
- **Non-destructive updates**: Ingestion can refresh without clobbering user documentation
- **UI consistency**: DataHub UI shows editable aspects as user overrides
- **Performance optimization**: Async ingestion for high-volume batch writes, sync for interactive edits
## Async Mode Control (The Escape Hatch)
By default, the async mode is automatically inferred from your operation mode:
- SDK mode → synchronous writes (immediate consistency)
- INGESTION mode → asynchronous writes (high throughput)
However, you can explicitly override this behavior using the `asyncIngest` parameter when you need full control:
### Force Synchronous in INGESTION Mode
For pipelines that need immediate consistency guarantees:
```java
DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
.operationMode(DataHubClientConfigV2.OperationMode.INGESTION)
.asyncIngest(false) // Override: force synchronous despite INGESTION mode
.build();
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.build();
dataset.setDescription("Ingested description");
client.entities().upsert(dataset);
// Writes to datasetProperties synchronously, waits for DB commit
// Use when you need guaranteed consistency before proceeding
```
**Use cases:**
- Critical ingestion jobs where you must verify writes succeeded
- Sequential processing where each step depends on previous writes
- Testing scenarios requiring deterministic behavior
- Compliance workflows requiring audit trail confirmation
### Force Asynchronous in SDK Mode
For high-volume SDK operations that can tolerate eventual consistency:
```java
DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
.operationMode(DataHubClientConfigV2.OperationMode.SDK)
.asyncIngest(true) // Override: force async despite SDK mode
.build();
Dataset dataset = Dataset.builder()
.platform("snowflake")
.name("my_table")
.build();
dataset.setDescription("User-provided description");
client.entities().upsert(dataset);
// Writes to editableDatasetProperties via Kafka for higher throughput
// Trade immediate consistency for performance
```
**Use cases:**
- Bulk metadata updates from admin tools
- Migration scripts moving large volumes of data
- Performance-critical batch operations
- Load testing and benchmarking
### Decision Guide
| Scenario | Operation Mode | asyncIngest | Result |
| ----------------------- | -------------- | ----------- | -------------------------------- |
| User edits in web UI | SDK | (default) | Sync writes to editable aspects |
| ETL pipeline ingestion | INGESTION | (default) | Async writes to system aspects |
| Critical data migration | INGESTION | false | Sync writes to system aspects |
| Bulk admin updates | SDK | true | Async writes to editable aspects |
**Default behavior is best for 95% of use cases.** Only use explicit `asyncIngest` when you have specific performance or consistency requirements.
## Testing the Connection
Verify connectivity before performing operations:
```java
try {
boolean connected = client.testConnection();
if (connected) {
System.out.println("Connected to DataHub!");
} else {
System.err.println("Failed to connect");
}
} catch (Exception e) {
System.err.println("Connection error: " + e.getMessage());
}
```
The `testConnection()` method performs a GET request to `/config` endpoint to verify the server is reachable.
## Client Lifecycle
### Resource Management
The client implements `AutoCloseable` for automatic resource management:
```java
try (DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
.build()) {
// Use client
client.entities().upsert(dataset);
} // Client automatically closed
```
### Manual Closing
If not using try-with-resources, explicitly close the client:
```java
DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
.build();
try {
// Use client
} finally {
client.close(); // Release HTTP connections
}
```
**Why close?** Closing the client releases the underlying HTTP connection pool.
## Advanced Configuration
### Complete Configuration Example
```java
import datahub.client.v2.DataHubClientV2;
import datahub.client.v2.config.DataHubClientConfigV2;
DataHubClientV2 client = DataHubClientV2.builder()
// Server configuration
.server("https://your-instance.acryl.io")
.token("your-personal-access-token")
// Timeout configuration
.timeoutMs(30000) // 30 seconds
// Retry configuration
.maxRetries(5)
// Operation mode (SDK or INGESTION)
.operationMode(DataHubClientConfigV2.OperationMode.SDK)
// Async mode control (optional - overrides mode-based default)
// .asyncIngest(false) // Explicit control: true=async, false=sync
// SSL configuration (testing only!)
.disableSslVerification(false)
.build();
```
### Accessing the Underlying RestEmitter
For advanced use cases, access the low-level REST emitter:
```java
RestEmitter emitter = client.getEmitter();
// Direct access to emission methods
```
> **Note:** Most users should use the high-level `client.entities()` API instead.
## Entity Operations
Once configured, use the client to perform entity operations:
### CRUD Operations
```java
// Create/Update (upsert)
client.entities().upsert(dataset);
// Update with patches
client.entities().update(dataset);
// Read
Dataset loaded = client.entities().get(datasetUrn);
```
See the [Getting Started Guide](./getting-started.md) for comprehensive examples.
## Configuration Best Practices
### Production Deployment
```java
DataHubClientV2 client = DataHubClientV2.builder()
.server(System.getenv("DATAHUB_SERVER"))
.token(System.getenv("DATAHUB_TOKEN"))
.timeoutMs(30000) // Higher timeout for production
.maxRetries(5) // More retries for reliability
.operationMode(DataHubClientConfigV2.OperationMode.SDK)
.disableSslVerification(false) // Always verify SSL!
.build();
```
### ETL Pipeline
```java
DataHubClientV2 client = DataHubClientV2.builder()
.server(System.getenv("DATAHUB_SERVER"))
.token(System.getenv("DATAHUB_TOKEN"))
.timeoutMs(60000) // Higher timeout for batch jobs
.maxRetries(3)
.operationMode(DataHubClientConfigV2.OperationMode.INGESTION) // Async by default
.build();
```
### Critical Data Migration
For migrations where you need confirmation before proceeding:
```java
DataHubClientV2 client = DataHubClientV2.builder()
.server(System.getenv("DATAHUB_SERVER"))
.token(System.getenv("DATAHUB_TOKEN"))
.timeoutMs(60000)
.maxRetries(5)
.operationMode(DataHubClientConfigV2.OperationMode.INGESTION)
.asyncIngest(false) // Force sync for guaranteed consistency
.build();
```
### Local Development
```java
DataHubClientV2 client = DataHubClientV2.builder()
.server("http://localhost:8080")
// No token needed for local quickstart
.timeoutMs(10000)
.build();
```
## Troubleshooting
### Connection Refused
**Error:** `java.net.ConnectException: Connection refused`
**Solutions:**
- Verify DataHub server is running
- Check server URL is correct
- Ensure port is accessible (firewall rules)
### Authentication Failed
**Error:** `401 Unauthorized`
**Solutions:**
- Verify token is valid and not expired
- Check token has correct permissions
- Ensure token matches the server environment
### Timeout
**Error:** `java.util.concurrent.TimeoutException`
**Solutions:**
- Increase `timeoutMs` configuration
- Check network latency to DataHub server
- Verify server is not overloaded
### SSL Certificate Error
**Error:** `javax.net.ssl.SSLHandshakeException`
**Solutions:**
- Ensure server SSL certificate is valid
- Add CA certificate to Java truststore
- For testing only: use `disableSslVerification(true)`
## Next Steps
- **[Entities Overview](./entities-overview.md)** - Working with different entity types
- **[Dataset Entity Guide](./dataset-entity.md)** - Comprehensive dataset operations
- **[Patch Operations](./patch-operations.md)** - Efficient incremental updates
- **[Getting Started Guide](./getting-started.md)** - Complete walkthrough
## API Reference
For complete API documentation, see:
- [DataHubClientV2.java](../../datahub-client/src/main/java/datahub/client/v2/DataHubClientV2.java)
- [DataHubClientConfigV2.java](../../datahub-client/src/main/java/datahub/client/v2/config/DataHubClientConfigV2.java)