28 KiB
Monitoring DataHub
Overview
Monitoring DataHub's system components is essential for maintaining operational excellence, troubleshooting performance issues, and ensuring system reliability. This comprehensive guide covers how to implement observability in DataHub through tracing and metrics, and how to extract valuable insights from your running instances.
Why Monitor DataHub?
Effective monitoring enables you to:
- Identify Performance Bottlenecks: Pinpoint slow queries or API endpoints
- Debug Issues Faster: Trace requests across distributed components to locate failures
- Meet SLAs: Track and alert on key performance indicators
Observability Components
DataHub's observability strategy consists of two complementary approaches:
-
Metrics Collection
Purpose: Aggregate statistical data about system behavior over time Technology: Transitioning from DropWizard/JMX to Micrometer
Current State: DropWizard metrics exposed via JMX, collected by Prometheus Future Direction: Native Micrometer integration for Spring-based metrics Compatibility: Prometheus-compatible format with support for other metrics backends
Key Metrics Categories:
- Performance Metrics: Request latency, throughput, error rates
- Resource Metrics: CPU, memory utilization
- Application Metrics: Cache hit rates, queue depths, processing times
- Business Metrics: Entity counts, ingestion rates, search performance
-
Distributed Tracing
Purpose: Track individual requests as they flow through multiple services and components Technology: OpenTelemetry-based instrumentation
- Provides end-to-end visibility of request lifecycles
- Automatically instruments popular libraries (Kafka, JDBC, Elasticsearch)
- Supports multiple backend systems (Jaeger, Zipkin, etc.)
- Enables custom span creation with minimal code changes
Key Benefits:
- Visualize request flow across microservices
- Identify latency hotspots
- Understand service dependencies
- Debug complex distributed transactions
GraphQL Instrumentation (Micrometer)
Overview
DataHub provides comprehensive instrumentation for its GraphQL API through Micrometer metrics, enabling detailed performance monitoring and debugging capabilities. The instrumentation system offers flexible configuration options to balance between observability depth and performance overhead.
Why Path-Level GraphQL Instrumentation Matters
Traditional GraphQL monitoring only tells you "the search query is slow" but not why. Without path-level instrumentation, you're blind to which specific fields are causing performance bottlenecks in complex nested queries.
Real-World Example
Consider this GraphQL query:
query getSearchResults {
search(input: { query: "sales data" }) {
searchResults {
entity {
... on Dataset {
name
owner {
# Path: /search/searchResults/entity/owner
corpUser {
displayName
}
}
lineage {
# Path: /search/searchResults/entity/lineage
upstreamCount
downstreamCount
upstreamEntities {
urn
name
}
}
schemaMetadata {
# Path: /search/searchResults/entity/schemaMetadata
fields {
fieldPath
description
}
}
}
}
}
}
}
What Path-Level Instrumentation Reveals
With path-level metrics, you discover:
/search/searchResults/entity/owner
- 50ms (fast, well-cached)/search/searchResults/entity/lineage
- 2500ms (SLOW! hitting graph database)/search/searchResults/entity/schemaMetadata
- 150ms (acceptable)
Without path metrics: "Search query takes 3 seconds"
With path metrics: "Lineage resolution is the bottleneck"
Key Benefits
1. Surgical Optimization
Instead of guessing, you know exactly which resolver needs optimization. Maybe lineage needs better caching or pagination.
2. Smart Query Patterns
Identify expensive patterns like:
# These paths consistently slow:
/*/lineage/upstreamEntities/*
/*/siblings/*/platform
# Action: Add field-level caching or lazy loading
3. Client-Specific Debugging
Different clients request different fields. Path instrumentation shows:
- Web UI requests are slow (requesting everything)
- API integrations timeout (requesting deep lineage)
4. N+1 Query Detection
Spot resolver patterns that indicate N+1 problems:
/users/0/permissions - 10ms
/users/1/permissions - 10ms
/users/2/permissions - 10ms
... (100 more times)
Configuration Strategy
Start targeted to minimize overhead:
# Focus on known slow operations
fieldLevelOperations: "searchAcrossEntities,getDataset"
# Target expensive resolver paths
fieldLevelPaths: "/**/lineage/**,/**/relationships/**,/**/privileges"
Architecture
The GraphQL instrumentation is implemented through GraphQLTimingInstrumentation
, which extends GraphQL Java's instrumentation framework. It provides:
- Request-level metrics: Overall query performance and error tracking
- Field-level metrics: Detailed timing for individual field resolvers
- Smart filtering: Configurable targeting of specific operations or field paths
- Low overhead: Minimal performance impact through efficient instrumentation
Metrics Collected
Request-Level Metrics
Metric: graphql.request.duration
- Type: Timer with percentiles (p50, p95, p99)
- Tags:
operation
: Operation name (e.g., "getSearchResultsForMultiple")operation.type
: Query, mutation, or subscriptionsuccess
: true/false based on error presencefield.filtering
: Filtering mode applied (DISABLED, ALL_FIELDS, BY_OPERATION, BY_PATH, BY_BOTH)
- Use Case: Monitor overall GraphQL performance, identify slow operations
Metric: graphql.request.errors
- Type: Counter
- Tags:
operation
: Operation nameoperation.type
: Query, mutation, or subscription
- Use Case: Track error rates by operation
Field-Level Metrics
Metric: graphql.field.duration
- Type: Timer with percentiles (p50, p95, p99)
- Tags:
parent.type
: GraphQL parent type (e.g., "Dataset", "User")field
: Field name being resolvedoperation
: Operation name contextsuccess
: true/falsepath
: Field path (optional, controlled byfieldLevelPathEnabled
)
- Use Case: Identify slow field resolvers, optimize data fetching
Metric: graphql.field.errors
- Type: Counter
- Tags: Same as field duration (minus success tag)
- Use Case: Track field-specific error patterns
Metric: graphql.fields.instrumented
- Type: Counter
- Tags:
operation
: Operation namefiltering.mode
: Active filtering mode
- Use Case: Monitor instrumentation coverage and overhead
Configuration Guide
Master Controls
graphQL:
metrics:
# Master switch for all GraphQL metrics
enabled: ${GRAPHQL_METRICS_ENABLED:true}
# Enable field-level resolver metrics
fieldLevelEnabled: ${GRAPHQL_METRICS_FIELD_LEVEL_ENABLED:false}
Selective Field Instrumentation
Field-level metrics can add significant overhead for complex queries. DataHub provides multiple strategies to control which fields are instrumented:
1. Operation-Based Filtering
Target specific GraphQL operations known to be slow or critical:
fieldLevelOperations: "getSearchResultsForMultiple,searchAcrossLineageStructure"
2. Path-Based Filtering
Use path patterns to instrument specific parts of your schema:
fieldLevelPaths: "/search/results/**,/user/*/permissions,/**/lineage/*"
Path Pattern Syntax:
/user
- Exact match for the user field/user/*
- Direct children of user (e.g.,/user/name
,/user/email
)/user/**
- User field and all descendants at any depth/*/comments/*
- Comments field under any parent
3. Combined Filtering
When both operation and path filters are configured, only fields matching BOTH criteria are instrumented:
# Only instrument search results within specific operations
fieldLevelOperations: "searchAcrossEntities"
fieldLevelPaths: "/searchResults/**"
Advanced Options
# Include field paths as metric tags (WARNING: high cardinality risk)
fieldLevelPathEnabled: false
# Include metrics for trivial property access
trivialDataFetchersEnabled: false
Filtering Modes Explained
The instrumentation automatically determines the most efficient filtering mode:
- DISABLED: Field-level metrics completely disabled
- ALL_FIELDS: No filtering, all fields instrumented (highest overhead)
- BY_OPERATION: Only instrument fields within specified operations
- BY_PATH: Only instrument fields matching path patterns
- BY_BOTH: Most restrictive - both operation and path must match
Performance Considerations
Impact Assessment
Field-level instrumentation overhead varies by:
- Query complexity: More fields = more overhead
- Resolver performance: Fast resolvers have higher relative overhead
- Filtering effectiveness: Better targeting = less overhead
Best Practices
-
Start Conservative: Begin with field-level metrics disabled
fieldLevelEnabled: false
-
Target Known Issues: Enable selectively for problematic operations
fieldLevelEnabled: true fieldLevelOperations: "slowSearchQuery,complexLineageQuery"
-
Use Path Patterns Wisely: Focus on expensive resolver paths
fieldLevelPaths: "/search/**,/**/lineage/**"
-
Avoid Path Tags in Production: High cardinality risk
fieldLevelPathEnabled: false # Keep this false
-
Monitor Instrumentation Overhead: Track the
graphql.fields.instrumented
metric
Example Configurations
Development Environment (Full Visibility)
graphQL:
metrics:
enabled: true
fieldLevelEnabled: true
fieldLevelOperations: "" # All operations
fieldLevelPathEnabled: true # Include paths for debugging
trivialDataFetchersEnabled: true
Production - Targeted Monitoring
graphQL:
metrics:
enabled: true
fieldLevelEnabled: true
fieldLevelOperations: "getSearchResultsForMultiple,searchAcrossLineage"
fieldLevelPaths: "/search/results/*,/lineage/upstream/**,/lineage/downstream/**"
fieldLevelPathEnabled: false
trivialDataFetchersEnabled: false
Production - Minimal Overhead
graphQL:
metrics:
enabled: true
fieldLevelEnabled: false # Only request-level metrics
Debugging Slow Queries
When investigating GraphQL performance issues:
- Enable request-level metrics first to identify slow operations
- Temporarily enable field-level metrics for the slow operation:
fieldLevelOperations: "problematicQuery"
- Analyze field duration metrics to find bottlenecks
- Optionally enable path tags (briefly) for precise identification:
fieldLevelPathEnabled: true # Temporary only!
- Optimize identified resolvers and disable detailed instrumentation
Integration with Monitoring Stack
The GraphQL metrics integrate seamlessly with DataHub's monitoring infrastructure:
- Prometheus: Metrics exposed at
/actuator/prometheus
- Grafana: Create dashboards showing:
- Request rates and latencies by operation
- Error rates and types
- Field resolver performance heatmaps
- Top slow operations and fields
Example Prometheus queries:
# Average request duration by operation
rate(graphql_request_duration_seconds_sum[5m])
/ rate(graphql_request_duration_seconds_count[5m])
# Field resolver p99 latency
histogram_quantile(0.99,
rate(graphql_field_duration_seconds_bucket[5m])
)
# Error rate by operation
rate(graphql_request_errors_total[5m])
Cache Monitoring (Micrometer)
Overview
Micrometer provides automatic instrumentation for cache implementations, offering deep insights into cache performance and efficiency. This instrumentation is crucial for DataHub, where caching significantly impacts query performance and system load.
Automatic Cache Metrics
When caches are registered with Micrometer, comprehensive metrics are automatically collected without code changes:
Core Metrics
cache.size
(Gauge) - Current number of entries in the cachecache.gets
(Counter) - Cache access attempts, tagged with:result=hit
- Successful cache hitsresult=miss
- Cache misses requiring backend fetch
cache.puts
(Counter) - Number of entries added to cachecache.evictions
(Counter) - Number of entries evictedcache.eviction.weight
(Counter) - Total weight of evicted entries (for size-based eviction)
Derived Metrics
Calculate key performance indicators using Prometheus queries:
# Cache hit rate (should be >80% for hot caches)
sum(rate(cache_gets_total{result="hit"}[5m])) by (cache) /
sum(rate(cache_gets_total[5m])) by (cache)
# Cache miss rate
1 - (cache_hit_rate)
# Eviction rate (indicates cache pressure)
rate(cache_evictions_total[5m])
DataHub Cache Configuration
DataHub uses multiple cache layers, each automatically instrumented:
1. Entity Client Cache
cache.client.entityClient:
enabled: true
maxBytes: 104857600 # 100MB
entityAspectTTLSeconds:
corpuser:
corpUserInfo: 20 # Short TTL for frequently changing data
corpUserKey: 300 # Longer TTL for stable data
structuredProperty:
propertyDefinition: 300
structuredPropertyKey: 86400 # 1 day for very stable data
2. Usage Statistics Cache
cache.client.usageClient:
enabled: true
maxBytes: 52428800 # 50MB
defaultTTLSeconds: 86400 # 1 day
# Caches expensive usage calculations
3. Search & Lineage Cache
cache.search.lineage:
ttlSeconds: 86400 # 1 day
Monitoring Best Practices
Key Indicators to Watch
-
Hit Rate by Cache Type
# Alert if hit rate drops below 70% cache_hit_rate < 0.7
-
Memory Pressure
# High eviction rate relative to puts rate(cache_evictions_total[5m]) / rate(cache_puts_total[5m]) > 0.1
Thread Pool Executor Monitoring (Micrometer)
Overview
Micrometer automatically instruments Java ThreadPoolExecutor
instances, providing crucial visibility into concurrency
bottlenecks and resource utilization. For DataHub's concurrent operations, this monitoring is essential for maintaining
performance under load.
Automatic Executor Metrics
Pool State Metrics
executor.pool.size
(Gauge) - Current number of threads in poolexecutor.pool.core
(Gauge) - Core (minimum) pool sizeexecutor.pool.max
(Gauge) - Maximum allowed pool sizeexecutor.active
(Gauge) - Threads actively executing tasks
Queue Metrics
executor.queued
(Gauge) - Tasks waiting in queueexecutor.queue.remaining
(Gauge) - Available queue capacity
Performance Metrics
executor.completed
(Counter) - Total completed tasksexecutor.seconds
(Timer) - Task execution time distributionexecutor.rejected
(Counter) - Tasks rejected due to saturation
DataHub Executor Configurations
1. GraphQL Query Executor
graphQL.concurrency:
separateThreadPool: true
corePoolSize: 20 # Base threads
maxPoolSize: 200 # Scale under load
keepAlive: 60 # Seconds before idle thread removal
# Handles complex GraphQL query resolution
2. Batch Processing Executors
entityClient.restli:
get:
batchConcurrency: 2 # Parallel batch processors
batchQueueSize: 500 # Task buffer
batchThreadKeepAlive: 60
ingest:
batchConcurrency: 2
batchQueueSize: 500
3. Search & Analytics Executors
timeseriesAspectService.query:
concurrency: 10 # Parallel query threads
queueSize: 500 # Buffered queries
Critical Monitoring Patterns
Saturation Detection
# Thread pool utilization (>0.8 indicates pressure)
executor_active / executor_pool_size > 0.8
# Queue filling up (>0.7 indicates backpressure)
executor_queued / (executor_queued + executor_queue_remaining) > 0.7
Rejection & Starvation
# Task rejections (should be zero)
rate(executor_rejected_total[1m]) > 0
# Thread starvation (all threads busy for extended period)
avg_over_time(executor_active[5m]) >= executor_pool_core
Performance Analysis
# Average task execution time
rate(executor_seconds_sum[5m]) / rate(executor_seconds_count[5m])
# Task throughput by executor
rate(executor_completed_total[5m])
Tuning Guidelines
Symptoms & Solutions
Symptom | Metric Pattern | Solution |
---|---|---|
High latency | executor_queued rising |
Increase pool size |
Rejections | executor_rejected > 0 |
Increase queue size or pool max |
Memory pressure | Many idle threads | Reduce keepAlive time |
CPU waste | Low executor_active |
Reduce core pool size |
Capacity Planning
- Measure baseline: Monitor under normal load
- Stress test: Identify saturation points
- Set alerts:
- Warning at 70% utilization
- Critical at 90% utilization
- Auto-scale: Consider dynamic pool sizing based on queue depth
Distributed Tracing
Traces let us track the life of a request across multiple components. Each trace is consisted of multiple spans, which are units of work, containing various context about the work being done as well as time taken to finish the work. By looking at the trace, we can more easily identify performance bottlenecks.
We enable tracing by using the OpenTelemetry java instrumentation library. This project provides a Java agent JAR that is attached to java applications. The agent injects bytecode to capture telemetry from popular libraries.
Using the agent we are able to
- Plug and play different tracing tools based on the user's setup: Jaeger, Zipkin, or other tools
- Get traces for Kafka, JDBC, and Elasticsearch without any additional code
- Track traces of any function with a simple
@WithSpan
annotation
You can enable the agent by setting env variable ENABLE_OTEL
to true
for GMS and MAE/MCE consumers. In our
example docker-compose, we export metrics to a local Jaeger
instance by setting env variable OTEL_TRACES_EXPORTER
to jaeger
and OTEL_EXPORTER_JAEGER_ENDPOINT
to http://jaeger-all-in-one:14250
, but you can easily change this behavior by
setting the correct env variables. Refer to
this doc for
all configs.
Once the above is set up, you should be able to see a detailed trace as a request is sent to GMS. We added
the @WithSpan
annotation in various places to make the trace more readable. You should start to see traces in the
tracing collector of choice. Our example docker-compose deploys
an instance of Jaeger with port 16686. The traces should be available at http://localhost:16686.
Configuration Note
We recommend using either grpc
or http/protobuf
, configured using OTEL_EXPORTER_OTLP_PROTOCOL
. Avoid using http
will not work as expected due to the size of
the generated spans.
Micrometer
DataHub is transitioning to Micrometer as its primary metrics framework, representing a significant upgrade in observability capabilities. Micrometer is a vendor-neutral application metrics facade that provides a simple, consistent API for the most popular monitoring systems, allowing you to instrument your JVM-based application code without vendor lock-in.
Why Micrometer?
-
Native Spring Integration
As DataHub uses Spring Boot, Micrometer provides seamless integration with:
- Auto-configuration of common metrics
- Built-in metrics for HTTP requests, JVM, caches, and more
- Spring Boot Actuator endpoints for metrics exposure
- Automatic instrumentation of Spring components
-
Multi-Backend Support
Unlike the legacy DropWizard approach that primarily targets JMX, Micrometer natively supports:
- Prometheus (recommended for cloud-native deployments)
- JMX (for backward compatibility)
- StatsD
- CloudWatch
- Datadog
- New Relic
- And many more...
-
Dimensional Metrics
Micrometer embraces modern dimensional metrics with labels/tags, enabling:
- Rich querying and aggregation capabilities
- Better cardinality control
- More flexible dashboards and alerts
- Natural integration with cloud-native monitoring systems
Micrometer Transition Plan
DataHub is undertaking a strategic transition from DropWizard metrics (exposed via JMX) to Micrometer, a modern vendor-neutral metrics facade. This transition aims to provide better cloud-native monitoring capabilities while maintaining backward compatibility for existing monitoring infrastructure.
Current State
What We Have Now:
- Primary System: DropWizard metrics exposed through JMX
- Collection Method: Prometheus-JMX exporter scrapes JMX metrics
- Dashboards: Grafana dashboards consuming JMX-sourced metrics
- Code Pattern: MetricUtils class for creating counters and timers
- Integration: Basic Spring integration with manual metric creation
Limitations:
- JMX-centric approach limits monitoring backend options
- No unified observability (separate instrumentation for metrics and traces)
- No support for dimensional metrics and tags
- Manual instrumentation required for most components
- Legacy naming conventions without proper tagging
Transition State
What We're Building:
- Primary System: Micrometer with native Prometheus support
- Collection Method: Direct Prometheus scraping via /actuator/prometheus
- Unified Telemetry: Single instrumentation point for both metrics and traces
- Modern Patterns: Dimensional metrics with rich tagging
- Multi-Backend: Support for Prometheus, StatsD, CloudWatch, Datadog, etc.
- Auto-Instrumentation: Automatic metrics for Spring components
Key Decisions and Rationale:
-
Dual Registry Approach
Decision: Run both systems in parallel with tag-based routing
Rationale:
- Zero downtime or disruption
- Gradual migration at component level
- Easy rollback if issues arise
-
Prometheus as Primary Target
Decision: Focus on Prometheus for new metrics
Rationale:
- Industry standard for cloud-native applications
- Rich query language and ecosystem
- Better suited for dimensional metrics
-
Observation API Adoption
Decision: Promote Observation API for new instrumentation
Rationale:
- Single instrumentation for metrics + traces
- Reduced code complexity
- Consistent naming across telemetry types
Future State
Once fully adopted, Micrometer will transform DataHub's observability from a collection of separate tools into a unified platform. This means developers can focus on building features while getting comprehensive telemetry "for free."
Intelligent and Adaptive Monitoring
- Dynamic Instrumentation: Enable detailed metrics for specific entities or operations on-demand without code changes
- Environment-Aware Metrics: Automatically route metrics to Prometheus in Kubernetes, CloudWatch in AWS, or Azure Monitor in Azure
- Built-in SLO Tracking: Define Service Level Objectives declaratively and automatically track error budgets
Developer and Operator Experience
- Adding @Observed to a method automatically generates latency percentiles, error rates, and distributed trace spans
- Every service exposes golden signals (latency, traffic, errors, saturation) out-of-the-box
- Business metrics (entity ingestion rates, search performance) seamlessly correlate with system metrics
- Self-documenting telemetry where metrics, traces, and logs tell a coherent operational story
DropWizard & JMX
We originally decided to use Dropwizard Metrics to export custom metrics to JMX,
and then use Prometheus-JMX exporter to export all JMX metrics to
Prometheus. This allows our code base to be independent of the metrics collection tool, making it easy for people to use
their tool of choice. You can enable the agent by setting env variable ENABLE_PROMETHEUS
to true
for GMS and MAE/MCE
consumers. Refer to this example docker-compose for setting the
variables.
In our example docker-compose, we have configured prometheus to scrape from 4318 ports of each container used by the JMX exporter to export metrics. We also configured grafana to listen to prometheus and create useful dashboards. By default, we provide two dashboards: JVM dashboard and DataHub dashboard.
In the JVM dashboard, you can find detailed charts based on JVM metrics like CPU/memory/disk usage. In the DataHub dashboard, you can find charts to monitor each endpoint and the kafka topics. Using the example implementation, go to http://localhost:3001 to find the grafana dashboards! (Username: admin, PW: admin)
To make it easy to track various metrics within the code base, we created MetricUtils class. This util class creates a central metric registry, sets up the JMX reporter, and provides convenient functions for setting up counters and timers. You can run the following to create a counter and increment.
metricUtils.counter(this.getClass(),"metricName").increment();
You can run the following to time a block of code.
try(Timer.Context ignored=metricUtils.timer(this.getClass(),"timerName").timer()){
...block of code
}
Enable monitoring through docker-compose
We provide some example configuration for enabling monitoring in this directory. Take a look at the docker-compose files, which adds necessary env variables to existing containers, and spawns new containers (Jaeger, Prometheus, Grafana).
You can add in the above docker-compose using the -f <<path-to-compose-file>>
when running docker-compose commands.
For instance,
docker-compose \
-f quickstart/docker-compose.quickstart.yml \
-f monitoring/docker-compose.monitoring.yml \
pull && \
docker-compose -p datahub \
-f quickstart/docker-compose.quickstart.yml \
-f monitoring/docker-compose.monitoring.yml \
up
We set up quickstart.sh, dev.sh, and dev-without-neo4j.sh to add the above docker-compose when MONITORING=true. For
instance MONITORING=true ./docker/quickstart.sh
will add the correct env variables to start collecting traces and
metrics, and also deploy Jaeger, Prometheus, and Grafana. We will soon support this as a flag during quickstart.
Health check endpoint
For monitoring healthiness of your DataHub service, /admin
endpoint can be used.