# Monitoring DataHub ## Overview Monitoring DataHub's system components is essential for maintaining operational excellence, troubleshooting performance issues, and ensuring system reliability. This comprehensive guide covers how to implement observability in DataHub through tracing and metrics, and how to extract valuable insights from your running instances. ## Why Monitor DataHub? Effective monitoring enables you to: - Identify Performance Bottlenecks: Pinpoint slow queries or API endpoints - Debug Issues Faster: Trace requests across distributed components to locate failures - Meet SLAs: Track and alert on key performance indicators ## Observability Components DataHub's observability strategy consists of two complementary approaches: 1. Metrics Collection **Purpose:** Aggregate statistical data about system behavior over time **Technology:** Transitioning from DropWizard/JMX to Micrometer **Current State:** DropWizard metrics exposed via JMX, collected by Prometheus **Future Direction:** Native Micrometer integration for Spring-based metrics **Compatibility:** Prometheus-compatible format with support for other metrics backends Key Metrics Categories: - Performance Metrics: Request latency, throughput, error rates - Resource Metrics: CPU, memory utilization - Application Metrics: Cache hit rates, queue depths, processing times - Business Metrics: Entity counts, ingestion rates, search performance 2. Distributed Tracing **Purpose:** Track individual requests as they flow through multiple services and components **Technology:** OpenTelemetry-based instrumentation - Provides end-to-end visibility of request lifecycles - Automatically instruments popular libraries (Kafka, JDBC, Elasticsearch) - Supports multiple backend systems (Jaeger, Zipkin, etc.) - Enables custom span creation with minimal code changes Key Benefits: - Visualize request flow across microservices - Identify latency hotspots - Understand service dependencies - Debug complex distributed transactions ## GraphQL Instrumentation (Micrometer) ### Overview DataHub provides comprehensive instrumentation for its GraphQL API through Micrometer metrics, enabling detailed performance monitoring and debugging capabilities. The instrumentation system offers flexible configuration options to balance between observability depth and performance overhead. ### Why Path-Level GraphQL Instrumentation Matters Traditional GraphQL monitoring only tells you "the search query is slow" but not **why**. Without path-level instrumentation, you're blind to which specific fields are causing performance bottlenecks in complex nested queries. ### Real-World Example Consider this GraphQL query: ```graphql query getSearchResults { search(input: { query: "sales data" }) { searchResults { entity { ... on Dataset { name owner { # Path: /search/searchResults/entity/owner corpUser { displayName } } lineage { # Path: /search/searchResults/entity/lineage upstreamCount downstreamCount upstreamEntities { urn name } } schemaMetadata { # Path: /search/searchResults/entity/schemaMetadata fields { fieldPath description } } } } } } } ``` ### What Path-Level Instrumentation Reveals With path-level metrics, you discover: - `/search/searchResults/entity/owner` - 50ms (fast, well-cached) - `/search/searchResults/entity/lineage` - 2500ms (SLOW! hitting graph database) - `/search/searchResults/entity/schemaMetadata` - 150ms (acceptable) **Without path metrics**: "Search query takes 3 seconds" **With path metrics**: "Lineage resolution is the bottleneck" ### Key Benefits #### 1. **Surgical Optimization** Instead of guessing, you know exactly which resolver needs optimization. Maybe lineage needs better caching or pagination. #### 2. **Smart Query Patterns** Identify expensive patterns like: ```yaml # These paths consistently slow: /*/lineage/upstreamEntities/* /*/siblings/*/platform # Action: Add field-level caching or lazy loading ``` #### 3. **Client-Specific Debugging** Different clients request different fields. Path instrumentation shows: - Web UI requests are slow (requesting everything) - API integrations timeout (requesting deep lineage) #### 4. **N+1 Query Detection** Spot resolver patterns that indicate N+1 problems: ``` /users/0/permissions - 10ms /users/1/permissions - 10ms /users/2/permissions - 10ms ... (100 more times) ``` ### Configuration Strategy Start targeted to minimize overhead: ```yaml # Focus on known slow operations fieldLevelOperations: "searchAcrossEntities,getDataset" # Target expensive resolver paths fieldLevelPaths: "/**/lineage/**,/**/relationships/**,/**/privileges" ``` ### Architecture The GraphQL instrumentation is implemented through `GraphQLTimingInstrumentation`, which extends GraphQL Java's instrumentation framework. It provides: - **Request-level metrics**: Overall query performance and error tracking - **Field-level metrics**: Detailed timing for individual field resolvers - **Smart filtering**: Configurable targeting of specific operations or field paths - **Low overhead**: Minimal performance impact through efficient instrumentation ### Metrics Collected #### Request-Level Metrics **Metric: `graphql.request.duration`** - **Type**: Timer with percentiles (p50, p95, p99) - **Tags**: - `operation`: Operation name (e.g., "getSearchResultsForMultiple") - `operation.type`: Query, mutation, or subscription - `success`: true/false based on error presence - `field.filtering`: Filtering mode applied (DISABLED, ALL_FIELDS, BY_OPERATION, BY_PATH, BY_BOTH) - **Use Case**: Monitor overall GraphQL performance, identify slow operations **Metric: `graphql.request.errors`** - **Type**: Counter - **Tags**: - `operation`: Operation name - `operation.type`: Query, mutation, or subscription - **Use Case**: Track error rates by operation #### Field-Level Metrics **Metric: `graphql.field.duration`** - **Type**: Timer with percentiles (p50, p95, p99) - **Tags**: - `parent.type`: GraphQL parent type (e.g., "Dataset", "User") - `field`: Field name being resolved - `operation`: Operation name context - `success`: true/false - `path`: Field path (optional, controlled by `fieldLevelPathEnabled`) - **Use Case**: Identify slow field resolvers, optimize data fetching **Metric: `graphql.field.errors`** - **Type**: Counter - **Tags**: Same as field duration (minus success tag) - **Use Case**: Track field-specific error patterns **Metric: `graphql.fields.instrumented`** - **Type**: Counter - **Tags**: - `operation`: Operation name - `filtering.mode`: Active filtering mode - **Use Case**: Monitor instrumentation coverage and overhead ### Configuration Guide #### Master Controls ```yaml graphQL: metrics: # Master switch for all GraphQL metrics enabled: ${GRAPHQL_METRICS_ENABLED:true} # Enable field-level resolver metrics fieldLevelEnabled: ${GRAPHQL_METRICS_FIELD_LEVEL_ENABLED:false} ``` #### Selective Field Instrumentation Field-level metrics can add significant overhead for complex queries. DataHub provides multiple strategies to control which fields are instrumented: ##### 1. **Operation-Based Filtering** Target specific GraphQL operations known to be slow or critical: ```yaml fieldLevelOperations: "getSearchResultsForMultiple,searchAcrossLineageStructure" ``` ##### 2. **Path-Based Filtering** Use path patterns to instrument specific parts of your schema: ```yaml fieldLevelPaths: "/search/results/**,/user/*/permissions,/**/lineage/*" ``` **Path Pattern Syntax**: - `/user` - Exact match for the user field - `/user/*` - Direct children of user (e.g., `/user/name`, `/user/email`) - `/user/**` - User field and all descendants at any depth - `/*/comments/*` - Comments field under any parent ##### 3. **Combined Filtering** When both operation and path filters are configured, only fields matching BOTH criteria are instrumented: ```yaml # Only instrument search results within specific operations fieldLevelOperations: "searchAcrossEntities" fieldLevelPaths: "/searchResults/**" ``` #### Advanced Options ```yaml # Include field paths as metric tags (WARNING: high cardinality risk) fieldLevelPathEnabled: false # Include metrics for trivial property access trivialDataFetchersEnabled: false ``` ### Filtering Modes Explained The instrumentation automatically determines the most efficient filtering mode: 1. **DISABLED**: Field-level metrics completely disabled 2. **ALL_FIELDS**: No filtering, all fields instrumented (highest overhead) 3. **BY_OPERATION**: Only instrument fields within specified operations 4. **BY_PATH**: Only instrument fields matching path patterns 5. **BY_BOTH**: Most restrictive - both operation and path must match ### Performance Considerations #### Impact Assessment Field-level instrumentation overhead varies by: - **Query complexity**: More fields = more overhead - **Resolver performance**: Fast resolvers have higher relative overhead - **Filtering effectiveness**: Better targeting = less overhead #### Best Practices 1. **Start Conservative**: Begin with field-level metrics disabled ```yaml fieldLevelEnabled: false ``` 2. **Target Known Issues**: Enable selectively for problematic operations ```yaml fieldLevelEnabled: true fieldLevelOperations: "slowSearchQuery,complexLineageQuery" ``` 3. **Use Path Patterns Wisely**: Focus on expensive resolver paths ```yaml fieldLevelPaths: "/search/**,/**/lineage/**" ``` 4. **Avoid Path Tags in Production**: High cardinality risk ```yaml fieldLevelPathEnabled: false # Keep this false ``` 5. **Monitor Instrumentation Overhead**: Track the `graphql.fields.instrumented` metric ### Example Configurations #### Development Environment (Full Visibility) ```yaml graphQL: metrics: enabled: true fieldLevelEnabled: true fieldLevelOperations: "" # All operations fieldLevelPathEnabled: true # Include paths for debugging trivialDataFetchersEnabled: true ``` #### Production - Targeted Monitoring ```yaml graphQL: metrics: enabled: true fieldLevelEnabled: true fieldLevelOperations: "getSearchResultsForMultiple,searchAcrossLineage" fieldLevelPaths: "/search/results/*,/lineage/upstream/**,/lineage/downstream/**" fieldLevelPathEnabled: false trivialDataFetchersEnabled: false ``` #### Production - Minimal Overhead ```yaml graphQL: metrics: enabled: true fieldLevelEnabled: false # Only request-level metrics ``` ### Debugging Slow Queries When investigating GraphQL performance issues: 1. **Enable request-level metrics first** to identify slow operations 2. **Temporarily enable field-level metrics** for the slow operation: ```yaml fieldLevelOperations: "problematicQuery" ``` 3. **Analyze field duration metrics** to find bottlenecks 4. **Optionally enable path tags** (briefly) for precise identification: ```yaml fieldLevelPathEnabled: true # Temporary only! ``` 5. **Optimize identified resolvers** and disable detailed instrumentation ### Integration with Monitoring Stack The GraphQL metrics integrate seamlessly with DataHub's monitoring infrastructure: - **Prometheus**: Metrics exposed at `/actuator/prometheus` - **Grafana**: Create dashboards showing: - Request rates and latencies by operation - Error rates and types - Field resolver performance heatmaps - Top slow operations and fields Example Prometheus queries: ```promql # Average request duration by operation rate(graphql_request_duration_seconds_sum[5m]) / rate(graphql_request_duration_seconds_count[5m]) # Field resolver p99 latency histogram_quantile(0.99, rate(graphql_field_duration_seconds_bucket[5m]) ) # Error rate by operation rate(graphql_request_errors_total[5m]) ``` ## Cache Monitoring (Micrometer) ### Overview Micrometer provides automatic instrumentation for cache implementations, offering deep insights into cache performance and efficiency. This instrumentation is crucial for DataHub, where caching significantly impacts query performance and system load. ### Automatic Cache Metrics When caches are registered with Micrometer, comprehensive metrics are automatically collected without code changes: #### Core Metrics - **`cache.size`** (Gauge) - Current number of entries in the cache - **`cache.gets`** (Counter) - Cache access attempts, tagged with: - `result=hit` - Successful cache hits - `result=miss` - Cache misses requiring backend fetch - **`cache.puts`** (Counter) - Number of entries added to cache - **`cache.evictions`** (Counter) - Number of entries evicted - **`cache.eviction.weight`** (Counter) - Total weight of evicted entries (for size-based eviction) #### Derived Metrics Calculate key performance indicators using Prometheus queries: ```promql # Cache hit rate (should be >80% for hot caches) sum(rate(cache_gets_total{result="hit"}[5m])) by (cache) / sum(rate(cache_gets_total[5m])) by (cache) # Cache miss rate 1 - (cache_hit_rate) # Eviction rate (indicates cache pressure) rate(cache_evictions_total[5m]) ``` ### DataHub Cache Configuration DataHub uses multiple cache layers, each automatically instrumented: #### 1. Entity Client Cache ```yaml cache.client.entityClient: enabled: true maxBytes: 104857600 # 100MB entityAspectTTLSeconds: corpuser: corpUserInfo: 20 # Short TTL for frequently changing data corpUserKey: 300 # Longer TTL for stable data structuredProperty: propertyDefinition: 300 structuredPropertyKey: 86400 # 1 day for very stable data ``` #### 2. Usage Statistics Cache ```yaml cache.client.usageClient: enabled: true maxBytes: 52428800 # 50MB defaultTTLSeconds: 86400 # 1 day # Caches expensive usage calculations ``` #### 3. Search & Lineage Cache ```yaml cache.search.lineage: ttlSeconds: 86400 # 1 day ``` ### Monitoring Best Practices #### Key Indicators to Watch 1. **Hit Rate by Cache Type** ```promql # Alert if hit rate drops below 70% cache_hit_rate < 0.7 ``` 2. **Memory Pressure** ```promql # High eviction rate relative to puts rate(cache_evictions_total[5m]) / rate(cache_puts_total[5m]) > 0.1 ``` ## Thread Pool Executor Monitoring (Micrometer) ### Overview Micrometer automatically instruments Java `ThreadPoolExecutor` instances, providing crucial visibility into concurrency bottlenecks and resource utilization. For DataHub's concurrent operations, this monitoring is essential for maintaining performance under load. ### Automatic Executor Metrics #### Pool State Metrics - **`executor.pool.size`** (Gauge) - Current number of threads in pool - **`executor.pool.core`** (Gauge) - Core (minimum) pool size - **`executor.pool.max`** (Gauge) - Maximum allowed pool size - **`executor.active`** (Gauge) - Threads actively executing tasks #### Queue Metrics - **`executor.queued`** (Gauge) - Tasks waiting in queue - **`executor.queue.remaining`** (Gauge) - Available queue capacity #### Performance Metrics - **`executor.completed`** (Counter) - Total completed tasks - **`executor.seconds`** (Timer) - Task execution time distribution - **`executor.rejected`** (Counter) - Tasks rejected due to saturation ### DataHub Executor Configurations #### 1. GraphQL Query Executor ```yaml graphQL.concurrency: separateThreadPool: true corePoolSize: 20 # Base threads maxPoolSize: 200 # Scale under load keepAlive: 60 # Seconds before idle thread removal # Handles complex GraphQL query resolution ``` #### 2. Batch Processing Executors ```yaml entityClient.restli: get: batchConcurrency: 2 # Parallel batch processors batchQueueSize: 500 # Task buffer batchThreadKeepAlive: 60 ingest: batchConcurrency: 2 batchQueueSize: 500 ``` #### 3. Search & Analytics Executors ```yaml timeseriesAspectService.query: concurrency: 10 # Parallel query threads queueSize: 500 # Buffered queries ``` ### Critical Monitoring Patterns #### Saturation Detection ```promql # Thread pool utilization (>0.8 indicates pressure) executor_active / executor_pool_size > 0.8 # Queue filling up (>0.7 indicates backpressure) executor_queued / (executor_queued + executor_queue_remaining) > 0.7 ``` #### Rejection & Starvation ```promql # Task rejections (should be zero) rate(executor_rejected_total[1m]) > 0 # Thread starvation (all threads busy for extended period) avg_over_time(executor_active[5m]) >= executor_pool_core ``` #### Performance Analysis ```promql # Average task execution time rate(executor_seconds_sum[5m]) / rate(executor_seconds_count[5m]) # Task throughput by executor rate(executor_completed_total[5m]) ``` ### Tuning Guidelines #### Symptoms & Solutions | Symptom | Metric Pattern | Solution | | --------------- | ------------------------ | ------------------------------- | | High latency | `executor_queued` rising | Increase pool size | | Rejections | `executor_rejected` > 0 | Increase queue size or pool max | | Memory pressure | Many idle threads | Reduce `keepAlive` time | | CPU waste | Low `executor_active` | Reduce core pool size | #### Capacity Planning 1. **Measure baseline**: Monitor under normal load 2. **Stress test**: Identify saturation points 3. **Set alerts**: - Warning at 70% utilization - Critical at 90% utilization 4. **Auto-scale**: Consider dynamic pool sizing based on queue depth ## Distributed Tracing Traces let us track the life of a request across multiple components. Each trace is consisted of multiple spans, which are units of work, containing various context about the work being done as well as time taken to finish the work. By looking at the trace, we can more easily identify performance bottlenecks. We enable tracing by using the [OpenTelemetry java instrumentation library](https://github.com/open-telemetry/opentelemetry-java-instrumentation). This project provides a Java agent JAR that is attached to java applications. The agent injects bytecode to capture telemetry from popular libraries. Using the agent we are able to 1. Plug and play different tracing tools based on the user's setup: Jaeger, Zipkin, or other tools 2. Get traces for Kafka, JDBC, and Elasticsearch without any additional code 3. Track traces of any function with a simple `@WithSpan` annotation You can enable the agent by setting env variable `ENABLE_OTEL` to `true` for GMS and MAE/MCE consumers. In our example [docker-compose](../../docker/monitoring/docker-compose.monitoring.yml), we export metrics to a local Jaeger instance by setting env variable `OTEL_TRACES_EXPORTER` to `jaeger` and `OTEL_EXPORTER_JAEGER_ENDPOINT` to `http://jaeger-all-in-one:14250`, but you can easily change this behavior by setting the correct env variables. Refer to this [doc](https://github.com/open-telemetry/opentelemetry-java/blob/main/sdk-extensions/autoconfigure/README.md) for all configs. Once the above is set up, you should be able to see a detailed trace as a request is sent to GMS. We added the `@WithSpan` annotation in various places to make the trace more readable. You should start to see traces in the tracing collector of choice. Our example [docker-compose](../../docker/monitoring/docker-compose.monitoring.yml) deploys an instance of Jaeger with port 16686. The traces should be available at http://localhost:16686. ### Configuration Note We recommend using either `grpc` or `http/protobuf`, configured using `OTEL_EXPORTER_OTLP_PROTOCOL`. Avoid using `http` will not work as expected due to the size of the generated spans. ## Micrometer DataHub is transitioning to Micrometer as its primary metrics framework, representing a significant upgrade in observability capabilities. Micrometer is a vendor-neutral application metrics facade that provides a simple, consistent API for the most popular monitoring systems, allowing you to instrument your JVM-based application code without vendor lock-in. ### Why Micrometer? 1. Native Spring Integration As DataHub uses Spring Boot, Micrometer provides seamless integration with: - Auto-configuration of common metrics - Built-in metrics for HTTP requests, JVM, caches, and more - Spring Boot Actuator endpoints for metrics exposure - Automatic instrumentation of Spring components 2. Multi-Backend Support Unlike the legacy DropWizard approach that primarily targets JMX, Micrometer natively supports: - Prometheus (recommended for cloud-native deployments) - JMX (for backward compatibility) - StatsD - CloudWatch - Datadog - New Relic - And many more... 3. Dimensional Metrics Micrometer embraces modern dimensional metrics with **labels/tags**, enabling: - Rich querying and aggregation capabilities - Better cardinality control - More flexible dashboards and alerts - Natural integration with cloud-native monitoring systems ## Micrometer Transition Plan DataHub is undertaking a strategic transition from DropWizard metrics (exposed via JMX) to Micrometer, a modern vendor-neutral metrics facade. This transition aims to provide better cloud-native monitoring capabilities while maintaining backward compatibility for existing monitoring infrastructure. ### Current State What We Have Now: - Primary System: DropWizard metrics exposed through JMX - Collection Method: Prometheus-JMX exporter scrapes JMX metrics - Dashboards: Grafana dashboards consuming JMX-sourced metrics - Code Pattern: MetricUtils class for creating counters and timers - Integration: Basic Spring integration with manual metric creation