mirror of
				https://github.com/datahub-project/datahub.git
				synced 2025-10-31 18:59:23 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			1112 lines
		
	
	
		
			36 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			1112 lines
		
	
	
		
			36 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # Monitoring DataHub
 | |
| 
 | |
| ## Overview
 | |
| 
 | |
| Monitoring DataHub's system components is essential for maintaining operational excellence, troubleshooting performance issues,
 | |
| and ensuring system reliability. This comprehensive guide covers how to implement observability in DataHub through tracing and metrics,
 | |
| and how to extract valuable insights from your running instances.
 | |
| 
 | |
| ## Why Monitor DataHub?
 | |
| 
 | |
| Effective monitoring enables you to:
 | |
| 
 | |
| - Identify Performance Bottlenecks: Pinpoint slow queries or API endpoints
 | |
| - Debug Issues Faster: Trace requests across distributed components to locate failures
 | |
| - Meet SLAs: Track and alert on key performance indicators
 | |
| 
 | |
| ## Observability Components
 | |
| 
 | |
| DataHub's observability strategy consists of two complementary approaches:
 | |
| 
 | |
| 1. Metrics Collection
 | |
| 
 | |
|    **Purpose:** Aggregate statistical data about system behavior over time
 | |
|    **Technology:** Transitioning from DropWizard/JMX to Micrometer
 | |
| 
 | |
|    **Current State:** DropWizard metrics exposed via JMX, collected by Prometheus
 | |
|    **Future Direction:** Native Micrometer integration for Spring-based metrics
 | |
|    **Compatibility:** Prometheus-compatible format with support for other metrics backends
 | |
| 
 | |
|    Key Metrics Categories:
 | |
| 
 | |
|    - Performance Metrics: Request latency, throughput, error rates
 | |
|    - Resource Metrics: CPU, memory utilization
 | |
|    - Application Metrics: Cache hit rates, queue depths, processing times
 | |
|    - Business Metrics: Entity counts, ingestion rates, search performance
 | |
| 
 | |
| 2. Distributed Tracing
 | |
| 
 | |
|    **Purpose:** Track individual requests as they flow through multiple services and components
 | |
|    **Technology:** OpenTelemetry-based instrumentation
 | |
| 
 | |
|    - Provides end-to-end visibility of request lifecycles
 | |
|    - Automatically instruments popular libraries (Kafka, JDBC, Elasticsearch)
 | |
|    - Supports multiple backend systems (Jaeger, Zipkin, etc.)
 | |
|    - Enables custom span creation with minimal code changes
 | |
| 
 | |
|    Key Benefits:
 | |
| 
 | |
|    - Visualize request flow across microservices
 | |
|    - Identify latency hotspots
 | |
|    - Understand service dependencies
 | |
|    - Debug complex distributed transactions
 | |
| 
 | |
| ## GraphQL Instrumentation (Micrometer)
 | |
| 
 | |
| ### Overview
 | |
| 
 | |
| DataHub provides comprehensive instrumentation for its GraphQL API through Micrometer metrics, enabling detailed performance
 | |
| monitoring and debugging capabilities. The instrumentation system offers flexible configuration options to balance between
 | |
| observability depth and performance overhead.
 | |
| 
 | |
| ### Why Path-Level GraphQL Instrumentation Matters
 | |
| 
 | |
| Traditional GraphQL monitoring only tells you "the search query is slow" but not **why**. Without path-level instrumentation,
 | |
| you're blind to which specific fields are causing performance bottlenecks in complex nested queries.
 | |
| 
 | |
| ### Real-World Example
 | |
| 
 | |
| Consider this GraphQL query:
 | |
| 
 | |
| ```graphql
 | |
| query getSearchResults {
 | |
|   search(input: { query: "sales data" }) {
 | |
|     searchResults {
 | |
|       entity {
 | |
|         ... on Dataset {
 | |
|           name
 | |
|           owner {
 | |
|             # Path: /search/searchResults/entity/owner
 | |
|             corpUser {
 | |
|               displayName
 | |
|             }
 | |
|           }
 | |
|           lineage {
 | |
|             # Path: /search/searchResults/entity/lineage
 | |
|             upstreamCount
 | |
|             downstreamCount
 | |
|             upstreamEntities {
 | |
|               urn
 | |
|               name
 | |
|             }
 | |
|           }
 | |
|           schemaMetadata {
 | |
|             # Path: /search/searchResults/entity/schemaMetadata
 | |
|             fields {
 | |
|               fieldPath
 | |
|               description
 | |
|             }
 | |
|           }
 | |
|         }
 | |
|       }
 | |
|     }
 | |
|   }
 | |
| }
 | |
| ```
 | |
| 
 | |
| ### What Path-Level Instrumentation Reveals
 | |
| 
 | |
| With path-level metrics, you discover:
 | |
| 
 | |
| - `/search/searchResults/entity/owner` - 50ms (fast, well-cached)
 | |
| - `/search/searchResults/entity/lineage` - 2500ms (SLOW! hitting graph database)
 | |
| - `/search/searchResults/entity/schemaMetadata` - 150ms (acceptable)
 | |
| 
 | |
| **Without path metrics**: "Search query takes 3 seconds"  
 | |
| **With path metrics**: "Lineage resolution is the bottleneck"
 | |
| 
 | |
| ### Key Benefits
 | |
| 
 | |
| #### 1. **Surgical Optimization**
 | |
| 
 | |
| Instead of guessing, you know exactly which resolver needs optimization. Maybe lineage needs better caching or pagination.
 | |
| 
 | |
| #### 2. **Smart Query Patterns**
 | |
| 
 | |
| Identify expensive patterns like:
 | |
| 
 | |
| ```yaml
 | |
| # These paths consistently slow:
 | |
| /*/lineage/upstreamEntities/*
 | |
| /*/siblings/*/platform
 | |
| # Action: Add field-level caching or lazy loading
 | |
| ```
 | |
| 
 | |
| #### 3. **Client-Specific Debugging**
 | |
| 
 | |
| Different clients request different fields. Path instrumentation shows:
 | |
| 
 | |
| - Web UI requests are slow (requesting everything)
 | |
| - API integrations timeout (requesting deep lineage)
 | |
| 
 | |
| #### 4. **N+1 Query Detection**
 | |
| 
 | |
| Spot resolver patterns that indicate N+1 problems:
 | |
| 
 | |
| ```
 | |
| /users/0/permissions - 10ms
 | |
| /users/1/permissions - 10ms
 | |
| /users/2/permissions - 10ms
 | |
| ... (100 more times)
 | |
| ```
 | |
| 
 | |
| ### Configuration Strategy
 | |
| 
 | |
| Start targeted to minimize overhead:
 | |
| 
 | |
| ```yaml
 | |
| # Focus on known slow operations
 | |
| fieldLevelOperations: "searchAcrossEntities,getDataset"
 | |
| 
 | |
| # Target expensive resolver paths
 | |
| fieldLevelPaths: "/**/lineage/**,/**/relationships/**,/**/privileges"
 | |
| ```
 | |
| 
 | |
| ### Architecture
 | |
| 
 | |
| The GraphQL instrumentation is implemented through `GraphQLTimingInstrumentation`, which extends GraphQL Java's instrumentation framework. It provides:
 | |
| 
 | |
| - **Request-level metrics**: Overall query performance and error tracking
 | |
| - **Field-level metrics**: Detailed timing for individual field resolvers
 | |
| - **Smart filtering**: Configurable targeting of specific operations or field paths
 | |
| - **Low overhead**: Minimal performance impact through efficient instrumentation
 | |
| 
 | |
| ### Metrics Collected
 | |
| 
 | |
| #### Request-Level Metrics
 | |
| 
 | |
| **Metric: `graphql.request.duration`**
 | |
| 
 | |
| - **Type**: Timer with percentiles (p50, p95, p99)
 | |
| - **Tags**:
 | |
|   - `operation`: Operation name (e.g., "getSearchResultsForMultiple")
 | |
|   - `operation.type`: Query, mutation, or subscription
 | |
|   - `success`: true/false based on error presence
 | |
|   - `field.filtering`: Filtering mode applied (DISABLED, ALL_FIELDS, BY_OPERATION, BY_PATH, BY_BOTH)
 | |
| - **Use Case**: Monitor overall GraphQL performance, identify slow operations
 | |
| 
 | |
| **Metric: `graphql.request.errors`**
 | |
| 
 | |
| - **Type**: Counter
 | |
| - **Tags**:
 | |
|   - `operation`: Operation name
 | |
|   - `operation.type`: Query, mutation, or subscription
 | |
| - **Use Case**: Track error rates by operation
 | |
| 
 | |
| #### Field-Level Metrics
 | |
| 
 | |
| **Metric: `graphql.field.duration`**
 | |
| 
 | |
| - **Type**: Timer with percentiles (p50, p95, p99)
 | |
| - **Tags**:
 | |
|   - `parent.type`: GraphQL parent type (e.g., "Dataset", "User")
 | |
|   - `field`: Field name being resolved
 | |
|   - `operation`: Operation name context
 | |
|   - `success`: true/false
 | |
|   - `path`: Field path (optional, controlled by `fieldLevelPathEnabled`)
 | |
| - **Use Case**: Identify slow field resolvers, optimize data fetching
 | |
| 
 | |
| **Metric: `graphql.field.errors`**
 | |
| 
 | |
| - **Type**: Counter
 | |
| - **Tags**: Same as field duration (minus success tag)
 | |
| - **Use Case**: Track field-specific error patterns
 | |
| 
 | |
| **Metric: `graphql.fields.instrumented`**
 | |
| 
 | |
| - **Type**: Counter
 | |
| - **Tags**:
 | |
|   - `operation`: Operation name
 | |
|   - `filtering.mode`: Active filtering mode
 | |
| - **Use Case**: Monitor instrumentation coverage and overhead
 | |
| 
 | |
| ### Configuration Guide
 | |
| 
 | |
| #### Master Controls
 | |
| 
 | |
| ```yaml
 | |
| graphQL:
 | |
|   metrics:
 | |
|     # Master switch for all GraphQL metrics
 | |
|     enabled: ${GRAPHQL_METRICS_ENABLED:true}
 | |
| 
 | |
|     # Enable field-level resolver metrics
 | |
|     fieldLevelEnabled: ${GRAPHQL_METRICS_FIELD_LEVEL_ENABLED:false}
 | |
| ```
 | |
| 
 | |
| #### Selective Field Instrumentation
 | |
| 
 | |
| Field-level metrics can add significant overhead for complex queries. DataHub provides multiple strategies to control which fields are instrumented:
 | |
| 
 | |
| ##### 1. **Operation-Based Filtering**
 | |
| 
 | |
| Target specific GraphQL operations known to be slow or critical:
 | |
| 
 | |
| ```yaml
 | |
| fieldLevelOperations: "getSearchResultsForMultiple,searchAcrossLineageStructure"
 | |
| ```
 | |
| 
 | |
| ##### 2. **Path-Based Filtering**
 | |
| 
 | |
| Use path patterns to instrument specific parts of your schema:
 | |
| 
 | |
| ```yaml
 | |
| fieldLevelPaths: "/search/results/**,/user/*/permissions,/**/lineage/*"
 | |
| ```
 | |
| 
 | |
| **Path Pattern Syntax**:
 | |
| 
 | |
| - `/user` - Exact match for the user field
 | |
| - `/user/*` - Direct children of user (e.g., `/user/name`, `/user/email`)
 | |
| - `/user/**` - User field and all descendants at any depth
 | |
| - `/*/comments/*` - Comments field under any parent
 | |
| 
 | |
| ##### 3. **Combined Filtering**
 | |
| 
 | |
| When both operation and path filters are configured, only fields matching BOTH criteria are instrumented:
 | |
| 
 | |
| ```yaml
 | |
| # Only instrument search results within specific operations
 | |
| fieldLevelOperations: "searchAcrossEntities"
 | |
| fieldLevelPaths: "/searchResults/**"
 | |
| ```
 | |
| 
 | |
| #### Advanced Options
 | |
| 
 | |
| ```yaml
 | |
| # Include field paths as metric tags (WARNING: high cardinality risk)
 | |
| fieldLevelPathEnabled: false
 | |
| 
 | |
| # Include metrics for trivial property access
 | |
| trivialDataFetchersEnabled: false
 | |
| ```
 | |
| 
 | |
| ### Filtering Modes Explained
 | |
| 
 | |
| The instrumentation automatically determines the most efficient filtering mode:
 | |
| 
 | |
| 1. **DISABLED**: Field-level metrics completely disabled
 | |
| 2. **ALL_FIELDS**: No filtering, all fields instrumented (highest overhead)
 | |
| 3. **BY_OPERATION**: Only instrument fields within specified operations
 | |
| 4. **BY_PATH**: Only instrument fields matching path patterns
 | |
| 5. **BY_BOTH**: Most restrictive - both operation and path must match
 | |
| 
 | |
| ### Performance Considerations
 | |
| 
 | |
| #### Impact Assessment
 | |
| 
 | |
| Field-level instrumentation overhead varies by:
 | |
| 
 | |
| - **Query complexity**: More fields = more overhead
 | |
| - **Resolver performance**: Fast resolvers have higher relative overhead
 | |
| - **Filtering effectiveness**: Better targeting = less overhead
 | |
| 
 | |
| #### Best Practices
 | |
| 
 | |
| 1. **Start Conservative**: Begin with field-level metrics disabled
 | |
| 
 | |
|    ```yaml
 | |
|    fieldLevelEnabled: false
 | |
|    ```
 | |
| 
 | |
| 2. **Target Known Issues**: Enable selectively for problematic operations
 | |
| 
 | |
|    ```yaml
 | |
|    fieldLevelEnabled: true
 | |
|    fieldLevelOperations: "slowSearchQuery,complexLineageQuery"
 | |
|    ```
 | |
| 
 | |
| 3. **Use Path Patterns Wisely**: Focus on expensive resolver paths
 | |
| 
 | |
|    ```yaml
 | |
|    fieldLevelPaths: "/search/**,/**/lineage/**"
 | |
|    ```
 | |
| 
 | |
| 4. **Avoid Path Tags in Production**: High cardinality risk
 | |
| 
 | |
|    ```yaml
 | |
|    fieldLevelPathEnabled: false # Keep this false
 | |
|    ```
 | |
| 
 | |
| 5. **Monitor Instrumentation Overhead**: Track the `graphql.fields.instrumented` metric
 | |
| 
 | |
| ### Example Configurations
 | |
| 
 | |
| #### Development Environment (Full Visibility)
 | |
| 
 | |
| ```yaml
 | |
| graphQL:
 | |
|   metrics:
 | |
|     enabled: true
 | |
|     fieldLevelEnabled: true
 | |
|     fieldLevelOperations: "" # All operations
 | |
|     fieldLevelPathEnabled: true # Include paths for debugging
 | |
|     trivialDataFetchersEnabled: true
 | |
| ```
 | |
| 
 | |
| #### Production - Targeted Monitoring
 | |
| 
 | |
| ```yaml
 | |
| graphQL:
 | |
|   metrics:
 | |
|     enabled: true
 | |
|     fieldLevelEnabled: true
 | |
|     fieldLevelOperations: "getSearchResultsForMultiple,searchAcrossLineage"
 | |
|     fieldLevelPaths: "/search/results/*,/lineage/upstream/**,/lineage/downstream/**"
 | |
|     fieldLevelPathEnabled: false
 | |
|     trivialDataFetchersEnabled: false
 | |
| ```
 | |
| 
 | |
| #### Production - Minimal Overhead
 | |
| 
 | |
| ```yaml
 | |
| graphQL:
 | |
|   metrics:
 | |
|     enabled: true
 | |
|     fieldLevelEnabled: false # Only request-level metrics
 | |
| ```
 | |
| 
 | |
| ### Debugging Slow Queries
 | |
| 
 | |
| When investigating GraphQL performance issues:
 | |
| 
 | |
| 1. **Enable request-level metrics first** to identify slow operations
 | |
| 2. **Temporarily enable field-level metrics** for the slow operation:
 | |
|    ```yaml
 | |
|    fieldLevelOperations: "problematicQuery"
 | |
|    ```
 | |
| 3. **Analyze field duration metrics** to find bottlenecks
 | |
| 4. **Optionally enable path tags** (briefly) for precise identification:
 | |
|    ```yaml
 | |
|    fieldLevelPathEnabled: true # Temporary only!
 | |
|    ```
 | |
| 5. **Optimize identified resolvers** and disable detailed instrumentation
 | |
| 
 | |
| ### Integration with Monitoring Stack
 | |
| 
 | |
| The GraphQL metrics integrate seamlessly with DataHub's monitoring infrastructure:
 | |
| 
 | |
| - **Prometheus**: Metrics exposed at `/actuator/prometheus`
 | |
| - **Grafana**: Create dashboards showing:
 | |
|   - Request rates and latencies by operation
 | |
|   - Error rates and types
 | |
|   - Field resolver performance heatmaps
 | |
|   - Top slow operations and fields
 | |
| 
 | |
| Example Prometheus queries:
 | |
| 
 | |
| ```promql
 | |
| # Average request duration by operation
 | |
| rate(graphql_request_duration_seconds_sum[5m])
 | |
| / rate(graphql_request_duration_seconds_count[5m])
 | |
| 
 | |
| # Field resolver p99 latency
 | |
| histogram_quantile(0.99,
 | |
|   rate(graphql_field_duration_seconds_bucket[5m])
 | |
| )
 | |
| 
 | |
| # Error rate by operation
 | |
| rate(graphql_request_errors_total[5m])
 | |
| ```
 | |
| 
 | |
| ## Kafka Consumer Instrumentation (Micrometer)
 | |
| 
 | |
| ### Overview
 | |
| 
 | |
| DataHub provides comprehensive instrumentation for Kafka message consumption through Micrometer metrics, enabling
 | |
| real-time monitoring of message queue latency and consumer performance. This instrumentation is critical for
 | |
| maintaining data freshness SLAs and identifying processing bottlenecks across DataHub's event-driven architecture.
 | |
| 
 | |
| ### Why Kafka Queue Time Monitoring Matters
 | |
| 
 | |
| Traditional Kafka lag monitoring only tells you "we're behind by 10,000 messages"
 | |
| Without queue time metrics, you can't answer critical questions like "are we meeting our 5-minute data freshness SLA?"
 | |
| or "which consumer groups are experiencing delays?"
 | |
| 
 | |
| #### Real-World Impact
 | |
| 
 | |
| Consider these scenarios:
 | |
| 
 | |
| Variable Production Rate:
 | |
| 
 | |
| - Morning: 100 messages/second → 1000 message lag = 10 seconds old
 | |
| - Evening: 10 messages/second → 1000 message lag = 100 seconds old
 | |
| - Same lag count, vastly different business impact!
 | |
| 
 | |
| Burst Traffic Patterns:
 | |
| 
 | |
| - Bulk ingestion creates 1M message backlog
 | |
| - Are these messages from the last hour (recoverable) or last 24 hours (SLA breach)?
 | |
| 
 | |
| Consumer Group Performance:
 | |
| 
 | |
| - Real-time processors need < 1 minute latency
 | |
| - Analytics consumers can tolerate 1 hour latency
 | |
| - Different groups require different monitoring thresholds
 | |
| 
 | |
| ### Architecture
 | |
| 
 | |
| Kafka queue time instrumentation is implemented across all DataHub consumers:
 | |
| 
 | |
| - MetadataChangeProposals (MCP) Processor - SQL entity updates
 | |
|   - BatchMetadataChangeProposals (MCP) Processor - Bulk SQL entity updates
 | |
| - MetadataChangeLog (MCL) Processor & Hooks - Elasticsearch & downstream aspect operations
 | |
| - DataHubUsageEventsProcessor - Usage analytics events
 | |
| - PlatformEventProcessor - Platform operations & external consumers
 | |
| 
 | |
| Each consumer automatically records queue time metrics using the message's embedded timestamp.
 | |
| 
 | |
| ### Metrics Collected
 | |
| 
 | |
| #### Core Metric
 | |
| 
 | |
| Metric: `kafka.message.queue.time`
 | |
| 
 | |
| - Type: Timer with configurable percentiles and SLO buckets
 | |
| - Unit: Milliseconds
 | |
| - Tags:
 | |
|   - topic: Kafka topic name (e.g., "MetadataChangeProposal_v1")
 | |
|   - consumer.group: Consumer group ID (e.g., "generic-mce-consumer")
 | |
| - Use Case: Monitor end-to-end latency from message production to SQL transaction
 | |
| 
 | |
| #### Statistical Distribution
 | |
| 
 | |
| The timer automatically tracks:
 | |
| 
 | |
| - Count: Total messages processed
 | |
| - Sum: Cumulative queue time
 | |
| - Max: Highest queue time observed
 | |
| - Percentiles: p50, p95, p99, p99.9 (configurable)
 | |
| - SLO Buckets: Percentage of messages meeting latency targets
 | |
| 
 | |
| #### Configuration Guide
 | |
| 
 | |
| Default Configuration:
 | |
| 
 | |
| ```yaml
 | |
| kafka:
 | |
|   consumer:
 | |
|     metrics:
 | |
|       # Percentiles to calculate
 | |
|       percentiles: "0.5,0.95,0.99,0.999"
 | |
| 
 | |
|       # Service Level Objective buckets (seconds)
 | |
|       slo: "300,1800,3600,10800,21600,43200" # 5m,30m,1h,3h,6h,12h
 | |
| 
 | |
|       # Maximum expected queue time
 | |
|       maxExpectedValue: 86400 # 24 hours (seconds)
 | |
| ```
 | |
| 
 | |
| #### Key Monitoring Patterns
 | |
| 
 | |
| SLA Compliance Monitoring:
 | |
| 
 | |
| ```promql
 | |
| # Percentage of messages processed within 5-minute SLA
 | |
| sum(rate(kafka_message_queue_time_seconds_bucket{le="300"}[5m])) by (topic)
 | |
| / sum(rate(kafka_message_queue_time_seconds_count[5m])) by (topic) * 100
 | |
| ```
 | |
| 
 | |
| Consumer Group Comparison:
 | |
| 
 | |
| ```promql
 | |
| # P99 queue time by consumer group
 | |
| histogram_quantile(0.99,
 | |
|   sum by (consumer_group, le) (
 | |
|     rate(kafka_message_queue_time_seconds_bucket[5m])
 | |
|   )
 | |
| )
 | |
| ```
 | |
| 
 | |
| #### Performance Considerations
 | |
| 
 | |
| Metric Cardinality:
 | |
| 
 | |
| The instrumentation is designed for low cardinality:
 | |
| 
 | |
| - Only two tags: `topic` and `consumer.group`
 | |
| - No partition-level tags (avoiding explosion with high partition counts)
 | |
| - No message-specific tags
 | |
| 
 | |
| Overhead Assessment:
 | |
| 
 | |
| - CPU Impact: Minimal - single timestamp calculation per message
 | |
| - Memory Impact: ~5KB per topic/consumer-group combination
 | |
| - Network Impact: Negligible - metrics aggregated before export
 | |
| 
 | |
| #### Migration from Legacy Metrics
 | |
| 
 | |
| The new Micrometer-based queue time metrics coexist with the legacy DropWizard `kafkaLag` histogram:
 | |
| 
 | |
| - Legacy: `kafkaLag` histogram via JMX
 | |
| - New: `kafka.message.queue.time` timer via Micrometer
 | |
| - Migration: Both metrics collected during transition period
 | |
| - Future: Legacy metrics will be deprecated in favor of Micrometer
 | |
| 
 | |
| The new metrics provide:
 | |
| 
 | |
| - Better percentile accuracy
 | |
| - SLO bucket tracking
 | |
| - Multi-backend support
 | |
| - Dimensional tagging
 | |
| 
 | |
| ## DataHub Request Hook Latency Instrumentation (Micrometer)
 | |
| 
 | |
| ### Overview
 | |
| 
 | |
| DataHub provides comprehensive instrumentation for measuring the latency from initial request submission to post-MCL
 | |
| (Metadata Change Log) hook execution. This metric is crucial for understanding the end-to-end processing time of metadata
 | |
| changes, including both the time spent in Kafka queues and the time taken to process through the system to the final hooks.
 | |
| 
 | |
| ### Why Hook Latency Monitoring Matters
 | |
| 
 | |
| Traditional metrics only show individual component performance. Request hook latency provides the complete picture of how long
 | |
| it takes for a metadata change to be fully processed through DataHub's pipeline:
 | |
| 
 | |
| - Request Submission: When a metadata change request is initially submitted
 | |
| - Queue Time: Time spent in Kafka topics waiting to be consumed
 | |
| - Processing Time: Time for the change to be persisted and processed
 | |
| - Hook Execution: Final execution of MCL hooks
 | |
| 
 | |
| This end-to-end view is essential for:
 | |
| 
 | |
| - Meeting data freshness SLAs
 | |
| - Identifying bottlenecks in the metadata pipeline
 | |
| - Understanding the impact of system load on processing times
 | |
| - Ensuring timely updates to downstream systems
 | |
| 
 | |
| ### Configuration
 | |
| 
 | |
| Hook latency metrics are configured separately from Kafka consumer metrics to allow fine-tuning based on your specific requirements:
 | |
| 
 | |
| ```yaml
 | |
| datahub:
 | |
|   metrics:
 | |
|     # Measures the time from request to post-MCL hook execution
 | |
|     hookLatency:
 | |
|       # Percentiles to calculate for latency distribution
 | |
|       percentiles: "0.5,0.95,0.99,0.999"
 | |
| 
 | |
|       # Service Level Objective buckets (seconds)
 | |
|       # These define the latency targets you want to track
 | |
|       slo: "300,1800,3000,10800,21600,43200" # 5m, 30m, 1h, 3h, 6h, 12h
 | |
| 
 | |
|       # Maximum expected latency (seconds)
 | |
|       # Values above this are considered outliers
 | |
|       maxExpectedValue: 86000 # 24 hours
 | |
| ```
 | |
| 
 | |
| ### Metrics Collected
 | |
| 
 | |
| #### Core Metric
 | |
| 
 | |
| Metric: `datahub.request.hook.queue.time`
 | |
| 
 | |
| - Type: Timer with configurable percentiles and SLO buckets
 | |
| - Unit: Milliseconds
 | |
| - Tags:
 | |
|   - `hook`: Name of the MCL hook being executed (e.g., "IngestionSchedulerHook", "SiblingsHook")
 | |
| - Use Case: Monitor the complete latency from request submission to hook exe
 | |
| 
 | |
| #### Key Monitoring Patterns
 | |
| 
 | |
| SLA Compliance by Hook:
 | |
| 
 | |
| Monitor which hooks are meeting their latency SLAs:
 | |
| 
 | |
| ```promql
 | |
| # Percentage of requests processed within 5-minute SLA per hook
 | |
| sum(rate(datahub_request_hook_queue_time_seconds_bucket{le="300"}[5m])) by (hook)
 | |
| / sum(rate(datahub_request_hook_queue_time_seconds_count[5m])) by (hook) * 100
 | |
| ```
 | |
| 
 | |
| Hook Performance Comparison:
 | |
| 
 | |
| Identify which hooks have the highest latency:
 | |
| 
 | |
| ```promql
 | |
| # P99 latency by hook
 | |
| histogram_quantile(0.99,
 | |
|   sum by (hook, le) (
 | |
|     rate(datahub_request_hook_queue_time_seconds_bucket[5m])
 | |
|   )
 | |
| )
 | |
| ```
 | |
| 
 | |
| Latency Trends:
 | |
| 
 | |
| Track how hook latency changes over time:
 | |
| 
 | |
| ```promql
 | |
| # Average hook latency trend
 | |
| avg by (hook) (
 | |
|   rate(datahub_request_hook_queue_time_seconds_sum[5m])
 | |
|   / rate(datahub_request_hook_queue_time_seconds_count[5m])
 | |
| )
 | |
| ```
 | |
| 
 | |
| #### Implementation Details
 | |
| 
 | |
| The hook latency metric leverages the trace ID embedded in the system metadata of each request:
 | |
| 
 | |
| 1. Trace ID Generation: Each request generates a unique trace ID with an embedded timestamp
 | |
| 1. Propagation: The trace ID flows through the entire processing pipeline via system metadata
 | |
| 1. Measurement: When an MCL hook executes, the metric calculates the time difference between the current time and the trace ID timestamp
 | |
| 1. Recording: The latency is recorded as a timer metric with the hook name as a tag
 | |
| 
 | |
| #### Performance Considerations
 | |
| 
 | |
| - Overhead: Minimal - only requires trace ID extraction and time calculation per hook execution
 | |
| - Cardinality: Low - only one tag (hook name) with typically < 20 unique values
 | |
| - Accuracy: High - measures actual wall-clock time from request to hook execution
 | |
| 
 | |
| #### Relationship to Kafka Queue Time Metrics
 | |
| 
 | |
| While Kafka queue time metrics (`kafka.message.queue.time`) measure the time messages spend in Kafka topics, request hook
 | |
| latency metrics provide the complete picture:
 | |
| 
 | |
| - Kafka Queue Time: Time from message production to consumption
 | |
| - Hook Latency: Time from initial request to final hook execution
 | |
| 
 | |
| Together, these metrics help identify where delays occur:
 | |
| 
 | |
| - High Kafka queue time but low hook latency: Bottleneck in Kafka consumption
 | |
| - Low Kafka queue time but high hook latency: Bottleneck in processing or persistence
 | |
| - Both high: System-wide performance issues
 | |
| 
 | |
| ## Cache Monitoring (Micrometer)
 | |
| 
 | |
| ### Overview
 | |
| 
 | |
| Micrometer provides automatic instrumentation for cache implementations, offering deep insights into cache performance
 | |
| and efficiency. This instrumentation is crucial for DataHub, where caching significantly impacts query performance and system load.
 | |
| 
 | |
| ### Automatic Cache Metrics
 | |
| 
 | |
| When caches are registered with Micrometer, comprehensive metrics are automatically collected without code changes:
 | |
| 
 | |
| #### Core Metrics
 | |
| 
 | |
| - **`cache.size`** (Gauge) - Current number of entries in the cache
 | |
| - **`cache.gets`** (Counter) - Cache access attempts, tagged with:
 | |
|   - `result=hit` - Successful cache hits
 | |
|   - `result=miss` - Cache misses requiring backend fetch
 | |
| - **`cache.puts`** (Counter) - Number of entries added to cache
 | |
| - **`cache.evictions`** (Counter) - Number of entries evicted
 | |
| - **`cache.eviction.weight`** (Counter) - Total weight of evicted entries (for size-based eviction)
 | |
| 
 | |
| #### Derived Metrics
 | |
| 
 | |
| Calculate key performance indicators using Prometheus queries:
 | |
| 
 | |
| ```promql
 | |
| # Cache hit rate (should be >80% for hot caches)
 | |
| sum(rate(cache_gets_total{result="hit"}[5m])) by (cache) /
 | |
| sum(rate(cache_gets_total[5m])) by (cache)
 | |
| 
 | |
| # Cache miss rate
 | |
| 1 - (cache_hit_rate)
 | |
| 
 | |
| # Eviction rate (indicates cache pressure)
 | |
| rate(cache_evictions_total[5m])
 | |
| ```
 | |
| 
 | |
| ### DataHub Cache Configuration
 | |
| 
 | |
| DataHub uses multiple cache layers, each automatically instrumented:
 | |
| 
 | |
| #### 1. Entity Client Cache
 | |
| 
 | |
| ```yaml
 | |
| cache.client.entityClient:
 | |
|   enabled: true
 | |
|   maxBytes: 104857600 # 100MB
 | |
|   entityAspectTTLSeconds:
 | |
|     corpuser:
 | |
|       corpUserInfo: 20 # Short TTL for frequently changing data
 | |
|       corpUserKey: 300 # Longer TTL for stable data
 | |
|     structuredProperty:
 | |
|       propertyDefinition: 300
 | |
|       structuredPropertyKey: 86400 # 1 day for very stable data
 | |
| ```
 | |
| 
 | |
| #### 2. Usage Statistics Cache
 | |
| 
 | |
| ```yaml
 | |
| cache.client.usageClient:
 | |
|   enabled: true
 | |
|   maxBytes: 52428800 # 50MB
 | |
|   defaultTTLSeconds: 86400 # 1 day
 | |
|   # Caches expensive usage calculations
 | |
| ```
 | |
| 
 | |
| #### 3. Search & Lineage Cache
 | |
| 
 | |
| ```yaml
 | |
| cache.search.lineage:
 | |
|   ttlSeconds: 86400 # 1 day
 | |
| ```
 | |
| 
 | |
| ### Monitoring Best Practices
 | |
| 
 | |
| #### Key Indicators to Watch
 | |
| 
 | |
| 1. **Hit Rate by Cache Type**
 | |
| 
 | |
|    ```promql
 | |
|    # Alert if hit rate drops below 70%
 | |
|    cache_hit_rate < 0.7
 | |
|    ```
 | |
| 
 | |
| 2. **Memory Pressure**
 | |
|    ```promql
 | |
|    # High eviction rate relative to puts
 | |
|    rate(cache_evictions_total[5m]) / rate(cache_puts_total[5m]) > 0.1
 | |
|    ```
 | |
| 
 | |
| ## Thread Pool Executor Monitoring (Micrometer)
 | |
| 
 | |
| ### Overview
 | |
| 
 | |
| Micrometer automatically instruments Java `ThreadPoolExecutor` instances, providing crucial visibility into concurrency
 | |
| bottlenecks and resource utilization. For DataHub's concurrent operations, this monitoring is essential for maintaining
 | |
| performance under load.
 | |
| 
 | |
| ### Automatic Executor Metrics
 | |
| 
 | |
| #### Pool State Metrics
 | |
| 
 | |
| - **`executor.pool.size`** (Gauge) - Current number of threads in pool
 | |
| - **`executor.pool.core`** (Gauge) - Core (minimum) pool size
 | |
| - **`executor.pool.max`** (Gauge) - Maximum allowed pool size
 | |
| - **`executor.active`** (Gauge) - Threads actively executing tasks
 | |
| 
 | |
| #### Queue Metrics
 | |
| 
 | |
| - **`executor.queued`** (Gauge) - Tasks waiting in queue
 | |
| - **`executor.queue.remaining`** (Gauge) - Available queue capacity
 | |
| 
 | |
| #### Performance Metrics
 | |
| 
 | |
| - **`executor.completed`** (Counter) - Total completed tasks
 | |
| - **`executor.seconds`** (Timer) - Task execution time distribution
 | |
| - **`executor.rejected`** (Counter) - Tasks rejected due to saturation
 | |
| 
 | |
| ### DataHub Executor Configurations
 | |
| 
 | |
| #### 1. GraphQL Query Executor
 | |
| 
 | |
| ```yaml
 | |
| graphQL.concurrency:
 | |
|   separateThreadPool: true
 | |
|   corePoolSize: 20 # Base threads
 | |
|   maxPoolSize: 200 # Scale under load
 | |
|   keepAlive: 60 # Seconds before idle thread removal
 | |
|   # Handles complex GraphQL query resolution
 | |
| ```
 | |
| 
 | |
| #### 2. Batch Processing Executors
 | |
| 
 | |
| ```yaml
 | |
| entityClient.restli:
 | |
|   get:
 | |
|     batchConcurrency: 2 # Parallel batch processors
 | |
|     batchQueueSize: 500 # Task buffer
 | |
|     batchThreadKeepAlive: 60
 | |
|   ingest:
 | |
|     batchConcurrency: 2
 | |
|     batchQueueSize: 500
 | |
| ```
 | |
| 
 | |
| #### 3. Search & Analytics Executors
 | |
| 
 | |
| ```yaml
 | |
| timeseriesAspectService.query:
 | |
|   concurrency: 10 # Parallel query threads
 | |
|   queueSize: 500 # Buffered queries
 | |
| ```
 | |
| 
 | |
| ### Critical Monitoring Patterns
 | |
| 
 | |
| #### Saturation Detection
 | |
| 
 | |
| ```promql
 | |
| # Thread pool utilization (>0.8 indicates pressure)
 | |
| executor_active / executor_pool_size > 0.8
 | |
| 
 | |
| # Queue filling up (>0.7 indicates backpressure)
 | |
| executor_queued / (executor_queued + executor_queue_remaining) > 0.7
 | |
| ```
 | |
| 
 | |
| #### Rejection & Starvation
 | |
| 
 | |
| ```promql
 | |
| # Task rejections (should be zero)
 | |
| rate(executor_rejected_total[1m]) > 0
 | |
| 
 | |
| # Thread starvation (all threads busy for extended period)
 | |
| avg_over_time(executor_active[5m]) >= executor_pool_core
 | |
| ```
 | |
| 
 | |
| #### Performance Analysis
 | |
| 
 | |
| ```promql
 | |
| # Average task execution time
 | |
| rate(executor_seconds_sum[5m]) / rate(executor_seconds_count[5m])
 | |
| 
 | |
| # Task throughput by executor
 | |
| rate(executor_completed_total[5m])
 | |
| ```
 | |
| 
 | |
| ### Tuning Guidelines
 | |
| 
 | |
| #### Symptoms & Solutions
 | |
| 
 | |
| | Symptom         | Metric Pattern           | Solution                        |
 | |
| | --------------- | ------------------------ | ------------------------------- |
 | |
| | High latency    | `executor_queued` rising | Increase pool size              |
 | |
| | Rejections      | `executor_rejected` > 0  | Increase queue size or pool max |
 | |
| | Memory pressure | Many idle threads        | Reduce `keepAlive` time         |
 | |
| | CPU waste       | Low `executor_active`    | Reduce core pool size           |
 | |
| 
 | |
| #### Capacity Planning
 | |
| 
 | |
| 1. **Measure baseline**: Monitor under normal load
 | |
| 2. **Stress test**: Identify saturation points
 | |
| 3. **Set alerts**:
 | |
|    - Warning at 70% utilization
 | |
|    - Critical at 90% utilization
 | |
| 4. **Auto-scale**: Consider dynamic pool sizing based on queue depth
 | |
| 
 | |
| ## Distributed Tracing
 | |
| 
 | |
| Traces let us track the life of a request across multiple components. Each trace is consisted of multiple spans, which
 | |
| are units of work, containing various context about the work being done as well as time taken to finish the work. By
 | |
| looking at the trace, we can more easily identify performance bottlenecks.
 | |
| 
 | |
| We enable tracing by using the [OpenTelemetry java instrumentation library](https://github.com/open-telemetry/opentelemetry-java-instrumentation).
 | |
| This project provides a Java agent JAR that is attached to java applications. The agent injects bytecode to capture
 | |
| telemetry from popular libraries.
 | |
| 
 | |
| Using the agent we are able to
 | |
| 
 | |
| 1. Plug and play different tracing tools based on the user's setup: Jaeger, Zipkin, or other tools
 | |
| 2. Get traces for Kafka, JDBC, and Elasticsearch without any additional code
 | |
| 3. Track traces of any function with a simple `@WithSpan` annotation
 | |
| 
 | |
| You can enable the agent by setting env variable `ENABLE_OTEL` to `true` for GMS and MAE/MCE consumers. In our
 | |
| example [docker-compose](../../docker/monitoring/docker-compose.monitoring.yml), we export metrics to a local Jaeger
 | |
| instance by setting env variable `OTEL_TRACES_EXPORTER` to `jaeger`
 | |
| and `OTEL_EXPORTER_JAEGER_ENDPOINT` to `http://jaeger-all-in-one:14250`, but you can easily change this behavior by
 | |
| setting the correct env variables. Refer to
 | |
| this [doc](https://github.com/open-telemetry/opentelemetry-java/blob/main/sdk-extensions/autoconfigure/README.md) for
 | |
| all configs.
 | |
| 
 | |
| Once the above is set up, you should be able to see a detailed trace as a request is sent to GMS. We added
 | |
| the `@WithSpan` annotation in various places to make the trace more readable. You should start to see traces in the
 | |
| tracing collector of choice. Our example [docker-compose](../../docker/monitoring/docker-compose.monitoring.yml) deploys
 | |
| an instance of Jaeger with port 16686. The traces should be available at http://localhost:16686.
 | |
| 
 | |
| ### Configuration Note
 | |
| 
 | |
| We recommend using either `grpc` or `http/protobuf`, configured using `OTEL_EXPORTER_OTLP_PROTOCOL`. Avoid using `http` will not work as expected due to the size of
 | |
| the generated spans.
 | |
| 
 | |
| ## Micrometer
 | |
| 
 | |
| DataHub is transitioning to Micrometer as its primary metrics framework, representing a significant upgrade in observability
 | |
| capabilities. Micrometer is a vendor-neutral application metrics facade that provides a simple, consistent API for the most
 | |
| popular monitoring systems, allowing you to instrument your JVM-based application code without vendor lock-in.
 | |
| 
 | |
| ### Why Micrometer?
 | |
| 
 | |
| 1. Native Spring Integration
 | |
| 
 | |
|    As DataHub uses Spring Boot, Micrometer provides seamless integration with:
 | |
| 
 | |
|    - Auto-configuration of common metrics
 | |
|    - Built-in metrics for HTTP requests, JVM, caches, and more
 | |
|    - Spring Boot Actuator endpoints for metrics exposure
 | |
|    - Automatic instrumentation of Spring components
 | |
| 
 | |
| 2. Multi-Backend Support
 | |
| 
 | |
|    Unlike the legacy DropWizard approach that primarily targets JMX, Micrometer natively supports:
 | |
| 
 | |
|    - Prometheus (recommended for cloud-native deployments)
 | |
|    - JMX (for backward compatibility)
 | |
|    - StatsD
 | |
|    - CloudWatch
 | |
|    - Datadog
 | |
|    - New Relic
 | |
|    - And many more...
 | |
| 
 | |
| 3. Dimensional Metrics
 | |
| 
 | |
|    Micrometer embraces modern dimensional metrics with **labels/tags**, enabling:
 | |
| 
 | |
|    - Rich querying and aggregation capabilities
 | |
|    - Better cardinality control
 | |
|    - More flexible dashboards and alerts
 | |
|    - Natural integration with cloud-native monitoring systems
 | |
| 
 | |
| ## Micrometer Transition Plan
 | |
| 
 | |
| DataHub is undertaking a strategic transition from DropWizard metrics (exposed via JMX) to Micrometer, a modern vendor-neutral metrics facade.
 | |
| This transition aims to provide better cloud-native monitoring capabilities while maintaining backward compatibility for existing
 | |
| monitoring infrastructure.
 | |
| 
 | |
| ### Current State
 | |
| 
 | |
| What We Have Now:
 | |
| 
 | |
| - Primary System: DropWizard metrics exposed through JMX
 | |
| - Collection Method: Prometheus-JMX exporter scrapes JMX metrics
 | |
| - Dashboards: Grafana dashboards consuming JMX-sourced metrics
 | |
| - Code Pattern: MetricUtils class for creating counters and timers
 | |
| - Integration: Basic Spring integration with manual metric creation
 | |
| 
 | |
| <p align="center">
 | |
|   <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/0f6ae5ae889ee4e780504ca566670867acf975ff/imgs/advanced/monitoring/monitoring_current.svg"/>
 | |
| </p>
 | |
| 
 | |
| Limitations:
 | |
| 
 | |
| - JMX-centric approach limits monitoring backend options
 | |
| - No unified observability (separate instrumentation for metrics and traces)
 | |
| - No support for dimensional metrics and tags
 | |
| - Manual instrumentation required for most components
 | |
| - Legacy naming conventions without proper tagging
 | |
| 
 | |
| ### Transition State
 | |
| 
 | |
| What We're Building:
 | |
| 
 | |
| - Primary System: Micrometer with native Prometheus support
 | |
| - Collection Method: Direct Prometheus scraping via /actuator/prometheus
 | |
| - Unified Telemetry: Single instrumentation point for both metrics and traces
 | |
| - Modern Patterns: Dimensional metrics with rich tagging
 | |
| - Multi-Backend: Support for Prometheus, StatsD, CloudWatch, Datadog, etc.
 | |
| - Auto-Instrumentation: Automatic metrics for Spring components
 | |
| 
 | |
| <p align="center">
 | |
|   <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/0f6ae5ae889ee4e780504ca566670867acf975ff/imgs/advanced/monitoring/monitoring_transition.svg"/>
 | |
| </p>
 | |
| 
 | |
| Key Decisions and Rationale:
 | |
| 
 | |
| 1. Dual Registry Approach
 | |
| 
 | |
|    **Decision:** Run both systems in parallel with tag-based routing
 | |
| 
 | |
|    **Rationale:**
 | |
| 
 | |
|    - Zero downtime or disruption
 | |
|    - Gradual migration at component level
 | |
|    - Easy rollback if issues arise
 | |
| 
 | |
| 2. Prometheus as Primary Target
 | |
| 
 | |
|    **Decision:** Focus on Prometheus for new metrics
 | |
| 
 | |
|    **Rationale:**
 | |
| 
 | |
|    - Industry standard for cloud-native applications
 | |
|    - Rich query language and ecosystem
 | |
|    - Better suited for dimensional metrics
 | |
| 
 | |
| 3. Observation API Adoption
 | |
| 
 | |
|    **Decision:** Promote Observation API for new instrumentation
 | |
| 
 | |
|    **Rationale:**
 | |
| 
 | |
|    - Single instrumentation for metrics + traces
 | |
|    - Reduced code complexity
 | |
|    - Consistent naming across telemetry types
 | |
| 
 | |
| ### Future State
 | |
| 
 | |
| <p align="center">
 | |
|   <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/0f6ae5ae889ee4e780504ca566670867acf975ff/imgs/advanced/monitoring/monitoring_future.svg"/>
 | |
| </p>
 | |
| 
 | |
| Once fully adopted, Micrometer will transform DataHub's observability from a collection of separate tools into a unified platform.
 | |
| This means developers can focus on building features while getting comprehensive telemetry "for free."
 | |
| 
 | |
| Intelligent and Adaptive Monitoring
 | |
| 
 | |
| - Dynamic Instrumentation: Enable detailed metrics for specific entities or operations on-demand without code changes
 | |
| - Environment-Aware Metrics: Automatically route metrics to Prometheus in Kubernetes, CloudWatch in AWS, or Azure Monitor in Azure
 | |
| - Built-in SLO Tracking: Define Service Level Objectives declaratively and automatically track error budgets
 | |
| 
 | |
| Developer and Operator Experience
 | |
| 
 | |
| - Adding @Observed to a method automatically generates latency percentiles, error rates, and distributed trace spans
 | |
| - Every service exposes golden signals (latency, traffic, errors, saturation) out-of-the-box
 | |
| - Business metrics (entity ingestion rates, search performance) seamlessly correlate with system metrics
 | |
| - Self-documenting telemetry where metrics, traces, and logs tell a coherent operational story
 | |
| 
 | |
| ## DropWizard & JMX
 | |
| 
 | |
| We originally decided to use [Dropwizard Metrics](https://metrics.dropwizard.io/4.2.0/) to export custom metrics to JMX,
 | |
| and then use [Prometheus-JMX exporter](https://github.com/prometheus/jmx_exporter) to export all JMX metrics to
 | |
| Prometheus. This allows our code base to be independent of the metrics collection tool, making it easy for people to use
 | |
| their tool of choice. You can enable the agent by setting env variable `ENABLE_PROMETHEUS` to `true` for GMS and MAE/MCE
 | |
| consumers. Refer to this example [docker-compose](../../docker/monitoring/docker-compose.monitoring.yml) for setting the
 | |
| variables.
 | |
| 
 | |
| In our example [docker-compose](../../docker/monitoring/docker-compose.monitoring.yml), we have configured prometheus to
 | |
| scrape from 4318 ports of each container used by the JMX exporter to export metrics. We also configured grafana to
 | |
| listen to prometheus and create useful dashboards. By default, we provide two
 | |
| dashboards: [JVM dashboard](https://grafana.com/grafana/dashboards/14845) and DataHub dashboard.
 | |
| 
 | |
| In the JVM dashboard, you can find detailed charts based on JVM metrics like CPU/memory/disk usage. In the DataHub
 | |
| dashboard, you can find charts to monitor each endpoint and the kafka topics. Using the example implementation, go
 | |
| to http://localhost:3001 to find the grafana dashboards! (Username: admin, PW: admin)
 | |
| 
 | |
| To make it easy to track various metrics within the code base, we created MetricUtils class. This util class creates a
 | |
| central metric registry, sets up the JMX reporter, and provides convenient functions for setting up counters and timers.
 | |
| You can run the following to create a counter and increment.
 | |
| 
 | |
| ```java
 | |
| metricUtils.counter(this.getClass(),"metricName").increment();
 | |
| ```
 | |
| 
 | |
| You can run the following to time a block of code.
 | |
| 
 | |
| ```java
 | |
| try(Timer.Context ignored=metricUtils.timer(this.getClass(),"timerName").timer()){
 | |
|     ...block of code
 | |
|     }
 | |
| ```
 | |
| 
 | |
| #### Enable monitoring through docker-compose
 | |
| 
 | |
| We provide some example configuration for enabling monitoring in
 | |
| this [directory](https://github.com/datahub-project/datahub/tree/master/docker/monitoring). Take a look at the docker-compose
 | |
| files, which adds necessary env variables to existing containers, and spawns new containers (Jaeger, Prometheus,
 | |
| Grafana).
 | |
| 
 | |
| You can add in the above docker-compose using the `-f <<path-to-compose-file>>` when running docker-compose commands.
 | |
| For instance,
 | |
| 
 | |
| ```shell
 | |
| docker-compose \
 | |
|   -f quickstart/docker-compose.quickstart.yml \
 | |
|   -f monitoring/docker-compose.monitoring.yml \
 | |
|   pull && \
 | |
| docker-compose -p datahub \
 | |
|   -f quickstart/docker-compose.quickstart.yml \
 | |
|   -f monitoring/docker-compose.monitoring.yml \
 | |
|   up
 | |
| ```
 | |
| 
 | |
| We set up quickstart.sh, dev.sh, and dev-without-neo4j.sh to add the above docker-compose when MONITORING=true. For
 | |
| instance `MONITORING=true ./docker/quickstart.sh` will add the correct env variables to start collecting traces and
 | |
| metrics, and also deploy Jaeger, Prometheus, and Grafana. We will soon support this as a flag during quickstart.
 | |
| 
 | |
| ## Health check endpoint
 | |
| 
 | |
| For monitoring healthiness of your DataHub service, `/admin` endpoint can be used.
 | 
