datahub/docs/how/load-indices.md
2025-10-10 09:53:42 -05:00

483 lines
20 KiB
Markdown

# Load Indices: High-Performance Bulk Index Loading
LoadIndices is a high-performance upgrade task designed for bulk loading metadata aspects directly from the database into Elasticsearch/OpenSearch indices. Unlike RestoreIndices which focuses on correctness and consistency, LoadIndices is optimized for speed and throughput during initial deployments or large-scale data migrations.
## Overview
LoadIndices bypasses the standard event-driven processing pipeline to directly stream data from the `metadata_aspect_v2` table into search indices using optimized bulk operations. This approach provides significant performance improvements for large installations while making specific architectural trade-offs that prioritize speed over consistency.
🚨 **CRITICAL WARNING**: LoadIndices is designed for specific use cases only and should **NEVER** be used in production environments with active concurrent writes, MCL-dependent systems, or real-time consistency requirements. See [Performance Trade-offs & Implications](#performance-trade-offs--implications) for complete details.
### Key Features
- **🚀 High Performance**: Direct streaming from database with optimized bulk operations
- **⚡ Fast Bulk Loading**: Optimized for speed over consistency during initial loads
- **🔧 Refresh Management**: Automatically disables refresh intervals during loading for optimal performance
- **📊 Comprehensive Monitoring**: Real-time progress reporting and performance metrics
- **⚙️ Configurable Isolation**: Utilizes READ_UNCOMMITTED transactions for faster scanning
---
## Performance Trade-offs & Implications
⚠️ **Critical Understanding**: LoadIndices prioritizes **performance over consistency** by making several architectural trade-offs. Understanding these implications is crucial before using LoadIndices in production environments.
### 🚨 Key Trade-offs Made
#### **1. BYPASS Kafka/MCL Event Pipeline**
- **What**: LoadIndices completely bypasses Kafka MCL (Metadata Change Log) topics that normally propagate all metadata changes
- **Architecture**: `Database → LoadIndices → Elasticsearch` **vs** normal flow of `Database → Kafka MCL → Multiple Consumers → Elasticsearch/Graph/etc`
- **Impact**: **No MCL events published** - downstream systems lose visibility into metadata changes
- **Critical Implication**:
- **MCL-Dependent Analytics**: Won't have audit trail of metadata changes
- **Integrations**: External systems won't be notified of changes
- **Custom MCL Consumers**: Any custom consumers will miss these events entirely
- **✅ Graph Service**: WILL be updated (UpdateIndicesService handles graph indices) **⚠️ Only when Elasticsearch is used for graph storage**
#### **2. BROKEN DataHub Event Architecture**
- **What**: Violates DataHub's core design principle that "all metadata changes flow through Kafka MCL"
- **Normal Flow**: `Metadata Change → MCL Event → Kafka → Multiple Consumers → Various Stores`
- **LoadIndices Flow**: `Metadata Change → LoadIndices → Direct ES Write` (**Skips Kafka entirely**)
#### **3. READ_UNCOMMITTED Isolation**
- **What**: Uses `TxIsolation.READ_UNCOMMITTED` for faster database scanning
- **Impact**: May read **uncommitted changes** or **dirty reads** from concurrent transactions
- **Implication**: Data consistency not guaranteed during active writes to database
#### **4. Refresh Interval Manipulation**
- **What**: Automatically disables refresh intervals during bulk operations
- **Impact**: **Recent updates may not be immediately searchable**
- **Implication**: Users won't see real-time updates in search until refresh intervals are restored
#### **5. No Write Concurrency Controls**
- **What**: No coordination with concurrent Elasticsearch writes from live ingestion
- **Impact**: **Potential conflicts** with active ingestion pipelines
- **Implication**: Concurrent writes may cause data inconsistency or operation failures
### ⚠️ When NOT to Use LoadIndices
**❌ DO NOT use LoadIndices if you have:**
- **Active ingestion pipelines** writing to Elasticsearch simultaneously
- **MCL-dependent systems** that need event notifications
- **Neo4j-based graph storage** (graph updates will be missing)
- **Real-time search requirements** during the loading process
- **Production traffic** that requires immediate search consistency
### ✅ When LoadIndices is Appropriate
**✅ Safe to use LoadIndices when:**
- **Fresh deployment** with empty Elasticsearch cluster
- **Offline migration** with no concurrent users
- **Standalone indexing** without DataHub services running
- **Read-only replica environments** with no active writes
- **Development/testing** environments
- **Disaster recovery** scenarios where faster restoration is prioritized
- **Independent cluster setup** where you need to populate indices before services start
- **Elasticsearch-based graph storage** (graph gets updated automatically)
### 🔒 Safety Requirements
Before using LoadIndices in any environment:
1. **Verify Minimal Infrastructure**:
- **Database**: MySQL/PostgreSQL with `metadata_aspect_v2` table accessible (via Ebean ORM)
- **Elasticsearch**: Running cluster accessible via HTTP/HTTPS
- **DataHub Services**: ✅ **NOT required** - LoadIndices can run independently
- **⚠️ Check Graph Storage**: Verify if using Elasticsearch-based graph storage
- **⚠️ Check Database Type**: Confirm NOT using Cassandra (not supported)
2. **Stop All Ingestion** (if DataHub is running):
```bash
# Disable all Kafka consumers
kubectl scale deployment --replicas=0 datahub-mae-consumer
kubectl scale deployment --replicas=0 datahub-mce-consumer
kubectl scale deployment --replicas=0 datahub-gms
```
3. **Check Database Configuration**:
```bash
# Check if using Cassandra (LoadIndices NOT supported)
grep -i cassandra /path/to/datahub/docker/docker-compose.yml
# Verify MySQL/PostgreSQL database is configured
grep -E "mysql\|postgres" /path/to/datahub/docker/docker-compose.yml
# ⚠️ If Cassandra detected, LoadIndices is NOT available
# Must use RestoreIndices instead
```
4. **Check Graph Storage Configuration**:
```bash
# Check if using Neo4j (graph updates will be MISSING)
grep -r "neo4j" /path/to/datahub/docker/docker-compose.yml
# Check DataHub configuration for graph service selection
grep -i "graph.*elasticsearch\|neo4j" /path/to/datahub/conf/application.yml
# ⚠️ If Neo4j is detected, LoadIndices will NOT update graph
```
5. **Verify No Concurrent Writes**:
```bash
# Check for active Elasticsearch indexing
curl -s "localhost:9200/_nodes/stats" | grep "index_current"
# Should show "index_current": 0
```
6. **Index Clean State**:
```bash
# Ensure clean indexing state
curl -s "localhost:9200/_nodes/stats" | grep -E "refresh.*active"
```
7. **Coordinate with Operations**:
- **Maintenance window** scheduling
- **User notification** of search unavailability
- **Monitoring** of downstream system dependencies
### 📊 Consistency Guarantees
| Level | LoadIndices | RestoreIndices |
| --------------------------------------- | ------------- | ------------------- |
| **URN-level Ordering** | ✅ Guaranteed | ✅ Guaranteed |
| **Real-time Searchability** | ❌ Delayed | ✅ Immediate |
| **Graph Service Updates (ES-based)** | ✅ Updated | ✅ Updated |
| **Graph Service Updates (Neo4j-based)** | ❌ Missing | ✅ Updated |
| **MCL Event Propagation** | ❌ Bypassed | ✅ Full propagation |
| **Concurrent Write Safety** | ❌ Not safe | ✅ Safe |
#### **2. Restore Normal Operations**
- **Re-enable ingestion** pipelines gradually
- **Monitor Elasticsearch** for conflicts
- **Validate downstream systems** are synchronized
#### **3. Emergency Rollback Plan**
```bash
# If issues arise, prepare rollback:
# 1. Stop LoadIndices immediately
# 2. Restore from backup indices
# 3. Re-run with RestoreIndices for correctness
```
---
## How LoadIndices Works
LoadIndices operates as an upgrade task that can run **independently** without requiring DataHub services to be running. It consists of two main steps:
1. **BuildIndicesStep**: Creates and configures Elasticsearch indices (creates indices if they don't exist)
2. **LoadIndicesStep**: Streams aspects from database and bulk loads them into indices
### 🔧 Independent Operation Mode
**Key Advantage**: LoadIndices only requires:
- ✅ **MySQL/PostgreSQL** source database (via Ebean ORM)
- ✅ **Elasticsearch/OpenSearch** destination cluster
- ❌ **No DataHub services** (maui, frontend, etc.) required
- ❌ **Cassandra**: ⚠️ **NOT supported** (Ebean doesn't support Cassandra)
This enables **offline bulk operations** during maintenance windows or initial deployments where DataHub infrastructure is being set up incrementally.
**Index Creation**: The BuildIndicesStep automatically creates all required Elasticsearch indices based on `IndexConvention` patterns, so empty Elasticsearch clusters are fully supported.
### Architecture Flow
```mermaid
graph TD
A[LoadIndices Upgrade] --> B[BuildIndicesStep]
B --> C[Create/Configure Indices]
C --> D[LoadIndicesStep]
D --> E[Disable Refresh Intervals]
E --> F[Stream Aspects from DB]
F --> G[Batch Processing]
G --> H[Convert to MCL Events]
H --> I[Bulk Write to ES]
I --> J[Restore Refresh Intervals]
```
### Key Differences from RestoreIndices
| Aspect | RestoreIndices | LoadIndices |
| ---------------------- | ----------------------------- | -------------------------- |
| **Purpose** | Correctness & consistency | Speed & throughput |
| **Processing** | Event-driven via MCL events | Direct bulk operations |
| **Isolation** | READ_COMMITTED | READ_UNCOMMITTED |
| **Refresh Management** | Static configuration | Dynamic disable/restore |
| **Performance Focus** | Accurate replay | Maximal speed |
| **Use Case** | Recovery from inconsistencies | Initial loads & migrations |
---
## Deployment & Execution
### 🚀 Standalone Deployment Advantage
**Key Benefit**: LoadIndices can run with **minimal infrastructure** without requiring DataHub services to be running:
```bash
# Minimal requirements
✅ MySQL/PostgreSQL database (with metadata_aspect_v2 table)
✅ Elasticsearch/OpenSearch cluster
❌ DataHub GMS/Maui services - NOT needed
❌ Kafka cluster - NOT needed
❌ Frontend services - NOT needed
```
### 🔧 Execution Methods
LoadIndices can be executed via:
1. **Gradle Task** (Recommended)
```bash
# From datahub-upgrade directory
./gradlew runLoadIndices
# With custom thread count
./gradlew runLoadIndices -PesThreadCount=6
```
2. **IDE Execution**: Run `UpgradeTask.main()` with LoadIndices arguments
3. **Standalone JAR**: Build and run datahub-upgrade JAR independently
---
## LoadIndices Configuration Options
### 🔄 Performance & Throttling
| Argument | Description | Default | Example |
| ----------- | ------------------------------------------ | ------------------------------ | ------------------- |
| `batchSize` | Number of aspects per batch for processing | `10000` | `-a batchSize=5000` |
| `limit` | Maximum number of aspects to process | `Integer.MAX_VALUE` (no limit) | `-a limit=50000` |
### 📅 Time Filtering
| Argument | Description | Example |
| -------------- | --------------------------------------------------------------------- | ------------------------------- |
| `gePitEpochMs` | Only process aspects created **after** this timestamp (milliseconds) | `-a gePitEpochMs=1609459200000` |
| `lePitEpochMs` | Only process aspects created **before** this timestamp (milliseconds) | `-a lePitEpochMs=1640995200000` |
### 🔍 Content Filtering
| Argument | Description | Example |
| ------------- | ----------------------------------------------- | ----------------------------------------- |
| `urnLike` | SQL LIKE pattern to filter URNs | `-a urnLike=urn:li:dataset:%` |
| `aspectNames` | Comma-separated list of aspect names to process | `-a aspectNames=ownership,schemaMetadata` |
| `lastUrn` | Resume processing from this URN (inclusive) | `-a lastUrn=urn:li:dataset:my-dataset` |
### ⚙️ System Configuration
| Environment Variable | Description | Default | Example |
| ---------------------------- | --------------------------------------- | ----------------------------------- | ------------------------------ |
| `ELASTICSEARCH_THREAD_COUNT` | Number of I/O threads for BulkProcessor | `2` (app config), `4` (Gradle task) | `ELASTICSEARCH_THREAD_COUNT=4` |
| `ES_BULK_ASYNC` | Enable asynchronous bulk operations | `true` | `ES_BULK_ASYNC=true` |
| `ES_BULK_REQUESTS_LIMIT` | Maximum bulk requests per buffer | `10000` | `ES_BULK_REQUESTS_LIMIT=15000` |
| `ES_BULK_FLUSH_PERIOD` | Bulk flush interval in seconds | `300` (5 minutes) | `ES_BULK_FLUSH_PERIOD=300` |
---
## Running LoadIndices
### 🐳 Docker Compose
If you're using Docker Compose with the DataHub source repository:
```bash
# Basic LoadIndices execution
./docker/datahub-upgrade/datahub-upgrade.sh -u LoadIndices
# LoadIndices with performance tuning
./docker/datahub-upgrade/datahub-upgrade.sh -u LoadIndices \
-a batchSize=15000 \
-a limit=100000
```
### 🎯 Gradle Task (Development)
For development and testing environments:
```bash
# Run LoadIndices with default settings
./gradlew :datahub-upgrade:runLoadIndices
# Run with custom thread count and batch size
./gradlew :datahub-upgrade:runLoadIndices \
-PesThreadCount=4 \
-PbatchSize=15000 \
-Plimit=50000
```
The Gradle task supports these parameters:
- `esThreadCount`: Set `ELASTICSEARCH_THREAD_COUNT` (default: `4`)
- `batchSize`: Override batch size (default: `10000`)
- `limit`: Set processing limit
- `urnLike`: Filter by URN pattern
- `aspectNames`: Filter by aspect names
- `lePitEpochMs`: Process records created before this timestamp
- `gePitEpochMs`: Process records created after this timestamp
- `lastUrn`: Resume processing from this URN (inclusive)
### 🐳 Docker Environment Variables
Configure LoadIndices through Docker environment:
```bash
# Target specific entity types
docker run --rm datahub-upgrade \
-u LoadIndices \
-a urnLike=urn:li:dataset:% \
-a batchSize=20000
# Process specific aspects only
docker run --rm datahub-upgrade \
-u LoadIndices \
-a aspectNames=ownership,status,schemaMetadata \
-a batchSize=15000
# Time-based filtering
docker run --rm datahub-upgrade \
-u LoadIndices \
-a gePitEpochMs=1640995200000 \
-a limit=50000
# Resume from a specific URN
docker run --rm datahub-upgrade \
-u LoadIndices \
-a lastUrn=urn:li:dataset:my-dataset \
-a batchSize=10000
```
### 🔄 Resume Functionality
LoadIndices supports resuming from a specific URN when processing is interrupted:
#### **Resume from Last Processed URN**
When LoadIndices runs, it logs the last URN processed in each batch:
```
Batch completed - Last URN processed: urn:li:dataset:my-dataset
Processed 10000 aspects - 150.2 aspects/sec - Last URN: urn:li:dataset:my-dataset
```
To resume from where you left off:
```bash
# Resume from the last URN that was successfully processed
./gradlew :datahub-upgrade:runLoadIndices \
-a lastUrn=urn:li:dataset:my-dataset \
-a batchSize=10000
```
#### **Resume Best Practices**
- **Use the exact URN**: Copy the URN exactly as logged (including any URL encoding)
- **Inclusive processing**: The `lastUrn` parameter processes from the specified URN onwards (inclusive)
- **Monitor progress**: Watch the logs for the "Last URN processed" messages to track progress
- **Batch boundaries**: Resume works at the URN level, not batch level - some aspects may be reprocessed
#### **Example Resume Workflow**
```bash
# 1. Start initial processing
./gradlew :datahub-upgrade:runLoadIndices -a batchSize=5000
# 2. If interrupted, check logs for last URN:
# "Batch completed - Last URN processed: urn:li:dataset:my-dataset"
# 3. Resume from that URN
./gradlew :datahub-upgrade:runLoadIndices \
-a lastUrn=urn:li:dataset:my-dataset \
-a batchSize=5000
```
---
## Performance Optimization
### 🚀 Elasticsearch/OpenSearch Configuration
#### Bulk Processing Tuning
```bash
# Optimize bulk settings for LoadIndices
export ES_BULK_REQUESTS_LIMIT=15000
export ES_BULK_FLUSH_PERIOD=10
export ES_BULK_ASYNC=true
export ELASTICSEARCH_THREAD_COUNT=4
```
#### Connection Pool Optimization
LoadIndices automatically configures connection pooling based on thread count:
```yaml
# datahub-upgrade/build.gradle configuration
environment "ELASTICSEARCH_THREAD_COUNT", "4" # Auto-adjusts maxConnectionsPerRoute
```
## Comparison with RestoreIndices
Understanding when to use LoadIndices vs RestoreIndices is crucial for optimal performance and data consistency.
### 🎯 Purpose & Design Philosophy
| Aspect | RestoreIndices | LoadIndices |
| --------------------- | ------------------------------ | ------------------------------- |
| **Primary Purpose** | Data consistency & correctness | Speed & throughput |
| **Design Philosophy** | Event-driven precision | Performance optimization |
| **Consistency Model** | Full consistency guarantee | Speed-optimized trade-offs |
| **Use Case** | Production recovery | Bulk migrations & initial loads |
### 📊 Technical Comparison
| Feature | RestoreIndices | LoadIndices |
| --------------------------------- | ---------------------------- | ------------------------ |
| **Database Isolation** | READ_COMMITTED | READ_UNCOMMITTED |
| **MCL Events** | ✅ Full MCL pipeline | ❌ Bypasses MCL entirely |
| **Graph Updates (Elasticsearch)** | ✅ Updated | ✅ Updated |
| **Graph Updates (Neo4j)** | ✅ Updated | ❌ Missing |
| **Database Support** | MySQL, PostgreSQL, Cassandra | MySQL, PostgreSQL only |
| **Performance** | Slower, safer | Faster, optimized |
| **Real-time Consistency** | ✅ Immediate | ❌ Delayed until refresh |
| **Concurrency Safety** | ✅ Safe | ❌ Not safe |
### 🚀 When to Use Each Tool
#### ✅ **Use RestoreIndices For:**
- **Production recovery** from inconsistencies
- **Neo4j-based graph storage** deployments
- **Cassandra-based** metadata storage
- **Active ingestion** pipelines running
- **MCL-dependent systems** requiring event notifications
- **Precise event replay** scenarios
#### ✅ **Use LoadIndices For:**
- **Fresh deployments** with empty clusters
- **Bulk migrations** during maintenance windows
- **MySQL/PostgreSQL + Elasticsearch** configurations
- **Offline scenarios** with no concurrent writes
- **Development/testing** environments
- **Performance-critical** initial data loads