datahub/docs/how/load-indices.md

# Load Indices: High-Performance Bulk Index Loading

LoadIndices is a high-performance upgrade task designed for bulk loading metadata aspects directly from the database into Elasticsearch/OpenSearch indices. Unlike RestoreIndices which focuses on correctness and consistency, LoadIndices is optimized for speed and throughput during initial deployments or large-scale data migrations.

## Overview

LoadIndices bypasses the standard event-driven processing pipeline to directly stream data from the `metadata_aspect_v2` table into search indices using optimized bulk operations. This approach provides significant performance improvements for large installations while making specific architectural trade-offs that prioritize speed over consistency.

🚨 **CRITICAL WARNING**: LoadIndices is designed for specific use cases only and should **NEVER** be used in production environments with active concurrent writes, MCL-dependent systems, or real-time consistency requirements. See [Performance Trade-offs & Implications](#performance-trade-offs--implications) for complete details.

### Key Features

- **🚀 High Performance**: Direct streaming from database with optimized bulk operations
- **⚡ Fast Bulk Loading**: Optimized for speed over consistency during initial loads
- **🔧 Refresh Management**: Automatically disables refresh intervals during loading for optimal performance
- **📊 Comprehensive Monitoring**: Real-time progress reporting and performance metrics
- **⚙️ Configurable Isolation**: Utilizes READ_UNCOMMITTED transactions for faster scanning

---

## Performance Trade-offs & Implications

⚠️ **Critical Understanding**: LoadIndices prioritizes **performance over consistency** by making several architectural trade-offs. Understanding these implications is crucial before using LoadIndices in production environments.

### 🚨 Key Trade-offs Made

#### **1. BYPASS Kafka/MCL Event Pipeline**

- **What**: LoadIndices completely bypasses Kafka MCL (Metadata Change Log) topics that normally propagate all metadata changes
- **Architecture**: `Database → LoadIndices → Elasticsearch` **vs** normal flow of `Database → Kafka MCL → Multiple Consumers → Elasticsearch/Graph/etc`
- **Impact**: **No MCL events published** - downstream systems lose visibility into metadata changes
- **Critical Implication**:
  - **MCL-Dependent Analytics**: Won't have audit trail of metadata changes
  - **Integrations**: External systems won't be notified of changes
  - **Custom MCL Consumers**: Any custom consumers will miss these events entirely
  - **✅ Graph Service**: WILL be updated (UpdateIndicesService handles graph indices) **⚠️ Only when Elasticsearch is used for graph storage**

#### **2. BROKEN DataHub Event Architecture**

- **What**: Violates DataHub's core design principle that "all metadata changes flow through Kafka MCL"
- **Normal Flow**: `Metadata Change → MCL Event → Kafka → Multiple Consumers → Various Stores`
- **LoadIndices Flow**: `Metadata Change → LoadIndices → Direct ES Write` (**Skips Kafka entirely**)

#### **3. READ_UNCOMMITTED Isolation**

- **What**: Uses `TxIsolation.READ_UNCOMMITTED` for faster database scanning
- **Impact**: May read **uncommitted changes** or **dirty reads** from concurrent transactions
- **Implication**: Data consistency not guaranteed during active writes to database

#### **4. Refresh Interval Manipulation**

- **What**: Automatically disables refresh intervals during bulk operations
- **Impact**: **Recent updates may not be immediately searchable**
- **Implication**: Users won't see real-time updates in search until refresh intervals are restored

#### **5. No Write Concurrency Controls**

- **What**: No coordination with concurrent Elasticsearch writes from live ingestion
- **Impact**: **Potential conflicts** with active ingestion pipelines
- **Implication**: Concurrent writes may cause data inconsistency or operation failures

### ⚠️ When NOT to Use LoadIndices

**❌ DO NOT use LoadIndices if you have:**

- **Active ingestion pipelines** writing to Elasticsearch simultaneously
- **MCL-dependent systems** that need event notifications
- **Neo4j-based graph storage** (graph updates will be missing)
- **Real-time search requirements** during the loading process
- **Production traffic** that requires immediate search consistency

### ✅ When LoadIndices is Appropriate

**✅ Safe to use LoadIndices when:**

- **Fresh deployment** with empty Elasticsearch cluster
- **Offline migration** with no concurrent users
- **Standalone indexing** without DataHub services running
- **Read-only replica environments** with no active writes
- **Development/testing** environments
- **Disaster recovery** scenarios where faster restoration is prioritized
- **Independent cluster setup** where you need to populate indices before services start
- **Elasticsearch-based graph storage** (graph gets updated automatically)

### 🔒 Safety Requirements

Before using LoadIndices in any environment:

1. **Verify Minimal Infrastructure**:

   - **Database**: MySQL/PostgreSQL with `metadata_aspect_v2` table accessible (via Ebean ORM)
   - **Elasticsearch**: Running cluster accessible via HTTP/HTTPS
   - **DataHub Services**: ✅ **NOT required** - LoadIndices can run independently
   - **⚠️ Check Graph Storage**: Verify if using Elasticsearch-based graph storage
   - **⚠️ Check Database Type**: Confirm NOT using Cassandra (not supported)

2. **Stop All Ingestion** (if DataHub is running):

   ```bash
   # Disable all Kafka consumers
   kubectl scale deployment --replicas=0 datahub-mae-consumer
   kubectl scale deployment --replicas=0 datahub-mce-consumer
   kubectl scale deployment --replicas=0 datahub-gms
   ```

3. **Check Database Configuration**:

   ```bash
   # Check if using Cassandra (LoadIndices NOT supported)
   grep -i cassandra /path/to/datahub/docker/docker-compose.yml

   # Verify MySQL/PostgreSQL database is configured
   grep -E "mysql\|postgres" /path/to/datahub/docker/docker-compose.yml

   # ⚠️ If Cassandra detected, LoadIndices is NOT available
   # Must use RestoreIndices instead
   ```

4. **Check Graph Storage Configuration**:

   ```bash
   # Check if using Neo4j (graph updates will be MISSING)
   grep -r "neo4j" /path/to/datahub/docker/docker-compose.yml

   # Check DataHub configuration for graph service selection
   grep -i "graph.*elasticsearch\|neo4j" /path/to/datahub/conf/application.yml

   # ⚠️ If Neo4j is detected, LoadIndices will NOT update graph
   ```

5. **Verify No Concurrent Writes**:

   ```bash
   # Check for active Elasticsearch indexing
   curl -s "localhost:9200/_nodes/stats" | grep "index_current"
   # Should show "index_current": 0
   ```

6. **Index Clean State**:

   ```bash
   # Ensure clean indexing state
   curl -s "localhost:9200/_nodes/stats" | grep -E "refresh.*active"
   ```

7. **Coordinate with Operations**:
   - **Maintenance window** scheduling
   - **User notification** of search unavailability
   - **Monitoring** of downstream system dependencies

### 📊 Consistency Guarantees

| Level                                   | LoadIndices   | RestoreIndices      |
| --------------------------------------- | ------------- | ------------------- |
| **URN-level Ordering**                  | ✅ Guaranteed | ✅ Guaranteed       |
| **Real-time Searchability**             | ❌ Delayed    | ✅ Immediate        |
| **Graph Service Updates (ES-based)**    | ✅ Updated    | ✅ Updated          |
| **Graph Service Updates (Neo4j-based)** | ❌ Missing    | ✅ Updated          |
| **MCL Event Propagation**               | ❌ Bypassed   | ✅ Full propagation |
| **Concurrent Write Safety**             | ❌ Not safe   | ✅ Safe             |

#### **2. Restore Normal Operations**

- **Re-enable ingestion** pipelines gradually
- **Monitor Elasticsearch** for conflicts
- **Validate downstream systems** are synchronized

#### **3. Emergency Rollback Plan**

```bash
# If issues arise, prepare rollback:
# 1. Stop LoadIndices immediately
# 2. Restore from backup indices
# 3. Re-run with RestoreIndices for correctness
```

---

## How LoadIndices Works

LoadIndices operates as an upgrade task that can run **independently** without requiring DataHub services to be running. It consists of two main steps:

1. **BuildIndicesStep**: Creates and configures Elasticsearch indices (creates indices if they don't exist)
2. **LoadIndicesStep**: Streams aspects from database and bulk loads them into indices

### 🔧 Independent Operation Mode

**Key Advantage**: LoadIndices only requires:

- ✅ **MySQL/PostgreSQL** source database (via Ebean ORM)
- ✅ **Elasticsearch/OpenSearch** destination cluster
- ❌ **No DataHub services** (maui, frontend, etc.) required
- ❌ **Cassandra**: ⚠️ **NOT supported** (Ebean doesn't support Cassandra)

This enables **offline bulk operations** during maintenance windows or initial deployments where DataHub infrastructure is being set up incrementally.

**Index Creation**: The BuildIndicesStep automatically creates all required Elasticsearch indices based on `IndexConvention` patterns, so empty Elasticsearch clusters are fully supported.

### Architecture Flow

```mermaid
graph TD
    A[LoadIndices Upgrade] --> B[BuildIndicesStep]
    B --> C[Create/Configure Indices]
    C --> D[LoadIndicesStep]
    D --> E[Disable Refresh Intervals]
    E --> F[Stream Aspects from DB]
    F --> G[Batch Processing]
    G --> H[Convert to MCL Events]
    H --> I[Bulk Write to ES]
    I --> J[Restore Refresh Intervals]
```

### Key Differences from RestoreIndices

| Aspect                 | RestoreIndices                | LoadIndices                |
| ---------------------- | ----------------------------- | -------------------------- |
| **Purpose**            | Correctness & consistency     | Speed & throughput         |
| **Processing**         | Event-driven via MCL events   | Direct bulk operations     |
| **Isolation**          | READ_COMMITTED                | READ_UNCOMMITTED           |
| **Refresh Management** | Static configuration          | Dynamic disable/restore    |
| **Performance Focus**  | Accurate replay               | Maximal speed              |
| **Use Case**           | Recovery from inconsistencies | Initial loads & migrations |

---

## Deployment & Execution

### 🚀 Standalone Deployment Advantage

**Key Benefit**: LoadIndices can run with **minimal infrastructure** without requiring DataHub services to be running:

```bash
# Minimal requirements
✅ MySQL/PostgreSQL database (with metadata_aspect_v2 table)
✅ Elasticsearch/OpenSearch cluster
❌ DataHub GMS/Maui services - NOT needed
❌ Kafka cluster - NOT needed
❌ Frontend services - NOT needed
```

### 🔧 Execution Methods

LoadIndices can be executed via:

1. **Gradle Task** (Recommended)

```bash
# From datahub-upgrade directory
./gradlew runLoadIndices

# With custom thread count
./gradlew runLoadIndices -PesThreadCount=6
```

2. **IDE Execution**: Run `UpgradeTask.main()` with LoadIndices arguments

3. **Standalone JAR**: Build and run datahub-upgrade JAR independently

---

## LoadIndices Configuration Options

### 🔄 Performance & Throttling

| Argument    | Description                                | Default                        | Example             |
| ----------- | ------------------------------------------ | ------------------------------ | ------------------- |
| `batchSize` | Number of aspects per batch for processing | `10000`                        | `-a batchSize=5000` |
| `limit`     | Maximum number of aspects to process       | `Integer.MAX_VALUE` (no limit) | `-a limit=50000`    |

### 📅 Time Filtering

| Argument       | Description                                                           | Example                         |
| -------------- | --------------------------------------------------------------------- | ------------------------------- |
| `gePitEpochMs` | Only process aspects created **after** this timestamp (milliseconds)  | `-a gePitEpochMs=1609459200000` |
| `lePitEpochMs` | Only process aspects created **before** this timestamp (milliseconds) | `-a lePitEpochMs=1640995200000` |

### 🔍 Content Filtering

| Argument      | Description                                     | Example                                   |
| ------------- | ----------------------------------------------- | ----------------------------------------- |
| `urnLike`     | SQL LIKE pattern to filter URNs                 | `-a urnLike=urn:li:dataset:%`             |
| `aspectNames` | Comma-separated list of aspect names to process | `-a aspectNames=ownership,schemaMetadata` |
| `lastUrn`     | Resume processing from this URN (inclusive)     | `-a lastUrn=urn:li:dataset:my-dataset`    |

### ⚙️ System Configuration

| Environment Variable         | Description                             | Default                             | Example                        |
| ---------------------------- | --------------------------------------- | ----------------------------------- | ------------------------------ |
| `ELASTICSEARCH_THREAD_COUNT` | Number of I/O threads for BulkProcessor | `2` (app config), `4` (Gradle task) | `ELASTICSEARCH_THREAD_COUNT=4` |
| `ES_BULK_ASYNC`              | Enable asynchronous bulk operations     | `true`                              | `ES_BULK_ASYNC=true`           |
| `ES_BULK_REQUESTS_LIMIT`     | Maximum bulk requests per buffer        | `10000`                             | `ES_BULK_REQUESTS_LIMIT=15000` |
| `ES_BULK_FLUSH_PERIOD`       | Bulk flush interval in seconds          | `300` (5 minutes)                   | `ES_BULK_FLUSH_PERIOD=300`     |

---

## Running LoadIndices

### 🐳 Docker Compose

If you're using Docker Compose with the DataHub source repository:

```bash
# Basic LoadIndices execution
./docker/datahub-upgrade/datahub-upgrade.sh -u LoadIndices

# LoadIndices with performance tuning
./docker/datahub-upgrade/datahub-upgrade.sh -u LoadIndices \
  -a batchSize=15000 \
  -a limit=100000
```

### 🎯 Gradle Task (Development)

For development and testing environments:

```bash
# Run LoadIndices with default settings
./gradlew :datahub-upgrade:runLoadIndices

# Run with custom thread count and batch size
./gradlew :datahub-upgrade:runLoadIndices \
  -PesThreadCount=4 \
  -PbatchSize=15000 \
  -Plimit=50000
```

The Gradle task supports these parameters:

- `esThreadCount`: Set `ELASTICSEARCH_THREAD_COUNT` (default: `4`)
- `batchSize`: Override batch size (default: `10000`)
- `limit`: Set processing limit
- `urnLike`: Filter by URN pattern
- `aspectNames`: Filter by aspect names
- `lePitEpochMs`: Process records created before this timestamp
- `gePitEpochMs`: Process records created after this timestamp
- `lastUrn`: Resume processing from this URN (inclusive)

### 🐳 Docker Environment Variables

Configure LoadIndices through Docker environment:

```bash
# Target specific entity types
docker run --rm datahub-upgrade \
  -u LoadIndices \
  -a urnLike=urn:li:dataset:% \
  -a batchSize=20000

# Process specific aspects only
docker run --rm datahub-upgrade \
  -u LoadIndices \
  -a aspectNames=ownership,status,schemaMetadata \
  -a batchSize=15000

# Time-based filtering
docker run --rm datahub-upgrade \
  -u LoadIndices \
  -a gePitEpochMs=1640995200000 \
  -a limit=50000

# Resume from a specific URN
docker run --rm datahub-upgrade \
  -u LoadIndices \
  -a lastUrn=urn:li:dataset:my-dataset \
  -a batchSize=10000
```

### 🔄 Resume Functionality

LoadIndices supports resuming from a specific URN when processing is interrupted:

#### **Resume from Last Processed URN**

When LoadIndices runs, it logs the last URN processed in each batch:

```
Batch completed - Last URN processed: urn:li:dataset:my-dataset
Processed 10000 aspects - 150.2 aspects/sec - Last URN: urn:li:dataset:my-dataset
```

To resume from where you left off:

```bash
# Resume from the last URN that was successfully processed
./gradlew :datahub-upgrade:runLoadIndices \
  -a lastUrn=urn:li:dataset:my-dataset \
  -a batchSize=10000
```

#### **Resume Best Practices**

- **Use the exact URN**: Copy the URN exactly as logged (including any URL encoding)
- **Inclusive processing**: The `lastUrn` parameter processes from the specified URN onwards (inclusive)
- **Monitor progress**: Watch the logs for the "Last URN processed" messages to track progress
- **Batch boundaries**: Resume works at the URN level, not batch level - some aspects may be reprocessed

#### **Example Resume Workflow**

```bash
# 1. Start initial processing
./gradlew :datahub-upgrade:runLoadIndices -a batchSize=5000

# 2. If interrupted, check logs for last URN:
# "Batch completed - Last URN processed: urn:li:dataset:my-dataset"

# 3. Resume from that URN
./gradlew :datahub-upgrade:runLoadIndices \
  -a lastUrn=urn:li:dataset:my-dataset \
  -a batchSize=5000
```

---

## Performance Optimization

### 🚀 Elasticsearch/OpenSearch Configuration

#### Bulk Processing Tuning

```bash
# Optimize bulk settings for LoadIndices
export ES_BULK_REQUESTS_LIMIT=15000
export ES_BULK_FLUSH_PERIOD=10
export ES_BULK_ASYNC=true
export ELASTICSEARCH_THREAD_COUNT=4
```

#### Connection Pool Optimization

LoadIndices automatically configures connection pooling based on thread count:

```yaml
# datahub-upgrade/build.gradle configuration
environment "ELASTICSEARCH_THREAD_COUNT", "4" # Auto-adjusts maxConnectionsPerRoute
```

## Comparison with RestoreIndices

Understanding when to use LoadIndices vs RestoreIndices is crucial for optimal performance and data consistency.

### 🎯 Purpose & Design Philosophy

| Aspect                | RestoreIndices                 | LoadIndices                     |
| --------------------- | ------------------------------ | ------------------------------- |
| **Primary Purpose**   | Data consistency & correctness | Speed & throughput              |
| **Design Philosophy** | Event-driven precision         | Performance optimization        |
| **Consistency Model** | Full consistency guarantee     | Speed-optimized trade-offs      |
| **Use Case**          | Production recovery            | Bulk migrations & initial loads |

### 📊 Technical Comparison

| Feature                           | RestoreIndices               | LoadIndices              |
| --------------------------------- | ---------------------------- | ------------------------ |
| **Database Isolation**            | READ_COMMITTED               | READ_UNCOMMITTED         |
| **MCL Events**                    | ✅ Full MCL pipeline         | ❌ Bypasses MCL entirely |
| **Graph Updates (Elasticsearch)** | ✅ Updated                   | ✅ Updated               |
| **Graph Updates (Neo4j)**         | ✅ Updated                   | ❌ Missing               |
| **Database Support**              | MySQL, PostgreSQL, Cassandra | MySQL, PostgreSQL only   |
| **Performance**                   | Slower, safer                | Faster, optimized        |
| **Real-time Consistency**         | ✅ Immediate                 | ❌ Delayed until refresh |
| **Concurrency Safety**            | ✅ Safe                      | ❌ Not safe              |

### 🚀 When to Use Each Tool

#### ✅ **Use RestoreIndices For:**

- **Production recovery** from inconsistencies
- **Neo4j-based graph storage** deployments
- **Cassandra-based** metadata storage
- **Active ingestion** pipelines running
- **MCL-dependent systems** requiring event notifications
- **Precise event replay** scenarios

#### ✅ **Use LoadIndices For:**

- **Fresh deployments** with empty clusters
- **Bulk migrations** during maintenance windows
- **MySQL/PostgreSQL + Elasticsearch** configurations
- **Offline scenarios** with no concurrent writes
- **Development/testing** environments
- **Performance-critical** initial data loads