2025-05-15 13:48:47 -07:00
# Search and Graph Reindexing
2021-06-30 16:49:02 -07:00
2025-05-15 13:48:47 -07:00
If your search infrastructure (Elasticsearch/OpenSearch) or graph services (Elasticsearch/OpenSearch/Neo4j) become inconsistent or out-of-sync with your primary metadata store, you can **rebuild them from the source of truth** : the `metadata_aspect_v2` table in your relational database (MySQL/Postgres).
2021-06-30 16:49:02 -07:00
2025-05-15 13:48:47 -07:00
This process works by fetching the latest version of each aspect from the database and replaying them as Metadata Change Log (MCL) events. These events will regenerate your search and graph indexes, effectively restoring a consistent view.
2021-06-30 16:49:02 -07:00
2025-05-15 13:48:47 -07:00
> ⚠️ **Note**: By default, this process does **not remove** stale documents from the index that no longer exist in the database. To ensure full consistency, we recommend reindexing into a clean instance, or using the `-a clean` option to wipe existing index contents before replay.
2024-09-16 23:40:03 +02:00
2025-05-15 13:48:47 -07:00
---
2025-02-03 10:26:27 -06:00
2025-05-15 13:48:47 -07:00
## How it Works
2025-02-27 12:55:26 -06:00
2025-05-15 13:48:47 -07:00
Reindexing is powered by the `datahub-upgrade` utility (packaged as the `datahub-upgrade` container in Docker/Kubernetes). It supports a special upgrade task called `RestoreIndices` , which replays aspects from the database back into search and graph stores.
2025-02-03 10:26:27 -06:00
2025-05-15 13:48:47 -07:00
You can run this utility in three main environments:
2025-02-03 10:26:27 -06:00
2025-05-15 13:48:47 -07:00
- Quickstart (via CLI)
- Docker Compose (via shell script)
- Kubernetes (via Helm + CronJob)
2025-02-03 10:26:27 -06:00
2025-05-15 13:48:47 -07:00
---
2025-02-03 10:26:27 -06:00
2025-05-15 13:48:47 -07:00
## Reindexing Configuration Options
2025-02-03 10:26:27 -06:00
2025-05-15 13:48:47 -07:00
When running the `RestoreIndices` job, you can pass additional arguments to customize the behavior:
2025-02-03 10:26:27 -06:00
2025-05-15 13:48:47 -07:00
### 🔄 Pagination & Performance
2025-03-13 10:17:14 -05:00
2025-05-15 13:48:47 -07:00
| Argument | Description |
| -------------------- | --------------------------------------------------------------------------- |
| `urnBasedPagination` | Use URN-based pagination instead of offset. Recommended for large datasets. |
| `startingOffset` | Starting offset for offset-based pagination. |
| `lastUrn` | Resume from this URN (used with URN pagination). |
| `lastAspect` | Resume from this aspect name (used with `lastUrn` ). |
| `numThreads` | Number of concurrent threads for reindexing. |
| `batchSize` | Number of records per batch. |
| `batchDelayMs` | Delay in milliseconds between each batch (throttling). |
2025-03-13 10:17:14 -05:00
2025-05-15 13:48:47 -07:00
### 📅 Time Filtering
2025-03-13 10:17:14 -05:00
2025-05-15 13:48:47 -07:00
| Argument | Description |
| -------------- | --------------------------------------------------------------- |
| `gePitEpochMs` | Only restore aspects created **after** this timestamp (in ms). |
| `lePitEpochMs` | Only restore aspects created **before** this timestamp (in ms). |
2025-03-13 10:17:14 -05:00
2025-05-15 13:48:47 -07:00
### 🔍 Content Filtering
2025-04-16 16:55:51 -07:00
2025-05-15 13:48:47 -07:00
| Argument | Description |
| ------------- | ---------------------------------------------------------------------- |
| `aspectNames` | Comma-separated list of aspects to restore (e.g., `ownership,status` ). |
| `urnLike` | SQL LIKE pattern to filter URNs (e.g., `urn:li:dataset%` ). |
2025-02-03 10:26:27 -06:00
2025-05-15 13:48:47 -07:00
### 🧱 Other Options
2025-02-03 10:26:27 -06:00
2025-05-15 13:48:47 -07:00
| Argument | Description |
| ---------------------- | ----------------------------------------------------------------------------------------------------------- |
| `createDefaultAspects` | Whether to create default aspects in SQL & index if missing. **Disable** this if using a read-only replica. |
| `clean` | **Deletes existing index documents before restoring.** Use with caution. |
2025-02-03 10:26:27 -06:00
2025-05-15 13:48:47 -07:00
---
2025-02-27 12:55:26 -06:00
2025-05-15 13:48:47 -07:00
## Running the Restore Job
2022-07-24 23:52:25 -07:00
2025-05-15 13:48:47 -07:00
### 🧪 Quickstart CLI
2022-07-24 23:52:25 -07:00
2025-05-15 13:48:47 -07:00
If you're using DataHub's quickstart image, you can restore indices using a single CLI command:
```bash
2022-07-24 23:52:25 -07:00
datahub docker quickstart --restore-indices
```
2024-09-16 23:40:03 +02:00
:::info
2025-05-15 13:48:47 -07:00
This command automatically clears the search and graph indices before restoring them.
2024-09-20 22:39:28 +02:00
:::
2024-09-16 23:40:03 +02:00
2025-05-15 13:48:47 -07:00
More details in the [Quickstart Docs ](../quickstart.md#restore-datahub ).
---
2022-07-24 23:52:25 -07:00
2025-05-15 13:48:47 -07:00
### 🐳 Docker Compose
2021-06-30 16:49:02 -07:00
2025-05-15 13:48:47 -07:00
If you're using Docker Compose and have cloned the [DataHub source repo ](https://github.com/datahub-project/datahub ), run:
2021-06-30 16:49:02 -07:00
2025-05-15 13:48:47 -07:00
```bash
2021-06-30 16:49:02 -07:00
./docker/datahub-upgrade/datahub-upgrade.sh -u RestoreIndices
```
2025-05-15 13:48:47 -07:00
To clear existing index contents before restore (recommended if you suspect inconsistencies), add `-a clean` :
```bash
./docker/datahub-upgrade/datahub-upgrade.sh -u RestoreIndices -a clean
```
2024-09-16 23:40:03 +02:00
:::info
2025-05-15 13:48:47 -07:00
Without the `-a clean` flag, old documents may remain in your search/graph index, even if they no longer exist in your SQL database.
2024-09-20 22:39:28 +02:00
:::
2024-09-16 23:40:03 +02:00
2025-05-15 13:48:47 -07:00
Refer to the [Upgrade Script Docs ](../../docker/datahub-upgrade/README.md#environment-variables ) for more info on environment configuration.
2024-09-16 23:40:03 +02:00
2025-05-15 13:48:47 -07:00
---
2021-06-30 16:49:02 -07:00
2025-05-15 13:48:47 -07:00
### ☸️ Kubernetes (Helm)
2021-06-30 16:49:02 -07:00
2025-05-15 13:48:47 -07:00
1. **Check if the Job Template Exists**
2021-06-30 16:49:02 -07:00
2025-05-15 13:48:47 -07:00
Run:
2021-06-30 16:49:02 -07:00
2025-05-15 13:48:47 -07:00
```bash
kubectl get cronjobs
2021-06-30 16:49:02 -07:00
```
2025-05-15 13:48:47 -07:00
You should see a result like:
```bash
datahub-datahub-restore-indices-job-template
2021-06-30 16:49:02 -07:00
```
2025-05-15 13:48:47 -07:00
If not, make sure you're using the latest Helm chart version that includes the restore job.
2. **Trigger the Restore Job**
2021-06-30 16:49:02 -07:00
2025-05-15 13:48:47 -07:00
Run:
2021-06-30 16:49:02 -07:00
2025-05-15 13:48:47 -07:00
```bash
2021-07-22 01:27:08 +05:30
kubectl create job --from=cronjob/datahub-datahub-restore-indices-job-template datahub-restore-indices-adhoc
2021-06-30 16:49:02 -07:00
```
2025-05-15 13:48:47 -07:00
This will create and run a one-off job to restore indices from your SQL database.
2024-09-16 23:40:03 +02:00
2025-05-15 13:48:47 -07:00
3. **To Enable Clean Reindexing**
2024-09-16 23:40:03 +02:00
2025-05-15 13:48:47 -07:00
Edit your `values.yaml` to include the `-a clean` argument:
2024-09-16 23:40:03 +02:00
```yaml
datahubUpgrade:
restoreIndices:
image:
args:
- "-u"
- "RestoreIndices"
- "-a"
2025-05-15 13:48:47 -07:00
- "batchSize=1000"
2024-09-16 23:40:03 +02:00
- "-a"
2025-05-15 13:48:47 -07:00
- "batchDelayMs=100"
2024-09-16 23:40:03 +02:00
- "-a"
- "clean"
```
2022-12-13 00:15:11 +05:30
2025-05-15 13:48:47 -07:00
:::info
The default job does **not** delete existing documents before restoring. Add `-a clean` to ensure full sync.
:::
2025-02-27 12:55:26 -06:00
### Through APIs
See also the [Best Practices ](#best-practices ) section below, however note that the APIs are able to handle a few thousand
aspects. In this mode one of the GMS instances will perform the required actions, however it is subject to timeout. Use one of the
approaches above for longer running restoreIndices.
#### OpenAPI
There are two primary APIs, one which exposes the common parameters for restoreIndices and another one designed
to accept a list of URNs where all aspects are to be restored.
Full configuration:
< p align = "center" >
< img width = "80%" src = "https://github.com/datahub-project/static-assets/blob/main/imgs/how/restore-indices/openapi-restore-indices.png?raw=true" / >
< / p >
All Aspects:
< p align = "center" >
< img width = "80%" src = "https://github.com/datahub-project/static-assets/blob/main/imgs/how/restore-indices/openapi-restore-indices-urns.png?raw=true" / >
< / p >
#### Rest.li
For Rest.li, see [Restore Indices API ](../api/restli/restore-indices.md ).
## Best Practices
2025-04-16 16:55:51 -07:00
In general, this process is not required to run unless there has been a disruption of storage services or infrastructure,
such as Elasticsearch/Opensearch cluster failures, data corruption events, or significant version upgrade inconsistencies
2025-02-27 12:55:26 -06:00
that have caused the search and graph indices to become out of sync with the local database.
2025-05-15 13:48:47 -07:00
Some pointers to keep in mind when running this process:
- Always test reindexing in a **staging environment** first.
- Consider taking a backup of your Elasticsearch/OpenSearch index before a `clean` restore.
- For very large deployments, use `urnBasedPagination` and limit `batchSize` to avoid overloading your backend.
- Monitor Elasticsearch/OpenSearch logs during the restore for throttling or memory issues.
2025-02-27 12:55:26 -06:00
### K8 Job vs. API
#### When to Use Kubernetes Jobs
2025-04-16 16:55:51 -07:00
For operations affecting 2,000 or more aspects, it's strongly recommended to use the Kubernetes job approach. This job is
2025-02-27 12:55:26 -06:00
designed for long-running processes and provide several advantages:
2025-04-16 16:55:51 -07:00
- Won't time out like API calls
- Can be monitored through Kubernetes logging
- Won't consume resources from your primary GMS instances
- Can be scheduled during off-peak hours to minimize system impact
2025-02-27 12:55:26 -06:00
#### When to Use APIs
2025-04-16 16:55:51 -07:00
2025-02-27 12:55:26 -06:00
The RestoreIndices APIs (available through both Rest.li and OpenAPI) is best suited for:
2025-04-16 16:55:51 -07:00
- Targeted restores affecting fewer than 2,000 aspects
- Emergencies where you need to quickly restore critical metadata
- Testing or validating the restore process before running a full-scale job
- Scenarios where you don't have direct access to the Kubernetes cluster
2025-02-27 12:55:26 -06:00
2025-04-16 16:55:51 -07:00
Remember that API-based restoration runs within one of your GMS instances and is subject to timeouts, which could lead to
2025-02-27 12:55:26 -06:00
incomplete restorations for larger installations.
### Targeted Restoration Strategies
2025-04-16 16:55:51 -07:00
Being selective about what you restore is crucial for efficiency. Combining these filtering strategies can dramatically
2025-02-27 12:55:26 -06:00
reduce the restoration scope, saving resources and time.
#### Entity Type Filtering
2025-04-16 16:55:51 -07:00
2025-02-27 12:55:26 -06:00
Entity Type Filtering: Use the `urnLike` parameter to target specific entity types:
2025-04-16 16:55:51 -07:00
- For datasets: `urnLike=urn:li:dataset:%`
- For users: `urnLike=urn:li:corpuser:%`
- For dashboards: `urnLike=urn:li:dashboard:%`
2025-02-27 12:55:26 -06:00
#### Single Entity
2025-04-16 16:55:51 -07:00
2025-02-27 12:55:26 -06:00
Single Entity Restoration: When only one entity is affected, provide the specific URN to minimize processing overhead.
Aspect-Based Filtering: Use aspectNames to target only the specific aspects that need restoration:
2025-04-16 16:55:51 -07:00
- For ownership inconsistencies: `aspectNames=ownership`
- For tag issues: `aspectNames=globalTags`
2025-02-27 12:55:26 -06:00
#### Time-Based
2025-04-16 16:55:51 -07:00
2025-02-27 12:55:26 -06:00
Time-Based Recovery: If you know when the inconsistency began, use time-based filtering:
2025-04-16 16:55:51 -07:00
- gePitEpochMs={timestamp} to process only records created after the incident
- lePitEpochMs={timestamp} to limit processing to records before a certain time
2025-02-27 12:55:26 -06:00
### Parallel Processing Strategies
To optimize restoration speed while managing system load:
#### Multiple Parallel Jobs
2025-04-16 16:55:51 -07:00
2025-02-27 12:55:26 -06:00
Run several restoreIndices processes simultaneously by:
2025-04-16 16:55:51 -07:00
- Work on non-overlapping sets of aspects or entities
- Dividing work by entity type (one job for datasets, another for users, etc.)
- Splitting aspects among different jobs (one for ownership aspects, another for lineage, etc.)
- Partitioning large entity types by prefix or time range
- Have staggered start times to prevent initial resource contention
- Monitor system metrics closely during concurrent restoration to ensure you're not overloading your infrastructure.
2025-02-27 12:55:26 -06:00
:::caution
Avoid Conflicts: Ensure that concurrent jobs:
Never specify the --clean argument in concurrent jobs
:::
### Temporary Workload Reduction
2025-04-16 16:55:51 -07:00
- Pause scheduled ingestion jobs during restoration
- Temporarily disable or reduce frequency of the datahub-gc job to prevent conflicting deletes
- Consider pausing automated workflows or integrations that generate metadata events
2025-02-27 12:55:26 -06:00
### Infrastructure Tuning
2025-04-16 16:55:51 -07:00
Implementing these expanded best practices should help ensure a smoother, more efficient restoration process while
2025-02-27 12:55:26 -06:00
minimizing impact on your DataHub environment.
This operation can be I/O intensive from the read-side from SQL and on the Elasticsearch write side. If you're able to leverage
provisioned I/O. or throughput, you might want to monitor your infrastructure for a possible.
#### Elasticsearch/Opensearch Optimization
2025-04-16 16:55:51 -07:00
2025-02-27 12:55:26 -06:00
To improve write performance during restoration:
##### Refresh Interval Adjustment:
Temporarily increase the refresh_interval setting from the default (typically 1s) to something like 30s or 60s.
Run the system update job with the following environment variable `ELASTICSEARCH_INDEX_BUILDER_REFRESH_INTERVAL_SECONDS=60`
:::caution
Remember to reset this after restoration completes!
:::caution
##### Bulk Processing Improvements:
2025-04-16 16:55:51 -07:00
- Adjust the Elasticsearch batching parameters to optimize bulk request size (try values between 2000-5000)
- Run your GMS or `mae-consumer` with environment variables
- `ES_BULK_REQUESTS_LIMIT=3000`
- `ES_BULK_FLUSH_PERIOD=60`
- Configure `batchDelayMs` on restoreIndices to add breathing room between batches if your cluster is struggling
2025-02-27 12:55:26 -06:00
##### Shard Management:
2025-04-16 16:55:51 -07:00
- Ensure your indices have an appropriate number of shards for your cluster size.
- Consider temporarily adding nodes to your search cluster during massive restorations.
2025-02-27 12:55:26 -06:00
#### SQL/Primary Storage
2025-03-13 10:17:14 -05:00
Consider using a read replica as the source of the job's data. If you configure a read-only replica
you must also provide the parameter `createDefaultAspects=false` .
2025-02-27 12:55:26 -06:00
#### Kafka & Consumers
##### Partition Strategy:
2025-04-16 16:55:51 -07:00
- Verify that the Kafka Metadata Change Log (MCL) topic have enough partitions to allow for parallel processing.
- Recommended: At least 10-20 partitions for the MCL topic in production environments.
2025-02-27 12:55:26 -06:00
##### Consumer Scaling:
2025-04-16 16:55:51 -07:00
- Temporarily increase the number of `mae-consumer` pods to process the higher event volume.
- Scale GMS instances if they're handling consumer duties.
2025-02-27 12:55:26 -06:00
##### Monitoring:
2022-12-13 00:15:11 +05:30
2025-04-16 16:55:51 -07:00
- Watch consumer lag metrics closely during restoration.
- Be prepared to adjust scaling or batch parameters if consumers fall behind.