From 0817041232245db5d016dfda81111e522a9924de Mon Sep 17 00:00:00 2001
From: John Joyce <john@acryl.io>
Date: Thu, 15 May 2025 13:48:47 -0700
Subject: [PATCH] fix(docs): Improve backup and restore doc (#13466)

Co-authored-by: John Joyce <john@Mac-191.lan>
Co-authored-by: John Joyce <john@Johns-MacBook-Pro.local>
---
 docker/datahub-upgrade/README.md |  61 +++++++--
 docs/advanced/monitoring.md      |   3 +-
 docs/how/backup-datahub.md       | 213 ++++++++++++++++++++++++++++++-
 docs/how/restore-indices.md      | 164 ++++++++++++++----------
 4 files changed, 353 insertions(+), 88 deletions(-)

diff --git a/docker/datahub-upgrade/README.md b/docker/datahub-upgrade/README.md
index 81306132b1..8465fdbf0b 100644
--- a/docker/datahub-upgrade/README.md
+++ b/docker/datahub-upgrade/README.md
@@ -1,17 +1,27 @@
 # DataHub Upgrade Docker Image
 
-This container is used to automatically apply upgrades from one version of DataHub to another.
+This container is used to automatically apply upgrades from one version of DataHub to another. It contains
+a set of executable jobs, which can be used to perform various types of system maintenance on-demand.
+
+It also supports regularly upgrade tasks that need to occur between versions of DataHub, and should be run
+each time DataHub is deployed. More on this below. (TODO)
 
 ## Supported Upgrades
 
-As of today, there are 2 supported upgrades:
+The following jobs are supported:
 
-1. **NoCodeDataMigration**: Performs a series of pre-flight qualification checks and then migrates metadata*aspect table data
-   to metadata_aspect_v2 table. Arguments: - \_batchSize* (Optional): The number of rows to migrate at a time. Defaults to 1000. - _batchDelayMs_ (Optional): The number of milliseconds of delay between migrated batches. Used for rate limiting. Defaults to 250. - _dbType_ (optional): The target DB type. Valid values are `MYSQL`, `MARIA`, `POSTGRES`. Defaults to `MYSQL`.
-2. **NoCodeDataMigrationCleanup**: Cleanses graph index, search index, and key-value store of legacy DataHub data (metadata_aspect table) once
-   the No Code Data Migration has completed successfully. No arguments.
+1. **SystemUpdate**: Performs any tasks required to update to a new version of DataHub. For example, applying new configurations to the search & graph indexes, ingesting default settings, and more. Once completed, emits a message to the DataHub Upgrade History Kafka (`DataHubUpgradeHistory_v1`) topic, which signals to other pods that DataHub is ready to start.
+   Note that this _must_ be executed any time the DataHub version is incremented before starting or restarting other system containers. Dependent services will wait until the Kafka message is emitted corresponding to the code they are running.
+   A unique "version id" is generated based on a combination of the a) embedded git tag corresponding to the version of DataHub running and b) an optional revision number, provided via the `DATAHUB_REVISION` environment variable. Helm uses
+   the latter to ensure that the system upgrade job is executed every single time a deployment of DataHub is performed, even if the container version has not changed.
+   Important: This job runs as a pre-install hook via the DataHub Helm Charts, i.e. before deploying new version tags for each container.
+
+2. **SystemUpdateBlocking**: Performs any _blocking_ tasks required to update to a new version of DataHub, as a subset of **SystemUpdate**.
+
+3. **SystemUpdateNonBlocking**: Performs any _nonblocking_ tasks required to update to a new version of DataHub, as a subset of **SystemUpdate**.
+
+4. **RestoreIndices**: Restores indices by fetching the latest version of each aspect and restating MetadataChangeLog events for each latest aspect. Arguments include:
 
-3. **RestoreIndices**: Restores indices by fetching the latest version of each aspect and producing MAE. Arguments:
    - _batchSize_ (Optional): The number of rows to migrate at a time. Defaults to 1000.
    - _batchDelayMs_ (Optional): The number of milliseconds of delay between migrated batches. Used for rate limiting. Defaults to 250.
    - _numThreads_ (Optional): The number of threads to use, defaults to 1. Note that this is not used if `urnBasedPagination` is true.
@@ -20,7 +30,32 @@ As of today, there are 2 supported upgrades:
    - _urnLike_ (Optional): The urn pattern for producing events, using `%` as a wild card
    - _urnBasedPagination_ (Optional): Paginate the SQL results using the urn + aspect string instead of `OFFSET`. Defaults to false,
      though should improve performance for large amounts of data.
-4. **RestoreBackup**: Restores the storage stack from a backup of the local database
+
+5. **RestoreBackup**: Restores the primary storage - the SQL document DB - from an available backup of the local database. Requires that the backup reader and backup are provided. Note that this does not also restore the secondary indexes, the graph or search storage. To do so, you should run the **RestoreIndices** upgrade job.
+   Arguments include:
+
+   - _BACKUP_READER_ (Required): The backup reader to use to read and restore the db. The only backup reader currently supported is `LOCAL_PARQUET`, which requires a parquet-formatted backup file path to be specified via the `BACKUP_FILE_PATH` argument.
+   - _BACKUP_FILE_PATH_ (Required): The path of the backup file. If you are running in a container, this needs to the location where the backup file has been mounted into the container.
+
+6. **EvaluateTests**: Executes all Metadata Tests in batches. Running this job can slow down DataHub, and it in some cases requires full scans of the document db. Generally, it's recommended to configure this to run one time per day (which is the helm CronJob default).
+   Arguments include:
+
+   - _batchSize_ (Optional): The number of assets to test at a time. Defaults to 1000.
+   - _batchDelayMs_ (Optional): The number of milliseconds of delay between evaluated asset batches. Used for rate limiting. Defaults to 250.
+
+7. (Legacy) **NoCodeDataMigration**: Performs a series of pre-flight qualification checks and then migrates metadata\*aspect table data
+   to metadata_aspect_v2 table. Arguments include:
+
+   - _batchSize_ (Optional): The number of rows to migrate at a time. Defaults to 1000.
+   - _batchDelayMs_ (Optional): The number of milliseconds of delay between migrated batches. Used for rate limiting. Defaults to 250.
+   - _dbType_ (Optional): The target DB type. Valid values are `MYSQL`, `MARIA`, `POSTGRES`. Defaults to `MYSQL`.
+
+   If you are using newer versions of DataHub (v1.0.0 or above), this upgrade job will not be relevant.
+
+8. (Legacy) **NoCodeDataMigrationCleanup**: Cleanses graph index, search index, and key-value store of legacy DataHub data (metadata_aspect table) once
+   the No Code Data Migration has completed successfully. No arguments.
+
+   If you are using newer versions of DataHub (v1.0.0 or above), this upgrade job will not be relevant.
 
 ## Environment Variables
 
@@ -80,13 +115,17 @@ DATAHUB_MAE_CONSUMER_PORT=9091
 # ELASTICSEARCH_SSL_KEYSTORE_PASSWORD=
 ```
 
+These variables tell the upgrade job how to connect to critical storage systems like Kafka, MySQL / Postgres, and Elasticsearch or OpenSearch.
+
 2. Pull (or build) & execute the `datahub-upgrade` container:
 
 ```aidl
-docker pull acryldata/datahub-upgrade:head && docker run --env-file *path-to-custom-env-file.env* acryldata/datahub-upgrade:head -u NoCodeDataMigration
+docker pull acryldata/datahub-upgrade:head && docker run --env-file *path-to-custom-env-file.env* acryldata/datahub-upgrade:head -u <Upgrade Job Name> -a <Upgrade Job Arguments>
 ```
 
-## Arguments
+## Command-Line Arguments
+
+### Selecting the Upgrade to Run
 
 The primary argument required by the datahub-upgrade container is the name of the upgrade to perform. This argument
 can be specified using the `-u` flag when running the `datahub-upgrade` container.
@@ -103,6 +142,8 @@ OR
 docker pull acryldata/datahub-upgrade:head && docker run --env-file env/docker.env acryldata/datahub-upgrade:head -u NoCodeDataMigration
 ```
 
+### Provided Arguments for a Given Upgrade Job
+
 In addition to the required `-u` argument, each upgrade may require specific arguments. You can provide arguments to individual
 upgrades using multiple `-a` arguments.
 
diff --git a/docs/advanced/monitoring.md b/docs/advanced/monitoring.md
index 276a69c5c1..739c761f60 100644
--- a/docs/advanced/monitoring.md
+++ b/docs/advanced/monitoring.md
@@ -9,8 +9,7 @@ Traces let us track the life of a request across multiple components. Each trace
 are units of work, containing various context about the work being done as well as time taken to finish the work. By
 looking at the trace, we can more easily identify performance bottlenecks.
 
-We enable tracing by using
-the [OpenTelemetry java instrumentation library](https://github.com/open-telemetry/opentelemetry-java-instrumentation).
+We enable tracing by using the [OpenTelemetry java instrumentation library](https://github.com/open-telemetry/opentelemetry-java-instrumentation).
 This project provides a Java agent JAR that is attached to java applications. The agent injects bytecode to capture
 telemetry from popular libraries.
 
diff --git a/docs/how/backup-datahub.md b/docs/how/backup-datahub.md
index 3ae0c4e3ae..b0e5f5bb3e 100644
--- a/docs/how/backup-datahub.md
+++ b/docs/how/backup-datahub.md
@@ -1,11 +1,212 @@
-# Taking backup of DataHub
+# DataHub Backup & Restore
 
-## Production
+DataHub stores metadata in two key storage systems that require separate backup approaches:
 
-The recommended backup strategy is to periodically dump the database `datahub.metadata_aspect_v2` so it can be recreated from the dump which most managed DB services will support (e.g. AWS RDS). Then run [restore indices](./restore-indices.md) to recreate the indices.
+1. **Versioned Aspects**: Stored in a relational database (MySQL/PostgreSQL) in the `metadata_aspect_v2` table
+2. **Time Series Aspects, Search Indexes, & Graph Relationships**: Stored in Elasticsearch/OpenSearch indexes
 
-In order to back up Time Series Aspects (which power usage and dataset profiles), you'd have to do a backup of Elasticsearch, which is possible via AWS OpenSearch. Otherwise, you'd have to reingest dataset profiles from your sources in the event of a disaster scenario!
+This guide outlines how to properly back up both components to ensure complete recoverability of your DataHub instance.
 
-## Quickstart
+## Production Environment Backups
 
-To take a backup of your quickstart, take a look at this [document](../quickstart.md#backing-up-your-datahub-quickstart-experimental) on how to accomplish it.
+### Backing Up Document Store (Versioned Metadata)
+
+The recommended backup strategy is to periodically dump the `metadata_aspect_v2` table from the `datahub` database. This table contains all versioned aspects and can be restored in case of database failure. Most managed database services (e.g., AWS RDS) provide automated backup capabilities.
+
+#### AWS Managed RDS
+
+**Option 1: Automated RDS Snapshots**
+
+1. Go to **AWS Console > RDS > Databases**
+2. Select your DataHub RDS instance
+3. Click **Actions > Take Snapshot**
+4. Name the snapshot (e.g., `datahub-backup-YYYY-MM-DD`)
+5. Configure automated snapshots in RDS with appropriate retention periods (recommended: 14-30 days)
+
+**Option 2: SQL Dump (MySQL)**
+
+For a targeted backup of only the essential metadata:
+
+`mysqldump -h <rds-endpoint> -u <username> -p datahub metadata_aspect_v2 > metadata_aspect_v2_backup.sql`
+
+To compress the backup:
+
+`mysqldump -h <rds-endpoint> -u <username> -p datahub metadata_aspect_v2 | gzip > metadata_aspect_v2_backup.sql.gz`
+
+#### Self-Hosted MySQL
+
+`mysqldump -u <username> -p datahub metadata_aspect_v2 > metadata_aspect_v2_backup.sql`
+
+Compressed version:
+
+`mysqldump -u <username> -p datahub metadata_aspect_v2 | gzip > metadata_aspect_v2_backup.sql.gz`
+
+### Backing Up Time Series Aspects (Elasticsearch/OpenSearch)
+
+Time Series Aspects power important features like usage statistics, dataset profiles, and assertion runs. These are stored in Elasticsearch/OpenSearch and require a separate backup strategy.
+
+#### AWS OpenSearch Service
+
+1. **Create an IAM Role for Snapshots**
+
+   Create an IAM role with permissions to write to an S3 bucket:
+
+```json
+{
+  "Version": "2012-10-17",
+  "Statement": [
+    {
+      "Action": ["s3:ListBucket"],
+      "Effect": "Allow",
+      "Resource": ["arn:aws:s3:::your-backup-bucket"]
+    },
+    {
+      "Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
+      "Effect": "Allow",
+      "Resource": ["arn:aws:s3:::your-backup-bucket/*"]
+    }
+  ]
+}
+```
+
+Ensure the trust relationship allows OpenSearch to assume this role:
+
+```json
+{
+  "Version": "2012-10-17",
+  "Statement": [
+    {
+      "Effect": "Allow",
+      "Principal": {
+        "Service": "es.amazonaws.com"
+      },
+      "Action": "sts:AssumeRole"
+    }
+  ]
+}
+```
+
+2. **Register a Snapshot Repository**
+
+```
+   PUT _snapshot/datahub_s3_backup
+   {
+      "type": "s3",
+      "settings": {
+         "bucket": "your-backup-bucket",
+         "region": "us-east-1",
+         "role_arn": "arn:aws:iam::<account-id>:role/<snapshot-role>"
+      }
+   }
+```
+
+> ⚠️ **Important**: The S3 bucket must be in the same AWS region as your OpenSearch domain.
+
+3. **Create a Regular Snapshot Schedule**
+
+   Set up an automated schedule using the OpenSearch Snapshot Management:
+
+```
+   PUT _plugins/_sm/policies/datahub_backup_policy
+   {
+      "schedule": {
+         "cron": {
+            "expression": "0 0 * * *",
+            "timezone": "UTC"
+         }
+      },
+      "name": "<snapshot-{now/d}>",
+      "repository": "datahub_s3_backup",
+      "config": {
+         "partial": false
+      },
+      "retention": {
+         "expire_after": "15d",
+         "min_count": 5,
+         "max_count": 30
+      }
+   }
+```
+
+This configures daily snapshots with a 15-day retention period.
+
+4. **Take a Manual Snapshot** (if needed)
+
+   `PUT _snapshot/datahub_s3_backup/snapshot_YYYY_MM_DD?wait_for_completion=true`
+
+5. **Verify Snapshot Status**
+
+   `GET _snapshot/datahub_s3_backup/snapshot_YYYY_MM_DD`
+
+#### Self-Hosted Elasticsearch
+
+1. **Create a Local Repository**
+
+   First, add `path.repo` setting to `elasticsearch.yml` on all nodes:
+
+   path.repo: ["/mnt/es-backups"]
+
+   Ensure `/mnt/es-backups` is a shared or mounted path on all Elasticsearch nodes.
+
+2. **Register the Repository**
+
+```
+   PUT _snapshot/datahub_fs_backup
+   {
+      "type": "fs",
+      "settings": {
+         "location": "/mnt/es-backups",
+         "compress": true
+      }
+   }
+```
+
+3. **Create a Snapshot**
+
+   `PUT \_snapshot/datahub_fs_backup/snapshot_YYYY_MM_DD?wait_for_completion=true`
+
+4. **Check Snapshot Status**
+
+   `GET \_snapshot/datahub_fs_backup/snapshot_YYYY_MM_DD`
+
+## Restoring DataHub from Backups
+
+### Restoring the MySQL Database
+
+1. **Restore from an RDS Snapshot** (if using AWS RDS)
+
+   In the AWS Console, go to **RDS > Snapshots**, select your snapshot, and choose "Restore Snapshot".
+
+2. **Restore from SQL Dump**
+
+   `mysql -h <host> -u <user> -p datahub < metadata_aspect_v2_backup.sql`
+
+### Restoring Elasticsearch/OpenSearch Indices
+
+After restoring the database, you need to restore the search and graph indices using your snapshots.
+
+Note that you can also rebuild the index from scratch after restoring the MySQL / Postgres Document Store,
+as outlined [here](./restore-indices.md).
+
+#### Restoring from Snapshots
+
+To restore search indexes from a snapshot:
+
+```
+POST _snapshot/datahub_s3_backup/snapshot_YYYY_MM_DD/_restore
+{
+   "indices": "datastream*,metadataindex*",
+   "include_global_state": false
+}
+```
+
+## Testing Your Backup Strategy
+
+Regularly test your backup and restore procedures to ensure they work when needed:
+
+1. Create a test environment
+2. Restore your production backups to this environment
+3. Verify that all functionality works correctly
+4. Document any issues encountered and update your backup/restore procedures
+
+A good practice is to test restore procedures quarterly or after significant infrastructure changes.
diff --git a/docs/how/restore-indices.md b/docs/how/restore-indices.md
index cdf1e25fc7..955c7c9841 100644
--- a/docs/how/restore-indices.md
+++ b/docs/how/restore-indices.md
@@ -1,122 +1,135 @@
-# Restoring Search and Graph Indices from Local Database
+# Search and Graph Reindexing
 
-If search infrastructure (Elasticsearch/Opensearch) or graph services (Elasticsearch/Opensearch/Neo4j) become inconsistent,
-you can restore them from the aspects stored in the local database.
+If your search infrastructure (Elasticsearch/OpenSearch) or graph services (Elasticsearch/OpenSearch/Neo4j) become inconsistent or out-of-sync with your primary metadata store, you can **rebuild them from the source of truth**: the `metadata_aspect_v2` table in your relational database (MySQL/Postgres).
 
-When a new version of the aspect gets ingested, GMS initiates an MCL event for the aspect which is consumed to update
-the search and graph indices. As such, we can fetch the latest version of each aspect in the local database and produce
-MCL events corresponding to the aspects to restore the search and graph indices.
+This process works by fetching the latest version of each aspect from the database and replaying them as Metadata Change Log (MCL) events. These events will regenerate your search and graph indexes, effectively restoring a consistent view.
 
-By default, restoring the indices from the local database will not remove any existing documents in
-the search and graph indices that no longer exist in the local database, potentially leading to inconsistencies
-between the search and graph indices and the local database.
+> ⚠️ **Note**: By default, this process does **not remove** stale documents from the index that no longer exist in the database. To ensure full consistency, we recommend reindexing into a clean instance, or using the `-a clean` option to wipe existing index contents before replay.
 
-## Configuration
+---
 
-The upgrade jobs take arguments as command line args to the job itself rather than environment variables for job specific
-configuration. The RestoreIndices job is specified through the `-u RestoreIndices` upgrade ID parameter and then additional
-parameters are specified like `-a batchSize=1000`.
+## How it Works
 
-The following configurations are available:
+Reindexing is powered by the `datahub-upgrade` utility (packaged as the `datahub-upgrade` container in Docker/Kubernetes). It supports a special upgrade task called `RestoreIndices`, which replays aspects from the database back into search and graph stores.
 
-### Time-Based Filtering
+You can run this utility in three main environments:
 
-- `lePitEpochMs`: Restore records created before this timestamp (in milliseconds)
-- `gePitEpochMs`: Restore records created after this timestamp (in milliseconds)
+- Quickstart (via CLI)
+- Docker Compose (via shell script)
+- Kubernetes (via Helm + CronJob)
 
-### Pagination and Performance Options
+---
 
-- `urnBasedPagination`: Enable key-based pagination instead of offset-based pagination. Recommended for large datasets as it's typically more efficient.
-- `startingOffset`: When using default pagination, start from this offset
-- `lastUrn`: Resume from a specific URN when using URN-based pagination
-- `lastAspect`: Used with lastUrn to resume from a specific aspect, preventing reprocessing
-- `numThreads`: Number of concurrent threads for processing restoration, only used with default offset based paging
-- `batchSize`: Configures the size of each batch as the job pages through rows
-- `batchDelayMs`: Adds a delay in between each batch to avoid overloading backend systems
+## Reindexing Configuration Options
 
-### Content Filtering
+When running the `RestoreIndices` job, you can pass additional arguments to customize the behavior:
 
-- `aspectNames`: Comma-separated list of aspects to restore (e.g., "ownership,status")
-- `urnLike`: SQL LIKE pattern to filter URNs (e.g., "urn:li:dataset%")
+### 🔄 Pagination & Performance
 
-### Default Aspects
+| Argument             | Description                                                                 |
+| -------------------- | --------------------------------------------------------------------------- |
+| `urnBasedPagination` | Use URN-based pagination instead of offset. Recommended for large datasets. |
+| `startingOffset`     | Starting offset for offset-based pagination.                                |
+| `lastUrn`            | Resume from this URN (used with URN pagination).                            |
+| `lastAspect`         | Resume from this aspect name (used with `lastUrn`).                         |
+| `numThreads`         | Number of concurrent threads for reindexing.                                |
+| `batchSize`          | Number of records per batch.                                                |
+| `batchDelayMs`       | Delay in milliseconds between each batch (throttling).                      |
 
-- `createDefaultAspects`: Create default aspects in both SQL and ES if missing.
+### 📅 Time Filtering
 
-During the restore indices process, it will create default aspects in SQL. While this may be
-desired in some situations, disabling this feature is required when using a read-only SQL replica.
+| Argument       | Description                                                     |
+| -------------- | --------------------------------------------------------------- |
+| `gePitEpochMs` | Only restore aspects created **after** this timestamp (in ms).  |
+| `lePitEpochMs` | Only restore aspects created **before** this timestamp (in ms). |
 
-### Nuclear option
+### 🔍 Content Filtering
 
-- `clean`: This option wipes out the current indices by running deletes of all the documents to guarantee a consistent state with SQL. This is generally not recommended unless there is significant data corruption on the instance.
+| Argument      | Description                                                            |
+| ------------- | ---------------------------------------------------------------------- |
+| `aspectNames` | Comma-separated list of aspects to restore (e.g., `ownership,status`). |
+| `urnLike`     | SQL LIKE pattern to filter URNs (e.g., `urn:li:dataset%`).             |
 
-### Helm
+### 🧱 Other Options
 
-These are available in the helm charts as configurations for Kubernetes deployments under the `datahubUpgrade.restoreIndices.args` path which will set them up as args to the pod command.
+| Argument               | Description                                                                                                 |
+| ---------------------- | ----------------------------------------------------------------------------------------------------------- |
+| `createDefaultAspects` | Whether to create default aspects in SQL & index if missing. **Disable** this if using a read-only replica. |
+| `clean`                | **Deletes existing index documents before restoring.** Use with caution.                                    |
 
-## Execution Methods
+---
 
-### Quickstart
+## Running the Restore Job
 
-If you're using the quickstart images, you can use the `datahub` cli to restore the indices.
+### 🧪 Quickstart CLI
 
-```shell
+If you're using DataHub's quickstart image, you can restore indices using a single CLI command:
+
+```bash
 datahub docker quickstart --restore-indices
 ```
 
 :::info
-Using the `datahub` CLI to restore the indices when using the quickstart images will also clear the search and graph indices before restoring.
+This command automatically clears the search and graph indices before restoring them.
 :::
 
-See [this section](../quickstart.md#restore-datahub) for more information.
+More details in the [Quickstart Docs](../quickstart.md#restore-datahub).
 
-### Docker-compose
+---
 
-If you are on a custom docker-compose deployment, run the following command (you need to checkout [the source repository](https://github.com/datahub-project/datahub)) from the root of the repo to send MAE for each aspect in the local database.
+### 🐳 Docker Compose
 
-```shell
+If you're using Docker Compose and have cloned the [DataHub source repo](https://github.com/datahub-project/datahub), run:
+
+```bash
 ./docker/datahub-upgrade/datahub-upgrade.sh -u RestoreIndices
 ```
 
-:::info
-By default this command will not clear the search and graph indices before restoring, thous potentially leading to inconsistencies between the local database and the indices, in case aspects were previously deleted in the local database but were not removed from the correponding index.
-:::
+To clear existing index contents before restore (recommended if you suspect inconsistencies), add `-a clean`:
 
-If you need to clear the search and graph indices before restoring, add `-a clean` to the end of the command. Please take note that the search and graph services might not be fully functional during reindexing when the indices are cleared.
-
-```shell
+```bash
 ./docker/datahub-upgrade/datahub-upgrade.sh -u RestoreIndices -a clean
 ```
 
-Refer to this [doc](../../docker/datahub-upgrade/README.md#environment-variables) on how to set environment variables
-for your environment.
+:::info
+Without the `-a clean` flag, old documents may remain in your search/graph index, even if they no longer exist in your SQL database.
+:::
 
-### Kubernetes
+Refer to the [Upgrade Script Docs](../../docker/datahub-upgrade/README.md#environment-variables) for more info on environment configuration.
 
-Run `kubectl get cronjobs` to see if the restoration job template has been deployed. If you see results like below, you
-are good to go.
+---
 
-```
-NAME                                          SCHEDULE    SUSPEND   ACTIVE   LAST SCHEDULE   AGE
-datahub-datahub-cleanup-job-template          * * * * *   True      0        <none>          2d3h
-datahub-datahub-restore-indices-job-template  * * * * *   True      0        <none>          2d3h
+### ☸️ Kubernetes (Helm)
+
+1. **Check if the Job Template Exists**
+
+Run:
+
+```bash
+kubectl get cronjobs
 ```
 
-If not, deploy latest helm charts to use this functionality.
+You should see a result like:
 
-Once restore indices job template has been deployed, run the following command to start a job that restores indices.
+```bash
+datahub-datahub-restore-indices-job-template
+```
 
-```shell
+If not, make sure you're using the latest Helm chart version that includes the restore job.
+
+2. **Trigger the Restore Job**
+
+Run:
+
+```bash
 kubectl create job --from=cronjob/datahub-datahub-restore-indices-job-template datahub-restore-indices-adhoc
 ```
 
-Once the job completes, your indices will have been restored.
+This will create and run a one-off job to restore indices from your SQL database.
 
-:::info
-By default the restore indices job template will not clear the search and graph indices before restoring, thous potentially leading to inconsistencies between the local database and the indices, in case aspects were previously deleted in the local database but were not removed from the correponding index.
-:::
+3. **To Enable Clean Reindexing**
 
-If you need to clear the search and graph indices before restoring, modify the `values.yaml` for your deployment and overwrite the default arguments of the restore indices job template to include the `-a clean` argument. Please take note that the search and graph services might not be fully functional during reindexing when the indices are cleared.
+Edit your `values.yaml` to include the `-a clean` argument:
 
 ```yaml
 datahubUpgrade:
@@ -126,13 +139,17 @@ datahubUpgrade:
         - "-u"
         - "RestoreIndices"
         - "-a"
-        - "batchSize=1000" # default value of datahubUpgrade.batchSize
+        - "batchSize=1000"
         - "-a"
-        - "batchDelayMs=100" # default value of datahubUpgrade.batchDelayMs
+        - "batchDelayMs=100"
         - "-a"
         - "clean"
 ```
 
+:::info
+The default job does **not** delete existing documents before restoring. Add `-a clean` to ensure full sync.
+:::
+
 ### Through APIs
 
 See also the [Best Practices](#best-practices) section below, however note that the APIs are able to handle a few thousand
@@ -166,6 +183,13 @@ In general, this process is not required to run unless there has been a disrupti
 such as Elasticsearch/Opensearch cluster failures, data corruption events, or significant version upgrade inconsistencies
 that have caused the search and graph indices to become out of sync with the local database.
 
+Some pointers to keep in mind when running this process:
+
+- Always test reindexing in a **staging environment** first.
+- Consider taking a backup of your Elasticsearch/OpenSearch index before a `clean` restore.
+- For very large deployments, use `urnBasedPagination` and limit `batchSize` to avoid overloading your backend.
+- Monitor Elasticsearch/OpenSearch logs during the restore for throttling or memory issues.
+
 ### K8 Job vs. API
 
 #### When to Use Kubernetes Jobs