mirror of
https://github.com/datahub-project/datahub.git
synced 2025-12-28 18:38:17 +00:00
docs(restore-indices): added best practices (#12741)
This commit is contained in:
parent
8dd2f6e28d
commit
6531f0df04
@ -1,11 +1,11 @@
|
||||
# Restoring Search and Graph Indices from Local Database
|
||||
|
||||
If search or graph services go down or you have made changes to them that require reindexing, you can restore them from
|
||||
the aspects stored in the local database.
|
||||
If search infrastructure (Elasticsearch/Opensearch) or graph services (Elasticsearch/Opensearch/Neo4j) become inconsistent,
|
||||
you can restore them from the aspects stored in the local database.
|
||||
|
||||
When a new version of the aspect gets ingested, GMS initiates an MAE event for the aspect which is consumed to update
|
||||
When a new version of the aspect gets ingested, GMS initiates an MCL event for the aspect which is consumed to update
|
||||
the search and graph indices. As such, we can fetch the latest version of each aspect in the local database and produce
|
||||
MAE events corresponding to the aspects to restore the search and graph indices.
|
||||
MCL events corresponding to the aspects to restore the search and graph indices.
|
||||
|
||||
By default, restoring the indices from the local database will not remove any existing documents in
|
||||
the search and graph indices that no longer exist in the local database, potentially leading to inconsistencies
|
||||
@ -13,7 +13,10 @@ between the search and graph indices and the local database.
|
||||
|
||||
## Configuration
|
||||
|
||||
The upgrade jobs take arguments as command line args to the job itself rather than environment variables for job specific configuration. The RestoreIndices job is specified through the `-u RestoreIndices` upgrade ID parameter and then additional parameters are specified like `-a batchSize=1000`.
|
||||
The upgrade jobs take arguments as command line args to the job itself rather than environment variables for job specific
|
||||
configuration. The RestoreIndices job is specified through the `-u RestoreIndices` upgrade ID parameter and then additional
|
||||
parameters are specified like `-a batchSize=1000`.
|
||||
|
||||
The following configurations are available:
|
||||
|
||||
### Time-Based Filtering
|
||||
@ -43,7 +46,9 @@ The following configurations are available:
|
||||
|
||||
These are available in the helm charts as configurations for Kubernetes deployments under the `datahubUpgrade.restoreIndices.args` path which will set them up as args to the pod command.
|
||||
|
||||
## Quickstart
|
||||
## Execution Methods
|
||||
|
||||
### Quickstart
|
||||
|
||||
If you're using the quickstart images, you can use the `datahub` cli to restore the indices.
|
||||
|
||||
@ -57,7 +62,7 @@ Using the `datahub` CLI to restore the indices when using the quickstart images
|
||||
|
||||
See [this section](../quickstart.md#restore-datahub) for more information.
|
||||
|
||||
## Docker-compose
|
||||
### Docker-compose
|
||||
|
||||
If you are on a custom docker-compose deployment, run the following command (you need to checkout [the source repository](https://github.com/datahub-project/datahub)) from the root of the repo to send MAE for each aspect in the local database.
|
||||
|
||||
@ -78,7 +83,7 @@ If you need to clear the search and graph indices before restoring, add `-a clea
|
||||
Refer to this [doc](../../docker/datahub-upgrade/README.md#environment-variables) on how to set environment variables
|
||||
for your environment.
|
||||
|
||||
## Kubernetes
|
||||
### Kubernetes
|
||||
|
||||
Run `kubectl get cronjobs` to see if the restoration job template has been deployed. If you see results like below, you
|
||||
are good to go.
|
||||
@ -120,6 +125,160 @@ datahubUpgrade:
|
||||
- "clean"
|
||||
```
|
||||
|
||||
## Through API
|
||||
### Through APIs
|
||||
|
||||
See [Restore Indices API](../api/restli/restore-indices.md).
|
||||
See also the [Best Practices](#best-practices) section below, however note that the APIs are able to handle a few thousand
|
||||
aspects. In this mode one of the GMS instances will perform the required actions, however it is subject to timeout. Use one of the
|
||||
approaches above for longer running restoreIndices.
|
||||
|
||||
#### OpenAPI
|
||||
|
||||
There are two primary APIs, one which exposes the common parameters for restoreIndices and another one designed
|
||||
to accept a list of URNs where all aspects are to be restored.
|
||||
|
||||
Full configuration:
|
||||
|
||||
<p align="center">
|
||||
<img width="80%" src="https://github.com/datahub-project/static-assets/blob/main/imgs/how/restore-indices/openapi-restore-indices.png?raw=true"/>
|
||||
</p>
|
||||
|
||||
All Aspects:
|
||||
|
||||
<p align="center">
|
||||
<img width="80%" src="https://github.com/datahub-project/static-assets/blob/main/imgs/how/restore-indices/openapi-restore-indices-urns.png?raw=true"/>
|
||||
</p>
|
||||
|
||||
#### Rest.li
|
||||
|
||||
For Rest.li, see [Restore Indices API](../api/restli/restore-indices.md).
|
||||
|
||||
## Best Practices
|
||||
|
||||
In general, this process is not required to run unless there has been a disruption of storage services or infrastructure,
|
||||
such as Elasticsearch/Opensearch cluster failures, data corruption events, or significant version upgrade inconsistencies
|
||||
that have caused the search and graph indices to become out of sync with the local database.
|
||||
|
||||
### K8 Job vs. API
|
||||
|
||||
#### When to Use Kubernetes Jobs
|
||||
For operations affecting 2,000 or more aspects, it's strongly recommended to use the Kubernetes job approach. This job is
|
||||
designed for long-running processes and provide several advantages:
|
||||
|
||||
* Won't time out like API calls
|
||||
* Can be monitored through Kubernetes logging
|
||||
* Won't consume resources from your primary GMS instances
|
||||
* Can be scheduled during off-peak hours to minimize system impact
|
||||
|
||||
#### When to Use APIs
|
||||
The RestoreIndices APIs (available through both Rest.li and OpenAPI) is best suited for:
|
||||
|
||||
* Targeted restores affecting fewer than 2,000 aspects
|
||||
* Emergencies where you need to quickly restore critical metadata
|
||||
* Testing or validating the restore process before running a full-scale job
|
||||
* Scenarios where you don't have direct access to the Kubernetes cluster
|
||||
|
||||
Remember that API-based restoration runs within one of your GMS instances and is subject to timeouts, which could lead to
|
||||
incomplete restorations for larger installations.
|
||||
|
||||
### Targeted Restoration Strategies
|
||||
Being selective about what you restore is crucial for efficiency. Combining these filtering strategies can dramatically
|
||||
reduce the restoration scope, saving resources and time.
|
||||
|
||||
#### Entity Type Filtering
|
||||
Entity Type Filtering: Use the `urnLike` parameter to target specific entity types:
|
||||
|
||||
* For datasets: `urnLike=urn:li:dataset:%`
|
||||
* For users: `urnLike=urn:li:corpuser:%`
|
||||
* For dashboards: `urnLike=urn:li:dashboard:%`
|
||||
|
||||
#### Single Entity
|
||||
Single Entity Restoration: When only one entity is affected, provide the specific URN to minimize processing overhead.
|
||||
Aspect-Based Filtering: Use aspectNames to target only the specific aspects that need restoration:
|
||||
|
||||
* For ownership inconsistencies: `aspectNames=ownership`
|
||||
* For tag issues: `aspectNames=globalTags`
|
||||
|
||||
#### Time-Based
|
||||
Time-Based Recovery: If you know when the inconsistency began, use time-based filtering:
|
||||
|
||||
* gePitEpochMs={timestamp} to process only records created after the incident
|
||||
* lePitEpochMs={timestamp} to limit processing to records before a certain time
|
||||
|
||||
### Parallel Processing Strategies
|
||||
|
||||
To optimize restoration speed while managing system load:
|
||||
|
||||
#### Multiple Parallel Jobs
|
||||
Run several restoreIndices processes simultaneously by:
|
||||
|
||||
* Work on non-overlapping sets of aspects or entities
|
||||
* Dividing work by entity type (one job for datasets, another for users, etc.)
|
||||
* Splitting aspects among different jobs (one for ownership aspects, another for lineage, etc.)
|
||||
* Partitioning large entity types by prefix or time range
|
||||
* Have staggered start times to prevent initial resource contention
|
||||
* Monitor system metrics closely during concurrent restoration to ensure you're not overloading your infrastructure.
|
||||
|
||||
:::caution
|
||||
Avoid Conflicts: Ensure that concurrent jobs:
|
||||
|
||||
Never specify the --clean argument in concurrent jobs
|
||||
:::
|
||||
|
||||
### Temporary Workload Reduction
|
||||
|
||||
* Pause scheduled ingestion jobs during restoration
|
||||
* Temporarily disable or reduce frequency of the datahub-gc job to prevent conflicting deletes
|
||||
* Consider pausing automated workflows or integrations that generate metadata events
|
||||
|
||||
### Infrastructure Tuning
|
||||
Implementing these expanded best practices should help ensure a smoother, more efficient restoration process while
|
||||
minimizing impact on your DataHub environment.
|
||||
|
||||
This operation can be I/O intensive from the read-side from SQL and on the Elasticsearch write side. If you're able to leverage
|
||||
provisioned I/O. or throughput, you might want to monitor your infrastructure for a possible.
|
||||
|
||||
#### Elasticsearch/Opensearch Optimization
|
||||
To improve write performance during restoration:
|
||||
|
||||
##### Refresh Interval Adjustment:
|
||||
|
||||
Temporarily increase the refresh_interval setting from the default (typically 1s) to something like 30s or 60s.
|
||||
Run the system update job with the following environment variable `ELASTICSEARCH_INDEX_BUILDER_REFRESH_INTERVAL_SECONDS=60`
|
||||
|
||||
:::caution
|
||||
Remember to reset this after restoration completes!
|
||||
:::caution
|
||||
|
||||
##### Bulk Processing Improvements:
|
||||
|
||||
* Adjust the Elasticsearch batching parameters to optimize bulk request size (try values between 2000-5000)
|
||||
* Run your GMS or `mae-consumer` with environment variables
|
||||
* `ES_BULK_REQUESTS_LIMIT=3000`
|
||||
* `ES_BULK_FLUSH_PERIOD=60`
|
||||
* Configure `batchDelayMs` on restoreIndices to add breathing room between batches if your cluster is struggling
|
||||
|
||||
##### Shard Management:
|
||||
|
||||
* Ensure your indices have an appropriate number of shards for your cluster size.
|
||||
* Consider temporarily adding nodes to your search cluster during massive restorations.
|
||||
|
||||
#### SQL/Primary Storage
|
||||
|
||||
Consider using a read replica as the source of the job's data.
|
||||
|
||||
#### Kafka & Consumers
|
||||
|
||||
##### Partition Strategy:
|
||||
|
||||
* Verify that the Kafka Metadata Change Log (MCL) topic have enough partitions to allow for parallel processing.
|
||||
* Recommended: At least 10-20 partitions for the MCL topic in production environments.
|
||||
|
||||
##### Consumer Scaling:
|
||||
|
||||
* Temporarily increase the number of `mae-consumer` pods to process the higher event volume.
|
||||
* Scale GMS instances if they're handling consumer duties.
|
||||
|
||||
##### Monitoring:
|
||||
|
||||
* Watch consumer lag metrics closely during restoration.
|
||||
* Be prepared to adjust scaling or batch parameters if consumers fall behind.
|
||||
Loading…
x
Reference in New Issue
Block a user