mirror of
https://github.com/datahub-project/datahub.git
synced 2025-06-27 05:03:31 +00:00

Co-authored-by: John Joyce <john@Mac-191.lan> Co-authored-by: John Joyce <john@Johns-MacBook-Pro.local>
213 lines
5.8 KiB
Markdown
213 lines
5.8 KiB
Markdown
# DataHub Backup & Restore
|
|
|
|
DataHub stores metadata in two key storage systems that require separate backup approaches:
|
|
|
|
1. **Versioned Aspects**: Stored in a relational database (MySQL/PostgreSQL) in the `metadata_aspect_v2` table
|
|
2. **Time Series Aspects, Search Indexes, & Graph Relationships**: Stored in Elasticsearch/OpenSearch indexes
|
|
|
|
This guide outlines how to properly back up both components to ensure complete recoverability of your DataHub instance.
|
|
|
|
## Production Environment Backups
|
|
|
|
### Backing Up Document Store (Versioned Metadata)
|
|
|
|
The recommended backup strategy is to periodically dump the `metadata_aspect_v2` table from the `datahub` database. This table contains all versioned aspects and can be restored in case of database failure. Most managed database services (e.g., AWS RDS) provide automated backup capabilities.
|
|
|
|
#### AWS Managed RDS
|
|
|
|
**Option 1: Automated RDS Snapshots**
|
|
|
|
1. Go to **AWS Console > RDS > Databases**
|
|
2. Select your DataHub RDS instance
|
|
3. Click **Actions > Take Snapshot**
|
|
4. Name the snapshot (e.g., `datahub-backup-YYYY-MM-DD`)
|
|
5. Configure automated snapshots in RDS with appropriate retention periods (recommended: 14-30 days)
|
|
|
|
**Option 2: SQL Dump (MySQL)**
|
|
|
|
For a targeted backup of only the essential metadata:
|
|
|
|
`mysqldump -h <rds-endpoint> -u <username> -p datahub metadata_aspect_v2 > metadata_aspect_v2_backup.sql`
|
|
|
|
To compress the backup:
|
|
|
|
`mysqldump -h <rds-endpoint> -u <username> -p datahub metadata_aspect_v2 | gzip > metadata_aspect_v2_backup.sql.gz`
|
|
|
|
#### Self-Hosted MySQL
|
|
|
|
`mysqldump -u <username> -p datahub metadata_aspect_v2 > metadata_aspect_v2_backup.sql`
|
|
|
|
Compressed version:
|
|
|
|
`mysqldump -u <username> -p datahub metadata_aspect_v2 | gzip > metadata_aspect_v2_backup.sql.gz`
|
|
|
|
### Backing Up Time Series Aspects (Elasticsearch/OpenSearch)
|
|
|
|
Time Series Aspects power important features like usage statistics, dataset profiles, and assertion runs. These are stored in Elasticsearch/OpenSearch and require a separate backup strategy.
|
|
|
|
#### AWS OpenSearch Service
|
|
|
|
1. **Create an IAM Role for Snapshots**
|
|
|
|
Create an IAM role with permissions to write to an S3 bucket:
|
|
|
|
```json
|
|
{
|
|
"Version": "2012-10-17",
|
|
"Statement": [
|
|
{
|
|
"Action": ["s3:ListBucket"],
|
|
"Effect": "Allow",
|
|
"Resource": ["arn:aws:s3:::your-backup-bucket"]
|
|
},
|
|
{
|
|
"Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
|
|
"Effect": "Allow",
|
|
"Resource": ["arn:aws:s3:::your-backup-bucket/*"]
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
Ensure the trust relationship allows OpenSearch to assume this role:
|
|
|
|
```json
|
|
{
|
|
"Version": "2012-10-17",
|
|
"Statement": [
|
|
{
|
|
"Effect": "Allow",
|
|
"Principal": {
|
|
"Service": "es.amazonaws.com"
|
|
},
|
|
"Action": "sts:AssumeRole"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
2. **Register a Snapshot Repository**
|
|
|
|
```
|
|
PUT _snapshot/datahub_s3_backup
|
|
{
|
|
"type": "s3",
|
|
"settings": {
|
|
"bucket": "your-backup-bucket",
|
|
"region": "us-east-1",
|
|
"role_arn": "arn:aws:iam::<account-id>:role/<snapshot-role>"
|
|
}
|
|
}
|
|
```
|
|
|
|
> ⚠️ **Important**: The S3 bucket must be in the same AWS region as your OpenSearch domain.
|
|
|
|
3. **Create a Regular Snapshot Schedule**
|
|
|
|
Set up an automated schedule using the OpenSearch Snapshot Management:
|
|
|
|
```
|
|
PUT _plugins/_sm/policies/datahub_backup_policy
|
|
{
|
|
"schedule": {
|
|
"cron": {
|
|
"expression": "0 0 * * *",
|
|
"timezone": "UTC"
|
|
}
|
|
},
|
|
"name": "<snapshot-{now/d}>",
|
|
"repository": "datahub_s3_backup",
|
|
"config": {
|
|
"partial": false
|
|
},
|
|
"retention": {
|
|
"expire_after": "15d",
|
|
"min_count": 5,
|
|
"max_count": 30
|
|
}
|
|
}
|
|
```
|
|
|
|
This configures daily snapshots with a 15-day retention period.
|
|
|
|
4. **Take a Manual Snapshot** (if needed)
|
|
|
|
`PUT _snapshot/datahub_s3_backup/snapshot_YYYY_MM_DD?wait_for_completion=true`
|
|
|
|
5. **Verify Snapshot Status**
|
|
|
|
`GET _snapshot/datahub_s3_backup/snapshot_YYYY_MM_DD`
|
|
|
|
#### Self-Hosted Elasticsearch
|
|
|
|
1. **Create a Local Repository**
|
|
|
|
First, add `path.repo` setting to `elasticsearch.yml` on all nodes:
|
|
|
|
path.repo: ["/mnt/es-backups"]
|
|
|
|
Ensure `/mnt/es-backups` is a shared or mounted path on all Elasticsearch nodes.
|
|
|
|
2. **Register the Repository**
|
|
|
|
```
|
|
PUT _snapshot/datahub_fs_backup
|
|
{
|
|
"type": "fs",
|
|
"settings": {
|
|
"location": "/mnt/es-backups",
|
|
"compress": true
|
|
}
|
|
}
|
|
```
|
|
|
|
3. **Create a Snapshot**
|
|
|
|
`PUT \_snapshot/datahub_fs_backup/snapshot_YYYY_MM_DD?wait_for_completion=true`
|
|
|
|
4. **Check Snapshot Status**
|
|
|
|
`GET \_snapshot/datahub_fs_backup/snapshot_YYYY_MM_DD`
|
|
|
|
## Restoring DataHub from Backups
|
|
|
|
### Restoring the MySQL Database
|
|
|
|
1. **Restore from an RDS Snapshot** (if using AWS RDS)
|
|
|
|
In the AWS Console, go to **RDS > Snapshots**, select your snapshot, and choose "Restore Snapshot".
|
|
|
|
2. **Restore from SQL Dump**
|
|
|
|
`mysql -h <host> -u <user> -p datahub < metadata_aspect_v2_backup.sql`
|
|
|
|
### Restoring Elasticsearch/OpenSearch Indices
|
|
|
|
After restoring the database, you need to restore the search and graph indices using your snapshots.
|
|
|
|
Note that you can also rebuild the index from scratch after restoring the MySQL / Postgres Document Store,
|
|
as outlined [here](./restore-indices.md).
|
|
|
|
#### Restoring from Snapshots
|
|
|
|
To restore search indexes from a snapshot:
|
|
|
|
```
|
|
POST _snapshot/datahub_s3_backup/snapshot_YYYY_MM_DD/_restore
|
|
{
|
|
"indices": "datastream*,metadataindex*",
|
|
"include_global_state": false
|
|
}
|
|
```
|
|
|
|
## Testing Your Backup Strategy
|
|
|
|
Regularly test your backup and restore procedures to ensure they work when needed:
|
|
|
|
1. Create a test environment
|
|
2. Restore your production backups to this environment
|
|
3. Verify that all functionality works correctly
|
|
4. Document any issues encountered and update your backup/restore procedures
|
|
|
|
A good practice is to test restore procedures quarterly or after significant infrastructure changes.
|