mirror of
https://github.com/datahub-project/datahub.git
synced 2025-12-27 18:07:57 +00:00
docs: improve documentation for metadata retention (#3802)
This commit is contained in:
parent
101275230e
commit
255462c63c
@ -2,56 +2,55 @@
|
||||
|
||||
## Goal
|
||||
|
||||
DataHub uses a database (or key-value store) to store different versions of the aspects as they get ingested. Storing
|
||||
multiple versions of the aspects allows us to look at the history of how the aspect changed and to rollback to previous
|
||||
version when incorrect metadata gets ingested. However, each version takes up space in the database, while bringing less
|
||||
value to the system. We need to be able to impose **retention** on these records to keep the size of the DB in check.
|
||||
DataHub stores different versions of [metadata aspects](https://datahubproject.io/docs/what/aspect) as they are ingested
|
||||
using a database (or key-value store). These multiple versions allow us to look at an aspect's historical changes and
|
||||
rollback to a previous version if incorrect metadata is ingested. However, every stored version takes additional storage
|
||||
space, while possibly bringing less value to the system. We need to be able to impose a **retention** policy on these
|
||||
records to keep the size of the DB in check.
|
||||
|
||||
Goal of the retention system is to be able to **configure and enforce retention policies** on documents in various
|
||||
levels (
|
||||
global, entity-level, aspect-level)
|
||||
Goal of the retention system is to be able to **configure and enforce retention policies** on documents at each of these
|
||||
various levels:
|
||||
- global
|
||||
- entity-level
|
||||
- aspect-level
|
||||
|
||||
## What type of retention policies are supported?
|
||||
|
||||
We support 3 types of retention policies.
|
||||
We support 3 types of retention policies for aspects:
|
||||
Policy | Versions Kept |
|
||||
--- | --- |
|
||||
Indefinite | All versions
|
||||
Version-based | Latest *N* versions
|
||||
Time-based | Versions ingested in last *N* seconds
|
||||
|
||||
1. Indefinite retention: Keep all versions of aspects
|
||||
2. Version-based retention: Keep the latest N versions
|
||||
3. Time-based retention: Keep versions that have been ingested in the last N seconds
|
||||
|
||||
Note, the latest version of each aspect (version 0) is never deleted. This is to ensure that we do not impact the core
|
||||
functionality of DataHub while applying retention.
|
||||
**Note:** The latest version (version 0) is never deleted. This ensures core functionality of DataHub is not impacted while applying retention.
|
||||
|
||||
## When is the retention policy applied?
|
||||
|
||||
As of now, retention policies are applied in two places
|
||||
As of now, retention policies are applied in two places:
|
||||
|
||||
1. **GMS boot-up**: On boot, it runs a bootstrap step to ingest the predefined set of retention policies. If there were
|
||||
no existing policies or the existing policies got updated, it will trigger an asynchronous call to apply retention
|
||||
to **
|
||||
all** records in the database.
|
||||
2. **Ingest**: On every ingest, if an existing aspect got updated, it applies retention to the urn, aspect pair being
|
||||
ingested.
|
||||
1. **GMS boot-up**: A bootstrap step ingests the predefined set of retention policies. If no policy existed before or the existing policy
|
||||
was updated, an asynchronous call will be triggered. It will apply the retention policy (or policies) to **all** records in the database.
|
||||
2. **Ingest**: On every ingest, if an existing aspect got updated, it applies the retention policy to the urn-aspect pair being ingested.
|
||||
|
||||
We are planning to support a cron-based application of retention in the near future to ensure that the time-based
|
||||
retention is applied correctly.
|
||||
We are planning to support a cron-based application of retention in the near future to ensure that the time-based retention is applied correctly.
|
||||
|
||||
## How to configure?
|
||||
|
||||
For the initial iteration, we have made this feature opt-in. Please set **ENTITY_SERVICE_ENABLE_RETENTION=true** when
|
||||
creating the datahub-gms container/k8s pod.
|
||||
|
||||
On GMS start up, it fetches the list of retention policies to ingest from two sources. First is the default we provide,
|
||||
which adds a version-based retention to keep 20 latest aspects for all entity-aspect pairs. Second, we read YAML files
|
||||
from the `/etc/datahub/plugins/retention` directory and overlay them on the default set of policies we provide.
|
||||
On GMS start up, retention policies are initialized with:
|
||||
1. First, the default provided **version-based** retention to keep **20 latest aspects** for all entity-aspect pairs.
|
||||
2. Second, we read YAML files from the `/etc/datahub/plugins/retention` directory and overlay them on the default set of policies we provide.
|
||||
|
||||
For docker, we set docker-compose to mount `${HOME}/.datahub/plugins` directory to `/etc/datahub/plugins` directory
|
||||
within the containers, so you can customize the initial set of retention policies by creating
|
||||
a `${HOME}/.datahub/plugins/retention/retention.yaml` file.
|
||||
|
||||
We will support a standardized way to do this in kubernetes setup in the near future.
|
||||
We will support a standardized way to do this in Kubernetes setup in the near future.
|
||||
|
||||
The format for the YAML file is as follows.
|
||||
The format for the YAML file is as follows:
|
||||
|
||||
```yaml
|
||||
- entity: "*" # denotes that policy will be applied to all entities
|
||||
@ -70,10 +69,10 @@ The format for the YAML file is as follows.
|
||||
maxAgeInSeconds: 2592000 # 30 days
|
||||
```
|
||||
|
||||
Note, it searches for the policies corresponding to the entity, aspect pair in the following order
|
||||
Note, it searches for the policies corresponding to the entity, aspect pair in the following order:
|
||||
1. entity, aspect
|
||||
2. *, aspect
|
||||
3. entity, *
|
||||
4. *, *
|
||||
|
||||
By restarting datahub-gms after creating the plugin yaml file, the new set of retention policies will be applied.
|
||||
By restarting datahub-gms after creating the plugin yaml file, the new set of retention policies will be applied.
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user