docs: improve documentation for metadata retention (#3802)

This commit is contained in:
Danny L 2022-01-04 16:00:39 -05:00 committed by GitHub
parent 101275230e
commit 255462c63c
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -2,56 +2,55 @@
## Goal
DataHub uses a database (or key-value store) to store different versions of the aspects as they get ingested. Storing
multiple versions of the aspects allows us to look at the history of how the aspect changed and to rollback to previous
version when incorrect metadata gets ingested. However, each version takes up space in the database, while bringing less
value to the system. We need to be able to impose **retention** on these records to keep the size of the DB in check.
DataHub stores different versions of [metadata aspects](https://datahubproject.io/docs/what/aspect) as they are ingested
using a database (or key-value store). These multiple versions allow us to look at an aspect's historical changes and
rollback to a previous version if incorrect metadata is ingested. However, every stored version takes additional storage
space, while possibly bringing less value to the system. We need to be able to impose a **retention** policy on these
records to keep the size of the DB in check.
Goal of the retention system is to be able to **configure and enforce retention policies** on documents in various
levels (
global, entity-level, aspect-level)
Goal of the retention system is to be able to **configure and enforce retention policies** on documents at each of these
various levels:
- global
- entity-level
- aspect-level
## What type of retention policies are supported?
We support 3 types of retention policies.
We support 3 types of retention policies for aspects:
Policy | Versions Kept |
--- | --- |
Indefinite | All versions
Version-based | Latest *N* versions
Time-based | Versions ingested in last *N* seconds
1. Indefinite retention: Keep all versions of aspects
2. Version-based retention: Keep the latest N versions
3. Time-based retention: Keep versions that have been ingested in the last N seconds
Note, the latest version of each aspect (version 0) is never deleted. This is to ensure that we do not impact the core
functionality of DataHub while applying retention.
**Note:** The latest version (version 0) is never deleted. This ensures core functionality of DataHub is not impacted while applying retention.
## When is the retention policy applied?
As of now, retention policies are applied in two places
As of now, retention policies are applied in two places:
1. **GMS boot-up**: On boot, it runs a bootstrap step to ingest the predefined set of retention policies. If there were
no existing policies or the existing policies got updated, it will trigger an asynchronous call to apply retention
to **
all** records in the database.
2. **Ingest**: On every ingest, if an existing aspect got updated, it applies retention to the urn, aspect pair being
ingested.
1. **GMS boot-up**: A bootstrap step ingests the predefined set of retention policies. If no policy existed before or the existing policy
was updated, an asynchronous call will be triggered. It will apply the retention policy (or policies) to **all** records in the database.
2. **Ingest**: On every ingest, if an existing aspect got updated, it applies the retention policy to the urn-aspect pair being ingested.
We are planning to support a cron-based application of retention in the near future to ensure that the time-based
retention is applied correctly.
We are planning to support a cron-based application of retention in the near future to ensure that the time-based retention is applied correctly.
## How to configure?
For the initial iteration, we have made this feature opt-in. Please set **ENTITY_SERVICE_ENABLE_RETENTION=true** when
creating the datahub-gms container/k8s pod.
On GMS start up, it fetches the list of retention policies to ingest from two sources. First is the default we provide,
which adds a version-based retention to keep 20 latest aspects for all entity-aspect pairs. Second, we read YAML files
from the `/etc/datahub/plugins/retention` directory and overlay them on the default set of policies we provide.
On GMS start up, retention policies are initialized with:
1. First, the default provided **version-based** retention to keep **20 latest aspects** for all entity-aspect pairs.
2. Second, we read YAML files from the `/etc/datahub/plugins/retention` directory and overlay them on the default set of policies we provide.
For docker, we set docker-compose to mount `${HOME}/.datahub/plugins` directory to `/etc/datahub/plugins` directory
within the containers, so you can customize the initial set of retention policies by creating
a `${HOME}/.datahub/plugins/retention/retention.yaml` file.
We will support a standardized way to do this in kubernetes setup in the near future.
We will support a standardized way to do this in Kubernetes setup in the near future.
The format for the YAML file is as follows.
The format for the YAML file is as follows:
```yaml
- entity: "*" # denotes that policy will be applied to all entities
@ -70,10 +69,10 @@ The format for the YAML file is as follows.
maxAgeInSeconds: 2592000 # 30 days
```
Note, it searches for the policies corresponding to the entity, aspect pair in the following order
Note, it searches for the policies corresponding to the entity, aspect pair in the following order:
1. entity, aspect
2. *, aspect
3. entity, *
4. *, *
By restarting datahub-gms after creating the plugin yaml file, the new set of retention policies will be applied.
By restarting datahub-gms after creating the plugin yaml file, the new set of retention policies will be applied.