mirror of
				https://github.com/datahub-project/datahub.git
				synced 2025-10-31 02:37:05 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			83 lines
		
	
	
		
			3.4 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			83 lines
		
	
	
		
			3.4 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # Configuring Database Retention
 | |
| 
 | |
| ## Goal
 | |
| 
 | |
| DataHub stores different versions of [metadata aspects](https://docs.datahub.com/docs/what/aspect) as they are ingested
 | |
| using a database (or key-value store). These multiple versions allow us to look at an aspect's historical changes and
 | |
| rollback to a previous version if incorrect metadata is ingested. However, every stored version takes additional storage
 | |
| space, while possibly bringing less value to the system. We need to be able to impose a **retention** policy on these
 | |
| records to keep the size of the DB in check.
 | |
| 
 | |
| Goal of the retention system is to be able to **configure and enforce retention policies** on documents at each of these
 | |
| various levels:
 | |
| 
 | |
| - global
 | |
| - entity-level
 | |
| - aspect-level
 | |
| 
 | |
| ## What type of retention policies are supported?
 | |
| 
 | |
| We support 3 types of retention policies for aspects:
 | |
| 
 | |
| |    Policy     |             Versions Kept             |
 | |
| | :-----------: | :-----------------------------------: |
 | |
| |  Indefinite   |             All versions              |
 | |
| | Version-based |          Latest _N_ versions          |
 | |
| |  Time-based   | Versions ingested in last _N_ seconds |
 | |
| 
 | |
| **Note:** The latest version (version 0) is never deleted. This ensures core functionality of DataHub is not impacted while applying retention.
 | |
| 
 | |
| ## When is the retention policy applied?
 | |
| 
 | |
| As of now, retention policies are applied in two places:
 | |
| 
 | |
| 1. **GMS boot-up**: A bootstrap step ingests the predefined set of retention policies. If no policy existed before or the existing policy
 | |
|    was updated, an asynchronous call will be triggered. It will apply the retention policy (or policies) to **all** records in the database.
 | |
| 2. **Ingest**: On every ingest, if an existing aspect got updated, it applies the retention policy to the urn-aspect pair being ingested.
 | |
| 
 | |
| We are planning to support a cron-based application of retention in the near future to ensure that the time-based retention is applied correctly.
 | |
| 
 | |
| ## How to configure?
 | |
| 
 | |
| We have enabled with feature by default. Please set **ENTITY_SERVICE_ENABLE_RETENTION=false** when
 | |
| creating the datahub-gms container/k8s pod to prevent the retention policies from taking effect.
 | |
| 
 | |
| On GMS start up, retention policies are initialized with:
 | |
| 
 | |
| 1. First, the default provided **version-based** retention to keep **20 latest aspects** for all entity-aspect pairs.
 | |
| 2. Second, we read YAML files from the `/etc/datahub/plugins/retention` directory and overlay them on the default set of policies we provide.
 | |
| 
 | |
| For docker, we set docker-compose to mount `${HOME}/.datahub` directory to `/etc/datahub` directory
 | |
| within the containers, so you can customize the initial set of retention policies by creating
 | |
| a `${HOME}/.datahub/plugins/retention/retention.yaml` file.
 | |
| 
 | |
| We will support a standardized way to do this in Kubernetes setup in the near future.
 | |
| 
 | |
| The format for the YAML file is as follows:
 | |
| 
 | |
| ```yaml
 | |
| - entity: "*" # denotes that policy will be applied to all entities
 | |
|   aspect: "*" # denotes that policy will be applied to all aspects
 | |
|   config:
 | |
|     retention:
 | |
|       version:
 | |
|         maxVersions: 20
 | |
| - entity: "dataset"
 | |
|   aspect: "datasetProperties"
 | |
|   config:
 | |
|     retention:
 | |
|       version:
 | |
|         maxVersions: 20
 | |
|       time:
 | |
|         maxAgeInSeconds: 2592000 # 30 days
 | |
| ```
 | |
| 
 | |
| Note, it searches for the policies corresponding to the entity, aspect pair in the following order:
 | |
| 
 | |
| 1. entity, aspect
 | |
| 2. \*, aspect
 | |
| 3. entity, \*
 | |
| 4. _, _
 | |
| 
 | |
| By restarting datahub-gms after creating the plugin yaml file, the new set of retention policies will be applied.
 | 
