mirror of
				https://github.com/datahub-project/datahub.git
				synced 2025-10-31 02:37:05 +00:00 
			
		
		
		
	feat(tags): RFC for tags (#2112)
This commit is contained in:
		
							parent
							
								
									5c4e671ac5
								
							
						
					
					
						commit
						973768fc5a
					
				| @ -115,6 +115,7 @@ module.exports = { | ||||
|           "docs/rfc/active/business_glossary/README", | ||||
|           "docs/rfc/active/graph_ql_frontend/queries", | ||||
|           "docs/rfc/active/react-app/README", | ||||
|           "docs/rfc/active/tags/README", | ||||
|         ], | ||||
|       }, | ||||
|     ], | ||||
|  | ||||
							
								
								
									
										190
									
								
								docs/rfc/active/tags/README.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										190
									
								
								docs/rfc/active/tags/README.md
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,190 @@ | ||||
| - Start Date: 2021-02-17 | ||||
| - RFC PR: https://github.com/linkedin/datahub/pull/2112 | ||||
| - Discussion Issue: (GitHub issue this was discussed in before the RFC, if any) | ||||
| - Implementation PR(s): (leave this empty) | ||||
| 
 | ||||
| # Tags | ||||
| 
 | ||||
| ## Summary | ||||
| 
 | ||||
| We suggest a generic, global tagging solution for Datahub. As the solution is quite generic and flexible, it can also | ||||
| hopefully serve as an stepping stone for new, cool features in the future. | ||||
| 
 | ||||
| ## Motivation | ||||
| 
 | ||||
| Currently some entities, such as Datasets, can be tagged using strings, but unfortunately this solution is quite | ||||
| limited. | ||||
| 
 | ||||
| A general tag implementation will allow us to define and attach a new and simple type of metadata to all type of | ||||
| entities. As the tags would be defined globally, tagging multiple objects with the same tag will give us the possibility | ||||
| to define and search based on a new kind of relationship, for example which datasets and ML Models that are tagged to | ||||
| include PII data. This allows for describing relationships between object that would otherwise not have a direct lineage | ||||
| relationship. Moreover, tags would lower that bar to add simple metadata to any object in the Datahub instance and open | ||||
| the door to crowd-sourcing metadata. Remembering that tags themselves are entities, it would also be possible to tag | ||||
| tags, enabling a hierarchy of sorts. | ||||
| 
 | ||||
| The solution is meant to be quite generic and flexible, and we're not trying to be too opinionated about how a user | ||||
| should use the feature. We hope that this initial generic solution can serve as a stepping stone for cool futures in the | ||||
| future. | ||||
| 
 | ||||
| ## Requirements | ||||
| 
 | ||||
| - Ability to associate tags with any type of entity, even other tags! | ||||
| - Ability to tag the same entity with multiple tags. | ||||
| - Ability to tag multiple objects with the same tag instance. | ||||
| - To the point above, ability to make easy tag-based searches later on. | ||||
| - Metadata on tags is TBD | ||||
| 
 | ||||
| ### Extensibility | ||||
| 
 | ||||
| The normal new-entity-onboarding work is obviously required. | ||||
| 
 | ||||
| Hopefully this can serve as a stepping stone to work on special cases such as the tag-based privacy tagging mentioned in | ||||
| the roadmap. | ||||
| 
 | ||||
| ## Non-Requirements | ||||
| 
 | ||||
| Let's leave the UI work required for this to another time. | ||||
| 
 | ||||
| ## Detailed design | ||||
| 
 | ||||
| We want to introduce some new under `datahub/metadata-models/src/main/pegasus/com/linkedin/common/`. | ||||
| 
 | ||||
| ### `Tag` entity | ||||
| 
 | ||||
| First we create a `TagMetadata` entity, which defines the actual tag-object. | ||||
| 
 | ||||
| The edit property defines the edit rights of the tag, as some tags (like sensitivity tags) should be read-only for a | ||||
| majority of users | ||||
| 
 | ||||
| ``` | ||||
| /** | ||||
|  * Tag information | ||||
|  */ | ||||
| record TagMetadata { | ||||
|    /** | ||||
|    * Tag URN, e.g. urn:li:tag:<name> | ||||
|    */ | ||||
|    urn: TagUrn | ||||
| 
 | ||||
|    /** | ||||
|    * Tag value. | ||||
|    */ | ||||
|    value: string | ||||
| 
 | ||||
|    /** | ||||
|    * Optional tag description | ||||
|    */ | ||||
|    description: optional string | ||||
| 
 | ||||
|    /** | ||||
|    * Audit stamp associated with creation of this tag | ||||
|    */ | ||||
|    createStamp: AuditStamp | ||||
| } | ||||
| ``` | ||||
| 
 | ||||
| ### `TagAttachment` | ||||
| 
 | ||||
| We define a `TagAttachment`-model, which describes the application of a tag to a entity | ||||
| 
 | ||||
| ``` | ||||
| /** | ||||
|  * Tag information | ||||
|  */ | ||||
| record TagAttachment { | ||||
| 
 | ||||
|   /** | ||||
|    * Tag in question | ||||
|    */ | ||||
|   tag: TagUrn | ||||
| 
 | ||||
|   /** | ||||
|    * Who has edit rights to this employment. | ||||
|    * WIP, pending access-control support in Datahub. | ||||
|    * Relevant for privacy tags at least. | ||||
|    * We might also want to add view rights? | ||||
|    */ | ||||
|   edit: union[None, any, role-urn] | ||||
| 
 | ||||
|    /** | ||||
|    * Audit stamp associated with employment of this tag to this entity | ||||
|    */ | ||||
|    attachmentStamp: AuditStamp | ||||
| } | ||||
| ``` | ||||
| 
 | ||||
| ### `Tags` container | ||||
| 
 | ||||
| Then we define a `Tags`-aspect, which is used as a container for tag employments. | ||||
| 
 | ||||
| ``` | ||||
| namespace com.linkedin.common | ||||
| 
 | ||||
| /** | ||||
|  * Tags information | ||||
|  */ | ||||
| record Tags { | ||||
| 
 | ||||
|    /** | ||||
|    * List of tag employments | ||||
|    */ | ||||
|    elements: array[TagAttachment] = [ ] | ||||
| } | ||||
| ``` | ||||
| 
 | ||||
| This can easily be taken into use with wall entities that we want to be able to use tags, e.g. `Datasets`. As we see a | ||||
| lot of potential in tagging individual dataset fields as well, we can either add a reference to a Tags-object in the | ||||
| `SchemaField` object, or alternative create a new `DatasetFieldTags`, similar to `DatasetFieldMapping`. | ||||
| 
 | ||||
| ## How we teach this | ||||
| 
 | ||||
| We should create/update user guides to educate users for: | ||||
| 
 | ||||
| - Suggestions on how to use tags: low threshold metadata-addition, and the possibility of doing new types of searches | ||||
| 
 | ||||
| ## Drawbacks | ||||
| 
 | ||||
| This is definitely more complex than just adding strings to an array. | ||||
| 
 | ||||
| ## Alternatives | ||||
| 
 | ||||
| An array of string is a simple solution but does allow for the same functionality as suggested here. | ||||
| 
 | ||||
| Another alternative would be simplify the models by removing some of the metadata in the `TagMetadata` and | ||||
| `TagAttachment` entities, such as the the edit/view permission field, the audit stamps, and the descriptions. | ||||
| 
 | ||||
| Apache Atlas uses a similar approach. The require you to create a Tag instance before it can be associated with an | ||||
| "asset", and the attachment is done using a dropdown list. The tags can also have attributes and a description. See | ||||
| [here](https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.5.3/bk_data-governance/content/ch_working_with_atlas_tags.html) | ||||
| for an example. The tags are a central piece in the UI and readably searchable, as easily as datasets. | ||||
| 
 | ||||
| Atlas also has concept very closely related to tags, called _classification_. Classifications are similar to tags in | ||||
| that they need to be created separately, can have attributes (but no description?) and are attached to assets is done | ||||
| using a dropdown list. Classifications have the added functionality of propagation, which means that they are | ||||
| automatically applied to downstream assets, unless specifically set to not do so. Any change to a classification (say an | ||||
| attribute change) also flows downstream, and in downstream assets you're able to see from where the classification | ||||
| propagated from. See | ||||
| [here](https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/using-atlas/content/propagate_classifications_to_derived_entities.html) | ||||
| for an example. | ||||
| 
 | ||||
| ## Rollout / Adoption Strategy | ||||
| 
 | ||||
| Using the functionality is optional and does not break other functionality as is. The solution is generic enough that | ||||
| the users can easily take into use. It can be take into use as any other entity and aspect. | ||||
| 
 | ||||
| ## Future Work | ||||
| 
 | ||||
| - add `Tags` to aspects for entities. | ||||
| - Implement relationship builders as needed. | ||||
| - The implementation of and need for access control to tags is an open question | ||||
| - As this is first and foremost a tool for discovery, the UI work is extensible: | ||||
|   - Creating tags in a way that makes duplication and spelling mistakes difficult. | ||||
|   - Attaching tags to entities: autocomplete, dropdown, etc. | ||||
|   - Visualizing existing tags, and which are most popular? | ||||
| - Explore the idea about a special "classification" type, that propagates downstream, as in Atlas. | ||||
| 
 | ||||
| ## Unresolved questions | ||||
| 
 | ||||
| - How do we want to map dataset fields to tags? | ||||
| - Do we want to implement edit/view rights? | ||||
		Loading…
	
	
			
			x
			
			
		
	
		Reference in New Issue
	
	Block a user
	 Fredrik Sannholm
						Fredrik Sannholm