mirror of
				https://github.com/datahub-project/datahub.git
				synced 2025-10-31 10:49:00 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			269 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			269 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| import FeatureAvailability from '@site/src/components/FeatureAvailability';
 | |
| 
 | |
| # Metadata Tests
 | |
| 
 | |
| <FeatureAvailability saasOnly />
 | |
| 
 | |
| DataHub includes a highly configurable, no-code framework that allows you to configure broad-spanning monitors & continuous actions
 | |
| for the data assets - datasets, dashboards, charts, pipelines - that make up your enterprise Metadata Graph.
 | |
| At the center of this framework is the concept of a Metadata Test.
 | |
| 
 | |
| There are two powerful use cases that are uniquely enabled by the Metadata Tests framework:
 | |
| 
 | |
| 1. Automated Asset Classification
 | |
| 2. Automated Metadata Completion Monitoring
 | |
| 
 | |
| ### Automated Asset Classification
 | |
| 
 | |
| Metadata Tests allows you to define conditions for selecting a subset of data assets (e.g. datasets, dashboards, etc),
 | |
| along with a set of actions to take for entities that are selected. After the test is defined, the actions
 | |
| will be applied continuously over time, as the selection set evolves & changes with your data ecosystem.
 | |
| 
 | |
| When defining selection criteria, you'll be able to choose from a range of useful technical signals (e.g. usage, size) that are automatically
 | |
| extracted by DataHub (which vary by integration). This makes automatically classifying the "important" assets in your organization quite easy, which
 | |
| is in turn critical for running effective Data Governance initiatives within your organization.
 | |
| 
 | |
| For example, we can define a Metadata Test which selects all Snowflake Tables which are in the top 10% of "most queried"
 | |
| for the past 30 days, and then assign those Tables to a special "Tier 1" group using DataHub Tags, Glossary Terms, or Domains.
 | |
| 
 | |
| ### Automated Data Governance Monitoring
 | |
| 
 | |
| Metadata Tests allow you to define & monitor a set of rules that apply to assets in your data ecosystem (e.g. datasets, dashboards, etc). This is particularly useful when attempting to govern
 | |
| your data, as it allows for the (1) definition and (2) measurement of centralized metadata standards, which are key for both bootstrapping
 | |
| and maintaining a well-governed data ecosystem.
 | |
| 
 | |
| For example, we can define a Metadata Test which requires that all "Tier 1" data assets (e.g. those marked with a special Tag or Glossary Term),
 | |
| must have the following metadata:
 | |
| 
 | |
| 1. At least 1 explicit owner _and_
 | |
| 2. High-level, human-authored documentation _and_
 | |
| 3. At least 1 Glossary Term from the "Classification" Term Group
 | |
| 
 | |
| Then, we can closely monitor which assets are passing and failing these rules as we work to improve things over time.
 | |
| We can easily identify assets that are _in_ and _out of_ compliance with a set of centrally-defined standards.
 | |
| 
 | |
| By applying automation, Metadata Tests
 | |
| can enable the full lifecycle of complex Data Governance initiatives - from scoping to execution to monitoring.
 | |
| 
 | |
| ## Metadata Tests Setup, Prerequisites, and Permissions
 | |
| 
 | |
| What you need to manage Metadata Tests on DataHub:
 | |
| 
 | |
| - **Manage Tests** Privilege
 | |
| 
 | |
| This Platform Privilege allows users to create, edit, and remove all Metadata Tests on DataHub. Therefore, it should only be
 | |
| given to those users who will be serving as metadata Admins of the platform. The default `Admin` role has this Privilege.
 | |
| 
 | |
| > Note that the Metadata Tests feature is currently limited in support for the following DataHub Asset Types:
 | |
| >
 | |
| > - Dataset
 | |
| > - Dashboard
 | |
| > - Chart
 | |
| > - Data Flow (e.g. Pipeline)
 | |
| > - Data Job (e.g. Task)
 | |
| > - Container (Database, Schema, Project)
 | |
| >
 | |
| > If you'd like to see Metadata Tests for other asset types, please let your Acryl Customer Success partner know!
 | |
| 
 | |
| ## Using Metadata Tests
 | |
| 
 | |
| Metadata Tests can be created by first navigating to **Govern > Tests**.
 | |
| 
 | |
| To begin building a new Metadata, click **Create new Test**.
 | |
| 
 | |
| <p align="center">
 | |
|   <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/manage-tests.png"/>
 | |
| </p>
 | |
| 
 | |
| ### Creating a Metadata Test
 | |
| 
 | |
| Inside the Metadata Test builder, we'll need to construct the 3 parts of a Metadata Test:
 | |
| 
 | |
| 1. **Selection Criteria** - Select assets that are in the scope of the test
 | |
| 2. **Rules** - Define rules that selected assets can either pass or fail
 | |
| 3. **Actions (Optional)** - Define automated actions to be taken assets that are passing
 | |
|    or failing the test
 | |
| 
 | |
| <p align="center">
 | |
|   <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/metadata-test-create.png"/>
 | |
| </p>
 | |
| 
 | |
| #### Step 1. Defining Selection Criteria (Scope)
 | |
| 
 | |
| In the first step, we define a set of conditions that are used to select a subset of the assets in our Metadata Graph
 | |
| that will be "in the scope" of the new test. Assets that **match** the selection conditions will be considered in scope, while those which do not are simply not applicable for the test.
 | |
| Once the test is created, the test will be evaluated for any assets which fall in scope on a continuous basis (when an asset changes on DataHub
 | |
| or once every day).
 | |
| 
 | |
| ##### Selecting Asset Types
 | |
| 
 | |
| You must select at least one asset _type_ from a set that includes Datasets, Dashboards, Charts, Data Flows (Pipelines), Data Jobs (Tasks),
 | |
| and Containers.
 | |
| 
 | |
| <p align="center">
 | |
|   <img width="50%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/metadata-test-select-type.png"/>
 | |
| </p>
 | |
| 
 | |
| Entities will the selected types will be considered in scope, while those of other types will be considered out of scope and
 | |
| thus omitted from evaluation of the test.
 | |
| 
 | |
| ##### Building Conditions
 | |
| 
 | |
| **Property** conditions are the basic unit of comparison used for selecting data assets. Each **Property** condition consists of a target _property_,
 | |
| an _operator_, and an optional _value_.
 | |
| 
 | |
| A _property_ is an attribute of a data asset. It can either be a technical signal (e.g. **metric** such as usage, storage size) or a  
 | |
| metadata signal (e.g. owners, domain, glossary terms, tags, and more), depending on the asset type and applicability of the signal.
 | |
| The full set of supported _properties_ can be found in the table below.
 | |
| 
 | |
| An _operator_ is the type of predicate that will be applied to the selected _property_ when evaluating the test for an asset. The types
 | |
| of operators that are applicable depend on the selected property. Some examples of operators include `Equals`, `Exists`, `Matches Regex`,
 | |
| and `Contains`.
 | |
| 
 | |
| A _value_ defines the right-hand side of the condition, or a pre-configured value to evaluate the property and operator against. The type of the value
 | |
| is dependent on the selected _property_ and *operator. For example, if the selected *operator\* is `Matches Regex`, the type of the
 | |
| value would be a string.
 | |
| 
 | |
| By selecting a property, operator, and value, we can create a single condition (or predicate) used for
 | |
| selecting a data asset to be tested. For example, we can build property conditions that match:
 | |
| 
 | |
| - All datasets in the top 25% of query usage in the past 30 days
 | |
| - All assets that have the "Tier 1" Glossary Term attached
 | |
| - All assets in the "Marketing" Domain
 | |
| - All assets without owners
 | |
| - All assets without a description
 | |
| 
 | |
| To create a **Property** condition, simply click **Add Condition** then select **Property** condition.
 | |
| 
 | |
| <p align="center">
 | |
|   <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/metadata-test-create-property-condition.png"/>
 | |
| </p>
 | |
| 
 | |
| We can combine **Property** conditions using boolean operators including `AND`, `OR`, and `NOT`, by
 | |
| creating **Logical** conditions. To create a **Logical** condition, simply click **Add Condition** then select an
 | |
| **And**, **Or**, or **Not** condition.
 | |
| 
 | |
| <p align="center">
 | |
|   <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/metadata-test-create-logical-condition.png"/>
 | |
| </p>
 | |
| 
 | |
| Logical conditions allow us to accommodate complex real-world selection requirements:
 | |
| 
 | |
| - All Snowflake Tables that are in the Top 25% of most queried AND do not have a Domain
 | |
| - All Looker Dashboards that do not have a description authored in Looker OR in DataHub
 | |
| 
 | |
| #### Step 2: Defining Rules
 | |
| 
 | |
| In the second step, we can define a set of conditions that selected assets must match in order to be "passing" the test.
 | |
| To do so, we can construct another set of **Property** conditions (as described above).
 | |
| 
 | |
| > **Pro-Tip**: If no rules are supplied, then all assets that are selected by the criteria defined in Step 1 will be considered "passing".
 | |
| > If you need to apply an automated Action to the selected assets, you can leave the Rules blank and continue to the next step.
 | |
| 
 | |
| <p align="center">
 | |
|   <img width="50%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/metadata-test-create-rules.png"/>
 | |
| </p>
 | |
| 
 | |
| When combined with the selection criteria, Rules allow us to define complex, highly custom **Data Governance** policies such as:
 | |
| 
 | |
| - All datasets in the top 25% of query usage in the past 30 days **must have an owner**.
 | |
| - All assets in the "Marketing" Domain **must have a description**
 | |
| - All Snowflake Tables that are in the Top 25% of most queried AND do not have a Domain **must have
 | |
|   a Glossary Term from the Classification Term Group**
 | |
| 
 | |
| ##### Validating Test Conditions
 | |
| 
 | |
| During Step 2, we can quickly verify that the Selection Criteria & Rules we've authored
 | |
| match our expectations by testing them against some existing assets indexed by DataHub.
 | |
| 
 | |
| To verify your Test conditions, simply click **Try it out**, find an asset to test against by searching & filtering down your assets,
 | |
| and finally click **Run Test** to see whether the asset is passes or fails the provided conditions.
 | |
| 
 | |
| <p align="center">
 | |
|   <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/metadata-test-validate-conditions.png"/>
 | |
| </p>
 | |
| 
 | |
| #### Step 3: Defining Actions (Optional)
 | |
| 
 | |
| > If you don't wish to take any actions for assets that pass or fail the test, simply click 'Skip'.
 | |
| 
 | |
| In the third step, we can define a set of Actions that will be automatically applied to each selected asset which passes or fails the Rules conditions.
 | |
| 
 | |
| For example, we may wish to mark **passing** assets with a special DataHub Tag or Glossary Term (e.g. "Tier 1"), or remove these special marking for those which are failing.
 | |
| This allows us to automatically control classifications of data assets as they move in and out of compliance with the Rules defined in Step 2.
 | |
| 
 | |
| A few of the supported Action types include:
 | |
| 
 | |
| - Adding or removing specific Tags
 | |
| - Adding or removing specific Glossary Terms
 | |
| - Adding or removing specific Owners
 | |
| - Adding or removing to a specific Domain
 | |
| 
 | |
| <p align="center">
 | |
|   <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/metadata-test-define-actions.png"/>
 | |
| </p>
 | |
| 
 | |
| #### Step 4: Name, Category, Description
 | |
| 
 | |
| In the final step, we can add a freeform name, category, and description for our new Metadata Test.
 | |
| 
 | |
| ### Viewing Test Results
 | |
| 
 | |
| Metadata Test results can be viewed in 2 places:
 | |
| 
 | |
| 1. On an asset profile page (e.g. Dataset profile page), inside the **Validation** tab.
 | |
| 2. On the Metadata Tests management page. To view all assets passing or failing a particular test,
 | |
|    simply click on the labels which showing the number of passing or failing assets.
 | |
| 
 | |
| <p align="center">
 | |
|   <img width="50%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/metadata-test-view-results.png"/>
 | |
| </p>
 | |
| 
 | |
| ### Updating an Existing Test
 | |
| 
 | |
| To update an existing Test, simply click **Edit** on the test you wish to change.
 | |
| 
 | |
| Then, make the changes required and click **Save**. When you save a Test, it may take up to 2 minutes for changes
 | |
| to be reflected across DataHub.
 | |
| 
 | |
| ### Removing a Test
 | |
| 
 | |
| To remove a Test, simply click on the trashcan icon located on the Tests list. This will remove the Test and
 | |
| deactivate it so that it no is evaluated.
 | |
| 
 | |
| When you delete a Test, it may take up to 2 minutes for changes to be reflected.
 | |
| 
 | |
| ### GraphQL
 | |
| 
 | |
| - [listTests](../../graphql/queries.md#listtests)
 | |
| - [createTest](../../graphql/mutations.md#createtest)
 | |
| - [deleteTest](../../graphql/mutations.md#deletetest)
 | |
| 
 | |
| ## FAQ and Troubleshooting
 | |
| 
 | |
| **When are Metadata Tests evaluated?**
 | |
| 
 | |
| Metadata Tests are evaluated in 2 scenarios:
 | |
| 
 | |
| 1. When an individual asset is changed in DataHub, all tests that include it in scope are evaluated
 | |
| 2. On a recurring cadence (usually every 24 hours) by a dedicated Metadata Test evaluator, which evaluates all tests against the Metadata Graph
 | |
| 
 | |
| **Can I configure a custom evaluation schedule for my Metadata Test?**
 | |
| 
 | |
| No, you cannot. Currently, the internal evaluator will ensure that tests are run continuously for
 | |
| each asset, regardless of whether it is being changed on DataHub.
 | |
| 
 | |
| **How is a Metadata Test different from an Assertion?**
 | |
| 
 | |
| An Assertion is a specific test, similar to a unit test, that is defined for a single data asset. Typically,
 | |
| it will include domain-specific knowledge about the asset and test against physical attributes of it. For example, an Assertion
 | |
| may verify that the number of rows for a specific table in Snowflake falls into a well-defined range.
 | |
| 
 | |
| A Metadata Test is a broad spanning predicate which applies to a subset of the Metadata Graph (e.g. across multiple
 | |
| data assets). Typically, it is defined against _metadata_ attributes, as opposed to the physical data itself. For example,
 | |
| a Metadata Test may verify that ALL tables in Snowflake have at least 1 assigned owner, and a human-authored description.
 | |
| Metadata Tests allow you to manage broad policies across your entire data ecosystem driven by metadata, for example to
 | |
| augment a larger scale Data Governance initiative.
 | |
| 
 | |
| 
 | 
