mirror of
https://github.com/datahub-project/datahub.git
synced 2025-10-23 23:13:57 +00:00
267 lines
13 KiB
Markdown
267 lines
13 KiB
Markdown
import FeatureAvailability from '@site/src/components/FeatureAvailability';
|
|
|
|
# Metadata Tests
|
|
|
|
<FeatureAvailability saasOnly />
|
|
|
|
DataHub includes a highly configurable, no-code framework that allows you to configure broad-spanning monitors & continuous actions
|
|
for the data assets - datasets, dashboards, charts, pipelines - that make up your enterprise Metadata Graph.
|
|
At the center of this framework is the concept of a Metadata Test.
|
|
|
|
There are two powerful use cases that are uniquely enabled by the Metadata Tests framework:
|
|
|
|
1. Automated Asset Classification
|
|
2. Automated Metadata Completion Monitoring
|
|
|
|
### Automated Asset Classification
|
|
|
|
Metadata Tests allows you to define conditions for selecting a subset of data assets (e.g. datasets, dashboards, etc),
|
|
along with a set of actions to take for entities that are selected. After the test is defined, the actions
|
|
will be applied continuously over time, as the selection set evolves & changes with your data ecosystem.
|
|
|
|
When defining selection criteria, you'll be able to choose from a range of useful technical signals (e.g. usage, size) that are automatically
|
|
extracted by DataHub (which vary by integration). This makes automatically classifying the "important" assets in your organization quite easy, which
|
|
is in turn critical for running effective Data Governance initiatives within your organization.
|
|
|
|
For example, we can define a Metadata Test which selects all Snowflake Tables which are in the top 10% of "most queried"
|
|
for the past 30 days, and then assign those Tables to a special "Tier 1" group using DataHub Tags, Glossary Terms, or Domains.
|
|
|
|
### Automated Data Governance Monitoring
|
|
|
|
Metadata Tests allow you to define & monitor a set of rules that apply to assets in your data ecosystem (e.g. datasets, dashboards, etc). This is particularly useful when attempting to govern
|
|
your data, as it allows for the (1) definition and (2) measurement of centralized metadata standards, which are key for both bootstrapping
|
|
and maintaining a well-governed data ecosystem.
|
|
|
|
For example, we can define a Metadata Test which requires that all "Tier 1" data assets (e.g. those marked with a special Tag or Glossary Term),
|
|
must have the following metadata:
|
|
|
|
1. At least 1 explicit owner _and_
|
|
2. High-level, human-authored documentation _and_
|
|
3. At least 1 Glossary Term from the "Classification" Term Group
|
|
|
|
Then, we can closely monitor which assets are passing and failing these rules as we work to improve things over time.
|
|
We can easily identify assets that are _in_ and _out of_ compliance with a set of centrally-defined standards.
|
|
|
|
By applying automation, Metadata Tests
|
|
can enable the full lifecycle of complex Data Governance initiatives - from scoping to execution to monitoring.
|
|
|
|
## Metadata Tests Setup, Prerequisites, and Permissions
|
|
|
|
What you need to manage Metadata Tests on DataHub:
|
|
|
|
- **Manage Tests** Privilege
|
|
|
|
This Platform Privilege allows users to create, edit, and remove all Metadata Tests on DataHub. Therefore, it should only be
|
|
given to those users who will be serving as metadata Admins of the platform. The default `Admin` role has this Privilege.
|
|
|
|
> Note that the Metadata Tests feature is currently limited in support for the following DataHub Asset Types:
|
|
>
|
|
> - Dataset
|
|
> - Dashboard
|
|
> - Chart
|
|
> - Data Flow (e.g. Pipeline)
|
|
> - Data Job (e.g. Task)
|
|
> - Container (Database, Schema, Project)
|
|
>
|
|
> If you'd like to see Metadata Tests for other asset types, please let your DataHub Cloud CustomerSuccess partner know!
|
|
|
|
## Using Metadata Tests
|
|
|
|
Metadata Tests can be created by first navigating to **Govern > Tests**.
|
|
|
|
To begin building a new Metadata, click **Create new Test**.
|
|
|
|
<p align="center">
|
|
<img width="80%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/manage-tests.png"/>
|
|
</p>
|
|
|
|
### Creating a Metadata Test
|
|
|
|
Inside the Metadata Test builder, we'll need to construct the 3 parts of a Metadata Test:
|
|
|
|
1. **Selection Criteria** - Select assets that are in the scope of the test
|
|
2. **Rules** - Define rules that selected assets can either pass or fail
|
|
3. **Actions (Optional)** - Define automated actions to be taken assets that are passing
|
|
or failing the test
|
|
|
|
<p align="center">
|
|
<img width="80%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/metadata-test-create.png"/>
|
|
</p>
|
|
|
|
#### Step 1. Defining Selection Criteria (Scope)
|
|
|
|
In the first step, we define a set of conditions that are used to select a subset of the assets in our Metadata Graph
|
|
that will be "in the scope" of the new test. Assets that **match** the selection conditions will be considered in scope, while those which do not are simply not applicable for the test.
|
|
Once the test is created, the test will be evaluated for any assets which fall in scope on a continuous basis (when an asset changes on DataHub
|
|
or once every day).
|
|
|
|
##### Selecting Asset Types
|
|
|
|
You must select at least one asset _type_ from a set that includes Datasets, Dashboards, Charts, Data Flows (Pipelines), Data Jobs (Tasks),
|
|
and Containers.
|
|
|
|
<p align="center">
|
|
<img width="50%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/metadata-test-select-type.png"/>
|
|
</p>
|
|
|
|
Entities will the selected types will be considered in scope, while those of other types will be considered out of scope and
|
|
thus omitted from evaluation of the test.
|
|
|
|
##### Building Conditions
|
|
|
|
**Property** conditions are the basic unit of comparison used for selecting data assets. Each **Property** condition consists of a target _property_,
|
|
an _operator_, and an optional _value_.
|
|
|
|
A _property_ is an attribute of a data asset. It can either be a technical signal (e.g. **metric** such as usage, storage size) or a
|
|
metadata signal (e.g. owners, domain, glossary terms, tags, and more), depending on the asset type and applicability of the signal.
|
|
The full set of supported _properties_ can be found in the table below.
|
|
|
|
An _operator_ is the type of predicate that will be applied to the selected _property_ when evaluating the test for an asset. The types
|
|
of operators that are applicable depend on the selected property. Some examples of operators include `Equals`, `Exists`, `Matches Regex`,
|
|
and `Contains`.
|
|
|
|
A _value_ defines the right-hand side of the condition, or a pre-configured value to evaluate the property and operator against. The type of the value
|
|
is dependent on the selected _property_ and *operator. For example, if the selected *operator\* is `Matches Regex`, the type of the
|
|
value would be a string.
|
|
|
|
By selecting a property, operator, and value, we can create a single condition (or predicate) used for
|
|
selecting a data asset to be tested. For example, we can build property conditions that match:
|
|
|
|
- All datasets in the top 25% of query usage in the past 30 days
|
|
- All assets that have the "Tier 1" Glossary Term attached
|
|
- All assets in the "Marketing" Domain
|
|
- All assets without owners
|
|
- All assets without a description
|
|
|
|
To create a **Property** condition, simply click **Add Condition** then select **Property** condition.
|
|
|
|
<p align="center">
|
|
<img width="80%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/metadata-test-create-property-condition.png"/>
|
|
</p>
|
|
|
|
We can combine **Property** conditions using boolean operators including `AND`, `OR`, and `NOT`, by
|
|
creating **Logical** conditions. To create a **Logical** condition, simply click **Add Condition** then select an
|
|
**And**, **Or**, or **Not** condition.
|
|
|
|
<p align="center">
|
|
<img width="80%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/metadata-test-create-logical-condition.png"/>
|
|
</p>
|
|
|
|
Logical conditions allow us to accommodate complex real-world selection requirements:
|
|
|
|
- All Snowflake Tables that are in the Top 25% of most queried AND do not have a Domain
|
|
- All Looker Dashboards that do not have a description authored in Looker OR in DataHub
|
|
|
|
#### Step 2: Defining Rules
|
|
|
|
In the second step, we can define a set of conditions that selected assets must match in order to be "passing" the test.
|
|
To do so, we can construct another set of **Property** conditions (as described above).
|
|
|
|
> **Pro-Tip**: If no rules are supplied, then all assets that are selected by the criteria defined in Step 1 will be considered "passing".
|
|
> If you need to apply an automated Action to the selected assets, you can leave the Rules blank and continue to the next step.
|
|
|
|
<p align="center">
|
|
<img width="50%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/metadata-test-create-rules.png"/>
|
|
</p>
|
|
|
|
When combined with the selection criteria, Rules allow us to define complex, highly custom **Data Governance** policies such as:
|
|
|
|
- All datasets in the top 25% of query usage in the past 30 days **must have an owner**.
|
|
- All assets in the "Marketing" Domain **must have a description**
|
|
- All Snowflake Tables that are in the Top 25% of most queried AND do not have a Domain **must have
|
|
a Glossary Term from the Classification Term Group**
|
|
|
|
##### Validating Test Conditions
|
|
|
|
During Step 2, we can quickly verify that the Selection Criteria & Rules we've authored
|
|
match our expectations by testing them against some existing assets indexed by DataHub.
|
|
|
|
To verify your Test conditions, simply click **Try it out**, find an asset to test against by searching & filtering down your assets,
|
|
and finally click **Run Test** to see whether the asset is passes or fails the provided conditions.
|
|
|
|
<p align="center">
|
|
<img width="80%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/metadata-test-validate-conditions.png"/>
|
|
</p>
|
|
|
|
#### Step 3: Defining Actions (Optional)
|
|
|
|
> If you don't wish to take any actions for assets that pass or fail the test, simply click 'Skip'.
|
|
|
|
In the third step, we can define a set of Actions that will be automatically applied to each selected asset which passes or fails the Rules conditions.
|
|
|
|
For example, we may wish to mark **passing** assets with a special DataHub Tag or Glossary Term (e.g. "Tier 1"), or remove these special marking for those which are failing.
|
|
This allows us to automatically control classifications of data assets as they move in and out of compliance with the Rules defined in Step 2.
|
|
|
|
A few of the supported Action types include:
|
|
|
|
- Adding or removing specific Tags
|
|
- Adding or removing specific Glossary Terms
|
|
- Adding or removing specific Owners
|
|
- Adding or removing to a specific Domain
|
|
|
|
<p align="center">
|
|
<img width="80%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/metadata-test-define-actions.png"/>
|
|
</p>
|
|
|
|
#### Step 4: Name, Category, Description
|
|
|
|
In the final step, we can add a freeform name, category, and description for our new Metadata Test.
|
|
|
|
### Viewing Test Results
|
|
|
|
Metadata Test results can be viewed in 2 places:
|
|
|
|
1. On an asset profile page (e.g. Dataset profile page), inside the **Validation** tab.
|
|
2. On the Metadata Tests management page. To view all assets passing or failing a particular test,
|
|
simply click on the labels which showing the number of passing or failing assets.
|
|
|
|
<p align="center">
|
|
<img width="50%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/metadata-test-view-results.png"/>
|
|
</p>
|
|
|
|
### Updating an Existing Test
|
|
|
|
To update an existing Test, simply click **Edit** on the test you wish to change.
|
|
|
|
Then, make the changes required and click **Save**. When you save a Test, it may take up to 2 minutes for changes
|
|
to be reflected across DataHub.
|
|
|
|
### Removing a Test
|
|
|
|
To remove a Test, simply click on the trashcan icon located on the Tests list. This will remove the Test and
|
|
deactivate it so that it no is evaluated.
|
|
|
|
When you delete a Test, it may take up to 2 minutes for changes to be reflected.
|
|
|
|
### GraphQL
|
|
|
|
- [listTests](../../graphql/queries.md#listtests)
|
|
- [createTest](../../graphql/mutations.md#createtest)
|
|
- [deleteTest](../../graphql/mutations.md#deletetest)
|
|
|
|
## FAQ and Troubleshooting
|
|
|
|
**When are Metadata Tests evaluated?**
|
|
|
|
Metadata Tests are evaluated in 2 scenarios:
|
|
|
|
1. When an individual asset is changed in DataHub, all tests that include it in scope are evaluated
|
|
2. On a recurring cadence (usually every 24 hours) by a dedicated Metadata Test evaluator, which evaluates all tests against the Metadata Graph
|
|
|
|
**Can I configure a custom evaluation schedule for my Metadata Test?**
|
|
|
|
No, you cannot. Currently, the internal evaluator will ensure that tests are run continuously for
|
|
each asset, regardless of whether it is being changed on DataHub.
|
|
|
|
**How is a Metadata Test different from an Assertion?**
|
|
|
|
An Assertion is a specific test, similar to a unit test, that is defined for a single data asset. Typically,
|
|
it will include domain-specific knowledge about the asset and test against physical attributes of it. For example, an Assertion
|
|
may verify that the number of rows for a specific table in Snowflake falls into a well-defined range.
|
|
|
|
A Metadata Test is a broad spanning predicate which applies to a subset of the Metadata Graph (e.g. across multiple
|
|
data assets). Typically, it is defined against _metadata_ attributes, as opposed to the physical data itself. For example,
|
|
a Metadata Test may verify that ALL tables in Snowflake have at least 1 assigned owner, and a human-authored description.
|
|
Metadata Tests allow you to manage broad policies across your entire data ecosystem driven by metadata, for example to
|
|
augment a larger scale Data Governance initiative.
|