mirror of
https://github.com/datahub-project/datahub.git
synced 2025-12-27 09:58:14 +00:00
docs(): Announcing DataHub Open Assertions Specification (#10609)
Co-authored-by: John Joyce <john@ip-192-168-1-200.us-west-2.compute.internal> Co-authored-by: John Joyce <john@Johns-MBP-465.lan> Co-authored-by: John Joyce <john@Johns-MBP-466.lan> Co-authored-by: John Joyce <john@Johns-MBP-478.lan> Co-authored-by: John Joyce <john@Johns-MBP-499.lan> Co-authored-by: John Joyce <john@Johns-MBP-514.lan>
This commit is contained in:
parent
75f65dd88b
commit
ea7b27b0e5
@ -79,6 +79,18 @@ module.exports = {
|
||||
id: "docs/managed-datahub/observe/volume-assertions",
|
||||
className: "saasOnly",
|
||||
},
|
||||
{
|
||||
label: "Open Assertions Specification",
|
||||
type: "category",
|
||||
link: { type: "doc", id: "docs/assertions/open-assertions-spec" },
|
||||
items: [
|
||||
{
|
||||
label: "Snowflake",
|
||||
type: "doc",
|
||||
id: "docs/assertions/snowflake/snowflake_dmfs",
|
||||
},
|
||||
],
|
||||
},
|
||||
],
|
||||
},
|
||||
{
|
||||
|
||||
486
docs/assertions/open-assertions-spec.md
Normal file
486
docs/assertions/open-assertions-spec.md
Normal file
@ -0,0 +1,486 @@
|
||||
# DataHub Open Data Quality Assertions Specification
|
||||
|
||||
DataHub is developing an open-source Data Quality Assertions Specification & Compiler that will allow you to declare data quality checks / expectations / assertions using a simple, universal
|
||||
YAML-based format, and then compile this into artifacts that can be registered or directly executed by 3rd party Data Quality tools like [Snowflake DMFs](https://docs.snowflake.com/en/user-guide/data-quality-intro),
|
||||
dbt tests, Great Expectations or Acryl Cloud natively.
|
||||
|
||||
Ultimately, our goal is to provide an framework-agnostic, highly-portable format for defining Data Quality checks, making it seamless to swap out the underlying
|
||||
assertion engine without service disruption for end consumers of the results of these data quality checks in catalogging tools like DataHub.
|
||||
|
||||
## Integrations
|
||||
|
||||
Currently, the DataHub Open Assertions Specification supports the following integrations:
|
||||
|
||||
- [Snowflake DMF Assertions](snowflake/snowflake_dmfs.md)
|
||||
|
||||
And is looking for contributions to build out support for the following integrations:
|
||||
|
||||
- [Looking for Contributions] dbt tests
|
||||
- [Looking for Contributions] Great Expectation checks
|
||||
|
||||
Below, we'll look at how to define assertions in YAML, and then provide an usage overview for each support integration.
|
||||
|
||||
## The Specification: Declaring Data Quality Assertions in YAML
|
||||
|
||||
The following assertion types are currently supported by the DataHub YAML Assertion spec:
|
||||
|
||||
- [Freshness](/docs/managed-datahub/observe/freshness-assertions.md)
|
||||
- [Volume](/docs/managed-datahub/observe/volume-assertions.md)
|
||||
- [Column](/docs/managed-datahub/observe/column-assertions.md)
|
||||
- [Custom SQL](/docs/managed-datahub/observe/custom-sql-assertions.md)
|
||||
- [Schema](/docs/managed-datahub/observe/schema-assertions.md)
|
||||
|
||||
Each assertion type aims to validate a different aspect of structured table (e.g. on a data warehouse or data lake), from
|
||||
structure to size to column integrity to custom metrics.
|
||||
|
||||
In this section, we'll go over examples of defining each.
|
||||
|
||||
### Freshness Assertions
|
||||
|
||||
Freshness Assertions allow you to verify that your data was updated within the expected timeframe.
|
||||
Below you'll find examples of defining different types of freshness assertions via YAML.
|
||||
|
||||
#### Validating that Table is Updated Every 6 Hours
|
||||
|
||||
```yaml
|
||||
version: 1
|
||||
assertions:
|
||||
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
|
||||
type: freshness
|
||||
lookback_interval: '6 hours'
|
||||
last_modified_field: updated_at
|
||||
schedule:
|
||||
type: interval
|
||||
interval: '6 hours' # Run every 6 hours
|
||||
```
|
||||
|
||||
This assertion checks that the `purchase_events` table in the `test_db.public` schema was updated within the last 6 hours
|
||||
by issuing a Query to the table which validates determines whether an update was made using the `updated_at` column in the past 6 hours.
|
||||
To use this check, we must specify the field that contains the last modified timestamp of a given row.
|
||||
|
||||
The `lookback_interval` field is used to specify the "lookback window" for the assertion, whereas the `schedule` field is used to specify how often the assertion should be run.
|
||||
This allows you to schedule the assertion to run at a different frequency than the lookback window, for example
|
||||
to detect stale data as soon as it becomes "stale" by inspecting it more frequently.
|
||||
|
||||
#### Supported Source Types
|
||||
|
||||
Currently, the only supported `sourceType` for Freshness Assertions is `LAST_MODIFIED_FIELD`. In the future,
|
||||
we may support additional source types, such as `HIGH_WATERMARK`, along with data source-specific types such as
|
||||
`AUDIT_LOG` and `INFORMATION_SCHEMA`.
|
||||
|
||||
|
||||
### Volume Assertions
|
||||
|
||||
Volume Assertions allow you to verify that the number of records in your dataset meets your expectations.
|
||||
Below you'll find examples of defining different types of volume assertions via YAML.
|
||||
|
||||
#### Validating that Tale Row Count is in Expected Range
|
||||
|
||||
```yaml
|
||||
version: 1
|
||||
assertions:
|
||||
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
|
||||
type: volume
|
||||
metric: 'row_count'
|
||||
condition:
|
||||
type: between
|
||||
min: 1000
|
||||
max: 10000
|
||||
# filters: "event_type = 'purchase'" Optionally add filters.
|
||||
schedule:
|
||||
type: on_table_change # Run when new data is added to the table.
|
||||
```
|
||||
|
||||
This assertion checks that the `purchase_events` table in the `test_db.public` schema has between 1000 and 10000 records.
|
||||
Using the `condition` field, you can specify the type of comparison to be made, and the `min` and `max` fields to specify the range of values to compare against.
|
||||
Using the `filters` field, you can optionally specify a SQL WHERE clause to filter the records being counted.
|
||||
Using the `schedule` field you can specify when the assertion should be run, either on a fixed schedule or when new data is added to the table.
|
||||
The only metric currently supported is `row_count`.
|
||||
|
||||
#### Validating that Table Row Count is Less Than Value
|
||||
|
||||
```yaml
|
||||
version: 1
|
||||
assertions:
|
||||
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
|
||||
type: volume
|
||||
metric: 'row_count'
|
||||
condition:
|
||||
type: less_than_or_equal_to
|
||||
value: 1000
|
||||
# filters: "event_type = 'purchase'" Optionally add filters.
|
||||
schedule:
|
||||
type: on_table_change # Run when new data is added to the table.
|
||||
```
|
||||
|
||||
#### Validating that Table Row Count is Greater Than Value
|
||||
|
||||
```yaml
|
||||
version: 1
|
||||
assertions:
|
||||
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
|
||||
type: volume
|
||||
metric: 'row_count'
|
||||
condition:
|
||||
type: greater_than_or_equal_to
|
||||
value: 1000
|
||||
# filters: "event_type = 'purchase'" Optionally add filters.
|
||||
schedule:
|
||||
type: on_table_change # Run when new data is added to the table.
|
||||
```
|
||||
|
||||
|
||||
#### Supported Conditions
|
||||
|
||||
The full set of supported volume assertion conditions include:
|
||||
|
||||
- `equal_to`
|
||||
- `not_equal_to`
|
||||
- `greater_than`
|
||||
- `greater_than_or_equal_to`
|
||||
- `less_than`
|
||||
- `less_than_or_equal_to`
|
||||
- `between`
|
||||
|
||||
|
||||
### Column Assertions
|
||||
|
||||
Column Assertions allow you to verify that the values in a column meet your expectations.
|
||||
Below you'll find examples of defining different types of column assertions via YAML.
|
||||
|
||||
The specification currently supports 2 types of Column Assertions:
|
||||
|
||||
- **Field Value**: Asserts that the values in a column meet a specific condition.
|
||||
- **Field Metric**: Asserts that a specific metric aggregated across the values in a column meet a specific condition.
|
||||
|
||||
We'll go over examples of each below.
|
||||
|
||||
#### Field Values Assertion: Validating that All Column Values are In Expected Range
|
||||
|
||||
```yaml
|
||||
version: 1
|
||||
assertions:
|
||||
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
|
||||
type: field
|
||||
field: amount
|
||||
condition:
|
||||
type: between
|
||||
min: 0
|
||||
max: 10
|
||||
exclude_nulls: True
|
||||
# filters: "event_type = 'purchase'" Optionally add filters for Column Assertion.
|
||||
# failure_threshold:
|
||||
# type: count
|
||||
# value: 10
|
||||
schedule:
|
||||
type: on_table_change
|
||||
```
|
||||
|
||||
This assertion checks that all values for the `amount` column in the `purchase_events` table in the `test_db.public` schema have values between 0 and 10.
|
||||
Using the `field` field, you can specify the column to be asserted on, and using the `condition` field, you can specify the type of comparison to be made,
|
||||
and the `min` and `max` fields to specify the range of values to compare against.
|
||||
Using the `schedule` field you can specify when the assertion should be run, either on a fixed schedule or when new data is added to the table.
|
||||
Using the `filters` field, you can optionally specify a SQL WHERE clause to filter the records being counted.
|
||||
Using the `exclude_nulls` field, you can specify whether to exclude NULL values from the assertion, meaning that
|
||||
NULL will simply be ignored if encountered, as opposed to failing the check.
|
||||
Using the `failure_threshold`, we can set a threshold for the number of rows that can fail the assertion before the assertion is considered failed.
|
||||
|
||||
#### Field Values Assertion: Validating that All Column Values are In Expected Set
|
||||
|
||||
The validate a VARCHAR / STRING column that should contain one of a set of values:
|
||||
|
||||
```yaml
|
||||
version: 1
|
||||
assertions:
|
||||
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
|
||||
type: field
|
||||
field: product_id
|
||||
condition:
|
||||
type: in
|
||||
value:
|
||||
- 'product_1'
|
||||
- 'product_2'
|
||||
- 'product_3'
|
||||
exclude_nulls: False
|
||||
# filters: "event_type = 'purchase'" Optionally add filters for Column Assertion.
|
||||
# failure_threshold:
|
||||
# type: count
|
||||
# value: 10
|
||||
schedule:
|
||||
type: on_table_change
|
||||
```
|
||||
|
||||
#### Field Values Assertion: Validating that All Column Values are Email Addresses
|
||||
|
||||
The validate a string column contains valid email addresses:
|
||||
|
||||
```yaml
|
||||
version: 1
|
||||
assertions:
|
||||
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
|
||||
type: field
|
||||
field: email_address
|
||||
condition:
|
||||
type: matches_regex
|
||||
value: "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}"
|
||||
exclude_nulls: False
|
||||
# filters: "event_type = 'purchase'" Optionally add filters for Column Assertion.
|
||||
# failure_threshold:
|
||||
# type: count
|
||||
# value: 10
|
||||
schedule:
|
||||
type: on_table_change
|
||||
```
|
||||
|
||||
#### Field Values Assertion: Supported Conditions
|
||||
|
||||
The full set of supported field value conditions include:
|
||||
|
||||
- `in`
|
||||
- `not_in`
|
||||
- `is_null`
|
||||
- `is_not_null`
|
||||
- `equal_to`
|
||||
- `not_equal_to`
|
||||
- `greater_than` # Numeric Only
|
||||
- `greater_than_or_equal_to` # Numeric Only
|
||||
- `less_than` # Numeric Only
|
||||
- `less_than_or_equal_to` # Numeric Only
|
||||
- `between` # Numeric Only
|
||||
- `matches_regex` # String Only
|
||||
- `not_empty` # String Only
|
||||
- `length_greater_than` # String Only
|
||||
- `length_less_than` # String Only
|
||||
- `length_between` # String Only
|
||||
|
||||
|
||||
#### Field Metric Assertion: Validating No Missing Values in Column
|
||||
|
||||
```yaml
|
||||
version: 1
|
||||
assertions:
|
||||
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
|
||||
type: field
|
||||
field: col_date
|
||||
metric: null_count
|
||||
condition:
|
||||
type: equal_to
|
||||
value: 0
|
||||
# filters: "event_type = 'purchase'" Optionally add filters for Column Assertion.
|
||||
schedule:
|
||||
type: on_table_change
|
||||
```
|
||||
|
||||
This assertion ensures that the `col_date` column in the `purchase_events` table in the `test_db.public` schema has no NULL values.
|
||||
|
||||
#### Field Metric Assertion: Validating No Duplicates in Column
|
||||
|
||||
```yaml
|
||||
version: 1
|
||||
assertions:
|
||||
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
|
||||
type: field
|
||||
field: id
|
||||
metric: unique_percentage
|
||||
condition:
|
||||
type: equal_to
|
||||
value: 100
|
||||
# filters: "event_type = 'purchase'" Optionally add filters for Column Assertion.
|
||||
schedule:
|
||||
type: on_table_change
|
||||
```
|
||||
|
||||
This assertion ensures that the `id` column in the `purchase_events` table in the `test_db.public` schema
|
||||
has no duplicates, by checking that the unique percentage is 100%.
|
||||
|
||||
#### Field Metric Assertion: Validating String Column is Never Empty String
|
||||
|
||||
```yaml
|
||||
version: 1
|
||||
assertions:
|
||||
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
|
||||
type: field
|
||||
field: name
|
||||
metric: empty_percentage
|
||||
condition:
|
||||
type: equal_to
|
||||
value: 0
|
||||
# filters: "event_type = 'purchase'" Optionally add filters for Column Assertion.
|
||||
schedule:
|
||||
type: on_table_change
|
||||
```
|
||||
|
||||
This assertion ensures that the `name` column in the `purchase_events` table in the `test_db.public` schema is never empty, by checking that the empty percentage is 0%.
|
||||
|
||||
#### Field Metric Assertion: Supported Metrics
|
||||
|
||||
The full set of supported field metrics include:
|
||||
|
||||
- `null_count`
|
||||
- `null_percentage`
|
||||
- `unique_count`
|
||||
- `unique_percentage`
|
||||
- `empty_count`
|
||||
- `empty_percentage`
|
||||
- `min`
|
||||
- `max`
|
||||
- `mean`
|
||||
- `median`
|
||||
- `stddev`
|
||||
- `negative_count`
|
||||
- `negative_percentage`
|
||||
- `zero_count`
|
||||
- `zero_percentage`
|
||||
|
||||
### Field Metric Assertion: Supported Conditions
|
||||
|
||||
The full set of supported field metric conditions include:
|
||||
|
||||
- `equal_to`
|
||||
- `not_equal_to`
|
||||
- `greater_than`
|
||||
- `greater_than_or_equal_to`
|
||||
- `less_than`
|
||||
- `less_than_or_equal_to`
|
||||
- `between`
|
||||
|
||||
### Custom SQL Assertions
|
||||
|
||||
Custom SQL Assertions allow you to define custom SQL queries to verify your data meets your expectations.
|
||||
The only condition is that the SQL query must return a single value, which will be compared against the expected value.
|
||||
Below you'll find examples of defining different types of custom SQL assertions via YAML.
|
||||
|
||||
SQL Assertions are useful for more complex data quality checks that can't be easily expressed using the other assertion types,
|
||||
and can be used to assert on custom metrics, complex aggregations, cross-table integrity checks (JOINS) or any other SQL-based data quality check.
|
||||
|
||||
#### Validating Foreign Key Integrity
|
||||
|
||||
```yaml
|
||||
version: 1
|
||||
assertions:
|
||||
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
|
||||
type: sql
|
||||
statement: |
|
||||
SELECT COUNT(*)
|
||||
FROM test_db.public.purchase_events AS pe
|
||||
LEFT JOIN test_db.public.products AS p
|
||||
ON pe.product_id = p.id
|
||||
WHERE p.id IS NULL
|
||||
condition:
|
||||
type: equal_to
|
||||
value: 0
|
||||
schedule:
|
||||
type: interval
|
||||
interval: '6 hours' # Run every 6 hours
|
||||
```
|
||||
|
||||
This assertion checks that the `purchase_events` table in the `test_db.public` schema has no rows where the `product_id` column does not have a corresponding `id` in the `products` table.
|
||||
|
||||
#### Comparing Row Counts Across Multiple Tables
|
||||
|
||||
```yaml
|
||||
version: 1
|
||||
assertions:
|
||||
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
|
||||
type: sql
|
||||
statement: |
|
||||
SELECT COUNT(*) FROM test_db.public.purchase_events
|
||||
- (SELECT COUNT(*) FROM test_db.public.purchase_events_raw) AS row_count_difference
|
||||
condition:
|
||||
type: equal_to
|
||||
value: 0
|
||||
schedule:
|
||||
type: interval
|
||||
interval: '6 hours' # Run every 6 hours
|
||||
```
|
||||
|
||||
This assertion checks that the number of rows in the `purchase_events` exactly matches the number of rows in an upstream `purchase_events_raw` table
|
||||
by subtracting the row count of the raw table from the row count of the processed table.
|
||||
|
||||
#### Supported Conditions
|
||||
|
||||
The full set of supported custom SQL assertion conditions include:
|
||||
|
||||
- `equal_to`
|
||||
- `not_equal_to`
|
||||
- `greater_than`
|
||||
- `greater_than_or_equal_to`
|
||||
- `less_than`
|
||||
- `less_than_or_equal_to`
|
||||
- `between`
|
||||
|
||||
|
||||
### Schema Assertions (Coming Soon)
|
||||
|
||||
Schema Assertions allow you to define custom SQL queries to verify your data meets your expectations.
|
||||
Below you'll find examples of defining different types of custom SQL assertions via YAML.
|
||||
|
||||
The specification currently supports 2 types of Schema Assertions:
|
||||
|
||||
- **Exact Match**: Asserts that the schema of a table - column names and their data types - exactly matches an expected schema
|
||||
- **Contains Match** (Subset): Asserts that the schema of a table - column names and their data types - is a subset of an expected schema
|
||||
|
||||
#### Validating Actual Schema Exactly Equals Expected Schema
|
||||
|
||||
```yaml
|
||||
version: 1
|
||||
assertions:
|
||||
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
|
||||
type: schema
|
||||
condition:
|
||||
type: exact_match
|
||||
columns:
|
||||
- name: id
|
||||
type: INTEGER
|
||||
- name: product_id
|
||||
type: STRING
|
||||
- name: amount
|
||||
type: DECIMAL
|
||||
- name: updated_at
|
||||
type: TIMESTAMP
|
||||
schedule:
|
||||
type: interval
|
||||
interval: '6 hours' # Run every 6 hours
|
||||
```
|
||||
|
||||
This assertion checks that the `purchase_events` table in the `test_db.public` schema has the exact schema as specified, with the exact column names and data types.
|
||||
|
||||
#### Validating Actual Schema is Contains all of Expected Schema
|
||||
|
||||
```yaml
|
||||
version: 1
|
||||
assertions:
|
||||
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
|
||||
type: schema
|
||||
condition:
|
||||
type: contains
|
||||
columns:
|
||||
- name: id
|
||||
type: integer
|
||||
- name: product_id
|
||||
type: string
|
||||
- name: amount
|
||||
type: number
|
||||
schedule:
|
||||
type: interval
|
||||
interval: '6 hours' # Run every 6 hours
|
||||
```
|
||||
|
||||
This assertion checks that the `purchase_events` table in the `test_db.public` schema contains all of the columns specified in the expected schema, with the exact column names and data types.
|
||||
The actual schema can also contain additional columns not specified in the expected schema.
|
||||
|
||||
#### Supported Data Types
|
||||
|
||||
The following high-level data types are currently supported by the Schema Assertion spec:
|
||||
|
||||
- string
|
||||
- number
|
||||
- boolean
|
||||
- date
|
||||
- timestamp
|
||||
- struct
|
||||
- array
|
||||
- map
|
||||
- union
|
||||
- bytes
|
||||
- enum
|
||||
224
docs/assertions/snowflake/snowflake_dmfs.md
Normal file
224
docs/assertions/snowflake/snowflake_dmfs.md
Normal file
@ -0,0 +1,224 @@
|
||||
# Snowflake DMF Assertions [BETA]
|
||||
|
||||
The DataHub Open Assertion Compiler allows you to define your Data Quality assertions in a simple YAML format, and then compile them to be executed by Snowflake Data Metric Functions.
|
||||
Once compiled, you'll be able to register the compiled DMFs in your Snowflake environment, and extract their results them as part of your normal ingestion process for DataHub.
|
||||
Results of Snowflake DMF assertions will be reported as normal Assertion Results, viewable on a historical timeline in the context
|
||||
of the table with which they are associated.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- You must have a Snowflake Enterprise account, where the DMFs feature is enabled.
|
||||
- You must have the necessary permissions to provision DMFs in your Snowflake environment (see below)
|
||||
- You must have the necessary permissions to query the DMF results in your Snowflake environment (see below)
|
||||
- You must have DataHub instance with Snowflake metadata ingested. If you do not have existing snowflake ingestion, refer [Snowflake Quickstart Guide](https://datahubproject.io/docs/quick-ingestion-guides/snowflake/overview) to get started.
|
||||
- You must have DataHub CLI installed and run [`datahub init`](https://datahubproject.io/docs/cli/#init).
|
||||
|
||||
### Permissions
|
||||
|
||||
*Permissions required for registering DMFs*
|
||||
|
||||
According to the latest Snowflake docs, here are the permissions the service account performing the
|
||||
DMF registration and ingestion must have:
|
||||
|
||||
| Privilege | Object | Notes |
|
||||
|------------------------------|---------------------|---------------------------------------------------------------------------------------------|
|
||||
| USAGE | Database, schema | Database and schema where snowflake DMFs will be created. This is configured in compile command described below. |
|
||||
| CREATE FUNCTION | Schema | This privilege enables creating new DMF in schema configured in compile command. |
|
||||
| EXECUTE DATA METRIC FUNCTION | Account | This privilege enables you to control which roles have access to server-agnostic compute resources to call the system DMF. |
|
||||
| USAGE | Database, schema | These objects are the database and schema that contain the referenced table in the query. |
|
||||
| OWNERSHIP | Table | This privilege enables you to associate a DMF with a referenced table. |
|
||||
| USAGE | DMF | This privilege enables calling the DMF in schema configured in compile command. |
|
||||
|
||||
and the roles that must be granted:
|
||||
|
||||
| Role | Notes |
|
||||
|--------------------------|-------------------------|
|
||||
| SNOWFLAKE.DATA_METRIC_USER | To use System DMFs |
|
||||
|
||||
*Permissions required for running DMFs (scheduled DMFs run with table owner's role)*
|
||||
|
||||
Because scheduled DMFs run with the role of the table owner, the table owner must have the following privileges:
|
||||
|
||||
| Privilege | Object | Notes |
|
||||
|------------------------------|------------------|---------------------------------------------------------------------------------------------|
|
||||
| USAGE | Database, schema | Database and schema where snowflake DMFs will be created. This is configured in compile command described below. |
|
||||
| USAGE | DMF | This privilege enables calling the DMF in schema configured in compile power. |
|
||||
| EXECUTE DATA METRIC FUNCTION | Account | This privilege enables you to control which roles have access to server-agnostic compute resources to call the system DMF. |
|
||||
|
||||
and the roles that must be granted:
|
||||
|
||||
| Role | Notes |
|
||||
|--------------------------|-------------------------|
|
||||
| SNOWFLAKE.DATA_METRIC_USER | To use System DMFs |
|
||||
|
||||
*Permissions required for querying DMF results*
|
||||
|
||||
In addition, the service account that will be executing DataHub Ingestion, and querying the DMF results, must have been granted the following system application roles:
|
||||
|
||||
| Role | Notes |
|
||||
|--------------------------------|-----------------------------|
|
||||
| DATA_QUALITY_MONITORING_VIEWER | Query the DMF results table |
|
||||
|
||||
To learn more about Snowflake DMFs and the privileges required to provision and query them, see the [Snowflake documentation](https://docs.snowflake.com/en/user-guide/data-quality-intro).
|
||||
|
||||
*Example: Granting Permissions*
|
||||
|
||||
```sql
|
||||
-- setup permissions to <assertion-registration-role> to create DMFs and associate DMFs with table
|
||||
grant usage on database "<dmf-database>" to role "<assertion-service-role>"
|
||||
grant usage on schema "<dmf-database>.<dmf-schema>" to role "<assertion-service-role>"
|
||||
grant create function on schema "<dmf-database>.<dmf-schema>" to role "<assertion-service-role>"
|
||||
-- grant ownership + rest of permissions to <assertion-service-role>
|
||||
grant role "<table-owner-role>" to role "<assertion-service-role>"
|
||||
|
||||
-- setup permissions for <table-owner-role> to run DMFs on schedule
|
||||
grant usage on database "<dmf-database>" to role "<table-owner-role>"
|
||||
grant usage on schema "<dmf-database>.<dmf-schema>" to role "<table-owner-role>"
|
||||
grant usage on all functions in "<dmf-database>.<dmf-schema>" to role "<table-owner-role>"
|
||||
grant usage on future functions in "<dmf-database>.<dmf-schema>" to role "<table-owner-role>"
|
||||
grant database role SNOWFLAKE.DATA_METRIC_USER to role "<table-owner-role>"
|
||||
grant execute data metric function on account to role "<table-owner-role>"
|
||||
|
||||
-- setup permissions for <datahub-role> to query DMF results
|
||||
grant application role SNOWFLAKE.DATA_QUALITY_MONITORING_VIEWER to role "<datahub_role>"
|
||||
```
|
||||
|
||||
## Supported Assertion Types
|
||||
|
||||
The following assertion types are currently supported by the DataHub Snowflake DMF Assertion Compiler:
|
||||
|
||||
- [Freshness](/docs/managed-datahub/observe/freshness-assertions.md)
|
||||
- [Volume](/docs/managed-datahub/observe/volume-assertions.md)
|
||||
- [Column](/docs/managed-datahub/observe/column-assertions.md)
|
||||
- [Custom SQL](/docs/managed-datahub/observe/custom-sql-assertions.md)
|
||||
|
||||
Note that Schema Assertions are not currently supported.
|
||||
|
||||
## Creating Snowflake DMF Assertions
|
||||
|
||||
The process for declaring and running assertions backend by Snowflake DMFs consists of a few steps, which will be outlined
|
||||
in the following sections.
|
||||
|
||||
|
||||
### Step 1. Define your Data Quality assertions using Assertion YAML files
|
||||
|
||||
See the section **Declaring Assertions in YAML** below for examples of how to define assertions in YAML.
|
||||
|
||||
|
||||
### Step 2. Register your assertions with DataHub
|
||||
|
||||
Use the DataHub CLI to register your assertions with DataHub, so they become visible in the DataHub UI:
|
||||
|
||||
```bash
|
||||
datahub assertions upsert -f examples/library/assertions_configuration.yml
|
||||
```
|
||||
|
||||
|
||||
### Step 3. Compile the assertions into Snowflake DMFs using the DataHub CLI
|
||||
|
||||
Next, we'll use the `assertions compile` command to generate the SQL code for the Snowflake DMFs,
|
||||
which can then be registered in Snowflake.
|
||||
|
||||
```bash
|
||||
datahub assertions compile -f examples/library/assertions_configuration.yml -p snowflake -x DMF_SCHEMA=<db>.<schema-where-DMF-should-live>
|
||||
```
|
||||
|
||||
Two files will be generated as output of running this command:
|
||||
|
||||
- `dmf_definitions.sql`: This file contains the SQL code for the DMFs that will be registered in Snowflake.
|
||||
- `dmf_associations.sql`: This file contains the SQL code for associating the DMFs with the target tables in Snowflake.
|
||||
|
||||
By default in a folder called `target`. You can use config option `-o <output_folder>` in `compile` command to write these compile artifacts in another folder.
|
||||
|
||||
Each of these artifacts will be important for the next steps in the process.
|
||||
|
||||
_dmf_definitions.sql_
|
||||
|
||||
This file stores the SQL code for the DMFs that will be registered in Snowflake, generated
|
||||
from your YAML assertion definitions during the compile step.
|
||||
|
||||
```sql
|
||||
-- Example dmf_definitions.sql
|
||||
|
||||
-- Start of Assertion 5c32eef47bd763fece7d21c7cbf6c659
|
||||
|
||||
CREATE or REPLACE DATA METRIC FUNCTION
|
||||
test_db.datahub_dmfs.datahub__5c32eef47bd763fece7d21c7cbf6c659 (ARGT TABLE(col_date DATE))
|
||||
RETURNS NUMBER
|
||||
COMMENT = 'Created via DataHub for assertion urn:li:assertion:5c32eef47bd763fece7d21c7cbf6c659 of type volume'
|
||||
AS
|
||||
$$
|
||||
select case when metric <= 1000 then 1 else 0 end from (select count(*) as metric from TEST_DB.PUBLIC.TEST_ASSERTIONS_ALL_TIMES )
|
||||
$$;
|
||||
|
||||
-- End of Assertion 5c32eef47bd763fece7d21c7cbf6c659
|
||||
....
|
||||
```
|
||||
|
||||
_dmf_associations.sql_
|
||||
|
||||
This file stores the SQL code for associating with the target table,
|
||||
along with scheduling the generated DMFs to run on at particular times.
|
||||
|
||||
```sql
|
||||
-- Example dmf_associations.sql
|
||||
|
||||
-- Start of Assertion 5c32eef47bd763fece7d21c7cbf6c659
|
||||
|
||||
ALTER TABLE TEST_DB.PUBLIC.TEST_ASSERTIONS_ALL_TIMES SET DATA_METRIC_SCHEDULE = 'TRIGGER_ON_CHANGES';
|
||||
ALTER TABLE TEST_DB.PUBLIC.TEST_ASSERTIONS_ALL_TIMES ADD DATA METRIC FUNCTION test_db.datahub_dmfs.datahub__5c32eef47bd763fece7d21c7cbf6c659 ON (col_date);
|
||||
|
||||
-- End of Assertion 5c32eef47bd763fece7d21c7cbf6c659
|
||||
....
|
||||
```
|
||||
|
||||
|
||||
### Step 4. Register the compiled DMFs in your Snowflake environment
|
||||
|
||||
Next, you'll need to run the generated SQL from the files output in Step 3 in Snowflake.
|
||||
|
||||
You can achieve this either by running the SQL files directly in the Snowflake UI, or by using the SnowSQL CLI tool:
|
||||
|
||||
```bash
|
||||
snowsql -f dmf_definitions.sql
|
||||
snowsql -f dmf_associations.sql
|
||||
```
|
||||
|
||||
:::NOTE
|
||||
Scheduling Data Metric Function on table incurs Serverless Credit Usage in Snowflake. Refer [Billing and Pricing](https://docs.snowflake.com/en/user-guide/data-quality-intro#billing-and-pricing) for more details.
|
||||
Please ensure you DROP Data Metric Function created via dmf_associations.sql if the assertion is no longer in use.
|
||||
:::
|
||||
|
||||
### Step 5. Run ingestion to report the results back into DataHub
|
||||
|
||||
Once you've registered the DMFs, they will be automatically executed, either when the target table is updated or on a fixed
|
||||
schedule.
|
||||
|
||||
To report the results of the generated Data Quality assertions back into DataHub, you'll need to run the DataHub ingestion process with a special configuration
|
||||
flag: `include_assertion_results: true`:
|
||||
|
||||
```yaml
|
||||
# Your DataHub Snowflake Recipe
|
||||
source:
|
||||
type: snowflake
|
||||
config:
|
||||
# ...
|
||||
include_assertion_results: True
|
||||
# ...
|
||||
```
|
||||
|
||||
During ingestion we will query for the latest DMF results stored in Snowflake, convert them into DataHub Assertion Results, and report the results back into DataHub during your ingestion process
|
||||
either via CLI or the UI visible as normal assertions.
|
||||
|
||||
`datahub ingest -c snowflake.yml`
|
||||
|
||||
## Caveats
|
||||
|
||||
- Currently, Snowflake supports at most 1000 DMF-table associations at the moment so you can not define more than 1000 assertions for snowflake.
|
||||
- Currently, Snowflake does not allow JOIN queries or non-deterministic functions in DMF definition so you can not use these in SQL for SQL assertion or in filters section.
|
||||
- Currently, all DMFs scheduled on a table must follow same exact schedule, so you can not set assertions on same table to run on different schedules.
|
||||
- Currently, DMFs are only supported for regular tables and not dynamic or external tables.
|
||||
|
||||
## FAQ
|
||||
|
||||
Coming soon!
|
||||
@ -485,4 +485,4 @@ metadataChangeProposal:
|
||||
maxAttempts: ${MCP_TIMESERIES_MAX_ATTEMPTS:1000}
|
||||
initialIntervalMs: ${MCP_TIMESERIES_INITIAL_INTERVAL_MS:100}
|
||||
multiplier: ${MCP_TIMESERIES_MULTIPLIER:10}
|
||||
maxIntervalMs: ${MCP_TIMESERIES_MAX_INTERVAL_MS:30000}
|
||||
maxIntervalMs: ${MCP_TIMESERIES_MAX_INTERVAL_MS:30000}
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user