mirror of
https://github.com/datahub-project/datahub.git
synced 2025-10-23 23:13:57 +00:00
482 lines
16 KiB
Markdown
482 lines
16 KiB
Markdown
# DataHub Open Data Quality Assertions Specification
|
|
|
|
DataHub is developing an open-source Data Quality Assertions Specification & Compiler that will allow you to declare data quality checks / expectations / assertions using a simple, universal
|
|
YAML-based format, and then compile this into artifacts that can be registered or directly executed by 3rd party Data Quality tools like [Snowflake DMFs](https://docs.snowflake.com/en/user-guide/data-quality-intro),
|
|
dbt tests, Great Expectations or DataHub Cloud natively.
|
|
|
|
Ultimately, our goal is to provide an framework-agnostic, highly-portable format for defining Data Quality checks, making it seamless to swap out the underlying
|
|
assertion engine without service disruption for end consumers of the results of these data quality checks in catalogging tools like DataHub.
|
|
|
|
## Integrations
|
|
|
|
Currently, the DataHub Open Assertions Specification supports the following integrations:
|
|
|
|
- [Snowflake DMF Assertions](snowflake/snowflake_dmfs.md)
|
|
|
|
And is looking for contributions to build out support for the following integrations:
|
|
|
|
- [Looking for Contributions] dbt tests
|
|
- [Looking for Contributions] Great Expectation checks
|
|
|
|
Below, we'll look at how to define assertions in YAML, and then provide an usage overview for each support integration.
|
|
|
|
## The Specification: Declaring Data Quality Assertions in YAML
|
|
|
|
The following assertion types are currently supported by the DataHub YAML Assertion spec:
|
|
|
|
- [Freshness](/docs/managed-datahub/observe/freshness-assertions.md)
|
|
- [Volume](/docs/managed-datahub/observe/volume-assertions.md)
|
|
- [Column](/docs/managed-datahub/observe/column-assertions.md)
|
|
- [Custom SQL](/docs/managed-datahub/observe/custom-sql-assertions.md)
|
|
- [Schema](/docs/managed-datahub/observe/schema-assertions.md)
|
|
|
|
Each assertion type aims to validate a different aspect of structured table (e.g. on a data warehouse or data lake), from
|
|
structure to size to column integrity to custom metrics.
|
|
|
|
In this section, we'll go over examples of defining each.
|
|
|
|
### Freshness Assertions
|
|
|
|
Freshness Assertions allow you to verify that your data was updated within the expected timeframe.
|
|
Below you'll find examples of defining different types of freshness assertions via YAML.
|
|
|
|
#### Validating that Table is Updated Every 6 Hours
|
|
|
|
```yaml
|
|
version: 1
|
|
assertions:
|
|
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
|
|
type: freshness
|
|
lookback_interval: "6 hours"
|
|
last_modified_field: updated_at
|
|
schedule:
|
|
type: interval
|
|
interval: "6 hours" # Run every 6 hours
|
|
```
|
|
|
|
This assertion checks that the `purchase_events` table in the `test_db.public` schema was updated within the last 6 hours
|
|
by issuing a Query to the table which validates determines whether an update was made using the `updated_at` column in the past 6 hours.
|
|
To use this check, we must specify the field that contains the last modified timestamp of a given row.
|
|
|
|
The `lookback_interval` field is used to specify the "lookback window" for the assertion, whereas the `schedule` field is used to specify how often the assertion should be run.
|
|
This allows you to schedule the assertion to run at a different frequency than the lookback window, for example
|
|
to detect stale data as soon as it becomes "stale" by inspecting it more frequently.
|
|
|
|
#### Supported Source Types
|
|
|
|
Currently, the only supported `sourceType` for Freshness Assertions is `LAST_MODIFIED_FIELD`. In the future,
|
|
we may support additional source types, such as `HIGH_WATERMARK`, along with data source-specific types such as
|
|
`AUDIT_LOG` and `INFORMATION_SCHEMA`.
|
|
|
|
### Volume Assertions
|
|
|
|
Volume Assertions allow you to verify that the number of records in your dataset meets your expectations.
|
|
Below you'll find examples of defining different types of volume assertions via YAML.
|
|
|
|
#### Validating that Tale Row Count is in Expected Range
|
|
|
|
```yaml
|
|
version: 1
|
|
assertions:
|
|
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
|
|
type: volume
|
|
metric: "row_count"
|
|
condition:
|
|
type: between
|
|
min: 1000
|
|
max: 10000
|
|
# filters: "event_type = 'purchase'" Optionally add filters.
|
|
schedule:
|
|
type: on_table_change # Run when new data is added to the table.
|
|
```
|
|
|
|
This assertion checks that the `purchase_events` table in the `test_db.public` schema has between 1000 and 10000 records.
|
|
Using the `condition` field, you can specify the type of comparison to be made, and the `min` and `max` fields to specify the range of values to compare against.
|
|
Using the `filters` field, you can optionally specify a SQL WHERE clause to filter the records being counted.
|
|
Using the `schedule` field you can specify when the assertion should be run, either on a fixed schedule or when new data is added to the table.
|
|
The only metric currently supported is `row_count`.
|
|
|
|
#### Validating that Table Row Count is Less Than Value
|
|
|
|
```yaml
|
|
version: 1
|
|
assertions:
|
|
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
|
|
type: volume
|
|
metric: "row_count"
|
|
condition:
|
|
type: less_than_or_equal_to
|
|
value: 1000
|
|
# filters: "event_type = 'purchase'" Optionally add filters.
|
|
schedule:
|
|
type: on_table_change # Run when new data is added to the table.
|
|
```
|
|
|
|
#### Validating that Table Row Count is Greater Than Value
|
|
|
|
```yaml
|
|
version: 1
|
|
assertions:
|
|
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
|
|
type: volume
|
|
metric: "row_count"
|
|
condition:
|
|
type: greater_than_or_equal_to
|
|
value: 1000
|
|
# filters: "event_type = 'purchase'" Optionally add filters.
|
|
schedule:
|
|
type: on_table_change # Run when new data is added to the table.
|
|
```
|
|
|
|
#### Supported Conditions
|
|
|
|
The full set of supported volume assertion conditions include:
|
|
|
|
- `equal_to`
|
|
- `not_equal_to`
|
|
- `greater_than`
|
|
- `greater_than_or_equal_to`
|
|
- `less_than`
|
|
- `less_than_or_equal_to`
|
|
- `between`
|
|
|
|
### Column Assertions
|
|
|
|
Column Assertions allow you to verify that the values in a column meet your expectations.
|
|
Below you'll find examples of defining different types of column assertions via YAML.
|
|
|
|
The specification currently supports 2 types of Column Assertions:
|
|
|
|
- **Field Value**: Asserts that the values in a column meet a specific condition.
|
|
- **Field Metric**: Asserts that a specific metric aggregated across the values in a column meet a specific condition.
|
|
|
|
We'll go over examples of each below.
|
|
|
|
#### Field Values Assertion: Validating that All Column Values are In Expected Range
|
|
|
|
```yaml
|
|
version: 1
|
|
assertions:
|
|
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
|
|
type: field
|
|
field: amount
|
|
condition:
|
|
type: between
|
|
min: 0
|
|
max: 10
|
|
exclude_nulls: True
|
|
# filters: "event_type = 'purchase'" Optionally add filters for Column Assertion.
|
|
# failure_threshold:
|
|
# type: count
|
|
# value: 10
|
|
schedule:
|
|
type: on_table_change
|
|
```
|
|
|
|
This assertion checks that all values for the `amount` column in the `purchase_events` table in the `test_db.public` schema have values between 0 and 10.
|
|
Using the `field` field, you can specify the column to be asserted on, and using the `condition` field, you can specify the type of comparison to be made,
|
|
and the `min` and `max` fields to specify the range of values to compare against.
|
|
Using the `schedule` field you can specify when the assertion should be run, either on a fixed schedule or when new data is added to the table.
|
|
Using the `filters` field, you can optionally specify a SQL WHERE clause to filter the records being counted.
|
|
Using the `exclude_nulls` field, you can specify whether to exclude NULL values from the assertion, meaning that
|
|
NULL will simply be ignored if encountered, as opposed to failing the check.
|
|
Using the `failure_threshold`, we can set a threshold for the number of rows that can fail the assertion before the assertion is considered failed.
|
|
|
|
#### Field Values Assertion: Validating that All Column Values are In Expected Set
|
|
|
|
The validate a VARCHAR / STRING column that should contain one of a set of values:
|
|
|
|
```yaml
|
|
version: 1
|
|
assertions:
|
|
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
|
|
type: field
|
|
field: product_id
|
|
condition:
|
|
type: in
|
|
value:
|
|
- "product_1"
|
|
- "product_2"
|
|
- "product_3"
|
|
exclude_nulls: False
|
|
# filters: "event_type = 'purchase'" Optionally add filters for Column Assertion.
|
|
# failure_threshold:
|
|
# type: count
|
|
# value: 10
|
|
schedule:
|
|
type: on_table_change
|
|
```
|
|
|
|
#### Field Values Assertion: Validating that All Column Values are Email Addresses
|
|
|
|
The validate a string column contains valid email addresses:
|
|
|
|
```yaml
|
|
version: 1
|
|
assertions:
|
|
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
|
|
type: field
|
|
field: email_address
|
|
condition:
|
|
type: matches_regex
|
|
value: "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}"
|
|
exclude_nulls: False
|
|
# filters: "event_type = 'purchase'" Optionally add filters for Column Assertion.
|
|
# failure_threshold:
|
|
# type: count
|
|
# value: 10
|
|
schedule:
|
|
type: on_table_change
|
|
```
|
|
|
|
#### Field Values Assertion: Supported Conditions
|
|
|
|
The full set of supported field value conditions include:
|
|
|
|
- `in`
|
|
- `not_in`
|
|
- `is_null`
|
|
- `is_not_null`
|
|
- `equal_to`
|
|
- `not_equal_to`
|
|
- `greater_than` # Numeric Only
|
|
- `greater_than_or_equal_to` # Numeric Only
|
|
- `less_than` # Numeric Only
|
|
- `less_than_or_equal_to` # Numeric Only
|
|
- `between` # Numeric Only
|
|
- `matches_regex` # String Only
|
|
- `not_empty` # String Only
|
|
- `length_greater_than` # String Only
|
|
- `length_less_than` # String Only
|
|
- `length_between` # String Only
|
|
|
|
#### Field Metric Assertion: Validating No Missing Values in Column
|
|
|
|
```yaml
|
|
version: 1
|
|
assertions:
|
|
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
|
|
type: field
|
|
field: col_date
|
|
metric: null_count
|
|
condition:
|
|
type: equal_to
|
|
value: 0
|
|
# filters: "event_type = 'purchase'" Optionally add filters for Column Assertion.
|
|
schedule:
|
|
type: on_table_change
|
|
```
|
|
|
|
This assertion ensures that the `col_date` column in the `purchase_events` table in the `test_db.public` schema has no NULL values.
|
|
|
|
#### Field Metric Assertion: Validating No Duplicates in Column
|
|
|
|
```yaml
|
|
version: 1
|
|
assertions:
|
|
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
|
|
type: field
|
|
field: id
|
|
metric: unique_percentage
|
|
condition:
|
|
type: equal_to
|
|
value: 100
|
|
# filters: "event_type = 'purchase'" Optionally add filters for Column Assertion.
|
|
schedule:
|
|
type: on_table_change
|
|
```
|
|
|
|
This assertion ensures that the `id` column in the `purchase_events` table in the `test_db.public` schema
|
|
has no duplicates, by checking that the unique percentage is 100%.
|
|
|
|
#### Field Metric Assertion: Validating String Column is Never Empty String
|
|
|
|
```yaml
|
|
version: 1
|
|
assertions:
|
|
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
|
|
type: field
|
|
field: name
|
|
metric: empty_percentage
|
|
condition:
|
|
type: equal_to
|
|
value: 0
|
|
# filters: "event_type = 'purchase'" Optionally add filters for Column Assertion.
|
|
schedule:
|
|
type: on_table_change
|
|
```
|
|
|
|
This assertion ensures that the `name` column in the `purchase_events` table in the `test_db.public` schema is never empty, by checking that the empty percentage is 0%.
|
|
|
|
#### Field Metric Assertion: Supported Metrics
|
|
|
|
The full set of supported field metrics include:
|
|
|
|
- `null_count`
|
|
- `null_percentage`
|
|
- `unique_count`
|
|
- `unique_percentage`
|
|
- `empty_count`
|
|
- `empty_percentage`
|
|
- `min`
|
|
- `max`
|
|
- `mean`
|
|
- `median`
|
|
- `stddev`
|
|
- `negative_count`
|
|
- `negative_percentage`
|
|
- `zero_count`
|
|
- `zero_percentage`
|
|
|
|
### Field Metric Assertion: Supported Conditions
|
|
|
|
The full set of supported field metric conditions include:
|
|
|
|
- `equal_to`
|
|
- `not_equal_to`
|
|
- `greater_than`
|
|
- `greater_than_or_equal_to`
|
|
- `less_than`
|
|
- `less_than_or_equal_to`
|
|
- `between`
|
|
|
|
### Custom SQL Assertions
|
|
|
|
Custom SQL Assertions allow you to define custom SQL queries to verify your data meets your expectations.
|
|
The only condition is that the SQL query must return a single value, which will be compared against the expected value.
|
|
Below you'll find examples of defining different types of custom SQL assertions via YAML.
|
|
|
|
SQL Assertions are useful for more complex data quality checks that can't be easily expressed using the other assertion types,
|
|
and can be used to assert on custom metrics, complex aggregations, cross-table integrity checks (JOINS) or any other SQL-based data quality check.
|
|
|
|
#### Validating Foreign Key Integrity
|
|
|
|
```yaml
|
|
version: 1
|
|
assertions:
|
|
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
|
|
type: sql
|
|
statement: |
|
|
SELECT COUNT(*)
|
|
FROM test_db.public.purchase_events AS pe
|
|
LEFT JOIN test_db.public.products AS p
|
|
ON pe.product_id = p.id
|
|
WHERE p.id IS NULL
|
|
condition:
|
|
type: equal_to
|
|
value: 0
|
|
schedule:
|
|
type: interval
|
|
interval: "6 hours" # Run every 6 hours
|
|
```
|
|
|
|
This assertion checks that the `purchase_events` table in the `test_db.public` schema has no rows where the `product_id` column does not have a corresponding `id` in the `products` table.
|
|
|
|
#### Comparing Row Counts Across Multiple Tables
|
|
|
|
```yaml
|
|
version: 1
|
|
assertions:
|
|
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
|
|
type: sql
|
|
statement: |
|
|
SELECT COUNT(*) FROM test_db.public.purchase_events
|
|
- (SELECT COUNT(*) FROM test_db.public.purchase_events_raw) AS row_count_difference
|
|
condition:
|
|
type: equal_to
|
|
value: 0
|
|
schedule:
|
|
type: interval
|
|
interval: "6 hours" # Run every 6 hours
|
|
```
|
|
|
|
This assertion checks that the number of rows in the `purchase_events` exactly matches the number of rows in an upstream `purchase_events_raw` table
|
|
by subtracting the row count of the raw table from the row count of the processed table.
|
|
|
|
#### Supported Conditions
|
|
|
|
The full set of supported custom SQL assertion conditions include:
|
|
|
|
- `equal_to`
|
|
- `not_equal_to`
|
|
- `greater_than`
|
|
- `greater_than_or_equal_to`
|
|
- `less_than`
|
|
- `less_than_or_equal_to`
|
|
- `between`
|
|
|
|
### Schema Assertions (Coming Soon)
|
|
|
|
Schema Assertions allow you to define custom SQL queries to verify your data meets your expectations.
|
|
Below you'll find examples of defining different types of custom SQL assertions via YAML.
|
|
|
|
The specification currently supports 2 types of Schema Assertions:
|
|
|
|
- **Exact Match**: Asserts that the schema of a table - column names and their data types - exactly matches an expected schema
|
|
- **Contains Match** (Subset): Asserts that the schema of a table - column names and their data types - is a subset of an expected schema
|
|
|
|
#### Validating Actual Schema Exactly Equals Expected Schema
|
|
|
|
```yaml
|
|
version: 1
|
|
assertions:
|
|
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
|
|
type: schema
|
|
condition:
|
|
type: exact_match
|
|
columns:
|
|
- name: id
|
|
type: INTEGER
|
|
- name: product_id
|
|
type: STRING
|
|
- name: amount
|
|
type: DECIMAL
|
|
- name: updated_at
|
|
type: TIMESTAMP
|
|
schedule:
|
|
type: interval
|
|
interval: "6 hours" # Run every 6 hours
|
|
```
|
|
|
|
This assertion checks that the `purchase_events` table in the `test_db.public` schema has the exact schema as specified, with the exact column names and data types.
|
|
|
|
#### Validating Actual Schema is Contains all of Expected Schema
|
|
|
|
```yaml
|
|
version: 1
|
|
assertions:
|
|
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
|
|
type: schema
|
|
condition:
|
|
type: contains
|
|
columns:
|
|
- name: id
|
|
type: integer
|
|
- name: product_id
|
|
type: string
|
|
- name: amount
|
|
type: number
|
|
schedule:
|
|
type: interval
|
|
interval: "6 hours" # Run every 6 hours
|
|
```
|
|
|
|
This assertion checks that the `purchase_events` table in the `test_db.public` schema contains all of the columns specified in the expected schema, with the exact column names and data types.
|
|
The actual schema can also contain additional columns not specified in the expected schema.
|
|
|
|
#### Supported Data Types
|
|
|
|
The following high-level data types are currently supported by the Schema Assertion spec:
|
|
|
|
- string
|
|
- number
|
|
- boolean
|
|
- date
|
|
- timestamp
|
|
- struct
|
|
- array
|
|
- map
|
|
- union
|
|
- bytes
|
|
- enum
|