# DataHub Open Data Quality Assertions Specification
DataHub is developing an open-source Data Quality Assertions Specification & Compiler that will allow you to declare data quality checks / expectations / assertions using a simple, universal
YAML-based format, and then compile this into artifacts that can be registered or directly executed by 3rd party Data Quality tools like [Snowflake DMFs](https://docs.snowflake.com/en/user-guide/data-quality-intro),
Ultimately, our goal is to provide an framework-agnostic, highly-portable format for defining Data Quality checks, making it seamless to swap out the underlying
assertion engine without service disruption for end consumers of the results of these data quality checks in catalogging tools like DataHub.
## Integrations
Currently, the DataHub Open Assertions Specification supports the following integrations:
This assertion checks that the `purchase_events` table in the `test_db.public` schema was updated within the last 6 hours
by issuing a Query to the table which validates determines whether an update was made using the `updated_at` column in the past 6 hours.
To use this check, we must specify the field that contains the last modified timestamp of a given row.
The `lookback_interval` field is used to specify the "lookback window" for the assertion, whereas the `schedule` field is used to specify how often the assertion should be run.
This allows you to schedule the assertion to run at a different frequency than the lookback window, for example
to detect stale data as soon as it becomes "stale" by inspecting it more frequently.
#### Supported Source Types
Currently, the only supported `sourceType` for Freshness Assertions is `LAST_MODIFIED_FIELD`. In the future,
we may support additional source types, such as `HIGH_WATERMARK`, along with data source-specific types such as
`AUDIT_LOG` and `INFORMATION_SCHEMA`.
### Volume Assertions
Volume Assertions allow you to verify that the number of records in your dataset meets your expectations.
Below you'll find examples of defining different types of volume assertions via YAML.
#### Validating that Tale Row Count is in Expected Range
type: on_table_change # Run when new data is added to the table.
```
This assertion checks that the `purchase_events` table in the `test_db.public` schema has between 1000 and 10000 records.
Using the `condition` field, you can specify the type of comparison to be made, and the `min` and `max` fields to specify the range of values to compare against.
Using the `filters` field, you can optionally specify a SQL WHERE clause to filter the records being counted.
Using the `schedule` field you can specify when the assertion should be run, either on a fixed schedule or when new data is added to the table.
The only metric currently supported is `row_count`.
#### Validating that Table Row Count is Less Than Value
This assertion checks that all values for the `amount` column in the `purchase_events` table in the `test_db.public` schema have values between 0 and 10.
Using the `field` field, you can specify the column to be asserted on, and using the `condition` field, you can specify the type of comparison to be made,
and the `min` and `max` fields to specify the range of values to compare against.
Using the `schedule` field you can specify when the assertion should be run, either on a fixed schedule or when new data is added to the table.
Using the `filters` field, you can optionally specify a SQL WHERE clause to filter the records being counted.
Using the `exclude_nulls` field, you can specify whether to exclude NULL values from the assertion, meaning that
NULL will simply be ignored if encountered, as opposed to failing the check.
Using the `failure_threshold`, we can set a threshold for the number of rows that can fail the assertion before the assertion is considered failed.
#### Field Values Assertion: Validating that All Column Values are In Expected Set
The validate a VARCHAR / STRING column that should contain one of a set of values:
This assertion ensures that the `name` column in the `purchase_events` table in the `test_db.public` schema is never empty, by checking that the empty percentage is 0%.
#### Field Metric Assertion: Supported Metrics
The full set of supported field metrics include:
-`null_count`
-`null_percentage`
-`unique_count`
-`unique_percentage`
-`empty_count`
-`empty_percentage`
-`min`
-`max`
-`mean`
-`median`
-`stddev`
-`negative_count`
-`negative_percentage`
-`zero_count`
-`zero_percentage`
### Field Metric Assertion: Supported Conditions
The full set of supported field metric conditions include:
-`equal_to`
-`not_equal_to`
-`greater_than`
-`greater_than_or_equal_to`
-`less_than`
-`less_than_or_equal_to`
-`between`
### Custom SQL Assertions
Custom SQL Assertions allow you to define custom SQL queries to verify your data meets your expectations.
The only condition is that the SQL query must return a single value, which will be compared against the expected value.
Below you'll find examples of defining different types of custom SQL assertions via YAML.
SQL Assertions are useful for more complex data quality checks that can't be easily expressed using the other assertion types,
and can be used to assert on custom metrics, complex aggregations, cross-table integrity checks (JOINS) or any other SQL-based data quality check.
This assertion checks that the `purchase_events` table in the `test_db.public` schema has no rows where the `product_id` column does not have a corresponding `id` in the `products` table.
This assertion checks that the `purchase_events` table in the `test_db.public` schema has the exact schema as specified, with the exact column names and data types.
#### Validating Actual Schema is Contains all of Expected Schema
This assertion checks that the `purchase_events` table in the `test_db.public` schema contains all of the columns specified in the expected schema, with the exact column names and data types.
The actual schema can also contain additional columns not specified in the expected schema.
#### Supported Data Types
The following high-level data types are currently supported by the Schema Assertion spec: