docs(): Announcing DataHub Open Assertions Specification (#10609)

Co-authored-by: John Joyce <john@ip-192-168-1-200.us-west-2.compute.internal>
Co-authored-by: John Joyce <john@Johns-MBP-465.lan>
Co-authored-by: John Joyce <john@Johns-MBP-466.lan>
Co-authored-by: John Joyce <john@Johns-MBP-478.lan>
Co-authored-by: John Joyce <john@Johns-MBP-499.lan>
Co-authored-by: John Joyce <john@Johns-MBP-514.lan>
This commit is contained in:
John Joyce 2024-06-12 10:52:22 -07:00 committed by GitHub
parent 75f65dd88b
commit ea7b27b0e5
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
4 changed files with 723 additions and 1 deletions

View File

@ -79,6 +79,18 @@ module.exports = {
id: "docs/managed-datahub/observe/volume-assertions",
className: "saasOnly",
},
{
label: "Open Assertions Specification",
type: "category",
link: { type: "doc", id: "docs/assertions/open-assertions-spec" },
items: [
{
label: "Snowflake",
type: "doc",
id: "docs/assertions/snowflake/snowflake_dmfs",
},
],
},
],
},
{

View File

@ -0,0 +1,486 @@
# DataHub Open Data Quality Assertions Specification
DataHub is developing an open-source Data Quality Assertions Specification & Compiler that will allow you to declare data quality checks / expectations / assertions using a simple, universal
YAML-based format, and then compile this into artifacts that can be registered or directly executed by 3rd party Data Quality tools like [Snowflake DMFs](https://docs.snowflake.com/en/user-guide/data-quality-intro),
dbt tests, Great Expectations or Acryl Cloud natively.
Ultimately, our goal is to provide an framework-agnostic, highly-portable format for defining Data Quality checks, making it seamless to swap out the underlying
assertion engine without service disruption for end consumers of the results of these data quality checks in catalogging tools like DataHub.
## Integrations
Currently, the DataHub Open Assertions Specification supports the following integrations:
- [Snowflake DMF Assertions](snowflake/snowflake_dmfs.md)
And is looking for contributions to build out support for the following integrations:
- [Looking for Contributions] dbt tests
- [Looking for Contributions] Great Expectation checks
Below, we'll look at how to define assertions in YAML, and then provide an usage overview for each support integration.
## The Specification: Declaring Data Quality Assertions in YAML
The following assertion types are currently supported by the DataHub YAML Assertion spec:
- [Freshness](/docs/managed-datahub/observe/freshness-assertions.md)
- [Volume](/docs/managed-datahub/observe/volume-assertions.md)
- [Column](/docs/managed-datahub/observe/column-assertions.md)
- [Custom SQL](/docs/managed-datahub/observe/custom-sql-assertions.md)
- [Schema](/docs/managed-datahub/observe/schema-assertions.md)
Each assertion type aims to validate a different aspect of structured table (e.g. on a data warehouse or data lake), from
structure to size to column integrity to custom metrics.
In this section, we'll go over examples of defining each.
### Freshness Assertions
Freshness Assertions allow you to verify that your data was updated within the expected timeframe.
Below you'll find examples of defining different types of freshness assertions via YAML.
#### Validating that Table is Updated Every 6 Hours
```yaml
version: 1
assertions:
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
type: freshness
lookback_interval: '6 hours'
last_modified_field: updated_at
schedule:
type: interval
interval: '6 hours' # Run every 6 hours
```
This assertion checks that the `purchase_events` table in the `test_db.public` schema was updated within the last 6 hours
by issuing a Query to the table which validates determines whether an update was made using the `updated_at` column in the past 6 hours.
To use this check, we must specify the field that contains the last modified timestamp of a given row.
The `lookback_interval` field is used to specify the "lookback window" for the assertion, whereas the `schedule` field is used to specify how often the assertion should be run.
This allows you to schedule the assertion to run at a different frequency than the lookback window, for example
to detect stale data as soon as it becomes "stale" by inspecting it more frequently.
#### Supported Source Types
Currently, the only supported `sourceType` for Freshness Assertions is `LAST_MODIFIED_FIELD`. In the future,
we may support additional source types, such as `HIGH_WATERMARK`, along with data source-specific types such as
`AUDIT_LOG` and `INFORMATION_SCHEMA`.
### Volume Assertions
Volume Assertions allow you to verify that the number of records in your dataset meets your expectations.
Below you'll find examples of defining different types of volume assertions via YAML.
#### Validating that Tale Row Count is in Expected Range
```yaml
version: 1
assertions:
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
type: volume
metric: 'row_count'
condition:
type: between
min: 1000
max: 10000
# filters: "event_type = 'purchase'" Optionally add filters.
schedule:
type: on_table_change # Run when new data is added to the table.
```
This assertion checks that the `purchase_events` table in the `test_db.public` schema has between 1000 and 10000 records.
Using the `condition` field, you can specify the type of comparison to be made, and the `min` and `max` fields to specify the range of values to compare against.
Using the `filters` field, you can optionally specify a SQL WHERE clause to filter the records being counted.
Using the `schedule` field you can specify when the assertion should be run, either on a fixed schedule or when new data is added to the table.
The only metric currently supported is `row_count`.
#### Validating that Table Row Count is Less Than Value
```yaml
version: 1
assertions:
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
type: volume
metric: 'row_count'
condition:
type: less_than_or_equal_to
value: 1000
# filters: "event_type = 'purchase'" Optionally add filters.
schedule:
type: on_table_change # Run when new data is added to the table.
```
#### Validating that Table Row Count is Greater Than Value
```yaml
version: 1
assertions:
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
type: volume
metric: 'row_count'
condition:
type: greater_than_or_equal_to
value: 1000
# filters: "event_type = 'purchase'" Optionally add filters.
schedule:
type: on_table_change # Run when new data is added to the table.
```
#### Supported Conditions
The full set of supported volume assertion conditions include:
- `equal_to`
- `not_equal_to`
- `greater_than`
- `greater_than_or_equal_to`
- `less_than`
- `less_than_or_equal_to`
- `between`
### Column Assertions
Column Assertions allow you to verify that the values in a column meet your expectations.
Below you'll find examples of defining different types of column assertions via YAML.
The specification currently supports 2 types of Column Assertions:
- **Field Value**: Asserts that the values in a column meet a specific condition.
- **Field Metric**: Asserts that a specific metric aggregated across the values in a column meet a specific condition.
We'll go over examples of each below.
#### Field Values Assertion: Validating that All Column Values are In Expected Range
```yaml
version: 1
assertions:
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
type: field
field: amount
condition:
type: between
min: 0
max: 10
exclude_nulls: True
# filters: "event_type = 'purchase'" Optionally add filters for Column Assertion.
# failure_threshold:
# type: count
# value: 10
schedule:
type: on_table_change
```
This assertion checks that all values for the `amount` column in the `purchase_events` table in the `test_db.public` schema have values between 0 and 10.
Using the `field` field, you can specify the column to be asserted on, and using the `condition` field, you can specify the type of comparison to be made,
and the `min` and `max` fields to specify the range of values to compare against.
Using the `schedule` field you can specify when the assertion should be run, either on a fixed schedule or when new data is added to the table.
Using the `filters` field, you can optionally specify a SQL WHERE clause to filter the records being counted.
Using the `exclude_nulls` field, you can specify whether to exclude NULL values from the assertion, meaning that
NULL will simply be ignored if encountered, as opposed to failing the check.
Using the `failure_threshold`, we can set a threshold for the number of rows that can fail the assertion before the assertion is considered failed.
#### Field Values Assertion: Validating that All Column Values are In Expected Set
The validate a VARCHAR / STRING column that should contain one of a set of values:
```yaml
version: 1
assertions:
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
type: field
field: product_id
condition:
type: in
value:
- 'product_1'
- 'product_2'
- 'product_3'
exclude_nulls: False
# filters: "event_type = 'purchase'" Optionally add filters for Column Assertion.
# failure_threshold:
# type: count
# value: 10
schedule:
type: on_table_change
```
#### Field Values Assertion: Validating that All Column Values are Email Addresses
The validate a string column contains valid email addresses:
```yaml
version: 1
assertions:
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
type: field
field: email_address
condition:
type: matches_regex
value: "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}"
exclude_nulls: False
# filters: "event_type = 'purchase'" Optionally add filters for Column Assertion.
# failure_threshold:
# type: count
# value: 10
schedule:
type: on_table_change
```
#### Field Values Assertion: Supported Conditions
The full set of supported field value conditions include:
- `in`
- `not_in`
- `is_null`
- `is_not_null`
- `equal_to`
- `not_equal_to`
- `greater_than` # Numeric Only
- `greater_than_or_equal_to` # Numeric Only
- `less_than` # Numeric Only
- `less_than_or_equal_to` # Numeric Only
- `between` # Numeric Only
- `matches_regex` # String Only
- `not_empty` # String Only
- `length_greater_than` # String Only
- `length_less_than` # String Only
- `length_between` # String Only
#### Field Metric Assertion: Validating No Missing Values in Column
```yaml
version: 1
assertions:
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
type: field
field: col_date
metric: null_count
condition:
type: equal_to
value: 0
# filters: "event_type = 'purchase'" Optionally add filters for Column Assertion.
schedule:
type: on_table_change
```
This assertion ensures that the `col_date` column in the `purchase_events` table in the `test_db.public` schema has no NULL values.
#### Field Metric Assertion: Validating No Duplicates in Column
```yaml
version: 1
assertions:
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
type: field
field: id
metric: unique_percentage
condition:
type: equal_to
value: 100
# filters: "event_type = 'purchase'" Optionally add filters for Column Assertion.
schedule:
type: on_table_change
```
This assertion ensures that the `id` column in the `purchase_events` table in the `test_db.public` schema
has no duplicates, by checking that the unique percentage is 100%.
#### Field Metric Assertion: Validating String Column is Never Empty String
```yaml
version: 1
assertions:
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
type: field
field: name
metric: empty_percentage
condition:
type: equal_to
value: 0
# filters: "event_type = 'purchase'" Optionally add filters for Column Assertion.
schedule:
type: on_table_change
```
This assertion ensures that the `name` column in the `purchase_events` table in the `test_db.public` schema is never empty, by checking that the empty percentage is 0%.
#### Field Metric Assertion: Supported Metrics
The full set of supported field metrics include:
- `null_count`
- `null_percentage`
- `unique_count`
- `unique_percentage`
- `empty_count`
- `empty_percentage`
- `min`
- `max`
- `mean`
- `median`
- `stddev`
- `negative_count`
- `negative_percentage`
- `zero_count`
- `zero_percentage`
### Field Metric Assertion: Supported Conditions
The full set of supported field metric conditions include:
- `equal_to`
- `not_equal_to`
- `greater_than`
- `greater_than_or_equal_to`
- `less_than`
- `less_than_or_equal_to`
- `between`
### Custom SQL Assertions
Custom SQL Assertions allow you to define custom SQL queries to verify your data meets your expectations.
The only condition is that the SQL query must return a single value, which will be compared against the expected value.
Below you'll find examples of defining different types of custom SQL assertions via YAML.
SQL Assertions are useful for more complex data quality checks that can't be easily expressed using the other assertion types,
and can be used to assert on custom metrics, complex aggregations, cross-table integrity checks (JOINS) or any other SQL-based data quality check.
#### Validating Foreign Key Integrity
```yaml
version: 1
assertions:
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
type: sql
statement: |
SELECT COUNT(*)
FROM test_db.public.purchase_events AS pe
LEFT JOIN test_db.public.products AS p
ON pe.product_id = p.id
WHERE p.id IS NULL
condition:
type: equal_to
value: 0
schedule:
type: interval
interval: '6 hours' # Run every 6 hours
```
This assertion checks that the `purchase_events` table in the `test_db.public` schema has no rows where the `product_id` column does not have a corresponding `id` in the `products` table.
#### Comparing Row Counts Across Multiple Tables
```yaml
version: 1
assertions:
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
type: sql
statement: |
SELECT COUNT(*) FROM test_db.public.purchase_events
- (SELECT COUNT(*) FROM test_db.public.purchase_events_raw) AS row_count_difference
condition:
type: equal_to
value: 0
schedule:
type: interval
interval: '6 hours' # Run every 6 hours
```
This assertion checks that the number of rows in the `purchase_events` exactly matches the number of rows in an upstream `purchase_events_raw` table
by subtracting the row count of the raw table from the row count of the processed table.
#### Supported Conditions
The full set of supported custom SQL assertion conditions include:
- `equal_to`
- `not_equal_to`
- `greater_than`
- `greater_than_or_equal_to`
- `less_than`
- `less_than_or_equal_to`
- `between`
### Schema Assertions (Coming Soon)
Schema Assertions allow you to define custom SQL queries to verify your data meets your expectations.
Below you'll find examples of defining different types of custom SQL assertions via YAML.
The specification currently supports 2 types of Schema Assertions:
- **Exact Match**: Asserts that the schema of a table - column names and their data types - exactly matches an expected schema
- **Contains Match** (Subset): Asserts that the schema of a table - column names and their data types - is a subset of an expected schema
#### Validating Actual Schema Exactly Equals Expected Schema
```yaml
version: 1
assertions:
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
type: schema
condition:
type: exact_match
columns:
- name: id
type: INTEGER
- name: product_id
type: STRING
- name: amount
type: DECIMAL
- name: updated_at
type: TIMESTAMP
schedule:
type: interval
interval: '6 hours' # Run every 6 hours
```
This assertion checks that the `purchase_events` table in the `test_db.public` schema has the exact schema as specified, with the exact column names and data types.
#### Validating Actual Schema is Contains all of Expected Schema
```yaml
version: 1
assertions:
- entity: urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.public.purchase_events,PROD)
type: schema
condition:
type: contains
columns:
- name: id
type: integer
- name: product_id
type: string
- name: amount
type: number
schedule:
type: interval
interval: '6 hours' # Run every 6 hours
```
This assertion checks that the `purchase_events` table in the `test_db.public` schema contains all of the columns specified in the expected schema, with the exact column names and data types.
The actual schema can also contain additional columns not specified in the expected schema.
#### Supported Data Types
The following high-level data types are currently supported by the Schema Assertion spec:
- string
- number
- boolean
- date
- timestamp
- struct
- array
- map
- union
- bytes
- enum

View File

@ -0,0 +1,224 @@
# Snowflake DMF Assertions [BETA]
The DataHub Open Assertion Compiler allows you to define your Data Quality assertions in a simple YAML format, and then compile them to be executed by Snowflake Data Metric Functions.
Once compiled, you'll be able to register the compiled DMFs in your Snowflake environment, and extract their results them as part of your normal ingestion process for DataHub.
Results of Snowflake DMF assertions will be reported as normal Assertion Results, viewable on a historical timeline in the context
of the table with which they are associated.
## Prerequisites
- You must have a Snowflake Enterprise account, where the DMFs feature is enabled.
- You must have the necessary permissions to provision DMFs in your Snowflake environment (see below)
- You must have the necessary permissions to query the DMF results in your Snowflake environment (see below)
- You must have DataHub instance with Snowflake metadata ingested. If you do not have existing snowflake ingestion, refer [Snowflake Quickstart Guide](https://datahubproject.io/docs/quick-ingestion-guides/snowflake/overview) to get started.
- You must have DataHub CLI installed and run [`datahub init`](https://datahubproject.io/docs/cli/#init).
### Permissions
*Permissions required for registering DMFs*
According to the latest Snowflake docs, here are the permissions the service account performing the
DMF registration and ingestion must have:
| Privilege | Object | Notes |
|------------------------------|---------------------|---------------------------------------------------------------------------------------------|
| USAGE | Database, schema | Database and schema where snowflake DMFs will be created. This is configured in compile command described below. |
| CREATE FUNCTION | Schema | This privilege enables creating new DMF in schema configured in compile command. |
| EXECUTE DATA METRIC FUNCTION | Account | This privilege enables you to control which roles have access to server-agnostic compute resources to call the system DMF. |
| USAGE | Database, schema | These objects are the database and schema that contain the referenced table in the query. |
| OWNERSHIP | Table | This privilege enables you to associate a DMF with a referenced table. |
| USAGE | DMF | This privilege enables calling the DMF in schema configured in compile command. |
and the roles that must be granted:
| Role | Notes |
|--------------------------|-------------------------|
| SNOWFLAKE.DATA_METRIC_USER | To use System DMFs |
*Permissions required for running DMFs (scheduled DMFs run with table owner's role)*
Because scheduled DMFs run with the role of the table owner, the table owner must have the following privileges:
| Privilege | Object | Notes |
|------------------------------|------------------|---------------------------------------------------------------------------------------------|
| USAGE | Database, schema | Database and schema where snowflake DMFs will be created. This is configured in compile command described below. |
| USAGE | DMF | This privilege enables calling the DMF in schema configured in compile power. |
| EXECUTE DATA METRIC FUNCTION | Account | This privilege enables you to control which roles have access to server-agnostic compute resources to call the system DMF. |
and the roles that must be granted:
| Role | Notes |
|--------------------------|-------------------------|
| SNOWFLAKE.DATA_METRIC_USER | To use System DMFs |
*Permissions required for querying DMF results*
In addition, the service account that will be executing DataHub Ingestion, and querying the DMF results, must have been granted the following system application roles:
| Role | Notes |
|--------------------------------|-----------------------------|
| DATA_QUALITY_MONITORING_VIEWER | Query the DMF results table |
To learn more about Snowflake DMFs and the privileges required to provision and query them, see the [Snowflake documentation](https://docs.snowflake.com/en/user-guide/data-quality-intro).
*Example: Granting Permissions*
```sql
-- setup permissions to <assertion-registration-role> to create DMFs and associate DMFs with table
grant usage on database "<dmf-database>" to role "<assertion-service-role>"
grant usage on schema "<dmf-database>.<dmf-schema>" to role "<assertion-service-role>"
grant create function on schema "<dmf-database>.<dmf-schema>" to role "<assertion-service-role>"
-- grant ownership + rest of permissions to <assertion-service-role>
grant role "<table-owner-role>" to role "<assertion-service-role>"
-- setup permissions for <table-owner-role> to run DMFs on schedule
grant usage on database "<dmf-database>" to role "<table-owner-role>"
grant usage on schema "<dmf-database>.<dmf-schema>" to role "<table-owner-role>"
grant usage on all functions in "<dmf-database>.<dmf-schema>" to role "<table-owner-role>"
grant usage on future functions in "<dmf-database>.<dmf-schema>" to role "<table-owner-role>"
grant database role SNOWFLAKE.DATA_METRIC_USER to role "<table-owner-role>"
grant execute data metric function on account to role "<table-owner-role>"
-- setup permissions for <datahub-role> to query DMF results
grant application role SNOWFLAKE.DATA_QUALITY_MONITORING_VIEWER to role "<datahub_role>"
```
## Supported Assertion Types
The following assertion types are currently supported by the DataHub Snowflake DMF Assertion Compiler:
- [Freshness](/docs/managed-datahub/observe/freshness-assertions.md)
- [Volume](/docs/managed-datahub/observe/volume-assertions.md)
- [Column](/docs/managed-datahub/observe/column-assertions.md)
- [Custom SQL](/docs/managed-datahub/observe/custom-sql-assertions.md)
Note that Schema Assertions are not currently supported.
## Creating Snowflake DMF Assertions
The process for declaring and running assertions backend by Snowflake DMFs consists of a few steps, which will be outlined
in the following sections.
### Step 1. Define your Data Quality assertions using Assertion YAML files
See the section **Declaring Assertions in YAML** below for examples of how to define assertions in YAML.
### Step 2. Register your assertions with DataHub
Use the DataHub CLI to register your assertions with DataHub, so they become visible in the DataHub UI:
```bash
datahub assertions upsert -f examples/library/assertions_configuration.yml
```
### Step 3. Compile the assertions into Snowflake DMFs using the DataHub CLI
Next, we'll use the `assertions compile` command to generate the SQL code for the Snowflake DMFs,
which can then be registered in Snowflake.
```bash
datahub assertions compile -f examples/library/assertions_configuration.yml -p snowflake -x DMF_SCHEMA=<db>.<schema-where-DMF-should-live>
```
Two files will be generated as output of running this command:
- `dmf_definitions.sql`: This file contains the SQL code for the DMFs that will be registered in Snowflake.
- `dmf_associations.sql`: This file contains the SQL code for associating the DMFs with the target tables in Snowflake.
By default in a folder called `target`. You can use config option `-o <output_folder>` in `compile` command to write these compile artifacts in another folder.
Each of these artifacts will be important for the next steps in the process.
_dmf_definitions.sql_
This file stores the SQL code for the DMFs that will be registered in Snowflake, generated
from your YAML assertion definitions during the compile step.
```sql
-- Example dmf_definitions.sql
-- Start of Assertion 5c32eef47bd763fece7d21c7cbf6c659
CREATE or REPLACE DATA METRIC FUNCTION
test_db.datahub_dmfs.datahub__5c32eef47bd763fece7d21c7cbf6c659 (ARGT TABLE(col_date DATE))
RETURNS NUMBER
COMMENT = 'Created via DataHub for assertion urn:li:assertion:5c32eef47bd763fece7d21c7cbf6c659 of type volume'
AS
$$
select case when metric <= 1000 then 1 else 0 end from (select count(*) as metric from TEST_DB.PUBLIC.TEST_ASSERTIONS_ALL_TIMES )
$$;
-- End of Assertion 5c32eef47bd763fece7d21c7cbf6c659
....
```
_dmf_associations.sql_
This file stores the SQL code for associating with the target table,
along with scheduling the generated DMFs to run on at particular times.
```sql
-- Example dmf_associations.sql
-- Start of Assertion 5c32eef47bd763fece7d21c7cbf6c659
ALTER TABLE TEST_DB.PUBLIC.TEST_ASSERTIONS_ALL_TIMES SET DATA_METRIC_SCHEDULE = 'TRIGGER_ON_CHANGES';
ALTER TABLE TEST_DB.PUBLIC.TEST_ASSERTIONS_ALL_TIMES ADD DATA METRIC FUNCTION test_db.datahub_dmfs.datahub__5c32eef47bd763fece7d21c7cbf6c659 ON (col_date);
-- End of Assertion 5c32eef47bd763fece7d21c7cbf6c659
....
```
### Step 4. Register the compiled DMFs in your Snowflake environment
Next, you'll need to run the generated SQL from the files output in Step 3 in Snowflake.
You can achieve this either by running the SQL files directly in the Snowflake UI, or by using the SnowSQL CLI tool:
```bash
snowsql -f dmf_definitions.sql
snowsql -f dmf_associations.sql
```
:::NOTE
Scheduling Data Metric Function on table incurs Serverless Credit Usage in Snowflake. Refer [Billing and Pricing](https://docs.snowflake.com/en/user-guide/data-quality-intro#billing-and-pricing) for more details.
Please ensure you DROP Data Metric Function created via dmf_associations.sql if the assertion is no longer in use.
:::
### Step 5. Run ingestion to report the results back into DataHub
Once you've registered the DMFs, they will be automatically executed, either when the target table is updated or on a fixed
schedule.
To report the results of the generated Data Quality assertions back into DataHub, you'll need to run the DataHub ingestion process with a special configuration
flag: `include_assertion_results: true`:
```yaml
# Your DataHub Snowflake Recipe
source:
type: snowflake
config:
# ...
include_assertion_results: True
# ...
```
During ingestion we will query for the latest DMF results stored in Snowflake, convert them into DataHub Assertion Results, and report the results back into DataHub during your ingestion process
either via CLI or the UI visible as normal assertions.
`datahub ingest -c snowflake.yml`
## Caveats
- Currently, Snowflake supports at most 1000 DMF-table associations at the moment so you can not define more than 1000 assertions for snowflake.
- Currently, Snowflake does not allow JOIN queries or non-deterministic functions in DMF definition so you can not use these in SQL for SQL assertion or in filters section.
- Currently, all DMFs scheduled on a table must follow same exact schedule, so you can not set assertions on same table to run on different schedules.
- Currently, DMFs are only supported for regular tables and not dynamic or external tables.
## FAQ
Coming soon!

View File

@ -485,4 +485,4 @@ metadataChangeProposal:
maxAttempts: ${MCP_TIMESERIES_MAX_ATTEMPTS:1000}
initialIntervalMs: ${MCP_TIMESERIES_INITIAL_INTERVAL_MS:100}
multiplier: ${MCP_TIMESERIES_MULTIPLIER:10}
maxIntervalMs: ${MCP_TIMESERIES_MAX_INTERVAL_MS:30000}
maxIntervalMs: ${MCP_TIMESERIES_MAX_INTERVAL_MS:30000}