Can you remember a time when the meaning of Data Warehouse Table that you depended on fundamentally changed, with little or no notice?
If the answer is yes, how did you find out? We'll take a guess - someone looking at an internal reporting dashboard or worse, a user using your your product, sounded an alarm when
a number looked a bit out of the ordinary. Perhaps your table initially tracked purchases made on your company's e-commerce web store, but suddenly began to include purchases made
There are many reasons why an important Table on Snowflake, Redshift, BigQuery, or Databricks may change in its meaning - application code bugs, new feature rollouts,
changes to key metric definitions, etc. Often times, these changes break important assumptions made about the data used in building key downstream data products
like reporting dashboards or data-driven product features.
What if you could reduce the time to detect these incidents, so that the people responsible for the data were made aware of data
and then monitor those expectations over time as the table grows and changes.
In this article, we'll cover the basics of monitoring Volume Assertions - what they are, how to configure them, and more - so that you and your team can
start building trust in your most important data assets.
A **Volume Assertion** is a configurable Data Quality rule used to monitor a Data Warehouse Table
for unexpected or sudden changes in "volume", or row count. Volume Assertions can be particularly useful when you have frequently-changing
Tables which have a relatively stable pattern of growth or decline.
For example, imagine that we work for a company with a Snowflake Table that stores user clicks collected from our e-commerce website.
This table is updated with new data on a specific cadence: once per hour (In practice, daily or even weekly are also common).
In turn, there is a downstream Business Analytics Dashboard in Looker that shows important metrics like
the number of people clicking our "Daily Sale" banners, and this dashboard is generated from data stored in our "clicks" table.
It is important that our clicks Table is updated with the correct number of rows each hour, else it could mean
that our downstream metrics dashboard becomes incorrect. The risk of this situation is obvious: our organization
may make bad decisions based on incomplete information.
In such cases, we can use a **Volume Assertion** that checks whether the Snowflake "clicks" Table is growing in an expected
way, and that there are no sudden increases or sudden decreases in the rows being added or removed from the table.
If too many rows are added or removed within an hour, we can notify key stakeholders and begin to root cause before the problem impacts stakeholders of the data.
### Anatomy of a Volume Assertion
At the most basic level, **Volume Assertions** consist of a few important parts:
1. An **Evaluation Schedule**
2. A **Volume Condition**
2. A **Volume Source**
In this section, we'll give an overview of each.
#### 1. Evaluation Schedule
The **Evaluation Schedule**: This defines how often to check a given warehouse Table for its volume. This should usually
be configured to match the expected change frequency of the Table, although it can also be less frequently depending
on the requirements. You can also specify specific days of the week, hours in the day, or even
minutes in an hour.
#### 2. Volume Condition
The **Volume Condition**: This defines the type of condition that we'd like to monitor, or when the Assertion
should result in failure.
There are a 2 different categories of conditions: **Total** Volume and **Change** Volume.
_Total_ volume conditions are those which are defined against the point-in-time total row count for a table. They allow you to specify conditions like:
1.**Table has too many rows**: The table should always have less than 1000 rows
2.**Table has too few rows**: The table should always have more than 1000 rows
3.**Table row count is outside a range**: The table should always have between 1000 and 2000 rows.
_Change_ volume conditions are those which are defined against the growth or decline rate of a table, measured between subsequent checks
of the table volume. They allow you to specify conditions like:
1.**Table growth is too fast**: When the table volume is checked, it should have <1000morerowsthanithadduringthepreviouscheck.
2.**Table growth is too slow**: When the table volume is checked, it should have > 1000 more rows than it had during the previous check.
3.**Table growth is outside a range**: When the table volume is checked, it should have between 1000 and 2000 more rows than it had during the previous check.
For change volume conditions, both _absolute_ row count deltas and relative percentage deltas are supported for identifying
table that are following an abnormal pattern of growth.
source types vary by the platform, but generally fall into these categories:
- **Information Schema**: A system Table that is exposed by the Data Warehouse which contains live information about the Databases
and Tables stored inside the Data Warehouse, including their row count. It is usually efficient to check, but can in some cases be slightly delayed to update
once a change has been made to a table.
- **Query**: A `COUNT(*)` query is used to retrieve the latest row count for a table, with optional SQL filters applied (depending on platform).
This can be less efficient to check depending on the size of the table. This approach is more portable, as it does not involve
system warehouse tables, it is also easily portable across Data Warehouse and Data Lake providers.
2. (Optional) **Data Platform Connection**: In order to create a Volume Assertion that queries the source data platform directly (instead of DataHub metadata), you'll need to have an **Ingestion Source** configured to your
Data Platform: Snowflake, BigQuery, or Redshift under the **Integrations** tab.
5. Configure the evaluation **schedule**. This is the frequency at which the assertion will be evaluated to produce a pass or fail result, and the times
when the table volume will be checked.
6. Configure the evaluation **condition type**. This determines the cases in which the new assertion will fail when it is evaluated.
The supported volume assertion types are `ROW_COUNT_TOTAL` and `ROW_COUNT_CHANGE`. Other (e.g. incrementing segment) types are not yet supported.
The supported operator types are `GREATER_THAN`, `GREATER_THAN_OR_EQUAL_TO`, `LESS_THAN`, `LESS_THAN_OR_EQUAL_TO`, and `BETWEEN` (requires minValue, maxValue).
The supported parameter types are `NUMBER`.
You can use same endpoint with assertion urn input to update an existing Volume Assertion and corresponding Monitor: