“Quality over Quantity” is a phrase we hear regularly in life, but when it comes to the world of data, we often fail to adhere to this rule.
Data Quality Monitoring implements quality checks in operational data processes to ensure that the data meets pre-defined standards and business rules.
It’s common to have terabytes of data in most data warehouses, data quality monitoring is often challenging and cost-intensive due to dependencies on multiple tools and eventually ignored. This results in poor credibility and data consistency after some time, leading businesses to mistrust the data pipelines and processes.
Hence, a new feature allows for natively implementing data quality monitoring in the Snowflake AI Data Cloud without using any additional tools.
In this blog, we will cover:
What Data Quality Monitoring with data metric functions feature is in Snowflake
How to implement the DQM using Data Metric Functions
Challenges with the Data Quality Monitoring Feature
Best Practices for implementing the DQM
What is Snowflake's Data Quality Monitoring Tool?
Snowflake has introduced a feature called data metric functions (DMFs). These include Snowflake-provided system DMFs and user-defined DMFs that can regularly check the integrity and state of data in a table or view. Currently, the functions support SQL syntax, accept one or more arguments, and must always return a NUMBER
value.
Each type of DMF can be assigned to the below objects.
Standard, Temporary, Transient Tables
View
Materialized view
Dynamic table
Event table
External table
Apache Iceberg table
The following objects do not support DMF for now.
Stream Objects
After assigning the DMF, we can schedule how often the DMF is executed on the object. The DMFs will run as per the schedule and populate the results in a system view called SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS
.
Below is the process flow for this implementation.
Based on these results, the data teams can learn the state and quality of the data and take necessary actions to fix the outliers preemptively. Additionally, alerts can be configured based on the results so that the engineers and other stakeholders know when to look for quality discrepancies.
Snowflake natively runs these quality checks in the background using serverless compute resources, i.e., by creating and managing its virtual warehouses, per the DMF definition.
The process’s credits are listed in the “Data Quality Monitoring” category on the monthly bill, and we can query the SNOWFLAKE.ACCOUNT_USAGE.DATA_QUALITY_MONITORING_USAGE_HISTORY
view to track the credit consumption related to using DMFs in our account.
Benefits of Using the Data Quality Monitoring Tool
We can observe the below benefits once we implement the Data Quality Monitoring Tool.
Credibility – Make data Trustworthy which will enable reliable data-driven decision-making.
Adhere to Service Level Agreements – Precise data metrics, like freshness, help ensure compliance with SLAs between data producers and data consumers.
Regulatory Compliance – Knowledge of the condition of your data facillitates compliance with regulatory standards, reducing risk and strengthening your data governance framework.
Automation – No manual intervention is required to check the quality of your data after the Data Monitoring function is scheduled.
Use Case Specific needs – Custom Data Metric Functions (DMFs) allow data engineers to create precise metrics for evaluating data, resulting in more accurate optimizations for specific data applications.
How to Use the Data Quality Monitoring Feature in Snowflake
Pre-requisites
Before we can use DMFs in Snowflake for implementing Data Quality Monitoring, we will require the following platform-level setup done.
Account—This feature is available for Enterprise edition onwards and does not work on Trial Accounts and certain US GOV region accounts.
Access Control Setup – We will need the following access to be established before we can start the implementation.
Role – We can create a custom role or use a role with
USAGE
access on the database and schema containing the table(s)/view(s) where we intend to implement the DMFs.
USE ROLE ACCOUNTADMIN;
CREATE ROLE IF NOT EXISTS dqm_rw_role;
For simplicity of understanding of this blog, we have created a custom role like the one below.
-
Grants
-
Usage on the database and schema and select access on the tables.
-
GRAN USAGE ON DATABASE demo TO ROLE dqm_rw_role;
GRANT USAGE ON SCHEMA demo.public TO ROLE dqm_rw_role;
GRANT SELECT ON ALL TABLES IN SCHEMA demo.public TO ROLE dqm_rw_role;
GRANT SELECT ON FUTURE TABLES IN SCHEMA demo.public TO ROLE dqm_rw_role;
-
-
Permission to execute DMFs.
-
GRANT EXECUTE DATA METRIC FUNCTION ON ACCOUNT TO ROLE dqm_rw_role;
-
-
Permission to view the results of the DMF execution.
-
GRANT APPLICATION ROLE SNOWLFAKE.DATA_QUALITY_MONITORING_VIEWER TO ROLE dqm_rw_role;
GRANT DATABASE ROLE SNOWFLAKE.USAGE_VIEWER TO ROLE dqm_rw_role;
GRANT DATABASE ROLE SNOWFLAKE.DATA_METRIC_USER TO ROLE dqm_rw_role;
-
-
Permission to use a warehouse for running a query to view the results of the DMF execution.
-
GRANT USAGE ON DATABASE COMPUTE_WH TO ROLE dqm_rw_role;
-
-
Establish a role hierarchy and grant the role to a user who can complete this tutorial (replace the
<your username>
value with the actual user ID).
-
GRANT ROLE dqm_rw_role TO ROLE SYSADMIN;
GRANT ROLE dqm_rw_role TO USER ;
Types of DMF
There are two types of DMFs Snowflake provides.
System-Defined DMFs—These are inbuilt functions provided by Snowflake in the
SNOWFLAKE.CORE
schema. These functions are meant for all the common Data Quality checks, such as null count, duplicate count, volume, minimum and maximum values, etc., for individual fields of the table/view. We will demo one of them, duplicate count, in our use cases below.
User-Defined DMFs – We can create our own Data Metric Function using the
CREATE DATA METRIC FUNCTION
. This will contain an argument-driven SQL query, which we have to set as per our business requirement, and the output of the SQL must be aNUMBER
value. For example, suppose we want to check the number of invalid emails in a table containing customer email information. In that case, we can specify a count query to do the same, and the output of that count query will result from the DMF. In our use cases below, we will demo some of them – Check for Late Arriving Dimensions and Check for Invalid Address.
Use Cases
We will use two datasets – CUSTOMERS
and SUBSCRIPTIONS
– to demo the use cases.
Duplicate Count
Duplicate Count helps determine redundant data in a table based on a key column. The presence of duplicates might lead to exploding joins downstream, disrupting data integrity and causing performance issues.
In Snowflake, the system defined DMF for a duplicate check is – SNOWFLAKE.CORE.DUPLICATE_COUNT(<query>).
Let’s say below is the initial state of the table for ACCOUNT_NUMBER = 10000000
.
As shown above, the table has no duplicates based on the ACCOUNT_NUMBER
. We can use the DMF mentioned above to validate this. It will return 0 to signify that there are no duplicates, as shown below.
Now, we will assign the DMF on the table so that if there are duplicate records in future data pipeline runs, we can see them in the result view. SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS
.
To assign the DMF to the table, we must first add a data metric schedule to the table CUSTOMERS.
For now, we will set the schedule to be triggered on any DML (INSERT/UPDATE/DELETE
) changes on the table like below.
Remember that we can add only one data metric schedule to a table/view to execute all the metrics set on the table simultaneously. Hence, the scheduling frequency must first be evaluated with an understanding of the end-to-end process timings and then added to the table. Once this is done, we will add the DMF to the table.
To check the status of the DMF addition to the table, we can use the DATA_METRIC_FUNCTION_REFERENCES
table function as below.
As you can see, the schedule status is "STARTED"
, which means the DMF is successfully attached and active on the table. Now, if we have a scenario where there is a duplicate entry of this column in the table, we will automatically be able to see them. Say we have a duplicate record like the one below on the next run.
In this scenario, Snowflake will evaluate the DMF in the background as per the schedule, and we will be able to see the results after a ~5-minute lag from the scheduled run like below.
The value 1 means one duplicate per the DMF set on table CUSTOMERS
.
Since the schedule defined here is "TRIGGER_ON_CHANGES"
, the SCHEDULED_TIME
is null in the above output. If it were a cron schedule, the field would have displayed the timestamp of the scheduled run.
This result and this view can also be used to build an alerting system to notify the necessary teams of duplicates in CUSTOMERS
based on ACCOUNT_NUMBER.
By following these steps, we can set a Data Quality check to detect duplicates in any column in a table or view in Snowflake.
Late Arriving Dimensions
Late-arriving dimensions refer to a data discrepancy scenario in which data is in the dimension key in the new fact data. Still, the same data does not yet exist in the dimension table. Like “a glass half full or glass half empty” scenarios, these records are also called early-arriving facts. This can result in incomplete reporting and can corrupt dependent downstream applications.
Let’s say, in the below dataset, we have SUBSCRIPTION_ID
as the primary key in the dimension table—the SUBSCRIPTIONS
table—which in turn is the foreign key in the CUSTOMERS
table. So, every record in the CUSTOMERS
table must have a SUBSCRIPTION_ID
value, also present in the SUBSCRIPTIONS
table.
Since Snowflake doesn’t enforce primary key and foreign key constraints for traditional tables, there can be a scenario where updated data has not flowed into the SUBSCRIPTIONS
table but has arrived in the CUSTOMERS
table, and there is no logic in the data pipeline to check the same. This will result in an incomplete data load, which might impact the downstream data.
To detect such scenarios, we can create a Data Metric Function to check the foreign key dependency like below.
In this DDL statement, we have to specify the arguments correctly. The arguments will take only one table name (on which our DMF will be added) and any column and the datatype of that table only. Columns of other tables cannot be specified here and must be mentioned in the query definition only, as shown in line#83 in the snippet above.
Next, we will add the DMF to the table below.
Now, as soon as there is such a record that satisfies the criteria of the late arriving dimension in the CUSTOMERS
table, the DMF will capture that in the view of the same results, and we can see that in a similar way as shown below.
The above signifies that one such record satisfies the criteria of late-arriving dimensions. Remember that this will also be visible after a lag of ~ 5 minutes.
Based on this, the data engineers can tweak our pipelines to handle such scenarios in the pipelines or inform upstream of any delays at their end.
Invalid Address
In any dataset with user demographic information, the address details are always challenging to validate. They have multiple parts, such as Street, House, State, and Zip/postal codes, each with a unique, valid pattern.
Hence, it’s common to have records that do not adhere to the correct format, resulting in incorrect reporting based on location elements like state and Zip/postal codes. In this use case, we will see how to determine whether the Zip/postal codes are valid.
The ZIP codes in the USA have five numbers and no letters – and some have five digits, a hyphen, then four more digits, in this format: ##### or #####-#### (five or nine digits).
So, first, we will create a DMF to get the count of records that don’t confine to the regex pattern for valid US ZIP Codes – ^\d{5}(-\d{4})?$
on the zip field
in CUSTOMERS
table like below.
Once done, we will add the DMF to the table as in the previous two use cases.
Once that is done, as soon as there is a new run loading records where there are invalid zip codes, the DMF will capture that in the same results view, and we can see that in a similar way as shown below.
This signifies one record in the CUSTOMERS
table where we have invalid zip codes. The data engineers can then take action to resolve the issue in the data pipeline itself.
Scheduling
We have covered how to create and assign DMFs to tables and see the results of the data quality monitoring for three different use cases. However, for all of them, the schedule was set to "ON_TRIGGER_CHANGES"
, meaning the DMFs will run only when DML is on the table.
There is another way of scheduling using a CRON schedule like the one below –
ALTER TABLE DEMO.PUBLIC.customers SET
DATA_METRIC_SCHEDULE = 'USING CRON 00 02 * * * America/Chicago';
The above will set the schedule of all the DMFs assigned to a table to run daily at 02:00 AM US Central time.
After assigning the schedule, below will be the initial state of the SCHEDULE_STATUS
–
After some time, its status will change to "STARTED"
as shown below.
We can also use an interval-based CRON like the code below.
ALTER TABLE DEMO.PUBLIC.customers SET DATA_METRIC_SCHEDULE = '5 MINUTE';
The valid interval values for interval-based CRON are the 5th, 15th, and 30th minutes of the hour.
Challenges with the Data Quality Monitoring Feature
Although the Data Quality Monitoring feature in Snowflake is a huge boost to its rich ecosystem of services and features, it’s still a relatively new and less explored feature and comes with its shortcomings. Apart from the limitations listed in the Snowflake documentation of this feature, below are some of the challenges and limitations we have seen with the Data Quality Monitoring feature.
Every table/view can have only 1 Data Metric Schedule, i.e., all the DMFs associated with it start running almost simultaneously. In scenarios when one table/view is being updated from multiple pipelines, this limitation might result in false negatives and redundant quality checks. Ideally, there should be an option to schedule each DMF as required rather than having the schedule at the table/view level.
The schedules might run with some lag from the scheduled time. Since this feature uses Serverless Compute and Snowflake to start and stop the services on its own, there is no way to monitor the execution of the DMF in real time. Hence, we must consider this if we build alerts for the DMF results view.
When defining a user-defined DMF, multiple tables and their columns cannot be used as arguments for the function. This results in using hardcoded values for such scenarios, requiring additional handling at the CI/CD level to make them environment-independent.
Best Practices
We would like to recommend the below standards while implementing this feature.
Create a Custom Role and grant it the ownership, usage, and full select access to the object to create, add, and schedule the DMFs.
After creating the DMF, always call it manually using
SELECT <data_metric_function>(<query>)
to validate whether the logic defined in the DMF is working properly or not.
If the logic for the DMF does not need to be public, please secure the DMF. Below is an example of the same.
ALTER FUNCTION demo.public.invalid_zip_count(
TABLE(
STRING
))
SET SECURE;
If a DMF is no longer required after some time, please drop the DMF association with the table so that it does not execute in subsequent runs and save the serverless credits. This can be done like below.
ALTER TABLE demo.public.customers DROP DATA METRIC FUNCTION
DEMO.PUBLIC.LATE_ARRIVING_DIMENSION_COUNT ON (SUBSCRIPTION_ID);
Closing
Now that we have covered data metrics functions in implementing data quality monitoring in Snowflake, you can start exploring the different types of in-built DMFs available in Snowflake and build your DMFs to add data quality checks to your data pipelines. This feature will allow you to meet service-level agreements for your data and make your data credible and consistent for businesses to consume.
If you have questions or need assistance setting up Data Quality Monitoring or leveraging DMFs in Snowflake, please contact the experts at phData.
FAQs
Which tool can be used for data governance with Snowflake?
Snowflake has a diverse partner ecosystem and partners like Sigma, Talend, Collibra, Informatica etc., enable strong data governance implementation. Along with the Data Monitoring function, Snowflake also has other native data governance features like Data Lineage, Data Clean Rooms, Object Insights , Trust Center etc. These are all collectively housed under Snowflake Horizon.
Does Data Quality Monitoring cost extra?
Yes, there are some costs that are incurred only when a data metric function (DMF) is executed on an object. But, these functions are integrated into Snowflake and there are no additional licensing costs.The cost per execution (i.e a query which runs the DMF) is much lower when billed by Snowflake compared to a third-party provider, which would typically charge a premium.