October 25, 2024

What is Snowflake’s Data Quality Monitoring Feature and How is it Used?

By Pratik Datta

“Quality over Quantity” is a phrase we hear regularly in life, but when it comes to the world of data, we often fail to adhere to this rule.  

Data Quality Monitoring implements quality checks in operational data processes to ensure that the data meets pre-defined standards and business rules.

It’s common to have terabytes of data in most data warehouses, data quality monitoring is often challenging and cost-intensive due to dependencies on multiple tools and eventually ignored. This results in poor credibility and data consistency after some time, leading businesses to mistrust the data pipelines and processes.

Hence, a new feature allows for natively implementing data quality monitoring in the Snowflake AI Data Cloud without using any additional tools. 

In this blog, we will cover: 

  • What Data Quality Monitoring with data metric functions feature is in Snowflake

  • How to implement the DQM using Data Metric Functions

  • Challenges with the Data Quality Monitoring Feature

  • Best Practices for implementing the DQM

What is Snowflake's Data Quality Monitoring Tool?

Snowflake has introduced a feature called data metric functions (DMFs). These include Snowflake-provided system DMFs and user-defined DMFs that can regularly check the integrity and state of data in a table or view. Currently, the functions support SQL syntax, accept one or more arguments, and must always return a NUMBER value.

Each type of DMF can be assigned to the below objects.

  • Standard, Temporary, Transient Tables

  • View

  • Materialized view

  • Dynamic table

  • Event table

  • External table

  • Apache Iceberg table

The following objects do not support DMF for now.

After assigning the DMF, we can schedule how often the DMF is executed on the object. The DMFs will run as per the schedule and populate the results in a system view called SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS. 

Below is the process flow for this implementation.

Based on these results, the data teams can learn the state and quality of the data and take necessary actions to fix the outliers preemptively. Additionally, alerts can be configured based on the results so that the engineers and other stakeholders know when to look for quality discrepancies.

Snowflake natively runs these quality checks in the background using serverless compute resources, i.e., by creating and managing its virtual warehouses, per the DMF definition. 

The process’s credits are listed in the “Data Quality Monitoring” category on the monthly bill, and we can query the SNOWFLAKE.ACCOUNT_USAGE.DATA_QUALITY_MONITORING_USAGE_HISTORY view to track the credit consumption related to using DMFs in our account.

Benefits of Using the Data Quality Monitoring Tool

We can observe the below benefits once we implement the Data Quality Monitoring Tool.

  • Credibility – Make data Trustworthy which will enable reliable data-driven decision-making.

  • Adhere to Service Level Agreements – Precise data metrics, like freshness, help ensure compliance with SLAs between data producers and  data consumers.

  • Regulatory Compliance – Knowledge of the condition of your data facillitates compliance with regulatory standards, reducing risk and strengthening your data governance framework.

  • Automation – No manual intervention is required to check the quality of your data after the Data Monitoring function is scheduled.

  • Use Case Specific needs – Custom Data Metric Functions (DMFs) allow data engineers to create precise metrics for evaluating data, resulting in more accurate optimizations for specific data applications.

How to Use the Data Quality Monitoring Feature in Snowflake

Pre-requisites

Before we can use DMFs in Snowflake for implementing Data Quality Monitoring, we will require the following platform-level setup done.

  • Account—This feature is available for Enterprise edition onwards and does not work on Trial Accounts and certain US GOV region accounts.

  • Access Control Setup – We will need the following access to be established before we can start the implementation.

  • Role – We can create a custom role or use a role with USAGE access on the database and schema containing the table(s)/view(s) where we intend to implement the DMFs.

				
					USE ROLE ACCOUNTADMIN;
CREATE ROLE IF NOT EXISTS dqm_rw_role;
				
			

For simplicity of understanding of this blog, we have created a custom role like the one below.

  • Grants

    • Usage  on the database and schema and select access on the tables.

				
					GRAN USAGE ON DATABASE demo TO ROLE dqm_rw_role;
GRANT USAGE ON SCHEMA demo.public TO ROLE dqm_rw_role;
GRANT SELECT ON ALL TABLES IN SCHEMA demo.public TO ROLE dqm_rw_role;
GRANT SELECT ON FUTURE TABLES IN SCHEMA demo.public TO ROLE dqm_rw_role;
				
			
    • Permission to execute DMFs.

				
					GRANT EXECUTE DATA METRIC FUNCTION ON ACCOUNT TO ROLE dqm_rw_role;
				
			
    • Permission to view the results of the DMF execution.

				
					GRANT APPLICATION ROLE SNOWLFAKE.DATA_QUALITY_MONITORING_VIEWER TO ROLE dqm_rw_role;
GRANT DATABASE ROLE SNOWFLAKE.USAGE_VIEWER TO ROLE dqm_rw_role;
GRANT DATABASE ROLE SNOWFLAKE.DATA_METRIC_USER TO ROLE dqm_rw_role;
				
			
    • Permission to use a warehouse for running a query to view the results of the DMF execution.

				
					GRANT USAGE ON DATABASE COMPUTE_WH TO ROLE dqm_rw_role;
				
			
    • Establish a role hierarchy and grant the role to a user who can complete this tutorial (replace the <your username> value with the actual user ID).

				
					GRANT ROLE dqm_rw_role TO ROLE SYSADMIN;
GRANT ROLE dqm_rw_role TO USER <your username>;
				
			

Types of DMF

There are two types of DMFs Snowflake provides.

  • System-Defined DMFs—These are inbuilt functions provided by Snowflake in the SNOWFLAKE.CORE schema. These functions are meant for all the common Data Quality checks, such as null count, duplicate count, volume, minimum and maximum values, etc., for individual fields of the table/view. We will demo one of them, duplicate count, in our use cases below.

  • User-Defined DMFs – We can create our own Data Metric Function using the CREATE DATA METRIC FUNCTION. This will contain an argument-driven SQL query, which we have to set as per our business requirement, and the output of the SQL must be a NUMBER value. For example, suppose we want to check the number of invalid emails in a table containing customer email information. In that case, we can specify a count query to do the same, and the output of that count query will result from the DMF. In our use cases below, we will demo some of them – Check for Late Arriving Dimensions and Check for Invalid Address.

Use Cases

We will use two datasets – CUSTOMERS and SUBSCRIPTIONS– to demo the use cases.

Duplicate Count

Duplicate Count helps determine redundant data in a table based on a key column. The presence of duplicates might lead to exploding joins downstream, disrupting data integrity and causing performance issues.

In Snowflake, the system defined DMF for a duplicate check is – SNOWFLAKE.CORE.DUPLICATE_COUNT(<query>).

Let’s say below is the initial state of the table  for ACCOUNT_NUMBER = 10000000.

As shown above, the table has no duplicates based on the ACCOUNT_NUMBER. We can use the DMF mentioned above to validate this. It will return 0 to signify that there are no duplicates, as shown below.

Now, we will assign the DMF on the table so that if there are duplicate records in future data pipeline runs, we can see them in the result view. SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS.

To assign the DMF to the table, we must first add a data metric schedule to the table CUSTOMERS. For now, we will set the schedule to be triggered on any DML (INSERT/UPDATE/DELETE) changes on the table like below.

Remember that we can add only one data metric schedule to a table/view to execute all the metrics set on the table simultaneously. Hence, the scheduling frequency must first be evaluated with an understanding of the end-to-end process timings and then added to the table.  Once this is done, we will add the DMF to the table.

To check the status of the DMF addition to the table, we can use the DATA_METRIC_FUNCTION_REFERENCES table function as below.

As you can see, the schedule status is "STARTED", which means the DMF is successfully attached and active on the table. Now, if we have a scenario where there is a duplicate entry of this column in the table, we will automatically be able to see them. Say we have a duplicate record like the one below on the next run.

In this scenario, Snowflake will evaluate the DMF in the background as per the schedule, and we will be able to see the results after a ~5-minute lag from the scheduled run like below.

The value 1 means one duplicate per the DMF set on table CUSTOMERS

Since the schedule defined here is "TRIGGER_ON_CHANGES", the SCHEDULED_TIME is null in the above output. If it were a cron schedule, the field would have displayed the timestamp of the scheduled run.

This result and this view can also be used to build an alerting system to notify the necessary teams of duplicates in CUSTOMERS based on ACCOUNT_NUMBER.

By following these steps, we can set a Data Quality check to detect duplicates in any column in a table or view in Snowflake.

Late Arriving Dimensions

Late-arriving dimensions refer to a data discrepancy scenario in which data is in the dimension key in the new fact data. Still, the same data does not yet exist in the dimension table. Like “a glass half full or glass half empty” scenarios, these records are also called early-arriving facts. This can result in incomplete reporting and can corrupt dependent downstream applications.

Let’s say, in the below dataset, we have SUBSCRIPTION_ID as the primary key in the dimension table—the SUBSCRIPTIONS table—which in turn is the foreign key in the CUSTOMERS table. So, every record in the CUSTOMERS table must have a SUBSCRIPTION_ID  value, also present in the SUBSCRIPTIONS table.

Since Snowflake doesn’t  enforce primary key and foreign key constraints for traditional tables, there can be a scenario where updated data has not flowed into the SUBSCRIPTIONS table but has arrived in the CUSTOMERS table, and there is no logic in the data pipeline to check the same. This will result in an incomplete data load, which might impact the downstream data.

To detect such scenarios, we can create a Data Metric Function to check the foreign key dependency like below.

In this DDL statement, we have to specify the arguments correctly. The arguments will take only one table name (on which our DMF will be added) and any column and the datatype of that table only. Columns of other tables cannot be specified here and must be mentioned in the query definition only, as shown in line#83 in the snippet above.

Next, we will add the DMF to the table below.

Now, as soon as there is such a record that satisfies the criteria of the late arriving dimension in the CUSTOMERS  table, the DMF will capture that in the view of the same results, and we can see that in a similar way as shown below.

The above signifies that one such record satisfies the criteria of late-arriving dimensions. Remember that this will also be visible after a lag of ~ 5 minutes.

Based on this, the data engineers can tweak our pipelines to handle such scenarios in the pipelines or inform upstream of any delays at their end.

Invalid Address

In any dataset with user demographic information, the address details are always challenging to validate. They have multiple parts, such as Street, House, State, and Zip/postal codes, each with a unique, valid pattern.

Hence, it’s common to have records that do not adhere to the correct format, resulting in incorrect reporting based on location elements like state and Zip/postal codes. In this use case, we will see how to determine whether the Zip/postal codes are valid. 

The ZIP codes in the USA have five numbers and no letters – and some have five digits, a hyphen, then four more digits, in this format: ##### or #####-#### (five or nine digits).

So, first, we will create a DMF to get the count of records that don’t confine to the regex pattern for valid US ZIP Codes – ^\d{5}(-\d{4})?$ on the zip field in CUSTOMERS  table like below.

Once done, we will add the DMF to the table as in the previous two use cases.

Once that is done, as soon as there is a new run loading records where there are invalid zip codes,  the DMF will capture that in the same results view, and we can see that in a similar way as shown below.

This signifies one record in the CUSTOMERS  table where we have invalid zip codes. The data engineers can then take action to resolve the issue in the data pipeline itself.

Scheduling

We have covered how to create and assign DMFs to tables and see the results of the data quality monitoring for three different use cases. However, for all of them, the schedule was set to "ON_TRIGGER_CHANGES", meaning the DMFs will run only when DML is on the table.

There is another way of scheduling using a CRON schedule like the one below –

				
					ALTER TABLE DEMO.PUBLIC.customers SET
DATA_METRIC_SCHEDULE = 'USING CRON 00 02 * * * America/Chicago';
				
			

The above will set the schedule of all the DMFs assigned to a table to run daily at 02:00 AM US Central time.

After assigning the schedule, below will be the initial state of the SCHEDULE_STATUS

After some time, its status will change to "STARTED" as shown below.

We can also use an interval-based CRON like the code below.

				
					ALTER TABLE DEMO.PUBLIC.customers SET DATA_METRIC_SCHEDULE = '5 MINUTE';
				
			

The valid interval values for interval-based CRON are the 5th, 15th, and 30th minutes of the hour.

Challenges with the Data Quality Monitoring Feature

Although the Data Quality Monitoring feature in Snowflake is a huge boost to its rich ecosystem of services and features, it’s still a relatively new and less explored feature and comes with its shortcomings. Apart from the limitations listed in the Snowflake documentation of this feature, below are some of the challenges and limitations we have seen with the Data Quality Monitoring feature.

  • Every table/view can have only 1 Data Metric Schedule, i.e., all the DMFs associated with it start running almost simultaneously. In scenarios when one table/view is being updated from multiple pipelines, this limitation might result in false negatives and redundant quality checks. Ideally, there should be an option to schedule each DMF as required rather than having the schedule at the table/view level.

  • The schedules might run with some lag from the scheduled time. Since this feature uses Serverless Compute and Snowflake to start and stop the services on its own, there is no way to monitor the execution of the DMF in real time. Hence, we must consider this if we build alerts for the DMF results view.

  • When defining a user-defined DMF, multiple tables and their columns cannot be used as arguments for the function. This results in using hardcoded values for such scenarios, requiring additional handling at the CI/CD level to make them environment-independent.

Best Practices

We would like to recommend the below standards while implementing this feature.

  • Create a Custom Role and grant it the ownership, usage,  and full select access to the object to create, add, and schedule the DMFs.

  • After creating the DMF, always call it manually using SELECT <data_metric_function>(<query>) to validate whether the logic defined in the DMF is working properly or not.

  • If the logic for the DMF does not need to be public, please secure the DMF. Below is an example of the same.

				
					ALTER FUNCTION demo.public.invalid_zip_count(
    TABLE(
        STRING
    ))
SET SECURE;
				
			
  • If a DMF is no longer required after some time, please drop the DMF association with the table so that it does not execute in subsequent runs and save the serverless credits. This can be done like below.

				
					ALTER TABLE demo.public.customers DROP DATA METRIC FUNCTION
DEMO.PUBLIC.LATE_ARRIVING_DIMENSION_COUNT ON (SUBSCRIPTION_ID);
				
			

Closing

Now that we have covered data metrics functions in implementing data quality monitoring in Snowflake, you can start exploring the different types of in-built DMFs available in Snowflake and build your DMFs to add data quality checks to your data pipelines. This feature will allow you to meet service-level agreements for your data and make your data credible and consistent for businesses to consume. 

If you have questions or need assistance setting up Data Quality Monitoring or leveraging DMFs in Snowflake, please contact the experts at phData.

FAQs

Snowflake has a diverse partner ecosystem and partners like Sigma, Talend, Collibra,  Informatica etc., enable strong data governance implementation. Along with the Data Monitoring function, Snowflake also has other native data governance features like Data Lineage, Data Clean Rooms, Object Insights , Trust Center etc. These are all collectively housed under Snowflake Horizon.

Yes, there are some costs that are incurred only when a data metric function (DMF)  is executed on an object. But, these functions are integrated into Snowflake and there are no additional licensing costs.The cost per execution (i.e a query which runs the DMF) is much lower when billed by Snowflake compared to a third-party provider, which would typically charge a premium.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit