When you bring raw data into your Snowflake Data Cloud from different sources within your organization, it very likely won’t be formatted in a way that’s easily consumable for your reporting and machine learning needs.
That’s because some source systems may provide semi-structured data, while others are columnar. Some may pad numbers with zeros, while others may not. A field may be called “employee_id” in one system, while another system may call it “EID,” yet those fields represent the same data in Snowflake.
What is Data Curation?
Data curation is the act of making field names consistent with each other, formatting values for use, and applying the correct data types. This makes it easier for users to consume data by providing a consistent experience and helps join data from different sources even if they have been maintained differently.
Beyond making your data and fields easier to consume, you will likely have a set of tables or views to publish data. These views or tables aggregate, normalize, or denormalize tables from the previous curation step to make them faster and easier to use and reuse for a specific use case.
In this blog, we’ll take raw data that you have already ingested into Snowflake and give you two approaches to standardize that data, so it’s ready to use. One method uses views, and the other uses tasks.
What are the most common use cases for data curation?
The primary use cases for curating data include: aligning data types from various source systems, converting strings to numbers or the other way around, removing erroneous rows, and assigning default values to empty columns.
Cloud Services vs. User-Defined Virtual Warehouses
Cloud Services
Any operation that you do requires compute power. Cloud Services is the computing power that processes built-in Snowflake features. While this includes handling SQL statements that create database objects, like defining a table or a schema, Cloud Services is also used to maintain materialized views in real-time. We’ll talk about materialized views later.
What’s important to know about Cloud Services is that you cannot put a resource monitor on it to limit spending, and you cannot control how it scales.
Be cautious about ongoing operations that use Cloud Services because of the lack of control over how it scales up and how much you spend.
User-Defined Virtual Warehouses
A warehouse in Snowflake is computing power that you allocate to your account. You can specify how big of a cluster to use or even to use a multi-cluster. Most importantly, you can place a resource monitor on it to limit the amount of money you spend over a specified timeframe.
I won’t go into details on optimizing warehouses in this post, but I will say that this level of control can help you not only maximize performance but reduce cost.
How to Transform Data with Views
Views perform simple transformations efficiently.
For example, they let you rename columns, flatten semi-structured data to look structured, or omit columns that aren’t useful downstream.
To transform data, use a non-secure view. Secure views are used for managing row-level security and not something you should be doing at the same time as transforming your data. A materialized view acts similar to a table that’s updated automatically by Snowflake. It is populated with data specified by the query of the table you used to create the view.
You cannot use materialized views to transform more than a single source table.
If your data changes frequently, it can be expensive to maintain a materialized view because Snowflake will trigger a new transformation the moment it detects the backing data changes, meaning if the data is frequently changing, the warehouse doing the work would be consuming credits most of the day.
Data Transformation Example Using a Standard View
We rename columns from an employee table to be consistent with other systems’ columns and remove the leading and trailing spaces from the employee’s first name.
CREATE VIEW EMPLOYEE_VIEW AS (
SELECT
EID AS EMPLOYEE_ID,
TRIM(EMPLOYEE_FNAME) AS FIRST_NAME
FROM EMPLOYEE
);
A view can be much more complex where it performs aggregations or otherwise manipulates the data.
It is important to note that a caller’s defined warehouse will do the work when you use non-materialized views. If you materialized the view, Snowflake re-runs the view’s query in the background in a predefined warehouse, and the cost will be part of your cloud services bill.
How to Transform Data with Tasks
Tasks run at a set frequency to transform and update a curated representation of the data.
This approach can offer the same performance benefits of a materialized view without the unpredictable cost of updating with every change to the source table.
Unlike materialized views, you can also combine data from multiple source tables rather than a single source.
Tasks have other advantages over materialized views:
- They allow you to specify a warehouse, which can help you chargeback transformation costs to the correct business unit.
- Unlike cloud services, warehouses also support resource monitors to prevent budget overages.
Data Transformation Example Using Tasks
Creating a task is relatively easy. Here is a simple example that runs once per hour:
CREATE OR REPLACE TASK curate_employee_data
WAREHOUSE = 'my_project_wh'
SCHEDULE = '60 MINUTE'
USER_TASK_TIMEOUT_MS = 60000
COMMENT = 'Update the curated employee table from the raw employee table.'
AS
INSERT OVERWRITE INTO curated_employee
SELECT
EID as EMPLOYEE_ID,
TRIM(EMPLOYEE_FNAME) as FIRST_NAME
FROM raw_employee;
The query looks a little complicated, but it isn’t too bad if we break it down.
- We define a task named curate_employee_data, which uses a specific warehouse and runs once per hour.
- Then specify what the task does: run an “insert overwrite” to copy the data from the raw_employee table to the curated_employee table with the proper column names.
For the purposes of this post, we chose a straightforward example because it demonstrates the core idea with minimal extraneous transformations. You can create more complex tasks as needed to manage your specific use cases.
Which Method is Right for You?
We’ve covered two ways of curating your tables within Snowflake without having to use external tools.
Views are simple to implement and useful for simple transformations in real-time. Tasks can be used for more complicated non-real-time situations while managing cost and boosting performance.
There is no right answer as to the best method of transforming your data. There’s only the method that best suits your budget and data requirements. Only you can balance those conflicting goals.
If you still have questions on which method is best for your use case, our team is here to help! Get in touch, and we’d be more than happy to answer your data curation or Snowflake questions.
Common Questions About Transforming Data
The most common challenge with data transformation is determining how real-time the data needs to be. Everybody wants data to be up-to-the-second fresh until cost and performance challenges get in the way. Views offer the most real-time solution but can be slow if the query supporting the view is complicated. Materialized views can help with performance, but then the cost is unpredictable and potentially high.
Yes, any great tool will help you ingest data into Snowflake and do the transformation before or after loading the data.
However, we wouldn’t recommend it. phData recommends loading the data in its most raw form, especially if you plan to use Snowflake as a Data Lake. This approach allows for multiple different types of transformations to happen on the same data for each use case.
While you can use Snowflake’s built-in features to do data transformation, there are plenty of reasons you may choose other tools to do the transformation from the raw set of tables in Snowflake to a curated collection of tables within Snowflake. Customers who do this are often looking to leverage their existing processes and expertise.