The Snowflake Data Cloud has been a market leader for database systems that are built for the cloud and support an unlimited number of warehouses. Snowflake has had a blistering fast query performance, but it does not have indexes.
So the question arises, how do you boost Snowflake’s Performance? For a small amount of data, increasing the warehouse size does work, but when you are in the multi-terabyte range, it might not always work.
As Abraham Maslow said, “If the only tool you have is a hammer, you tend to see every problem as a nail.”
This is where you might think about data clustering to increase throughput and decrease latency for your queries. In this blog, we will explore the option of data clustering.
What is Clustering Data in Snowflake?
Snowflake loads data in separate micro-partitions into a table in the same order that the data got loaded. That is how Snowflake naturally clusters data, which works great for small to medium tables and tables where the recent data is queried often. As the size of the data grows, the queries start to become not very efficient, and scanning over all partitions of the table becomes a costly affair.
Suppose you recluster the table on a specific key that is often queried. In that case, you can improve the efficiency of the query and save resources by only looking at partitions where the target data is supposed to be and ignoring or “pruning” the other partitions (we will talk about scanning and pruning later).
Snowflake recommends defining cluster keys in the following scenarios:
- When query performance is not ideal and a specific key is being used to query data often.
- When the data size is in the multi-Terabyte range, and each query is scanning a large amount of data or micro-partitions.
What Are Micro-Partitions?
Snowflake automatically divides the tables into multiple smaller partitions called micro-partitions. Every micro-partition is around 50 to 500 Megabytes of uncompressed data. Rows are grouped from the same table and mapped into micro-partitions.
Snowflake is a columnar store, which means the data is compressed by organizing it into columns rather than by row for each micro-partition. This allows Snowflake to scan data for a column value without touching the rest of the data in a given row. Snowflake stores metadata regarding all the rows in the micro-partition like:
- The value range for each column in that micro-partition.
- The distinct value count.
- Additional data used for increasing query performance and optimization.
One thing you need to remember about partitions is that they are immutable; hence, when inserting new data, Snowflake creates a new partition with the old data plus the new one and then deletes the old one. This makes insert operations costlier but improves query performance.
Why Do We Need Micro-Partitions?
Almost all traditional data warehouses are based on static partitioning of huge tables to improve performance and provide enhanced scaling options. In these types of systems, each partition is a unit of management that is manipulated independently via DDL and Syntax.
Still, static partitioning has limitations like maintenance overheads and data skewness, which results in different-sized partitions. This is where Snowflake solved the problem with micro-partitions, which could deliver the advantages of static partitioning. Micro-partitions are not very performant with individual inserts, but Snowflake has an option for bulk inserts as a workaround. One should remember that Snowflake should not be used as a transactional database.
How Do Micro-Partitions Help Us Make Snowflake Faster?
Snowflake internally tracks micro-partitions and the associated metadata and then uses this metadata to efficiently execute a query. A simple example would be to cluster on a date or timestamp column. That would dramatically decrease the search time as Snowflake would scan fewer micro-partitions to get the target result.
This does come with extra computing costs associated with it, and hence you should only do it when your queries are taking too long or the data is enormous. The ideal case would be where the clustering costs prove beneficial, which could be calculated by running it.
Let’s look at one example of clustering.
If one of the columns is data like “published_date” and you search for a value of “12/29/1999,” it will know from the metadata which partitions could have that data. And, if all of your data is sorted by “published_date,” very few partitions could have the rows you are looking for. If your data isn’t sorted, then there may be many partitions that the partitions could be in.
You will see the term “prune,” which refers to the process of disqualifying partitions that could not possibly have the results you are looking for. You will also see the term “scan,” which refers to checking a potential partition to see if it has relevant rows.
Finally, it’s essential to know that Snowflake is a columnar store, which means the data is compressed by organizing it into columns rather than by row for each micro-partition. This allows Snowflake to scan data for a column value without touching the rest of the data in a given row.
Why Do Partitions Need To Be Managed?
In most situations, you do not have to manage partitions or even know they exist. If you insert data with copy-into statements or Snowpipe, as you should, the newest partitions will contain the newest data.
Since most analytics is done on recent data and contain query criteria that will filter the results down, you will rarely touch all of the partitions in a large table. Unfortunately, some analytics queries will look at data without being restricted to only what is most recent.
Better Pruning and Scanning
We touched on pruning and scanning, but let’s show that visually.
Here’s an example of 4 partitions where the data is not sorted, and the range of values is spread out. When we search for a value of 9000, we’ll scan all partitions:
Partition 1 | criteria column X range: 1 – 10000 | might have X equal to 9000 |
Partition 2 | criteria column X range: 1 – 10000 | might have X equal to 9000 |
Partition 3 | criteria column X range: 1 – 10000 | might have X equal to 9000 |
Partition 4 | criteria column X range: 1 – 10000 | might have X equal to 9000 |
Now, Let’s sort it or Cluster it on a specific key:
Partition 1 | criteria column X range: 1 – 2000 | pruned |
Partition 2 | criteria column X range: 2001 – 4000 | pruned |
Partition 3 | criteria column X range: 4001 – 8000 | pruned |
Partition 4 | criteria column X range: 8001 – 10000 | might have X equal to 9000 |
The middle column is the visual representation of how Snowflake stores the metadata about the partitions, and that is how snowflake knows which partitions to prune.
Now, Snowflake only has to look at the values in a single partition. Scanning is the name of the process of walking through the values in that partition, which is relatively slow, so we want to avoid looking in places where the value doesn’t exist.
Let’s see a Snowflake example and how clustering works in query processing. We will create a table with 600M records from Snowflake sample data. (snowflake_sample_data.tpch_sf100.lineitem)
Step 1: Create a base table
create table lineitem as select * from snowflake_sample_data.tpch_sf100.lineitem
Step 2: Clone the base table and enable clustering on it
create table lineitem_clustered clone lineitem;
alter table lineitem_clustered cluster by (l_shipdate);
Step 3: Verify that Clustering is complete
select system$clustering_information(‘lineitem’,'(l_shipdate)’);
select system$clustering_information(‘lineitem_clustered’, ‘l_shipdate’);
Output:
Unclustered table | Clustered table |
{ “cluster_by_keys” : “LINEAR(l_shipdate)”, “total_partition_count” : 911, “total_constant_partition_count” : 1, “average_overlaps” : 272.4083, “average_depth” : 165.9956, “partition_depth_histogram” : { “00000” : 0, “00001” : 0, “00002” : 0, “00003” : 0, “00004” : 0, “00005” : 0, “00006” : 0, “00007” : 0, “00008” : 0, “00009” : 0, “00010” : 0, “00011” : 0, “00012” : 0, “00013” : 0, “00014” : 0, “00015” : 0, “00016” : 0, “00032” : 3, “00064” : 45, “00128” : 219, “00256” : 644 } } | { “cluster_by_keys” : “LINEAR(l_shipdate)”, “total_partition_count” : 913, “total_constant_partition_count” : 2, “average_overlaps” : 6.3834, “average_depth” : 4.5005, “partition_depth_histogram” : { “00000” : 0, “00001” : 1, “00002” : 28, “00003” : 188, “00004” : 272, “00005” : 241, “00006” : 123, “00007” : 42, “00008” : 6, “00009” : 5, “00010” : 7, “00011” : 0, “00012” : 0, “00013” : 0, “00014” : 0, “00015” : 0, “00016” : 0 } } |
Step 4: Run the same Query in the normal and clustered table
Normal base Query :
select * from lineitem where l_orderkey=’506754340′;
Query Profile Output:
Note: We need to remember to carefully consider which column we are using for clustering. If we clustered on one column, in our case “l_shipdate”, but we wrote a query on “l_orderkey”, we will see that the query will execute as normal.
Query on another column:
select * from lineitem_clustered where l_orderkey=’506754340′;
Query Profile Output:
Note: Now we will execute the same query as above but use the column which we have clustered with, and snowflake will use pruning to efficiently search and execute.
Query with clustering:
select * from lineitem_clustered where l_shipdate between ‘1992-05-01’ and ‘1992-07-15′ and l_orderkey=’506754340’;
Query Profile Output:
As you can see here, the partitions scanned here are significantly less because Snowflake knew where to look and pruned the rest of the partitions. The execution time also decreased from 7.7 seconds to 1.6 seconds, almost a five times improvement in query performance.
The below image shows the normal run and clustered side by side for a better comparison.
This shows the power of clustering, but you must also remember that it will bring extra computing costs and overhead. Now we will look into optimizing partitions for improved performance.
Optimizing Partitions for Improved Performance
So, if Clustering and Search Optimization won’t work for you, either because of cost or other criteria, you can optimize your partitions in different ways.
Optimize on Load
First, when you load historical data into Snowflake, if you sort the data before you load it, you can improve the performance of all queries using that sorted data. Sometimes, you’ll have queries with competing needs, and I’ve found that you can sort by up to about three columns before seeing diminishing returns.
Your mileage may vary. This works best for large amounts of data that span many partitions, and there’s an easily known pattern to which columns are used for filtering results.
Here is a query to understand which queries are slow and could be improved:
select query_id as query_id
, round(bytes_scanned/1024/1024)as mb_scanned
, total_elapsed_time / 1000 as elapsed_seconds
, (partitions_scanned / nullif(partitions_total,0)) * 100 as pct_table_scan
, percentage_scanned_from_cache * 100 as pct_from_cache
, bytes_spilled_to_local_storage as spill_to_local
, bytes_spilled_to_remote_storage as spill_to_remote
from snowflake.account_usage.query_history
where (bytes_spilled_to_local_storage > 1024 * 1024 or
bytes_spilled_to_remote_storage > 1024 * 1024 or
percentage_scanned_from_cache < 0.1)
and elapsed_seconds > 120
and bytes_scanned > 1024 * 1024
order by elapsed_seconds desc;
Reoptimizing As Needed
If you have a large table that is poorly ordered and new data coming in is organized fine, you can do a one-time update by using one of two methods:
- // insert-overwrite approach
insert overwrite into my_table
select * from my_table order by my_criteria;
- // create-swap approach
create or replace table y
copy grants
as select * from x order by my_criteria;
alter table x swap with y;
Note : The big issue with swap is keeping ownership and other grants for the new table.
The insert-overwrite approach is simple and easy and happens in a single transaction, but if you have a very large table, it can run out of transaction memory and fail.
The CTAS-swap approach is more complicated and happens in multiple transactions (which makes it impossible to use in some situations where inserts could be happening while this is running); however, it can handle any size table.
If you have streams on a table, be aware of the consequences of modifying the table with either approach.
Scheduling
If the data needs to be re-sorted regularly for performance, but the volume of inserts would be cost prohibitive if you used clustering, then you can use a task that uses the previous approach. This is easiest with the insert-overwrite approach.
Here’s an example that runs once per day at midnight UTC.
create task my_table_optimization
warehouse=compute_wh
schedule=’USING CRON 0 0 * * * UTC’
as insert overwrite into my_table
select * from my_table order by my_criteria;
Wrapping Up
Most of the time, you can use built-in tools like clustering and search optimization to help performance. However, understanding what’s happening under the covers and how to optimize for cost is another tool in your belt.
You can read more about using Clustering and the Search Optimization service with our other blog that compares and highlights these two features in more detail.
If you’re looking for more tips and methods to boost your Snowflake performance, contact the Snowflake experts at phData! As the 2022 Snowflake Partner of the Year, phData thrives at helping businesses of all shapes and sizes succeed with Snowflake.
Frequently Asked Questions
The built-in feature of Snowflake to optimize partitions is called Clustering Keys, which you can add manually or via Automatic Clustering. Adding a clustering key to a table may improve the performance of queries against that table. However, this isn’t recommended for small tables, and there are costs associated with doing this.
I recommend only using clustering keys if needed and only after you test and find they improve performance. In cases where clustering does not help and you still have performance issues, you may have to consider other tools like Search Optimization Service