Any database be it a transactional or a data warehouse will always have records that need to be deleted, either to clean up the complete table or specifically to delete a few records.
The Snowflake Data Cloud, like other databases, has SQL commands to delete specific records or complete tables. In addition, the deleted records can also be recovered quickly using the Time Travel feature.
The syntax of deleting records in Snowflake is very similar to ANSI SQL, however, delete commands in Snowflake can be combined with other features thereby providing greater control for the developers.
In this blog, we’ll look at the various ways to delete records and also to recover them in Snowflake. This post will summarize the different options for deleting records in Snowflake including:
- Delete Records Using Subqueries
- Delete Data Using Filters
- Delete Records Using Correlated Subquery
- How to Delete Duplicate Records
- Delete Semi-Structured/Variant Data Type Data
- Comparison of In-Clause/Using-Clause/Exist-clause Delete types
- Delete Records as a Transaction
- Soft Deletes
Why Delete Records in a Data Warehouse
There are several reasons why an organization would need to delete records in a data warehouse. One of the basic use cases is the regular ETL job to delete and reload historical data to insert new incremental data or completely wipe out older data to insert new data. But there are specific business use cases where an organization will need to delete records in a data warehouse.
One of the major use cases is compliance. Any organization might have to delete very old historical records and retain only the last few years of data to satisfy compliance requirements. Some organizations will have a scheduled process that will scan historical data on a regular basis and delete those records over a specific threshold.
In a more specific use case, EU GDPR allows any individual to force an organization to delete personal records under the “Right to be forgotten/Right to erasure”. If an organization receives such a request, they need to comply by deleting any or all records of specific individuals from all databases.
How to Delete Records in Snowflake
First, let’s delete some records in Snowflake! For this blog, let’s consider Snowflake’s sample data SNOWFLAKE_SAMPLE_DATA.TPCH_SF100.CUSTOMER. This is cloned into a new schema with the same name CUSTOMER with the following columns:
Sample data in this table will look like this:
The basic syntax to delete records from Snowflake is below:
Delete from
E.g. delete from customer;
Truncate table
E.g. truncate table customer;
Both these commands will completely wipe out all the records in the table. Even though they are used for the same purpose, there is a slight difference. While Delete and Truncate maintain the deleted data (micro partitions) for recovery purposes (using Time Travel), Truncate removes load metadata as well.
Snowflake usually maintains load metadata (copy/Snowpipe) for each of the tables which includes file name, eTag, size, etc. and the Truncate command removes these load metadata. If this metadata is required, then it is better to use the Delete command.
Truncate is usually faster and it is a metadata-only operation. While Truncate and Delete seem to perform the same type of deletion, Delete statements will bring up a virtual warehouse whereas Truncate does not require a virtual warehouse to be on.
In the screenshot below, we can see Truncate did not use a warehouse (warehouse_size is null), whereas Delete used the warehouse for the operation. We can also see that Truncate is 50% faster (total_elapsed_time) than Delete operation for the same sized table (15M).
How to Delete Records Using Filters
Specific records can be deleted from tables using simple filter conditions or from another table or using complex subqueries. Simple delete using a filter condition will just include where condition in the delete statement.
The syntax looks like this:
delete from where condition;
E.g. delete from customer where c_custkey = 123
Now let’s consider some special cases for deletion.
How to Delete Records Based on Another Table
While most of the simple use cases have delete statements with filters, there will always be cases where we need to delete records based on another table.
Let’s create another table with some customer keys for deletion.
create table invalid_customers (c_custkey string, reason string);
insert into invalid_customers values
(11761864,'No longer a customer'),
(11761910,'No longer a customer'),
(11761864,'Invalid entry');
The table will look like this:
Now let’s write a query to delete the records from the customer table based on this invalid_customers tables. There are different ways to write queries in this case.
Using “In” Clause
delete from customer where c_custkey in (select c_custkey from invalid_customers);
Using “Using” (inner join) Clause
delete from customer as cust using invalid_customers as inactiv_c where cust.c_custkey = inactiv_c.c_custkey;
Using “Exists” Clause
delete from customer as cust where exists(select * from invalid_customers as inactiv_c where cust.c_custkey = inactiv_c.c_custkey);
Performance Considerations:
The performance of these three queries are very similar, but looking at the query profile, “in” and “exists” clause adds additional aggregate where “using” clause doesn’t generate any aggregate.
While looking at this query profile, we may think that the performance of delete queries may depend on how the query is written. While it is partially correct, the actual performance of the delete query depends on the number of micro partitions where deleted records resize.
As you may know, Snowflake organizes data in the form of micro partitions. For high-volume tables, the number of micro partitions may span up to 1000s or even millions (TB/PB sized tables).
Snowflake usually maintains metadata information about each of the micro partitions. Micro partitions are immutable, so Snowflake needs to rewrite the micro partitions into newer micro partitions where the delete action is performed.
In a simple case of deleting a couple of million records in a 100-200 million records table may result in rewriting 90 percent or almost all micro partitions. In some scenarios, clustering may be helpful as it will bring down the number of micro partitions to scan and rewrite. But this may not always be feasible when delete is performed on unique keys/ids.
In a high-volume table spanning multi TB size, it is recommended to cluster tables based on low cardinality keys and redesign the ETL process to delete and reload based on clustering keys. There will still be overhead but it may not be as high as rewriting all micro partitions.
How to Delete Records with Semi-Structured Data
Snowflake has special data types like Variant, Object, and Array to store semi-structured data and map/array type data. While the basic syntax of delete is similar, the filters will require specifying fields with dot/bracket notation.
Let’s create a table with some semi-structured data in a variant column.
create or replace table users(record variant) as
select to_variant(parse_json($1)) from values
('{"name":"John", "age":30, "car":"Honda"}'),
('{"name":"Alex", "age":40, "car":"Toyota"}'),
('{"name":"Tim", "age":20, "car":"Lexus"}');
The data in the table will look like this:
Let’s write a delete query to delete all users where age > 20. The query will look like this:
delete from users where record:age::number > 20;
In this query, the field age is represented as a record:age with an additional typecast specification number using :: notation. Running the above query will delete two records from the table.
The following section has queries to handle some use cases like deleting duplicate records, top 10 records, etc.
Deleting Exact Duplicates From Table
Let’s consider a table customer which has exact duplicates inserted in all the fields.
To remove these exact duplicates, the query will look like the table below. Even though this query does not have “delete” or “truncate”, the “overwrite” clause will implicitly run “truncate” and then distinct records. This will be visible in the query profile. Snowflake runs this as a two-step process with step 1 running truncate and step 2 being the insert.
Insert overwrite into customer select distinct * from customer;
After deleting records, the table will look like this:
How Do We Ensure We Delete Records as a Transaction
Snowflake has auto-commit enabled, so any delete will be committed automatically. But there might be cases where we want to ensure that deletes from multiple tables run as a transaction. Consider the same example where we used a child table (invalid_customer) to delete records from the customer table.
Now if we need to delete records from both tables, then enclose them within a transaction. This will ensure that the delete will be committed only if both deletes succeed.
begin transaction;
delete from customer where c_custkey in (select c_custkey from invalid_customers);
delete from invalid_customers;
Commit;
What About Soft Deletes?
When deleting records from a data warehouse, there is always a case for soft deletes. Soft delete means we do not actually delete records, but we add additional columns (say delete_ind) in the table so that we can update the column which will mark the record as deleted rather than physically deleting it from the table.
We use soft delete records for different reasons. The main reason is to retain the history of data without actually losing them. In Snowflake, we can always use Time Travel to recover deleted records but if a business use case requires deleted records for some sort of historical analysis, then it is better to soft delete them rather than actually delete records.
For example, in the customer table we used above, we can add additional columns so that the table will look like this:
alter table customer add column
delete_ind varchar(1),
delete_ts timestamp ;
Once the column is available and when a record needs to be deleted, we will instead update and mark the record as deleted. For example,
update customer set delete_ind=’Y’, delete_ts=current_timestamp() where c_custkey=238129;
An important point to consider is that Snowflake is designed for data warehouses and it will not perform well for record by record updates. Even if records are updated in batches, it still needs to rewrite micro partitions (remember micro partitions are immutable), so the performance of delete and update will be very similar.
Closing
There are different ways to delete records in Snowflake using queries. Most of the select statement style filters can be applied to delete as well (multi-column filter condition). The type of delete may depend on the use case and requirement.
While there is more than one way to achieve the same results as seen above, choosing the right solution will require a good understanding of Snowflake architecture and business use case.
At phData, we have experts who excel in both technology and business that would be happy to help out. We’ve also been awarded the 2022 Snowflake Partner of the Year. Reach out to us today with your toughest Snowflake questions.
Join our team
Partners
Subscribe to our newsletter
Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.
Accelerate and automate your data projects with the phData Toolkit