In this blog, we will explore Slowly Changing Dimensions (SCDs) and the magic of dbt snapshots. Specifically, we’ll show how to create dbt snapshots and how they simplify the setup and maintenance of SCD Type-2 tables, offering a significant advantage for data management. Let’s begin!
What is SCD
SCD stands for Slowly Changing Dimension, a dimension that changes over time. There are several approaches to handling changes to dimension data, and these are categorized into different SCD types.Â
The three most common types are:
SCD Type 1
SCD Type 2
SCD Type 3.
Here is a sample data we will use to demonstrate the different SCD Types. Consider that the product (dimension) table below holds product information in fields called dimensions.Â
product_id | product_name | category | price | update_date |
---|---|---|---|---|
1 | MacBook Air (15-inch, M2, 2023) | Computers & Accessories | $999 | June 13, 2023 |
In the product table, we have a product, MacBook Air (15-inch, M2, 2023), and the price is $999 as of June 13, 2023.
SCD Type 1
In this type, changes overwrite the existing data. The dimension only contains the most current information. For instance, consider the above table and the business decides to increase the price of the MacBook Air (15-inch, M2, 2023) to $1500, they would update the price dimension to reflect $1500 and also update the update_date
to capture the date of the price modification. As a result, the table is transformed as follows.
product_id | product_name | category | price | update_date |
---|---|---|---|---|
1 | MacBook Air (15-inch, M2, 2023) | Computers & Accessories | $1500 | Aug 18, 2023 |
With this approach, it becomes challenging to determine any price changes of the MacBook Air (15-inch, M2, 2023) in June if someone inquires about it. Any table is a representation of SCD Type 1.Â
SCD Type 2
This type involves creating a new record in the dimension table whenever there is a change in the data. The new record includes a start date for when it becomes effective and the end date will be a high date or null based on the design; the old record is marked with an end date. This way, historical changes are preserved, and you can track the history of the data over time.
In the below example, the price of the product MacBook Air (15-inch, M2, 2023) has been updated to 1500 on Oct 31, 2023; hence, we closed the previous record by populating the end_date
with Oct 31, 2023, and created a new record for the update and populated the start_date
of the record with the previous record end_date
and end_date
is populated with high_date
.
product_id | product_name | category | price | start_date | end_date |
---|---|---|---|---|---|
1 | MacBook Air (15-inch, M2, 2023) | Computers & Accessories | $999 | June 13, 2023 | Aug 18, 2023 |
1 | MacBook Air (15-inch, M2, 2023) | Computers & Accessories | $1100 | Aug 18, 2023 | Oct 31, 2023 |
1 | MacBook Air (15-inch, M2, 2023) | Computers & Accessories | $1500 | Oct 31, 2023 | Dec 31, 9999 |
SCD Type 3
In SCD Type 1, we have only the latest information; in SCD Type 2, we have all the historical information, but in SCD Type 3, we maintain only the current and previous versions of the data. It typically includes separate columns for the current and previous values.Â
For instance, we would create two columns for the product table to maintain the previous version of the price dimension. The price field stores the latest value, and the previous_price
field stores the previous value, but the initial value for this field is null. With this method, you will have limited access to the historical data.
product_id | product_name | category | price | previous_price | updated_date |
---|---|---|---|---|---|
1 | MacBook Air (15-inch, M2, 2023) | Computers & Accessories | $1100 | $999 | Aug 18, 2023 |
SCD types are very important in data warehousing for effectively managing changes to dimension data over time. These types enable businesses to make informed decisions by understanding data changes and trends over different periods.
SCD Type 2 results in a lot of data that might need to be revised, particularly when changes occur frequently due to human error, such as multiple errors in entering addresses or phone numbers.
SCD Type 3 is great if there are only a couple of values you care about and you only want to compare to the previous. It reduces storage and limits the depth of historical data available.
What are dbt Snapshots?
dbt snapshots are like time machines for data. It creates Type-2 Slowly Changing Dimension tables that help us track how things change over time, just like when you look at old photos and see how people change. They also save different versions of information so we can compare them, like comparing two drawings you made at other times.Â
Why are dbt Snapshots Important?
dbt snapshots create SCD Type-2 tables and are important for tracking changes in data.
It simplifies the creation of SCD Type-2 tables. We don’t need to write complex logic to create these tables, and it is just a select query with a set of configurations in a
.sql
file.
How do dbt Snapshots Work?
Snapshots are basic select queries that transform into tables within a data warehouse. When you execute the dbt snapshot command, it compares the current state of the dataset with its previous state. If there are changes in the records, it updates the snapshot table accordingly, creating new entries for modified or new records.Â
In the background, dbt manages the necessary actions, like inserting, updating, or merging records in the snapshot table. Additionally, if there are new columns in the source query, dbt automatically includes these columns in the snapshot table.
How to Snapshot Your Data
In dbt, snapshots are essentially SQL queries written within a specific snapshot block located in a .sql
file. Additionally, you must set up your snapshot configuration to instruct dbt on identifying record changes.Â
Follow the below steps to create snapshots in dbt.
Determine your source table
Creating a snapshot model
Configure your snapshot
Execute your snapshot
Identifying your snapshot table
Please note that the following examples assume a basic understanding of model-building steps. Let’s get started.
Determine Your Source Table
Initially, you should identify the source table you want to use as the foundation for creating a snapshot. This table typically contains slowly changing dimensions that you wish to monitor over time. For instance, these could be user dimension tables, lead status tables, or records related to purchases.
Creating a Snapshot Model
After identifying the source, you now have to create a file with a .sql
extension in your snapshots directory. Inside this file, use a snapshot block to define the start and end of a snapshot, as shown below.Â
Write a select statement within the snapshot block to select records you want to capture from the selected source. For more information about source function, please refer to this doc.
Configure Your Snapshot
After identifying what data from the selected source needs to be a snapshot, we now have to tell dbt how to identify changes by configuring unique_key
, strategy
, updated_at
, and check_cols
fields using the config block in the snapshot SQL file.
The unique_key
field is a required configuration for snapshots in dbt. Ensure that the source table has a unique key to snapshot data. It helps dbt to match records between a new result set and an existing snapshot. It’s recommended to create snapshots in a schema other than the default schema; hence, use target_schema
config to create a snapshot in a different schema.
Go through the dbt doc to learn more about snapshot properties and configurations.
Note: For every developer, snapshots are built into the same target_schema
, no matter who is executing them.
Snapshot strategies determine how dbt identifies changes in rows. dbt includes two built-in strategies they are timestamp and check.
Timestamp Strategy
The timestamp strategy relies on an updated_at
column to detect row changes. If the updated_at
column’s value for a row is more recent than the previous snapshot run, dbt will replace the old record with the updated one.Â
If timestamps remain unchanged, dbt won’t take any actions. If your data source includes an updated_timestamp
column that updates whenever a row is modified, it’s recommended that you use a timestamp strategy.
Example:
{% snapshot orders_snapshot_timestamp %}
{{
config(
target_schema='snapshots',
strategy='timestamp',
unique_key='order_id',
updated_at='updated_at',
)
}}
select * from {{ source('jaffle_shop', 'orders') }}
{% endsnapshot %}
Check Strategy
If the table you’re using doesn’t have a field that shows when data was last updated, the check strategy can help. This strategy looks at specific columns you choose (using check_cols
) to compare what’s in them now with what they used to be.Â
If any of these chosen columns have changed, dbt will update the record. If everything in these columns is still the same, dbt won’t do anything.
Example:
{% snapshot orders_snapshot_check %}
{{
config(
target_schema='snapshots',
strategy='check',
unique_key='order_id',
check_cols=['status', 'is_cancelled'],
)
}}
select * from {{ source('jaffle_shop', 'orders') }}
{% endsnapshot %}
Managing Hard Deletes
Apart from updates, there might be instances of hard deletes occurring in the source data. By default, dbt ignores hard deletes, but if you wish to track them, set the invalidate_hard_deletes
configuration to true. This allows dbt to monitor rows that have been removed. However, note that this configuration is only applicable when using the timestamp strategy.
Example:
{% snapshot orders_snapshot_hard_delete %}
{{
config(
target_schema='snapshots',
strategy='timestamp',
unique_key='order_id',
updated_at='updated_at',
invalidate_hard_deletes=True,
)
}}
select * from {{ source('jaffle_shop', 'orders') }}
{% endsnapshot %}
Execute Your Snapshot
In the above steps, we created and configured a snapshot model, and now it’s time to execute your snapshot model. Use the dbt snapshot
command to execute all snapshot models in your dbt project.
Note: The dbt build command includes the snapshot command.
To execute specific snapshots, provide snapshot model names separated by spaces after -s
to the dbt snapshot
command. This will execute only given snapshots and create SCD Type-2 tables In your target data warehouse.
The initial snapshot table is generated upon executing the dbt snapshot model for the first time. This table incorporates the columns specified in the snapshot select query along with additional meta-fields below.
Field | Meaning | Usage |
---|---|---|
dbt_valid_from | The timestamp when this snapshot row was first inserted | This column can be used to order a record’s different “versions”. |
dbt_valid_to | The timestamp when this row became invalidated | The most recent snapshot record will have dbt_valid_to set to null. |
dbt_scd_id | A unique key is generated for each snapshot record. | This is used internally by dbt |
dbt_updated_at | The updated_at timestamp of the source record when this snapshot row was inserted. | This is used internally by dbt |
Identifying Your Snapshot Table
When you run the dbt snapshot
command, it generates a snapshot table in the specified target schema. If no alias configuration is provided, the table name defaults to the snapshot file name.
On the First Run
After executing dbt snapshot -s orders_snapshot
the first time, assume the orders table has only one order, and the history for that order is maintained, as shown in the image below. If we observe the DBT_VALID_TO
field is set to null, i.e., the order is still in the shipped state at the time of the updated_at
value.
On subsequent runs
Consider a scenario where the order status goes from shipped to completed. As a result, that particular record will appear as
Note: The above results were generated using the timestamp strategy. The meta fields have been populated based on the updated_at
field in the timestamp strategy. If the check strategy were used, the current timestamp would be used to populate the snapshot meta fields.
Snapshot Best Practices
We have learned how to create and execute snapshot models, and now it’s time to know the best practices.
Keep your snapshots simple, don’t add complex transformation logic, and avoid using multiple joins. If the business logic used in the snapshots changes in the future, then it will be difficult to apply the change to your snapshots.
It’s good to create snapshots on source tables because it’s essential to understand how your source data is changing and use downstream models to clean up the data.Â
Snapshotting at the source simplifies the creation of SCD-1 or SCD-2 models. Snapshotting a final model, on the other hand, can result in snapshots having any coding errors that would be easier to fix if you had a snapshot on the source and just built from the snapshot.
Please include all available columns in the selection. Use
SELECT *
 even if a specific column seems irrelevant at the moment. Capturing it in the snapshot could prove beneficial in the future.Snapshots, executed through the
dbt snapshot
command, capture table changes and should be scheduled regularly. Aim for snapshot runs between hourly and daily intervals; frequent snapshots may signal a need to explore alternative methods for tracking source data changes.Use different schemas for snapshots: This practice simplifies the differentiation between snapshot tables and other table types within your data warehouse.
Conclusion
dbt snapshot is an important feature that allows you to record changes within your data as it evolves. It simplifies the process of creating and maintaining SCD Type-2 tables.
Interested in leveraging dbt snapshots for your data management needs?
Reach out to phData today and discover how our expertise can streamline your data pipelines, making them more efficient and effective.
FAQs
After creating a snapshot table on a source model, if any changes in the column datatypes of an existing snapshot table, dbt will not change the type of a column beyond expanding the size of varchar columns.
Yes, we can create a customized snapshot strategy. Kindly refer to the dbt documentation for more details.
If the snapshot job doesn't run daily as intended, the snapshots will become outdated. This might impact analytical queries or reporting that rely on up-to-date information. It can lead to incorrect insights or decisions based on stale data.
This can create a significant backlog that needs to be addressed when the snapshot job resumes. It may need to process a larger volume of data than usual. This could potentially strain the system, leading to longer processing times or resource constraints.