How to Slowly Change Dimensions with Snapshots in dbt

In this blog, we will explore Slowly Changing Dimensions (SCDs) and the magic of dbt snapshots. Specifically, we’ll show how to create dbt snapshots and how they simplify the setup and maintenance of SCD Type-2 tables, offering a significant advantage for data management. Let’s begin!

What is SCD

SCD stands for Slowly Changing Dimension, a dimension that changes over time. There are several approaches to handling changes to dimension data, and these are categorized into different SCD types.

The three most common types are:

SCD Type 1
SCD Type 2
SCD Type 3.

Here is a sample data we will use to demonstrate the different SCD Types. Consider that the product (dimension) table below holds product information in fields called dimensions.

product_id	product_name	category	price	update_date
1	MacBook Air (15-inch, M2, 2023)	Computers & Accessories	$999	June 13, 2023

In the product table, we have a product, MacBook Air (15-inch, M2, 2023), and the price is $999 as of June 13, 2023.

SCD Type 1

In this type, changes overwrite the existing data. The dimension only contains the most current information. For instance, consider the above table and the business decides to increase the price of the MacBook Air (15-inch, M2, 2023) to $1500, they would update the price dimension to reflect $1500 and also update the update_date to capture the date of the price modification. As a result, the table is transformed as follows.

product_id	product_name	category	price	update_date
1	MacBook Air (15-inch, M2, 2023)	Computers & Accessories	$1500	Aug 18, 2023

With this approach, it becomes challenging to determine any price changes of the MacBook Air (15-inch, M2, 2023) in June if someone inquires about it. Any table is a representation of SCD Type 1.

SCD Type 2

This type involves creating a new record in the dimension table whenever there is a change in the data. The new record includes a start date for when it becomes effective and the end date will be a high date or null based on the design; the old record is marked with an end date. This way, historical changes are preserved, and you can track the history of the data over time.

In the below example, the price of the product MacBook Air (15-inch, M2, 2023) has been updated to 1500 on Oct 31, 2023; hence, we closed the previous record by populating the end_date with Oct 31, 2023, and created a new record for the update and populated the start_date of the record with the previous record end_date and end_date is populated with high_date.

product_id	product_name	category	price	start_date	end_date
1	MacBook Air (15-inch, M2, 2023)	Computers & Accessories	$999	June 13, 2023	Aug 18, 2023
1	MacBook Air (15-inch, M2, 2023)	Computers & Accessories	$1100	Aug 18, 2023	Oct 31, 2023
1	MacBook Air (15-inch, M2, 2023)	Computers & Accessories	$1500	Oct 31, 2023	Dec 31, 9999

SCD Type 3

In SCD Type 1, we have only the latest information; in SCD Type 2, we have all the historical information, but in SCD Type 3, we maintain only the current and previous versions of the data. It typically includes separate columns for the current and previous values.

For instance, we would create two columns for the product table to maintain the previous version of the price dimension. The price field stores the latest value, and the previous_price field stores the previous value, but the initial value for this field is null. With this method, you will have limited access to the historical data.

product_id	product_name	category	price	previous_price	updated_date
1	MacBook Air (15-inch, M2, 2023)	Computers & Accessories	$1100	$999	Aug 18, 2023

SCD types are very important in data warehousing for effectively managing changes to dimension data over time. These types enable businesses to make informed decisions by understanding data changes and trends over different periods.

SCD Type 2 results in a lot of data that might need to be revised, particularly when changes occur frequently due to human error, such as multiple errors in entering addresses or phone numbers.

SCD Type 3 is great if there are only a couple of values you care about and you only want to compare to the previous. It reduces storage and limits the depth of historical data available.

What are dbt Snapshots?

dbt snapshots are like time machines for data. It creates Type-2 Slowly Changing Dimension tables that help us track how things change over time, just like when you look at old photos and see how people change. They also save different versions of information so we can compare them, like comparing two drawings you made at other times.

Why are dbt Snapshots Important?

dbt snapshots create SCD Type-2 tables and are important for tracking changes in data.
It simplifies the creation of SCD Type-2 tables. We don’t need to write complex logic to create these tables, and it is just a select query with a set of configurations in a .sql file.

How do dbt Snapshots Work?

Snapshots are basic select queries that transform into tables within a data warehouse. When you execute the dbt snapshot command, it compares the current state of the dataset with its previous state. If there are changes in the records, it updates the snapshot table accordingly, creating new entries for modified or new records.

In the background, dbt manages the necessary actions, like inserting, updating, or merging records in the snapshot table. Additionally, if there are new columns in the source query, dbt automatically includes these columns in the snapshot table.

How to Snapshot Your Data

In dbt, snapshots are essentially SQL queries written within a specific snapshot block located in a .sql file. Additionally, you must set up your snapshot configuration to instruct dbt on identifying record changes.

Follow the below steps to create snapshots in dbt.

Determine your source table
Creating a snapshot model
Configure your snapshot
Execute your snapshot
Identifying your snapshot table

Please note that the following examples assume a basic understanding of model-building steps. Let’s get started.

Determine Your Source Table

Initially, you should identify the source table you want to use as the foundation for creating a snapshot. This table typically contains slowly changing dimensions that you wish to monitor over time. For instance, these could be user dimension tables, lead status tables, or records related to purchases.

Creating a Snapshot Model

After identifying the source, you now have to create a file with a .sql extension in your snapshots directory. Inside this file, use a snapshot block to define the start and end of a snapshot, as shown below.

Write a select statement within the snapshot block to select records you want to capture from the selected source. For more information about source function, please refer to this doc.

Configure Your Snapshot

After identifying what data from the selected source needs to be a snapshot, we now have to tell dbt how to identify changes by configuring unique_key, strategy, updated_at, and check_cols fields using the config block in the snapshot SQL file.

The unique_key field is a required configuration for snapshots in dbt. Ensure that the source table has a unique key to snapshot data. It helps dbt to match records between a new result set and an existing snapshot. It’s recommended to create snapshots in a schema other than the default schema; hence, use target_schema config to create a snapshot in a different schema.

Go through the dbt doc to learn more about snapshot properties and configurations.

Note: For every developer, snapshots are built into the same target_schema, no matter who is executing them.

Snapshot strategies determine how dbt identifies changes in rows. dbt includes two built-in strategies they are timestamp and check.

Timestamp Strategy

The timestamp strategy relies on an updated_at column to detect row changes. If the updated_at column’s value for a row is more recent than the previous snapshot run, dbt will replace the old record with the updated one.

If timestamps remain unchanged, dbt won’t take any actions. If your data source includes an updated_timestamp column that updates whenever a row is modified, it’s recommended that you use a timestamp strategy.

Example:

				
					{% snapshot orders_snapshot_timestamp %}

    {{
        config(
          target_schema='snapshots',
          strategy='timestamp',
          unique_key='order_id',
    updated_at='updated_at',
        )
    }}

    select * from {{ source('jaffle_shop', 'orders') }}

{% endsnapshot %}

Check Strategy

If the table you’re using doesn’t have a field that shows when data was last updated, the check strategy can help. This strategy looks at specific columns you choose (using check_cols) to compare what’s in them now with what they used to be.

If any of these chosen columns have changed, dbt will update the record. If everything in these columns is still the same, dbt won’t do anything.

Example:

				
					{% snapshot orders_snapshot_check %}

    {{
        config(
          target_schema='snapshots',
          strategy='check',
          unique_key='order_id',
     check_cols=['status', 'is_cancelled'],
        )
    }}

    select * from {{ source('jaffle_shop', 'orders') }}

{% endsnapshot %}

Managing Hard Deletes

Apart from updates, there might be instances of hard deletes occurring in the source data. By default, dbt ignores hard deletes, but if you wish to track them, set the invalidate_hard_deletes configuration to true. This allows dbt to monitor rows that have been removed. However, note that this configuration is only applicable when using the timestamp strategy.

Example:

				
					{% snapshot orders_snapshot_hard_delete %}

    {{
        config(
          target_schema='snapshots',
          strategy='timestamp',
          unique_key='order_id',
     updated_at='updated_at',
          invalidate_hard_deletes=True,
        )
    }}

    select * from {{ source('jaffle_shop', 'orders') }}

{% endsnapshot %}

Execute Your Snapshot

In the above steps, we created and configured a snapshot model, and now it’s time to execute your snapshot model. Use the dbt snapshot command to execute all snapshot models in your dbt project.

Note: The dbt build command includes the snapshot command.

To execute specific snapshots, provide snapshot model names separated by spaces after -s to the dbt snapshot command. This will execute only given snapshots and create SCD Type-2 tables In your target data warehouse.

The initial snapshot table is generated upon executing the dbt snapshot model for the first time. This table incorporates the columns specified in the snapshot select query along with additional meta-fields below.

Field	Meaning	Usage
`dbt_valid_from`	The timestamp when this snapshot row was first inserted	This column can be used to order a record’s different “versions”.
`dbt_valid_to`	The timestamp when this row became invalidated	The most recent snapshot record will have `dbt_valid_to` set to null.
`dbt_scd_id`	A unique key is generated for each snapshot record.	This is used internally by dbt
`dbt_updated_at`	The `updated_at` timestamp of the source record when this snapshot row was inserted.	This is used internally by dbt

Identifying Your Snapshot Table

When you run the dbt snapshot command, it generates a snapshot table in the specified target schema. If no alias configuration is provided, the table name defaults to the snapshot file name.

On the First Run

After executing dbt snapshot -s orders_snapshot the first time, assume the orders table has only one order, and the history for that order is maintained, as shown in the image below. If we observe the DBT_VALID_TO field is set to null, i.e., the order is still in the shipped state at the time of the updated_at value.

On subsequent runs

Consider a scenario where the order status goes from shipped to completed. As a result, that particular record will appear as

Note: The above results were generated using the timestamp strategy. The meta fields have been populated based on the updated_at field in the timestamp strategy. If the check strategy were used, the current timestamp would be used to populate the snapshot meta fields.

Snapshot Best Practices

We have learned how to create and execute snapshot models, and now it’s time to know the best practices.

Keep your snapshots simple, don’t add complex transformation logic, and avoid using multiple joins. If the business logic used in the snapshots changes in the future, then it will be difficult to apply the change to your snapshots.
It’s good to create snapshots on source tables because it’s essential to understand how your source data is changing and use downstream models to clean up the data.
Snapshotting at the source simplifies the creation of SCD-1 or SCD-2 models. Snapshotting a final model, on the other hand, can result in snapshots having any coding errors that would be easier to fix if you had a snapshot on the source and just built from the snapshot.
Please include all available columns in the selection. Use SELECT * even if a specific column seems irrelevant at the moment. Capturing it in the snapshot could prove beneficial in the future.
Snapshots, executed through the dbt snapshot command, capture table changes and should be scheduled regularly. Aim for snapshot runs between hourly and daily intervals; frequent snapshots may signal a need to explore alternative methods for tracking source data changes.
Use different schemas for snapshots: This practice simplifies the differentiation between snapshot tables and other table types within your data warehouse.

Conclusion

dbt snapshot is an important feature that allows you to record changes within your data as it evolves. It simplifies the process of creating and maintaining SCD Type-2 tables.

Interested in leveraging dbt snapshots for your data management needs?

Reach out to phData today and discover how our expertise can streamline your data pipelines, making them more efficient and effective.

FAQs

Is a change in data type for a source field considered in the dbt snapshot?

After creating a snapshot table on a source model, if any changes in the column datatypes of an existing snapshot table, dbt will not change the type of a column beyond expanding the size of varchar columns.

Is it possible to create a custom snapshot strategy?

Yes, we can create a customized snapshot strategy. Kindly refer to the dbt documentation for more details.

What happens if the snapshot job does not run daily?

If the snapshot job doesn't run daily as intended, the snapshots will become outdated. This might impact analytical queries or reporting that rely on up-to-date information. It can lead to incorrect insights or decisions based on stale data.

What happens when the snapshot job resumes after missing multiple scheduled runs?

This can create a significant backlog that needs to be addressed when the snapshot job resumes. It may need to process a larger volume of data than usual. This could potentially strain the system, leading to longer processing times or resource constraints.