Migrating from SQL Server to Snowflake

A Step-by-Step Guide for a Successful Migration

This document is intended to serve as a plan for migrating on-premises and/or cloud IaaS SQL Server databases and tables to the Snowflake Data Cloud.

SQL Server is a relational database (RDBMS) that often comes with high licensing fees to host an enterprise’s most important data. SQL Server also requires a significant amount of work to set up and maintain, both for hardware (network, storage, OS patching, configuration) and the software (configuration, updates). If there is a need to utilize the data on these systems for analytical purposes, there will often be additional licensing fees for the OLAP counterparts or the added risk of analytical workloads interfering with OLTP systems that are serving end users. All of these things can quickly add up to create a costly product that is taxing to both the budget and IT workforce.

There are reasons beyond licensing costs for wanting to migrate, some called out above. It should be noted that the migration plan laid out here will work for far more systems than just SQL Server, but the context of this document is to focus on SQL Server migrations.

Want to Save This Guide for Later?

No problem! Just click this button and fill out the form to download it. (You can read the full guide without giving us your email — keep scrolling!)
Get the Guide
Download Snowflake Guide

Table of Contents

Why Snowflake Over SQL Server?

The Snowflake Data Cloud was designed with the cloud in mind, and allows its users to interface with the software without having to worry about the infrastructure it runs on or how to install it. Between the reduction in operational complexity, the pay-for-what-you-use pricing model, and the ability to isolate compute workloads there are numerous ways to reduce costs associated with performing analytical tasks. Some other benefits and capabilities include:

  • Data sharing: Easily share data securely within your organization or externally with your customers.
  • Zero copy cloning: Create multiple ‘copies’ of tables, schemas, or databases without actually copying the data. This saves on the time to copy and reduces data storage costs.
  • Separate compute and storage: Scale your compute and storage independent of one another, and isolate compute power for jobs that need their own dedicated warehouse.
  • No hardware provisioning: No hardware to provision, just a t-shirt sized warehouse available as needed within seconds.

phData, Your Premier Partner

phData is a Premier Service Partner and Snowflake’s Emerging Partner of the Year in 2020. If you’re looking to migrate, phData has the people, experience, and best practices to get it done right. We’ve completed nearly 1,000 data engineering and machine learning projects for our customers. So whether you are looking for architecture, strategy, tooling, automation recommendations, or execution, we’re here to help!

Migration Goals and Requirements

The primary goals of this migration are to reduce the costs and operational burden of running licenced OLAP database systems on-premises, in favor of a cloud native, pay-for-what-you-use SaaS counterpart. It seems only natural that a migration process would have similar goals. This will be a foundational concept applied throughout the migration plan.

Further, it would be difficult to build a solution without some sort of requirements. There is a lot of value in knowing requirements up front, as they will help to drive the design of a system and can often be used to create SLA’s by which the solution can be evaluated against. This solution must:

Migration Plan

The data flow for migrating any source system to Snowflake can be distilled into the following three box architecture:

Snowflake Migration three-box Architecture

This picture is pretty straightforward, but immediately introduces something not obvious; a small pitstop for the source data in an ‘external stage’. This is because Snowflake has no native tools for pulling data directly from a database. Instead, it uses the concept of a Stage, which can be either internal or external, as a point to load data from and unload data to.

Snowflake Stages are effectively pointers to public cloud object storage locations and metadata about the files in that location. These object storage locations are ideal for staging data because they scale nearly infinitely and often large enterprises would build data lakes on this storage, making it convenient to get data into Snowflake and allow for data in Snowflake to be easily contributed back to the lake.

The two arrows represent distinct processing steps of the migration that will be outlined in the coming sections. Per the migration goals, each of these steps should attempt to minimize costs by utilizing cloud infrastructure (preferring SaaS solutions to PaaS, PaaS to IaaS, and IaaS to on-premises) and allowing cost to scale with the migration.

Migration - Source to Stage

Snowflake has many technology partners that assist with moving data from a source database to an object storage location that can be used as an external stage. These include FiveTran, Attunity, Informatica, HVR, and Streamsets, all of which phData has past experience with, but this migration will instead focus on a cloud native solution from AWS, the Database Migration Service (DMS).

AWS DMS is a data migration service that supports highly configurable data migrations and replications between various sources and targets. SQL Server is included as an available source system, making this the perfect tool to handle data collection from our source SQL Server database. Targets include Apache Kafka, multiple AWS services like S3, DynamoDB, and Kinesis but for the purpose of this paper we are going to focus on S3 for Snowflake consumption.

AWS DMS Replication Instance

AWS DMS appears to meet the migration goals defined above. It is a SaaS solution by AWS that is primarily setup by configuration. Costs scale with the number and size of its replication instances needed for a migration. It’s replication tasks have settings for bulk load migration to support retrospective data, as well as CDC for prospective data. It also supports S3 as a write target which is going to be used as the stage for Snowflake. The replication tasks achieve near real time latency for CDC by using SQL Server’s native API to read transaction logs as they are written to.

Building this step of the migration is primarily done by configuring the AWS DMS to produce the desired results, but it isn’t the only thing that needs to be done. Ensuring that the DMS infrastructure is authorized to access both the SQL Server database and the S3 target is important, as is setting up the source database to produce the data that is needed for migration.

DMS Infrastructure

Authentication and Authorization

The replication task must authenticate against both the SQL Server database and the S3 bucket to perform the first step of the migration. Authentication into both systems is specified via the respective endpoints. For SQL Server, the database username and password provided would be for a user specifically created for use by AWS DMS. For the target, the IAM Role is the identity, and must have an attached trust policy that specifies AWS DMS as a valid principal.

Authorization for the SQL Server database is specified via a series of GRANTS to allow access to the database transaction logs. For the target, the IAM role must include a policy that allows it to write to the target S3 bucket, and the target S3 bucket’s policy must allow the IAM role to write to it.

DMS Replication Instance

See the appendix for further details on authentication and authorization to these systems.

Source Database Preparation

Setting up a source database to be properly utilized by AWS DMS requires a bit more configuration than just providing it a user. In general, the following needs to be set up:

  • For one time batch, it is typically sufficient to allow the DMS database user to query the tables that need to be migrated.
  • To include ongoing replication requires database specific actions to enable data capture and provide access to this data to the DMS database user.

For SQL Server, the ongoing replication would be handled by reading transaction logs, persisting them for a reasonable amount of time, and allowing the DMS database user to access them. How this is achieved differs from vendor to vendor, and from self-managed to hosted setups.

See the appendix for details on setting up a SQL Server database to work with AWS DMS.

Replication Output

Replication tasks will output to S3 with a default data format of comma separated values (.csv), but the format is configurable with Parquet being recommended. This can be updated via the extra connection attributes provided during Target Endpoint creation.

When CDC is involved, it is important to note a couple points about the data that is output.

  1. The schema for a given table will reflect that table’s schema, but will also include an additional ‘timestamp’ field and ‘op’ field to reflect the point in time a change happened and the operation that the change reflects, either I[nsert], U[pdate], or D[elete].
  2. The data is an append-only log of ongoing changes added to the end of the bulk load of data. This matters because this data will need to be reassembled in Snowflake for the tables to be reflective of the data in the source tables.

Migration - Stage to Snowflake

Once data is in the external stage, the remainder of the migration can be achieved using Snowflake resources and capabilities.

Even though Snowflake is the primary requirement for this solution, it is still valid to evaluate it against the migration goals. Snowflake is a SaaS solution that builds data warehouse systems using SQL commands. With Snowflake, costs accrue for storage use and compute use on a per-second basis. Snowflakes pipe and task objects support building low latency data pipelines. Finally, the data landing in S3 can be treated the same through the Snowflake pipeline, whether retrospective or prospective.

Building this step of the migration involves configuring a couple components to enable authorized access to the data in S3 and to ensure timely delivery to the Snowflake pipe. The remainder of this step would then be built by executing DDL to build out the pipe, tables, stream and task.

Snowflake and AWS Infrastructure

Authentication and Authorization

Snowflake has the concept of a Storage Integration that it uses to provide credential-less access to AWS S3 external stages. These are setup using an IAM Role that is provided access to the S3 location. This role is then set up with a trust policy that allows a Snowflake IAM role to assume the original role, effectively delegating its abilities to the Snowflake role.

The process of properly creating a storage integration has a few back and forth steps, which is essentially creating a long term authentication to Snowflake via the trust policy. The policies attached to the IAM role and the S3 bucket are what defines the authorization.

Migration Pipeline

This step of the migration involves two Snowflake tables. The first is a representation of the data that mirrors what is in S3, an append-only table that contains the operations performed against the original source database, the ‘change table’. The second is reflective of the (mostly) current state of the original source database, the ‘reporting table’, which is the final target for the data in the migration.

The pipe, often referred to as Snowpipe in Snowflake marketing materials, is used to keep the change table up to date with the latest data from S3. A pipe is effectively a COPY INTO statement that is listening for changes to data in a stage. As new CDC data lands in S3 from the first step of the migration, S3 object notifications signal the pipe of the new data. Upon receiving a signal, the pipe queues its COPY INTO command for a Snowflake managed warehouse to copy the data into the change table. Charges for the warehouse are billed per second with a small overhead for every 1000 files.

Snowpipe Detailed

Table Streams keep track of DML changes to a table allowing for action to be taken against the delta. Streams can be queried just like a table can, and the contents of the stream are cleared when queried. The stream on the change table is keeping track of the changes to the change table, which will only be insert operations for this migration, as the data coming from S3 is either the data from the bulk load or the CDC data from the source database. 

Tasks are scheduled execution of a specified SQL statement. Creation of a task requires providing a warehouse with which the query will execute. The migration for this step uses a task that starts up periodically, checks to see if the stream has any new changes, and executes a MERGE INTO statement to the reporting table.

Stream/Task Detailed

Together these Snowflake resources work together to take the CDC data as it lands in S3 and reassemble it to a representation that mimics the source, with some degree of latency.

Scaling the Migration

The migration process put forth so far has been focused on moving a single table from an SQL Server database to Snowflake. While it is the fundamental action of a migration, it would be best to evaluate how to apply it at some scale. 

Scaling a bulk-only migration is going to be different from scaling a migration that includes on-going replication. The former can be scaled multiple ways, with one logical approach being table-at-a-time. Only the latter approach will be covered here as it is a bit more complex.

This section will cover scaling the migration of multiple databases each with multiple tables. A given migration might only require a single database, which should still be able to utilize the advice in this section.

Scaling AWS DMS

The basic unit of migration in DMS is a Replication Task, and a task is where the configuration for a replication is defined. The configuration that defines what tables and views to replicate is the table mapping. A task’s table mapping contains a list of rules to apply of which there are multiple types, including selection rules. Selection rules specify which tables and views in a database schema to replicate. The table and view names can use wildcards to make selection simpler.

Replicating multiple databases will require multiple replication tasks, at least one each, because a replication task only connects to a single source database. It is acceptable to have multiple tasks map to a single database, which might be necessary if there are different configurations required for different schemas or tables. Within a database multiple schemas can be replicated in a single task, using a selection rule for each schema in the table mapping. Within a schema, multiple tables can be replicated using a single selection rule.

Replication tasks need compute power to run on and that comes in the form of replication instances. Scaling replication instances is done in one of two ways, either add more instances or utilize larger instance sizes. The latter approach comes with limits, as there is a maximum sized instance that AWS offers.

Scaling AWS DMS components

It is difficult to give guidelines on choosing the appropriate size and number of instances to use for a migration due to the variability that some replication tasks will introduce based on their configuration. The same goes for mapping tasks to instances in situations where multiple instances are necessary. Generally, it would be best to over-provision replication instances, monitor the resource consumption of the instances and adjust as necessary, especially in a production environment. This is one area where increased scale does come with some additional operational overhead in the form of managing this mapping, but it is still significantly less than having the additional burden of managing the hardware and software.

Scaling Snowflake

Scaling the Snowflake components is significantly more straightforward, as the pipeline pieces are stamped out once for each table that is being migrated. The authorization and stage reference components are largely going to be reusable across the various pipelines.

The biggest headache of scaling the Snowflake step will probably be around naming the pipeline components and organizing the tables logically. Templating the creation of the DDL statements would be an ideal approach to applying standard naming conventions to Snowflake objects. All Snowflake tables, pipes, tasks and streams belong to a Snowflake schema and all Snowflake schemas belong to a Snowflake database, so it would make sense to utilize a similar database/schema setup as the source databases when choosing where to place the Snowflake components.

Evaluating the Migration

With a plan laid out, it would be ideal to take one final look at the requirements to ensure all of them were fulfilled for the migration.

Migrate Databases / Schemas / Tables From Listed Sources

AWS DMS supports specifying multiple databases by using multiple replication tasks. Each task can send one or more schemas and tables in the tasks table mapping configuration. AWS DMS also supports SQL Server as a source database, from either on-premises or from Amazon RDS.

Migrate to Equivalent Database / Schema / Table in Snowflake

AWS DMS allows for specifying a bucketFolder configuration parameter for a given replication task that uses an S3 target. When migrating data, AWS DMS will write data to bucketFolder/ schemaName / tableName in the specified S3 bucket. Snowflake can then use this folder structure in S3 to recreate the database, schema, and table layout as it was in SQL Server.

Migrate Retrospective and Prospective Data

AWS DMS supports three different migration types, full-load, cdc, and full-load-and-cdc. Migrations of type full load will land data in S3 using the same schema as the source database, with modifications only if they are specified in the task settings. Migrations of cdc type will land data in S3 with the same schema, but will include two additional columns. The ‘op’ column will contain the operation of the DML, either ‘I’ for insert, ‘U’ for update, or ‘D’ for delete, and the ‘timestamp’ column will contain the timestamp that the operation occurred.

The full load data is a subset of the cdc data, and the Snowflake components are able to handle this with little effort. Pipes will move the data as is from S3 into the change tables, and the tasks MERGE INTO statements will support the various values of the ‘op’ column with WHEN MATCHED clauses.

Latency of Prospective Data is Under Five Minutes

The AWS DMS replication tasks have configurable change processing settings that can be modified. Specifically, there are config values for BatchApplyTimeoutMin and BatchApplyTimeoutMax, which specify acceptable timeout boundaries (in seconds) for creating and sending micro batches to the target. By default, these values are 1 second and 30 seconds respectively.

Once data lands in S3, an event notification will land in one of Snowflake’s internal SQS queues, and shortly thereafter trigger a pipe to run. This process is wholly event based and should occur in rather rapid succession, placing the data into the change table in Snowflake. The Snowflake task can be configured to run as frequently as every one minute to merge the data from the change table into the reporting table.

Summing these discreet timings it would appear that a total time to migrate could take as little as just over one minute. This is far below the five minute requirement, which would allow for flexibility to optimize file size from AWS DMS (using a larger max timeout) or to minimize warehouse credit usage (larger task interval).

Next Steps

We have covered the objectives that we set out to solve by walking you through what a cloud native migration looks like to help drive down overall cost, specifically from licensing and hardware. We have also touched on the potential scale and velocity that ingesting data from these systems can reach. On top of this, the solutions the source systems solve for are unique to your business and require attention to detail to get right.

But this is only the first step, in order to ensure your migration from SQL Server to Snowflake is a success, there is a general set of tasks that should be undertaken to consider a migration complete, these steps are described in the following image:

At the end of this process you will have an efficient set of ingest pipelines hydrating your Snowflake environment taking advantage of cloud native benefits along the way. You will also find that the limitations and expenses of the legacy SQL Server database is a thing of the past with Snowflake’s ability to separate and scale compute resources as your business or end users require it.

Appendix

Setup S3 as target for AWS DMS

The following guidance is around setting up IAM Roles and Policies to allow AWS DMS to use S3 as a target for a migration or replication. Refer to the AWS DMS S3 target documentation for more details and options for configuring an S3 target.

Prerequisites

  • The S3 bucket that is the target must be in the same region as the AWS DMS Replication Instance.

Create IAM Role for DMS service

A role must be specified for AWS DMS to assume when interfacing with S3. The role consists of a Trust Policy which defines which principals can assume the role and how and a set of Permissions Policies which define what actions the role can perform.

Trust Policy

				
					{
  "Version": "2008-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "dms.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

				
			

Permission Policy

				
					{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:PutObjectTagging",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<bucket_name>/*",
                "arn:aws:s3:::<bucket_name>"
            ]
        }
    ]
}
				
			

Make sure to replace the bucket_name placeholders in the Permission Policy.

Create Bucket Policy

If the S3 Bucket is in a different AWS account than that where the AWS DMS Replication Instance is running, then it is likely that a Bucket Policy will also need to be applied to the S3 Bucket to allow AWS DMS to use it as a target.

Bucket Policy
				
					{
    "Version": "2008-10-17",
    "Statement": [
        {
            "Sid": "DmsInfraTargetBucketPolicy",
            "Effect": "Allow",
            "Principal": {
                "AWS": "<role_arn>"
            },
            "Action": [
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:PutObjectTagging",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<bucket_name>/*",
                "arn:aws:s3:::<bucket_name>"
            ]
        }
    ]
}

				
			

Make sure to replace the bucket_name and role_arn placeholders in the Bucket Policy. The role_arn is the Amazon Resource Name for the IAM Role that the AWS DMS Replication Instance is assuming. 

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit