In this blog, we’re going to walk you through how to load data from AWS RDS Snapshot into the Snowflake Data Cloud. We will be:
- Preparing our environments
- Backing up our RDS data
- Exporting our RDS backup to an S3 bucket
- Creating a Snowflake stage
- Querying data & creating tables
Why is Loading RDS Data Into Snowflake Important?
Relational databases are at the heart of your most critical applications. More often than not we find ourselves building architectures with many different types of databases and data that would be much more meaningful when aggregated in one place for analytics.
Throughout this blog, we will walk you through how to load data from Amazon Web Services (AWS) Relational Database Service (RDS) into Snowflake using an RDS Snapshot. The RDS Snapshot method is great for one time loads to explore the data available and determine overall value before setting up ongoing incremental ingestion mechanisms.
This solution will be using AWS S3 for the storage of data, and AWS IAM for managing access to data between services.
While this post uses an RDS instance with a Postgres Engine, these steps should work with any database engines supported through Amazon RDS (MySQL, PostgreSQL, MariaDB, Oracle, or SQLServer).
Now let’s begin.
Step 1: Setup Environment
- Set Up Snowflake
- Before you start, you will need to have a Snowflake Virtual Warehouse up and running with a User that can connect and query the cluster. Here are some helpful links below to get you started:
- Set Up RDS
- Here are some resources for setting up RDS
- Will need DBInstanceIdentifier (database name) for Step3
- Set Up Sample Data
- Feel free to use your own data but for the sake of convenience when going through this post you can use the code snippet below.
Sample Table Used To Demo
CREATE TABLE public.airline (
"CODE" VARCHAR,
"AIRLINE" VARCHAR
)
;
INSERT INTO public.airline (code,airline) VALUES
('UA','United Air Lines Inc.'),
('AA','American Airlines Inc.'),
('US','US Airways Inc.'),
('F9','Frontier Airlines Inc.'),
('B6','JetBlue Airways'),
('OO','Skywest Airlines Inc.'),
('AS','Alaska Airlines Inc.'),
('NK','Spirit AirLines'),
('WN','Southwest Airlines Co.'),
('DL','Delta Air Lines Inc.'),
('EV','Atlantic Southeast Airlines'),
('HA','Hawaiian Airlines Inc.'),
('MQ','American Eagle Airlines Inc.'),
('VX','Virgin America')
;
- Set Up S3
- Will need S3 url for Step4
- s3://bucket-rds-s3-phdata
- Will need S3 url for Step4
- Create IAM role for rds to S3
- Create rds_exporter role
- This is used to export an RDS backup to an S3 bucket
- You will need the iam_rds_exporter_arn for Step3
- Create “rds_snowflake_policy” policy
- This is used to allow Snowflake to read from S3 bucket
- Create rds_exporter role
- Create User with Pragmatic Access
- Attach rds_snowflake_policy role
- aws_key_id will be used by Step4
- aws_secret_key will be used by Step4
- Create AWS Key
- This is used for exporting the RDS snapshot to an S3 bucket
- kms_key_id will be used by Step3
- Install AWS CLI
Step 2: Take Snapshot of RDS
- Will need “DBSnapshotArn” for Step3
#!/bin/bash
timestamp=$(date +%Y-%m-%d-%H-%M-%S)
db_name="phdata-rds-snowflake"
db_backup="${db_name}-backup-${timestamp}"
aws rds create-db-snapshot \
--db-instance-identifier ${db_name} \
--db-snapshot-identifier ${db_backup}
Step 3: Export Snapshot to S3
#!/bin/bash
timestamp=$(date +%Y-%m-%d-%H-%M-%S)
export_task_id="my-export"
db_snapshot_arn="********"
target_bucket_name="********"
iam_rds_exporter_arn="********"
kms_key_id="********"
aws rds start-export-task \
--export-task-identifier ${export_task_id} \
--source-arn ${db_snapshot_arn} \
--s3-bucket-name ${target_bucket_name} \
--iam-role-arn ${iam_rds_exporter_arn} \
--kms-key-id ${kms_key_id}
Step 4: Create STAGE for Querying AWS S3 Bucket
CREATE OR REPLACE file format parquet_format type = 'parquet';
CREATE OR REPLACE STAGE rds_s3_stage
file_format = parquet_format
credentials = (
aws_key_id='********',
aws_secret_key='********'
)
url = 's3://bucket-rds-s3-phdata/my-export/phDataRDS';
LIST @rds_s3_stage;
Step 5: Confirm Access to Data in S3 Bucket by Querying Parquet
SELECT * FROM
@rds_s3_stage/public.airline/part-00000-0f4cd80c-8d8d-46e5-a617-ce11368ebf33-c000.gz.parquet;
Step 6: Create Table and Insert S3 Data
CREATE OR REPLACE TABLE staging.airlines AS
Select
$1:airline::varchar as "airline",
$1:code::varchar as "code"
FROM
@rds_s3_stage/public.airline/part-00000-0f4cd80c-8d8d-46e5-a617-ce11368ebf33-c000.gz.parquet;
Confirm creation and insertion of data.
SELECT * FROM staging.airlines;
Conclusion
In this post, we’ve demonstrated how to load data from an AWS RDS Snapshot into Snowflake. This gives you the flexibility to further develop your Analytics Infrastructure by being able to quickly load data from your Amazon RDS environment into an advanced analytics for initial exploration and determining the overall value.
This is a one time load of the RDS database and does not set up ongoing ingestion. For ongoing CDC ingestion from RDS to Snowflake check out solutions from our partner Fivetran/HVR.
If your team is interested in learning more about how to extract more value from your Snowflake account, please reach out! We’d love to help kick off your project!
If you’re looking for more helpful Snowflake resources, be sure to check out our free guide to getting started with Snowflake.