September 27, 2024

How Does Snowflake Ensure High Availability and Disaster Recovery for Data?

By Justin Delisi

Using cloud data services can be nerve-wracking for some companies. Yes, it’s cheaper, faster, and more efficient than keeping your data on-premises, but you’re at the provider’s mercy regarding your available data. That’s why choosing a cloud data provider with a track record of key features that promote high data availability and allow you to recover from any disaster is essential. 

In this blog, we’ll explain why Snowflake AI Data Cloud should be considered the best-in-class cloud data service for availability and disaster recovery, providing features for each and some best practices.

Snowflake’s Architecture for High Availability

Data Availability

Snowflake has built-in fault tolerance across all editions, automatically replicating stored data over at least 3 availability zones within the cloud provider the account is built on. Availability zones are physically separated data centers with separate access to power and networking. 

If one availability zone goes down, Snowflake will simply point to another, seamlessly keeping your data available. Snowflake stores the data in the cloud provider’s blog storage, which all tout at least a 99.99% availability.

Query Resiliency

Snowflake uses virtual warehouses for compute execution in one availability zone. This is fantastic for performance, but what if the availability zone goes down or a compute instance fails? 

To ensure your queries aren’t lost or fail in such an event, Snowflake will automatically restart the query in another availability zone or begin another compute instance if one fails without the user having to restart the query. The query(s) may take longer than usual, but they will finish without any other impact on the user.

Cloud Services Availability

The third layer of Snowflake’s architecture is its cloud services layer, which must also be highly available. This layer is responsible for query management, optimization, transactions, security and governance, metadata, and sharing and collaboration. 

The metadata needed for all these features is again synchronized and replicated over multiple availability zones to ensure its uptime is over 99.9%. Suppose one or more compute instances were dedicated to the services layer or two data centers go down. In that case, the cloud services layer can continue without disruption or loss of data.

Snowflake Disaster Recovery Features

99.9% is really good in terms of availability, but the .1% can be a huge financial loss for some companies that need their data available all the time. Snowflake has created additional features to get your data’s availability closer to 100% by preparing for disaster strikes.

Data Replication and Failover

Replication over availability zones provides a lot of protection for your data, but what if the entire region or even the whole cloud provider were to go down? Snowflake has the Data Replication (DR) feature to cover when such an enormous disaster strikes, which is available to Enterprise Edition or higher accounts. 

DR allows you to replicate some or all of your data objects to a separate read-only Snowflake account set up with the same cloud provider in a different region or even with a completely different cloud provider. It does require some setup to choose which objects to replicate. 

Going one step further, with the Business Critical edition or higher, the replication can be sent to a read-write account, which can then be promoted so that the replicated account is your primary account and business can continue as usual.

A failover group needs to be created so that Snowflake knows which objects need to be replicated, which account should be primary, and which should be the replica. For example, this failover group includes the phData_db that will be replicated to account2 in the phData organization, syncing every 10 minutes:

				
					
CREATE FAILOVER GROUP PHDATA_FAILOVER_GROUP
   OBJECT_TYPES = DATABASES
   ALLOWED_DATABASES = phData_db
   ALLOWED_ACCOUNTS = phData.account2
   REPLICATION_SCHEDULE = '10 MINUTE';
				
			

From then, if a disaster occurs to the primary group, then it’s a simple one-line command to promote the replica to be your primary account:

				
					ALTER FAILOVER GROUP PHDATA_FAILOVER_GROUP PRIMARY;
				
			

Client Redirect

While having a failover account is a fantastic feature, it might be a lot of work to get everything that points to your Snowflake account up and running again. All the 3rd party clients will still be pointed at the original account, meaning your ETL jobs, monitoring apps, and data visualization applications will have to be re-pointed to the replicated account, which could be hours of work. 

Luckily, Snowflake also thought of that and created the Client Redirect feature, which is available for all accounts, including Business Critical or higher.

The way it works is that there is only one URL for your account that never changes; the URL just points to whatever account is your primary at the time. 

For instance, a company named ACME may have https://acme.snowflakecomputing.com as the URL for its account, but this is just a mask for the actual URL, which is something more like https://h73512.awsuseast2.snowflakecomputing.com. With Client Redirect enabled, Snowflake automatically changes the mask URL to point at the replica account, and your third-party clients continue working as if nothing happened.

Time Travel and Fail-safe

Not all disasters affecting your data’s availability are external to your account. Some things could happen locally, such as a new employee accidentally dropping a production database or you have an unmonitored streaming service that hasn’t been retrieving data for a week. Things like this always happen, taking many hours and expenses to get right. However, if issues like this get caught quickly, Snowflake has options to fix them easily.

Time Travel is a feature that allows you to view data as it was at a certain point in time. Depending on the edition you are using and your settings, Snowflake keeps a historical record of your data for anywhere from 1 to 90 days to retrieve it and perform actions such as undropping a table or inserting a deleted record. You incur costs for these records, so the longer the retention period, the more storage cost.

Fail-safe is another similar feature to recover data; however, it is not provided to access historical data after the time travel retention period has ended. It is for Snowflake use only to recover data that may have been lost or damaged due to extreme operational failures. Snowflake calls it a “best effort” feature, as they’ll do their best to get your data, but it may be overwritten and unrecoverable. Recovery through Fail-safe is done entirely by Snowflake and could take hours to several days to complete.

DR Best Practices and Limitations

Best Practices for DR

  • Test your Failover setup and perform drills

    • Using Failover, you’ll want to test your setup so that all the objects you need correctly switch over to the replica account, ensuring you’re prepared if disaster does strike.

    • Your account constantly changes, so regular “drills” should also be done. Pretend something happened to the primary account and check to see if your Failover is set up correctly and if your team knows what to do if something happens.

  • Create a detailed recovery plan

    • Document everything when you perform your testing and drills, ensuring that every team member will know what to do when something goes wrong

DR Limitations

  • Not all objects can be replicated to another account. Namely:

    • Temporary tables/stages

    • Hybrid tables

    • External tables/stages

    • Iceberg tables

    • Event tables

    • Databases created from shares

  • Database and share replication is available on all accounts, but replication of all other objects requires Business Critical Edition or higher

  • All data protected by Tri-Secret Secure can only be replicated by Business Critical Edition or higher

  • Refresh operations fail if the primary database includes a stream with an unsupported source object. The operation also fails if the source object for any stream has been dropped

  • Append-only streams are not supported on replicated source objects

Closing

With all its features to ensure high availability and disaster recovery, Snowflake stands out from the crowd as the top cloud data provider. Having your data in the cloud is useless if unavailable, so it is paramount to pick a cloud provider that does it right. Furthermore, having robust disaster recovery and failover features for your data can save you time and money when something goes wrong.

If you have any questions or need further insights on how Snowflake ensures high availability and disaster recovery, please contact phData. Our team of experts can guide you through the best practices and strategies to fully leverage Snowflake’s capabilities. Whether you want to optimize your data architecture or ensure maximum uptime, phData is here to help.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit