This article was co-written by Arnab Mondal & Ayush Kumar Singh.
HVR, a Fivetran product, is one of the best tools for data movement. Still, any tool is only as good as you can use it.Â
In this blog, we will dive into how to get the most out of your HVR setup and improve your whole ecosystem. Whenever we think about the HVR best practices for replicating data to Snowflake Data Cloud, we can look at it from two perspectives:
Keeping the Latency low.
Keeping the cost low.
We will discuss these two techniques:
1. Keeping The Latency Low
We start by explaining the process of replication and understanding its functioning, as well as exploring the possible delays that may occur regardless of the system we are using.Â
Replication of data is the process of duplicating/copying the data from one database to another database. Data Replication is used to enhance data accessibility, efficiency, and disaster readiness.Â
There are many methods for implementing replication, and log-based change data capture (CDC) is the predominant approach. CDC captures modifications to information in one database (i.e., source) and then transfers them to other databases (i.e., target).Â
When employing the CDC, the source databases generally handle many transactions concurrently. Every transaction gets a distinct commit sequence number. CDC technology captures transactions that have occurred and can reproduce them on the target system in the exact sequence in which they happened in the original system.Â
One of the most common methods has a single integration process on the target, which executes the transactions in commit order. This helps ensure that the data integrity on the Target database stays in sync with the Source system.Â
Still, there might be situations where the target system may also require help to meet the demand. This situation might happen when the source database is under much strain, or the target database needs higher capabilities.Â
In this instance, the duplication process could stall. This could help maintain data coherence between the original and destination databases.Â
There are many methods to avoid delays in replication, one of which is to implement a multi-threaded integration process on the target. This ensures transactions are carried out at a faster throughput.Â
Using a load balancer on the source database can also minimize replication delays. This helps distribute the load more evenly and hence prevents database overload.Â
Ultimately, selecting a robust target database capable of managing the replication workload is crucial. Snowflake is the ideal option in this situation as it provides several benefits.Â
By following these steps, you can make sure your replication process is robust and efficient.Â
While parallelizing data integration can increase the speed of your replication process, it is important to be aware of its associated risks. By parallelizing data integration, you essentially split transactions across different processes, which can compromise data integrity. Therefore, ensuring that your application can handle parallel data integration is essential.Â
Before you decide to parallelize your data integration, consider the following questions:Â
Why is my Replication Slow?
There can be multiple reasons for this.
Are The Statistics up to Date?
The cost-based optimizer needs up-to-date statistics to make informed decisions about how to execute queries.
Do The Indexes on The Primary Key Columns Exist, And Are They Fragmented?
Single-row updates and deletes typically benefit from an index on the primary key columns. If the indexes are missing or fragmented, this can slow down the replication process.
Does The Target Database Handle Large Transactions Well?
MPP databases are designed for OLAP workloads, not OLTP workloads. If you are replicating OLTP data to an MPP database, you may need to use a different replication approach, such as micro-batches.
Microbatches are a way of replicating data in small batches. This can be more efficient than copying data in large batches, especially for OLTP workloads.
Can I Speed up The Replication Without Parallelizing?
Analyze your workload, source, and target, and implement changes based on the workload analysis. After analyzing the workload, you may need to change the replication process to improve performance. Some possible changes include:
Adding or rebuilding indexes: Indexes help to speed up queries, so you may need to add or rebuild indexes if they are missing or fragmented.
Implementing partitioning: Partitioning can help improve queries’ performance by dividing the data into smaller, more manageable chunks.
Changing the distribution of tables: In an MPP system, the distribution of tables can affect the performance of queries. You may need to change the distribution of tables to improve performance.
What Are The Risks of Breaking Data Integrity?
The usual approach for maintaining data integrity involves a single integration process that applies changes in the target in the same order as the source. This approach is usually good but can only succeed if multiple parallel sessions integrate changes from source to target. Fortunately, HVR can deal with this and maintain the integrity of the data.
If you have tried everything else, consider parallelizing your replication. For this, you can leverage HVR.
HVR can deal with huge volumes of data and change data. It can read from the source database logs and apply the change data to the target as soon as it is detected.
It is not unheard of to get a data transfer speed of a couple of Gb/min when replicating data to Snowflake using simple HVR installation. And can then be optimized for even better performance.Â
2. Keeping The Cost Low
In most cases, when the source table is truncated and reloaded periodically, or you don’t need to maintain all the historical change data in the target (Snowflake), or you have small tables that need to be replicated, it is always cheaper to go with the periodic snapshots by scheduling a bulk refresh job.
You don’t need to enable the channel in such a case. Since HVR charges are based on MAR (monthly active rows), it only charges for each unique row, and periodic snapshot is cheaper for both HVR cost and Snowflake warehouse up time. This is an excellent way to save costs but is not always feasible due to different business or auditory requirements.
Another way to minimize the cost of computing (warehouse up time) of Snowflake is by following the following steps:
It is always recommended to select Burst. This is because integrating with Burst can enhance performance and reduce costs by minimizing the number of records to be integrated through coalescing. Moreover, it explicitly filters boundary values for set-based operations. Additionally, applying all DML statements in a single merge statement can lower the costs of cloud services.
Burst is also the default mode for Snowflake as a Target in HVR.
Another way to save warehouse costs is to set BurstCommitFrequency to
CYCLE
(The default value for HVR for the Snowflake target). This can save cost by applying/integrating all changes for an integrated job cycle in a single transaction.Increase the CycleByteLimit: This parameter controls the maximum amount of data processed in a single integration cycle. Increasing this value can reduce the number of integration cycles, saving money. However, it is important to note that a higher value will also increase the memory the integration process uses.
Increase the
$HVR_SORT_BYTE_LIMIT
: This parameter controls the maximum amount of memory used for sorting during the integration process. Increasing this value can help prevent data from being spilled to disk, saving money. However, it is important to note that a higher value will also increase the memory the integration process uses.Schedule the integration less frequently: If near real-time replication is optional, you can schedule the integration less regularly. This will reduce the number of integration cycles, which can save money.
This can be scheduled using IntegrateStartTimes in Scheduling Action. This image shows a CRON every 6 hours but can be changed according to the need.
When scheduling data integration, the latency of the integrated data is determined by the time between integration cycles and the total time required for a complete integration cycle. To reduce this time, it is possible to have multiple integration locations. This will shorten the integration cycle and allow you to meet your SLA even with a lower scheduling frequency. However, this manual process may increase your Snowflake credit usage. Alternatively, replicating data from the source to Snowflake in parallel can reduce integration time. However, this approach may increase cloud service costs as more integration cycles will run simultaneously, even though the volume of data remains the same. Optimum usage can be measured by a hit-and-trial method in an isolated environment.
Another way of saving Snowflake warehouse credits that we have noticed is that instead of using a single large warehouse (for example, XL warehouse), it is better to use a smaller multi-cluster Snowflake warehouse (for example, S or M warehouse with min cluster size 1 and max cluster size 3, and use an economic scale policy.Â
We have noticed that when HVR is done reading a slice/chunk of data, it writes the data to Snowflake and starts reading a new slice once the previous one is complete, especially in case of refreshes. Similarly, it doesn’t parallel write the data to Snowflake when it reads from the source. So, most of the time, the Snowflake warehouse is either not utilized or is underutilized; hence, we can recommend using a smaller multi-cluster warehouse instead of a larger Snowflake warehouse to save the credits.
Conclusion
The above two methods have been tested with our clients, and we have seen an average yield of up to 50% increase in efficiency and decreased Snowflake Usage costs, saving them thousands of dollars. Feel free to contact us for further clarifications and more steps to increase the efficiency of HVR in your organization.
Contact us for expert insights and further steps to elevate your HVR efficiency. Let’s maximize your ROI and accelerate your data movement strategy!
FAQs
You need to analyze your workload or target table and ensure it can handle your Data type. Ensure that you are storing transactional data in an OLTP Database.Â
You will be billed according to MAR (monthly active rows), and hence, it is cheaper to go with periodic snapshots with bulk refresh jobs rather than keep the channel always active.