In today’s world, data-driven applications demand more flexibility, scalability, and auditability, which traditional data warehouses and modeling approaches lack. This is where the Snowflake Data Cloud and data vault modeling comes in handy.
This blog will explore the reasons why data vault modeling is widely adopted and how Snowflake makes it an ideal platform for implementing it. We will also touch on the challenges and best practices in data vault modeling with Snowflake and finally, explore several methods to implement data vault for your organization effectively.
What is Snowflake?
Snowflake is a cloud-native data warehousing platform that separates computing and storage, allowing for automatic scaling and pay-per-use pricing. Unlike traditional data warehousing solutions, Snowflake brings critical features like Data Sharing, Snowpipe, Streams, and Time-Travel to the enterprise data architecture space.
What is Data Vault Modeling?
Data vault modeling is a hybrid approach that combines traditional relational data warehouse models with newer big data architectures to build a data warehouse for enterprise-scale analytics. It is agile, scalable, no pre-modeling required, and well-suited for fluid designs.
The data vault architecture is composed of three distinct layers that work together, let’s take a look at them in detail and understand how they work.
Raw Vault
The raw vault is the foundation of the architecture and is designed to store the raw data from source systems without any business rules or transformations applied to it. This ensures that the raw data is preserved in its original form, which ensures data lineage and auditability. It comprises three types of entities: hubs, links, and satellites.
Business Vault
The business vault extends the raw vault by applying rules or functions that most of the business users will find useful, as opposed to doing these repeatedly into multiple marts (e.g., net profit, total compensation, etc.). As of Data Vault 2.0, the business vault also includes some specialty tables that help us build more efficient queries against the raw vault. These are Point-in-Time (PIT) tables and Bridge Tables.
Including business vault tables in the raw vault is not mandatory, as a compliant data vault can be created without them. Whether or not to include them depends on the specific business and reporting requirements that need to be met.
Information Mart
The information mart is the final stage, where the data is optimized for analysis and reporting. Soft business rules may be applied at this stage to further refine the data and ensure that it meets the specific needs of the business users; typically, it comprises of dimensions and facts.
Now that we have a foundational understanding of what data vault modeling is, let’s take a closer look at some of the key characteristics and benefits that make it an attractive option for enterprise-scale analytics:
Key Benefits of Data Vault Modeling
- The data is always inserted, inherently enabling auditing and history.
- Data loading is faster since multiple tables may be loaded concurrently due to hashing strategy.
- Integrates disparate source systems.
- Adapts to constantly changing relationships.
Why is Snowflake the Ideal Platform for Data Vault Modeling?
Snowflake is an ideal platform for data vault modeling due to its cloud-based architecture, which allows for elastic scaling of computing and storage resources on demand. This is crucial for data vault modeling, which requires flexibility and adaptability to accommodate changing business requirements and data sources.
Snowflake’s architecture also makes it easy to integrate data from multiple sources using data sharing, accelerating the performance of the data vault model at scale.
Additionally, Snowflake’s unique features, such as Snowpipe and Streams, make it easier to capture incremental data and continuously ingest it into the raw data vault.
Snowflake’s capabilities with semi-structured data also simplify data vault design further, allowing for direct ingestion of data into satellite tables. Its virtual warehouses accelerate parallel loads into Hubs and Satellites. The agile design approach of data vault modeling also aligns well with Snowflake’s ability to easily tap new datasets through the Snowflake Data Marketplace.
Finally, Snowflake’s robust security features, such as its Role-Based Access Control (RBAC) model, Dynamic Data Masking, and Data Tagging capabilities, make it a secure and user-friendly platform for implementing a data vault model.
Overall, Snowflake’s cloud-based architecture, scalability, and unique features make it a perfect match for organizations looking to implement a data vault model for enterprise-scale analytics.
Benefits of Data Vault Modeling on Snowflake
Parallel Data Loading with Virtual Warehouses
Data vault enables parallel loading into Hubs and Satellites through hashed keys. The hashed key is used as an identifier for putting data into a Hub and Satellite independently.
This approach allows for maximum parallelization of the loading process, resulting in zero contention for continuous loading scenarios. To set up this approach, a multi-cluster warehouse is recommended for stage loads, and separate multi-cluster warehouses can be used to run all loads in parallel.
Stats collection can help to determine the optimal size for the virtual warehouses, with smaller warehouses typically sufficient for Hubs and Links and larger ones for Satellites.
Hashed PKs were introduced as a means of eliminating the bottleneck encountered by most database sequence generators, making this DV pattern ideal for customers prioritizing data loading performance and using data warehouse automation tools.
Snowflake Views for Information Marts
When combining data vault modeling with Snowflake’s cloud data warehousing platform, there are several benefits, including better query performance and more efficient use of resources.
Views are the best way to optimize query performance, within Information marts in the data vault. By creating views on top of PIT (Point In Time) and Bridge tables, you can significantly minimize the number of joins required and improve query performance.
In Snowflake, views are especially beneficial because they retrieve cached results for queries that have already been executed and when the underlying table data remains unchanged, thus reducing query time.
Additionally, Snowflake automatically optimizes queries that use views so you can elastically scale a virtual warehouse to enjoy even better query performance at optimal cost—all without any admin dependency.
Overall, by leveraging Snowflake’s architecture and combining it with data vault modeling and views, you can achieve excellent query performance with the same virtual warehouse.
Using Variant to Load JSON Directly into a Data Vault
In the context of a data vault, a Snowflake variant column is a column in a satellite table that stores semi-structured data in a JSON-like format. Variant columns can be used to store data that doesn’t fit neatly into traditional columns, such as nested data structures, arrays, or key-value pairs.
Using variant columns in data vault satellites in Snowflake can provide several benefits.
It allows for more flexible data modeling, as data can be stored in a more natural format and schema on read. You can also pipe all of your data into the lake as received from source in JSON format, and then it’s all there, allowing you to effectively unpack the variant data as you need to and model it into the data vault.
Overall, using variant columns in data vault satellites with Snowflake can help to create a more agile and adaptable data warehouse that can easily accommodate changing business requirements and new data sources.
Near Real-Time Vaults with Snowflake
Snowflake Streams automatically tracks all data changes on a table without consuming any storage space and impacting performance. The stream shows the ‘delta’ that needs processing.
Snowflake’s managed data processing unit called “Tasks” wakes up at a defined interval and checks for data in the associated stream.
If data is present, Tasks runs SQL to push it to the raw data vault objects. Tasks can be arranged in a dependency graph, executing child tasks when the predecessor has finished its part.
Multi-table insert (MTI) is used inside Tasks to populate multiple raw data vault objects with a single DML command. Virtual warehouses are assigned to Tasks, providing enough compute power to handle any size workload. The multi-cluster virtual warehouse option automatically scales out and load balances all tasks as hubs, links, and satellites are introduced.
Streams can also propagate changes to Business Vault objects such as derived Sats, PITS, or Bridges in the next layer. This setup can be repeated to move data through all the layers in small increments efficiently until it is ready to be accessed by data consumers.
Governance and Access Policy
Given the massive volumes of data produced by various sources and the need to comply with international regulations, there’s a growing need to have compliance and international regulations baked into the architecture. This is where Snowflake shines with its Dynamic Data Masking and Object Tagging features.
Dynamic Data Masking in Snowflake controls access to sensitive data by masking the data on the fly. Object Tagging is another useful feature that allows users to tag database objects like tables, views, and schemas with descriptive labels.
By leveraging the Dynamic Data Masking and Object Tagging capabilities in Snowflake, organizations can establish a secure and streamlined business data vault architecture. Object Tagging enables the classification of business vault objects, and DDM helps to define data governance policies based on the tags assigned to those objects.
This setup facilitates tracking of sensitive data usage and reduced access control management overheads. Consequently, this enhanced data governance approach improves compliance and supports effective data analysis and reporting at very less administrative cost.
Best Practices for Implementing Data Vault Modeling in Snowflake
Implement a Robust Data Lineage Process: Data lineage is the process of tracking the flow of data through the data warehouse, from its source to its destination. Using dbt is one of the best choices.
Establish Naming Conventions and Standards: Establish naming conventions and standards for Data Vault modeling objects, such as hubs, links, and satellites.
Use Automation Tools: Data vault modeling is a repeatable pattern for data modeling, which is a perfect candidate for automation through metadata-driven design. Again dbt Data Vault package automates a major portion of it.
Implement a Solid Data Governance Strategy: Data vault modeling can result in large, complex data structures; hence a solid governance model must be implemented from inception.
Test Your Implementation: Implementing data vault modeling in Snowflake can be complex, so it’s essential to test your implementation thoroughly and set up an automated reconciliation framework to monitor data integrity and quality.
Simplify and Enhance Querying a Data Vault Model: You should consider building Point-in-Time (PIT) and Bridge tables.
Caution When Creating Views: When designing Information Marts, be cautious when creating views on views; as volume and complexity grow, the performance may saturate. Consider having a mixed strategy between physicalisation and virtualization.
Closing
Snowflake architecture capabilities, simplicity, and scalable performance enable teams to concentrate on core business matters with far less administration. Teams can connect data from various sources and access new datasets via the Snowflake Data Marketplace without effort on ETL.
By combining the Snowflake Data Cloud with a Data Vault 2.0 approach, teams can democratize access to data assets of any scale beyond organizational boundaries.
This approach enables businesses to derive more value from their data assets and gain insights that propel them to become truly data-driven.
The real-time data vault in Snowflake provides a robust security layer and brings data closer to the business, enabling faster access and analysis.
In today’s business environment, delivering usable data faster is no longer optional. The Snowflake platform and the Data Vault 2.0 architecture make it possible to achieve this and take businesses to the next level of data-driven success.
Interested in leveraging these Snowflake features? As the Snowflake 2022 Partner of the Year, phData has the experience and expertise to help your organization get the most out of your Snowflake investment.
Contact phData today for any questions, advice, best practices, or data strategy services.