Building a successful data management solution requires several correct choices to be made in terms of technology, architecture, and design. Modern cloud based data platforms like the Snowflake Data Cloud can address most of your technological needs, but you still need to ensure that the design and structure of the data complement the technology you have chosen, and address all the business requirements in an effective manner.
Data modeling is an important part of the design process that ensures the effectiveness of your data platform, and if you’re in the process of selecting a data model for your business, you’ve come to the right place.
In this post, we’ll examine what a data model is and why it’s so important to choose the right model for your data management initiatives. From pros and cons to a comprehensive ratings chart, this blog will ultimately give you more confidence to select and implement the best data model for your business.
What is Data Modeling?
At a high level, data modeling is the process of visualizing and representing data for storage, to facilitate efficient loads, retrieval, and analysis. The model is a conceptual representation of the data, the relationships between data, and the rules. The modeling itself can include diagrams, symbols, or text to represent data and the way that it interrelates.
Because of the structure that data modeling imposes upon data, the process of data modeling subsequently increases consistency in naming, rules, semantics, and security – while also improving data analytics.
Why is Data Modeling Important?
Any modern, data-driven business generates a vast amount of data in its various operations. Due to disparities in business functions, systems and processes, it is required that the data should be consolidated properly, cleansed to remove noise, and transformed to enable meaningful analytics. For this purpose, it becomes necessary to conduct a Data Modeling exercise to structure the data consistently, and store it in a format that can be usable for a variety of purposes.
In addition to that, an effective data model provides a solid foundation for any Data Warehouse to sustain growing data volumes and adjust easily to addition or deletion of data entities.
Changing Needs of Data Lifecycle
The Data Modeling process varies as per different stages of the Data Lifecycle. The requirements to manage data differ based on how data is generated, stored, and retrieved. A good Data Model should address the considerations for the specific stage of the Lifecycle it is being implemented for. At a high level, a typical Data Lifecycle looks like this:
What is the Most Important Consideration in Data Modeling?
There are several aspects to consider when choosing the right data model. These aspects vary based on the stage of the Data Lifecycle we are designing for. These factors are as follows:
- Speed and Frequency of Data Creation and Modification - Small amounts of data should be written to disk faster while maintaining consistency
- Speed of Data Retrieval - Small or large amounts of data retrieval for reporting and analysis
- ACID Properties - Atomicity, Consistency, Isolation, and Durability of transactions
- Business Scope - One or several departments or business functions
- Access to Lowest Grain of Data - Different use-cases for data may require access to the lowest level of detail or various levels of aggregation.
There can be other factors too, but those mentioned above greatly influence the decision-making for choosing the right data model.
Rating the Importance of Different Factors at Each Stage of Data Lifecycle:
Creation | Warehousing | Analytics | Archival | |
---|---|---|---|---|
Speed and Frequency of Data Creation and Modification | Requires fast writing speeds to ensure faster execution of transactions Data generated by end-users or automated systems | Requires moderately fast writing speeds Larger volume of data that needs to be stored consistently | Requires moderately fast writing speeds Aggregations may be needed for efficient reporting | Slower speeds are ok Successful execution of the archival process is more important than speed |
Speed of Data Retrieval | Data may be retrieved as soon as it is written Granularity of data retrieval may be the same as data inserted, with minimal transformations | Data retrieval is needed to generate smaller datasets periodically Data retrieval may require transformations and aggregations | Data retrieval is needed for reporting or dashboarding purposes To ensure better end-user experience, data retrievals should be faster, and repetitive transformations and aggregations should be part of the data storage/data model | Data retrieval is needed only in exceptional circumstances (audits, disaster recovery, etc.) Consistency with live datasets is more important than retrieval speeds |
ACID Properties | Data Creation is part of transactions involving multiple steps Compliance to ACID properties is critical to ensure transactional consistency | Compliance with all ACID properties is not critical, but the consistency of the dataset before and after a batch / micro-batch execution is expected | Compliance with all ACID properties is not critical, but the consistency of the dataset before and after a batch / micro-batch execution is expected | Compliance to all ACID properties is not critical, but consistency of the dataset before and after a batch / micro-batch execution is expected |
Business Scope | Focused on a specific business activity | Should cover multiple business functions or an entire enterprise | Focused on specific reporting or AI/ML requirements | Scope depends on the Data Lifecycle stage (Creation, Warehousing, Analytics) |
Access to Lowest Grain of Data | In most cases, if a system is generating data at its lowest granularity, then access to that same level of detail is also required | Historical data may need to be stored at different granularities Access to lowest level of data is important | Historical data may need to be stored at different granularities Access to lowest level of data may not be important | Historical data may need to be stored at different granularities Access to lowest level of data may not be important |
Breaking these down, the Creation section is generally modeled in 3rd Normal Form (3NF) which is optimized for transaction speed and the archival layer is highly dependent on the use case and storage medium. A platform like Snowflake does not focus on these two life cycles yet, so for the remainder of this blog, we will focus on Warehousing and Analytics stages, as these use cases greatly influence how the data is utilized by modern analytics-oriented applications and AI/ML algorithms.
Depending on the factors discussed above, the following data modeling approaches are suitable, and widely used as best practices in the industry:
- Data Vault - Suitable for the Warehousing stage, as it provides mechanisms for linking data entities across multiple business functions, auditing and tracing historical data, and scalability.
- Star Schema (Dimensional Data Modeling) - Suitable for the Analytics stage to build data marts focused on one or few business functions. Due to its simpler and denormalized design, it enables simpler joins and provides the ability to store data at various levels of granularity.
What is a Data Vault?
Data Vault is a database modeling method that is designed to provide long-term historical storage of data coming in from multiple operational systems. It is also a method of looking at historical data that deals with issues such as auditing, tracing of data, loading speed, and resilience to change as well as emphasizing the need to trace where all the data in the database came from. This means that every row in a data vault must be accompanied by record source and load date attributes, enabling an auditor to trace values back to the source. It was developed by Daniel (Dan) Linstedt in 2000. He describes the Data Model as below:
“The Data Vault Model is detail-oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between the 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent, and adaptable to the needs of the enterprise.”
Components of Data Vault:
- Hubs - Hubs contain a list of unique business keys with a low propensity to change. Hubs also contain a surrogate key for each Hub item and metadata describing the origin of the business key.
- Links - Associations or transactions between business keys (relating for instance the hubs for customer and product with each other through the purchase transaction) are modeled using link tables. These tables are basically many-to-many join tables, with some metadata.
- Satellites - Satellite tables consist of metadata linking them to their parent hub or link, metadata describing the origin of the association and other descriptive attributes, as well as a timeline with start and end dates for the attribute.
- Reference Tables - Any information deemed necessary to resolve descriptions from codes, or to translate keys consistently. Many of these fields are "descriptive" in nature and describe a specific state of the other more important information.
Data Vault Pros and Cons:
- The ETL is quite straightforward and lends itself to easy automation or templating.
- It is easier to create dimensional (Star Schema) data marts from a Data Vault, as Satellites can be modeled as dimensions and hubs can be modeled into Fact tables.
- Data Vault enables consistent storage of historical data. It is optimized for storage, but not optimized for faster retrieval which is expected by the Analytics tools.
- An effective design and implementation of Data Vault can have a higher lead time, but taking an iterative approach using the templates created for earlier versions can help to speed up the process.
What is a Star Schema?
Star schema data model is widely used to develop or build a data warehouse and dimensional data marts. It includes one or more fact tables indexing any number of dimensional tables. The name “Star Schema” is derived from how a diagrammatic representation looks like, with dimensions distributed around a fact table.
Components of Star Schema:
- Fact Tables - Fact Tables consist of facts/metrics recorded in a business, as business transactions happen. For example, a Fact Table for a Sales department would track metrics like Revenues, Marketing Costs, etc.
- Dimension Tables - Dimension Tables consist of descriptive attributes that define the facts being measured in a Fact Table. Dimension Tables uniquely define the attribute values using primary keys (either Business Keys or Surrogate Keys), which are used as foreign keys in the Fact table. For Example, a DimSalesExec dimension would list all the Sales Executives, and it will help describe how much revenue that Sales Executive generated. Dimension Tables also help define the granularity of the fact table.
Star Schema Pros and Cons
- Simpler Queries - Join logic of star schema is quite a cinch in comparison to other join logic which are needed to fetch data from a transactional schema that is highly normalized.
- Simplified Business Reporting Logic - In comparison to a transactional schema that is highly normalized, the star schema makes simpler common business reporting logic, such as as-of reporting and period-over-period.
- Feeding Cubes - Star schema is widely used by all OLAP systems to design OLAP cubes efficiently. In fact, major OLAP systems deliver a ROLAP mode of operation which can use a star schema as a source without designing a cube structure.
- Data integrity is not enforced well since in a highly denormalized schema state.
- Star schemas don’t reinforce many-to-many relationships within business entities – at least not frequently.
- Creating Data Marts can create data silos, which may not merge consistently with each other.
How to Model Data in Modern, Cloud-Based Database Platforms Like Snowflake
The Snowflake platform was built from the ground up to address scalability and performance issues with traditional database systems. Snowflake customers do not have to worry about guessing the capacity requirements or procuring and maintaining expensive hardware. These features are mainly accomplished by separating compute and storage, which helps a modern warehouse to scale for performance (compute) and scale for capacity (storage) independently.
In addition to that, Snowflake provides data types like VARIANT to store semi-structured data. For modern data sources like IoT devices, big data storage (Data Lakes), etc., the semi-structured data types like VARIANT can store data in its native format like JSON. These data types can help in faster Data Ingestion into either of the Data Models like Data Vault or Star Schema. For Example, a JSON document can be stored into a dimension table in a Star Schema model, as a VARIANT column. This column can further evolve into multiple columns or separate dimension tables depending on the relationships between the JSON attributes.
Along with the inherent benefits of the Snowflake platform, it is important to implement the right Data Model which can complement the Snowflake benefits by providing effective ETL / ELT, reducing unwanted redundancies, and optimally creating Snowflake artifacts like Compute Warehouses and databases.
Both Data Vault and Star Schema Data Models can be implemented in Snowflake, depending on the scope and objectives of the exercise. A Data Vault model is suitable for enterprise-wide or multi-department initiatives, where the objective is to build a Warehouse to store historical data and ensure consistent linkage between data entities in different departments. A Star Schema model is suitable for smaller initiatives focused on specific Analytical needs, which may require data to be structured at different granular levels.
Data Vault in Snowflake | Star Schema in Snowflake |
---|---|
Should be used to build larger, enterprise-wide Data Warehouse | Should be used to build smaller, focused Data Marts for one or a few departments |
Departments can define ownership of Hubs and Satellites, and Compute Warehouses can be designed according to those definitions | Star Schemas can utilize conforming dimensions, which can be defined once and used across multiple Data Marts. This can help impart consistency in business definitions across departments, and prevent the formation of Data Silos. (Eg. A Sales Data Mart and Finance Data Mart can both share a Customer dimension, leading to a uniform definition of Customer entity across the 2 business functions.) |
To implement either of these Data Models in Snowflake, follow these next steps.
Step 1: Identify scope and purpose of the Data modeling exercise: A larger, enterprise-wide scope and purpose of storing and consolidating data across source systems would require a Data Vault model. A Data Model for reporting or AI/ML applications will require a Star Schema model.
Step 2: If a traditional Data Warehouse already exists (on-premise or on other Cloud platforms), then we can redesign the existing Data Model to a Data Vault or Star Schema, based on scope and purpose.
Step 3: To migrate the existing Data Warehouse to a Data Vault model in Snowflake, you can either use a Lift and Shift approach, or an Iterative Migration approach.
In a Lift and Shift approach, a traditional Data Warehouse can be re-platformed on Snowflake without any changes. Once the Data Warehouse is live, then we can begin mapping existing data entities to components of Data Vault (Hubs, Satellites, Links, Reference Tables). Snowflake can allow consistent access to both existing and new Data Warehouses at the same time. Alternatively, in an Iterative Migration approach, you can start defining a Data Vault model in Snowflake, but keep the traditional data warehouse functioning in its current state. Snowflake will host only the Data Vault model, which will continue to evolve as more data entities are migrated.
Step 4: Migration of Star Schema Models can also use similar approaches as Data Vault. In addition to designing the data entities specific to a Data Mart, you also need to consider how to link different Data Marts together. This will ensure data consolidation across different Analytical use-cases. To best achieve this, the Star Schema models should be derived from the underlying Data Vault model.
Step 5: It is also possible that an organization has a Data Lake storing vast amounts of unstructured or semi-structured data using Big Data platforms like Hadoop. Migration from these data sources into Snowflake also follows a similar process as traditional Data Warehouses, but additional effort is needed to identify relationships between different data entities, data integrity requirements, and data ownership.
Step 6: You can build both Data Vault and Star Schema models in Snowflake, from the ground up. In such cases, an iterative approach for Data Vault can be most effective. A Data Vault is not directly usable by end-users, so you can build independent Star Schema models at the same time, which can re-align to the Data Vault once it is live.
In Closing
While this blog is not exhaustive, it hopefully will shed some light on your path to choosing the right data model for your enterprise. If you have any lingering questions or need additional help designing, building, and operationalizing your data product, we’d love to chat!