This guide was co-written by a team of data experts, including Dakota Kelley, Ahmad Aburia, Sam Hall, Katrina Menne, and Sunny Yan.
Imagine a world where all of your data is organized, easily accessible, and routinely leveraged to drive impactful outcomes.
In this ideal world, your company no longer has to rely on gut decisions, important business questions are accurately answered from a simple glance at a dashboard.
This dream world is what most businesses strive for, but few are able to achieve.
Yet, those that do achieve this level of maturity from their data stack are able to unlock breakthrough successes while leaving competitors years behind in innovation.
Perhaps the largest roadblock of this data-driven utopia is the continued reliance on a patchwork of legacy, on-premise technologies like Teradata, Netezza, Oracle, etc., that just can’t keep up with future data demands as data usage and storage skyrocket.
The good news is that there’s a concept called the Modern Data Stack that when utilized properly, consistently helps empower organizations to harness the full potential of their data.
As an AI and data analytics consulting company, phData is on a mission to become the world leader in delivering data services and products on a modern data platform. Throughout this journey, we’ve helped hundreds of clients achieve eye-opening results by moving to the Modern Data Stack. While rarely a straightforward process, we’ve hit nearly every bump along the way and have learned so much, and we’re excited to share those insights with you!
In this approachable guide, we’ll uncover:
- The importance of moving to the Modern Data Stack
- A few essential technologies that make up the Modern Data Stack
- Effective strategies on how to seamlessly migrate to the Modern Data Stack
- Best practices to build and be successful with a Modern Data Stack
No problem! Just click this button and fill out the form to download it.
(You can read the full guide without giving us your email — keep scrolling!)
- What is the Modern Data Stack?
- Why Migrate to a Modern Data Stack?
- Central Source of Truth for Analytics
- Data Replication
- Data Transformation
- Data Applications and ML/AI
- Governance
- Pre-Migration Considerations
- Migration Approaches
- Modern Data Stack Migration Plan
- Real World Case Study
- Changing your Delivery Mindset
- Team Structure
- Automation
- phData Toolkit
- Free Data Migration Assessment
Part 1: Modern Data Stack 101
What is the Modern Data Stack?
With the birth of cloud data warehouses, data applications, and generative AI, processing large volumes of data faster and cheaper is more approachable and desired than ever. This has led to the development of many different tools and platforms that provide scalable and economical solutions to work with these systems.
The combination of these technologies has become collectively known as the Modern Data Stack. Often these tools have low bars of entry and have either strong SaaS or Open-Core support while solving core data challenges like:
- Building a central source of truth for analytics
- Data replication
- Data transformation
- Data applications and ML/AI
- Data governance
The Modern Data Stack serves as an integrative force, facilitating the efficient flow of data from its source to its point of application. However, merely knowing what it consists of isn’t enough. To truly understand its potential, we need to explore the benefits it brings, particularly when transitioning from traditional data management structures.
Why Migrate to a Modern Data Stack?
In an era where data is becoming the lifeblood of organizations, legacy on-premise data and analytics ecosystems often struggle to keep pace with ever-expanding data demands.
To sustain a competitive edge, it’s imperative for organizations to consolidate data from diverse sources into a unified repository. This single source of truth must facilitate real-time insights yet be flexible enough to evolve with changing business needs.
Thus, migration to a Modern Data Stack emerges as a viable and strategic response to these requirements. However, before exploring the benefits and potential of the Modern Data Stack, it’s crucial to identify and understand the major challenges associated with legacy data systems.
Common Limitations from Legacy Data Systems
- Long Turn-Around Time to Set up Infrastructure: Having large on-premise infrastructure results in a setup that is deeply interconnected and often requires an army of engineers to maintain.
- Slow Response to New Information: Legacy data systems often lack the computation power necessary to run efficiently and can be cost-inefficient to scale. This typically results in long-running ETL pipelines that cause decisions to be made on stale or old data.
- Expensive and Often Manual Journey Towards Insights: Reporting is extremely manual and error-prone. Legacy tools force users to manually build out processes that can be automated by the Modern Data Stack. Due to these inefficiencies, data teams with legacy data systems usually spend more time handling data issues operationally instead of in a scalable manner.
PRO TIP: By migrating onto the Modern Data Stack, you can mitigate these issues and begin to work towards a scalable data system that allows any modern organization to be truly data-driven.
Common Advantages of a Modern Data Stack
- Infrastructure is Managed in the Cloud: No longer does your organization need a squad of engineers to manage the on-premise system. Storage, compute, hardware, and platform maintenance (and all its associated costs) are handled for you.
- Business-Focused Operation Model: Teams can shed countless hours of managing long-running and complex ETL pipelines that do not scale. Data teams can focus on delivering higher-value data tasks with better organizational visibility.
- Move Beyond One-off Analytics: The Modern Data Stack empowers you to elevate your data for advanced analytics and integration of AI/ML, enabling faster generation of actionable business insights.
- Transparent Pricing Model: Say goodbye to tedious cost adjustments for hardware, software, platform maintenance, upgrade costs, etc. Only pay for what you need, when you need it.
Part 2: The Technologies & Concepts of the Modern Data Stack
We mentioned earlier that the Modern Data Stack is a collection of platforms that all work in unison to solve a core data challenge. In this section, we’ll explore each of those challenges in more detail as well as the tools/platforms that help solve these challenges.
First up, let’s dive into the foundation of every Modern Data Stack, a cloud-based data warehouse.
Central Source of Truth for Analytics
A Cloud Data Warehouse (CDW) is a type of database that provides analytical data processing and storage capabilities within a cloud-based infrastructure. CDWs are designed for running large and complex queries across vast amounts of data, making them ideal for centralizing an organization’s analytical data for the purpose of business intelligence and data analytics applications.
Demands on the cloud data warehouse are also evolving to require it to become more of an all-in-one platform for an organization’s analytics needs. Workloads like unstructured or semi-structured data processing, ML/generative AI operations, and custom data applications are all becoming expected features of today’s cloud data warehouse platforms.
Enter Snowflake
The Snowflake Data Cloud is one of the most popular and powerful CDW providers. Snowflake allows its users to interface with the software without worrying about the infrastructure it runs or how to install it. Snowflake is built on public cloud infrastructure and can be deployed to Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP)
Between the reduction in operational complexity, the pay-as-you-go pricing model, and the ability to isolate compute workloads, there are numerous ways to reduce costs associated with performing analytical tasks.
Key Benefits and Features of Using Snowflake
- Data Sharing: Easily share data securely within your organization or externally with your customers and partners.
- Zero Copy Cloning: Create multiple ‘copies’ of tables, schemas, or databases without actually copying the data. This noticeably saves time on copying and drastically reduces data storage costs.
- Separate Compute and Storage: Scale your compute and storage independently of one another and isolate compute power for jobs that need their own dedicated warehouse.
- No Hardware Provisioning: No hardware to provision, just a t-shirt-sized warehouse available as needed within seconds.
- Advanced Analytics: Snowflake’s platform is purposefully engineered to cater to the demands of machine learning and AI-driven data science applications in a cost-effective manner. Enterprises can effortlessly prepare data and construct ML models without the burden of complex integrations while maintaining the highest level of security.
Data Replication
Transferring data from a source system to a data warehouse (often known as data replication or data ingestion) can present numerous challenges for organizations of all sizes. Generally, organizations need to integrate a wide variety of source systems when building their analytics platform, each with its own specific data extraction requirements.
This complexity often necessitates the involvement of numerous experts who specialize in these individual systems to effectively extract the data.
Enter Fivetran
Fivetran automates the data integration process, helping reduce the overall effort required to manage data movement from different sources into your data warehouse. This can save your organization significant time and money compared to manual data integration methods.
Often this automation minimizes the points of failure your team will be required to manage, allowing your team to spend more time using and refining data.
Key Benefits and Features of Using Fivetran
- Automated Data Integration: Minimizing points of failure, Fivetran allows your team to focus more on refining and utilizing data rather than managing its movement from various sources to your data warehouse.
- Streamlined Change Management: Fewer moving parts mean simpler change management for your organization. This enables teams to tighten delivery cycles, accelerate bug fixes, and implement new features more frequently, thus enhancing business agility.
- Data Privacy Compliance: Fivetran ensures compliance with key data privacy regulations such as GDPR, CCPA, and HIPPA. Its built-in features like data encryption and masking protect sensitive data throughout its lifecycle.
- Support for Numerous Data Sources: Fivetran supports over 200 data sources, including popular databases, applications, and cloud platforms like Salesforce, Google Analytics, SQL Server, Snowflake, and many more. Additionally, unsupported data sources can be integrated using Fivetran’s cloud function connectors.
- Low Maintenance Connectors and Cloud-Based Infrastructure: Fivetran ensures fast and reliable data loading into your data warehouse, making it easy for organizations to integrate data from multiple sources. Fivetran scales with your organization automatically, allowing your data teams to focus on producing value instead of maintaining infrastructure.
Data Transformation
Once the various data sources are available in the data warehouse, the data must often be modeled and transformed to make it usable for the business. Transformation tools of old often lacked easy orchestration, were difficult to test/verify, required specialized knowledge of the tool, and the documentation of your transformations dependent on the willingness of the developer to document.
Enter dbt
dbt provides SQL-centric transformations for your data modeling and transformations, which is efficient for scrubbing and transforming your data while being an easy skill set to hire for and develop within your teams.
Aside from being SQL-centric, dbt focuses on modular and bite-sized SQL models that can be tied together to generate a directed acyclic graph, which provides an easy way to show how data flows from source to target.
Key Benefits and Features of Using dbt
- Version Control With Git: SQL code is stored within git and version-controlled. This enables an automated continuous integration/continuous deployment system (CI/CD).
- Testing and Repeatability: Rigorous testing protocols and the principle of DRY (Don’t Repeat Yourself) are applied to your transformation pipeline, ensuring consistency and reliability.
- Transparent Model Lineage: The creation of a directed acyclic graph makes model lineage extremely clear, allowing for a better understanding of the data transformation process.
- Easy Orchestration of Transformations: Graph operations make the orchestration of transformations almost trivial, leading to self-documenting transformations.
- Better Transparency: There’s more clarity about where data is coming from, where it’s going, why it’s being transformed, and how it’s being used.
- Improved Data Governance: This level of transparency can also enhance data governance and control mechanisms in the new data system.
Data Applications and ML/AI
After the data has been successfully replicated, transformed and validated, the next step is to leverage it to extract meaningful insights. This often takes the form of creating BI (Business Intelligence) reports, data visualizations and advanced data science models that take advantage of the capabilities of generative AI and machine learning.
Creating effective data visualization dashboards requires finding a tool that is capable of creating interactive visualizations, building sophisticated data models, and running complex queries and reports. It should also enable easy sharing of insights across the organization.
Enter Sigma Computing
Sigma Computing is a cloud-native business intelligence platform that allows users to quickly analyze, visualize, and explore their data without needing a deep technical background. Its spreadsheet-based interface makes high-level summaries and granular analyses easily accessible to all members of the business.
Key Benefits and Features of Using Sigma Computing
- Live Connection to Snowflake: Sigma’s optimized live connection to Snowflake eliminates the need for data extracts, ensuring reports are always updated without sacrificing speed or performance.
- Model and Interact with Data: Sigma’s data modeling features can either complement existing models, for example, with their dbt integration, or empower users to create new models through a transparent modeling interface. If data needs to be supplemented or adjusted, the Input Table feature allows users to augment and update data directly, writing modifications back to the data warehouse. This centralization streamlines data management.
- Collaborative Data Decisions: Sigma fosters collaboration by enabling multiple users to work within the same workbook simultaneously, annotate on analysis, and save different workbook versions. These capabilities enable organizations to save time and money by following development life cycle best practices and empower users to overcome departmental silos.
- Snowpark Integration: Sigma’s Snowpark integration enables models created by developers and data scientists to be used like every other formula in Sigma. This enhances the value of machine learning investments by making the models available to a wider audience who can explore hypothetical situations freely.
- Efficient Sharing: Sigma provides several easy but secure ways to share data and analytics internally and externally. Users can schedule reports or create alerts to be sent within or outside the organization. Additionally, whole or partial workbooks can be embedded into existing applications or portals, enhancing collaboration between teams and organizations or creating a new potential revenue stream.
The Future of Data Applications and ML/AI
As the analytics industry continues to evolve with the recent introduction of generative AI and the rise of building applications close to the data warehouse with technologies like Streamlit, organizations must make the decision on whether these technologies will help them further transform and unlock value for their business.
Without a business driver to profitability, these technologies will be useless and a drain on the organization’s resources.
Regardless of this, the fundamentals of a data platform will still remain more critical than ever – data reliability, data availability, and how actionable the data is in whatever form it takes. Businesses that win at these fundamentals will be set up for success no matter how much they are using the latest LLM prompt engineering or fine-tuning techniques.
The Modern Data Stack is one of the best ways to ensure that your data products are being produced in this way so that you can focus on the best ways to use it to grow your business.
Data Governance
Governance can often be one of those terms thrown around when modernizing a data platform, but it can easily be misunderstood or misapplied. Modern data governance is expected to help the business to derive maximum value from its data assets while proactively managing risk.
However, the focus is not on creating rigid, brittle controls, but on enabling the agile delivery of business value from data. Additionally, and possibly most important, continuous collaboration between technical and business teams is enabled to iteratively improve the data, as well as the controls and processes around it.
Legacy data systems are at a disadvantage here. Not only is there limited tooling and resources for implementing good controls and monitoring, but often the change processes associated with these legacy systems are time-consuming and resource intensive due to the effort and risk associated with making changes to these systems.
This is only exacerbated by the fact that legacy systems have a shrinking talent pool for operations and support. Even things like data access reviews are typically done manually without automation. These things limit the ability of these systems to keep up with the requirements of today’s data-driven business culture. Modern data businesses need modern data governance.
Modern Data Stacks embrace governance as a core design principle. Each component, from ingestion to reporting, is built with governance in mind.
Key Advantages of Governance
- Simplified Change Managment: The complexity of the underlying systems is abstracted away from the user, allowing them to simply and declaratively build and change data pipelines. This reduces risk, enables automation, and allows for less technical users to assist in the development process. Systems are also decoupled from each other, reducing the risk that a change in one area will affect something else in another area.
- Embracing Automation: As already mentioned, the abstraction that the Modern Data Stack provides means that most of the infrastructure maintenance that would typically be required to maintain an enterprise data platform is automated away.
- Security, Compliance, and Data Quality as First Class Citizens: Tools that handle your data use end-to-end encryption and are certified to handle all of your data. Read more here.
- Enabling Collaboration Between Business and Technical Teams: Modern Data Stack tooling is built to democratize the building of data assets through self-service access and fast, agile feature development.
Building on the Modern Data Stack not only facilitates rapid iteration of data products by both engineering and business teams but also enables the maintenance of security, quality, and compliance at scale through automation and abstraction.
For an in-depth guide around modernizing governance for the Modern Data Stack, check out our comprehensive guide!
Part 3: Modern Data Stack Migration Strategy
Now that we have a firm understanding of what the Modern Data Stack is and what technologies encompass it, it’s time to look into the migration component, especially the strategy behind it.
In our experience, taking the time to develop a sound strategy will help you and your organization mitigate risks and ensure a successful migration of your critical data workloads. This requires some pre-migration preparation, as well as identifying the strategy for the actual migration.
Pre-Migration Considerations
- Assessing: Take Inventory and understand your existing data stack. Not just what is done in the data warehouse but where various logic exists and why.
- Identification: Identifying a high-value first target to be migrated helps with the overall success of your migration, especially when driving buy-in from across the organization.
- Up-Skilling: Data teams will need to expand their skills to work with their new and modern tooling.
Once the pre-migration steps are completed, it is time to begin planning for the actual migration. This comes in two primary flavors: lift and shift or modernization.
Migration Approaches
Lift and Shift
With a “Lift and Shift” migration, an existing data warehouse (or a portion of it) is moved from one environment to another with minimal or no changes to the underlying architecture or functionality.
Advantages of Lift & Shift
- Simplicity: Fewer changes will happen at once, allowing those that are done to be done more quickly and efficiently. Validating the outcome of a Lift and Shift is simpler as well, as organizations can generally compare datasets 1:1 with the legacy source systems.
- Faster Time to Value: A Lift and Shift prioritizes showing some value to the business more quickly than it would take if the Lift and Shift involved activities like optimization or improvement on an existing data model. This can be critical to maintaining stakeholder buy-in and continued funding.
- Enable Future Optimizations (but don’t deliver…yet): Moving to a data cloud like Snowflake unlocks immense potential when it comes to your data stack. However, a Lift and Shift approach is simply putting your data into a position where it can take advantage of that, but not focusing on those optimizations…yet.
Disadvantages of Lift & Shift
- Deferred Optimization: One of the main issues to a Lift and Shift is the decision to not immediately optimize datasets and workloads to use the full potential of Snowflake’s features and capabilities. Features like scalable data warehouses, zero-copy cloning, and the ability to handle semi-structured data can significantly improve your data operations, but these features generally take time to implement and require refactoring of legacy workloads and architecture.
- Cost Inefficiency: The Lift and Shift approach can lead to architectural decisions that do not take full advantage of the cost savings that an optimized cloud architecture could provide. For instance, you might end up storing redundant or unnecessary data or executing data workloads that could be optimized through scaling or right-sizing.
- Compatibility and Maintenance: By merely lifting and shifting data workloads, you run into instances when legacy code or column data types are incompatible with the new Snowflake environment, thus requiring refactoring or workarounds. Though these generally do not significantly affect the project, they do take additional time to resolve.
phData’s Thoughts
Through adopting a “Lift and Shift” approach, an organization can quickly transition its data warehousing infrastructure onto Snowflake without the need for extensive redesign or redevelopment. This allows the organization to move fast and start retrieving data from its new infrastructure, which could be critical for business buy-in.
However, this will often mean that certain ETL/ELT workloads and data products after the Lift and Shift are not modernized or improved. That is the tradeoff of this strategy, and it is very important to clearly define what success looks like with your stakeholders before committing to it.
Modernization and Optimization
With this migration strategy, an organization will make significant modifications to its existing data warehouse architecture, design, and code during the migration process.
The goal of these modifications is to better leverage the capabilities and benefits of the new environment while optimizing the new data warehouse for improved performance, scalability, and cost-effectiveness.
Advantages of Modernization/Optimization
- Future Proof: Optimizing your data architecture for the cloud ensures that it aligns better with best practices, modern SaaS tooling, and the ever-growing cloud ecosystem. As your business scales and new capabilities become available, your modernized data platform will be able to take advantage of them.
- Cost Optimization: Modernization ensures that your data workloads take advantage of the cloud-native scalability and cost efficiency that come with a cloud data warehouse like Snowflake.
- Performance, Security, and Reliability: By optimizing your architecture to leverage the full potential of modern data technologies, you enable your organization to harness the performance, security, and reliability capabilities offered by these tools.
Disadvantages of Modernization/Optimization
- Complexity: Refactoring or rewriting a significant portion of your legacy code often requires substantial time, resources, and expertise. Identifying areas for improvement and implementing changes can be complex, potentially disrupting your current data structures and processes.
- Changes to Business Logic: Modernization involves altering not only the platform but also the business logic of applications. Ensuring the new logic is consistent, accurate, and compatible with existing data sources and processes can be a daunting task for many organizations.
- Expertise: Optimizing/refactoring legacy workloads and data products for the cloud requires resources that have deep cloud and Modern Data Stack experience and expertise.
phData’s Thoughts
Through adopting a “Refactor” approach, an organization can modernize and optimize its data warehouse architecture to fully leverage the benefits of the new environment. Modernization improves scalability, performance, and cost efficiency while addressing the limitations and challenges that the previous system had.
However, spending the time to refactor can take significantly longer, depending on the complexity of the current system. Once again, the tradeoffs of this strategy should be evaluated, and success must be clearly defined with your stakeholders before committing to it.
phData’s Recommended Migration Approach
Both the “Lift and Shift” and “Modernize and Optimize” migration strategies have extreme strengths and weaknesses. Both are centered around optimization and time to value of your migration.
At phData, we recommend using a combination of the two.
Often, organizations need to perform a Lift and Shift operation to reduce costs associated with legacy systems. At phData, we prioritize automation to make this process extremely efficient and streamlined, providing immediate cost savings.
Following the Lift and Shift phase, we focus on Modernization and Optimization. This second phase entails leveraging modern best practices to optimize the current platform, further reducing the overall cost of your Modern Data Stack.
Our approach combines these steps in a highly iterative manner. This enables swift data migration and seamless workflow transitions, while simultaneously optimizing the new system. This continuous learning process allows us to apply gained insights to enhance future product delivery.
The agility of this methodology encourages incremental changes, significantly transforming our data product development practices. It helps us adapt quickly, making each phase of the project more effective than the last.
To ensure a smooth transition, phData recommends creating a detailed roadmap. This roadmap should outline each phase, provide an expected timeline, and define success criteria for measurement.
By following this comprehensive strategy, we can help your organization successfully transition to a modern, optimized data stack.
For a deeper look into selecting the migration approach that works best for your organization, check out our in-depth blog
Modern Data Stack Migration Plan
Now that you have your migration approach in sight, it’s time to get into the nitty gritty of actually migrating to the Modern Data Stack.
In our experience of helping hundreds of customers migrate to a Modern Data Stack, we’ve optimized the planning process. The scope of these activities might depend on the strategy that you have chosen; however, implementing our strategy can be distilled into three core steps: Asses, Build, and Optimize/Scale.
Within these three core areas, there are several steps to take for your migration. For the remainder of this section, we’ll explore each of these steps at a high level.
Assess
- Identification: Identifying a high-value first target to be migrated helps with the overall success of your migration. While determining the backlog order of products to be migrated.
- Evaluation of Target Platform: Next, your organization should evaluate and select the target platforms for the data warehouse migration. This often includes selecting a new data warehouse like Snowflake, replication/ingestion tools, transformation tools, data activation tools, and governance tools.
- Assessment of Existing Data Warehouse: The existing data warehouse is assessed to understand its architecture, design, components, and dependencies. This evaluation helps identify any incompatibilities or challenges that may arise during the migration process.
- Mapping and Translation: Data structures, schema definitions, ETL/ELT processes, and other components of the existing data warehouse need to be mapped to the corresponding components and tools in the target platform. This mapping ensures that there is a smooth transition and compatibility between the two environments.
Build
- Data Migration: Data from the existing data warehouse is extracted to align with the schema and structure of the new target platform. This often involves data conversion, data cleansing, and other data transformation activities to help ensure data integrity and quality during the migration.
- Code and Configuration Migration: Code, transformation pipelines, and orchestrations are migrated and rewritten to work with the new platform following best practices. This step involves a mixture of adapting the current code to the syntax and requirements of the new target platform while preserving the logic of the existing data warehouse, or refactoring the code to match the best practices.
- Testing and Validation: Rigorous testing is performed to ensure that the migrated workload functions correctly on the target platform. This includes validating data accuracy, performance, and testing the ELT process to ensure the expected results are produced.
Optimize/Scale
Performance Tuning and Optimization: Once the code is migrated, performance tuning and optimization activities are carried out to ensure the new pipelines are taking advantage of the specific features and capabilities of the target platform. This can involve fine-tuning query performance, optimizing data partitions, or implementing workload management techniques.
Once all of these activities are completed, the system is ready to be cutover for a particular data product. While generating lessons learned, the organization can improve delivery of the next data product that is to be implemented.
Allowing the organization to upgrade their current data warehouse infrastructure to a more advanced and robust technology stack without completely redesigning their complete architecture.
Next up, let’s take a closer look at how phData helped a leading private mortgage insurance company move to the Modern Data Stack using the strategies outlined earlier.
While most of our engagements go extremely smoothly with minimal business disruption, we thought it would be more beneficial to share a case study that had a number of bumps.
We’ll explore what worked really well, what didn’t go so hot, and ultimately, what we learned.
Real World Case Study
The Challenge
This case study focuses on an organization that embarked on a data migration project to address various challenges in their existing data infrastructure. The organization operated a legacy on-premises data warehouse system that presented performance limitations, restricted access, data sprawl, and a lack of governance and transparency.
Additionally, their analytical datasets, which were essential for critical analysis such as risk modeling, were known to have various data issues. Recognizing the need for improvement and agility, the organization aimed to migrate to a new cloud-based data warehouse to enhance data quality and establish robust governance practices.
The organization set forth several goals for the project, which would initially span 16 weeks:
- Migrate 70 attributes from the legacy data warehouse to the new cloud-based solution to enable simple and efficient user access.
- Enhance data quality by rebuilding and documenting data transformations starting from the operational data sources.
- Address data ownership disputes by leveraging domain experts and implementing new governance practices.
The Solution
phData recognized the imperative to enhance both the technical and people aspects of the data platform. To accomplish this, we deployed a team including both business and data architecture professionals who undertook the crucial task of guiding our client through the planning and implementation of the new data platform. Because of this, our team discerned early on that the initial proposed scope lacked the refinement necessary to meet the desired business timeframe.
In response, we invested considerable effort in collaborating with subject matter experts (SMEs) and stakeholders within the organization. This collaborative effort aimed to redefine and architect a solution that aligned with the business requirements and remained technically feasible.
As part of this process, we conducted a comprehensive field-level analysis of the requested data attributes to determine the prioritization and sequencing of their delivery. This process included:
- A focus on data attributes from only a single domain to start, delivering a prioritized subset of the 70 requested attributes.
- The use of the domain’s operational source data as a starting point to populate the attributes.
- A focus on landing and building transformational jobs for this data while simultaneously continuing to work on data architecture and modeling for the additional attributes.
Once a strategic direction was finalized, the team successfully developed and deployed a data pipeline leveraging the existing tools employed by our client, as well as pioneering patterns for transforming and governing the resultant data products.
Throughout the project, we also maintained a steadfast commitment to involving and seeking frequent feedback from leadership to ensure our shared goals were being met.
Outcomes
In order to reconcile the conflicting demands of data quality and timeline, the project scope underwent multiple adjustments, as mentioned earlier. Consequently, the initial scope gradually underwent reductions, resulting in a scaled-down version of the proposed solution.
Over a span of 24 weeks, our team dedicated extensive efforts to successfully migrate 16 out of the 70 attributes. This accomplishment not only laid a solid technical foundation but also set the stage for the accelerated implementation of future attributes.
Throughout the implementation process, we encountered and overcame the many technical challenges associated with moving to a new cloud data platform. Challenges such as defining cloud data processing best practices, integrating data quality measures, and collecting metadata for the organization’s data cataloging tool were effectively addressed.
Additionally, we established robust processes for defining and maintaining data architecture, ensuring a solid framework for future endeavors. This achievement culminated in the development of a repeatable architecture that will streamline and simplify future implementations.
Although this project phase did not culminate in the comprehensive solution initially envisioned by the organization, it yielded valuable insights and paved the way for future endeavors.
The lessons learned, and the technical groundwork established during this phase will inform and optimize future implementations, enabling us to achieve the desired outcomes more efficiently and effectively.
Takeaways
Although we could effectively mitigate many of the challenges that were faced during this engagement, there were certainly things that we learned and could continue to improve on as we continue to work with this and other clients.
Some of the most significant for this project were:
A Clear and Transparent Strategy is key
A well-defined strategy, supported by effective communication, is vital for project success. A lot of time was spent in this engagement trying to understand the priorities and objectives of the business. Having a strategy up front and understood by both the leadership and delivery teams would have accelerated this project significantly.
Goals Must be Realistic
Striking a balance between ambition and feasibility is crucial. This often means that the business will not get everything all at once, but setting achievable goals helps maintain quality standards and ensures that the value proposition is met.
Focus on Delivering Value One Iteration at a Time
Initiating the project with smaller, achievable milestones that provide tangible business value generates momentum and increases the likelihood of overall success. Thinking that all problems must be solved all at once significantly increases the risk of team failure.
Allowing for continuous evaluation and course correction based on real-time feedback will help the business remain agile when it comes to technology solutions.
Goals Should be Consistently Assessed and Re-Aligned
Ongoing collaboration and alignment between the leadership and delivery teams is essential. During our engagement, there were times when the delivery team would present a solution one way, while leadership had been expecting it to be interpreted in a different way.
Misunderstandings like these were due to the fact that the leadership and delivery teams were misaligned on the overall goals – which affected which tasks the team prioritized. Regular checkpoints with leadership that primarily focus on the goals of the project – not just the details of the current tasks that are being completed – will ensure that the project stays on track and deviations are promptly addressed.
By embracing these insights and applying them to future data migration projects, organizations can optimize their strategies, set attainable goals, prioritize value delivery, and foster effective collaboration, leading to successful outcomes.
Part 4: Building on the Modern Data Stack
Building on the Modern Data Stack will consist of making some changes to the overall data culture. When transitioning from an on premise or legacy data stack, there are three primary areas of change that the organization should be prepared for, the delivery mindset, overall team structure, and approach to automation.
Changing your Delivery Mindset
Legacy data stacks required legacy delivery models. When transitioning to a Modern Data Stack, there should be a re-assessment of the key points that drive your delivery strategy/mindset as well as an adaption to how you work with your new tooling.
In this section, we’ll take a closer look at what this looks like.
Best Practices
Delivery best practices in a Modern Data Stack environment include but are not limited to:
- CI/CD: Making the latest versions widely available and deploying the latest stable code to production continuously.
- Testing: Data engineering should be treated as a form of software engineering. This means implementing unit, integration, and smoke test frameworks to ensure stable code and models are well established prior to deploying in a production environment.
- Paradigm Shifts: With new tools, there can often be changes in how or the way you solve problems. Additionally, the new tools bring an unprecedented level of agility and flexibility that any team and organization needs to be ready for.
Incremental Development
- Deploy: It’s critical to deliver fast, learn fast, and improve fast. With a Modern Data Stack, you hold the power to scale up, down, in, and out. It’s typically a best practice to deploy your product to a smaller environment and monitor performance.
- Monitor/Learn: Use a short window to monitor access patterns, usage, and resource consumption. Document expected performance vs SLA/expected performance.
- Modify: If a case can be made to scale the platform (in any direction), modify the allocated resources and repeat the previous steps.
Building Data Products
When transitioning to a Modern Data Stack, you’re given an opportunity to focus on the development of data products that will provide value to your internal and external facing customers. There is generally a prescribed approach that’s tried and true, as has been done in software development for many years.
- Identify Opportunities: Bring the business and data teams together to identify both the needs of the business and what can be resolved with the data available, all while prioritizing for the future.
- Build the Products: Focus on staged execution and validation that the product resolves a business need. This helps de-risk the creation of new products that end up going in the wrong direction from the needs of the organization.
- Evaluate & Iterate: Prioritize the speed of iteration while keeping an eye on the future potential of your new data products. The new product might be wrong, and rapid iteration could resolve the issues; however, it’s also possible that a fresh data product isn’t very usable initially since it requires more historical data to be usable.
Team Structure
When considering how to organize your company’s people resources for a migration, there are a few approaches to take, Centralized, Federated, or a Hybrid of the two. In this section, we’ll explore each team structure approach in more detail.
Centralized
In a centralized team structure, a single team takes charge of developing and maintaining the organization’s analytics products and underlying infrastructure. Having a single team responsible for the entire process is a simple model that allows them to have a comprehensive understanding of the entire workflow.
This enables streamlined decision-making and coordination, and a clear end-to-end ownership for any issues or concerns.
However, as the platform and the organization’s demands on it grow, reliance on a single team can become a potential bottleneck. The team may face challenges in scaling their resources and capabilities to meet the increasing needs.
Therefore, this centralized model is particularly effective when building a new data platform from scratch. It allows for one team to focus on iterating on the design without having to be slowed down by dependencies on external groups/teams.
Federated
In a federated team structure, the organization divides the ownership of different components of the data platform among multiple teams. Each team takes responsibility for specific data products or capabilities within the platform. The extent of each team’s ownership can vary based on the organization’s size and maturity.
Typically, this structure involves a platform team(s) focusing on the technical enablement of the data migration capabilities – data replication, transformation, validation, data apps, and governance.
This will involve implementing tools like Fivetran and potentially building abstractions on top of them to further accelerate data product development. Then domain teams leverage those tools and abstractions to develop data products and applications that adhere to the organization’s governance policies and contracts.
The federated approach focuses on enabling multiple specialized data teams to work with increased speed and agility. The tradeoff is that without proper management, the sprawl of responsibilities can end up making it difficult to track how and where data products are being created.
It is crucial to establish non-negotiable governance standards from the start. These standards should be integrated and automated so that teams cannot help but use them and strike a balance between being able to deliver value quickly and maintaining the right level of quality.
A federated team structure is particularly suitable for larger organizations with an established data practice. The separation of duties not only helps enable a reduction in time to value but also allows teams to specialize in a particular area or capability.
It also aligns well with the concept of a “data mesh” where clear interfaces and common governance practices are enforced across the organization – while allowing teams the flexibility to develop their own process for creating data products.
Hybrid
The hybrid team structure combines elements of both the centralized and federated approaches to optimize the benefits of each. In this model, organizations centralize the common elements of the data platform while federating those that are specific to different business domains or projects.
For instance, actions that would likely produce results shared among all teams, like source data replication and core data model creation, could be placed under a single team in a centralized model. This prevents duplication of effort and simplifies the management of that portion of the data platform.
On the other hand, domain-specific models or data curated for specific use cases could be distributed among specialized teams in a federated model. This would bring the business experts to the data product creation process, which would enhance the value and quality of the output.
It would also enable faster delivery when compared to a centralized model, as teams would be able to work in parallel on their respective areas of expertise. The Modern Data Stack simplifies all of the approaches by removing the need for teams to build basic data processing infrastructure.
Automation
Automation and the Modern Data Stack go hand-in-hand. During the different phases of your data journey, you’ll find many opportunities to implement automation to optimize your productivity.
During the migration process, there may be value in automating the validation of comparing source and target data to ensure all source data has been successfully migrated.
Throughout the lifecycle of the data project/product, automating the development, testing, and deployment of data pipelines will increase productivity and reduce labor overhead while increasing data reliability and integrity.
At phData, we specialize in integrating automation into your data infrastructure, enabling you to devote your attention to maximizing business value.
How phData Can Help
Our mission is to help businesses like yours succeed with the Modern Data Stack.
phData is a team composed of highly skilled data scientists, consultants, engineers, and architects. Our focus lies in providing expertise in the Modern Data Stack, spanning strategy, enablement, and artificial intelligence. We assist in maximizing the value derived from your data, whether it involves simplifying connections or engaging in more comprehensive projects.
Our team, with its extensive experience working across diverse industries, is always ready to tackle new challenges. Whether you need advice, best practices, or guidance, we love leading businesses toward data modernization.
To get started, we recommend you explore a few of our most requested Modern Data Stack resources.
phData Toolkit
phData is an engineering first organization, and like any good engineer would, we like to automate repetitive tasks. Overtime, we’ve identified repeated activities across migrations, resulting in us rolling our sleeves up and building tools to make ours and our customers’ lives easier. This has resulted in the creation of the phData Toolkit, a set of tools created from our countless and combined industry experiences to provide automation, agility, and flexibility.
- Provision Tool: Manage Snowflake resource lifecycles — creating, updating, and destroying — with a templated approach to provide a well-structured information architecture.
- Data Source Tool: A multipurpose tool that collects, compares, analyzes, and acts on data source metadata and profile metrics. Allowing data diff analysis and code generation.
- SQL Translation: Instantly translate SQL from one language to another, eliminating a usually time-consuming, error-prone, and highly manual process.
- Advisor Tool: Quickly and easily identify opportunities to improve performance, security, and cost-efficiency of your Snowflake environment.
Here at phData, we like to let our tools and skills speak for themselves. With that, you can go download and run our Toolkit to verify it meets your needs and works as we say. Over on the phData Toolkit homepage, you’ll find guides, instructions, and examples to experiment with.
Conclusion
Envisioning and actualizing a streamlined, efficient, data-driven organization is no longer an insurmountable challenge. It’s a reality that businesses can reach through the Modern Data Stack.
This guide has hopefully highlighted the importance of transitioning to this new paradigm, showcasing critical technologies involved, and outlining strategies for successful migration and implementation. By leaving behind the limitations of legacy, on-premise technologies, and embracing the future-ready capabilities of the Modern Data Stack, your business can unlock a data-driven future and gain a competitive edge.
Moving towards the Modern Data Stack often involves navigating complexities and challenges but with phData at your side. As an AI and data analytics consulting company, we have enabled hundreds of clients to harness the full potential of their data through the Modern Data Stack, learning from each unique journey.
We’re passionate about sharing these insights, and we’re committed to help your organization achieve its data-driven aspirations. With our expertise and best practices, we can assist your business in not only transitioning to the Modern Data Stack but also thriving with it. The power of data is immense, and together, we can leverage it to drive your business towards unprecedented success.