It’s easy to overlook the amount of data that’s being generated every day — from your smartphone, your Zoom calls, to your Wi-Fi-connected dishwasher.
It is estimated that the world will have created and stored 200 Zettabytes of data by the year 2025. While storing this data is a challenge itself, it’s significantly more complex to derive value from this amount of data.
From 2020 to 2022, the total enterprise data volume will go from approximately one petabyte (PB) to 2.02 petabytes. This is a 42.2% average annual growth over these two years.
You’re likely familiar with the term “Big Data” — and the scale of this market is continuously growing. The big data analytics market is set to reach $103 billion by 2023, with poor data quality costing the US economy up to $3.1 trillion yearly. Fortune 1000 companies can gain more than $65 million additional net income, only by increasing their data accessibility by 10%.
This means it’s business-critical that companies can derive value from their data to better inform business decisions, protect their enterprise and their customers, and grow their business. In order to do this, businesses have to employ people with specific skill sets tailored to data governance and strategy, such as data engineers, data scientists, and machine learning engineers.
This comprehensive guide will cover all of the basics of data engineering including common roles, functions, and responsibilities. You’ll also walk away with a better understanding of the importance of data engineering and learn how to get started deriving more value from your data in 2022.
When it comes to adding value to data, there are many things you have to take into account — both inside and outside your company.
Your company likely generates data from internal systems or products, integrates with third-party applications and vendors, and has to provide data in a particular format for different users (internal and external) and use cases.
The data that is generated and collected from your business likely has compliance requirements such as SOC2 or Personally Identifiable Information (PII) that you’re legally required to protect. When this is the case, security becomes the top priority around your data which introduces additional technical challenges for data in transit and at rest. We continue to hear about big data breaches in the news, which can cripple your business and its reputation if it happens to you.
Not only must your data be secure, it must also be available to your end-users, performant to your business requirements, and have integrity (accuracy and consistency). If your data is secure but unusable, it cannot add value to your company. There are many aspects to a data governance strategy that require specialized skills.
This is where data engineering comes into play.
A data engineer is like a swiss army knife in the data space; there are many roles and responsibilities that data engineers are capable of, typically reflecting one or more of the critical pieces of data engineering from above.
The role of a data engineer is going to vary depending on the particular needs of your organization.
It’s the role of a data engineer to store, extract, transform, load, aggregate, and validate data. This involves:
For example, an enterprise might be using Amazon Web Services (AWS) as a cloud provider, and you want to store and query data from various systems. The best option will vary depending on whether your data is structured or unstructured (or even semi-structured), normalized or denormalized, and whether you need data in a row or columnar data format.
Is your data key/value-based? Are there complex relationships between the data? Does the data need to be processed or joined with other data sets?
All of these decisions impact how a data engineer will ingest, process, curate, and store data.
Instead of an abstract description, here’s a scenario: the CEO wants to know how much money your business could save by purchasing materials in bulk and distributing them to your various locations.
You need to be able to determine how to charge back any unused materials to different business units.
This likely requires you to aggregate data from your ERP system, your supply chain system, potentially third-party vendors, and data around your internal business structure. In years past, some companies may have tried to create this report within Excel, having multiple business analysts and engineers contribute to data extraction and manipulation.
Data engineers allow an organization to efficiently and effectively collect data from various sources, generally storing that data into a data lake or into several Kafka topics. Once the data has been collected from each system, a data engineer can determine how to optimally join the data sets.
With that in place, data engineers can build data pipelines to allow data to flow out of the source systems. The result of this data pipeline is then stored in a separate location — generally in a highly available format for various business intelligence tools to query.
Data engineers are also responsible for ensuring that these data pipelines have correct inputs and outputs. This frequently involves data reconciliation or additional data pipelines to validate against the source systems. Data engineers also have to ensure that data pipelines flow continuously and keep information up to date, utilizing various monitoring tools and site reliability engineering (SRE) practices.
In a phrase, data engineers add value as they automate and optimize complex systems, transforming data into an accessible and usable business asset.
Data pipelines come in different flavors, and it’s the role of the data engineer to know which strategy to use and why.
The two most common strategies center around the concepts of extraction, loading, and transforming (ELT) of data. Data always has to be extracted in some manner first from a source of data, but what should happen next is not as simple.
The ELT use case is commonly seen within data lake architectures or systems that need raw extracted data from multiple sources. This allows for various processes and systems to process data from the same extraction. If you are joining data from a variety of systems and sources, it’s beneficial to co-locate that data and store it in one place before performing transformations to the data.
PRO TIP: Generally speaking, an ELT-type workflow really is an ELT-L process, where the transformed data is then loaded into another location for consumption such as Snowflake, AWS Redshift, or Hadoop.
In contrast, an ETL (extract, transform, load) process puts the heavy compute involved with transformation before loading the result into a file system, database, or data warehouse. This style often isn’t as performant compared to an ELT process, as data for each batch or stream is often required from dependent or related systems. This means that on each execution, you would have to requery data from the necessary systems, adding extra load to those systems and additional time waiting for the data to be available.
However, in cases where simple transformations are being applied to a single source of data, ETL can be more appropriate as it reduces the complexity of your system, potentially at the cost of data enablement
The general recommendation is to use ELT processes when possible to increase data performance, availability, and enablement.
It’s not as simple as having data correct and available for a data engineer. Data must also be performant. When processing gigabytes, terabytes, or even petabytes of data, processes and checks must be put in place to ensure that data is meeting service level agreements (SLAs) and adds value to the business as quickly as possible.
It’s also important to define what performance means with regard to your data. Data engineers need to take into account how frequently they’re receiving new data, how long their transformations take to run, and how long it takes to update the target destination of their data. Business units frequently want up-to-date information as soon as possible, and there’s moving pieces and stops along the data’s journey that data engineers have to account for.
Imagine if your company was an airline, and you wanted to provide pricing to customers based on inputs from a variety of different systems to offer a price to customers. If your price is too high, customers will book with other airlines. If your price is too low, your profit margins take a hit.
Suddenly, there’s a blockage in the Suez Canal, and freighters hauling oil cannot make it out of Saudi Arabia, disrupting the global supply chain and driving the price of oil and gas up. Commercial airplanes use a lot of fuel, to the tune of almost 20 billion gallons a year. This is going to dramatically affect the cost to operate your business and should be reflected as fast as possible in your pricing.
In order for this to happen, data engineers have to design and implement data pipelines that are efficient and performant.
Code is never a “set it and forget it” type solution. Data governance requirements, tooling, best practices, security procedures, and business requirements are always quickly changing and adapting; your production environment should be as well.
This means that deployments need to be automated and verifiable. Older styles of software deployment frequently resulted in running a build, copying and pasting the result onto your production server, and performing a manual “smoke test” to see if the application was working as expected.
This does not scale and introduces risk to your business.
If you’re live testing on a production environment, any bugs or issues that you may have missed in testing (or any environment-specific influences on your code), will result in a poor customer experience since these bugs or errors will be presented to the end user. The best practice for promotion of code is to put automated processes in place that verify code works as expected in different scenarios. This is frequently done with unit and integration tests.
Unit tests verify that individual pieces of code, given a set of inputs, produce expected outputs independently of the other code that uses that piece of code. These add value to verify complex logic within the individual piece of code, as well as providing proof that the code executes expectedly.
Another level up from that is integration testing. This ensures that pieces of code work together and produce expected output(s) for a given set of inputs. This is often the more critical layer of testing, as it ensures that systems integrate as expected with each other.
By combining unit tests and integration tests with modern deployment strategies such as blue-green deployments, the probability of impact to your customers and your business by new code is significantly reduced. Everything is validated based on the established tests before changes are promoted to an environment.
Many businesses focus on providing as much value to their customers as quickly as possible, but it’s also critical to ensure that you have a plan in the event of a system failure. While many companies rely heavily on cloud providers to minimize downtime and guarantee SLAs, failure will inevitably happen. This means that systems must be designed to tolerate a critical system failure.
Disaster recovery in data engineering generally falls into two metrics:
In the event of a disaster recovery scenario, businesses need to have standards in place to understand the impact to their customers and how long their systems will be unavailable. Data engineers are responsible for putting processes in place to ensure that data pipelines, databases, and data warehouses meet these metrics.
Imagine if your company was an airline and you needed to provide customers with the ability to book flights, but suddenly, your data center explodes. Your business has established a data sync process to replicate data to another data center, but that process was interrupted and data loss has occurred. You need to re-establish your primary database in your application suite from the replicated database. The RPO represents how much data is lost between the cut-over, and the RTO represents how long customers are unable to book flights.
Data engineers frequently have to evaluate, design, and implement systems to minimize impact to customers in the event of failure.
A data governance strategy is essential to the success of your organization and its data. This is a very complex topic we’ve covered elsewhere, but at a high level, data governance is structured around the following:
In order for your data to provide value to your enterprise while minimizing risk and cost, you’ll need to define and enforce the answers to quite a few questions:
These are very complex questions that generally have complex answers, and require knowledge from different business and technology areas:
Data governance is more focused on data administration, and data engineering is focused on data execution. While data engineers are part of the overall data governance strategy, data governance encompasses much more than data collection and curation. t’s unlikely that your organization is going to have an effective data governance practice without data engineers to implement it.
For example, let’s take a look at some of our questions above, keeping data engineers and how they accomplish each task in mind.
In a data governance practice, rules and regulations define who should have access to particular pieces of information within your organization.
If you’re a shipping company, you may need to separate the data that suppliers and customers can see at any given time, or ensure that different suppliers can’t see information about other suppliers. This requires data classification, tagging, and access constraints.
If you’re gathering data from various systems, a data engineer is responsible for applying the classification and tagging rules upon collection. This might include adding additional data points to the collected data or storing data separately on disk. Then, when the data is aggregated or transformed, the end result must include this same information. When setting up access constraints to the data, the data engineer also has to enforce the required policies.
To be considered compliant with the many regulations required of businesses, you must have the ability to track who has access to your data and the changes to that access. This also includes informing consumers of your data about changes to the data. If you’re a consumer of a set of data and it changes without your knowledge, systems are likely to break. This means it’s critical to be able to track who is and who should be consuming data.
While data governance practices determine what those rules should be, it’s the responsibility of data engineers to put those rules into place. This could mean setting up IAM rules in AWS or Microsoft Azure to ensure that certain roles are only able to read data from various sources and systems. It’s then the responsibility of the security team to validate that users only have access to the appropriate roles.
Data engineers are responsible for storing collected and transformed data in various locations depending on the business requirements. Each set of tooling and location will have different ways by which data is stored and accessed, and the data engineer must take into account the limitations, benefits, and use cases for each location and set of data.
Let’s say your business is ingesting a million records a day for a particular data source. If you’re storing this on disk, you can’t simply append to a singular file, (It would be like looking for a needle in the world’s largest haystack!) If you were trying to build a report, or provide end-users with a particular piece of information, you would never be able to find it.
Data engineers would:
Data governance and the rules around it might determine data access to those partitions and could have performance metrics required of that data. However, members of the data governance team wouldn’t have the skill set to establish those access roles or pull those metrics.
If you were trying to find value from various data sets, where would you start?
For example: if you have data around customers and their orders, you might try to figure out what additional items you could sell to them based on other customer orders. If you could manage to correlate customers and their purchases, you’re likely able to upsell on future orders.
This might be simple if you have a small set of customers and orders. You could employ business analysts who are experts in your company and have worked with your customers for years to possibly infer what customers want.
But what if you had millions of customers and millions of transactions? What if you had external vendors providing you with additional information about your customers? What if your data is unstructured, and can’t be easily joined together with your other datasets? How do you know that particular pieces of information are actually correlated and make decisions off of data rather than gut feelings?
This is where data science comes into the picture. Data scientists are tasked with using scientific methods, processes, algorithms, and systems to extract valuable business insights from structured and unstructured data.
To understand what the results of a data scientist’s work looks like, we have to understand what a data model is.
Data modeling is the process by which data is defined, analyzed, and structured to produce a meaningful output. This generally means ingesting data from a variety of sources, structuring it into various entities and relationships, performing calculations against the data, and validating the output.
The goal of data modeling is to illustrate or calculate the connections between data points and structures.
Going back to our example of customers and transactions, the data model would show us how different customers and transactions relate to each other, so we can start to perform some statistical analysis on just how closely related they are. One potential output of this data model could be that customers who bought diapers are 80% more likely to also purchase hand sanitizer than customers who didn’t.
There are also different types of data models:
Data scientists generally have strong mathematics, statistics, and programming backgrounds.
When working with Big Data, it’s impossible to try to determine value manually. Remember the needle in the haystack? Instead, data scientists have to work programmatically with data in order to validate theories and statistical models.
In our data model example, we were able to determine that customers who bought diapers are 80% more likely to also purchase hand sanitizer than customers who didn’t. While this is a simple and logical conclusion, oftentimes organizations have more complex relationships between their data and business value. It’s also likely that your organization has so much data that you don’t even know where to start.
Fortune 1000 companies can gain more than $65 million additional net income by increasing their data accessibility by 10%. This is why it’s critical for companies to have data scientists creating data models and performing analysis on data — making it accessible to business units. It’s very realistic that your enterprise could be cross-selling or up-selling services to customers more effectively, or that your enterprise could be saving money by using data models to predict the usage of resources.
While cross-selling and up-selling of services is a normal concept for most businesses who sell a product or service, predictive analysis adds a layer of business value that’s harder to conceptualize.
Let’s say you’re a shipping company, and you have been tasked by the CEO to maximize profits and minimize operational costs. This is the goal of every business, right?
You’d probably try to identify shipping lanes that are frequently used, and make sure you have trucks regularly making deliveries back and forth without sitting idle between shipments for too long. However, how do you determine how weather is going to affect driving conditions? How do you optimize routes in the event of a bridge collapsing? How do you know the ideal time to drive through each city without hitting large amounts of traffic?
This is another great example of where a data model and data scientists add tons of value. The data scientist is responsible for modeling out each data point that could affect the shipping lane, programmatically calculating the risks and effects of each, and calculating conclusions to inform the business on how to operate. With predictive analysis, your business is likely to find correlations between data that previously was thought to be useless or unlikely to affect different scenarios.
For data scientists to be able to model data effectively, data governance practices must be in place to ensure data quality and accuracy.
Data engineers are then responsible for enacting these policies and monitoring data quality and performance. Data engineers also feed the data sources that data scientists use for creating data models.
While data engineers can perform large-scale transformations and aggregations on data, there has to be an analysis to determine how data should be processed. The data engineer has to know how data is related and how it should be manipulated to create the desired result. In basic examples, a data engineer might be able to partner with the business to map this out, but in more complex systems further analysis is required from a data scientist.
In some cases, the data model may require a more complex algorithm and transformation process than a generalized data engineer might be able to handle. There may be complex mathematical equations and statistical analysis that must be taken from a prototype or small-scale example and productionized.
This is when you need to employ a machine learning engineer.
Machine learning engineers are at the intersection of data engineering and data science. These engineers often have a stronger mathematical background than a typical data engineer, but not to the degree that a data scientist does. These engineers can leverage data engineering tooling and frameworks in a big data ecosystem, apply data models created by data scientists to that data, and productionize the process of deploying these models. This is not a simple task.
Machine learning engineers need to be well versed in data structures and algorithms, both from a mathematical and computational perspective. For a data model to be productionized, data must be ingested into the model and computations run in a high-performance environment. This means potentially handling terabytes of real-time data to drive business decisions.
When data scientists work with data to prove out models, the work is typically done in environments such as Python or R, inside of an analytical notebook such as Jupyter. This notebook runs against a cluster to translate queries into a big data platform-specific engine like Spark.
While this approach minimizes the development experience and time required to derive value, it requires additional work to productionize. This includes:
While some of these skills overlap with that of a data engineer (ingestion of data, data quality checks, etc.), the responsibilities and skills required are significantly focused on a few areas of data engineering.
There’s not a simple answer to this question — but let’s go through some of the basics
Data can be stored in many different file formats within file systems, and in different ways within databases and data warehouses. Each of these different formats is optimized for a particular use case, and data engineers are responsible for understanding the right tool for the job.
As an example, if you were storing data on disk in a data lake, there are a few common options for data formats:
These data formats are usually driven by a metastore that tracks where data is located in order to query the data. Depending on what tooling you’re using, the query syntax, access patterns, performance, and capabilities will be different. Common examples include:
Data can also be stored within streaming-based platforms that allow for highly distributed systems. This is often a pub/sub architecture that allows multiple consumers of data to receive updates from a publisher of data. Common examples include:
Once data has been stored, it will generally need to be processed to reach the desired state. This could involve pulling data from a variety of sources, joining that data together, performing aggregations on it, and then putting the result into a final location. There are a variety of compute options commonly used in data pipelines:
The output of these data pipelines will then generally be put back into a data lake, using the data formats and metastores mentioned above. In some instances, customers want to put this data into a database or data warehouse such as Snowflake or AWS Redshift. These toolings allow for further data performance tuning, data enablement, and integration with third-party tooling.
Many companies have on-premises systems and are migrating to cloud-based solutions such as Amazon Web Services (AWS) and Microsoft Azure. This requires a different set of skills and engineers must be able to understand the differences in how these systems work.
When working with on-premises workloads, generally speaking, engineers aren’t as focused on execution time and memory usage until they become bad neighbors to other processes on the same server or cluster. Since the company pays for the hardware and not on a consumption-based model, it’s easier to allow processes to run a little longer than spend lots of time optimizing performance.
However, when working on a cloud-based platform, many solutions run on a consumption-based model that is tied to things like memory usage, execution time, and storage requirements. This can lead to significant costs when directly porting on-premises workloads to the cloud.
Data engineers need to have the ability to understand different pricing models and tailor solutions to fit. This means a basic understanding of sales strategies, charges that a company will incur, and how to implement solutions in both ecosystems.
For many data engineers, the process of transforming data into data marts and curated datasets isn’t as simple as joining a couple datasets. In many instances, aggregations need to be performed against the source data to calculate things such as statistical values like mean, standard deviation, and variance.
Mathematics is also important when considering various data structures to store data or algorithms to process data. It’s critical to have an understanding of the performance implications of storing data in a particular structure or performing certain algorithms against a given dataset.
You know that your data is stored and partitioned by the date it was loaded, but you need to join that data based on a business key. For a data engineer, this should raise a big red flag.
By having an understanding of data structures and algorithms, the engineer would understand that they will have to do a full table scan on the data, reading every single partition and file just to perform that action. This may be okay for small datasets, but certainly isn’t feasible when you’re in the Big Data ecosystem.
Even if your data ingestion and curation is 100% optimized and highly performant, it won’t matter if the data is incorrect. A data engineer has to be able to understand what the end result should be, and the practices and tooling that enables validation of data.
Data engineers can utilize tools such as Deequ and Great Expectations that provide a framework and tooling for data quality and data detection. Tests must be written against data to ensure the data is as expected, and monitored for variance in the data.
A skilled data engineer is able to profile, monitor, and alert when data falls outside of acceptable ranges and parameters.
Knowledge is power — and it couldn’t be truer in today’s society. Large companies are creating, ingesting, and processing more data than ever before.
Data is a critical component to knowledge and as we have demonstrated through various examples, the process of turning data into knowledge can be very complex. There are different levels of data processing and analysis, and there may be instances in your organization where experience in the specific business practice and field may give an individual a level of knowledge that the data can back up. However, the amount of knowledge that Big Data can generate about your business and its impact on your business is often overlooked (and overwhelming).
Throughout this article, we have talked about data engineers, data scientists, machine learning engineers, and how each of these have a specific place within the big data ecosystem. These experts are usually very expensive and experienced resources for an organization to employ, creating a barrier of entry that can be hard to cross.
However, there has never been a more critical time to invest in these resources.
Let’s take a look at some examples of what these practices have allowed companies to do.
Large retailers such as Amazon and airlines frequently use dynamic pricing for their goods. This allows for the most up-to-date pricing based on data models that are created by data scientists, implemented by machine learning engineers, and fed by data engineers.
You’ve likely checked airline prices regularly to try and snag a good deal or checked Amazon to see if a particular item you’re interested in is on sale or at a better price than competitors. What you probably didn’t know is that Amazon updates its prices up to 2,500,000 times a day. This is fed by a data model built by Amazon to maximize profits and stay competitive within a massive e-commerce marketplace. This is how the company earns 35 percent of its annual sales.
Another example of dynamic pricing is Marriott hotels. As one of the biggest hotel chains in the world, they have over 6,500 hotels globally, and rates are impacted by a huge variety of factors. To competitively price their hotel rooms, they would have to employ hundreds to thousands of analysts to check things such as the local and global economic situation, weather, availability and reservation behavior, cancellations, etc. This isn’t feasible at scale. Instead, they use dynamic pricing built off of data models, resulting in a five percent revenue increase per room.
In a global economy, it’s important to understand that marketing is not a one-size-fits-all dynamic. Successful marketing and advertising campaigns are going to look different in the United States when compared to China. Even within a particular country, there may be areas of the country that have different beliefs, weather patterns, and preferences.
To boost sales, it’s common in marketing to have a campaign that is targeted to a specific audience. A great example of this is Airbnb, which in 2014 wanted to tailor the search experience demographically and geographically. They noticed that certain Asian countries typically had a high bounce rate when visiting the homepage. Analyzing the data further, they discovered that users would click the “Neighborhood” link, start browsing photos and then never come back to book a place.
To resolve this, the company created a redesigned version for users in those countries, replacing neighborhood links with top travel destinations. This led to a 10% increase in conversion.
Another great example is Coca-Cola, who in 2017 revealed that the flavor Cherry Sprite was inspired by data collected from self-service drink fountains, where customers mix their own drinks. These machines were set to track what flavors customers were mixing in different areas of the world. Then the company simply aggregated the variations of drink combinations and turned it into a purchasable item.
One last example.
It’s very common for modern hardware to contain all kinds of sensors and mechanisms to track how something is functioning. One phData customer is ingesting streaming data from various heavy equipment that are equipped with various sensors. The organization wanted to track these machines, and not only predict when they would need maintenance, but also how they might help customers operate their products more efficiently.
The organization quickly realized that they were spending a significant amount of time administering and managing the system, rather than performing analytics on the ingested data.
Enter the data engineer.
A group of phData data engineers quickly evaluated that the customer needed to migrate to cloud-native infrastructure to off-load maintenance responsibilities for the customer. The engineers also re-architected the customer’s current system to allow for high scalability and high availability, ultimately making the data available within a data warehouse (Snowflake) that allowed for Business Intelligence tools to easily query the data.
This saved the customer millions of dollars a year between administrative/management costs and the additional value they were enabled to bring to their customers.
Our sincere hope is that you’ll walk away from this guide with a much better understanding of what a data engineer does and how they can help your organization make better decisions with data.
If you’re interested in getting more value from your data, consider working with the data engineering team at phData. Our team of seasoned data experts can help you design, build, and operationalize your modern data product.
Subscribe to our newsletter
Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.