Machine learning (ML) is only possible because of all the data we collect. However, with data coming from so many different sources, it doesn’t always come in a format that’s easy for ML models to understand. Before you can take advantage of everything ML offers, much prep work is involved.
In this blog, we’ll explain why you should prepare your data before use in machine learning, how to clean and preprocess the data, and a few tips and tricks about data preparation.
Why Prepare Data for Machine Learning Models?
As the saying goes: “Garbage in, garbage out.” If you’re not cleaning and preparing data for an ML model, you might as well not have a model at all. The quality and structure of the data directly impact the model’s performance and accuracy. With messy, inconsistent data, the model may learn patterns that aren’t there, leading to inaccurate predictions and misleading results.
Preparing the data for use in the model is paramount to the benefits of machine learning predictions, so let’s review what steps to take to ensure you’re getting the most out of your model.
How to Prepare Data for Use in Machine Learning Models
Data Collection
The first step is to collect all the data you believe the model will need and ingest it into a centralized location, such as a data warehouse. Centralizing the data will be very helpful with the rest of the process, as it will probably come from many different sources, such as databases, spreadsheets, streams, or web scraping, and will be in various states of cleanliness.
As you collect the data, be mindful of certain biases in the data collection process, such as:
Sampling Bias occurs when the sample you take isn’t representative of the entire population you’re interested in, such as when only surveying active people on social media.
Selection Bias is when the method creates a bias, such as incentivizing those who take the survey.
Systemic Bias is the underrepresentation of certain groups of people in the data due to their small percentage of the population or even a deeply rooted discrimination against that group within the organization from which the data is coming.
Data Cleaning
Once we have all our data in one place, it’s time to clean it up and make it fit together. Missing values, typos, and inconsistencies can all throw off your model, so they must be thoroughly addressed. Here’s a breakdown of some of the key tasks in creating a clean dataset for ML use:
Handle Missing Values
Analyze the patterns of missing values and decide to either fill in the missing values with a statistical method such as mean/median, forward/backward fill, or remove the row together.
Address Inconsistent Formatting
Standardize date formats, convert currencies to one unit, and fix typos or capitalization inconsistencies.
Remove Duplicates
Handle Outliers
Outliers are data points significantly different from the rest of the data and can drastically change how your model performs.
To handle outliers, you can either Winsorize them by capping them to a specific range or remove them altogether.
Data Preprocessing
Even though you now have clean, consistent data, it’s still not ready to train your model. We need to format it to be suitable for machine learning algorithms. Failure to perform this step will significantly impact your model’s performance.
Normalization/Standardization
Unless you’re looking at only one type of data (which, when looking at model training data, is probably not the case), you will have different scales for your numerical values. Say you have one feature with a scale of 1-10 and another with a scale of -4000 to 1 million, or one feature is measured in pounds while another is measured in kilometers. This will confuse the algorithm, potentially putting more weight on the feature with a larger range and giving you bad results depending on the type of model you’re using.
One of the more popular methods of alleviating this problem is standardizing your data using a z-score. This is done by subtracting the mean from the original value and dividing by the standard deviation for each feature.
Once done, each feature will have a mean of 0 and a standard deviation of 1, so they will all be on the same scale.
If your data is already in a modern data warehouse such as Snowflake AI Data Cloud, standardization of your features can be done with some simple SQL:
SELECT
DISTANCE - AVG(DISTANCE) / STDDEV(DISTANCE) AS DISTANCE_Z_SCORE
FROM PHDATA_CLEANED_DATA;
Normalization is another method that scales values from 0 to 1. It is useful for data where the distribution is not normal or unknown but is more susceptible to outliers than standardization.
Encoding Categorical Variables
A categorical variable is data that falls into distinct categories such as colors (red, yellow, green, etc.) or types of customers (new, loyal, churned). Again, since ML algorithms work better with numerical data, these can be encoded into numbers for the model to use more easily.
Two common ways of encoding include:
Label Encoding
Simply assign a number to each category. This can be an efficient approach but may introduce an ordinal bias if the numerical order is interpreted as a hierarchy (e.g., 1=low priority, 2=medium priority, 3=high priority)
One-Hot Encoding
Create a new binary feature for each category. This does take care of any ordinal bias but ultimately can create a lot of new features that can impact computational efficiency
Simple statements in SQL can create One-Hot Encoding features:
SELECT
ID,
CASE WHEN COLOR = 'GREEN' THEN 1 ELSE 0 END AS GREEN,
CASE WHEN COLOR = 'YELLOW' THEN 1 ELSE 0 END AS YELLOW,
CASE WHEN COLOR = 'RED' THEN 1 ELSE 0 END AS RED
...
FROM PHDATA_CLEANED_DATA;
Feature Creation
The last piece of the puzzle is to create new features for the model to use based on existing ones to enhance its performance. Feature creation allows you to derive new features that might be more informative than the existing ones.
For instance, if you were trying to predict housing prices, you may already have the number of bedrooms and total square footage as features, but you could create average square footage per bedroom as a new feature for the algorithm to use.
Here are some tips for creating effective features:
Begin with basic feature-creation techniques based on domain knowledge
Feature creation is often an iterative process, so try different combinations and evaluate their effectiveness
Keep track of the features you create and the rationale behind them to aid in reproducibility and future improvements
Splitting the Data into Sets
The last step in the process is to split the data into sets: training, testing, and (optionally) validation. The training set trains the model by learning patterns and relationships within the data. If being used, the validation set fine-tunes hyperparameters during training. The test set finally evaluates the model’s generalizability on unseen data, reflecting how well the model will perform in real-world scenarios.
Ensuring your data is split correctly will allow your model to generalize well to new data. Some considerations when splitting your data:
The most common splitting ratio is 80:10:10 (or 80:20 if not using the validation set), but it can vary depending on the size of your dataset. For instance, with smaller datasets, you may allocate more to training and less to testing.
Before splitting, randomize your data to ensure a good distribution of samples across the sets.
If your data is a time series, maintain chronological order within each set to avoid leakage of future information into the training process.
Tips and Tricks for Machine Learning Data Preparation
More Data is Not Always Better
Collecting more data will not automatically improve your ML model. It may hurt it by adding in irrelevant, noisy data. You should always focus on quality over quantity when it comes to your data, and it’ll have the added benefit of saving your company data storage costs.
Validate your Data Throughout the Process
Constantly validate your data throughout the preparation process. The last thing you want to do is make decisions for your company using an untrustworthy model because you added two features for one category or calculated the z-score incorrectly.
Explore Data Visually
Visualization can be a powerful tool for quickly identifying outliers, spotting potential biases, and understanding your data’s distribution. Building histograms, scatter plots, or boxplots with your data can help immensely.
Closing
Preparing your data before using it in ML algorithms can be time-consuming, but it’s worth it. Efficiently creating trustworthy results will help you make informed business decisions to drive profits, increase efficiency, or save money. As long as data continues to be collected from more and more sources, data preparation will be needed to ensure it’s ready for use in news ML models.
phData can make this process easier for you.
Our team of experts can help with data cleaning, transformation, and integration, ensuring you get the most out of your data.