An important part of the data science development cycle is determining which features should be included in a model. While it may be tempting to throw all of your available data into the model and hope your selected algorithm figures it out, there are a number of benefits to using feature selection techniques to whittle down the data going into your model.
- Improved training time. Decreasing the number of features will decrease the time and computational resources needed to train your model. Features that are redundant or non-informative slow down your development without improving results.
- Decreased likelihood of overfitting. Giving our model fewer features will decrease complexity. Complex models are prone to overfitting and not generalizing well.
- Increased model interpretability. In many business use cases, model interpretability is a contributing factor to whether or not a model will be accepted and productionalized. Fewer features can make our models easier to explain.
Pearson’s Correlation
This is the most common form of correlation and is the one you probably remember covering in an intro stats course. The Pearson correlation coefficient between two variables will always return a value between -1 and 1. The further the value is from 0 the stronger the relationship. A value of 1 signifies a perfect positive linear relationship (i.e., if one variable increases by a unit the other will increase by a corresponding known amount and that amount does not change) while -1 represents a perfect negative linear relationship (i.e., when one variable increases the other decreases) and 0 represents no relationship.
Prior to using Pearson’s a number of assumptions should be verified. Both variables should be continuous variables, sometimes referred to as interval or ratio variables (all ratio variables are interval variables, but only certain cases of interval variables are ratio variables). The two variables you are checking the correlation between should both be normally distributed, think bell curve when you plot them. The relationship between the two variables should be linear, so if you plot them and see clear curves or a parabola this is likely not the best choice. Finally, outliers need to be handled ahead of time as they can greatly affect the correlation coefficient that is returned. If you can check all the boxes above you are good to use Pearson’s and base your feature selection off of it.
Kendall’s Rank Correlation
While Pearson’s measures a linear relationship between two variables, Kendall’s and Spearman’s, which is covered later, both measure the monotonic relationship. While linear relationships mean two variables move together at a constant rate, think straight line, monotonic relationships measure how likely it is for two variables to move in the same direction, but not necessarily at a constant rate. The upward exponential curve formed by y=x^2 (assuming positive values for x) would have a strictly positive monotonicity because as x increases y also always increases, but the curved line is not linear because the rate y changes vary at different values of x. Like Pearson’s correlation, Kendall’s will return a value between -1 and 1 with -1 being a strictly negative monotonic relationship (the variables are inversely related), 1 being a strictly positive monotonic relationship, and 0 representing no relationship.
Kendall’s is often used when data doesn’t meet one of the requirements of Pearson’s correlation. Kendall’s is non-parametric meaning that it does not require the two variables to fall into a bell curve. Kendall’s also does not require continuous data. Because it is based on the ranked values of each variable it will work with continuous data, but it can also be used with ordinal data. Ordinal data has a ranking, but the intervals between ranks are not necessarily consistent. Examples would be levels of education (high school, college, master’s, Ph.D.) or a self-evaluation of your Python skills (beginner, intermediate, expert).
Spearman’s Rank Correlation
Spearman’s is incredibly similar to Kendall’s. It is a non-parametric test that measures a monotonic relationship using ranked data. While it can often be used interchangeably with Kendall’s, Kendall’s is more robust and generally the preferred method of the two. An advantage of Spearman’s is that it is easier to calculate, but in a data science context, it is unlikely you’ll be working anything out by hand and both methods are computationally light relative to many other tasks you’ll be performing.
The three correlation methods covered in this blog all have easy implementations in both Python and R. Consider taking your filter-based feature selection a step further by researching additional metrics like Chi-Squared, Fisher Score, and Mutual Information Score.