When we want to fit a Machine Learning (ML) model to a big dataset, it is often recommended to carefully pre-process the input data in order to obtain better results. Although it is widely accepted that more data lead to better results, this is not necessarily true when referred to the number of variables of our data. Some variables may be noisy, redundant and not useful.
The process of removing useless variables is known as Feature Selection (FS), it is a very common task that precedes the application of any data analysis technique. In this first chapter of a three-part series, we will define the FS problem, review existing approaches with an emphasis on wrapper methods, and define the concept of metaheuristic algorithms and their relation with FS. In part 2 (coming soon!) we will present a taxonomy of metaheuristics and deep into genetic algorithms.
More formally, Feature Selection is the process of selecting a subset of the features of a dataset that is optimal according to some criterion. It has two main aims:
It improves the model obtained by a ML algorithm, because it reduces overfitting (less features lead to a simpler model which is less prone to overfitting and easier to understand and interpret), and also reduces the need of many examples when there are many variables (alleviating the curse of dimensionality). Of course, Big Data technologies cannot help if we do not have enough examples, as it is usually the case with microarray data -many genes, few patients- or medical imaging, just to cite a few. But it does help in collecting and processing lots of variables from heterogeneous sources. Then we can use an FS algorithm to decide which of them are actually relevant and discard the rest.
It reduces the size of the dataset. A reduced storage space can enable running a non-distributed algorithm on our data. Maybe the algorithm is hard to parallelize and is readily available off-the-shelf in some non-distributed analytic environment. Furthermore, ML algorithms run faster on smaller datasets, thus reducing training time.
Taxonomy of FS methods
There are lots of FS methods out there, as well as many good books on this topic. Roughly speaking, they can be classified as follows:
Filter methods: they are based on data-related measures, and are independent of the algorithm we want to run later on the reduced dataset. Basically they detect useless features according to mathematical or statistical criteria, or in terms of metrics like the amount of additional information each feature provides. They are generally faster than any other method and have good generalization ability. Some examples: the Chi-squared statistic applied to FS, the information gain metric, the correlation with the target variable and uncorrelation between features…
Embedded methods: feature selection is done at the same time as fitting our ML model. The output of this process is usually the model together with the best feature subset (for that specific model). A typical example are regularization methods, which modify the objective function used to build the model to penalize models with many variables. By acting on the objective function, regularization methods also have a direct influence on the learning output (the model itself). Forward/backwards feature selection, in which successively model comparison is done, is another example. As the name says, embedded methods are highly coupled to the specific learning model we are building. On the other hand, they are generally faster than wrapper methods.
Wrapper methods: they perform a search in the space of feature combinations, guided by some ML model to evaluate candidate subsets of features. The model can be different from the one we may eventually build later on the final reduced dataset.
They are the most time consuming, because many candidate subsets evaluated via an ML algorithm (to make things clearer: many does not mean every possible feature combination). Wrapper methods commonly require an intelligent search (heuristic optimization) procedure to find the best feature combination. Among the battery of metaheuristics available to solve this problem, evolutionary algorithms have proven a very good choice due to how easy to implement they are. The CHC algorithm, which constitutes the focus of this series, is an example of a wrapper method that has been successfully applied to FS. We will dive deeper into the details of CHC and a distributed implementation of it in the next chapter.
Metaheuristics for optimization
Imagine a candidate feature subset, represented as a vector of length N (where N is the number of features of the input dataset). Something like this:
s = 1 1 0 0 1 0
In this binary vector, si= 1 when the i-th feature is selected and 0 otherwise. As can be seen, finding the best (according to some criterion) feature subset is a combinatorial optimization problem. There are 2N distinct feature subsets, hence evaluating (by means of some classification algorithm) all possibilities is computationally unfeasible when N becomes moderately high (beyond 15-20). The problem consists in finding the best feature subset by exploring the space of combinations in an intelligent way that yields the best solution much faster than a brute force exploration.
Metaheuristics constitute a broad class of search algorithms that are able to deal with huge search spaces in optimization problems. They explore the solution space to find good solutions within a reasonable time, although finding the global optimum is not guaranteed. However, they perform very well in practice, and are often the only way to go in situations in which the search space is too large for a classical method, or the function being optimized is non-differentiable, or does not have an analytical expression at all (for instance, the magnitude being optimized is the result of a randomized simulation under a parameter set that constitutes a candidate solution).
In the next chapter, we will present a taxonomy of metaheuristics, and develop those inspired by Darwin’s theory of the evolution of species (widely known as evolutionary computation), which include CHC. Finally we will explain how CHC can be parallelized with Spark.
Thanks, I loved the article and I was left wanting to read the second part. Pablo, when you express that the array of features can be expressed as an array of ones and zeros, what meaning does one have or one zero? that is, the semantic meaning of the feature? how do we know that a one in a certain position of the array means one thing and not another? Another thing, when you say that you can handle an array of 2 exponential 15 or 20, you mean, that a single commodity-type machine can handle an array of 1,048,576 ones or consecutive zeros in the selection array? What kind of machine are we talking about?