The set of features used by a learning algorithm can have a dramatic impact on the performance of the algorithm. Including extraneous features can make the learning problem more difficult by adding useless, noisy dimensions that lead to over-fitting and increased computational complexity. Conversely, excluding useful features can deprive the model of important signals. The problem of feature selection is to find a subset of features that allows the learning algorithm to learn the “best” model in terms of measures such as accuracy or model simplicity.
The problem of feature selection continues to grow in both importance and difficulty as extremely high-dimensional datasets become the standard in real-world machine learning tasks. Scalability can become a problem for even simple approaches. For example, common feature selection approaches that evaluate each new feature by training a new model containing that feature require learning a linear number of models each time they add a new feature. This computational cost can add up quickly when we iteratively add many new features. Even those techniques that use relatively computationally inexpensive tests of a feature's value, such as mutual information, require at least linear time in the number of features being evaluated.
As a simple illustrative example, consider the task of classifying websites. In this case, the dataset could easily contain many millions of examples. Including very basic features such as text unigrams on the page or HTML tags could easily provide many thousands of potential features for the model.