Skip to main content Accessibility help
  • Get access
    Check if you have access via personal or institutional login
  • Cited by 3
  • Print publication year: 2011
  • Online publication date: February 2012

17 - Parallel Large-Scale Feature Selection

from Part Three - Alternative Learning Settings


The set of features used by a learning algorithm can have a dramatic impact on the performance of the algorithm. Including extraneous features can make the learning problem more difficult by adding useless, noisy dimensions that lead to over-fitting and increased computational complexity. Conversely, excluding useful features can deprive the model of important signals. The problem of feature selection is to find a subset of features that allows the learning algorithm to learn the “best” model in terms of measures such as accuracy or model simplicity.

The problem of feature selection continues to grow in both importance and difficulty as extremely high-dimensional datasets become the standard in real-world machine learning tasks. Scalability can become a problem for even simple approaches. For example, common feature selection approaches that evaluate each new feature by training a new model containing that feature require learning a linear number of models each time they add a new feature. This computational cost can add up quickly when we iteratively add many new features. Even those techniques that use relatively computationally inexpensive tests of a feature's value, such as mutual information, require at least linear time in the number of features being evaluated.

As a simple illustrative example, consider the task of classifying websites. In this case, the dataset could easily contain many millions of examples. Including very basic features such as text unigrams on the page or HTML tags could easily provide many thousands of potential features for the model.

Related content

Powered by UNSILO
Abe, S. 2005. Modified backward feature selection by cross validation. In: 13th European Symposium on Artificial Neural Networks.
Asuncion, A., and Newman, D. 2007. UCI Machine Learning Repository.
Battiti, R. 1994. Using Mutual Information for Selecting Features in Supervised Neural Net Learning. IEEE Transactions on Neural Networks, 5, 537–550.
Caruana, R., Karampatziakis, N., and Yessenalina, A. 2008. An Empirical Evaluation of Supervised Learning in High Dimensions. In: Proceedings of the 25th International Conference on Machine Learning (ICML 2008).
Dean, J., and Ghemawat, S. 2004. MapReduce: Simplified Data Processing on Large Clusters. In: OSDI'04: Sixth Symposium on Operating System Design and Implementation.
Della Pietra, S., Della Pietra, V., and Lafferty, J. 1997. Inducing Features of Random Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4), 380–393.
Fleuret, F. 2004. Fast Binary Feature Selection with Conditional Mutual Information. Journal of Machine Learning Research, 5, 1531–1555.
Friedman, J., Hastie, T., and Tibshirani, R. 2008. Regularized Paths for Generalized Linear Models via Coordinate Descent. http://www
Garcia, D., Hall, L., Goldgof, D., and Kramer, K. 2006. A parallel feature selection algorithm from random subsets. In: Proceedings of the 17th European Conference on Machine Learning and the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases.
Genkin, A., Lewis, D., and Madigan, D. 2005. Sparse Logistic Regression for Text Categorization.
Guyon, I., and Elisseeff, A. 2003. An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, 3(March), 1157–1182.
Hastie, T., Tibshirani, R., and Friedman, J. 2001. The Elements of Statistical Learning. New York: Springer.
John, G., Kohavi, R., and Pfleger, K. 1994. Irrelevant Features and the Subset Selection Problem. Pages 121–129 of: Proceedings of the Eleventh International Conference on Machine Learning (ICML 1994). San Francisco, CA: Morgan Kauffmann.
Komarek, P., and Moore, A. 2005. Making Logistic Regression a Core Data Mining Tool with TR-IRLS. In: Proceedings of the 5th International Conference on Data Mining Machine Learning.
Krishnapuram, B., Carin, L., and Hartemink, A. 2004. Joint Classifier and Feature Optimization for Comprehensive Cancer Diagnosis Using Gene Expression Data. Journal of Computational Biology, 11(2–3), 227–242.
Lewis, D. 1992. Feature Selection and Feature Extraction for Text Categorization. Pages 212–217 of: Proceedings of the Workshop on Speech and Natural Language.
Lewis, D., Yang, Y., Rose, T., and Li, F. 2004. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5, 361–397.
López, F., Torres, M., Batista, B., Pérez, J., and Moreno-Vega, M. 2006. Solving Feature Subset Selection Problem by a Parallel Scatter Search. European Journal of Operational Research, 169(2), 477–489.
McCallum, A. 2003. Efficiently Inducing Features of Conditional Random Fields. In: Conference on Uncertainty in Artificial Intelligence (UAI).
Perkins, S., Lacker, K., and Theiler, J. 2003. Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space. Journal of Machine Learning Research, 3, 1333–1356.
Singh, S., Kubica, J., Larsen, S., and Sorokina, D. 2009. Parallel Large Scale Feature Selection for Logistic Regression. In: SIAM International Conference on Data Mining (SDM).
Tibshirani, R. 1996. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, 58(1), 267–288.
Whitney, A. 1971. A Direct Method of Nonparametric Measurement Selection. IEEE Transactions on Computers, 20(9), 1100–1103.