Skip to main content Accessibility help
Hostname: page-component-55597f9d44-t4qhp Total loading time: 0.495 Render date: 2022-08-17T07:39:04.059Z Has data issue: true Feature Flags: { "shouldUseShareProductTool": true, "shouldUseHypothesis": true, "isUnsiloEnabled": true, "useRatesEcommerce": false, "useNewApi": true } hasContentIssue true

17 - Parallel Large-Scale Feature Selection

from Part Three - Alternative Learning Settings

Published online by Cambridge University Press:  05 February 2012

Jeremy Kubica
Google Inc., Pittsburgh, PA, USA
Sameer Singh
University of Massachusetts
Daria Sorokina
Yandex Labs, Palo Alto, CA, USA
Ron Bekkerman
LinkedIn Corporation, Mountain View, California
Mikhail Bilenko
Microsoft Research, Redmond, Washington
John Langford
Yahoo! Research, New York
Get access


The set of features used by a learning algorithm can have a dramatic impact on the performance of the algorithm. Including extraneous features can make the learning problem more difficult by adding useless, noisy dimensions that lead to over-fitting and increased computational complexity. Conversely, excluding useful features can deprive the model of important signals. The problem of feature selection is to find a subset of features that allows the learning algorithm to learn the “best” model in terms of measures such as accuracy or model simplicity.

The problem of feature selection continues to grow in both importance and difficulty as extremely high-dimensional datasets become the standard in real-world machine learning tasks. Scalability can become a problem for even simple approaches. For example, common feature selection approaches that evaluate each new feature by training a new model containing that feature require learning a linear number of models each time they add a new feature. This computational cost can add up quickly when we iteratively add many new features. Even those techniques that use relatively computationally inexpensive tests of a feature's value, such as mutual information, require at least linear time in the number of features being evaluated.

As a simple illustrative example, consider the task of classifying websites. In this case, the dataset could easily contain many millions of examples. Including very basic features such as text unigrams on the page or HTML tags could easily provide many thousands of potential features for the model.

Scaling up Machine Learning
Parallel and Distributed Approaches
, pp. 352 - 370
Publisher: Cambridge University Press
Print publication year: 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)


Abe, S. 2005. Modified backward feature selection by cross validation. In: 13th European Symposium on Artificial Neural Networks.Google Scholar
Asuncion, A., and Newman, D. 2007. UCI Machine Learning Repository.
Battiti, R. 1994. Using Mutual Information for Selecting Features in Supervised Neural Net Learning. IEEE Transactions on Neural Networks, 5, 537–550.CrossRefGoogle ScholarPubMed
Caruana, R., Karampatziakis, N., and Yessenalina, A. 2008. An Empirical Evaluation of Supervised Learning in High Dimensions. In: Proceedings of the 25th International Conference on Machine Learning (ICML 2008).Google Scholar
Dean, J., and Ghemawat, S. 2004. MapReduce: Simplified Data Processing on Large Clusters. In: OSDI'04: Sixth Symposium on Operating System Design and Implementation.Google Scholar
Della Pietra, S., Della Pietra, V., and Lafferty, J. 1997. Inducing Features of Random Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4), 380–393.CrossRefGoogle Scholar
Fleuret, F. 2004. Fast Binary Feature Selection with Conditional Mutual Information. Journal of Machine Learning Research, 5, 1531–1555.Google Scholar
Friedman, J., Hastie, T., and Tibshirani, R. 2008. Regularized Paths for Generalized Linear Models via Coordinate Descent. http://www
Garcia, D., Hall, L., Goldgof, D., and Kramer, K. 2006. A parallel feature selection algorithm from random subsets. In: Proceedings of the 17th European Conference on Machine Learning and the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases.Google Scholar
Genkin, A., Lewis, D., and Madigan, D. 2005. Sparse Logistic Regression for Text Categorization.
Guyon, I., and Elisseeff, A. 2003. An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, 3(March), 1157–1182.Google Scholar
Hastie, T., Tibshirani, R., and Friedman, J. 2001. The Elements of Statistical Learning. New York: Springer.CrossRefGoogle Scholar
John, G., Kohavi, R., and Pfleger, K. 1994. Irrelevant Features and the Subset Selection Problem. Pages 121–129 of: Proceedings of the Eleventh International Conference on Machine Learning (ICML 1994). San Francisco, CA: Morgan Kauffmann.Google Scholar
Komarek, P., and Moore, A. 2005. Making Logistic Regression a Core Data Mining Tool with TR-IRLS. In: Proceedings of the 5th International Conference on Data Mining Machine Learning.Google Scholar
Krishnapuram, B., Carin, L., and Hartemink, A. 2004. Joint Classifier and Feature Optimization for Comprehensive Cancer Diagnosis Using Gene Expression Data. Journal of Computational Biology, 11(2–3), 227–242.CrossRefGoogle ScholarPubMed
Lewis, D. 1992. Feature Selection and Feature Extraction for Text Categorization. Pages 212–217 of: Proceedings of the Workshop on Speech and Natural Language.CrossRefGoogle Scholar
Lewis, D., Yang, Y., Rose, T., and Li, F. 2004. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5, 361–397.Google Scholar
López, F., Torres, M., Batista, B., Pérez, J., and Moreno-Vega, M. 2006. Solving Feature Subset Selection Problem by a Parallel Scatter Search. European Journal of Operational Research, 169(2), 477–489.CrossRefGoogle Scholar
McCallum, A. 2003. Efficiently Inducing Features of Conditional Random Fields. In: Conference on Uncertainty in Artificial Intelligence (UAI).Google Scholar
Perkins, S., Lacker, K., and Theiler, J. 2003. Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space. Journal of Machine Learning Research, 3, 1333–1356.Google Scholar
Singh, S., Kubica, J., Larsen, S., and Sorokina, D. 2009. Parallel Large Scale Feature Selection for Logistic Regression. In: SIAM International Conference on Data Mining (SDM).Google Scholar
Tibshirani, R. 1996. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, 58(1), 267–288.Google Scholar
Whitney, A. 1971. A Direct Method of Nonparametric Measurement Selection. IEEE Transactions on Computers, 20(9), 1100–1103.CrossRefGoogle Scholar
Cited by

Save book to Kindle

To save this book to your Kindle, first ensure is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the or variations. ‘’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats