2 results
17 - Parallel Large-Scale Feature Selection
- from Part Three - Alternative Learning Settings
-
- By Jeremy Kubica, Google Inc., Pittsburgh, PA, USA, Sameer Singh, University of Massachusetts, Daria Sorokina, Yandex Labs, Palo Alto, CA, USA
- Edited by Ron Bekkerman, Mikhail Bilenko, John Langford
-
- Book:
- Scaling up Machine Learning
- Published online:
- 05 February 2012
- Print publication:
- 30 December 2011, pp 352-370
-
- Chapter
- Export citation
-
Summary
The set of features used by a learning algorithm can have a dramatic impact on the performance of the algorithm. Including extraneous features can make the learning problem more difficult by adding useless, noisy dimensions that lead to over-fitting and increased computational complexity. Conversely, excluding useful features can deprive the model of important signals. The problem of feature selection is to find a subset of features that allows the learning algorithm to learn the “best” model in terms of measures such as accuracy or model simplicity.
The problem of feature selection continues to grow in both importance and difficulty as extremely high-dimensional datasets become the standard in real-world machine learning tasks. Scalability can become a problem for even simple approaches. For example, common feature selection approaches that evaluate each new feature by training a new model containing that feature require learning a linear number of models each time they add a new feature. This computational cost can add up quickly when we iteratively add many new features. Even those techniques that use relatively computationally inexpensive tests of a feature's value, such as mutual information, require at least linear time in the number of features being evaluated.
As a simple illustrative example, consider the task of classifying websites. In this case, the dataset could easily contain many millions of examples. Including very basic features such as text unigrams on the page or HTML tags could easily provide many thousands of potential features for the model.
LSST: Comprehensive NEO detection, characterization, and orbits
- Željko Ivezić, J. Anthony Tyson, Mario Jurić, Jeremy Kubica, Andrew Connolly, Francesco Pierfederici, Alan W. Harris, Edward Bowell, the LSST Collaboration
-
- Journal:
- Proceedings of the International Astronomical Union / Volume 2 / Issue S236 / August 2006
- Published online by Cambridge University Press:
- 01 August 2006, pp. 353-362
- Print publication:
- August 2006
-
- Article
-
- You have access Access
- Export citation
-
The Large Synoptic Survey Telescope (LSST) is currently by far the most ambitious proposed ground-based optical survey. With initial funding from the National Science Foundation (NSF), Department of Energy (DOE) laboratories, and private sponsors, the design and development efforts are well underway at many institutions, including top universities and national laboratories. Solar System mapping is one of the four key scientific design drivers, with emphasis on efficient Near-Earth Object (NEO) and Potentially Hazardous Asteroid (PHA) detection, orbit determination, and characterization. The LSST system will be sited at Cerro Pachon in northern Chile. In a continuous observing campaign of pairs of 15 s exposures of its 3,200 megapixel camera, LSST will cover the entire available sky every three nights in two photometric bands to a depth of V=25 per visit (two exposures), with exquisitely accurate astrometry and photometry. Over the proposed survey lifetime of 10 years, each sky location would be visited about 1000 times, with the total exposure time of 8 hours distributed over several broad photometric bandpasses. The baseline design satisfies strong constraints on the cadence of observations mandated by PHAs such as closely spaced pairs of observations to link different detections and short exposures to avoid trailing losses. Due to frequent repeat visits LSST will effectively provide its own follow-up to derive orbits for detected moving objects.
Detailed modeling of LSST operations, incorporating real historical weather and seeing data from Cerro Pachon, shows that LSST using its baseline design cadence could find 90% of the PHAs with diameters larger than 250 m, and 75% of those greater than 140 m within ten years. However, by optimizing sky coverage, the ongoing simulations suggest that the LSST system, with its first light in 2013, can reach the Congressional mandate of cataloging 90% of PHAs larger than 140m by 2020. In addition to detecting, tracking, and determining orbits for these PHAs, LSST will also provide valuable data on their physical and chemical characteristics (accurate color and variability measurements), constraining PHA properties relevant for risk mitigation strategies. In order to fulfill the Congressional mandate, a survey with an etendue of at least several hundred m2deg2, and a sophisticated and robust data processing system is required. It is fortunate that the same hardware, software and cadence requirements are driven by science unrelated to NEOs: LSST reaches the threshold where different science drivers and different agencies (NSF, DOE and NASA) can work together to efficiently achieve seemingly disjoint, but deeply connected, goals.