Skip to main content Accessibility help
×
Hostname: page-component-8448b6f56d-wq2xx Total loading time: 0 Render date: 2024-04-24T03:31:21.600Z Has data issue: false hasContentIssue false

12 - Large-Scale Machine Learning

Published online by Cambridge University Press:  05 December 2014

Jure Leskovec
Affiliation:
Stanford University, California
Anand Rajaraman
Affiliation:
Milliways Laboratories, California
Jeffrey David Ullman
Affiliation:
Stanford University, California
Get access

Summary

Many algorithms are today classified as “machine learning.” These algorithms share, with the other algorithms studied in this book, the goal of extracting information from data. All algorithms for analysis of data are designed to produce a useful summary of the data, from which decisions are made. Among many examples, the frequent-itemset analysis that we did in Chapter 6 produces information like association rules, which can then be used for planning a sales strategy or for many other purposes.

However, algorithms called “machine learning” not only summarize our data; they are perceived as learning a model or classifier from the data, and thus discover something about data that will be seen in the future. For instance, the clustering algorithms discussed in Chapter 7 produce clusters that not only tell us something about the data being analyzed (the training set), but they allow us to classify future data into one of the clusters that result from the clustering algorithm. Thus, machine-learning enthusiasts often speak of clustering with the neologism “unsupervised learning”; the term unsupervised refers to the fact that the input data does not tell the clustering algorithm what the clusters should be. In supervised machine learning, which is the subject of this chapter, the available data includes information about the correct way to classify at least some of the data. The data already classified is called the training set.

In this chapter, we do not attempt to cover all the different approaches to machine learning. We concentrate on methods that are suitable for very large data and that have the potential for parallel implementation. We consider the classical “perceptron” approach to learning a data classifier, where a hyperplane that separates two classes is sought. Then, we look at more modern techniques involving support-vector machines. Similar to perceptrons, these methods look for hyperplanes that best divide the classes, so that few, if any, members of the training set lie close to the hyperplane. We end with a discussion of nearest-neighbor techniques, where data is classified according to the class(es) of their nearest neighbors in some space.

Type
Chapter
Information
Publisher: Cambridge University Press
Print publication year: 2014

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

[1] A., Blum, “Empirical support for winnow and weighted-majority algorithms: results on a calendar scheduling domain,” Machine Learning 26 (1997), pp. 5–23.Google Scholar
[2] L., Bottou, “Large-scale machine learning with stochastic gradient descent,” Proc. 19th Intl. Conf. on Computational Statistics (2010), pp. 177–187, Springer.Google Scholar
[3] L., Bottou, “Stochastic gradient tricks, neural networks,” in Tricks of the Trade, Reloaded, pp. 430–445, edited by G., Montavon, G.B., Orr and K.-R., Mueller, Lecture Notes in Computer Science (LNCS 7700), Springer, 2012.Google Scholar
[4] C.J.C., Burges, “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery 2 (1998), pp. 121–167.Google Scholar
[5] N., Cristianini and J., Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods, Cambridge University Press, 2000.Google Scholar
[6] C., Cortes and V.N., Vapnik, “Support-vector networks,” Machine Learning 20 (1995), pp. 273–297.Google Scholar
[7] Y., Freund and R.E., Schapire, “Large margin classification using the perceptron algorithm,” Machine Learning 37 (1999), pp. 277–296.Google Scholar
[8] T., Joachims, “Training linear SVMs in linear time.” Proc. 12th ACM SIG-KDD (2006), pp. 217–226.Google Scholar
[9] N., Littlestone, “Learning quickly when irrelevant attributes abound: a new linear-threshold algorithm,” Machine Learning 2 (1988), pp. 285–318.Google Scholar
[10] M., Minsky and S., Papert, Perceptrons: An Introduction to Computational Geometry (2nd edition), MIT Press, Cambridge MA, 1972.Google Scholar
[11] F., Rosenblatt, “The perceptron: a probabilistic model for information storage and organization in the brain,” Psychological Review 65:6 (1958), pp. 386–408.Google Scholar

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×