References

Shai Shalev-Shwartz; Shai Ben-David

doi:10.1017/CBO9781107298019.036

References

Published online by Cambridge University Press: 05 July 2014

Shai Shalev-Shwartz and

Shai Ben-David

Show author details

Shai Shalev-Shwartz: Affiliation:
Hebrew University of Jerusalem
Shai Ben-David: Affiliation:
University of Waterloo, Ontario

Book contents

Get access

Summary

A summary is not available for this content so a preview has been provided. Please use the Get access link above for information on how to access this content.

Image of the first page of this content. For PDF version, please use the ‘Save PDF’ preceeding this image.'

Information

Type: Chapter
Information: Understanding Machine Learning
From Theory to Algorithms
, pp. 385 - 394

DOI: https://doi.org/10.1017/CBO9781107298019.036 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2014

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

References

Abernethy, J., Bartlett, P. L., Rakhlin, A. & Tewari, A. (2008), “Optimal strategies and minimax lower bounds for online convex games,” in Proceedings of the nineteenth annual conference on computational learning theory.Google Scholar

Ackerman, M. & Ben-David, S. (2008), “Measures of clustering quality: A working set of axioms for clustering,” in Proceedings of Neural Information Processing Systems (NIPS), pp. 121-128.Google Scholar

Agarwal, S. & Roth, D. (2005), “Learnability of bipartite ranking functions,” in Proceedings of the 18th annual conference on learning theory, pp. 16-31.Google Scholar

Agmon, S. (1954), “The relaxation method for linear inequalities,” Canadian Journal of Mathematics 6(3), 382-392.CrossRef Google Scholar

Aizerman, M. A., Braverman, E. M. & Rozonoer, L. I. (1964), “Theoretical foundations of the potential function method in pattern recognition learning,” Automation and Remote Control 25, 821-837.Google Scholar

Allwein, E. L., Schapire, R. & Singer, Y. (2000), “Reducing multiclass to binary: A unifying approach for margin classifiers,” Journal of Machine Learning Research 1, 113-141.Google Scholar

Alon, N., Ben-David, S., Cesa-Bianchi, N. & Haussler, D. (1997), “Scale-sensitive dimensions, uniform convergence, and learnability,” Journal of the ACM 44(4), 615-631.CrossRef Google Scholar

Anthony, M. & Bartlet, P. (1999), Neural Network Learning: Theoretical Foundations, Cambridge University Press.CrossRef Google Scholar

Baraniuk, R., Davenport, M., DeVore, R. & Wakin, M. (2008), “A simple proof of the restricted isometry property for random matrices,” Constructive Approximation 28(3), 253-263.CrossRef Google Scholar

Barber, D. (2012), Bayesian reasoning and machine learning, Cambridge University Press.Google Scholar

Bartlett, P., Bousquet, O. & Mendelson, S. (2005), “Local rademacher complexities,” Annals of Statistics 33(4), 1497-1537.CrossRef Google Scholar

Bartlett, P. L. & Ben-David, S. (2002), “Hardness results for neural network approximation problems,” Theor. Comput. Sci. 284(1), 53-66.CrossRef Google Scholar

Bartlett, P. L., Long, P. M. & Williamson, R. C. (1994), “Fat-shattering and the learn-ability of real-valued functions,” in Proceedings of the seventh annual conference on computational learning theory, (ACM), pp. 299-310.Google Scholar

Bartlett, P. L. & Mendelson, S. (2001), “Rademacher and Gaussian complexities: Risk bounds and structural results,” in 14th Annual Conference on Computational Learning Theory (COLT) 2001, Vol. 2111, Springer, Berlin, pp. 224-240.Google Scholar

Bartlett, P. L. & Mendelson, S. (2002), “Rademacher and Gaussian complexities: Risk bounds and structural results,” Journal of Machine Learning Research 3, 463-482.Google Scholar

Ben-David, S., Cesa-Bianchi, N., Haussler, D. & Long, P. (1995), “Characterizations of learnability for classes of {0,…, n}-valued functions,” Journal of Computer and System Sciences 50, 74-86.Google Scholar

Ben-David, S., Eiron, N. & Long, P. (2003), “On the difficulty of approximately maximizing agreements,” Journal of Computer and System Sciences 66(3), 496-514.CrossRef Google Scholar

Ben-David, S. & Litman, A. (1998), “Combinatorial variability of vapnik-chervonenkis classes with applications to sample compression schemes,” Discrete Applied Mathematics 86(1), 3-25.CrossRef Google Scholar

Ben-David, S., Pal, D., & Shalev-Shwartz, S. (2009), “Agnostic online learning,” in Conference on Learning Theory (COLT).Google Scholar

Ben-David, S. & Simon, H. (2001), “Efficient learning of linear perceptrons,” Advances in Neural Information Processing Systems, pp. 189-195.Google Scholar

Bengio, Y. (2009), “Learning deep architectures for AI,” Foundations and Trends in Machine Learning 2(1), 1-127.CrossRef Google Scholar

Bengio, Y. & LeCun, Y. (2007), “Scaling learning algorithms towards AI,” Large-Scale Kernel Machines 34.Google Scholar

Bertsekas, D. (1999), Nonlinear programming, Athena Scientific.Google Scholar

Beygelzimer, A., Langford, J. & Ravikumar, P. (2007), “Multiclass classification with filter trees,” Preprint, June.Google Scholar

Birkhoff, G. (1946), “Three observations on linear algebra,” Revi. Univ. Nac. Tucuman, ser. A 5, 147-151.Google Scholar

Bishop, C. M. (2006), Pattern recognition and machine learning, Vol. 1, Springer: New York.Google Scholar

Blum, L., Shub, M. & Smale, S. (1989), “On a theory of computation and complexity over the real numbers: Np-completeness, recursive functions and universal machines,” Am. Math. Soc. 21(1), 1-46.Google Scholar

Blumer, A., Ehrenfeucht, A., Haussler, D. & Warmuth, M. K. (1987), “Occam's razor,” Information Processing Letters 24(6), 377-380.CrossRef Google Scholar

Blumer, A., Ehrenfeucht, A., Haussler, D. & Warmuth, M. K. (1989), “Learnability and the Vapnik-Chervonenkis dimension,” Journal of the Association for Computing Machinery 36(4), 929-965.CrossRef Google Scholar

Borwein, J. & Lewis, A. (2006), Convex analysis and nonlinear optimization, Springer.CrossRef Google Scholar

Boser, B. E., Guyon, I. M. & Vapnik, V. N. (1992), “A training algorithm for optimal margin classifiers,” in COLT, pp. 144-152.Google Scholar

Bottou, L. & Bousquet, O. (2008), “The tradeoffs of large scale learning,” in NIPS, pp. 161-168.Google Scholar

Boucheron, S., Bousquet, O. & Lugosi, G. (2005), “Theory of classification: A survey of recent advances,” ESAIM: Probability and Statistics 9, 323-375.Google Scholar

Bousquet, O. (2002), Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of Learning Algorithms, PhD thesis, Ecole Polytechnique.Google Scholar

Bousquet, O. & Elisseeff, A. (2002), “Stability and generalization,” Journal of Machine Learning Research 2, 499-526.Google Scholar

Boyd, S. & Vandenberghe, L. (2004), Convex optimization, Cambridge University Press.CrossRef Google Scholar

Breiman, L. (1996), Bias, variance, and arcing classifiers, Technical Report 460, Statistics Department, University of California at Berkeley.Google Scholar

Breiman, L. (2001), “Random forests,” Machine Learning 45(1), 5-32.Google Scholar

Breiman, L., Friedman, J. H., Olshen, R. A. & Stone, C. J. (1984), Classification and regression trees, Wadsworth & Brooks.Google Scholar

Candès, E. (2008), “The restricted isometry property and its implications for compressed sensing,” Comptes Rendus Mathematique 346(9), 589-592.CrossRef Google Scholar

Candes, E. J. (2006), “Compressive sampling,” in Proc. of the int. congress of math., Madrid, Spain.Google Scholar

Candes, E. & Tao, T. (2005), “Decoding by linear programming,” IEEE Trans. on Information Theory 51, 4203-4215.CrossRef Google Scholar

Cesa-Bianchi, N. & Lugosi, G. (2006), Prediction, learning, and games, Cambridge University Press.CrossRef Google Scholar

Chang, H. S., Weiss, Y. & Freeman, W. T. (2009), “Informative sensing,” arXiv preprint arXiv:0901.4275.Google Scholar

Chapelle, O., Le, Q. & Smola, A. (2007), “Large margin optimization of ranking measures,” in NIPS workshop: Machine learning for Web search (Machine Learning).Google Scholar

Collins, M. (2000), “Discriminative reranking for natural language parsing,” in Machine Learning.Google Scholar

Collins, M. (2002), “Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms,” in Conference on Empirical Methods in Natural Language Processing.Google Scholar

Collobert, R. & Weston, J. (2008), “A unified architecture for natural language processing: deep neural networks with multitask learning,” in International Conference on Machine Learning (ICML).CrossRef Google Scholar

Cortes, C. & Vapnik, V. (1995), “Support-vector networks,” Machine Learning 20(3), 273-297.CrossRef Google Scholar

Cover, T. (1965), “Behavior of sequential predictors of binary sequences,” Trans. 4th Prague conf. information theory statistical decision functions, random processes, pp. 263-272.Google Scholar

Cover, T. & Hart, P. (1967), “Nearest neighbor pattern classification,” Information Theory, IEEE Transactions on 13(1), 21-27.CrossRef Google Scholar

Crammer, K. & Singer, Y. (2001), “On the algorithmic implementation of multiclass kernel-based vector machines,” Journal of Machine Learning Research 2, 265-292.Google Scholar

Cristianini, N. & Shawe-Taylor, J. (2000), An introduction to support vector machines, Cambridge University Press.Google Scholar

Daniely, A., Sabato, S., Ben-David, S. & Shalev-Shwartz, S. (2011), “Multiclass learnability and the erm principle,” in COLT.Google Scholar

Daniely, A., Sabato, S. & Shwartz, S. S. (2012), “Multiclass learning approaches: A theoretical comparison with implications,” in NIPS.Google Scholar

Davis, G., Mallat, S. & Avellaneda, M. (1997), “Greedy adaptive approximation,” Journal of Constructive Approximation 13, 57-98.Google Scholar

Devroye, L. & Gyorfi, L. (1985), Nonparametric density estimation: The L B1 S view, Wiley.Google Scholar

Devroye, L., Gyorfi, L. & Lugosi, G. (1996), A probabilistic theory of pattern recognition, Springer.CrossRef Google Scholar

Dietterich, T. G. & Bakiri, G. (1995), “Solving multiclass learning problems via error-correcting output codes,” Journal of Artificial Intelligence Research 2, 263-286.Google Scholar

Donoho, D. L. (2006), “Compressed sensing,” Information Theory, IEEE Transactions 52(4), 1289-1306.CrossRef Google Scholar

Dudley, R., Gine, E. & Zinn, J. (1991), “Uniform and universal glivenko-cantelli classes,” Journal of Theoretical Probability 4(3), 485-510.CrossRef Google Scholar

Dudley, R. M. (1987), “Universal Donsker classes and metric entropy,” Annals of Probability 15(4), 1306-1326.CrossRef Google Scholar

Fisher, R. A. (1922), “On the mathematical foundations of theoretical statistics,” Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 222, 309-368.Google Scholar

Floyd, S. (1989), “Space-bounded learning and the Vapnik-Chervonenkis dimension,” in COLT, pp. 349-364.Google Scholar

Floyd, S. & Warmuth, M. (1995), “Sample compression, learnability, and the Vapnik-Chervonenkis dimension,” Machine Learning 21(3), 269-304.CrossRef Google Scholar

Frank, M. & Wolfe, P. (1956), “An algorithm for quadratic programming,” Naval Res. Logist. Quart. 3, 95-110.CrossRef Google Scholar

Freund, Y. & Schapire, R. (1995), “A decision-theoretic generalization of on-line learning and an application to boosting,” in European Conference on Computational Learning Theory (EuroCOLT), Springer-Verlag, pp. 23-37.Google Scholar

Freund, Y. & Schapire, R. E. (1999), “Large margin classification using the perceptron algorithm,” Machine Learning 37(3), 277-296.CrossRef Google Scholar

Garcia, J. & Koelling, R. (1996), “Relation of cue to consequence in avoidance learning,” Foundations of animal behavior: classic papers with commentaries 4, 374.Google Scholar

Gentile, C. (2003), “The robustness of the p-norm algorithms,” Machine Learning 53(3), 265-299.CrossRef Google Scholar

Georghiades, A., Belhumeur, P. & Kriegman, D. (2001), “From few to many: Illumination cone models for face recognition under variable lighting and pose,” IEEE Trans. Pattern Anal. Mach. Intelligence 23(6), 643-660.CrossRef Google Scholar

Gordon, G. (1999), “Regret bounds for prediction problems,” in Conference on Learning Theory (COLT).CrossRef Google Scholar

Gottlieb, L.-A., Kontorovich, L. & Krauthgamer, R. (2010), “Efficient classification for metric data,” in 23rd conference on learning theory, pp. 433-440.Google Scholar

Guyon, I. & Elisseeff, A. (2003), “An introduction to variable and feature selection,” Journal of Machine Learning Research, Special Issue on Variable and Feature Selection 3, 1157-1182.Google Scholar

Hadamard, J. (1902), “Sur les problèmes aux dérivées partielles et leur signification physique,” Princeton University Bulletin 13, 49-52.Google Scholar

Hastie, T., Tibshirani, R. & Friedman, J. (2001), The elements of statistical learning, Springer.CrossRef Google Scholar

Haussler, D. (1992), “Decision theoretic generalizations of the PAC model for neural net and other learning applications,” Information and Computation 100(1), 78-150.CrossRef Google Scholar

Haussler, D. & Long, P. M. (1995), “A generalization of sauer's lemma,” Journal of Combinatorial Theory, Series A 71(2), 219-240.CrossRef Google Scholar

Hazan, E., Agarwal, A. & Kale, S. (2007), “Logarithmic regret algorithms for online convex optimization,” Machine Learning 69(2-3), 169-192.CrossRef Google Scholar

Hinton, G. E., Osindero, S. & Teh, Y.-W. (2006), “A fast learning algorithm for deep belief nets,” Neural Computation 18(7), 1527-1554.CrossRef Google Scholar PubMed

Hiriart-Urruty, J.-B. & Lemaréchal, C. (1993), Convex analysis and minimization algorithms, Springer.CrossRef Google Scholar

Hsu, C.-W., Chang, C. -C., & Lin, C. -J. (2003), “A practical guide to support vector classification.”Google Scholar

Hyafil, L. & Rivest, R. L. (1976), “Constructing optimal binary decision trees is NP-complete,” Information Processing Letters 5(1), 15-17.CrossRef Google Scholar

Joachims, T. (2005), “A support vector method for multivariate performance measures,” in Proceedings of the international conference on machine learning (ICML).Google Scholar

Kakade, S., Sridharan, K. & Tewari, A. (2008), “On the complexity of linear prediction: Risk bounds, margin bounds, and regularization,” in NIPS.Google Scholar

Karp, R. M. (1972), Reducibility among combinatorial problems, Springer.CrossRef Google Scholar

Kearns, M. & Mansour, Y. (1996), “On the boosting ability of top-down decision tree learning algorithms,” in ACM Symposium on the Theory of Computing (STOC).CrossRef Google Scholar

Kearns, M. & Ron, D. (1999), “Algorithmic stability and sanity-check bounds for leave-one-out cross-validation,” Neural Computation 11(6), 1427-1453.CrossRef Google Scholar PubMed

Kearns, M. & Valiant, L. G. (1988), “Learning Boolean formulae or finite automata is as hard as factoring”, Technical Report TR-14-88, Harvard University, Aiken Computation Laboratory.Google Scholar

Kearns, M. & Vazirani, U. (1994), An Introduction to Computational Learning Theory, MIT Press.Google Scholar

Kearns, M. J., Schapire, R. E. & Sellie, L. M. (1994), “Toward efficient agnostic learning,” Machine Learning 17, 115-141.CrossRef Google Scholar

Kleinberg, J. (2003), “An impossibility theorem for clustering,” NIPS, pp. 463-470.Google Scholar

Klivans, A. R. & Sherstov, A. A. (2006), Cryptographic hardness for learning intersections of halfspaces, in FOCS.CrossRef Google Scholar

Koller, D. & Friedman, N. (2009), Probabilistic graphical models: Principles and techniques, MIT Press.Google Scholar

Koltchinskii, V. & Panchenko, D. (2000), “Rademacher processes and bounding the risk of function learning,” in High Dimensional Probability II, Springer, pp. 443-457.Google Scholar

Kuhn, H. W. (1955), “The hungarian method for the assignment problem,” Naval Research Logistics Quarterly 2(1-2), 83-97.CrossRef Google Scholar

Kutin, S. & Niyogi, P. (2002), “Almost-everywhere algorithmic stability and generalization error,” in Proceedings of the 18th conference in uncertainty in artificial intelligence, pp. 275-282.Google Scholar

Lafferty, J., McCallum, A. & Pereira, F. (2001), “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in International conference on machine learning, pp. 282-289.Google Scholar

Langford, J. (2006), “Tutorial on practical prediction theory for classification,” Journal of machine learning research 6(1), 273.Google Scholar

Langford, J. & Shawe-Taylor, J. (2003), “PAC-Bayes & margins,” in NIPS, pp. 423-430.Google Scholar

Le, Q. V., Ranzato, M. -A., Monga, R., Devin, M., Corrado, G., Chen, K., Dean, J. & Ng, A. Y. (2012), “Building high-level features using large scale unsupervised learning,” in ICML.Google Scholar

Le Cun, L. (2004), “Large scale online learning,” in Advances in neural information processing systems 16: Proceedings of the 2003 conference, Vol. 16, MIT Press, p. 217.Google Scholar

LeCun, Y. & Bengio, Y. (1995), “Convolutional networks for images, speech, and time series,” in The handbook of brain theory and neural networks, The MIT Press.Google Scholar

Lee, H., Grosse, R., Ranganath, R. & Ng, A. (2009), “Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations,” in ICML.CrossRef Google Scholar

Littlestone, N. (1988), “Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm,” Machine Learning 2, 285-318.CrossRef Google Scholar

Littlestone, N. & Warmuth, M. (1986), Relating data compression and learnability. Unpublished manuscript.Google Scholar

Littlestone, N. & Warmuth, M. K. (1994), “The weighted majority algorithm,” Information and Computation 108, 212-261.CrossRef Google Scholar

Livni, R., Shalev-Shwartz, S. & Shamir, O. (2013), “A provably eficient algorithm for training deep networks,” arXiv preprint arXiv:1304.7045.Google Scholar

Livni, R. & Simon, P. (2013), “Honest compressions and their application to compression schemes,” in COLT.Google Scholar

MacKay, D. J. (2003), Information theory, inference and learning algorithms, Cambridge University Press1.Google Scholar

Mallat, S. & Zhang, Z. (1993), “Matching pursuits with time-frequency dictionaries,” IEEE Transactions on Signal Processing 41, 3397-3415.CrossRef Google Scholar

McAllester, D. A. (1998), “Some PAC-Bayesian theorems,” in COLT.CrossRef Google Scholar

McAllester, D. A. (1999), “PAC-Bayesian model averaging,” in COLT, pp. 164-170.Google Scholar

McAllester, D. A. (2003), “Simpliied PAC-Bayesian margin bounds,” in COLT, pp. 203-215.Google Scholar

Minsky, M. & Papert, S. (1969), Perceptrons: An introduction to computational geometry, The MIT Press.Google Scholar

Mukherjee, S., Niyogi, P., Poggio, T. & Rifkin, R. (2006), “Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization,” Advances in Computational Mathematics 25(1-3), 161-193.CrossRef Google Scholar

Murata, N. (1998), “A statistical study of on-line learning,” Online Learning and Neural Networks, Cambridge University Press.Google Scholar

Murphy, K. P. (2012), Machine learning: a probabilistic perspective, The MIT Press.Google Scholar

Natarajan, B. (1995), “Sparse approximate solutions to linear systems,” SIAM J. Computing 25(2), 227-234.Google Scholar

Natarajan, B. K. (1989), “On learning sets and functions,” Mach. Learn. 4, 67-97.CrossRef Google Scholar

Nemirovski, A., Juditsky, A., Lan, G. & Shapiro, A. (2009), “Robust stochastic approximation approach to stochastic programming,” SIAM Journal on Optimization 19(4), 1574-1609.CrossRef Google Scholar

Nemirovski, A. & Yudin, D. (1978), Problem complexity and method efficiency in optimization, Nauka, Moscow.Google Scholar

Nesterov, Y. (2005), Primal-dual subgradient methods for convex problems, Technical report, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL).Google Scholar

Nesterov, Y. & Nesterov, I. (2004), Introductory lectures on convex optimization: A basic course, Vol. 87, Springer, Netherlands.CrossRef Google Scholar

Novikoff, A. B. J. (1962), “On convergence proofs on perceptrons,” in Proceedings of the symposium on the mathematical theory of automata, Vol. XII, pp. 615-622.Google Scholar

Parberry, I. (1994), Circuit complexity and neural networks, The MIT press.Google Scholar

Pearson, K. (1901), “On lines and planes of closest fit to systems of points in space,” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2(11), 559-572.CrossRef Google Scholar

Phillips, D. L. (1962), “A technique for the numerical solution of certain integral equations of the first kind,” Journal of the ACM 9(1), 84-97.CrossRef Google Scholar

Pisier, G. (1980-1981), “Remarques sur un résultat non publié de B. maurey.”Google Scholar

Pitt, L. & Valiant, L. (1988), “Computational limitations on learning from examples,” Journal of the Association for Computing Machinery 35(4), 965-984.CrossRef Google Scholar

Poon, H. & Domingos, P. (2011), “Sum-product networks: A new deep architecture,” in Conference on Uncertainty in Artificial Intelligence (UAI).Google Scholar

Quinlan, J. R. (1986), “Induction of decision trees,” Machine Learning 1, 81-106.CrossRef Google Scholar

Quinlan, J. R. (1993), C4.5: Programs for machine learning, Morgan Kaufmann.Google Scholar

Rabiner, L. & Juang, B. (1986), “An introduction to hidden markov models,” IEEE ASSP Magazine 3(1), 4-16.CrossRef Google Scholar

Rakhlin, A., Shamir, O. & Sridharan, K. (2012), “Making gradient descent optimal for strongly convex stochastic optimization,” in ICML.Google Scholar

Rakhlin, A., Sridharan, K. & Tewari, A. (2010), “Online learning: Random averages, combinatorial parameters, and learnability,” in NIPS.Google Scholar

Rakhlin, S., Mukherjee, S. & Poggio, T. (2005), “Stability results in learning theory,” Analysis and Applications 3(4), 397-419.CrossRef Google Scholar

Ranzato, M., Huang, F., Boureau, Y. & Lecun, Y. (2007), “Unsupervised learning of invariant feature hierarchies with applications to object recognition,” in Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on, IEEE, pp. 1-8.Google Scholar

Rissanen, J. (1978), “Modeling by shortest data description,” Automatica 14, 465-471.CrossRef Google Scholar

Rissanen, J. (1983), “A universal prior for integers and estimation by minimum description length,” The Annals of Statistics 11(2), 416-431.CrossRef Google Scholar

Robbins, H. & Monro, S. (1951), “A stochastic approximation method,” The Annals of Mathematical Statistics, pp. 400-407.Google Scholar

Rogers, W. & Wagner, T. (1978), “A finite sample distribution-free performance bound for local discrimination rules,” The Annals of Statistics 6(3), 506-514.CrossRef Google Scholar

Rokach, L. (2007), Data mining with decision trees: Theory and applications, Vol. 69, World Scientific.CrossRef Google Scholar

Rosenblatt, F. (1958), “The perceptron: A probabilistic model for information storage and organization in the brain,” Psychological Review 65, 386-407. (Reprinted in Neurocomputing, MIT Press, 1988).CrossRef Google Scholar

Rumelhart, D. E., Hinton, G. E. & Williams, R. J. (1986), “Learning internal representations by error propagation,” in D. E., Rumelhart & J. L., McClelland, eds, Parallel distributed processing - explorations in the microstructure of cognition, MIT Press, chapter 8, pp. 318-362.Google Scholar

Sankaran, J. K. (1993), “A note on resolving infeasibility in linear programs by constraint relaxation,” Operations Research Letters 13(1), 19-20.CrossRef Google Scholar

Sauer, N. (1972), “On the density of families of sets,” Journal of Combinatorial Theory Series A 13, 145-147.CrossRef Google Scholar

Schapire, R. (1990), “The strength of weak learnability,” Machine Learning 5(2), 197-227.CrossRef Google Scholar

Schapire, R. E. & Freund, Y. (2012), Boosting: Foundations and algorithms, MIT Press.Google Scholar

Schölkopf, B. & Smola, A. J. (2002), Learning with kernels: Support vector machines, regularization, optimization and beyond, MIT Press.Google Scholar

Schölkopf, B., Herbrich, R. & Smola, A. (2001), “A generalized representer theorem,” in Computational learning theory, pp. 416-426.Google Scholar

Schölkopf, B., Herbrich, R., Smola, A. & Williamson, R. (2000), “A generalized representer theorem,” in NeuroCOLT.Google Scholar

Schölkopf, B., Smola, A. & Müller, K.-R. (1998), ‘Nonlinear component analysis as a kernel eigenvalue problem’, Neural computation 10(5), 1299-1319.CrossRef Google Scholar

Seeger, M. (2003), “Pac-bayesian generalisation error bounds for gaussian process classiication,” The Journal of Machine Learning Research 3, 233-269.Google Scholar

Shakhnarovich, G., Darrell, T. & Indyk, P. (2006), Nearest-neighbor methods in learning and vision: Theory and practice, MIT Press.Google Scholar

Shalev-Shwartz, S. (2007), Online Learning: Theory, Algorithms, and Applications, PhD thesis, The Hebrew University.Google Scholar

Shalev-Shwartz, S. (2011), “Online learning and online convex optimization,” Foundations and Trends ® in Machine Learning 4(2), 107-194.Google Scholar

Shalev-Shwartz, S., Shamir, O., Srebro, N. & Sridharan, K. (2010), “Learnability, stability and uniform convergence,” The Journal of Machine Learning Research 9999, 2635-2670.Google Scholar

Shalev-Shwartz, S., Shamir, O. & Sridharan, K. (2010), “Learning kernel-based halfspaces with the zero-one loss,” in COLT.Google Scholar

Shalev-Shwartz, S., Shamir, O., Sridharan, K. & Srebro, N. (2009), “Stochastic convex optimization,” in COLT.Google Scholar

Shalev-Shwartz, S. & Singer, Y. (2008), “On the equivalence of weak learnability and linear separability: New relaxations and eficient boosting algorithms,” in Proceedings of the nineteenth annual conference on computational learning theory.Google Scholar

Shalev-Shwartz, S., Singer, Y. & Srebro, N. (2007), “Pegasos: Primal Estimated sub-GrAdient SOlver for SVM,” in International conference on machine learning, pp. 807-814.Google Scholar

Shalev-Shwartz, S. & Srebro, N. (2008), “SVM optimization: Inverse dependence on training set size,” in International conference on machine learningICML, pp. 928-935.Google Scholar

Shalev-Shwartz, S., Zhang, T. & Srebro, N. (2010), “Trading accuracy for sparsity in optimization problems with sparsity constraints,” Siam Journal on Optimization 20, 2807-2832.CrossRef Google Scholar

Shamir, O. & Zhang, T. (2013), “Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes,” in ICML.Google Scholar

Shapiro, A., Dentcheva, D. & Ruszczyński, A. (2009), Lectures on stochastic programming: modeling and theory, Vol. 9, Society for Industrial and Applied Mathematics.CrossRef Google Scholar

Shelah, S. (1972), “A combinatorial problem; stability and order for models and theories in infinitary languages,” Pac. J. Math 4, 247-261.Google Scholar

Sipser, M. (2006), Introduction to the Theory of Computation, Thomson Course Technology.Google Scholar

Slud, E. V. (1977), “Distribution inequalities for the binomial law,” The Annals of Probability 5(3), 404-412.CrossRef Google Scholar

Steinwart, I. & Christmann, A. (2008), Support vector machines, Springerverlag, New York.Google Scholar

Stone, C. (1977), “Consistent nonparametric regression,” The Annals of Statistics 5(4), 595-620.CrossRef Google Scholar

Taskar, B., Guestrin, C. & Koller, D. (2003), “Max-margin markov networks,” in NIPS.Google Scholar

Tibshirani, R. (1996), “Regression shrinkage and selection via the lasso,” J. Royal. Statist. Soc B. 58(1), 267-288.Google Scholar

Tikhonov, A. N. (1943), “On the stability of inverse problems,” Dolk. Akad. Nauk SSSR 39(5), 195-198.Google Scholar

Tishby, N., Pereira, F. & Bialek, W. (1999), “The information bottleneck method,” in The 37'th Allerton conference on communication, control, and computing.Google Scholar

Tsochantaridis, I., Hofmann, T., Joachims, T. & Altun, Y. (2004), “Support vector machine learning for interdependent and structured output spaces,” in Proceedings of the twenty-first international conference on machine learning.Google Scholar

Valiant, L. G. (1984), “A theory of the learnable,” Communications of the ACM 27(11), 1134-1142.CrossRef Google Scholar

Vapnik, V. (1992), “Principles of risk minimization for learning theory,” in J. E., Moody, S. J., Hanson & R. P., Lippmann, eds., Advances in Neural Information Processing Systems 4, Morgan Kaufmann, pp. 831-838.Google Scholar

Vapnik, V. (1995), The Nature of Statistical Learning Theory, Springer.CrossRef Google Scholar

Vapnik, V. N. (1982), Estimation of Dependences Based on Empirical Data, SpringerVerlag.Google Scholar

Vapnik, V. N. (1998), Statistical Learning Theory, Wiley.Google Scholar

Vapnik, V. N. & Chervonenkis, A. Y. (1971), “On the uniform convergence of relative frequencies of events to their probabilities,” Theory of Probability and Its Applications XVI(2), 264-280.Google Scholar

Vapnik, V. N. & Chervonenkis, A. Y. (1974), Theory of pattern recognition, Nauka, Moscow (In Russian).Google Scholar

Von Luxburg, U. (2007), “A tutorial on spectral clustering,” Statistics and Computing 17(4), 395-416.CrossRef Google Scholar

von Neumann, J. (1928), “Zur theorie der gesellschaftsspiele (on the theory of parlor games),” Math. Ann. 100, 295-320.Google Scholar

Von Neumann, J. (1953), “A certain zero-sum two-person game equivalent to the optimal assignment problem,” Contributions to the Theory of Games 2, 5-12.Google Scholar

Vovk, V. G. (1990), “Aggregating strategies,” in COLT, pp. 371-383.Google Scholar

Warmuth, M., Glocer, K. & Vishwanathan, S. (2008), “Entropy regularized lpboost,” in Algorithmic Learning Theory (ALT).Google Scholar

Warmuth, M., Liao, J. & Ratsch, G. (2006), “Totally corrective boosting algorithms that maximize the margin,” in Proceedings of the 23rd international conference on machine learning.Google Scholar

Weston, J., Chapelle, O., Vapnik, V., Elisseeff, A. & Scholkopf, B. (2002), “Kernel dependency estimation,” in Advances in neural information processing systems, pp. 873-880.Google Scholar

Weston, J. & Watkins, C. (1999), “Support vector machines for multi-class pattern recognition,” in Proceedings of the seventh european symposium on artificial neural networks.Google Scholar

Wolpert, D. H. & Macready, W. G. (1997), “No free lunch theorems for optimization,” Evolutionary Computation, IEEE Transactions on 1(1), 67-82.CrossRef Google Scholar

Zhang, T. (2004), “Solving large scale linear prediction problems using stochastic gradient descent algorithms,” in Proceedings of the twenty-first international conference on machine learning.Google Scholar

Zhao, P. & Yu, B. (2006), “On model selection consistency of Lasso,” Journal of Machine Learning Research 7, 2541-2567.Google Scholar

Zinkevich, M. (2003), “Online convex programming and generalized infinitesimal gradient ascent,” in International conference on machine learning.Google Scholar

Accessibility standard: Unknown

Why this information is here

This section outlines the accessibility features of this content - including support for screen readers, full keyboard navigation and high-contrast display options. This may not be relevant for you.

Accessibility Information

Accessibility compliance for the PDF of this book is currently unknown and may be updated in the future.

Book contents

References

Summary

Information

Access options

Book purchase

Temporarily unavailable

References

Accessibility standard: Unknown

Why this information is here

Accessibility Information

Save book to Kindle

Save book to Dropbox

Save book to Google Drive