[1]
Ahlswede, R., Gács, P. and Körner, J., Bounds on conditional probabilities with applications in multi-user communication.
Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete
34 (1976) 157–177. (correction in 39 (1977) 353–354).
[2]
Aizerman, M.A., Braverman, E.M. and Rozonoer, L.I., The method of potential functions for the problem of restoring the characteristic of a function converter from randomly observed points.
Automat. Remote Control
25 (1964) 1546–1556.
[3]
Aizerman, M.A., Braverman, E.M. and Rozonoer, L.I., The probability problem of pattern recognition learning and the method of potential functions.
Automat. Remote Control
25 (1964) 1307–1323.
[4]
Aizerman, M.A., Braverman, E.M. and Rozonoer, L.I., Theoretical foundations of the potential function method in pattern recognition learning.
Automat. Remote Control
25 (1964) 917–936.
[5]
M.A. Aizerman, E.M. Braverman and L.I. Rozonoer, Method of potential functions in the theory of learning machines. Nauka, Moscow (1970).
[6]
Akaike, H., A new look at the statistical model identification.
IEEE Trans. Automat. Control
19 (1974) 716–723.
[7]
Alesker, S., A remark on the Szarek-Talagrand theorem.
Combin. Probab. Comput.
6 (1997) 139–144.
[8]
Alon, N., Ben-David, S., Cesa-Bianchi, N. and Haussler, D., Scale-sensitive dimensions, uniform convergence, and learnability.
J. ACM
44 (1997) 615–631.
[9]
M. Anthony and P.L. Bartlett, Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge (1999).
[10]
M. Anthony and N. Biggs, Computational Learning Theory. Cambridge Tracts in Theoretical Computer Science (30). Cambridge University Press, Cambridge (1992).
[11]
Anthony, M. and Shawe-Taylor, J., A result of Vapnik with applications.
Discrete Appl. Math.
47 (1993) 207–217.
[12]
Antos, A, Devroye, L. and Györfi, L., Lower bounds for Bayes error estimation.
IEEE Trans. Pattern Anal. Machine Intelligence
21 (1999) 643–645.
[13]
Antos, A., Kégl, B., Linder, T. and Lugosi, G., Data-dependent margin-based generalization bounds for classification.
J. Machine Learning Res.
3 (2002) 73–98.
[14]
Antos, A. and Lugosi, G., Strong minimax lower bounds for learning.
Machine Learning
30 (1998) 31–56.
[15]
Assouad, P., Densité et dimension.
Annales de l'Institut Fourier
33 (1983) 233–282.
[16]
J.-Y. Audibert and O. Bousquet, Pac-Bayesian generic chaining, in Advances in Neural Information Processing Systems
16, L. Saul, S. Thrun and B. Schölkopf Eds., Cambridge, Mass., MIT Press (2004).
[17]
J.-Y. Audibert, PAC-Bayesian Statistical Learning Theory. Ph.D. Thesis, Université Paris 6, Pierre et Marie Curie (2004).
[18]
Azuma, K., Weighted sums of certain dependent random variables.
Tohoku Math. J.
68 (1967) 357–367.
[19]
Baraud, Y., Model selection for regression on a fixed design.
Probability Theory and Related Fields
117 (2000) 467–493.
[20]
Barron, A.R., Birgé, L. and Massart, P., Risks bounds for model selection via penalization.
Probab. Theory Related Fields
113 (1999) 301–415.
[21]
A.R. Barron, Logically smooth density estimation. Technical Report TR 56, Department of Statistics, Stanford University (1985).
[22]
A.R. Barron, Complexity regularization with application to artificial neural networks, in Nonparametric Functional Estimation and Related Topics, G. Roussas Ed. NATO ASI Series, Kluwer Academic Publishers, Dordrecht (1991) 561–576.
[23]
Barron, A.R. and Cover, T.M., Minimum complexity density estimation.
IEEE Trans. Inform. Theory
37 (1991) 1034–1054.
[24]
Bartlett, P., Boucheron, S. and Lugosi, G., Model selection and error estimation.
Machine Learning
48 (2001) 85–113.
[25]
Bartlett, P., Bousquet, O. and Mendelson, S., Localized Rademacher complexities.
Ann. Statist.
33 (2005) 1497–1537.
[26]
Bartlett, P.L. and Ben-David, S., Hardness results for neural network approximation problems.
Theoret. Comput. Sci.
284 (2002) 53–66.
[27]
P.L. Bartlett, M.I. Jordan and J.D. McAuliffe, Convexity, classification, and risk bounds. J. Amer. Statis. Assoc., to appear (2005).
[28]
P.L. Bartlett and W. Maass, Vapnik-Chervonenkis dimension of neural nets, in Handbook Brain Theory Neural Networks, M.A. Arbib Ed. MIT Press, second edition. (2003) 1188–1192.
[29]
Bartlett, P.L. and Mendelson, S., Rademacher and gaussian complexities: risk bounds and structural results.
J. Machine Learning Res.
3 (2002) 463–482.
[30]
P. L. Bartlett, S. Mendelson and P. Philips, Local Complexities for Empirical Risk Minimization, in Proc. of the 17th Annual Conference on Learning Theory (COLT), Springer (2004).
[31]
Bashkirov, O., Braverman, E.M. and Muchnik, I.E., Potential function algorithms for pattern recognition learning machines.
Automat. Remote Control
25 (1964) 692–695.
[32]
Ben-David, S., Eiron, N. and Simon, H.-U., Limitations of learning via embeddings in Euclidean half spaces.
J. Machine Learning Res.
3 (2002) 441–461.
[33]
Bennett, G., Probability inequalities for the sum of independent random variables.
J. Amer. Statis. Assoc.
57 (1962) 33–45.
[34]
S.N. Bernstein, The Theory of Probabilities. Gostehizdat Publishing House, Moscow (1946).
[35]
L. Birgé, An alternative point of view on Lepski's method, in State of the art in probability and statistics (Leiden, 1999), Inst. Math. Statist., Beachwood, OH, IMS Lecture Notes Monogr. Ser.
36 (2001) 113–133.
[36]
Birgé, L. and Massart, P., Rates of convergence for minimum contrast estimators.
Probab. Theory Related Fields
97 (1993) 113–150.
[37]
L. Birgé and P. Massart, From model selection to adaptive estimation, in Festschrift for Lucien Le Cam: Research papers in Probability and Statistics, E. Torgersen D. Pollard and G. Yang Eds., Springer, New York (1997) 55–87.
[38]
Birgé, L. and Massart, P., Minimum contrast estimators on sieves: exponential bounds and rates of convergence.
Bernoulli
4 (1998) 329–375.
[39]
G. Blanchard, O. Bousquet and P. Massart, Statistical performance of support vector machines. Ann. Statist., to appear (2006).
[40]
Blanchard, G., Lugosi, G. and Vayatis, N., On the rates of convergence of regularized boosting classifiers.
J. Machine Learning Res.
4 (2003) 861–894.
[41]
Blumer, A., Ehrenfeucht, A., Haussler, D. and Warmuth, M.K., Learnability and the Vapnik-Chervonenkis dimension.
J. ACM
36 (1989) 929–965.
[42]
Bobkov, S. and Ledoux, M., Poincaré's inequalities and Talagrands's concentration phenomenon for the exponential distribution.
Probab. Theory Related Fields
107 (1997) 383–400.
[43]
B. Boser, I. Guyon and V.N. Vapnik, A training algorithm for optimal margin classifiers, in Proc. of the Fifth Annual ACM Workshop on Computational Learning Theory (COLT). Association for Computing Machinery, New York, NY (1992) 144–152.
[44]
Boucheron, S., Bousquet, O., Lugosi, G. and Massart, P., Moment inequalities for functions of independent random variables.
Ann. Probab.
33 (2005) 514–560.
[45]
Boucheron, S., Lugosi, G. and Massart, P., A sharp concentration inequality with applications.
Random Structures Algorithms
16 (2000) 277–292.
[46]
Boucheron, S., Lugosi, G. and Massart, P., Concentration inequalities using the entropy method.
Ann. Probab.
31 (2003) 1583–1614.
[47]
Bousquet, O., Bennett, A concentration inequality and its application to suprema of empirical processes.
C. R. Acad. Sci. Paris
334 (2002) 495–500.
[48]
O. Bousquet, Concentration inequalities for sub-additive functions using the entropy method, in Stochastic Inequalities and Applications, C. Houdré E. Giné and D. Nualart Eds., Birkhauser (2003).
[49]
Bousquet, O. and Elisseeff, A., Stability and generalization.
J. Machine Learning Res.
2 (2002) 499–526.
[50]
O. Bousquet, V. Koltchinskii and D. Panchenko, Some local measures of complexity of convex hulls and generalization bounds, in Proceedings of the 15th Annual Conference on Computational Learning Theory (COLT), Springer (2002) 59–73.
[51]
Breiman, L., Arcing classifiers.
Ann. Statist.
26 (1998) 801–849.
[52]
Breiman, L., Some infinite theory for predictor ensembles.
Ann. Statist.
32 (2004) 1–11.
[53]
L. Breiman, J.H. Friedman, R.A. Olshen and C.J. Stone, Classification and Regression Trees. Wadsworth International, Belmont, CA (1984).
[54]
Bühlmann, P. and Boosting, B. Yu with the l
_{2}-loss: Regression and classification.
J. Amer. Statis. Assoc.
98 (2004) 324–339.
[55]
Cannon, A., Ettinger, J.M., Hush, D. and Scovel, C., Machine learning with data dependent hypothesis classes.
J. Machine Learning Res.
2 (2002) 335–358.
[56]
Castellan, G., Density estimation via exponential model selection.
IEEE Trans. Inform. Theory
49 (2003) 2052–2060.
[57]
O. Catoni, Randomized estimators and empirical complexity for pattern recognition and least square regression. Preprint PMA-677.
[58]
O. Catoni, Statistical learning theory and stochastic optimization. École d'été de Probabilités de Saint-Flour XXXI. Springer-Verlag. Lect. Notes Math.
1851 (2004).
[59]
O. Catoni, Localized empirical complexity bounds and randomized estimators (2003). Preprint.
[60]
Cesa-Bianchi, N. and Haussler, D., A graph-theoretic generalization of the Sauer-Shelah lemma.
Discrete Appl. Math.
86 (1998) 27–35.
[61]
Collins, M., Schapire, R.E. and Singer, Y., Logistic regression, AdaBoost and Bregman distances.
Machine Learning
48 (2002) 253–285.
[62]
Cortes, C. and Vapnik, V.N., Support vector networks.
Machine Learning
20 (1995) 1–25.
[63]
Cover, T.M., Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition.
IEEE Trans. Electronic Comput.
14 (1965) 326–334.
[64]
Craven, P. and Wahba, G., Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation.
Numer. Math.
31 (1979) 377–403.
[65]
N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press, Cambridge, UK (2000).
[66]
Csiszár, I., Large-scale typicality of Markov sample paths and consistency of MDL order estimators.
IEEE Trans. Inform. Theory
48 (2002) 1616–1628.
[67]
Csiszár, I. and Shields, P., The consistency of the BIC Markov order estimator.
Ann. Statist.
28 (2000) 1601–1619.
[68]
F. Cucker and S. Smale, On the mathematical foundations of learning. Bull. Amer. Math. Soc. (2002) 1–50.
[69]
Dembo, A., Information inequalities and concentration of measure.
Ann. Probab.
25 (1997) 927–939.
[70]
P.A. Devijver and J. Kittler, Pattern Recognition: A Statistical Approach. Prentice-Hall, Englewood Cliffs, NJ (1982).
[71]
Devroye, L., Automatic pattern recognition: A study of the probability of error.
IEEE Trans. Pattern Anal. Machine Intelligence
10 (1988) 530–543.
[72]
L. Devroye, L. Györfi and G. Lugosi, A Probabilistic Theory of Pattern Recognition. Springer-Verlag, New York (1996).
[73]
Devroye, L. and Lugosi, G., Lower bounds in pattern recognition and learning.
Pattern Recognition
28 (1995) 1011–1018.
[74]
L. Devroye and T. Wagner, Distribution-free inequalities for the deleted and holdout error estimates. IEEE Trans. Inform. Theory
25(2) (1979) 202–207.
[75]
L. Devroye and T. Wagner, Distribution-free performance bounds for potential function rules. IEEE Trans. Inform. Theory
25(5) (1979) 601–604.
[76]
D.L. Donoho and I.M. Johnstone, Ideal spatial adaptation by wavelet shrinkage. Biometrika
81(3) (1994) 425–455.
[77]
R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis. John Wiley, New York (1973).
[78]
R.O. Duda, P.E. Hart and D.G. Stork, Pattern Classification. John Wiley and Sons (2000).
[79]
Dudley, R.M., Central limit theorems for empirical measures.
Ann. Probab.
6 (1978) 899–929.
[80]
R.M. Dudley, Balls in R^{k}
do not cut all subsets of k + 2 points. Advances Math.
31 (3) (1979) 306–308.
[81]
R.M. Dudley, Empirical processes, in École de Probabilité de St. Flour 1982. Lect. Notes Math.
1097 (1984).
[82]
Dudley, R.M., Universal Donsker classes and metric entropy.
Ann. Probab.
15 (1987) 1306–1326.
[83]
R.M. Dudley, Uniform Central Limit Theorems. Cambridge University Press, Cambridge (1999).
[84]
Dudley, R.M., Giné, E. and Zinn, J., Uniform and universal Glivenko-Cantelli classes.
J. Theoret. Probab.
4 (1991) 485–510.
[85]
Efron, B., Bootstrap methods: another look at the jackknife.
Ann. Statist.
7 (1979) 1–26.
[86]
B. Efron, The jackknife, the bootstrap, and other resampling plans. SIAM, Philadelphia (1982).
[87]
B. Efron and R.J. Tibshirani, An Introduction to the Bootstrap. Chapman and Hall, New York (1994).
[88]
Ehrenfeucht, A., Haussler, D., Kearns, M. and Valiant, L., A general lower bound on the number of examples needed for learning.
Inform. Comput.
82 (1989) 247–261.
[89]
T. Evgeniou, M. Pontil and T. Poggio, Regularization networks and support vector machines, in Advances in Large Margin Classifiers, A.J. Smola, P.L. Bartlett B. Schölkopf and D. Schuurmans, Eds., Cambridge, MA, MIT Press. (2000) 171–203.
[90]
P. Frankl, On the trace of finite sets. J. Combin. Theory, Ser. A
34 (1983) 41–45.
[91]
Freund, Y., Boosting a weak learning algorithm by majority.
Inform. Comput.
121 (1995) 256–285.
[92]
Y. Freund, Self bounding learning algorithms, in Proceedings of the 11th Annual Conference on Computational Learning Theory (1998) 127–135.
[93]
Y. Freund, Y. Mansour and R.E. Schapire, Generalization bounds for averaged classifiers (how to be a Bayesian without believing). Ann. Statist. (2004).
[94]
Freund, Y. and Schapire, R., A decision-theoretic generalization of on-line learning and an application to boosting.
J. Comput. Syst. Sci.
55 (1997) 119–139.
[95]
Friedman, J., Hastie, T. and Tibshirani, R., Additive logistic regression: a statistical view of boosting.
Ann. Statist.
28 (2000) 337–374.
[96]
M. Fromont, Some problems related to model selection: adaptive tests and bootstrap calibration of penalties. Thèse de doctorat, Université Paris-Sud (December 2003).
[97]
K. Fukunaga, Introduction to Statistical Pattern Recognition. Academic Press, New York (1972).
[98]
Giné, E., Empirical processes and applications: an overview.
Bernoulli
2 (1996) 1–28.
[99]
Giné, E. and Zinn, J., Some limit theorems for empirical processes.
Ann. Probab.
12 (1984) 929–989.
[100]
Giné, E., Lectures on some aspects of the bootstrap, in Lectures on probability theory and statistics (Saint-Flour, 1996).
Lect. Notes Math.
1665 (1997) 37–151.
[101]
Goldberg, P. and Jerrum, M., Bounding the Vapnik-Chervonenkis dimension of concept classes parametrized by real numbers.
Machine Learning
18 (1995) 131–148.
[102]
U. Grenander, Abstract inference. John Wiley & Sons Inc., New York (1981).
[103]
Hall, P., Large sample optimality of least squares cross-validation in density estimation.
Ann. Statist.
11 (1983) 1156–1174.
[104]
T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning. Springer Series in Statistics. Springer-Verlag, New York (2001).
[105]
Haussler, D., Decision theoretic generalizations of the pac model for neural nets and other learning applications.
Inform. Comput.
100 (1992) 78–150.
[106]
D. Haussler, Sphere packing numbers for subsets of the boolean n-cube with bounded Vapnik-Chervonenkis dimension. J. Combin. Theory, Ser. A
69 (1995) 217–232.
[107]
D. Haussler, N. Littlestone and M. Warmuth, Predicting {0,1} functions from randomly drawn points, in Proc. of the 29th IEEE Symposium on the Foundations of Computer Science, IEEE Computer Society Press, Los Alamitos, CA (1988) 100–109.
[108]
Herbrich, R. and Williamson, R.C., Algorithmic luckiness.
J. Machine Learning Res.
3 (2003) 175–212.
[109]
Hoeffding, W., Probability inequalities for sums of bounded random variables.
J. Amer. Statist. Assoc.
58 (1963) 13–30.
[110]
P. Huber, The behavior of the maximum likelihood estimates under non-standard conditions, in Proc. Fifth Berkeley Symposium on Probability and Mathematical Statistics, Univ. California Press (1967) 221–233.
[111]
Jiang, W., Process consistency for adaboost.
Ann. Statist.
32 (2004) 13–29.
[112]
Johnson, D.S. and Preparata, F.P., The densest hemisphere problem.
Theoret. Comput. Sci.
6 (1978) 93–107.
[113]
I. Johnstone, Function estimation and gaussian sequence models. Technical Report. Department of Statistics, Stanford University (2002).
[114]
M. Karpinski and A. Macintyre, Polynomial bounds for vc dimension of sigmoidal and general pfaffian neural networks. J. Comput. Syst. Sci.
54 (1997).
[115]
M. Kearns, Y. Mansour, A.Y. Ng and D. Ron, An experimental and theoretical comparison of model selection methods, in Proc. of the Eighth Annual ACM Workshop on Computational Learning Theory, Association for Computing Machinery, New York (1995) 21–30.
[116]
M.J. Kearns and D. Ron, Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. Neural Comput.
11(6) (1999) 1427–1453.
[117]
M.J. Kearns and U.V. Vazirani, An Introduction to Computational Learning Theory. MIT Press, Cambridge, Massachusetts (1994).
[118]
A.G. Khovanskii, Fewnomials. Translations of Mathematical Monographs 88, American Mathematical Society (1991).
[119]
Kieffer, J.C., Strongly consistent code-based identification and order estimation for constrained finite-state model classes.
IEEE Trans. Inform. Theory
39 (1993) 893–902.
[120]
Kimeldorf, G.S. and Wahba, G., A correspondence between Bayesian estimation on stochastic processes and smoothing by splines.
Ann. Math. Statist.
41 (1970) 495–502.
[121]
P. Koiran and E.D. Sontag, Neural networks with quadratic vc dimension. J. Comput. Syst. Sci.
54 (1997).
[122]
Kolmogorov, A.N., On the representation of continuous functions of several variables by superposition of continuous functions of one variable and addition.
Dokl. Akad. Nauk SSSR
114 (1957) 953–956.
[123]
A.N. Kolmogorov and V.M. Tikhomirov, ε-entropy and ε-capacity of sets in functional spaces. Amer. Math. Soc. Transl., Ser. 2
17 (1961) 277–364.
[124]
V. Koltchinskii, Rademacher penalties and structural risk minimization.
IEEE Trans. Inform. Theory
47 (2001) 1902–1914.
[125]
V. Koltchinskii, Local Rademacher complexities and oracle inequalities in risk minimization. Manuscript (September 2003).
[126]
V. Koltchinskii and D. Panchenko, Rademacher processes and bounding the risk of function learning, in High Dimensional Probability II, E. Giné, D.M. Mason and J.A. Wellner, Eds. (2000) 443–459.
[127]
V. Koltchinskii and D. Panchenko, Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statist.
30 (2002).
[128]
Kulkarni, S., Lugosi, G. and Venkatesh, S., Learning pattern classification – a survey.
IEEE Trans. Inform. Theory
44 (1998) 2178–2206. Information Theory: 1948–1998. Commemorative special issue.
[129]
S. Kutin and P. Niyogi, Almost-everywhere algorithmic stability and generalization error, in UAI-2002: Uncertainty in Artificial Intelligence (2002).
[130]
J. Langford and M. Seeger, Bounds for averaging classifiers. CMU-CS 01-102, Carnegie Mellon University (2001).
[131]
M. Ledoux, Isoperimetry and gaussian analysis in Lectures on Probability Theory and Statistics, P. Bernard Ed., École d'Été de Probabilités de St-Flour XXIV-1994 (1996) 165–294.
[132]
Ledoux, M., Talagrand's de, Onviation inequalities for product measures.
ESAIM: PS
1 (1997) 63–87.
[133]
M. Ledoux and M. Talagrand, Probability in Banach Space. Springer-Verlag, New York (1991).
[134]
Lee, W.S., Bartlett, P.L. and Williamson, R.C., The importance of convexity in learning with squared loss.
IEEE Trans. Inform. Theory
44 (1998) 1974–1980.
[135]
Lepskiĭ, O.V., Mammen, E. and Spokoiny, V.G., Optimal spatial adaptation to inhomogeneous smoothness: an approach based on kernel estimates with variable bandwidth selectors.
Ann. Statist.
25 (1997) 929–947.
[136]
Lepskiĭ, O.V., A problem of adaptive estimation in Gaussian white noise.
Teor. Veroyatnost. i Primenen.
35 (1990) 459–470.
[137]
Lepskiĭ, O.V., Asymptotically minimax adaptive estimation. I. Upper bounds. Optimally adaptive estimates.
Teor. Veroyatnost. i Primenen.
36 (1991) 645–659.
[138]
Li, Y., Long, P.M. and Srinivasan, A., Improved bounds on the sample complexity of learning.
J. Comput. Syst. Sci.
62 (2001) 516–527.
[139]
Y. Lin, A note on margin-based loss functions in classification. Technical Report 1029r, Department of Statistics, University Wisconsin, Madison (1999).
[140]
Y. Lin, Some asymptotic properties of the support vector machine. Technical Report 1044r, Department of Statistics, University of Wisconsin, Madison (1999).
[141]
Lin, Y., Support vector machines and the bayes rule in classification.
Data Mining and Knowledge Discovery
6 (2002) 259–275.
[142]
F. Lozano, Model selection using Rademacher penalization, in Proceedings of the Second ICSC Symposia on Neural Computation (NC2000). ICSC Adademic Press (2000).
[143]
Luczak, M.J. and McDiarmid, C., Concentration for locally acting permutations.
Discrete Math.
265 (2003) 159–171.
[144]
G. Lugosi, Pattern classification and learning theory, in Principles of Nonparametric Learning, L. Györfi Ed., Springer, Wien (2002) 5–62.
[145]
Lugosi, G. and Nobel, A., Adaptive model selection using empirical complexities.
Ann. Statist.
27 (1999) 1830–1864.
[146]
Lugosi, G. and Vayatis, N., On the Bayes-risk consistency of regularized boosting methods.
Ann. Statist.
32 (2004) 30–55.
[147]
Lugosi, G. and Wegkamp, M., Complexity regularization via localized random penalties.
Ann. Statist.
2 (2004) 1679–1697.
[148]
Lugosi, G. and Zeger, K., Concept learning using complexity regularization.
IEEE Trans. Inform. Theory
42 (1996) 48–54.
[149]
A. Macintyre and E.D. Sontag, Finiteness results for sigmoidal “neural” networks, in Proc. of the 25th Annual ACM Symposium on the Theory of Computing, Association of Computing Machinery, New York (1993) 325–334.
[150]
Mallows, C.L., Some comments on C
_{
p
}.
Technometrics
15 (1997) 661–675.
[151]
E. Mammen and A. Tsybakov, Smooth discrimination analysis. Ann. Statist.
27(6) (1999) 1808–1829.
[152]
S. Mannor and R. Meir, Weak learners and improved convergence rate in boosting, in Advances in Neural Information Processing Systems 13: Proc. NIPS'2000 (2001).
[153]
S. Mannor, R. Meir and T. Zhang, The consistency of greedy algorithms for classification, in Proceedings of the 15th Annual Conference on Computational Learning Theory (2002).
[154]
Marton, K., A simple proof of the blowing-up lemma.
IEEE Trans. Inform. Theory
32 (1986) 445–446.
[155]
Marton, K., Bounding
$\bar{d}$
-distance by informational divergence: a way to prove measure concentration.
Ann. Probab.
24 (1996) 857–866.
[156]
Marton, K., A measure concentration inequality for contracting Markov chains.
Geometric Functional Analysis
6 (1996) 556–571. Erratum: 7 (1997) 609–613.
[157]
L. Mason, J. Baxter, P.L. Bartlett and M. Frean, Functional gradient techniques for combining hypotheses, in Advances in Large Margin Classifiers, A.J. Smola, P.L. Bartlett, B. Schölkopf and D. Schuurmans Eds., MIT Press, Cambridge, MA (1999) 221–247.
[158]
P. Massart, Optimal constants for Hoeffding type inequalities. Technical report, Mathematiques, Université de Paris-Sud, Report 98.86, 1998.
[159]
Massart, P., About the constants in Talagrand's concentration inequalities for empirical processes.
Ann. Probab.
28 (2000) 863–884.
[160]
P. Massart, Some applications of concentration inequalities to statistics. Ann. Fac. Sci. Toulouse
IX (2000) 245–303.
[161]
P. Massart, École d'Eté de Probabilité de Saint-Flour XXXIII, chapter Concentration inequalities and model selection, LNM. Springer-Verlag (2003).
[162]
P. Massart and E. Nédélec, Risk bounds for statistical learning, Ann. Statist., to appear.
[163]
D.A. McAllester, Some pac-Bayesian theorems, in Proc. of the 11th Annual Conference on Computational Learning Theory, ACM Press (1998) 230–234.
[164]
D.A. McAllester, pac-Bayesian model averaging, in Proc. of the 12th Annual Conference on Computational Learning Theory. ACM Press (1999).
[165]
McAllester, D.A., PAC-Bayesian stochastic model selection.
Machine Learning
51 (2003) 5–21.
[166]
C. McDiarmid, On the method of bounded differences, in Surveys in Combinatorics 1989, Cambridge University Press, Cambridge (1989) 148–188.
[167]
C. McDiarmid, Concentration, in Probabilistic Methods for Algorithmic Discrete Mathematics, M. Habib, C. McDiarmid, J. Ramirez-Alfonsin and B. Reed Eds., Springer, New York (1998) 195–248.
[168]
McDiarmid, C., Concentration for independent permutations.
Combin. Probab. Comput.
2 (2002) 163–178.
[169]
G.J. McLachlan, Discriminant Analysis and Statistical Pattern Recognition. John Wiley, New York (1992).
[170]
Mendelson, S., Improving the sample complexity using global data.
IEEE Trans. Inform. Theory
48 (2002) 1977–1991.
[171]
S. Mendelson, A few notes on statistical learning theory, in Advanced Lectures in Machine Learning. Lect. Notes Comput. Sci.
2600, S. Mendelson and A. Smola Eds., Springer (2003) 1–40.
[172]
Mendelson, S. and Philips, P., On the importance of “small” coordinate projections.
J. Machine Learning Res.
5 (2004) 219–238.
[173]
Mendelson, S. and Vershynin, R., Entropy and the combinatorial dimension.
Inventiones Mathematicae
152 (2003) 37–55.
[174]
V. Milman and G. Schechman, Asymptotic theory of finite-dimensional normed spaces, Springer-Verlag, New York (1986).
[175]
B.K. Natarajan, Machine Learning: A Theoretical Approach, Morgan Kaufmann, San Mateo, CA (1991).
[176]
D. Panchenko, A note on Talagrand's concentration inequality. Electron. Comm. Probab.
6 (2001).
[177]
D. Panchenko, Some extensions of an inequality of Vapnik and Chervonenkis. Electron. Comm. Probab.
7 (2002).
[178]
Panchenko, D., Symmetrization approach to concentration inequalities for empirical processes.
Ann. Probab.
31 (2003) 2068–2081.
[179]
Poggio, T., Rifkin, S., Mukherjee, S. and Niyogi, P., General conditions for predictivity in learning theory.
Nature
428 (2004) 419–422.
[180]
D. Pollard, Convergence of Stochastic Processes, Springer-Verlag, New York (1984).
[181]
Pollard, D., Uniform ratio limit theorems for empirical processes.
Scand. J. Statist.
22 (1995) 271–278.
[182]
W. Polonik, Measuring mass concentrations and estimating density contour clusters–an excess mass approach. Ann. Statist.
23(3) (1995) 855–881.
[183]
Rio, E., Inégalités de concentration pour les processus empiriques de classes de parties.
Probab. Theory Related Fields
119 (2001) 163–175.
[184]
E. Rio, Une inegalité de Bennett pour les maxima de processus empiriques, in Colloque en l'honneur de J. Bretagnolle, D. Dacunha-Castelle et I. Ibragimov, Annales de l'Institut Henri Poincaré (2001).
[185]
B.D. Ripley, Pattern Recognition and Neural Networks, Cambridge University Press (1996).
[186]
Rogers, W.H. and Wagner, T.J., A finite sample distribution-free performance bound for local discrimination rules.
Ann. Statist.
6 (1978) 506–514.
[187]
M. Rudelson, R. Vershynin, Combinatorics of random processes and sections of convex bodies. Ann. Math, to appear (2004).
[188]
N. Sauer, On the density of families of sets. J. Combin. Theory, Ser A
13 (1972) 145–147.
[189]
Schapire, R.E., The strength of weak learnability.
Machine Learning
5 (1990) 197–227.
[190]
Schapire, R.E., Freund, Y., Bartlett, P. and Lee, W.S., Boosting the margin: a new explanation for the effectiveness of voting methods.
Ann. Statist.
26 (1998) 1651–1686.
[191]
B. Schölkopf and A. J. Smola, Learning with Kernels. MIT Press, Cambridge, MA (2002).
[192]
D. Schuurmans, Characterizing rational versus exponential learning curves, in Computational Learning Theory: Second European Conference. EuroCOLT'95, Springer-Verlag (1995) 272–286.
[193]
C. Scovel and I. Steinwart, Fast rates for support vector machines. Los Alamos National Laboratory Technical Report LA-UR 03-9117 (2003).
[194]
Seeger, M., PAC-Bayesian generalisation error bounds for gaussian process classification.
J. Machine Learning Res.
3 (2002) 233–269.
[195]
Shawe-Taylor, J., Bartlett, P.L., Williamson, R.C. and Anthony, M., Structural risk minimization over data-dependent hierarchies.
IEEE Trans. Inform. Theory
44 (1998) 1926–1940.
[196]
Shelah, S., A combinatorial problem: Stability and order for models and theories in infinity languages.
Pacific J. Mathematics
41 (1972) 247–261.
[197]
G.R. Shorack and J. Wellner, Empirical Processes with Applications in Statistics. Wiley, New York (1986).
[198]
H.U. Simon, General lower bounds on the number of examples needed for learning probabilistic concepts, in Proc. of the Sixth Annual ACM Conference on Computational Learning Theory, Association for Computing Machinery, New York (1993) 402–412.
[199]
A.J. Smola, P.L. Bartlett, B. Schölkopf and D. Schuurmans Eds, Advances in Large Margin Classifiers. MIT Press, Cambridge, MA (2000).
[200]
Smola, A.J., Schölkopf, B. and Müller, K.-R., The connection between regularization operators and support vector kernels.
Neural Networks
11 (1998) 637–649.
[201]
Specht, D.F., Probabilistic neural networks and the polynomial Adaline as complementary techniques for classification.
IEEE Trans. Neural Networks
1 (1990) 111–121.
[202]
J.M. Steele, Existence of submatrices with all possible columns. J. Combin. Theory, Ser. A
28 (1978) 84–88.
[203]
I. Steinwart, On the influence of the kernel on the consistency of support vector machines. J. Machine Learning Res. (2001) 67–93.
[204]
Steinwart, I., Consistency of support vector machines and other regularized kernel machines.
IEEE Trans. Inform. Theory
51 (2005) 128–142.
[205]
Steinwart, I., Support vector machines are universally consistent.
J. Complexity
18 (2002) 768–791.
[206]
Steinwart, I., On the optimal parameter choice in v-support vector machines.
IEEE Trans. Pattern Anal. Machine Intelligence
25 (2003) 1274–1284.
[207]
Steinwart, I., Sparseness of support vector machines.
J. Machine Learning Res.
4 (2003) 1071–1105.
[208]
S.J. Szarek and M. Talagrand, On the convexified Sauer-Shelah theorem. J. Combin. Theory, Ser. B
69 (1997) 183–192.
[209]
Talagrand, M., The Glivenko-Cantelli problem.
Ann. Probab.
15 (1987) 837–870.
[210]
Talagrand, M., Sharper bounds for Gaussian and empirical processes.
Ann. Probab.
22 (1994) 28–76.
[211]
Talagrand, M., Concentration of measure and isoperimetric inequalities in product spaces.
Publications Mathématiques de l'I.H.E.S.
81 (1995) 73–205.
[212]
Talagrand, M., The Glivenko-Cantelli problem, ten years later.
J. Theoret. Probab.
9 (1996) 371–384.
[213]
Talagrand, M., Majorizing measures: the generic chaining.
Ann. Probab.
24 (1996) 1049–1103. (Special Invited Paper).
[214]
Talagrand, M., New concentration inequalities in product spaces.
Inventiones Mathematicae
126 (1996) 505–563.
[215]
Talagrand, M., A new look at independence.
Ann. Probab.
24 (1996) 1–34. (Special Invited Paper).
[216]
Talagrand, M., Vapnik-Chervonenkis type conditions and uniform Donsker classes of functions.
Ann. Probab.
31 (2003) 1565–1582.
[217]
M. Talagrand, The generic chaining: upper and lower bounds for stochastic processes. Springer-Verlag, New York (2005).
[218]
A. Tsybakov. On nonparametric estimation of density level sets. Ann. Stat.
25 (1997) 948–969.
[219]
Tsybakov, A.B., Optimal aggregation of classifiers in statistical learning.
Ann. Statist.
32 (2004) 135–166.
[220]
A.B. Tsybakov, Introduction à l'estimation non-paramétrique. Springer (2004).
[221]
A. Tsybakov and S. van de Geer, Square root penalty: adaptation to the margin in classification and in edge estimation. Ann. Statist., to appear (2005).
[222]
Van de Geer, S., A new approach to least-squares estimation, with applications.
Ann. Statist.
15 (1987) 587–602.
[223]
Van de Geer, S., Estimating a regression function.
Ann. Statist.
18 (1990) 907–924.
[224]
S. van de Geer, Empirical Processes in M-Estimation. Cambridge University Press, Cambridge, UK (2000).
[225]
A.W. van der Waart and J.A. Wellner, Weak convergence and empirical processes. Springer-Verlag, New York (1996).
[226]
Vapnik, V. and Lerner, A., Pattern recognition using generalized portrait method.
Automat. Remote Control
24 (1963) 774–780.
[227]
V.N. Vapnik, Estimation of Dependencies Based on Empirical Data. Springer-Verlag, New York (1982).
[228]
V.N. Vapnik, The Nature of Statistical Learning Theory. Springer-Verlag, New York (1995).
[229]
V.N. Vapnik, Statistical Learning Theory. John Wiley, New York (1998).
[230]
Vapnik, V.N. and Chervonenkis, A.Ya., On the uniform convergence of relative frequencies of events to their probabilities.
Theory Probab. Appl.
16 (1971) 264–280.
[231]
V.N. Vapnik and A.Ya. Chervonenkis, Theory of Pattern Recognition. Nauka, Moscow (1974). (in Russian); German translation: Theorie der Zeichenerkennung, Akademie Verlag, Berlin (1979).
[232]
Vapnik, V.N. and Chervonenkis, A.Ya., Necessary and sufficient conditions for the uniform convergence of means to their expectations.
Theory Probab. Appl.
26 (1981) 821–832.
[233]
M. Vidyasagar, A Theory of Learning and Generalization. Springer, New York (1997).
[234]
On, V. Vu the infeasibility of training neural networks with small mean squared error.
IEEE Trans. Inform. Theory
44 (1998) 2892–2900.
[235]
M. Wegkamp, Model selection in nonparametric regression. Ann. Statist.
31(1) (2003) 252–273.
[236]
Wenocur, R.S. and Dudley, R.M., Some special Vapnik-Chervonenkis classes.
Discrete Math.
33 (1981) 313–318.
[237]
Y. Yang, Minimax nonparametric classification. I. Rates of convergence. IEEE Trans. Inform. Theory
45(7) (1999) 2271–2284.
[238]
Y. Yang, Minimax nonparametric classification. II. Model selection for adaptation. IEEE Trans. Inform. Theory
45(7) (1999) 2285–2292.
[239]
Yang, Y., Adaptive estimation in pattern recognition by combining different procedures.
Statistica Sinica
10 (2000) 1069–1089.
[240]
Yurinksii, V.V., Exponential bounds for large deviations.
Theory Probab. Appl.
19 (1974) 154–155.
[241]
Yurinksii, V.V., Exponential inequalities for sums of random vectors.
J. Multivariate Anal.
6 (1976) 473–499.
[242]
Zhang, T., Statistical behavior and consistency of classification methods based on convex risk minimization.
Ann. Statist.
32 (2004) 56–85.
[243]
Zhou, D.-X., Capacity of reproducing kernel spaces in learning theory.
IEEE Trans. Inform. Theory
49 (2003) 1743–1752.