Hostname: page-component-77c89778f8-rkxrd Total loading time: 0 Render date: 2024-07-23T09:44:30.710Z Has data issue: false hasContentIssue false

Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation

Published online by Cambridge University Press:  04 August 2021

Mikhail Belkin*
Affiliation:
Halıcıoğlu Data Science Institute, University of California San Diego, 10100 Hopkins Drive, La Jolla, CA92093, USA E-mail: mbelkin@ucsd.edu

Abstract

In the past decade the mathematical theory of machine learning has lagged far behind the triumphs of deep neural networks on practical challenges. However, the gap between theory and practice is gradually starting to close. In this paper I will attempt to assemble some pieces of the remarkable and still incomplete mathematical mosaic emerging from the efforts to understand the foundations of deep learning. The two key themes will be interpolation and its sibling over-parametrization. Interpolation corresponds to fitting data, even noisy data, exactly. Over-parametrization enables interpolation and provides flexibility to select a suitable interpolating model.

As we will see, just as a physical prism separates colours mixed within a ray of light, the figurative prism of interpolation helps to disentangle generalization and optimization properties within the complex picture of modern machine learning. This article is written in the belief and hope that clearer understanding of these issues will bring us a step closer towards a general theory of deep learning and machine learning.

Type
Research Article
Copyright
© The Author(s), 2021. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

In memory of Partha Niyogi, a thinker, a teacher, and a dear friend.

References

Allen-Zhu, Z. and Li, Y. (2020), Backward feature correction: How deep learning performs deep learning. Available at arXiv:2001.04413.Google Scholar
Arora, S., Du, S. S., Li, Z., Salakhutdinov, R., Wang, R. and Yu, D. (2020), Harnessing the power of infinitely wide deep nets on small-data tasks, in 8th International Conference on Learning Representations (ICLR 2020). Available at https://openreview.net/forum?id=rkl8sJBYvH.Google Scholar
Bai, S., Kolter, J. Z. and Koltun, V. (2019), Deep equilibrium models, in Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (Wallach, H. et al., eds), Curran Associates, pp. 690701.Google Scholar
Bartlett, P. L. and Long, P. M. (2020), Failures of model-dependent generalization bounds for least-norm interpolation. Available at arXiv:2010.08479.Google Scholar
Bartlett, P. L. and Mendelson, S. (2002), Rademacher and Gaussian complexities: Risk bounds and structural results, J. Mach. Learn. Res. 3, 463482.Google Scholar
Bartlett, P. L., Long, P. M., Lugosi, G. and Tsigler, A. (2020), Benign overfitting in linear regression, Proc . Nat. Acad. Sci. 117, 3006330070.CrossRefGoogle Scholar
Bartlett, P. L., Montanari, A. and Rakhlin, A. (2021), Deep learning: A statistical viewpoint, in Acta Numerica, Vol. 30, Cambridge University Press, pp. 1115.Google Scholar
Bassily, R., Belkin, M. and Ma, S. (2018), On exponential convergence of SGD in non-convex over-parametrized learning. Available at arXiv:1811.02564.Google Scholar
Belkin, M., Hsu, D. and Mitra, P. (2018a), Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate, in Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (Bengio, S. et al., eds), Curran Associates, pp. 23062317.Google Scholar
Belkin, M., Hsu, D. and Xu, J. (2020), Two models of double descent for weak features, SIAM J. Math. Data Sci. 2, 11671180.CrossRefGoogle Scholar
Belkin, M., Hsu, D., Ma, S. and Mandal, S. (2019a), Reconciling modern machine-learning practice and the classical bias–variance trade-off, Proc . Nat. Acad. Sci. 116, 1584915854.CrossRefGoogle Scholar
Belkin, M., Ma, S. and Mandal, S. (2018b), To understand deep learning we need to understand kernel learning, in Proceedings of the 35th International Conference on Machine Learning (ICML 2018) (Dy, J. and Krause, A., eds), Vol. 80 of Proceedings of Machine Learning Research, PMLR, pp. 541549.Google Scholar
Belkin, M., Rakhlin, A. and Tsybakov, A. B. (2019b), Does data interpolation contradict statistical optimality?, in Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS 2019) (Chaudhuri, K. and Sugiyama, M., eds), Vol. 89 of Proceedings of Machine Learning Research, PMLR, pp. 16111619.Google Scholar
Blumer, A., Ehrenfeucht, A., Haussler, D. and Warmuth, M. K. (1987), Occam’s razor, Inform. Process. Lett. 24, 377380.CrossRefGoogle Scholar
Bousquet, O., Boucheron, S. and Lugosi, G. (2003), Introduction to statistical learning theory, in Advanced Lectures on Machine Learning (ML 2003) (Bousquet, O. et al., eds), Vol. 3176 of Lecture Notes in Computer Science, Springer, pp. 169207.CrossRefGoogle Scholar
Breiman, L. (1995), Reflections after refereeing papers for NIPS, in The Mathematics of Generalization (Wolpert, D. H., ed.), Addison-Wesley, pp. 1115.Google Scholar
Bruna, J., Szegedy, C., Sutskever, I., Goodfellow, I., Zaremba, W., Fergus, R. and Er-han, D. (2014), Intriguing properties of neural networks, in 2nd International Conference on Learning Representations (ICLR 2014). Available at https://openreview.net/forum?id=kklr_MTHMRQjG.Google Scholar
Bubeck, S. (2015), Convex optimization: Algorithms and complexity, Found . Trends Mach. Learning 8, 231357.CrossRefGoogle Scholar
Buja, A., Mease, D. and Wyner, A. J. (2007), Comment: Boosting algorithms: Regularization, prediction and model fitting, Statist. Sci. 22, 506512.CrossRefGoogle Scholar
Canziani, A., Paszke, A. and Culurciello, E. (2016), An analysis of deep neural network models for practical applications. Available at arXiv:1605.07678.Google Scholar
Cover, T. and Hart, P. (1967), Nearest neighbor pattern classification, IEEE Trans. Inform. Theory 13, 2127.CrossRefGoogle Scholar
Cutler, A. and Zhao, G. (2001), PERT: perfect random tree ensembles, Comput . Sci. Statist. 33, 490497.Google Scholar
Defazio, A., Bach, F. and Lacoste-Julien, S. (2014), SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives, in Advances in Neural Information Processing Systems 27 (NIPS 2014) (Ghahramani, Z. et al., eds), MIT Press, pp. 16461654.Google Scholar
Devroye, L., Györfi, L. and Krzyżak, A. (1998), The Hilbert kernel regression estimate, J. Multivariate Anal. 65, 209227.CrossRefGoogle Scholar
Du, S. S., Lee, J., Li, H., Wang, L. and Zhai, X. (2019a), Gradient descent finds global minima of deep neural networks, in Proceedings of the 36th International Conference on Machine Learning (ICML 2019) (Chaudhuri, K. and Salakhutdinov, R., eds), Vol. 97 of Proceedings of Machine Learning Research, PMLR, pp. 16751685.Google Scholar
Du, S. S., Zhai, X., Poczos, B. and Singh, A. (2019b), Gradient descent provably optimizes over-parameterized neural networks, in 7th International Conference on Learning Representations (ICLR 2019). Available at https://openreview.net/forum?id=S1eK3i09YQ.Google Scholar
Fawzi, A., Moosavi-Dezfooli, S.-M. and Frossard, P. (2016), Robustness of classifiers: From adversarial to random noise, in Advances in Neural Information Processing Systems 29 (NIPS 2016) (Lee, D. D. et al., eds), Curran Associates, pp. 16321640.Google Scholar
Fedus, W., Zoph, B. and Shazeer, N. (2021), Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Available at arXiv:2101.03961.Google Scholar
Freund, Y. and Schapire, R. E. (1997), A decision-theoretic generalization of on-line learning and an application to boosting, J. Comput. Syst. Sci. 55, 119139.CrossRefGoogle Scholar
Geman, S., Bienenstock, E. and Doursat, R. (1992), Neural networks and the bias/variance dilemma, Neural Comput. 4, 158.CrossRefGoogle Scholar
Ghorbani, B., Mei, S., Misiakiewicz, T. and Montanari, A. (2020), When do neural networks outperform kernel methods?, in Advances in Neural Information Processing Systems 33 (NeurIPS 2020) (Larochelle, H. et al., eds), Curran Associates, pp. 1482014830.Google Scholar
Goodfellow, I., Bengio, Y. and Courville, A. (2016), Deep Learning, MIT Press. Available at http://www.deeplearningbook.org.Google Scholar
Györfi, L., Kohler, M., Krzyzak, A. and Walk, H. (2002), A Distribution-Free Theory of Nonparametric Regression, Springer Series in Statistics, Springer.CrossRefGoogle Scholar
Halton, J. H. (1991), Simplicial multivariable linear interpolation. Report TR91-002, Department of Computer Science, University of North Carolina at Chapel Hill.Google Scholar
Hastie, T., Montanari, A., Rosset, S. and Tibshirani, R. J. (2019), Surprises in high-dimensional ridgeless least squares interpolation. Available at arXiv:1903.08560.Google Scholar
Hastie, T., Tibshirani, R. and Friedman, J. (2001), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer Series in Statistics, Springer.CrossRefGoogle Scholar
Hui, L. and Belkin, M. (2021), Evaluation of neural architectures trained with square loss vs cross-entropy in classification tasks, in 9th International Conference on Learning Representations (ICLR 2021). Available at https://openreview.net/forum?id=hsFN92eQEla.Google Scholar
Ilyas, A., Santurkar, S., Engstrom, L., Tran, B. and Madry, A. (2019), Adversarial examples are not bugs, they are features, in Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (Wallach, H. et al., eds), Curran Associates. Available at https://proceedings.neurips.cc/paper/2019/file/e2c420d928d4bf8ce0ff2ec19b371514-Paper.pdf.Google Scholar
Jacot, A., Gabriel, F. and Hongler, C. (2018), Neural tangent kernel: Convergence and generalization in neural networks, in Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (Bengio, S. et al., eds), Curran Associates, pp. 85718580.Google Scholar
Ji, Z. and Telgarsky, M. (2019), The implicit bias of gradient descent on nonseparable data, in Proceedings of the 32nd Conference on Learning Theory (COLT 2019) (Beygelzimer, A. and Hsu, D., eds), Vol. 99 of Proceedings of Machine Learning Research, PMLR, pp. 17721798.Google Scholar
Johnson, R. and Zhang, T. (2013), Accelerating stochastic gradient descent using predictive variance reduction, in Advances in Neural Information Processing Systems 26 (NIPS 2013) (Burges, C. J. C. et al., eds), Curran Associates, pp. 315323.Google Scholar
Kaczmarz, S. (1937), Angenäherte Auflösung von Systemen linearer Gleichungen, Bull . Int. Acad. Sci. Pologne A 35, 355357.Google Scholar
Karimi, H., Nutini, J. and Schmidt, M. (2016), Linear convergence of gradient and proximal-gradient methods under the Polyak–Łojasiewicz condition, in Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2016) (Frasconi, P. et al., eds), Vol. 9851 of Lecture Notes in Computer Science, Springer, pp. 795811.Google Scholar
Kimeldorf, G. S. and Wahba, G. (1970), A correspondence between Bayesian estimation on stochastic processes and smoothing by splines, Ann . Math. Statist. 41, 495502.CrossRefGoogle Scholar
Kingma, D. P. and Ba, J. (2015), Adam: A method for stochastic optimization, in 3rd International Conference on Learning Representations (ICLR 2015) (Bengio, Y. and LeCun, Y., eds).Google Scholar
LeCun, Y. (2019), The epistemology of deep learning. Available at https://www.youtube.com/watch?v=gG5NCkMerHU&t=3210s.Google Scholar
Lee, J., Schoenholz, S. S., Pennington, J., Adlam, B., Xiao, L., Novak, R. and Sohl-Dickstein, J. (2020), Finite versus infinite neural networks: An empirical study. Available at arXiv:2007.15801.Google Scholar
Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl-Dickstein, J. and Pennington, J. (2019), Wide neural networks of any depth evolve as linear models under gradient descent, in Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (Wallach, H. et al., eds), Curran Associates, pp. 85708581.Google Scholar
Li, M., Soltanolkotabi, M. and Oymak, S. (2020), Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks, in Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS 2020) (Chiappa, S. and Calandra, R., eds), Vol. 108 of Proceedings of Machine Learning Research, PMLR, pp. 43134324.Google Scholar
Liang, T. and Rakhlin, A. (2020), Just interpolate: Kernel ‘ridgeless’ regression can generalize, Ann . Statist. 48, 13291347.CrossRefGoogle Scholar
Liu, C. and Belkin, M. (2020), Accelerating SGD with momentum for over-parameterized learning, in 8th International Conference on Learning Representations (ICLR 2020). Available at https://openreview.net/forum?id=r1gixp4FPH.Google Scholar
Liu, C., Zhu, L. and Belkin, M. (2020a), On the linearity of large non-linear models: When and why the tangent kernel is constant, in Advances in Neural Information Processing Systems 33 (NeurIPS 2020) (Larochelle, H. et al., eds), Curran Associates, pp. 1595415964.Google Scholar
Liu, C., Zhu, L. and Belkin, M. (2020b), Toward a theory of optimization for overparameterized systems of non-linear equations: The lessons of deep learning. Available at arXiv:2003.00307.Google Scholar
Lojasiewicz, S. (1963), A topological property of real analytic subsets, Coll. du CNRS , Les Équations aux Dérivées Partielles 117, 8789.Google Scholar
Loog, M., Viering, T., Mey, A., Krijthe, J. H. and Tax, D. M. (2020), A brief prehistory of double descent, Proc . Nat. Acad. Sci. 117, 1062510626.CrossRefGoogle Scholar
Ma, S. and Belkin, M. (2019), Kernel machines that adapt to GPUs for effective large batch training, in Proceedings of Machine Learning and Systems (Talwalkar, A. et al., eds), pp. 360373. Available at https://proceedings.mlsys.org/paper/2019/file/a4a042cf4fd6bfb47701cbc8a1653ada-Paper.pdf.Google Scholar
Ma, S., Bassily, R. and Belkin, M. (2018), The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning, in Proceedings of the 35th International Conference on Machine Learning (ICML 2018) (Dy, J. and Krause, A., eds), Vol. 80 of Proceedings of Machine Learning Research, PMLR, pp. 33253334.Google Scholar
Madry, A., Makelov, A., Schmidt, L., Tsipras, D. and Vladu, A. (2018), Towards deep learning models resistant to adversarial attacks, in 6th International Conference on Learning Representations (ICLR 2018 ). Available at https://openreview.net/forum?id=rJzIBfZAb.Google Scholar
Mai, X. and Liao, Z. (2019), High dimensional classification via regularized and unreg-ularized empirical risk minimization: Precise error and optimal loss. Available at arXiv:1905.13742.Google Scholar
Meanti, G., Carratino, L., Rosasco, L. and Rudi, A. (2020), Kernel methods through the roof: Handling billions of points efficiently. Available at arXiv:2006.10350.Google Scholar
Mei, S. and Montanari, A. (2019), The generalization error of random features regression: Precise asymptotics and double descent curve. Available at arXiv:1908.05355.Google Scholar
Mitra, P. P. (2019), Understanding overfitting peaks in generalization error: Analytical risk curves for l 2 and l 1 penalized interpolation. Available at arXiv:1906.03667.Google Scholar
Moulines, E. and Bach, F. R. (2011), Non-asymptotic analysis of stochastic approximation algorithms for machine learning, in Advances in Neural Information Processing Systems 24 (NIPS 2011) (Shawe-Taylor, J. et al., eds), Curran Associates, pp. 451459.Google Scholar
Muthukumar, V., Narang, A., Subramanian, V., Belkin, M., Hsu, D. and Sahai, A. (2020a), Classification vs regression in overparameterized regimes: Does the loss function matter? Available at arXiv:2005.08054.Google Scholar
Muthukumar, V., Vodrahalli, K., Subramanian, V. and Sahai, A. (2020b), Harmless interpolation of noisy data in regression, IEEE J. Selected Areas Inform. Theory 1, 6783.CrossRefGoogle Scholar
Nadaraya, E. A. (1964), On estimating regression, Theory Probab . Appl. 9, 141142.Google Scholar
Nagarajan, V. and Kolter, J. Z. (2019), Uniform convergence may be unable to explain generalization in deep learning, in Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (Wallach, H. et al., eds), Curran Associates. Available at https://proceedings.neurips.cc/paper/2019/file/05e97c207235d63ceb1db43c60db7bbb-Paper.pdf.Google Scholar
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B. and Sutskever, I. (2020), Deep double descent: Where bigger models and more data hurt, in 8th International Conference on Learning Representations (ICLR 2020). Available at https://openreview.net/forum?id=B1g5sA4twr.Google Scholar
Needell, D., Ward, R. and Srebro, N. (2014), Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm, in Advances in Neural Information Processing Systems 27 (NIPS 2014) (Ghahramani, Z. et al., eds), MIT Press, pp. 10171025.Google Scholar
Negrea, J., Dziugaite, G. K. and Roy, D. (2020), In defense of uniform convergence: Generalization via derandomization with an application to interpolating predictors, in Proceedings of the 37th International Conference on Machine Learning (ICML 2020) (Daumé, H. III and Singh, A., eds), Vol. 119 of Proceedings of Machine Learning Research, PMLR, pp. 72637272.Google Scholar
Neyshabur, B., Tomioka, R. and Srebro, N. (2015), In search of the real inductive bias: On the role of implicit regularization in deep learning, in ICLR (Workshop) 2015. Available at https://openreview.net/forum?id=6AzZb_7Qo0e.Google Scholar
Nichani, E., Radhakrishnan, A. and Uhler, C. (2020), Do deeper convolutional networks perform better? Available at arXiv:2010.09610.Google Scholar
Nocedal, J. and Wright, S. (2006), Numerical Optimization, Springer Science & Business Media.Google Scholar
Oymak, S. and Soltanolkotabi, M. (2019), Overparameterized nonlinear learning: Gradient descent takes the shortest path?, in Proceedings of the 36th International Conference on Machine Learning (ICML 2019) (Chaudhuri, K. and Salakhutdinov, R., eds), Vol. 97 of Proceedings of Machine Learning Research, PMLR, pp. 49514960.Google Scholar
Polyak, B. T. (1963), Gradient methods for minimizing functionals, Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki 3, 643653.Google Scholar
Pravesh, K. K. and Roi, L. (2020), On the expressive power of kernel methods and the efficiency of kernel learning by association schemes, in Algorithmic Learning Theory (ALT 2020) (Kontorovich, A. and Neu, G., eds), Vol. 117 of Proceedings of Machine Learning Research, PMLR, pp. 422450.Google Scholar
Radhakrishnan, A., Belkin, M. and Uhler, C. (2020), Overparameterized neural networks implement associative memory, Proc . Nat. Acad. Sci. 117, 2716227170.CrossRefGoogle Scholar
Rahimi, A. and Recht, B. (2007), Random features for large-scale kernel machines, in Advances in Neural Information Processing Systems 20 (NIPS 2007) (Platt, J. C. et al., eds), Curran Associates, pp. 11771184.Google Scholar
Rahimi, A. and Recht, B. (2017), Reflections on random kitchen sinks. Available at http://www.argmin.net/2017/12/05/kitchen-sinks/.Google Scholar
Rifkin, R. M. (2002), Everything old is new again: A fresh look at historical approaches in machine learning. PhD thesis, Massachusetts Institute of Technology.Google Scholar
Roux, N. L., Schmidt, M. and Bach, F. R. (2012), A stochastic gradient method with an exponential convergence rate for finite training sets, in Advances in Neural Information Processing Systems 25 (NIPS 2012) (Pereira, F. et al., eds), Curran Associates, pp. 26632671.Google Scholar
Salakhutdinov, R. (2017), Tutorial on deep learning. Available at https://simons.berkeley.edu/talks/ruslan-salakhutdinov-01-26-2017-1.Google Scholar
Schapire, R. E., Freund, Y., Bartlett, P. and Lee, W. S. (1998), Boosting the margin: A new explanation for the effectiveness of voting methods, Ann . Statist. 26, 16511686.Google Scholar
Senior, A. W., Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L., Green, T., Qin, C., Žídek, A., Nelson, A. W., Bridgland, A. et al. (2020), Improved protein structure prediction using potentials from deep learning, Nature 577 (7792), 706710.CrossRefGoogle ScholarPubMed
Shankar, V., Fang, A., Guo, W., Fridovich-Keil, S., Schmidt, L., Ragan-Kelley, J. and Recht, B. (2020), Neural kernels without tangents, in Proceedings of the 37th International Conference on Machine Learning (ICML 2020) (Daumé, H. III and Singh, A., eds), Vol. 119 of Proceedings of Machine Learning Research, PMLR, pp. 86148623.Google Scholar
Shepard, D. (1968), A two-dimensional interpolation function for irregularly-spaced data, in Proceedings of the 1968 23rd ACM National Conference, ACM, pp. 517524.Google Scholar
Sindhwani, V., Niyogi, P. and Belkin, M. (2005), Beyond the point cloud: From transductive to semi-supervised learning, in Proceedings of the 22nd International Conference on Machine Learning (ICML 2005), ACM, pp. 824831.Google Scholar
Spigler, S., Geiger, M., d’Ascoli, S., Sagun, L., Biroli, G. and Wyart, M. (2019), A jamming transition from under- to over-parametrization affects loss landscape and generalization, J. Phys. A 52, 474001.CrossRefGoogle Scholar
Stigler, S. M. (1981), Gauss and the invention of least squares, Ann . Statist. 9, 465474.CrossRefGoogle Scholar
Strohmer, T. and Vershynin, R. (2009), A randomized Kaczmarz algorithm with exponential convergence, J. Fourier Anal. Appl. 15, 262.CrossRefGoogle Scholar
Su, J., Vargas, D. V. and Kouichi, S. (2019), One pixel attack for fooling deep neural networks, IEEE Trans. Evol. Comput. 23, 828841.CrossRefGoogle Scholar
Thrampoulidis, C., Oymak, S. and Soltanolkotabi, M. (2020), Theoretical insights into multiclass classification: A high-dimensional asymptotic view, in Advances in Neural Information Processing Systems 33 (NeurIPS 2020) (Larochelle, H. et al., eds), Curran Associates, pp. 89078920.Google Scholar
Vapnik, V. N. (1995), The Nature of Statistical Learning Theory, Springer.CrossRefGoogle Scholar
Warmuth, M. K. and Vishwanathan, S. (2005), Leaving the span, in International Conference on Computational Learning Theory (COLT 2005) (Auer, P. and Meir, R., eds), Vol. 3559 of Lecture Notes in Computer Science, Springer, pp. 366381.Google Scholar
Watson, G. S. (1964), Smooth regression analysis, Sankhyā A 26, 359372.Google Scholar
Wendland, H. (2004), Scattered Data Approximation, Cambridge Monographs on Applied and Computational Mathematics, Cambridge University Press.CrossRefGoogle Scholar
Woodworth, B., Gunasekar, S., Lee, J. D., Moroshko, E., Savarese, P., Golan, I., Soudry, D. and Srebro, N. (2020), Kernel and rich regimes in overparametrized models, in Proceedings of the 33rd Conference on Learning Theory (COLT 2020) (Abernethy, J. and Agarwal, S., eds), Vol. 125 of Proceedings of Machine Learning Research, PMLR, pp. 36353673.Google Scholar
Wyner, A. J., Olson, M., Bleich, J. and Mease, D. (2017), Explaining the success of AdaBoost and random forests as interpolating classifiers, J. Mach. Learn. Res. 18, 133.Google Scholar
Xu, J. and Hsu, D. (2019), On the number of variables to use in principal component regression, in Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (Wallach, H. et al., eds), Curran Associates. Available at https://proceedings.neurips.cc/paper/2019/file/e465ae46b07058f4ab5e96b98f101756-Paper.pdf.Google Scholar
Yao, Y., Rosasco, L. and Caponnetto, A. (2007), On early stopping in gradient descent learning, Constr . Approx. 26, 289315.CrossRefGoogle Scholar
Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O. (2017), Understanding deep learning requires rethinking generalization, in 5th International Conference on Learning Representations (ICLR 2017). Available at https://openreview.net/forum?id=Sy8gdB9xx.Google Scholar
Zhou, L., Sutherland, D. J. and Srebro, N. (2020), On uniform convergence and low-norm interpolation learning, in Advances in Neural Information Processing Systems 33 (Neur-IPS) (Larochelle, H. et al., eds), Curran Associates, pp. 68676877.Google Scholar