Skip to main content Accessibility help
×
Home
Hostname: page-component-747cfc64b6-bv7lh Total loading time: 0.224 Render date: 2021-06-15T20:31:47.908Z Has data issue: true Feature Flags: { "shouldUseShareProductTool": true, "shouldUseHypothesis": true, "isUnsiloEnabled": true, "metricsAbstractViews": false, "figures": true, "newCiteModal": false, "newCitedByModal": true, "newEcommerce": true }

EXPLORATION–EXPLOITATION POLICIES WITH ALMOST SURE, ARBITRARILY SLOW GROWING ASYMPTOTIC REGRET

Published online by Cambridge University Press:  26 January 2019

Wesley Cowan
Affiliation:
Department of Computer Science, Rutgers University, Piscataway, NJ08854, USA E-mail: cwcowan@math.rutgers.edu
Michael N. Katehakis
Affiliation:
Department of Management Science and Information Systems, Rutgers University, Piscataway, NJ08854, USA E-mail: mnk@rutgers.edu
Corresponding

Abstract

The purpose of this paper is to provide further understanding into the structure of the sequential allocation (“stochastic multi-armed bandit”) problem by establishing probability one finite horizon bounds and convergence rates for the sample regret associated with two simple classes of allocation policies. For any slowly increasing function g, subject to mild regularity constraints, we construct two policies (the g-Forcing, and the g-Inflated Sample Mean) that achieve a measure of regret of order O(g(n)) almost surely as n → ∞, bound from above and below. Additionally, almost sure upper and lower bounds on the remainder term are established. In the constructions herein, the function g effectively controls the “exploration” of the classical “exploration/exploitation” tradeoff.

MSC classification

Type
Research Article
Copyright
Copyright © Cambridge University Press 2019

Access options

Get access to the full version of this content by using one of the access options below.

References

1.Audibert, J-Y, Munos, R., & Szepesvári, C. (2009). Exploration - exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science 410: 18761902.CrossRefGoogle Scholar
2.Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning 47: 235256.CrossRefGoogle Scholar
3.Bubeck, S. & Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. arXiv preprint arXiv:1204.5721.CrossRefGoogle Scholar
4.Burnetas, A.N. & Katehakis, M.N. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics 17: 122142.CrossRefGoogle Scholar
5.Cowan, W. & Katehakis, M.N. (2015a). An Asymptotically Optimal Policy for Uniform Bandits of Unknown Support. arXiv preprint: arXiv:1505.01918.Google Scholar
6.Cowan, W. & Katehakis, M.N. (2015b). Multi-armed bandits under general depreciation and commitment. Probability in the Engineering and Informational Sciences 29(1): 5176.CrossRefGoogle Scholar
7.Cowan, W. & Katehakis, M.N. (2015c). Asymptotically Optimal Sequential Experimentation Under Generalized Ranking. arXiv preprint arXiv:1510.02041.Google Scholar
8.Cowan, W., Honda, J., & Katehakis, M.N. (2018). Normal bandits of unknown means and variances. Journal of Machine Learning Research 18(154): 128.Google Scholar
9.Garivier, A., Ménard, P., & Stoltz, G. (2018). Explore first, exploit next: the true shape of regret in bandit problems. Mathematics of Operations Research. doi: 10.1287/moor.2017.0928.Google Scholar
10.Honda, J. & Takemura, A (2010) An asymptotically optimal bandit algorithm for bounded support models. In COLT, 67–79, Citeseer.Google Scholar
11.Honda, J. & Takemura, A. (2011). An asymptotically optimal policy for finite support models in the multiarmed bandit problem. Machine Learning 85: 361391.CrossRefGoogle Scholar
12.Honda, J. & Takemura, A. (2013) Optimality of Thompson sampling for Gaussian bandits depends on priors. arXiv preprint arXiv:1311.1894.Google Scholar
13.Lai, T.L. & Robbins, H.E. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6: 4–2.CrossRefGoogle Scholar
14.Lattimore, L. (2018). Refining the confidence level for optimistic bandit strategies. Journal of Machine Learning Research 19: 765796.Google Scholar
15.Orabona, F., Pál, D. (2016) Open problem: Parameter-free and scale-free online algorithms. Conference on Learning Theory, 1659–1664.Google Scholar
16.Ortner, R. (2018). Regret Bounds for Reinforcement Learning via Markov Chain Concentration. arXiv preprint arXiv:1808.01813.Google Scholar
17.Robbins, H.E. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Monthly 58: 527536.CrossRefGoogle Scholar

Send article to Kindle

To send this article to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about sending to your Kindle. Find out more about sending to your Kindle.

Note you can select to send to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be sent to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

EXPLORATION–EXPLOITATION POLICIES WITH ALMOST SURE, ARBITRARILY SLOW GROWING ASYMPTOTIC REGRET
Available formats
×

Send article to Dropbox

To send this article to your Dropbox account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Dropbox.

EXPLORATION–EXPLOITATION POLICIES WITH ALMOST SURE, ARBITRARILY SLOW GROWING ASYMPTOTIC REGRET
Available formats
×

Send article to Google Drive

To send this article to your Google Drive account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Google Drive.

EXPLORATION–EXPLOITATION POLICIES WITH ALMOST SURE, ARBITRARILY SLOW GROWING ASYMPTOTIC REGRET
Available formats
×
×

Reply to: Submit a response

Please enter your response.

Your details

Please enter a valid email address.

Conflicting interests

Do you have any conflicting interests? *