DUELING BANDIT PROBLEMS

Erol Peköz; Sheldon M. Ross; Zhengyu Zhang

doi:10.1017/S0269964820000601

DUELING BANDIT PROBLEMS

Published online by Cambridge University Press: 20 November 2020

and

Erol Peköz: Affiliation:
Boston University, Boston, MA, USA E-mail: pekoz@bu.edu
Sheldon M. Ross: Affiliation:
University of Southern California, Los Angeles, CA, USA E-mails: smross@usc.edu; zhan892@usc.edu
Zhengyu Zhang: Affiliation:
University of Southern California, Los Angeles, CA, USA E-mails: smross@usc.edu; zhan892@usc.edu

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

There is a set of n bandits and at every stage, two of the bandits are chosen to play a game, with the result of a game being learned. In the “weak regret problem,” we suppose there is a “best” bandit that wins each game it plays with probability at least p > 1/2, with the value of p being unknown. The objective is to choose bandits to maximize the number of times that one of the competitors is the best bandit. In the “strong regret problem”, we suppose that bandit i has unknown value vi, i = 1, …, n, and that i beats j with probability vi/(vi + vj). One version of strong regret is interested in maximizing the number of times that the contest is between the players with the two largest values. Another version supposes that at any stage, rather than choosing two arms to play a game, the decision maker can declare that a particular arm is the best, with the objective of maximizing the number of stages in which the arm with the largest value is declared to be the best. In the weak regret problem, we propose a policy and obtain an analytic bound on the expected number of stages over an infinite time frame that the best arm is not one of the competitors when this policy is employed. In the strong regret problem, we propose a Thompson sampling type algorithm and empirically compare its performance with others in the literature.

Keywords

applied probability simulation stochastic modeling

Information

Type: Research Article
Information: Probability in the Engineering and Informational Sciences , Volume 36 , Issue 2 , April 2022 , pp. 264 - 275

DOI: https://doi.org/10.1017/S0269964820000601 [Opens in a new window]
Copyright: Copyright © The Author(s), 2020. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Busa-Fekete, R., Szorenyi, B., Cheng, W., Weng, P., & Hüllermeier, E. (2013). Top-k selection based on adaptive sampling of noisy preferences. Proceedings of the 30th International Conference on Machine Learning, in PMLR 28(3), 1094–1102.Google Scholar

Chen, B. & Frazier, P.I. (2017). Dueling bandits with weak regret. Preprint arXiv:1706.04304.Google Scholar

Komiyama, J., Honda, J., Kashima, H., & Nakagawa, H. (2015). Regret lower bound and optimal algorithm in dueling bandit problem. Proceedings of The 28th Conference on Learning Theory, in PMLR 40, 1141–1154.Google Scholar

Komiyama, J., Honda, J., & Nakagawa, H. (2016). Copeland dueling bandit problem: Regret lower bound, optimal algorithm, and computationally efficient algorithm. Preprint arXiv:1605.01677.Google Scholar

Thompson, W.R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4), 285–294.Google Scholar

Urvoy, T., Clerot, F., Féraud, R., & Naamane, S. (2013). Generic exploration and k-armed voting bandits. Proceedings of the 30th International Conference on Machine Learning, in PMLR 28(2), 91–99.Google Scholar

Wu, H. & Liu, X. (2016). Double Thompson sampling for dueling bandits. Advances in Neural Information Processing Systems 29, 649–657.Google Scholar

Yue, Y. & Joachims, T. (2009). Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1201–1208.CrossRef Google Scholar

Yue, Y. & Joachims, T. (2011). Beat the mean bandit. Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 241–248.Google Scholar

Zoghi, M., Whiteson, S., Munos, R., & Rijke, M. (2014). Relative upper confidence bound for the k-armed dueling bandit problem. Proceedings of the 31st International Conference on Machine Learning, in PMLR 32(2), 10–18.Google Scholar

Zoghi, M., Whiteson, S.A., de Rijke, M., & Munos, R. (2014). Relative confidence sampling for efficient on-line ranker evaluation. Proceedings of the 7th ACM International Conference on Web Search and Data Mining, pp. 73–82.CrossRef Google Scholar

Zoghi, M., Whiteson, S., de Rijke, M. (2015). Mergerucb: A method for large-scale online ranker evaluation. Proceedings of the 8th ACM International Conference on Web Search and Data Mining, pp. 17–26.CrossRef Google Scholar

Article contents

DUELING BANDIT PROBLEMS

Abstract

Keywords

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests