Hostname: page-component-89b8bd64d-9prln Total loading time: 0 Render date: 2026-05-08T07:12:59.759Z Has data issue: false hasContentIssue false

DUELING BANDIT PROBLEMS

Published online by Cambridge University Press:  20 November 2020

Erol Peköz
Affiliation:
Boston University, Boston, MA, USA E-mail: pekoz@bu.edu
Sheldon M. Ross
Affiliation:
University of Southern California, Los Angeles, CA, USA E-mails: smross@usc.edu; zhan892@usc.edu
Zhengyu Zhang
Affiliation:
University of Southern California, Los Angeles, CA, USA E-mails: smross@usc.edu; zhan892@usc.edu

Abstract

There is a set of n bandits and at every stage, two of the bandits are chosen to play a game, with the result of a game being learned. In the “weak regret problem,” we suppose there is a “best” bandit that wins each game it plays with probability at least p > 1/2, with the value of p being unknown. The objective is to choose bandits to maximize the number of times that one of the competitors is the best bandit. In the “strong regret problem”, we suppose that bandit i has unknown value vi, i = 1, …, n, and that i beats j with probability vi/(vi + vj). One version of strong regret is interested in maximizing the number of times that the contest is between the players with the two largest values. Another version supposes that at any stage, rather than choosing two arms to play a game, the decision maker can declare that a particular arm is the best, with the objective of maximizing the number of stages in which the arm with the largest value is declared to be the best. In the weak regret problem, we propose a policy and obtain an analytic bound on the expected number of stages over an infinite time frame that the best arm is not one of the competitors when this policy is employed. In the strong regret problem, we propose a Thompson sampling type algorithm and empirically compare its performance with others in the literature.

Information

Type
Research Article
Copyright
Copyright © The Author(s), 2020. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable