Hostname: page-component-6766d58669-h8lrw Total loading time: 0 Render date: 2026-05-16T17:34:55.363Z Has data issue: false hasContentIssue false

An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits

Published online by Cambridge University Press:  03 September 2019

Gabriel Zayas-Cabán*
Affiliation:
University of Wisconsin-Madison
Stefanus Jasin*
Affiliation:
University of Michigan
Guihua Wang*
Affiliation:
University of Michigan
*
*Postal address: Mechanical Engineering Building, University of Wisconsin-Madison, 1513 University Avenue, Room 3011 Madison, WI 53706-1691, USA. Email address: zayascaban@wisc.edu
**Postal address: Stephen M. Ross School of Business, University of Michigan, 701 Tappan Street, Ann Arbor, MI 48109, USA.
**Postal address: Stephen M. Ross School of Business, University of Michigan, 701 Tappan Street, Ann Arbor, MI 48109, USA.

Abstract

We propose an asymptotically optimal heuristic, which we term randomized assignment control (RAC) for a restless multi-armed bandit problem with discrete-time and finite states. It is constructed using a linear programming relaxation of the original stochastic control formulation. In contrast to most of the existing literature, we consider a finite-horizon problem with multiple actions and time-dependent (i.e. nonstationary) upper bound on the number of bandits that can be activated at each time period; indeed, our analysis can also be applied in the setting with nonstationary transition matrix and nonstationary cost function. The asymptotic setting is obtained by letting the number of bandits and other related parameters grow to infinity. Our main contribution is that the asymptotic optimality of RAC in this general setting does not require indexability properties or the usual stability conditions of the underlying Markov chain (e.g. unichain) or fluid approximation (e.g. global stable attractor). Moreover, our multi-action setting is not restricted to the usual dominant action concept. Finally, we show that RAC is also asymptotically optimal for a dynamic population, where bandits can randomly arrive and depart the system.

Information

Type
Original Article
Copyright
© Applied Probability Trust 2019 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Supplementary material: PDF

Zayas-Cabán et al. supplementary material

Supplementary data

Download Zayas-Cabán et al. supplementary material(PDF)
PDF 163.6 KB