Hostname: page-component-89b8bd64d-n8gtw Total loading time: 0 Render date: 2026-05-06T04:37:08.652Z Has data issue: false hasContentIssue false

MULTI-ARMED BANDITS UNDER GENERAL DEPRECIATION AND COMMITMENT

Published online by Cambridge University Press:  10 October 2014

Wesley Cowan
Affiliation:
Department of Mathematics, Rutgers University, 110 Frelinghuysen Road, Piscataway, NJ 08854, USA E-mail: cwcowan@mah.rutgers.edu
Michael N. Katehakis
Affiliation:
Department of Management Science and Information Systems, Rutgers Business School, Newark and New Brunswick, 100 Rockafeller Road, Piscataway, NJ 08854, USA E-mail: mnk@rutgers.edu
Rights & Permissions [Opens in a new window]

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the 'Save PDF' action button.

Generally, the multi-armed has been studied under the setting that at each time step over an infinite horizon a controller chooses to activate a single process or bandit out of a finite collection of independent processes (statistical experiments, populations, etc.) for a single period, receiving a reward that is a function of the activated process, and in doing so advancing the chosen process. Classically, rewards are discounted by a constant factor β∈(0, 1) per round.

In this paper, we present a solution to the problem, with potentially non-Markovian, uncountable state space reward processes, under a framework in which, first, the discount factors may be non-uniform and vary over time, and second, the periods of activation of each bandit may be not be fixed or uniform, subject instead to a possibly stochastic duration of activation before a change to a different bandit is allowed. The solution is based on generalized restart-in-state indices, and it utilizes a view of the problem not as “decisions over state space” but rather “decisions over time”.

Information

Type
Research Article
Copyright
Copyright © Cambridge University Press 2014