Hostname: page-component-78857b5c4d-vkghm Total loading time: 0 Render date: 2023-08-04T00:39:51.810Z Has data issue: false Feature Flags: { "corePageComponentGetUserInfoFromSharedSession": true, "coreDisableEcommerce": false, "coreDisableSocialShare": false, "coreDisableEcommerceForArticlePurchase": false, "coreDisableEcommerceForBookPurchase": false, "coreDisableEcommerceForElementPurchase": false, "coreUseNewShare": false, "useRatesEcommerce": true } hasContentIssue false

ON THE IDENTIFICATION AND MITIGATION OF WEAKNESSES IN THE KNOWLEDGE GRADIENT POLICY FOR MULTI-ARMED BANDITS

Published online by Cambridge University Press:  13 September 2016

James Edwards
Affiliation:
STOR-i Centre for Doctoral Training, Lancaster UniversityLancaster LA1 4YF, UK E-mail: j.edwards4@lancaster.ac.uk
Paul Fearnhead
Affiliation:
Department of Mathematics and Statistics, Lancaster University, Lancaster LA1 4YF, UK E-mail: p.fearnhead@lancaster.ac.uk
Kevin Glazebrook
Affiliation:
Department of Management Science, Lancaster University, Lancaster LA1 4YX, UK E-mail: k.glazebrook@lancaster.ac.uk

Abstract

The knowledge gradient (KG) policy was originally proposed for online ranking and selection problems but has recently been adapted for use in online decision-making in general and multi-armed bandit problems (MABs) in particular. We study its use in a class of exponential family MABs and identify weaknesses, including a propensity to take actions which are dominated with respect to both exploitation and exploration. We propose variants of KG which avoid such errors. These new policies include an index heuristic, which deploys a KG approach to develop an approximation to the Gittins index. A numerical study shows this policy to perform well over a range of MABs including those for which index policies are not optimal. While KG does not take dominated actions when bandits are Gaussian, it fails to be index consistent and appears not to enjoy a performance advantage over competitor policies when arms are correlated to compensate for its greater computational demands.

Type
Research Article
Copyright
Copyright © Cambridge University Press 2016 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

1. Berry, D.A. & Fristedt, B. (1985). Bandit Problems. London: Chapman and Hall.CrossRefGoogle Scholar
2. Brezzi, M. & Lai, T.L. (2002). Optimal learning and experimentation in bandit problems. Journal of Economic Dynamics and Control 27(1): 87108.CrossRefGoogle Scholar
3. Chick, S.E. & Gans, N. (2009). Economic analysis of simulation selection problems. Management Science 55(3): 421437.CrossRefGoogle Scholar
4. Ding, Z. & Ryzhov, I.O. (2016). Optimal learning with non-Gaussian rewards. Advances in Applied Probability 1(48): 112136.CrossRefGoogle Scholar
5. Frazier, P.I., Powell, W.B., & Dayanik, S. (2008). A knowledge-gradient policy for sequential information collection. SIAM Journal on Control and Optimization 47(5): 24102439.CrossRefGoogle Scholar
6. Frazier, P.I., Powell, W.B., & Dayanik, S. (2009). The knowledge-gradient policy for correlated normal beliefs. INFORMS Journal on Computing 21(4): 599613.CrossRefGoogle Scholar
7. Gittins, J.C., Glazebrook, K.D., & Weber, R. (2011). Multi-armed Bndit Allocation Indices, 2nd ed. Chichester, UK: John Wiley & Sons.CrossRefGoogle Scholar
8. Gupta, S.S. & Miescke, K.J. (1996). Bayesian look ahead one-stage sampling allocations for selection of the best population. Journal of Statistical Planning and Inference 54(2): 229244.CrossRefGoogle Scholar
9. Jones, D.R., Schonlau, M., & Welch, W.J. (1998). Efficient global optimization of expensive black-box functions. Journal of Global Optimization 13(4): 455492.CrossRefGoogle Scholar
10. Powell, W.B. & Ryzhov, I.O. (2012). Optimal Learning. Hoboken, NJ: John Wiley & Sons.CrossRefGoogle Scholar
11. Russo, D. & Van Roy, B. (2014). Learning to optimize via posterior sampling. Mathematics of Operations Research 39(4): 12211243.CrossRefGoogle Scholar
12. Ryzhov, I.O., Frazier, P.I., & Powell, W.B. (2010). On the robustness of a one-period look-ahead policy in multi-armed bandit problems. Procedia Computer Science 1(1): 16351644.CrossRefGoogle Scholar
13. Ryzhov, I.O. & Powell, W.B. (2011). The value of information in multi-armed bandits with exponentially distributed rewards. In Proceedings of the 2011 International Conference on Computational Science, pp. 13631372.Google Scholar
14. Ryzhov, I.O., Powell, W.B., & Frazier, P.I. (2012). The knowledge gradient algorithm for a general class of online learning problems. Operations Research 60(1): 180195.CrossRefGoogle Scholar
15. Shaked, M. & Shanthikumar, J.G. (2007). Stochastic Orders. New York: Springer.CrossRefGoogle Scholar
16. Weber, R. (1992). On the Gittins index for multiarmed bandits. The Annals of Applied Probability 2(4): 10241033.CrossRefGoogle Scholar
17. Whittle, P. (1980). Multi-armed bandits and the Gittins index. Journal of the Royal Statistical Society. Series B (Methodological) 42(2): 143149.Google Scholar
18. Whittle, P. (1988). Restless bandits: Activity allocation in a changing world. Journal of Applied Probability 25: 287298.CrossRefGoogle Scholar
19. Yu, Y. (2011). Structural properties of Bayesian bandits with exponential family distributions. arXiv preprint. arXiv:1103.3089v1.Google Scholar