ASYMPTOTICALLY OPTIMAL MULTI-ARMED BANDIT POLICIES UNDER A COST CONSTRAINT

Apostolos Burnetas; Odysseas Kanavetas; Michael N. Katehakis

doi:10.1017/S026996481600036X

ASYMPTOTICALLY OPTIMAL MULTI-ARMED BANDIT POLICIES UNDER A COST CONSTRAINT

Published online by Cambridge University Press: 05 October 2016

Apostolos Burnetas ,

Odysseas Kanavetas and

Michael N. Katehakis

Show author details

Apostolos Burnetas: Affiliation:
Department of Mathematics, University of Athens, Athens, Greece E-mail: aburnetas@math.uoa.gr
Odysseas Kanavetas: Affiliation:
Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul, Turkey E-mail: okanavetas@sabanciuniv.edu
Michael N. Katehakis: Affiliation:
Department of Management Science and Information Systems, Rutgers University, NJ, USA E-mail: mnk@rutgers.edu

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

We consider the multi-armed bandit problem under a cost constraint. Successive samples from each population are i.i.d. with unknown distribution and each sample incurs a known population-dependent cost. The objective is to design an adaptive sampling policy to maximize the expected sum of n samples such that the average cost does not exceed a given bound sample-path wise. We establish an asymptotic lower bound for the regret of feasible uniformly fast convergent policies, and construct a class of policies, which achieve the bound. We also provide their explicit form under Normal distributions with unknown means and known variances.

Keywords

applied probability stochastic modelling

Type: Research Article
Information: Probability in the Engineering and Informational Sciences , Volume 31 , Issue 3 , July 2017 , pp. 284 - 310

DOI: https://doi.org/10.1017/S026996481600036X [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2016

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

1. Audibert, J.-Y., Munos, R., & Szepesvári, C. (2009). Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science 410(19): 1876–1902.CrossRef Google Scholar

2. Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2–3): 235–256.Google Scholar

3. Auer, P. and Ortner, R. (2010). Ucb revisited: improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica 61(1–2): 55–65.Google Scholar

4. Badanidiyuru, A., Kleinberg, R., & Slivkins, A. (2013). Bandits with knapsacks. Foundations of Computer Science (FOCS). In 2013 IEEE 54th Annual Symposium on IEEE, pp. 207–216.Google Scholar

5. Bartlett, P.L. & Tewari, A. (2009). Regal: a regularization based algorithm for reinforcement learning in weakly communicating MDPs. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence. AUAI Press, pp. 35–42.Google Scholar

6. Bubeck, S. & Slivkins, A. (2012). The best of both worlds: stochastic and adversarial bandits. arXiv:1202.4473.Google Scholar

7. Burnetas, A.N. & Kanavetas, O.A. (2012). Adaptive policies for sequential sampling under incomplete information and a cost constraint. In Daras, N.J. (ed.), Applications of mathematics and informatics in military science, Springer, pp. 97–112.CrossRef Google Scholar

8. Burnetas, A.N. & Katehakis, M.N. (1993). On sequencing two types of tasks on a single processor under incomplete information. Probability in the Engineering and Informational Sciences 7(1): 85–119.Google Scholar

9. Burnetas, A.N. & Katehakis, M.N. (1996). On large deviations properties of sequential allocation problems. Stochastic Analysis and Applications 14(1): 23–31.Google Scholar

10. Burnetas, A.N. & Katehakis, M.N. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics 17(2): 122–142.Google Scholar

11. Burnetas, A.N. & Katehakis, M.N. (1997). Optimal adaptive policies for Markov decision processes. Mathematics of Operations Research 22(1): 222–255.Google Scholar

12. Burnetas, A.N. & Katehakis, M.N. (1998). Sequential allocation problems with side constraints. INFORMS Seattle 1998, Annual Meeting, Seattle, WA.Google Scholar

13. Burnetas, A.N. & Katehakis, M.N. (2003). Asymptotic Bayes analysis for the finite-horizon one-armed-bandit problem. Probability in the Engineering and Informational Sciences 17(01): 53–82.Google Scholar

14. Butenko, S., Murphey, R., & Pardalos, P.M. (eds.). (2013). Cooperative control: models, applications and algorithms (Vol. 1). Springer Science & Business Media, New York.Google Scholar

15. Cappé, O., Garivier, A., Maillard, O.-A., Munos, R., & Stoltz, G. (2013). Kullback–Leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics 41(3): 1516–1541.Google Scholar

16. Cowan, W., Honda, J., & Katehakis, M.N. (2015). Asymptotic optimality, finite horizon regret bounds, and a solution to an open problem. arXiv:1504.05823. Journal of Machine Learning Research, to appear.Google Scholar

17. Cowan, W. & Katehakis, M.N. (2015). Asymptotic behavior of minimal-exploration allocation policies: almost sure, arbitrarily slow growing regret. arXiv:1505.02865.Google Scholar

18. Cowan, W. & Katehakis, M.N. (2015). An asymptotically optimal UCB policy for uniform bandits of unknown support. arXiv:1505.01918.Google Scholar

19. Cowan, W. & Katehakis, M.N. (2015 c). Multi-armed bandits under general depreciation and commitment. Probability in the Engineering and Informational Sciences 29(01): 51–76.Google Scholar

20. Dayanik, S., Powell, W.B., & Yamazaki, K. (2013). Asymptotically optimal Bayesian sequential change detection and identification rules. Annals of Operations Research 208(1): 337–370.Google Scholar

21. Ding, W., Qin, T., Zhang, X.-D., & Liu, T.-Y. (2013). Multi-armed bandit with budget constraint and variable costs. In AAAI-13 Conference, pp. 232–238.Google Scholar

22. Feinberg, E.A., Kasyanov, P.O., & Zgurovsky, M.Z. (2014). Convergence of value iterations for total-cost MDPs and POMDPs with general state and action sets. Adaptive Dynamic Programming and Reinforcement Learning (ADPRL). In 2014 IEEE Symposium on IEEE, pp. 1–8.Google Scholar

23. Feller, W. (1967). An introduction to probability theory and its applications, Vol. 1; 3rd ed, Wiley, New York.Google Scholar

24. Filippi, S., Cappé, O., & Garivier, A. (2010). Optimism in reinforcement learning based on Kullback–Leibler divergence. In 48th Annual Allerton Conference on Communication, Control, and Computing, pp. 115–122.Google Scholar

25. Gittins, J.C., Glazebrook, K., & Weber, R.R. (2011). Multi-armed Bandit allocation indices, West Sussex, UK: John Wiley & Sons.Google Scholar

26. Guha, S. & Munagala, K. (2007). Approximation algorithms for budgeted learning problems. In Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing, ACM, pp. 104–113.Google Scholar

27. Honda, J. & Takemura, A. (2011). An asymptotically optimal policy for finite support models in the multiarmed bandit problem. Machine Learning 85(3): 361–391.Google Scholar

28. Johnson, K., Simchi-Levi, D., & Wang, H. (2015). Online network revenue management using Thompson sampling. Available at SSRN.Google Scholar

29. Jouini, W., Ernst, D., Moy, C., & Palicot, J. (2009). Multi-armed bandit based policies for cognitive radio's decision making issues. In Third International Conference on Signals, Circuits and Systems (SCS), pp. 1–6.Google Scholar

30. Katehakis, M.N. & Derman, C. (1986). Computing optimal sequential allocation rules. Clinical Trials. Vol. 8; Lecture Note Series: Adoptive Statistical Procedures and Related Topics, Institute of Mathematical Statistics, pp. 29–39.Google Scholar

31. Katehakis, M.N. & Robbins, H. (1995). Sequential choice from several populations. Proceedings of the National Academy of Sciences of the United States of America 92(19): 8584.Google Scholar

32. Katehakis, M.N. & Veinott, A.F. Jr. (1987). The multi-armed bandit problem: decomposition and computation. Mathematics of Operations Research 12: 262–268.Google Scholar

33. Kaufmann, E. (2015). Analyse de stratégies Bayésiennes et fréquentistes pour l'allocation séquentielle de ressources. Doctorat, ParisTech.Google Scholar

34. Kleinberg, R.D. (2004). Nearly tight bounds for the continuum-armed bandit problem. In Advances in Neural Information Processing Systems Conference. pp. 697–704.Google Scholar

35. Lagoudakis, M.G. & Parr, R. (2003). Least-squares policy iteration. The Journal of Machine Learning Research 4: 1107–1149.Google Scholar

36. Lai, T.L. & Robbins, H. (1985). ‘Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6(1): 4–2.Google Scholar

37. Lattimore, T., Crammer, K., & Szepesvári, C. (2014). Optimal resource allocation with semi-Bandit feedback. arXiv:1406.3840.Google Scholar

38. Li, L., Munos, R., & Szepesvári, C. (2014). On minimax optimal offline policy evaluation. arXiv:1409.3653.Google Scholar

39. Littman, M.L. (2012). Inducing partially observable Markov decision processes. In ICGI Conference, pp. 145–148.Google Scholar

40. Mahajan, A. & Teneketzis, D. (2008). Multi-armed bandit problems. In Hero, A.O., Castanon, D., Cocharn, D., & Kastella, K. (eds.), Foundations and applications of sensor management, Springer, pp. 121–151.Google Scholar

41. Osband, I. & Van Roy, B. (2014). Near-optimal reinforcement learning in factored MDPs. In Advances in Neural Information Processing Systems Conference. pp. 604–612.Google Scholar

42. Sen, S., Ridgway, A., & Ripley, M. (2015). Adaptive budgeted bandit algorithms for trust development in a supply-chain. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, International Foundation for Autonomous Agents and Multiagent Systems. pp. 137–144.Google Scholar

43. Singla, A. & Krause, A. (2013). Truthful incentives in crowdsourcing tasks using regret minimization mechanisms. In Proceedings of the 22nd International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, pp. 1167–1178.Google Scholar

44. Tekin, C. & Liu, M. (2012). Approximately optimal adaptive learning in opportunistic spectrum access. INFOCOM, 2012 Proceedings IEEE. IEEE, pp. 1548–1556.Google Scholar

45. Tewari, A. & Bartlett, P.L. (2008). Optimistic linear programming gives logarithmic regret for irreducible MDPs. In Advances in Neural Information Processing Systems Conference. pp. 1505–1512.Google Scholar

46. Thomaidou, S., Vazirgiannis, M., & Liakopoulos, K. (2012). Toward an integrated framework for automated development and optimization of online advertising campaigns. arXiv:1208.1187.Google Scholar

47. Tran-Thanh, L., Chapman, A., Luna, M.D.C.F., Enrique, J., Rogers, A., & Jennings, N.R. (2010). Epsilon – first policies for budget – limited multi-armed bandits. In AAAI-2010 Conference, pp. 1211–1216.Google Scholar

48. Tran-Thanh, L., Chapman, A., Luna, M.D.C.F., Enrique, J., Rogers, A., & Jennings, N.R. (2012). Knapsack based optimal policies for budget-limited multi-armed bandits. In AAAI-2012 Conference, pp. 1134–1140.Google Scholar

49. Tran-Thanh, L., Stavrogiannis, L.C., Naroditskiy, V., Robu, V., Jennings, N.R., & Key, P. (2014). Efficient regret bounds for online bid optimisation in budget-limited sponsored search auctions’. University of Southampton, UK, Technical Report.Google Scholar

50. Wang, Z., Deng, S., & Ye, Y. (2014). Close the gaps: a learning-while-doing algorithm for single-product revenue management problems. Operations Research 62(2): 318–331.Google Scholar

Article contents

ASYMPTOTICALLY OPTIMAL MULTI-ARMED BANDIT POLICIES UNDER A COST CONSTRAINT

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests