Hostname: page-component-7c8c6479df-xxrs7 Total loading time: 0 Render date: 2024-03-17T14:56:54.168Z Has data issue: false hasContentIssue false

Safe option-critic: learning safety in the option-critic architecture

Published online by Cambridge University Press:  07 April 2021

Arushi Jain
Affiliation:
School of Computer Science, Mila - McGill University, Montreal, Quebec E-mail: arushi.jain@mail.mcgill.ca, khimya.khetarpal@mail.mcgill.ca, dprecup@cs.mcgill.ca
Khimya Khetarpal
Affiliation:
School of Computer Science, Mila - McGill University, Montreal, Quebec E-mail: arushi.jain@mail.mcgill.ca, khimya.khetarpal@mail.mcgill.ca, dprecup@cs.mcgill.ca
Doina Precup
Affiliation:
School of Computer Science, Mila - McGill University, Montreal, Quebec E-mail: arushi.jain@mail.mcgill.ca, khimya.khetarpal@mail.mcgill.ca, dprecup@cs.mcgill.ca

Abstract

Designing hierarchical reinforcement learning algorithms that exhibit safe behaviour is not only vital for practical applications but also facilitates a better understanding of an agent’s decisions. We tackle this problem in the options framework (Sutton, Precup & Singh, 1999), a particular way to specify temporally abstract actions which allow an agent to use sub-policies with start and end conditions. We consider a behaviour as safe that avoids regions of state space with high uncertainty in the outcomes of actions. We propose an optimization objective that learns safe options by encouraging the agent to visit states with higher behavioural consistency. The proposed objective results in a trade-off between maximizing the standard expected return and minimizing the effect of model uncertainty in the return. We propose a policy gradient algorithm to optimize the constrained objective function. We examine the quantitative and qualitative behaviours of the proposed approach in a tabular grid world, continuous-state puddle world, and three games from the Arcade Learning Environment: Ms. Pacman, Amidar, and Q*Bert. Our approach achieves a reduction in the variance of return, boosts performance in environments with intrinsic variability in the reward structure, and compares favourably both with primitive actions and with risk-neutral options.

Type
Research Article
Copyright
© The Author(s), 2021. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

These authors contributed equally to this work.

References

Amodei, D., Olah, C., Steinhardt, J., Christiano, P. F., Schulman, J. & ManÉ, D. 2016. Concrete problems in AI safety. CoRR.Google Scholar
Bacon, P.-L., Harb, J. & Precup, D. 2017. The option-critic architecture. In AAAI, 17261734.Google Scholar
Barreto, A., Borsa, D., Hou, S., Comanici, G., AygÜn, E., Hamel, P., Toyama, D., Mourad, S., Silver, D., Precup, D., et al. 2019. The option keyboard: combining skills in reinforcement learning. In Advances in Neural Information Processing Systems, 1305213062.Google Scholar
Barto, A. G. & Mahadevan, S. 2003. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems 13(4), 341379.CrossRefGoogle Scholar
Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. 2013. The arcade learning environment: an evaluation platform for general agents. Journal of Artificial Intelligence Research 47, 253279.CrossRefGoogle Scholar
Borkar, V. S. & Meyn, S. P. 2002. Risk-sensitive optimal control for Markov decision processes with monotone cost. Mathematics of Operations Research 27(1), 192209.CrossRefGoogle Scholar
Daniel, C., Van Hoof, H., Peters, J. & Neumann, G. 2016. Probabilistic inference for determining options in reinforcement learning. Machine Learning 104(2–3), 337357.CrossRefGoogle Scholar
Dietterich, T. G. 2000. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research 13, 227303.CrossRefGoogle Scholar
Fikes, R. E., Hart, P. E. & Nilsson, N. J. 1972. Learning and executing generalized robot plans. Artificial Intelligence 3, 251288.CrossRefGoogle Scholar
Fikes, R. E., Hart, P. E. & Nilsson, N. J. 1981. Learning and executing generalized robot plans. In Readings in Artificial Intelligence. Elsevier, 231–249.Google Scholar
Future of Life Institute 2017. Asilomar AI principles.Google Scholar
Garca, J. & FernÁndez, F. 2015. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16(1), 14371480.Google Scholar
Gehring, C. & Precup, D. 2013. Smart exploration in reinforcement learning using absolute temporal difference errors. In Proceedings of the 2013 International Conference on Autonomous Agents and Multi-agent Systems, AAMAS 2013, 10371044.Google Scholar
Geibel, P. & Wysotzki, F. 2005. Risk-sensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research (JAIR) 24, 81108.CrossRefGoogle Scholar
Harb, J., Bacon, P.-L., Klissarov, M. & Precup, D. 2018. When waiting is not an option: learning options with a deliberation cost. In AAAI.Google Scholar
Heger, M. 1994. Consideration of risk in reinforcement learning. In Machine Learning Proceedings 1994. Elsevier, 105111.Google Scholar
Howard, R. A. & Matheson, J. E. 1972. Risk-sensitive Markov decision processes. Management Science 18(7), 356369.CrossRefGoogle Scholar
Iba, G. A. 1989. A heuristic approach to the discovery of macro-operators. Machine Learning 3(4), 285317.CrossRefGoogle Scholar
Iyengar, G. N. 2005. Robust dynamic programming. Mathematics of Operations Research 30(2), 257280.CrossRefGoogle Scholar
Jain, A., Patil, G., Jain, A., Khetarpal, K. & Precup, D. 2021. Variance penalized on-policy and off-policy actor-critic. arXiv preprint arXiv:2102.01985.Google Scholar
Jain, A. & Precup, D. 2018. Eligibility traces for options. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, 10081016.Google Scholar
Khetarpal, K., Klissarov, M., Chevalier-Boisvert, M., Bacon, P.-L. & Precup, D. 2020. Options of interest: Temporal abstraction with interest functions. In Proceedings of the AAAI Conference on Artificial Intelligence, 34, 44444451.CrossRefGoogle Scholar
Konidaris, G. & Barto, A. G. 2007. Building portable options: Skill transfer in reinforcement learning. In IJCAI, 7, 895900.Google Scholar
Konidaris, G., Kuindersma, S., Grupen, R. A. & Barto, A. G. 2011. Autonomous skill acquisition on a mobile manipulator. In AAAI.Google Scholar
Korf, R. E. 1983. Learning to Solve Problems by Searching for Macro-operators. PhD thesis, Pittsburgh, PA, USA. AAI8425820.Google Scholar
Kulkarni, T. D., Narasimhan, K., Saeedi, A. & Tenenbaum, J. 2016. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in Neural Information Processing Systems, 36753683.Google Scholar
Law, E. L., Coggan, M., Precup, D. & Ratitch, B. 2005. Risk-directed exploration in reinforcement learning. In Planning and Learning in A Priori Unknown or Dynamic Domains, 97.Google Scholar
Lim, S. H., Xu, H. & Mannor, S. 2013. Reinforcement learning in robust Markov decision processes. Advances in Neural Information Processing Systems 26, 701709.Google Scholar
Machado, M. C., Bellemare, M. G., Talvitie, E., Veness, J., Hausknecht, M. & Bowling, M. 2017. Revisiting the arcade learning environment: evaluation protocols and open problems for general agents. ArXiv e-prints.CrossRefGoogle Scholar
Mankowitz, D. J., Mann, T. A. & Mannor, S. 2016. Adaptive skills adaptive partitions (ASAP). In Advances in Neural Information Processing Systems, 15881596.Google Scholar
McGovern, A. & Barto, A. G. 2001. Automatic discovery of subgoals in reinforcement learning using diverse density. In ICML, 1, 361368.Google Scholar
Menache, I., Mannor, S. & Shimkin, N. 2002. Q-cut - dynamic discovery of sub-goals in reinforcement learning. In European Conference on Machine Learning. Springer, 295306.Google Scholar
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D. & Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 1928–1937.Google Scholar
Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon, R., Maria, A. D., Panneershelvam, V., Suleyman, M., Beattie, C., Petersen, S., Legg, S., Mnih, V., Kavukcuoglu, K. & Silver, D. 2015. Massively parallel methods for deep reinforcement learning. CoRR.Google Scholar
Nilim, A. & El Ghaoui, L. 2005. Robust control of Markov decision processes with uncertain transition matrices. Operations Research 53(5), 780798.CrossRefGoogle Scholar
Parr, R. & Russell, S. J. 1998. Reinforcement learning with hierarchies of machines. In Advances in Neural Information Processing Systems, 10431049.Google Scholar
Precup, D. 2000. Temporal abstraction in reinforcement learning (University of Massachusetts Amherst).Google Scholar
Riemer, M., Liu, M. & Tesauro, G. 2018. Learning abstract options. In Advances in Neural Information Processing Systems, 1042410434.Google Scholar
Sherstan, C., Ashley, D. R., Bennett, B., Young, K., White, A., White, M. & Sutton, R. S. 2018. Comparing direct and indirect temporal-difference methods for estimating the variance of the return. In Proceedings of Uncertainty in Artificial Intelligence, 6372.Google Scholar
Stolle, M. & Precup, D. 2002. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation & Approximation. Springer, 212223.Google Scholar
Sutton, R. S. 1988. Learning to predict by the methods of temporal differences. Machine Learning 3(1), 944.CrossRefGoogle Scholar
Sutton, R. S. & Barto, A. G. 1998. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition.CrossRefGoogle Scholar
Sutton, R. S., McAllester, D. A., Singh, S. P. & Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, 10571063.Google Scholar
Sutton, R. S., Precup, D. & Singh, S. 1999. Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112(1-2), 181211.CrossRefGoogle Scholar
Tamar, A., Di Castro, D. & Mannor, S. 2012. Policy gradients with variance related risk criteria. In Proceedings of the Twenty-Ninth International Conference on Machine Learning, 387396.Google Scholar
Tamar, A., Di Castro, D. & Mannor, S. 2016. Learning the variance of the reward-to-go. Journal of Machine Learning Research 17(13), 136.Google Scholar
Tamar, A., Xu, H. & Mannor, S. 2013. Scaling up robust MDPs by reinforcement learning. arXiv preprint arXiv:1306.6189.Google Scholar
Van Hasselt, H., Guez, A. & Silver, D. 2016. Deep reinforcement learning with double Q-learning. In AAAI, 16, 20942100.Google Scholar
Vezhnevets, A., Mnih, V., Osindero, S., Graves, A., Vinyals, O., Agapiou, J., et al. 2016. Strategic attentive writer for learning macro-actions. In Advances in Neural Information Processing Systems, 34863494.Google Scholar
Wang, Z., de Freitas, N. & Lanctot, M. 2015. Dueling network architectures for deep reinforcement learning. CoRR.Google Scholar
White, D. 1994. A mathematical programming approach to a problem in variance penalised Markov decision processes. Operations-Research-Spektrum 15(4), 225230.CrossRefGoogle Scholar