Safe option-critic: learning safety in the option-critic architecture

Arushi Jain; Khimya Khetarpal; Doina Precup

doi:10.1017/S0269888921000035

Safe option-critic: learning safety in the option-critic architecture

Part of: Adaptive Learning Agents 2018

Published online by Cambridge University Press: 07 April 2021

and

Arushi Jain: Affiliation:
School of Computer Science, Mila - McGill University, Montreal, Quebec E-mail: arushi.jain@mail.mcgill.ca, khimya.khetarpal@mail.mcgill.ca, dprecup@cs.mcgill.ca
Khimya Khetarpal: Affiliation:
School of Computer Science, Mila - McGill University, Montreal, Quebec E-mail: arushi.jain@mail.mcgill.ca, khimya.khetarpal@mail.mcgill.ca, dprecup@cs.mcgill.ca
Doina Precup: Affiliation:
School of Computer Science, Mila - McGill University, Montreal, Quebec E-mail: arushi.jain@mail.mcgill.ca, khimya.khetarpal@mail.mcgill.ca, dprecup@cs.mcgill.ca

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

Designing hierarchical reinforcement learning algorithms that exhibit safe behaviour is not only vital for practical applications but also facilitates a better understanding of an agent’s decisions. We tackle this problem in the options framework (Sutton, Precup & Singh, 1999), a particular way to specify temporally abstract actions which allow an agent to use sub-policies with start and end conditions. We consider a behaviour as safe that avoids regions of state space with high uncertainty in the outcomes of actions. We propose an optimization objective that learns safe options by encouraging the agent to visit states with higher behavioural consistency. The proposed objective results in a trade-off between maximizing the standard expected return and minimizing the effect of model uncertainty in the return. We propose a policy gradient algorithm to optimize the constrained objective function. We examine the quantitative and qualitative behaviours of the proposed approach in a tabular grid world, continuous-state puddle world, and three games from the Arcade Learning Environment: Ms. Pacman, Amidar, and Q*Bert. Our approach achieves a reduction in the variance of return, boosts performance in environments with intrinsic variability in the reward structure, and compares favourably both with primitive actions and with risk-neutral options.

Information

Type: Research Article
Information: The Knowledge Engineering Review , Volume 36 , 2021 , e4

DOI: https://doi.org/10.1017/S0269888921000035 [Opens in a new window]
Copyright: © The Author(s), 2021. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

†

These authors contributed equally to this work.

References

Amodei, D., Olah, C., Steinhardt, J., Christiano, P. F., Schulman, J. & ManÉ, D. 2016. Concrete problems in AI safety. CoRR.Google Scholar

Bacon, P.-L., Harb, J. & Precup, D. 2017. The option-critic architecture. In AAAI, 1726–1734.Google Scholar

Barreto, A., Borsa, D., Hou, S., Comanici, G., AygÜn, E., Hamel, P., Toyama, D., Mourad, S., Silver, D., Precup, D., et al. 2019. The option keyboard: combining skills in reinforcement learning. In Advances in Neural Information Processing Systems, 13052–13062.Google Scholar

Barto, A. G. & Mahadevan, S. 2003. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems 13(4), 341–379.CrossRef Google Scholar

Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. 2013. The arcade learning environment: an evaluation platform for general agents. Journal of Artificial Intelligence Research 47, 253–279.CrossRef Google Scholar

Borkar, V. S. & Meyn, S. P. 2002. Risk-sensitive optimal control for Markov decision processes with monotone cost. Mathematics of Operations Research 27(1), 192–209.CrossRef Google Scholar

Daniel, C., Van Hoof, H., Peters, J. & Neumann, G. 2016. Probabilistic inference for determining options in reinforcement learning. Machine Learning 104(2–3), 337–357.CrossRef Google Scholar

Dietterich, T. G. 2000. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research 13, 227–303.CrossRef Google Scholar

Fikes, R. E., Hart, P. E. & Nilsson, N. J. 1972. Learning and executing generalized robot plans. Artificial Intelligence 3, 251–288.CrossRef Google Scholar

Fikes, R. E., Hart, P. E. & Nilsson, N. J. 1981. Learning and executing generalized robot plans. In Readings in Artificial Intelligence. Elsevier, 231–249.Google Scholar

Future of Life Institute 2017. Asilomar AI principles.Google Scholar

Garca, J. & FernÁndez, F. 2015. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16(1), 1437–1480.Google Scholar

Gehring, C. & Precup, D. 2013. Smart exploration in reinforcement learning using absolute temporal difference errors. In Proceedings of the 2013 International Conference on Autonomous Agents and Multi-agent Systems, AAMAS 2013, 1037–1044.Google Scholar

Geibel, P. & Wysotzki, F. 2005. Risk-sensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research (JAIR) 24, 81–108.CrossRef Google Scholar

Harb, J., Bacon, P.-L., Klissarov, M. & Precup, D. 2018. When waiting is not an option: learning options with a deliberation cost. In AAAI.Google Scholar

Heger, M. 1994. Consideration of risk in reinforcement learning. In Machine Learning Proceedings 1994. Elsevier, 105–111.Google Scholar

Howard, R. A. & Matheson, J. E. 1972. Risk-sensitive Markov decision processes. Management Science 18(7), 356–369.CrossRef Google Scholar

Iba, G. A. 1989. A heuristic approach to the discovery of macro-operators. Machine Learning 3(4), 285–317.CrossRef Google Scholar

Iyengar, G. N. 2005. Robust dynamic programming. Mathematics of Operations Research 30(2), 257–280.CrossRef Google Scholar

Jain, A., Patil, G., Jain, A., Khetarpal, K. & Precup, D. 2021. Variance penalized on-policy and off-policy actor-critic. arXiv preprint arXiv:2102.01985.Google Scholar

Jain, A. & Precup, D. 2018. Eligibility traces for options. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, 1008–1016.Google Scholar

Khetarpal, K., Klissarov, M., Chevalier-Boisvert, M., Bacon, P.-L. & Precup, D. 2020. Options of interest: Temporal abstraction with interest functions. In Proceedings of the AAAI Conference on Artificial Intelligence, 34, 4444–4451.CrossRef Google Scholar

Konidaris, G. & Barto, A. G. 2007. Building portable options: Skill transfer in reinforcement learning. In IJCAI, 7, 895–900.Google Scholar

Konidaris, G., Kuindersma, S., Grupen, R. A. & Barto, A. G. 2011. Autonomous skill acquisition on a mobile manipulator. In AAAI.Google Scholar

Korf, R. E. 1983. Learning to Solve Problems by Searching for Macro-operators. PhD thesis, Pittsburgh, PA, USA. AAI8425820.Google Scholar

Kulkarni, T. D., Narasimhan, K., Saeedi, A. & Tenenbaum, J. 2016. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in Neural Information Processing Systems, 3675–3683.Google Scholar

Law, E. L., Coggan, M., Precup, D. & Ratitch, B. 2005. Risk-directed exploration in reinforcement learning. In Planning and Learning in A Priori Unknown or Dynamic Domains, 97.Google Scholar

Lim, S. H., Xu, H. & Mannor, S. 2013. Reinforcement learning in robust Markov decision processes. Advances in Neural Information Processing Systems 26, 701–709.Google Scholar

Machado, M. C., Bellemare, M. G., Talvitie, E., Veness, J., Hausknecht, M. & Bowling, M. 2017. Revisiting the arcade learning environment: evaluation protocols and open problems for general agents. ArXiv e-prints.CrossRef Google Scholar

Mankowitz, D. J., Mann, T. A. & Mannor, S. 2016. Adaptive skills adaptive partitions (ASAP). In Advances in Neural Information Processing Systems, 1588–1596.Google Scholar

McGovern, A. & Barto, A. G. 2001. Automatic discovery of subgoals in reinforcement learning using diverse density. In ICML, 1, 361–368.Google Scholar

Menache, I., Mannor, S. & Shimkin, N. 2002. Q-cut - dynamic discovery of sub-goals in reinforcement learning. In European Conference on Machine Learning. Springer, 295–306.Google Scholar

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D. & Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 1928–1937.Google Scholar

Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon, R., Maria, A. D., Panneershelvam, V., Suleyman, M., Beattie, C., Petersen, S., Legg, S., Mnih, V., Kavukcuoglu, K. & Silver, D. 2015. Massively parallel methods for deep reinforcement learning. CoRR.Google Scholar

Nilim, A. & El Ghaoui, L. 2005. Robust control of Markov decision processes with uncertain transition matrices. Operations Research 53(5), 780–798.CrossRef Google Scholar

Parr, R. & Russell, S. J. 1998. Reinforcement learning with hierarchies of machines. In Advances in Neural Information Processing Systems, 1043–1049.Google Scholar

Precup, D. 2000. Temporal abstraction in reinforcement learning (University of Massachusetts Amherst).Google Scholar

Riemer, M., Liu, M. & Tesauro, G. 2018. Learning abstract options. In Advances in Neural Information Processing Systems, 10424–10434.Google Scholar

Sherstan, C., Ashley, D. R., Bennett, B., Young, K., White, A., White, M. & Sutton, R. S. 2018. Comparing direct and indirect temporal-difference methods for estimating the variance of the return. In Proceedings of Uncertainty in Artificial Intelligence, 63–72.Google Scholar

Stolle, M. & Precup, D. 2002. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation & Approximation. Springer, 212–223.Google Scholar

Sutton, R. S. 1988. Learning to predict by the methods of temporal differences. Machine Learning 3(1), 9–44.CrossRef Google Scholar

Sutton, R. S. & Barto, A. G. 1998. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition.CrossRef Google Scholar

Sutton, R. S., McAllester, D. A., Singh, S. P. & Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, 1057–1063.Google Scholar

Sutton, R. S., Precup, D. & Singh, S. 1999. Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112(1-2), 181–211.CrossRef Google Scholar

Tamar, A., Di Castro, D. & Mannor, S. 2012. Policy gradients with variance related risk criteria. In Proceedings of the Twenty-Ninth International Conference on Machine Learning, 387–396.Google Scholar

Tamar, A., Di Castro, D. & Mannor, S. 2016. Learning the variance of the reward-to-go. Journal of Machine Learning Research 17(13), 1–36.Google Scholar

Tamar, A., Xu, H. & Mannor, S. 2013. Scaling up robust MDPs by reinforcement learning. arXiv preprint arXiv:1306.6189.Google Scholar

Van Hasselt, H., Guez, A. & Silver, D. 2016. Deep reinforcement learning with double Q-learning. In AAAI, 16, 2094–2100.Google Scholar

Vezhnevets, A., Mnih, V., Osindero, S., Graves, A., Vinyals, O., Agapiou, J., et al. 2016. Strategic attentive writer for learning macro-actions. In Advances in Neural Information Processing Systems, 3486–3494.Google Scholar

Wang, Z., de Freitas, N. & Lanctot, M. 2015. Dueling network architectures for deep reinforcement learning. CoRR.Google Scholar

White, D. 1994. A mathematical programming approach to a problem in variance penalised Markov decision processes. Operations-Research-Spektrum 15(4), 225–230.CrossRef Google Scholar

Article contents

Safe option-critic: learning safety in the option-critic architecture

Abstract

Information

Access options

Article purchase

Temporarily unavailable

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests