Pre-training with non-expert human demonstration for deep reinforcement learning

Gabriel V. de la Cruz Jr; Yunshu Du; Matthew E. Taylor

doi:10.1017/S0269888919000055

Pre-training with non-expert human demonstration for deep reinforcement learning

Part of: Adaptive Learning Agents 2018

Published online by Cambridge University Press: 26 July 2019

Gabriel V. de la Cruz Jr

Yunshu Du

and

Matthew E. Taylor

Show author details

Gabriel V. de la Cruz Jr: Affiliation:
School of Electrical Engineering and Computer Science, Washington State University, Pullman, Washington 99164-2752, USA e-mails: gabriel.delacruz@wsu.edu, yunshu.du@wsu.edu, matthew.e.taylor@wsu.edu
Yunshu Du: Affiliation:
School of Electrical Engineering and Computer Science, Washington State University, Pullman, Washington 99164-2752, USA e-mails: gabriel.delacruz@wsu.edu, yunshu.du@wsu.edu, matthew.e.taylor@wsu.edu
Matthew E. Taylor: Affiliation:
School of Electrical Engineering and Computer Science, Washington State University, Pullman, Washington 99164-2752, USA e-mails: gabriel.delacruz@wsu.edu, yunshu.du@wsu.edu, matthew.e.taylor@wsu.edu

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Deep reinforcement learning (deep RL) has achieved superior performance in complex sequential tasks by using deep neural networks as function approximators to learn directly from raw input images. However, learning directly from raw images is data inefficient. The agent must learn feature representation of complex states in addition to learning a policy. As a result, deep RL typically suffers from slow learning speeds and often requires a prohibitively large amount of training time and data to reach reasonable performance, making it inapplicable to real-world settings where data are expensive. In this work, we improve data efficiency in deep RL by addressing one of the two learning goals, feature learning. We leverage supervised learning to pre-train on a small set of non-expert human demonstrations and empirically evaluate our approach using the asynchronous advantage actor-critic algorithms in the Atari domain. Our results show significant improvements in learning speed, even when the provided demonstration is noisy and of low quality.

Information

Type: Adaptive and Learning Agents
Information: The Knowledge Engineering Review , Volume 34 , 2019 , e10

DOI: https://doi.org/10.1017/S0269888919000055 [Opens in a new window]
Copyright: © Cambridge University Press, 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Abtahi, F. & Fasel, I. 2011. Deep belief nets as function approximators for reinforcement learning, Restricted Boltzmann Machine (RBM) 2, h3.Google Scholar

Anderson, C. W., Lee, M. & Elliott, D. L. 2015. Faster reinforcement learning after pretraining deep networks to predict state dynamics. In 2015 International Joint Conference on Neural Networks (IJCNN), 1–7. IEEE.Google Scholar

Argall, B. D., Chernova, S., Veloso, M. & Browning, B. 2009. A survey of robot learning from demonstration. Robotics and Autonomous Systems 57(5), 469–483.CrossRef Google Scholar

Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. 2013. The arcade learning environment: an evaluation platform for general agents. Journal of Artificial Intelligence Research 47, 253–279.CrossRef Google Scholar

Bojarski, M., Yeres, P., Choromanska, A., Choromanski, K., Firner, B., Jackel, L. & Muller, U. 2017. Explaining how a deep neural network trained with end-to-end learning steers a car. arXiv preprint arXiv:1704.07911.Google Scholar

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J. & Zaremba, W. 2016. Openai gym.Google Scholar

Brys, T., Harutyunyan, A., Taylor, M. E. & Nowé, A. 2015. Policy transfer using reward shaping. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, 181–188. International Foundation for Autonomous Agents and Multiagent Systems.Google Scholar

Christiano, P., Leike, J., Brown, T. B., Martic, M., Legg, S. & Amodei, D. 2017. Deep reinforcement learning from human preferences. In NIPS, Curran Associates, Inc.Google Scholar

Deng, Y., Bao, F., Kong, Y., Ren, Z. & Dai, Q. 2017. Deep direct reinforcement learning for financial signal representation and trading. IEEE Transactions on Neural Networks and Learning Systems 28(3), 653–664.CrossRef Google Scholar PubMed

Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu, Y. & Zhokhov, P. 2017. Openai baselines. https://github.com/openai/baselines.Google Scholar

Du, Y., Czarnecki, W. M., Jayakumar, S. M., Pascanu, R. & Lakshminarayanan, B. 2018. Adapting auxiliary losses using gradient similarity. arXiv preprint arXiv:1812.02224.Google Scholar

Du, Y., de la Cruz, G. V. Jr., Irwin, J. & Taylor, M. E. 2016. Initial progress in transfer for deep reinforcement learning algorithms. In Proceedings of the Deep Reinforcement Learning: Frontiers and Challenges (DeepRL) Workshop (at IJCAI 2016).Google Scholar

Duan, Y., Chen, X., Houthooft, R., Schulman, J. & Abbeel, P. 2016. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, 1329–1338, JMLR.org.Google Scholar

Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P. & Bengio, S. 2010. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research 11, 625–660.Google Scholar

Erhan, D., Manzagol, P.-A., Bengio, Y., Bengio, S. & Vincent, P. 2009. The difficulty of training deep architectures and the effect of unsupervised pre-training. In Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS), 153–160, PMLR.Google Scholar

Glatt, R., d. Silva, F. L. & Costa, A. H. R. 2016. Towards knowledge transfer in deep reinforcement learning. In 2016 5th Brazilian Conference on Intelligent Systems (BRACIS), 91–96, IEEE.CrossRef Google Scholar

Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Sendonaris, A., Dulac-Arnold, G., Osband, I., Agapiou, J., Leibo, J. Z. & Gruslys, A. 2018. Deep q-learning from demonstrations. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, AAAI Press.Google Scholar

Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z., Silver, D. & Kavukcuoglu, K. 2017. Reinforcement learning with unsupervised auxiliary tasks. In ICLR, OpenReview.net.Google Scholar

Kempka, M., Wydmuch, M., Runc, G., Toczek, J. & Jaśkowski, W. 2016. Vizdoom: a Doom-based AI research platform for visual reinforcement learning. In 2016 IEEE Conference on Computational Intelligence and Games (CIG), 1–8. IEEE.Google Scholar

Kurin, V., Nowozin, S., Hofmann, K., Beyer, L. & Leibe, B. 2017. The Atari grand challenge dataset. arXiv preprint arXiv:1705.10998.Google Scholar

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D. & Wierstra, D. 2016. Continuous control with deep reinforcement learning. In ICLR, OpenReview.net.Google Scholar

Lin, M., Chen, Q. & Yan, S. 2013. Network in network. arXiv preprint arXiv:1312.4400.Google Scholar

Miotto, R., Wang, F., Wang, S., Jiang, X. & Dudley, J. T. 2017. Deep learning for healthcare: review, opportunities and challenges. Briefings in Bioinformatics 19(6), 1236–1246.CrossRef Google Scholar

Mirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard, A. J., Banino, A., Denil, M., Goroshin, R., Sifre, L., Kavukcuoglu, K., Kumaran, D. & Hadsell, R. 2017. Learning to navigate in complex environments. In ICLR, OpenReview.net.Google Scholar

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D. & Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 1928–1937, JMLR.org.Google Scholar

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S. & Hassabis, D. 2015. Human-level control through deep reinforcement learning. Nature 518(7540), 529–533, Nature Publishing Group.CrossRef Google Scholar PubMed

Ng, A. Y., Harada, D. & Russell, S. 1999. Policy invariance under reward transformations: theory and application to reward shaping. In ICML, 99, 278–287, Morgan Kaufmann.Google Scholar

Pan, S. J. & Yang, Q. 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22(10), 1345–1359.CrossRef Google Scholar

Papoudakis, G., Chatzidimitriou, K. C. & Mitkas, P. A. 2018. Deep reinforcement learning for Doom using unsupervised auxiliary tasks. CoRR abs/1807.01960.Google Scholar

Parisotto, E., Ba, J. L. & Salakhutdinov, R. 2016. Actor-mimic: deep multitask and transfer reinforcement learning. In ICLR, OpenReview.net.Google Scholar

Pohlen, T., Piot, B., Hester, T., Azar, M. G., Horgan, D., Budden, D., Barth-Maron, G., van Hasselt, H., Quan, J., Večerík, M., Hessel, M., Munos, R. & Pietquin, O. 2018. Observe and look further: achieving consistent performance on Atari. arXiv preprint arXiv:1805.11593.Google Scholar

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Alexander, C. B. & Fei-Fei, L. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115(3), 211–252.CrossRef Google Scholar

Rusu, A. A., Colmenarejo, S. G., Gulcehre, C., Desjardins, G., Kirkpatrick, J., Pascanu, R., Mnih, V., Kavukcuoglu, K. & Hadsell, R. 2016, Policy distillation. In ICLR, OpenReview.net.Google Scholar

Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D. & Batra, D. 2017. Grad-cam: visual explanations from deep networks via gradient-based localization. In ICCV, 618–626, IEEE.Google Scholar

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T. & Hassabis, D. 2016. Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489.CrossRef Google Scholar PubMed

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K. & Hassabis, D. 2018. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362(6419), 1140–1144.CrossRef Google Scholar PubMed

Sutton, R. S. & Barto, A. G. 2018. Reinforcement Learning: An Introduction. MIT Press.Google Scholar

Taylor, M. E. & Stone, P. 2009. Transfer learning for reinforcement learning domains: a survey. Journal of Machine Learning Research 10(Jul), 1633–1685.Google Scholar

Teh, Y., Bapst, V., Czarnecki, W. M., Quan, J., Kirkpatrick, J., Hadsell, R., Heess, N. & Pascanu, R. 2017. Distral: robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, 4496–4506.Google Scholar

Tieleman, T. & Hinton, G. 2012. Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning 4(2), 26–31.Google Scholar

Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A. S., Yeo, M., Makhzani, A., Küttler, H., Agapiou, J., Schrittwieser, J., Quan, J., Gaffney, S., Petersen, S., Simonyan, K., Schaul, T., van Hasselt, H., Silver, D., Lillicrap, T., Calderone, K., Keet, P., Brunasso, A., Lawrence, D., Ekermo, A., Repp, J. & Tsing, R. 2017. Starcraft II: a new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782.Google Scholar

Wang, Z. & Taylor, M. E. 2017. Improving reinforcement learning with confidence-based demonstrations. In Proceedings of the 26th International Conference on Artificial Intelligence (IJCAI), International Joint Conference on Artificial Intelligence.CrossRef Google Scholar

Watkins, C. J. & Dayan, P. 1992. Q-learning. Machine Learning 8(3–4), 279–292.CrossRef Google Scholar

Yosinski, J., Clune, J., Bengio, Y. & Lipson, H. 2014. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems, 3320–3328.Google Scholar

Zhang, Y., Lee, K. & Lee, H. 2016. Augmenting supervised neural networks with unsupervised objectives for large-scale image classification. In ICML, JMLR.org.Google Scholar

Article contents

Pre-training with non-expert human demonstration for deep reinforcement learning

Abstract

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests