Skip to main content Accessibility help

Introspective Q-learning and learning from demonstration

  • Mao Li (a1), Tim Brys (a2) and Daniel Kudenko (a1) (a3)


One challenge faced by reinforcement learning (RL) agents is that in many environments the reward signal is sparse, leading to slow improvement of the agent’s performance in early learning episodes. Potential-based reward shaping can help to resolve the aforementioned issue of sparse reward by incorporating an expert’s domain knowledge into the learning through a potential function. Past work on reinforcement learning from demonstration (RLfD) directly mapped (sub-optimal) human expert demonstration to a potential function, which can speed up RL. In this paper we propose an introspective RL agent that significantly further speeds up the learning. An introspective RL agent records its state–action decisions and experience during learning in a priority queue. Good quality decisions, according to a Monte Carlo estimation, will be kept in the queue, while poorer decisions will be rejected. The queue is then used as demonstration to speed up RL via reward shaping. A human expert’s demonstration can be used to initialize the priority queue before the learning process starts. Experimental validation in the 4-dimensional CartPole domain and the 27-dimensional Super Mario AI domain shows that our approach significantly outperforms non-introspective RL and state-of-the-art approaches in RLfD in both domains.



Hide All
Argall, B. D., Chernova, S., Veloso, M. & Browning, B. 2009. A survey of robot learning from demonstration. Robotics and Autonomous Systems 57(5), 469483.
Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., & Munos, R. (2016). Unifying count-based exploration and intrinsic motivation. In Proceedings of the 30th Conference on Advances in Neural Information Processing Systems (pp. 14711479).
Brys, T. 2016. Reinforcement Learning with Heuristic Information. PhD thesis, Vrije Universiteit Brussel.
Brys, T., Harutyunyan, A., Suay, H. B., Chernova, S., Taylor, M. E. & Nowé, A. 2015. Reinforcement learning from demonstration through shaping. In IJCAI. 33523358.
Devlin, S. & Kudenko, D. 2012. Dynamic potential-based reward shaping. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems, Vol. 1, 433440. International Foundation for Autonomous Agents and Multiagent Systems.
Harutyunyan, A., Devlin, S., Vrancx, P. & Nowé, A. 2015. Expressing arbitrary reward functions as potential-based advice. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence.
Karakovskiy, S. & Togelius, J. 2012. The Mario AI benchmark and competitions. IEEE Transactions on Computational Intelligence and AI in Games 4, 5567.
Mataric, M. J. 1994. Reward functions for accelerated learning. In Machine Learning: Proceedings of the Eleventh International Conference, 181189.
Michie, D. & Chambers, R. A. 1968. Boxes: an experiment in adaptive control. Machine Intelligence 2, 137152.
Ng, A. Y., Harada, D. & Russell, S. 1999. Policy invariance under reward transformations: theory and application to reward shaping. In Proceedings of the Sixteenth International Conference on Machine Learning. Vol. 99, 278287.
Ng, A. Y., Andrew, Y., & Russell, S. J. 2000. Algorithms for inverse reinforcement learning. In ICML 1, 663670.
Pathak, D., Agrawal, P., Efros, A. A. & Darrell, T. 2017. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. (pp. 1617).
Schaal, S. 1997. Learning from demonstration. Advances in Neural Information Processing Systems 9, 10401046.
Schaul, T., Quan, J., Antonoglou, I. & Silver, D. 2015. Prioritized experience replay. arXiv preprint arXiv:1511.05952.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. & et al. 2016. Mastering the game of go with deep neural networks and tree search. Nature 529, 484489.
Singh, S. P. & Sutton, R. S. 1996. Reinforcement learning with replacing eligibility traces. Machine Learning 22, 123158.
Smart, W. D. & Kaelbling, L. P. 2002. Effective reinforcement learning for mobile robots. In IEEE International Conference on Robotics and Automation, Vol. 4, 34043410. IEEE.
Suay, H. B., Brys, T., Taylor, M. E. & Chernova, S. 2016. Learning from demonstration for shaping through inverse reinforcement learning. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, 429437. International Foundation for Autonomous Agents and Multiagent Systems.
Sutton, R. & Barto, A. 1998. Reinforcement Learning: An Introduction, Vol. 1. Cambridge University Press.
Taylor, M. E. & Stone, P. 2009. Transfer learning for reinforcement learning domains: a survey. Journal of Machine Learning Research 10, 16331685.
Taylor, M. E., Suay, H. B. & Chernova, S. 2011. Integrating reinforcement learning with human demonstrations of varying ability. In The 10th International Conference on Autonomous Agents and Multiagent Systems, Vol. 2, 617624. International Foundation for Autonomous Agents and Multiagent Systems.
Tsitsiklis, J. N. 1994. Asynchronous stochastic approximation and Q-learning. Machine Learning 16, 185202.
Watkins, C. J. C. H. 1989. Learning from Delayed Rewards. PhD thesis, University of Cambridge.
Wiewiora, E., Cottrell, G. & Elkan, C. 2003. Principled methods for advising reinforcement learning agents. In ICML. 792799.

Introspective Q-learning and learning from demonstration

  • Mao Li (a1), Tim Brys (a2) and Daniel Kudenko (a1) (a3)


Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed