Hostname: page-component-77f85d65b8-9nbrm Total loading time: 0 Render date: 2026-04-20T18:05:01.052Z Has data issue: false hasContentIssue false

Introspective Q-learning and learning from demonstration

Published online by Cambridge University Press:  01 January 2019

Mao Li
Affiliation:
Computer Science Department, University of York, Deramore Ln, Heslington, York YO10 5GH, United Kingdom; e-mail: ml1480@york.ac.uk;
Tim Brys
Affiliation:
Computer Science Department, Vrije Universiteit Brussel, Artificial Intelligence Lab Pleinlaan 9, 3th floor 1050, Brussels, Belgium; e-mail: timbrys@vub.ac.be;
Daniel Kudenko
Affiliation:
Computer Science Department, University of York, Deramore Ln, Heslington, York YO10 5GH, United Kingdom; e-mail: ml1480@york.ac.uk; Jet Brains Research, St Petersburg, Russia; e-mail: daniel.kudenko@york.ac.uk

Abstract

One challenge faced by reinforcement learning (RL) agents is that in many environments the reward signal is sparse, leading to slow improvement of the agent’s performance in early learning episodes. Potential-based reward shaping can help to resolve the aforementioned issue of sparse reward by incorporating an expert’s domain knowledge into the learning through a potential function. Past work on reinforcement learning from demonstration (RLfD) directly mapped (sub-optimal) human expert demonstration to a potential function, which can speed up RL. In this paper we propose an introspective RL agent that significantly further speeds up the learning. An introspective RL agent records its state–action decisions and experience during learning in a priority queue. Good quality decisions, according to a Monte Carlo estimation, will be kept in the queue, while poorer decisions will be rejected. The queue is then used as demonstration to speed up RL via reward shaping. A human expert’s demonstration can be used to initialize the priority queue before the learning process starts. Experimental validation in the 4-dimensional CartPole domain and the 27-dimensional Super Mario AI domain shows that our approach significantly outperforms non-introspective RL and state-of-the-art approaches in RLfD in both domains.

Information

Type
Adaptive and Learning Agents
Copyright
© Cambridge University Press, 2019 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable