Skip to main content Accessibility help
Internet Explorer 11 is being discontinued by Microsoft in August 2021. If you have difficulties viewing the site on Internet Explorer 11 we recommend using a different browser such as Microsoft Edge, Google Chrome, Apple Safari or Mozilla Firefox.

Chapter 47: Q-Learning

Chapter 47: Q-Learning

pp. 1971-2007

Authors

Resources available Unlock the full potential of this textbook with additional resources. There are Instructor restricted resources available for this textbook. Explore resources
  • Add bookmark
  • Cite
  • Share

Summary

The temporal learning algorithms TD(0) and TD(λ) of the previous chapter are useful procedures for state value evaluation; i.e., they permit the estimation of the state value function vπ(s) for a given target policy π(a|s) by observing actions and rewards arising from this policy (on‐policy learning) or another behavior policy (off‐policy learning).In most situations, however, we are not interested in state values but rather in determining optimal policies, denoted by π⋆(a|s) (i.e., in selecting what optimal actions an agent should follow in a Markov decision process (MDP)).

About the book

Access options

Review the options below to login to check your access.

Purchase options

eTextbook
US$110.00
Hardback
US$110.00

Have an access code?

To redeem an access code, please log in with your personal login.

If you believe you should have access to this content, please contact your institutional librarian or consult our FAQ page for further information about accessing our content.

Also available to purchase from these educational ebook suppliers