The temporal learning algorithms TD(0) and TD(λ) of the previous chapter are useful procedures for state value evaluation; i.e., they permit the estimation of the state value function vπ(s) for a given target policy π(a|s) by observing actions and rewards arising from this policy (on‐policy learning) or another behavior policy (off‐policy learning).In most situations, however, we are not interested in state values but rather in determining optimal policies, denoted by π⋆(a|s) (i.e., in selecting what optimal actions an agent should follow in a Markov decision process (MDP)).
Review the options below to login to check your access.
Log in with your Cambridge Higher Education account to check access.
If you believe you should have access to this content, please contact your institutional librarian or consult our FAQ page for further information about accessing our content.