from Part III - Partially Observed Markov Decision Processes: Structural Results
Published online by Cambridge University Press: 05 April 2016
Introduction
The previous chapter established conditions under which the value function of a POMDP is monotone with respect to the MLR order. Also conditions were given for the optimal policy for a two-state POMDP to be monotone (threshold). This and the next chapter develop structural results for the optimal policy of multi-state POMDPs. To establish the structural results, we will use submodularity, and stochastic dominance on the lattice of belief states to analyze Bellman's dynamic programming equation – such analysis falls under the area of “Lattice Programming” [144]. Lattice programming and “monotone comparative statics” pioneered by Topkis [322] (see also [15, 26]) provide a general set of sufficient conditions for the existence of monotone strategies. Once a POMDP is shown to have a monotone policy, then gradient-based algorithms that exploit this structure can be designed to estimate this policy. This and the next two chapters rely heavily on the structural results for filtering (Chapter 10) and monotone value function (Chapter 11). Please see Figure 10.1 on page 220 for the context of this chapter.
Main results
This chapter deals with structural results for the optimal policy of stopping time POMDPs. Stopping time POMDPs have action space U = {1 (stop), 2 (continue) }. They arise in sequential detection such as quickest change detection and machine replacement. Establishing structural results for stopping time POMDPs are easier than that for general POMDPs (which is considered in the next chapter). The main structural results in this chapter regarding stopping time POMDPs are:
Convexity of stopping region: §12.2 shows that the set of beliefs where it is optimal to apply action 1 (stop) is a convex subset of the belief space. This result unifies several well known results about the convexity of the stopping set for sequential detection problems.
Monotonicity of the optimal policy: §12.3 gives conditions under which the optimal policy of a stopping time POMDP is monotone with respect to the monotone likelihood ratio (MLR) order. The MLR order is naturally suited for POMDPs since it is preserved under conditional expectations.
Figure 12.1 displays these structural results. For X = 2, we will show that stopping set is the interval [π*, 1] and the optimal policy μ*(π) is a step function; see Figure 12.1(a)). So it is only necessary to compute the threshold state π*.
To save this book to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Find out more about the Kindle Personal Document Service.
To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.
To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.