Structural results for stopping time POMDPs

Vikram Krishnamurthy

doi:10.1017/CBO9781316471104.016

12 - Structural results for stopping time POMDPs

from Part III - Partially Observed Markov Decision Processes: Structural Results

Published online by Cambridge University Press: 05 April 2016

Vikram Krishnamurthy

Show author details

Vikram Krishnamurthy: Affiliation:
Cornell University/Cornell Tech

Book contents

Get access

Summary

Introduction

The previous chapter established conditions under which the value function of a POMDP is monotone with respect to the MLR order. Also conditions were given for the optimal policy for a two-state POMDP to be monotone (threshold). This and the next chapter develop structural results for the optimal policy of multi-state POMDPs. To establish the structural results, we will use submodularity, and stochastic dominance on the lattice of belief states to analyze Bellman's dynamic programming equation – such analysis falls under the area of “Lattice Programming” [144]. Lattice programming and “monotone comparative statics” pioneered by Topkis [322] (see also [15, 26]) provide a general set of sufficient conditions for the existence of monotone strategies. Once a POMDP is shown to have a monotone policy, then gradient-based algorithms that exploit this structure can be designed to estimate this policy. This and the next two chapters rely heavily on the structural results for filtering (Chapter 10) and monotone value function (Chapter 11). Please see Figure 10.1 on page 220 for the context of this chapter.

Main results

This chapter deals with structural results for the optimal policy of stopping time POMDPs. Stopping time POMDPs have action space U = {1 (stop), 2 (continue) }. They arise in sequential detection such as quickest change detection and machine replacement. Establishing structural results for stopping time POMDPs are easier than that for general POMDPs (which is considered in the next chapter). The main structural results in this chapter regarding stopping time POMDPs are:

Convexity of stopping region: §12.2 shows that the set of beliefs where it is optimal to apply action 1 (stop) is a convex subset of the belief space. This result unifies several well known results about the convexity of the stopping set for sequential detection problems.
Monotonicity of the optimal policy: §12.3 gives conditions under which the optimal policy of a stopping time POMDP is monotone with respect to the monotone likelihood ratio (MLR) order. The MLR order is naturally suited for POMDPs since it is preserved under conditional expectations.

Figure 12.1 displays these structural results. For X = 2, we will show that stopping set is the interval [π*, 1] and the optimal policy μ*(π) is a step function; see Figure 12.1(a)). So it is only necessary to compute the threshold state π*.

Information

Type: Chapter
Information: Partially Observed Markov Decision Processes
From Filtering to Controlled Sensing
, pp. 255 - 283

DOI: https://doi.org/10.1017/CBO9781316471104.016 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2016

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

Accessibility standard: Unknown

Why this information is here

This section outlines the accessibility features of this content - including support for screen readers, full keyboard navigation and high-contrast display options. This may not be relevant for you.

Accessibility Information

Accessibility compliance for the PDF of this chapter is currently unknown and may be updated in the future.