On the correctness of monadic backward induction

Abstract In control theory, to solve a finite-horizon sequential decision problem (SDP) commonly means to find a list of decision rules that result in an optimal expected total reward (or cost) when taking a given number of decision steps. SDPs are routinely solved using Bellman’s backward induction. Textbook authors (e.g. Bertsekas or Puterman) typically give more or less formal proofs to show that the backward induction algorithm is correct as solution method for deterministic and stochastic SDPs. Botta, Jansson and Ionescu propose a generic framework for finite horizon, monadic SDPs together with a monadic version of backward induction for solving such SDPs. In monadic SDPs, the monad captures a generic notion of uncertainty, while a generic measure function aggregates rewards. In the present paper, we define a notion of correctness for monadic SDPs and identify three conditions that allow us to prove a correctness result for monadic backward induction that is comparable to textbook correctness proofs for ordinary backward induction. The conditions that we impose are fairly general and can be cast in category-theoretical terms using the notion of Eilenberg–Moore algebra. They hold in familiar settings like those of deterministic or stochastic SDPs, but we also give examples in which they fail. Our results show that backward induction can safely be employed for a broader class of SDPs than usually treated in textbooks. However, they also rule out certain instances that were considered admissible in the context of Botta et al. ’s generic framework. Our development is formalised in Idris as an extension of the Botta et al. framework and the sources are available as supplementary material.


Introduction
Backward induction is a method introduced by Bellman (1957) that is routinely used to solve finite-horizon sequential decision problems (SDP).Such problems lie at the core of many applications in economics, logistics, and computer science (Finus et al., 2003;Helm, 2003;Heitzig, 2012;Gintis, 2007;Botta et al., 2013;De Moor, 1995, 1999).Examples include inventory, scheduling and shortest path problems, but also the search for optimal strategies in games (Bertsekas, 1995;Diederich, 2001).Botta, Jansson and Ionescu (2017a) propose a generic framework for monadic finitehorizon SDPs as generalisation of the deterministic, non-deterministic and stochastic SDPs treated in control theory textbooks (Bertsekas, 1995;Puterman, 2014).This framework allows to specify such problems and to solve them with a generic version of backward induction that we will refer to as monadic backward induction.
The Botta-Jansson-Ionescu-framework, subsequently referred to as BJI-framework, BJItheory or simply framework, already includes a verification of monadic backward induction with respect to a certain underlying value function (see Sec. 3.2).However, in the literature on stochastic SDPs this formulation of the function is itself part of the backward induction algorithm and needs to be verified against an optimisation criterion, the expected total reward (Puterman, 2014, Ch. 4.2).For stochastic SDPs semi-formal proofs can be found in textbooks -but monadic SDPs are substantially more general than the stochastic SDPs for which these results are established.This observation raises a number of questions: • What exactly should "correctness" mean for a solution of monadic SDPs? • Does monadic backward induction provide correct solutions in this sense for monadic SDPs in their full generality?• And if not, is there a class of monadic SDPs for which monadic backward induction does provide provably correct solutions?
In the present paper we address these questions and make the following contributions to answering them: • We put forward a formal specification that monadic backward induction should meet in order to be considered "correct" as solution method for monadic SDPs.This specification uses an optimisation criterion that is a generic version of the expected total reward of standard control theory textbooks. 1In analogy, we call this criterion measured total reward.• We consider the value function underlying monadic backward induction as "correct" if it computes the measured total reward.• If the value function is correct, monadic backward induction can be proven to be correct in our sense by extending the result of Botta et al. (2017a).However, we show that this is not necessarily the case, i.e. the value function does not compute the measured total reward for arbitrary monadic SDPs.• We therefore formulate conditions that identify a class of monadic SDPs for which the value function and thus monadic backward induction can be shown to be correct.The conditions are fairly simple and allow for a neat description in category-theoretical terms using the notion of Eilenberg-Moore-algebra. • We give a formalised proof that monadic backward induction fulfils the correctness criterion if the conditions hold.This correctness result can be seen as a generic version of correctness results for standard backward induction like (Bertsekas, 1995, Prop. 1.3.1)and (Puterman, 2014, Th. 4.5.1.c).
Our results rule out the application of backward induction to certain monadic SDPs that were previously considered as admissible in the BJI-framework.Thus, they complement the verification result of Botta et al. and provide a necessary clarification.Still, the new conditions are simple enough to be checked for non-standard instantiations of the framework.This allows to broaden the applicability of backward induction to settings which are not commonly discussed in the literature and to obtain a formalised proof of correctness with considerably less effort.It is worth stressing that our conditions can be useful for anyone interested in applying monadic backward induction in non-standard situationscompletely independent of the BJI-framework.
Finally, the value function underlying monadic backward induction is also interesting in itself.Given the conditions hold, it can be used to compute the measured total reward efficiently, using a method reminiscent of a thinning algorithm (Bird & Gibbons, 2020, Ch. 10).
For the reader unfamiliar with SDPs, we provide a brief informal overview and two simple examples in the next section.We recap the BJI-framework and its (partial) verification result for monadic backward induction in Sec. 3. In Sec. 4 we specify correctness for monadic backward induction and the BJI-value function.We also show that in the general monadic case the value function does not necessarily meet the specification.To resolve this problem, we identify conditions under which the value function does meet the specification.These conditions are stated and analysed in Sec. 5.In Sec. 6 we prove that, given the conditions hold, the BJI-value function and monadic backward induction are correct in the sense defined in Sec. 4. We discuss the conditions from a more abstract perspective in Sec.7 and conclude in Sec. 8.
Throughout the paper we use Idris as our host language (Brady, 2013(Brady, , 2017)).We assume some familiarity with Haskell-like syntax and notions like functor and monad as used in functional programming.We tacitly consider types as logical statements and programs as proofs, justified by the propositions-as-types correspondence (for an accessible introduction see Wadler, 2015).
Source code.Our development is formalised in Idris as an extension of a lightweight version of the BJI-framework.The proofs are machine-checked and the source code is available as supplementary material attached to this paper.The sources of this document have been written in literal Idris and are available at (Brede & Botta, 2021), together with some example code.All source files can be type checked with Idris 1.3.2.

Finite-horizon Sequential Decision Problems
In deterministic, non-deterministic and stochastic finite-horizon SDPs, a decision maker seeks to control the evolution of a (dynamical) system at a finite number of decision steps by selecting certain controls in sequence, one after the other.The controls available to the decision maker at a given decision step typically depend on the state of the system at that step.
In deterministic problems, selecting a control in a state at decision step t : N, determines a unique next state at decision step t + 1 through a given transition function.In non-deterministic problems, the transition function yields a whole set of possible states at the next decision step.In stochastic problems, the transition function yields a probability distribution on states at the next decision step.
The notion of monadic problem generalises that of deterministic, non-deterministic and stochastic problem through a transition function that yields an M-structure of next states where M is a monad.For example, the identity monad can be applied to model deterministic systems.Non-deterministic systems can be represented in terms of transition functions that return lists (or some other representations of sets) of next states.Stochastic systems can be represented in terms of probability distribution monads (Giry, 1981;Erwig & Kollmansberger, 2006;Audebaud & Paulin-Mohring, 2009;Jacobs, 2011).The uncertainty monad, the states, the controls and the next function define what is often called a decision process.
The idea of sequential decision problems is that each single decision yields a reward and these rewards add up to a total reward over all decision steps.Rewards are often represented by values of a numeric type, and added up using the canonical addition.If the transition function and thus the evolution of the system is not deterministic, then the resulting possible total rewards need to be aggregated to yield a single outcome value.In stochastic SDPs, evolving the underlying stochastic system leads to a probability distribution on total rewards which is usually aggregated using the familiar expected value measure.The value thus obtained is called the expected total reward (Puterman, 2014, ch. 4.1.2) and its role is central: It is the quantity that is to be optimised in an SDP.
In monadic SDPs, the measure is generic, i.e. it is not fixed in advance but has to be given as part of the specification of a concrete problem.Therefore we will generalise the notion of expected total reward to a corresponding notion for monadic SDPs that we call measured total reward in analogy (see Sec. 4).
Solving a stochastic SDP consists in finding a list of rules for selecting controls that maximises the expected total reward for n decision steps when starting at decision step t.Similarly, we define that solving a monadic SDP consists in finding a list of rules for selecting controls that maximises the measured total reward.This means that when starting from any initial state at decision step t, following the computed list of rules for selecting controls will result in a value that is maximal as measure of the sum of rewards along all possible trajectories rooted in that initial state.
Equivalently, rewards can instead be considered as costs that need to be minimised.This dual perspective is taken e.g. in (Bertsekas, 1995).In the subsequent sections we will follow the terminology of the BJI-framework and (Puterman, 2014) and speak of "rewards", but our second example below will illustrate the "cost" perspective.
In mathematical theories of optimal control, the rules for selecting controls are called policies.A policy for a decision step is simply a function that maps each possible state to a control.As mentioned above, the controls available in a given state typically depend on that state, thus policies are dependently typed functions.A list of such policies is called a policy sequence.
Fig. 1: Transition graphs for the example SDPs described in Sec. 2. The edge labels denote pairs control | reward for the associated transitions.In the first example, state and control spaces are constant over time therefore we have omitted the temporal dimension.
The central idea underlying backward induction is to compute a globally optimal solution of a multi-step SDP incrementally by solving local optimisation problems at each decision step.This is captured by Bellman's principle: Extending an optimal policy sequence with an optimal policy yields again an optimal policy sequence.However, as we will see in Sec.4.2, one has to carefully check whether for a given SDP backward induction is indeed applicable.
Two features are crucial for finite-horizon, monadic SDPs to be solvable with the BJIframework that we study in this paper: (1) the number of decision steps has to be given explicitly as input to the backward induction and (2) at each decision step, the number of possible next states has to be finite.While (2) is a necessary condition for backward induction to be computable, (1) is a genuine limitation of the BJI-framework: in many SDPs, for example in a game of tic-tac-toe, the number of decision steps is bounded but not known a priori.
Before we discuss the BJI-framework in the next section, we illustrate the notion of sequential decision problem with two simple examples, one in which the purpose is to maximise rewards and one in which the purpose is to minimise costs.Rewards and costs in these examples are just natural numbers and are summed up with ordinary addition.The first example is a non-deterministic SDP.Although it is somewhat oversimplified, it has the advantage of being tractable for computations by hand while still being sufficient as basis for illustrations in sections 3-5.The second example is a deterministic SDP that stands for the important class of scheduling SDPs.This problem highlights why dependent types are necessary to model state and control spaces accurately.As in these simple examples state and control spaces are finite, the transition functions can be described by directed graphs.These are given in Fig. 1.
Example 1 (A toy climate problem).Our first example is a variant of a stochastic climate science SDP studied in (Botta et al., 2018), stripped down to a simple non-deterministic SDP.At every decision step, the world may be in one of two states, namely Good or Bad, and the controls determine whether a Low or a High amount of green house gases is emitted into the atmosphere.If the world is in the Good state, choosing Low emissions will definitely keep the world in the Good state, but the result of choosing high emissions is non-deterministic: either the world may stay in the Good or tip to the Bad state.Similarly, in the Bad state, High emissions will definitely keep the world in Bad, while with Low emissions it might either stay in Bad or recover and return to the Good state.The transitions just described define a non-deterministic transition function.The rewards associated with each transition are determined by the respective control and reached state.Now we can formulate an SDP: "Which policy sequence will maximise the worst-case sum of rewards along all possible trajectories when taking n decisions starting at decision step t?".In this simple example the question is not hard to answer: always choose Low emissions, independent of decision step and state.The optimal policy sequence for any n and t would thus consist of n constant Low functions.But in a more realistic example the situation will be more involved: every option will have its benefits and drawbacks encoded in a more complicated reward function, uncertainties might come with different probabilities, there might be intermediate states, different combinations of control options etc.For more along these lines we refer the interested reader to (Botta et al., 2018).
Example 2 (Scheduling).Scheduling problems serve as canonical examples in control theory textbooks.The one we present here is a slightly modified version of (Bertsekas, 1995, Example 1.1.2).
Think of some machine in a factory that can perform different operations, say A, B, C and D. Each of these operations is supposed to be performed once.The machine can only perform one operation at a time, thus an order has to be fixed in which to perform the operations.Setting the machine up for each operation incurs a specific cost that might vary according to which operation has been performed before.Moreover, operation B can only be performed after operation A has already been completed, and operation D only after operation C. It suffices to fix the order in which the first three operations are to be performed as this uniquely determines which will be the fourth task.The aim is now to choose an order that minimises the total cost of performing the four operations.
This situation can be modelled as follows as a problem with three decision steps: The states at each step are the sequences of operations already performed, with the empty sequence at step 0. The controls at a decision step and in a state are the operations which have not already been performed at previous steps and which are permitted in that state.For example, at decision step 0, only controls A and C are available because of the above constraint on performing B and D. The transition and cost functions of the problem are depicted by the graph in Fig. 1b.As the problem is deterministic, picking a control will result in a unique next state and each sequence of policies will result in a unique trajectory.In this setting, solving the SDP reduces to finding a control sequence that minimises the sum of costs along the single resulting trajectory.In Fig. 1b this is the sequence CAB(D).

The BJI-framework
The BJI-framework is a dependently typed formalisation of optimal control theory for finite-horizon, discrete-time SDPs.It extends mathematical formulations for stochastic SDPs (Bertsekas, 1995;Bertsekas & Shreve, 1996;Puterman, 2014) to the general problem of optimal decision making under monadic uncertainty.
For monadic SDPs, the framework provides a generic implementation of backward induction.It has been applied to study the impact of uncertainties on optimal emission policies (Botta et al., 2018) and is currently used to investigate solar radiation management problems under tipping point uncertainty (TiPES, 2019(TiPES, -2023)).
In a nutshell, the framework consists of two sets of components: one for the specification of an SDP and one for its solution with monadic backward induction.

Problem specification components
In the following we discuss the components necessary to specify a monadic SDP.The first one is the monad M: As discussed in the previous section, M accounts for the uncertainties that affect the decision problem.For our first example, we could for instance define M to be List.For the second example it suffices to use M = Id as the problem is deterministic.
Further, the BJI-framework supports the specification of the states, the controls and the transition function of an SDP through The interpretation is that X t represents the states at decision step t.2In the first example of Sec. 2, there are just two states (Good and Bad) such that X is a constant family: But in the second example the possible states depend on the decision step t.Taking for example step t = 2, we could simply define Alternatively, we could employ type dependency in a more systematic way to express that in Ex. 2 states are admissible sequences of actions Recall that actions could require that another action was performed before, no action was to be carried out twice and the problem was limited to 3 steps.These conditions might be captured by a type-valued predicate AdmissibleState :{t : N} → Vect t Act → Type and the type of states might then be expressed as dependent pair of a vector of actions and a proof that it is admissible.
X t = (as : Vect t Act * * AdmissibleState as) Similarly, Y t x represents the controls available at decision step t and in state x and next t x y represents the states that can be obtained by selecting control y in state x at decision step t.In our first example, the available controls remain constant over time (High or Low) like the states, but for the second example, the type dependency is relevant: e.g.we might define (again at step t = 2) or more elegantly use dependent pairs to define the type of controls, using the observation that an action is an admissible control for some state represented by a vector of actions, if adding the action to the vector results again in an admissible state : Recall from Sec. 2 that the monad, the states, the controls and the next function together define a decision process.In order to fully specify a decision problem, one also has to define the rewards obtained at each decision step and the operation that is used to add up rewards.In the BJI-framework, this is done in terms of

Val
: Type reward : Here, Val is the type of rewards and reward t x y x ′ is the reward obtained by selecting control y in state x if the next state is x ′ , an element of the state space at step t + 1.Note that for deterministic problems it is unnecessary to parameterise the reward function over the next state as it is unique and can thus be obtained from the current state and control.But for non-deterministic problems it is useful to be able to assign rewards depending on the (uncertain) outcome of a transition.
A few remarks are at place here.
• In many applications, Val is a numerical type and the controls of the SDP represent resources (fuel, water, etc.) that come at a cost.In these cases, the reward function encodes the costs and perhaps also the benefits associated with a decision step.Often, the latter also depends both on the current state x and on the next state x ′ .The BJI-framework nicely copes with all these situations.• The operation ⊕ determines how rewards are added up.It could be a simple arithmetic operation, but it could also be defined in terms of problem-specific parameters, e.g.discount factors to give more weight to current rewards as compared to future rewards.
• Mapping reward t x y onto next t x y (remember that M is a monad and thus a functor) yields a value of type M Val.These are the possible rewards obtained by selecting control y in state x at decision step t.In mathematical theories of optimal control, the implicit assumption often is that Val is equal to R and that the M-structure is a probability distribution on real numbers which can be evaluated with the expected value measure.However, in many practical applications, measuring uncertainty of rewards in terms of the expected value is inadequate (Mercure et al., 2020).The BJI-framework therefore takes a generic approach and allows the specification of SDPs in terms of problem-specific measures.
As just discussed, in SDPs with uncertainty a measure is required to aggregate multiple possible rewards.The BJI-framework supports the specification of the measure by: In our first example we could use the minimum of a list as worst-case measure, while in the second example the measure would just be the identity (as the problem is deterministic).Before we get to the solution components of the BJI-framework, one more ingredient needs to be specified.In the next section we will formalise a notion of optimality for which it is necessary to be able to compare elements of Val.The framework allows users to compare Val-values in terms of a problem specific comparison operator: The operator ( ⊑ ) is required to define a total preorder on Val.In our two examples, we simply have: Three more ingredients are necessary to fully specify a monadic SDP, but we defer discussing them to when they come up in the next subsection.For illustration, a formalisation of Ex. 1 can be found in Fig. 2. A formalisation of Ex. 2 is included in the supplementary material.

Problem solution components
The second set of components of the BJI-framework is an extension of the mathematical theory of optimal control for stochastic sequential decision problems to monadic problems.Here, we provide a summary of the theory.Motivation and full details can be found in (Botta et al., 2017b(Botta et al., ,a, 2018)).
The theory formalises the notions of policy (decision rule) from Sec. 2 as Policy sequences are then essentially vectors of policies3 .

M = List
Measure: States and Controls: Rewards: Notice the role of the step (time) index t and of the length index n in the constructors of policy sequences: For a policy sequence to make sense, policies for taking decisions at step t can only be prepended to policy sequences for taking first decisions at step t + 1 and the operation yields policy sequences for taking first decisions at step t.Thus the time index allows to ensure a consistency property of policy sequences by construction.As for plain vectors and lists, prepending a policy to a policy sequence of length n yields a policy sequence of length n + 1.Both the time and the length index will be useful below: they allow to express that the backward induction algorithm computes policy sequences starting at a specific time and having a specific length depending on its inputs.The perhaps most important ingredient of backward induction is a value function that incrementally measures and adds up rewards.For a given decision problem, the value function takes two arguments: a policy sequence ps for making n decision steps starting from decision step t and an initial state x : X t.It computes the value of taking n decision steps according to the policies ps when starting in x.In the BJI-framework, the value function is defined as Notice that, independently of the initial state x, the value of the empty policy sequence is zero.This is a problem-specific reference value zero : Val that has to be provided as part of a problem specification. 4 It is one of the specification components that we have not discussed in Sec.3.1.The value of a policy sequence consisting of a first policy p and of a tail policy sequence ps is defined inductively as the measure of an M-structure of Val-values.These values are obtained by first computing the control y dictated by the policy p in state x and the M-structure of possible next states mx ′ dictated by the transition function next.Then, for all x ′ in mx ′ the current reward reward t x y x ′ is added to the aggregated outcome for the tail policy sequence val ps x ′ .Finally, the result of this functorial mapping is aggregated with the problem-specific measure meas to obtain a result of type Val as outcome for the policy sequence (p :: ps).The function which is mapped onto mx ′ is just a lifted version of ⊕: We will come back to the value function of the BJI-theory in Sec. 4 where we will contrast it with a function val ′ that, for a policy sequence ps and an initial state x, computes the measure of the sum of the rewards along all possible trajectories starting at x under ps (the measured total reward that we anticipated in Sec. 2).For the time being, though, we accept the notion of value of a policy sequence as put forward in the BJI-theory and we show how the definition of val can be employed to compute policy sequences that are provably optimal in the sense of Notice the universal quantification in this definition: A policy sequence ps is defined to be optimal iff val ps ′ x ⊑ val ps x for any policy sequence ps ′ and for any state x.
The generic implementation of backward induction in the BJI-framework is an application of Bellman's principle of optimality mentioned in Sec. 2. In control theory textbooks, this principle is often referred to as Bellman's equation.It can be suitably formulated in terms of the notion of optimal extension.We say that a policy p : Policy t is an optimal extension of a policy sequence ps : Policy (S t) n if it is the case that the value of p :: ps is at least as good as the value of p ′ :: ps for any policy p ′ and for any state x : X t: With the notion of optimal extension in place, Bellman's principle can be formulated as In words: extending an optimal policy sequence with an optimal extension (of that policy sequence) yields an optimal policy sequence or shorter prefixing with optimal extensions preserves optimality.Proving Bellman's optimality principle is almost straightforward and relies on ⊑ being reflexive and transitive and two monotonicity properties: The second condition is a special case of the measure monotonicity requirement originally formulated by Ionescu (2009) in the context of a theory of vulnerability and monadic dynamical systems.It is a natural property and the expected value measure, the worst (best) case measure and any sound statistical measure fulfil it.Like the reference value zero discussed above, plusMonSpec and measMonSpec are specification components of the BJI-framework that we have not discussed in Sec.3.1.We provide a proof of Bellman in Appendix 5.As one would expect, the proof makes essential use of the recursive definition of the function val discussed above.As a consequence, this precise definition of val plays a crucial role for the verification of backward induction in (Botta et al., 2017a).
Apart from the increased level of generality, the definition of val and Bellman are in fact just an Idris formalisation of Bellman's equation as formulated in control theory textbooks.With Bellman and provided that we can compute optimal extensions of arbitrary policy sequences it is easy to derive an implementation of monadic backward induction that computes provably optimal policy sequences with respect to val: first, notice that the empty policy sequence is optimal: This is the base case for constructing optimal policy sequences by backward induction, starting from the empty policy sequence: bi : (t, n : N) → PolicySeq t n bi t Z = Nil bi t (S n) = let ps = bi (S t) n in optExt ps :: ps That bi computes optimal policy sequences with respect to val is then proved by induction on n, the input that determines the length of the resulting policy sequence: This is the verification result for bi of (Botta et al., 2017a).5 3.3 BJI-framework wrap-up.
The specification and solution components discussed in the last two sections are all we need to formulate precisely the problem of correctness for monadic backward induction in the BJI-framework.This is done in the next section.Before turning to it, two further remarks are necessary: • The theory proposed in (Botta et al., 2017a) is slightly more general than the one summarised above.Here, policies are just functions from states to controls: By contrast, in (Botta et al., 2017a), policies are indexed over a number of decision steps n and their domain for n > 0 is restricted to states that are reachable and viable for n steps.This allows to cope with states whose control set is empty and with transition functions that return empty M-structures of next states.(For a discussion of reachability and viability see (Botta et al., 2017a, Sec. 3.7 and 3.8).)This generality, however, comes at a cost: Compare e.g. the proof of Bellman's principle from the last subsection with the corresponding proof in (Botta et al., 2017a, Appendix B).The impact of the reachability and viability constraints on other parts of the theory is even more severe.Here, we have decided to trade some generality for better readability and opted for a simplified version of the original theory.Still, for the generic backward induction algorithm we need to make sure that it is possible to define policy sequences of the length required for a specific SDP.This can e.g.be done by postulating controls to be non-empty: We also impose a non-emptiness requirement on the transition function next that will be discussed in Sec. 7.
• In section 3.2, we have not discussed under which conditions one can implement optimal extensions of arbitrary policy sequences.This is an interesting topic that is however orthogonal to the purpose of the current paper.For the same reason we have not addressed the question of how to make bi more efficient by tabulation.We briefly discuss the specification and implementation of optimal extensions in the BJI-framework in Appendix 7. We refer the reader interested in tabulation of bi to SequentialDecisionProblems.TabBackwardsInduction of (Botta, 2016(Botta, -2021)).
4 Correctness for monadic backward induction In this section we formally specify the notions of correctness for monadic backward induction bi and the value function val of the BJI-framework that we will study in the remainder of this paper.We develop these notions as generic variants of the corresponding notions for stochastic SDPs.

Extension of the BJI-framework
In the previous section, we have seen that a monadic SDP can be specified in terms of nine components: M, X, Y, next, Val, zero, ⊕, ⊑ and reward.
Given a policy sequence (optimal or not) and an initial state for an SDP, we can compute the M-structure of possible trajectories starting at that state: where we use StateCtrlSeq as type of trajectories.Essentially it is a non-empty list of (dependent) state/control pairs, with the exception of the base case which is a singleton just containing the last state reached.Furthermore, we can compute the total reward for a single trajectory, i.e. its sum of rewards: where head is the helper function By mapping sumR onto an M-structure of trajectories, we obtain an M-structure containing the individual sums of rewards of the trajectories.Now, using the measure function, we can compute the generic analogue of the expected total reward for a policy sequence ps and an initial state x: As anticipated in Sec. 2 we call the value computed by val ′ the measured total reward.Recall that solving a stochastic SDP commonly means finding a policy sequence that maximises the expected total reward.By analogy, we define that solving a monadic SDP means to find a policy sequence that maximises the measured total reward.I.e.given t and n, the solution of a monadic SDP is a sequence of n policies that maximises the measure of the sum of rewards along all possible trajectories of length n that are rooted in an initial state at step t.
Again by analogy to the stochastic case, we define monadic backward induction to be correct if, for a given SDP, the policy sequence computed by bi is the solution to the SDP.I.e., we consider bi to be correct if it meets the specification biOptMeasTotalReward : (t, n : where GenOptPolicySeq is a generalised version of the optimality predicate OptPolicySeq from Sec. 3.2.It now takes as an additional parameter the function with respect to which the policy sequence is to be optimal: As recapitulated in Sec.3.2, Botta et al. have already shown that if M is a monad, ⊑ a total preorder and ⊕ and meas fulfil two monotonicity conditions, then bi t n yields an optimal policy sequence with respect to the value function val in the sense that val ps ′ x ⊑ val (bi t n) x for any policy sequence ps ′ and initial state x, for arbitrary t, n : N. Or, expressed using the generalised optimality predicate, that the type is inhabited.As seen in Sec.3.2, the function val measures and adds rewards incrementally.But does it always compute the measured total reward like val ′ ?Modulo differences in the presentation Puterman (2014, Theorem 4.2.1)suggests that for standard stochastic SDPs, val and val ′ are extensionally equal, which in turn allows the use of backward induction for solving these SDPs.Generalising, we therefore consider val as correct if it fulfils the specification valMeasTotalReward : {t, n : N} → (ps : PolicySeq t n) → (x : X t) → val ps x = val ′ ps x If this equality held for the general monadic SDPs of the BJI-theory, we could prove the correctness of bi as immediate corollary of valMeasTotalReward and Botta et al. 's result biOptVal.The statement biOptMeasTotalReward can be seen as a generic version of textbook correctness statements for backward induction as solution method for stochastic SDPs like (Bertsekas, 1995, prop.1.3.1) or (Puterman, 2014, Theorem 4.5.1.c).By proving valMeasTotalReward we could therefore extend the verification of (Botta et al., 2017a) and obtain a stronger correctness result for monadic backward induction.
Our main objective in the remainder of the paper is therefore to prove that valMeasTotalReward holds.But there is a problem.

The problem with the BJI-value function
A closer look at val and val ′ reveals two quite different computational patterns: applied to a policy sequence ps of length n + 1 and a state x, the function val directly evaluates meas on the M-structure of rewards corresponding to the possible next states after one step.This entails further evaluations of meas for each possible next state.By contrast, val ′ ps x entails only one evaluation of meas, independently of the length of ps.The computation, however, builds up an intermediate M-structure of state-control sequences.The elements of this M-structure, the state-control sequences, are then consumed by sumR and finally the M-structure of rewards is reduced by meas.
For illustration, let us revisit Ex. 1 from Sec. 2 as formalised in Fig. 2. To do an example calculation with val and val ′ we first need a concrete policy sequence as input.The simplest two policies are the two constant policies: From these, we can define a policy sequence ps : PolicySeq 0 3 ps = constH 0 :: (constL 1 :: (constH 2 :: Nil)) It is instructive to compute val ps Good and val ′ ps Good by hand.Recall that in this example, we have M = List with (> > =) = concatMap and Val = N with ⊕ = +.The measure meas thus needs to have the type List N → N. Without instantiating meas for the moment, the computations roughly exhibit the structure and it is not "obviously clear" that val and val ′ are extensionally equal without further knowledge about meas.
In the deterministic case, i.e. for M = Id and meas = id, val ps x and val ′ ps x are indeed equal for all ps and x, without imposing any further conditions (as we will see in Sec. 6).For the stochastic case, (Puterman, 2014, Theorem 4.2.1)suggests that the equality should hold.But for the monadic case, no such result has been established.And as it turns out, in general the functions val and val ′ are not unconditionally equal -consider the following counter-example: We continue in the setting of Ex. 1 from above, but now instantiate the measure to the plain arithmetic sum meas = foldr (+) 0 This measure fulfils the monotonicity condition (measMonSpec, Sec.3.2) imposed by the BJI-framework.But if we instantiate the above computations with it, then we get val ps Good = 13 and val ′ ps Good = 21!We thus see that the equality between val and val ′ cannot hold unconditionally in the generic setting of the BJI-framework.In the next section we therefore present conditions under which the equality does hold.

Correctness conditions
We now formulate three conditions on combinations of the monad, the measure function and the binary operation ⊕ that imply the extensional equality of val and val ′ : Essentially, these conditions assure that the measure is well-behaved relative to the monad structure and the ⊕-operation.They arise by, again, generalising from the standard example of stochastic SDPs with a probability monad, the expected value as measure and ordinary addition as ⊕.The first two conditions are lifting properties that allow to do computations either in the monad or the underlying structure with the same result.The third condition is a distributivity law.For the computation of the measured total reward it means that instead of adding the current reward to the outcome of each trajectory and then measuring, one may as well first measure the outcomes and then add the current reward.
To illustrate the conditions, let us consider a simple representation of discrete probability distributions like in (Erwig & Kollmansberger, 2006).As we can see, an essential ingredient for the equality to hold is that the mapped occurrences of (v+) are weighted by the probabilities which add up to 1.Note that in this example, we have glossed over problems that might arise from the use of Dist to represent probability distributions. 7We will briefly address probability monads and the expected value from a more abstract perspective in Subsection 5.2.

Examples and counter-examples
Besides the motivating example above, let us now consider some more functions that have the correct type to serve as a measure, and that do or do not fulfil the three conditions.
Simple examples of admissible measures are the minimum (minList as defined in Fig. 2) or maximum (maxList = foldr 'maximum' 0) of a list for M = List with N as type of values and ordinary addition as ⊕.It is straightforward to prove that the conditions hold for these two measures and the proofs for maxList are included in the supplementary material.
The function length is a very simple counter-example: It has the right type for a list measure but fails all three of the conditions.As to other counter-examples, let us revisit the conditions one by one.Throughout, we use M = List with map = listMap, join = concat and ⊕ = + (the canonical addition for the respective type of Val).
Condition 1.We remain in the setting of Ex. 1 with Val = N, and just vary the measure.Using a somewhat contrived variation of maxList maxListVar : with meas = maxListVar it suffices to consider that for an arbitrary n : N Condition 3. Let again Val = N to take another look at our counter-example from the last section with meas = sum, the arithmetic sum of a list.It does fulfil measPureSpec and measJoinSpec, the first by definition, the second by structural induction using the associativity of +.But it fails to fulfil measPlusSpec.If the list has the form a :: as, we would have to show the following equality for measPlusSpec to hold: Clearly, if v = 0 and as = [ ] this equality cannot hold.This is why in the last section the equality of val and val ′ failed for meas = sum.A similar failure would arise if we chose meas = foldr ( * ) 1 instead, as + does not distribute over * .But if we turned the situation around by setting ⊕ = * and meas = sum, the condition measPlusSpec would hold thanks to the usual arithmetic distributivity law for * over +.
All of the measures considered in this subsection do fulfil the measMonSpec condition imposed by the BJI-theory.This raises the question how previously admissible measures are impacted by adding the three new conditions to the framework.

Impact on previously admissible measures
As we have seen in Sec.3.2, the BJI-framework already requires measures to fulfil the monotonicity condition (ma : M A) → meas (map f ma) ⊑ meas (map g ma) Botta et al. show that the arithmetic average (for M = List), the worst-case measure (for M = List and for a probability monad M = Prob) and the expected value measure (for M = Prob) all fulfil measMonSpec.Thus, a natural question is whether these measures also fulfil the three additional requirements.
Expected value for probability distributions.As already discussed, most applications of backward induction concern stochastic SDPs where possible rewards are aggregated using the expected value measure from probability theory, commonly denoted as E. Essentially, for a numerical type Q, the expected value of a probability distribution on Q is where prob and supp are generic functions that encode the notions of probability and of support associated with a finite probability distribution: For pa and a of suitable types, prob pa a represents the probability of a according to pa.Similarly, supp pa returns a list of those values whose probability is not zero in pa.The probability function prob has to fulfil the axioms of probability theory.In particular, This condition implies that probability distributions cannot be empty, a precondition of measPlusSpec.Putting forward minimal specifications for prob and supp is not completely trivial but if the +-operation associated with Q is commutative and associative, if * distributes over + and if the map and join associated with Prob -for f , a, b, pa and ppa of suitable types -fulfil the conservation law and the total probability law prob (join ppa) a = sum [prob pa a * prob ppa pa | pa ← supp ppa ] then the expected value fulfils measPureSpec, measJoinSpec and measPlusSpec.This is not surprising since -as stated above -this has been the guiding example for the generalisation to monadic SDPs and the formulation of the three conditions.
Average and arithmetic sum.As can already be concluded from the corresponding counter-examples in the previous subsection, neither the plain arithmetic average nor the arithmetic sum are suited as measure when using the standard monad structure on List to represent non-deterministic uncertainty.We think this is an important observation, as the average seems innocent enough to come to mind as a simple way to represent uniformly distributed outcomes: "The probability of each element can simply be inferred from the length of the list -so why bother to explicitly deal with probabilities?"Although our counter-example shows that this idea is flawed, the intuition behind it can be employed to define an alternative, but less general monad structure on lists by incorporating the averaging operation into the joining of lists (i.e. by choosing join = map avg).However, this only makes sense for types that are instances of the Num and Fractional type classes, and naturality only holds for a restricted class of functions (namely additive functions).As a consequence, this alternative structure does not seem particularly useful for our current purpose either.
Worst-case measures.In many important applications in climate impact research but also in portfolio management and sports, decisions are taken as to minimise the consequences of worst case outcomes.Depending on how "worse" is defined, the corresponding measures might pick the maximum or minimum from an M-structure of values.In the previous subsection we considered an example in which the monad was List, the operation ⊕ plain addition together with either maxList or minList as measure.And indeed we can prove that for both measures the three requirements hold (the proofs for maxList can be found in the supplementary material).This gives us a useful notion of worst-case measure that is admissible for monadic backward induction.
We can thus conclude that the new requirements hold for certain familiar measures, but that they also rule out certain instances that were considered admissible in the BJIframework.Given the three conditions measPureSpec, measJoinSpec, measPlusSpec hold, we can prove the extensional equality of the functions val and val ′ generically.This is what we will do in the next section.

Correctness Proofs
In this section we show that val (Sec.3.2) and val ′ (Sec.4) are extensionally equal given the three conditions from the previous section hold.As a corollary we then obtain our correctness result for monadic backward induction.
We can understand the proof of valMeasTotalReward as an optimising program transformation from the less efficient but "obviously correct" implementation val ′ to the more efficient implementation val.Therefore the equational reasoning proofs in this section will proceed from val ′ to val.In Sec. 5 we have stated sufficient conditions for this transformation to be possible: measPureSpec, measJoinSpec, measPlusSpec.We also have seen the different computational patterns that the two implementations exhibit: While val ′ first computes all possible trajectories for the given policy sequence and initial state, then computes their individual sum of rewards and finally applies the measure once, val computes its final result by adding the current reward to an intermediate outcome and applying the measure locally at each decision step.This suggests that a transformation from val ′ to val will essentially have to push the application of the measure into the recursive computation of the sum of rewards.The proof will be carried out by induction on the structure of policy sequences.

Deterministic Case
To get a first intuition, let's have a look at what the induction step looks like in the deterministic case (i.e. if we fix monad and measure to be identities): where y = p x, x ′ = next t x y and r = reward t x y.In the proof sketch, we have first applied the definitions of val ′ and sumR.Using the fact that in the deterministic case trj returns exactly one state-control sequence and that the head of any trajectory starting in x ′ is just x ′ (let us call the latter headLemma), the left-hand side of the sum simplifies to r x ′ .Its right-hand side amounts to val ′ ps x ′ so that we can apply the induction hypothesis.The rest of the proof only relies on definitional equalities.Thus in the deterministic case val and val ′ are unconditionally extensionally equal -or rather, the conditions of Sec. 5 are trivially fulfilled.

Lemmas
To prove the general, monadic case, we proceed similarly.This time, however, the situation is complicated by the presence of the abstract monad M. Instead of being able to use the type structure of some concrete monad, we need to leverage on the properties of M, meas and ⊕ postulated in Sec. 5. To facilitate the main proof, we first prove three lemmas about the interaction of the measure with the monad structure and the ⊕-operator on Val.
Machine-checked proofs are given in the Appendices 2, 3 and 4. The monad laws we use are stated in Appendix 1.2.In the remainder of this section, we discuss semi-formal versions of the proofs.
Monad algebras.The first lemma allows us to lift and eliminate an application of the monad's join operation: The proof of this lemma hinges on the condition measJoinSpec.It allows to trade the application of join against an application of map meas.The rest is just standard reasoning with monad and functor laws, i.e. we use that the functorial map for M preserves composition and that join is a natural transformation: This lemma is generic in the sense that it holds for arbitrary Eilenberg-Moore algebras of a monad.Here we prove it for the framework's measure meas, but note that in the appendix we prove a generic version that is then appropriately instantiated.
Head/trajectory interaction.The second lemma amounts to a lifted version of headLemma in the deterministic case.Mapping head onto an M-structure of trajectories computed with trj results in an M-structure filled with the initial states of these trajectories; similarly, mapping (r • head s) onto trj ps x for functions r and s of appropriate type is the same as mapping (const (r x) s) onto trj ps x (where const is the constant function).We can prove headTrjLemma : {t, n : N} → (ps by doing a case split on ps.In case ps = Nil, the equality holds because the monad's pure is a natural transformation and in case ps = p :: ps ′ because M's functorial map preserves composition.
Measure/sum interaction.The third lemma allows us to both commute the measure into the right summand of an -sum and to perform the head/trajectory simplification.It lies at the core of the relationship between val and val ′ .
Recall that our third condition from Sec. 5, measPlusSpec, plays the role of a distributive law and allows us to "factor out" a partially applied sum (v⊕) for arbitrary v : Val.Given that measPlusSpec holds, the lemma is provable by simple equational reasoning using the above head-trajectory lemma and the fact that map preserves composition: Notice how measPlusSpec is used to transform an application of meas • map ((r x ′ )⊕) into an application of ((r x ′ )⊕) • meas.This is essential to simplify the computation of the measured total reward: instead of first adding the current reward to the intermediate outcome of each individual trajectory and then measuring the outcomes, one can first measure the intermediate outcomes of the trajectories and then add the current reward.

Correctness of the BJI-value function
With the above lemmas in place, we now prove that val is extensionally equal to val ′ .
Let t, n : N, ps : PolicySeq t n.We prove valMeasTotalReward by induction on ps.
Base case.We need to show that for all x : X t, val ′ Nil x = val Nil x.The right hand side of this equation reduces to zero by definition.The left hand side can be simplified to meas (pure zero) since pure is a natural transformation.At this point, our first condition, measPureSpec, comes into play: Using that meas is inverse to pure on the left, we can conclude that the equality holds.
In equational reasoning style: For all x : X t, Step case.The induction hypothesis (IH) is: for all x : X t, val ′ ps x = val ps x.We have to show that IH implies that for all p : Policy t and x : X t, the equality val ′ (p :: ps) x = val (p :: ps) x holds.
For brevity (and to economise on brackets), let in the following y = p x, mx ′ = next t x y, r = reward t x y, trjps = trj ps, and consxy = ((x * * y)##).
As in the base case, all that has to be done on the val-side of the equation only depends on definitional equality.However it is more involved to bring the val ′ -side into a form in which the induction hypothesis can be applied.This is where we leverage on the lemmas proved above.
By definition and because map preserves composition, we know that val ′ (p :: ps) x is equal to (meas • map ((r • head) sumR)) (mx ′ > > = trjps).We use the relation between the monad's bind and join to eliminate the bind-operator from the term.Now we can apply the first lemma from above, measAlgLemma, to lift and eliminate the join operation.
To commute the measure under the and get rid of the application of head, we use our third lemma, measSumLemma.At this point we can apply the induction hypothesis and the resulting term is equal to val ps x by definition.Technical remarks.The above proof of valMeasTotalReward omits some technical details that may be uninteresting for a pen and paper proof, but turn out to be crucial in the setting of an intensional type theory -like Idris -where function extensionality does not hold in general.In particular, we have to postulate that the functorial map preserves extensional equality (see Appendix 1.2 and (Botta et al., n.d.)) for Idris to accept the proof.In fact, most of the reasoning proceeds by replacing functions that are mapped onto monadic values by other functions that are only extensionally equal.Using that map preserves extensional equality allows to carry out such proofs generically without knowledge of the concrete structure of the functor.

Correctness of monadic backward induction
As corollary, we can now prove the correctness of monadic backward induction, namely that the policy sequences computed by bi are optimal with respect to the measured total reward computed by val ′ : biOptMeasTotalReward

Discussion
In the last two sections we have seen what the three conditions mean for concrete examples and how they are used in the correctness proof.In this section we take a step back and consider them from a more abstract point of view.
Category-theoretical perspective.Readers familiar with the theory of monads might have recognised that the first two conditions ensure that meas is the structure map of a monad algebra for M on Val and thus the pair (Val, meas) is an object of the Eilenberg-Moore category associated with the monad M. The third condition requires the map (v⊕) to be an M-algebra homomorphism -a structure preserving map -for arbitrary values v.
This perspective allows us to use existing knowledge about monad algebras as a first criterion for choosing measures.For example, the Eilenberg-Moore-algebras of the list monad are monoids -this implicitly played a role in the examples we considered above.Jacobs (2011) shows that the algebras of the distribution monad for probability distributions with finite support correspond to convex sets.Interestingly, convex sets play an important role in the theory of optimal control (Bertsekas et al., 2003).
Measures for the list monad.The knowledge that monoids are List-algebras suggests a generic description of admissible measures for M = List: Given a monoid (Val, ⊙, b), we can prove that monoid homomorphisms of the form foldr ⊙ b fulfil the three conditions, if ⊕ distributes over ⊙ on the left.I.e. for meas = foldr ⊙ b the three conditions can be proven from Neutrality of b on the right is needed for measPureSpec, while measJoinSpec follows from neutrality on the left and the associativity of ⊙.The algebra morphism condition on (v⊕) is provable from the distributivity of ⊕ over ⊙ and again neutrality of b on the right.If moreover ⊙ is monotone with respect to then we can also prove measMonSpec using the transitivity of ⊑ .The proofs are simple and can be found in the supplementary material to this paper.This also illustrates how the three abstract conditions follow from more familiar algebraic properties.
Mutual independence.Although it does not seem surprising, it should be noted that the three conditions are mutually independent.This can be concluded from the counterexamples in Sec.5.1: The sum, the modified list maximum and the arithmetic average each fail exactly one of the three conditions.
Sufficient vs. necessary.The three conditions are sufficient to prove the extensional equality of the functions val and val ′ .They are justified by their level of generality and the fact that they hold for standard measures used in control theory.However, we leave open the interesting question whether these conditions are also necessary for the correctness of monadic backward induction.
Non-emptiness requirement.Note that mv in the premises of measPlusSpec is required to be non-empty.This condition arises from a pragmatic consideration.As an example, let us again use the list monad with Val = N and ⊕ = +.It is not hard to see that for any natural number n greater than 0 the equality meas (map (n+) [ ]) = n + meas [ ] must fail.So, if we wish to use the standard list data type instead of defining a custom type of nonempty lists, the only way to prove the base case of measPlusSpec is by contradiction with the non-emptiness premise.
However, omitting the premise mv : NotEmpty would not prevent us from generically proving the correctness result of Section 6 -it would even simplify matters as it would spare us reasoning about preservation of non-emptiness.But it would implicitly restrict the class of monads that can be used to instantiate M. For example, we have seen above, that measPlusSpec is not provable for the empty list without the non-emptiness premise and we would therefore need to resort to a custom type of non-empty lists instead.
The price to pay for including the non-emptiness premise is the additional condition nextNotEmpty on the transition function next that was already stated in Sec.3.3.Moreover, we have to postulate non-emptiness preservation laws for the monad operations (Appendix 1.2) and to prove an additional lemma about the preservation of non-emptiness (Appendix 4).Conceptually, it might seem cleaner to omit the non-emptiness condition: In this case, the remaining conditions would only concern the interaction between the monad, the measure, the type of values and the binary operation ⊕.However, the non-emptiness preservation laws seem less restrictive with respect to the monad.In particular, for our above example of ordinary lists they hold (the relevant proofs can be found in the supplementary material).Thus we have opted for explicitly restricting the next function instead of implicitly restricting the class of monads for which the result of Sec. 6 holds.

Conclusion
In this paper, we have proposed correctness criteria for monadic backward induction and its underlying value function in the framework for specifying and solving finite-horizon, monadic SDPs proposed in (Botta et al., 2017a).After having shown that these criteria are not necessarily met for arbitrary monadic SDPs, we have formulated three general compatibility conditions.We have given a proof that monadic backward induction and its underlying value function are correct if these conditions are fulfilled.
The main theorem has been proved via the extensional equality of two functions: 1) the value function of Bellman's dynamic programming (Bellman, 1957) and optimal control theory (Bertsekas, 1995;Puterman, 2014) that is also at the core of the generic backward induction algorithm of (Botta et al., 2017a) and 2) the measured total reward function that specifies the objective of decision making in monadic SDPs: the maximisation of a measure of the sum of the rewards along the trajectories rooted at the state associated with the first decision.
Our contribution to verified optimal decision making is twofold: On the one hand, we have implemented a machine-checked generalisation of the semi-formal results for deterministic and stochastic SDPs discussed in (Bertsekas, 1995, Prop. 1.3.1)and (Puterman, 2014, Theorem 4.5.1.c).As a consequence, we now have a provably correct method for solving deterministic and stochastic sequential decision problems with their canonical measure functions.On the other hand, we have identified three general conditions that are sufficient for the equivalence between the two functions and thus the correctness result to hold.The first two conditions are natural compatibility conditions between the measure of uncertainty meas and the monadic operations associated with the uncertainty monad M. The third condition is a distributivity principle concerning the relationship between meas, the functorial map associated with M and the rule for adding rewards ⊕.All three conditions have a straightforward category-theoretical interpretation in terms of Eilenberg-Moore algebras (MacLane, 1978, ch. VI.2).As discussed in Sec. 7, the three conditions are independent and have non-trivial implications for the measure and the addition function that cannot be derived from the monotonicity condition on meas already imposed in (Ionescu, 2009;Botta et al., 2017a).
A consequence of this contribution is more flexibility: We can now compute verified solutions of stochastic sequential decision problems in which the measure of uncertainty is different from the expected value measure.This is important for applications in which the goal of decision making is, for example, of maximising the value of worst-case outcomes.To the best of our knowledge, the formulation of the compatibility condition and the proof of the equivalence between the two value functions are novel results.
The latter can be employed in a wider context than the one that has motivated our study: in many practical problems in science and engineering, the computation of optimal policies via backward induction (let apart brute-force or gradient methods) is simply not feasible.In these problems one often still needs to generate, evaluate and compare different policies and our result shows under which conditions such evaluation can safely be done via the "fast" value function val of standard control theory.
Finally, our contribution is an application of verified, literal programming to optimal decision making: the sources of this document have been written in literal Idris and are available at (Brede & Botta, 2021), where the reader can also find the bare code and some examples.Although the development has been carried out in Idris, it should be readily reproducible in other implementations of type theory like Agda or Coq.
• Its pure and join operations are natural transformations (see MacLane, 1978, I.4): and fulfil the neutrality and associativity axioms: Notice that the above specification of the monad operations is not minimal: (> > =) is not assumed to be implemented in terms of join and map (or map and join via (> > =) and pure).Therefore (> > =) (pronounced "bind"), join and map have to fulfil: The equality in the axioms is extensional equality, not the type theory's definitional equality: As Idris does not have function extensionality, not postulating definitional equalities makes the axioms strictly weaker.
The BJI-framework also requires M to be equipped with type-level membership, with a universal quantifier and with a type-valued predicate NotEmpty : {A : Type} → M A → Type For our purposes, the monad operations are moreover required to fulfil the following preservation laws: • The M-structure obtained from lifting an element into the monad is not empty: • The monad's map and bind preserve non-emptiness: As discussed in Sec. 7, we could have omitted these non-emptiness preservation laws, but instead would have implicitly restricted the class of monads for which the correctness result holds.

Preservation of extensional equality
We have stated above that for our correctness proof the functor M has to satisfy the monad laws and moreover its functorial map has to preserve extensional equality.This e.g. is the case for M = Identity (no uncertainty), M = List (non-deterministic uncertainty) and for the finite probability distributions (stochastic uncertainty, M = Prob) discussed in (Botta et al., 2017a).For M = List the proof of mapPresEE amounts to: The principle of extensional equality preservation is discussed in more detail in (Botta et al., n.d.).
2 Correctness of the value function 4 Lemmas The proof of valMeasTotalReward relies on a few auxiliary results which we prove here.
Lemma about the interaction of head and trj interleaved with map:  then we can implement optimal extensions of arbitrary policy sequences.As it turns out, this intuition is correct.With The observation that functions cvalmax and cvalargmax that fulfil cvalmaxSpec and cvalargmaxSpec are sufficient to implement an optimal extension optExt that fulfils optExtSpec naturally raises the question of what are necessary and sufficient conditions for cvalmax and cvalargmax.Answering this question necessarily requires discussing properties of cval and goes well beyond the scope of formulating a theory of SDPs.Here, we limit ourselves to remark that if Y t x is finite and non-empty one can implement the functions cvalmax and cvalargmax by linear search.A generic implementation of cvalmax and cvalargmax can be found under (Brede & Botta, 2021).
For the original BJI-theory, tabulated backward induction and several example applications can be found in the SequentialDecisionProblems folder of (Botta, 2016(Botta, -2021)).

Condition 1 .
The measure needs to be left-inverse to pure: 6 measPureSpec : meas • pure .Applying the measure after join needs to be extensionally equal to applying it after map meas: measJoinSpec : meas • join .= meas • map meas For arbitrary v : Val and non-empty mv : M Val applying the measure after mapping (v⊕) onto mv needs to be equal to applying (v⊕) after the measure: measPlusSpec : (v : Val) → (mv : M Val) → (NotEmpty mv) → (meas • map (v⊕)) mv = ((v⊕) • meas) mv