Hostname: page-component-77f85d65b8-6bnxx Total loading time: 0 Render date: 2026-03-30T08:50:00.747Z Has data issue: false hasContentIssue false

Causal temporal reasoning for Markov decision processes

Published online by Cambridge University Press:  10 February 2025

A response to the following question: How to ensure safety of learning-enabled cyber-physical systems?

Milad Kazemi*
Affiliation:
Department of Informatics, King’s College London, London, UK
Jessica Lally
Affiliation:
Department of Informatics, King’s College London, London, UK
Nicola Paoletti
Affiliation:
Department of Informatics, King’s College London, London, UK
*
Corresponding author: Milad Kazemi; E-mail: milad.kazemi@kcl.ac.uk
Rights & Permissions [Opens in a new window]

Abstract

We present PCFTL (Probabilistic CounterFactual Temporal Logic), a new probabilistic temporal logic for the verification of Markov Decision Processes (MDP). PCFTL introduces operators for causal inference, allowing us to express interventional and counterfactual queries. Given a path formula ϕ, an interventional property is concerned with the satisfaction probability of ϕ if we apply a particular change I to the MDP (e.g., switching to a different policy); a counterfactual formula allows us to compute, given an observed MDP path τ, what the outcome of ϕ would have been had we applied I in the past and under the same random factors that led to observing τ. Our approach represents a departure from existing probabilistic temporal logics that do not support such counterfactual reasoning. From a syntactic viewpoint, we introduce a counterfactual operator that subsumes both interventional and counterfactual probabilities as well as the traditional probabilistic operator. This makes our logic strictly more expressive than PCTL. The semantics of PCFTL rely on a structural causal model translation of the MDP, which provides a representation amenable to counterfactual inference. We evaluate PCFTL in the context of safe reinforcement learning using a benchmark of grid-world models.

Information

Type
Results
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Figure 1. Overview of our approach to PCFTL verification, with section pointers.

Figure 1

Figure 2. Causal diagram for the SCM encoding of an MDP. Black circles represents exogenous variables, while white circles represent endogenous ones.

Figure 2

Figure 3. Light switch MDP (example 1). X-axis: log (P𝒫($\rm\mathsf Off $St, At)) + G$\rm\mathsf Off$, t; Y-axis: log (P𝒫($\rm\mathsf On$St, At)) + G$\rm\mathsf On$, t. Plots (a) and (b) are relative to the prior Gumbel G and the observed path τ (using 1000 realizations for G). Plots (c) and (d) are relative to the posterior Gumbel G τ and the counterfactual path τ′. Points leading to state $\rm\mathsf On$ are in red, while those for $\rm\mathsf Off$ are in blue.

Figure 3

Figure 4. Three scenarios for the evaluation of I@t.Pp(ϕ). The observed path τ is in black. The counterfactual path (induced by the counterfactual MDP 𝒫′ = 𝒫τ[|τ|−t:] and the intervention policy π′) is in dark blue (in general we have a distribution of such paths, but here we show only one for simplicity). Paths extensions under the nominal policy π are in gray, and those under π′ in light blue. The horizontal axis represents time (or path positions), and the vertical axis the MDP state (continuous and one-dimensional for illustration purposes). While none of the three examples hit the obstacle within the observed/counterfactual path, moving forward, π yields a higher probability of this happening.

Figure 4

Figure 5. Counterfactual probabilities under the optimal policy πo (blue) given that we observe MDP paths under the nominal policy π (orange). In (a) and (b) paths have length 10 (same as the time bound T in ϕ). In (c), we observe paths of length 2 < T, and so, applying πo results in paths that are part counterfactual, part post-interventional.

Figure 5

Table 1. PCFTL verification of the MiniGrid benchmark, with 6 x 6 grids. For each environment, we apply the intervention at the start of the path (t = T − 1) and 10 steps after the start (t = T − 11). T = 50 is the length of the path. The SMC parameters (see Section 5) are δ = 0.02, and α = 0.05 and β = 0.2 for P and I@t.P properties, and α = 0.01 and β = 0.2 for $ \Delta_{@t}^{{I, \emptyset}}$.P. ⊤ and ⊥ indicate whether the SMC procedure returns true or false for the given PCFTL formulae, and in parentheses are the number of realizations required by SMC to reach this verdict

Author Comment: Causal temporal reasoning for Markov decision processes — R0/PR1

Comments

No accompanying comment.

Recommendation: Causal temporal reasoning for Markov decision processes — R0/PR2

Comments

No accompanying comment.

Author Comment: Causal temporal reasoning for Markov decision processes — R1/PR3

Comments

No accompanying comment.

Decision: Causal temporal reasoning for Markov decision processes — R1/PR4

Comments

The author has addressed all the issues required by the reviewers.