Hostname: page-component-89b8bd64d-46n74 Total loading time: 0 Render date: 2026-05-06T10:30:57.754Z Has data issue: false hasContentIssue false

A necessary and sufficient condition for the unique solution of the Bellman equation for LTL surrogate rewards

Published online by Cambridge University Press:  04 November 2025

A response to the following question: How to ensure safety of learning-enabled cyber-physical systems?

Zetong Xuan*
Affiliation:
Department of Mechanical and Aerospace Engineering, University of Florida, Gainesville, FL, USA
Alper Kamil Bozkurt
Affiliation:
Department of Computer Science, University of Maryland, College Park, MD, USA
Miroslav Pajic
Affiliation:
Department of Electrical and Computer Engineering, Duke University, Durham, NC, USA
Yu Wang
Affiliation:
Department of Mechanical and Aerospace Engineering, University of Florida, Gainesville, FL, USA
*
Corresponding author: Zetong Xuan; Email: z.xuan@ufl.edu
Rights & Permissions [Opens in a new window]

Abstract

Linear temporal logic (LTL) offers a formal way of specifying complex objectives for Cyber-Physical Systems (CPS). In the presence of uncertain dynamics, the planning for an LTL objective can be solved by model-free reinforcement learning (RL). Surrogate rewards for LTL objectives are commonly utilized in model-free RL for LTL objectives. In a widely adopted surrogate reward approach, two discount factors are used to ensure that the expected return approximates the satisfaction probability of the LTL objective. The expected return then can be estimated by methods using the Bellman updates such as RL. However, the uniqueness of the solution to the Bellman equation with two discount factors has not been explicitly discussed. We demonstrate that when one of the discount factors is set to one, as allowed in many previous works, the Bellman equation may have multiple solutions, leading to an inaccurate evaluation of the expected return. To address this issue, we propose a condition that ensures the Bellman equation has the expected return as its unique solution. Specifically, we require that the solutions for states within rejecting bottom strongly connected components (BSCCs) be zero. We prove that this condition guarantees the uniqueness of the solution, first for states within BSCCs and then for remaining transient states.

Information

Type
Results
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Figure 1. Example of a three-state Markov decision process. The resulting Bellman equation (9) has multiple solutions when $\gamma = 1$ in the surrogate reward (3), which can mislead to the suboptimal actions.

Figure 1

Figure 2. Under different discount factors $\gamma_B = 1-10^{-3}, 1-10^{-4}, 1-10^{-5}$, the change of $\delta(k) = \Vert U_{(k+1)} - U_{(k)}\Vert_\infty$ (Bellman error) and $e(k) = \Vert U_{(k)} - \bar{V} \Vert_\infty$ (Approximation error) during dynamic programming updates with $U_0 = \mathbb{O}$. (a) The vanishing of $\delta(k)$ shows the approximate value function $U_{(k)}$ converges to a solution to the Bellman equation in all settings. (b) Different e(K) shows the accuracy of the approximation is determined by $1-\gamma_B$. Final errors are $e(K)=0.4456,\ 0.0740,\ 0.0079$ for different $\gamma_B$. When the discount factor $\gamma_B=0.999$ is less closer to 1, e(k) grows instead of decreasing. The reason is that $U_{(k)}$ is now converging to a value function that deviates from the satisfaction probability.

Figure 2

Figure 3. Under different initial conditions $U_0 = \mathbb{O}$ and $U_0 \neq \mathbb{O}$, the change of $\delta(k) = \Vert U_{(k+1)} - U_{(k)}\Vert_\infty$ (Bellman error) and $e(k) = \Vert U_{(k)} - \bar{V} \Vert_\infty$ (Approximation error) during dynamic programming updates with $\gamma_B = 0.99999$ (a) The vanishing of $\delta(k)$ confirms that the approximate value function $U_{(k)}$ converges to a solution to the Bellman equation in all settings. (b) The decrease of e(k) shows the numerical solution is close to the satisfaction probability only when the initial conditions $U_0 = \mathbb{O}$. Even when $ U_{(k)}$ converges, $e(k) \gt 0$ remains because the true value function is an approximation to the satisfaction probability.

Figure 3

Figure 4. Given $U_0 \sim [0,1]^{m+n}$, the change of approximate value function $U_{(k)}$ for all four states $s_1,s_2,s_3$ and $s_4$ inside one rejecting BSCCs during updates. Since the Bellman update does not involve discounting within a rejecting BSCC, the approximate value function converges to the same nonzero constant for all states inside the BSCC. This constant differs from the value function, demonstrating that the solution is incorrect when the condition in Theorem 1 does not hold.

Figure 4

Figure 5. Neural network value approximation under three setups: Subset (train only on $S\backslash \neg B_R$, excluding $\neg B_R$ from inputs/outputs), Baseline (train on all states with standard initialization), and Init0 (train on all states with zero-bias initialization so $V_\theta(s)\approx 0$ at $k{=}0$). (a) Training loss $\mathcal{L}(\theta_k)$, i.e., the mean squared Bellman residual on $S_{{\rm train}}$. (b) Value error ${\rm MSE}_{\mathcal{S}}(\theta_k)$ on the evaluation set. Empirically, for all setups, the training loss drops and, after $k\geq 8\times 10^{4}$, stabilizes below $10^{-5}$, while the value error changes only marginally thereafter. The Subset setup yields substantially smaller value error than Baseline and Init0.

Figure 5

Figure 6. ${\rm MSE}_{\neg B_R}(\theta_k)$ on rejecting BSCCs. The error in both Baseline and Init0 converges to an incorrect, nonzero level, indicating the neural network outputs on $\neg B_R$ violate the uniqueness condition in Theorem 1. Although Init0 starts near zero by construction, its error quickly rises to a nonzero plateau due to generalization during training.

Author Comment: A necessary and sufficient condition for the unique solution of the Bellman equation for LTL surrogate rewards — R0/PR1

Comments

No accompanying comment.

Review: A necessary and sufficient condition for the unique solution of the Bellman equation for LTL surrogate rewards — R0/PR2

Comments

# Summary

This paper presents a necessary and sufficient condition for the Bellman equation used for LTL surrogate rewards to have a unique solution.

In reinforcement learning (RL), it is common practice to transform an LTL objective into a discounted reward objective via surrogate rewards.

The resulting Bellman equation contains two discount factors, one of which is set to 1 in previous work.

However, as this paper shows, when one of these discount factors is 1, the solution to the Bellman equation is no longer unique, meaning a (RL) policy may be evaluated incorrectly, as the policy evaluation step may converge to a different solution.

The paper studies and presents a necessary and sufficient condition based on the bottom strongly connected components of the induced Markov chain, under which the Bellman equation converges to the correct solution even when one of the discount factors is 1.

A numerical example supports the theoretical results.

# Strengths

The paper identifies a relevant issue, as using surrogate rewards with two discount factors where one of those is set to 1 is a recent and common practice in RL papers.

The exposition of the problem is well-written (up to some minor typos listed below), and includes a simple example to show when the problem arises.

The methodology to fix the problem is conceptually straightforward and primarily relies on identifying, accepting, and rejecting BSCCs, an approach that the broader planning and RL communities should easily understand. The proofs for the correctness of this method rely on basic linear algebra, and I did not spot any obvious errors.

A numerical example highlights that the proposed fix correctly converges, in contrast to other naive approaches.

# Weaknesses

- There are some typos and other presentation issues that need to be fixed.

- The numerical experiments miss some key information. In particular, implementation details are not fully discussed. It would be good to mention in what language and with which datatypes (I assume floats?) numbers are represented. While the difference between methods is large enough not to be due to floating point issues, it would nonetheless be good to discuss this.

## Minor comments, typos, and other suggestions

- page 2, LTL introduction: "conjunction (∧), two temporal operators .." -> conjunction (∧), *and* two temporal operators ..

- page 4, def. 6: "A BSCC is rejecting if all states \notin B." to make the definition more self-contained, recall that B is the set of accepting states of the LDBA (and not the BSCC).

- page 5, below prop. : "Then, one can use {..} invertible to show the solutions {..} is unique .." -> use {} *is* invertible to show the solution U_B *is* unique.

- page 6, lemma 3: \lambda(s) is used without context.

- page 6, above proof of prop 2. "Diectly" -> directly.

- page 7: notation of accepting and rejecting BSCCs B_A and B_R. This notation may be slightly confusing as A is the set of actions, and R is the reward function. Consider using a different font or explicitly mentioning this (slight) overloading of notation.

- page 7: reference to Eq. 24: either ensure the equation is on the same page, mention it is on the next page, and/or ensure the equation number is consistent with the others. Currently, it is confusing that equation 24 is mentioned while only eq. 23 and eq. 25 are above and below that paragraph.

- The presentation of the numerical results could be made clearer, for example, via a small table.

Presentation

Overall score 4 out of 5
Is the article written in clear and proper English? (30%)
4 out of 5
Is the data presented in the most useful manner? (40%)
4 out of 5
Does the paper cite relevant and related articles appropriately? (30%)
5 out of 5

Context

Overall score 4 out of 5
Does the title suitably represent the article? (25%)
4 out of 5
Does the abstract correctly embody the content of the article? (25%)
5 out of 5
Does the introduction give appropriate context and indicate the relevance of the results to the question or hypothesis under consideration? (25%)
5 out of 5
Is the objective of the experiment clearly defined? (25%)
4 out of 5

Review: A necessary and sufficient condition for the unique solution of the Bellman equation for LTL surrogate rewards — R0/PR3

Comments

Overall, the paper is well-written and the ideas are presented clearly and are novel, which is why the original work has been published at L4DC. This work differs from that work by having some additional proofs and the inclusion of a case study section. I understand this extension crosses the threshold necessary to mean it is not self-plagiarising.

Despite this, I still think the work should be either rejected or be revised to include more theory or a substantial set of benchmarking. I do not believe this work has suitably extended the conference version to be a new journal submission. I believe in the current form it would be more suitable for the proofs and case study to be simply included as an extended version on arXiv instead.

I'll focus my review on the new aspects, which I believe are the Converting Remark 13 to Lemma 3 and attributing it the proof previously assigned to Proposition 2. A proof for Lemma 4. A new proof for Proposition 2. A proof for Lemma 5. A new case study section.

Overall, the proofs seem correct and are written clearly. I think it would be helpful to make the earlier discussion of Gershgorin Circle Theorem into its own Lemma as the results are used multiple times in the proofs, highlighting the invertability.

The case study is substantial with both an interesting nursing example and a complex LTL planning specification to be solved. I think it would be helpful to provide some high-level insight into the nursing example to give intuition to the reader for using your approach in practice.

The discussion on the solution to the Bellman equation is also well described, using 6x10^6 iterations to show convergence seems like a lot, would alternative algorithms like interval iteration (Haddad, 2018) work in the setup that could provide the convergence guarantee perhaps sooner?

Because there is not a high-level description of the case study and its application, when the large error ||U(k) - \bar{V}||_\infty is given, the two numbers are harder to gauge if they are completely out of usefulness, or if they are simply very conservative but usable.

Can you explain or provide intuition for the the approximation error increasing after 10^4 iterations for 1-10-^-3?

I would suggest moving the x-axis labels with (x10^6) into the label, and then simply put 0,1,2,3,4,... along the x-axis for readability. Figure 4 could include a legend.

Recommendation: A necessary and sufficient condition for the unique solution of the Bellman equation for LTL surrogate rewards — R0/PR4

Comments

No accompanying comment.

Author Comment: A necessary and sufficient condition for the unique solution of the Bellman equation for LTL surrogate rewards — R1/PR5

Comments

No accompanying comment.

Decision: A necessary and sufficient condition for the unique solution of the Bellman equation for LTL surrogate rewards — R1/PR6

Comments

No accompanying comment.