Hostname: page-component-77f85d65b8-zzw9c Total loading time: 0 Render date: 2026-03-27T07:06:46.079Z Has data issue: false hasContentIssue false

An empirical investigation of value-based multi-objective reinforcement learning for stochastic environments

Published online by Cambridge University Press:  15 August 2025

Kewen Ding
Affiliation:
Federation University Australia, Mt Helen, VIC, Australia
Peter Vamplew*
Affiliation:
Federation University Australia, Mt Helen, VIC, Australia
Cameron Foale
Affiliation:
Federation University Australia, Mt Helen, VIC, Australia
Richard Dazeley
Affiliation:
Deakin University, Geelong, VIC, Australia
*
Corresponding author: Peter Vamplew; Email: p.vamplew@federation.edu.au
Rights & Permissions [Opens in a new window]

Abstract

One common approach to solve multi-objective reinforcement learning (MORL) problems is to extend conventional Q-learning by using vector Q-values in combination with a utility function. However issues can arise with this approach in the context of stochastic environments, particularly when optimising for the scalarised expected reward (SER) criterion. This paper extends prior research, providing a detailed examination of the factors influencing the frequency with which value-based MORL Q-learning algorithms learn the SER-optimal policy for an environment with stochastic state transitions. We empirically examine several variations of the core multi-objective Q-learning algorithm as well as reward engineering approaches and demonstrate the limitations of these methods. In particular, we highlight the critical impact of the noisy Q-value estimates issue on the stability and convergence of these algorithms.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Algorithm 1 Multi-objective Q($\lambda$) using accumulated expected reward as an approach to finding deterministic policies for the SER context (Vamplew et al., 2022a)

Figure 1

Figure 1. The Space Traders MOMDP. Solid black lines show the Direct actions, solid grey line show the Indirect actions, and dashed lines indicate Teleport actions. Solid black circles indicate terminal (failure) states (Vamplew et al., 2022a)

Figure 2

Table 1. The probability of success and reward values for each state-action pair in the Space Traders MOMDP (Vamplew et al., 2022a)

Figure 3

Table 2. The mean return for the nine available deterministic policies for the Space Traders Environment (Vamplew et al., 2022a)

Figure 4

Table 3. Hyperparameters used for experiments with the Space Traders environment

Figure 5

Table 4. The final greedy policies learned in twenty independent runs of the baseline multi-objective Q-learning algorithm (Algorithm 1) on the Space Traders environment

Figure 6

Table 5. The final greedy policies learned in twenty independent runs of baseline multi-objective Q-learning (Algorithm 1) with constant or decayed learning rates for the Space Traders environment

Figure 7

Figure 2. Policy charts showing the greedy policy produced by the baseline multi-objective Q-learning algorithm (Algorithm 1) on the Space Traders environment. Each chart shows the greedy policy identified by the agent at each episode of four different trials, culminating in different final policies. The dashed green line represents the threshold used for TLO, to highlight which policies meet this threshold

Figure 8

Figure 3. The Policy chart for baseline method with a decayed learning rate in the Space Traders Environment

Figure 9

Table 6. The probability of success and reward values for each state-action pair in Space Traders MR

Figure 10

Table 7. The final greedy policies learned in twenty independent runs of the Algorithm 1 with a decayed learning rate for Space Traders MR environment, compared to the original Space Traders environment

Figure 11

Figure 4. The Space Traders MR environment, which has the same state transition dynamics as the original Space Traders but with a modified reward design. The changed rewards have been highlighted in red

Figure 12

Figure 5. Policy charts for MOQ-learning on the Space Traders MR and 3ST environments. Chart (a) shows convergence to the SER-optimal DI policy on Space Traders MR, whereas (b) shows convergence to the suboptimal ID policy when applying the same algorithm and reward design to the Space Traders 3ST environment

Figure 13

Figure 6. The Space Traders 3-State environment which adds an additional state C to the Space Traders MOMDP with a more complex state structure. All the changes have been highlighted in red color

Figure 14

Table 8. The final greedy policies learned in twenty independent runs of the Algorithm 1 for Space Traders 3-State environment, compared to the Space Traders MR environment. All runs used a decayed learning rate

Figure 15

Algorithm 2 The multi-objective stochastic state Q($\lambda$) algorithm (MOSS Q-learning). Highlighted text identifies the changes and extensions introduced relative to multi-objective Q($\lambda$) as previously described in Algorithm 1.

Figure 16

Algorithm 3 The update-statistics helper algorithm for MOSS Q-learning (Algorithm 2). Given a particular state s it updates the global variables which store statistics related to s. It will then return an augmented state formed from the concatenation of s with the estimated mean accumulated reward when s is reached, and a utility vector U which estimates the mean vector return over all episodes for each action available in s

Figure 17

Table 9. The final greedy policies learned in twenty independent runs of the MOSS algorithm with a decayed learning rate for both the Space Traders and Space Traders ID environments. Red text highlights the SER-optimal policy for each environment

Figure 18

Figure 7. The Policy chart for the MOSS algorithm with the decayed learning rate in original Space Traders Environment

Figure 19

Figure 8. The Space Traders ID variant environment. All the changes compared with original have been highlighted in red. The changed rewards result in ID being the SER-optimal policy for this environment

Figure 20

Table 10. The probability of success and reward values for each state-action pair in the new variant Space Traders ID environment

Figure 21

Table 11. Nine available deterministic policies mean return for Space Traders ID environment

Figure 22

Algorithm 4 multi-objective Q($\lambda$) with policy-options.

Figure 23

Table 12. The final greedy policies learned in twenty independent runs of the policy-options MOQ-Learning algorithm for the variants of the Space Traders environment, with a decayed learning rate—red indicates the SER-optimal policy for each variant

Figure 24

Table 13. A comparison of the final greedy policies learned in twenty independent runs of the policy-options MOQ-Learning algorithm for the Space Traders environment variants, with either a decayed or constant learning rate—red indicates the SER-optimal policy for each variant

Figure 25

Figure 9. Policy charts for five sample runs of the policy-options MOQ-Learning algorithm with a constant learning rate on Space Traders

Figure 26

Figure 10. The Noisy Q Value Estimate issue in policy-options MOQ-learning with a constant learning rate. These graphs illustrate agent behaviour for a single run. The top graph shows which option/policy is viewed as optimal after each episode, while the lower graphs show the estimated Q-value for each objective for each option

Figure 27

Table A1. The final greedy policies learned in twenty independent runs of the Algorithm 1 for Space Traders 3-State environment, compared to the Space Traders MR environment

Figure 28

Figure A1. Policy charts for MOQ-learning with a constant learning rate on the Space Traders MR environment—each chart illustrates a sample run culminating in a different final policy

Figure 29

Table A2. The final greedy policies learned in twenty independent runs of the MOSS algorithm with either a constant or decayed learning rate for both the Space Traders and Space Traders ID environments. Red text highlights the SER-optimal policy for each environment

Figure 30

Figure A2. The Policy chart for the MOSS algorithm with either a constant or decayed learning rate in original Space Traders Environment