Hostname: page-component-89b8bd64d-rbxfs Total loading time: 0 Render date: 2026-05-11T13:24:02.906Z Has data issue: false hasContentIssue false

An online scalarization multi-objective reinforcement learning algorithm: TOPSIS Q-learning

Published online by Cambridge University Press:  13 June 2022

Mohammad Mirzanejad
Affiliation:
Faculty of New Sciences and Technologies, University of Tehran, Tehran, Iran; e-mail: mirzanejad@ut.ac.ir; mo.ebrahimi@ut.ac.ir; h.veisi@ut.ac.ir
Morteza Ebrahimi
Affiliation:
Faculty of New Sciences and Technologies, University of Tehran, Tehran, Iran; e-mail: mirzanejad@ut.ac.ir; mo.ebrahimi@ut.ac.ir; h.veisi@ut.ac.ir
Peter Vamplew
Affiliation:
School of Engineering, Information Technology and Physical Sciences, Federation University Australia, Ballarat, Australia; e-mail: p.vamplew@federation.edu.au
Hadi Veisi
Affiliation:
Faculty of New Sciences and Technologies, University of Tehran, Tehran, Iran; e-mail: mirzanejad@ut.ac.ir; mo.ebrahimi@ut.ac.ir; h.veisi@ut.ac.ir

Abstract

Conventional reinforcement learning focuses on problems with single objective. However, many problems have multiple objectives or criteria that may be independent, related, or contradictory. In such cases, multi-objective reinforcement learning is used to propose a compromise among the solutions to balance the objectives. TOPSIS is a multi-criteria decision method that selects the alternative with minimum distance from the positive ideal solution and the maximum distance from the negative ideal solution, so it can be used effectively in the decision-making process to select the next action. In this research a single-policy algorithm called TOPSIS Q-Learning is provided with focus on its performance in online mode. Unlike all single-policy methods, in the first version of the algorithm, there is no need for the user to specify the weights of the objectives. The user’s preferences may not be completely definite, so all weight preferences are combined together as decision criteria and a solution is generated by considering all these preferences at once and user can model the uncertainty and weight changes of objectives around their specified preferences of objectives. If the user only wants to apply the algorithm for a specific set of weights the second version of the algorithm efficiently accomplishes that.

Information

Type
Research Article
Copyright
© The Author(s), 2022. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable