Hostname: page-component-89b8bd64d-shngb Total loading time: 0 Render date: 2026-05-13T11:33:07.339Z Has data issue: false hasContentIssue false

Multi-agent credit assignment in stochastic resource management games

Published online by Cambridge University Press:  24 August 2017

Patrick Mannion
Affiliation:
Department of Computer Science & Applied Physics, Galway-Mayo Institute of Technology, Galway, Ireland e-mail: patrick.mannion@gmit.ie Discipline of Information Technology, National University of Ireland Galway, Galway, Ireland e-mail: jim.duggan@nuigalway.ie, enda.howley@nuigalway.ie
Sam Devlin
Affiliation:
Department of Computer Science, University of York, Deramore Lane, York, UK e-mail: sam.devlin@york.ac.uk
Jim Duggan
Affiliation:
Discipline of Information Technology, National University of Ireland Galway, Galway, Ireland e-mail: jim.duggan@nuigalway.ie, enda.howley@nuigalway.ie
Enda Howley
Affiliation:
Discipline of Information Technology, National University of Ireland Galway, Galway, Ireland e-mail: jim.duggan@nuigalway.ie, enda.howley@nuigalway.ie
Rights & Permissions [Opens in a new window]

Abstract

Multi-agent systems (MASs) are a form of distributed intelligence, where multiple autonomous agents act in a common environment. Numerous complex, real world systems have been successfully optimized using multi-agent reinforcement learning (MARL) in conjunction with the MAS framework. In MARL agents learn by maximizing a scalar reward signal from the environment, and thus the design of the reward function directly affects the policies learned. In this work, we address the issue of appropriate multi-agent credit assignment in stochastic resource management games. We propose two new stochastic games to serve as testbeds for MARL research into resource management problems: the tragic commons domain and the shepherd problem domain. Our empirical work evaluates the performance of two commonly used reward shaping techniques: potential-based reward shaping and difference rewards. Experimental results demonstrate that systems using appropriate reward shaping techniques for multi-agent credit assignment can achieve near-optimal performance in stochastic resource management games, outperforming systems learning using unshaped local or global evaluations. We also present the first empirical investigations into the effect of expressing the same heuristic knowledge in state- or action-based formats, therefore developing insights into the design of multi-agent potential functions that will inform future work.

Information

Type
Adaptive and Learning Agents
Copyright
© Cambridge University Press, 2017 
Figure 0

Figure 1 Occupancy vs. commons value for an entire episode of the tragic commons domain

Figure 1

Figure 2 Single-step tragic commons domain results. (a) Commons values for L, G, and D. (b) Occupancy values for L, G, and D. (c) Commons values for G with heuristics. (d) Commons values for L with heuristics. sPBRS and aPBRS refer to potential-based reward shaping (PBRS) approaches with state-based and action-based potential functions, respectively; CaP=counterfactual as potential

Figure 2

Figure 3 Multi-step tragic commons domain results. (a) Commons values for L, G, and D. (b) Occupancy values for L, G, and D. (c) Commons values for G with heuristics. (d) Commons values for L with heuristics. sPBRS and aPBRS refer to potential-based reward shaping (PBRS) approaches with state-based and action-based potential functions, respectively; CaP=counterfactual as potential

Figure 3

Table 1 Tragic commons domain results (averaged over last 2000 episodes)

Figure 4

Figure 4 SPD 3×3 grid topology with resource (state) numbers

Figure 5

Figure 5 Shepherd problem domain results. (a) Basic reward functions. (b) Best performing reward functions. (c) G with various heuristics. (d) L with various heuristics. sPBRS and aPBRS refer to potential-based reward shaping (PBRS) approaches with state-based and action-based potential functions, respectively; CaP=counterfactual as potential

Figure 6

Table 2 Shepherd problem domain results (averaged over last 1000 episodes)