Hostname: page-component-77f85d65b8-6bnxx Total loading time: 0 Render date: 2026-03-27T22:37:08.844Z Has data issue: false hasContentIssue false

I don’t want to play with you anymore’: dynamic partner judgements in moody reinforcement learners playing the prisoner’s dilemma

Published online by Cambridge University Press:  26 March 2024

Grace Feehan
Affiliation:
Loughborough University, Epinal Way, Loughborough, LE11 3TU, UK
Shaheen Fatima*
Affiliation:
Loughborough University, Epinal Way, Loughborough, LE11 3TU, UK
*
Corresponding author: Grace Feehan; Email: g.feehan@lboro.ac.uk
Rights & Permissions [Opens in a new window]

Abstract

Emerging reinforcement learning algorithms that utilize human traits as part of their conceptual architecture have been demonstrated to encourage cooperation in social dilemmas when compared to their unaltered origins. In particular, the addition of a mood mechanism facilitates more cooperative behaviour in multi-agent iterated prisoner dilemma (IPD) games, for both static and dynamic network contexts. Mood-altered agents also exhibit humanlike behavioural trends when environmental aspects of the dilemma are altered, such as the structure of the payoff matrix used. It is possible that other environmental effects from both human and agent-based research will interact with moody structures in previously unstudied ways. As the literature on these interactions is currently small, we seek to expand on previous research by introducing two more environmental dimensions; voluntary interaction in dynamic networks, and stability of interaction through varied network restructuring. From an initial Erdos–Renyi random network, we manipulate the structure of a network IPD according to existing methodology in human-based research, to investigate possible replication of their findings. We also facilitated strategic selection of opponents through the introduction of two partner evaluation mechanisms and tested two selection thresholds for each. We found that even minimally strategic play termination in dynamic networks is enough to enhance cooperation above a static level, though the thresholds for these strategic decisions are critical to desired outcomes. More forgiving thresholds lead to better maintenance of cooperation between kinder strategies than stricter ones, despite overall cooperation levels being relatively low. Additionally, moody reinforcement learning combined with certain play termination decision strategies can mimic trends in human cooperation affected by structural changes to the IPD played on dynamic networks—as can kind and simplistic strategies such as Tit-For-Tat. Implications of this in comparison with human data is discussed, and suggestions for diversification of further testing are made.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Table 1. Traditional payoff matrix for the two-player Prisoner’s Dilemma game (Wooldridge, 2013).

Figure 1

Algorithm 1: mSARSA Pseudocode, taken from Feehan & Fatima (2022) and originally adopted from Collenette et al. (2018b).

Figure 2

Figure 1. An example visualization of the initial random network, and the final network at the end of the experimental period, generated with the PyVis library (Perrone et al., 2020). Colours in the graph to the right indicate differing agent game-playing strategies

Figure 3

Table 2. Parameter identifiers, their meanings, and the tested values for the following experiments

Figure 4

Figure 2. Summary graphs for Mean Payoffs attained by mSARSA agents within the final cycle of gameplay, Single Opponent condition. Data is presented as grand means across all agents within that time period, averaged over five simulative episodes, and demonstrates a promotion of payoff earning for mSARSA agents under the RB and SB strategies. Asterisks (*) indicate starting values for that variable in the initial three rounds of the whole simulative period, and the solid black horizontal line indicates the baseline average, taken from simulation with no partner switching

Figure 5

Figure 3. Summary graphs for Mean Payoffs attainted by mSARSA agents within the final cycle of gameplay, Multiple Opponent condition. Data is presented as grand means across all agents within that time period, averaged over five simulative episodes, and demonstrates a promotion of payoff earning for mSARSA agents under the RB and SB strategies. Asterisks (*) indicate starting values for that variable in the initial three rounds of the whole simulative period, and the solid black horizontal line indicates the baseline average, taken from simulation with no partner switching

Figure 6

Figure 4. Summary graphs for Mean Cooperations performed (as a proportion of all actions taken) by mSARSA agents within the final cycle of gameplay, Single Opponent condition. Data is presented as grand means across all agents within that time period, averaged over five simulative episodes, and demonstrates an increase in cooperation for mSARSA agents under the RB and SB strategies. Asterisks (*) indicate starting values for that variable in the initial three rounds of the whole simulative period, and the solid black horizontal line indicates the baseline average, taken from simulation with no partner switching

Figure 7

Figure 5. Summary graphs for Mean Cooperations performed (as a proportion of all actions taken) by mSARSA agents within the final cycle of gameplay, Multiple Opponent condition. Data is presented as grand means across all agents within that time period, averaged over five simulative episodes, and depicts a sharp decrease in cooperation for mSARSA agents under the RB strategy. Asterisks (*) indicate starting values for that variable in the initial three rounds of the whole simulative period, and the solid black horizontal line indicates the baseline average, taken from simulation with no partner switching

Figure 8

Figure 6. Summary graphs for Mean Normalised Actor Degree Centrality values for mSARSA agents within the final cycle of gameplay, Single Opponent condition. Data is presented as grand means across all agents within that time period, averaged over five simulative episodes, and displays the much greater normalized centrality of mSARSA agents under the SB condition over the alternative strategies. Asterisks (*) indicate starting values for that variable in the initial three rounds of the whole simulative period, and the solid black horizontal line indicates the baseline average, taken from simulation with no partner switching

Figure 9

Figure 7. Summary graphs for Mean Normalised Actor Degree Centrality values for mSARSA agents within the final cycle of gameplay, Multiple Opponent condition. Data is presented as grand means across all agents within that time period, averaged over five simulative episodes, and displays the much greater normalized centrality of mSARSA agents under the SB condition over the alternative strategies. Asterisks (*) indicate starting values for that variable in the initial three rounds of the whole simulative period, and the solid black horizontal line indicates the baseline average, taken from simulation with no partner switching

Figure 10

Figure 8. Summary graphs for Average Mood levels of mSARSA agents within the final cycle of gameplay, Single Opponent condition. Data is presented as grand means across all agents within that time period, averaged over five simulative episodes, and displays a high mSARSA agent mood value throughout. Asterisks (*) indicate starting values for that variable in the initial three rounds of the whole simulative period, and the solid black horizontal line indicates the baseline average, taken from simulation with no partner switching

Figure 11

Figure 9. Summary graphs for Average Mood levels of mSARSA agents within the final cycle of gameplay, Multiple Opponent condition. Data is presented as grand means across all agents within that time period, averaged over five simulative episodes, and displays a high mSARSA agent mood value throughout with the exception of the RB strategy. Asterisks (*) indicate starting values for that variable in the initial three rounds of the whole simulative period, and the solid black horizontal line indicates the baseline average, taken from simulation with no partner switching