Crafting desirable climate trajectories with reinforcement learning explored socio-environmental simulations

James Rudd-Jones; Fiona Thendean; María Pérez-Ortiz

doi:10.1017/eds.2025.10019

Crafting desirable climate trajectories with reinforcement learning explored socio-environmental simulations

Published online by Cambridge University Press: 26 August 2025

James Rudd-Jones

Fiona Thendean and

María Pérez-Ortiz

Show author details

James Rudd-Jones*: Affiliation:
UCL Centre for Artificial Intelligence, Department of Computer Science, University College London , London, UK
Fiona Thendean: Affiliation:
UCL Centre for Artificial Intelligence, Department of Computer Science, University College London , London, UK
María Pérez-Ortiz: Affiliation:
UCL Centre for Artificial Intelligence, Department of Computer Science, University College London , London, UK
*: Corresponding author: James Rudd-Jones; Email: james.rudd-jones.22@ucl.ac.uk

Article contents

Abstract
Impact Statement
Introduction
Materials and methods
Experimental results
Discussion
Conclusion
Author contribution
Competing interest
Data availability statement
Funding statement
Footnotes
References

Abstract

Climate change poses an existential threat, necessitating effective climate policies to enact impactful change. Decisions in this domain are incredibly complex, involving conflicting entities and evidence. In the last decades, policymakers increasingly use simulations and computational methods to guide some of their decisions. Integrated Assessment Models (IAMs) are one of such methods, which combine social, economic, and environmental simulations to forecast potential policy effects. For example, the UN uses outputs of IAMs for their recent Intergovernmental Panel on Climate Change (IPCC) reports. Traditionally these have been solved using recursive equation solvers, but have several shortcomings, e.g. struggling at decision making under uncertainty. Recent preliminary work using Reinforcement Learning (RL) as an alternative to traditional solvers shows promising results in decision making in uncertain and noisy scenarios. We extend on this work by introducing multiple interacting RL agents as a preliminary analysis on modelling the complex interplay of socio-interactions between various stakeholders or nations that drives much of the current climate crisis. Our findings show that cooperative agents in this framework can consistently chart pathways towards more desirable futures in terms of reduced carbon emissions and improved economy. However, upon introducing competition between agents, for instance by using opposing reward functions, desirable climate futures are rarely reached. Modelling competition is key to increased realism in these simulations, as such we employ policy interpretation by visualizing what states lead to more uncertain behavior, to understand algorithm failure. Finally, we highlight the current limitations and avenues for further work to ensure future technology uptake for policy derivation.

Keywords

climate policy integrated assessment models multi-agent reinforcement learning socio-environmental simulator

Information

Type: Application Paper
Information: Environmental Data Science , Volume 4 , 2025 , e41

DOI: https://doi.org/10.1017/eds.2025.10019 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Open Practices: Open data Open materials
Copyright: © The Author(s), 2025. Published by Cambridge University Press

Impact Statement

Deriving climate policy is a challenging problem, with an expansive solution space. Policymakers have turned to simulation-based approaches to aid their decisions; however, these traditionally have various limitations. Our work is a preliminary study on improving aspects of these simulation-based approaches with multi-entity agent interactions. This allows for modelling of stakeholder/nation competition, cooperation, and communication, which is the key driver for much of anthropogenic climate change.

1. Introduction

According to the 2022 Intergovernmental Panel on Climate Change (IPCC) report—“Having the right policies, infrastructure and technology in place to enable changes to our lifestyles and behaviour can result in a 40–70% reduction in greenhouse gas emissions by 2050” (Luz, Reference Luz2022). The overall findings show that within all sectors, technology exists that will enable a habitable future, but their adoption may require capital-intensive investments and societal changes. Ambitious policies can have some effect on incentivizing funding towards research or implementation of such technologies, and enforcing certain behavioral restrictions, but are not the exclusive driver to change lifestyle and behavior. These major policy change adjustments, which are needed to combat climate change, can therefore be met with strong opposition that prevents uptake (Patterson, Reference Patterson2023), as entrenched societal structures, cultural norms, and vested interests often resist shifts that challenge the status quo. Evidence-based policy is key here as it not only improves the derived policy but also states quantifiable results that can reassure critics (Cairney, Reference Cairney2016). However, this can be challenging within the climate domain as we are experiencing novel events that have never been tackled. Climate modelling through simulations greatly helps as it provides evidence of future trajectories and attributes metrics to how future actions can have an impact. With human behavior so inextricably linked to our changing climate, it is key that these simulation models incorporate human factors to not exclude anthropogenic effects. Models of this type are known as Integrated Assessment Models (IAMs), which join traditional climate simulations with socio-economic dynamic models (Dowlatabadi, Reference Dowlatabadi1995). The UN extensively uses outputs of IAMs for the backbone of their IPCC reports, submitted by researchers across the world, providing quantitative insights into the trade-offs and synergies between different policy options and their consequences on socio-economic and/or environmental factors (Van Beek et al., Reference Van Beek, Hajer, Pelzer, van Vuuren and Cassen2020). On the UN website, they publicly list 29 IAMs used for their decision making (UN, 2023), such as the GEMINI-E3 model that specifically assesses how world climate change policies affect countries both at the micro and macro economic levels (Bernard and Vielle, Reference Bernard and Vielle2008). As an example van de Ven et al., Reference van de Ven, Mittal, Gambhir, Lamboll, Doukas, Giarola, Hawkes, Koasidis, Köberle, McJeon, Perdana, Peters, Rogelj, Sognnaes, Vielle and Nikas2023 use multiple IAMs to analyse how the national policies and pledges made at the latest COP26 Glasgow conference will affect future CO₂ emission trajectories, one of which is GEMINI-E3.

IAMs are the current most used model framework for the socio-environmental domain, traditionally paired with an optimal control problem (e.g., Model Predictive Control (Garcia et al., Reference Garcia, Prett and Morari1989), to predict future trajectories towards a desired outcome (Kellett et al., Reference Kellett, Weller, Faulwasser, Grüne and Semmler2019). However, they are not free from their own shortcomings. Some key negatives are their poor representation of behavioral and economic systems, as well as a lack of modeling decision-making under uncertainty (Farmer et al., Reference Farmer, Hepburn, Mealy and Teytelboym2015; Zhang et al., Reference Zhang, Williams, Phade, Srinivasa, Zhang, Gupta, Bengio and Zheng2022), for further details, refer to the review of Gambhir et al., Reference Gambhir, Butnar, Li, Smith and Strachan2019. Both can be improved using Agent-Based Model (ABM) approaches (Gambhir et al., Reference Gambhir, Butnar, Li, Smith and Strachan2019). ABMs are a common approach within domains such as financial modeling (Axtell and Farmer, Reference Axtell and Farmer2022) or transport modeling (Wise et al., Reference Wise, Crooks and Batty2017) as they allow agent heterogeneity, agent cooperation/competition/communication, closer representative entity dynamics to reality, and more (Axtell and Farmer, Reference Axtell and Farmer2022). These features improve decision-making over the traditional control problem, but require agent behavioral policies to be defined (rather than learnt) outside of the simulation, which can still struggle under uncertainty (Kelly et al., Reference Kelly and Kolstad1999; van den Berg et al., Reference van den Berg, Hof, Akenji, Edelenbosch, van Sluisveld, Timmer and van Vuuren2019). Further improvements on ABMs incorporate trained algorithms to infer the best actions and search the solution space instead of heuristic behavioral policies. This deeper exploration increases an agent’s robustness to simulation uncertainty, which is paramount with the highly changeable simulation dynamics caused by the current climate. In this case, ABMs must be reformulated so that agents receive a signal (e.g., a reward) from the environment after each action taken, which is used to update their behavioral policy.

Reinforcement learning (RL) and especially multi-agent reinforcement learning (MARL) algorithms are widely used within ABM literature to improve agent behavior policies (Liang et al., Reference Liang, Guo, Ding and Hua2020; Sert et al., Reference Sert, Bar-Yam and Morales2020). We carry this RL theme over, replacing the control problem on top of the IAM environment simulation to increase exploration in this space. Temporally updating agents account for the changeability in the climate simulations caused by their own and other agent’s actions, creating feedback loops that enable reactive behavior to further climate or other agent changes. Another benefit of this MARL approach is that it is simulation agnostic; extended developments in the field can be applied to any form of multiple agent simulation, be it IAMs, ABMs, etc., although they would require further training. In this work, we focus on developing and analyzing a different solving strategy for IAMs using RL, over the traditional Optimal Control approaches. Rather than replacing solvers, RL-based approaches should be seen as a powerful complementary tool, particularly for modeling complex agent interactions and decision-making under uncertainty, where current methods have limitations. We detail some of the current limitations of the RL approach in this paper, but believe that if these were solved, this method could be added to the suite of tools available to policymakers.

The application of RL and MARL to IAMs is a novel topic with only a handful of previous works. For a single agent scenario, the work of Strnad et al., Reference Strnad, Barfuss, Donges and Heitzig2019 and our previous work in Wolf et al., Reference Wolf, Nardelli, Shawe-Taylor and Perez-Ortiz2023 applied an RL agent into an IAM, which, once trained, was able to generate policy guidance pathways towards a defined “economic and environmental positive future” within the model’s framework. They focused on adapting agent initial states and reward functions to understand the impact these had on the exploration of the IAM, as well as test the agents under the injection of noise in the environment. This has guided our experiments to ensure a wide range of initializations to understand the exploration of agents. Both Strnad et al., Reference Strnad, Barfuss, Donges and Heitzig2019 and our previous work in Wolf et al., Reference Wolf, Nardelli, Shawe-Taylor and Perez-Ortiz2023 use a singular agent, hence assuming a “unified” earth, in which there is a collectively shared goal. In this work, we aim to move one step further and model inter-world interactions, which are the driver for much of anthropogenic climate change and must be understood for many policy decisions (Stone, Reference Stone2008). Towards this aim we adapt the IAM accordingly, based on ABM extensions of IAMs (Giarola et al., Reference Giarola, Sachs, d’Avezac, Kell and Hawkes2022; W. Nordhaus, Reference Nordhaus2015; Zhang et al., Reference Zhang, Williams, Phade, Srinivasa, Zhang, Gupta, Bengio and Zheng2022), in order to implement a multi-agent IAM with MARL.

The only work that has used MARL within the climate policy domain in the literature is that of Zhang et al., Reference Zhang, Williams, Phade, Srinivasa, Zhang, Gupta, Bengio and Zheng2022, which created the RICE-N model used for the AI for Global Climate Cooperation ChallengeFootnote ¹. Itself an extension of the Regional Integrated model of Climate and the Economy (RICE) model developed in Nordhaus (Reference Nordhaus2010) that models 12 global regions. Zhang et al. (Reference Zhang, Williams, Phade, Srinivasa, Zhang, Gupta, Bengio and Zheng2022) invited various domain experts to create and edit interaction and negotiation protocols to achieve the best Pareto Frontier of the socio-economic system variables in the environment. The RICE-N model combines a climate-economic IAM with trade and negotiation dynamics, enabling high levels of interaction between countries/regions (a.k.a agents). Agents can adjust their savings rates, climate mitigation rates, as well as trade and negotiate with each other at each time step, leading to a large range of potential interactions between each other and the environment (Zhang et al., Reference Zhang, Williams, Phade, Srinivasa, Zhang, Gupta, Bengio and Zheng2022). Their findings show the potential of MARL-based applications to IAMs with a large call to action for further research on the topic. RICE-N is an extensive environment that we aim to use for future work; however, we prioritise increased interpretability of the trained agent and, as such, focus on the multi-agent extension of the more simplistic environment as used in Strnad et al. (Reference Strnad, Barfuss, Donges and Heitzig2019) and Wolf et al. (Reference Wolf, Nardelli, Shawe-Taylor and Perez-Ortiz2023). This simplified environment enables a visual understanding and easier interpretation of the trained agent’s interactions, which are key to analyzing the use of MARL within IAMs.

RL algorithms, however, lack inherent explainability, raising concerns about their trustworthiness for informing real-world policy decisions. Using explainability methods, we can reinforce human confidence by providing insights into how decisions were made and visibility to vulnerabilities (Adadi and Berrada, Reference Adadi and Berrada2018; Lipton, Reference Lipton2018; Glanois et al., Reference Glanois, Weng, Zimmer, Li, Yang, Hao and Liu2021). The explainability methods explored in this work specifically target explaining model policy through a quantification technique, determining the states at which taking a certain action is crucial, critical in applications related to informing climate change policy.

In summary, we attempt to model whether agents prioritizing economic or environmental gain can affect climate policy derivation. As well as simulate, within this framework, whether “climate positive” futures are possible when agents conflict in their prioritizations. We have extended the previous literature’s single-agent IAM to a multi-agent scenario in order to incorporate inter-nation behavior. Utilizing this technology, policies can be derived and enacted in reality, depending on the validity of our underlying IAM. For a single-agent setting, one can fully implement the projected policies as they can have full agency over the singular agent in reality. However, moving to multiple agents if we want to follow a similar optimization approach it assumes we can have control over all agents in reality. A heavy assumption in practice. Instead, in this paper, we focus on the setting of having control over one or a subset of the agents, but still model all agents learning collectively. This necessitates the need for decentralized training decentralized execution (DTDE) algorithms. We have arbitrarily assumed the learning algorithm and parameters behind each stylized agent, which will directly affect the outcome trajectories. Aiming to highlight the challenges of employing certain existing algorithms. However, in future work, the other agents in the simulation (that we may not have agency over in reality) could be trained using imitation learning (Hussein et al., Reference Hussein, Gaber, Elyan and Jayne2017) on historical data to represent in-silico versions of real-world entities. MARL can then be used to train an agent to act as a best response to these imitation pre-trained agents within a multiple-agent IAM, providing us with a range of possible future trajectories. Again, dependent on the validity not only of the IAM, but also the agent representations of real-world entities. As with any forecasting tool, long-range trajectories lead to large accumulations of error. As an alternative, the algorithm can be further trained as more data about other agents is received. Finally, an inherent challenge with algorithm-derived policy is being able to interpret the underlying solution, especially in edge cases or failure scenarios in which there may not be much prior experience. We have implemented initial interpretability techniques to increase trust in the system for downstream applications.

Our results show that multiple agents that work toward the same goal cooperatively are able to achieve the IAMs “economic and environmental positive future” success state consistently over 90% of test episodes. Increasing competition between agents reduces this success significantly, which is one of this work’s main conclusions, and is a major avenue for future work, as in reality competition or mixed motivations are rife. This work serves as an early discovery into the field, positioning future research required to achieve the adoption of the technology.

2. Materials and methods

In this section, we introduce the core themes required for our contribution: the IAM environment, the MARL algorithm and requirements for its application, and the interpretability framework we have used in order to improve insight.

2.1. The IAM environment

The AYS environment, created by Kittel et al. (Reference Kittel, Müller-Hansen, Koch, Heitzig, Deffuant, Mathias and Kurths2021), is a low complexity IAM, made up of a social, economic, and environmental variables. These three variables each relate to an ordinary differential equation (ODE) defining the system:

(2.1)

$$ \frac{dA}{dt}=E-\frac{A}{\tau_A},\hskip3em \frac{dY}{dt}=\beta Y-\theta AY,\hskip3em \frac{dS}{dt}=R-\frac{S}{\tau_S} $$

where $ A $ is the excess atmospheric carbon ( $ GtC $ ), $ Y $ the economic output ( $ \${\mathrm{yr}}^{-1} $ ), and $ S $ the renewable knowledge stock ( $ GJ $ ). Each variable is inextricably linked with the other, creating a dynamic cycle. In words:

• $ A $ is proportional to emissions produced from the use of fossil fuels, minus a natural carbon decay out of the atmosphere.
• $ Y $ naturally grows by 3% each time period; however, it is reduced by an economic climate damage function where increasing $ A $ increases the reduction in $ Y $ .
• $ S $ is proportional to the amount of renewable energy produced; however, has a natural knowledge decay rate over time.

The following equations are required for a deeper analysis of the AYS ODEs, with further numerical parameters listed in Table A1.

(2.2)

$$ \mathrm{Emissions}\hskip9.24em E=\frac{\Gamma U}{\phi } $$

(2.3)

$$ \mathrm{Fossil}\hskip0.33em \mathrm{fuel}\hskip0.33em \mathrm{energy}\hskip0.33em \mathrm{share}\hskip3.33em \Gamma =\frac{1}{1+{\left(\frac{S}{\sigma}\right)}^{\rho }} $$

(2.4)

$$ \mathrm{Energy}\hskip0.33em \mathrm{demand}\hskip7.00em U=\frac{Y}{\varepsilon } $$

(2.5)

$$ \hskip2.9em \mathrm{Renewable}\hskip0.33em \mathrm{energy}\hskip0.33em \mathrm{produced}\hskip1em R=\left(1-\Gamma \right)U $$

While A and Y are easily quantifiable with real-life implications, S is harder to define. Generally, social factors require greater levels of detail than economic or environmental attributes. For instance, in Zhang et al., Reference Zhang, Williams, Phade, Srinivasa, Zhang, Gupta, Bengio and Zheng2022 they incorporate many layers of complex socio-economic equations in order to have a functioning model with quantifiable social impact. In the AYS model this is simplified down to a single equation, enabling a much reduced state space toward lower computational requirements and a more interpretable understanding of agent behavior.

The AYS model has been specifically tuned so that an agent tends towards one of two points:

(2.6)

$$ \bullet \hskip0.35em \mathrm{Green}\ \mathrm{fixed}\ \mathrm{point}-\left(\begin{array}{c}0\\ {}\infty \\ {}\infty \end{array}\right),\hskip4em \bullet \hskip0.35em \mathrm{Black}\ \mathrm{fixed}\ \mathrm{point}-\left(\begin{array}{c}\frac{\beta }{\theta}\\ {}\frac{\phi \beta \varepsilon}{{\theta \tau}_A}\\ {}0\end{array}\right)=\left(\begin{array}{c}350\; GtC\\ {}4.84\times {10}^{13}\;\${yr}^{-1}\\ {}0\; GJ\end{array}\right) $$

The green fixed point denotes a “sustainable” future, one where there is no atmospheric carbon but limitless capital and renewable knowledge. The black fixed point, however, denotes a stagnant economy solely dependent on fossil fuels. This is a future we ideally want to avoid. Included with these “drain” points are Planetary Boundaries (PB). The AYS model incorporates one PB set in the reports from Steffen et al., Reference Steffen, Richardson, Rockström, Cornell, Fetzer, Bennett, Biggs, Carpenter, De Vries and De Wit2015 and Rockström et al., Reference Rockström, Steffen, Noone, Persson, Chapin, Lambin, Lenton, Scheffer, Folke and Schellnhuber2009 of a maximum excess atmospheric carbon at $ {PB}_A=345\; GtC $ , with a social foundation for prosperity from Dearing et al., Reference Dearing, Wang, Zhang, Dyke, Haberl, Hossain, Langdon, Lenton, Raworth and Brown2014 defining a minimum yearly economic output at $ {PB}_Y=4\times {10}^{13}\;\${yr}^{-1} $ (Kittel et al., Reference Kittel, Müller-Hansen, Koch, Heitzig, Deffuant, Mathias and Kurths2021). For brevity throughout this paper, we will make reference to these boundaries as the two PBs, although by definition, our economic output boundary is in fact a social goal, not a planetary boundary.

To mimic the current state of the Earth within this model, the starting point is defined as $ {s}_{t=0}=\left\{240\; GtC,7\times {10}^{13}\;\${yr}^{-1},5\times {10}^{11}\; GJ\right\} $ . Not only is this starting location very close to the PBs, creating a challenging control problem, but also from this location the agent will tend towards the black fixed point if no actions are taken. Figure 1 highlights the AYS environment with black and green fixed points, and the two translucent grey planes indicating the two PBs. Strnad et al., Reference Strnad, Barfuss, Donges and Heitzig2019 and Wolf et al., Reference Wolf, Nardelli, Shawe-Taylor and Perez-Ortiz2023 incorporate noise into the starting position over episodes to improve training; however, noise is omitted from the S state variable as this dramatically reduces the agents’ ability to learn. Kittel et al., Reference Kittel, Müller-Hansen, Koch, Heitzig, Deffuant, Mathias and Kurths2021 and subsequent work normalised the environment between $ 0 $ and $ 1 $ to prevent numerical explosions.

Figure 1. The AYS model state space from Kittel et al. (Reference Kittel, Müller-Hansen, Koch, Heitzig, Deffuant, Mathias and Kurths2021). Translucent grey planes signify the two PBs, and the green and black points denote the fixed point end conditions for a single agent. Whisker lines indicate flow forces within the model that tend towards either of the two fixed points. The colors show the flow to the respective fixed points.

We carry this through, normalizing the states and then incorporating noise, setting the starting state as:

(2.7)

where $ \mathcal{U} $ is the uniform distribution.

At its current state the model will tend toward the black fixed point. To avoid this an agent is able to undertake four actions, described in Kittel et al. (Reference Kittel, Müller-Hansen, Koch, Heitzig, Deffuant, Mathias and Kurths2021):

0. Default—Default parameters are used and the agent follows the flow lines without any resistance.
1. Degrowth—Economic growth parameter $ \beta $ is halved, fluctuating between 3% and 1.5% growth.
2. Energy transition—Break-even renewable knowledge $ \sigma $ is reduced by 31.3%, equivalent to halving the renewable-to-fossil fuel energy cost ratio.
3. Both—The two non-default actions are combined within one timestep.

For each integration timestep of the environment, an agent is able to select one of these four options, mimicking an action taken every year (Kittel et al., Reference Kittel, Müller-Hansen, Koch, Heitzig, Deffuant, Mathias and Kurths2021).

The AYS model in its current format depends on only one agent driving the simulation. We propose an extension enabling simple interactions between multiple agents. Global variables are denoted with no subscript; however, local (to each agent) variables are denoted with a subscript. There is now only one global variable—the excess atmospheric carbon A. Figure 2 visualises the extended multi-agent environment differential equation cycle.

Figure 2. Multi-agent AYS interaction cycle (diagram adapted from Kittel et al., Reference Kittel, Müller-Hansen, Koch, Heitzig, Deffuant, Mathias and Kurths2021). Block arrows are positive interactions, dashed arrows are negative interactions.

We carry through the same PBs and green fixed point, as they still apply to the global scale. However, the black fixed point is individual to each agent, as Equation 2.6 is dependent on individual agent parameters. We have also normalized emissions on the global scale so that we can work within the same parameters as the original AYS model. This is the simplest approach, allowing us to focus on interacting with the model rather than heavily editing the model. We have adjusted the axes in Figure 1 to enable greater insight when dealing with multiple agents. The S and A axes are swapped and S variable is then replaced with Equation 2.2 for agent-dependent emissions $ E $ . Incorporating emissions visualizes the individual impact each agent has toward the shared $ A $ .

We have adopted the JAX framework (Bradbury et al., Reference Bradbury, Frostig, Hawkins, Johnson, Leary, Maclaurin, Necula, Paszke, VanderPlas and Wanderman-Milne2018), converting the environment to be fully vectorized, allowing both inference and environment loops to be run on a GPU. The original environment from Kittel et al., Reference Kittel, Müller-Hansen, Koch, Heitzig, Deffuant, Mathias and Kurths2021 utilizes an ODE solver to calculate the environment transition at each time step. Due to JAX’s default enforcement of single precision floats, there is a discrepancy in the ODE solver results from Kittel et al., Reference Kittel, Müller-Hansen, Koch, Heitzig, Deffuant, Mathias and Kurths2021, Strnad et al., Reference Strnad, Barfuss, Donges and Heitzig2019, and Wolf et al., Reference Wolf, Nardelli, Shawe-Taylor and Perez-Ortiz2023 as their solver used double precision. However, this precision error has been tested over a wide range of states in the environment, with a minimum value of $ 0.000 $ and maximum of $ 1.055{e}^{-05} $ . This is a minute discrepancy, so we have assumed parity.

This extended AYS environment can be modeled as a Partially Observable Stochastic Game (POSG) (Shapley, Reference Shapley1953; Hansen et al., Reference Hansen, Bernstein and Zilberstein2004), defined by the tuple $ <N,\mathcal{S},{\mathcal{A}}_1,\dots, {\mathcal{A}}_n,T,{R}_1,\dots, {R}_n,{\mathcal{O}}_1,\dots, {\mathcal{O}}_n,\gamma > $ , where $ N $ is the number of agents, $ \mathcal{S} $ is the set of all possible environmental states, $ {\mathcal{A}}_1,\dots, {\mathcal{A}}_n $ is the set of possible actions for each agent, $ T:\mathcal{S}\times {\mathcal{A}}_1\times \dots \times {\mathcal{A}}_n\times \mathcal{S}\to \Pi \left(\mathcal{S}\right) $ is the transition distribution, $ {R_i}_{i=1}^n $ is the set of reward functions where $ {R}_i:\mathcal{S}\times \mathcal{A}\to \mathrm{\mathbb{R}} $ is the reward function for agent $ i $ , and $ \gamma $ is the discount factor. Each agent $ i $ has access to its observation $ {o}^i\in {\mathcal{O}}_i $ where $ {\mathcal{O}}^i $ is the observation set of agent $ i $ .

2.2. MARL algorithm

Focusing on DTDE algorithms as stated in the introduction, the independent proximal policy optimization (IPPO) algorithm acts as an effective starting point (Schulman et al., Reference Schulman, Wolski, Dhariwal, Radford and Klimov2017; Yu et al., Reference Yu, Velu, Vinitsky, Gao, Wang, Bayen and Wu2022). This relates to $ n $ (number of agents) versions of PPO-based agents within an environment that do not share parameters between them, so are fully independent. Each ( $ 0 $ to $ n $ ) PPO agent (Schulman et al., Reference Schulman, Wolski, Dhariwal, Radford and Klimov2017) has no awareness of other agents in the system, and since we are in a POSG, it only has access to its observations of the environment. The state and observation space a vectors of values $ \in \left[0,1\right] $ relating to the three AYS variables. A is global, but Y and S are independent of each agent, leading to the partially observable nature. The action space contains values from the discrete set $ \left\{\mathrm{0,1,2,3}\right\} $ relating to the actions in List 2.1. Our previous work in Wolf et al., Reference Wolf, Nardelli, Shawe-Taylor and Perez-Ortiz2023 found PPO to achieve impressive results and thus further posits its use within our experiments, with the hyperparameters used listed in Table B1. Rewards are derived from the “Planetary Boundary” (PB) reward function, maximizing the Euclidean distance between the agent and the two PBs and a lower bound of 0 on the S parameter. If a boundary is crossed, the reward equals 0:

(2.8)

$$ {R}_{PB}={\left\Vert o-{o}_{PB}\right\Vert}^2 $$

where $ o $ relates to an individual agent’s observations of the environment. As an agent aims to maximise its reward, it looks to achieve a point as far away from the PBs as possible, thus tending toward the green fixed point. Using the PB rather than the limits of the simulation incentivises the agent to avoid the PBs. $ {R}_{PB} $ is an idealised reward function that provides an agent with an easily quantifiable signal of how to navigate toward the “green fixed point.” This selection is fairly arbitrary and can be adapted under a domain specialist’s guidance to provide a more realistic interpretation of behavior within the IAM. For further experiments, we look at competitive agents and thus need two new reward functions:

(2.9)

$$ {R}_{maxA}={o}_A $$

(2.10)

$$ {R}_{maxY}={o}_Y-{PB}_Y, $$

where $ {o}_A $ is the agent observation of the $ A $ variable, $ {o}_Y $ is the agent observation of the $ Y $ variable, and $ {PB}_Y $ is the planetary boundary (social goal) for the $ Y $ variable. The former directly rewards an agent on the A variable, the excess atmospheric carbon ( $ GtC $ ), relating to an entity that prioritises environmental degradation. The latter at maximizing the agent’s distance to the $ Y $ planetary boundary, the economic output ( $ \${\mathrm{yr}}^{-1} $ ) social goal, which can be seen as an entity that prioritises economic gain over environmental impact.

2.3. Critical states

Explainability and interpretability in RL are open questions, with most methods focusing on explaining the neural networks that are used as functional approximators in deep RL (Heuillet et al., Reference Heuillet, Couthouis and Díaz-Rodríguez2021). There are very few methods that are specific to RL algorithms, and even fewer that are usable rather than purely conceptual (Heuillet et al., Reference Heuillet, Couthouis and Díaz-Rodríguez2021). Critical states, based on Huang et al., Reference Huang, Bhatia, Abbeel and Dragan2018, serve as a form of explainability specific to RL for model policy. This work elaborates that there is a set of few specific states (critical states) in an agent’s trajectory in which it greatly matters which action the agent takes (Huang et al., Reference Huang, Bhatia, Abbeel and Dragan2018). In theory, certain states lead to a large difference between policy outputs over the set of actions. Generally, one action would lead to a much larger policy value than the rest, as the agent is more sure that this is the only action option in that state. We proceed with this method of explainability, as it is crucial to know which locations in a trajectory correspond to the most vital actions for actionable climate policies. In more concrete terms, the set of critical states $ {\mathcal{C}}_{\pi } $ are identified as those with a high logit difference, calculated from the outputs of the neural network representation of the agent’s policy, mathematically formalised as:

(2.11)

$$ {\mathcal{C}}_{\pi }=\left\{s|\underset{a}{\max}\;{\pi}_{\theta}\left(s,a\right)-\frac{1}{\mid \mathcal{A}\mid}\sum \limits_a\;{\pi}_{\theta}\left(s,a\right)>t\right\} $$

where $ {\pi}_{\theta}\left(s,a\right) $ represents the logits of the policy distribution (as output by the actor network), $ t $ a critical state threshold, and $ \mathcal{A} $ is the set of potential actions. A requirement is that entropy regularization is used in the policy objective—without it, policies can collapse prematurely to almost deterministic states, signifying that almost all states are critical (Huang et al., Reference Huang, Bhatia, Abbeel and Dragan2018). We have included entropy regularization into our implementation of PPO, ensuring the policy acts purposefully in critical states and more randomly in others (Huang et al., Reference Huang, Bhatia, Abbeel and Dragan2018). We expand on the idea of critical states by plotting the logit differences across 1000 sampled trajectories (post-training) to analyse how “critical” each state is, rather than defining a critical state threshold. The value of this threshold is arbitrary and we prefer to highlight the full range over states, although one could consider states with a logit difference over $ 0.5 $ as the critical states. In particular, we ask: Are there locations in the trajectories that the policy finds more critical than others, and are these critical areas distributed in a way that is interpretable with regard to the agent’s behavior? To some extent, this can be loosely interpreted as policy uncertainty, as critical states are those in which the policy has a higher logit difference and is thus more certain of the correct action to take. However, we try to avoid using this term, as this method does not provide an exact uncertainty quantification of the policy.

3. Experimental results

Our overarching ambition is toward applicable and deployable systems that guide climate policy. While this is an expansive open question that cannot be fully answered in this paper, we begin by experimenting on the simplest cases and slowly increase complexity. These lines up the following research questions that we tackle within this work:

• RQ1—Assuming agents are homogeneous (having the same starting state and thus the same initial IAM variables), can they achieve an “economic and environmental positive future” when acting towards a shared goal through having the same reward functions (a.k.a interacting cooperatively)?
• RQ2—Relaxing agent homogeneity, are cooperative agents still able to achieve a successful future at a similar rate?
• RQ3—Finally, does introducing competition between agents, for example by having reward functions that oppose each other to discourage cooperation, significantly hinder a strategic interaction convergence on reaching the green fixed point?

Toward RQ1 our first experiment incorporates increasing numbers of homogeneous cooperative agents into the AYS environment. For RQ2 we repeat the same experiments as RQ1 but allow agents to start in varying locations relative to each other, initializing an agent’s state at different AYS variables, thus mimicking the variability seen between entities/nations in reality. Furthering agent heterogeneity, we also vary the agent-independent values for climate damages $ {\xi}_i $ mimicking agents, not all experience the same damaging effects as the climate degrades. Finally, for RQ3, we reduce the number of agents in our environment to two to compare varying reward functions and their effects on an agent’s ability to reach the green fixed point. Then extend this to three agents, highlighting that the trend continues as agent numbers increase. By keeping the number of agents low as well as incorporating the critical states visualization, we show greater insight into the agents’ action decisions.

A key theme within our research questions is the ability for an agent to reach the green fixed point. To calculate this, we compute a global $ AYS $ value, using the global $ A $ quantity and summing the agent’s individual $ {Y}_i $ and $ {S}_i $ variables which is checked to be within the vicinity of the green fixed point. We define the win rate as the percentage of times that this calculated global state reaches the green fixed point over a set number of episodes. This definition of success for this environment is not a Pareto Frontier and instead stakes claims on what is negative or positive; as such, we focus on the environmental positives. For clarity, an episode is the collection of timesteps between an initial state and a terminal state, be that due to reaching the green fixed point, breaching a planetary boundary, or reaching the fixed maximum number of steps per episode. We run all experiments for six seeds and plot the average of these seeds with translucent standard error bounds.

3.1. Experiment 1—homogeneous agents

We begin by instantiating homogeneous agents, that is, agents that have the same initial AYS variables. This relates to all agents starting in the same location. Agents here have the same objective towards a common goal, each following the $ {R}_{PB} $ reward function. The greater the distance to the PBs, the greater the reward. Agents are not predefined with a top-down restraint that they must cooperate; instead, by using a reward with a shared goal, we show the emergence of cooperation.

In Figure 3 for a single-agent case, IPPO (which reduces to PPO for one agent) quickly learns a consistent policy, as it avoids any complexity from the non-stationarity of the transition function caused by other agents. Increasing the number of agents (ranging from 2 to 8 agents together), increases training time taken until a consistent policy is reached, which can be attributed to the increasing complexity stemming from the non-stationarity and interactions between agents.

Figure 3. Homogeneous agent’s win rates. Each experiment is run over six seeds with the line corresponding to mean win rate with translucent standard error bounds. Num agents relates to the number of agents in the simulation.

Figure 4 shows (with only two seeds leading to a larger variance during the middle of training) that with enough time steps, a similar win rate is achieved between agents. We have not run the experiments in Figure 3 to a stable state for large numbers of agents due to the computational resources required, and instead focus on a smaller total of agents (and for fewer random seeds) for greater insight. For a singular agent, the win rate after $ 1.2\times {10}^8 $ steps is $ 87.740\%\pm 8.225 $ . For six and eight agents after $ 3\times {10}^8 $ steps, the win rates are $ 90.935\pm 0.010 $ and $ 90.143\pm 0.035 $ respectively. The lower standard deviations here stem from the policy convergence gained from much longer time steps. Answering RQ1, it is clear that agents are able to reach the green fixed point consistently, independently of the number of agents. Cooperation thus emerges between agents, with the shared reward function of a common goal being the only predefined signal towards cooperating.

Figure 4. Homogeneous agent’s win rates for a longer range of training steps. These experiments are only run over two seeds due to computational constraints.

3.2. Experiment 2—heterogeneous agents

Increasing the applicability, we now look at heterogeneous, but still cooperative, agents. Heterogeneity is very important in the climate domain, especially when dealing with anthropogenic factors, as it can apply to: spatial variability, temporal variability, and variability in socio-economic impacts, among others (Madani, Reference Madani2013). The various sources of heterogeneity between agents in the AYS MARL environment are: AYS variables, AYS parameters, Reward Functions, and MARL algorithm.

Varying the AYS variables and parameters can be seen as representing different traits of a representative agent, for example, a larger initial $ Y $ may indicate an economically wealthy entity. Similarly, changing the economic growth parameter $ \beta $ again represents an entity with increased economic function. There are limitless combinations one could make from these for experimentation. Values could also be based on real-world data to provide an in silico entity representation or verify results on a well-known case study. Reward functions represent what an entity may “value” or be looking to optimize for, changing these between agents can lead to conflicting behavior as these may directly oppose one another. Finally we can represent each agent with different MARL algorithms since we are constrained to the use of DTDE algorithms, which have no overarching centralized controller. For example, we could represent certain agents with less capable algorithms to understand the effect on the resulting equilibrium. We do not adjust the MARL algorithm, using PPO for all, as we want to understand some of the limitations of RL-specific algorithms being applied to MARL in this domain. Instead, we vary the AYS variables and parameters, with our subsequent experiments adjusting the reward function. Agents can start at any location within the predefined uniform distribution of starting points. A new starting point is sampled at each episode.

Figure 5 shows that scaling up agents here has a larger impact on the win rate due to the more complex heterogeneous nature of the agents. Still again with enough timesteps agents reach a consistent policy, as seen in Figure 6. Win rates for six and eight agents after $ 6\times {10}^8 $ steps are $ 93.007\pm 0.054 $ and $ 94.121\pm 0.067 $ _, respectively. Closely matching the results found in Experiment 1.

Figure 5. Heterogeneous agent’s win rates. We have omitted the single agent scenario as these results match between homogeneous and heterogeneous starting points. Each experiment is run over six seeds with the line corresponding to mean win rate with translucent standard error bounds.

Figure 6. Heterogeneous agent’s win rates for a longer range of training steps. These experiments are only run over two seeds due to computational constraints.

Multiple heterogeneous agents acting toward the same goal have similar performance to a singular agent, although require a much longer set of episodes for convergence due to the increased complexity. Here we prove that RQ2 is possible, without any loss of performance.

Furthering these experiments we also look at heterogeneity in the AYS parameters, specifically scaling the agent-independent climate damage $ {\xi}_i $ . We carry over the same heterogeneous starting point variation as in the previous experiment and only focus on two agents together. In reality, negative environmental effects such as extreme weather scenarios or rising water levels that impact economic output may affect certain regions more than others (Dellink et al., Reference Dellink, Lanzi and Chateau2019). In the worst scenarios, the biggest polluters may rarely see the negative climate effects, which are instead fully experienced at other geographical locations. To naively model this, we scale the climate damage parameter $ {\xi}_i $ between $ 0 $ and $ 1 $ , the former is an extreme case where the economy is not affected by parameter $ A $ , and the latter the usual AYS ODE dynamics.

Figure 7 indicate that as an agent is impacted less by climate damages, that is as $ {\xi}_i $ tends towards 0, it gains more independent return (total individual reward over an episode) than the other agent that has $ {\xi}_i=1 $ . Importantly though it comes at the cost of globally reaching the green fixed point, even with cooperative reward functions, as seen in Figure 8. As $ {\xi}_i $ reduces in the AYS ODE interaction Figure 2, $ Y $ becomes less affected by the value of $ A $ which has knock on effects in further increasing an agent’s own emissions $ E $ . However an agent therefore also receives less signal in the observations about how the $ A $ variable affects the $ Y $ variable, and how this all relates to its own actions and reward function. Therefore these agents seem to prefer maximizing $ Y $ as they are unaware of the impact this has on $ A $ . In Figure 9 one can see how the trajectories evolve from a two agent scenario both following $ {R}_{PB} $ and having $ {\xi}_i $ of $ 1 $ , to very different pathways when $ {\xi}_2 $ is $ 0.25 $ for Agent $ 2 $ . Interestingly the trajectories for Agent $ 2 $ in Figure 9d are very similar to those of an agent following the $ {R}_{maxY} $ reward function, with example trajectories found in Figure 13c and d, even though the agent is still following $ {R}_{PB} $ . Without staking too many claims in reality, an agent that has minimal understanding of how the actions it takes impact the environmental variable on a global scale, will be unable to enact the desired actions to reach the “climate positive” future.

Figure 7. Returns for each agent for the climate damages parameter $ {\xi}_i $ experiments. Agent $ 1 $ episode returns are on the left, which always has $ {\xi}_1=0 $ . Agent $ 2 $ episode returns are on the right where $ {\xi}_2 $ varies between $ 0 $ and $ 1 $ as per the figure legend.

Figure 8. Overall win rates for a two agent scenario in which both agents follow the $ {R}_{PB} $ reward function, but have different climate damage parameters $ {\xi}_i $ for each experiment. Six combinations of $ {\xi}_i $ are tested.

Figure 9. Trajectory plots for two cooperative agents, both following the $ {R}_{PB} $ reward function. Agent $ 1 $ has red trajectories, and Agent $ 2 $ has green. The variation in color for each agent signifies trajectories from different episodes. We have visualised a sample of $ 1000 $ episodes (trajectories) to indicate the distribution of trajectories. The grid row relates to experiments that contain both agents together. In the upper row both the agents experience the same climate damages, with $ {\xi}_i=1 $ for each. In the lower row Agent $ 1 $ has $ {\xi}_1=1 $ and Agent $ 2 $ has $ {\xi}_2=0.25 $ . The green fixed point is situated on the lowest vertex of the Figures, where $ E=0 $ , $ Y=\infty $ , and $ A=0 $ . The distribution of starting states is near the middle of the Figures, where $ E\approx 10 $ , $ Y\approx 60 $ , and $ A\approx 250 $ .

3.3. Experiment 3—competitive agents

We have shown that agents are able to consistently reach the green fixed point when working together. However, how will they fare when dealing with more competitive agents, e.g., ones that prioritise capital over detrimental environmental effects? Or in an extreme (yet slightly unrealistic) case, agents that only care to maximize the excess carbon in the atmosphere. For this, we use the two other reward functions: $ {R}_{maxY} $ and $ {R}_{maxA} $ . The former rewarding an agent for maximizing the distance to the $ Y $ planetary boundary, the economic output ( $ \${\mathrm{yr}}^{-1} $ ) social goal. The latter rewarding an agent for maximizing the $ A $ variable, the excess atmospheric carbon ( $ GtC $ ). We also assume that agents start in heterogeneous locations as our experiments have shown this does not negatively impact the win rate. The choice of $ {R}_{maxA} $ may be a peculiar one, but we have included the experiments to show more adversarial behavior than can be expected with $ {R}_{maxY} $ . The definition of $ {R}_{PB} $ in some ways includes maximizing $ Y $ , or at least ensuring that the agent avoids the $ Y $ social goal boundary, and as such $ {R}_{maxY} $ can be seen as a mixed motivation reward function. Whereas $ {R}_{maxA} $ greatly opposes the aims of $ {R}_{PB} $ , leaning towards more competition. This choice helps us understand the performance of the IPPO algorithm in these more challenging competitive scenarios, which will arise in future applications.

As seen in previous experiments and in Figure 10, two agents following $ {R}_{PB} $ consistently reach the green fixed point. Interestingly agents following $ {R}_{maxY} $ are also able to reach the green fixed point, although at a much reduced capacity. This is due to the AYS environment, wherein the $ Y $ variable is directly driven by the atmospheric carbon $ A $ , greatly incentivizing an agent to reduce $ A $ in order to maximise $ Y $ . We note that in a singular agent setup, an agent following $ {R}_{maxY} $ is unable to ever reach the green fixed point and in some ways requires the guidance of an agent following $ {R}_{PB} $ , although greatly impacts the overall success. Imbuing an agent with a more explicit understanding of the impact of both $ A $ and $ Y $ , through the reward function, is necessary to reach the desired goal.

Figure 10. Experiments combining reward types for a two-agent scenario, the first agent always follows the $ {R}_{PB} $ reward function. Each run has two agents relating to the respective labeled reward type.

However, as we would unfortunately expect, an agent that only aims to maximise its carbon output (following $ {R}_{maxA} $ ) overrules any potential climate-positive actions from the $ {R}_{PB} $ following agent. This clearly highlights the need for cooperation, or at least, ways to shape “opponents” actions to more closely align with the desired behavior.

In Figure 11 a similar trend carries over with an increasing number of agents. Agents that work together on a shared goal succeed but agents that have different incentives fail, although combinations of a majority of $ {R}_{PB} $ with $ {R}_{maxY} $ have the potential to succeed, but at a much reduced rate. Our results confirm RQ3—increasing competition reduces the ability for agents to reach the green fixed point. Highlighting the need for the use of algorithms with increased opponent awareness over IPPO to improve performance.

Figure 11. Experiments combining reward types for a three-agent scenario, the first agent always follows the $ {R}_{PB} $ reward function. Each run has three agents relating to the respective labeled reward type.

In RL, defining the reward can be tricky, as agents can “hack” these values and act in non-predictable ways (Skalse et al., Reference Skalse, Howe, Krasheninnikov and Krueger2022; Laidlaw et al., Reference Laidlaw, Singhal and Dragan2024). Due to the possibility for early termination from reaching goal states or boundary conditions before the maximum number of time steps, if agents aren’t correctly given potential future rewards, they can be incentivized to take “longer” in the environment as there are no temporal negatives. This was clear in some competitive environments where, without the notion of discounted future rewards, agents following the $ {R}_{PB} $ would receive more reward if they never reached the green fixed point, but slowed down the impact of an agent following $ {R}_{maxA} $ . Therefore, we use discounted rewards within this environment. Correctly defining rewards is relatively easy here, but a key question for future applications is how to quantify rewards.

3.4. Experiment 4—critical states

Finally, we look into interpreting the behavior of the agents and attempting to understand failure points. To this end, we visualise how “critical” states are along a sample of trajectories of trained agents in Figures 12 and 13. Images in the left column represent actions taken at certain points in the trajectory, with images in the right column highlighting the logit difference over actions of the agent’s policy. Darker colors relate to areas in which the policy has a lower logit difference, with increasing difference as the color lightens. The color gradient scale is normalised over agents. Agents are separated over rows in the multi-grid figure, each with their own respective color map, and the agent’s reward function is set as the figure caption. To enable a margin of tolerance for reaching the green fixed point, it is defined in the simulation as a ball instead of a singular point. In each critical states figure, the number of displayed agents correlates with the number of agents that were in the simulation—we have not, for example, sampled two agents from a 10-agent simulation.

Figure 12. Critical state plots for two cooperative agents, both following the $ {R}_{PB} $ reward function. Figures on the left-hand side represent the actions taken at certain points along the trajectory. Reference List 2.1 that details all potential actions. Figures on the right-hand side indicate scales of logit difference in the agent’s policy action distribution, defined as the Logit Diff. Darker colors relate to lower logit difference, with the color gradation normalised over agents.

Figure 13. Critical states for two competitive agents, where the agents follow the $ {R}_{PB} $ and $ {R}_{maxY} $ reward functions, respectively.

To evaluate these trajectory plots and the quality of explanations that they produce, we establish a set of evaluation metrics consisting of explanation consistency and fidelity, adapted from Islam et al., Reference Islam, Eberle and Ghafoor2020 and defined as follows:

• Consistency: How consistent are the plots (explanations) between the agents in an experiment?
• Fidelity: Are the plots (explanations) logically aligned with the behavior of the agents?

In the context of our experiments, we assess consistency between two heterogeneous agents in cooperative and competitive settings and—we note that the same can be done for homogeneous agents as well. The metric fidelity more specifically refers to whether the plots accurately represent the nature of the attributes contributing to agent behavior, such as reward type and location in the trajectory (and accordingly prior knowledge).

With the two-agent experiments, it is clear that when agents cooperate (i.e., both follow $ {R}_{PB} $ ), the simulation as a whole consistently reaches the green fixed point, although different trajectories are also able to succeed. For agent 1, as seen in Figure 12b, it is clear there is a high logit difference at the start and end of the simulation, signifying the most critical states in which the agent constantly makes the same action. The lowest occurs during the middle phase as the agent passes close to the economic planetary boundary. On the other hand, 12d shows an agent with the same reward function having a similar difference at the beginning but with much lower logit difference towards the end, even though it still takes a consistent action as seen in Figure 12c. This emphasizes the importance of pairing the consistent action taken with the logit difference for each timestep.

This indicates a relatively high level of explanation consistency, as the logit difference for both agents is similar until they start to reach the green fixed point—as such, they also have critical states at similar points in their respective trajectories. With regard to explanation fidelity, it is also logical that both agents would be experiencing areas of critical states near the start (corresponding with the action that takes both non-default actions) and then move to lower logit difference levels, as without prior knowledge, the immediate ideal action of the $ {R}_{PB} $ agent is to move away from the planetary boundaries.

For competitive agents, we focus on the $ {R}_{PB} $ and $ {R}_{maxY} $ two agent experiments in Figure 13 since they show the greatest insight. Performance is much worse, with only one or two trajectories reaching the green fixed point. This matches the results found in Figure 10 that show a win rate of $ 7\% $ , similarly matching the ratio of successful trajectories in Figure 13. However, it is clear that the agent following $ {R}_{maxY} $ consistently chooses the energy transition action so it can maximise its reward. On the other hand, the agent following $ {R}_{PB} $ is unable to have enough effect on the other agent and the environment to reach the green fixed point. On the rare occasions that it does reach the green fixed point, it is confident in its action selection.

This experiment resulted in high explanation consistency as well, with both agents experiencing similar logit difference levels throughout their trajectories. The exception to this occurs in the few trajectories that reach the green fixed point, where the $ {R}_{PB} $ agent experiences a much higher logit difference than the $ {R}_{maxY} $ agent. In terms of explanation fidelity between the actions taken and the logit differences, this also makes sense—while the $ {R}_{PB} $ agent learns all of the environmental attributes, the $ {R}_{maxY} $ agent is focused on maximizing the distance from the economic output planetary boundary.

4. Discussion

It is clear when constraining agents have the same objective, working toward a common “climate positive” goal, the green fixed point is consistently reached. This is a promising result, but it does not carry over once competition is introduced. From visualizing the critical states figures, agents have lower logit difference when dealing with other agents with differing reward functions, but also have a similar trend even when dealing with others cooperating. Combining this insight with the fact that we are using IPPO, agents have no explicit understanding of the other agents in the environment. Within basic DTDE methods (like IPPO) other agents are modeled as part of the environment, and without an understanding of the consequences of their policies, their actions exacerbate the stochasticity of the environment in the observations of the ego agent. For Centralised Training Decentralized Execution (CTDE) algorithms, there exists a centralised policy between agents during training that reduces the non-stationarity in the transition distribution. Tackling non-stationarity in DTDE algorithms is an open question, with a few types of well-researched approaches (Papoudakis et al., Reference Papoudakis, Christianos, Rahman and Albrecht2019). One of which is opponent modeling (Albrecht and Stone, Reference Albrecht and Stone2018), where approximate policies are learnt of other agents through historical data and can be used to reduce the effect of non-stationarity, dependent on the validity of the opponent models. However, these can often be sample inefficient and do not explicitly guide exploration to gain an improved understanding of the other agent’s desires. Another branch of MARL research looks into opponent shaping (Lu et al., Reference Lu, Willi, De Witt and Foerster2022), how an ego agent can shape the behavior of other agents, through its own actions, to more closely align with its goals. This approach would have great weight in this domain, as an agent can attempt to steer all agents in the IAM environment towards a “climate positive future” even with reward functions that may directly oppose this trajectory.

More intricate algorithms, however, raise issues due to scaling, a primary issue with MARL due to the exponential growth of agent interactions (Christianos et al., Reference Christianos, Papoudakis, Rahman and Albrecht2021). There is generally an inverse relationship between algorithm capability (e.g., opponent awareness or more principled exploration) and scalability. Similarly, as the IAM complexity increases, most certainly will the MARL state and action spaces, which also hinder scalability. This is a large open question in MARL with many techniques focusing on graph-based approaches to balance local and global interactions (Nayak et al., Reference Nayak, Choi, Ding, Dolan, Gopalakrishnan and Balakrishnan2023; Ma et al., Reference Ma, Li, Du, Dong and Yang2024). In the application to IAMs we could also take different viewpoints. One looks at highly abstracted global-level IAMs, e.g., continents/countries on a world model. We therefore have smaller agent numbers and can focus on more capable algorithms for the more complex global IAMs. Computing more easily covers the large state and action spaces required for complex environments as the numbers of agents (and agent interactions) is lower. We mentioned in the introduction how this could be expanded by imitation learning representative world states from historic data to train against. Another viewpoint looks at larger numbers of agents (e.g., in the thousands and more) with local scale IAMs, but at the cost (at this current stage) of agent algorithm capability for scalability. Although there is extensive work in this vein such as in multi-agent driving simulations (Kazemkhani et al., Reference Kazemkhani, Pandya, Cornelisse, Shacklett and Vinitsky2024) and massively multiplayer online games (Suarez et al., Reference Suarez, Du, Isola and Mordatch2019). With current work in creating a Digital Twin of Earth (Bauer et al., Reference Bauer, Stevens and Hazeleger2021) that aims to incorporate a wide range of in silico human activity, it is clear that scalable agents are needed.

As these simulations can be used for evidence-based policy, ensuring their validity is important, but how do we assess their uncertainty? Comparing critical states between similar reward functions shows the variability even between agents that appear to follow similar trajectory planning within the set environment, highlighting the poor representation of the policies uncertainty. The concept of explainability itself has been heavily debated in literature—some believe that rather than attempting to explain black-box models, we should instead just use more intrinsically explainable and transparent models, as explanations can be inconsistent or misleading (Rudin, Reference Rudin2019). In the context of arguments resembling this one, the pitfalls of explainability methods largely fall on post-hoc methods. Potential drawbacks with post-hoc explanations include explanations that are inconsistent based on the method used to generate them, as well as explanations that do not make sense to humans (Li et al., Reference Li, Liu, Chen and Rudin2018).

In addition, most post-hoc explainability methods do not provide a fully explainable picture of the model—with the critical states experiment that we performed in this paper, the plots resemble “summary statistic”-like results that we can interpret and use to generate explanations for model policy (Rudin, Reference Rudin2019). But we question whether this truly enhances the explainability of a model and correctly quantifies the uncertainty, prompting the question of whether we can deem these explanations to be accurate when they fail to encompass the entire model. While there is potential for the application of these explainability methods, further work is required here, such as exploring more intrinsically explainable methods.

5. Conclusion

This paper presents a step toward creating actionable and deployable systems to guide climate policy. Extending on previous work that focused on a single agent scenario, we have found that within the bounds of cooperation and the confines of this environment, multiple agents are consistently able to reach a “climate positive” future. This ability to craft policy trajectories may help inform policymakers of potential outcomes of prospective plans, with explicit results that can be used as evidence. As is key with any technology used for policy, failure modes and uncertainty must be quantified so that results can be used. To this end, we applied the critical states experiments to gain insight into the policy of the RL model. However, there are strong limitations of this current MARL and interpretability approach, and as such, we posited various future directions that must be researched if we are to use this technology to guide real policy. A key issue with either MARL, ABM, or Optimal Control explored by IAMs is scalability, an inherent challenge with MARL itself. While we have no concrete answer to this question, we guide our future work in exploring scalable techniques that still ensure deep exploration of inter-agent behavior. However, focusing on global-scale low agent number IAMs, this technology could currently be used with data-driven stylized world regions to forecast potential policy or action pathways towards a desired outcome. We hope this is a promising start toward the use of algorithms to support politically guiding the earth’s trajectory onto a habitable and stable future.

Author contribution

Conceptualisation: J.R.-J., F.T., M.P.-O.; Formal analysis: J.R.-J., F.T., M.P.-O.; Investigation: J.R.-J., F.T., M.P.-O.; Methodology: J.R.-J., F.T., M.P.-O.; Software: J.R.-J., F.T.; Supervision: M.P.-O.; Validation: J.R.-J.; Visualisation: J.R.-J., Writing–original draft: J.R.-J., F.T.; Writing–review and editing: J.R.-J., F.T., M.P.-O.

Competing interest

The authors declare none.

Data availability statement

Our code is publicly available on github at https://github.com/JamesR-J/multi_agent_climate_pathways.

Funding statement

James Rudd-Jones is supported by grants from the UK EPSRC-DTP (Award 2868483).

Appendix A. Further AYS environment details

Table A1. AYS numerical parameters (Kittel et al. Reference Kittel, Müller-Hansen, Koch, Heitzig, Deffuant, Mathias and Kurths2021)

Appendix B. Hyperparameters

Table B1. Table of training hyperparameters

$ {}^{LR} $ With annealed learning rate

Footnotes

This research article was awarded Open Data and Open Materials badges for transparent practices. See the Data Availability Statement for details.

¹ AI For Global Climate Cooperation competition - https://www.ai4climatecoop.org.

* Shares the same head up until the RNN (GRU aggregator) output then split to actor and critic for further layers.

References

Adadi, A and Berrada, M (2018) Peeking inside the black-box: A survey on explainable artificial intelligence (XAI). IEEE Access 6, 52138–52160.10.1109/ACCESS.2018.2870052CrossRef Google Scholar

Albrecht, SV and Stone, P (2018) Autonomous agents modelling other agents: A comprehensive survey and open problems. Artificial Intelligence 258, 66–95.10.1016/j.artint.2018.01.002CrossRef Google Scholar

Axtell, RL and Farmer, JD (2022) Agent-based modeling in economics and finance: Past, present, and future. Journal of Economic Literature, 1–101.Google Scholar

Bauer, P, Stevens, B and Hazeleger, W (2021) A digital twin of earth for the green transition. Nature Climate Change 11(2), 80–83.10.1038/s41558-021-00986-yCrossRef Google Scholar

Bernard, A and Vielle, M (2008) Gemini-e3, a general equilibrium model of international–national interactions between economy, energy and the environment. Computational Management Science 5(3), 173–206.10.1007/s10287-007-0047-yCrossRef Google Scholar

Bradbury, J, Frostig, R, Hawkins, P, Johnson, MJ, Leary, C, Maclaurin, D,Necula, G, Paszke, A,VanderPlas, J,Wanderman-Milne, S et al. (2018) Jax: Composable Transformations of Python+ Numpy Programs.Google Scholar

Cairney, P (2016) The Politics of Evidence-Based Policy Making. SpringerGoogle Scholar

Christianos, F, Papoudakis, G, Rahman, MA and Albrecht, SV (2021) Scaling multi-agent reinforcement learning with selective parameter sharing. In International Conference on Machine Learning, 1989–1998.Google Scholar

Dearing, JA, Wang, R, Zhang, K, Dyke, JG, Haberl, H, Hossain, MS, Langdon, PG, Lenton, TM, Raworth, K, Brown, S et al. (2014) Safe and just operating spaces for regional social-ecological systems. Global Environmental Change 28, 227–238.10.1016/j.gloenvcha.2014.06.012CrossRef Google Scholar

Dellink, R, Lanzi, E and Chateau, J (2019) The sectoral and regional economic consequences of climate change to 2060. Environmental and Resource Economics 72, 309–363.10.1007/s10640-017-0197-5CrossRef Google Scholar

Dowlatabadi, H (1995) Integrated assessment models of climate change: An incomplete overview. Energy Policy 23(4–5), 289–296.10.1016/0301-4215(95)90155-ZCrossRef Google Scholar

Farmer, JD, Hepburn, C, Mealy, P and Teytelboym, A (2015) A third wave in the economics of climate change. Environmental and Resource Economics 62, 329–357.10.1007/s10640-015-9965-2CrossRef Google Scholar

Gambhir, A, Butnar, I, Li, P-H, Smith, P and Strachan, N (2019) A review of criticisms of integrated assessment models and proposed approaches to address these, through the lens of beccs. Energies 12(9), 1747.10.3390/en12091747CrossRef Google Scholar

Garcia, CE, Prett, DM and Morari, M (1989) Model predictive control: Theory and practice—A survey. Automatica 25(3), 335–348.10.1016/0005-1098(89)90002-2CrossRef Google Scholar

Giarola, S, Sachs, J, d’Avezac, M, Kell, A and Hawkes, A (2022) Muse: An open-source agent-based integrated assessment modelling framework. Energy Strategy Reviews 44, 100964.10.1016/j.esr.2022.100964CrossRef Google Scholar

Glanois, C,Weng, P, Zimmer, M, Li, D, Yang, T, Hao, J and Liu, W (2021) A survey on interpretable reinforcement learning. Preprint, arXiv:2112.13112.Google Scholar

Hansen, EA, Bernstein, DS and Zilberstein, S (2004) Dynamic programming for partially observable stochastic games. AAAI 4, 709–715.Google Scholar

Heuillet, A, Couthouis, F and Díaz-Rodríguez, N (2021) Explainability in deep reinforcement learning. Knowledge-Based Systems 214, 106685.10.1016/j.knosys.2020.106685CrossRef Google Scholar

Huang, SH, Bhatia, K, Abbeel, P and Dragan, AD (2018) Establishing appropriate trust via critical states. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 3929–3936.10.1109/IROS.2018.8593649CrossRef Google Scholar

Hussein, A, Gaber, MM, Elyan, E and Jayne, C (2017) Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR) 50(2), 1–35.10.1145/3054912CrossRef Google Scholar

Islam, SR, Eberle, W and Ghafoor, SK (2020) Towards quantification of explainability in explainable artificial intelligence methods. In The Thirty-Third International Flairs Conference.Google Scholar

Kazemkhani, S, Pandya, A, Cornelisse, D, Shacklett, B and Vinitsky, E (2024) Gpudrive: Data-driven, multi-agent driving simulation at 1 million FPS. Preprint, arXiv:2408.01584.Google Scholar

Kellett, CM, Weller, SR, Faulwasser, T, Grüne, L and Semmler, W (2019) Feedback, dynamics, and optimal control in climate economics. Annual Reviews in Control 47, 7–20.10.1016/j.arcontrol.2019.04.003CrossRef Google Scholar

Kelly, DL, Kolstad, CD et al. (1999) Integrated assessment models for climate change control. International Yearbook of Environmental and Resource Economics 2000, 171–197.Google Scholar

Kittel, T, Müller-Hansen, F, Koch, R, Heitzig, J, Deffuant, G, Mathias, J-D and Kurths, J (2021) From lakes and glades to viability algorithms: Automatic classification of system states according to the topology of sustainable management. The European Physical Journal Special Topics 230, 3133–3152.10.1140/epjs/s11734-021-00262-2CrossRef Google Scholar

Laidlaw, C, Singhal, S and Dragan, A (2024) Preventing reward hacking with occupancy measure regularization. Preprint, arXiv:2403.03185.Google Scholar

Li, O, Liu, H, Chen, C and Rudin, C (2018) Deep learning for case-based reasoning through prototypes: A neural network that explains its predictions. Proceedings of the AAAI Conference on Artificial Intelligence 32(1).Google Scholar

Liang, Y, Guo, C, Ding, Z and Hua, H (2020) Agent-based modeling in electricity market using deep deterministic policy gradient algorithm. IEEE Transactions on Power Systems 35(6), 4180–4192.10.1109/TPWRS.2020.2999536CrossRef Google Scholar

Lipton, ZC (2018) The mythos of model interpretability. Queue 16(3), 31–57.10.1145/3236386.3241340CrossRef Google Scholar

Lu, C, Willi, T, De Witt, CAS and Foerster, J (2022) Model-free opponent shaping. In International Conference on Machine Learning, pp. 14398–14411.Google Scholar

Luz, S (2022) The evidence is clear: The time for action is now. We can halve emissions by 2030. IPCC [(Accessed on 09/04/2023)].Google Scholar

Ma, C, Li, A, Du, Y, Dong, H and Yang, Y (2024) Efficient and scalable reinforcement learning for large-scale network control. Nature Machine Intelligence, 1–15.Google Scholar

Madani, K (2013) Modeling international climate change negotiations more responsibly: Can highly simplified game theory models provide reliable policy insights? Ecological Economics 90, 68–76.10.1016/j.ecolecon.2013.02.011CrossRef Google Scholar

Nayak, S, Choi, K, Ding, W, Dolan, S, Gopalakrishnan, K and Balakrishnan, H (2023) Scalable multi-agent reinforcement learning through intelligent information aggregation. International Conference on Machine Learning, 25817–25833.Google Scholar

Nordhaus, WD (2010) Economic aspects of global warming in a post-Copenhagen environment. Proceedings of the National Academy of Sciences 107(26), 11721–11726.10.1073/pnas.1005985107CrossRef Google Scholar

Nordhaus, W (2015) Climate clubs: Overcoming free-riding in international climate policy. American Economic Review 105(4), 1339–1370.10.1257/aer.15000001CrossRef Google Scholar

Papoudakis, G, Christianos, F, Rahman, A and Albrecht, SV (2019) Dealing with non-stationarity in multi-agent deep reinforcement learning. Preprint, arXiv:1906.04737.Google Scholar

Patterson, JJ (2023) Backlash to climate policy. Global Environmental Politics 23(1), 68–90.10.1162/glep_a_00684CrossRef Google Scholar

Rockström, J, Steffen, W, Noone, K, Persson, Å, Chapin, FS III, Lambin, E, Lenton, TM, Scheffer, M, Folke, C, Schellnhuber, HJ et al. (2009) Planetary boundaries: Exploring the safe operating space for humanity. Ecology and Society 14(2).10.5751/ES-03180-140232CrossRef Google Scholar

Rudin, C (2019) Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1(5), 206–215.10.1038/s42256-019-0048-xCrossRef Google Scholar PubMed

Schulman, J, Wolski, F, Dhariwal, P, Radford, A and Klimov, O (2017) Proximal policy optimization algorithms. Preprint, arXiv:1707.06347.Google Scholar

Sert, E, Bar-Yam, Y and Morales, AJ (2020) Segregation dynamics with reinforcement learning and agent based modeling. Scientific Reports 10(1), 11771.10.1038/s41598-020-68447-8CrossRef Google Scholar PubMed

Shapley, LS (1953) Stochastic games. Proceedings of the National Academy of Sciences 39(10), 1095–1100.10.1073/pnas.39.10.1095CrossRef Google Scholar PubMed

Skalse, J, Howe, N, Krasheninnikov, D and Krueger, D (2022) Defining and characterizing reward gaming. Advances in Neural Information Processing Systems 35, 9460–9471.Google Scholar

Steffen, W, Richardson, K, Rockström, J, Cornell, SE, Fetzer, I, Bennett, EM, Biggs, R, Carpenter, SR, De Vries, W, De Wit, CA et al. (2015) Planetary boundaries: Guiding human development on a changing planet. Science 347(6223), 1259855.10.1126/science.1259855CrossRef Google Scholar PubMed

Stone, D (2008) Global public policy, transnational policy communities, and their networks. Policy Studies Journal 36(1), 19–38.10.1111/j.1541-0072.2007.00251.xCrossRef Google Scholar

Strnad, FM, Barfuss, W, Donges, JF and Heitzig, J (2019) Deep reinforcement learning in world-earth system models to discover sustainable management strategies. Chaos: An Interdisciplinary Journal of Nonlinear Science 29(12), 123122.10.1063/1.5124673CrossRef Google Scholar PubMed

Suarez, J, Du, Y, Isola, P and Mordatch, I (2019) Neural MMO: A massively multiagent game environment for training and evaluating intelligent agents. Preprint, arXiv:1903.00784.Google Scholar

UN (2023) Integrated Assessment Models (IAMS) and Energy-Environment-Economy (E3) Models. UNFCCC (accessed on 22nd Sept 2023).Google Scholar

Van Beek, L, Hajer, M, Pelzer, P, van Vuuren, D and Cassen, C (2020) Anticipating futures through models: The rise of integrated assessment modelling in the climate science-policy interface since 1970. Global Environmental Change 65, 102191.10.1016/j.gloenvcha.2020.102191CrossRef Google Scholar

van de Ven, D-J, Mittal, S, Gambhir, A, Lamboll, RD, Doukas, H, Giarola, S, Hawkes, A, Koasidis, K, Köberle, AC, McJeon, H, Perdana, S, Peters, GP, Rogelj, J, Sognnaes, I, Vielle, M, Nikas, A (2023) A multimodel analysis of post-glasgow climate targets and feasibility challenges. Nature Climate Change 13(6), 570–578.10.1038/s41558-023-01661-0CrossRef Google Scholar

van den Berg, NJ, Hof, AF, Akenji, L, Edelenbosch, OY, van Sluisveld, MA, Timmer, VJ and van Vuuren, DP (2019) Improved modelling of lifestyle changes in integrated assessment models: Cross-disciplinary insights from methodologies and theories. Energy Strategy Reviews 26, 100420.10.1016/j.esr.2019.100420CrossRef Google Scholar

Wise, S, Crooks, A and Batty, M (2017) Transportation in agent-based urban modelling. In Agent Based Modelling of Urban Systems: First International Workshop, ABMUS 2016, Held in Conjunction with AAMAS, Singapore, Singapore, May 10, 2016, Revised, Selected, and Invited Papers 1, pp 129–148.10.1007/978-3-319-51957-9_8CrossRef Google Scholar

Wolf, T,Nardelli, N, Shawe-Taylor, J and Perez-Ortiz, M (2023) Can reinforcement learning support policy makers? A preliminary study with integrated assessment models. Preprint, arXiv:2312.06527.Google Scholar

Yu, C, Velu, A, Vinitsky, E, Gao, J, Wang, Y, Bayen, A and Wu, Y (2022) The surprising effectiveness of ppo in cooperative multi-agent games. Advances in Neural Information Processing Systems 35, 24611–24624.Google Scholar

Zhang, T, Williams, A, Phade, S, Srinivasa, S, Zhang, Y, Gupta, P, Bengio, Y and Zheng, S (2022) AI for global climate cooperation: Modeling global climate negotiations, agreements, and long-term cooperation in RICE-N. Preprint, arXiv:2208.07004.10.2139/ssrn.4189735CrossRef Google Scholar

Figure 1. The AYS model state space from Kittel et al. (2021). Translucent grey planes signify the two PBs, and the green and black points denote the fixed point end conditions for a single agent. Whisker lines indicate flow forces within the model that tend towards either of the two fixed points. The colors show the flow to the respective fixed points.

Figure 2. Multi-agent AYS interaction cycle (diagram adapted from Kittel et al., 2021). Block arrows are positive interactions, dashed arrows are negative interactions.

Figure 4. Homogeneous agent’s win rates for a longer range of training steps. These experiments are only run over two seeds due to computational constraints.

Figure 6. Heterogeneous agent’s win rates for a longer range of training steps. These experiments are only run over two seeds due to computational constraints.

Figure 7. Returns for each agent for the climate damages parameter $ {\xi}_i $ experiments. Agent $ 1 $ episode returns are on the left, which always has $ {\xi}_1=0 $. Agent $ 2 $ episode returns are on the right where $ {\xi}_2 $ varies between $ 0 $ and $ 1 $ as per the figure legend.

Figure 9. Trajectory plots for two cooperative agents, both following the $ {R}_{PB} $ reward function. Agent $ 1 $ has red trajectories, and Agent $ 2 $ has green. The variation in color for each agent signifies trajectories from different episodes. We have visualised a sample of $ 1000 $ episodes (trajectories) to indicate the distribution of trajectories. The grid row relates to experiments that contain both agents together. In the upper row both the agents experience the same climate damages, with $ {\xi}_i=1 $ for each. In the lower row Agent $ 1 $ has $ {\xi}_1=1 $ and Agent $ 2 $ has $ {\xi}_2=0.25 $. The green fixed point is situated on the lowest vertex of the Figures, where $ E=0 $, $ Y=\infty $, and $ A=0 $. The distribution of starting states is near the middle of the Figures, where $ E\approx 10 $, $ Y\approx 60 $, and $ A\approx 250 $.