Impact Statement
Deriving climate policy is a challenging problem, with an expansive solution space. Policymakers have turned to simulation-based approaches to aid their decisions; however, these traditionally have various limitations. Our work is a preliminary study on improving aspects of these simulation-based approaches with multi-entity agent interactions. This allows for modelling of stakeholder/nation competition, cooperation, and communication, which is the key driver for much of anthropogenic climate change.
1. Introduction
According to the 2022 Intergovernmental Panel on Climate Change (IPCC) report—“Having the right policies, infrastructure and technology in place to enable changes to our lifestyles and behaviour can result in a 40–70% reduction in greenhouse gas emissions by 2050” (Luz, Reference Luz2022). The overall findings show that within all sectors, technology exists that will enable a habitable future, but their adoption may require capital-intensive investments and societal changes. Ambitious policies can have some effect on incentivizing funding towards research or implementation of such technologies, and enforcing certain behavioral restrictions, but are not the exclusive driver to change lifestyle and behavior. These major policy change adjustments, which are needed to combat climate change, can therefore be met with strong opposition that prevents uptake (Patterson, Reference Patterson2023), as entrenched societal structures, cultural norms, and vested interests often resist shifts that challenge the status quo. Evidence-based policy is key here as it not only improves the derived policy but also states quantifiable results that can reassure critics (Cairney, Reference Cairney2016). However, this can be challenging within the climate domain as we are experiencing novel events that have never been tackled. Climate modelling through simulations greatly helps as it provides evidence of future trajectories and attributes metrics to how future actions can have an impact. With human behavior so inextricably linked to our changing climate, it is key that these simulation models incorporate human factors to not exclude anthropogenic effects. Models of this type are known as Integrated Assessment Models (IAMs), which join traditional climate simulations with socio-economic dynamic models (Dowlatabadi, Reference Dowlatabadi1995). The UN extensively uses outputs of IAMs for the backbone of their IPCC reports, submitted by researchers across the world, providing quantitative insights into the trade-offs and synergies between different policy options and their consequences on socio-economic and/or environmental factors (Van Beek et al., Reference Van Beek, Hajer, Pelzer, van Vuuren and Cassen2020). On the UN website, they publicly list 29 IAMs used for their decision making (UN, 2023), such as the GEMINI-E3 model that specifically assesses how world climate change policies affect countries both at the micro and macro economic levels (Bernard and Vielle, Reference Bernard and Vielle2008). As an example van de Ven et al., Reference van de Ven, Mittal, Gambhir, Lamboll, Doukas, Giarola, Hawkes, Koasidis, Köberle, McJeon, Perdana, Peters, Rogelj, Sognnaes, Vielle and Nikas2023 use multiple IAMs to analyse how the national policies and pledges made at the latest COP26 Glasgow conference will affect future CO2 emission trajectories, one of which is GEMINI-E3.
IAMs are the current most used model framework for the socio-environmental domain, traditionally paired with an optimal control problem (e.g., Model Predictive Control (Garcia et al., Reference Garcia, Prett and Morari1989), to predict future trajectories towards a desired outcome (Kellett et al., Reference Kellett, Weller, Faulwasser, Grüne and Semmler2019). However, they are not free from their own shortcomings. Some key negatives are their poor representation of behavioral and economic systems, as well as a lack of modeling decision-making under uncertainty (Farmer et al., Reference Farmer, Hepburn, Mealy and Teytelboym2015; Zhang et al., Reference Zhang, Williams, Phade, Srinivasa, Zhang, Gupta, Bengio and Zheng2022), for further details, refer to the review of Gambhir et al., Reference Gambhir, Butnar, Li, Smith and Strachan2019. Both can be improved using Agent-Based Model (ABM) approaches (Gambhir et al., Reference Gambhir, Butnar, Li, Smith and Strachan2019). ABMs are a common approach within domains such as financial modeling (Axtell and Farmer, Reference Axtell and Farmer2022) or transport modeling (Wise et al., Reference Wise, Crooks and Batty2017) as they allow agent heterogeneity, agent cooperation/competition/communication, closer representative entity dynamics to reality, and more (Axtell and Farmer, Reference Axtell and Farmer2022). These features improve decision-making over the traditional control problem, but require agent behavioral policies to be defined (rather than learnt) outside of the simulation, which can still struggle under uncertainty (Kelly et al., Reference Kelly and Kolstad1999; van den Berg et al., Reference van den Berg, Hof, Akenji, Edelenbosch, van Sluisveld, Timmer and van Vuuren2019). Further improvements on ABMs incorporate trained algorithms to infer the best actions and search the solution space instead of heuristic behavioral policies. This deeper exploration increases an agent’s robustness to simulation uncertainty, which is paramount with the highly changeable simulation dynamics caused by the current climate. In this case, ABMs must be reformulated so that agents receive a signal (e.g., a reward) from the environment after each action taken, which is used to update their behavioral policy.
Reinforcement learning (RL) and especially multi-agent reinforcement learning (MARL) algorithms are widely used within ABM literature to improve agent behavior policies (Liang et al., Reference Liang, Guo, Ding and Hua2020; Sert et al., Reference Sert, Bar-Yam and Morales2020). We carry this RL theme over, replacing the control problem on top of the IAM environment simulation to increase exploration in this space. Temporally updating agents account for the changeability in the climate simulations caused by their own and other agent’s actions, creating feedback loops that enable reactive behavior to further climate or other agent changes. Another benefit of this MARL approach is that it is simulation agnostic; extended developments in the field can be applied to any form of multiple agent simulation, be it IAMs, ABMs, etc., although they would require further training. In this work, we focus on developing and analyzing a different solving strategy for IAMs using RL, over the traditional Optimal Control approaches. Rather than replacing solvers, RL-based approaches should be seen as a powerful complementary tool, particularly for modeling complex agent interactions and decision-making under uncertainty, where current methods have limitations. We detail some of the current limitations of the RL approach in this paper, but believe that if these were solved, this method could be added to the suite of tools available to policymakers.
The application of RL and MARL to IAMs is a novel topic with only a handful of previous works. For a single agent scenario, the work of Strnad et al., Reference Strnad, Barfuss, Donges and Heitzig2019 and our previous work in Wolf et al., Reference Wolf, Nardelli, Shawe-Taylor and Perez-Ortiz2023 applied an RL agent into an IAM, which, once trained, was able to generate policy guidance pathways towards a defined “economic and environmental positive future” within the model’s framework. They focused on adapting agent initial states and reward functions to understand the impact these had on the exploration of the IAM, as well as test the agents under the injection of noise in the environment. This has guided our experiments to ensure a wide range of initializations to understand the exploration of agents. Both Strnad et al., Reference Strnad, Barfuss, Donges and Heitzig2019 and our previous work in Wolf et al., Reference Wolf, Nardelli, Shawe-Taylor and Perez-Ortiz2023 use a singular agent, hence assuming a “unified” earth, in which there is a collectively shared goal. In this work, we aim to move one step further and model inter-world interactions, which are the driver for much of anthropogenic climate change and must be understood for many policy decisions (Stone, Reference Stone2008). Towards this aim we adapt the IAM accordingly, based on ABM extensions of IAMs (Giarola et al., Reference Giarola, Sachs, d’Avezac, Kell and Hawkes2022; W. Nordhaus, Reference Nordhaus2015; Zhang et al., Reference Zhang, Williams, Phade, Srinivasa, Zhang, Gupta, Bengio and Zheng2022), in order to implement a multi-agent IAM with MARL.
The only work that has used MARL within the climate policy domain in the literature is that of Zhang et al., Reference Zhang, Williams, Phade, Srinivasa, Zhang, Gupta, Bengio and Zheng2022, which created the RICE-N model used for the AI for Global Climate Cooperation ChallengeFootnote 1. Itself an extension of the Regional Integrated model of Climate and the Economy (RICE) model developed in Nordhaus (Reference Nordhaus2010) that models 12 global regions. Zhang et al. (Reference Zhang, Williams, Phade, Srinivasa, Zhang, Gupta, Bengio and Zheng2022) invited various domain experts to create and edit interaction and negotiation protocols to achieve the best Pareto Frontier of the socio-economic system variables in the environment. The RICE-N model combines a climate-economic IAM with trade and negotiation dynamics, enabling high levels of interaction between countries/regions (a.k.a agents). Agents can adjust their savings rates, climate mitigation rates, as well as trade and negotiate with each other at each time step, leading to a large range of potential interactions between each other and the environment (Zhang et al., Reference Zhang, Williams, Phade, Srinivasa, Zhang, Gupta, Bengio and Zheng2022). Their findings show the potential of MARL-based applications to IAMs with a large call to action for further research on the topic. RICE-N is an extensive environment that we aim to use for future work; however, we prioritise increased interpretability of the trained agent and, as such, focus on the multi-agent extension of the more simplistic environment as used in Strnad et al. (Reference Strnad, Barfuss, Donges and Heitzig2019) and Wolf et al. (Reference Wolf, Nardelli, Shawe-Taylor and Perez-Ortiz2023). This simplified environment enables a visual understanding and easier interpretation of the trained agent’s interactions, which are key to analyzing the use of MARL within IAMs.
RL algorithms, however, lack inherent explainability, raising concerns about their trustworthiness for informing real-world policy decisions. Using explainability methods, we can reinforce human confidence by providing insights into how decisions were made and visibility to vulnerabilities (Adadi and Berrada, Reference Adadi and Berrada2018; Lipton, Reference Lipton2018; Glanois et al., Reference Glanois, Weng, Zimmer, Li, Yang, Hao and Liu2021). The explainability methods explored in this work specifically target explaining model policy through a quantification technique, determining the states at which taking a certain action is crucial, critical in applications related to informing climate change policy.
In summary, we attempt to model whether agents prioritizing economic or environmental gain can affect climate policy derivation. As well as simulate, within this framework, whether “climate positive” futures are possible when agents conflict in their prioritizations. We have extended the previous literature’s single-agent IAM to a multi-agent scenario in order to incorporate inter-nation behavior. Utilizing this technology, policies can be derived and enacted in reality, depending on the validity of our underlying IAM. For a single-agent setting, one can fully implement the projected policies as they can have full agency over the singular agent in reality. However, moving to multiple agents if we want to follow a similar optimization approach it assumes we can have control over all agents in reality. A heavy assumption in practice. Instead, in this paper, we focus on the setting of having control over one or a subset of the agents, but still model all agents learning collectively. This necessitates the need for decentralized training decentralized execution (DTDE) algorithms. We have arbitrarily assumed the learning algorithm and parameters behind each stylized agent, which will directly affect the outcome trajectories. Aiming to highlight the challenges of employing certain existing algorithms. However, in future work, the other agents in the simulation (that we may not have agency over in reality) could be trained using imitation learning (Hussein et al., Reference Hussein, Gaber, Elyan and Jayne2017) on historical data to represent in-silico versions of real-world entities. MARL can then be used to train an agent to act as a best response to these imitation pre-trained agents within a multiple-agent IAM, providing us with a range of possible future trajectories. Again, dependent on the validity not only of the IAM, but also the agent representations of real-world entities. As with any forecasting tool, long-range trajectories lead to large accumulations of error. As an alternative, the algorithm can be further trained as more data about other agents is received. Finally, an inherent challenge with algorithm-derived policy is being able to interpret the underlying solution, especially in edge cases or failure scenarios in which there may not be much prior experience. We have implemented initial interpretability techniques to increase trust in the system for downstream applications.
Our results show that multiple agents that work toward the same goal cooperatively are able to achieve the IAMs “economic and environmental positive future” success state consistently over 90% of test episodes. Increasing competition between agents reduces this success significantly, which is one of this work’s main conclusions, and is a major avenue for future work, as in reality competition or mixed motivations are rife. This work serves as an early discovery into the field, positioning future research required to achieve the adoption of the technology.
2. Materials and methods
In this section, we introduce the core themes required for our contribution: the IAM environment, the MARL algorithm and requirements for its application, and the interpretability framework we have used in order to improve insight.
2.1. The IAM environment
The AYS environment, created by Kittel et al. (Reference Kittel, Müller-Hansen, Koch, Heitzig, Deffuant, Mathias and Kurths2021), is a low complexity IAM, made up of a social, economic, and environmental variables. These three variables each relate to an ordinary differential equation (ODE) defining the system:

where
$ A $
is the excess atmospheric carbon (
$ GtC $
),
$ Y $
the economic output (
$ \${\mathrm{yr}}^{-1} $
), and
$ S $
the renewable knowledge stock (
$ GJ $
). Each variable is inextricably linked with the other, creating a dynamic cycle. In words:
-
•
$ A $ is proportional to emissions produced from the use of fossil fuels, minus a natural carbon decay out of the atmosphere.
-
•
$ Y $ naturally grows by 3% each time period; however, it is reduced by an economic climate damage function where increasing
$ A $ increases the reduction in
$ Y $ .
-
•
$ S $ is proportional to the amount of renewable energy produced; however, has a natural knowledge decay rate over time.
The following equations are required for a deeper analysis of the AYS ODEs, with further numerical parameters listed in Table A1.




While A and Y are easily quantifiable with real-life implications, S is harder to define. Generally, social factors require greater levels of detail than economic or environmental attributes. For instance, in Zhang et al., Reference Zhang, Williams, Phade, Srinivasa, Zhang, Gupta, Bengio and Zheng2022 they incorporate many layers of complex socio-economic equations in order to have a functioning model with quantifiable social impact. In the AYS model this is simplified down to a single equation, enabling a much reduced state space toward lower computational requirements and a more interpretable understanding of agent behavior.
The AYS model has been specifically tuned so that an agent tends towards one of two points:

The green fixed point denotes a “sustainable” future, one where there is no atmospheric carbon but limitless capital and renewable knowledge. The black fixed point, however, denotes a stagnant economy solely dependent on fossil fuels. This is a future we ideally want to avoid. Included with these “drain” points are Planetary Boundaries (PB). The AYS model incorporates one PB set in the reports from Steffen et al., Reference Steffen, Richardson, Rockström, Cornell, Fetzer, Bennett, Biggs, Carpenter, De Vries and De Wit2015 and Rockström et al., Reference Rockström, Steffen, Noone, Persson, Chapin, Lambin, Lenton, Scheffer, Folke and Schellnhuber2009 of a maximum excess atmospheric carbon at
$ {PB}_A=345\; GtC $
, with a social foundation for prosperity from Dearing et al., Reference Dearing, Wang, Zhang, Dyke, Haberl, Hossain, Langdon, Lenton, Raworth and Brown2014 defining a minimum yearly economic output at
$ {PB}_Y=4\times {10}^{13}\;\${yr}^{-1} $
(Kittel et al., Reference Kittel, Müller-Hansen, Koch, Heitzig, Deffuant, Mathias and Kurths2021). For brevity throughout this paper, we will make reference to these boundaries as the two PBs, although by definition, our economic output boundary is in fact a social goal, not a planetary boundary.
To mimic the current state of the Earth within this model, the starting point is defined as
$ {s}_{t=0}=\left\{240\; GtC,7\times {10}^{13}\;\${yr}^{-1},5\times {10}^{11}\; GJ\right\} $
. Not only is this starting location very close to the PBs, creating a challenging control problem, but also from this location the agent will tend towards the black fixed point if no actions are taken. Figure 1 highlights the AYS environment with black and green fixed points, and the two translucent grey planes indicating the two PBs. Strnad et al., Reference Strnad, Barfuss, Donges and Heitzig2019 and Wolf et al., Reference Wolf, Nardelli, Shawe-Taylor and Perez-Ortiz2023 incorporate noise into the starting position over episodes to improve training; however, noise is omitted from the S state variable as this dramatically reduces the agents’ ability to learn. Kittel et al., Reference Kittel, Müller-Hansen, Koch, Heitzig, Deffuant, Mathias and Kurths2021 and subsequent work normalised the environment between
$ 0 $
and
$ 1 $
to prevent numerical explosions.

Figure 1. The AYS model state space from Kittel et al. (Reference Kittel, Müller-Hansen, Koch, Heitzig, Deffuant, Mathias and Kurths2021). Translucent grey planes signify the two PBs, and the green and black points denote the fixed point end conditions for a single agent. Whisker lines indicate flow forces within the model that tend towards either of the two fixed points. The colors show the flow to the respective fixed points.
We carry this through, normalizing the states and then incorporating noise, setting the starting state as:

where
$ \mathcal{U} $
is the uniform distribution.
At its current state the model will tend toward the black fixed point. To avoid this an agent is able to undertake four actions, described in Kittel et al. (Reference Kittel, Müller-Hansen, Koch, Heitzig, Deffuant, Mathias and Kurths2021):
-
0. Default—Default parameters are used and the agent follows the flow lines without any resistance.
-
1. Degrowth—Economic growth parameter
$ \beta $ is halved, fluctuating between 3% and 1.5% growth.
-
2. Energy transition—Break-even renewable knowledge
$ \sigma $ is reduced by 31.3%, equivalent to halving the renewable-to-fossil fuel energy cost ratio.
-
3. Both—The two non-default actions are combined within one timestep.
For each integration timestep of the environment, an agent is able to select one of these four options, mimicking an action taken every year (Kittel et al., Reference Kittel, Müller-Hansen, Koch, Heitzig, Deffuant, Mathias and Kurths2021).
The AYS model in its current format depends on only one agent driving the simulation. We propose an extension enabling simple interactions between multiple agents. Global variables are denoted with no subscript; however, local (to each agent) variables are denoted with a subscript. There is now only one global variable—the excess atmospheric carbon A. Figure 2 visualises the extended multi-agent environment differential equation cycle.

Figure 2. Multi-agent AYS interaction cycle (diagram adapted from Kittel et al., Reference Kittel, Müller-Hansen, Koch, Heitzig, Deffuant, Mathias and Kurths2021). Block arrows are positive interactions, dashed arrows are negative interactions.
We carry through the same PBs and green fixed point, as they still apply to the global scale. However, the black fixed point is individual to each agent, as Equation 2.6 is dependent on individual agent parameters. We have also normalized emissions on the global scale so that we can work within the same parameters as the original AYS model. This is the simplest approach, allowing us to focus on interacting with the model rather than heavily editing the model. We have adjusted the axes in Figure 1 to enable greater insight when dealing with multiple agents. The S and A axes are swapped and S variable is then replaced with Equation 2.2 for agent-dependent emissions
$ E $
. Incorporating emissions visualizes the individual impact each agent has toward the shared
$ A $
.
We have adopted the JAX framework (Bradbury et al., Reference Bradbury, Frostig, Hawkins, Johnson, Leary, Maclaurin, Necula, Paszke, VanderPlas and Wanderman-Milne2018), converting the environment to be fully vectorized, allowing both inference and environment loops to be run on a GPU. The original environment from Kittel et al., Reference Kittel, Müller-Hansen, Koch, Heitzig, Deffuant, Mathias and Kurths2021 utilizes an ODE solver to calculate the environment transition at each time step. Due to JAX’s default enforcement of single precision floats, there is a discrepancy in the ODE solver results from Kittel et al., Reference Kittel, Müller-Hansen, Koch, Heitzig, Deffuant, Mathias and Kurths2021, Strnad et al., Reference Strnad, Barfuss, Donges and Heitzig2019, and Wolf et al., Reference Wolf, Nardelli, Shawe-Taylor and Perez-Ortiz2023 as their solver used double precision. However, this precision error has been tested over a wide range of states in the environment, with a minimum value of
$ 0.000 $
and maximum of
$ 1.055{e}^{-05} $
. This is a minute discrepancy, so we have assumed parity.
This extended AYS environment can be modeled as a Partially Observable Stochastic Game (POSG) (Shapley, Reference Shapley1953; Hansen et al., Reference Hansen, Bernstein and Zilberstein2004), defined by the tuple
$ <N,\mathcal{S},{\mathcal{A}}_1,\dots, {\mathcal{A}}_n,T,{R}_1,\dots, {R}_n,{\mathcal{O}}_1,\dots, {\mathcal{O}}_n,\gamma > $
, where
$ N $
is the number of agents,
$ \mathcal{S} $
is the set of all possible environmental states,
$ {\mathcal{A}}_1,\dots, {\mathcal{A}}_n $
is the set of possible actions for each agent,
$ T:\mathcal{S}\times {\mathcal{A}}_1\times \dots \times {\mathcal{A}}_n\times \mathcal{S}\to \Pi \left(\mathcal{S}\right) $
is the transition distribution,
$ {R_i}_{i=1}^n $
is the set of reward functions where
$ {R}_i:\mathcal{S}\times \mathcal{A}\to \mathrm{\mathbb{R}} $
is the reward function for agent
$ i $
, and
$ \gamma $
is the discount factor. Each agent
$ i $
has access to its observation
$ {o}^i\in {\mathcal{O}}_i $
where
$ {\mathcal{O}}^i $
is the observation set of agent
$ i $
.
2.2. MARL algorithm
Focusing on DTDE algorithms as stated in the introduction, the independent proximal policy optimization (IPPO) algorithm acts as an effective starting point (Schulman et al., Reference Schulman, Wolski, Dhariwal, Radford and Klimov2017; Yu et al., Reference Yu, Velu, Vinitsky, Gao, Wang, Bayen and Wu2022). This relates to
$ n $
(number of agents) versions of PPO-based agents within an environment that do not share parameters between them, so are fully independent. Each (
$ 0 $
to
$ n $
) PPO agent (Schulman et al., Reference Schulman, Wolski, Dhariwal, Radford and Klimov2017) has no awareness of other agents in the system, and since we are in a POSG, it only has access to its observations of the environment. The state and observation space a vectors of values
$ \in \left[0,1\right] $
relating to the three AYS variables. A is global, but Y and S are independent of each agent, leading to the partially observable nature. The action space contains values from the discrete set
$ \left\{\mathrm{0,1,2,3}\right\} $
relating to the actions in List 2.1. Our previous work in Wolf et al., Reference Wolf, Nardelli, Shawe-Taylor and Perez-Ortiz2023 found PPO to achieve impressive results and thus further posits its use within our experiments, with the hyperparameters used listed in Table B1. Rewards are derived from the “Planetary Boundary” (PB) reward function, maximizing the Euclidean distance between the agent and the two PBs and a lower bound of 0 on the S parameter. If a boundary is crossed, the reward equals 0:

where
$ o $
relates to an individual agent’s observations of the environment. As an agent aims to maximise its reward, it looks to achieve a point as far away from the PBs as possible, thus tending toward the green fixed point. Using the PB rather than the limits of the simulation incentivises the agent to avoid the PBs.
$ {R}_{PB} $
is an idealised reward function that provides an agent with an easily quantifiable signal of how to navigate toward the “green fixed point.” This selection is fairly arbitrary and can be adapted under a domain specialist’s guidance to provide a more realistic interpretation of behavior within the IAM. For further experiments, we look at competitive agents and thus need two new reward functions:


where
$ {o}_A $
is the agent observation of the
$ A $
variable,
$ {o}_Y $
is the agent observation of the
$ Y $
variable, and
$ {PB}_Y $
is the planetary boundary (social goal) for the
$ Y $
variable. The former directly rewards an agent on the A variable, the excess atmospheric carbon (
$ GtC $
), relating to an entity that prioritises environmental degradation. The latter at maximizing the agent’s distance to the
$ Y $
planetary boundary, the economic output (
$ \${\mathrm{yr}}^{-1} $
) social goal, which can be seen as an entity that prioritises economic gain over environmental impact.
2.3. Critical states
Explainability and interpretability in RL are open questions, with most methods focusing on explaining the neural networks that are used as functional approximators in deep RL (Heuillet et al., Reference Heuillet, Couthouis and Díaz-Rodríguez2021). There are very few methods that are specific to RL algorithms, and even fewer that are usable rather than purely conceptual (Heuillet et al., Reference Heuillet, Couthouis and Díaz-Rodríguez2021). Critical states, based on Huang et al., Reference Huang, Bhatia, Abbeel and Dragan2018, serve as a form of explainability specific to RL for model policy. This work elaborates that there is a set of few specific states (critical states) in an agent’s trajectory in which it greatly matters which action the agent takes (Huang et al., Reference Huang, Bhatia, Abbeel and Dragan2018). In theory, certain states lead to a large difference between policy outputs over the set of actions. Generally, one action would lead to a much larger policy value than the rest, as the agent is more sure that this is the only action option in that state. We proceed with this method of explainability, as it is crucial to know which locations in a trajectory correspond to the most vital actions for actionable climate policies. In more concrete terms, the set of critical states
$ {\mathcal{C}}_{\pi } $
are identified as those with a high logit difference, calculated from the outputs of the neural network representation of the agent’s policy, mathematically formalised as:

where
$ {\pi}_{\theta}\left(s,a\right) $
represents the logits of the policy distribution (as output by the actor network),
$ t $
a critical state threshold, and
$ \mathcal{A} $
is the set of potential actions. A requirement is that entropy regularization is used in the policy objective—without it, policies can collapse prematurely to almost deterministic states, signifying that almost all states are critical (Huang et al., Reference Huang, Bhatia, Abbeel and Dragan2018). We have included entropy regularization into our implementation of PPO, ensuring the policy acts purposefully in critical states and more randomly in others (Huang et al., Reference Huang, Bhatia, Abbeel and Dragan2018). We expand on the idea of critical states by plotting the logit differences across 1000 sampled trajectories (post-training) to analyse how “critical” each state is, rather than defining a critical state threshold. The value of this threshold is arbitrary and we prefer to highlight the full range over states, although one could consider states with a logit difference over
$ 0.5 $
as the critical states. In particular, we ask: Are there locations in the trajectories that the policy finds more critical than others, and are these critical areas distributed in a way that is interpretable with regard to the agent’s behavior? To some extent, this can be loosely interpreted as policy uncertainty, as critical states are those in which the policy has a higher logit difference and is thus more certain of the correct action to take. However, we try to avoid using this term, as this method does not provide an exact uncertainty quantification of the policy.
3. Experimental results
Our overarching ambition is toward applicable and deployable systems that guide climate policy. While this is an expansive open question that cannot be fully answered in this paper, we begin by experimenting on the simplest cases and slowly increase complexity. These lines up the following research questions that we tackle within this work:
-
• RQ1—Assuming agents are homogeneous (having the same starting state and thus the same initial IAM variables), can they achieve an “economic and environmental positive future” when acting towards a shared goal through having the same reward functions (a.k.a interacting cooperatively)?
-
• RQ2—Relaxing agent homogeneity, are cooperative agents still able to achieve a successful future at a similar rate?
-
• RQ3—Finally, does introducing competition between agents, for example by having reward functions that oppose each other to discourage cooperation, significantly hinder a strategic interaction convergence on reaching the green fixed point?
Toward RQ1 our first experiment incorporates increasing numbers of homogeneous cooperative agents into the AYS environment. For RQ2 we repeat the same experiments as RQ1 but allow agents to start in varying locations relative to each other, initializing an agent’s state at different AYS variables, thus mimicking the variability seen between entities/nations in reality. Furthering agent heterogeneity, we also vary the agent-independent values for climate damages
$ {\xi}_i $
mimicking agents, not all experience the same damaging effects as the climate degrades. Finally, for RQ3, we reduce the number of agents in our environment to two to compare varying reward functions and their effects on an agent’s ability to reach the green fixed point. Then extend this to three agents, highlighting that the trend continues as agent numbers increase. By keeping the number of agents low as well as incorporating the critical states visualization, we show greater insight into the agents’ action decisions.
A key theme within our research questions is the ability for an agent to reach the green fixed point. To calculate this, we compute a global
$ AYS $
value, using the global
$ A $
quantity and summing the agent’s individual
$ {Y}_i $
and
$ {S}_i $
variables which is checked to be within the vicinity of the green fixed point. We define the win rate as the percentage of times that this calculated global state reaches the green fixed point over a set number of episodes. This definition of success for this environment is not a Pareto Frontier and instead stakes claims on what is negative or positive; as such, we focus on the environmental positives. For clarity, an episode is the collection of timesteps between an initial state and a terminal state, be that due to reaching the green fixed point, breaching a planetary boundary, or reaching the fixed maximum number of steps per episode. We run all experiments for six seeds and plot the average of these seeds with translucent standard error bounds.
3.1. Experiment 1—homogeneous agents
We begin by instantiating homogeneous agents, that is, agents that have the same initial AYS variables. This relates to all agents starting in the same location. Agents here have the same objective towards a common goal, each following the
$ {R}_{PB} $
reward function. The greater the distance to the PBs, the greater the reward. Agents are not predefined with a top-down restraint that they must cooperate; instead, by using a reward with a shared goal, we show the emergence of cooperation.
In Figure 3 for a single-agent case, IPPO (which reduces to PPO for one agent) quickly learns a consistent policy, as it avoids any complexity from the non-stationarity of the transition function caused by other agents. Increasing the number of agents (ranging from 2 to 8 agents together), increases training time taken until a consistent policy is reached, which can be attributed to the increasing complexity stemming from the non-stationarity and interactions between agents.

Figure 3. Homogeneous agent’s win rates. Each experiment is run over six seeds with the line corresponding to mean win rate with translucent standard error bounds. Num agents relates to the number of agents in the simulation.
Figure 4 shows (with only two seeds leading to a larger variance during the middle of training) that with enough time steps, a similar win rate is achieved between agents. We have not run the experiments in Figure 3 to a stable state for large numbers of agents due to the computational resources required, and instead focus on a smaller total of agents (and for fewer random seeds) for greater insight. For a singular agent, the win rate after
$ 1.2\times {10}^8 $
steps is
$ 87.740\%\pm 8.225 $
. For six and eight agents after
$ 3\times {10}^8 $
steps, the win rates are
$ 90.935\pm 0.010 $
and
$ 90.143\pm 0.035 $
respectively. The lower standard deviations here stem from the policy convergence gained from much longer time steps. Answering RQ1, it is clear that agents are able to reach the green fixed point consistently, independently of the number of agents. Cooperation thus emerges between agents, with the shared reward function of a common goal being the only predefined signal towards cooperating.

Figure 4. Homogeneous agent’s win rates for a longer range of training steps. These experiments are only run over two seeds due to computational constraints.
3.2. Experiment 2—heterogeneous agents
Increasing the applicability, we now look at heterogeneous, but still cooperative, agents. Heterogeneity is very important in the climate domain, especially when dealing with anthropogenic factors, as it can apply to: spatial variability, temporal variability, and variability in socio-economic impacts, among others (Madani, Reference Madani2013). The various sources of heterogeneity between agents in the AYS MARL environment are: AYS variables, AYS parameters, Reward Functions, and MARL algorithm.
Varying the AYS variables and parameters can be seen as representing different traits of a representative agent, for example, a larger initial
$ Y $
may indicate an economically wealthy entity. Similarly, changing the economic growth parameter
$ \beta $
again represents an entity with increased economic function. There are limitless combinations one could make from these for experimentation. Values could also be based on real-world data to provide an in silico entity representation or verify results on a well-known case study. Reward functions represent what an entity may “value” or be looking to optimize for, changing these between agents can lead to conflicting behavior as these may directly oppose one another. Finally we can represent each agent with different MARL algorithms since we are constrained to the use of DTDE algorithms, which have no overarching centralized controller. For example, we could represent certain agents with less capable algorithms to understand the effect on the resulting equilibrium. We do not adjust the MARL algorithm, using PPO for all, as we want to understand some of the limitations of RL-specific algorithms being applied to MARL in this domain. Instead, we vary the AYS variables and parameters, with our subsequent experiments adjusting the reward function. Agents can start at any location within the predefined uniform distribution of starting points. A new starting point is sampled at each episode.
Figure 5 shows that scaling up agents here has a larger impact on the win rate due to the more complex heterogeneous nature of the agents. Still again with enough timesteps agents reach a consistent policy, as seen in Figure 6. Win rates for six and eight agents after
$ 6\times {10}^8 $
steps are
$ 93.007\pm 0.054 $
and
$ 94.121\pm 0.067 $
, respectively. Closely matching the results found in Experiment 1.

Figure 5. Heterogeneous agent’s win rates. We have omitted the single agent scenario as these results match between homogeneous and heterogeneous starting points. Each experiment is run over six seeds with the line corresponding to mean win rate with translucent standard error bounds.

Figure 6. Heterogeneous agent’s win rates for a longer range of training steps. These experiments are only run over two seeds due to computational constraints.
Multiple heterogeneous agents acting toward the same goal have similar performance to a singular agent, although require a much longer set of episodes for convergence due to the increased complexity. Here we prove that RQ2 is possible, without any loss of performance.
Furthering these experiments we also look at heterogeneity in the AYS parameters, specifically scaling the agent-independent climate damage
$ {\xi}_i $
. We carry over the same heterogeneous starting point variation as in the previous experiment and only focus on two agents together. In reality, negative environmental effects such as extreme weather scenarios or rising water levels that impact economic output may affect certain regions more than others (Dellink et al., Reference Dellink, Lanzi and Chateau2019). In the worst scenarios, the biggest polluters may rarely see the negative climate effects, which are instead fully experienced at other geographical locations. To naively model this, we scale the climate damage parameter
$ {\xi}_i $
between
$ 0 $
and
$ 1 $
, the former is an extreme case where the economy is not affected by parameter
$ A $
, and the latter the usual AYS ODE dynamics.
Figure 7 indicate that as an agent is impacted less by climate damages, that is as
$ {\xi}_i $
tends towards 0, it gains more independent return (total individual reward over an episode) than the other agent that has
$ {\xi}_i=1 $
. Importantly though it comes at the cost of globally reaching the green fixed point, even with cooperative reward functions, as seen in Figure 8. As
$ {\xi}_i $
reduces in the AYS ODE interaction Figure 2,
$ Y $
becomes less affected by the value of
$ A $
which has knock on effects in further increasing an agent’s own emissions
$ E $
. However an agent therefore also receives less signal in the observations about how the
$ A $
variable affects the
$ Y $
variable, and how this all relates to its own actions and reward function. Therefore these agents seem to prefer maximizing
$ Y $
as they are unaware of the impact this has on
$ A $
. In Figure 9 one can see how the trajectories evolve from a two agent scenario both following
$ {R}_{PB} $
and having
$ {\xi}_i $
of
$ 1 $
, to very different pathways when
$ {\xi}_2 $
is
$ 0.25 $
for Agent
$ 2 $
. Interestingly the trajectories for Agent
$ 2 $
in Figure 9d are very similar to those of an agent following the
$ {R}_{maxY} $
reward function, with example trajectories found in Figure 13c and d, even though the agent is still following
$ {R}_{PB} $
. Without staking too many claims in reality, an agent that has minimal understanding of how the actions it takes impact the environmental variable on a global scale, will be unable to enact the desired actions to reach the “climate positive” future.

Figure 7. Returns for each agent for the climate damages parameter
$ {\xi}_i $
experiments. Agent
$ 1 $
episode returns are on the left, which always has
$ {\xi}_1=0 $
. Agent
$ 2 $
episode returns are on the right where
$ {\xi}_2 $
varies between
$ 0 $
and
$ 1 $
as per the figure legend.

Figure 8. Overall win rates for a two agent scenario in which both agents follow the
$ {R}_{PB} $
reward function, but have different climate damage parameters
$ {\xi}_i $
for each experiment. Six combinations of
$ {\xi}_i $
are tested.

Figure 9. Trajectory plots for two cooperative agents, both following the
$ {R}_{PB} $
reward function. Agent
$ 1 $
has red trajectories, and Agent
$ 2 $
has green. The variation in color for each agent signifies trajectories from different episodes. We have visualised a sample of
$ 1000 $
episodes (trajectories) to indicate the distribution of trajectories. The grid row relates to experiments that contain both agents together. In the upper row both the agents experience the same climate damages, with
$ {\xi}_i=1 $
for each. In the lower row Agent
$ 1 $
has
$ {\xi}_1=1 $
and Agent
$ 2 $
has
$ {\xi}_2=0.25 $
. The green fixed point is situated on the lowest vertex of the Figures, where
$ E=0 $
,
$ Y=\infty $
, and
$ A=0 $
. The distribution of starting states is near the middle of the Figures, where
$ E\approx 10 $
,
$ Y\approx 60 $
, and
$ A\approx 250 $
.
3.3. Experiment 3—competitive agents
We have shown that agents are able to consistently reach the green fixed point when working together. However, how will they fare when dealing with more competitive agents, e.g., ones that prioritise capital over detrimental environmental effects? Or in an extreme (yet slightly unrealistic) case, agents that only care to maximize the excess carbon in the atmosphere. For this, we use the two other reward functions:
$ {R}_{maxY} $
and
$ {R}_{maxA} $
. The former rewarding an agent for maximizing the distance to the
$ Y $
planetary boundary, the economic output (
$ \${\mathrm{yr}}^{-1} $
) social goal. The latter rewarding an agent for maximizing the
$ A $
variable, the excess atmospheric carbon (
$ GtC $
). We also assume that agents start in heterogeneous locations as our experiments have shown this does not negatively impact the win rate. The choice of
$ {R}_{maxA} $
may be a peculiar one, but we have included the experiments to show more adversarial behavior than can be expected with
$ {R}_{maxY} $
. The definition of
$ {R}_{PB} $
in some ways includes maximizing
$ Y $
, or at least ensuring that the agent avoids the
$ Y $
social goal boundary, and as such
$ {R}_{maxY} $
can be seen as a mixed motivation reward function. Whereas
$ {R}_{maxA} $
greatly opposes the aims of
$ {R}_{PB} $
, leaning towards more competition. This choice helps us understand the performance of the IPPO algorithm in these more challenging competitive scenarios, which will arise in future applications.
As seen in previous experiments and in Figure 10, two agents following
$ {R}_{PB} $
consistently reach the green fixed point. Interestingly agents following
$ {R}_{maxY} $
are also able to reach the green fixed point, although at a much reduced capacity. This is due to the AYS environment, wherein the
$ Y $
variable is directly driven by the atmospheric carbon
$ A $
, greatly incentivizing an agent to reduce
$ A $
in order to maximise
$ Y $
. We note that in a singular agent setup, an agent following
$ {R}_{maxY} $
is unable to ever reach the green fixed point and in some ways requires the guidance of an agent following
$ {R}_{PB} $
, although greatly impacts the overall success. Imbuing an agent with a more explicit understanding of the impact of both
$ A $
and
$ Y $
, through the reward function, is necessary to reach the desired goal.

Figure 10. Experiments combining reward types for a two-agent scenario, the first agent always follows the
$ {R}_{PB} $
reward function. Each run has two agents relating to the respective labeled reward type.
However, as we would unfortunately expect, an agent that only aims to maximise its carbon output (following
$ {R}_{maxA} $
) overrules any potential climate-positive actions from the
$ {R}_{PB} $
following agent. This clearly highlights the need for cooperation, or at least, ways to shape “opponents” actions to more closely align with the desired behavior.
In Figure 11 a similar trend carries over with an increasing number of agents. Agents that work together on a shared goal succeed but agents that have different incentives fail, although combinations of a majority of
$ {R}_{PB} $
with
$ {R}_{maxY} $
have the potential to succeed, but at a much reduced rate. Our results confirm RQ3—increasing competition reduces the ability for agents to reach the green fixed point. Highlighting the need for the use of algorithms with increased opponent awareness over IPPO to improve performance.

Figure 11. Experiments combining reward types for a three-agent scenario, the first agent always follows the
$ {R}_{PB} $
reward function. Each run has three agents relating to the respective labeled reward type.
In RL, defining the reward can be tricky, as agents can “hack” these values and act in non-predictable ways (Skalse et al., Reference Skalse, Howe, Krasheninnikov and Krueger2022; Laidlaw et al., Reference Laidlaw, Singhal and Dragan2024). Due to the possibility for early termination from reaching goal states or boundary conditions before the maximum number of time steps, if agents aren’t correctly given potential future rewards, they can be incentivized to take “longer” in the environment as there are no temporal negatives. This was clear in some competitive environments where, without the notion of discounted future rewards, agents following the
$ {R}_{PB} $
would receive more reward if they never reached the green fixed point, but slowed down the impact of an agent following
$ {R}_{maxA} $
. Therefore, we use discounted rewards within this environment. Correctly defining rewards is relatively easy here, but a key question for future applications is how to quantify rewards.
3.4. Experiment 4—critical states
Finally, we look into interpreting the behavior of the agents and attempting to understand failure points. To this end, we visualise how “critical” states are along a sample of trajectories of trained agents in Figures 12 and 13. Images in the left column represent actions taken at certain points in the trajectory, with images in the right column highlighting the logit difference over actions of the agent’s policy. Darker colors relate to areas in which the policy has a lower logit difference, with increasing difference as the color lightens. The color gradient scale is normalised over agents. Agents are separated over rows in the multi-grid figure, each with their own respective color map, and the agent’s reward function is set as the figure caption. To enable a margin of tolerance for reaching the green fixed point, it is defined in the simulation as a ball instead of a singular point. In each critical states figure, the number of displayed agents correlates with the number of agents that were in the simulation—we have not, for example, sampled two agents from a 10-agent simulation.

Figure 12. Critical state plots for two cooperative agents, both following the
$ {R}_{PB} $
reward function. Figures on the left-hand side represent the actions taken at certain points along the trajectory. Reference List 2.1 that details all potential actions. Figures on the right-hand side indicate scales of logit difference in the agent’s policy action distribution, defined as the Logit Diff. Darker colors relate to lower logit difference, with the color gradation normalised over agents.

Figure 13. Critical states for two competitive agents, where the agents follow the
$ {R}_{PB} $
and
$ {R}_{maxY} $
reward functions, respectively.
To evaluate these trajectory plots and the quality of explanations that they produce, we establish a set of evaluation metrics consisting of explanation consistency and fidelity, adapted from Islam et al., Reference Islam, Eberle and Ghafoor2020 and defined as follows:
-
• Consistency: How consistent are the plots (explanations) between the agents in an experiment?
-
• Fidelity: Are the plots (explanations) logically aligned with the behavior of the agents?
In the context of our experiments, we assess consistency between two heterogeneous agents in cooperative and competitive settings and—we note that the same can be done for homogeneous agents as well. The metric fidelity more specifically refers to whether the plots accurately represent the nature of the attributes contributing to agent behavior, such as reward type and location in the trajectory (and accordingly prior knowledge).
With the two-agent experiments, it is clear that when agents cooperate (i.e., both follow
$ {R}_{PB} $
), the simulation as a whole consistently reaches the green fixed point, although different trajectories are also able to succeed. For agent 1, as seen in Figure 12b, it is clear there is a high logit difference at the start and end of the simulation, signifying the most critical states in which the agent constantly makes the same action. The lowest occurs during the middle phase as the agent passes close to the economic planetary boundary. On the other hand, 12d shows an agent with the same reward function having a similar difference at the beginning but with much lower logit difference towards the end, even though it still takes a consistent action as seen in Figure 12c. This emphasizes the importance of pairing the consistent action taken with the logit difference for each timestep.
This indicates a relatively high level of explanation consistency, as the logit difference for both agents is similar until they start to reach the green fixed point—as such, they also have critical states at similar points in their respective trajectories. With regard to explanation fidelity, it is also logical that both agents would be experiencing areas of critical states near the start (corresponding with the action that takes both non-default actions) and then move to lower logit difference levels, as without prior knowledge, the immediate ideal action of the
$ {R}_{PB} $
agent is to move away from the planetary boundaries.
For competitive agents, we focus on the
$ {R}_{PB} $
and
$ {R}_{maxY} $
two agent experiments in Figure 13 since they show the greatest insight. Performance is much worse, with only one or two trajectories reaching the green fixed point. This matches the results found in Figure 10 that show a win rate of
$ 7\% $
, similarly matching the ratio of successful trajectories in Figure 13. However, it is clear that the agent following
$ {R}_{maxY} $
consistently chooses the energy transition action so it can maximise its reward. On the other hand, the agent following
$ {R}_{PB} $
is unable to have enough effect on the other agent and the environment to reach the green fixed point. On the rare occasions that it does reach the green fixed point, it is confident in its action selection.
This experiment resulted in high explanation consistency as well, with both agents experiencing similar logit difference levels throughout their trajectories. The exception to this occurs in the few trajectories that reach the green fixed point, where the
$ {R}_{PB} $
agent experiences a much higher logit difference than the
$ {R}_{maxY} $
agent. In terms of explanation fidelity between the actions taken and the logit differences, this also makes sense—while the
$ {R}_{PB} $
agent learns all of the environmental attributes, the
$ {R}_{maxY} $
agent is focused on maximizing the distance from the economic output planetary boundary.
4. Discussion
It is clear when constraining agents have the same objective, working toward a common “climate positive” goal, the green fixed point is consistently reached. This is a promising result, but it does not carry over once competition is introduced. From visualizing the critical states figures, agents have lower logit difference when dealing with other agents with differing reward functions, but also have a similar trend even when dealing with others cooperating. Combining this insight with the fact that we are using IPPO, agents have no explicit understanding of the other agents in the environment. Within basic DTDE methods (like IPPO) other agents are modeled as part of the environment, and without an understanding of the consequences of their policies, their actions exacerbate the stochasticity of the environment in the observations of the ego agent. For Centralised Training Decentralized Execution (CTDE) algorithms, there exists a centralised policy between agents during training that reduces the non-stationarity in the transition distribution. Tackling non-stationarity in DTDE algorithms is an open question, with a few types of well-researched approaches (Papoudakis et al., Reference Papoudakis, Christianos, Rahman and Albrecht2019). One of which is opponent modeling (Albrecht and Stone, Reference Albrecht and Stone2018), where approximate policies are learnt of other agents through historical data and can be used to reduce the effect of non-stationarity, dependent on the validity of the opponent models. However, these can often be sample inefficient and do not explicitly guide exploration to gain an improved understanding of the other agent’s desires. Another branch of MARL research looks into opponent shaping (Lu et al., Reference Lu, Willi, De Witt and Foerster2022), how an ego agent can shape the behavior of other agents, through its own actions, to more closely align with its goals. This approach would have great weight in this domain, as an agent can attempt to steer all agents in the IAM environment towards a “climate positive future” even with reward functions that may directly oppose this trajectory.
More intricate algorithms, however, raise issues due to scaling, a primary issue with MARL due to the exponential growth of agent interactions (Christianos et al., Reference Christianos, Papoudakis, Rahman and Albrecht2021). There is generally an inverse relationship between algorithm capability (e.g., opponent awareness or more principled exploration) and scalability. Similarly, as the IAM complexity increases, most certainly will the MARL state and action spaces, which also hinder scalability. This is a large open question in MARL with many techniques focusing on graph-based approaches to balance local and global interactions (Nayak et al., Reference Nayak, Choi, Ding, Dolan, Gopalakrishnan and Balakrishnan2023; Ma et al., Reference Ma, Li, Du, Dong and Yang2024). In the application to IAMs we could also take different viewpoints. One looks at highly abstracted global-level IAMs, e.g., continents/countries on a world model. We therefore have smaller agent numbers and can focus on more capable algorithms for the more complex global IAMs. Computing more easily covers the large state and action spaces required for complex environments as the numbers of agents (and agent interactions) is lower. We mentioned in the introduction how this could be expanded by imitation learning representative world states from historic data to train against. Another viewpoint looks at larger numbers of agents (e.g., in the thousands and more) with local scale IAMs, but at the cost (at this current stage) of agent algorithm capability for scalability. Although there is extensive work in this vein such as in multi-agent driving simulations (Kazemkhani et al., Reference Kazemkhani, Pandya, Cornelisse, Shacklett and Vinitsky2024) and massively multiplayer online games (Suarez et al., Reference Suarez, Du, Isola and Mordatch2019). With current work in creating a Digital Twin of Earth (Bauer et al., Reference Bauer, Stevens and Hazeleger2021) that aims to incorporate a wide range of in silico human activity, it is clear that scalable agents are needed.
As these simulations can be used for evidence-based policy, ensuring their validity is important, but how do we assess their uncertainty? Comparing critical states between similar reward functions shows the variability even between agents that appear to follow similar trajectory planning within the set environment, highlighting the poor representation of the policies uncertainty. The concept of explainability itself has been heavily debated in literature—some believe that rather than attempting to explain black-box models, we should instead just use more intrinsically explainable and transparent models, as explanations can be inconsistent or misleading (Rudin, Reference Rudin2019). In the context of arguments resembling this one, the pitfalls of explainability methods largely fall on post-hoc methods. Potential drawbacks with post-hoc explanations include explanations that are inconsistent based on the method used to generate them, as well as explanations that do not make sense to humans (Li et al., Reference Li, Liu, Chen and Rudin2018).
In addition, most post-hoc explainability methods do not provide a fully explainable picture of the model—with the critical states experiment that we performed in this paper, the plots resemble “summary statistic”-like results that we can interpret and use to generate explanations for model policy (Rudin, Reference Rudin2019). But we question whether this truly enhances the explainability of a model and correctly quantifies the uncertainty, prompting the question of whether we can deem these explanations to be accurate when they fail to encompass the entire model. While there is potential for the application of these explainability methods, further work is required here, such as exploring more intrinsically explainable methods.
5. Conclusion
This paper presents a step toward creating actionable and deployable systems to guide climate policy. Extending on previous work that focused on a single agent scenario, we have found that within the bounds of cooperation and the confines of this environment, multiple agents are consistently able to reach a “climate positive” future. This ability to craft policy trajectories may help inform policymakers of potential outcomes of prospective plans, with explicit results that can be used as evidence. As is key with any technology used for policy, failure modes and uncertainty must be quantified so that results can be used. To this end, we applied the critical states experiments to gain insight into the policy of the RL model. However, there are strong limitations of this current MARL and interpretability approach, and as such, we posited various future directions that must be researched if we are to use this technology to guide real policy. A key issue with either MARL, ABM, or Optimal Control explored by IAMs is scalability, an inherent challenge with MARL itself. While we have no concrete answer to this question, we guide our future work in exploring scalable techniques that still ensure deep exploration of inter-agent behavior. However, focusing on global-scale low agent number IAMs, this technology could currently be used with data-driven stylized world regions to forecast potential policy or action pathways towards a desired outcome. We hope this is a promising start toward the use of algorithms to support politically guiding the earth’s trajectory onto a habitable and stable future.
Author contribution
Conceptualisation: J.R.-J., F.T., M.P.-O.; Formal analysis: J.R.-J., F.T., M.P.-O.; Investigation: J.R.-J., F.T., M.P.-O.; Methodology: J.R.-J., F.T., M.P.-O.; Software: J.R.-J., F.T.; Supervision: M.P.-O.; Validation: J.R.-J.; Visualisation: J.R.-J., Writing–original draft: J.R.-J., F.T.; Writing–review and editing: J.R.-J., F.T., M.P.-O.
Competing interest
The authors declare none.
Data availability statement
Our code is publicly available on github at https://github.com/JamesR-J/multi_agent_climate_pathways.
Funding statement
James Rudd-Jones is supported by grants from the UK EPSRC-DTP (Award 2868483).
Appendix A. Further AYS environment details
Table A1. AYS numerical parameters (Kittel et al. Reference Kittel, Müller-Hansen, Koch, Heitzig, Deffuant, Mathias and Kurths2021)

Appendix B. Hyperparameters
Table B1. Table of training hyperparameters

$ {}^{LR} $
With annealed learning rate