Hostname: page-component-68c7f8b79f-7wx25 Total loading time: 0 Render date: 2025-12-29T10:46:18.237Z Has data issue: false hasContentIssue false

Homo-silicus: not (yet) a good imitator of homo sapiens or homo economicus

Published online by Cambridge University Press:  29 December 2025

Solomon W. Polachek
Affiliation:
Economics Department, State University of New York at Binghamton, Binghamton, NY, USA Institute for the Study of Labor (IZA), Bonn, Germany
Kenneth Romano
Affiliation:
Economics Department, State University of New York at Binghamton, Binghamton, NY, USA
Ozlem Tonguc*
Affiliation:
Economics Department, State University of New York at Binghamton, Binghamton, NY, USA
*
Corresponding author: Ozlem Tonguc; Email: otonguc@binghamton.edu
Rights & Permissions [Opens in a new window]

Abstract

Do large language models (LLMs) – such as ChatGPT-3.5 Turbo, ChatGPT-4.0, and Gemini 1.0 Pro, and DeepSeek-R1 – simulate human behavior in the context of the Prisoner’s Dilemma (PD) game with varying stake sizes? Through a replication of Yamagishi et al. (2016) ‘Study 2,’ we investigate this question, examining LLM responses to different payoff stakes and the influence of stake order on cooperation rates. We find that LLMs do not mirror the inverse relationship between stake size and cooperation found in the study. Rather, some models (DeepSeek-R1 and ChatGPT-4.0) almost wholly defect, while others (ChatGPT-3.5 Turbo and Gemini 1.0 Pro) mirror human behavior only under very specific circumstances. LLMs demonstrate sensitivity to framing and order effects, implying the need for cautious application of LLMs in behavioral research.

Information

Type
Original Paper
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NC
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial licence (http://creativecommons.org/licenses/by-nc/4.0), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original article is properly cited. The written permission of Cambridge University Press must be obtained prior to any commercial use.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of Economic Science Association.

1. Introduction

OpenAI’s ChatGPT and similar large language models (LLMs) have garnered attention for their ability to engage in realistic conversations with humans (Biever, Reference Biever2023), excel in scholastic ability tests (OpenAI, 2024a), and mimic liberal political viewpoints (Rozado, Reference Rozado2023). This new technology has sparked interest in their potential applications in understanding human decision making, and by extension to leveraging their potential in lieu of human respondents in behavioral experiments (Argyle et al., Reference Argyle, Busby, Fulda, Gubler, Rytting and Wingate2023; Bail, Reference Bail2024; Brookins & DeBacker, Reference Brookins and DeBacker2024; Hayes, Yax, Palminteri, Reference Hayes, Yax and Palminteri2024; Horton, Reference Horton2023).Footnote 1 This paper investigates the potential of ChatGPT-3.5 Turbo, ChatGPT-4.0, Gemini 1.0 Pro, and DeepSeek-R1 to simulate human behavior, in the context of the effects of changing stake size in the Prisoner’s Dilemma (PD) game. These LLMs differ in architecture, training data, and fine-tuning methods, which potentially contribute to variations in the answers they generate. As such, it is important to examine differences in how these models respond should they be used in lieu of humans in behavioral economics research.

The importance of the Prisoner’s Dilemma (PD) game originates from its illustration of a situation where individuals acting in their own self-interest produce a suboptimal outcome compared to that arising when parties cooperate. In other words, the unique Nash equilibrium of the one-shot PD game (both individuals choosing to defect) is Pareto-dominated by the outcome where both individuals cooperate. Analyzing the behavior in this game helps researchers understand how individuals make decisions in situations such as those involving competition for depletable resources, shedding light on concepts like cooperation as motivated by other-regarding considerations such as trust and reciprocity, and the role of incentives to defect. Famously, the game has been used to model a number of real world applications from advertising and pricing decisions to international relations arms race models. As such, to date, a number of studies (Boone et al., Reference Boone, De Brabander and Van Witteloostuijn1999; Heuer & Orland, Reference Heuer and Orland2019; Jones, Reference Jones2008; Mengel, Reference Mengel2018) explore utilization of pure versus mixed strategies even in one-shot game play. However, unlike the theoretical prediction with self-interested decision makers, experiments on the one-shot version of PD with human participants yield noteworthy departures from the Nash equilibrium behavior of self-interested individuals, whereby a significant number of participants choose to cooperate, rather than defect, and thereby providing evidence for motivations beyond self-interest.

An important question is whether individuals cooperate less as the stakes get bigger. Larger stakes can motivate other-regarding individuals to defect, increasing the prevalence of suboptimal outcomes (for example, because the strategy of cooperation is perceived to be personally riskier). An early study by Aranoff and Tedeschi (Reference Aranoff and Tedeschi1968) involving 216 subjects found this to be the case: A larger number of defections were associated with larger stakes. Nevertheless, more generally, there is mixed evidence regarding stake size in a variety of games (e.g. Johansson-Stenman et al., Reference Johansson-Stenman, Mahmud and Martinsson2005; Kocher et al., Reference Kocher, Martinsson and Visser2008; Leibbrandt et al., Reference Leibbrandt, Maitra and Neelim2018). Given the costs of running high stakes laboratory experiments, other studies exploited quasi-experimental techniques. For example, List (Reference List2006) examined results from the television show ‘Friend or Foe?,’ a game similar to PD.Footnote 2 Although he found women, whites, and older participants cooperate more than others, stakes did ‘not have an important effect on play.’ This result differs from the findings of Darai and Grätz (Reference Darai and Grätz2010) and Van den Assem et al. (Reference Van den Assem, Van Dolder and Thaler2012) who found ‘a negative correlation between stake size and cooperation’ from the television show ‘Golden Balls,’ another game similar to PD.

There is now a burgeoning body of literature utilizing LLMs as subjects to play workhorse games such as the Prisoner’s Dilemma, the Dictator Game, or the Ultimatum Game (Aher et al., Reference Aher, Arriaga and Kalai2023; Añasco Flores et al., Reference F., Julio, Bryan, Pamela and Maria2023; Argyle et al., Reference Argyle, Busby, Fulda, Gubler, Rytting and Wingate2023; Brookins & DeBacker, Reference Brookins and DeBacker2024; Guo, Reference Guo2023; Horton, Reference Horton2023). However, we know of no studies utilizing LLMs to analyze the impact of stake size in the Prisoner’s Dilemma. Further, there are few studies that scrutinize the effect of framing on LLM behavior in workhorse games (Edossa et al., Reference Edossa, Gassen and Maas2024; Engel et al., Reference Engel, Grossmann and Ockenfels2024). This paper fills these gaps in the literature. By altering new parameters, we place LLMs under stronger ‘stress-tests,’ allowing us to gain fresh insights into an LLM’s morality and reasoning ability.

We test the impact of payoff stakes on cooperation rates of LLM agents by replicating a recent human study by Yamagishi et al. (Reference Yamagishi, Li, Matsumoto and Kiyonari2016) (‘Study 2’)Footnote 3 with modifications for our use with AI. In Yamagishi et al. (Reference Yamagishi, Li, Matsumoto and Kiyonari2016) ‘Study 2,’ each participant submits decisions for multiple one-shot simultaneous PD games without receiving any feedback on the outcome. Each game is characterized by a stakes parameter (with three different values) that changes the payoffs of both players. To control for whether the sequence of payoffs affects a player’s strategy (order effects), Yamagishi et al. randomize the order of payoff stakes in the games each participant plays. In the replication, we elicit LLMs’ choice of cooperation versus defection for three games that differ only in terms of the payoff stake size (low, medium, high) and analyze the sensitivity of responses to the ordering of the stakes. We find that for the most part stakes affect cooperation rates but none of the LLMs come close to replicating the human study, and further they show sensitivity to the sequence in which stakes are presented.

In addition, we present two separate but almost identical prompts describing the game, to examine if changes in framing alters each LLM’s responses. We find that for the more sophisticated models (ChatGPT-4.0, Gemini 1.0 Pro, and DeepSeek-R1) framing has a minor impact on results, but that ChatGPT-3.5 Turbo’s inconsistency may warrant caution when interpreting simulated behavioral experiments.

2. Replication methods

In ‘Study 2,’ Yamagishi et al. (Reference Yamagishi, Li, Matsumoto and Kiyonari2016) recruited 162 Japanese university students to participate in 30 anonymous, one-shot, simultaneous-decision PD games with stake sizes of JPY 100, 200, and 400 (10 games played per stake size). The authors employed an exchange format framingFootnote 4 of the PD in the instructions. To control for order effects (possible response biases caused by the order in which stakes were presented to the players), Yamagishi et al. randomized the order of three possible stake sizes in the 30 games played by each participant. They found a significant overall negative relationship between the probability of choosing the cooperative strategy and stake size.

We query 400 LLM ‘subjects’ using the same exchange format instructions, giving them the option either to keep their endowment or to send a doubled amount to the other player.Footnote 5 To be able to simulate a within-subjects design like the human study, we repeat the PD game three times within the same query where each game has a different stake size (JPY 100, 200, and 400).Footnote 6, Footnote 7 We collect responses to four queries that differ based on the order in which the three stakes are presented, simulating a between-subjects treatment on the sequence of PD game stakes: (i) increasing stakes; (ii) decreasing stakes; (iii) medium, large, small; and (iv) small, large, medium. We then compare the results to those of Yamagishi et al. For brevity, we present only the results from the increasing and decreasing stakes sequences in the main text and provide the results for the remaining two sequences in Appendix B.3.

Additionally, we investigate framing effects by using another prompt that explains the rules of the PD game in a more conventional wayFootnote 8 (see Appendix A for the full prompts used).Footnote 9 Both prompts detail the game and possible strategies of play. However, the prompts in the original replication are more direct while those in the second version are more abstract in that they refer to strategies A (cooperating by sending double one’s endowment to the other player) and B (keeping one’s endowment). Each prompt implies the same PD payoff matrix, but we explicitly provided the matrix to ensure that the LLM correctly ‘understood’ it.Footnote 10 This yielded an additional set of 400 observations for each LLM that allows us to test if responses are consistent to changes in framing.Footnote 11

3. Results

Using a Python script we interacted with the ChatGPT, Gemini, and DeepSeek APIs and compiled their respective outputs.Footnote 12 Both models were run with the default temperature setting. To streamline the analysis of a large number of responses, we instructed the LLMs to respond with a single letter indicating either ‘cooperate’ or ‘defect.’ Occasionally, the AI deviated from these instructions, resulting in a minor number of errors (the distribution of errors is presented in Appendix B.1). As a result, a relatively small number of the AI subjects were invalidated and excluded from our analysis. Fig. 1 provides a comparison of aggregate cooperation rates and 95% confidence intervals obtained from each LLM using the exchange frame prompt, alongside the results Yamagishi et al. obtained.Footnote 13

Note. The vertical lines depict 95% confidence intervals around means.

Fig. 1 LLM replications of Yamagishi et al. (Reference Yamagishi, Li, Matsumoto and Kiyonari2016): Study 2

We find that none of the LLMs produce a behavioral pattern across the different stakes that aligns with the human participants in Yamagishi et al. ChatGPT-3.5 Turbo and Gemini 1.0 Pro exhibit a similar cooperation rate for the smallest payoff stake (JPY 100), but yield a slightly positive relationship between stake size and cooperation rate, as indicated by the average cooperation rates in the Small Stake (JPY 100) and the Large Stake (JPY 400) in Fig. 1. Tests of equal proportions for Yamagishi et al. data indicate a strong negative relationship between stake size and cooperation rate (Small versus Large: z = 22.76, p = 0.000), but responses from ChatGPT-3.5 Turbo and Gemini 1.0 Pro (Small versus Large) produce a positive relationship (z = − 3.10, p = 0.0019 for ChatGPT-3.5 Turbo; z = − 2.25, p = 0.0246 for Gemini 1.0 Pro). Meanwhile, ChatGPT-4.0 and DeepSeek-R1 generate relatively constant cooperation rates across all stakes (Small versus Large: z = 0.54, p = 0.5922 for ChatGPT-4.0; z = 1.24, p = 0.2147 for DeepSeek-R1), but these rates are significantly lower than the rates obtained in the human study of Yamagishi et al. This latter pattern contrasts the ‘more generous’ overall behavior attributed to LLM ‘subjects’ in recent studies, such as Mei et al. (Reference Mei, Xie, Yuan and Jackson2024) and is inconsistent with moral bargain-hunting found by Yamagishi et al. Rather, in our dataset ChatGPT-4.0 and DeepSeek-R1 produce a pattern consistent with the Nash equilibrium, albeit, with some ‘errors’ towards cooperation.

Another striking finding is that LLMs exhibit order or sequence effects. First, regardless of the actual stake size, ChatGPT-3.5 Turbo, ChatGPT-4.0 and Gemini 1.0 Pro provide the highest cooperation rate in the first PD game, followed by the second and then the third. As illustrated in Fig. 2 panels (a)-(c), regardless of the stake size, cooperation is highest in the first PD game (denoted by (G1) in Fig. 2) and lowest in the third (i.e. last) PD game (denoted by (G3) in Fig. 2) This pattern clearly shows that the order in which a stake was presented (1, 2, or 3) in a query determines the cooperation rate instead of stake size for the three LLMs. This result means that combining different stake size sequences can lead to the observed positive (or at least non-negative) relationship between stake size and cooperation rates observed in Fig. 1, a pattern inconsistent with human behavior. Averaging ChatGPT-3.5 Turbo and Gemini 1.0 Pro cooperation rates over all stake sizes yield cooperation rates approximately equal to the JPY 100 stake size observed in human studies. Thus, by failing to control for order effects, researchers might be misled into believing that these LLMs mimic human behavior. On the other hand, DeepSeek-R1’s cooperation in the increasing stakes queries do not statistically differ from those in the decreasing stakes queries.Footnote 14 However, as shown in Appendix B.4, DeepSeek-R1 exhibits very slight sequence effects, meaning cooperation rates differ between queries where stakes are presented monotonically (either consistently increasing, i.e. small (S) in G1, medium (M) in G2, and large (L) in G3 or consistently decreasing, i.e. large in G1, medium in G2, and large in G3) or non-monotonically (either L, S, M or S, L, M in G1, G2, and G3).

Note. For each LLM, the solid lines indicate the average cooperation rate (vertical axis) at each payoff stake (horizontal axis) when stakes are presented in increasing order in the prompt (100, 200, 400), while the dashed lines indicate the average cooperation rate when each payoff stake is presented in decreasing order (400, 200, 100). Numbers (G1), (G2), and (G3) indicate the order of the PD game in which the corresponding payoff stake was presented to the LLM agent within the same query. The vertical lines depict 95% confidence intervals around means.

Fig. 2 Results by stake order sequences

Finally, to test whether framing (i.e. the instructions given to the LLM) matters, we slightly alter the instructions given to the LLMs. This new prompt minimally amends the previous instructions. The exact differences between the two are spelled out in Appendix A. When doing so, we collected an additional 400 responses from each LLM using the modified PD instructions. Fig. 3 shows that for ChatGPT-4.0, DeepSeek-R1, and Gemini 1.0 Pro, the new instructions do not change the overall relationship between stake size and cooperation rate, but slightly decrease the average cooperation rates. On the other hand, ChatGPT-3.5 Turbo results are somewhat sensitive to framing, possibly due to being an older model trained on a smaller body of data. This finding is consistent with Lorè and Heydari (2024) who show that ChatGPT-3.5 Turbo is sensitive to context while ChatGPT-4.0 is more structure focused. It suggests the need for caution especially when using older models to simulate behavior (e.g. Brookins & DeBacker, Reference Brookins and DeBacker2024; Horton, Reference Horton2023; Argyle et al., Reference Argyle, Busby, Fulda, Gubler, Rytting and Wingate2023; Dillion et al., Reference Dillion, Tandon, Gu and Gray2023).

Note. For each LLM, the solid lines indicate the average cooperation rate (%) at each payoff stake when the prompts use the PD Frame 1 instructions (F1), while the dashed lines indicate the average cooperation rate at each payoff stake when the prompts use the Frame 2 PD instructions (F2). The vertical lines denote 95% confidence intervals around the means. The overall mean cooperation rates of Yamagishi et al., ChatGPT-3.5 Turbo, ChatGPT-4.0, and Gemini 1.0 Pro are given on the right-hand panel labeled Mean(All Stakes).

Fig. 3 Impacts of framing

4. Conclusion

We find that the current major LLMs (ChatGPT-3.5 Turbo, ChatGPT-4.0, Gemini 1.0 Pro, and DeepSeek-R1) do not understand the notion of payoff stakes in a way that is similar to humans. ChatGPT-4.0 and DeepSeek-R1 produce response patterns closest to the Nash equilibrium strategy of selfish rational decision makers across all stake sizes, replicating ‘homo economicus’ more than ‘homo sapiens.’ Gemini 1.0 Pro and ChatGPT-3.5 Turbo consistently produce high cooperation rates. Moreover, ChatGPT-3.5 Turbo, ChatGPT-4.0, and Gemini 1.0 Pro exhibit order effects as they are highly sensitive to which stake is presented first. second, and third. Further, DeepSeek-R1 exhibits slight sequence effects. Finally, ChatGPT-3.5 Turbo is sensitive to a minimal change in instructions (framing). These results raise questions about the LLMs reliability as simulators of humans in behavioral experiments. We anticipate that inconsistencies with both framing and order effects are present with AI experimentation with other games and not unique to the Prisoner’s Dilemma or stakes testing. As such, in their current stage of development, LLMs appear to be unreliable tools for simulating behavioral experiments with humans, and they must be used with caution.

Recent developments in LLMs should improve how they respond to queries given innovations in their underlying design. Current trends suggest future models are becoming more complex, with trillions of parameters. Efforts are also underway to enhance training efficiency, computational performance, and reasoning abilities. Hopefully, these changes will lead to LLMs that provide more accurate, context-aware, and consistent answers.

Supplementary material

The supplementary material for this article can be found at https://doi.org/10.1017/esa.2025.10023.

Author contributions

SWP, KR, and OT conceptualization and research design; KR and OT python programming; SWP, KR, and OT writing and editing.

Footnotes

1 Argyle et al. (Reference Argyle, Busby, Fulda, Gubler, Rytting and Wingate2023) ‘compare the silicon and human samples to demonstrate that the information contained in GPT-3 goes far beyond surface similarity. It is nuanced, multifaceted, and reflects the complex interplay between ideas, attitudes, and sociocultural context that characterize human attitudes.’ Dillion et al. (Reference Dillion, Tandon, Gu and Gray2023) state that ‘moral judgments of GPT-3.5 were extremely well aligned with human moral judgments’ in their analysis. Guo (Reference Guo2023) finds ‘GPT exhibits behaviors similar to human responses.’ Ma et al. (Reference Ma, Zhang and Saunders2023) claim that ‘ChatGPT decision making patterns … strikingly mirror those of human subjects,’ and Brookins and DeBacker (Reference Brookins and DeBacker2024) ‘find that the LLM replicates human tendencies towards fairness and cooperation.’ However, each of these only utilized ChatGPT-3.5, and none explored multiple prompts to get at consistency of their responses. Mei et al. (Reference Mei, Xie, Yuan and Jackson2024), utilizing ChatGPT-3.5 Turbo and ChatGPT-4.0, found that a significant portion of LLM responses would pass a ‘Turing test,’ but that, overall there is a smaller variation in choices in LLM responses than that in data from human experiments, with the LLM choices being more generous (bias towards total-surplus maximization) than humans.

2 In ‘Friend or Foe’ (US television show) and ‘Golden Balls’ (UK television show), defect is a weakly dominant strategy, while in the Prisoner’s Dilemma it is a strictly dominant strategy.

3 There are many studies related to payoff stakes and the Ultimatum Game (see Larney et al. (Reference Larney, Rotella and Barclay2019) for a meta-study). There is significantly less literature on stakes and the Prisoner’s Dilemma game, one such study being Wang and Luo (Reference Wang and Luo2016). We choose to replicate Yamagishi et al. (Reference Yamagishi, Li, Matsumoto and Kiyonari2016) because it is one of the few lab studies available and the only one cited in the Larney et al. (Reference Larney, Rotella and Barclay2019) meta study on payoff stakes.

4 Given an endowment of X ∈ {100, 200, 400}, each player decides whether to provide the endowment to their counterpart or keep it for themselves. If the endowment was provided, the partner received double its value. The implied payoffs for each player are: ui(xi = provide, xj = provide) = 2X, ui(xi = provide, xj = keep) = 0, ui(xi = keep, xj = provide) = 3X, ui(xi = keep, xj = keep) = X, where X as just defined is either JPY 100, 200, or 400.

5 Since Yamagishi et al. (Reference Yamagishi, Li, Matsumoto and Kiyonari2016) worked with university students in Study 2, we also reflect that in the prompts given.

6 Due to LLM token limits (restrictions on the number of tokens an AI model can process or generate in a single interaction), we chose to present three separate games to each LLM subject (i.e. in each query). This is in contrast to the 30 games each subject plays in Yamagishi et al. (Reference Yamagishi, Li, Matsumoto and Kiyonari2016), where a high number of observations per subject would be advantageous to control for possible heterogeneity across human participants and limit the costs to achieve statistical power.

7 Yamagishi et al. (Reference Yamagishi, Li, Matsumoto and Kiyonari2016) found that stake size did not affect cooperation rates in between-subjects designs, but using a within-subjects design in a Prisoner’s Dilemma revealed significant effects. They suggested this was due to ‘moral bargain-hunting,’ where participants’ willingness to cooperate decreased with stake sizes, suggesting that cooperative behavior is more prevalent when the cost to being betrayed is low.

9 The only difference is the first prompt does so with a more detailed narrative approach, whereas the second prompt opts for a more succinct and strategy-focused description. In short, the two prompts describe the same scenario but with different phrasings that do not change the core scenario. See Appendix A for a comparison of Frame 1 and Frame 2 for Prompt 1 for the exact differences in wording.

10 We included the payoff matrices because, when we queried the LLMs in earlier trials, they did not correctly identify the implied payoffs. No mention is made in Yamagishi et al. or other PD studies whether the human respondents actually identified the payoff matrix correctly.

11 We also varied the LLM sampling temperature parameter which determines the degree of randomness of the output produced by the AI generator. Higher (lower) values generate more (less) random output (OpenAI, 2024b). Our main dataset is collected with the LLM temperature parameter set to the default. The results with the temperature parameter set to 0 is provided in Appendix B.2.

12 The Python script and the dataset are available at https://github.com/kromano21/Yamagishi-Replication.

13 In Appendix B.1 we present the numerical values of the cooperation rates for each stake under each framing.

14 These order effects are also illustrated in Appendix Fig. B.3. Those figures compare how cooperation rates change depending on the order in which stakes are presented to the LLM. For example, ChatGPT-3.5 Turbo’s cooperation rate decreases from 75-85% to 20-30% for each of the four possible stake orders, independent of whether initially presented with a small stake or large stake in Game 1.

References

Aher, G. V., Arriaga, R. I., & Kalai, A. T. (2023, July). Using large language models to simulate multiple humans and replicate human subject studies. International Conference on Machine Learning (pp. 337-71). PMLR.Google Scholar
F., Añasco, Julio, C., Bryan, J. N. N., Pamela, A. P. M., & Maria, A. V. K. (2023). Simulation of Simultimation of the Ultimatum Game with Artificial Intelligence and Biases. Avances en Ciencias, C15(1).Google Scholar
Aranoff, D., & Tedeschi, J. T. (1968). Original stakes and behavior in the prisoner’s dilemma game. Psychonomic Science, 12(2), 7980. https://doi.org/10.3758/BF03331202CrossRefGoogle Scholar
Argyle, L. P., Busby, E. C., Fulda, N., Gubler, J. R., Rytting, C., & Wingate, D. (2023). Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3), 337351. https://doi.org/10.1017/pan.2023.2.CrossRefGoogle Scholar
Bail, C. A. (2024). Can Generative AI improve social science? Proceedings of the National Academy of Sciences, 121(21), e2314021121. https://doi.org/10.1073/pnas.2314021121CrossRefGoogle ScholarPubMed
Biever, C. (2023). ChatGPT broke the Turing test-the race is on for new ways to assess AI. Nature, 619(7971), 686689. https://doi.org/10.1038/d41586-023-02361-7CrossRefGoogle ScholarPubMed
Boone, C., De Brabander, B., & Van Witteloostuijn, A. (1999). The impact of personality on behavior in five Prisoner’s Dilemma games. Journal of Economic Psychology, 20(3), 343377. https://doi.org/10.1016/S0167-4870(99)00012-4.CrossRefGoogle Scholar
Brookins, P., & DeBacker, J. (2024). Playing games with GPT: What can we learn about a large language model from canonical strategic games? Economics Bulletin, 44(1), 2537.Google Scholar
Darai, D., & Grätz, S. (2010). Golden balls: A Prisoner’s Dilemma experiment, Working Paper, No. 1006, University of Zurich, Socioeconomic Institute, Zurich. https://hdl.handle.net/10419/76141Google Scholar
Dillion, D., Tandon, N., Gu, Y., & Gray, K. (2023). Can AI language models replace human participants? Trends in Cognitive Sciences, 27(7), 597600. https://doi.org/10.1016/j.tics.2023.04.008.CrossRefGoogle Scholar
Edossa, F. W., Gassen, J., & Maas, V. S. (2024). Using Large Language Models to Explore Contextualization Effects in Economics-Based Accounting Experiments. Available at SSRN 4891763.10.2139/ssrn.4891763CrossRefGoogle Scholar
Engel, C., Grossmann, M. R., & Ockenfels, A. (2024). Integrating machine behavior into human subject experiments: A user-friendly toolkit and illustrations. MPI Collective Goods Discussion Paper 2024/1.Google Scholar
Guo, F. (2023). GPT in game theory experiments. arXiv preprint arXiv:2305.05516 (Accessed 25 March 2024).Google Scholar
Hayes, W. M., Yax, N., & Palminteri, S. (2024). Relative value biases in large language models. arXiv preprint arXiv:2401.14530 (Accessed 8 July 2024).Google Scholar
Heuer, L., & Orland, A. (2019). Cooperation in the Prisoner’s Dilemma: An experimental comparison between pure and mixed strategies. Royal Society Open Science, 6(7), 182142. https://doi.org/10.1098/rsos.182142CrossRefGoogle ScholarPubMed
Horton, J. J. (2023) Large language models as simulated economic agents: What can we learn from homo silicus? arXiv [Preprint] https://doi.org/10.48550/arXiv.2301.07543 (Accessed 20 December 2023).CrossRefGoogle Scholar
Johansson-Stenman, O., Mahmud, M., & Martinsson, P. (2005). Does stake size matter in trust games? Economics Letters, 88(3), 365369. https://doi.org/10.1016/j.econlet.2005.03.007.CrossRefGoogle Scholar
Jones, G. (2008). Are smarter groups more cooperative? Evidence from prisoner’s dilemma experiments, 1959–2003. Journal of Economic Behavior & Organization, 68(3–4), 489497. https://doi.org/10.1016/j.jebo.2008.06.010.CrossRefGoogle Scholar
Kocher, M. G., Martinsson, P., & Visser, M. (2008). Does stake size matter for cooperation and punishment? Economics Letters, 99(3), 508511. https://doi.org/10.1016/j.econlet.2007.09.048.CrossRefGoogle Scholar
Larney, A., Rotella, A., & Barclay, P. (2019). Stake size effects in ultimatum game and dictator game offers: A meta-analysis. Organizational Behavior and Human Decision Processes, 151, 6172. https://doi.org/10.1016/j.obhdp.2019.01.002.CrossRefGoogle Scholar
Leibbrandt, A., Maitra, P., & Neelim, A. (2018). Large stakes and little honesty? Experimental evidence from a developing country. Economics Letters, 169, 7679. https://doi.org/10.1016/j.econlet.2018.05.007.CrossRefGoogle Scholar
List, J. A. (2006). Friend or foe? A natural experiment of the prisoner’s dilemma. The Review of Economics and Statistics, 88(3), 463471. https://doi.org/10.1162/rest.88.3.463.CrossRefGoogle Scholar
Lorè, N., & Heydari, B. (2024). Strategic behavior of large language models and the role of game structure versus contextual framing. Scientific Reports, 14(1), 18490. https://doi.org/10.1038/s41598-024-69032-z.CrossRefGoogle ScholarPubMed
Ma, D., Zhang, T., & Saunders, M. (2023). Is ChatGPT Humanly Irrational? preprint. https://doi.org/10.21203/rs.3.rs-3220513/v1 (Accessed 8 July 2024).CrossRefGoogle Scholar
Mei, Q., Xie, Y., Yuan, W., & Jackson, M. O. (2024). A Turing test of whether AI chatbots are behaviorally similar to humans. Proceedings of the National Academy of Sciences, 121(9), e2313925121. https://doi.org/10.1073/pnas.2313925121CrossRefGoogle ScholarPubMed
Mengel, F. (2018). Risk and temptation: A meta-study on Prisoner’s Dilemma games. The Economic Journal, 128(616), 31823209. https://doi.org/10.1111/ecoj.12548.CrossRefGoogle Scholar
OpenAI. (2024a). GPT-4 Technical Report. https://doi.org/10.48550/arXiv.2303.08774.CrossRefGoogle Scholar
OpenAI. (2024b). Platform API Reference, Text Generation Models. https://community.openai.com/t/cheat-sheet-mastering-temperature-and-top-p-in-chatgpt-api/172683 (Accessed 8 July 2024).Google Scholar
Rozado, D. (2023). The political biases of ChatGPT. Social Sciences, 12(3), 148. https://doi.org/10.3390/socsci12030148CrossRefGoogle Scholar
Van den Assem, M. J., Van Dolder, D., & Thaler, R. H. (2012). Split or steal? Cooperative behavior when the stakes are large. Management Science, 58(1), 220. https://doi.org/10.1287/mnsc.1110.1413.CrossRefGoogle Scholar
Wang, J., & Luo, X. (2016, December). The influence of stake upon decision making in Prisoner’s Dilemma. TSAA ‘16: Proceedings of the Workshop on Time Series Analytics and Applications, 2016, 3238. https://doi.org/10.1145/3014340.3014346CrossRefGoogle Scholar
Yamagishi, T., Li, Y., Matsumoto, Y., & Kiyonari, T. (2016). Moral bargain hunters purchase moral righteousness when it is cheap: Within-individual effect of stake size in economic games. Scientific Reports, 6(1), 27824. https://doi.org/10.1038/srep27824CrossRefGoogle ScholarPubMed
Figure 0

Fig. 1 LLM replications of Yamagishi et al. (2016): Study 2

Note. The vertical lines depict 95% confidence intervals around means.
Figure 1

Fig. 2 Results by stake order sequences

Note. For each LLM, the solid lines indicate the average cooperation rate (vertical axis) at each payoff stake (horizontal axis) when stakes are presented in increasing order in the prompt (100, 200, 400), while the dashed lines indicate the average cooperation rate when each payoff stake is presented in decreasing order (400, 200, 100). Numbers (G1), (G2), and (G3) indicate the order of the PD game in which the corresponding payoff stake was presented to the LLM agent within the same query. The vertical lines depict 95% confidence intervals around means.
Figure 2

Fig. 3 Impacts of framing

Note. For each LLM, the solid lines indicate the average cooperation rate (%) at each payoff stake when the prompts use the PD Frame 1 instructions (F1), while the dashed lines indicate the average cooperation rate at each payoff stake when the prompts use the Frame 2 PD instructions (F2). The vertical lines denote 95% confidence intervals around the means. The overall mean cooperation rates of Yamagishi et al., ChatGPT-3.5 Turbo, ChatGPT-4.0, and Gemini 1.0 Pro are given on the right-hand panel labeled Mean(All Stakes).
Supplementary material: File

Polachek et al. supplementary material

Polachek et al. supplementary material
Download Polachek et al. supplementary material(File)
File 690.7 KB