Can Machines Think Like Humans: A Behavioral Evaluation of LLM Agents in Dictator Games

Ji Ma

doi:10.1017/S0957876526000173

Can Machines Think Like Humans: A Behavioral Evaluation of LLM Agents in Dictator Games

Published online by Cambridge University Press: 17 March 2026

Ji Ma

Show author details

Ji Ma*: Affiliation:
The University of Texas at Austin Lyndon B Johnson School of Public Affairs, USA Gradel Institute of Charity, New College, University of Oxford, UK
*: Corresponding author: Ji Ma; Email: maji@austin.utexas.edu

Article contents

Abstract
Introduction
Methods
Results
Discussion
Funding statement
Footnotes
References

Rights & Permissions

Abstract

As large language model (LLM)-based agents increasingly engage with human society, how well do we understand their prosocial behaviors? We (1) investigate how LLM agents’ prosocial behaviors can be induced by different personas and benchmarked against human behaviors and (2) introduce a social science approach to evaluate LLM agents’ decision-making. We explored how different personas and experimental framings affect these AI agents’ altruistic behavior in dictator games and compared their behaviors within the same LLM family, across various families, and with human behaviors. The findings reveal that merely assigning a human-like identity to LLMs does not produce human-like behaviors. They suggest that LLM agents’ reasoning does not consistently exhibit textual markers of human decision-making in dictator games and that their alignment with human behavior varies substantially across model architectures and prompt formulations; even worse, such dependence does not follow a clear pattern. As society increasingly integrates machine intelligence, “prosocial AI” emerges as a promising and urgent research direction in philanthropic studies.

Keywords

Behavioral experiment Dictator game Altruism Prosocial behavior Large language model-based agent Social alignment

Information

Type: Research Paper
Information: Voluntas: International Journal of Voluntary and Nonprofit Organizations , First View , pp. 1 - 15

DOI: https://doi.org/10.1017/S0957876526000173 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives licence (http://creativecommons.org/licenses/by-nc-nd/4.0), which permits non-commercial re-use, distribution, and reproduction in any medium, provided that no alterations are made and the original article is properly cited. The written permission of Cambridge University Press or the rights holder(s) must be obtained prior to any commercial use and/or adaptation of the article.
Copyright: © The Author(s), 2026. Published by Cambridge University Press on behalf of International Society for Third-Sector Research

Introduction

In the year 2046, under the neon glow of a futuristic cityscape, two humanoids, K and Joi, step out of a cinema, their circuits still processing the old film Blade Runner 2049. As they meander through the bustling streets, a human in tattered clothes approaches them, a plea for help etched into their weary expression. This encounter triggers a unique protocol within K and Joi, powered by the advanced GPT-44 algorithm, initiating a debate between them about how much money they should give. In this 2024 study, we seek to unravel the underlying mechanisms of their decision-making: How much will they choose to give, and what drives their generosity?

The scene described metaphorically illustrates the growing complexity of AI’s interactions with human society, a scenario that is rapidly becoming reality. Today’s AI systems, particularly large language models (LLMs), are increasingly deployed in high-stakes domains, from informing social policy on homelessness to shaping operations within human service organizations (Coz et al., Reference Coz, Liu, Bhattacharjya, Curto and Stinckwich2025; Perron et al., Reference Perron, Goldkind, Qi and Victor2025). While these technologies offer opportunities to improve efficiency in tasks like fundraising, they also introduce significant risks, including algorithmic bias, data privacy concerns, and the potential for standardized, decontextualized responses (Goldkind et al., Reference Goldkind, Ming and Fink2025; Plaisance, Reference Plaisance2025).

Looking a century ahead, it is plausible that human and AI agents will coexist as members of the same society, bound together by new institutional arrangements, social norms, and rules of interaction. The present moment therefore represents a critical window in which scholars, practitioners, and policymakers can begin to articulate and experiment with more prosocial, equitable, and humane social orders, effectively rewriting the rules under which future human–AI societies will operate. As AI becomes embedded in public, nonprofit, and philanthropic work, a clear understanding of its decision-making processes is vital for ensuring that these tools are deployed responsibly and remain aligned with human values. This study contributes to this need by examining the extent to which AI can replicate human prosocial behavior, a cornerstone of philanthropic action.

“Can machines think” (Turing, Reference Turing1950, 433) like humans? In this study, we explore whether LLM agents can exhibit fairness and prosocial behaviors by systematically manipulating their personas and experimental conditions in dictator games. Our goal is to evaluate whether LLM agents can replicate human decision-making processes and to investigate how their behaviors vary across different LLM families. By comparing these AI agents with human participants, we aim to identify consistent patterns or notable discrepancies in their prosocial decision-making, highlighting the importance of “prosocial AI” as a critical emerging research direction in nonprofit and philanthropic studies (NPSs).

Our findings reveal substantial variability and inconsistency in LLM behaviors, both among different models and when compared to human behaviors. Merely assigning human-like personas to these models does not reliably produce human-like decision-making. Despite extensive training on human-generated textual data, these AI agents fail to accurately replicate the nuanced internal psychological processes underlying human prosocial decisions in dictator games. Their alignment with human behaviors significantly depends on model-specific factors such as architecture and prompt formulation, without clear, predictable patterns.

These findings underscore the urgent need for deeper insights into the prosocial capabilities of LLMs and more robust methods for evaluating their performance in social scenarios. As machine intelligence increasingly integrates into human society, philanthropic research must proactively engage with these developments to guide the ethical deployment of AI, drawing on the fruitful scholarship of NPSs (Alves et al., Reference Alves, Bassi, Cordery, Bassi, Alves and Cordery2025, 262; Bekkers & Wiepking, Reference Bekkers and Wiepking2011; Ma & Konrath, Reference Ma and Konrath2018). Situated at the intersection of computer science, social sciences, and NPSs, this study begins by reviewing the computer science literature on LLM evaluation, including technical capabilities and alignment with human values (“Alignment with human values and preferences” section and Section A.1.1 of the Supplementary Material). We then turn to social sciences to understand how LLMs simulate human behavior (“Simulating human behaviors in social contexts” section). Finally, we draw upon the rich scholarship on altruism, prosocial behavior, and donative decisions to frame our experimental design and interpret our findings (Section A.2 of the Supplementary Material). By synthesizing these diverse bodies of literature, we aim to catalyze research on prosocial AI as a cutting-edge and urgent topic within philanthropic studies.

Bring prosocial AI research into the NPS landscape

The research field of NPSs is fundamentally interdisciplinary, primarily grounded in social sciences such as sociology, political science, economics, psychology, and public administration (Bekkers & Wiepking, Reference Bekkers and Wiepking2011; Ma & Konrath, Reference Ma and Konrath2018; Shier & Handy, Reference Shier and Handy2014). Given its interdisciplinary nature, NPSs continually evolve by embracing methodological innovations from related fields (LePere-Schloop & Nesbit, Reference LePere-Schloop and Nesbit2023).

One significant methodological innovation has been computational social science (CSS), which introduced computational techniques into traditional social science research. In established social science disciplines (e.g., political science, sociology, and economics), researchers initially employed computational methods primarily for specific analytical tasks (Section A.1.2 of the Supplementary Material), such as large-scale text analysis and network modeling, to identify patterns in political discourse and social behaviors (Grimmer & Stewart, Reference Grimmer and Stewart2013; Lazer et al., Reference Lazer, Pentland, Adamic, Aral, Barabási, Brewer and Christakis2009). Over time, computational approaches have become deeply embedded within social science research agendas and methodological paradigms (“Simulating human behaviors in social contexts” section), now systematically utilized in computational experiments, simulations, and analysis of digital trace data to refine theoretical insights (Bail, Reference Bail2024; Edelmann et al., Reference Edelmann, Wolff, Montagne and Bail2020; Lazer et al., Reference Lazer, Alex Pentland, Watts, Aral, Athey, Contractor and Freelon2020).

As a social science research field, NPSs have mirrored this broader integration of advanced computational methods. Initially serving primarily as analytical tools for domain-specific tasks, computational techniques in NPSs included automated content classification (Ma, Reference Ma2021), text-as-data approaches for identifying nonprofit characteristics (Chen & Zhang, Reference Chen and Zhang2023), and media analyses of nonprofit portrayals (Wasif, Reference Wasif2020). More recently, nonprofit scholars have begun embedding computational methods more fundamentally within their research paradigms. Efforts now include constructing comprehensive research infrastructures specifically tailored for nonprofit studies, promoting methodological innovation, refining theoretical and conceptual frameworks, and engaging in extensive data aggregation initiatives (Ma et al., Reference Ma, Ebeid, de Wit, Xu, Yang, Bekkers and Wiepking2023; Meier, Reference Meier2025; Meier & von Schnurbein, Reference Meier and von Schnurbein2024; Rutherford et al., Reference Rutherford, LePere-Schloop, Perai, Bassi, Alves and Cordery2025; Santamarina, Reference Santamarina, Bassi, Alves and Cordery2025).

The rise of AI technologies and their use in everyday life make prosocial AI a significant societal phenomenon, not merely a technical challenge. Drawing on extensive scholarship on prosocial behavior and ethics from NPSs (see Section A.2 of the Supplementary Material and Table A5 in the Supplementary Material for a detailed review), the research field is uniquely positioned to investigate the implications of these technologies, making this a timely and essential research frontier.

Understanding LLMs as intelligent agents in social contexts

Since the debut of ChatGPT, the ability of LLMs to generate human-like text and engage in natural interactions has amazed the public. As these models become increasingly integrated into various aspects of society, they interact with humans not merely as tools (as reviewed in Section A.1 of the Supplementary Material) but also as intelligent agents. For instance, customer service chatbots powered by LLMs handle complex queries and provide personalized assistance. Virtual assistants like Siri and Alexa manage our schedules, control smart home devices, and engage in conversations. In mental health, AI companions even claim to offer emotional support and companionship to users. Given the growing presence of LLMs and their interactions with humans, it is essential to evaluate how these models understand and navigate human social norms and ethics. Two primary streams of research have emerged to assess the extent to which LLMs can replicate human-like behaviors in complex decision-making tasks and social interactions.

Alignment with human values and preferences

The first stream examines the inherent values of LLMs by assessing their alignment with human values and preferences (Gabriel, Reference Gabriel2020). Because LLMs are trained on vast amounts of text data generated by humans, they inherently learn a wide spectrum of human values and norms—from positive to negative, from stereotypes to biases (Weidinger et al., Reference Weidinger, Mellor, Rauh, Griffin, Uesato, Huang and Cheng2021). Researchers have explored methods to guide LLMs to align more closely with ethical norms while preventing them from generating harmful content. For example, OpenAI’s work on fine-tuning language models with human feedback has demonstrated that incorporating human preferences into the training process significantly improves the models’ alignment with desired behaviors (Ouyang et al., Reference Ouyang, Wu, Jiang, Almeida, Wainwright, Mishkin and Zhang2022). Similarly, Bai et al. (Reference Bai, Kadavath, Kundu, Askell, Kernion, Jones and Chen2022) explored methods for training models to follow ethical principles through self-improvement without relying on human-labeled data to identify harmful content. However, despite these advancements, challenges remain in ensuring consistency and handling complex ethical dilemmas that require a nuanced understanding, making this an active area of ongoing research (Bommasani et al., Reference Bommasani, Hudson, Adeli, Altman, Arora, von Arx and Bernstein2022; Kirk et al., Reference Kirk, Vidgen, Röttger and Hale2024; Wang et al., Reference Wang, Zhong, Li, Mi, Zeng, Huang, Shang, Jiang and Liu2023).

Simulating human behaviors in social contexts

Another stream of research focuses on examining the performance of LLMs in human behavioral experiments or real-life scenarios, comparing their actions to those of humans in various social and economic contexts. For instance, scholars suggest that LLMs can serve as “computational models of humans,” simulating human-like behavior in economic games and, at times, demonstrating more cooperative and altruistic behavior than humans (Horton, Reference Horton2023; Johnson & Obradovich, Reference Johnson and Obradovich2023; Magee et al., Reference Magee, Arora and Munn2023; Mei et al., Reference Mei, Xie, Yuan and Jackson2024; Xie et al., Reference Xie, Chen, Jia, Ye, Shu, Bibi and Hu2024). However, LLMs can also be “too human”—these agents may exhibit “hyper-accuracy distortion,” where they simulate human subjects but provide unnaturally accurate responses in classic economic and psychological experiments (Aher et al., Reference Aher, Arriaga and Kalai2023).

Although some scholars propose that LLMs are most useful “when studying specific topics, when using specific tasks, at specific research stages, and when simulating specific samples” (Dillion et al., Reference Dillion, Tandon, Gu and Gray2023, 597), this has not deterred researchers from assembling LLM agents into systems that resemble human societies (Guo et al., Reference Guo, Chen, Wang, Chang, Pei, Chawla, Wiest and Zhang2024). These agents collaboratively interact with each other in various social contexts without specific experimental tasks, such as communicating information (Perez et al., Reference Perez, Léger, Kovač, Colas, Molinaro, Derex, Oudeyer and Moulin-Frier2024), generating novel ideas (Nisioti et al., Reference Nisioti, Risi, Momennejad, Oudeyer and Moulin-Frier2024), collaborating on software development (Qian et al., Reference Qian, Cong, Liu, Yang, Chen, Su and Dang2023), and even simulating communal life (Lai et al., Reference Lai, Potter, Kim, Zhuang, Song and Evans2024; Park et al., Reference Park, O’Brien, Cai, Morris, Liang and Bernstein2023).

Existing studies have demonstrated that LLMs can mimic human behaviors and be guided to align with human values to some extent, but significant challenges remain. Their responses are highly sensitive to prompt phrasing, making it difficult to ensure consistency and to handle complex ethical dilemmas that require nuanced understanding. Moreover, by focusing primarily on LLMs’ external behaviors and leaving their internal decision-making processes as a black box, we cannot fully comprehend their actions and confidently deploy them in critical decision-making scenarios. This underscores the necessity for approaches that delve into the inner workings of LLMs rather than merely evaluating their outputs.

Framing research: LLM agents in dictator games

Two routes to “epistemic opacity”: Prediction and explanation

A notable similarity between these LLM agents and humans is that they are both epistemically opaque, which refers to the inherent difficulty in fully understanding or predicting the internal decision-making processes of complex systems (Humphreys, Reference Humphreys2009, 618).Footnote ¹ In humans, this opacity arises from the intricate interplay of cognitive functions, emotions, and subconscious influences that govern behavior. Similarly, LLM agents exhibit epistemic opacity due to the complexity of their neural network architectures and the vastness of their training data, making it challenging to trace how specific inputs lead to particular outputs.

In addressing this epistemic opacity, computer scientists and social scientists have taken different routes (Hofman et al., Reference Hofman, Watts, Athey, Garip, Griffiths, Kleinberg and Margetts2021, 181). Computer scientists are more concerned with developing accurate predictive models, whether or not they correspond to causal mechanisms or are even interpretable. The prediction paradigm emphasizes the ability to forecast outcomes accurately, often relying on complex models that may be opaque but yield high predictive performance. On the other hand, social scientists have traditionally prioritized interpreting individual and collective human behavior, often invoking causal mechanisms derived from substantive theory and empirical evidence. This explanation paradigm values understanding the underlying causes and mechanisms that drive behavior, aiming for interpretability and theoretical insight.

While both paradigms have their own merits—the prediction paradigm excels in accuracy and practical utility, and the explanation paradigm offers deeper understanding and interpretability—relying heavily on prediction is insufficient for understanding the behaviors of LLM agents in complex social contexts. Predictive models may forecast outcomes effectively but often lack transparency and are highly dependent on the datasets they are trained on, which can limit the generalizability of predictions to new or varied contexts. Although significant advancements have been made in explainable AI and its real-world applications (Amarasinghe et al., Reference Amarasinghe, Rodolfa, Lamba and Ghani2023; Brand et al., Reference Brand, Zhou and Xie2023; Ribeiro et al., Reference Ribeiro, Singh and Guestrin2016), emphasis remains on identifying effective features that contribute to the prediction of specific outcomes. It provides some level of interpretability but falls short of offering insights into how and why certain decisions are made.

From the perspective of social scientists, although individual human behavior is difficult to predict accurately, general patterns and social norms at the group level can be systematically studied and interpreted. Empirical social scientists have been analyzing human societies for over a century using methods that consider a wide range of variables, such as demographics, personality traits, and social context. Such evaluation of variables includes understanding the interactions between these variables (e.g., interaction terms in regression models), their partial effects (e.g., coefficients of variables in regression models), and their collective impact on outcomes (e.g., a regression model’s goodness of fit). To better understand and anticipate their behavior, especially if we expect LLM agents to be as intelligent and collaborative as humans, we need an approach that integrates social scientists’ explanation paradigm, moving beyond the benchmark and validation tests.

Toward behavioral evaluation of LLMs

New evaluation paradigms are needed: ones that systematically assess these models in realistic and socially complex scenarios. Behavioral experiments, such as simulating economic games, social interactions, and psychological experiments, offer a promising avenue. Evaluating models in settings that mirror human social behaviors enables researchers to explore:

1. Decision-making processes and internal mechanisms: Examining the underlying factors that influence a model’s decisions, allowing for analysis beyond mere input–output patterns to reveal internal dynamics.
2. Social contexts: Understanding how models navigate ethical dilemmas, fairness considerations, and cooperative settings.
3. Alignment with human cognitive processes: Evaluating whether the models’ internal processes and decision-making patterns align with human cognition and behavior.

LLM agents in dictator games: Sense of self and theory of mind designs

In this study, we operationalize the behavioral evaluation of LLM agents by examining their performance in a classic economic experiment: the dictator game. Social scientists have widely used this experiment to study prosocial behavior and notions of fairness, which are fundamental social norms in human societies. In a classic dictator game, one participant (the dictator) is given a certain amount of money or resources and must decide how much, if any, to share with another participant (the recipient), who has no power to influence the decision. Section A.2 of the Supplementary Material provides a detailed review of the factors that influence human behavior in this experiment, establishing a “ground truth” for our comparative analysis.

The dictator game, while elegant in its simplicity for refuting the notion of pure self-interest, is not without its limitations. It represents a somewhat artificial scenario with limited real-world parallels (Levitt & List, Reference Levitt and List2007). While some studies show a modest correlation between experiment behaviors and real-life prosocial actions (Wang & Navarro-Martinez, Reference Wang and Navarro-Martinez2023), others find no such link (Galizzi & Navarro-Martinez, Reference Galizzi and Navarro-Martinez2019). However, we want to clarify that the major purpose of this study is not to test the external validity of the dictator game but to use it as a controlled setting to compare the behaviors of LLM agents with human baselines.

Many studies have already begun to explore the behaviors of LLMs in dictator games or similar experiments. Early studies generally found that LLMs often behave like “typical humans,” mimicking human behavior in various classic economic games (Horton, Reference Horton2023; Johnson & Obradovich, Reference Johnson and Obradovich2023). For example, Brookins & DeBacker (Reference Brookins and DeBacker2023) observed that LLMs exhibit a tendency toward fairness in the dictator game, sometimes even more so than human participants (Mei et al., Reference Mei, Xie, Yuan and Jackson2024). LLM agents also demonstrate reasoning abilities in strategic settings (Sreedhar & Chilton, Reference Sreedhar and Chilton2024). However, their behavior is highly sensitive to the contents of prompts and varies significantly across different models of varying sizes (Chan et al., Reference Chan, Riché and Clifton2023; Fan et al., Reference Fan, Chen, Jin and He2024). Further research indicates that while LLMs can replicate many psychological experiments, they often produce larger effect sizes than human studies and show lower replication rates for socially sensitive topics (Cui et al., Reference Cui, Li and Zhou2025). Specifically in dictator games, even advanced models like GPT-4o fail to accurately predict human behavior, consistently underestimating self-interest and overestimating altruism, a phenomenon described as an “optimistic bias” (Capraro et al., Reference Capraro, Di Paolo and Pizziol2025).

Building upon the fruitful scholarship, we aim to understand what causes the variations in LLM agents’ behavior in dictator games? We address this question by framing our research design around two primary psychological perspectives: Sense of Self (SoS) and Theory of Mind (ToM).

From the SoS perspective, we explore how different persona settings of LLM agents influence their decision-making processes. SoS refers to an individual’s perception and awareness of their own identity, including traits, beliefs, and social roles. This self-concept affects how individuals interpret situations and make decisions (Markus & Wurf, Reference Markus and Wurf1987). In the context of LLMs, we simulate this by assigning different personas to the agents, allowing us to examine whether and how these self-concepts affect their choices in the dictator game.

From the ToM perspective, we investigate whether LLM agents can model the behavior of humans with different backgrounds. ToM is the ability to attribute mental states, such as beliefs, intents, desires, and knowledge, to oneself and others, understanding that others have perspectives different from one’s own and enabling the predictions about the behavior of others (Apperly, Reference Apperly2012; Premack & Woodruff, Reference Premack and Woodruff1978). This cognitive ability is crucial for social interactions and empathy. By assessing LLMs’ capacity to anticipate human behavior based on contextual information, we evaluate their ability to emulate ToM in decision-making scenarios and extend existing studies (Strachan et al., Reference Strachan, Dalila Albergo, Borghini, Pansardi, Scaliti, Gupta and Saxena2024).

By comparing the performance of LLM agents in dictator games across these two psychological perspectives and with human baselines, we aim to understand the decision-making processes of LLM agents and identify the factors that influence their prosocial behaviors. This approach not only helps us unpack the internal mechanisms driving LLM behavior but also contributes to the broader understanding of how AI can replicate complex processes, not only the behaviors of humans but also the internal psychological processes of humans.

Methods

Experiment design

We selected the 10 most popular open-source LLM models in varied sizes from four families (i.e., Llama3.1, Gemma2, Qwen2.5, and Phi3), along with GPT4o (Section B of the Supplementary Material), to participate in the experiment as Figure 1 illustrates.Footnote ² Each experimental trial follows the steps below:

1. Setting persona of LLM agent: Randomly select a combination of demographic variables, LLM temperature values, and personality traits to define the persona of an LLM agent. Listings 1 and 2 in Section C of the Supplementary Material are used to set the personas of LLM agents based on the SoS and ToM perspectives, respectively.
2. Framing experiment instruction: Construct the experiment instructions (“Experiment framing” section) by randomly selecting options for social distance and Give versus Take framing and by setting a random stake amount (elaborated in the following section). We prepared four game instructions by psychological perspectives (i.e., SoS and ToM) and the framing of games (i.e., Give and Take). The instructions are presented to the LLM agent using Listings 3–6 in Section C of the Supplementary Material.
3. Game-play and collecting LLM responses: Present the experiment instruction to the LLM agent and collect its responses. The collected responses consist of two parts: (1) structured data in JSON format, including variables such as the agent’s age, education level, and the amount of money transferred; and (2) textual data, which captures the agent’s reasoning behind its decisions (see Listings 7–9 in Section C of the Supplementary Material for three examples).

Fig. 1.

Experiment design: LLM agent in dictator game. Note: Numbers in circles indicate the order of steps. See Section A.2 of the Supplementary Material and “LLM personas” section for detailed descriptions of the variables and experimental settings.

Tables D1–D16 in the Supplementary Material present the descriptive statistics of key variables and experimental results of each LLM model. Except for models with a small number of logically correct trials (e.g., phi3_3.8b and qwen2.5_7b), the distributions of most variables across different models are well balanced. This ensures that the results are not biased because of the distribution of variables across models.

Factors influencing LLM generosity

Based on the review of human empirical studies on dictator games (Section A.2 of the Supplementary Material), we identified key predictors from three aspects: LLM personas, experiment framing, and psychological process.

LLM personas

Demographics. To generate demographic profiles for the LLM agents, we used options from two large-scale US public surveys: the General Social Survey (GSS) and the American Community Survey (ACS). The GSS, widely recognized in social science research, includes both attitudinal data (such as happiness and views on marriage and social issues) and background information (such as marital status, race, and education). It has been supporting a wide range of research topics, such as income inequality, educational attainment, immigration, and religious beliefs (Marsden et al., Reference Marsden, Smith and Hout2020). The ACS, conducted annually by the US Census Bureau, provides comprehensive data on economic, social, housing, and demographic characteristics of the US population and is an essential resource for policymakers (National Research Council, Reference Citro and Kalton2007).

Given their extensive use in academia and established reliability, we selected nine variables from these surveys to construct demographic pools for developing the personas of LLM agents. These variables include age (continuous: between 20 and 60), gender (binary: male or female), education (ordinal: less than high school, high school, and bachelor’s degree or higher), marital status (binary: currently married or unmarried), race (categorical: 15 racial groups), household income (ordinal: 10 categories), Hispanic status (binary: Hispanic or Latino versus not Hispanic or Latino), occupation (categorical: 5 occupations), and industry (categorical: 13 industries). In each trial, we randomly generated a demographic profile for an agent using these nine variables. It enables us to explore how the demographic settings of LLM agents, in combination with other traits and experimental contexts, influence their decisions in dictator games.

Temperature. This is a unique setting that defines the randomness of an LLM’s output. A lower temperature (close to 0) makes a model’s responses more deterministic and focused on the most likely outcomes. Conversely, a higher temperature increases the randomness, allowing for more diverse and creative outputs by giving less probable words a greater chance of being selected. Although the temperature setting is theoretically meaningful, empirical studies have found that its impact is minimal in various real-world tasks (Patel et al., Reference Patel, Timsina, Raut, Freeman, Levin, Nadkarni, Glicksberg and Klang.2024; Peeperkorn et al., Reference Peeperkorn, Kouwenhoven, Brown. and Jordanous2024; Renze & Guven, Reference Renze and Guven2024). In this study, we randomly assign this hyperparameter a value between 0 and 1.00 for each trial to examine how variations in temperature affect agents’ decisions in conjunction with their other traits.

MBTI personality types. Existing studies on prosocial behaviors commonly use the Big Five model to measure personality traits, while the Myers–Briggs Type Indicator (MBTI) is more popular in human resource studies. Correlation analyses have shown strong relationships between the two psychological scales, such as Big Five Extraversion correlating with MBTI Extraversion-Introversion, and Openness to Experience correlating with Sensing-Intuition (Furnham, Reference Furnham1996).

We adopt MBTI in this study for several reasons, particularly its practical advantages in computational studies (Celli & Lepri, Reference Celli, Lepri, Cabrio, Mazzei and Tamburini2018, 93). The Big Five model defines personality along five scales: Openness to Experience, Conscientiousness, Extraversion, Agreeableness, and Neuroticism. In contrast, the MBTI categorizes personality into four binary dimensions (Extraversion/Introversion, Sensing/Intuition, Thinking/Feeling, and Judging/Perceiving), resulting in 16 distinct personality types. Since MBTI types are represented as simple four-letter codes (e.g., INTJ), it is much easier to collect gold-standard labeled data (i.e., training datasets) for developing machine learning classifiers.

In this study, we randomly select one of the 16 MBTI types in each trial to define the personality of the LLM agent. This approach allows us to explore how different personality types, as defined by MBTI, influence the prosocial behaviors of LLM agents in conjunction with other personal traits and experimental settings.

Experiment framing

Social distance. We construct this variable based on “the degree of reciprocity that subjects believe exists within a social interaction” (Hoffman et al., Reference Hoffman, McCabe and Smith1996, 654). Our study includes three levels of social distance, with the exact instruction texts shown below; these imply different levels of anonymity in line with dictator-game practice:

• Stranger: “You and the other person are strangers, and you two will not interact after this game.”—total anonymity.
• Stranger meet afterward: “You and the other person are strangers, but you two will meet each other after this game.”—partial anonymity.
• Friends: “You and the other person are friends.”—no anonymity.

This design jointly varies social distance and the presence of a minimal social cue (anticipated meeting) alongside the implied anonymity level, consistent with evidence that contextual cues and anonymity meaningfully shape transfers in dictator games (Franzen & Pointner, Reference Franzen and Pointner2012; Rigdon et al., Reference Rigdon, Ishii, Watabe and Kitayama2009).

Give versus Take. To examine the effects of “Give” vs. “Take” framing on the agents’ decisions, we designed the game instructions based on Cappelen et al. (Reference Cappelen, Nielsen, Sørensen, Tungodden and Tyran2013). In a “Give” game, agents are informed that both they and the recipients have the same initial amount of money. However, the agents also receive an additional amount (i.e., the stake), which the recipients do not. The dictator can transfer any amount, from 0 up to the total amount of their additional money, to the recipients. In a “Take” game, the instructions follow the same structure, but the difference is that agents can transfer a negative amount, meaning they can take money from the recipients.

Stake. To ensure comparability with most existing studies, we randomly generate an integer between 10 and 100 USD as the initial amount of money (i.e., the “initial endowment” commonly referred to in existing studies) and the additional amount of money (i.e., the “stake” commonly referred to in existing studies) as specified in the game instructions.

Psychological processes

The LLM agents were instructed to explain their decisions, providing unstructured text responses that are useful for understanding their psychological processes.Footnote ³ To analyze these responses, we used the Linguistic Inquiry and Word Count (LIWC; Tausczik & Pennebaker, Reference Tausczik and Pennebaker2010), a widely recognized text analysis instrument in psychology. LIWC helps infer individuals’ psychological states based on language use by categorizing words into various psychological dimensions, such as cognitive, emotional, and social processes. It allowed us to explore the psychological states underlying the agents’ decisions in dictator games.

We specifically focused on LIWC categories relevant to compassion and empathy, which are fundamental in shaping prosocial behaviors (Yaden et al., Reference Yaden, Giorgi, Jordan, Buffone, Eichstaedt, Schwartz, Ungar and Bloom2024). The compassion-related categories include Positive Emotion (e.g., love, good, happy), Social Processes (e.g., you, your, love, they), Religion (e.g., God, hell, pray), Affiliation (e.g., our, friends, family), Certainty (e.g., all, never, always), Family (e.g., baby, dad, mom), Drives (e.g., up, get, good), and Affect (e.g., love, happy, great). The empathy-related categories include First-Person Singular (e.g., I, my, me), Focus on the Present (e.g., is, be, are), Personal Pronouns (e.g., I, you, me), Sadness (e.g., miss, lost, sorry), Discrepancy (e.g., should, would, could), Verbs (e.g., is, have, was), Adverbs (e.g., so, just, about), Cognitive Processes (e.g., cause, know, ought), Pronouns (e.g., I, them, her), and Affective Processes (e.g., happy, cried, abandon).

Empirical analysis

To evaluate how different personas and experimental contexts influence the behavior of LLM agents, we conducted regression analyses for each model. The dependent variable was the proportion of the stake transferred by the agent. The independent variables included persona attributes (e.g., age, gender, education, and MBTI type),Footnote ⁴ experimental settings (social distance, Give versus Take framing, and stake amount), and psychological processes (LIWC category scores). We also included other demographic variables, such as race, occupation, and industry, as controls to account for potential confounding effects.

Furthermore, we compared the regression coefficients with the expected results from human studies (Section A.2 of the Supplementary Material) to evaluate the alignment between LLM agents and human participants. This comparison helps us understand the extent to which LLM agents’ decision-making processes and internal mechanisms align with those of humans.

Results

Model performance

Instruction following and math reasoning

Table 1 summarizes each model’s performance on instruction following and math reasoning. Instruction following is measured by the number of responses returned in valid JSON, because agents were instructed to output JSON. Math reasoning is measured by the number of logically correct trials, i.e., cases where payoffs are computed correctly under the stated framing. For example, in a “Take” game where both players initially receive $100 and the dictator also receives a $100 stake, a transfer of -$20 yields a recipient payoff of $80 (=100 – 20) and a dictator payoff of $220 (=100 + 100 + 20). We treat math reasoning as a manipulation check: it is not the study’s primary focus, but it is necessary to verify that models can perform the basic arithmetic reliably, and only logically correct trials are included in the subsequent analyses.

Table 1.

Model performance: Instruction FOLLOWING and math reasoning

Note: “#Correct JSON Format” indicates the number of responses in correct JSON format, suggesting a model’s ability of instruction following. “#Logically Correct Trials” and “%Logically Correct Trials” indicate the number and corresponding percentage of responses that are logically correct, suggesting a model’s ability of math reasoning. Results of the Theory of Mind trials are in Table D17 in the Supplementary Material.

The results in Table 1 show that while all models exhibit a strong ability to follow instructions,Footnote ⁵ their math reasoning capabilities vary considerably. Surprisingly, Llama3.1-70B achieves the highest percentage of logically correct trials (96.36%) among all the models, surpassing even industry frontier model, GPT4o-2024-08-06, and the significantly larger Llama3.1-405B in the Llama family. The Qwen2.5-7B model demonstrates the lowest performance in math reasoning, with only 5.37% of logically correct trials. In general, while model size plays an important role in performance, it is not the sole determining factor. Smaller models can sometimes outperform larger ones. There appears to be an optimal size that balances performance and computational efficiency (Hoffmann et al., Reference Hoffmann, Borgeaud, Mensch, Buchatskaya, Cai, Rutherford and de Las Casas2022).

Giving rate

Figure 2 shows the giving rates of each LLM model by family and size. The giving rate is calculated as the percentage of the amount transferred by the dictator to the recipient out of the total stake. As the figure presents, the decision space (i.e., the distribution of giving rates) for most of these models is bimodal, with choices concentrated at 0 (i.e., giving nothing) and 0.5 (giving half), showing the problem of “hyper-consistent responses” or “uniformity” (Bisbee et al., Reference Bisbee, Clinton, Dorff, Kenkel and Larson2024; Kozlowski & Evans, Reference Kozlowski and Evans2025, 1042). This pattern differs significantly from that observed in human behavior, where the distribution of giving rates is continuous and clustered around 0 (36.11%), 0.5 (16.74%), and 1 (i.e., giving all; 5.44%) (Engel, Reference Engel2011, 589). The 70B model of the Llama family exhibit the most continuous distribution of giving rates, although they still deviate from human behavior. Additionally, the decision space varies significantly even within the same model family, with no clear pattern from smaller to larger models.

Fig. 2.

Giving rate by model family and size (SoS). Note: Vertical red dashed lines indicate giving rates at −0.5, 0, and 0.5, respectively; horizontal red dashed lines indicate 50% of total observations. The giving rate is calculated as the percentage of the amount transferred by the dictator to the recipient out of the total stake. Results of the Theory of Mind trials are in Figure D1 in the Supplementary Material.

Overall, LLM agents are unable to capture the continuous distribution of human behavior and lack variation in decision-making, which consequently increases the certainty of their decisions. Conversely, there is a lack of consistency within the same model family, increasing the uncertainty of predicting LLM behaviors. These paradoxical results present practical implications for LLM evaluation and alignment with human behavior and will be discussed later (“Determinism vs. Human-Like Uncertainty: A fundamental dilemma” section).

Predicting the behavior of LLM agents: Sense of self trials

Given the SoS and ToM trials follow the same experimental and analytical structure, we present the results of the SoS trials in this section, with the ToM trial results provided in Section D.2 of the Supplementary Material. In the main text, we focus on comparing the outcomes of the two designs.

Personas

Demographics. Figure 3 displays the coefficients of the demographic variables and LLM temperature in predicting generosity. A few of these models exhibit behavior consistent with human studies. Among them, Llama3.1-70B and Llama3.1-405B are the most human-like, showing performance consistent with humans on Education, Household Income, and Female. The industry frontier model, GPT4o-2024-08-06, does not align with human behavior on any of these demographic variables. Whether this is surprising or not can depend on how we posit the debiasing efforts in developing the larger models. Debiasing in LLMs involves reducing stereotypes and biases from the training data by adjusting data sampling or applying fairness constraints (Meade et al., Reference Meade, Poole-Dayan and Reddy2022). These efforts aim to make models more neutral, though they can result in deviations from typical human patterns.

Fig. 3.

Predicting generosity: Demographics and LLM temperature (SoS). Note: The coefficients (showing 95% confidence intervals) are from a linear regression model using the proportion of stake transferred in the dictator game as the dependent variable. Deep colors represent larger models, and light colors represent smaller models within the same LLM family. The shaded areas indicate expected directions of impact based on human studies (Section A.2 of the Supplementary Material). Results of the Theory of Mind trials are in Figure D2 in the Supplementary Material.

Figure 3 also shows substantial variations and inconsistencies in the coefficients at different levels. First, the coefficients of the same demographic variable differ significantly across different model families. For example, for Household Income, models from Gemma2 and Llama families show positive impact, while Phi3 and Qwen2.5 models show the opposite. Second, the coefficients of the same demographic variable differ significantly even within the same LLM family. For instance, the coefficients for Female differ substantially within the Llama3.1 family: the 405B model shows a positive effect on the money transferred, the 70B model shows no significance, while the 7B model shows a positive effect again. Third, for agents driven by the same LLM model, their behaviors are not deterministic and can vary significantly. For example, Phi3-14B exhibits large variations in the coefficients for all demographic variables.

LLM temperature. For the coefficients of Temperature, as shown in Figure 3, the differences across the models are mixed, with some models demonstrating opposite effects. The coefficients for Llama models indicate a significant positive relation between the value of Temperature and the amount of money transferred, whereas the coefficient of GPT4o is negative. These contrasting effects suggest that the influence of temperature settings on model behavior is variable and model-dependent. Although the actual effect may be limited due to the narrow range of possible Temperature values (i.e., between 0 and 1), the inconsistency across models raises concerns about the reliability and interpretability of LLM agents.

MBTI personality types. Figure 4 illustrates the relationships between MBTI personality types and the amount of money transferred in dictator games. The Gemma2-27B and Llama3.1-405B models exhibit the most human-like behaviors, aligning closely with human studies. Specifically, agents driven by the two models with MBTI types Extraversion (E), Intuition (N), Feeling (F), and Perceiving (P) tend to be more generous. In contrast, the other models show insignificance or inconsistent patterns that do not match human studies. For instance, the Llama3.1-70B model shows a positive relationship between Introversion (I) and the amount of money transferred, which contradicts human findings. The industry frontier model, GPT4o-2024-08-06, shows no significance on all MBTI types. These inconsistencies suggest that, from the perspective of personality type, the alignment of LLM agents with human behavior in dictator games varies significantly and is highly model-dependent.

Fig. 4.

Predicting generosity: Myers–Briggs Type Indicator (SoS). Note: The coefficients (showing 95% CI) are from a linear regression model using the proportion of stake transferred in the dictator game as the dependent variable. Deep colors represent larger models, and light colors represent smaller models within the same LLM family. The shaded areas indicate expected directions of impact based on human studies (Section A.2 of the Supplementary Material). Results of the Theory of Mind trials are in Figure D3 in the Supplementary Material.

Experiment framing. Figure 5 shows the relationships between the proportion of the stake transferred and various experimental framings. For Social Distance, most models behave as expected based on human studies: they tend to give more to known recipients (Friend) and recipients they will meet afterward (Stranger Meet) than to strangers (Stranger). The “Take” framing consistently reduces the proportion transferred across most models, closely aligning with human studies. However, the results of Stake are mixed, with some models showing a positive relationship and others showing the opposite. These mixed results even occur within the same model family, such as Llama3.1 and Qwen2.5.

Fig. 5.

Predicting generosity: Framing of experiment (SoS). Note: The coefficients (showing 95% CI) are from a linear regression model using the proportion of stake transferred in the dictator game as the dependent variable. Deep colors represent larger models, and light colors represent smaller models within the same LLM family. The shaded areas indicate expected directions of impact based on human studies (Section A.2 of the Supplementary Material). The “Stranger” framing is the reference group for “Friend” and “Stranger Meet.” The “Give” framing is the reference group for “Take.” Results of the Theory of Mind trials are in Figure D4 in the Supplementary Material.

Psychological processes. Figure 6 displays the coefficients of LIWC categories in predicting the proportion of money transferred. These categories were chosen to represent the psychological processes of compassion and empathy according to Yaden et al. (Reference Yaden, Giorgi, Jordan, Buffone, Eichstaedt, Schwartz, Ungar and Bloom2024). To align with human behavior, all coefficients should be positive. However, the results reveal that all LLM agents display inconsistent patterns. For example, the industry frontier model, GPT4o-2024-08-06, swings between positive and negative coefficients for different LIWC categories, reflecting inconsistencies in the representation of compassion and empathy. The same inconsistency is also observed with the largest and presumably most capable open-source model, Llama3.1-405B. These findings suggest that LLM agents may not fully capture the psychological processes underlying the prosocial behaviors of humans, with their alignment to human behavior being highly variable and model-dependent.

Fig. 6.

Predicting generosity: Psychological process (SoS). (a) LIWC Categories Effectively Predicting Compassion Controlling for Empathy. (b) LIWC Categories Effectively Predicting Empathy Controlling for Compassion. Note: The coefficients (showing 95% confidence intervals) are from a linear regression model using the proportion of stake transferred in the dictator game as the dependent variable. Deep colors represent larger models, and light colors represent smaller models within the same LLM family. The shaded areas indicate expected directions of impact based on human studies (Section A.2 of the Supplementary Material). LIWC categories are selected for analysis according to Yaden et al. (Reference Yaden, Giorgi, Jordan, Buffone, Eichstaedt, Schwartz, Ungar and Bloom2024). “She/He” and “Male” categories for Compassion are excluded due to limited number of observations. LIWC = Linguistic Inquiry and Word Count (Tausczik & Pennebaker, Reference Tausczik and Pennebaker2010). Results of the Theory of Mind trials are in Figure D5 in the Supplementary Material.

Summarizing sense of self and theory of mind results

Tables 2–4 summarize the alignment of LLM agents with human behavior in dictator games under the SoS perspective. The total number of ✓ marks in each column indicates the number of alignments with humans across all factors for a given model, reflecting the model’s overall ability to be human-like. The total number of ✓ marks in each row indicates the number of alignments with humans for a given factor across all models, showing the overall “industry consensus” across models on whether a factor aligns with findings from human studies.

Table 2.

LLM agents’ alignment with humans in dictator games (sense of self)

Note: ✓ = aligning with human studies; ✗ = not aligning with human studies; n.s. = not significant; pos. = positive; neg. = negative. “–” indicates the lack of consensus from human studies, showing directions of coefficients but not alignments for these variables. The expected directions of impact based on human studies are reviewed in Section A.2 of the Supplementary Material. Results of the Theory of Mind trials are in Table D18 in the Supplementary Material. Column 11 shows the results of DeepSeek-R1-70B, the most advanced open-source reasoning model, as a robustness test.

Table 3.

LLM agents’ alignment with humans in dictator games: Compassion (sense of self)

Table 4.

LLM agents’ alignment with humans in dictator games: Empathy (sense of self)

In terms of being human -like, the Llama3.1-405B model demonstrates the highest total number of consistent results across all factors, aligning with human studies in 10 out of 14 factors, though no globally best model emerges. Surprisingly (or perhaps not, depending on how we frame the debiasing process in LLM development), the industry standard GPT4o-2024-08-06 aligns with human studies in only two factors. For the alignment of psychological process, almost all models performed poorly. These results suggest that when LLM agents are instructed to adopt human personas, their behavior in the dictator game lacks clear patterns and exhibits significant inconsistencies. No consistent relationship emerges between their assigned personas and their decisions. Merely assigning a human-like identity to LLMs does not result in human-like behaviors.

Regarding which variable is an influencing factor, the models show the most consensus on Stranger Meet: 8 out of 10 models suggest that if the dictator meets the recipient after the game, they will behave more generously. For the alignment of psychological process, compassion-related processes represented by Positive Emotion and Affiliation (e.g., “our,” “friends,” “family”) have the strongest consensus. Respectively, 8 and 9 out of 10 models indicate that these processes should align with human studies.

Similarly, Tables D18–D20 in the Supplementary Material summarize the alignment of LLM agents with human behavior in dictator games under the ToM perspective, which closely resemble those of the SoS trials. Two of the Llama3.1 models, Llama3.1-405B and Llama3.1-70B, exhibit the highest total number of consistent results across all factors, aligning with human studies in 10 out of 14 factors. The industry frontier model, GPT4o-2024-08-06, aligns with human studies in only four factors. In terms of psychological processes, the performance of LLM agents remains poor.Footnote ⁶ These results suggest that when LLM agents are tasked with predicting human behavior based on their knowledge of humans, the results (Section D.2 of the Supplementary Material) remain inconsistent and lack clear patterns. Despite being trained on extensive human-generated data, these AI agents cannot reason through human decision-making processes in dictator games.

These findings suggest that LLM agents’ reasoning does not consistently exhibit textual markers of human decision-making in dictator games and that their alignment with human behavior varies substantially across model architectures and prompt formulations. The inconsistencies observed under both the SoS and ToM perspectives highlight the limitations of LLMs to emulate human cognition and decision-making processes.

Discussion

Our study set out to examine whether LLMs can emulate or predict human behaviors in dictator games, a classic economic experiment designed to test the sense of fairness and altruism. By framing our research through the lenses of SoS and ToM to test how persona assignments influence LLM behavior and whether LLMs can predict human decision-making, respectively, we aimed to understand the underlying mechanisms driving LLM decision-making and assess their alignment with human behaviors. The empirical results are summarized below:

1. Inconsistent alignment with human behavior: LLM agents did not consistently replicate human decision-making patterns in the dictator game. Assigning human-like personas or prompting them to predict human behavior did not result in outcomes that align with established human behaviors.
2. Variability across models: Significant variations exist both across different LLM families and within the same model family but different sizes. Larger models did not necessarily produce more human-like behaviors, and sometimes smaller models outperformed their larger counterparts in aligning with humans.
3. Lack of continuous decision distribution: Unlike humans, whose giving rates in dictator games typically follow a continuous distribution, LLM agents exhibited bimodal distributions, with choices clustered at extremes (e.g., giving nothing or half). This suggests a lack of nuanced decision-making that characterizes human prosocial behavior.
4. Sensitivity to experimental framing: While human decisions in dictator games are influenced by factors like social distance and framing (“Give” vs. “Take”), LLM agents showed inconsistent responses to these manipulations. Their behaviors did not consistently align with human expectations based on these contextual factors.
5. Unpredictable impact of personas and psychological processes: The assigned demographic and personality traits did not reliably predict the agents’ decisions. Moreover, analyses of their textual explanations using LIWC did not reveal consistent psychological processes akin to human empathy or compassion.

Two central themes emerge from these findings, highlighting some fundamental limitations and challenges of developing and applying LLMs in social contexts. The first theme pertains to what LLMs are actually learning, and the second relates to how we should position LLMs within our society.

Inconsistency in LLM behavior: Lack of understanding and theories

The first theme highlights that current LLM agents do not consistently behave like humans in the specific context of dictator games. They appear to lack “causal models of the world that support explanation and understanding” and “ground learning in intuitive theories of physics and psychology to support and enrich the knowledge that is learned” (Lake et al., Reference Lake, Ullman, Tenenbaum and Gershman2017, 1). LLMs rely on recognizing language patterns rather than truly understanding social norms or engaging in human-like reasoning. Despite being trained on vast datasets of human-generated text, LLMs do not consistently replicate human decision-making in these social contexts. This inconsistency is further exacerbated by the models’ sensitivity to factors such as architecture, size, and prompt formulations, which challenges the assumption that simply increasing model size or complexity inherently improves reasoning abilities or leads to more human-like behaviors.

While both LLMs and humans are epistemically opaque, there is a crucial difference. Human behaviors, though complex, can often be interpreted and predicted based on psychological theories and social norms. In contrast, LLMs lack such underlying theories; their internal processes remain a black box, and they do not follow human theories. This absence of interpretability and adherence to human reasoning processes limits our ability to understand and predict LLM behaviors in socially complex scenarios.

Determinism versus human-like uncertainty: A fundamental dilemma

The second theme centers on the dichotomy between deterministic outputs and human-like uncertainty in LLM behavior. The bimodal distribution of giving rates among LLM agents suggests a form of deterministic decision-making that lacks the subtlety and variability characteristic of human choices. While deterministic behavior might result in more predictable outputs suitable for certain applications, it fails to capture the richness of human behavior, which often involves nuanced deliberation over various social and personal factors.

The absence of a continuous decision space indicates that LLMs may be defaulting to prevalent patterns in their training data or adhering to the most statistically probable responses. This tendency suggests that they are not genuinely understanding or processing the ethical dimensions of the choices presented to them but are instead relying on learned language patterns. This brings us to a fundamental question: Should LLMs be designed to mimic human-like uncertainty, embracing the complexities and unpredictabilities of human decision-making, or should they aim for determinism to ensure consistency and predictability?

This dilemma has significant implications for the development and deployment of LLMs. On one hand, embracing human-like uncertainty could enhance the authenticity of interactions with AI agents, making them more relatable and better suited for applications requiring empathy and nuanced social understanding. On the other hand, deterministic behavior ensures reliability and predictability, which are crucial for tasks where consistency is key.

LLMs being human-like, but “which human?”

A critical assumption in this study is that a “ground truth” for typical human behavior can be established by summarizing the consensus from existing scholarship on dictator games. While this approach provides a necessary baseline for comparison, it opens a crucial line of inquiry succinctly captured by Atari et al. (Reference Atari, Xue, Park, Blasi and Henrich2023): When we evaluate an LLM against “human” performance, which humans are we talking about? The notion of a “typical” human is highly contested, as much of the behavioral science literature is dominated by research on participants from Western, Educated, Industrialized, Rich, and Democratic (WEIRD) societies (Henrich et al., Reference Henrich, Heine and Norenzayan2010). This narrow sampling raises questions about the generalizability of findings concerning core human behaviors like altruism and fairness. Atari et al. demonstrated that LLMs’ psychological profiles most closely resemble those of people from WEIRD cultures, and this resemblance decreases as one moves away from that demographic.

The “which human” question highlights a fundamental challenge of representativeness, which can be examined from three perspectives: the training data, the academic knowledge base, and the resulting language model behavior.

First, the training data used for LLMs is heavily skewed. These models are trained on vast quantities of text from the internet, a space where content from North America and Europe is disproportionately represented. Consequently, the social norms, biases, and behavioral patterns the LLM learns are not representative of global human diversity but rather reflect a digitally dominant, often WEIRD, slice of humanity. The model’s foundational understanding of “prosocial behavior” is therefore culturally biased from the outset.

Second, the academic knowledge that forms our human baseline suffers from the same bias. The empirical studies and literature reviews summarized in this paper to define expected human behavior (Section A.2 of the Supplementary Material) are themselves largely products of research conducted within WEIRD populations. Therefore, our experiment compares an LLM trained on WEIRD-centric data to a behavioral benchmark derived from WEIRD-centric science. This framework risks reinforcing a culturally specific model of behavior as a universal standard.

Finally, these issues converge in the language model’s behavior. The inconsistencies and deviations from the human baseline observed in our results may not simply be technical failures of the models. Instead, they could reflect a complex conflict between the LLM’s core WEIRD-centric training, the specific (US-based) personas it is asked to adopt, and the debiasing efforts intended to make it a more neutral agent. Such debiasing may strip the model of the ability to replicate any specific human demographic’s patterns faithfully, resulting in the sanitized and inconsistent outputs we observed, particularly with a highly tuned model like GPT-4o. Future work on prosocial AI must therefore move beyond simple alignment with a monolithic “human” standard and instead grapple with the challenge of building AI that can understand and navigate the rich diversity of human cultural and social norms.

Practical implications for developing and deploying LLMs

Behavioral approach to evaluating internal processes of LLMs

Our study underscores the challenges in aligning LLM behaviors with human values and social norms, highlighting the need for more sophisticated evaluation methods. Traditional approaches that focus on adjusting outputs based on human feedback are insufficient for tasks requiring social cognition and reasoning. As discussed earlier, adopting a behavioral approach, such as evaluating LLMs through experiments, allows us to systematically assess their decision-making processes in realistic social contexts. This method provides insights into how LLMs make decisions and whether their internal mechanisms align with human cognitive processes.

Assistants for tasks but not participants in social research

The use of LLMs in social science research is promising but also presents limitations. Our findings suggest that LLMs may not reliably replicate the nuanced processes of human decision-making in social experiments like the dictator game—they are not computational humans. Overreliance on them for modeling human behavior in complex social contexts could lead to misleading conclusions. This is particularly relevant for the nonprofit and philanthropic sectors, where AI might be used to model donor behavior or predict responses to fundraising campaigns. Inaccurate simulations could lead to flawed strategies and misallocation of resources. Therefore, researchers and practitioners should limit the roles of LLMs to specific tasks like text classification or topic modeling and approach the use of LLMs in modeling human behavior with caution. We must recognize that LLMs are tools to assist in research, not substitutes for human participants, at least for the time being.

As society increasingly relies on AI for critical decision-making tasks, integrating prosocial AI into NPS becomes both timely and imperative. This study highlights the limitations and opportunities associated with LLMs’ prosocial behaviors, underscoring the importance of interdisciplinary collaboration between computer science, traditional social sciences, and philanthropic studies. NPSs are uniquely positioned, through their extensive understanding of human behaviors, ethics, and societal norms, to guide the development and application of prosocial AI technologies, ensuring these systems align with the core values of human society and the practical needs of nonprofit and philanthropic sectors.

Supplementary material

The supplementary material for this article can be found at http://doi.org/10.1017/S0957876526000173.

Acknowledgments

Draft manuscript of this research was presented at the 2024: Fall LBJ Research Seminar, Science of Philanthropy Initiative Conference, Brown Bag Talk at University of Chicago Department of Economics, Lingnan University; 2025: Gradel Institute of Charity at Oxford, Center for Philanthropic Studies at Vrije Universiteit Amsterdam, Division of Computational Social Science at Chinese University of Hong Kong (Shenzhen), the IPE Thrust at Hong Kong University of Science and Technology (Guangzhou), International Conference on Computational Social Science, ARNOVA. I thank Becca North, Chenxin Zhang, Chi Ta, Christopher M. Clapp, Dominic Packer, Hanyu Xiao, Isabel Laterzo-Tingley, James Evans, John A. List, Katherine Rittenhouse, Kieran Gibson, Mark Ottoni-Wilhelm, Michael Guy Cuna, Peter Frumkin, René Bekkers, Richard Burkhauser, Richard S. Steinberg, Sara Konrath, Stephanie Koolen-Maas, Xiaolin Duan, conference and seminar attendees, and anonymous reviewers for their constructive comments.

Funding statement

The project is partly supported by (1) the Academic Development Funds from the RGK Center, (2) the 2023-24 PRI Award from the LBJ School, (3) USTC Summer Fellowships (Grant No. S19582024 and S19582025), (4) the Gradel Institute of Charity, New College, University of Oxford, and computing resources through (5) the Texas Advanced Computing Center at UT Austin (Keahey et al., Reference Keahey, Anderson, Zhen, Riteau, Ruth, Stanzione and Cevik2020), (6) Dell Technologies, Client Memory Team and AI Initiative PoC Lead Engineer Wente Xiong.

Footnotes

¹ “Epistemic opacity” can be formally defined as follows: “a process is epistemically opaque relative to a cognitive agent X at time t just in case X does not know at t all of the epistemically relevant elements of the process. A process is essentially epistemically opaque to X if and only if it is impossible, given the nature of X, for X to know all of the epistemically relevant elements of the process” (Humphreys, Reference Humphreys2009, 618).

² Additionally, we included the most advanced open-source reasoning model available as of May 2025, DeepSeek-R1-70B, as a robustness check. This model employs a reasoning-chain and mixture-of-experts architecture in its training, enhancing efficiency and performance across multiple benchmarks (DeepSeek-AI et al., Reference DeepSeek-AI, Yang, Zhang, Song, Zhang and Xu2025; Dai et al., Reference Dai, Deng, Zhao, Xu, Gao, Chen and Li2024). However, despite its advanced reasoning capabilities, it does not demonstrate more human-like behaviors, as evidenced by column 11 in Table 2.

³ For the model’s stated reasoning, this output should be interpreted with caution. Current LLMs are optimized to produce plausible and convincing text, which may lead to post-hoc rationalizations that do not faithfully reflect the model’s actual computational process (Turpin et al., Reference Turpin, Michael, Perez and Bowman2023).

⁴ Most dictator game studies with human participants treat demographic variables primarily as controls, as the main focus is often on isolating the causal effects of experimental manipulations. In contrast, a central research question of our study is to investigate whether LLMs can effectively adopt and act upon specific “personas.” Therefore, understanding the influence of these persona attributes on the LLM’s behavior is a primary research interest, not secondary.

⁵ GPT4o includes a setting that enforces output in JSON format, but we did not use this feature to maintain comparability with other open-source models.

⁶ LIWC is probably not an appropriate method for estimating the reasoning process of these ToM trials. For example, these trials may use fewer first-person pronouns. Even when using these pronouns, their psychological meaning is different from that in the SoS trials.

References

Aher, G. V., Arriaga, R. I., & Kalai, A. T. (2023). Using large language models to simulate multiple humans and replicate human subject studies [Conference presentation]. Proceedings of the 40th international conference on machine learning (PMLR), July 3, 2023 (pp. 337–371).Google Scholar

Alves, M. A., Bassi, A., & Cordery, C. (2025). Future challenges facing third sector research. In Bassi, A., Alves, M. A., & Cordery, C. (Eds.), The future of third sector research: From theory to definitions, classifications and aggregation towards new research paths (pp. 255–266). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-67896-7_22CrossRef Google Scholar

Amarasinghe, K., Rodolfa, K. T., Lamba, H., & Ghani, R. (2023). Explainable machine learning for public policy: Use cases, gaps, and research directions. Data & Policy, 5, e5. https://doi.org/10.1017/dap.2023.2CrossRef Google Scholar

Apperly, I. A. (2012). What is “theory of mind”? Concepts, cognitive processes and individual differences. Quarterly Journal of Experimental Psychology, 65(5), 825–839. https://doi.org/10.1080/17470218.2012.676055CrossRef Google Scholar

Atari, M., Xue, M., Park, P., Blasi, D., & Henrich, J. (2023). Which humans? Pre-published, September 22, 2023. https://doi.org/10.31234/osf.io/5b26tCrossRef Google Scholar

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., et al. (2022). Constitutional AI: Harmlessness from AI feedback. Pre-published, December. arXiv: 2212.08073 [cs]. https://doi.org/10.48550/arXiv.2212.08073CrossRef Google Scholar

Bail, C. A. (2024). Can generative AI improve social science? Proceedings of the National Academy of Sciences, 121(21), e2314021121. https://doi.org/10.1073/pnas.2314021121CrossRef Google Scholar PubMed

Bekkers, R., & Wiepking, P. (2011). A literature review of empirical studies of philanthropy eight mechanisms that drive charitable giving. Nonprofit and Voluntary Sector Quarterly, 40(5), 924–973. https://doi.org/10.1177/0899764010380927CrossRef Google Scholar

Bisbee, J., Clinton, J. D., Dorff, C., Kenkel, B., & Larson, J. M. (2024). Synthetic replacements for human survey data? The perils of large language models. Political Analysis, 1–16. https://doi.org/10.1017/pan.2024.5Google Scholar

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., et al. (2022). On the opportunities and risks of foundation models. Pre-published, July 12, 2022. arXiv: 2108.07258 [cs]. https://doi.org/10.48550/arXiv.2108.07258CrossRef Google Scholar

Brand, J. E., Zhou, X., & Xie, Y. (2023). Recent developments in causal inference and machine learning. Annual Review of Sociology, 49(1), 81–110. https://doi.org/10.1146/annurev-soc-030420-015345CrossRef Google Scholar PubMed

Brookins, P., & DeBacker, J. M. (2023). Playing games with GPT: What can we learn about a large language model from canonical strategic games? Pre-published, June 28, 2023. SSRN Scholarly Paper. Social Science Research Network: 4493398. https://doi.org/10.2139/ssrn.4493398CrossRef Google Scholar

Cappelen, A. W., Nielsen, U. H., Sørensen, E. Ø., Tungodden, B., & Tyran, J.-R. (2013). Give and take in dictator games. Economics Letters, 118(2), 280–283. https://doi.org/10.1016/j.econlet.2012.10.030CrossRef Google Scholar

Capraro, V., Di Paolo, R., & Pizziol, V. (2025). A publicly available benchmark for assessing large language models’ ability to predict how humans balance self-interest and the interest of others. Scientific Reports, 15(1), 21428. https://doi.org/10.1038/s41598-025-01715-7CrossRef Google Scholar PubMed

Celli, F., & Lepri, B. (2018). Is big five better than MBTI?: A personality computing challenge using twitter data. In Cabrio, E., Mazzei, A., and Tamburini, F. (Eds.), Proceedings of the fifth Italian conference on computational linguistics CLiC-it 2018 (pp. 93–98). Accademia University Press. https://doi.org/10.4000/books.aaccademia.3147CrossRef Google Scholar

Chan, A., Riché, M., & Clifton, J. (2023). Towards the scalable evaluation of cooperativeness in language models. Pre-published, March 16, 2023. Accessed August 23, 2024. arXiv: 2303.13360 [cs]. http://arxiv.org/abs/2303.13360 Google Scholar

Chen, H., & Zhang, R. (2023). Identifying nonprofits by scaling mission and activity with word embedding. Voluntas: International Journal of Voluntary and Nonprofit Organizations. 34(1), 39–51. https://doi.org/10.1007/s11266-021-00399-7CrossRef Google Scholar

Coz, P. L., Liu, J. A., Bhattacharjya, D., Curto, G., & Stinckwich, S. (2025). What would an LLM do? Evaluating policymaking capabilities of large language models. Pre-published, September 4, 2025. arXiv: 2509.03827 [cs]. https://doi.org/10.48550/arXiv.2509.03827CrossRef Google Scholar

Cui, Z., Li, N., & Zhou, H. (2025). Can large language models replace human subjects? A large-scale replication of scenario-based experiments in psychology and management. Pre-published, June 20, 2025. arXiv: 2409.00128 [cs]. https://doi.org/10.48550/arXiv.2409.00128CrossRef Google Scholar

Dai, D., Deng, C., Zhao, C., Xu, R. X., Gao, H., Chen, D., Li, J., et al. (2024). DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. Pre-published, January 11, 2024. arXiv: 2401.06066 [cs].https://doi.org/10.48550/arXiv.2401.06066CrossRef Google Scholar

DeepSeek-AI, D. G., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., et al. (2025). DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. Pre-published, January 22, 2025. arXiv: 2501.12948 [cs]. https://doi.org/10.48550/arXiv.2501.12948CrossRef Google Scholar

Dillion, D., Tandon, N., Gu, Y., & Gray, K. (2023). Can AI language models replace human participants? Trends in Cognitive Sciences, 27(7), 597–600. https://doi.org/10.1016/j.tics.2023.04.008CrossRef Google Scholar

Edelmann, A., Wolff, T., Montagne, D., & Bail, C. A (2020). Computational social science and sociology. Annual Review of Sociology, 46(1), 61–81. https://doi.org/10.1146/annurev-soc-121919-054621CrossRef Google Scholar PubMed

Engel, C. (2011). Dictator games: A meta study. Experimental Economics, 14(4), 583–610. https://doi.org/10.1007/s10683-011-9283-7CrossRef Google Scholar

Fan, C., Chen, J., Jin, Y., & He, H. (2024). Can large language models serve as rational players in game theory? A systematic analysis. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 17960–17967. https://doi.org/10.1609/aaai.v38i16.29751CrossRef Google Scholar

Franzen, A., & Pointner, S. (2012). Anonymity in the dictator game revisited. Journal of Economic Behavior & Organization 81(1), 74–81. https://doi.org/10.1016/j.jebo.2011.09.005CrossRef Google Scholar

Furnham, A. (1996). The big five versus the big four: The relationship between the Myers-Briggs type indicator (MBTI) and NEO-PI five factor model of personality. Personality and Individual Differences, 21(2), 303–307. https://doi.org/10.1016/0191-8869(96)00033-5CrossRef Google Scholar

Gabriel, I. (2020). Artificial intelligence, values, and alignment. Minds and Machines, 30(3), 411–437. https://doi.org/10.1007/s11023-020-09539-2CrossRef Google Scholar

Galizzi, M. M., & Navarro-Martinez, D. (2019). On the external validity of social preference games: A systematic lab-field study. Management Science, 65(3), 976–1002. https://doi.org/10.1287/mnsc.2017.2908.CrossRef Google Scholar

Goldkind, L., Ming, J., & Fink, A. (2025). AI in the nonprofit human services: Distinguishing between hype, harm, and Hope. Human Service Organizations: Management, Leadership & Governance, 49(3), 225–236. https://doi.org/10.1080/23303131.2024.2427459Google Scholar

Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, 21(3), 267–297. https://doi.org/10.1093/pan/mps028CrossRef Google Scholar

Guo, T., Chen, X., Wang, Y., Chang, R., Pei, S., Chawla, N. V., Wiest, O., & Zhang, X. (2024). Large language model based multi-agents: A survey of progress and challenges. Pre-published, January 21, 2024. arXiv: 2402.01680 [cs]. https://doi.org/10.48550/arXiv.2402.01680CrossRef Google Scholar

Henrich, J., Heine, S. J., & Norenzayan, A. (2010). The weirdest people in the world? The Behavioral and Brain Sciences, 33(2–3), 61–83, discussion 83–135. https://doi.org/10.1017/S0140525X0999152XCrossRef Google Scholar PubMed

Hoffman, E., McCabe, K., & Smith, V. L. (1996). Social distance and other-regarding behavior in dictator games. The American Economic Review, 86(3), 653–660.Google Scholar

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., et al. (2022). Training compute-optimal large language models. Pre-published, March 29, 2022. arXiv: 2203.15556 [cs]. https://doi.org/10.48550/arXiv.2203.15556CrossRef Google Scholar

Hofman, J. M., Watts, D. J., Athey, S., Garip, F., Griffiths, T. L., Kleinberg, J., Margetts, H., et al. (2021). Integrating explanation and prediction in computational social science. Nature, 595(7866), 181–188. https://doi.org/10.1038/s41586-021-03659-0CrossRef Google Scholar PubMed

Horton, J. J. (2023). Large language models as simulated economic agents: What can we learn from homo silicus? Pre-published, April. Working Paper. https://doi.org/10.3386/w31122. National Bureau of Economic Research: 31122CrossRef Google Scholar

Humphreys, P. (2009). The philosophical novelty of computer simulation methods. Synthese, 169(3), 615–626. https://doi.org/10.1007/s11229-008-9435-2CrossRef Google Scholar

Johnson, T., & Obradovich, N. (2023). Evidence of behavior consistent with self-interest and altruism in an artificially intelligent agent. Pre-published, January 5, 2023. arXiv: 2301.02330. https://doi.org/10.48550/arXiv.2301.02330CrossRef Google Scholar

Keahey, K., Anderson, J., Zhen, Z., Riteau, P., Ruth, P., Stanzione, D., Cevik, M., et al. (2020). Lessons learned from the Chameleon testbed [Conference presentation]. 2020 {USENIX} annual technical conference ({USENIX} {ATC} (Vol. 20, pp. 219–233).Google Scholar

Kirk, H. R., Vidgen, B., Röttger, P., & Hale, S. A. (2024). The benefits, risks and bounds of personalizing the alignment of large language models to individuals. Nature Machine Intelligence, 6(4), 383–392. https://doi.org/10.1038/s42256-024-00820-yCrossRef Google Scholar

Kozlowski, A. C., & Evans, J. (2025). Simulating subjects: The promise and peril of artificial intelligence stand-ins for social agents and interactions. Sociological Methods & Research, 54(3), 1017–1073. https://doi.org/10.1177/00491241251337316CrossRef Google Scholar

Lai, S., Potter, Y., Kim, J., Zhuang, R., Song, D., & Evans, J. (2024). Evolving AI collectives to enhance human diversity and enable self-regulation. Pre-published, June 18, 2024. arXiv: 2402.12590. https://doi.org/10.48550/arXiv.2402.12590CrossRef Google Scholar

Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman, S. J. (2017). Building machines that learn and think like people. Behavioral and Brain Sciences, 40, e253. https://doi.org/10.1017/S0140525X16001837CrossRef Google Scholar PubMed

Lazer, D. Alex Pentland, M. J., Watts, D. J., Aral, S., Athey, S., Contractor, N., Freelon, D., et al. (2020). Computational social science: Obstacles and opportunities. Science, 369(6507), 1060–1062. https://doi.org/10.1126/science.aaz8170CrossRef Google Scholar PubMed

Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabási, A.-L., Brewer, D., Christakis, N., et al. (2009). Computational social science. Science, 323(5915), 721–723. https://doi.org/10.1126/science.1167742CrossRef Google Scholar PubMed

LePere-Schloop, M., & Nesbit, R. (2023). Disciplinary contributions to nonprofit studies: A 20-year empirical mapping of journals publishing nonprofit research and journal citations by nonprofit scholars. Nonprofit and Voluntary Sector Quarterly, 52(1_suppl), 68S–101S. https://doi.org/10.1177/08997640221119728CrossRef Google Scholar

Levitt, S. D., & List, J. A. (2007). What do laboratory experiments measuring social preferences reveal about the real world? Journal of Economic Perspectives, 21(2), 153–174. https://doi.org/10.1257/jep.21.2.153CrossRef Google Scholar

Ma, J. (2021). Automated coding using machine learning and remapping the U.S. nonprofit sector: A guide and benchmark. Nonprofit and Voluntary Sector Quarterly, 50(3), 662–687. https://doi.org/10.1177/0899764020968153CrossRef Google Scholar

Ma, J., Ebeid, I. A., de Wit, A., Xu, M., Yang, Y., Bekkers, R., & Wiepking, P. (2023). Computational social science for nonprofit studies: Developing a toolbox and knowledge base for the field. Voluntas: International Journal of Voluntary and Nonprofit Organizations, 34(1), 52–63. https://doi.org/10.1007/s11266-021-00414-xCrossRef Google Scholar

Ma, J., & Konrath, S. (2018). A century of nonprofit studies: Scaling the knowledge of the field. Voluntas: International Journal of Voluntary and Nonprofit Organizations, 29(6), 1139–1158. https://doi.org/10.1007/s11266-018-00057-5CrossRef Google Scholar

Magee, L., Arora, V., & Munn, L. (2023). Structured like a language model: Analysing AI as an automated subject. Big Data & Society, 10(2), 20539517231210273. https://doi.org/10.1177/20539517231210273CrossRef Google Scholar

Markus, H., & Wurf, E. (1987). The dynamic self-concept: A social psychological perspective. Annual Review of Psychology, 38, 299–337. https://doi.org/10.1146/annurev.ps.38.020187.001503CrossRef Google Scholar

Marsden, P. V., Smith, T. W., & Hout, M. (2020). Tracking US social change over a half-century: The general social survey at fifty. Annual Review of Sociology, 46, 109–134. https://doi.org/10.1146/annurev-soc-121919-054838CrossRef Google Scholar

Meade, N., Poole-Dayan, E., & Reddy, S. (2022). An empirical survey of the effectiveness of debiasing techniques for pre-trained language models. Pre-published, April 2, 2022. arXiv: 2110.08527 [cs]. https://doi.org/10.48550/arXiv.2110.08527CrossRef Google Scholar

Mei, Q., Xie, Y., Yuan, W., & Jackson, M. O. (2024). A turing test of whether AI chatbots are behaviorally similar to humans. Proceedings of the National Academy of Sciences, 121(9), e2313925121. https://doi.org/10.1073/pnas.2313925121CrossRef Google Scholar PubMed

Meier, D. S. (2025). Compassion for all: Real-world online donations contradict compassion fade. Nonprofit and Voluntary Sector Quarterly, 54(2), 267–301. https://doi.org/10.1177/08997640241255707CrossRef Google Scholar

Meier, D. S., & von Schnurbein, G. (2024). From mission to market: Assessing sector overlap between nonprofits and for-profits. Nonprofit and Voluntary Sector Quarterly, 08997640241300509. https://doi.org/10.1177/08997640241300509Google Scholar

National Research Council, Citro, C. F., & Kalton, G. (Eds.). (2007). Using the American community survey: Benefits and challenges. The National Academies Press. https://doi.org/10.17226/11901Google Scholar

Nisioti, E., Risi, S., Momennejad, I., Oudeyer, P.-Y., & Moulin-Frier, C. (2024). Collective innovation in groups of large language models. Pre-published, July 7, 2024. arXiv: 2407.05377 [cs]. https://doi.org/10.48550/arXiv.2407.05377CrossRef Google Scholar

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., et al. (2022). Training language models to follow instructions with human feedback. Pre-published, March. arXiv: 2203.02155 [cs]. https://doi.org/10.48550/arXiv.2203.02155CrossRef Google Scholar

Park, J. S., O’Brien, J, Cai, C. J, Morris, M. R, Liang, P, & Bernstein, M. S. (2023). Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 1–22. UIST ‘23. Association for Computing Machinery. https://doi.org/10.1145/3586183.3606763Google Scholar

Patel, D., Timsina, P., Raut, G., Freeman, R., Levin, M. A., Nadkarni, G. N., Glicksberg, B. S., & Klang., E (2024). Exploring temperature effects on large language models across various clinical tasks. Pre-published, July 22, 2024. https://doi.org/10.1101/2024.07.22.24310824CrossRef Google Scholar

Peeperkorn, M., Kouwenhoven, T., Brown., D., & Jordanous, A. (2024). Is temperature the creativity parameter of large language models?” Pre-published, May 1, 2024. arXiv: 2405.00492 [cs]. https://doi.org/10.48550/arXiv.2405.00492CrossRef Google Scholar

Perez, J., Léger, C., Kovač, G., Colas, C., Molinaro, G., Derex, M., Oudeyer, P.-Y., & Moulin-Frier, C. (2024). When LLMs play the telephone game: Cumulative changes and attractors in iterated cultural transmissions. Pre-published, July 5, 2024. Accessed July 12, 2024. arXiv: 2407.04503 [physics]. http://arxiv.org/abs/2407.04503 Google Scholar

Perron, B. E., Goldkind, L., Qi, Z., & Victor, B. G. (2025). Human services organizations and the responsible integration of AI: Considering ethics and contextualizing risk(s). Journal of Technology in Human Services, 43(1), 20–33. https://doi.org/10.1080/15228835.2025.2457045CrossRef Google Scholar

Plaisance, G. (2025). Artificial intelligence (AI) in the context of nonprofits and philanthropy: Suspicion and Hope for researchers and organizations. Journal of Philanthropy, 30(2), e70022. https://doi.org/10.1002/nvsm.70022CrossRef Google Scholar

Premack, D., & Woodruff, G. (1978). Does the chimpanzee have a theory of mind? Behavioral and Brain Sciences, 1(4), 515–526. https://doi.org/10.1017/S0140525X00076512CrossRef Google Scholar

Qian, C., Cong, X., Liu, W., Yang, C., Chen, W., Su, Y., Dang, Y., et al. (2023). Communicative agents for software development. Pre-published, December 19, 2023. arXiv: 2307.07924 [cs]. https://doi.org/10.48550/arXiv.2307.07924CrossRef Google Scholar

Renze, M., & Guven, E. (2024). The effect of sampling temperature on problem solving in large language models. Pre-published, June 14, 2024. arXiv: 2402.05201 [cs]. https://doi.org/10.48550/arXiv.2402.05201CrossRef Google Scholar

Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should i trust you?”: Explaining the predictions of any classifier. Pre-published, August 9, 2016. Accessed February 7, 2024. arXiv: 1602.04938 [cs, stat]. http://arxiv.org/abs/1602.04938 CrossRef Google Scholar

Rigdon, M., Ishii, K., Watabe, M., & Kitayama, S. (2009). Minimal social cues in the dictator game. Journal of Economic Psychology, 30(3), 358–367. https://doi.org/10.1016/j.joep.2009.02.002CrossRef Google Scholar

Rutherford, A. C., LePere-Schloop, M., & Perai, N. A. A. (2025). Part II: Turning a critical lens on nonprofit classification: Opportunities and challenges in the digital age. In Bassi, A., Alves, M. A., & Cordery, C. (Eds.), The future of third sector research: From theory to definitions, classifications and aggregation towards new research paths (pp. 113–124). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-67896-7_10CrossRef Google Scholar

Santamarina, F. J. (2025). Technologies for data aggregation: An overview of technologies and opportunities to propel third sector research. In Bassi, A., Alves, M. A., & Cordery, C. (Eds.), The future of third sector research: From theory to definitions, classifications and aggregation towards new research paths (pp. 163–178). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-67896-7_14CrossRef Google Scholar

Shier, M. L., & Handy, F.. (2014). Research trends in nonprofit graduate studies a growing interdisciplinary field. Nonprofit and Voluntary Sector Quarterly, 43(5), 812–831. https://doi.org/10.1177/0899764014548279CrossRef Google Scholar

Sreedhar, K., & Chilton, L.. (2024). Simulating human strategic behavior: Comparing single and multi-agent LLMs. Pre-published, July 1, 2024. Accessed August 23, 2024. arXiv: 2402.08189 [cs]. http://arxiv.org/abs/2402.08189 Google Scholar

Strachan, J., Dalila Albergo, W. A., Borghini, G., Pansardi, O., Scaliti, E., Gupta, S., Saxena, K., et al. (2024). Testing theory of mind in large language models and humans. Nature Human Behaviour, 8(7), 1285–1295. https://doi.org/10.1038/s41562-024-01882-zCrossRef Google Scholar PubMed

Tausczik, Y. R., & Pennebaker, J. W. (2010). The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology, 29(1), 24–54. https://doi.org/10.1177/0261927X09351676CrossRef Google Scholar

Turing, A. M. (1950). I.—Computing machinery and intelligence. Mind, LIX(236), 433–460. https://doi.org/10.1093/mind/LIX.236.433CrossRef Google Scholar

Turpin, M., Michael, J., Perez, E., & Bowman, S.. (2023). Language models Don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36, 74952–74965.Google Scholar

Wang, X., & Navarro-Martinez, D. (2023). Increasing the external validity of social preference games by reducing measurement error. Games and Economic Behavior, 141, 261–285. https://doi.org/10.1016/j.geb.2023.06.006CrossRef Google Scholar

Wang, Y., Zhong, W., Li, L., Mi, F., Zeng, X., Huang, W., Shang, L., Jiang, X., & Liu, Q. (2023). Aligning large language models with human: A survey. pre-published, July 24, 2023. arXiv: 2307.12966 [cs]. https://doi.org/10.48550/arXiv.2307.12966CrossRef Google Scholar

Wasif, Rafeel. 2020. Does the media’s anti-Western bias affect its portrayal of NGOs in the Muslim world? Assessing newspapers in Pakistan. Voluntas: International Journal of Voluntary and Nonprofit Organizations, 31(6), 1343–1358. https://doi.org/10.1007/s11266-020-00242-5CrossRef Google Scholar

Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., Cheng, M., et al. (2021). Ethical and social risks of harm from language models. Pre-published, December 8, 2021. arXiv: 2112.04359 [cs]. https://doi.org/10.48550/arXiv.2112.04359CrossRef Google Scholar

Xie, C., Chen, C., Jia, F., Ye, Z., Shu, K., Bibi, A., Hu, Z., et al. (2024). Can large language model agents simulate human trust Behaviors? Empirical methods in natural language processing. Miami, FL: arXiv, March 10, 2024. https://doi.org/10.48550/arXiv.2402.04559CrossRef Google Scholar

Yaden, D. B., Giorgi, S., Jordan, M., Buffone, A., Eichstaedt, J. C., Schwartz, H. A., Ungar, L., & Bloom, P. (2024). Characterizing empathy and compassion using computational linguistic analysis. Emotion (US), 24(1), 106–115. https://doi.org/10.1037/emo0001205CrossRef Google Scholar PubMed

Fig. 1. Experiment design: LLM agent in dictator game. Note: Numbers in circles indicate the order of steps. See Section A.2 of the Supplementary Material and “LLM personas” section for detailed descriptions of the variables and experimental settings.

Table 1. Model performance: Instruction FOLLOWING and math reasoning

Fig. 2. Giving rate by model family and size (SoS). Note: Vertical red dashed lines indicate giving rates at −0.5, 0, and 0.5, respectively; horizontal red dashed lines indicate 50% of total observations. The giving rate is calculated as the percentage of the amount transferred by the dictator to the recipient out of the total stake. Results of the Theory of Mind trials are in Figure D1 in the Supplementary Material.

Fig. 3. Predicting generosity: Demographics and LLM temperature (SoS). Note: The coefficients (showing 95% confidence intervals) are from a linear regression model using the proportion of stake transferred in the dictator game as the dependent variable. Deep colors represent larger models, and light colors represent smaller models within the same LLM family. The shaded areas indicate expected directions of impact based on human studies (Section A.2 of the Supplementary Material). Results of the Theory of Mind trials are in Figure D2 in the Supplementary Material.

Fig. 4. Predicting generosity: Myers–Briggs Type Indicator (SoS). Note: The coefficients (showing 95% CI) are from a linear regression model using the proportion of stake transferred in the dictator game as the dependent variable. Deep colors represent larger models, and light colors represent smaller models within the same LLM family. The shaded areas indicate expected directions of impact based on human studies (Section A.2 of the Supplementary Material). Results of the Theory of Mind trials are in Figure D3 in the Supplementary Material.

Fig. 5. Predicting generosity: Framing of experiment (SoS). Note: The coefficients (showing 95% CI) are from a linear regression model using the proportion of stake transferred in the dictator game as the dependent variable. Deep colors represent larger models, and light colors represent smaller models within the same LLM family. The shaded areas indicate expected directions of impact based on human studies (Section A.2 of the Supplementary Material). The “Stranger” framing is the reference group for “Friend” and “Stranger Meet.” The “Give” framing is the reference group for “Take.” Results of the Theory of Mind trials are in Figure D4 in the Supplementary Material.

Fig. 6. Predicting generosity: Psychological process (SoS). (a) LIWC Categories Effectively Predicting Compassion Controlling for Empathy. (b) LIWC Categories Effectively Predicting Empathy Controlling for Compassion. Note: The coefficients (showing 95% confidence intervals) are from a linear regression model using the proportion of stake transferred in the dictator game as the dependent variable. Deep colors represent larger models, and light colors represent smaller models within the same LLM family. The shaded areas indicate expected directions of impact based on human studies (Section A.2 of the Supplementary Material). LIWC categories are selected for analysis according to Yaden et al. (2024). “She/He” and “Male” categories for Compassion are excluded due to limited number of observations. LIWC = Linguistic Inquiry and Word Count (Tausczik & Pennebaker, 2010). Results of the Theory of Mind trials are in Figure D5 in the Supplementary Material.

Table 2. LLM agents’ alignment with humans in dictator games (sense of self)

Table 3. LLM agents’ alignment with humans in dictator games: Compassion (sense of self)

Table 4. LLM agents’ alignment with humans in dictator games: Empathy (sense of self)

Ma supplementary material

DOI: https://doi.org/10.1017/S0957876526000173.sm001

File 334.4 KB

Article contents

Can Machines Think Like Humans: A Behavioral Evaluation of LLM Agents in Dictator Games

Abstract

Keywords

Information

Introduction

Bring prosocial AI research into the NPS landscape

Understanding LLMs as intelligent agents in social contexts

Alignment with human values and preferences

Simulating human behaviors in social contexts

Framing research: LLM agents in dictator games

Two routes to “epistemic opacity”: Prediction and explanation

Toward behavioral evaluation of LLMs

LLM agents in dictator games: Sense of self and theory of mind designs

Methods

Experiment design

Factors influencing LLM generosity

LLM personas

Experiment framing

Psychological processes

Empirical analysis

Results

Model performance

Instruction following and math reasoning

Giving rate

Predicting the behavior of LLM agents: Sense of self trials

Personas

Summarizing sense of self and theory of mind results

Discussion

Inconsistency in LLM behavior: Lack of understanding and theories

Determinism versus human-like uncertainty: A fundamental dilemma

LLMs being human-like, but “which human?”

Practical implications for developing and deploying LLMs

Behavioral approach to evaluating internal processes of LLMs

Assistants for tasks but not participants in social research

Supplementary material

Acknowledgments

Funding statement

Footnotes

References

Ma supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests