Do large language models (LLMs) – such as ChatGPT-3.5 Turbo, ChatGPT-4.0, and Gemini 1.0 Pro, and DeepSeek-R1 – simulate human behavior in the context of the Prisoner’s Dilemma (PD) game with varying stake sizes? Through a replication of Yamagishi et al. (2016) ‘Study 2,’ we investigate this question, examining LLM responses to different payoff stakes and the influence of stake order on cooperation rates. We find that LLMs do not mirror the inverse relationship between stake size and cooperation found in the study. Rather, some models (DeepSeek-R1 and ChatGPT-4.0) almost wholly defect, while others (ChatGPT-3.5 Turbo and Gemini 1.0 Pro) mirror human behavior only under very specific circumstances. LLMs demonstrate sensitivity to framing and order effects, implying the need for cautious application of LLMs in behavioral research.