Gods are watching and so what? Moralistic supernatural punishment across 15 cultures

Psychological and cultural evolutionary accounts of human sociality propose that beliefs in punitive and monitoring gods that care about moral norms facilitate cooperation. While there is some evidence to suggest that belief in supernatural punishment and monitoring generally induce cooperative behaviour, the effect of a deity's explicitly postulated moral concerns on cooperation remains unclear. Here, we report a pre-registered set of analyses to assess whether perceiving a locally relevant deity as moralistic predicts cooperative play in two permutations of two economic games using data from up to 15 diverse field sites. Across games, results suggest that gods' moral concerns do not play a direct, cross-culturally reliable role in motivating cooperative behaviour. The study contributes substantially to the current literature by testing a central hypothesis in the evolutionary and cognitive science of religion with a large and culturally diverse dataset using behavioural and ethnographically rich methods.


S1. Causal Model
To guide model-building and inference and based on the above review of previous literature, Figure S1 illustrates our assumed causal structure of the data-generating process in the form of a directed acyclic graph (DAG; see Pearl et al., 2016).
S P M O C Y F Figure S1: Directed acyclic graph (DAG) of the assumed causal structure of the datagenerating process. C = number of children; S = food insecurity; P = punitive tendency of deity; O = knowledge breadth of deity; M = deity's degree of moral concern; Y = cooperation; F = Structural features of the game set-up. Dashed double-headed arrows refer to bidirectional causal relationships.
Here, Y denotes cooperation, M , P and O denote respectively the moral concern, punitive tendency, and knowledge breadth of a deity (between which we assume bidirectional causal relationships represented by dashed double-headed arrows), S denotes food (in)security, C denotes number of children, and F denotes a set of structural features of the game set-up (game order, game check, and game type; see Section S3.1). These assumed relationships derive from previous empirical studies and evidence syntheses 1 Purzycki et al., 2016bPurzycki et al., , 2018aPurzycki et al., ,b,c, 2022. Our target relationship, or estimand, then, is the direct effect M →Y , where M is measured using free-lists and Y is operationalized as 1 Note that our adjustment set is rather minimal. We justify this on the grounds that previous analyses (in particular, see Lang et al., 2019) did not find consistent relationships between common control variables (sex, education, etc.) and coin allocation to the DISTANT cup across a wide range of model specifications. economic game play.
According to this causal model and assuming no relevant unobserved confounding, to block all back-door paths from M to Y , we need only adjust for P , O, and S. The path from C to M through S is blocked when we condition on S; conditioning on C is therefore not strictly necessary. However, including C can reduce variation in Y , thereby increasing the precision of the target relationship (see e.g., Cinelli et al., 2020, and references therein), and we therefore include both C and S in our conditioning set 2 . For the same reason, we include the structural features of the game set-up, F . The only colliders in this graph are the main predictor variables M , P and O, and the outcome variable, Y , and so, under this model, conditioning on covariates does not induce spurious associations in our estimand.
Moreover, while we assume no systematic missingness pattern conditional on our covariates, our statistical models employ full Bayesian imputation of missing covariates (McElreath, 2020) in order not to discard data unnecessarily.
Next, we discuss the assumptions required for causal interpretation of our analysis.

S1.1. Causal identification assumptions
Our g-computation approach to estimation (Section S2.3) in an observational setting generally requires five key conditions for a causal interpretation of the main exposure (e.g., Hernan and Robins, 2020, ch. 13;Naimi et al., 2017).
We assume that the potential outcomes (coin allocations in the behavioral games) under varying levels of exposure (free-listed Morality) are independent from the observed outcomes (conditional exchangeability). In a perfectly randomized trial, this is the case since randomization ensures that the probability of treatment is independent of the outcome -we say that the treatment and control groups are "exchangeable". However, in an observational 2 Two additional variables could be included on these grounds, namely measures of emotional closeness to the LOCAL/DISTANT players using pictorial "fusion" scales. While previous research found small effects of these variables on impartial coin allocations, a concern was raised that rather than being measures of "fusion" per se, these instruments might measure prosocial tendencies in general (cf., Purzycki and Lang, 2019). Given ambiguities around validity, we refrain from including these variables here.
setting, there can be countless factors that both influence exposure levels and the outcome.
It's a main goal of our statistical model to adjust for these confounding factors, in order to obtain conditional exchangeability. We used a DAG ( Figure S1) to guide our statistical adjustment. In essence, we sought to statistically adjust for all causal paths that both influence our exposure M and outcome Y of interest.
Relatedly, we assume no model misspecification, which entails that our model is specified correctly (e.g., in terms of functional relationships, no omitted confounding variables, etc.).
However, whether any given statistical model and adjustment sets are sufficient to ensure these conditions hold is generally not empirically testable. Similarly, we assume that our variables are measured without error, another difficult-to-verify assumption.
We further assume that an individual's observed outcome under a given exposure is equivalent to the potential outcome that would've been observed under that exposure (counterfactual consistency). In other words, in the context of our g-computation procedure, when we obtain expected values for participants setting M = m, we assume that we obtain the values that we in fact would've observed, if those participants had been observed under M = m.
Finally, we assume positivity, which implies that individuals have similar exposure levels within all confounder levels. While this is empirically unlikely to hold in our case (in that we have several covariates, including a continuous on (i.e., number of children) as well as several groups), lack of positivity can be ignored to the extent that we're willing to assume that estimates for the strata with zero observations can be extrapolated from the model fitted on the observed strata (Hernan and Robins, 2020, p. 162). To that effect, our Bayesian multilevel models borrow information across clusters to inform, impute, and adaptively regularize estimates of all other clusters.

S2. Statistical models
In this section, we lay out our statistical approach and detail the main models in formal notation. The models are extensions of the main model from Purzycki et al. (2018b). We analyzed the RAG with a binomial model (Section S2.1) and the DG with an ordered categorical model (Section S2.2) and model the SELF and LOCAL games in separate models.
Both sets of models include varying effects to manage repeated observations on groups. We ran the models on simplified simulated data to ensure that the models in fact recover key parameters in an ideal scenario under the assumed data-generating process. Key model diagnostics and posterior predictive checks were generally acceptable and are reported in the online supplementary materials (Section S5).
For H 1 the key parameter is that of free-listed morality β M on coin allocation y. To assess H 2 for each of the two outcomes we fit two different models: (1) The theoretically informed "interaction model" including three two-way interaction terms, one for punitiveness and free-listed morality β MP , another for knowledge breadth and free-listed morality β MO , a third for punitiveness and knowledge breadth β PO (not directly relevant to H 2 but a sub-component of the three-way interaction), and one three-way interaction term β MPO , as well as main effects of morality β M , punitiveness β P and knowledge breadth β O ; and (2) an "additive model", which excludes all interaction terms but retains the main effects, and serves then as a "null model" in contrast to the theoretically informed interaction model.
We then 1) inspected the interaction terms to evaluate whether their coefficients have the bulk of their mass above a log odds of 0 (i.e., "no effect"), 2) assessed the posterior predictions of the interaction terms on their natural scales, and 3) compared the two model pairs with approximate leave-one-out cross-validation (LOO-CV; Vehtari et al., 2017), a convenient model comparison tool that computes metrics of out-of-sample predictive accuracy while penalizing model complexity 3 . Model code and data were prepared with the rethinking package (McElreath, 2020) for R and fit with Stan via cmdstanr (Carpenter et al., 2017;Gabry et al., 2022).
Below, we explain the full interaction models for each game in pieces.

S2.1. RAG model
To model the coin allocations to the distant cup y out of a total of 30 coins in the RAG, we use Bayesian multilevel binomial regression: The linear model logit(p i ) (line 2) includes an average intercept α and varying intercepts group (i.e., field site). The next lines (3-5) captures the three cultural variables of interest: free-listed moral code M , gods' punishment P , and knowledge breadth (i.e., omniscience) O and their respective interactions. For these parameters, each group has its own varying slope. The last two lines contain simple (i.e., fixed) individual-level effects for the remaining conditioning set: number of children, food insecurity, game order and game check, respectively.
All simple effects above are assigned weakly-regularizing priors, Normal(0, 1). These guard against finding strong effects in small samples or those that vary considerably in responses, but are easily overwhelmed in large or consistent samples. The varying effects for group are bound together in a common variance-covariance matrix: where S is a diagonal matrix of standard deviations of the intercept and the main and interaction terms for the three cultural variables of interest: and R is a full rank correlation matrix of the same variables. Each standard deviation is assigned an independent Exponential(1) prior as before, and R is given a weakly regularizing prior from the LKJ family (Lewandowski et al., 2009), a common choice of prior distribution for covariance matrices:

S2.1.1. Imputation models
Our three key predictor variables above have missing values: free-listed morality, punishment, and knowledge breadth. We build an imputation model for each of these variables so that the imputed values are informed by their field site. For example, the morality free-list imputation model looks as follows: (1) where the imputed values M range between 0 and 1 and are drawn from a normal distribution with mean µ m and standard deviation σ m that are informed by the individual's field site's estimated mean proportion of free-listed moral items M group . This value is in turn drawn from a normal distribution with a mean and standard deviation that are assigned weakly regularizing priors themselves. The prior for µ M , the mean proportion of free-listed moral items for a given group, is centered on 0.5 as it is the mid-point of the variable, which ranges between 0 and 1; however, with a standard deviation of 2 this estimate is allowed to take on a wide range of values, although we constrain it via Stan at 0 and 1. The imputation models for punishment and knowledge breadth are similar, although those two variables are each means of two binary items and can therefore take on values 0, 1, and 2 4 .
Similarly, following the same structure, we impute missing values in number of children so that the imputations are informed by the respective site-specific mean number of children (truncated at zero).
children σ ∼ Exp (10) Children group ∼ Normal(µ Children , σ Children ) (1) We relied on the following prior distributions for the remaining parameters: The sub-models for the food and check variables impute missing values (there are no missing values in the order variable) using Bayesian inference. Since these two variables are binary (0/1), we say that they are drawn from a Bernoulli distribution with probability ϕ, which is estimated from the sample and given a weakly regularizing beta prior. When we plot the imputations of the ϕ food and ϕ check parameters (see Section S5), note that these pertain to the posterior probabilities of a 1 and not the actual, imputed binary values.

S2.2. DG model
To model coin allocations to the distant cup y out of a total of 10 coins in the DG, we use an ordered categorical likelihood model, where we regard each coin as a discrete ordered response The cumulative property of the ordered categorical neatly captures the cumulative aspect of coin allocations in the DG game. However, since zero is not a valid value for the ordered categorical model but was an option in the game (i.e., no coins allocated to the distant cup) and falls naturally on the ordering of the response (i.e., zero coins is less than one, less than two, etc.), we add a "dummy coin" to the response variable, such that the response value ranks from one to 11, where one means zero coins, two means one coin, etc. This yields a vector of 11 − 1 = 10 random cut-points K, on which we put a prior of Normal(0, 2) to be estimated along with a linear model η i that is otherwise similar to the RAG model.

S2.3. G-computation
Once the models are fitted, the marginal contrasts are obtained as follows 5 : 1. Duplicate original dataset and set M = 0 (i.e., no moral items in an individual's free-list) for all individuals, retaining covariates as observed.

S3.1. Covariates
A comprehensive overview of variables, sampling procedures, and field sites characteristics employed in the Evolution of Religion and Morality Project can be found in Purzycki et al.
(2016a) and Lang et al. (2019). Here, we describe the operationalization of covariates relevant for the present study. In addition to the conditioning set identified in Section S1, we include two variables (game order and game check; see explication below) related to the structural features of the game set-up (see Purzycki et al., 2018b).
Children: the self-reported number of children that a participant has fathered or given birth to.
Food insecurity: participant's self-reported food or material (in)security measured with a dichotomous (yes/no) item: "Do you worry that in the next month your household will have a time when it is not able to buy or produce enough food to eat?". Game order : An indicator denoting which game participants played first. 0 = SELF Game first, 1 = LOCAL Game first.
Game check : An indicator denoting whether, when asked what they thought the games were about during debriefing, a participant's response included (= 1) "honesty," "fairness," and/or "cheating".

S3.2. Free-list data cleaning
Following the workflow of , the data entries for the general codes were cleaned and systematized (e.g., in terms of spelling, typos, and blank spaces) to avoid that the same codes are treated as separate by the parsing in R and AnthroTools. Then, in cases of disagreement between coders, we selected for final analysis the best fitting of the two codes according to the general coding scheme (see Purzycki and McNamara, 2016). A column designating these (dis)agreements was added to the spreadsheet, and the selected code was stored in a separate column for full transparency.

S3.3. Proportion vs. salience of moral free-list responses
As we are interested in measuring the extent to which individuals ascribe moral concern to their deities, we use the proportion of moral items in each participant's lists as our focal predictor for all main models. We take the proportion of moral items to deal with multiple listings of moral responses.
A popular alternative is a salience score. A salience score is computed by subtracting an item's order number, k, from 1 plus the total number of items a participant listed. This number is then divided by the total number of items listed, n+1−k n . All items listed first thus get an item salience of 1. Items listed earlier are typically easier to access or recall, and thus constitute a form of cognitive salience.
For the current study, we use proportion rather than salience for two main reasons.
First, based on previous analyses , we expected that salience scores would produce ceiling effects, such that many participants list moral items as the first response, in turn yielding little variation in the variable. Second, whereas salience scores are generally considered a measure of recall (e.g., the ease with which an association comes to mind), we consider proportion to be a more appropriate measure of the degree to which participants perceive their deity as moralistic, a more apt measure for present purposes.
We recorded a "1" if any of these codes figured in either of the two coders' columns and then, for each participant, took the proportion of these responses to the total number of responses listed, similar to the main analysis.

S3.5. Moral Interest Scale
The moral interest scale took the form: How important is it for [moralistic deity] to punish: One issue with this approach is that it assumes that, in cases of missing responses, participants are consistent across the three items-e.g., a participant can get a maximum score on the index either by selecting (4) on all three items or on just one or two of the items, if failing to respond to the remaining item(s). Another issue is the lack of uncertainty around the index score. We use the scale as described here, to be consistent with previous research Purzycki et al., 2016b).

S4. Deviations from pre-registration protocol
Here, we document the deviations we took from the pre-registered statistical protocol. Deviations were minor in nature, and they were all due to a combination of oversight and unforeseen practical complications.
In contrast to the pre-registration, we ended up modeling the SELF and LOCAL game outcomes in separate models. The reason for this was mainly computational. Modeling the SELF and LOCAL games in the same model required by-participant random intercepts to accommodate repeated observations on participants across game types.
However, the by-participant random intercepts made our LOO model comparison unstable, due to some observations being too influential in the importance sampling procedure. We therefore dropped the by-participant random intercepts.
We tightened the priors from Normal(0.5, 2) to Normal(0.5, 0.5) on µ M and from Normal(1, 2) to Normal(0.5, 0.5) on µ P and µ O . These parameters constrain the group mean imputation of missing values in M , P and O, respectively. The pre-registered priors for µ P and µ O were constructed on the basis of these variables taking on values 0-2; however, they in fact range from 0-1. The relatively wide standard deviation of the prior for µ M was likewise an oversight.
Model parameters bMavg, bPavg, and bOavg in the pre-registered model code had priors but are redundant and are therefore deleted.
We made a coding error in the imputation routine for number of children C. The preregistered line C mu <-C[group] was therefore changed to C mu <-Cavg[group].
We deleted the imputation sub-models for the order and type variables, since there are no missing values in these variables in the dataset.
We removed range constraints on S impute and check impute, since constraints are redundant (these parameters are already constrained by the beta distribution used for imputation).
We pre-registered the supplementary analysis of the "moral interest" scale and specified that we'd transform the scale such that it ranged from 0-1. However, we retained it as used in Lang et al. (2019), such that it ranges from 0-4 and modified the prior used for imputation accordingly. Further, since Hadza participants answered these items on a binary scale, they are excluded from the analysis.
In an exploratory (not pre-registered) supplementary analysis, we expanded the predictor of the main analysis such that it also included instances of the general free-list coding category "Virtue", which overlaps conceptually with the "Morality" coding category. A free-list response qualifies as "Virtue" if it satisfies the following: individual qualities that may or may not have social ramifications (e.g., hard-working, kind, bad conscience, etc.) (see . Indeed, there's precedence in the published literature for lumping these two coding categories (e.g., Purzycki and McNamara, 2016;Purzycki et al., 2016b). However, it yielded similar results as reported in the main manuscript.
We pre-registered a set of supplementary analysis whereby we'd fit an ordered beta model to the Dictator Game data. However, due to very reasonable posterior pre-dictive fit of the ordinal model to the Dictator Game data, we refrained from fitting corresponding ordered beta models.

S5. Online supplementary plots and analyses
At this study's Git repo (https://github.com/tbendixen/moral-freelist-econ), we report supplementary plots and analyses in separate notebooks, for the sake of compact imp plots.pdf reports plots from missing data imputations of all variables and across all main models.
For reference, Table S1 provides an overview of all model specifications.