Combining decision-level data from multiple experiments: what is the pooled estimator doing?

James R. Bland

doi:10.1017/esa.2026.10031

Combining decision-level data from multiple experiments: what is the pooled estimator doing?

Published online by Cambridge University Press: 22 April 2026

James R. Bland

Show author details

James R. Bland*: Affiliation:
Department of Economics, The University of Toledo, 2801 Bancroft St, Toledo, OH, USA
*: Email: james.bland@utoledo.edu

Article contents

Abstract
Introduction
How the pooled estimator weights experiments
Some remedies
Examples
Conclusion
Supplementary material
Footnotes
References

Rights & Permissions

Abstract

When analyzing decision-level data from more than one economic experiment, the pooled ordinary least squares (OLS) estimator is a weighted sum of (i) within-experiment treatment effects, and (ii) an estimate of between-experiment treatment effects. The latter is plausibly biased and receives substantial weight in typical studies. I discuss some implications of this weighting and some remedies to the problem.

Keywords

Combining data linear regression meta-analysis multiple experiments C23 C81

Information

Type: Original Paper
Information: Journal of the Economic Science Association , First View , pp. 1 - 13

DOI: https://doi.org/10.1017/esa.2026.10031 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2026. Published by Cambridge University Press on behalf of Economic Science Association.

1. Introduction

In experimental economics, we are fortunate to have healthy data-sharing norms. This means that it is relatively easy to obtain data from published economic experiments. With all of this data at our fingertips, it is not surprising that researchers are finding it useful to combine decision-level data from experiments in order to learn from them together in meta-analyses. Such analyses typically focus on identifying “determinants” (usually treatment conditions) of observed behavior. It is common in these papers that the results are based on estimated pooled models. That is, the combined dataset is analyzed as if it came from one experiment.Footnote ¹

Using a commonly-known result in the panel-data literature, this paper demonstrates that when combining data from multiple experiments, the pooled OLS estimator can be written as a weighted sum of the estimated treatment effects within each experiment and a between-experiment estimate of the treatment effect. While typical experimental techniques ensure that the within-experiment estimators are unbiased, this does not guarantee that the between-experiment estimator is also unbiased. Specifically, within an experiment, the explanatory variable in the regression is plausibly exogenous by the experiment design, but between experiments, the explanatory variable is likely chosen by the experimenter, and hence is endogenous. The weight placed on this between-experiment estimator is substantial in typical studies, and so aggregated effects estimated with these pooled models may be misleading. Furthermore, the weighting does not take into account some aspects of the precision of the within-experiment estimates. I discuss some remedies to these problems and demonstrate their extent with some examples.

The remainder of this paper is organized as follows: Section 2 describes how the pooled estimator weights data and how it can be biased. Section 3 discusses some fixes to the problem and provides some advice for those doing meta-analysis. Section 4 shows some detailed examples of specific meta-studies of how much weight the pooled estimator can place on the between-experiments estimator. Finally, Section 5 concludes. In the Appendix, a derivation of the weighting is provided (Appendix A), and a derivation of the bias is provided (Appendix C). Appendix B provides some details of a simulation reported in Section 2.

2. How the pooled estimator weights experiments

Consider a case where it is appropriate to estimate the effect of treatment condition $x_{e,i,t}$ on outcome variable $y_{e,i,t}$ using linear regression, where $e$, $i$, and $t$ index the experiment, participant, and decision, respectively. That is, we wish to estimate the parameter $\beta_1$ in the following equation:

\begin{equation*} y_{e,i,t}=\beta_0+\beta_1x_{e,i,t}+\epsilon_{e,i,t} \end{equation*}

where $e$ indexes the experiment, $i$ the participant, and $t$ the decision.

Since in each experiment the treatment is assigned randomly, we have an unbiased estimator for the treatment effect $\beta_1$ for every experiment that has some variation in the treatment variable:

\begin{equation*} \hat\beta_{1}^e=\frac{\sum_{\forall i,t\in e}(y_{e,i,t}-\bar y_e)(x_{e,i,t}-\bar x_e)}{\sum_{\forall i,t\in e}(x_{e,i,t}-\bar x_e)^2} \end{equation*}

where $\bar y_e$ and $\bar x_e$ are the means within experiment $e$ of $y$ and $x$, respectively. $\hat\beta_1^e$ is the within-experiment estimator, because it identifies treatment effects due to variation in $x_{e,i,t}$ within an experiment only. Presumably, the realization of $\hat\beta^e_1$ is the estimate of the treatment effect that is reported in experiment $e$.

Now consider pooling all of the experiments in the meta-study, and again estimating $\beta_1$ using OLS. In this case, we can write this pooled estimator as (see Appendix A for a derivation):

(1)

\begin{equation} \hat\beta_1^\text{pooled}=\sum_{\forall e}w_e\hat\beta_1^e+ \left(1-\sum_{\forall e}w_e\right)\tilde\beta_1 \end{equation}

(2)

\begin{equation} \text{where: } w_e=\frac{\sum_{\forall i,t\in e}(x_{e,i,t}-\bar x_e)^2}{\sum_{\forall e}\sum_{\forall i,t\in e}(x_{e,i,t}-\bar x_e)^2+\sum_{\forall e}\sum_{\forall i,t\in e}(\bar x_{e}-\bar x)^2} \end{equation}

(3)

\begin{equation} \tilde\beta_1=\frac{\sum_{\forall e}\sum_{\forall i,t\in e}(\bar y_e-\bar y)(\bar x_e-\bar x)}{\sum_{\forall e}\sum_{\forall i,t\in e}(\bar x_e-\bar x)^2} \end{equation}

where $\bar y$ and $\bar x$ are the sample means of $y_{e,i,t}$ and $x_{e,i,t}$ over the entire dataset.Footnote ² Note that $\hat\beta_1^\text{pooled}$ is a weighted sum of (i) the experiment-specific estimators $\hat\beta^e_1$, and (ii) $\tilde\beta_1$, the pooled OLS estimator replacing data with their within-experiment means. We can call $\tilde\beta_1$ the between-experiment estimator, because it identifies the effect of $x$ on $y$ from variation between experiments only. That is, it ignores any variation within an experiment. We can therefore interpret $\hat\beta_1^\text{pooled}$ as an aggregation of (i) what we learn from within-experiment variation in treatment, and (ii) what we learn from between-experiment variation.Footnote ³

It is reasonable to assume that $\mathrm{cov}(x_{e,i,t},\epsilon_{e,i,t}\mid e)=0$ within an experiment. This is because experiments randomly assign treatment variables, and it follows immediately that each $\hat\beta_1^e$ is an unbiased estimator of the treatment effect $\beta_1$. Since the weights add up to one, for $\hat\beta_1^\text{pooled}$ to also be unbiased, we also require that $\tilde\beta_1$ is unbiased. In Appendix C, I show that the bias of the between-experiments estimator is:

\begin{equation*} \begin{aligned} E\left(\tilde\beta_1-\beta_1\mid x\right)&=\frac{\sum_{\forall e}N_e\mathrm{cov}(\bar \epsilon_e,\bar x_e)}{\sum_{\forall e}N_e(\bar x_e-\bar x)^2} \end{aligned} \end{equation*}

where $N_e$ is the total number of observations in experiment $e$. Hence, the bias of the pooled estimator must be equal to the bias of the between-experiments estimator multiplied by the weight this estimator receives:

\begin{equation*} \begin{aligned} E\left(\hat\beta_1^\text{pooled}-\beta_1\mid x\right)&= \begin{cases} \frac{\sum_{\forall e}N_e\mathrm{cov}(\bar \epsilon_e,\bar x_e)}{\sum_{\forall e}\sum_{\forall i,t\in e}(x_{e,i,t}-\bar x_e)^2+\sum_{\forall e}\sum_{\forall i,t\in e}(\bar x_e-\bar x)^2} &\text{if }\sum_{\forall e}\sum_{i,t\in e}(\bar x_e-\bar x)^2 \gt 0\\ 0&\text{otherwise} \end{cases} \end{aligned} \end{equation*}

That is, for the pooled estimator to be unbiased, we need at least one of:

(1) The average treatment condition within an experiment, $\bar x_e$, is the same for all experiments. This is equivalent to the pooled estimator placing zero weight on the between-experiment estimator.
(2) $\mathrm{cov}(\bar\epsilon_e,\bar x_e)=0$. That is, there is zero covariance between average errors in an experiment and average treatment conditions.

The first condition is something that we can check with our data. For example, Table 1 documents some studies that estimate pooled models from individual-level data.Footnote ⁴ The first column lists the study, the second column the topic of the study, the third column lists the explanatory variable (i.e. $x_{e,i,t}$), and the rightmost column lists the weight placed on the between-experiments estimator $\tilde\beta_1$. If this weight is zero, then one does not need to worry about bias. However, none of these weights are zero, and furthermore most weights are closer to one than zero. Therefore, for typical uses of this technique, it is unlikely that we will be able to justify unbiasedness, or even just receiving a small weight, using this first condition. That the weight on the between-experiments estimator is substantial for most applications should in fact be unsurprising: it is uncommon for experiment conditions to be exactly replicated, and so average treatment conditions between experiments will likely be different. That is, when running a new experiment, an experimenter may replicate some or all of an existing study, but it is unlikely that they will only replicate an existing study. More likely, they will add extensions to the original design. Figure 1 takes a closer look at the data contributing to a 75% weight being placed on the between estimator for the effect of discount factor $\delta$ in the meta-study by Dal Bó and Fréchette (Reference Dal Bó and Fréchette2018). Each dot in this plot represents a unique discount factor studied in each experiment, and the red “X”s mark the within-experiment average treatment conditions (i.e. $\bar x_e$). Clearly, average treatment conditions are not constant across experiments, as seen by the different heights of the red “X”s. Furthermore, this study includes some experiments with no variation in the discount factor. These experiments individually are unable to estimate the effect of discount factor on (say) cooperation, and so data from these experiments only inform the meta-study through the between-experiments estimator.

Table 1

Weight placed on the between estimator for various studies using pooled models. (D) indicates that the explanatory variable is a dummy variable. Weights are shown for the bivariate OLS estimator

Fig. 1

Discount factors used in the fifteen economic experiments on the indefinitely repeated Prisoner’s Dilemma examined in Dal Bó and Fréchette (Reference Dal Bó and Fréchette2018). Red “X”s mark the average treatment conditions

The second condition for unbiasedness, $\mathrm{cov}(\bar \epsilon_e,\bar x_e)=0$, requires that average errors in the experiment ( $\bar \epsilon_e$) are uncorrelated with average treatment conditions ( $\bar x_e$). This condition requires knowledge of the error terms in (2), and is therefore unverifiable. However, since experiments have different research questions for each new experiment, it is plausibly a problem. This exogeneity between experiments is also unlikely to hold in practice, as good experimenters will design their new experiments (i.e. choose their $x$s) based on their beliefs about how participants will respond to treatment conditions in their own laboratory. For example, suppose that we were designing a linear public goods game experiment, and we understood that participants in our subject pool were particularly uncooperative. We may choose to increase the marginal per-capita return above levels used in most past experiments if we need a good baseline level of cooperation. This would be choosing our $x$s in response to anticipated levels of $\epsilon$. Furthermore, good experimenters will use results of previous experiments, even if not from the same laboratory, in the design of new experiments. This, too, will induce correlation between average treatment conditions and the error term.Footnote ⁵

Correcting for this bias is possible using experiment-specific fixed effects. We would then place no weight on what can be learned from between-experiment treatment variation. Hence, if we believe that this kind of endogeneity is not present, we may not want to throw out this information.

Next, I use a Monte Carlo simulation to demonstrate the extent of the bias. Here, the true relationship between treatment variable $X$ and outcome variable $Y$ is:

\begin{equation*} Y_{e,i,t}=X_{e,i,t}+\eta_e+\epsilon_{e,i,t} \end{equation*}

where $\eta_e$ is an experiment-specific effect, and $\epsilon_{e,i,t}$ is an idiosyncratic error term. I vary the way in which experimenters choose their treatment conditions in response to $\eta_e$. That is, $\eta_e$ is known to the experimenter but not the econometrician. I explore three different kinds of responses to this experiment-specific error term. Firstly, in the “no correlation” condition, experimenters do not take $\eta_e$ into account. Then, in the “negative correlation” condition, experimenters choose lower $X$ within their experiment as $\eta_e$ increases. Finally, in the “positive correlation” condition, experimenters choose larger $X$ within their experiment as $\eta_e$ increases. The simulation is set up so that approximately half the weight is placed on the between-experiments estimator for all conditions.Footnote ⁶ Figure 2 summarizes the results of the simulation by showing the kernel-smoothed density of the simulated slope estimates. The top panels present the results for the pooled estimator. Here we can see that when the experimenter does not choose $X$ in response to experiment-specific errors (middle panel), there is no problem with the pooled model. We can see that the distribution is centered on the true value of $\beta_1=1$. On the other hand, when there is a negative correlation between experiment-specific errors and $X$ (left panel), the pooled model is biased in the negative direction, and when there is a positive correlation between experiment-specific errors and $X$ (right panel), the pooled model over-estimates the treatment effect. The bottom panels show that including experiment fixed effects will eliminate this bias in both cases.

Fig. 2

Results of a simulation exploring various experimenter responses to experiment-specific error terms. The vertical red dashed line shows the true value of the estimand

Inspection of (2), the equation for $w_e$, shows that experiment-specific treatment effects receive a weight that is proportional to $\sum_{\forall i,t\in e}(x_{e,i,t}-\bar x_e)^2$. This is part, but not all, of the inverse variance for $\hat\beta_1^e$, assuming homoskedasticity within an experiment. Because of this, all else held equal, an experiment that (i) has more observations, and/or (ii) has more variation in the treatment variable will receive more weight. This is a desirable property of the weights, as experiments with these properties will, all else held equal, have more precise estimates of the treatment effect. On the other hand, absent from the weight is any estimate of the within-experiment variance of $\epsilon_{e,i,t}$. Hence, experiments with identical $x$s and the same number of decisions will receive identical weights, even if the treatment effect is estimated more precisely in one experiment compared to another. This means that $\hat\beta_1^\text{pooled}$ will be inefficient compared to more traditional methods of meta-analysis that take account of this precision. Furthermore, if it is appropriate to cluster standard errors when analyzing data within an experiment, this weighting will not take this into account. For example, suppose that for two experiments, it is appropriate to cluster standard errors at the participant level. Experiment $A$ has 10 participants making 100 decisions each, and experiment $B$ has 100 participants making 10 decisions each. $\hat\beta_1^\text{pooled}$ weights each of these experiments equally, even though experiment $B$ has more independent observations, and thus will most likely have a more precise estimate of the treatment effect.

3. Some remedies

The previous section shows that within-experiment randomization into treatments is not sufficient for the pooled OLS estimator $\hat\beta_1^\text{pooled}$ to be unbiased. The extent of this bias is a function of (i) the weight placed on the between-experiment estimator $\tilde\beta_1$, and (ii) the bias of $\tilde\beta_1$. While the weight on $\tilde\beta_1$ can be easily computed, its bias is a function of the experiment-average error terms $\{\bar \epsilon_e\}_{\forall e}$, and so one can only hypothesize about its bias. While drawing conclusions from $\hat\beta_1^\text{pooled}$ may be worthwhile if we can establish that this bias (through reasoning) and/or weight (through computation) is small for our particular application, there will be other applications in which we need to be wary. The remainder of this section first outlines how this bias term can be completely eliminated by ignoring all studies that do not have any variation in $x_{e,i,t}$, and then how the remaining within-experiment estimators can be combined more efficiently.

If we are unconvinced that the between-experiment estimator $\tilde\beta_1$ is unbiased, then constructing an estimator that places no weight on this component will solve the bias problem. Including experiment fixed effects in the regression will ensure that weight will only be placed on the unbiased within-experiment estimators $\hat\beta^e$. This is equivalent to simply dropping all of the experiments with no variation in $x_{e,i,t}$ from the analysis. Table 2 of Cooper and Dutcher (Reference Cooper and Dutcher2011) and Table 2 of Alm and Malézieux (Reference Alm and Malézieux2021) are examples of using fixed effects, and so the estimators used in these Tables do not suffer from this source of bias. Importantly, if present, the bias of $\tilde\beta_1$ is not remedied by experiment random effects, or other multi-level or hierarchical models. These models assume that the random effects are uncorrelated with $x_{e,i,t}$, which would not be true if $\tilde\beta_1$ is biased. That is, the random effects model assumes that the group-specific random effects are uncorrelated with the explanatory variable. When experimenters choose their treatment conditions in response to the error term, this condition is not satisfied, and the slope estimator is biased.

Table 2

Decomposition of the pooled estimator in Fiala and Suetens (Reference Fiala and Suetens2017), estimating the effect of having feedback about each group member’s contribution in a public goods game on average group contributions. Standard errors for weighted estimates are clustered at the experiment level. Standard errors for estimates within experiments are heteroskedasticity-robust

Suppose now that we are analyzing just experiments with some variation in the treatment variable, and so any weighting (that adds to one) of the remaining within-experiment estimators will be unbiased. The goal could therefore be to weight the estimators in a way that minimizes the variance:

\begin{equation*} V\left(\hat\beta_1\right)=\sum_{\forall e}w_e^2 V\left(\hat\beta_1^e\right) \end{equation*}

Choosing weights that are proportional to the inverse of the variance of the within-experiment estimator minimizes this variance. That is:

\begin{equation*} w_e= \frac{V\left(\hat\beta_1^e\right)^{-1}}{\sum_{\forall e}V\left(\hat\beta_1^e\right)^{-1}} \end{equation*}

In practice, we replace the unknown quantity $V\left(\hat\beta_1^e\right)$ with its estimate, which is the squared standard error of $\hat\beta_1^e$. Examples of this weighting estimator being used for economic experiments can be found in Ioannidis et al. (Reference Ioannidis, Offerman and Sloof2020) (see discussion around their Table 3) and De Quidt et al. (Reference De Quidt, Fallucchi, Kölle, Nosenzo and Quercia2017) (see their footnote 15). Note that while clustered standard errors at the experiment level may be appropriate to more accurately express the uncertainty we have in our pooled estimate, they will not affect the weighting, and so will have no bearing on the point estimate.

Table 3

Decomposition of the pooled estimator in Fiala and Suetens (Reference Fiala and Suetens2017), estimating the effect of the marginal per capita return in a public goods game on average group contributions. Standard errors for weighted estimates are clustered at the experiment level. Standard errors for estimates within experiments are heteroskedasticity-robust

In summary, these results motivate the following advice for practitioners who are performing a meta-analysis with decision-level data:

(1) Be aware of how many experiments in your dataset can individually estimate the desired treatment effect. These are the experiments that vary the treatment variable.
(2) One can always calculate the weight placed on the between-experiments estimator (and the between-experiments estimate itself). If this weight is large, it is an indicator that the potentially biased between-experiment estimator is driving the results.
(3) Consider estimators that are robust to this source of bias. If you want to use OLS, at least include experiment fixed effects. Otherwise, consider aggregating experiment-specific estimates using either an inverse variance-weighted estimator or the Rubin (Reference Rubin1981) model.
(4) Understand why the experiments were designed the way they were. Be aware of situations where experimenters would want to adjust their treatment conditions in response to expectations about the error term.
(5) If you have a few experiments in the dataset, consider reporting the pooled estimates alongside the fixed effects estimates and discussing the limitations of drawing conclusions from the pooled estimates.

4. Examples

4.1. Fiala and Suetens (Reference Fiala and Suetens2017)

Fiala and Suetens (Reference Fiala and Suetens2017) uses data from linear public good and oligopoly game experiments to investigate the influence of different kinds of feedback on participants’ choices. In this example, I analyze their data on the public good games to answer two research questions. Firstly, I estimate the effect of having or not having feedback about each group member’s choice provided. As shown in Table 1, the pooled OLS estimator places about 92% of its weight on the between-experiments estimator here. Secondly, I investigate the effect of the marginal per capita return (MPCR) on participants’ choices. Here, Table 1 shows that the pooled OLS estimator places about 66% of its weight on the between-experiment estimator for this effect. In order to be comparable to the original analysis of Fiala and Suetens (Reference Fiala and Suetens2017), I treat groups of participants as the level of independent observation.

Table 2 decomposes the pooled estimator used in Table 4, column 1 of Fiala and Suetens (Reference Fiala and Suetens2017) to estimate the effect of feedback about individual group choices on average group contributions. While the dataset includes 65 studies in total, only five of these varied in this kind of feedback. That is, these experiments only contribute to the pooled estimator through the between-experiments estimator. As such, only 8% of the weight of the pooled estimator goes to the within-experiment estimates, which are itemized by study in the top panel of Table 2. Importantly here, the between-experiment estimate ( $\tilde\beta_1$) is larger than all of the within-experiment estimates. If what we learn about feedback from between-experiment variation is biased, the pooled estimator is placing a lot of weight on this estimator. This variation largely comes from the remaining 33 studies that never provide this feedback, and 27 that always provide this feedback.Footnote ⁷

Table 4

Decomposition of the pooled OLS estimator for Dal Bó et al. (Reference Dal Bó, Fréchette and Kim2021). Each estimate is for the effect of the basin of attraction of Stag on the probability of choosing Stag in a Stag Hunt game. Missing estimates indicate that the within-experiment estimator is undefined for these experiments. This is because there is no variation in the treatment variable within these experiments

Table 3 does a similar decomposition for the marginal per capita return.Footnote ⁸ For this explanatory variable, 65% of the weight is placed on the between estimator $\tilde\beta_1$. Perhaps most striking in this decomposition is that the within (fixed effects) estimate (0.629) is almost double that of the pooled estimate (0.340). This is because the between estimate is much smaller than most of the within-experiment estimates, and receives a substantial fraction of the weight. Furthermore, the optimally-weighted estimator (rightmost column) weights the within-experiment estimates very differently from the within estimator (second column from the right), with the latter placing almost half the weight on Nosenzo et al. (Reference Nosenzo, Quercia and Sefton2015), and the former placing almost 70% of the weight on Eckel et al. (Reference Eckel, Harwell and Castillo2015). This highlights the difference between the pooled and within weights, which do not fully take into account the precision of the within-experiment estimators, and the optimally-weighted estimator, which explicitly weights according to this precision.

In summary, for both of these outcome variables, the potentially biased between-experiments estimator receives the majority of the weight. This is because only a handful of studies actually vary the treatment variable of interest.

4.2. Dal Bó et al. (Reference Dal Bó, Fréchette and Kim2021)

In their Table 4, Dal Bó et al. (Reference Dal Bó, Fréchette and Kim2021) estimate the effects of various treatment conditions on play in stag hunt games. They use decision-level data collected from eight economic experiments. To keep this example as simple as possible, I focus on estimating the effect of the basin of attraction for the action “Stag” on the probability of choosing this action in the first round of the experiments using a linear probability model.Footnote ⁹

Table 4 decomposes the pooled OLS estimator into the contributions from the within-experiment estimators (i.e. $\hat\beta_1^e$) and the between-experiment estimator ( $\tilde\beta_1$). First, note that for four of the experiments, the within-experiment estimator is undefined (these are denoted in the Table as blank cells). This is because there is no variation of the basin of attraction for Stag in these experiments. However, the pooled estimation does incorporate some information from these experiments, through the between-experiment estimator $\tilde\beta_1$. The weights placed on each of these components are shown in the rightmost three columns of the Table, which show the pooled, within only (i.e. fixed effects), and optimal weightings. Approximately 58% of the pooled estimator is attributable to the plausibly biased between-experiment estimator, with the remainder being divided up, unevenly, between the experiments that have some variation in the treatment variable. The “within” weights shown in this table are the weights if experiment fixed effects are used. These are simply the pooled weights excluding the between estimator’s weight, re-scaled so that they sum to one. Note the difference between these and the “optimal”, variance-minimizing weights (rightmost column). While the within weights take account of some of the uncertainty associated with each experiment’s estimate, they do not take into account all of the sources of uncertainty. Hence, the “optimal” weights differ from the “within” (fixed effects) weights.

In this study, half (four of eight) of the experiments do not vary the basin of attraction. As such, the only way for these studies to contribute to the overall estimate is through the between-experiments estimator. Given the large weight placed on the between-experiments estimator, we can largely interpret the results of this meta-study as an observational comparison between experiments, not an aggregation of causal effects.

5. Conclusion

When combining decision-level data from more than one economic experiment, the pooled OLS estimator can be written as a weighted sum of the within-experiment estimators for the treatment effect and the between-experiment estimator for the treatment effect. An immediate consequence of this is that all of these estimators must be unbiased for the pooled estimator to be unbiased. Typical experimental designs will take care of this for the within-experiment estimators, however more care may be needed with the between-experiment estimator. For this to be unbiased, we need the average treatment variable for an experiment is uncorrelated with the average error term in an experiment. This condition may not hold if experimenters choose their average treatment variable in response to their beliefs about choices in the experiment. In the examples shown in the previous section, these between-experiment estimators received the majority of the weight.

Furthermore, conclusions drawn from the pooled estimator may be overstating what we have learned from the experiments included in the study. This is because some weight, and in practice a substantial weight, is placed on the between-experiment estimator. Ideally, we would like to learn about treatment effects through exogenous variation of the treatment variable. Many studies included in these pooled analyses do not have this variation for some explanatory variables. This calls for a more careful consideration of what studies are included, and not included, in a meta-analysis. In particular, if an experiment cannot individually provide an answer to the research question, then its contribution to the meta-study will be solely through between-experiment variation. This means that there may need to be more experiments run to estimate the same treatment effect before we can meaningfully combine them in a meta-analysis. On the other hand, Meager (Reference Meager2019) meaningfully aggregates the results of just seven experiments, so this number need not be too high.

It should be noted that whether we are using pooled OLS, instrumental variables, or an inverse variance-weighted estimator, all of these estimators are weighted averages of experiment-specific estimates (and in the case of pooled OLS, the biased between-experiments estimate as well). While the latter two are unbiasedly aggregating treatment effect estimates, they are not necessarily estimating an “average treatment effect”. An alternative estimator that aims to estimate this average treatment effect is the Rubin (Reference Rubin1981) model, which explicitly models between-experiment variation of treatment effects. See Meager (Reference Meager2019) and for an application of this technique to field experiments, Bland (Reference Bland2025) for an application related to laboratory experiments.

The result that a pooled estimator can be written as a weighted average of within estimators and a between estimator is not new. However, this paper contributes to the experimental economics literature by putting this result into the context of meta-analysis using decision-level data.

This paper also stresses the importance of understanding why there is between-experiment variation in average treatment conditions in a meta-study. If this variation is due to experimenters endogenously choosing their treatment conditions in response to expected behavior, then we should worry about the conclusions drawn from the pooled OLS model. In this case, other methods, such as using fixed effects, should be preferable. On the other hand, if we understand that differences across experiments are not due to experimenters choosing these things, then we need not worry about this bias.

While the remedies discussed here address the bias associated with experimenters choosing their own treatment conditions, they do not address other, perhaps more worrying, sources of bias potentially present in meta-analyses. For example, publication bias (e.g. Ioannidis et al. (Reference Ioannidis, Stanley and Doucouliagos2017)) is a form of selection bias. If this source of bias is also present, then dropping all studies that do not individually answer the research question of the meta-study will not fully eliminate bias in the aggregated treatment effect. As such, the fixed effects or optimal weighting approaches discussed here are not magic bullets and need to be applied with the understanding that publication bias may still be present.

In practice, adding an experiment to a meta-study that cannot individually estimate a treatment effect comes with a trade-off. On the one hand, if the experiment-average errors are correlated with average treatment variables, then we will introduce bias to our pooled estimator. On the other hand, there is objectively less information to learn from if we ignore such a study. Where there are few experiments available with variation in the explanatory variable of interest, reporting both the pooled OLS estimator and the weighted within-experiment estimator is perhaps advisable. In doing this, we report both a (potentially biased) “big picture” view of how the variables in our data are related, and then also focus in on what we can learn from the experiments that can individually answer our research question.

Supplementary material

The supplementary material for this article can be found at https://doi.org/10.1017/esa.2026.10031.

Acknowledgements

I would like to thank Andreas Ortmann and two anonymous referees for helpful comments that improved this paper.

Footnotes

1 Recent examples of this type of analysis include Table 3 of Billinger and Rosenbaum (Reference Billinger and Rosenbaum2023), Table 3 of Blavatskyy et al. (Reference Blavatskyy, Panchenko and Ortmann2023), Table 2 of Blavatskyy et al. (Reference Blavatskyy, Ortmann and Panchenko2022), Table 4 of Dal Bó et al. (Reference Dal Bó, Fréchette and Kim2021), Tables 3 and 4 of Bilén et al. (Reference Bilén, Dreber and Johannesson2021), Table 4 of Dal Bó and Fréchette (Reference Dal Bó and Fréchette2018), and Tables 4–6 of Fiala and Suetens (Reference Fiala and Suetens2017).

2 For multivariate OLS with additional controls $z_{e,i,t}$, replace $x$ with $\hat x=x-z(z^\top z)^{-1}z^\top x$, and $y$ with $\hat y=y-z(z^\top z)^{-1}z^\top y$. The interpretation of the weights for multivariate OLS is therefore the same as bivariate OLS, except that we are running the bivariate regression with the components of $x$ and $y$ that are orthogonal to $z$.

3 In the Online Appendix I also show that for a bivariate logit regression, the average marginal effect for the pooled estimator is approximately the same weighting of marginal effects of the within-experiment estimates and the between-experiment estimates.

4 In order to identify studies, I searched for studies published in or after 2015 in Journal of the Economic Science Association, Experimental Economics, Games and Economic Behavior, and Journal of Behavioral and Experimental Economics that contained the word “meta”. Articles in AEJ: Microeconomics and Journal of Economic Literature published in or after 2015 were also searched for the words “experiment” and “meta”. Of these, articles that estimated pooled models from individual-level data were included if the data were available.

5 Anecdotally, I recall Guillaume Frechette discussing an experiment designed to achieve approximately 50% of participants choosing “Stag” in a Stag Hunt game. Choosing the payoffs of this game with this goal in mind is an example of an experimenter choosing $\bar x_e$ in anticipation of $\bar \epsilon_e$ (2024 Economic Science Association North American Conference, Columbus, OH, USA). Additionally, Aoyagi et al. (Reference Aoyagi, Fréchette and Yuksel2024) describes their choice game parameters in a Prisoner’s Dilemma experiment as follows (p3949): “… using past experiments as guidance, we selected parameters that were expected to generate high round 1 cooperation for both Finite and Indefinite games.”

6 Further details of the simulation are described in Appendix B.

7 One concern readers may have with the “optimal” estimator is that it appears that there is zero weight being placed on two of the studies that contain within-experiment variation in feedback. This is because these experiments individually estimate the effect much less precisely than the others, and so are receiving much less weight than the other three studies. These numbers are therefore not exactly zero, but are instead rounded to zero.

8 Data from Isaac et al. (Reference Isaac, Walker and Williams1994) are removed from this analysis due to perfect prediction of the within estimator.

9 In their Table, Dal Bó et al. (Reference Dal Bó, Fréchette and Kim2021) report the average marginal effects from multivariate pooled probit models.

References

Alm, J., & Malézieux, A. (2021). 40 years of tax evasion games: a meta-analysis. Experimental Economics, 24 (3), 699–750. https://doi.org/10.1007/s10683-020-09679-3.CrossRef Google Scholar

Andreoni, J., & Miller, J. H. (1993). Rational cooperation in the finitely repeated prisoner’s dilemma: experimental evidence. The Economic Journal, 103(418), 570–85.CrossRef Google Scholar

Aoyagi, M., & Fréchette, G. (2009). Collusion as public monitoring becomes noisy: experimental evidence. Journal of Economic Theory, 144(3), 1135–65.10.1016/j.jet.2008.10.005CrossRef Google Scholar

Aoyagi, M., Fréchette, G. R., & Yuksel, S. (2024). Beliefs in repeated games: an experiment. American Economic Review, 114(12), 3944–75.CrossRef Google Scholar

Battalio, R., Samuelson, L., & Van Huyck, J. (2001). Optimization incentives and coordination failure in laboratory stag hunt games. Econometrica, 69(3), 749–64.10.1111/1468-0262.00212CrossRef Google Scholar

Bigoni, M., & Suetens, S. (2012). Feedback and dynamics in public good experiments. Journal of Economic Behavior & Organization, 82(1), 86–95.CrossRef Google Scholar

Bilén, D., Dreber, A., & Johannesson, M. (2021). Are women more generous than men? A meta-analysis. Journal of the Economic Science Association, 7(1), 1–18.CrossRef Google Scholar

Billinger, S., & Rosenbaum, S. M. (2023). On the limits of hierarchy in public goods games: a survey and meta-analysis on the effects of design variables on cooperation. Journal of Behavioral and Experimental Economics, 107, 102081.10.1016/j.socec.2023.102081CrossRef Google Scholar

Bland, J. R. (2025). Meta-study in the shadow of the future: the effect of continuation probability on strategies. Available at SSRN: http://dx.doi.org/10.2139/ssrn.5042608.Google Scholar

Blavatskyy, P., Ortmann, A., & Panchenko, V. (2022). On the experimental robustness of the Allais paradox. American Economic Journal: Microeconomics, 14(1), 143–63.Google Scholar

Blavatskyy, P., Panchenko, V., & Ortmann, A. (2023). How common is the common-ratio effect?. Experimental Economics, 26(2), 253–72.CrossRef Google Scholar

Blonski, M., Ockenfels, P., & Spagnolo, G. (2011). Equilibrium selection in the repeated prisoner’s dilemma: axiomatic approach and experimental evidence. American Economic Journal: Microeconomics, 3(3), 164–92.Google Scholar

Bruttel, L., & Kamecke, U. (2012). Infinity in the lab. How do people play repeated games? Theory and Decision, 72, 205–19.10.1007/s11238-011-9247-6CrossRef Google Scholar

Clark, K., Kay, S., & Sefton, M. (2001). When are Nash equilibria self-enforcing? An experimental analysis. International Journal of Game Theory, 29, 495–515.CrossRef Google Scholar

Cooper, D. J., & Dutcher, E. G. (2011). The dynamics of responder behavior in ultimatum games: a meta-study. Experimental Economics, 14, 519–46.10.1007/s10683-011-9280-xCrossRef Google Scholar

Cooper, R., DeJong, D. V., Forsythe, R., & Ross, T. W. (1992). Communication in coordination games. The Quarterly Journal of Economics, 107(2), 739–71.10.2307/2118488CrossRef Google Scholar

Cooper, R., DeJong, D. V., Forsythe, R., & Ross, T. W. (1996). Cooperation without reputation: experimental evidence from prisoner’s dilemma games. Games and Economic Behavior, 12(2), 187–218.CrossRef Google Scholar

Croson, R. T. A. (2001). Feedback in Voluntary Contribution Mechanisms: An Experiment in Team Production. In Research in Experimental Economics. Emerald Group Publishing Limited, pp. 85–97.CrossRef Google Scholar

Dal Bó, P. (2005). Cooperation under the shadow of the future: experimental evidence from infinitely repeated games. American Economic Review, 95(5), 1591–1604.Google Scholar

Dal Bó, P., Foster, A., & Putterman, L. (2010). Institutions and behavior: experimental evidence on the effects of democracy. American Economic Review, 100(5), 2205–29.Google Scholar

Dal Bó, P., & Fréchette, G. R. (2011). The evolution of cooperation in infinitely repeated games: experimental evidence. American Economic Review, 101(1), 411–29.CrossRef Google Scholar

Dal Bó, P., & Fréchette, G. R. (2018). On the determinants of cooperation in infinitely repeated games: a survey. Journal of Economic Literature, 56(1), 60–114.CrossRef Google Scholar

Dal Bó, P., & Fréchette, G. R. (2019). Strategy choice in the infinitely repeated prisoner’s dilemma. American Economic Review, 109(11), 3929–52.CrossRef Google Scholar

Dal Bó, P., Fréchette, G. R., & Kim, J. (2021). The determinants of efficient behavior in coordination games. Games and Economic Behavior, 130, 352–68.CrossRef Google Scholar

De Quidt, J., Fallucchi, F., Kölle, F., Nosenzo, D., & Quercia, S. (2017). Bonus versus penalty: how robust are the effects of contract framing?. Journal of the Economic Science Association, 3, 174–82.CrossRef Google Scholar PubMed

Dreber, A., Rand, D. G., Fudenberg, D., & Nowak, M. A. (2008). Winners don’t punish. Nature, 452(7185), 348–51.10.1038/nature06723CrossRef Google Scholar PubMed

Dubois, D., Willinger, M., & Van Nguyen, P. (2012). Optimization incentive and relative riskiness in experimental stag-hunt games. International Journal of Game Theory, 41, 369–80.CrossRef Google Scholar

Duffy, J., & Feltovich, N. (2002). Do actions speak louder than words? An experimental comparison of observation and cheap talk. Games and Economic Behavior, 39(1), 1–27.CrossRef Google Scholar

Duffy, J., & Ochs, J. (2009). Cooperative behavior and the frequency of social interaction. Games and Economic Behavior, 66(2), 785–812.10.1016/j.geb.2008.07.003CrossRef Google Scholar

Eckel, C. C., Harwell, H. and Castillo, J. G. (2015). Four Classic Public Goods Experiments: A Replication Study. In Replication in Experimental Economics (18, pp. 13–40). Emerald Group Publishing Limited.CrossRef Google Scholar

Feltovich, N., Iwasaki, A., & Oda, S. H. (2012). Payoff levels, loss avoidance, and equilibrium selection in games with multiple equilibria: an experimental study. Economic Inquiry, 50(4), 932–52.CrossRef Google Scholar

Fiala, L., & Suetens, S. (2017). Transparency and cooperation in repeated dilemma games: a meta study. Experimental Economics, 20, 755–71.CrossRef Google Scholar

Fréchette, G. R., & Yuksel, S. (2017). Infinitely repeated games in the laboratory: four perspectives on discounting and random termination. Experimental Economics, 20, 279–308.10.1007/s10683-016-9494-zCrossRef Google Scholar

Fudenberg, D., Rand, D. G., & Dreber, A. (2012). Slow to anger and fast to forgive: cooperation in an uncertain world. American Economic Review, 102(2), 720–49.10.1257/aer.102.2.720CrossRef Google Scholar

Ioannidis, J. P. A., Stanley, T. D., & Doucouliagos, H. (2017). The power of bias in economics research. The Economic Journal, 127(605), F236–65.CrossRef Google Scholar

Ioannidis, K., Offerman, T., & Sloof, R. (2020). On the effect of anchoring on valuations when the anchor is transparently uninformative. Journal of the Economic Science Association, 6(1), 77–94.CrossRef Google Scholar

Isaac, R. M., & Walker, J. M. (1988). Group size effects in public goods provision: the voluntary contributions mechanism. The Quarterly Journal of Economics, 103(1), 179–99.10.2307/1882648CrossRef Google Scholar

Isaac, R. M., Walker, J. M., & Thomas, S. H. (1984). Divergent evidence on free riding: an experimental examination of possible explanations. Public Choice, 43, 113–49.CrossRef Google Scholar

Isaac, R. M., Walker, J. M., & Williams, A. W. (1994). Group size and the voluntary provision of public goods: experimental evidence utilizing large groups. Journal of Public Economics, 54(1), 1–36.CrossRef Google Scholar

Kagel, J. H., & Schley, D. R. (2013). How economic rewards affect cooperation reconsidered. Economics Letters, 121(1), 124–27.CrossRef Google Scholar

Meager, R. (2019). Understanding the average impact of microcredit expansions: a Bayesian hierarchical analysis of seven randomized experiments. American Economic Journal: Applied Economics, 11(1), 57–91.Google Scholar

Neugebauer, T., Perote, J., Schmidt, U., & Loos, M. (2009). Selfish-biased conditional cooperation: on the decline of contributions in repeated public goods experiments. Journal of Economic Psychology, 30(1), 52–60.CrossRef Google Scholar

Nosenzo, D., Quercia, S., & Sefton, M. (2015). Cooperation in small groups: the effect of group size. Experimental Economics, 18, 4–14.CrossRef Google Scholar

Rubin, D. B. (1981). Estimation in parallel randomized experiments. Journal of Educational Statistics, 6(4), 377–401.10.3102/10769986006004377CrossRef Google Scholar

Schmidt, D., Shupp, R., Walker, J. M., & Ostrom, E. (2003). Playing safe in coordination games: the roles of risk dominance, payoff dominance, and history of play. Games and Economic Behavior, 42(2), 281–99.CrossRef Google Scholar

Sell, J., & Wilson, R. K. (1991). Levels of information and contributions to public goods. Social Forces, 70(1), 107–24.CrossRef Google Scholar

Sherstyuk, K., Tarui, N., & Saijo, T. (2013). Payment schemes in infinite-horizon experimental games. Experimental Economics, 16, 125–53.CrossRef Google Scholar

Straub, P. G. (1995). Risk dominance and coordination failures in static games. The Quarterly Review of Economics and Finance, 35(4), 339–63.CrossRef Google Scholar

Weimann, J. (1994). Individual behaviour in a free riding experiment. Journal of Public Economics, 54(2), 185–200.CrossRef Google Scholar

Table 1 Weight placed on the between estimator for various studies using pooled models. (D) indicates that the explanatory variable is a dummy variable. Weights are shown for the bivariate OLS estimator

Fig. 1 Discount factors used in the fifteen economic experiments on the indefinitely repeated Prisoner’s Dilemma examined in Dal Bó and Fréchette (2018). Red “X”s mark the average treatment conditions

Fig. 2 Results of a simulation exploring various experimenter responses to experiment-specific error terms. The vertical red dashed line shows the true value of the estimand

Table 2 Decomposition of the pooled estimator in Fiala and Suetens (2017), estimating the effect of having feedback about each group member’s contribution in a public goods game on average group contributions. Standard errors for weighted estimates are clustered at the experiment level. Standard errors for estimates within experiments are heteroskedasticity-robust

Table 3 Decomposition of the pooled estimator in Fiala and Suetens (2017), estimating the effect of the marginal per capita return in a public goods game on average group contributions. Standard errors for weighted estimates are clustered at the experiment level. Standard errors for estimates within experiments are heteroskedasticity-robust

Table 4 Decomposition of the pooled OLS estimator for Dal Bó et al. (2021). Each estimate is for the effect of the basin of attraction of Stag on the probability of choosing Stag in a Stag Hunt game. Missing estimates indicate that the within-experiment estimator is undefined for these experiments. This is because there is no variation in the treatment variable within these experiments

Bland supplementary material

DOI: https://doi.org/10.1017/esa.2026.10031.sm001

File 183 KB

Article contents

Combining decision-level data from multiple experiments: what is the pooled estimator doing?

Abstract

Keywords

Information

1. Introduction

2. How the pooled estimator weights experiments

3. Some remedies

4. Examples

4.1. Fiala and Suetens (Reference Fiala and Suetens2017)

4.2. Dal Bó et al. (Reference Dal Bó, Fréchette and Kim2021)

5. Conclusion

Supplementary material

Acknowledgements

Footnotes

References

Bland supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests