Meta-analysis with Jeffreys priors: Empirical frequentist properties

Maya B. Mathur

doi:10.1017/rsm.2024.2

Meta-analysis with Jeffreys priors: Empirical frequentist properties

Published online by Cambridge University Press: 12 March 2025

Maya B. Mathur

Show author details

Maya B. Mathur*: Affiliation:
Quantitative Sciences Unit and Department of Pediatrics, Stanford University, Palo Alto, CA, USA
*: Email: mmathur@stanford.edu

Article contents

Abstract
Highlights
Introduction
Existing frequentist methods
Bayesian methods using Jeffreys priors
Simulation study
Applied example
Discussion
Author contributions
Competing interest statement
Data availability statement
Funding statement
Footnotes
References

Rights & Permissions

Abstract

In small meta-analyses (e.g., up to 20 studies), the best-performing frequentist methods can yield very wide confidence intervals for the meta-analytic mean, as well as biased and imprecise estimates of the heterogeneity. We investigate the frequentist performance of alternative Bayesian methods that use the invariant Jeffreys prior. This prior has the usual Bayesian motivation, but also has a purely frequentist motivation: the resulting posterior modes correspond to the established Firth bias correction of the maximum likelihood estimator. We consider two forms of the Jeffreys prior for random-effects meta-analysis: the previously established “Jeffreys1” prior treats the heterogeneity as a nuisance parameter, whereas the “Jeffreys2” prior treats both the mean and the heterogeneity as estimands of interest. In a large simulation study, we assess the performance of both Jeffreys priors, considering different types of Bayesian estimates and intervals. We assess point and interval estimation for both the mean and the heterogeneity parameters, comparing to the best-performing frequentist methods. For small meta-analyses of binary outcomes, the Jeffreys2 prior may offer advantages over standard frequentist methods for point and interval estimation of the mean parameter. In these cases, Jeffreys2 can substantially improve efficiency while more often showing nominal frequentist coverage. However, for small meta-analyses of continuous outcomes, standard frequentist methods seem to remain the best choices. The best-performing method for estimating the heterogeneity varied according to the heterogeneity itself. Röver & Friede’s R package bayesmeta implements both Jeffreys priors. We also generalize the Jeffreys2 prior to the case of meta-regression.

Keywords

meta-analysis bayesian simulation study Firth correction bayesian methods small-sample estimation simulations

Information

Type: Research Article
Information: Research Synthesis Methods , Volume 16 , Issue 1 , January 2025 , pp. 87 - 122

DOI: https://doi.org/10.1017/rsm.2024.2 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Open Practices: Open data Open materials
Copyright: © The Author(s), 2025. Published by Cambridge University Press on behalf of The Society for Research Synthesis Methodology

Highlights

• What is already known: The best-performing frequentist methods for random-effects meta-analysis can be highly imprecise in small meta-analyses, and can provide biased estimates of the heterogeneity.
• What is new: We conduct a large simulation study evaluating two forms of the Jeffreys prior for meta-analysis, which correspond to the Firth bias correction to the maximum likelihood estimator.
• Potential impact for RSM readers: For small meta-analyses of binary outcomes, the Jeffreys2 prior may offer advantages over standard frequentist methods for point and interval estimation for the mean parameter.

1 Introduction

Standard random-effects meta-analysis involves estimating the heterogeneity of studies’ population effects (e.g., their standard deviation) and obtaining an inverse-variance-weighted estimate of the meta-analytic mean, in which studies’ weights depend on the estimated heterogeneity.Reference DerSimonian and Laird ¹ Commonly used methods to estimate the heterogeneity include semiparametric method-of-moments estimatorsReference DerSimonian and Laird ¹ ^– Reference Raudenbush and Bryk ⁵ and parametric likelihood-based estimators.Reference DerSimonian and Laird ¹ ^, Reference Harville ⁶ The theoretical justification for these methods relies on asymptotics, yet in some scientific disciplines, the majority of meta-analyses include a relatively small number of studies. Meta-analyses of healthcare interventions in the Cochrane Database for Systematic Reviews include a median of only 3 studies (75th percentile: 6, 90th percentile: 10).Reference Langan, Higgins and Jackson ⁷ In psychology, meta-analyses published in Psychological Bulletin include a median of 12 studies, though some meta-analyses are much larger (75th percentile: 33, 90th percentile: 76).Reference Van Erp, Verhagen and Grasman ⁸ ^, Reference Röver ⁹

On the one hand, previous simulation studies indicate that even in very small meta-analyses (here defined as those with $\le 5$ studies), many existing methods provide nearly unbiased point estimates for the meta-analytic mean, termed $\mu $ .Reference Kontopantelis and Reeves ¹⁰ On the other hand, confidence intervals that are based on asymptotic normality (e.g., Wald intervals) can have less than nominal coverage in small meta-analyses ( $\le 20$ studies), and coverage can decline further in very small meta-analyses.Reference Langan, Higgins and Jackson ⁷ ^, Reference Langan, Higgins and Simmonds ¹¹ ^, Reference IntHout, Ioannidis and Borm ¹² Using Hartung–Knapp–Sidik–Jonkman’s (HKSJ) method to adjust standard errorsReference Sidik and Jonkman ¹³ ^, Reference Knapp and Hartung ¹⁴ can provide better-calibrated intervals in many settings, though existing simulation studies have yielded somewhat mixed findings regarding whether these intervals consistently achieve nominal coverage.Reference Langan, Higgins and Jackson ⁷ ^, Reference Langan, Higgins and Simmonds ¹¹ ^, Reference IntHout, Ioannidis and Borm ¹² ^, Reference Bodnar, Link and Arendacká ¹⁵ ^– Reference Seide, Röver and Friede ¹⁷ Moreover, such intervals can be extremely wide for meta-analyses of typical sample sizes.Reference Bodnar, Link and Arendacká ¹⁵ ^– Reference Michael, Thornton and Xie ¹⁸ For example, even when the true heterogeneity is zero, moments estimators with HKSJ standard errors yielded 95% confidence intervals with average widths of approximately 4–5 in simulated meta-analyses of 5 studies.Reference Michael, Thornton and Xie ¹⁸ This suggests that for a point estimate of 0.5 on the standardized mean difference scale, a typical confidence interval would be approximately $[-1.5, 2.5]$ , which is so wide that it might be considered uninformative. Additionally, standard point estimates for the heterogeneity can be substantially biased and imprecise in small meta-analyses.Reference Langan, Higgins and Jackson ⁷ ^, Reference Langan, Higgins and Simmonds ¹¹ Many existing simulation studies on heterogeneity estimation do not seem to have evaluated the coverage or width of confidence intervals for the heterogeneityReference Langan, Higgins and Simmonds ¹¹ (but see Viechtbauer (2007)Reference Viechtbauer ¹⁹ ).

In this paper, we investigate the frequentist performance of alternative Bayesian methods that use the invariant Jeffreys prior.Reference Jeffreys ²⁰ In general, Bayesian estimation proceeds by specifying a prior on the unknown parameters and obtaining the posterior of those parameters, given the observed data. This essentially involves updating the prior based on the likelihood of the observed data.Reference Gelman, Carlin and Stern ²¹ Various types of point estimates and credible intervals can then be obtained from the posterior. For an arbitrary distribution with unknown parameters $\boldsymbol {\Psi }$ and expected Fisher information $\mathcal {I}(\boldsymbol {\Psi })$ , the Jeffreys prior is proportional to $\sqrt { \text {det} \; \mathcal {I}(\boldsymbol {\Psi })}$ .Reference Jeffreys ²⁰ An original motivation for this prior was its invariance to transforming the parameters,Reference Jeffreys ²⁰ a property that does not hold for all priors.Reference Cencov ²² ^, Reference Datta and Ghosh ²³ ^, Footnote ⁱ For example, letting $\tau $ denote the standard deviation of studies’ population effects, the Jeffreys prior on $(\mu , \tau )$ is the same as the Jeffreys prior on $(\mu , \tau ^2)$ , so the resulting posterior estimates and intervals would not depend on the analyst’s arbitrary choice of parameterization. This desirable property has led some to describe the Jeffreys prior as “noninformative,” though we agree with others’ critiques of this term.Reference Kass and Wasserman ²⁴ ^, Reference Bernardo ²⁵

An interesting, underappreciated property of the Jeffreys prior is that the resulting posterior can alternatively be motivated from a solely frequentist perspective.Reference Firth ²⁶ In particular, it is well-known that the maximum likelihood (ML) estimate has an $O(n^{-1})$ bias, essentially due to the curvature of the score function.Reference Firth ²⁶ Firth (1993)Reference Firth ²⁶ showed that for exponential family distributions, an appropriate penalty on the likelihood to correct this bias coincides with estimation under the Jeffreys prior. This is essentially because the Jeffreys prior introduces a bias in the score function that compensates for the bias due to its curvature.Reference Firth ²⁶ In particular, the posterior mode under this prior can be viewed in frequentist terms as a bias-corrected ML estimate; consequently, the posterior mode under the Jeffreys prior has sometimes been termed the “Firth correction.” The Firth correction has demonstrated success in a number of frequentist estimation problems, and is used fairly often for logistic regression.Reference Firth ²⁶ ^– Reference Sartori ²⁹

Given the Jeffreys prior’s effectiveness as a bias-correction method in small samples, it seems plausible that using this prior in small meta-analyses might improve point and interval estimation. Bodnar et al. (2016, 2017)Reference Bodnar, Link and Arendacká ¹⁵ ^, Reference Bodnar, Link and Elster ³⁰ derived the Jeffreys prior on the heterogeneity $\tau $ alone (i.e., holding the mean $\mu $ constant), an approach that may be optimal if $\tau $ is strictly a nuisance parameter.Reference Bernardo ²⁵ Their simulations suggested that, along with an independent flat prior on $\mu $ , the resulting credible intervals may have better frequentist coverage than existing frequentist methods.Reference Bodnar, Link and Arendacká ¹⁵ We term this prior “Jeffreys1” because it is the prior with respect to a single parameter. Kosmidis et al. (2017)Reference Kosmidis, Guolo and Varin ³¹ independently derived a penalized likelihood correction that is equivalent to the single-parameter Jeffreys prior on $\mu $ alone; that is, treating $\mu $ rather than $\tau $ as a nuisance parameter. This penalization is closely related to the restricted ML (REML) estimator of $\tau $ .Reference Kosmidis, Guolo and Varin ³¹

In this paper, we consider the Jeffreys1 prior along with the two-parameter Jeffreys prior on both $\mu $ and $\tau $ . To the best of our knowledge, the latter has not appeared in the published literature on meta-analysis. We consider this prior, termed “Jeffreys2”, for several reasons. First, while the mean parameter is often of primary interest in meta-analysis, the heterogeneity should generally also be estimated and reported, so it may not be optimal to treat $\tau $ as a nuisance parameter.Reference Schünemann, Higgins, Vist, Higgins, Thomas, Chandler, Cumpston, Li, Page and Welch ³² Second, in other small-sample estimation problems, multiparameter Jeffreys priors that include scale parameters (e.g., the dispersion parameter in exponential-family models) have been proposed and have good empirical properties.Reference Firth ²⁶ ^, Reference Zhou, Giacometti and Fabozzi ²⁸ ^, Reference Alam, Rahman and Bari ³³ (We return to this issue in Section 3.3.) In the context of adjusting for p-hacking in meta-analyses by meta-analyzing only a truncated part of the random-effects distribution, we recently found that a Jeffreys prior on $\mu $ and $\tau $ performed considerably better than ML,Reference Mathur ³⁴ whose performance is remarkably poor for truncated distributions in general.Reference Zhou, Giacometti and Fabozzi ²⁸ ^, Reference Cope ³⁵ Third, as we will discuss, the shape of the Jeffreys2 prior suggests it might provide more precise intervals than the Jeffreys1 prior. Whether Jeffreys2 credible intervals show nominal frequentist coverage, and whether point estimation for $\mu $ and $\tau $ performs well, are open questions.

Previous simulation studies of Jeffreys priors in meta-analysis have provided promising preliminary results, but do have limitations. Those simulations investigated only the Jeffreys1 prior, but not Jeffreys2, and have considered point and interval estimation for $\mu $ , but not $\tau $ .Reference Bodnar, Link and Arendacká ¹⁵ In this paper, we present a simulation study comparing the frequentist properties of point and interval estimation for both $\mu $ and $\tau $ under the Jeffreys1 and Jeffreys2 priors, as well as several of the best-performing frequentist methods. Using a simulation design that closely paralleled a recent, extensive simulation study by Langan et al. (2019)Reference Langan, Higgins and Jackson ⁷ , we substantially expanded on the range of comparison methods and simulation scenarios used in previous simulation studies of the Jeffreys1 prior. Previous simulations regarding the Jeffreys1 prior considered only posterior means for point estimation,Reference Bodnar, Link and Arendacká ¹⁵ whereas the aforementioned bias-correction properties specifically apply to posterior modes. This may be especially relevant for point estimation of $\tau $ , whose posterior is highly asymmetric. We therefore consider three types of Bayesian point estimates (the posterior mode, mean, and median) as well as two types of credible intervals (central and shortest). Our simulations include the best-performing methods in Langan et al.’s (2019)Reference Langan, Higgins and Jackson ⁷ simulation study, along with several other methods whose theoretical properties suggest they might also perform well, such as exact intervalsReference Michael, Thornton and Xie ¹⁸ and intervals based on the profile likelihood.Reference Harville ⁶

This paper is organized as follows. We briefly review existing moments and likelihood-based estimators for random-effects meta-analysis (Section 2), all of which have been covered in more detail elsewhere.Reference Harville ⁶ ^, Reference Michael, Thornton and Xie ¹⁸ ^, Reference Veroniki, Jackson, Viechtbauer, Bender, Bowden, Knapp and Salanti ³⁶ We also briefly review existing simulation results regarding these methods (Section 2.4). We review the established form of the Jeffreys1 priorReference Bodnar, Link and Arendacká ¹⁵ and derive the form of the Jeffreys2 prior; we then discuss posterior estimation under both priors (Section 3). We present the simulation study (Section 4) and a brief applied example (Section 5), and conclude with a general discussion.

2 Existing frequentist methods

2.1 Method-of-moments estimators

Moments estimators for meta-analysis are semiparametric; they involve specifying only the first two moments of the distribution of population effects, namely $\mu $ and $\tau ^2$ . Because these methods do not require specifying the higher moments, they do not requiring assuming that population effects are normal. Specifically, consider k studies whose population effects, $\mu _i$ , have expectation $\mu $ and variance $\tau ^2$ . These two moments are the usual meta-analytic estimands of interest. Let $\widehat {\theta }_i$ and $\sigma _i$ respectively denote the point estimate and standard error of the ith study, such that $\widehat {\theta }_i \sim N \left ( \mu _i, \sigma _i^2 \right )$ holds approximately. The within-study standard errors $\sigma _i$ are generally treated as fixed and known.

For a given estimate of the heterogeneity variance, $\widehat {\tau }^2$ , the estimated marginal variance of $\widehat {\theta }_i$ is $\widehat {\tau }^2 + \sigma _i^2$ . The uniformly minimum variance unbiased estimator (UMVUE) of $\mu $ arises from weighting studies by the inverse of their estimated marginal variances,Reference Harville ⁶ denoted $w_i = 1 / (\widehat {\tau }^2 + \sigma _i^2)$ :

$$ \begin{align*} \widehat{\mu} &= \frac{\sum_{i=1}^k w_i \widehat{\theta}_i}{\sum_{i=1}^k w_i}. \end{align*} $$

The various moments estimators are distinguished by their estimators for $\tau ^2$ , and hence the form of the weights $w_i$ . Detailed reviewsReference Langan, Higgins and Jackson ⁷ ^, Reference Veroniki, Jackson, Viechtbauer, Bender, Bowden, Knapp and Salanti ³⁶ ^, Reference Viechtbauer ³⁷ and original papers on these approaches are available, so here we summarize briefly. Moments estimators for $\tau ^2$ are based on the generalized Q-statistic:

(1)

$$ \begin{align} Q &= \sum_{i=1}^k a_i (\widehat{\theta}_i - \widehat{\mu})^2, \end{align} $$

where the form of the coefficients, $a_i$ , differs across moments estimators. For example, the traditional Dersimonian–Laird estimator (DL)Reference DerSimonian and Laird ¹ sets $a_i = 1 / \sigma _i^2$ . The two-step DL estimator (DL2)Reference DerSimonian and Kacker ² instead sets $a_i = 1 / (\widehat {\tau }^2_{DL} + \sigma _i^2)$ , where $\widehat {\tau }^2_{DL}$ is an initial estimate obtained using the DL estimator. The Paule–Mandel (PM)Reference Paule and Mandel ³ ^, Reference van Aert and Jackson ⁴ estimator can be viewed as a limiting case of DL2, involving iteration over the estimates $\widehat {\mu }$ and $\widehat {\tau }^2$ until convergence. This estimator is also equivalent to the empirical Bayes estimator.Reference Raudenbush and Bryk ⁵ In general terms, empirical Bayes estimation uses the observed data to estimate the parameters of the Bayesian prior, rather than specifying the prior independently of the data.Reference Gelman, Carlin and Stern ²¹ In the context of meta-analysis, the empirical Bayes estimator essentially estimates the distribution of population effects by their posterior means, with the prior determined empirically.Reference Raudenbush and Bryk ⁵

2.2 Likelihood-based estimators

In contrast to moments estimators, commonly used likelihood-based estimators assume that the population effects, $\mu _i$ , arise independently from the distribution $\mu _i \sim N \left ( \mu , \tau ^2 \right )$ . Thus, the marginal distribution of studies’ point estimates, $\widehat {\theta }_i$ , is $\widehat {\theta }_i \sim N \left ( \mu , \tau ^2 + \sigma _i^2 \right )$ . We denote the k-vector of point estimates as $\widehat {\boldsymbol {\theta }}$ . Letting $S_i(\tau ) = \sqrt {\tau ^2 + \sigma _i^2}$ be the true marginal standard deviation of the ith study, the joint likelihood is:

(2)

The standard ML estimator for $\tau $ is obtained as usual by solving $\frac {\partial }{\partial \tau } \; \log p ( \widehat {\boldsymbol {\theta }} \mid \mu , \tau ) = 0$ , whose solution depends on $\mu $ .Reference Harville ⁶ Since this estimator does not take into account the loss in degrees of freedom due to the additional estimation of $\mu $ itself, the resulting estimate is often negatively biased.Reference Harville ⁶ This issue motivates REML estimation, which can improve upon ML estimation by transforming the log-likelihood to remove the parameter $\mu $ .Reference Harville ⁶

2.3 Interval estimation

A simple Wald confidence interval can be obtained by assuming $\widehat {\mu }$ is normally distributed, which holds asymptotically in k by standard ML properties. If the weights $w_i$ are treated as known rather than estimated, we have $\widehat {\text {Var}}(\widehat {\mu }) = 1 / \sum _{i=1}^k w_i$ . A Wald 95% confidence interval is:

$$ \begin{align*} \widehat{\mu} &\pm c \sqrt{\widehat{\text{Var}}(\widehat{\mu})}, \end{align*} $$

where $c = \Phi ^{-1}(0.975) \approx 1.96$ is the critical value of the standard normal distribution. However, Wald intervals exhibit substantial under-coverage for small meta-analyses, both because the normal approximation holds only asymptotically and because the approximation $\widehat {\text {Var}}(\widehat {\mu }) = 1 / \sum _{i=1}^k w_i$ does not account for the estimation of $\tau ^2$ .Reference Langan, Higgins and Jackson ⁷ ^, Reference Langan, Higgins and Simmonds ¹¹ ^, Reference IntHout, Ioannidis and Borm ¹² Wald intervals can also be constructed for $\widehat {\tau }$ , but exhibit similarly poor performance.Reference Viechtbauer ¹⁹ We therefore do not further discuss Wald intervals, focusing instead on the better-performing alternatives discussed below.

Regarding interval estimation for $\mu $ , the alternative HKSJ, sometimes called “Knapp–Hartung,” interval addresses the limitations of the Wald interval.Reference Sidik and Jonkman ¹³ ^, Reference Knapp and Hartung ¹⁴ This method more flexibly assumes that $\widehat {\mu }$ follows a t distribution and additionally rescales $\widehat {\text {Var}}(\widehat {\mu })$ to account for the estimation of $\tau ^2$ in the weights $w_i$ :

$$ \begin{align*} \widehat{\text{Var}}(\widehat{\mu}) = \frac{\sum_{i=1}^k w_i (\widehat{\theta}_i - \widehat{\mu})^2 }{(k-1) \sum_{i=1}^k w_i }. \end{align*} $$

For $\tau $ , improved intervals can be constructed using the chi-square distribution of the Q statistic, per Eq. (1).Reference Viechtbauer ¹⁹ These “Q-profile” intervals substantially outperform Wald intervals.Reference Viechtbauer ¹⁹ For both $\mu $ and $\tau $ , ML profile intervals can also be constructed in the usual way.Reference Harville ⁶

An interesting, relatively new approach provides exact rather than asymptotic intervals and is theoretically guaranteed to provide more than nominal coverage, under the assumption of normal population effects.Reference Michael, Thornton and Xie ¹⁸ This method essentially involves inverting exact tests. Other parametric methods provide finite-sample corrections to the likelihood ratio test statistic; these include Skovgaard’s second-order correction and Bartlett’s correction.Reference Huizenga, Visser and Dolan ³⁸ ^– Reference Guolo and Varin ⁴⁰ These methods can improve upon basic likelihood methods for hypothesis testing,Reference Guolo and Varin ⁴⁰ but Skovgaard’s second-order correction was not designed for interval estimation and can be numerically unstable in this context.Reference Kosmidis, Guolo and Varin ³¹ Interval estimation with Bartlett’s correction is possible,Reference Noma ⁴¹ but is not implemented in existing software (I. Visser, personal communication, 8 July 2024).Reference Hilde and Ingmar ⁴² ^, Reference Veroniki, Jackson and Bender ⁴³ Because our focus is on interval estimation rather than testing, our simulations do not include Skovgaard’s or Bartlett’s corrections. Finally, various parametric or nonparametric resampling methods can be used to obtain bootstrapped confidence intervals.Reference Viechtbauer ¹⁹ ^, Reference Veroniki, Jackson and Bender ⁴³ ^, Reference Van Den Noortgate and Onghena ⁴⁴ Nonparametric resampling can be conducted by resampling rows with replacement, after which one can obtain simple percentile bootstrap intervals or bias-corrected and accelerated (BCa) intervals, among many other types of bootstrap intervals.Reference Efron ⁴⁵ ^, Reference Carpenter and Bithell ⁴⁶ The BCa confidence corrects for bias and skewness in the bootstrapped sampling distribution, which we speculate could be helpful when estimating the sampling distribution of $\tau $ . The BCa bootstrap has performed relatively well for certain meta-analytic estimators that are functions of $\widehat {\tau }$ .Reference Mathur and VanderWeele ⁴⁷ However, bootstrapping is an asymptotic procedure whose finite-sample performance typically must be assessed through simulations.

2.4 Existing simulations comparing these methods

Langan et al. (2017)Reference Langan, Higgins and Simmonds ¹¹ provide an excellent systematic review of simulation studies for different heterogeneity estimators.Reference Langan, Higgins and Jackson ⁷ Briefly, the DL estimator was negatively biased for $\tau $ when heterogeneity was moderate to high, and the PM estimator was typically less biased.Reference Langan, Higgins and Simmonds ¹¹ The reviewed studies do not appear to have assessed interval estimation for $\tau $ . Based on their own, more extensive simulation study, Langan et al. (2019)Reference Langan, Higgins and Jackson ⁷ generally recommend REML, PM, or DL2 for heterogeneity estimation, along with HKSJ confidence intervals for $\mu $ ; however, they recommend caution in interpreting heterogeneity estimates in small meta-analyses.

Langan et al.’s (2019)Reference Langan, Higgins and Jackson ⁷ simulation study did not assess intervals based on the profile likelihood, bootstrapping, or the exact method; the latter was developed only recently. Regarding profile intervals, recommendations in the literature are inconsistent. A prominent paper stated that “the profile likelihood is a good method for computing confidence intervals”.Reference Cornell, Mulrow and Localio ⁴⁸ One simulation study seemed to support this recommendation, finding that when the heterogeneity is greater than zero, profile likelihood intervals showed the closest to nominal coverage.Reference Kontopantelis and Reeves ¹⁰ On the other hand, another simulation study suggested that profile intervals often exhibited under-coverage for meta-analyses of only 5 studies.Reference Guolo ³⁹ The originators of the exact method provide simulations suggesting that the resulting intervals are not substantially wider than those of existing methods, despite the method’s theoretical guarantee of at least nominal coverage.Reference Michael, Thornton and Xie ¹⁸ While our simulation study is primarily motivated by investigating the Jeffreys methods, a secondary contribution is to more extensively evaluate profile, bootstrap, and exact intervals. We now turn to establishing the theory for the Jeffreys1 and Jeffreys2 priors.

3 Bayesian methods using Jeffreys priors

3.1 The Jeffreys priors

Under the assumption of normal population effects, Bodnar et al. (2017)Reference Bodnar, Link and Arendacká ¹⁵ showed that the improper Jeffreys1 prior is:

$$ \begin{align*} p \left( \mu, \tau \right) \propto \tau \sqrt{\sum_{i=1}^k \left(S_i(\tau)\right)^{-4}} \end{align*} $$

where, again, $S_i(\tau ) = \sqrt {\tau ^2 + \sigma _i^2}$ . Since this prior is independent of $\mu $ , it can be expressed as two independent priors on $\mu $ and $\tau $ , where the prior on $\mu $ is uniform:

(3)

$$ \begin{align} p \left( \mu, \tau \right) &\propto p(\mu) \; p(\tau), \text{ where } p(\mu) \propto 1 \text{ and } p(\tau) \propto \tau \sqrt{\sum_{i=1}^k \left(S_i(\tau)\right)^{-4}}. \end{align} $$

If $\mu $ is treated as the only parameter of interest and $\tau $ is considered a nuisance parameter, then the Jeffreys1 prior also coincides with the Berger–Bernardo reference prior.Reference Bodnar, Link and Elster ³⁰ In general, the Berger–Bernardo prior for a given distribution is designed to be maximally “noninformative” in the sense of minimizing the amount of information provided by the prior and maximizing the amount of information provided by the data.Reference Bodnar, Link and Elster ³⁰ ^, Reference Berger and Bernardo ⁴⁹ Specifically, this prior maximizes the Kullback–Liebler divergence between the prior and the posterior.Reference Berger and Bernardo ⁴⁹

Regarding the Jeffreys2 prior, the joint likelihood in Eq. (2) implies that the entries of the expected Fisher information are:

$$ \begin{align*} k_{\mu\mu} := E \Bigg[ \frac{\partial^2 \ell}{\partial\mu^2} \Bigg] = -\sum_{i=1}^k \left(S_i(\tau)\right)^{-2}, \; \; \; k_{\tau \tau} := E \Bigg[ \frac{\partial^2 \ell}{\partial \tau^2} \Bigg] = -2 \tau^2 \sum_{i=1}^k \left(S_i(\tau)\right)^{-4}, \; \; \; k_{\mu \tau} := E \Bigg[ \frac{\partial^2 \ell}{\partial\mu\partial \tau} \Bigg] = 0 \end{align*} $$

where $\ell $ is the likelihood function. Therefore, the Jeffreys2 prior is $p \left ( \mu , \tau \right ) \propto \sqrt { k_{\mu \mu } k_{\tau \tau } }$ . (This result is straightforward to show directly, or alternatively can be viewed as a simple special case of the prior given in Mathur (2024).Reference Mathur ³⁴ ^, Footnote ⁱⁱ This yields the improper two-parameter prior:

$$ \begin{align*} p \left( \mu, \tau \right) &\propto \tau \sqrt{\left( \sum_{i=1}^k \left(S_i(\tau)\right)^{-2} \right) \left( \sum_{i=1}^k \left(S_i(\tau)\right)^{-4} \right)}. \end{align*} $$

Like the Jeffreys1 prior, the Jeffreys2 prior can be expressed as:

(4)

$$ \begin{align} p \left( \mu, \tau \right) &\propto p(\mu) \; p(\tau), \text{ where } p(\mu) \propto 1 \text{ and } p(\tau) \propto \tau \sqrt{\left( \sum_{i=1}^k \left(S_i(\tau)\right)^{-2} \right) \left( \sum_{i=1}^k \left(S_i(\tau)\right)^{-4} \right)}. \end{align} $$

To illustrate, Figure 1 shows both priors on $\tau $ for four meta-analyses of standardized mean differences. The meta-analyses were simulated with studies’ sample sizes, N, arising from four different distributions. Although the magnitude of the priors will of course be affected by the number of studies k, their shape is minimally affected by k, so Figure 1 depicts the prior for meta-analyses with $k=10$ . Note that for each meta-analysis, the Jeffreys2 prior is somewhat narrower than the Jeffreys1 prior, suggesting that the former may provide narrower intervals; this hypothesis will be explored in more depth in the simulation study (Section 4). Both priors lead to proper posteriors if $k> 1$ (see Bodnar (2017)Reference Bodnar, Link and Arendacká ¹⁵ regarding Jeffreys1 and the present Section 1 of the Supplementary Material, regarding Jeffreys2). Additionally, both priors generalize easily to the case of meta-regression: the Jeffreys1 prior would coincide with that of Bodnar et al. (2024) for generalized marginal random effects models,Reference Bodnar and Bodnar ⁵⁰ and we derive the Jeffreys2 prior for meta-regression in Section 1 of the Supplementary Material. We do not further consider meta-regression in the main text.

Figure 1

Priors for four simulated meta-analyses of standardized mean differences ( $k=10$ ), in which the within-study sample sizes (N) were generated from four possible distributions. Studies’ standard errors were estimated using Eq. (5) and, given the data-generation parameters, were approximately equal to $2/\sqrt {N}$ . Points are the maxima. The priors have been scaled to have the same maximum height.

3.2 The posterior under each prior

For either prior, since $p(\mu , \tau ) \propto p(\tau )$ , the marginal posterior on $\tau $ is:Reference Bodnar, Link and Arendacká ¹⁵

In turn, the conditional posterior of $\mu $ , given $\tau $ , is normal:Reference Röver ⁹ ^, Reference Bodnar, Link and Arendacká ¹⁵ ^, Reference Gelman, Carlin and Stern ²¹

$$ \begin{align*} p(\mu \mid \tau, \widehat{\boldsymbol{\theta}}) &\propto \frac{1}{\sqrt{2\pi \; \text{Var}(\mu \mid \tau, \widehat{\boldsymbol{\theta}})} } \cdot \exp \Bigg\{ -\frac{1}{2 \; \text{Var}(\mu \mid \tau, \widehat{\boldsymbol{\theta}})} \left( \mu - E[ \mu \mid \tau, \widehat{\boldsymbol{\theta}}] \right)^2 \Bigg\}, \end{align*} $$

where:

$$ \begin{align*} E[ \mu \mid \tau, \widehat{\boldsymbol{\theta}}] = \frac{\sum_{i=1}^k \widehat{\theta}_i \left(S_i(\tau)\right)^{-2} }{\sum_{i=1}^k \left(S_i(\tau)\right)^{-2}}, \; \; \text{Var}(\mu \mid \tau, \widehat{\boldsymbol{\theta}}) &= \frac{1}{\sum_{i=1}^k \left(S_i(\tau)\right)^{-2}}. \end{align*} $$

Thus, the joint posterior $p(\mu , \tau \mid \widehat {\boldsymbol {\theta }})$ can be decomposed into the two tractable components $p(\tau \mid \widehat {\boldsymbol {\theta }})$ and $p(\mu \mid \tau , \widehat {\boldsymbol {\theta }})$ .Reference Röver ⁹ Given this observation, Röver and othersReference Röver ⁹ ^, Reference Röver and Friede ⁵¹ developed theory and software for a discrete approximation to the joint posterior $p(\mu , \tau \mid \widehat {\boldsymbol {\theta }})$ and the marginal posterior on $\mu $ , given by the mixture distribution:

$$ \begin{align*} p(\mu \mid \widehat{\boldsymbol{\theta}}) &\propto \int_{0}^{\infty} p(\mu \mid \tau, \widehat{\boldsymbol{\theta}}) \; p(\tau \mid \widehat{\boldsymbol{\theta}}) \; d\tau. \end{align*} $$

The discrete approximation approach does not require sampling via mixed-chain Monte Carlo (MCMC) and is implemented in the R package bayesmeta.Reference Röver ⁹ ^, Reference Röver and Friede ⁵¹ We use this package in our simulations and applied example.

With approximations to the joint and marginal posteriors in hand, point estimates can be defined in terms of various measures of central tendency, such as the posterior mode, median, or mean. For either prior, $p(\mu \mid \widehat {\boldsymbol {\theta }})$ appears to be nearly symmetric in many cases (e.g., Figure 4), so the three measures of central tendency will often agree closely. However, this is not the case for $p(\tau \mid \widehat {\boldsymbol {\theta }})$ , which is asymmetric under either prior. Existing work on the Jeffreys1 prior focused primarily on posterior means and medians,Reference Bodnar, Link and Arendacká ¹⁵ but we focus on posterior modes given their aforementioned theoretical advantages.Reference Firth ²⁶ Indeed, as discussed in Section 4.4, our simulations indicated that posterior modes for $\tau $ provided substantially lower bias, root mean square error (RMSE), and mean absolute error (MAE) than did posterior means and medians. As in ML estimation, point estimates can be defined either in terms of the marginal or the joint mode. In the Bayesian context, the marginal mode represents the value of a given parameter (e.g., $\mu $ ) that maximizes the posterior for that parameter alone, marginalizing over the other parameter (e.g., $\tau $ ). In contrast, the joint mode represents the values of both parameters that jointly maximize the joint posterior. We consider marginal modes in this paper to provide a more direct comparison to marginal ML estimation, which is the usual implementation for meta-analysis.

Figure 2

Priors on $\tau $ for the meta-analysis on all-cause death ( $k=3, \{ \sigma _i \} = \{ 1.15, 1.63, 0.19 \}$ ). Points are maxima. The priors have been scaled to have the same maximum height.

Figure 3

Joint posterior under the Jeffreys2 prior for the meta-analysis on all-cause death ( $k=3, \{ \sigma _i \} = \{ 1.15, 1.63, 0.19 \}$ ). Horizontal red line: marginal posterior mode of $\mu $ . Vertical blue line: marginal posterior mode of $\tau $ .

Figure 4

Marginal posteriors under the Jeffreys2 prior for the meta-analysis on all-cause death. Solid vertical lines: marginal posterior modes. Dashed vertical lines: limits of 95% intervals.

Figure 5

Interval limits greater than $RR=10$ are truncated. The exact method does not yield point estimates. CI: credible interval.

Also analogously to ML estimation, symmetric Wald credible intervals are sometimes constructed for Bayesian estimates by approximating the posterior as asymptotically normal around the posterior mode, with a variance–covariance matrix equal to the inverse of the Hessian of the negative log-posterior evaluated at the posterior mode.Reference Gelman, Carlin and Stern ²¹ However, just as Wald intervals around the ML estimate can perform poorly if the likelihood is asymmetric, Wald intervals around the posterior mode can likewise perform poorly if the posterior is asymmetric.Reference Greenland ⁵² To obtain appropriately asymmetric posterior intervals, we consider two approaches. First, a central (also called “equal-tailed”) 95% posterior quantile interval can be obtained by taking the 2.5th and 97.5th quantiles of the estimated posterior distribution. Second, the shortest possible 95% posterior quantile interval can be obtained numerically; this interval is equivalent to a highest posterior density interval for unimodal distributions.Reference Gelman, Carlin and Stern ²¹ In our simulations and applied example, we obtain both types of intervals from the R package bayesmeta.Reference Röver ⁹

3.3 Theoretical and substantive distinctions between the priors

The distinction between the Jeffreys1 and Jeffreys2 priors invokes theoretical and substantive considerations that pertain in general to multiparameter Jeffreys priors. Jeffreys and others have argued that multiparameter Jeffreys priors are appropriate if one wishes to estimate all of the parameters (i.e., both $\mu $ and $\tau $ in meta-analysis), but not if one wishes to estimate only a subset of the parameters (i.e., only $\mu $ ), with the others treated as nuisance parameters.Reference Kass and Wasserman ²⁴ ^, Reference Bernardo ²⁵ ^, Reference Jeffreys ⁵³ As noted in the Introduction, a random-effects meta-analysis should generally involve estimation and reporting of $\tau $ (or related metricsReference Schünemann, Higgins, Vist, Higgins, Thomas, Chandler, Cumpston, Li, Page and Welch ³² ^, Reference Riley, Higgins and Deeks ⁵⁴ ^, Reference Mathur and VanderWeele ⁵⁵ ) in addition to $\mu $ , which suggests consideration of the Jeffreys2 prior. On the other hand, in general location-scale problems, Jeffreys recommended obtaining the prior with respect to only the scale parameters, holding constant the location parameters.Reference Kass and Wasserman ²⁴ ^, Reference Jeffreys ⁵³ This would correspond to the Jeffreys1 prior. Jeffreys’ recommendation was motivated by problems that can arise when the number of location parameters increases with the sample size, similarly to the well-known Neyman–Scott problem in which the ML estimator fails to be consistent.Reference Kass and Wasserman ²⁴ ^, Reference Jeffreys ⁵³ Interestingly, Firth later showed that in a specific, severe version of the Neyman–Scott problem, the multiparameter Jeffreys prior (i.e., the Firth correction) in fact leads to a consistent and exactly unbiased estimator.Reference Firth ²⁶ This was unexpected given that the asymptotic arguments justifying the Firth correction are violated with an increasing number of parameters.Reference Firth ²⁶ Of course, in the present setting of random-effects meta-analysis, the number of parameters is fixed, so this potential issue does not arise in the first place. Our view is that existing substantive and theoretical considerations do not clearly rule out either prior as inappropriate for random-effects meta-analysis, so our simulation study evaluates both.

4 Simulation study

We designed the simulation study to closely parallel that of Langan et al. (2019),Reference Langan, Higgins and Jackson ⁷ which in turn was designed to address many of the limitations of previous simulation studies.Reference Langan, Higgins and Simmonds ¹¹ As detailed below, we considered meta-analyses with binary outcomes (with effect sizes on the log-odds ratio scale) and with continuous outcomes (with effect sizes on the Hedges’ g scaleReference Hedges ⁵⁶ ), with as few as 2 studies, with varying amounts of heterogeneity, with varying means and outcome probabilities (for binary outcomes), and with varying distributions of within-study sample sizes. Because we assessed a variety of parametric, semiparametric, and nonparametric methods, we preliminarily investigated robustness to parametric misspecification by considering exponentially distributed population effects in addition to normally distributed effects.

4.1 Point and interval estimation methods

Table 1 lists the methods assessed in our simulation study. We assessed both Jeffreys priors. For point estimation under each prior, we primarily considered marginal posterior modes but secondarily investigated posterior means and medians (Section 2.2 of the Supplementary Material). Regarding interval estimation for $\mu $ , central and shortest intervals were generally quite similar, so we only show results for shortest intervals. Regarding interval estimation for $\tau $ , we consider both types of intervals for each prior, termed “Jeffreys1-shortest,” “Jeffreys1-central,” “Jeffreys2-shortest,” and “Jeffreys2-central.”

Table 1

Methods assessed in simulation study

$^{a}$ In pilot tests for the scenarios with $k=10$ , the bootstrap methods were not competitive with other methods, so these computationally intensive methods were not run for other sample sizes.

We compared the performance of both Jeffreys priors to that of several existing frequentist methods that were described in Section 2. We selected methods that have performed well in existing, large simulation studies or that have desirable theoretical properties, such as providing appropriately asymmetric intervals for $\tau $ .Reference Harville ⁶ ^, Reference Langan, Higgins and Jackson ⁷ ^, Reference Michael, Thornton and Xie ¹⁸ ^, Reference Guolo ³⁹ ^, Reference Cornell, Mulrow and Localio ⁴⁸ ^, Reference Brockwell and Gordon ⁵⁷ For point estimation, the comparison methods were ML estimation, REML, DL, DL2, and PM. Regarding interval estimation for $\mu $ , we considered HKSJ intervals for each frequentist estimation method, ML profile intervals (ML-profile), exact intervals,Reference Michael, Thornton and Xie ¹⁸ nonparametric BCa bootstrap intervals, and nonparametric percentile bootstrap intervals.Reference Efron ⁴⁵ ^, Reference Carpenter and Bithell ⁴⁶ Regarding interval estimation for $\tau $ , we considered Q-profile intervals for each frequentist estimation method, as well as ML-profile and both bootstrap intervals. We implemented all frequentist methods and intervals using the R package metafor Reference Viechtbauer ⁵⁸ with the following exceptions: we implemented ML-profile using custom R code, the exact method using the R package rma.exact,Reference Michael, Thornton and Xie ¹⁸ and the bootstrap methods using the R package boot.Reference Canty and Boot ⁵⁹

4.2 Data generation

Table 2 summarizes the simulation parameters we manipulated, which were similar to those of Langan et al.’s (2019) simulation study.Reference Langan, Higgins and Jackson ⁷ We considered continuous outcomes with point estimates on the Hedges’ g scaleReference Hedges ⁵⁶ as well as binary outcomes with point estimates on the log-odds ratio scale. We considered both normally distributed and exponentially distributed population effects; in the latter case, the assumptions for all point estimators except the moments estimators were violated. Statistical theory suggests that all methods would perform comparably for very large meta-analyses with normal effects, and accordingly our focus is on point and interval estimation for smaller meta-analyses ( $k \le 20$ ). Our primary simulations reported in the main text are those with $k \in \{ 2, 3, 5, 10, 20\}$ . We additionally ran simulations with $k=100$ to confirm asymptotic behavior (Section 3 of the Supplementary Material). Because the bootstrap intervals required much more computational time than the other methods, we first pilot-tested them in all scenarios with a single sample size ( $k=10$ ) to assess whether these methods were competitive with other methods.

Table 2

Possible values of simulation parameters

$^{a}$ Results for scenarios with $k=100$ appear in the Supplementary Material; these scenarios are excluded from aggregated results in the main text.

Data generation proceeded as follows. For each simulation iterate, we generated a meta-analysis whose underlying population effects ( $\mu _i$ ) were either normal or exponential. Normal population effects were generated as $\mu _i \sim N \left ( \mu , \tau ^2 \right )$ , where we varied $\mu $ and $\tau $ as indicated in Table 2. Exponential population effects were generated from an appropriately scaled and shifted distribution to achieve the desired population moments, $\mu $ and $\tau ^2$ . For each study in the meta-analysis, we generated a total sample size N from one of the four distributions listed in Table 2. We then simulated individual participant data, such that $N/2$ participants were allocated to a treatment group, and the other $N/2$ to a control group. In scenarios with a continuous outcome, we simulated outcomes with a mean of 0 in the control group and $\mu _i$ in the treatment group, and with a standard deviation of 1 within each group. We then estimated the standardized mean difference using the Hedges’ g correction.Reference Hedges ⁵⁶ ^, Reference Viechtbauer ⁵⁸ We used the standard large-sample approximation for studies’ standard errors (Eq. (8) in Hedges (1982)Reference Hedges ⁶⁰ ):

(5)

$$ \begin{align} \widehat{\sigma}_i = \sqrt{ \frac{N^c + N^t}{N^c N^t} + \frac{\widehat{\theta}_i^2}{2(N^c + N^t)} } = \sqrt{ \frac{8 + \widehat{\theta}_i^2}{2N} } \end{align} $$

where $N^c$ and $N^t$ are the within-group sample sizes, which were both equal to $N/2$ in our simulations. The expectation of this estimator is approximately $\sqrt {(8 + \mu ^2)/2N}$ .

In scenarios with a binary outcome, we simulated outcomes from a logistic model such that:

$$ \begin{align*} P(Y = 1 \mid X) = \text{expit} \Big\{ \text{logit}\{ P(Y = 1 \mid X = 0) \} + \mu_i X \Big\} \end{align*} $$

where $P(Y = 1 \mid X = 0)$ was a scenario parameter that we manipulated among the values listed in Table 2. We then estimated the odds ratio; to handle potential zero cell counts when present, we added 0.5 to each table cell when any cells had a count of zero.Reference Viechtbauer ⁵⁸

We expected that for binary outcomes and small within-study sample sizes, certain extreme combinations of scenario parameters (e.g., $N=40$ and $\mu =2.3$ , corresponding to an extreme odds ratio of 10) would result in biased within-study odds ratios.Reference Firth ²⁶ ^, Reference Nemes, Jonasson and Genell ⁶¹ In pilot simulations, we identified combinations of scenario parameters that resulted in within-study absolute bias of greater than 0.05. We excluded these combinations of scenario parameters since our focus is on bias arising from meta-analytic estimation methods rather than from within-study bias. After excluding these combinations of simulation parameters, we ultimately simulated 240 unique scenarios for continuous outcomes and 2267 for binary outcomes.

4.3 Performance metrics

For each scenario, we assessed the point estimators’ performance and variability in terms of their bias, MAE, and RMSE, defined in the usual frequentist sense. That is, for a generic parameter $\omega _r$ that varies across 500 simulation iterates, r:

$$ \begin{align*} \text{Bias} &= \frac{1}{500} \sum_{r=1}^{500} \left( \widehat{\omega}_r - \omega_r \right) \\ \text{MAE} &= \frac{1}{500} \sum_{r=1}^{500} \vert \widehat{\omega}_r - \omega_r \vert \\ \text{RMSE} &= \left( \frac{1}{500} \sum_{r=1}^{500} \left( \widehat{\omega}_r - \omega_r \right)^2 \right)^{1/2}. \end{align*} $$

For each scenario, we assessed interval estimation in terms of the frequentist coverage and width of 95% confidence or credible intervals. Some methods’ intervals exhibited over-coverage in some scenarios but exhibited under-coverage in others. Therefore, when aggregating results across scenarios, we also consider the percentage of scenarios in which each method achieved approximately nominal coverage, defined stringently as having coverage $>$ 94%. In the Discussion, we expand upon our reasons for assessing frequentist properties of Bayesian methods, and the implications of this approach. We did not assess statistical power. Although p-values can certainly be useful when interpreted as continuous measures of evidence, we concur with others’ longstanding concerns about bright-line significance testing,Reference Rothman ⁶² ^, Reference McShane, Gal and Gelman ⁶³ a practice that has contributed to striking misinterpretations of published meta-analysesReference Mathur and VanderWeele ⁵⁵ ^, Reference Mathur and VanderWeele ⁶⁴ and likely also to publication bias.

4.4 Results

Given the large number of scenarios, some aggregation is necessary to display the results compactly. In the main text, we provide line plots that stratify by k, $\tau $ , the distribution of population effects, and the outcome type, and that aggregate over distributions of N and, for binary outcomes, over $\mu $ and $P(Y=1 \mid X=0)$ . Because the direction of a given estimator’s bias could differ across scenarios, we depict each estimator’s bias across scenarios using boxplots instead of line plots to avoid any aggregation across scenarios. For the other performance metrics, we additionally provide a series of tables that consider average performances within subsets of scenarios defined by the outcome type and k (Tables 3–10). Comprehensive simulation results for each individual scenario are publicly available as a dataset (https://osf.io/9qfah).

Table 3

Scenarios with continuous outcomes; $\widehat {\mu }$ point and interval estimation

Note: Methods are sorted from best to worst performance within each column (or alphabetically for ties); coverage is sorted from highest to lowest. MAE: Mean absolute error. RMSE: Root mean square error. CI: 95% confidence or credible interval. Coverage $>$ 0.94: Percent of scenarios for which coverage probability was at least 0.94. “All”: All methods performed equally to two decimal places.

Table 4

Scenarios with continuous outcomes; $\widehat {\tau }$ point and interval estimation

Table 5

Scenarios with binary outcomes; $\widehat {\mu }$ point and interval estimation

Table 6

Scenarios with binary outcomes; $\widehat {\tau }$ point and interval estimation

Table 7

Scenarios with continuous outcomes, $k \le 5$ ; $\widehat {\mu }$ point and interval estimation

Table 8

Scenarios with continuous outcomes, $k \le 5$ ; $\widehat {\tau }$ point and interval estimation

Table 9

Scenarios with binary outcomes, $k \le 5$ ; $\widehat {\mu }$ point and interval estimation

Table 10

Scenarios with binary outcomes, $k \le 5$ ; $\widehat {\tau }$ point and interval estimation

As described above, our focus is on small meta-analyses. Thus, except where otherwise noted, all subsequent results pertain to scenarios with $k \le 20$ , and we refer to these as “all scenarios.” Although tables and figures show results for both normal and exponential population effects, our prose descriptions focus primarily on scenarios with normal effects; in these scenarios, all methods were correctly specified. We do secondarily discuss how results changed for exponentially distributed effects. Note that figures stratify on effect distribution, while tables aggregate over normal and exponential effects due to space constraints.

4.4.1 Convergence metrics

Other than the exact method and the BCa bootstrap, all methods’ algorithms converged (in the sense of yielding point estimates and/or intervals for $\widehat {\mu }$ and $\widehat {\tau }$ ) in $>$ 99% of simulated datasets. The exact method is designed only to provide an interval for $\widehat {\mu }$ , and its algorithm did so in $>$ 98% of simulated datasets. In the subset of scenarios we ran with the bootstrap methods (i.e., scenarios with $k=10$ ), the BCa bootstrap only provided an interval for $\widehat {\mu }$ and $\widehat {\tau }$ in 67% of datasets. When no interval was provided, this was because the estimated bias correction was infinite, which can happen when empirical influence values are close to zero due to outliers or small sample sizes.

4.4.2 Point and interval estimation for $\mu $

Consistent with previously published simulations,Reference Kontopantelis and Reeves ¹⁰ all methods performed very similarly for point estimation of $\mu $ and were approximately unbiased (Figure 6 and Section 2.1 of the Supplementary Material). Across all scenarios, the maximum within-scenario absolute differences between any two methods in bias, RMSE, and MAE respectively were only 0.056, 0.064, and 0.036. Given these relatively minor differences in point estimation for $\mu $ , we primarily discuss interval estimation for this estimand. In pilot tests for $k=10$ scenarios, the bootstrap methods were not competitive with other methods (Sections 3.7 and 3.8 of the Supplementary Material). Therefore, we did not run these computationally intensive methods $=$ for other sample sizes, and the bootstrap methods are omitted from results in the main text.

Figure 7 shows the coverage of 95% intervals. All frequentist methods with HKSJ intervals performed similarly to one another. In scenarios with normal population effects, these methods’ performances were minimally affected by k and $\tau $ , and coverage was $>$ 94% in 80% of scenarios. This is a somewhat pessimistic portrayal because coverages for these methods were also rarely below $\sim 93$ %. ML-profile had coverage $>$ 94% in 71% of scenarios with normal effects, but unlike the HKSJ methods, the coverage of ML-profile varied substantially across scenarios. In particular, this method had close to nominal coverage for intermediate levels of heterogeneity and for $k=20$ , but exhibited under-coverage for higher heterogeneity values (e.g., $\tau \ge 0.20$ ). The exact interval exhibited over-coverage for smaller values of k and close to nominal coverage for $k=20$ . All of these findings are consistent with previous simulation studies.Reference Kontopantelis and Reeves ¹⁰ ^, Reference Michael, Thornton and Xie ¹⁸

Figure 6

Bias of $\widehat {\mu }$ ; all scenarios. Hinges of each boxplot are the 25th, 50th, and 75th percentiles. The upper and lower whiskers extend from the hinge to the minimum or maximum value that is no more than $1.5 \times (\text {interquartile range})$ from the nearest hinge.

Figure 7

Coverage of CI for $\widehat {\mu }$ . Lines are slightly staggered horizontally for visibility. Lines are mean performances across scenarios, conditional on k, $\tau $ , the distribution of population effects, and the outcome type. All HKSJ methods performed very similarly, so their overlapping lines look like a single grey line.

Jeffreys1-shortest and Jeffreys2-shortest intervals had coverage $>$ 94% in 98% and in 88% of scenarios with normal population effects, respectively. This exceeded the 80% and 71% seen for HKSJ intervals and ML-profile intervals, respectively. In individual scenarios, Jeffreys1-shortest and Jeffreys2-shortest intervals both typically exhibited over-coverage or nominal coverage with one exception: Jeffreys2-shortest intervals exhibited mild under-coverage ( $\sim $ 89–93%) for very small meta-analyses ( $k\le~5$ ) that also had a continuous outcome and high heterogeneity ( $\tau = 0.50$ ).

Figure 8 shows the width of 95% intervals. For $k < 10$ , the different intervals’ widths varied, sometimes substantially. In these scenarios, the ML-profile interval was consistently the narrowest, and was often considerably so for very small meta-analyses. The Jeffreys1-shortest interval was typically the widest of all, especially for very small meta-analyses. On the other hand, the Jeffreys2-shortest interval was typically the second-narrowest after ML-profile, and was substantially narrower than all HKSJ intervals for very small meta-analyses. It may be counterintuitive that the Jeffreys2-shortest interval was narrower than the HKSJ intervals while more consistently achieving at least nominal coverage; we explain this finding in Section 4.4.3 below. For $k \ge 10$ and continuous outcomes, all types of intervals had nearly identical widths. For $k \ge 10$ and binary outcomes, both Jeffreys intervals and the exact interval were slightly wider than those of the HKSJ methods, although this should be interpreted in light of the frequentist methods’ slight under-coverage in these scenarios (Figure 7).

Figure 8

Width of CI for $\widehat {\mu }$ . Lines are slightly staggered horizontally for visibility. Lines are mean performances across scenarios, conditional on k, $\tau $ , the distribution of population effects, and the outcome type. Y-axis is on log scale.

In scenarios with exponential population effects, all methods’ relative performances were similar, although coverages declined somewhat when heterogeneity was high ( $\tau = 0.50$ ). This, too, is consistent with previous simulation studies.Reference Kontopantelis and Reeves ¹⁰ Section 3 of the Supplementary Material provides additional results stratified by outcome type. First, results are shown for scenarios with $k=100$ , since these scenarios are excluded from all results in the main text. In those scenarios, as expected from theory, all point estimates performed very similarly regardless of outcome type. For binary outcomes, most methods’ coverage probabilities declined somewhat at $k=100$ . This finding is consistent with previous simulation results involving rare binary outcomes (Langan et al. (2019)Reference Langan, Higgins and Jackson ⁷ ; Appendix Figure 4) and likely reflects known two sources of misspecification when meta-analyzing log-odds ratios. In particular: (1) estimated log-odds ratios are correlated with their estimated standard errors; and (2) the conventional variance estimate is an imperfect approximation especially when there are zero cell counts, a problem that occurs even when adding positive constants to each cell.Reference Tang ⁶⁵ ^, Reference Chang and Hoaglin ⁶⁶ We return to these issues in the Discussion. In these scenarios, Jeffreys methods retained closer to nominal coverage than did the frequentist methods. Additional supplementary tables stratify the results in the main text (i.e., scenarios with $k \le 20$ ) into those with fixed versus varying N across studies. In all of these strata, the relative rankings of methods’ performances were quite similar to those in the aggregate analyses.

4.4.3 Discussion of results regarding $\mu $

For small meta-analyses ( $k \le 20$ ) with binary outcomes, Jeffreys2-shortest may be a useful method since its intervals had at least nominal coverage (with normal effects) and yet were often considerably narrower than all others except ML-profile, whose coverage was inconsistent across scenarios. To illustrate, we provide some numerical comparisons between Jeffreys2-shortest and REML-HKSJ intervals for meta-analyses of binary outcomes. We compare to a single type of frequentist interval for simplicity. In scenarios with binary outcomes and normal population effects, Jeffreys2-shortest had coverage $>$ 94% in 90% of scenarios, whereas REML-HKSJ did so in 80% of scenarios. Accordingly, Jeffreys2-shortest had coverage at least equal to that of REML-HKSJ in 85% of scenarios. At the same time, the Jeffreys2-shortest interval was on average 27% narrower than the REML-HKSJ interval; and in meta-analyses with $k \le 5$ , this efficiency improvement increased to 51%. For binary outcomes, Jeffreys1-shortest did not appear to have clear advantages over Jeffreys2-shortest or the other methods, since the Jeffreys1-shortest interval was wider than even that of the exact method.

For small meta-analyses with continuous outcomes, more caution is warranted when using Jeffreys2-shortest intervals, since they exhibit mild under-coverage ( $\sim $ 89–93%) for very small meta-analyses ( $k \le 5$ ) that also had high heterogeneity. Since Jeffreys2-shortest provided only modest improvements in efficiency for meta-analyses of continuous outcomes once $k>5$ , it may be preferable to conservatively use a frequentist method with an HKSJ interval for continuous outcomes, regardless of k. Although the Jeffreys1-shortest interval did, in general, retain at least nominal coverage for continuous outcomes, this interval was again wider than the exact interval, and was considerably wider than the HKSJ intervals.

As noted above, it may be counterintuitive that the Jeffreys2-shortest interval was typically narrower than the HKSJ intervals while more consistently achieving at least nominal coverage. There are two reasons for this finding. First, whereas HKSJ intervals for $\mu $ are always symmetric on the analyzed effect scale (i.e., Hedges’ g for continuous outcomes and log-odds ratio for binary outcomes), the Jeffreys1-shortest and Jeffreys2-shortest intervals can be symmetric or asymmetric depending on the shape of the posterior (Section 2.3 of the Supplementary Material). Second, within a given scenario, the width of the Jeffreys2-shortest interval was typically much less variable across repeated samples than the HKSJ intervals. Thus, in many scenarios in which the Jeffreys2-shortest interval exhibited over-coverage but comparison methods exhibited nominal or less than nominal coverage, this was because the HKSJ methods often yielded extremely wide intervals under repeated sampling, whereas the Jeffreys2-shortest intervals were bounded within a narrower range (Section 2.3 of the Supplementary Material).

4.4.4 Point and interval estimation for $\tau $

For both continuous and binary outcomes, results for point and interval estimation depended on whether $\tau $ was near the boundary value of zero, especially for the Jeffreys methods. Regarding point estimation, the frequentist methods, especially ML, typically showed a slight negative bias (Figure 9). Point estimates from Jeffreys1 and Jeffreys2 varied more in the sign and magnitude of bias than did the frequentist point estimates (Figure 9). Regarding MAE and RMSE, the frequentist methods DL, DL2, REML, and PM were comparable to one another. In contrast, ML often performed slightly better on these metrics (Figures 10 and 11). Jeffreys1 and Jeffreys2 had comparable MAE and RMSE to one another. Relative to the frequentist methods, Jeffreys1 and Jeffreys2 typically showed comparable MAE and RMSE at midrange values of $\tau $ (e.g., $\tau = 0.10$ ), showed better MAE and RMSE for $\tau> 0.10$ , and showed worse MAE and RMSE for $\tau < 0.10$ . These patterns were more pronounced for binary outcomes, though the relative rankings of methods were similar for both outcome types. The patterns were similar for both normal and exponential population effects.

Figure 9

Bias of $\widehat {\tau }$ ; all scenarios. Hinges of each boxplot are the 25th, 50th, and 75th percentiles. The upper and lower whiskers extend from the hinge to the minimum or maximum value that is no more than $1.5 \times (\text {interquartile range})$ from the nearest hinge.

Figure 10

MAE of $\widehat {\tau }$ . Lines are slightly staggered horizontally for visibility. Lines are mean performances across scenarios, conditional on k, $\tau $ , the distribution of population effects, and the outcome type.

Figure 11

RMSE of $\widehat {\tau }$ . Lines are slightly staggered horizontally for visibility. Lines are mean performances across scenarios, conditional on k, $\tau $ , the distribution of population effects, and the outcome type.

Regarding interval estimation, pilot tests with the bootstrap methods again suggested that these methods performed relatively poorly compared to the other methods (Sections 3.7 and 3.8 of the Supplementary Material), so we again omit the bootstrap methods from the main text. Figure 12 shows the coverage of 95% intervals. With normal population effects, all Q-profile intervals performed comparably to one another and showed close to nominal coverage ( $>$ 94% in 83% of scenarios). ML-profile typically showed nominal coverage or over-coverage in the large majority of scenarios; in scenarios with normal effects, the coverage of these intervals was $>$ 94% in 82% of scenarios, similar to the Q-profile methods. However, ML-profile did show under-coverage for smaller meta-analyses that also had high heterogeneity. This under-coverage was minimal for binary outcomes ( $\sim $ 90% at minimum), but could be substantial for continuous outcomes ( $\sim $ 75% at minimum).

Figure 12

Coverage of CI for $\widehat {\tau }$ . Lines are slightly staggered horizontally for visibility. Lines are mean performances across scenarios, conditional on k, $\tau $ , the distribution of population effects, and the outcome type.

Jeffreys1-shortest showed at least nominal coverage for $\tau> 0.01$ , but showed substantial under-coverage when $\tau = 0.01$ . Jeffreys2-shortest behaved similarly, except additionally showed under-coverage for meta-analyses of continuous outcomes with high heterogeneity ( $\tau = 0.50$ ), especially for $k \le 5$ . The coverage of Jeffreys1-shortest and Jeffreys2-shortest was $>$ 94% in, respectively, 83% and 74% of scenarios. Both Jeffreys1-central and Jeffreys2-central performed considerably worse (i.e., showed more severe under-coverage) than Jeffreys1-shortest and Jeffreys2-shortest for smaller values of $\tau $ : in scenarios with normal population effects, coverage of Jeffreys1-central and Jeffreys2-central was $>$ 94% in, respectively, 54% and 56% of scenarios. This under-coverage reflects overestimation of $\tau $ when it was near the boundary of the parameter space.

Figure 13 shows the width of 95% intervals. We now discuss only the methods that had the highest rates of at least nominal coverage, so exclude discussion of Jeffreys2-shortest, Jeffreys1-central, and Jeffreys2-central. The widths of the various Q-profile intervals the Jeffreys1-shortest intervals were comparable, but the ML-profile intervals were typically considerably narrower, especially for very small meta-analyses.

Figure 13

Width of CI for $\widehat {\tau }$ . Lines are slightly staggered horizontally for visibility. Lines are mean performances across scenarios, conditional on k, $\tau $ , the distribution of population effects, and the outcome type. Y-axis is on log scale.

In scenarios with exponential population effects, all methods’ relative performances for estimation and interval estimation on $\tau $ were similar, although coverages declined for all methods. Additional stratified results (Section 3 of the Supplementary Material) suggest that patterns of performance were also comparable for $k=100$ and for fixed versus varying N across studies.

4.4.5 Discussion of results regarding $\tau $

For point estimation of $\tau $ , no method emerged as clearly optimal, since methods’ performances depended strongly on $\tau $ itself. The low coverage of the Jeffreys methods occurred when $\tau $ was near zero (the boundary of the parameter space). This reflects overestimation of $\tau $ , which is often viewed as conservative in the context of random-effects meta-analysis. Regarding interval estimation for $\tau $ , the frequentist estimators with Q-profile or ML-profile intervals appear preferable to the Jeffreys methods. Of the two Jeffreys priors and two types of intervals, Jeffreys1-shortest is the only one whose coverage was competitive with that of the frequentist methods. However, since Jeffreys1-shortest intervals were somewhat wider than those of the frequentist intervals, this method does not seem to offer an overall advantage over the frequentist intervals. The Q-profile intervals performed slightly more consistently across scenarios than did ML-profile, although average performances were similar. ML-profile intervals were, however, considerably narrower than the Q-profile intervals.

4.5 Overall conclusions

All methods performed similarly for point estimation of $\mu $ . In general, standard frequentist methods with HKSJ intervals for $\mu $ and Q-profile intervals for $\tau $ performed the most consistently across outcome types. Jeffreys2-shortest also showed consistently strong performance for meta-analyses of binary outcomes, and yielded substantially narrower intervals than frequentist methods. However, the Jeffreys2-shortest interval did not perform as consistently for continuous outcomes; this method exhibited mild under-coverage for very small meta-analyses with high heterogeneity. Regarding point estimation of $\tau $ , all methods again performed comparably on average, though the optimal method depended on the value of $\tau $ itself. Regarding interval estimation for $\tau $ , the Q-profile method arguably performed the best and behaved consistently across scenarios.

Overall, for small meta-analyses of continuous outcomes, we would recommend standard frequentist methods with an HKSJ interval for $\mu $ and a Q-profile interval for $\tau $ , consistent with previous recommendations. However, for small meta-analyses of binary outcomes, Jeffreys2 may be preferable over standard frequentist methods if the meta-analyst is primarily interested in point and interval estimation for $\mu $ , potentially along with point estimation of $\tau $ (although, again, the best-performing method for estimating $\tau $ depended on the value of $\tau $ itself). This is because the Jeffreys2-shortest interval more frequently had at least nominal coverage, yet was substantially more precise. If the meta-analyst is also interested in obtaining an interval for $\tau $ , then using a frequentist method with a Q-profile interval for $\tau $ would likely provide closer to nominal coverage for $\tau $ than would Jeffrey2-shortest; however, this would likely sacrifice a substantial amount of precision for $\mu $ .

5 Applied example

Zito et al.Reference Zito, Galli, Biondi-Zoccai, Abbate, Douglas, Princi and Burzotta ⁶⁷ meta-analyzed randomized trials that compared various diagnostic strategies for detecting coronary artery disease (CAD) in patients experiencing CAD-related symptoms. The authors conducted meta-analyses for each pairwise comparison between multiple diagnostic methods; for simplicity, we focus on studies comparing coronary computed tomography angiography (CCTA) to stress single-photon emission computed tomography myocardial perfusion imaging (SPECT-MPI). We replicated the authors’ meta-analyses for each of six outcomes: cardiovascular death and myocardial infarction ( $k=2$ ), all-cause death ( $k=3$ ), myocardial infarction ( $k=2$ ), index invasive coronary angiography (ICA) ( $k=4$ ), index revascularization ( $k=4$ ), and downstream testing ( $k=4$ ). The authors’ meta-analysesReference Zito, Galli, Biondi-Zoccai, Abbate, Douglas, Princi and Burzotta ⁶⁷ used the DL method and used Wald rather than HKSJ confidence intervals.Footnote ⁱⁱⁱ We scraped study-level summary statistics from the published forest plots and re-analyzed the studies assessing each outcome using DL, REML, the exact method, Jeffreys1-shortest, and Jeffreys2-shortest. For DL and REML, we used HKSJ intervals, following established recommendations.Reference Langan, Higgins and Jackson ⁷ ^, Reference IntHout, Ioannidis and Borm ¹² ^– Reference Knapp and Hartung ¹⁴ Since our simulation study suggested relatively minor differences between the various frequentist methods with HKSJ intervals, we focus only on DL and REML for brevity. All code and data required to reproduce the applied example are publicly available and documented (https://osf.io/9qfah).

Figure 2 shows the Jeffreys1 and Jeffreys2 priors for a single outcome (all-cause death), and Figure 3 shows the resulting joint posterior under the Jeffreys2 prior. Figure 5 shows all methods’ point estimates and intervals for $\widehat {\mu }$ for all outcomes; a similar forest plot for heterogeneity estimates appears in Section 4 of the Supplementary Material. As in the simulation study, all point estimates were nearly identical, but the Jeffreys2-shortest interval was often considerably narrower than those from Jeffreys1-shortest, REML, DL, and the exact method. Across all six outcomes, the Jeffreys2-shortest interval on the log-odds scale was on average 45% narrower than the narrowest interval from the other methods. For the meta-analyses of only two studies, this improvement in precision increased to 112%.

6 Discussion

To the best of our knowledge, this paper provides the first empirical assessment of the Jeffreys2 prior in meta-analysis. We compared point estimates and intervals from the Jeffreys2 prior to those of the Jeffreys1 prior and to several of the best-performing parametric, semiparameteric, and nonparametric frequentist methods. Extending previous simulation studies on the Jeffreys1 prior, we additionally considered different types of Bayesian point estimates and intervals, and we considered point and interval estimation for both $\mu $ and $\tau $ . As summarized in Section 4.5, for small meta-analyses of binary outcomes, Jeffreys2 may be preferable over standard frequentist methods for point and interval estimation for $\mu $ , providing improvements in efficiency that can be substantial. However, for small meta-analyses of continuous outcomes, standard frequentist methods with HKSJ intervals for $\mu $ and Q-profile CIs for $\tau $ seem to be the best choices, avoiding the mild under-coverage that Jeffreys2-shortest intervals can sometimes exhibit for very small meta-analyses with high heterogeneity. For both outcome types, the best-performing method for point estimation of $\tau $ varied according to $\tau $ itself. When $\tau $ is very small, the Jeffreys methods performed conservatively in that they typically overestimate $\tau $ . Finally, we showed that the Jeffreys2 prior has a straightforward generalization to the case of meta-regression (Section 1 of the Supplementary Material).

Given our interest in the frequentist properties of Jeffreys priors as the Firth correction to ML estimates, we have treated point and interval estimation from a frequentist perspective. For example, our simulation study considered the coverage of 95% intervals estimated from repeated samples that were generated from fixed values of the parameters. In contrast, in Bayesian inference, the parameters are viewed as random draws from the prior, rather than as fixed quantities. The Bayesian framework does permit empirical assessment of certain analogs to coverage, but doing so involves drawing repeated samples from parameters sampled from the prior, rather than from parameters held constant.Reference Röver ⁹ ^, Reference Cook, Gelman and Rubin ⁶⁸ ^, Reference Röver and Friede ⁶⁹ As an additional complication, performing these Bayesian calibration checks requires a proper prior from which to sample, yet both Jeffreys priors are improper.Reference Cook, Gelman and Rubin ⁶⁸ Cook et al. (2006) argued that this difficulty in assessing calibration with improper priors is a disadvantage of using such priors in the first place.Reference Cook, Gelman and Rubin ⁶⁸ Given our interest in methods’ frequentist motivations and their frequentist empirical properties, we did not consider any of the numerous other Bayesian priors that have been proposed for meta-analysis (e.g., as reviewed by Röver (2020)Reference Röver ⁹ ). It is somewhat difficult to compare the performance of standard frequentist methods to Bayesian methods that lack a frequentist interpretation, which is perhaps why many previous simulation studies have not included any Bayesian methodsReference Langan, Higgins and Jackson ⁷ ^, Reference Langan, Higgins and Simmonds ¹¹ (but with exceptionsReference Bodnar, Link and Arendacká ¹⁵ ^– Reference Seide, Röver and Friede ¹⁷ ).

Our simulation study had other limitations. First, we considered only one form of model misspecification, namely exponentially distributed population effects, and found that methods’ relative rankings were largely unaffected. However, we did not assess any other forms of misspecification, such as more severe departures from normalityReference Kontopantelis and Reeves ¹⁰ or clustered population effects. Second, for meta-analyses of binary outcomes, we considered only standard inverse-variance weighted meta-analysis, but arm-based approaches may have better statistical properties.Reference Chang and Hoaglin ⁶⁶ On the other hand, arm-based methods can introduce bias due to non-exchangeability across trials,Reference White, Turner and Karahalios ⁷⁰ ^, Reference Dias and Ades ⁷¹ and inverse-variance meta-analysis more readily accommodates the possibility that studies adjust for covariates, and may be more feasible when original papers reported only limited summary statistics. Additionally, assessing inverse-variance meta-analysis provides a more direct comparison to previous simulation studies.Reference Langan, Higgins and Simmonds ¹¹ Third, the two within-study estimators we used, namely log-odds ratios and Hedges’ g, both involve approximations which may introduce slight finite-sample biases of their own. Such decisions can nontrivially affect the results of simulation studies,Reference Kulinskaya, Hoaglin and Bakbergenuly ⁷² and we used these estimators to ensure direct comparability with previous simulation studies.Reference Langan, Higgins and Jackson ⁷ Additionally, these two measures are among the most commonly used in meta-analyses.Reference Mathur and VanderWeele ⁷³ Future work could explore relative performances with effect measures that do not require approximations, such as raw mean differences, although these are not frequently used in practice.Reference Mathur and VanderWeele ⁷³ Fourth, we considered only two estimands, $\mu $ and $\tau $ , but these alone provide a limited summary of the random-effects distribution. Additional metrics than can be informative include the percentage of population effects exceeding a chosen threshold for a meaningful effect sizeReference Mathur and VanderWeele ⁴⁷ ^, Reference Mathur and VanderWeele ⁵⁵ ^, Reference Mathur and VanderWeele ⁷⁴ ; the prediction interval for a new population effectReference Riley, Higgins and Deeks ⁵⁴ ^, Reference Wang and Lee ⁷⁵ ; and shrinkage estimates for each study’s population effect.Reference Wang and Lee ⁷⁵ ^, Reference Röver, Rindskopf and Friede ⁷⁶ An advantage of Bayesian estimation is that such metrics can be obtained readily from the posterior; several are implemented in the R package bayesmeta.Reference Röver ⁹ Future simulation studies could consider these estimands and intervals as well. Fifth, we made the usual assumption that any estimation error in the within-study standard errors is negligible. We did not assess the extent to which this approximation compromised interval estimation. A number of approaches have been proposed to accommodate this form of estimation error; perhaps future work could incorporate these developments into the Jeffreys priors.Reference Bellio and Guolo ⁷⁷ ^– Reference Johnson and Huedo-Medina ⁸⁰

Our work remains a preliminary investigation of Jeffreys1 and Jeffreys2 priors. We would particularly encourage future work to consider other generalizations to these priors, besides our generalization to meta-regression. For example, as noted in Introduction, we recently found that a Jeffreys prior on $\mu $ and $\tau $ performed well for an estimation problem involving severe p-hacking, which required estimating the parameters of a truncated distribution.Reference Mathur ³⁴ Certain selection models for publication bias lead to related distributions that involve step functions in the publication probability.Reference Hedges ⁸¹ These models can perform poorly for small meta-analyses, often exhibiting extremely wide intervals for parameters related to publication bias severity.Reference Carter, Schönbrodt and Gervais ⁸² ^, Reference Terrin, Schmid and Lau ⁸³ Might using a Jeffreys prior on $\mu $ , $\tau $ , and the bias parameters also improve these models’ performance for small meta-analyses? Additional extensions could include accommodating clustered population effects. We look forward to future research along these lines.

Acknowledgements

Christian Röver provided helpful discussions and implemented the Jeffreys2 prior in his R package, bayesmeta. Dean Langan shared code from his simulation study. Annamaria Guolo provided advice on the use of her R package, metaLik.

Author contributions

M.B.M. conducted this research.

Competing interest statement

The author declares that no competing interests exist.

Data availability statement

All code and data required to reproduce the simulation study and applied example are publicly available and documented (https://osf.io/9qfah).

Funding statement

This research was supported by National Institutes of Health grants R01 LM013866, UL1TR003142, P30CA124435, and P30DK116074. The funders had no role in the design, conduct, or reporting.

Supplementary material

The supplementary material for this article can be found at https://doi.org/10.1017/rsm.2024.2.

Footnotes

This article was awarded Open Data and Open Materials badges for transparent practices. See the Data availability statement for details.

i. The invariance property does not mean, however, that the posterior mode of $\tau $ under a $\tau $ -parameterization is equal to the square-root of the posterior mode of $\tau ^2$ under a $\tau ^2$ -parameterization, for example. Rather, the usual change-of-variables expression would apply.

ii. Specifically, in the context of bias correction for p-hacking, Mathur (2024)Reference Mathur ³⁴ described meta-analyzing only studies that are nonsignificant or that have negative point estimates. In that paper, we derived the two-parameter Jeffreys prior for a version of the likelihood that is appropriately right-truncated to reflect the inclusion of only a subset of studies. The present Jeffreys2 prior is the special case in which there is no truncation, i.e., the limit as the truncation threshold approaches $\infty $ .

iii. The paper states that HKSJ intervals were used, but our re-analyses replicated their results only when we used Wald intervals. Through correspondence with the authors and journal editors, we confirmed that Wald intervals were indeed used (S. Chang and J.Cornell, 14 March 2024).

References

DerSimonian, R, Laird, N. Meta-analysis in clinical trials. Control. Clin. Trials. 1986;7(3):177–188.CrossRef Google Scholar PubMed

DerSimonian, R, Kacker, R. Random-effects model for meta-analysis of clinical trials: an update. Contemp. Clin. Trials. 2007;28(2): 105–114.CrossRef Google Scholar PubMed

Paule, RC, Mandel, J. Consensus values and weighting factors. J. Res. Nat. Bureau Stand. 1982;87(5):377.CrossRef Google Scholar PubMed

van Aert, RCM, Jackson, D. Multistep estimators of the between-study variance: The relationship with the Paule-Mandel estimator. Stat. Med. 2018;37(17):2616–2629.CrossRef Google Scholar PubMed

Raudenbush, SW, Bryk, AS. Empirical Bayes meta-analysis. J. Educ. Stat. 1985;10(2):75–98.CrossRef Google Scholar

Harville, DA. Maximum likelihood approaches to variance component estimation and to related problems. J. Amer. Stat. Assoc. 1977;72(358):320–338.Google Scholar

Langan, D, Higgins, JPT, Jackson, D, et al. A comparison of heterogeneity variance estimators in simulated random-effects meta-analyses. Res. Synth. Methods. 2019;10(1):83–98.CrossRef Google Scholar PubMed

Van Erp, S, Verhagen, J, Grasman, R, et al.. Estimates of between-study heterogeneity for 705 meta-analyses reported in Psychological Bulletin from 1990–2013. J. Open Psychol. Data. 2017;5(1).CrossRef Google Scholar

Röver, C. Bayesian random-effects meta-analysis using the bayesmeta R package. J. Stat. Softw. 2020;93(6).Google Scholar

Kontopantelis, E, Reeves, D. Performance of statistical methods for meta-analysis when true study effects are non-normally distributed: a simulation study. Stat. Methods Med. Res. 2012;21(4):409–426.CrossRef Google Scholar PubMed

Langan, D, Higgins, JPT, Simmonds, M. Comparative performance of heterogeneity variance estimators in meta-analysis: a review of simulation studies. Res. Synth. Methods. 2017;8(2):181–198.Google Scholar PubMed

IntHout, J, Ioannidis, JPA, Borm, GF. The Hartung-Knapp-Sidik-Jonkman method for random effects meta-analysis is straightforward and considerably outperforms the standard DerSimonian-Laird method. BMC Med. Res. Methodol. 2014;14(1):1.Google Scholar PubMed

Sidik, K, Jonkman, JN. A simple confidence interval for meta-analysis. Stat. Med. 2002;21(21):3153–3159.Google Scholar PubMed

Knapp, G, Hartung, J. Improved tests for a random effects meta-regression with a single covariate. Stat. Med. 2003;22(17):2693–2710.CrossRef Google Scholar PubMed

Bodnar, O, Link, A, Arendacká, B, et al. Bayesian estimation in random effects meta-analysis using a non-informative prior. Stat. Med. 2017;36(2):378–399.Google Scholar PubMed

Friede, T, Röver, C, Wandel, S, et al. Meta-analysis of few small studies in orphan diseases. Res. Synth. Methods. 2017;8(1):79–91.CrossRef Google Scholar PubMed

Seide, SE, Röver, C, Friede, T. Likelihood-based random-effects meta-analysis with few studies: empirical and simulation studies. BMC Med. Res. Methodol. 2019;19:1–14.CrossRef Google Scholar PubMed

Michael, H, Thornton, S, Xie, M, et al. Exact inference on the random-effects model for meta-analyses with few studies. Biom. 2019;75(2):485–493.Google Scholar PubMed

Viechtbauer, W. Confidence intervals for the amount of heterogeneity in meta-analysis. Stat. Med. 2007;26(1):37–52.Google Scholar PubMed

Jeffreys, H. An invariant form for the prior probability in estimation problems. Proc. Royal Soc. London. Series A. Math. Phys. Sci. 1946;186(1007):453–461.Google Scholar PubMed

Gelman, A, Carlin, JB, Stern, HS, et al. Bayesian Data Analysis. 3rd ed. CRC Press; 2014.Google Scholar

Cencov, NN. Statistical Decision Rules and Optimal Inference. Vol 53. American Mathematical Society; 2000.CrossRef Google Scholar

Datta, GS, Ghosh, M. On the invariance of noninformative priors. Ann Stat. 1996;24(1):141–159.CrossRef Google Scholar

Kass, RE, Wasserman, L. Formal rules for selecting prior distributions: A review and annotated bibliography. J. Amer. Stat. Assoc. 1996;435:1343–1370.Google Scholar

Bernardo, JM. Reference posterior distributions for Bayesian inference. J. Royal Stat. Soc. Series B: Stat. Methodol. 1979;41(2):113–128.CrossRef Google Scholar

Firth, D. Bias reduction of maximum likelihood estimates. Biometrika. 1993;80(1):27–38.CrossRef Google Scholar

Magis, D. A note on weighted likelihood and Jeffreys modal estimation of proficiency levels in polytomous item response models. Psychometrika. 2015;80:200–204.CrossRef Google Scholar PubMed

Zhou, X, Giacometti, R, Fabozzi, FJ, et al. Bayesian estimation of truncated data with applications to operational risk measurement. Quant. Finance. 2014;14(5):863–888.CrossRef Google Scholar

Sartori, N. Bias prevention of maximum likelihood estimates for scalar skew normal and skew t distributions. J. Stat. Plan. Inference. 2006;136(12):4259–4275.CrossRef Google Scholar

Bodnar, O, Link, A, Elster, C. Objective Bayesian inference for a generalized marginal random effects model. Bayesian Anal. 2016;11(1):25–45.CrossRef Google Scholar

Kosmidis, I, Guolo, A, Varin, C. Improving the accuracy of likelihood-based inference in meta-analysis and meta-regression. Biometrika. 2017;104(2):489–496.CrossRef Google Scholar

Schünemann, HJ, Higgins, J, Vist, GE, et al. Chapter 14: completing ‘summary of findings’ tables and grading the certainty of the evidence. In: Higgins, JPT, Thomas, J, Chandler, J, Cumpston, M, Li, T, Page, MJ, Welch, VA (editors). Cochrane Handbook for Systematic Reviews of Interventions. 2nd Edition. Chichester: John Wiley & Sons, 2019.Google Scholar

Alam, TF, Rahman, MS, Bari, W. On estimation for accelerated failure time models with small or rare event survival data. BMC Med. Res. Methodol. 2022;22(1):1–15.CrossRef Google Scholar PubMed

Mathur, MB. P-hacking in meta-analyses: A formalization and new meta-analytic methods. Res. Synth. Methods. In press. Early view version available at https://onlinelibrary.wiley.com/doi/full/10.1002/jrsm.1701.Google Scholar

Cope, EW. Penalized likelihood estimators for truncated data. J. Stat. Plan. Inference. 2011;141(1): 345–358.CrossRef Google Scholar

Veroniki, AA, Jackson, D, Viechtbauer, W, Bender, R, Bowden, J, Knapp, G, Salanti, G. Methods to estimate the between-study variance and its uncertainty in meta-analysis. Res. Synth. Methods. 2016;7(1): 55–79.Google Scholar PubMed

Viechtbauer, W. Bias and efficiency of meta-analytic variance estimators in the random-effects model. J. Educ. Behav. Stat. 2005;30(3):261–293.CrossRef Google Scholar

Huizenga, HM, Visser, I, Dolan, CV. Testing overall and moderator effects in random effects meta-regression. Br. J. Math. Stat. Psychol. 2011;64(1):1–19.CrossRef Google Scholar PubMed

Guolo, A. Higher-order likelihood inference in meta-analysis and meta-regression. Stat. Med. 2012;31(4):313–327.Google Scholar PubMed

Guolo, A, Varin, C. Random-effects meta-analysis: the number of studies matters. Stat. Methods Med. Res. 2017;26(3):1500–1518.CrossRef Google Scholar PubMed

Noma, H. Confidence intervals for a random-effects meta-analysis based on Bartlett-type corrections. Stat. Med. 2011;30(28):3304–3312.CrossRef Google Scholar PubMed

Hilde, H, Ingmar, V. metatest: Fit and test metaregression models. R package version 1.0-5. 2018. https://CRAN.R-project.org/package=metatest.Google Scholar

Veroniki, AA, Jackson, D, Bender, R, et al. Methods to calculate uncertainty in the estimated overall effect size from a random-effects meta-analysis. Res. Synth. Methods. 2019;10(1):23–43.Google Scholar PubMed

Van Den Noortgate, W, Onghena, P. Parametric and nonparametric bootstrap methods for meta-analysis. Behav. Res. Methods. 2005;37:11–22.CrossRef Google Scholar PubMed

Efron, B. Better bootstrap confidence intervals. J. Amer. Stat. Assoc. 1987;82(397):171–185.Google Scholar

Carpenter, J, Bithell, J. Bootstrap confidence intervals: When, which, what? A practical guide for medical statisticians. Stat. Med. 2000;19(9):1141–1164.Google Scholar PubMed

Mathur, MB, VanderWeele, TJ. Robust metrics and sensitivity analyses for meta-analyses of heterogeneous effects. Epidemiology. 2020;31(3):356–358.CrossRef Google Scholar PubMed

Cornell, JE, Mulrow, CD, Localio, R, et al. Random-effects meta-analysis of inconsistent effects: a time for change. Ann. Internal Med. 2014;160(4):267–270.Google Scholar PubMed

Berger, JO, Bernardo, JM. Ordered group reference priors with application to the multinomial problem.Biometrika. 1992;79(1):25–37.Google Scholar

Bodnar, O, Bodnar, T. Objective Bayesian meta-analysis based on generalized marginal multivariate random effects model. Bayesian Anal. 2024;19(2):531–564.Google Scholar

Röver, C, Friede, T. Discrete approximation of a mixture distribution via restricted divergence. J. Comput. Graph. Stat. 2017;26(1):217–222.Google Scholar

Greenland, S. Generalized conjugate priors for Bayesian analysis of risk and survival regressions. Biometrics 2003;59(1):92–99.Google Scholar PubMed

Jeffreys, H. Theory of Probability. 3rd ed. Oxford University Press; 1961.Google Scholar

Riley, RD, Higgins, JPT, Deeks, JJ. Interpretation of random effects meta-analyses. BMJ. 2011;342.Google Scholar PubMed

Mathur, MB, VanderWeele, TJ. New metrics for meta-analyses of heterogeneous effects. Stat. Med. 2019;38(8):1336–1342.Google Scholar PubMed

Hedges, LV. Distribution theory for Glass’s estimator of effect size and related estimators. J. Educ. Stat. 1981;6(2):107–128.Google Scholar

Brockwell, SE, Gordon, IR. A comparison of statistical methods for meta-analysis. Stat. Med. 2001;20(6):825–840.Google Scholar PubMed

Viechtbauer, W. Conducting meta-analyses in R with the metafor package. J. Stat. Softw. 2010;36(3):1–48.CrossRef Google Scholar

Canty, A, Boot, RB. Bootstrap functions. R package version 1.3-30. 2024. https://CRAN.R-project.org/package=boot.Google Scholar

Hedges, LV. Estimation of effect size from a series of independent experiments. Psychol. Bull. 1982;92(2):490.CrossRef Google Scholar

Nemes, S, Jonasson, JM, Genell, A, et al. Bias in odds ratios by logistic regression modelling and sample size. BMC Med. Res. Methodol. 2009;9:1–5.CrossRef Google Scholar PubMed

Rothman, KJ. Disengaging from statistical significance. Eur. J. Epidemiol. 2016;31:443–444.Google Scholar PubMed

McShane, BB, Gal, D, Gelman, A, et al. Abandon statistical significance. Amer. Stat. 2019;73(sup1):235–245.Google Scholar

Mathur, MB, VanderWeele, TJ. Finding common ground in meta-analysis “wars” on violent video games. Perspect. Psychol. Sci. 2019;14(4):705–708.CrossRef Google Scholar PubMed

Tang, J-L. Weighting bias in meta-analysis of binary outcomes. J. Clin. Epidemiol. 2000;53(11):1130–1136.Google Scholar PubMed

Chang, B-H, Hoaglin, DC. Meta-analysis of odds ratios: current good practices. Med. Care 2017;55(4):328.Google Scholar PubMed

Zito, A, Galli, M, Biondi-Zoccai, G, Abbate, A, Douglas, PS, Princi, G, Burzotta, F. Diagnostic strategies for the assessment of suspected stable coronary artery disease: a systematic review and meta-analysis. Ann. Internal Med. 2023;176(6):817–826.CrossRef Google Scholar PubMed

Cook, SR, Gelman, A, Rubin, DB. Validation of software for Bayesian models using posterior quantiles. J. Comput. Graph. Stat. 2006;15(3):675–692.Google Scholar

Röver, C, Friede, T. Dynamically borrowing strength from another study through shrinkage estimation. Stat. Methods Med. Res. 2020;29(1):293–308.Google Scholar PubMed

White, IR, Turner, RM, Karahalios, A, et al. A comparison of arm-based and contrast-based models for network meta-analysis. Stat. Med. 2019;38(27):5197–5213.Google Scholar PubMed

Dias, S, Ades, AE. Absolute or relative effects? Arm-based synthesis of trial data. Res. Synth. Methods 2016;7(1):23.Google Scholar PubMed

Kulinskaya, E, Hoaglin, DC, Bakbergenuly, I. Exploring consequences of simulation design for apparent performance of methods of meta-analysis. Stat. Methods Med. Res. 2021;30(7):1667–1690.Google Scholar PubMed

Mathur, MB, VanderWeele, TJ. Estimating publication bias in meta-analyses of peer-reviewed studies: A meta-meta-analysis across disciplines and journal tiers. Res. Synth. Methods. 2021;12(2):176–191.Google Scholar PubMed

Mathur, MB, VanderWeele, TJ. Sensitivity analysis for unmeasured confounding in meta-analyses. J. Amer. Stat. Assoc. 2020;115(529):163–170.Google Scholar PubMed

Wang, C-C, Lee, W-C. A simple method to estimate prediction intervals and predictive distributions: summarizing meta-analyses beyond means and confidence intervals. Res. Synth. Methods 2019;10(2):255–266.Google Scholar PubMed

Röver, C, Rindskopf, D, Friede, T. How trace plots help interpret meta-analysis results. Res. Synth. Methods. 2024;15(3):413–429.CrossRef Google Scholar PubMed

Bellio, R, Guolo, A. Integrated likelihood inference in small sample meta-analysis for continuous outcomes. Scand. J. Stat. 2016;43(1):191–201.Google Scholar

Malzahn, U, Böhning, D, Holling, H. Nonparametric estimation of heterogeneity variance for the standardised difference used in meta-analysis. Biometrika. 2000;87(3):619–632.CrossRef Google Scholar

Böhning, D, Malzahn, U, Dietz, K, et al. Some general points in estimating heterogeneity variance with the DerSimonian–Laird estimator. Biostatistics. 2002(3)4 pp. 445–457.CrossRef Google Scholar PubMed

Johnson, BT, Huedo-Medina, TB. Meta-Analytic Statistical Inferences for Continuous Measure Outcomes as a Function of Effect Size Metric and Other Assumptions: AHRQ Methods for Effective Health Care. Statist. Sci. 1992;7(2):246–255. https://doi.org/10.1214/ss/1177011364.CrossRef Google Scholar

Hedges, LV. Modeling publication selection effects in meta-analysis. Stat. Sci. 1992:246–255.Google Scholar

Carter, EC, Schönbrodt, FD, Gervais, WM, et al. Correcting for bias in psychology: A comparison of meta-analytic methods. Adv. Methods Pract. Psychol. Sci. 2019;2(2):115–144. https://doi.org/10.1177/2515245919847196.CrossRef Google Scholar

Terrin, N, Schmid, CH, Lau, J, et al. Adjusting for publication bias in the presence of heterogeneity. Stat. Med. 2003;22(13):2113–2126. https://doi.org/10.1002/sim.1461.Google Scholar PubMed

Figure 1 Priors for four simulated meta-analyses of standardized mean differences ($k=10$), in which the within-study sample sizes (N) were generated from four possible distributions. Studies’ standard errors were estimated using Eq. (5) and, given the data-generation parameters, were approximately equal to $2/\sqrt {N}$. Points are the maxima. The priors have been scaled to have the same maximum height.

Figure 2 Priors on $\tau $ for the meta-analysis on all-cause death ($k=3, \{ \sigma _i \} = \{ 1.15, 1.63, 0.19 \}$). Points are maxima. The priors have been scaled to have the same maximum height.

Figure 3 Joint posterior under the Jeffreys2 prior for the meta-analysis on all-cause death ($k=3, \{ \sigma _i \} = \{ 1.15, 1.63, 0.19 \}$). Horizontal red line: marginal posterior mode of $\mu $. Vertical blue line: marginal posterior mode of $\tau $.

Figure 4 Marginal posteriors under the Jeffreys2 prior for the meta-analysis on all-cause death. Solid vertical lines: marginal posterior modes. Dashed vertical lines: limits of 95% intervals.

Figure 5 Interval limits greater than $RR=10$ are truncated. The exact method does not yield point estimates. CI: credible interval.

Table 1 Methods assessed in simulation study

Table 2 Possible values of simulation parameters

Table 3 Scenarios with continuous outcomes; $\widehat {\mu }$ point and interval estimation

Table 4 Scenarios with continuous outcomes; $\widehat {\tau }$ point and interval estimation

Table 5 Scenarios with binary outcomes; $\widehat {\mu }$ point and interval estimation

Table 6 Scenarios with binary outcomes; $\widehat {\tau }$ point and interval estimation

Table 7 Scenarios with continuous outcomes, $k \le 5$; $\widehat {\mu }$ point and interval estimation

Table 8 Scenarios with continuous outcomes, $k \le 5$; $\widehat {\tau }$ point and interval estimation

Table 9 Scenarios with binary outcomes, $k \le 5$; $\widehat {\mu }$ point and interval estimation

Table 10 Scenarios with binary outcomes, $k \le 5$; $\widehat {\tau }$ point and interval estimation

Figure 6 Bias of $\widehat {\mu }$; all scenarios. Hinges of each boxplot are the 25th, 50th, and 75th percentiles. The upper and lower whiskers extend from the hinge to the minimum or maximum value that is no more than $1.5 \times (\text {interquartile range})$ from the nearest hinge.

Figure 7 Coverage of CI for $\widehat {\mu }$. Lines are slightly staggered horizontally for visibility. Lines are mean performances across scenarios, conditional on k, $\tau $, the distribution of population effects, and the outcome type. All HKSJ methods performed very similarly, so their overlapping lines look like a single grey line.

Figure 8 Width of CI for $\widehat {\mu }$. Lines are slightly staggered horizontally for visibility. Lines are mean performances across scenarios, conditional on k, $\tau $, the distribution of population effects, and the outcome type. Y-axis is on log scale.

Figure 9 Bias of $\widehat {\tau }$; all scenarios. Hinges of each boxplot are the 25th, 50th, and 75th percentiles. The upper and lower whiskers extend from the hinge to the minimum or maximum value that is no more than $1.5 \times (\text {interquartile range})$ from the nearest hinge.

Figure 10 MAE of $\widehat {\tau }$. Lines are slightly staggered horizontally for visibility. Lines are mean performances across scenarios, conditional on k, $\tau $, the distribution of population effects, and the outcome type.

Figure 11 RMSE of $\widehat {\tau }$. Lines are slightly staggered horizontally for visibility. Lines are mean performances across scenarios, conditional on k, $\tau $, the distribution of population effects, and the outcome type.

Figure 12 Coverage of CI for $\widehat {\tau }$. Lines are slightly staggered horizontally for visibility. Lines are mean performances across scenarios, conditional on k, $\tau $, the distribution of population effects, and the outcome type.

Figure 13 Width of CI for $\widehat {\tau }$. Lines are slightly staggered horizontally for visibility. Lines are mean performances across scenarios, conditional on k, $\tau $, the distribution of population effects, and the outcome type. Y-axis is on log scale.

Mathur supplementary material

File 605 KB

Article contents

Meta-analysis with Jeffreys priors: Empirical frequentist properties

Abstract

Keywords

Information

Highlights

1 Introduction

2 Existing frequentist methods

2.1 Method-of-moments estimators

2.2 Likelihood-based estimators

2.3 Interval estimation

2.4 Existing simulations comparing these methods

3 Bayesian methods using Jeffreys priors

3.1 The Jeffreys priors

3.2 The posterior under each prior

3.3 Theoretical and substantive distinctions between the priors

4 Simulation study

4.1 Point and interval estimation methods

4.2 Data generation

4.3 Performance metrics

4.4 Results

4.4.1 Convergence metrics

4.4.2 Point and interval estimation for $\mu $

4.4.3 Discussion of results regarding $\mu $

4.4.4 Point and interval estimation for $\tau $

4.4.5 Discussion of results regarding $\tau $

4.5 Overall conclusions

5 Applied example

6 Discussion

Acknowledgements

Author contributions

Competing interest statement

Data availability statement

Funding statement

Supplementary material

Footnotes

References

Mathur supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests