A comparison of combined p-value functions for meta-analysis

Leonhard Held; Felix Hofmann; Samuel Pawel

doi:10.1017/rsm.2025.26

A comparison of combined p-value functions for meta-analysis

Published online by Cambridge University Press: 18 June 2025

and

Leonhard Held*: Affiliation:
Epidemiology, Biostatistics and Prevention Institute (EBPI) and Center for Reproducible Science (CRS), University of Zurich, Zurich, Switzerland
Felix Hofmann: Affiliation:
Epidemiology, Biostatistics and Prevention Institute (EBPI) and Center for Reproducible Science (CRS), University of Zurich, Zurich, Switzerland
Samuel Pawel: Affiliation:
Epidemiology, Biostatistics and Prevention Institute (EBPI) and Center for Reproducible Science (CRS), University of Zurich, Zurich, Switzerland
*: Corresponding author: Leonhard Held; Email: leonhard.held@uzh.ch

Article contents

Abstract
Highlights
Introduction
Methodology
Simulation study
Discussion and extensions
Author contributions
Competing interest statement
Data availability statement
Funding statement
Footnotes
References

Rights & Permissions

Abstract

P-value functions are modern statistical tools that unify effect estimation and hypothesis testing and can provide alternative point and interval estimates compared to standard meta-analysis methods, using any of the many p-value combination procedures available (Xie et al., 2011, JASA). We provide a systematic comparison of different combination procedures, both from a theoretical perspective and through simulation. We show that many prominent p-value combination methods (e.g. Fisher’s method) are not invariant to the orientation of the underlying one-sided p-values. Only Edgington’s method, a lesser-known combination method based on the sum of p-values, is orientation-invariant and still provides confidence intervals not restricted to be symmetric around the point estimate. Adjustments for heterogeneity can also be made and results from a simulation study indicate that Edgington’s method can compete with more standard meta-analytic methods.

Keywords

confidence curve confidence distribution heterogeneity meta-analysis p-value combination skewness

Information

Type: Research Article
Information: Research Synthesis Methods , Volume 16 , Issue 5 , September 2025 , pp. 758 - 785

DOI: https://doi.org/10.1017/rsm.2025.26 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Open Practices: Open data Open materials
Copyright: © The Author(s), 2025. Published by Cambridge University Press on behalf of The Society for Research Synthesis Methodology

Highlights

What is already known

• P-value functions unify hypothesis testing and parameter estimation, and are therefore particularly useful for quantitative reporting of statistical analyses.
• P-value combination methods provide a general framework to perform meta-analysis.

What is new

• P-value functions of different p-value combination methods are compared.
• Edgington’s method has attractive properties:
- – Results do not depend on the orientation of the underlying one-sided p-values.
- – Confidence intervals are not restricted to be symmetric.
• A simulation study is performed without and with adjustments for heterogeneity. Comparisons with standard meta-analysis (fixed effect and DerSimonian–Laird random effects) and the Hartung–Knapp–Sidik–Jonkman method are described.
- – The point estimate based on Edgington’s method is essentially unbiased for the mean of a normal study effects distribution
- – Coverage of Edgington’s method is comparable or better than standard meta-analysis, with only slightly wider confidence intervals
- – Confidence intervals based on the Hartung–Knapp–Sidik–Jonkman method have better coverage, but are also substantially wider

Potential impact for RSM readers

• Edgington’s combination method based on the sum of p-values may complement standard meta-analysis because of its ability to reflect data asymmetry, its orientation-invariance, and its good operating characteristics.
• The usage of p-value function methods for meta-analysis is faciliated through development of the R package confMeta.

1 Introduction

A pervasive challenge in all areas of research is the assessment of evidence from multiple studies. Standard meta-analysis aims to synthesize effect estimates from several studies into an overall effect estimate, typically a weighted average of the study-specific effect estimates, combined with an appropriate confidence interval. Inverse variance weights can be motivated as efficient choices under homogeneity or heterogeneity between studiesReference Rice, Higgins and Lumley ¹ via either exchangeability or random sampling of study effects.Reference Higgins, Thompson and Spiegelhalter ² The DerSimonian–Laird (DL)Reference DerSimonian and Laird ³ approach to random effects meta-analysis incorporates a measure of heterogeneity into the weights, but does not incorporate uncertainty in the variance estimate when making inference on the mean of the random effects distribution. This form of weights gives an estimate that is consistent for the mean of the distribution of study effects,Reference Normand ⁴ ^, Reference Jackson, Bowden and Baker ⁵ a natural target of inference where a symmetric (usually normal) distribution can be assumed.

There has been much progress in proposing alternative confidence intervals for meta-analysis. The Hartung–KnappReference Hartung and Knapp ⁶ ^, Reference Hartung and Knapp ⁷ and Sidik–JonkmanReference Sidik and Jonkman ⁸ approach takes into account the uncertainty in estimating heterogeneity and tends to produce wider confidence intervals, in particular if the number of studies is small. The approach by Henmi and CopasReference Henmi and Copas ⁹ combines the fixed effect (FE) point estimate with a standard error from the random effects model, in order to obtain confidence intervals less prone to publication bias. However, all these intervals are of a simple additive form with limits

(1)

$$ \begin{align} \mbox{ point estimate } \pm \mbox{ additive factor, } \end{align} $$

so symmetric around the point estimate. This may be reasonable if the number of studies is large or if there is good reason to assume that the true effect estimates follow a symmetric (normal) distribution around their mean, but confidence intervals not restricted to be symmetric around the point estimate may be more suitable if this is not the case. Non-symmetric confidence intervals can show improved performance in other applications, for example the “square-and-add” Wilson score interval for the risk difference is non-symmetric and performs better than symmetric confidence intervals.Reference Newcombe ¹⁰ ^, Reference Newcombe ¹¹ Also the nonparametric confidence interval for the median survival time is non-symmetric and performs better than alternativ parametric and symmetric confidence intervals.Reference Brookmeyer and Crowley ¹² Other prominent examples for non-symmetric confidence intervals are nonparametric bootstrap confidence intervals based on the percentile method,Reference Efron and Tibshirani ¹³ ^, Reference Chihara and Hesterberg ¹⁴ deterministic bootstrap intervals for odds ratios, risk differences and relative risks,Reference Newcombe ¹¹ and confidence intervals for a population median or the difference of two population medians.Reference Campbell, Gardner, Altman, Machin, Bryant and Gardner ¹⁵

In this article we compare meta-analytic methods based on the combined p-value functionReference Fraser ¹⁶ ^, Reference Infanger and Schmidt-Trucksäss ¹⁷ or equivalently confidence curveReference Bender, Berg and Zeeb ¹⁸ and confidence distribution.Reference Marschner ¹⁹ ^, Reference Melilli and Veronese ²⁰ Related meta-analytic approaches based on p-value functions have been proposed in Singh et al. Reference Singh, Xie and Strawderman ²¹ They showed that p-value combination based meta-analysis—approaches that combine p-values of individual studies, such as Fisher’s method—can be unified with standard model-based meta-analysis under a common framework using p-value functions. This framework has subsequently been extendedReference Xie, Singh and Strawderman ²² ^– Reference Cunen and Hjort ²⁴ and different methods have been recently compared for rare event meta-analysis.Reference Zabriskie, Corcoran and Senchaudhuri ²⁵ Yang et al. Reference Yang, Liu, Wang and Xie ²⁶ consider p-value combination methods for rare events based on Fisher’s exact test and note in an application to simulated data that “the p-value function based on the exact test preserves the skewness” of the original data. This statement suggests that a desired property of meta-analytic confidence interval is to reflect data asymmetry. However, the authors did not investigate this feature any further. A related proposal in the meta-analytic literature is the “drapery plot.”Reference Rücker and Schwarzer ²⁷ This visualization, an alternative to the standard forest plot, shows the p-value functions of individual studies and of their pooled effect, providing the reader with a wealth of information as p-values, point estimates, and confidence intervals (at any level) can be easily read off.

We apply these ideas in our article and provide a systematic comparison of different types of p-value combination proceduresReference Hedges and Olkin ²⁸ ^, Reference Cousins ²⁹ for meta-analysis. Theory and simulation studies are used to compare the different confidence intervals and point estimates in terms of coverage, bias, width, and skewness. The results indicate that four of the five p-value combination methods considered have undesirable properties, only Edgington’s methodReference Edgington ³⁰ ^, Reference Edgington ³¹ based on the sum of p-values can compete with more standard meta-analytic methods.

2 Methodology

Suppose results from k studies are to be synthesized and let $\hat \theta _i$ denote the effect estimate of the true parameter $\theta _i$ from the ith study, $i=1,\ldots , k$ , and $\sigma _i$ the corresponding standard error. As in standard meta-analysis we assume that the $\hat \theta _i$ ’s are independent and follow a normal distribution with unknown mean $\theta _i$ and known variance $\sigma _i^2$ (the squared standard error). Additional adjustments for heterogeneity will be discussed in Section 2.3.

Let

(2)

$$ \begin{align} Z_i = \frac{\hat \theta_i - \mu}{\sigma_i} \end{align} $$

denote the z-statistic for the null hypothesis $H_{0i}$ : $\theta _i=\mu $ , $i=1, \ldots , k$ . We can then derive the corresponding one-sided p-values based on the cumulative standard normal distribution function $\Phi (\cdot )$ :

(3)

$$ \begin{align} \overset{\rightharpoonup}{p_i} = 1-\Phi(Z_i) \mbox{ and } \overset{\leftharpoonup}{p_i} = \Phi(Z_i) \end{align} $$

for the alternatives $H_{1i}$ : $\theta _i> \mu $ (“greater”) and $H_{1i}$ : $\theta _i < \mu $ (“less”), respectively. Note that $\overset {\rightharpoonup }{p_i}$ is monotonically increasing and $\overset {\leftharpoonup }{p_i}$ is monotonically decreasing in $\mu $ .

2.1 P-value combination methods

In what follows we denote with $\overset {\rightharpoonup }{p_\bullet }$ a combined p-value based on study-specific one-sided p-values $\overset {\rightharpoonup }{p_1}, \ldots , \overset {\rightharpoonup }{p_k}$ for the alternative “greater” and likewise with $\overset {\leftharpoonup }{p_{\bullet }}$ for the alternative “less.” The subscript “ $\bullet $ ” is a placeholder for a p-value combination method, abbreviated by the first letter of the last name of the inventor, where we consider the methods listed in Table 1:

• Edgington’s method,Reference Edgington ³⁰
• Fisher’s method,Reference Fisher ³²
• Pearson’s method,Reference Pearson ³³
• Wilkinson’s method,Reference Wilkinson ³⁴ and
• Tippett’s method.Reference Tippett ³⁵

The combined p-values $\overset {\rightharpoonup }{p_\bullet }$ and $\overset {\leftharpoonup }{p_\bullet }$ will inherit the monotonicity property from the $\overset {\rightharpoonup }{p_i}$ ’s and $\overset {\leftharpoonup }{p_i}$ ’s, respectively: $\overset {\rightharpoonup }{p_\bullet }$ is monotonically increasing and $\overset {\leftharpoonup }{p_\bullet }$ is monotonically decreasing in $\mu $ for any of the combination methods listed in Table 1.

The p-value from Edgington’s method is based on a transformation of the sum of the p-values ${s}$ with the cumulative distribution function of the Irwin–Hall distribution.Reference Irwin ³⁶ ^, Reference Hall ³⁷ For large k it can be approximated based on a central limit theorem argumentReference Edgington ³¹ :

(4)

$$ \begin{align} {p_E} \approx \Phi(\sqrt{12 \, k}({s}/k-1/2)). \end{align} $$

For $k=12$ , this approximation is considered already “fairly good.”Reference Ripley ³⁸ In fact, the “sum of 12 uniforms” method was once a popular way to generate samples from a normal distribution. In order to mitigate overflow problems of the Irwin–Hall distribution for large k, we therefore use the normal approximation (4) if $k \geq 12$ .

Table 1 Some methods for combining one-sided p-values ${p_{1}}, \dots , {p_{k}}$ from kstudies into a combined p-value ${p_\bullet }$ (in alphabetic order).

Note: The floor function ${\lfloor {s} \rfloor }$ denotes the greatest integer less than or equal to s. A chi-squared random variable with k degrees of freedom is denoted as $\chi ^{2}_{k}$ .

Tippett’s method is based on the smallest p-value. There is a generalization of Tippett’s method based on the rth smallest p-value.Reference Hedges and Olkin ²⁸ ^, Reference Wilkinson ³⁴ For $r=k$ the largest p-value is hence used and we obtain the method denoted here as Wilkinson’s method. Another commonly used p-value combination method is Stouffer’s method based on the sum of inverse normal transformed p-values.Reference Stouffer, Suchman, Devinney, Star and Williams ³⁹ A weighted version exists,Reference Cousins ²⁹ which is equivalent to FE and random effects meta-analysis, if the weights are suitably chosen.Reference Senn ⁴⁰ FE and random effects meta-analysis will be included in our example (Section 2.2) and in the simulation study described in Section 3.

Suppose we combine one-sided p-values $\overset {\rightharpoonup }{p_1}, \ldots , \overset {\rightharpoonup }{p_k}$ for the alternative “greater” into a combined p-value function $\overset {\rightharpoonup }{p_\bullet }(\mu )$ . The standard point estimate is the median estimate $\hat \mu $ Reference Fraser ⁴¹ , defined as the root of the equation

(5)

$$ \begin{align} \overset{\rightharpoonup}{p_\bullet}(\hat \mu) = 0.5. \end{align} $$

The Irwin–Hall distribution has median $k/2$ , therefore the median estimate $\hat \mu _E$ of Edgington’s method is the value of $\mu $ where the mean $s/k$ of the study-specific one-sided p-values is 0.5. There are even closed-form solutions for the median estimate based on Tippett’s and Wilkinson’s method, see Appendix A.

In order to obtain a two-sided $1-\alpha $ confidence interval $[\mu _l, \mu _u]$ for $\mu $ based on a p-value combination method $\overset {\rightharpoonup }{p_\bullet }$ , we have to find the roots $\mu _l$ and $\mu _u$ of the two equations

(6)

$$ \begin{align} & &\overset{\rightharpoonup}{p_\bullet}(\mu_l) = \alpha/2& &\text{and}& &\overset{\rightharpoonup}{p_\bullet}(\mu_u) = 1 - \alpha/2.& & \end{align} $$

While in the case of Tippett’s and Wilkinson’s methods there are closed-form solutions for the roots in (5) and (6) as shown in Appendix A, in general they have to be computed using numerical root-finding algorithms. Equivalently we can find the two roots $\mu _l$ and $\mu _u$ of the single equation

(7)

$$ \begin{align} 2 \, \min\left\{\overset{\rightharpoonup}{p_\bullet}(\mu), 1- \overset{\rightharpoonup}{p_\bullet}(\mu) \right\} = \alpha \end{align} $$

to obtain the $1-\alpha $ confidence interval, while maximization of the function on the left side of (7) gives the median estimate $\hat \mu $ . Note that the confidence intervals $[\mu _l, \mu _u]$ is not necessarily symmetric around the median estimate. The left-hand side of (7) has been coined the centrality function Reference Xie, Singh and Strawderman ²² and is also known as the confidence curve. BerrarReference Berrar ⁴² has proposed to use the area under the confidence curve (AUCC) as a summary measure of precision. We note that Schweder and HjortReference Schweder and Hjort ⁴³ define the confidence curve as $1$ minus the left-hand side of (7), in which case AUCC is no longer a useful summary measure.

We may also use $\overset {\leftharpoonup }{p_\bullet }(\mu )$ (based on the one-sided p-values $\overset {\leftharpoonup }{p_{1}}, \dots , \overset {\leftharpoonup }{p_{k}}$ for the alternative “less”) rather than $\overset {\rightharpoonup }{p_\bullet }(\mu )$ to compute a point estimate with two-sided confidence interval. Ideally this should lead to the same results, but this is only the case for Edgington’s method. The other combination methods will generally lead to different results depending on whether the input p-values are oriented “greater” or “less” due to the following relationships between combined p-value and input p-value orientation:

(8)

$$ \begin{align} \overset{\rightharpoonup}{p_E} & = 1 - \overset{\leftharpoonup}{p_E}, \end{align} $$

(9)

$$ \begin{align} \overset{\rightharpoonup}{p_F} & = 1 - \overset{\leftharpoonup}{p_P}, \end{align} $$

(10)

$$ \begin{align} \overset{\rightharpoonup}{p_P} & = 1 - \overset{\leftharpoonup}{p_F}, \end{align} $$

(11)

$$ \begin{align} \overset{\rightharpoonup}{p_T} & = 1 - \overset{\leftharpoonup}{p_W}, \end{align} $$

(12)

$$ \begin{align} \overset{\rightharpoonup}{p_W} & = 1 - \overset{\leftharpoonup}{p_T}, \end{align} $$

see Appendix B for a proof. For example, equation (9) implies that the combined p-value function based on Fisher’s method and one-sided p-values for the alternative “greater” is 1 minus the combined p-value function based on Pearson’s method and one-sided p-values for the alternative “less.” However, in practice the ultimate goal is to compute a point estimate with two-sided confidence interval, so the direction of the alternative of the underlying one-sided p-values shouldn’t matter. But this is only the case for Edgington’s method due to property (8).

In principle, we may also use two-sided p-values in any of the combination methods listed in Table 1, to circumvent the lack of orientation-invariance of Tippett’s, Wilkinson’s, Fisher’s, and Pearson’s method, but this comes with new problems. Specifically, the combined p-value function of Fisher’s, Pearson’s, and Edgington’s method based on two-sided p-values may then no longer peak at 1, which means that confidence intervals on certain confidence levels may be empty sets. In contrast, the combined p-value functions of Tippett’s and Wilkinson’s method will peak at 1, but will have several modes at the study-specific point estimates. This may lead to confidence sets consisting of non-overlapping intervals, which are hard to interpret and not useful in applications.

2.2 Example: Association between corticosteroids and mortality in COVID-19 hospitalized patients

We will illustrate the different methods using a meta-analysis combining information from $n = 7$ randomized controlled clinical trials investigating the association between corticosteroids and mortality in hospitalized patients with COVID-19, ⁴⁴ see Table 2. We will use one-sided p-values for the alternative “less” as negative log odds ratios indicate treatment benefit, the results for the alternative “greater” follow from (8) to (12).

Table 2 Data from $k=7$ randomized controlled clinical trials investigating the association between corticosteroids and mortality in hospitalized patients with COVID-19. ⁴⁴

Figure 1 shows that the distribution of the study effect estimates is right-skewed. Such skewness can be quantified using Fisher’s weighted skewness coefficientReference Ferschl ⁴⁵ of the meta-analyzed effect estimates, defined as

(13)

$$ \begin{align} &\gamma = \frac{\left\{\sum_{i=1}^{k} w_{i} (\hat{\theta}_{i} - \bar{\theta})^{3}\right\}\sqrt{\sum_{i=1}^{k} w_{i}}}{\left\{\sum_{i=1}^{k} w_{i}(\hat{\theta}_{i} - \bar{\theta})^{2}\right\}^{3/2}}~\text{ with }~\bar{\theta} = \frac{\sum_{i=1}^{k}\hat{\theta}_{i}w_{i}}{\sum_{i=1}^{k}w_{i}}~\text{ and }~w_{i} = \frac{1}{\sigma^{2}_{i}}. \end{align} $$

In this example, we obtain $\gamma =3.72$ , the positive sign reflecting a right-skewed distribution.

Figure 1 Drapery plot (top) and forest plot (bottom) from several combination methods with 95% confidence intervals for a meta-analysis of $k=7$ randomized controlled clinical trials investigating the association between corticosteroids and mortality in hospitalized patients with COVID-19. ⁴⁴

The results from an inverse variance-weighted FE analysis are reproduced in Figure 1 on the log odds ratio scale. This was also the prespecified primary analysis in the protocol registered and made publicly available on the PROSPERO database prior to data analysis or receipt of outcome data. Note that the knot point of the p-value function for Wilkinson’s method shown in the drapery plot in Figure 1 is not an artefact, but caused by its definition based on the maximum of the different p-values, compare Table 1.

Results based on the different methods are shown in Table 3. The point estimates from the different p-value combination methods are all closer to zero than the combined effect estimate from the FE, DL and Hartung–Knapp–Sidik–Jonkman (HKSJ) random effects methods, respectively. The 95% confidence intervals also differ substantially. Wilkinson’s method even gives a positive point estimate and has the widest confidence interval. Wilkinson’s method also gives the largest values of the two-sided p-value ${p}_{\bullet }(0)$ for the null hypothesis of no effect, while Tippett’s method has the smallest estimate and the smallest p-value among all five p-value combination methods.

To assess the skewness of the different confidence intervals, we computed the skewness coefficientReference Groeneveld and Meeden ⁴⁶

(14)

$$ \begin{align} \beta = \frac{\mbox{upper} + \mbox{lower} - 2 \, \mbox{estimate}}{\mbox{upper} - \mbox{lower}} \end{align} $$

based on the (median) estimate and the $\mbox {upper}$ and $\mbox {lower}$ interval limits. Note that $\left \lvert \beta \right \rvert \leq 1$ with positive sign for a right-skewed interval and negative sign for a left-skewed one. The coefficient is zero for symmetric confidence intervals, as here for the FE, DL and HKSJ random effects methods. The penultimate column in Table 3 reveals that three of the different p-value combination methods (Fisher, Tippett, Wilkinson) return a left-skewed confidence interval with negative skewness coefficient $\beta $ , although the study effect estimates are right-skewed. Only Edgington’s and Pearson’s methods preserve the skewness of the data and return a positive coefficient $\beta $ .

We also calculated the AUCCReference Berrar ⁴² for the different p-value combination methods, see the third last column in Table 3. As the confidence interval width, AUCC is a measure of precision but has the advantage that it does not depend on the level of the confidence interval. In this application AUCC correlates strongly with the width of the 95% CI with a correlation of 0.999. As a measure of skewness of the confidence curve we propose to compute the AUCC below and above the point estimate $\hat \mu $ , so that $\mbox {AUCC}_{\tiny \mbox {below}} + \mbox {AUCC}_{\tiny \mbox {above}} = \mbox {AUCC}$ . The proposed measure of skewness is the

$$\begin{align*}\mbox{{AUCC ratio}} = \frac{\mbox{AUCC}_{\tiny \mbox{upper}} - \mbox{AUCC}_{\tiny \mbox{lower}}}{\mbox{AUCC}}, \end{align*}$$

which is restricted to the interval $[-1, 1]$ , just as the skewness coefficient (14). Table 3 shows that in this application the AUCC ratio has the same sign as the skewness coefficient (14) for all five p-value combination methods, with correlation 0.997.

Two of the studies in this example (study 1 “DEXA-COVID 19” and 5 “COVID STEROID”) have large confidence intervals due to a small number of events, where the normality assumption in (3) may be questionable. To avoid assuming normality, we can also define p-value functions based on exact one-sided p-values from Fisher’s exact test, where we employ the mid-p correction, originally proposed by Lancaster.Reference Lancaster ⁴⁷ This ensures that the p-values for “greater” and “less” still sum up to 1, although the distribution of the test statistic is discrete. Edgington’s method is hence still orientation-invariant, whereas the other methods are not.

Table 3 Comparison of different p-value combination methods (with alternative “less”) investigating the association between corticosteroids and mortality in hospitalized patients with COVID-19.

Note: Shown are estimates of the log odds ratio and their 95% CIs and compared with the fixed effect, DerSimonian–Laird (DL) and Hartung–Knapp–Sidik–Jonkman (HKSJ) random effects approach for meta-analysis of $k=7$ randomized controlled clinical trials. The random effects approach is based on the REML estimate of $\tau ^2$ , which is so close to zero that its results are indistinguishable from the fixed effect method. The p-value shown is two-sided for the standard null hypothesis $\mu =0$ of no effect and calculated based on the left-hand side of (7) (with $\overset {\leftharpoonup }{p_\bullet }$ rather than $\overset {\rightharpoonup }{p_\bullet }$ ) for the different p-value combination methods.

Figure 2 shows the corresponding p-value functions based on the exact p-values (solid black lines) along with the normal approximation p-value functions based on the z-statistics (2) for comparison (dashed green lines). We see that for most combination methods both curves are virtually identical, producing almost identical point estimates, p-values, and confidence intervals. Slightly larger differences can only be seen for Wilkinson’s method. This suggests that the p-value function based on the z-statistics provides a good approximation to the one based exact p-values, despite small counts for some of the studies.

Figure 2 Combined p-value functions with 95% confidence intervals based on exact p-values with mid- p correction for meta-analysis of $k=7$ randomized controlled clinical trials investigating the association between corticosteroids and mortality in hospitalized patients with COVID-19. ⁴⁴

Note: Shown are also the normal approximation p-value functions based on the z-statistics (2) for comparison.

2.3 Accounting for heterogeneity

We consider the case of additive heterogeneity, where heterogeneity is quantified based on Cochran’s Q-statisticReference Cochran ⁴⁸

(15)

$$ \begin{align} Q = \sum\limits_{i=1}^k w_i (\hat \theta_i - \hat \theta_w)^2 \end{align} $$

with weights $w_i=1/\sigma _i^2$ equal to the inverse squared standard errors $\sigma _i$ and the standard meta-analytical point estimate $\hat \theta _w$ , the weighted average of the study-specific estimates $\hat \theta _i$ with weights $w_i$ . The Q-statistic (15) depends on $\hat \theta _w$ , but can also be written as a weighted sum of squared paired differences,

$$\begin{align*}Q = \frac{\sum\limits_{i < j} w_i w_j (\hat \theta_i - \hat \theta_j)^2}{\sum\limits_{i=1}^k w_i}, \end{align*}$$

where $\hat \theta _w$ no longer appears. For example, if there are only $k=2$ studies we obtain

$$\begin{align*}Q = \frac{\sigma_1^{-2} \sigma_2^{-2} (\hat \theta_1 - \hat \theta_2)^2}{\sigma_1^{-2} + \sigma_2^{-2}} = \frac{(\hat \theta_1 - \hat \theta_2)^2}{\sigma_1^{2} + \sigma_2^{2}}, \end{align*}$$

the standard test statistic to assess the evidence for conflict between two study-specific effect estimates $\hat \theta _1$ and $\hat \theta _2$ . A popular measure of heterogeneity is Higgins’

$$ \begin{align*}I^2 = \max \left\{Q - (k-1), 0 \right\}/Q,\end{align*} $$

the proportion of the variance of the study-specific effect estimates that is attributable to study heterogeneity. Higgins’ $I^2$ is used in our simulation study to specify the amount of heterogeneity.

The z-statistic (2) can be modified to account for heterogeneity between study effects, represented by the heterogeneity variance $\tau ^2$ . The heterogeneity-adjusted z-statistic is

(16)

$$ \begin{align} Z_i = \frac{\hat \theta_i - \mu}{\sqrt{\sigma_i^2 + \hat \tau^2}}, \end{align} $$

where $\hat \tau ^2$ is a suitable estimate of $\tau ^2$ . Many estimates of the heterogeneity variance $\tau ^2$ exist, we will use the REML estimate in the following due to its good performance in simulation studies.Reference Langan, Higgins and Jackson ⁴⁹ Note that the transformation (3) is used to compute p-values based on (16), which assumes normality of the random effects distribution. We will comment on possible relaxations of this assumption in the discussion.

3 Simulation study

3.1 Design

We first describe the design of our simulation study, following the structured “ADEMP” approach for reporting of simulation studies.Reference Morris, White and Crowther ⁵⁰

3.1.1 Aims

The aim of the simulation study was to evaluate the estimation properties of p-value combination methods for meta-analysis and to compare them with classical meta-analysis methods under different numbers of studies with potentially different sample sizes and degrees of heterogeneity.

3.1.2 Data-generating mechanism

Our data-generating mechanism follows closely the simulation study of IntHout et al. Reference IntHout, Ioannidis and Borm ⁵¹ Specifically, in each simulation repetition, we simulated $k \in \{3, 5, 10, 20, 50\}$ true study effects (on standardized mean difference scale) and corresponding effect estimates with standard errors. The mean true study effect was set to $\theta = 0.2$ . The true study effect of study i was then simulated from a normal distribution $\theta _{i} \sim \operatorname {\mathrm {N}}(\theta , \tau ^2)$ with mean $\theta $ and heterogeneity variance $\tau ^2$ .

Based on a true effect $\theta _{i}$ , the effect estimate $\hat {\theta }_{i}$ of study i was simulated from a normal distribution

$$ \begin{align*} \hat{\theta}_{i} \sim \operatorname{\mathrm{N}}(\theta_{i}, 2/n_{i}), \end{align*} $$

where $n_{i}$ is the sample size per group and set to $n_{i} = 50$ (small studies) or $n_{i} = 500$ (large studies). We considered scenarios with either $0$ , $1$ , or $2$ large studies (and the rest as small studies). The variance of the outcome variable of the studies is assumed to be 1. The squared standard error of $\hat {\theta }_{i}$ was simulated from a scaled chi-squared distribution

$$ \begin{align*} \sigma_i^{2} \sim \frac{1}{(n_i-1)n_i} \chi^2_{2(n_{i}-1)}, \end{align*} $$

which were then transformed to standard errors $\sigma _i$ by taking the square root. The heterogeneity variance $\tau ^{2}$ of the study effect distribution was specified by Higgins’ $I^{2}$ . Specifically, we first computed the within-study variance using

(17)

$$ \begin{align} \epsilon^2 = \frac{1}{k} \sum\limits_{i=1}^k \frac{2}{n_i}, \end{align} $$

from which we then computed the between-study heterogeneity variance

(18)

$$ \begin{align} \tau^2 = \epsilon^2 \frac{I^2}{1-I^2}, \end{align} $$

where $I^{2} = \tau ^{2}/(\epsilon ^{2} + \tau ^{2})$ is Higgins’ relative heterogeneity.Reference Higgins and Thompson ⁵² We considered scenarios with $I^{2} = 0$ , $0.3$ , $0.6$ or $0.9$ , representing a range from no heterogeneity up to high relative heterogeneity. All manipulated factors are listed in Table 4 and were varied in a fully factorial manner, resulting in 5 (number of studies) $\times $ 3 (number of large studies) $\times $ 4 (relative heterogeneity) $= 60$ simulation scenarios. Additional simulation results based on a skew normal study effect distribution are described in the Supplementary Material.

Table 4 Factors considered in simulation study (varied in fully-factorial way).

3.1.3 Targets of analysis

In meta-analysis, the parameter of interest is typically the mean true study effect, which coincides with the median for a normal study effects distribution. The mean $\theta = 0.2$ is therefore used to evaluate coverage and bias of the point estimates of the different methods. In the supplementary material we also use the median if the study effect distribution was assumed to be skew-normal.

3.1.4 Methods

Each set of simulated effect estimates and standard errors were analyzed using different methods, each producing a point estimate and a 95% confidence interval for the true effect, namely:

1. Standard fixed and DL random effects meta-analysisReference DerSimonian and Laird ³ ^, Reference Borenstein, Hedges, Higgins and Rothstein ⁵³
2. HKSJ random effects meta-analysisReference Hartung and Knapp ⁷ ^, Reference Sidik and Jonkman ⁸

and the five p-value combination methods listed in Table 1. All input p-values were one-sided and oriented in positive effect direction (alternative “greater”). This setup exhausts all possible method/orientation combinations as Edgington’s method is orientation-invariant whereas Fisher/Pearson and Wilkinson/Tippett are orientation mirrored as described in Section 2.

Section 3.3.1 gives results without adjustments for heterogeneity (using p-values derived from (2)) and compared with FE meta-analysis. In Section 3.3.2, adjustments have been made for potential between-study heterogeneity as described in Section 2.3 and compared with random effects meta-analysis (DL and HKSJ). The restricted maximum likelihood (REML) estimate of the heterogeneity variance $\tau ^{2}$ was used, which is usually recommended as a default choice.Reference Langan, Higgins and Jackson ⁴⁹ The FE and random effects meta-analysis methods were computed with the metagen function from the R package meta, Reference Balduzzi, Rücker and Schwarzer ⁵⁴ while the remaining p-value combination methods were computed with the confMeta R package.Reference Hofmann and Held ⁵⁵

3.1.5 Performance measures

Our primary performance measure was coverage of the 95% confidence interval which we estimated by

$$ \begin{align*} \widehat{\mathrm{Cov}} = \frac{\#\ 95\%\ \text{CI includes the true value }\theta=0.2}{n_{\mathrm{sim}}} \end{align*} $$

with Monte Carlo standard error (MCSE)

$$ \begin{align*} \mathrm{MCSE}_{\widehat{\mathrm{Cov}}} = \sqrt{\frac{\widehat{\mathrm{Cov}} (1 - \widehat{\mathrm{Cov}})}{n_{\mathrm{sim}}}}. \end{align*} $$

We conducted $n_{\mathrm {sim}} = 20,000$ simulation repetitions. This ensures a maximum MCSE of $0.35\%$ (attained when the estimated coverage is 50%), which we consider as sufficiently small to detect relevant differences. Our secondary performance measures were bias and 95% confidence interval width, see Table 3 in Siepe et al. Reference Siepe, Bartoš, Morris, Boulesteix, Heck and Pawel ⁵⁶ for definitions and MCSE formulas. To assess the skewness properties of the different methods, we computed the skewness coefficient (14) for each 95% confidence interval. We then evaluated the distribution (mean, median, minimum, maximum) of the skewness coefficients for a given method and simulation scenario. To assess the relationship between confidence interval skewness and data skewness, we also computed the Pearson correlation between the 95% confidence interval skewness $\beta $ and Fisher’s weighted skewness coefficient (13) of the meta-analyzed effect estimates with weights $w_i = 1/\sigma ^{2}_{i}$ and $w_{i} = 1/({\sigma ^{2}_{i} + \hat {\tau }^{2}})$ without and with heterogeneity adjustment, respectively. Finally, to assess agreement between confidence interval skewness and data skewness, we also computed Cohen’s $\kappa $ of the sign of $\gamma $ and the sign of $\beta $ , using the function cohen.kappa from the psych R package.Reference Revelle ⁵⁷

3.2 Computational aspects

The simulation study was performed using R version 4.4.1 (2024-06-14) on a server running Debian GNU/Linux. More information on the computational environment and code to reproduce the simulation study are available at https://github.com/felix-hof/confMeta_simulation.

3.3 Results

We will now describe the results of the simulation study, first without adjustments for heterogeneity (Section 3.3.1) and then with (Section 3.3.2). Without heterogeneity adjustments there were a few cases where Pearson and Wilkinson CIs did not converge (lowest convergence rate $99.37\%$ for Pearson when $I^2 = 0.9$ and $k=50$ studies with 2 large studies, see Table S2 in the Supplementary Material for details), and method performance was then estimated based on the convergent repetitions only (case-wise deletion). With heterogeneity adjustments, no non-convergent confidence intervals or point estimates occurred.

3.3.1 Without adjustments for heterogeneity

Coverage Figure 3 shows the empirical coverage of the different methods without adjustments for heterogeneity. The coverage of all methods is at the nominal 95% level when data are generated under effect homogeneity ( $I^2=0$ ), as expected from theory. The coverage drops below the nominal level for all methods under simulation with heterogeneity ( $I^2>0$ ). However, the drop is the smallest for Edgington’s method, which always remains above 75%, even for conditions with high heterogeneity ( $I^2=0.9$ ). In contrast, the coverage of all other methods (including FE meta-analysis) can drop to values below 25%.

Figure 3 Empirical coverage of the 95% confidence intervals based on 20,000 simulation repetitions.

Bias Figure 4 shows the performance of the compared methods in terms of bias. We see that FE meta-analysis and all p-value combination methods are essentially unbiased under effect homogeneity ( $I^2=0$ ). However, when there is heterogeneity ( $I^2>0$ ), only FE meta-analysis and Edgington’s method remain unbiased, while the remaining p-value combination methods show increasing bias with increasing relative heterogeneity $I^2$ . The bias patterns of Fisher/Pearson and Wilkinson/Tippett are mirrored around zero, because these methods’ are mirrored with respect to p-value orientation.

Figure 4 Empirical bias of the point estimates for the true effect based on 20,000 simulation repetitions.

Confidence interval width Figure 5 shows the average width of the method’s 95% confidence intervals. We see that the width of FE meta-analysis, Wilkinson’s, and Tippett’s methods remains constant for different levels of relative heterogeneity $I^2$ . In contrast, Edgington’s, Fisher’s, and Pearson’s confidence interval widths increase as $I^2$ increases, adapting to greater heterogeneity. This adaptation is most pronounced for Edgington’s method. A very similar pattern can be observed for the AUCC, shown in Figure S1 in the Supplementary Material. Figure S4 in the Supplementary Material shows the confidence interval width relative to the FE meta-analysis method. Fisher’s method tends to have smaller width than all the other methods including FE meta-analysis, most pronounced for large $I^2$ .

Figure 5 Mean width of 95% confidence intervals based on 20,000 simulation repetitions.

Confidence interval skewness Figure S5 in the Supplementary Material displays the median and the range (min–max) of the CI skewness of the different methods. Both FE meta-analyis and Edgington have median skewness of zero, which is desirable, as we simulate from a non-skewed normal distribution. Whereas the skewness of fixed meta-analysis is always zero, Edgington method shows considerable symmetric variation of the skewness coefficient around zero. This variation increases with $I^2$ and decreases with the number of studies. The other methods often have a median skewness different from zero and also show non-symmetry of the range around the median.

To investigate how well the skewness of the confidence interval captures the skewness of the data, Figure S6 in the Supplementary Material displays the correlation of the skewness of the data and the skewness of the confidence interval. There is always positive correlation for Edgington’s method, while this is not the case for the other methods. Fisher’s and Pearson’s methods exhibit only sometimes a negative correlation (for simulations with $I^2=0.9$ ) whereas Wilkinson’s and Tippett’s methods have negative correlations most of the time.

Figure 6 shows the Cohen’s $\kappa $ agreement between the sign of the skewness of the confidence intervals and the sign of the skewness of the data. FE meta-analysis is not shown because the confidence intervals are always symmetric and thus always produce a skewness coefficient of zero. We can see that Edgington’s method shows consistently high agreement with a decreasing trend with increasing $I^2$ . Agreement also decreases with increasing number of studies. This is to be expected as we simulate from a normal study effect distribution with zero skewness. The distribution of the skewness of the data will therefore more and more concentrate around zero with increasing number of studies. The other methods have surprisingly poor performance, sometimes not better than what would be expected by chance ( $\kappa \approx 0$ , Wilkinson and Tippett for no large study) and sometimes even worse ( $\kappa < 0$ , Wilkinson and Tippett with one or two large studies). This illustrates that only Edgington’s confidence intervals are capable to reflect the skewness of the data. A very similar picture can be observed for the agreement of the AUCC ratio with data skewness, as shown in Figure S3 in the Supplementary Material, which suggests that the AUCC ratio as a measure of the skewness of a confidence curve is a useful generalization of the confidence interval skewness at level 95%.

Figure 6 Cohen’s $\kappa $ sign agreement between 95% confidence interval skewness and data skewness based on 20,000 simulation repetitions.

3.3.2 With adjustments for heterogeneity

Coverage Figure 7 shows the empirical coverage of the different methods with heterogeneity adjustment, which is, as expected, considerably better than without adjustments for heterogeneity (Figure 3). We see that the DL random effects meta-analysis (leftmost panels) has either too high (for $I^{2} = 0$ ) or too low (for $I^{2}> 0$ ) coverage for small numbers of studies, but seems to stabilize at the nominal 95% level as the number of studies increases. In contrast, for scenarios with no large studies (top panels), the HKSJ method shows almost perfect nominal coverage over all numbers of studies, while for scenarios with one or two large studies (middle and bottom panels), the coverage is slightly too low for small numbers of studies, consistent with the results of IntHout et al. Reference IntHout, Ioannidis and Borm ⁵¹ Focusing now on the p-value combination methods, we can see that Edgington’s method shows a qualitatively similar behavior to DL random effects meta-analysis, but with somewhat better coverage in most conditions. The coverage is in general not as good as with HKSJ, but close to the nominal level for a large number of studies. In contrast, the coverage of Fisher, Pearson, Wilkinson, and Tippett methods does not stabilize at the nominal 95% but increases above as the number of studies increases.

Figure 7 Empirical coverage of the 95% confidence intervals based on 20,000 simulation repetitions.

Bias Figure 8 shows the performance of the methods in terms of bias. We see that the DL and HKSJ random-effects meta-analysis as well as Edgington’s methods are essentially unbiased, while Fisher’s, Pearson’s, Wilkinson’s, and Tippett’s method have substantial bias in most conditions. The bias patterns of Fisher/Pearson and Wilkinson/Tippett are mirrored around zero, because these methods’ are mirrored with respect to p-value orientation.

Figure 8 Empirical bias of the point estimates for the true effect based on 20,000 simulation repetitions.

Figure 9 Mean width of 95% confidence intervals based on 20,000 simulation repetitions.

Figure 10 Cohen’s $\kappa $ sign agreement between 95% confidence interval skewness and data skewness based on 20,000 simulation repetitions.

Confidence interval width Figure 9 shows the average width of the method’s 95% confidence intervals, which is now considerably larger than without adjustments for heterogeneity (Figure 5). Edgington’s, Fisher’s, and Pearson’s methods all have somewhat wider confidence intervals than DL random effects meta-analysis, while HKSJ has substantially wider intervals in conditions with a small number of studies, as also noted by Weber et al. Reference Weber, Knapp, Glass, Kundt and Ickstadt ⁵⁸ All of these widths shrink and become narrower as the number of studies increases. However, Wilkinson’s and Tippett’s methods seem to shrink much more slowly and remain relatively wide even with larger numbers of studies. This is even better seen in Figure S10 in the Supplementary Material, which shows confidence interval width relative to the DL random effects meta-analysis method. We can see that the relative width of Wilkinson’s and Tippett’s methods increases, while the HKSJ and to a lesser extent Edgington’s, Fisher’s, and Pearson’s method remain constant or decrease with increasing number of studies. Interestingly, the figure also shows that the HKSJ method can have narrower confidence intervals than the DL random effects meta-analysis, which has been described as a potential shortcoming in the literature.Reference Jackson, Law, Rücker and Schwarzer ⁵⁹ In rare cases this may also happen with Edgington’s method, but only if there are no large studies and small amounts of relative heterogeneity.

Confidence interval skewness Figure 10 shows the Cohen’s $\kappa $ agreement between the sign of the skewness of the confidence intervals and the skewness of the data. The DL random effects meta-analysis and HKSJ methods are not shown because their confidence intervals are always symmetric and thus always produce a skewness coefficient of zero. We can see that Edgington’s method shows consistently high agreement with a decreasing trend as the number of studies increases, while the agreement of the other methods is not better than what would be expected by chance ( $\kappa \approx 0$ , Fisher and Pearson) and sometimes even worse ( $\kappa < 0$ , Wilkinson and Tippett). Figure S11 in the Supplementary Material shows the median skewness of the confidence intervals and the corresponding min–max range, illustrating why all but Edgington’s method often show exactly zero agreement: As the number of studies increases, their confidence intervals tend to be skewed in only one direction. For ten or more studies, Pearson’s method produced only confidence intervals with negative skewness, while confidence intervals based on Fisher’s method were all positively skewed. Thus, the confidence interval cannot represent the skewness type of the data, even though the confidence interval skewness tends to be correlated with the data skewness (see Figure S12 in the Supplementary Material).

3.4 Summary of simulation results

Edgington’s method produced unbiased point estimates, its confidence intervals had comparable or better coverage, and were only slightly wider than the confidence intervals from an FE and DL random effects meta-analysis, respectively. In addition, it was the only method that could accurately represent data skewness. The remaining p-value combination methods, Fisher/Pearson and, to a greater extent, Wilkinson/Tippett, could not achieve satisfactory performance. Their point estimates were more biased, their coverage for a large number of studies was too high after adjustments for heterogeneity, and their confidence intervals could not reliably represent the skewness of the data.

The HKSJ method leads to known improvements in coverage compared to DL random effects meta-analysis, although nominal coverage is still not guaranteed when a meta-analysis includes a few studies that are much larger than the remaining ones.Reference IntHout, Ioannidis and Borm ⁵¹ The improved coverage of the HKSJ method also comes at the cost of substantially wider confidence intervals on average, in particular, if the number of studies is small (see Figure S10 in the Supplementary Material).

4 Discussion and extensions

We have compared different p-value combination methods for meta-analysis theoretically and through simulation. The p-value function approach based on Edgington combination method constitutes a promising avenue for further research and applications. Its ability to reflect the skewness of the data will be attractive to applied meta-analysts. The good simulation performance (compared to standard FE and DL random effects meta-analysis, respectively) for a small number of studies suggests possible applicability in health technology assessment, where a Bayesian approach has recently been proposed as an alternative to the “overly conservative” HKSJ method.Reference Lilienthal, Sturtz and Schürmann ⁶⁰ A possible extension of the method provided is cumulative meta-analysis, where studies are added one at a time in a specific order, usually time of publication.Reference Egger, Higgins and Smith ⁶¹ It would be interesting to compare the width of the confidence interval with the length of the standard random effects confidence interval as studies accumulate.

Adjustments for heterogeneity have been made based on the standard additive approach. Alternatively multiplicative heterogeneity can be incorporated, where the squared standard errors $\sigma _i^2$ are multiplied with a factor $\phi> 0$ . Then we would use the adjusted z-statistic

$$\begin{align*}Z_i = \frac{\hat \theta_i - \mu}{\sqrt{\hat \phi} \, \sigma_i}, \end{align*}$$

where $\hat \phi = \max \{Q/(k-1), 1\}$ is the appropriate overdispersion estimate.Reference Stanley and Doucouliagos ⁶² ^, Reference Mawdsley, Higgins, Sutton and Abrams ⁶³ However, both additive and multiplicative adjustments are based on a plug-in approach, which ignores the uncertainty of the heterogeneity estimate $\hat \tau ^2$ and $\hat \phi $ , respectively. An interesting alternative would be to profile-out the heterogeneity parameter.Reference Cunen and Hjort ²⁴ ^, Reference Zabriskie, Corcoran and Senchaudhuri ²⁵

Jackson and WhiteReference Jackson and White ⁶⁴ raise concerns about the usual between-study normality assumption in meta-analysis, but also note that this issue has not received sufficient attention. Baker and JacksonReference Baker and Jackson ⁶⁵ ^, Reference Baker and Jackson ⁶⁶ propose long-tailed, but still symmetric random effects distributions while BeathReference Beath ⁶⁷ considers a mixture of two normals model to accommodate for outliers. The two components have different variances but the same means, so the resulting mixture distribution is still symmetric. Finally, Kontopantelis et al. Reference Kontopantelis and Reeves ⁶⁸ and Weber et al. Reference Weber, Knapp, Glass, Kundt and Ickstadt ⁵⁸ compare existing (symmetric) meta-analytic interval estimates in a simulation study with normal and skew normal random effects distributions. A possible area for future research is therefore to assume a non-normal distribution of the random effects, for example a skew normal. The heterogeneity variance could then be estimated based on a moment-based estimate (not assuming normality) together with the skewness parameter. The resulting distribution of the z-value (16) is then no longer normal and needs to be calculated through numerical integration and can then be used to convert z-values to p-values.

It would also be interesting to extend the approach to compute prediction intervals for future study effects.Reference Higgins, Thompson and Spiegelhalter ² ^, Reference Hamaguchi, Noma, Nagashima, Yamada and Furukawa ⁶⁹ ^, Reference Nagashima, Noma and Furukawa ⁷⁰ This would involve numerical integration of a $\operatorname {\mathrm {N}}(\mu , \tau ^2)$ (or skew normal) distribution with respect to the confidence density for $\mu $ , which can be obtained from any (monotonically increasing) one-sided p-value function (for alternative “greater”) through differentiation. For example, differentiation of the underlying exact one-sided p-value function from Edgington’s method in Figure 2 gives the confidence density shown in Figure 11. The confidence density is clearly skewed, which would then also be the case for the corresponding prediction interval. We plan to consider this in future work.

Figure 11 Confidence density based on Edgington’s combined p-value function and exact p-values with mid- pcorrection for meta-analysis of $k=7$ randomized controlled clinical trials investigating the association between corticosteroids and mortality in hospitalized patients with COVID-19. ⁴⁴

Acknowledgements

We appreciate helpful comments by two anomynous referees and are grateful for additional suggestions by Ralf Bender and Frank Weber on an earlier version of this manuscript. We also appreciate inputs from Florian Gerber and Lisa Hofer in earlier stages of the project.

Author contributions

L.H.: Conceptualization, investigation, funding acquisition, writing-original draft, methodology, visualization, writing—review and editing, formal analysis, supervision. F.H.: Software, investigation, visualization, methodology, writing—review and editing. S.P.: Writing—review and editing, writing—original draft, methodology, software, investigation, visualization, formal analysis.

Competing interest statement

The authors declares that no competing interests exist.

Data availability statement

Meta-analysis with p-value combination methods is implemented in the R-package confMeta available on GitHub (https://github.com/felix-hof/confMeta). The package will be soon submitted to CRAN. R code to reproduce our results is available at https://doi.org/10.17605/OSF.IO/JE8XB. R code to reproduce the simulation study is available at https://github.com/felix-hof/confMeta_simulation.

Funding statement

Partial support by the Swiss National Science Foundation (Project # 189295) in the initial phase of this project is gratefully acknowledged.

Supplementary material

The supplementary material for this article can be found at https://doi.org/10.1017/rsm.2025.26.

Appendix

A Closed-form solution of Wilkinson’s and Tippett’s methods

Assuming p-values under normality of the form (3), fixing the combined p-values $\overset {\rightharpoonup }{p_W}$ and $\overset {\leftharpoonup }{p_W}$ , respectively, of Wilkinson’s method from Table 1 to $\alpha $ and solving for $\mu $ , we obtain the following closed-form solutions

$$ \begin{align*} \hat{\mu}_W(\alpha) = \begin{cases} \min\limits_{i=1, \ldots, k}\left\{ \hat{\theta}_i + \sigma_i \, z_{\alpha^{1/n}} \right\} & \text{for~alternative = "greater"} \\ \max\limits_{i=1, \ldots, k}\left\{ \hat{\theta}_i - \sigma_i \, z_{\alpha^{1/n}} \right\} & \text{for~alternative = "less"} \\ \end{cases} \end{align*} $$

with $z_q$ the $q \times 100\%$ quantile of the standard normal distribution. Plugging in $\alpha = 0.5$ produces the median estimate, while $\alpha = 0.025$ and $\alpha = 0.975$ give the limits of a 95% confidence interval. A similar approach applied to the combined p-value from Tippett’s method leads to

$$ \begin{align*} \hat{\mu}_T(\alpha) = \begin{cases} \max\limits_{i=1, \ldots, k}\left\{\hat{\theta}_i - \sigma_i \, z_{(1 - \alpha)^{1/n}} \right\} & \text{for~alternative = " greater"} \\ \min\limits_{i=1, \ldots, k}\left\{\hat{\theta}_i + \sigma_i \, z_{(1 - \alpha)^{1/n}} \right\} & \text{for~alternative = "less."} \\ \end{cases} \end{align*} $$

B Relationships of p-value combination methods

B.1 Edgington’s method

Define

$$\begin{align*}\overset{\leftharpoonup}{s} = \sum_{i=1}^{k} \overset{\leftharpoonup}{p_{i}} = \sum_{i=1}^{k} (1 - \overset{\rightharpoonup}{p_{i}})= k - \sum_{i=1}^{k} \overset{\rightharpoonup}{p_{i}} = k - \overset{\rightharpoonup}{s}, \end{align*}$$

so $\overset {\rightharpoonup }{p_{E}} = \Pr (\mathrm {I}_{k} \leq \overset {\rightharpoonup }{s}) = 1 - \Pr (\mathrm {I}_{k}> \overset {\rightharpoonup }{s})$ where $\mathrm {I}_{k} \in [0, k]$ is an Irwin–Hall random variable with parameter k. Since the Irwin–Hall distribution is symmetric around its mean $k/2$ , it holds that $\Pr (\mathrm {I}_{k} \leq {s}) = \Pr (\mathrm {I}_{k}> k - {s})$ for any $s \in [0,k]$ . Hence, it follows that

$$ \begin{align*} \overset{\rightharpoonup}{p_{E}} & = \Pr(\mathrm{I}_{k} \leq \overset{\rightharpoonup}{s}) \\ & = \Pr(\mathrm{I}_{k}> k - \overset{\rightharpoonup}{s}) \\ & = \Pr(\mathrm{I}_{k} > \overset{\leftharpoonup}{s}) \\ & = 1- \Pr(\mathrm{I}_{k} \leq \overset{\leftharpoonup}{s}) \\ & = 1 - \overset{\leftharpoonup}{p_{E}}. \end{align*} $$

B.2 Fisher’s and Pearson’s methods

Define

$$\begin{align*}\overset{\rightharpoonup}{f} = -2 \sum_{i=1}^{k} \log(\overset{\rightharpoonup}{p_{i}}) = -2 \sum_{i=1}^{k} \log(1 - \overset{\leftharpoonup}{p_{i}}) = \overset{\leftharpoonup}{g}. \end{align*}$$

It follows that

$$ \begin{align*} \overset{\rightharpoonup}{p_{F}} & = \Pr(\chi_{2k}^{2}> \overset{\rightharpoonup}{f}) \\ & = \Pr(\chi_{2k}^{2} > \overset{\leftharpoonup}{g}) \\ & = 1 - \Pr(\chi_{2k}^{2} \leq \overset{\leftharpoonup}{g}) \\ & = 1 - \overset{\leftharpoonup}{p_{P}}. \end{align*} $$

Similarly we obtain $\overset {\rightharpoonup }{p_{P}} = 1 - \overset {\leftharpoonup }{p_{F}}$ .

B.3 Tippett’s and Wilkinson’s methods

We have

$$ \begin{align*} \overset{\rightharpoonup}{p_{T}} & = 1 - (1 - \min\{\overset{\rightharpoonup}{p_{1}}, \dots, \overset{\rightharpoonup}{p_{k}}\})^{k} \\ & = 1 - (1 - \min\{1-\overset{\leftharpoonup}{p_{1}}, \dots, 1-\overset{\leftharpoonup}{p_{k}}\})^{k} \\ & = 1 - \max\{\overset{\leftharpoonup}{p_{1}}, \dots, \overset{\leftharpoonup}{p_{k}}\}^{k} \\ & = 1 - \overset{\leftharpoonup}{p_{W}} \end{align*} $$

and similarly $\overset {\rightharpoonup }{p_{W}} = 1 - \overset {\leftharpoonup }{p_{T}}$ .

Footnotes

This article was awarded Open Data and Open Materials badges for transparent practices. See the Data availability statement for details.

References

Rice, K, Higgins, JPT, Lumley, T. A re-evaluation of fixed effect(s) meta-analysis. J. Royal Stat. Soc. Ser. A (Stat. Soc.) 2018;181(1):205–227. https://doi.org/10.1111/rssa.12275.CrossRef Google Scholar

Higgins, JPT, Thompson, SG, Spiegelhalter, DJ. A re-evaluation of random-effects meta-analysis. J. Royal Stat. Soc. Ser. A (Stat. Soc.) 2009;172(1):137–159. https://doi.org/10.1111/j.1467-985X.2008.00552.x.CrossRef Google Scholar PubMed

DerSimonian, R, Laird, N. Meta-analysis in clinical trials. Control. Clin. Trials. 1986;7(3):177–188. https://doi.org/10.1016/0197-2456(86)90046-2.CrossRef Google Scholar PubMed

Normand, SLT. Meta-analysis: Formulating, evaluating, combining, and reporting. Stat. Med. 1999;18(3):321–359. https://doi.org/10.1002/(SICI)1097-0258(19990215)18:3<321::AID-SIM28>3.0.CO;2-P.3.0.CO;2-P>CrossRef Google Scholar PubMed

Jackson, D, Bowden, J, Baker, R. How does the DerSimonian and Laird procedure for random effects meta-analysis compare with its more efficient but harder to compute counterparts? J. Stat. Plan. Inference. 2010;140(4):961–970. https://doi.org/10.1016/j.jspi.2009.09.017.CrossRef Google Scholar

Hartung, J, Knapp, G. On tests of the overall treatment effect in meta-analysis with normally distributed responses. Stat. Med. 2001;20(12):1771–1782. https://doi.org/10.1002/sim.791.CrossRef Google Scholar PubMed

Hartung, J, Knapp, G. A refined method for the meta-analysis of controlled clinical trials with binary outcome. Stat. Med. 2001;20(24):3875–3889. https://doi.org/10.1002/sim.1009.CrossRef Google Scholar PubMed

Sidik, K, Jonkman, JN. A simple confidence interval for meta-analysis. Stat. Med. 2002;21(21):3153–3159. https://doi.org/10.1002/sim.1262.CrossRef Google Scholar PubMed

Henmi, M, Copas, JB. Confidence intervals for random effects meta-analysis and robustness to publication bias. Stat. Med. 2010;29:2969–2983. https://doi.org/10.1002/sim.4029.CrossRef Google Scholar PubMed

Newcombe, RG. Interval estimation for the difference between independent proportions: Comparison of eleven methods. Stat. Med. 1998;17(8):873–890. https://doi.org/10.1002/(SICI)1097-0258(19980430)17:8<873::AID-SIM779>3.0.CO;2-I.3.0.CO;2-I>CrossRef Google Scholar PubMed

Newcombe, RG. Confidence Intervals for Proportions and Related Measures of Effect Size. Chapman & Hall/CRC Press; 2013.Google Scholar

Brookmeyer, R, Crowley, J. A confidence interval for the median survival time. Biometrics. 1982;38(1):29. https://doi.org/10.2307/2530286.CrossRef Google Scholar

Efron, B, Tibshirani, RJ. An Introduction to the Bootstrap. Chapman & Hall/CRC Monographs on Statistics and Applied Probability. Chapman and Hall; 1993.CrossRef Google Scholar

Chihara, L, Hesterberg, T. Mathematical Statistics with Resampling and R. 3rd ed. Wiley; 2022.Google Scholar

Campbell, MJ, Gardner, MJ. Medians and their differences. In: Altman, DG, Machin, D, Bryant, TN, Gardner, MJ, eds. Statistics with Confidence. 2nd ed. BMJ Books; 2000:36–44.Google Scholar

Fraser, DAS. The

$p$ -value function and statistical inference. Amer. Stat. 2019;73(sup1):135–147. https://doi.org/10.1080/00031305.2018.1556735.CrossRef Google Scholar

Infanger, D, Schmidt-Trucksäss, A. P value functions: An underused method to present research results and to promote quantitative reasoning. Stat. Med. 2019;38(21):4189–4197. https://doi.org/10.1002/sim.8293.CrossRef Google Scholar PubMed

Bender, R, Berg, G, Zeeb, H. Tutorial: Using confidence curves in medical research. Biom. J. 2005;47(2):237–247. https://doi.org/10.1002/bimj.200410104.CrossRef Google Scholar PubMed

Marschner, IC. Confidence distributions for treatment effects in clinical trials: Posteriors without priors. Stat. Med. 2024;43(6):1271–1289. https://doi.org/10.1002/sim.10000.CrossRef Google Scholar PubMed

Melilli, E, Veronese, P. Confidence distributions and hypothesis testing. Stat. Papers. 2024;65(6):3789–3820. https://doi.org/10.1007/s00362-024-01542-4.CrossRef Google Scholar

Singh, K, Xie, M, Strawderman, WE. Combining information from independent sources through confidence distributions. Ann. Stat. 2005;33(1):159–183. https://doi.org/10.1214/009053604000001084.CrossRef Google Scholar

Xie, M, Singh, K, Strawderman, WE. Confidence distributions and a unifying framework for meta-analysis. J. Amer. Stat. Assoc. 2011;106(493):320–333. https://doi.org/10.1198/jasa.2011.tm09803.CrossRef Google Scholar

Liu, D, Liu, RY, Xie, M. Exact meta-analysis approach for discrete data and its application to

$2\times 2$ tables with rare events. J. Amer. Stat. Assoc. 2014;109(508):1450–1465. https://doi.org/10.1080/01621459.2014.946318.CrossRef Google Scholar

Cunen, C, Hjort, NL. Combining information across diverse sources: The II-CC-FF paradigm. Scand. J. Stat. 2021;49(2):625–656. https://doi.org/10.1111/sjos.12530.CrossRef Google Scholar

Zabriskie, BN, Corcoran, C, Senchaudhuri, P. A comparison of confidence distribution approaches for rare event meta-analysis. Stat. Med. 2021;40(24):5276–5297. https://doi.org/10.1002/sim.9125.CrossRef Google Scholar PubMed

Yang, G, Liu, D, Wang, J, Xie, MG. Meta-analysis framework for exact inferences with application to the analysis of rare events. Biometrics. 2016;72(4):1378–1386. https://doi.org/10.1111/biom.12497.CrossRef Google Scholar

Rücker, G, Schwarzer, G. Beyond the forest plot: The drapery plot. Res. Synth. Methods. 2020;12(1):13–19. https://doi.org/10.1002/jrsm.1410.CrossRef Google Scholar PubMed

Hedges, LV, Olkin, I. Statistical Methods for Meta-Analysis. Elsevier; 1985 Google Scholar

Cousins, RD. Annotated bibliography of some papers on combining significances or p-values. 2007. arXiv preprint https://doi.org/10.48550/arXiv.0705.2209CrossRef Google Scholar

Edgington, ES. An additive method for combining probability values from independent experiments. J. Psychol. 1972;80(2):351–363. https://doi.org/10.1080/00223980.1972.9924813.CrossRef Google Scholar

Edgington, ES. A normal curve method for combining probability values from independent experiments. J. Psychol. 1972;82(1):85–89. https://doi.org/10.1080/00223980.1972.9916971.CrossRef Google Scholar

Fisher, RA. Statistical Methods for Research Workers. 4thed. Oliver & Boyd; 1932.Google Scholar

Pearson, K. On a method of determining whether a sample of size n supposed to have been drawn from a parent population having a known probability integral has probably been drawn at random. Biometrika 1933;25:379–410.CrossRef Google Scholar

Wilkinson, B. A statistical consideration in psychological research. Psychol. Bull. 1951;48:156–158. https://doi.org/10.1037/h0059111.CrossRef Google Scholar PubMed

Tippett, LHC. Methods of Statistics. Williams Norgate; 1931.Google Scholar

Irwin, JO. On the frequency distribution of the means of samples from a population having any law of frequency with finite moments, with special reference to Pearson’s type II. Biometrika. 1927;19(3–4):225–239. https://doi.org/10.1093/biomet/19.3-4.225.CrossRef Google Scholar

Hall, P. The distribution of means for samples if size

$n$ drawn from a population in which the variate takes values between 0 and 1, all such values being equally probable. Biometrika. 1927;19(3–4):240–244. https://doi.org/10.1093/biomet/19.3-4.240.CrossRef Google Scholar

Ripley, BD. Stochastic Simulation. John Wiley & Sons; 1987.CrossRef Google Scholar

Stouffer, SA, Suchman, EA, Devinney, LC, Star, SA, Williams, J. The American Soldier: Adjustment During Army Life. Studies in Social Psychology in World War II. Princeton University Press; Cambridge University Press; 1949.Google Scholar

Senn, S. Statistical Issues in Drug Development. Wiley; 2021. https://doi.org/10.1002/9781119238614.CrossRef Google Scholar

Fraser, DAS.

$p$ -Values: The insight to modern statistical inference. Annu. Rev. Stat. Appl. 2017;4:1–14. https://doi.org/10.1146/annurev-statistics-060116-054139.CrossRef Google Scholar

Berrar, D. Confidence curves: An alternative to null hypothesis significance testing for the comparison of classifiers. Mach. Learn. 2017;106(6):911–949. https://doi.org/10.1007/s10994-016-5612-6.CrossRef Google Scholar

Schweder, T, Hjort, NL. Confidence, Likelihood, Probability. Cambridge University Press; 2016.CrossRef Google Scholar

WHO REACT Working Group. Association between administration of systemic corticosteroids and mortality among critically ill patients with COVID-19: A meta-analysis. JAMA. 2020;324(13):1330–1341. https://doi.org/10.1001/jama.2020.17023.CrossRef Google Scholar

Ferschl, F. Deskriptive Statistik. 2nd ed. Physica-Verlag; 1980.CrossRef Google Scholar

Groeneveld, RA, Meeden, G. Measuring skewness and kurtosis Statistician. 1984;33(4):391. https://doi.org/10.2307/2987742.CrossRef Google Scholar

Lancaster, HO. Significance tests in discrete distributions. J. Amer. Stat. Assoc. 1961;56(294):223–234. https://doi.org/10.1080/01621459.1961.10482105.CrossRef Google Scholar

Cochran, W. Problems arising in the analysis of a series of similar experiments. J. Royal Stat. Soc. 1937;4(1):102–118.CrossRef Google Scholar

Langan, D, Higgins, JP, Jackson, D, et al. A comparison of heterogeneity variance estimators in simulated random-effects meta-analyses. Res. Synth. Methods. 2019;10(1):83–98. https://doi.org/10.1002/jrsm.1316.CrossRef Google Scholar PubMed

Morris, TP, White, IR, Crowther, MJ. Using simulation studies to evaluate statistical methods. Stat. Med. 2019;38(11):2074–2102. https://doi.org/10.1002/sim.8086.CrossRef Google Scholar PubMed

IntHout, J, Ioannidis, JP, Borm, GF. The Hartung-Knapp-Sidik-Jonkman method for random effects meta-analysis is straightforward and considerably outperforms the standard DerSimonian-Laird method. BMC Med. Res. Methodol. 2014;14. https://doi.org/10.1186/1471-2288-14-25.Google Scholar PubMed

Higgins, JPT, Thompson, SG. Quantifying heterogeneity in a meta-analysis. Stat. Med. 2002;21(11):1539–1558. https://doi.org/10.1002/sim.1186.CrossRef Google Scholar PubMed

Borenstein, M, Hedges, LV, Higgins, JP, Rothstein, HR. A basic introduction to fixed-effect and random-effects models for meta-analysis. Res. Synth. Methods. 2010;1(2):97–111. https://doi.org/10.1002/jrsm.12.CrossRef Google Scholar PubMed

Balduzzi, S, Rücker, G, Schwarzer, G. How to perform a meta-analysis with R: A practical tutorial. Evid.-Based Mental Health. 2019(22):153–160. https://doi.org/10.1136/ebmental-2019-300117.CrossRef Google Scholar

Hofmann, F, Held, L. confMeta: Confidence Curves and P-Value Functions for Meta-Analysis. University of Zurich; 2024. R package version 0.3.1.Google Scholar

Siepe, BS, Bartoš, F, Morris, TP, Boulesteix, AL, Heck, DW, Pawel, S. Simulation studies for methodological research in psychology: A standardized template for planning, preregistration, and reporting. Psychol. Methods. 2024. https://doi.org/10.1037/met0000695.CrossRef Google Scholar PubMed

Revelle, W. psych: Procedures for Psychological, Psychometric, and Personality Research. Northwestern University; 2024. R package version 2.4.3.Google Scholar

Weber, F, Knapp, G, Glass, A, Kundt, G, Ickstadt, K. Interval estimation of the overall treatment effect in random-effects meta-analyses: Recommendations from a simulation study comparing frequentist, Bayesian, and bootstrap methods. Res. Synth. Methods. 2021;12(3):291–315. https://doi.org/10.1002/jrsm.1471.CrossRef Google Scholar PubMed

Jackson, D, Law, M, Rücker, G, Schwarzer, G. The Hartung-Knapp modification for random-effects meta-analysis: A useful refinement but are there any residual concerns? Stat. Med. 2017;36(25):3923–3934. https://doi.org/10.1002/sim.7411.CrossRef Google Scholar PubMed

Lilienthal, J, Sturtz, S, Schürmann, C, et al. Bayesian random-effects meta-analysis with empirical heterogeneity priors for application in health technology assessment with very few studies. Res. Synth. Methods. 2024;15(2):275–287. https://doi.org/10.1002/jrsm.1685.CrossRef Google Scholar PubMed

Egger, M, Higgins, JPT, Smith, GD. Systematic Reviews in Health Research—Meta-Analysis in Context. 3rd ed. John Wiley & Sons; 2022.CrossRef Google Scholar

Stanley, TD, Doucouliagos, H. Neither fixed nor random: weighted least squares meta-analysis. Stat. Med. 2015;34(13):2116–2127. https://doi.org/10.1002/sim.6481.CrossRef Google Scholar PubMed

Mawdsley, D, Higgins, JPT, Sutton, AJ, Abrams, KR. Accounting for heterogeneity in meta-analysis using a multiplicative model—An empirical study. Res. Synth. Methods. 2017;8(1):43–52. https://doi.org/10.1002/jrsm.1216.CrossRef Google Scholar PubMed

Jackson, D, White, IR. When should meta-analysis avoid making hidden normality assumptions? Biom. J. 2018;60(6):1040–1058. https://doi.org/10.1002/bimj.201800071.CrossRef Google Scholar PubMed

Baker, R, Jackson, D. A new approach to outliers in meta-analysis. Health Care Manag. Sci. 2008;11(2):121–131. https://doi.org/10.1007/s10729-007-9041-8.CrossRef Google Scholar PubMed

Baker, R, Jackson, D. New models for describing outliers in meta-analysis. Res. Synth. Methods. 2016;7(3):314–328. https://doi.org/10.1002/jrsm.1191.CrossRef Google Scholar PubMed

Beath, KJ. A finite mixture method for outlier detection and robustness in meta-analysis. Res. Synth. Methods. 2014;5(4):285–293. https://doi.org/10.1002/jrsm.1114.CrossRef Google Scholar PubMed

Kontopantelis, E, Reeves, D. Performance of statistical methods for meta-analysis when true study effects are non-normally distributed: A simulation study. Stat. Methods Med. Res. 2012;21(4):409–426. https://doi.org/10.1177/0962280210392008.CrossRef Google Scholar PubMed

Hamaguchi, Y, Noma, H, Nagashima, K, Yamada, T, Furukawa, TA. Frequentist performances of Bayesian prediction intervals for random-effects meta-analysis. Biom. J. 2021;63(2):394–405. https://doi.org/10.1002/bimj.201900351.CrossRef Google Scholar PubMed

Nagashima, K, Noma, H, Furukawa, TA. Prediction intervals for random-effects meta-analysis: A confidence distribution approach. Stat. Methods Med. Res. 2019;28(6):1689–1702. https://doi.org/10.1177/0962280218773520.CrossRef Google Scholar PubMed

Table 1 Some methods for combining one-sided p-values ${p_{1}}, \dots , {p_{k}}$from kstudies into a combined p-value${p_\bullet }$(in alphabetic order).

Table 2 Data from$k=7$randomized controlled clinical trials investigating the association between corticosteroids and mortality in hospitalized patients with COVID-19.44

Figure 1 Drapery plot (top) and forest plot (bottom) from several combination methods with 95% confidence intervals for a meta-analysis of$k=7$randomized controlled clinical trials investigating the association between corticosteroids and mortality in hospitalized patients with COVID-19.44

Table 3 Comparison of different p-value combination methods (with alternative “less”) investigating the association between corticosteroids and mortality in hospitalized patients with COVID-19.

Figure 2 Combined p-value functions with 95% confidence intervals based on exact p-values with mid- p correction for meta-analysis of $k=7$randomized controlled clinical trials investigating the association between corticosteroids and mortality in hospitalized patients with COVID-19.44Note: Shown are also the normal approximation p-value functions based on the z-statistics (2) for comparison.

Table 4 Factors considered in simulation study (varied in fully-factorial way).

Figure 3 Empirical coverage of the 95% confidence intervals based on 20,000 simulation repetitions.

Figure 4 Empirical bias of the point estimates for the true effect based on 20,000 simulation repetitions.

Figure 5 Mean width of 95% confidence intervals based on 20,000 simulation repetitions.

Figure 6 Cohen’s$\kappa $sign agreement between 95% confidence interval skewness and data skewness based on 20,000 simulation repetitions.

Figure 7 Empirical coverage of the 95% confidence intervals based on 20,000 simulation repetitions.

Figure 8 Empirical bias of the point estimates for the true effect based on 20,000 simulation repetitions.

Figure 9 Mean width of 95% confidence intervals based on 20,000 simulation repetitions.

Figure 10 Cohen’s $\kappa $sign agreement between 95% confidence interval skewness and data skewness based on 20,000 simulation repetitions.

Figure 11 Confidence density based on Edgington’s combined p-value function and exact p-values with mid- pcorrection for meta-analysis of $k=7$randomized controlled clinical trials investigating the association between corticosteroids and mortality in hospitalized patients with COVID-19.44

Held et al. supplementary material

File 1.6 MB

Article contents

A comparison of combined p-value functions for meta-analysis

Abstract

Keywords

Information

Highlights

What is already known

What is new

Potential impact for RSM readers

1 Introduction

2 Methodology

2.1 P-value combination methods

2.2 Example: Association between corticosteroids and mortality in COVID-19 hospitalized patients

2.3 Accounting for heterogeneity

3 Simulation study

3.1 Design

3.1.1 Aims

3.1.2 Data-generating mechanism

3.1.3 Targets of analysis

3.1.4 Methods

3.1.5 Performance measures

3.2 Computational aspects

3.3 Results

3.3.1 Without adjustments for heterogeneity

3.3.2 With adjustments for heterogeneity

3.4 Summary of simulation results

4 Discussion and extensions

Acknowledgements

Author contributions

Competing interest statement

Data availability statement

Funding statement

Supplementary material

Appendix

A Closed-form solution of Wilkinson’s and Tippett’s methods

B Relationships of p-value combination methods

B.1 Edgington’s method

B.2 Fisher’s and Pearson’s methods

B.3 Tippett’s and Wilkinson’s methods

Footnotes

References

Held et al. supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests