Hostname: page-component-5f7774ffb-pmcks Total loading time: 0 Render date: 2026-02-20T09:17:15.148Z Has data issue: false hasContentIssue false

Producing treatment hierarchies in network meta-analysis using probabilistic models and treatment-choice criteria

Published online by Cambridge University Press:  20 February 2026

Theodoros Evrenoglou*
Affiliation:
Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center-University of Freiburg, Freiburg im Breisgau, Germany Center of Research in Epidemiology and Statistics (CRESS-U1153), Université Paris Cité, INSERM, Paris, France
Adriani Nikolakopoulou
Affiliation:
Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center-University of Freiburg, Freiburg im Breisgau, Germany Department of Hygiene, Social-Preventive Medicine and Medical Statistics, School of Medicine, Aristotle University of Thessaloniki, Thessaloniki, Greece
Guido Schwarzer
Affiliation:
Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center-University of Freiburg, Freiburg im Breisgau, Germany
Gerta Rücker
Affiliation:
Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center-University of Freiburg, Freiburg im Breisgau, Germany
Anna Chaimani
Affiliation:
Center of Research in Epidemiology and Statistics (CRESS-U1153), Université Paris Cité, INSERM, Paris, France Oslo Center for Biostatistics and Epidemiology, Department of Biostatistics, University of Oslo, Oslo, Norway
*
Corresponding author: Dr. Theodoros Evrenoglou; Email: theodoros.evrenoglou@uniklinik-freiburg.de
Rights & Permissions [Opens in a new window]

Abstract

A key output of network meta-analysis (NMA) is the relative ranking of treatments; nevertheless, it has attracted substantial criticism. Existing ranking methods often lack clear interpretability and fail to adequately account for uncertainty, overemphasizing small differences in treatment effects. We propose a novel framework to estimate treatment hierarchies in NMA using a probabilistic model, focusing on a clinically relevant treatment-choice criterion (TCC). Initially, we define a TCC based on smallest worthwhile differences (SWD), converting NMA relative treatment effects into treatment preference format. These data are then synthesized using a probabilistic ranking model, assigning each treatment a latent “ability” parameter, representing its propensity to yield clinically important and beneficial true treatment effects relative to the rest of the treatments in the network. Parameter estimation relies on the maximum likelihood theory, with standard errors derived asymptotically from the Hessian matrix. To facilitate the use of our methods, we launched the R package mtrank. We applied our method to two clinical datasets: one comparing 18 antidepressants for major depression and another comparing 6 antihypertensives for the incidence of diabetes. Our approach provided robust, interpretable treatment hierarchies that account for a concrete TCC. We further examined the agreement between the proposed method and existing ranking metrics in 153 published networks, concluding that the degree of agreement depends on the precision of the NMA estimates. Our framework offers a valuable alternative for NMA treatment ranking, mitigating overinterpretation of minor differences. This enables more reliable and clinically meaningful treatment hierarchies.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press on behalf of The Society for Research Synthesis Methodology

Highlights

What is already known?

The relative ranking of different competing treatments is a key output of a network meta-analysis (NMA). However, ranking has been widely criticized for being prone to overinterpretation and for overemphasizing small differences in treatment effects. In addition, current ranking metrics lack a straightforward way to measure uncertainty in treatment rankings. This creates challenges in interpretation, particularly when treatments are ranked adjacently.

What is new?

We introduce a new framework for estimating treatment hierarchies in network meta-analysis (NMA) using a probabilistic model, focusing on a clinically relevant treatment-choice criterion (TCC). This TCC is mathematically defined using the smallest worthwhile difference (SWD), which represents the smallest beneficial effect of a treatment that justifies a preference for it over another. We apply this TCC to all NMA estimates to determine either a treatment preference or a tie. The treatment preferences are synthesized using a probabilistic ranking model, where each treatment is assigned a latent “ability” parameter that reflects its propensity to yield clinically important and beneficial treatment effects, according to the TCC, when compared to other treatments in the network. In this way, treatments with higher estimated ability occupy a higher position in the final ranking list. The ability-based ranking metric is estimated via maximum likelihood, and standard errors are derived asymptotically from the Hessian matrix. To support the practical application of our method, we have developed and released the R package mtrank.

Potential impact for RSM readers

Our framework provides a new way to estimate treatment hierarchies in NMA, offering an alternative to existing ranking methods. Recent studies stress the importance of clearly defining treatment hierarchy questions before estimating rankings. To our knowledge, this is the first method to explicitly and quantitatively address this question using a predefined TCC. Researchers can use this method as a primary ranking tool or as a sensitivity analysis alongside traditional metrics, especially when clinically relevant TCCs are known.

1 Introduction

Interpretation of network meta-analysis (NMA) outputs can be challenging as it usually comprises consideration of multiple treatment effects with different levels of uncertainty and credibility across comparisons in the network.Reference Salanti, Ades and Ioannidis1, Reference Chaimani, Higgins, Mavridis, Spyridonos and Salanti2 For example, in the relatively simple case of a network with 6 treatments the output of NMA consists of 15 treatment effect estimates. In such a context, treatment ranking can be a reliable way to summarize the evidence provided by complex treatment networks.Reference Salanti, Ades and Ioannidis1, Reference Chaimani, Caldwell, Li, Higgins and Salanti3, Reference Rücker and Schwarzer4 This may explain the fact that treatment hierarchies are frequently presented in published NMAs with 43% of them reporting at least one ranking metric.Reference Chiocchia, Nikolakopoulou, Papakonstantinou, Egger and Salanti5

Probably the most commonly used ranking metric, until recently, was the probability of a treatment to have the best value,Reference Chiocchia, Nikolakopoulou, Papakonstantinou, Egger and Salanti5 usually denoted as ${p}_{BV}$ . This is primarily a Bayesian metric, but it can also be calculated within the frequentist framework using resampling, thereby mimicking a Bayesian framework with flat priors. It represents the probability that a treatment in the network will have the best true treatment effect.Reference Salanti, Nikolakopoulou, Efthimiou, Mavridis, Egger and White6 Although ${p}_{BV}$ has been widely used in published NMAs, more recently it has been criticized for not properly accounting for the uncertainty of the NMA estimates.Reference Rücker and Schwarzer4, Reference Nikolakopoulou, Mavridis, Chiocchia, Papakonstantinou, Furukawa and Salanti7, Reference Veroniki, Straus, Rücker and Tricco8

Other common ranking metrics are P-scores ,Reference Rücker and Schwarzer4 which are obtained analytically through the cumulative density function of the standard normal distribution, or their Bayesian equivalent SUCRAReference Salanti, Ades and Ioannidis1 that represent the surface under the cumulative ranking curve for each treatment. The main limitation of these metrics is that they often lead to attributing distinct ranks to treatments even when there are only small differences between their SUCRA values or P-scores. Nikolakopoulou et al.Reference Nikolakopoulou, Mavridis, Chiocchia, Papakonstantinou, Furukawa and Salanti7 employed the “deviation from the means” approach for the construction of the design matrix in the NMA model and introduced a new ranking metric, called the probability of a treatment being preferable to a fictional treatment of average performance (PReTA). This metric potentially accounts better for the uncertainty in the relative effects than P-scores or SUCRAs, particularly when there is substantial variability in the precision of the NMA estimates. This is an important advantage since an empirical study revealed high agreement across all ranking metrics when NMA estimates had similar variance estimates, but large sensitivity to the choice of metric for networks with large discrepancies in the variance of the NMA estimates.Reference Chiocchia, Nikolakopoulou, Papakonstantinou, Egger and Salanti5 More recently, new ranking metrics and approaches have been developed to address more complex ranking questions. Mavridis et al.Reference Mavridis, Porcher, Nikolakopoulou, Salanti and Ravaud9 extended P-scores to incorporate clinically important values, while Curteis et al. proposed a similar extension in terms of the SUCRA ranking method.Reference Curteis, Wigle, Michaels and Nikolakopoulou10 Chaimani et al.Reference Chaimani, Porcher, Sbidian and Mavridis11 suggested that treatment rankings should consider not only the summary relative effects but also other information, such as study or treatment characteristics. They introduced a new metric, called the probability of selecting a treatment to recommend (POST-R) that implements additional characteristics in treatment hierarchy (e.g., risk of bias or treatment cost). Papakonstantinou et al.Reference Papakonstantinou, Salanti, Mavridis, Rücker, Schwarzer and Nikolakopoulou12 developed a resampling approach for estimating the probability that a specific treatment hierarchy occurs or a predefined criterion may be met.

Despite its usefulness when properly reported and interpreted, treatment ranking in NMA has been accompanied with a lot of skepticism.Reference Mills, Kanters, Thorlund, Chaimani, Veroniki and Ioannidis13 Reference Kibret, Richer and Beyene16 Other common arguments against treatment ranking include that it can be biased, it is difficult to interpret, it is not accompanied with uncertainty measures, and it may overemphasize nonimportant differences in the treatment effect estimates.Reference Trinquart, Attiche, Bafeta, Porcher and Ravaud14, Reference Cipriani, Higgins, Geddes and Salanti15 For example, Kibret et al. performed a simulation study and found that ranking can be biased when there is an unequal number of studies per comparison in the network, with the rank probability for the treatment included in the fewest number of studies tending to suffer from upward bias.Reference Kibret, Richer and Beyene16 However, Salanti et al.Reference Salanti, Nikolakopoulou, Efthimiou, Mavridis, Egger and White6 argued that these criticisms should not refer to the ranking metrics per se but to the way they are used and interpreted. This is because different metrics target different types of hierarchy questions and researchers should clearly define what they mean by “best treatment” in a given setting. Hence, setting a well-defined treatment hierarchy question should always precede the estimation of treatment ranking and drive the choice of the ranking metric.Reference Salanti, Nikolakopoulou, Efthimiou, Mavridis, Egger and White6

In this article, we introduce a novel approach for estimating treatment hierarchies in NMA based on a treatment-choice criterion (TCC) constructed to ensure clinically important treatment effects. This TCC splits the NMA estimates into those that fulfil the criterion indicating a treatment effect that justifies a treatment preference and those that do not indicate a clear treatment preference. We then use a probabilistic model that yields the final treatment hierarchy by synthesizing the treatment preferences obtained from the TCC. Our manuscript is organized as follows. First, we define the TCC based on clinically important values. We then apply the criterion to the NMA treatment effects, taking into account their confidence intervals to get either a treatment preference or a tie. Our synthesis model estimates the treatment hierarchy through a latent parameter assigned to each treatment in the network that represents its “ability” to yield clinically important and beneficial treatment effects in context of the defined TCC. In this way, treatments with higher estimated abilities are positioned more prominently in the final ranking. This modeling approach has been previously used to produce rankings in fields outside of medicine, such as sports science,Reference Cattelan, Varin and Firth17 animal behavior,Reference Stuart-Fox, Firth, Moussalli and Whiting18 and risk analysis.Reference Merrick, van Dorp, Mazzuchi, Harrald, Spahn and Grabowski19 To illustrate our method and compare it with existing alternatives, we use two published NMAs: one comparing different antidepressantsReference Cipriani, Furukawa and Salanti20 for major depression and a second evaluating different antihypertensivesReference Elliott and Meyer21 for the incidence of diabetes. Finally, we investigate the agreement between the new and existing ranking metrics through an empirical study where we reanalyze 153 published networks.Reference Petropoulou, Nikolakopoulou and Veroniki22, Reference Nikolakopoulou, Chaimani, Veroniki, Vasiliadis, Schmid and Salanti23

2 Methods

2.1 Defining treatment-choice criteria based on NMA estimates

Suppose a network of $N$ studies comparing $T$ treatments. Let $\widehat{\boldsymbol{\theta}}={\left[{\widehat{\theta}}_{XY}\right]}_{X\ne Y},\mathrm{where}\;X,Y\in \left\{1,2,\dots, T\right\}$ , denote the $\left(\genfrac{}{}{0pt}{}{T}{2}\right)$ -vector containing all treatment effect estimates obtained from the NMA. Let also $\boldsymbol{l}={\left[{l}_{XY}\right]}_{X\ne Y}$ and $\boldsymbol{u}={\left[{u}_{XY}\right]}_{X\ne Y},X,Y\in \left\{1,2,\dots, T\right\}$ represent the corresponding vectors containing the lower and upper bounds of the confidence intervals for each ${\widehat{\theta}}_{XY}$ . We start building our modeling approach by defining concrete criteria for choosing one treatment over another or considering two treatments as equivalent. These criteria have the form of a decision rule and may depend on several factors, such as the clinical setting, the outcome(s) under investigation, or even the type of patients under consideration (e.g., chronic patients vs. treatment-naïve individuals). Here, we suggest a generic approach that can be easily adapted to different settings based on the so-called range of equivalence (ROE). The ROE has been previously introduced as a way to infer on the clinical importance of a treatment effect in the context of appraising NMA estimates; relative effects lying within this range are considered lacking a treatment preference.Reference Nikolakopoulou, Higgins and Papakonstantinou24

Following Nikolakopoulou et al.,Reference Nikolakopoulou, Higgins and Papakonstantinou24 we construct the ROE using the smallest worthwhile difference (SWD), representing the smallest beneficial effect of a treatment that justifies a preference for it over another treatment,Reference Sahker, Furukawa and Luo25 and its reciprocal (or opposite) value. For example, for a given SWD equal to 1.25 on the odds ratio (OR) scale, the ROE would range from $\frac{1}{1.25}$ to $1.25.$ Based on the ROE, the TCC distinguishes between treatment preferences or ties as follows: a comparison between treatments $X$ and $Y$ will indicate no clear treatment preference when either ${\widehat{\theta}}_{XY}$ lies within ROE or its confidence interval bounds ${l}_{XY}$ and ${u}_{XY}$ extend in opposite directions beyond the ROE. In such cases, the TCC is not satisfied, as there is insufficient evidence to support a clear treatment preference, and therefore treatments $X$ and $Y$ are considered as equivalent (i.e., $X=Y\Big)$ . In all other cases, the TCC is fulfilled and a treatment preference is defined (i.e., either $X>Y$ or $X<Y$ ) based on the direction of the effect ${\widehat{\theta}}_{XY}$ . Figure 1 illustrates the above TCC for the case of a beneficial outcome (i.e., larger treatment effect values are desirable) in a fictional example. The mathematical representation of this rule is available in our Supplementary Material.

Figure 1 A graphical representation of the TCC for a fictional example showing the NMA estimates for the comparison of eight treatments versus a common reference treatment $Y$ in terms of a beneficial outcome.

All NMA estimates are translated into a treatment preference format by applying the TCC, indicating either a treatment preference or a tie. Throughout the remainder of this manuscript, treatment effects that fulfil the TCC and therefore justify a treatment preference are also referred to as clinically important effects. Note that the TCC, as defined here, serves as a generic decision rule. Although not formally grounded in mathematical principles, the described TCC has been widely adopted by other established frameworks, such as CINeMAReference Nikolakopoulou, Higgins and Papakonstantinou24, Reference Papakonstantinou, Nikolakopoulou, Higgins, Egger and Salanti26 and GRADE,Reference Guyatt, Oxman and Kunz27 Reference Brignardello-Petersen, Florez and Izcovich29 which aim to put NMA estimates into a decision-analytic framework. In practical scenarios, investigators may also adjust this TCC based on their context specific needs. A crucial step in defining the TCC in our setting involves determining the SWD which is always context specific. To this end, both statistical and elicitation-based approaches for defining a SWD have been suggested elsewhere and therefore this task is beyond the scope of this article.Reference Sahker, Furukawa and Luo25, Reference Copay, Subach, Glassman, Polly and Schuler30 Reference McNamara, Elkins, Ferreira, Spencer and Herbert33

2.2 Estimating treatment hierarchies based on treatment-choice criteria using probabilistic models

To synthesize the treatment preferences obtained from the TCC, we adapt the so-called “Bradley–Terry model”Reference McNamara, Elkins, Ferreira, Spencer and Herbert33 Reference Turner, van Etten, Firth and Kosmidis36 to the context of NMA. This is a probabilistic model, suitable for modeling preference data and originally suggested to estimate ranking outside NMA (e.g., sports tournaments) but, to the best of our knowledge, was never adapted to estimate treatment hierarchies in NMA.

We parameterize the model using an unobserved latent parameter ${\psi}_X\ge 0$ that represent the “ability” each treatment $X$ has to outperform the other treatments in the network conditional on the TCC. In this context, the term “ability to outperform” refers to the propensity of a treatment $X$ to produce clinically important and beneficial true treatment effects, as determined from the TCC, relative to the remaining treatments in the network. Throughout the manuscript, the term “outperform” will be used according to this definition. Given all these considerations, the treatment hierarchy question addressed here is “Based on the predefined TCC, what is the overall propensity of each treatment to yield clinically important and beneficial true treatment effects when compared to the rest of the treatments in the network?” Consequently, the true ability of each treatment is the parameter of interest here, modeled through a Bradley–Terry model.Reference McNamara, Elkins, Ferreira, Spencer and Herbert33 Reference Turner, van Etten, Firth and Kosmidis36

Establishing a direct one-to-one mathematical relationship between the true treatment effects and the treatment abilities is challenging as the former are fixed unknown parameters, while the latter are unobserved treatment characteristics that depend jointly on (i) the magnitude of the true treatment effects and (ii) the TCC of interest, defining clinical important effects. Nevertheless, it is expected that treatments associated with estimated treatment effects fulfilling the TCC will yield larger ability estimates. In this way, treatments with higher estimated ability estimates will occupy higher positions in the final ranking list.Reference Bradley and Terry34 Let also $\psi$ denote the $T$ -length vector that contains the ability of each treatment in the network.

The idea behind this model stems from Luce’s axiomReference Bradley and Terry34, Reference Davidson37, Reference Luce38 of choice which states that the probability that a treatment $X$ has the largest ability among all $T$ treatments, with respect to the TCC, is equal to $\frac{\psi_X}{\sum \nolimits_{i=1}^T{\psi}_i}$ . Luce’s axiom of choice is valid under the assumption of independence of irrelevant alternatives. In the NMA setting, this assumption states that whether treatment $X$ is ranked higher than treatment $Y$ does not depend on other treatment options. This assumption is expected to hold whenever the underlying NMA assumption of transitivity holds, as in such cases the true treatment effect ${\theta}_{XY}$ is consistent across both direct and indirect comparisons.

2.2.1 Synthesizing treatment preferences obtained from the treatment-choice criterion

Following the earlier axiom, for each pairwise comparison in the network, the probability that treatment $X$ will outperform treatment $Y$ ( $X\ne Y;X,Y,=1,2,\dots, T$ ) is

(1) $$\begin{align}\Pr \left(X>Y\right)=\frac{\psi_X}{\psi_X+{\psi}_Y}\end{align}$$

with ${\psi}_X\ge 0\ \forall X\in \left\{1,2,\dots, T\right\}$ and $\sum \nolimits_{i=1}^T{\psi}_i=1$ . Based on Equation (1), a logit-linear Bradley–Terry model can be parametrized as

(2) $$\begin{align}{\rm logit}({\rm Pr}(X > Y)) = \log(\psi_X)-\log(\psi_Y)\end{align}$$

Equation (1) requires that one treatment is always preferred over another for any pairwise comparison in the network. However, this can violate the TCC defined in Section 2.1 where we also consider that two treatments may not justify a treatment preference. To accommodate for ties (i.e., $\Pr \left(X=Y\right))$ , following Davidson,Reference Davidson37 we assume that the probability of a tie between two treatments $X$ and ${Y}$ relates to $\nu \sqrt{\psi_X{\psi}_Y}$ . The quantity $\sqrt{\psi_X{\psi}_Y}$ is the geometric mean of ${\psi}_X$ and ${\psi}_Y$ , while $\nu$ is a scalar nuisance parameter that describes the prevalence of ties in the network. Hence, the probability that $X$ outperforms $Y$ now becomes

(3) $$\begin{align}\Pr \left(X>Y\right)=\frac{\psi_X}{\psi_X+{\psi}_Y+\nu \sqrt{\psi_X{\psi}_Y}}\end{align}$$

and the probability that the two treatments are tied is

(4) $$\begin{align}\Pr \left(X=Y\right)=\frac{\nu \sqrt{\psi_X{\psi}_Y}}{\psi_X+{\psi}_Y+\nu \sqrt{\psi_X{\psi}_Y}}\end{align}$$

with ${\psi}_X\ge 0,\forall X\in \left\{1,2,\dots, T\right\},\nu >0$ and $\sum \nolimits_{i=1}^T{\psi}_i=1$ . Note that parametrizing the probability of a tie using Equation (4) offers the mathematical convenience that, for a fixed value of $\nu$ , the probability of a tie is maximized when ${\psi}_X={\psi}_Y$ . In other words, the probability of a tie depends only on the ratio of ${\psi}_X$ and ${\psi}_Y$ and is maximized between treatments with equal abilities. The mathematical proof of this is provided in the Supplementary Material. In this manuscript, the estimation process for the earlier model refers to the frequentist framework and relies on maximum likelihood theory.Reference McNamara, Elkins, Ferreira, Spencer and Herbert33, Reference Luce39 Fitting the model in the Bayesian setting is also possible and has been discussed elsewhere.Reference Hunter40, Reference Caron and Doucet41

Let ${r}_{XY}$ denote a variable that takes the value 1 if, based on the TCC, treatment $X$ is preferred over treatment $Y$ and 0 otherwise. Let also ${w}_{XY}$ be the tie variable that takes the value 1 if the TCC indicates that $X=Y$ ; otherwise it is equal to 0. Then, the log-likelihood function for the model described in Equations (3) and (4) can be written as

(5) $$\begin{align} \mathrm{L}(\psi,\nu)=&\ \sum\sum_{X \neq Y}r_{XY}\log\!\left(\frac{\psi_X}{\psi_X + \psi_Y + \nu\sqrt{\psi_X\psi_Y}}\right)\notag\\&+r_{YX}\log\!\left(\frac{\psi_Y}{\psi_X + \psi_Y + \nu\sqrt{\psi_X\psi_Y}}\right)\notag\\&+w_{XY}\log\!\left(\frac{\nu\sqrt{\psi_X\psi_Y}}{\psi_X + \psi_Y + \nu\sqrt{\psi_X\psi_Y}}\right)\end{align}$$

with $\sum \nolimits_{i=1}^T{\psi}_i=1$ and $\nu>0$ . Maximizing the multinomial log-likelihood in Equation (5) yields the MLEs of the ability parameters $\boldsymbol{\psi}$ . The asymptotic distribution of $\widehat{\boldsymbol{\psi}}$ is a multivariate normal distribution with mean $\boldsymbol{\psi}$ and variance–covariance matrix ${\boldsymbol{\Sigma}}^{-\mathbf{1}}$ obtained as the inverse of the Hessian matrix $\boldsymbol{\Sigma}$ . The elements of $\boldsymbol{\Sigma}$ correspond to the second partial derivatives of the log-likelihood in Equation (5). Finally, note that Equation (5) can yield ability estimates $\widehat{\boldsymbol{\psi}}$ only when treatment preferences are identified from the TCC. In other words, the proposed methodology cannot estimate any treatment hierarchy if only ties, thus no clinically important NMA estimates, are obtained from the TCC.

2.2.2 Absolute and relative treatment abilities

Maximizing Equation (5) in terms of $\boldsymbol{\psi}$ refers to an optimization problem constrained at the region $\left\{{\psi}_X\ge 0,\sum \nolimits_{i=1}^T{\psi}_i=1\right\}$ . This constraint prevents from negative estimates of the ability parameters and guarantees that the optimization problem remains identifiable. Then, the resulting ${\widehat{\psi}}_X$ represents the estimated absolute abilities of each treatment in the network. However, as also noted elsewhere,Reference Bradley and Terry34 the scale of the absolute ability estimates is immaterial; what matters here is the relative comparison between abilities. To address this issue, we construct an artificial reference treatment groupReference Firth, Kosmidis and Turner35 $T+1$ , with ability equal to the average of the absolute ability estimates across all the $T$ treatments. This implies that we assume the ability of the treatment $T+1$ being equal to ${\psi}_{T+1}=\frac{\sum \nolimits_{i=1}^T{\widehat{\psi}}_i}{T}$ . Then, the ranking results are presented in terms of the ability ratios $\frac{{\widehat{\psi}}_X}{\psi_{T+1}}\forall X\in \left\{1,2,\dots, T\right\}$ .

The final estimates ${\widehat{\psi}}_X$ do not necessarily satisfy $\sum \nolimits_{i=1}^T{\psi}_i=1$ , as the renormalization of the vector $\boldsymbol{\psi}$ is not needed after each iteration of the iterative process.Reference Luce39 However, based on Luce’s axiom of choice,Reference Davidson37, Reference Luce38 we can renormalize the absolute ability estimates as ${\widehat{\pi}}_X=\frac{{\widehat{\psi}}_X\;}{\sum \nolimits_{i=1}^T{\widehat{\psi}}_i\;}$ . This allows interpreting ${\widehat{\pi}}_X$ as the probability that each treatment $X\in \left\{1,2,\dots, T\right\}$ has the largest true ability to yield clinically important and beneficial treatment effects, with respect to the TCC, among all the $T$ treatments in the network. This additional probabilistic ranking metric, ${\widehat{\pi}}_X$ , offers a straightforward interpretation, but it does not account for the uncertainty of the ability estimates ${\widehat{\psi}}_X.$ Therefore, we propose ${\widehat{\pi}}_X$ be presented alongside the ability estimates ${\widehat{\psi}}_X$ , particularly when these estimates are derived with similar levels of uncertainty in the top positions of the ranking list.

3 A qualitative comparison between the new ranking metric and other existing approaches

Table 1 summarizes the principal similarities and differences between the newly proposed and existing common ranking metrics, each addressing a different treatment hierarchy question. The interpretation of results obtained from our new method does not directly allow for probabilistic statements, in contrast to the existing ranking metrics that yield inherently probabilistic quantities and thus permit a more straightforward interpretation. Within the framework of the proposed approach, the scale of the ability estimates is irrelevant; the interpretation of ${\widehat{\psi}}_X$ arises solely from its relative comparison with the corresponding ability estimates of other treatments in the network.

Table 1 A summary of the characteristics across different ranking methods

A probabilistic interpretation can, however, be derived through the normalized abilities ${\widehat{\pi}}_X$ , which estimate the probability that each treatment has the largest true ability under the prespecified TCC. Since ${\widehat{\pi}}_X$ does not incorporate the uncertainty associated with ${\widehat{\psi}}_X$ , its use is recommended as a supplementary rather than a primary metric which in the context of the proposed method should always be ${\widehat{\psi}}_X$ . It is further noted that the interpretation of ${\widehat{\pi}}_X$ is analogous, though not equivalent, to that of ${p}_{BV}$ : while ${\widehat{\pi}}_X$ pertains to the probability that treatment $X$ exhibits the greatest true ability under the defined TCC, ${p}_{BV}$ relates to the underlying true treatment effects without incorporating a TCC. For example, a treatment $X$ yielding large yet clinically unimportant effects may attain a high ${p}_{BV}$ but a low ${\widehat{\pi}}_X$ .

An additional distinction between the proposed method and other ranking approaches lies in whether, and how, they account for the clinical importance of the NMA estimates and the extent to which this consideration influences the resulting treatment hierarchy. The proposed method explicitly distinguishes clinically important from negligible NMA treatment effects by applying a predefined TCC to the NMA estimates. Subsequently it estimates ${\widehat{\psi}}_X$ as a measure of the overall propensity of treatment $X$ to fulfil the TCC when compared to the rest of the treatments in the network. In the context of the TCC described in Section 2.1, the choice of the SWD can affect the final treatment hierarchy, as the propensity of each treatment to fulfil the TCC may vary depending on the selected SWD. For instance, a treatment with modest efficacy may yield treatment effects that satisfy the TCC for small SWD values, but fail to do so if larger thresholds are deemed clinically meaningful in the decision-making process.

Clinical importance is treated differently within the P-scores (SWD) framework, where the SWD is not used to define a TCC but is instead applied directly to the NMA estimates. This approach is based on the probabilities that the NMA estimates exceed the SWD, and consequently, variations in the SWD influence these probabilities only numerically, without altering the resulting treatment hierarchy. The same holds when comparing the hierarchy obtained from the standard P-scores approach, which does not incorporate clinical importance, with that derived from the P-scores (SWD) method as both of these methods are expected to always yield the same treatment hierarchy.

The final principal distinction among the different methods concerns the availability of the ranking list. The newly proposed method does not produce a treatment hierarchy when only ties are identified based on the TCC. This feature relates only to the new approach and indicates that a treatment ranking cannot be established in the absence of clinically important NMA estimates. In contrast, the remaining ranking metrics always yield a treatment hierarchy. Although these approaches do not explicitly incorporate a TCC in their computation, we recommend that meta-analysts interpret ranking with caution when clinically important NMA estimates are lacking.

4 Applications

We illustrate the use of our treatment ranking method and compare it with existing ranking approaches using two published networks. The first compares the efficacy of several antidepressants for major depressionReference Cipriani, Furukawa and Salanti20 and the second compares different antihypertensive treatment classes and placebo for the incidence of diabetes.Reference Elliott and Meyer21 We compared five ranking approaches: (a) P-scores,Reference Rücker and Schwarzer4 (b) P-scores “adjusted” for the SWD,Reference Mavridis, Porcher, Nikolakopoulou, Salanti and Ravaud9 (c) the PReTA-ranking,Reference Nikolakopoulou, Mavridis, Chiocchia, Papakonstantinou, Furukawa and Salanti7 (d) the ranking according to ${p}_{BV}$ in the frequentist setting, and (e) the estimated treatment abilities from our ranking approach. All ranking metrics were calculated based on a random-effects NMA model. To conduct the analysis, we used R version 4.4.1 (2024-06-14) and we used the R package netmetaReference Davidson and Solomon42 to fit the NMA models. To facilitate the use of our proposed approach, we have created the R package mtrankReference Balduzzi, Rücker and Nikolakopoulou43 which is available on CRAN.

4.1 Antidepressants for major depression

This network comprises 179 trials comparing 18 antidepressant drugs (Figure 2a). The primary outcome is response to treatment defined as a 50% or greater reduction in a depression symptom scale between baseline and 8 weeks of follow-up. The outcome is measured as odds ratios (OR).

Figure 2 Network plots for the two clinical examples. (a) The network of antidepressants and (b) the network of antihypertensive treatments. ACE, angiotensin-converting enzyme inhibitors; ARB, angiotensin receptor blockers; CCB, calcium channel blocker; BBlocker, beta blocker.

The results for methods (a) to (d) are presented in Table 2, alongside the respective NMA estimates of all treatments versus Trazodone. In this network, large treatment effect values indicate beneficial effects. A consensus is observed in terms of the best treatment for the P-scores and the ${p}_{BV}$ which rank Vortioxetine first, while using the PReTA-ranking Escitalopram is placed at the first position and Vortioxetine second. Results in terms of median ranks are available in Table 1 in the Supplementary Material. These results show that Vortioxetine, Escitalopram, and Bupropion occupy the top three positions, though there is considerable uncertainty. The median ranks and 95% CIs were 1 [1, 15], 3 [1, 10], and 3 [1, 15], respectively.

Table 2 Ranking metrics for the network of antidepressants. Treatments with the top three values for each respective metric are shown in bold. The “Treatment” column is ordered according to P-scores

The NMA treatment effect estimates for the comparison of each treatment versus Trazodone are also shown in Figure 3a. Overall, all NMA treatment effect estimates favor the other treatments over Trazodone. Vortioxetine has the largest treatment effect and ranks first, but it also has the largest standard error. When using ${p}_{BV}$ , the ranking does not fully account for the uncertainty in treatment effect estimates. This explains why Vortioxetine appears to be clearly the best treatment according to  ${p}_{BV}$ .

Figure 3 Forest plots with results for the network of antidepressants. (a) The summary odds ratios obtained assuming Trazodone as the reference treatment group. (b) The ranking results obtained using the proposed methodology.

Following the original publication,Reference Cipriani, Furukawa and Salanti20 we assume a SWD equal to 1.20. Using SWD adjusted P-scores, Vortioxetine was ranked at the top position and clearly higher than Bupropion which is at the second position. The differences between unadjusted and SWD adjusted P-scores can be attributed to the increased emphasis that the latter approach puts on the magnitude of the NMA estimates. Note that the adjusted P-scores approach affects only the numerical values of the unadjusted P-scores and is generally not expected to alter the treatment hierarchy. Overall, the differences across the different hierarchies may be explained by the substantial variation of the standard errors across the NMA estimates that range from 0.07 to 0.33. The full distribution of the standard errors across all NMA estimates is depicted in Figure 1 in the Supplementary Material.

Setting again an SWD of 1.20, we obtain the respective ROE that ranges from 0.83 to 1.20. Then, we applied the TCC of Section 2.1 to transform the 153 NMA estimates into treatment preferences. A high prevalence of ties was observed in the network, as only 32% of all comparisons yielded clinically important NMA estimates according to the defined TCC. The log-ability estimates (i.e., $\log \left({\widehat{\psi}}_{\mathrm{X}}\right)\Big)$ are shown in Figure 3b, while the normalized ability estimates ${\widehat{\pi}}_X$ are shown in Table 2. Overall, within the context of the predefined TCC, Escitalopram demonstrated the highest ability to fulfil the TCC and yield beneficial treatment effect estimates, followed by Vortioxetine and Bupropion which are tied at the second position. In addition to these three treatments, Amitriptyline, Mirtazapine, and Agomelatine were also found to have significantly greater abilities to yield clinically important effects than the average treatment in the network. Finally, we conducted a sensitivity analysis regarding the definition of the SWD, progressively increasing it by 0.10 increments from the value of 1.10 up to 1.50. The results are shown in Figure 4, where to improve visibility we presented the results only in terms of the normalized abilities ${\widehat{\pi}}_X$ associated with the first six treatments, as per the primary analysis. This sensitivity analysis indicated that if smaller treatment effects are of interest (i.e., SWD ≤1.20), then Escitalopram outperforms the other treatments. However, as the SWD increases, meaning that larger treatment effects are of interest, Vortioxetine demonstrates a greater ability to yield clinically important treatment effect estimates compared to all other treatments in the network.

Figure 4 Sensitivity analysis for the network of antidepressants. The y-axis represents the probability of each treatment having the highest true ability and the x-axis the different SWD values.

Overall, in this example, some disagreements were observed among the treatment hierarchies obtained from the different methods. In principle, none of these hierarchies is invalid. In such cases, meta-analysts should carefully consider their research question and choose the method that best aligns with it.Reference Salanti, Nikolakopoulou, Efthimiou, Mavridis, Egger and White6 The final decision on which method to use depends on whether clinical importance and uncertainty should be incorporated into the treatment ranking. If accounting for clinical importance is a priority, then either the proposed method or the P-scores (SWD) should be prioritized, as the other available metrics do not adjust their results accordingly. Conversely, if clinical importance is not a primary concern, any of the remaining ranking metrics could be used. The final choice should then depend on whether the uncertainty of the NMA estimates should be incorporated when producing the treatment hierarchy.

4.2 Antihypertensive treatments and the incident of diabetes

This network consists of 22 trials comparing five classes of antihypertensive treatments and placebo for the incidence of diabetes.Reference Elliott and Meyer21 This is a very well-connected network with 14 of the 15 possible direct comparisons being observed (Figure 2b). The primary outcome is the proportion of patients who developed diabetes and the NMA estimates using placebo as reference can be found in Figure 5a. The outcome is again measured as odds ratios (OR).

Figure 5 Forest plots with results for the network of antihypertensive treatments. (a) The summary odds ratios obtained assuming placebo as the reference treatment group. (b) The ranking results obtained using the proposed methodology. ACE, angiotensin-converting enzyme inhibitors; ARB, angiotensin receptor blockers; CCB, calcium channel blocker; BBlocker, beta blocker.

We consider again an SWD equal to 1.20Reference Papakonstantinou, Nikolakopoulou, Higgins, Egger and Salanti26 and the respective ROE ranging from 0.83 to 1.20. The ranking results obtained from the approaches (a) to (d) can be found in Table 3, along with the NMA estimates of all treatments versus placebo, while the respective results in terms of median ranks are available in Table 2 in the Supplementary Material. In this network, small treatment effect values indicate beneficial effects. The results in terms of the estimated treatment abilities are depicted in Figure 5b, while the normalized ability estimates ${\widehat{\pi}}_X$ are shown in Table 3.

Table 3 Ranking metrics for the network of antihypertensive drugs. Treatments with the top three values for each respective metric are shown in bold. The “Treatment” column is ordered according to P-scores

Based on the NMA estimates, ARB showed the most beneficial treatment effect, closely followed by ACE, which had a similar estimate in both magnitude and precision. Regarding the other ranking metrics, there is complete agreement across all five approaches, with ARB consistently ranked first. Notably, the TCC in this network indicated that 63% of all NMA estimates yielded a treatment preference. This perfect agreement among ranking methods can likely be attributed to the low uncertainty in the treatment effect estimates. Specifically, the standard errors of the NMA estimates range from 0.07 to 0.10 (Figure 3 in the Supplementary Material). Finally, to assess the robustness of the estimated rankings with respect to the definition of the TCC, we performed a sensitivity analysis, progressively increasing the SWD in 0.10 increments from the recommended value of 1.20 up to 1.50. The results are shown in Figure 6. Overall, this sensitivity analysis showed that ARB and ACE remained the top two treatments across the different SWD values.

Figure 6 Sensitivity analysis for the network of the antihypertensive drugs. The y-axis represents the probability of each treatment having the highest true ability and the x-axis the different SWD values.

5 Empirical investigation across 153 published networks

5.1 Database

We studied the agreement across different ranking metrics by reanalyzing networks from a database of published NMAs between 1999 and 2015, which included at least four treatments. To access these data, we used the R package nmadb.Reference Evrenoglou and Schwarzer44 More details about this database can be found in the original publications.Reference Petropoulou, Nikolakopoulou and Veroniki22, Reference Nikolakopoulou, Chaimani, Veroniki, Vasiliadis, Schmid and Salanti23 In this database, 267 datasets were identified with available data. Given that there was no information regarding the SWD across these 267 networks, we used the recommendations from previous publications, which suggested that a common choice for the SWD in the case of the risk ratio (RR) would be a value of 1.25.Reference Guyatt, Oxman and Kunz27, Reference Papakonstantinou45 We therefore further restricted the database to include only networks with a binary outcome of interest. This yielded a database of 186 networks. After reanalyzing these 186 networks, we obtained results from 174 networks, as 12 networks from nmadbReference Evrenoglou and Schwarzer44 had incompatible data that did allow to fit a NMA model. Finally, applying the proposed ranking method to the set of 174 networks further restricted the networks with results to 153, as in the remaining 21 networks only ties were identified by the TCC. The NMA estimates and network geometries of these 21 networks are available in Figures 3–23 in the Supplementary Material.

5.2 Evaluated methods and performance metrics

We evaluated the agreement of the five methods presented in Section 4 in the context of a random-effects NMA model. This resulted in a total of 10 pairwise agreement comparisons between the different ranking metrics. Agreement was measured using Pearson’s correlation coefficient, indicating the agreement in the ranking values obtained by each of the different ranking metrics. In other words, we investigated whether larger values in one ranking metric also corresponded to larger values in the other ranking metrics. This approach slightly deviates from previous works,Reference Chiocchia, Nikolakopoulou, Papakonstantinou, Egger and Salanti5, Reference Nikolakopoulou, Mavridis, Chiocchia, Papakonstantinou, Furukawa and Salanti7 which studied agreement between different ranking methods by investigating the agreement in the treatment order of the ranking list. This was not straightforward in our case, as the five methods of interest present the final treatment order in different ways (i.e., allowing for tied positions or always yielding an explicit order). Finally, we further investigated how the precision of the NMA estimates, as a measure of the total amount of information in the network, impacts the agreement between the proposed ability-based metric and the other ranking metrics. To this end, following Chioccia et al.,Reference Chiocchia, Nikolakopoulou, Papakonstantinou, Egger and Salanti5 we contrasted the correlation coefficients from each of the 153 networks with the following measures:

  1. i. the average variance across the $\left(\genfrac{}{}{0pt}{}{T}{2}\right)$ NMA estimates $\widehat{\theta}$ ,

  2. ii. the relative range of variances, defined as $\frac{\max \left\{\mathit{{var}}\left(\widehat{\theta}\right)\right\}-\min \left\{\mathit{{var}}\left(\widehat{\theta}\right)\right\}}{\max \left\{\mathit{{var}}\left(\widehat{\theta}\right)\right\}}$ .

5.3 Results

The results regarding the median correlation and the interquartile range (IQR) of correlations across the 153 networks are presented in Table 4. Overall, the proposed ability-based ranking metric was found to be strongly correlated with most other ranking metrics, as the median correlation coefficient was typically above 0.90. A similarly high level of agreement was observed among most of the alternative ranking methods. It is worth noting that the strong agreement between the P-scores, P-scores (SWD), and PReTA metrics was expected, given that P-scores (SWD) and PReTA are essentially variations of the standard P-scores approach. Finally, the agreement between ${p}_{BV}$ and the proposed method was generally moderate. The latter also applies in terms of the agreement between ${p}_{BV}$ and the rest of the evaluated method, with the correlation becoming stronger primarily when ${p}_{BV}$ was compared to the P-scores adjusted for SWD.

Table 4 Pairwise agreement between the different ranking metrics, measured by the median Pearson’s correlation coefficient and the interquartile range of values obtained across 153 published NMAs

Figure 7 shows the results regarding the impact of uncertainty in the NMA estimates on the agreement between the ability-based metric and the other ranking metrics. Overall, the results indicate that this agreement depends on the level of uncertainty in the NMA estimates, with greater agreement observed in networks where estimates have higher precision and similar levels of uncertainty. In panel (a), the different correlation coefficients were plotted against the average variance of the NMA estimates, which were log-transformed to enhance visibility. The overall trend suggests that as the average variance of the NMA estimates increases, the correlation between the ability-based metric and the other ranking metrics decreases. In panel (b), the correlation coefficients were plotted against the relative range of variances. Following previous studies,Reference Chiocchia, Nikolakopoulou, Papakonstantinou, Egger and Salanti5 the x-axis values were transformed using the double logarithm of the inverse relative range, so that values on the left-hand side indicate a larger variance range. These results showed that as the range of variances across the NMA estimates decreases, the agreement between the ability-based metric and the other ranking approaches increases. In other words, greater agreement is achieved in networks where the NMA treatment effects are estimated with similar levels of uncertainty. This is in line with previous empirical results that evaluated the rest of the approaches in terms of the same metrics.Reference Chiocchia, Nikolakopoulou, Papakonstantinou, Egger and Salanti5, Reference Nikolakopoulou, Mavridis, Chiocchia, Papakonstantinou, Furukawa and Salanti7

Figure 7 Scatter plots contrasting the correlation between the ability-based metric and the other ranking metrics across 153 networks. (a) The correlations plotted against the average variance of the NMA estimates; values on the left-hand side of the graph indicate greater precision. (b) The correlations plotted against the relative range of variances of the NMA estimates; values on the left-hand side of the graph indicate a larger variance range. In all scatter plots, the purple line represents a cubic smoothing spline with five degrees of freedom.

6 Discussion

In this article, we introduce a novel framework for producing treatment hierarchies in NMA through a probabilistic ranking model that accounts for a predefined TCC. The rationale behind the proposed ranking method differs from existing approaches, as it combines the NMA estimates with a concrete TCC or, in other words, a decision rule into a treatment hierarchy, whereas existing methods translate NMA estimates directly into rankings.

Our approach follows the principles of a typical decision-making process where a concrete decision rule is applied to the available evidence to translate the numerical results into practice.Reference Nikolakopoulou, Higgins and Papakonstantinou24, Reference Brignardello-Petersen, Murad and Walter28, Reference Brignardello-Petersen, Florez and Izcovich29, Reference Ades, Davies and Phillippo46, Reference Zeng, Brignardello-Petersen and Hultcrantz47 We start by applying the predefined TCC to the NMA relative treatment effects, transforming them into treatment preference data. We propose as a clinically relevant TCC the ROE between two treatments that represents the area within which their relative effect lacks indication of a treatment preference.Reference Nikolakopoulou, Higgins and Papakonstantinou24, Reference Papakonstantinou, Nikolakopoulou, Higgins, Egger and Salanti26 Following previous work, we define the ROE using the SWD and its reciprocal (or opposite) value.Reference Nikolakopoulou, Higgins and Papakonstantinou24, Reference Papakonstantinou, Nikolakopoulou, Higgins, Egger and Salanti26 Here, we propose a simple way for defining an ROE-based TCC based on the magnitude of the NMA treatment effect and its uncertainty. However, any TCC considered appropriate and clinically relevant can be used by investigators to produce preference data.

We parameterize our model to estimate the ability of each treatment to outperform the other treatments in the networkReference McNamara, Elkins, Ferreira, Spencer and Herbert33, Reference Bradley and Terry34, Reference Turner, van Etten, Firth and Kosmidis36; that is, a latent characteristic referring to the propensity of each treatment in the network to yield clinically important and beneficial true treatment effects in the context of the defined TCC. Consequently, treatments with larger ability estimates corresponding to higher positions in the final ranking. Confidence intervals can also be placed next to the ability estimates to representing the uncertainty around the ranking metric. This should not be confused with other metrics proposed to evaluate the uncertainty of the treatment hierarchy.Reference Phillippo, Dias, Welton, Caldwell, Taske and Ades48 Furthermore, the interpretation of the ability estimates also stems from their transformation into probabilities using Luce’s axiom of choice.Reference Davidson37, Reference Luce38 Model diagnostics were recently developed and can also be investigated in cases where no ties are identified from the TCC.Reference Wigle, Béliveau and Salanti49 However, these have not yet expanded to allow for ties. Overall, our method aims to produce clinically relevant treatment hierarchies accompanied by uncertainty measures. Of course, the proposed ranking method, like all existing ranking metrics, is not a substitute of the NMA relative effects; instead, it can be used to assist decision making and treatment recommendations.

Establishing a direct one-to-one mathematical relationship between the true ability of each treatment and the true treatment effects is challenging, as the former is a latent characteristic dependent on the TCC, while the latter is a fixed unknown parameter. This complicates the design of simulation studies, which typically begin by defining true treatment effects. However, this challenge is not unique to our method but applies broadly to treatment ranking in NMA, as the scope of the existing ranking metrics is to summarize evidence based on NMA estimates and they cannot be calculated directly from true treatment effect values.

We used two published networks to assess the properties of our method and compare it with existing approaches. Τhe network of antidepressantsReference Cipriani, Furukawa and Salanti20 represents an extreme case as the treatment ranked highest in terms of effect size (Vortioxetine) yielded the least precise NMA estimates. Using a TCC defined according to the SWD reported in the original publication,Reference Cipriani, Furukawa and Salanti20 our method produced more conservative results than the other methods, particularly regarding Vortioxetine’s position in the ranking. In a sensitivity analysis where we progressively increased the SWD, Vortioxetine moved to the top of the treatment hierarchy, reflecting its larger NMA estimate relative to other treatments. In the second network of antihypertensive treatments,Reference Elliott and Meyer21 we found a perfect agreement in the final ranking across all approaches. This agreement can be partly attributed to the high precision and narrow variance range of the NMA estimates.

We further explored the performance of the proposed framework and other common ranking metrics through a reanalysis of 153 published networks obtained from a published database,Reference Petropoulou, Nikolakopoulou and Veroniki22, Reference Nikolakopoulou, Chaimani, Veroniki, Vasiliadis, Schmid and Salanti23 accessed via the R package nmadb.Reference Evrenoglou and Schwarzer44 This empirical study showed strong agreement among most of the evaluated ranking metrics, except for ${p}_{BV}$ , which exhibited only moderate agreement with the others. We also investigated how the total amount of information in a network, expressed as the uncertainty in NMA estimates, affects the agreement between the proposed ability-based metric and the other methods. The results indicated that agreement depends on the level of uncertainty: greater agreement was observed in networks where NMA estimates had higher precision and similar levels of uncertainty across treatments.

We see several advantages of our proposed treatment ranking approach. First, the requirement of a priori defining a concrete TCC enables researchers to consider early on what constitutes a preferred treatment. In our approach, we estimate the treatment ability using maximum likelihood theory, thereby allowing us to obtain the standard error of the estimated abilities and infer about the uncertainty of ranking positions using standard statistical measures. In addition, the proposed model does not provide treatment ability estimates when all the NMA treatment effect estimates indicate ties due to convergence failure. Although this might be considered as a drawback of the model, we see it also as a way of preventing researchers from making ranking statements in the absence of sufficient evidence that the NMA estimates fulfil the TCC. This is in line with previous NMA recommendations for avoiding the presentation of ranking results in the presence of large uncertainty in the relative effects.Reference Chiocchia, Nikolakopoulou, Papakonstantinou, Egger and Salanti5

Overall, in the presence of unstable conditions (e.g., sparse networks, rare events, etc.) NMA estimates are often biased and imprecise,Reference Wu, Niezink and Junker50 Reference Evrenoglou, Metelli and Thomas52 thereby undermining the validity of any ranking method. When estimates are highly uncertain, the TCC defined in Section 2.1 is likely to fail to identify clinically meaningful effects, serving as a safeguard against presenting treatment hierarchies based on weak or unreliable evidence. This property was demonstrated in Section 5, where the TCC identified only ties across 21 sparse networks. Ultimately, the validity of any ranking method depends critically on the robustness of the underlying NMA estimates; hence, when the estimation of treatment effects is unstable, we recommend that researchers should be particularly cautious when in presenting and interpreting ranking results.

Despite these advantages, our approach is not free of limitations. Probably the most important limitation relates to the definition of the SWD and of the respective ROE that involves some subjectivity.Reference Copay, Subach, Glassman, Polly and Schuler30 On the other hand, though, the use of different ROEs allows researchers to estimate the treatment hierarchy under different settings (e.g., for different patient profiles). Ways to mitigate this inherent subjectivity have been suggested in the literature through fully statistical approachesReference Sahker, Furukawa and Luo25 or by incorporating information from patients.Reference Sahker, Furukawa and Luo25 Moreover, investigators conducting NMAs may choose to define another TCC not based on the ROE. To avoid data-driven decisions, we recommend meta-analysts using our ranking method to define and justify the TCC they plan to use in their protocol and investigate the robustness of the estimated hierarchy under different SWD values.

Our proposed framework offers a novel alternative to existing ranking metrics for estimating treatment hierarchies in NMA. The importance of a well-defined treatment hierarchy question prior to estimating treatment ranking has been highlighted recently.Reference Salanti, Nikolakopoulou, Efthimiou, Mavridis, Egger and White6 To our knowledge, this is the first approach that incorporates explicitly and quantitatively considerations on the treatment hierarchy question through the predefined TCC. Future extensions of the proposed approach could include adapting the model to account for treatment-level characteristics (e.g., treatment cost) and multiple outcomes. The former is currently possible only in cases where no ties are allowed from the TCC.Reference Fienberg and Larntz54 Overall, investigators can use the proposed approach either as their primary ranking tool or as sensitivity analysis alongside conventional ranking metrics particularly for networks with increased uncertainty in their relative effects and knowledge of clinically relevant TCC.

Acknowledgments

We thank Prof. Georgia Salanti and the two anonymous reviewers for their valuable comments on earlier versions of the manuscript.

Author contributions

Analysis: T.E., G.S.; Conceptualization: T.E., A.C.; Software: T.E., G.S.; Writing—original draft: T.E., A.C.; Writing—review and editing: T.E., A.N., G.R., G.S., A.C.

Competing Interest Statement

The authors declare none.

Data availability statement

The full code and data to reproduce the results of the two illustrative examples and the empirical study are freely available on Zenodo using the following link: https://doi.org/10.5281/zenodo.18171269.

Funding statement

T.E. received funding from the French National Research Agency under the project ANR-22-CE36-0013-01 and from the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under the Project ID-554095932. A.N. was supported by DFG—grant number NI 2226/1-1 and Project-ID 499552394 – SFB 1597.

Supplementary material

To view supplementary material for this article, please visit http://doi.org/10.1017/rsm.2026.10071.

Footnotes

This article was awarded Open Data and Open Materials badges for transparent practices. See the Data availability statement for details.

References

Salanti, G, Ades, AE, Ioannidis, JPA. Graphical methods and numerical summaries for presenting results from multiple-treatment meta-analysis: an overview and tutorial. J Clin Epidemiol. 2011;64(2):163171. https://doi.org/10.1016/j.jclinepi.2010.03.016.CrossRefGoogle ScholarPubMed
Chaimani, A, Higgins, JP, Mavridis, D, Spyridonos, P, Salanti, G. Graphical tools for network meta-analysis in STATA. PLoS One. 2013;8(10):e76654.10.1371/journal.pone.0076654CrossRefGoogle Scholar
Chaimani, A, Caldwell, DM, Li, T, Higgins, JP, Salanti, G. Undertaking network meta-analyses. In: Cochrane Handbook for Systematic Reviews of Interventions. John Wiley & Sons, Ltd; 2019:285320. https://doi.org/10.1002/9781119536604.ch11.CrossRefGoogle Scholar
Rücker, G, Schwarzer, G. Ranking treatments in frequentist network meta-analysis works without resampling methods. BMC Med Res Methodol. 2015;15(1):58.10.1186/s12874-015-0060-8CrossRefGoogle ScholarPubMed
Chiocchia, V, Nikolakopoulou, A, Papakonstantinou, T, Egger, M, Salanti, G. Agreement between ranking metrics in network meta-analysis: an empirical study. BMJ Open. 2020;10(8):e037744. https://doi.org/10.1136/bmjopen-2020-037744.CrossRefGoogle ScholarPubMed
Salanti, G, Nikolakopoulou, A, Efthimiou, O, Mavridis, D, Egger, M, White, IR. Introducing the treatment hierarchy question in network meta-analysis. Am J Epidemiol. 2022;191(5):930938. https://doi.org/10.1093/aje/kwab278.CrossRefGoogle ScholarPubMed
Nikolakopoulou, A, Mavridis, D, Chiocchia, V, Papakonstantinou, T, Furukawa, TA, Salanti, G. Network meta-analysis results against a fictional treatment of average performance: treatment effects and ranking metric. Res Synth Methods. 2021;12(2):161175. https://doi.org/10.1002/jrsm.1463.CrossRefGoogle ScholarPubMed
Veroniki, AA, Straus, SE, Rücker, G, Tricco, AC. Is providing uncertainty intervals in treatment ranking helpful in a network meta-analysis? J Clin Epidemiol. 2018;100:122129. https://doi.org/10.1016/j.jclinepi.2018.02.009.CrossRefGoogle ScholarPubMed
Mavridis, D, Porcher, R, Nikolakopoulou, A, Salanti, G, Ravaud, P. Extensions of the probabilistic ranking metrics of competing treatments in network meta-analysis to reflect clinically important relative differences on many outcomes. Biom J. 2020;62(2):375385. https://doi.org/10.1002/bimj.201900026.CrossRefGoogle ScholarPubMed
Curteis, T, Wigle, A, Michaels, CJ, Nikolakopoulou, A. Ranking of treatments in network meta-analysis: incorporating minimally important differences. BMC Med Res Methodol. 2025;25(1):67. https://doi.org/10.1186/s12874-025-02499-0.CrossRefGoogle ScholarPubMed
Chaimani, A, Porcher, R, Sbidian, É, Mavridis, D. A Markov chain approach for ranking treatments in network meta-analysis. Stat Med. 2021;40(2):451464. https://doi.org/10.1002/sim.8784.CrossRefGoogle ScholarPubMed
Papakonstantinou, T, Salanti, G, Mavridis, D, Rücker, G, Schwarzer, G, Nikolakopoulou, A. Answering complex hierarchy questions in network meta-analysis. BMC Med Res Methodol. 2022;22(1):47. https://doi.org/10.1186/s12874-021-01488-3.CrossRefGoogle ScholarPubMed
Mills, EJ, Kanters, S, Thorlund, K, Chaimani, A, Veroniki, AA, Ioannidis, JPA. The effects of excluding treatments from network meta-analyses: survey. BMJ. 2013;347. https://doi.org/10.1136/bmj.f5195.CrossRefGoogle ScholarPubMed
Trinquart, L, Attiche, N, Bafeta, A, Porcher, R, Ravaud, P. Uncertainty in treatment rankings: reanalysis of network meta-analyses of randomized trials. Ann Intern Med. 2016;164(10):666673. https://doi.org/10.7326/M15-2521.CrossRefGoogle ScholarPubMed
Cipriani, A, Higgins, JPT, Geddes, JR, Salanti, G. Conceptual and technical challenges in network meta-analysis. Ann Intern Med. 2013;159(2):130137. https://doi.org/10.7326/0003-4819-159-2-201307160-00008.CrossRefGoogle ScholarPubMed
Kibret, T, Richer, D, Beyene, J. Bias in identification of the best treatment in a Bayesian network meta-analysis for binary outcome: a simulation study. Clin Epidemiol. 2014;6:451460. https://doi.org/10.2147/CLEP.S69660.Google Scholar
Cattelan, M, Varin, C, Firth, D. Dynamic Bradley–Terry modelling of sports tournaments. Appl Statist. 2013;62(1):135150. https://doi.org/10.1111/j.1467-9876.2012.01046.x.Google Scholar
Stuart-Fox, DM, Firth, D, Moussalli, A, Whiting, MJ. Multiple signals in chameleon contests: designing and analysing animal contests as a tournament. Anim Behav. 2006;71(6):12631271. https://doi.org/10.1016/j.anbehav.2005.07.028.CrossRefGoogle Scholar
Merrick, JRW, van Dorp, JR, Mazzuchi, T, Harrald, JR, Spahn, JE, Grabowski, M. The Prince William sound risk assessment. Interfaces. 2002;32(6):2540. https://doi.org/10.1287/inte.32.6.25.6474.CrossRefGoogle Scholar
Cipriani, A, Furukawa, TA, Salanti, G, et al. Comparative efficacy and acceptability of 21 antidepressant drugs for the acute treatment of adults with major depressive disorder: a systematic review and network meta-analysis. Lancet. 2018;391(10128):13571366. https://doi.org/10.1016/S0140-6736(17)32802-7.CrossRefGoogle ScholarPubMed
Elliott, WJ, Meyer, PM. Incident diabetes in clinical trials of antihypertensive drugs: a network meta-analysis. Lancet. 2007;369(9557):201207. https://doi.org/10.1016/S0140-6736(07)60108-1.CrossRefGoogle ScholarPubMed
Petropoulou, M, Nikolakopoulou, A, Veroniki, AA, et al. Bibliographic study showed improving statistical methodology of network meta-analyses published between 1999 and 2015. J Clin Epidemiol. 2017;82:2028. https://doi.org/10.1016/j.jclinepi.2016.11.002.CrossRefGoogle ScholarPubMed
Nikolakopoulou, A, Chaimani, A, Veroniki, AA, Vasiliadis, HS, Schmid, CH, Salanti, G. Characteristics of networks of interventions: a description of a database of 186 published networks. PLoS One. 2014;9(1):e86754e86754. https://doi.org/10.1371/journal.pone.0086754.CrossRefGoogle Scholar
Nikolakopoulou, A, Higgins, JPT, Papakonstantinou, T, et al. CINeMA: an approach for assessing confidence in the results of a network meta-analysis. PLoS Med. 2020;17(4):e1003082. https://doi.org/10.1371/journal.pmed.1003082.CrossRefGoogle ScholarPubMed
Sahker, E, Furukawa, TA, Luo, Y, et al. Estimating the smallest worthwhile difference of antidepressants: a cross-sectional survey. BMJ Ment Health. 2024;27(1). https://doi.org/10.1136/bmjment-2023-300919.CrossRefGoogle ScholarPubMed
Papakonstantinou, T, Nikolakopoulou, A, Higgins, JPT, Egger, M, Salanti, G. CINeMA: software for semiautomated assessment of the confidence in the results of network meta-analysis. Campbell Syst Rev. 2020;16(1):e1080. https://doi.org/10.1002/cl2.1080.CrossRefGoogle ScholarPubMed
Guyatt, GH, Oxman, AD, Kunz, R, et al. GRADE guidelines 6. Rating the quality of evidence--imprecision. J Clin Epidemiol. 2011;64(12):12831293. https://doi.org/10.1016/j.jclinepi.2011.01.012.CrossRefGoogle ScholarPubMed
Brignardello-Petersen, R, Murad, MH, Walter, SD, et al. GRADE approach to rate the certainty from a network meta-analysis: avoiding spurious judgments of imprecision in sparse networks. J Clin Epidemiol. 2019;105:6067. https://doi.org/10.1016/j.jclinepi.2018.08.022.CrossRefGoogle ScholarPubMed
Brignardello-Petersen, R, Florez, ID, Izcovich, A, et al. GRADE approach to drawing conclusions from a network meta-analysis using a minimally contextualised framework. BMJ. 2020;371:m3900. https://doi.org/10.1136/bmj.m3900.CrossRefGoogle ScholarPubMed
Copay, AG, Subach, BR, Glassman, SD, Polly, DWJ, Schuler, TC. Understanding the minimum clinically important difference: a review of concepts and methods. Spine J. 2007;7(5):541546. https://doi.org/10.1016/j.spinee.2007.01.008.CrossRefGoogle Scholar
McGlothlin, AE, Lewis, RJ. Minimal clinically important difference: defining what really matters to patients. JAMA. 2014;312(13):13421343. https://doi.org/10.1001/jama.2014.13128.CrossRefGoogle ScholarPubMed
Sahker, E, Luo, Y, Omae, K, et al. Estimating the smallest worthwhile difference of recommended psychotherapies for depression: observational study. Br J Psychiatry. 2025;10:18. https://doi.org/10.1192/bjp.2025.10453.CrossRefGoogle Scholar
McNamara, RJ, Elkins, MR, Ferreira, ML, Spencer, LM, Herbert, RD. Smallest worthwhile effect of land-based and water-based pulmonary rehabilitation for COPD. ERJ Open Res. 2015;1(1):0000702015. https://doi.org/10.1183/23120541.00007-2015.CrossRefGoogle ScholarPubMed
Bradley, RA, Terry, ME. Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika. 1952;39(3/4):324345.Google Scholar
Firth, D, Kosmidis, I, Turner, H. Davidson-Luce model for multi-item choice with ties. arXiv preprint arXiv:190907123 . Published online 2019.Google Scholar
Turner, HL, van Etten, J, Firth, D, Kosmidis, I. Modelling rankings in R: the PlackettLuce package. Comput Stat. 2020;35(3):10271057. https://doi.org/10.1007/s00180-020-00959-3.CrossRefGoogle Scholar
Davidson, RR. On extending the Bradley-Terry model to accommodate ties in paired comparison experiments. J Am Stat Assoc. 1970;65(329):317328.10.1080/01621459.1970.10481082CrossRefGoogle Scholar
Luce, RD. Individual Choice Behavior: A Theoretical Analysis. Wiley; 1959.Google Scholar
Luce, RD. The choice axiom after twenty years. J Math Psychol. 1977;15(3):215233.10.1016/0022-2496(77)90032-3CrossRefGoogle Scholar
Hunter, DR. MM algorithms for generalized Bradley-Terry models. Ann Stat. 2004;32(1):384406.10.1214/aos/1079120141CrossRefGoogle Scholar
Caron, F, Doucet, A. Efficient Bayesian inference for generalized Bradley–Terry models. J Comput Graph Stat. 2012;21(1):174196.10.1080/10618600.2012.638220CrossRefGoogle Scholar
Davidson, RR, Solomon, DL. A Bayesian approach to paired comparison experimentation. Biometrika. 1973;60(3):477487. https://doi.org/10.1093/biomet/60.3.477.CrossRefGoogle Scholar
Balduzzi, S, Rücker, G, Nikolakopoulou, A, et al. Netmeta: an R package for network meta-analysis using Frequentist methods. J Stat Softw. 2023;106(2):140. https://doi.org/10.18637/jss.v106.i02.CrossRefGoogle Scholar
Evrenoglou, T, Schwarzer, G. mtrank: Ranking using Probabilistic Models and Treatment Choice Criteria. https://cran.r-project.org/web/packages/mtrank/mtrank.pdf.Google Scholar
Papakonstantinou, T. nmadb: Network Meta-Analysis Database API. https://github.com/cran/nmadb.Google Scholar
Ades, AE, Davies, AL, Phillippo, DM, et al. Treatment recommendations based on network meta-analysis: rules for risk-averse decision-makers. Res Synth Methods. 2025;119. https://doi.org/10.1017/rsm.2025.17.Google ScholarPubMed
Zeng, L, Brignardello-Petersen, R, Hultcrantz, M, et al. GRADE guidance 34: update on rating imprecision using a minimally contextualized approach. J Clin Epidemiol. 2022;150:216224. https://doi.org/10.1016/j.jclinepi.2022.07.014.CrossRefGoogle ScholarPubMed
Phillippo, DM, Dias, S, Welton, NJ, Caldwell, DM, Taske, N, Ades, AE. Threshold analysis as an alternative to GRADE for assessing confidence in guideline recommendations based on network meta-analyses. Ann Intern Med. 2019;170(8):538546. https://doi.org/10.7326/M18-3542.CrossRefGoogle Scholar
Wigle, A, Béliveau, A, Salanti, G, et al. Precision of treatment hierarchy: a metric for quantifying certainty in treatment hierarchies from network meta-analysis. Stat Med. 2025;44(13-14):e70176. https://doi.org/10.1002/sim.70176.CrossRefGoogle ScholarPubMed
Wu, W, Niezink, N, Junker, B. A diagnostic framework for the Bradley–Terry model. Journal of the Royal Statistical Society Series A: Statistics in Society. 2022;185(Supplement_2):S461484. https://doi.org/10.1111/rssa.12959.CrossRefGoogle Scholar
Evrenoglou, T, White, IR, Afach, S, Mavridis, D, Chaimani, A. Network meta-analysis of rare events using penalized likelihood regression. Stat Med. 2022;41(26):52035219. https://doi.org/10.1002/sim.9562.CrossRefGoogle Scholar
Evrenoglou, T, Metelli, S, Thomas, JS, et al. Sharing information across patient subgroups to draw conclusions from sparse treatment networks. Biom J. 2024;66(3):2200316. https://doi.org/10.1002/bimj.202200316.CrossRefGoogle ScholarPubMed
Efthimiou, O, Rücker, G, Schwarzer, G, Higgins, JP, Egger, M, Salanti, G. Network meta-analysis of rare events using the mantel-Haenszel method. Stat Med. 2019;38(16):29923012. https://doi.org/10.1002/sim.8158.CrossRefGoogle ScholarPubMed
Fienberg, SE, Larntz, K. Log linear representation for paired and multiple comparisons models. Biometrika. 1976;63(2):245254.10.1093/biomet/63.2.245CrossRefGoogle Scholar
Figure 0

Figure 1 A graphical representation of the TCC for a fictional example showing the NMA estimates for the comparison of eight treatments versus a common reference treatment $Y$in terms of a beneficial outcome.

Figure 1

Table 1 A summary of the characteristics across different ranking methods

Figure 2

Figure 2 Network plots for the two clinical examples. (a) The network of antidepressants and (b) the network of antihypertensive treatments. ACE, angiotensin-converting enzyme inhibitors; ARB, angiotensin receptor blockers; CCB, calcium channel blocker; BBlocker, beta blocker.

Figure 3

Table 2 Ranking metrics for the network of antidepressants. Treatments with the top three values for each respective metric are shown in bold. The “Treatment” column is ordered according to P-scores

Figure 4

Figure 3 Forest plots with results for the network of antidepressants. (a) The summary odds ratios obtained assuming Trazodone as the reference treatment group. (b) The ranking results obtained using the proposed methodology.

Figure 5

Figure 4 Sensitivity analysis for the network of antidepressants. The y-axis represents the probability of each treatment having the highest true ability and the x-axis the different SWD values.

Figure 6

Figure 5 Forest plots with results for the network of antihypertensive treatments. (a) The summary odds ratios obtained assuming placebo as the reference treatment group. (b) The ranking results obtained using the proposed methodology. ACE, angiotensin-converting enzyme inhibitors; ARB, angiotensin receptor blockers; CCB, calcium channel blocker; BBlocker, beta blocker.

Figure 7

Table 3 Ranking metrics for the network of antihypertensive drugs. Treatments with the top three values for each respective metric are shown in bold. The “Treatment” column is ordered according to P-scores

Figure 8

Figure 6 Sensitivity analysis for the network of the antihypertensive drugs. The y-axis represents the probability of each treatment having the highest true ability and the x-axis the different SWD values.

Figure 9

Table 4 Pairwise agreement between the different ranking metrics, measured by the median Pearson’s correlation coefficient and the interquartile range of values obtained across 153 published NMAs

Figure 10

Figure 7 Scatter plots contrasting the correlation between the ability-based metric and the other ranking metrics across 153 networks. (a) The correlations plotted against the average variance of the NMA estimates; values on the left-hand side of the graph indicate greater precision. (b) The correlations plotted against the relative range of variances of the NMA estimates; values on the left-hand side of the graph indicate a larger variance range. In all scatter plots, the purple line represents a cubic smoothing spline with five degrees of freedom.

Supplementary material: File

Evrenoglou et al. supplementary material

Evrenoglou et al. supplementary material
Download Evrenoglou et al. supplementary material(File)
File 3.7 MB