The evaluation of the role of face masks in preventing respiratory infections is a paradigm case in synthesising complex evidence (i.e. extensive, diverse, technically specialised, and with multilevel chains of causality). Primary studies have assessed different mask types, diseases, populations, and settings using different research designs. Numerous review teams have attempted to synthesise this literature, in which observational (case–control, cohort, cross-sectional) and ecological studies predominate. Their findings and conclusions vary widely.
This article critically examines how 66 systematic reviews dealt with mask efficacy studies. Risk-of-bias tools produced unreliable assessments when—as was often the case—review teams lacked methodological expertise or topic-specific understanding. This was especially true when datasets were large and heterogeneous, with multiple biases playing out in different ways and requiring nuanced adjustments. In such circumstances, tools were sometimes used crudely and reductively rather than to support close reading of primary studies and guide expert judgments. Various moves by reviewers—excluding observational evidence altogether, assessing risk but not direction of biases, omitting distinguishing details of primary studies, and producing meta-analyses that combined studies of different designs or included studies at critical risk of bias—served to obscure important aspects of heterogeneity, resulting in bland and unhelpful summary statements.
We draw on philosophy to question the formulaic use of generic risk-of-bias tools, especially when the primary evidence demands expert understanding and tailoring of study quality questions to the topic. We call for more rigorous training and oversight of reviewers of complex evidence and for new review methods designed specifically for such evidence.