Hostname: page-component-74d7c59bfc-tcgtt Total loading time: 0 Render date: 2026-02-02T05:09:49.760Z Has data issue: false hasContentIssue false

The Perks and Perils of Replicability

Published online by Cambridge University Press:  02 February 2026

Krist Vaesen*
Affiliation:
Industrial Engineering & Innovation Sciences, Eindhoven University of Technology, Eindhoven, The Netherlands Human Origins Group, Faculty of Archaeology, Leiden University, Leiden, The Netherlands Department for Prehistoric Archaeology, University of Cologne, Cologne, Germany
Shumon T. Hussain
Affiliation:
Department for Prehistoric Archaeology, University of Cologne, Cologne, Germany Multidisciplinary Environmental Studies in the Humanities (MESH), University of Cologne, Cologne, Germany
*
Corresponding author: Krist Vaesen; Email: k.vaesen@tue.nl
Rights & Permissions [Opens in a new window]

Abstract

Following a trend across the sciences, recent studies in lithic analysis have embraced the ideal of replicability. Recent large-scale studies have demonstrated that high replicability is achievable under controlled conditions and have proposed strategies to improve it in lithic data recording. Although this focus has yielded important methodological advances, we argue that an overemphasis on replicability risks narrowing the scope of archaeological inquiry. More specifically, we show (1) that replicability alone does not guarantee reliability, interpretive value, or cost effectiveness, and (2) that archaeological data often involve unavoidable ambiguity due to preservation, analyst background, and the nature of lithic variability itself. Instead of allowing replicability to dictate research priorities, we advocate for a problem-driven, pluralistic approach that tailors methods to research questions and balances replicable measures with interpretive depth. This has practical implications for training, publishing, and funding policy. We conclude that Paleolithic archaeology must engage with the replicability movement on its own terms—preserving methodological diversity while maintaining scientific credibility.

Resumen

Resumen

Siguiendo una tendencia generalizada en las ciencias, los estudios recientes sobre análisis lítico han adoptado el ideal de la replicabilidad. Estos estudios a gran escala han demostrado que se puede conseguir una alta replicabilidad en condiciones controladas además de proponer estrategias para mejorarla en el registro de datos líticos. Si bien este enfoque ha producido importantes avances metodológicos, nosotros argumentamos que un énfasis excesivo en la replicabilidad pone en riesgo la reducción del alcance de la investigación arqueológica. Más concretamente, mostramos como la replicabilidad por sí sola no garantiza fiabilidad, valor interpretativo o rentabilidad, y que los datos arqueológicos a menudo implican una ambigüedad inevitable debido a la preservación, los antecedentes del analista y la propia naturaleza de la variabilidad lítica. En lugar de permitir que la replicabilidad dicte las prioridades de investigación, abogamos por un enfoque pluralista y orientado a la resolución de problemas que adapte los métodos a las preguntas de investigación y a su vez equilibre las medidas replicables con la exhaustividad interpretativa. Esto tiene implicaciones prácticas en las políticas de formación, publicación y financiación. Concluimos que la arqueología paleolítica debe comprometerse con el movimiento de replicabilidad en sus propios términos, preservando así, tanto la diversidad metodológica como la credibilidad científica.

Information

Type
Report
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press on behalf of Society for American Archaeology.

The Push toward Replicability

The credibility of a scientific claim rests not only on its initial demonstration but also on whether the same outcome recurs when the study is repeated—demonstrating the replicability of the supporting evidence (Nosek et al. Reference Nosek, Hardwicke, Moshontz, Allard, Corker, Dreber and Fidler2022; Romero Reference Romero2019). Yet, a growing concern about replicability has led to a sense of crisis in science. It has become abundantly clear that researchers often appear to be unable to reproduce each other’s results even when following similar or identical methods. The purported crisis spans nearly every scientific discipline—from psychology to biomedicine, from economics to engineering, and from sociology to physics (Baker Reference Baker2016)—and has recently spilled over into archaeological debate, especially in relation to the study of stone artifacts (Farahani Reference Farahani2024; Marwick Reference Marwick2017, Reference Marwick2025; Marwick et al. Reference Marwick, Wang, Robinson and Loiselle2020).

In lithic analysis, this concern with replicability has translated into a heightened interest in inter-observer agreement and methodological standardization. Recent prominent studies concerning artifact-level inter-observer agreement—such as those by Timbrell and colleagues (Reference Timbrell, Scott, Habte, Tefera, Monod, Qazzih and Marais2022), Pargeter and colleagues (Reference Pargeter, Brooks, Douze, Eren, Groucutt, McNeil and Mackay2023), and Kot and colleagues (Reference Kot, Tyszkiewicz, Michał Leloch and Miller2025)—aim to identify and reduce inconsistencies in the description and measurement of knapped stone artifacts and the tools manufactured from them. Although earlier archaeological work has also addressed replicability (see, for example, Dibble and Bernard Reference Dibble and Bernard1980; Fish Reference Fish1978; Gnaden and Holdaway Reference Gnaden and Holdaway2000; Lyman and VanPool Reference Lyman and Todd2009; Proffitt and de la Torre Reference Proffitt and de la Torre2014; Wilmsen and Robert Reference Wilmsen and Frank H. H.1978),Footnote 1 the more recent studies are notable for their larger scale and broader scope. For example, Pargeter and colleagues (Reference Pargeter, Brooks, Douze, Eren, Groucutt, McNeil and Mackay2023) and Kot and colleagues (Reference Kot, Tyszkiewicz, Michał Leloch and Miller2025) involved significantly more lithic analysts (11 and 32, respectively) compared to the average of six in previous studies; Pargeter and colleagues (Reference Pargeter, Brooks, Douze, Eren, Groucutt, McNeil and Mackay2023) furthermore reassessed no fewer than 100 artifacts and 38 attributes. Timbrell and colleagues (Reference Timbrell, Scott, Habte, Tefera, Monod, Qazzih and Marais2022) included relatively few analysts (6) but implemented a resource-intensive collaborative framework, distributing identical 3D-printed copies of lithic artifacts worldwide for evaluation according to an elaborate standardized protocol. Furthermore, each study was coauthored not by two but 10 researchers on average.

Another common thread among the three studies, following a recent push for open science in archaeology (Marwick et al. Reference Marwick, Jade d’Alpoim Guedes, Bates, Baxter, Bevan and Bollwerk2017), is their shared conviction in the importance of replicability, and their shared belief that replicability should be optimized; each study concludes with a set of recommendations on how to achieve this goal. Timbrell and colleagues (Reference Timbrell, Scott, Habte, Tefera, Monod, Qazzih and Marais2022) commend their collaborative standardized data-collection protocol (see below), Pargeter and colleagues (Reference Pargeter, Brooks, Douze, Eren, Groucutt, McNeil and Mackay2023) suggest prioritizing high-agreement attributes and increasing sample sizes and the use of photogrammetry and morphometric methods for low-agreement attributes, and Kot and colleagues (Reference Kot, Tyszkiewicz, Michał Leloch and Miller2025) propose improving education in lithic analysis.

These studies offer invaluable insights into the current state of lithic analysis. They are timely, methodologically rigorous, and reflect a commendable effort to improve the reliability of archaeological data. By assembling large, diverse teams, implementing standardized protocols, and leveraging new technologies, they have demonstrated that high levels of replicability are indeed achievable—even across analysts from different training backgrounds. These contributions not only are practical but also contribute to the broader credibility of archaeological science, particularly in cross-disciplinary research.

However, before further dedicating resources to optimizing replicability, we believe it is important to pause briefly and critically examine the assumptions fueling the replicability movement in lithic analysis. Obviously, replicability is an important touchstone of science. Still, it is not always the most important goal, and in some cases, it may even be misleading to foreground it—or so we will argue.

When Replicability and Reliability Come Apart

For the sake of comparison, it seems essential that analysts achieve high agreement and low error rates when measuring attributes and comparing observations on lithic artifacts. To ensure credibility, it appears to be a basic requirement that different researchers studying the same artifacts using the same methods should obtain consistent results. In fact, inter-analyst replicability is a key concern shared by all empirical sciences. So, what could possibly be wrong with prioritizing high-agreement attributes?

A first thing to note is that inter-analyst replicability studies are generally challenging to interpret, making it difficult to distinguish between acceptable and unacceptable replicability scores. For example, Pargeter and colleagues (Reference Pargeter, Brooks, Douze, Eren, Groucutt, McNeil and Mackay2023) follow Cohen (Reference Cohen1960) in their interpretation of the inter-analyst scores they calculated: values ≤0 indicate no agreement, 0.01–0.20 indicate none to slight, 0.21–0.40 indicate fair, 0.41–0.60 indicate moderate, 0.61–0.80 indicate substantial, and 0.81–1.00 indicate strong agreement.Footnote 2 Cohen himself, however, acknowledges that there is no universal benchmark for a satisfactory score—that is, a score above which agreement is acceptable. Additionally, researchers have shown that so-called prevalence imbalance can produce deceptively low replicability scores even when overall agreement is high (Feinstein and Cicchetti Reference Feinstein and Cicchetti1990). In such cases, skewed category distributions—not poor analyst performance—may drive the low scores. Imagine that two analysts are classifying 100 stone tools into two types, Type A and Type B, and suppose that 90 of them are Type A and only 10 are Type B. If both analysts agree on most of the Type A tools but strongly disagree on the few Type B tools, their overall agreement might still be high: perhaps on 85 out of 100 tools. However, the replicability score (e.g., Cohen’s Kappa) will be low, precisely because it does not fully take into account the skewness of the distribution, where almost every artifact is Type A. Accordingly, discarding attributes solely on the basis of agreement metrics risks excluding data that are, in fact, reliably recorded.

The interpretation of replicability scores is further complicated by features specific to lithic analysis. In most fields, inter-analyst error is judged relative to the variability within the underlying sample. For example, if we measure the heights of students, an average error of 1 mm might be tolerable in a highly variable group, but it would be problematic in a uniform group, in which most students are similar in height. In lithic analysis, however, we cannot access the original variability of the assemblage directly, because the archaeological record is affected by preservation and time averaging. Imagine two attributes: for the first, the average inter-analyst error is just 1 mm, but poor preservation (e.g., postdepositional damage) introduces up to 5 mm of distortion. For the second, the average inter-analyst error is 2 mm, but preservation has little effect. Although the first attribute shows higher inter-analyst agreement, the second can be assessed more reliably—that is, we can more reliably infer its original state or condition. This illustrates the danger of judging attributes solely by analyst agreement scores. So, again, prioritizing attributes based on high inter-analyst agreement may lead us to favor measurements that appear robust under ideal conditions but that are less trustworthy in actually encountered archaeological contexts. This risk is particularly acute if such prioritization relies on the three aforementioned studies, all of which used experimentally knapped—and therefore pristine—artifacts.

A deeper issue concerns the assumption that lower error automatically corresponds to better analysis. This overlooks the fact that some analytical methods—especially interpretive ones, such as chaîne opératoire—may produce greater disagreement at the level of individual observations but still lead to robust and meaningful insights at the interpretive level. The challenge here lies not in measurement accuracy per se but in the kinds of questions different methods are designed to address. Descriptive approaches tend to rely on tightly defined variables and low tolerance for variation, whereas interpretive methods often aim to reconstruct processes, decisions, and behaviors, which are goals that depend more on contextual judgment than on precise replication. As a result, different methods entail different relationships to error and insight. For instance, analysts might disagree on the exact sequencing of flake scars on a core—a task that resists strict standardization—but still agree on the reduction strategy it reflects. In such cases, variation in low-level observations does not undermine, and may even support, higher-level interpretive convergence. Disagreement, then, is not necessarily a weakness of the method. It can indicate a method’s capacity to engage with the complexity of human behavior.

When Replicability and Meaningfulness Come Apart

Another concern is that an overemphasis on replicability may lead to the premature dismissal of attributes that are behaviorally meaningful. Not all attributes are equally informative about the aspects of past human behavior that lithic analysis seeks to elucidate. For instance, Pargeter and colleagues (Reference Pargeter, Brooks, Douze, Eren, Groucutt, McNeil and Mackay2023) found that flake weight had the highest replicability score. Yet weight, although easy to measure consistently, will often offer more limited insight into knapping behavior or cultural practices than other attributes, such as platform morphology—one of the attributes that fell below Pargeter and colleagues’ (Reference Pargeter, Brooks, Douze, Eren, Groucutt, McNeil and Mackay2023) threshold for acceptable replicability. Platform morphology refers to the shape and preparation of the striking platform, which is the area of a core or flake struck to detach a flake. According to Pargeter and colleagues (Reference Pargeter, Brooks, Douze, Eren, Groucutt, McNeil and Mackay2023), it includes forms such as punctiform, plain, linear, faceted, crushed, cortical, dihedral, and chapeau de gendarme, among others. Because it reflects aspects of knapping technique, skill, potentially cultural input, and—from a technological perspective most importantly—how stone volumes were treated and exploited, excluding it on the basis of a low replicability score risks discarding precisely the kinds of data most relevant to archaeological inference.

Also here, preservation complicates matters. In actual assemblages, fractures, patination, and postdepositional damage often obscure the very features analysts are meant to record, which includes use-wear traces, scar patterns, and cortical surfaces. Disagreement in these cases may reflect not an error but a genuine ambiguity in the artifact’s condition. For example, cortex percentage is often used as a proxy for reduction intensity, but heavily weathered or stained surfaces can make cortex identification uncertain, leading analysts to record substantially different values. Yet, cortex percentage remains a behaviorally meaningful attribute, and excluding it on the grounds of low replicability would be unnecessarily restrictive. A more reasonable recommendation would be to cultivate awareness of these issues and to work on the type of errors made (which are often not arbitrary; see Kot et al. Reference Kot, Tyszkiewicz, Michał Leloch and Miller2025).

The same applies to attributes that are marked by high behavioral or morphological ambiguity: meaningful but hard to record consistently. Indeed, some attributes are inherently more ambiguous because they reflect complex or continuous aspects of artifact or tool production—aspects that do not neatly fit into discrete categories. For instance, the variability of actual flake forms is continuous, yet analysts are required to classify them into simple, nominal categories (e.g., flat, concave, very concave, bulbar, twisted). In Pargeter and colleagues’ (Reference Pargeter, Brooks, Douze, Eren, Groucutt, McNeil and Mackay2023) study, attributes related to flake form scored poorly on replicability, a result the authors themselves attribute to the “complex decisions about flake shape and specific flake locations” analysts need to make. Similarly, Kot and colleagues (Reference Kot, Tyszkiewicz, Michał Leloch and Miller2025) found that certain flake scar relationships are intrinsically difficult to resolve; about 5% remained ambiguous regardless of who analyzed them. Such ambiguities stem from the complexity of knapping processes, which sometimes produce overlapping or equivocal features. In these cases, a low replicability score does not necessarily indicate a useless attribute or poorly trained analyst. Instead, it encapsulates the genuine ambiguity of the lithic record.

Differences in training background may introduce yet another layer of disagreement. Pargeter and colleagues (Reference Pargeter, Brooks, Douze, Eren, Groucutt, McNeil and Mackay2023) found that years of experience did not correlate with higher inter-analyst agreement, but the type of training did. Analysts who had had more exposure to quantitative, attribute-focused methods tended to record certain measurements more consistently, whereas those steeped in more qualitative, descriptive lithic traditions tended to diverge on those same measures—and vice versa for other attributes. Both approaches to lithic analysis yield valuable information: dismissing attributes associated with one tradition simply because they generate disagreement rooted in training differences risks discarding data that are crucial for meaningful inference. Given that Pargeter and colleagues (Reference Pargeter, Brooks, Douze, Eren, Groucutt, McNeil and Mackay2023) only included a single author (i.e., Katja Douze) who was trained in qualitative chaîne opératoire inquiry (as characteristic of the French tradition in lithic research sensu stricto; see Hussain Reference Hussain2019), it is likely that such training diversity and its impact on replicability remains underestimated.

In response to training-related disagreement—and inter-analyst disagreement more broadly—all three studies advocate for greater investment in improving replicability: enhanced collaboration and standardization (Timbrell et al. Reference Timbrell, Scott, Habte, Tefera, Monod, Qazzih and Marais2022), expanded sample sizes and the use of photogrammetry (Pargeter et al. Reference Pargeter, Brooks, Douze, Eren, Groucutt, McNeil and Mackay2023), and more targeted training (Kot et al. Reference Kot, Tyszkiewicz, Michał Leloch and Miller2025; Pargeter et al. Reference Pargeter, Brooks, Douze, Eren, Groucutt, McNeil and Mackay2023). At first glance, this seems entirely reasonable. What could possibly be wrong with striving for greater replicability when it comes to low-disagreement attributes? Striving for greater replicability may come at too great a cost—as will become apparent in the next section.

When Replicability and Cost-Effectiveness Come Apart

Timbrell and colleagues (Reference Timbrell, Scott, Habte, Tefera, Monod, Qazzih and Marais2022) developed a collaborative framework to assess inter-observer error in lithic shape analysis by distributing 3D-printed replicas of stone tools to expert analysts across multiple institutions. Rather than requiring all analysts to converge physically, the study enabled remote participation by providing identical physical models and a standardized protocol for photographing and measuring the artifacts. Analysts applied both traditional linear metrics and outline-based geometric morphometric (GMM) techniques to capture lithic shape variability. The study emphasizes the importance of clear definitions and consistent imaging procedures, demonstrating that when these are in place, inter-observer variability can indeed be minimized. Although the approach proved effective, the authors also readily acknowledged its logistical and financial demands, highlighting that such a method—though valuable—requires substantial coordination, infrastructure, and time investment.

The study by Pargeter and colleagues (Reference Pargeter, Brooks, Douze, Eren, Groucutt, McNeil and Mackay2023), too, intimates that achieving high replicability in lithic analysis comes at a considerable cost—in terms of both logistics and collaborative effort. The study involved 11 expert analysts from diverse backgrounds, who independently analyzed 100 experimentally produced flakes across 38 attributes over a two-year period. Despite using a shared set of physical artifacts and standardized definitions, the team encountered significant challenges in harmonizing data collection, particularly due to variation in recording tools, which led to substantial time spent on data cleaning. The attribute definitions themselves were the product of hundreds of hours of collaborative discussion and refinement. Although Pargeter and colleagues (Reference Pargeter, Brooks, Douze, Eren, Groucutt, McNeil and Mackay2023) recommend photogrammetry and morphometric methods to improve replicability—especially for shape-related attributes—they also acknowledge that such techniques are costly and impractical for large assemblages.

For our purposes, the key issue is not merely whether the approaches of Timbrell and colleagues (Reference Timbrell, Scott, Habte, Tefera, Monod, Qazzih and Marais2022) and Pargeter and colleagues (Reference Pargeter, Brooks, Douze, Eren, Groucutt, McNeil and Mackay2023) improve replicability but whether they do so in a cost-effective manner that genuinely advances archaeological understanding. The proposals by Timbrell and colleagues (Reference Timbrell, Scott, Habte, Tefera, Monod, Qazzih and Marais2022) as well as by Pargeter and colleagues (Reference Pargeter, Brooks, Douze, Eren, Groucutt, McNeil and Mackay2023) raise a (from our point of view) rarely considered but legitimate concern: that an overemphasis on replicability may skew funding and attention toward easily standardized attributes, potentially at the expense of qualitative characteristics that capture critical behavioral insights. If the goal is to understand lithic production and use, then cost effectiveness must also account for the interpretive depth offered by expert judgment and contextual analysis. The latter may yield results that are less tidy or more uncertain, but such ambiguity is intrinsic to the Paleolithic record—something to be engaged with, not standardized away.

That said, it is important to distinguish between the costs of testing replicability and the costs of replicable data. The former, such as time spent harmonizing definitions or coordinating analysts, is largely front loaded and may diminish over time. However, the latter often entails sustained logistical and financial burdens, particularly when replicable studies rely on resource-intensive methods such as 2D and 3D morphometric data capture, photogrammetry, or specialized imaging equipment (resources include analyst training and the coding/software literacies required to run them on a high level). These approaches, although powerful, may be impractical for large lithic assemblages or for teams with limited access to such infrastructure. Moreover, although interpretive methods such as techno-functional analysis also involve significant time investment, their value lies in the depth of behavioral insight they offer—an interpretive richness often unachievable through more standardized, replicable metrics alone. Consequently, cost effectiveness should be evaluated in terms of not only replicability but also the archaeological significance and explanatory power of the methods employed. Replication, in other words, must be evaluated in light of the broader epistemic goals of the field. Rather than positioning different research practices in opposition, we should aim to coordinate them in ways that foster methodological synergy and deepen archaeological understanding.

Although we emphasize the cost and labor intensity of the approaches proposed by Timbrell and colleagues (Reference Timbrell, Scott, Habte, Tefera, Monod, Qazzih and Marais2022) and Pargeter and colleagues (Reference Pargeter, Brooks, Douze, Eren, Groucutt, McNeil and Mackay2023), our critique is not intended to diminish their analytical value. These methods offer powerful tools for documenting and quantifying lithic variability. However, they should be seen as complementary to, rather than replacements for, interpretive approaches. Our concern simply is that an overemphasis on replicability may lead to the prioritization of some methods not because they are best suited to the research questions at hand, but because they produce what are perceived as highly replicable data and results. In doing so, there is a risk of privileging methodological neatness over interpretive richness—an imbalance that could ultimately constrain our understanding of the complexities of past human behavior. This is particularly problematic because what qualifies as “neat” or “rigorous” often cannot be determined on neutral grounds; such judgments are frequently shaped by researchers’ epistemic positions and disciplinary perspectives.

Conclusion: A Pluralistic and Problem-Driven Approach to Lithic Analysis

The push to improve replicability in lithic analysis ultimately aims to enhance the scientific credibility of our interpretations of the past. The studies by Timbrell and colleagues (Reference Timbrell, Scott, Habte, Tefera, Monod, Qazzih and Marais2022), Pargeter and colleagues (Reference Pargeter, Brooks, Douze, Eren, Groucutt, McNeil and Mackay2023), and Kot and colleagues (Reference Kot, Tyszkiewicz, Michał Leloch and Miller2025) exemplify this effort, and they highlight areas not only where analysts achieve strong agreement but where challenges persist in recording what remains, materially, the same archaeological reality. From their work, we learn that high replicability is indeed attainable. Pargeter and colleagues (Reference Pargeter, Brooks, Douze, Eren, Groucutt, McNeil and Mackay2023), for instance, demonstrated that even an international team of experts from diverse training backgrounds can, with a shared protocol, record many attributes with substantial consistency, and Kot and colleagues (Reference Kot, Tyszkiewicz, Michał Leloch and Miller2025) exemplify how such work can lead to the identification of particular sources and types of errors that better training and sensibility can help to reduce and perhaps even overcome. Additionally, the studies show encouraging developments in methodology: extensive collaborative calibration, creative use of new technology for standardization, and rigorous testing with experimental controls.

At the same time, we caution against organizing lithic analysis around what can be easily agreed upon. Replicability tests should be seen as valuable tools, not as gatekeepers that define what is worth studying and what should be funded. Although essential for identifying sources of disagreement and refining our methods, they should not limit the scope of inquiry. Instead of letting replicability dictate the agenda, we should let research questions guide methodological choices. For example, if our aim is to understand how hominins in different regions learned tool-making skills, we must examine how they experimented with various technical solutions for working stone, and how these choices shaped the tools they produced and used. This often requires close attention to subtle differences in artifact-level features and broader constellations of such features. We should not avoid such work simply because some of the relevant attributes are difficult to code reliably. Rather, we should design our studies to enhance consistency—perhaps through joint training sessions or by using digital tools to support expert measurement and observation—and to cross-validate findings using complementary techniques such as chemical sourcing or supervised statistical methods. Conversely, if our research question is broader—such as exploring how reduction intensity correlates with raw material availability—then we might rely more heavily on highly replicable measures such as core mass or dorsal scar count, where inter-analyst consistency is well established. In both cases, methodology should be tailored to the problem at hand, striking a balance between the desire for reliable data and the need to capture meaningful archaeological variation.

We recognize that, in some cases, research questions are tightly bound to specific methodological frameworks—such as chaîne opératoire or use-wear analysis—and that switching to alternative methods or combining approaches may not be viable without compromising the integrity of the inquiry. In such scenarios, we argue that lower replicability should be accepted as a necessary trade-off rather than a disqualifying flaw. Ambiguity in the archaeological record is often irreducible, and interpretive methods are designed precisely to engage with such complexity. Rather than abandoning the respective questions, we recommend that researchers explicitly document sources of ambiguity, develop standards for reporting uncertainty (e.g., confidence intervals, expert consensus), and evaluate interpretive robustness at higher levels of inference (including but not limited to the descriptive levels of lithic technical systems). Replicability, in these contexts, should be understood not as strict agreement on every observation but as coherence and explanatory strength across interpretive claims. This approach preserves methodological integrity while acknowledging the epistemic realities of archaeological practice.

The broader challenge, of course, is not only to recognize genuine ambiguity and manage uncertainty. Practitioners must also remain critically aware that the very act of framing research questions can commit them implicitly to particular methodological frameworks. In such cases, assessing the robustness of results across methods may not be feasible. It is therefore essential to cultivate an awareness of epistemological pluralism: to examine the trade-offs inherent in different research frameworks, reflect on the kinds of questions they privilege, and recognize that interrogating the questions we pose about the lithic record is as important as the answers we derive from it. Ultimately, this may require developing differentiated standards of replicability tailored to the specific nature of the research problem at hand.

In Paleolithic archaeology, resolving research questions often requires a multistranded approach—one that integrates metrics, morphometrics, chaîne opératoire, use-wear analysis, and experimentation in meaningful ways. These different methods are valuable not because they produce the same results but because each addresses a different dimension of the archaeological record. The interpretive claims they generate should therefore be evaluated not only for internal consistency but also for their coherence, explanatory strength, potential for new hypothesis generation, and alignment with other lines of evidence. This set of standards is broader and more demanding than replicability alone, but it would also be a more fitting one for a discipline dedicated to studying past human behavior and its long-term evolution.

In practice, many archaeologists already endorse the problem-driven pluralism we advocate. To the extent that they do, this article has reaffirmed its value and emphasized its importance, even as other fields react to replication crises with calls for stricter standardization. The idea that replicability should serve as the sole—or even primary—universal criterion for scientific validity has also been critically examined by philosophers of science (Guttinger Reference Guttinger2020). We should be cautious not to conflate replicability with “good science,” especially given that, at scale, low replicability can still contribute to robust and efficient scientific progress (Lewandowsky and Oberauer Reference Lewandowsky and Oberauer2020).

Our argument also carries several practical implications for how lithic analysis—and archaeological science more broadly—is taught, published, and evaluated. Training programs should (continue to) expose students to a range of analytical traditions, from typological and technological approaches to digital morphometrics, use-wear analysis, and experimental replication. This diversity of training not only equips future researchers with a broader methodological tool kit but also fosters an appreciation for the strengths and limitations of different approaches.

With respect to journals, editorial policies should (continue to) welcome diverse approaches and encourage authors to report interpretive disagreements alongside points of consensus. Publishing contrasting interpretations, when grounded in shared data, can help move the field forward by clarifying where and why perspectives diverge. Likewise, peer review of articles should (continue to) reward careful interpretation, even when results are hard to standardize.

Finally, science funders should (continue to) resist the temptation to direct funding exclusively to highly replicable approaches that may appear more scientifically legitimate by virtue of their quantification and standardization. Put differently, they must ensure that support is evenly distributed across a range of methodologies, recognizing that diverse approaches contribute in different but equally valuable ways to our understanding of the past.

As the scientific community increasingly prioritizes replicability, Paleolithic archaeology must forge its own path—one that values reliability without sacrificing its rich interpretive depth. The strength of the discipline lies not in methodological uniformity but in its capacity to integrate diverse lines of evidence in response to complex questions about the human past.

Acknowledgments

We are grateful to the two anonymous reviewers for their critical engagement and valuable suggestions, which helped improve the original manuscript, on which this commentary is based. We acknowledge the use of Microsoft Copilot for the purposes of checking English grammar, refining sentence structure and wording, and proofreading. However, the responsibility for the content of this publication lies with the authors.

Funding Statement

Shumon T. Hussain was supported by the HESCOR project (Cultural Evolution in Changing Climate: Human and Earth System Coupled Research), which has received funding through the “Profilbildung 2022” initiative of the Ministry of Culture and Science of the State of North Rhine-Westphalia (ID: HESCOR PB22-081).

Data Availability Statement

No original data were used.

Competing Interests

The authors declare none.

Footnotes

1. As noted, replicability refers to the ability to achieve consistent results when a study is repeated—whether by the original researcher or others—using the same methods but applied to a new dataset. In contrast, the studies by Timbrell and colleagues (Reference Timbrell, Scott, Habte, Tefera, Monod, Qazzih and Marais2022), Pargeter and colleagues (Reference Pargeter, Brooks, Douze, Eren, Groucutt, McNeil and Mackay2023), and Kot and colleagues (Reference Kot, Tyszkiewicz, Michał Leloch and Miller2025) assess the ability to achieve consistent results when the same methods are applied to the same study objects (i.e., stone artifacts). Despite this subtle difference, we follow Pargeter and colleagues (Reference Pargeter, Brooks, Douze, Eren, Groucutt, McNeil and Mackay2023) in referring to this latter form of consistency as replicability.

2. Scores were calculated as the variance among group means (i.e., each flake measured by the 11 analysts) relative to the sum of group-level and data-level variance.

References

References Cited

Baker, Monya. 2016. 1,500 Scientists Lift the Lid on Reproducibility. Nature 533(7604):452454.10.1038/533452aCrossRefGoogle ScholarPubMed
Cohen, Jacob. 1960. A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement 20(1):3746.10.1177/001316446002000104CrossRefGoogle Scholar
Dibble, Harold L., and Bernard, Mary C.. 1980. A Comparative Study of Basic Edge Angle Measurement Techniques. American Antiquity 45(4):857865.10.2307/280156CrossRefGoogle Scholar
Farahani, Alan. 2024. Reproducibility and Archaeological Practice in the Journal of Field Archaeology. Journal of Field Archaeology 49(6):391394.10.1080/00934690.2024.2391623CrossRefGoogle Scholar
Feinstein, Alvan R., and Cicchetti, Domenic V.. 1990. High Agreement but Low Kappa: I. The Problems of Two Paradoxes. Journal of Clinical Epidemiology 43(6):543549.10.1016/0895-4356(90)90158-LCrossRefGoogle Scholar
Fish, Paul R. 1978. Consistency in Archaeological Measurement and Classification: A Pilot Study. American Antiquity 43(1):8689.10.2307/279635CrossRefGoogle Scholar
Gnaden, Denis, and Holdaway, Simon. 2000. Understanding Observer Variation When Recording Stone Artifacts. American Antiquity 65(4):739747.10.2307/2694425CrossRefGoogle Scholar
Guttinger, Stephan. 2020. The Limits of Replicability. European Journal for Philosophy of Science 10(2):10. https://doi.org/10.1007/s13194-019-0269-1.CrossRefGoogle Scholar
Hussain, Shumon T. 2019. The French-Anglophone Divide in Lithic Research: A Plea for Pluralism in Palaeolithic Archaeology. PhD dissertation, Faculty of Archaeology, Leiden University, Leiden, Netherlands.Google Scholar
Kot, Małgorzata, Tyszkiewicz, Jerzy, Michał Leloch, Natalia Gryczewska, and Miller, Sebastian. 2025. Reliability and Validity in Determining the Relative Chronology between Neighbouring Scars on Flint Artefacts. Journal of Archaeological Science 175:106156. https://doi.org/10.1016/j.jas.2025.106156.CrossRefGoogle Scholar
Lewandowsky, Stephan, and Oberauer, Klaus. 2020. Low Replicability Can Support Robust and Efficient Science. Nature Communications 11:358. https://doi.org/10.1038/s41467-019-14203-0.CrossRefGoogle ScholarPubMed
Lyman, R. Lee, and Todd, L. VanPool. 2009. Metric Data in Archaeology: A Study of Intra-Analyst and Inter-Analyst Variation. American Antiquity 74(3):485504.10.1017/S0002731600048721CrossRefGoogle Scholar
Marwick, Ben. 2017. Computational Reproducibility in Archaeological Research: Basic Principles and a Case Study of Their Implementation. Journal of Archaeological Method and Theory 24(2):424450.10.1007/s10816-015-9272-9CrossRefGoogle Scholar
Marwick, Ben. 2025. Is Archaeology a Science? Insights and Imperatives from 10,000 Articles and a Year of Reproducibility Reviews. Journal of Archaeological Science 180:106281. https://doi.org/10.1016/j.jas.2025.106281.CrossRefGoogle Scholar
Marwick, Ben, Jade d’Alpoim Guedes, C. Michael Barton, Bates, Lynsey A., Baxter, Michael, Bevan, Andrew, Bollwerk, Elizabeth A., et al. 2017. Open Science in Archaeology. SAA Archaeological Record 17(4):814.Google Scholar
Marwick, Ben, Wang, Li-Ying, Robinson, Ryan, and Loiselle, Hope. 2020. How to Use Replication Assignments for Teaching Integrity in Empirical Archaeology. Advances in Archaeological Practice 8(1):7886.10.1017/aap.2019.38CrossRefGoogle Scholar
Nosek, Brian A., Hardwicke, Tom E., Moshontz, Hannah, Allard, Aurélien, Corker, Katherine S., Dreber, Anna, Fidler, Fiona, et al. 2022. Replicability, Robustness, and Reproducibility in Psychological Science. Annual Review of Psychology 73:719748.10.1146/annurev-psych-020821-114157CrossRefGoogle ScholarPubMed
Pargeter, Justin, Brooks, Alison, Douze, Katja, Eren, Metin, Groucutt, Huw S., McNeil, Jessica, Mackay, Alex, et al. 2023. Replicability in Lithic Analysis. American Antiquity 88(2):163186.10.1017/aaq.2023.4CrossRefGoogle Scholar
Proffitt, Tomos, and de la Torre, Ignacio. 2014. The Effect of Raw Material on Inter-Analyst Variation and Analyst Accuracy for Lithic Analysis: A Case Study from Olduvai Gorge. Journal of Archaeological Science 45:270283.10.1016/j.jas.2014.02.028CrossRefGoogle Scholar
Romero, Felipe. 2019. Philosophy of Science and the Replicability Crisis. Philosophy Compass 14(11):e12633. https://doi.org/10.1111/phc3.12633.CrossRefGoogle Scholar
Timbrell, Lucy, Scott, Christopher, Habte, Behailu, Tefera, Yosef, Monod, Hélène, Qazzih, Mouna, Marais, Benjamin, et al. 2022. Testing Inter-Observer Error under a Collaborative Research Framework for Studying Lithic Shape Variability. Archaeological and Anthropological Sciences 14(10):209. https://doi.org/10.1007/s12520-022-01676-2.CrossRefGoogle Scholar
Wilmsen, Edwin N., and Frank H. H., Roberts Jr. 1978. Lindenmeier, 1934–1974: Concluding Report on Investigations. Contributions to Anthropology No. 24. Smithsonian Institution, Washington, DC.10.5479/si.00810223.24.1CrossRefGoogle Scholar