Beyond Chronology, Using Bayesian Inference to Evaluate Hypotheses in Archaeology

ABSTRACT Archaeologists frequently use probability distributions and null hypothesis significance testing (NHST) to assess how well survey, excavation, or experimental data align with their hypotheses about the past. Bayesian inference is increasingly used as an alternative to NHST and, in archaeology, is most commonly applied to radiocarbon date estimation and chronology building. This article demonstrates that Bayesian statistics has broader applications. It begins by contrasting NHST and Bayesian statistical frameworks, before introducing and applying Bayes's theorem. In order to guide the reader through an elementary step-by-step Bayesian analysis, this article uses a fictional archaeological faunal assemblage from a single site. The fictional example is then expanded to demonstrate how Bayesian analyses can be applied to data with a range of properties, formally incorporating expert prior knowledge into the hypothesis evaluation process.

Palabras clave: estadística bayesiana, modelos de probabilidad frecuentista, prueba de hipótesis, arqueoestadística Many archaeologists are familiar with Bayesian statistics in the context of radiocarbon date calibration and chronology building. However, the Bayesian framework has broader applications beyond dating and chronology that are worthy of consideration by archaeologists. For example, many researchers in the natural and social sciences are using Bayesian statistics to evaluate how well observational or experimental data align with their hypotheses. For the most part, this use of Bayesian inference has not been applied to archaeology. Using a fictional zooarchaeological example, this article provides a straightforward explanation of Bayesian inference and compares it to the more conventional null hypothesis significance testing (NHST). Although some have previously described and reviewed the application of these concepts elsewhere (e.g., Buck and Meson 2015;Buck et al. 1996;Otárola-Castillo and Torquato 2018;Otárola-Castillo et al. 2022;Wolfhagen 2019Wolfhagen , 2020, this work is focused on presenting replicable step-by-step examples of the Bayesian framework for evaluating and discerning among competing hypotheses. R Markdown code to reproduce all materials presented here is available in an OpenScience Framework repository: https://osf.io/ 54f62/. In addition, a Spanish translation of this manuscript is available with this article's supplemental materials (Supplemental Text 1). Likewise, Spanish readers may find reproducible code at https://osf.io/23bdt/. standard deviation, and standard error, which rely on probability theory to describe this uncertainty. In their field and lab work, archaeologists regularly use equipment that relies on probabilistic descriptions of uncertainty. For example, the manufacturer of total stations, widely used to map archaeological sites, has stated accuracies of 2 mm plus an additional 2 mm per km, usually at the 1σ standard deviation level (e.g., Leica TS16). This is an example of a probability concept used to measure "random" uncertainty. In this case, assuming a "normal" probability distribution for the measurement error (although the manufacturer does not specify this), archaeologists should expect that 68% of the locations of artifacts mapped by this instrument will have an error up to ±2 mm, plus error related to increasing distance (and error due to atmospheric conditions, instrument stability, etc.; Walker and Awange 2020). Similarly, the manufacturer's specification sheet for a typical Ohaus (Scout STX2202) portable digital scale claims to measure up to 2,200 g, with an error of ±0.02 g (1σ). Like total stations, if we assume a normal error model, this means that the manufacturer certifies that 68% of all readings will be within ±0.02 g of the true reading under ideal circumstances.
Similarly, after careful data collection and analyses, archaeologists also apply the concept of probability to test their hypotheses. These are formal statements that offer plausible explanations of the observed patterns of people or their environment in the past. Like the statements about field and laboratory instrument measurements, these hypotheses and their predictions also possess some degree of uncertainty due to incomplete observation or knowledge. To formally quantify uncertainty about data and hypotheses, archaeologists frequently rely on specific probability models or probability functions (i.e., equations). The inputs of a probability function are observed or hypothesized values, and the outcomes are their probabilities ranging from 0 to 1-that is, from least to most probable. Archaeologists use this probabilistic system to test their hypotheses and describe the degree of uncertainty with which their hypotheses account for current and likely future observations. Using a probabilistic approach gives archaeologists a powerful and systematic tool that makes it possible to interpret data and evaluate hypotheses.
Below, we provide an overview of the central concepts of the two major probability paradigms to evaluate hypotheses: NHST and Bayesian inference. Whereas most scientists widely use NHST, the Bayesian approach is considered a modern data-driven learning system that has enjoyed increasing application to archaeology (Buck and Meson 2015;Buck et al. 1996;Howson and Urbach 2006;Jaynes 2003;Otárola-Castillo and Torquato 2018;Otárola-Castillo et al. 2022).

NULL HYPOTHESIS SIGNIFICANCE TESTING
As the prevailing statistical framework in most sciences, NHST enables practitioners to use their data to evaluate hypotheses. This approach is rooted in the early twentieth-century development of goodness of fit tests (Fisher 1922;Pearson 1900), experimental design, p-values (Fisher 1925(Fisher , 1935, confidence intervals (CIs), and hypothesis testing (Neyman and Pearson 1933:294). This methodology was introduced to archaeology in the mid-twentieth century (e.g., Binford 1964;Clarke 1968;Myers 1950;Spaulding 1953;Vescelius 1960). Applications of NHST in archaeology continue today, supported by new archaeology-specific statistical textbooks (e.g., Banning 2020;Baxter 2003;Carlson 2017;Drennan 2009;Fletcher and Lock 2005;McCall 2018;Shennan 1997;Thomas 1986). These textbooks provide detailed treatment of NHST and its procedures in the context of archaeology (for a multidisciplinary introductory textbook to NHST, see, for example, Diez et al. 2019).
In general, however, the NHST paradigm revolves around the concept of theoretically repeated sampling over the long term and the Central Limit Theorem (CLT; Diez et al. 2019:172). The CLT informs NHST's approach to hypothesis description and evaluation. The theorem shows that given a large enough sample, in many cases, the summary statistics (e.g., mean or standard deviation) will follow a normal distribution. For instance, after sampling the same population multiple times, the means of individual samples will be normally distributed. This distribution is known as the sampling or "null" distribution of the statistic. Because this phenomenon occurs often, even if the original variable was not normally distributed, this concept applies to many situations and data. The CLT further links sample statistics to their null distributions, such as the mean, through its "standard error." According to the CLT, the standard error of a sample's mean estimates the standard deviation of the mean's null distribution. One may compute this quantity by dividing the sample's standard deviation by the sample's size.
The CLT is helpful to archaeologists who often sample from a target population-a group of individuals, artifacts, events, measurements, or other phenomena that they wish to study. The aim is to use the sample to test a priori hypotheses about quantifiable characteristics of the sampled population. Statisticians refer to these characteristics as the population parameters. For example, a population's mean and standard deviation parameters represent its central tendency and variability, respectively. Sample statistics function as estimates of the population parameters and are therefore also known as the "parameter estimates." These statistics are used to test hypotheses about their respective population parameters. NHST requires archaeologists to state only two hypotheses: a "null" and an "alternative" hypothesis to evaluate. Null hypotheses are quantitative statements of "no difference" (difference = 0) between a hypothesized parameter value and its sample statistic, or between a sample statistic and its counterpart from another sample. Archaeologists often set up such null hypotheses to evaluate whether a sample statistic resulted from a population having the hypothesized parameter value (i.e., a onesample test). Alternatively, they may wish to know if the statistics from two independent samples were drawn from the same population (i.e., a two-sample test).
Alternative hypotheses are ordinarily simple statements negating the null hypothesis. Once archaeologists state the null and alternative hypotheses, they then sample the population, or "collect data," and calculate the sample statistics. We should point out that the NHST framework proceeds by assuming that the null hypothesis is true and then using the sample data, summarized by a statistic, to test that assumption. To do so, archaeologists use the sample statistic to define a "test statistic" (frequently the z-, t-, F-ratios, and χ 2 values; e.g., Diez et al. 2019;Drennan 2009;Thomas 1986) and calculate the probability that a value equal to or more extreme than the test statistic can occur under the assumption of the null hypothesis.
The probability of the test statistic, or p-value, is often calculated with the help of probability distribution models, like the normal distribution. These probability models are also known as "likelihood functions." The "likelihood" is a statistical function that describes the probability of the test statistic dependent on the hypothesized parameter values-for example, those assumed by the null hypothesis. For instance, as we show in the fictitious example below, the normal likelihood function is used to compute the p-value of a z-ratio test statistic, assuming the null hypothesis is true. Using similar probability models, archaeologists conduct NHST and calculate quantities such as p-values and confidence intervals (CIs) to evaluate whether the test statistic rejects or fails to reject the null hypothesis.
CIs are grounded in the CLT's null distribution concept. Archaeologists often compute CIs in two contexts: (1) to conduct NHST, they calculate the CIs of a test statistic; and (2) to estimate the precision of a parameter estimate, they compute the CIs of a sample statistic. Generally, the CIs of either the test or sample statistic are centered on their mean, represent their respective null distribution, and are derived using their sample's standard error. Recall that the standard error of either statistic is the standard deviation of its null distribution. For the sample statistic, this distribution represents the range of plausible values within which one may find the true value of the population parameters.
In the context of the test statistic, however, the CI is the range of possible values within which the true difference, assumed by the null hypothesis, will be found. In other words, due to the CLT, approximately 68% of the test statistic's null distribution will capture the true value of the difference, assumed to be 0 by the null hypothesis. Likewise, in the case of a sample statistic, 68% of its null distribution will contain the true value of the population parameter. Alternatively, one may wish less uncertainty than 68% for the sample or test statistic. In this case, one may compute ranges similar to the standard error that capture the true parameter or difference values 95%-99% of the time-again, after theoretically repeated sampling. These ranges are the CIs, and we refer to them in terms of their percentage-for example, as 95% or 99% CIs. In the context of NHST, archaeologists use the CIs of the test statistic to reject or fail to reject a null hypothesis. If the value of no difference, 0, is within the test statistic's CI, then the null hypothesis fails to be rejected. However, if 0 is not within the test statistic's CI range, the null hypothesis is not supported by the data and is rejected in favor of the alternative. We offer one last note about the mechanics of CIs. It may seem tempting to interpret the 95% CI as indicating that the true population parameter or difference has a 0.95 probability of being in the CI. Although somewhat confusing, however, the correct interpretation of the CI is that, based on repeated sampling over the long term, 95% of the CIs will contain the true population parameter or difference.
In addition to CIs, NHST uses p-values as an empirical signal of the plausibility of the test statistic, assuming the null hypothesis is true. Archaeologists compute p-values by calculating the proportion of values in the null distribution equal to and more extreme than the sample's test statistic. Typically, test statistic values with a p-value less than or equal to a proportion of 0.05 (1 out 20, or 5%) are considered extreme. Archaeologists commonly judge whether to reject or fail to reject the null hypothesis using a p-value of 0.05 as a cutoff for rejection: the more extreme the data, the smaller the p-value.
The broader scientific community has become increasingly critical of NHST (e.g., Gelman 2006Gelman , 2018Vidgen and Yasseri 2016). Statisticians and practitioners have strongly pointed out the arbitrariness of the 0.05 p-value threshold for statistical significance (Cowgill 1977:352;Valeggia and Fernández-Duque 2022;Wasserstein et al. 2019). Some argue that inadequate statistical training may lead researchers to misunderstand p-values (Hubbard 2011(Hubbard :2624McShane and Gal 2015). One consequence of not fully understanding the concept of p-values, for instance, is that some researchers confuse practical significance, or relevance, with statistical significance. In particular, it is possible for effects that are practically negligible, irrelevant, or uninteresting to result in small p-values (e.g., Aarts et al. 2012;Johnson 1999;Kramer et al. 2016;McCall 2018:90-93;Wolverton et al. 2016). In one case, while investigating the effects of sibling competition on the growth patterns of Maya children, Kramer and colleagues (2016) found that the effects of family size on child growth were statistically significant but "of little consequence to early childhood health or fitness." Here, interpreting the 0.05 p-value cutoff as demographically important would have led to incorrect conclusions.
In other cases, researchers have confused p-values for the Type-I error rate, α. The p-value is the probability that the test statistic may occur under the null hypothesis; α is the probability of rejecting the null hypothesis when it is true (Hubbard 2011). Historically, these two statistical quantities belong to competing NHST philosophies (Fisher 1925;Neyman and Pearson 1933). Neyman and Pearson developed the concept of Type-I error in the context of designing infinitely repeatable experiments, wherein α defines the probability that an analysis will fail to find a difference between two hypotheses when there is a genuine difference. Fisher's p-value, by contrast, empirically estimates if a specific set of observations fit a specified null hypothesis. These two quantities have completely different theoretical underpinnings and relationships to actual observations. For example, α is unrelated to observations, and the p-value is not influenced by the alternative hypotheses under consideration. Typical NHST practice, unfortunately, can lead researchers to directly associate the two concepts, complicating efforts to provide reasonable definitions and interpretations (Hubbard and Bayarri 2003). The misuse of p-values and statistical significance, due to either misunderstanding (e.g., Thiese et al. 2015) or intention (Chuard et al. 2019;Head et al. 2015), can lead to the so-called scientific replication crisis (Ioannidis 2005), which is beginning to reach archaeological science (Bayliss and Marshall 2019;Marwick 2017;McPherron et al. 2021).
Even accounting for these nuances, the interpretation of NHST concepts such as p-values, statistical significance, hypothesis testing, and CIs is not entirely straightforward. Statements about sample statistics-standard errors and CIs-are based on hypothetical repeated sampling, which is difficult to conceive of in nonexperimental situations or-as in archaeology-where true replication is hard or even impossible to achieve. In terms of evaluation, although most researchers might generally understand how to interpret a significant p-value in the context of rejecting a null hypothesis, the meaning of a nonsignificant p-value may cause confusion. This confusion might be exacerbated by the fact that there is no mechanism for "accepting" or "verifying" a null hypothesis. This critical misunderstanding of NHST may lead some to interpret a nonsignificant p-value as accepting their null hypothesis rather than failing to reject it (Greenland et al. 2016). However, knowledge production in the NHST paradigm is centered on rejecting null hypotheses rather than accepting the null or alternative hypotheses. To be fair, the NHST language is confusing. For example, stating that a null hypothesis failed to be rejected is a triple negative, meaning that "the hypothesis of no difference was not not accepted." Such convoluted language embedded in NHST obfuscates the relationship between the p-value, the null hypothesis, and the alternative hypotheses.
Moreover, the role of the alternative hypothesis and its connection to the p-value is also unclear and often incorrectly interpreted (Benjamin and Berger 2019;Cohen 1994). As a result, inference using traditional NHST statistics can be difficult, especially when a study wishes to discern among multiple working hypotheses (e.g., Chamberlin 1965Chamberlin [1890; Gelman et al. 2012), for example, when two or more hypotheses fail to be rejected. In theory, such hypotheses are consistent with the data. However, ranking multiple unrejected null hypotheses is difficult, if not impossible. One way to rank them may be to use the hypotheses' p-values. After all, the p-value is a continuous metric mediating hypothesis rejection and failure to reject. However, statisticians discourage this procedure (Hubbard and Lindsay 2008;McShane et al. 2019) because the magnitude of the p-value does not reflect the weight of evidence of one hypothesis over another. Consequently, traditional NHST does not offer a straightforward procedure for further comparing "unrejected" null hypotheses.

BAYESIAN STATISTICS
Bayesian inference offers an alternative approach with several advantages over NHST. First, Bayesian statistics enables scientists to use data to assign probabilities to their parameter estimates and hypotheses, facilitating a more straightforward comparison of competing hypotheses. Second, whereas NHST uses only new data to make inferences, a Bayesian framework allows both new data and existing information to be combined. As we detail below, this characteristic more closely resembles scientists' decisionmaking processes and is likely one of the key reasons that scientists, including anthropologists and archaeologists, are increasingly adopting Bayesian inference to evaluate their hypotheses.
Bayes's theorem derives its name from the Reverend Thomas Bayes (1763), an English Presbyterian minister and mathematician who researched problems in probability that involved conditional and prior probabilities (defined below). However, it was not until the late 1900s that the Bayesian approach to statistical inference was popularized in science (Bellhouse 2004). Although archaeologists notably began adopting Bayesian statistics to assess hypotheses in the 1990s (e.g., Buck et al. 1996;Cowgill 1993), earlier applications can be found scattered throughout the archaeological literature beginning in the 1970s (Doran et al. 1975;Fisher 1987;Freeman 1976;Salmon 1982:51-55;Thomas 1986). Today, scientists, including anthropologists and archaeologists who find this approach advantageous, are increasingly applying Bayesian statistics to evaluate their hypotheses with data (Gelman et al. 2020;McElreath 2020;Naylor and Smith 1988;Otárola-Castillo and Torquato 2018).
One advantage of Bayesian inference is that it enables expert, or "prior," information about hypotheses to be incorporated into statistical analyses. As we show in our example below, the prior knowledge of an archaeologist or collection of archaeologists and other experts can be very valuable because "we depend very much on prior information to help us in evaluating the degree of plausibility in a new problem" (Jaynes 2003:6). Formally including previous experience or expert information into statistical analyses to "update" one's state of knowledge is a natural learning process and improves the inferences made by NHST (Cowgill 2001). To accomplish this, practitioners of Bayesian inference convert prior knowledge into "prior probabilities" and use them and their distributions as part of statistical analyses. Once analysts determine their prior probability distributions, as with NHST, they can observe new data to test their hypothesis (or hypotheses). In this context, the likelihood of the data is combined with (or weighted by) the prior to give the Bayesian "posterior probability." The posterior is the probability of the hypothesis given the observed data's likelihood and prior knowledge (Buck et al. 1996). As we discuss in more detail below, the Bayesian process is particularly helpful in situations where only small amounts of data are obtained, as is often the case in archaeology.
In simple cases, determining the posterior and its distribution is relatively straightforward. However, the calculus underlying more complex cases is impossible to solve without the application of novel simulation methods. In particular, the Markov Chain Monte Carlo (MCMC) algorithms have facilitated progress in Bayesian analyses. MCMC simulation is a combination of Monte Carlo sampling and Markov Chains. Monte Carlo sampling is used to estimate difficult-to-compute quantities from the unknown distribution of an observed random variable. Markov Chains are a stochastic series of events associated with one another, where the probability of a new event is dependent only on the state of the last event. Together, these characteristics of Monte Carlo sampling and Markov Chains are essential to find the posterior probability distribution of complex problems. Today, variations on the original MCMC algorithm (Metropolis et al. 1953)-such as the Metropolis-Hastings, Gibbs, Hamiltonian, and other methodsare now in widespread use, facilitating broad application of the Bayesian paradigm (e.g., Dunson and Johndrow 2020;Gilks et al. 1996;Howson and Urbach 2006:xi;Robert and Casella 2011).
To further contextualize the application of Bayesian statistics, we provide a fictional example that illustrates how one can use this probabilistic framework to solve an idealized archaeological research problem. To do this, we choose to use a parable 1 rather than a real case study in order to avoid the complexities of site formation processes and sampling bias. The contrived, fictional example in this parable also helps focus attention on specific aspects of Bayesian inference, which we feel are most instructive. The parable of the "Monico Culture and the Bayesian Archaeologist" demonstrates how inferences can be made using data and prior information about a hypothesis, how to evaluate the uncertainty surrounding a hypothesis, why this approach seems less ambiguous than NHST, and consequently, why it is becoming increasingly popular.

THE MONICO CULTURE: A SIMPLIFIED APPLICATION OF BAYESIAN STATISTICS
The "Monico culture" is a fictitious group of people who might have lived between the ethnographic present and long ago across multiple environmental settings and sociocultural contexts worldwide. The imaginary archaeological record of the Monico is well known. In general, their material culture reflects patterns of foraging, farming, and pastoralist economies. Monico sociocultural dynamics are broad. They range from egalitarian practices exhibited at highly mobile camps to a greater social complexity derived from more permanent settlements. Some Monico experts argue that later Monico settlements show evidence of intensive food production, trade of exotic goods, and a highly centralized political organization administered by an increasingly hierarchical elite (Figure 1).
A famous Bayesian archaeologist, an authority on the Monico, has excavated a postcontact period site associated with this culture. Excavation work at the site, named Monico-1, has yielded an impressive faunal assemblage among the widely diverse material culture. The archaeofauna is composed of two species of animals: "dog" and "coyote." Individual animals of both species are represented by complete skeletons. Consequently, in this report, the archaeologist uses the term "individual" to refer to complete dogs or coyotes. Likewise, when mentioning "the number of" dogs or coyotes, the archaeologist means a count of complete individuals of the respective species. So far, the archaeologist has identified 100 such individuals and assigned them to their respective species. Based on the observations, the assemblage is composed of 71 dogs and 29 coyotes ( Figure 2). However, the archaeologist has also excavated a bone fragment that is difficult to identify. The archaeologist wishes to know the most probable species to which this fragment belongs.
The archaeologist defines "probability" as the relative frequency or proportion of times that an event occurs. On the basis of the data alone, the probability, P, of dog remains in the assemblage is whereas the probability of coyote remains is Given these probabilities, it is reasonable for the archaeologist to believe that the unidentifiable bone specimen is more likely to be from a dog. However, the archaeologist is skeptical. Moreover, as a  Oral histories passed down over generations have documented that dogs were thought to be a close sibling of people. Notably, the Monico culture is known to have had taboos against killing or eating dogs. However, oral histories have also revealed that the Monico did eat dogs during times of severe food scarcity. With this additional or "prior" information, the archaeologist decides to observe the skeletons more closely to check for the presence of butchery marks (i.e., cut marks) on the dog remains. The archaeologist tabulates this additional information on the recovered bones under two butchery conditions: (1) butchery marks are present, and (2) butchery marks are absent. Table 1 shows the frequencies of butchery marks on the skeletons of each species.
To convert these data into a probability table, the archaeologist standardizes (or divides) all of the values by the total number of observations (100 in this case). The inner cells (dark font, light shading) in Table 2 provide the probabilities of butchery marks and species occurring together, or jointly, which are therefore known as "joint probabilities." The values in the right and bottom margins of Table 2 are suitably named "marginal probabilities." These represent the presence and absence of butchery marks (on the right) and the species identified (bottom). The marginal totals are the total probabilities of each subsetted space (species or butchery mark). By definition, all probabilities lie in the range of 0 to 1, and the total sum of the marginal rows or columns (i.e., the sum over all marginal outcomes) must be 1.
At this point, the archaeologist focuses on the unidentifiable bone specimen and finds several butchery marks on it. The archaeologist can use this additional information to gain an inferential advantage by accounting for, or conditioning on, the presence of butchery marks-a process called "conditioning." The archaeologist conditions the species identified on the presence or absence of butchery marks. This procedure is otherwise known as subsetting or stratifying the variable "species identified" by the presence or absence of butchery marks.
Naturally, the archaeologist asks, "What is the probability that the unidentifiable bone specimen is from a dog compared to the probability that it is from a coyote, given that butchery marks are present on the bones of the individual?" The archaeologist observed 32 animals from Monico-1 with butchery marks present.
Of those, butchery marks were present on nine dogs and 23 coyotes. The archaeologist can therefore calculate the probabilities of the individual belonging to one species or the other, given that butchery marks are present (statisticians use the "|" symbol to mean "given that" and to signify that conditioning is taking place). Therefore, after observing butchery marks on the individual (unidentified) bone, the archaeologist can state that the probability is 0.72 that it came from a coyote. In other words, the archaeologist is 72% certain that the bone came from a coyote.
A few days later, a local newspaper reporter became aware of an ongoing archaeological excavation at another Monico village site nearby, named Monico-2. Sources reveal to the reporter that the excavators there are also recovering faunal remains. Because the archaeologist is a well-known expert on the Monico's eating habits, the reporter contacts the archaeologist and communicates the fact that the new faunal assemblage at Monico-2 is wholly composed of remains from dog species.
Even though the investigators at Monico-2 have not yet conducted a thorough faunal analysis, the reporter asks the archaeologist how likely it is that the Monico were butchering and eating dogs at the new site. By now, the archaeologist has estimated the probabilities of finding butchery marks associated with each animal species based on experience at the Monico-1 village. To make a probabilistic inference about behavior at the new site, the archaeologist conditions on the "species identified" instead of on the "presence of butchery marks." Out of the 71 dogs identified at Monico-1, the archaeologist observed nine with butchery marks and 62 without. This means that, based on the evidence from Monico-1, the probability of finding evidence of butchery on dogs is P(Butchery mark present | Dog) = 9 71 = 0.13, whereas the probability of no butchery evidence on dogs is P(Butchery mark absent | Dog) = 62 71 = 0.87.
After a moment's thought, the archaeologist tells the reporter that (based on knowledge from Monico-1) the probability of the dog bones from Monico-2 having resulted from dietary activities is relatively low, at around 13%. This calculation draws on Bayes's theorem, as well as on the information regarding the Monico's relationship with their dogs and the butchery practices at Monico-1.

What Is Bayes's Theorem?
Bayes's theorem is the algebraic formalization of the probabilistic table work that we conducted in the previous section using a discrete event. The theorem is most useful when a conditional probability statement is known, and one wishes to obtain its inverse conditional statement. For example, from the previous model, we know that P(Butchery mark present | Dog) = 0.13. If we wish to know the inverse conditional statement-P(Dog | Butchery mark present)-we can calculate it using P(Dog | Butchery mark present) = P(Butchery mark present | Dog) × P(Dog) P(Butchery mark present) .
Tables 1 and 2 provide the necessary values to plug into this expression so that   where the joint probability is divided by the marginal P(B). Statisticians call P(A|B) the posterior probability of A given B; P(B|A), the inverse conditional (or likelihood) of B given A; and P(A), the prior probability of A.

The Bayesian Archaeologist: Continued
After a few days, the reporter acquires more information from the continued excavations at the Monico-2 village. The frequencies and joint probabilities are described in Tables 3 and 4. The reporter is quite excited to inform the archaeologist that excavators had recovered 10 dogs, all but one of which had butchery marks on them. By contrast, the archaeologists at the Monico-2 site had recovered only one coyote that exhibited butchery marks on the remains. The researchers at Monico-2 used an appropriate NHST test statistic, the one-sided z-test for proportions (Diez et al. 2019:194-197), with continuity correction, to test whether the observed dog butchery rate (9/10) was statistically significantly greater than 50%-the default null hypothesis in this test. The Monico-2 archaeologists rejected the null hypothesis with a p-value <0.05 (z-ratio = 2.21, mean = 5, sdev = 1.58, p = 0.013).
Because of the small sample size, they also conducted a onesided binomial test, which yielded results in line with the z-test results (successes = 9, trials = 10, p = 0.01074). Based on these statistically significant results, the Monico-2 archaeologists told the reporter that the majority of dogs were butchered at the site. Moreover, according to the reporter, the archaeologists also suggested that the evidence and results of the statistical analysis indicated that the people at the Monico-2 village included dogs as an important part of their diet. In light of this evidence, the reporter begins to question the ethnographic record on the dietary taboos of the Monico.
The archaeologist at Monico-1 has a quick look at the tables, does a few calculations, and maintains that the probability of the Monico-2 villagers having butchered their dogs is now even lower, especially compared to the new probability of coyote butchery, which is slightly higher. The archaeologist insists on waiting for a larger sample before drawing firm conclusions, however. Incredulous, the reporter asks for an explanation as to why the archaeologist questions the significant null hypothesis tests conducted by the Monico-2 archaeologists. The archaeologist looks at the reporter and says, "Well, NHST procedures like the z-test only consider new data. These methods, unfortunately, do not account for all available information, new and prior, about Monico subsistence. Personally," the archaeologist continues, "I try not to form my opinions based solely on new data. Rather, I use new data to update my existing opinions made using prior knowledge, for example, from the Monico-1 site." The archaeologist then walks the reporter through the tables and begins to explain how they do their inference using Bayes's theorem.
The archaeologist explains that the posterior probabilities of dog and coyote butchery drawn from the (much larger) Monico-1 faunal assemblage have become new "prior" information on the probabilities that Monico villagers butchered dogs and coyotes. These quantities can be represented by   In this case, after updating the data, the new posterior probability of coyote butchery is also higher (changing even more from the prior probability than in the case of dogs). The archaeologist explains this to the reporter. Furthermore, the archaeologist urges caution given that the data and resulting probabilities from the original site were derived from a sample of 100 individuals, whereas the current selection represents a total of only 11. Although the probability calculations are correct, it would be prudent to wait for more data, given that the excavation at Monico-2 is ongoing. However, the archaeologist's Bayesian analysis suggests that, at this point, we should not expect butchery marks on any newly discovered dogs at the Monico-2 site.

LINKING BAYES'S THEOREM TO DATA AND HYPOTHESES
The Monico case study provides a tangible example of the different components of a Bayesian analysis, including estimating an event's probability and the probability of one event given another (using currently available data), along with the key concepts of likelihood, prior and posterior probabilities, and how to update one's knowledge using the previous Bayesian posterior as the new prior. Although the procedure exemplified here is specific to archaeological count data, Bayes's theorem is very general and can be useful for a wide variety of data and data-generating processes. This section generalizes Bayes's theorem to a variety of other scientific scenarios.
We stated earlier that Bayesian scientists use the data in hand (D) to assign probabilities to statements or hypotheses (H ) about a population. The statement P(H|D)-that is, the probability of the hypothesis given the data-formalizes this relationship. In our example of the Monico sites, the archaeologist was trying to calculate the probability that the Monico people butchered dogs and coyotes (the hypotheses) given the number of cut marks on their bones (the data in hand). To operationalize this statement in the context of data and hypotheses, Bayes's theorem functions as follows: where P(H|D) is the posterior probability of the hypothesis given the data, P(D|H ) is the probability of the data given the hypothesis (or the likelihood) of the observed data, P(H) is the prior probability of the hypothesis (before the data were collected), and P(D) is the probability of the data in hand (out of all possible values of the data). Alternatively, generalizing and using more modern statistical vernacular, this operation can be expressed as In this manner, Bayesian statistics offers an alternative statistical framework for updating and evaluating hypotheses through a mechanism that obtains a posteriori information about the posterior of interest based on the data, a statistical model (expressed as a likelihood), and appropriately formulated prior information. In other words, with an explicit statement of our prior information, a clearly defined statistical model, and a desire to update our understanding, Bayes's theorem provides us with a probabilistic framework for making interpretations.
In addition to the coherent and explicit nature of the framework, there is another attractive feature of the Bayesian paradigmnamely, that it allows us to learn from experience. Priors enable the explicit contextualization of previous knowledge or beliefs about the topic under investigation (Buck et al. 1996;Cowgill 1993). Using previous knowledge should be a natural tendency for archaeologists. As Buck and colleagues (1996) discuss, archaeologists apply previous knowledge often, for example, when inferring the function of newly discovered artifacts by using their association to artifacts and features that have already been discovered. Similarly, the archaeologist in our example was able to contextualize the data from the Monico-2 site based on Monico-1 observations. Few other interpretive frameworks offer a clear structure for updating beliefs in light of new information, and yet this is such an important part of most intuitive approaches to learning about the world in which we live. Moreover, today's posterior information (based on current data and prior information) is in a suitable form to become the prior for further work if and when more data become available.

From Inferences about Discrete Points to Data Distributions
Thus far, the example has shown how Bayesian inference can be applied to hypotheses defined by statements about discrete events. In the fictitious example above, the hypotheses were represented by statements about whether the observed faunal remains were the result of butchery. The observed data assigned probabilities to each hypothesis, thereby indicating the amount or degree of belief in the hypothesis. These data were discrete events from only two sites. Yet, in reality, although the population of the proportion of butchered dog bones is the outcome of the same behavioral process (butchery), the observed values are likely to vary from site to site.
Consequently, many archaeologists might wish to compare their single-site data to the universe of known sites. In this case, the hypotheses to be evaluated are characterized by the values of a probability model's parameters. Although we mentioned this earlier, at this point, it is worth recalling that such parameters describe certain characteristics of a sample or population. For archaeologists, the most common parameters are those that measure central tendency, such as the mean or median. Bayesian inference can be conducted using other parameters, as well as the full distribution of the posterior, data and prior information. These are usually represented by probability models. Likely the most well-known such model is the normal probability model, in which the probability distribution has a symmetrical, bell shape around a single mean value. When (sample) data and associated models of probability are involved, it is conventional to use the Roman symbol x to represent the observed (or sample) data, and the Greek symbol θ (theta) to represent the parameter (or multiple set of parameters) of the model of the population that we are trying to learn about. Given x and a model with parameter(s) θ, we can re-couch Bayes's theorem and its three components-the "likelihood," the "prior," and the "posterior"-in the context of data distributions and their probability models.
The "likelihood" is a statistical function, or a mathematical expression, that associates individual data quantities with their respective probability values. Its form is determined by the specific probability model being used, but in general terms, it is represented by P(x|θ)-that is, the probability distribution of newly observed data conditioned on the parameter(s). Consequently, the "likelihood" is the probability of observing particular data values given some specific (or hypothesized) values of the unknown parameters. Therefore, this is a formal statement of the relationship between the parameters about which we want to learn and the data we collect.
The "prior" is also a function and can be represented by P(θ). It is a statement of what we know about the probability distribution of the parameter(s) before new data are collected. In simple terms, we can think of this as the probability we attach to observing specified values of the unknown parameters based on what we knew before we observed the data. This is a formal statement of our knowledge prior to collecting the latest data.
The "posterior" is what we want to obtain: a combination of the information contained in the new data, the likelihood, and the prior. The posterior is represented by P(θ|x). As presented in the previous section, this is the probability of the hypothesis given the data, or P(H|D). It is the probability distribution of the model's parameter(s) conditioned on the data. In simple terms, we can think of this as the probability we attach to specified or hypothetical values of the unknown parameters after observing new data. In this context, we can express Bayes's theorem as The Bayesian Archaeologist and the Uncertainty of Hypotheses As described above, the Bayesian inference about Monico-2 given to the reporter was based only on the new Monico-2 data and the archaeologist's prior expert experience with Monico-1. However, if the archaeologist wants to give the reporter the best possible estimate, they could use all available evidence, including the Monico-2 data, their expert knowledge, and information from other archaeological sites. To do this, the archaeologist reviews the published literature and identifies additional information on the proportion of dogs with butchery marks recovered from 38 previously excavated Monico sites. The archaeologist then seeks to investigate the variability of dog butchery behavior as evidenced by the proportion of dogs with butchery marks at each Monico site, with a view to obtaining a probabilistic prior statement about the theta parameter, θ (the proportion of dogs with butchery marks). Table 5 illustrates the distribution of θ values across the frequency and proportions of sites. The table shows that out of the 38 sites, 20 reported having between 0% and 5% of dogs showing evidence of butchery marks. Twelve sites have between 6% and 15% of dogs showing evidence of butchery marks, whereas another four sites report values for θ between 16% and 35%. Meanwhile, another two archaeological sites report that θ ranges from 36% to 75%. There are no sites with more than 75% of dog remains showing evidence of butchery.
To begin, the archaeologist speaks with other experts about nutrition, the archaeology of food, and Monico archaeology and ethnography. Based on personal scientific knowledge, the archaeologist hypothesizes that to consider dogs as having made a substantial food contribution at a Monico site, there would need to be evidence of butchery marks on at least 50% of individual dogs. "So," the archaeologist thinks, "my first hypothesis, H 1 , is that the value of θ should be at least 50%, or 0.5, for any specific Monico site. What is the probability of this hypothesis being correct for Monico-2 based on the data I have and my prior knowledge?" The Monico-2 site sample indicated that, out of 10 individual dogs, nine had butchery marks on them (so, θ = 0.9). The archaeologist wants to use prior knowledge, including the information from the literature review, to understand the variability of θ at Monico village sites.
The archaeologist first records the dog butchery proportions (θ) from the 38 sites found in the literature. To summarize these data, in Table 5 (column 1), the archaeologist groups the θ values into equal intervals in increments of 0.10 (10%, except the first interval, which is smaller). They also record the number of sites reporting θ values in each interval (column 2) and then calculates the prior probability of each θ interval by dividing the number in each cell of column 2 of Table 5 by the total number of sites (i.e., 38). In this way, the third column of Table 5 reports the proportion of sites within each θ interval. This frequency distribution also serves as the prior distribution of θ values.
The archaeologist then plots the distribution of the proportion of dogs butchered at Monico sites (Table 5) in order to visualize the resulting prior knowledge that can be derived from this dataset (Figure 3).
Recall that, in the Bayesian framework, one needs the likelihood (P(x|θ)), the probability of the data (P(x)), and the prior probability of the hypothesis (P(θ)) to compute the posterior probability of the hypothesis that θ > 0.50, given the data (P(θ>0.5|x)). Figure 3 illustrates the prior probability, P(θ), for different θ values.
Note that in contrast to the single-event values in the previous examples above, the components of Bayes's theorem in this case are distributions of values. Applying Bayesian statistics in such situations provides a particular advantage because the framework enables archaeologists to evaluate the probability of a hypothesis and the associated uncertainty. Consequently, to continue with the Bayesian analysis of the Monico-2 data in light of the prior knowledge from the 38 sites (represented in Figure 3), the archaeologist needs a model to represent the probability of the data, x, and associated parameter(s), θ, in order to compute (1) the likelihood, P(x|θ), and (2) the probability of the data, P(x).

The Likelihood
To compute the probability of the Monico-2 data given the hypothesis, the archaeologist needs a function that can represent the likelihood, P(x|θ), of these data, x; given the parameter of interest, θ. Archaeologists frequently employ a probability function termed the "binomial model" to calculate the likelihood of data composed of binary observations-that is, observations expressed as 1/0, yes/no, true/false, or present/absent. In this case, the binomial model is appropriate for observations indicating the presence or absence of butchery marks on individual dog skeletons, as in the Monico-2 data. For this reason, the archaeologist wants to compute the likelihood that nine out of 10 dog skeletons from this site exhibited butchery marks on them.
Mathematically, the binomial model is expressed by: The symbols k and N represent the data: k is the number of dogs observed with butchery marks, whereas N is the total dogs observed. The model's parameter, θ, in this example, represents the proportion of dogs with butchery marks out of all dogs observed at Monico-2.
The archaeologist uses the parameter estimation method called "maximum likelihood" (ML) to determine the most likely value of θ that would have generated the data. ML asks, under the binomial model, which value of θ is most likely to lead to the data observed. In this case, the archaeologist's binomial data are k = 9 dogs with butchery marks and N = 10 total dogs. ML evaluates which value of the θ parameter maximizes P(x|θ), the likelihood, over a systematic range of quantities between 0 and 1.
To estimate the most likely value of θ, the archaeologist assumes that the probability of observing each butchered dog is independent of the others, making the probability of observing nine butchered dogs θ 9 . Conversely, the probability of observing a single unbutchered dog is (1 − θ) (10−9) , and the probability of both nine butchered dogs and one unbutchered dog occurring is θ 9 × (1 − θ) (10−9) . However, to compute the likelihood of the data, the archaeologist also needs to account for the number of different ways that the nine observations of dogs with butchery marks, k, can occur in the sequence of 10 dog observations, N.  Table 5). Note that small values of θ have a higher prior probability than larger ones.
Once N k has been computed, the archaeologist may continue to estimate the likelihood value of a given quantity of θ by calculating: P(x|u) = 10 × u 9 × (1 − u) (10−9) across the range of θ values from 0 to 1 to find the likelihood distribution of the data and, therefore, the value of θ that maximizes the likelihood function. This approach is illustrated in Figure 4, from which the archaeologist learns that the ML estimate of θ (given the Monico-2 data) is 0.9; in other words, the observations at Monico-2 are most likely if the proportion of dogs butchered across Monico-2 (θ) is also 0.9 (or 90%).

The Prior
Much like using the binomial probability model to obtain the likelihood distribution of the Monico-2 data, the archaeologist can use another probability model to express P(θ), the probability distribution of θ, also known as the prior. In this case, the archaeologist needs a probability function that models the distribution of θ, the proportion of dogs with butchery marks, across the 38 sites observed before the excavation of Monico-2. Statisticians frequently use the beta probability function to model the distribution of proportions such as θ. The mathematical expression of the beta model is: The shape of the beta model is therefore controlled by two parameters, a and b, which in turn control key summary statistics such as the model's mean and variance. Unlike with the likelihood model, the archaeologist in this case wants to find a distribution of θ that quantitatively describes their prior knowledge. To do this, the beta parameters can be adjusted to fit the shape of the prior data distribution in Figure 3. Through a visual best fit, the archaeologist estimates that the values a = 1.5 and b = 16 result in a probability distribution that resembles that of the prior knowledge about θ (i.e., the shape shown in Figure 3). Consequently,

The Posterior
The archaeologist is aware that statisticians frequently use the binomial and beta distributions in the context of Bayesian analyses because they work well together for modeling the likelihood and prior probability distributions, respectively, simplifying the calculations needed to compute the posterior. Such convenient pairs of probability models are known as "conjugates." As a result of the modeling choices made, the archaeologist may algebraically combine the binomial likelihood data with the parameters of the beta prior distribution to produce a posterior beta distribution represented by: They therefore generate values of P(x|θ), which is the likelihood, and P (θ), which is the prior probabilities, to calculate P(θ|x), which is the posterior probability distribution, across a fine grid of θ values in the [0, 1] interval (1,000 values between 0 and 1). These are illustrated in Figure 6.
The archaeologist then focuses on P(θ|x), the posterior distribution. The posterior will help them to make inferences about the probability of θ and its associated uncertainty ( Figure 6). The archaeologist can visually represent the estimate of θ (the expected proportion of dogs with butchery marks at Monico-2, based on the observed data and prior knowledge from the 38 other Monico archaeological sites) and the 90% uncertainty range of its estimate with a graph in Figure 7.
Unlike the NHST framework, the Bayesian posterior probability enables the archaeologist to assign probabilities to hypotheses about parameter values. In this case, the hypothesis is that the value of θ-the proportion of dogs butchered at Monico-2-is greater than 0.5 (50%; Table 6). The values shown in Table 6 are inferences resulting from calculations made using the posterior distribution. The archaeologist computed the probability that θ is greater than 0.5 (top-left value in the table), and the values of θ at the fifth, fiftieth, and ninety-fifth probability percentiles. Recall that earlier, the archaeologist in conjunction with other scientists proposed that cut marks would need to appear on at least 50% (or 0.5) of the dog remains at a Monico site in order to consider dogs "an important food contribution." However, Table 6 shows that the value of θ only has a 10% chance of being greater than 50%. Therefore, the inference that dogs were a substantial part of the Monico diet at Monico-2 is not highly probable. For example, the archaeologist thinks, "If a meteorologist told me that there was a 10% chance of rain today, I would not carry an umbrella." Importantly, the uncertainty around the value of θ can also be expressed as a probability interval. In the Bayesian framework, these probability intervals are known as the "highest probability density intervals" (HPDIs), and they differ from NHST's CIs. One of the most important differences is that the interpretation of the HPDI is much more straightforward. The HPDI is the probability of FIGURE 5. Standardized beta probability model, with parameters a = 1.5 and b = 16, representing the archaeologist's prior probabilities depicted in Figure 3. Note the similarity to Figure 3 in terms of shape, and in particular, the location of the mode and range of values.
Beyond Chronology, Using Bayesian Inference to Evaluate Hypotheses in Archaeology November 2022 | Advances in Archaeological Practice | A Journal of the Society for American Archaeology the parameter given the data, whereas, as we described earlier, the CI is not a probability about the value of the parameter estimate. In the case of θ, Figure 7 tells the archaeologist that there is quite a lot of uncertainty around the true value of θ. For example, the median, or fiftieth percentile, estimate of θ is 0.38-meaning that once the available prior information from the literature and the Monico-2 data are incorporated, it is most likely that the Monico-2 occupants included dogs in their diet 38% of the time.  Erik R. Otárola-Castillo et al.
However, the 90% HPDI spans 0.23 to 0.53 (23% to 53%), which means that, based on our prior information and current data, there is a 90% chance that θ is between these values, and only a 10% chance that it is larger or smaller than these limits. Although the variation in θ reaches over 50%, it does so only slightly and, again, it is not very probable. These results mean that the archaeologist is very uncertain about the occupants' proclivity to butcher dogs (presumably) for dietary purposes at Monico-2, especially considering the small sample size and the fact that the current Monico-2 data differ quite markedly from those found at other sites.

CONCLUSION
Bayesian inference has advantages for archaeologists that extend well beyond the realm of radiocarbon calibration and chronological modeling. The NHST framework has served archaeologists well for many years, but it has limitations. Unfortunately, NHST bases inference on new data alone due to its inherent structure. Its language and assumptions can be convoluted and confusing, and the approach cannot be used to compare multiple working hypotheses directly. Bayesian inference overcomes many of these problems for archaeologists. In many ways, archaeologists often think through problems using a Bayesian framework without knowing they are doing so and without using a formal probabilistic framework. Like the Bayesian archaeologist in our parable, most archaeologists do not form inferences about the past using new data isolated from the existing body of knowledge. Instead, we continually update our prior knowledge with new evidence to make decisions, form opinions, and generate conclusions. The advantage of Bayesian inference over NHST is that it affords archaeologists (1) a more natural toolkit with which to learn from data; (2) straightforward language to make hypotheses quantifiable, explicit, and transparent; and (3) the ability to use probability for comparing multiple hypotheses and conducting further evaluation.
Consequently, the Bayesian approach represents a paradigm shift in archaeological inference. Bayesian statistics offers a coherent inferential framework that explicitly outlines the way in which one's prior information is updated with new data to produce the current state of knowledge. The process helps to evaluate the degree to which current and new evidence support hypotheses. This may be conducted iteratively until there is a desirable amount of confidence (or lack thereof) in the accuracy of a hypothesis. In this context, the Bayesian framework resembles a learning process not unlike scientific investigation. For example, archaeologists continually update their knowledge and degree of belief in hypotheses using new information gathered through multiple data collection methods, including excavation, survey, experimental, laboratory, and other analytical activities.
An increasing number of archaeologists are using Bayesian statistics to calibrate radiocarbon dates, build chronologies, and evaluate their hypotheses about the past. The popularity of chronology-related Bayesian software has made Bayesian inference in that context a simple operation-meaning that most users will find the software easy to operate without a basic understanding of the logic of Bayesian inference and its three fundamental components: the likelihood, the prior, and the posterior. Moreover, without such fundamental understanding, the analytical power of Bayesian statistics, beyond chronology construction, may not be obvious, therefore slowing rather than enhancing more general adoption.
To mitigate this problem, this article highlights how archaeologists may use Bayesian inference to approach complex questions through a simple fictional example. This approach allows archaeologists to evaluate, compare, and update their hypotheses directly, using the weight of evidence and a straightforward process. We consider this one of the most significant impacts of the Bayesian paradigm. In addition, Bayesian inference requires archaeologists to become cognizant of and transparent about prior and current information for statistical analyses within a probabilistic structure. The framework explicitly incorporates all information (prior and current) to enable a more comprehensive understanding of a problem.
As a result, applications of this method are conducive to replication, allowing them to be improved upon by other archaeological scientists. In this light, Bayesian inference dovetails with ongoing efforts to promote open science methods and open data in archaeological research. This context encourages researchers to outline the entire logical process that underlies their results. Due to its advantages, we believe that Bayesian inference is well positioned to become a standard approach to evaluating quantitative hypotheses in archaeology.