## 1 Introduction

Bayes’ theorem offers a normative account about how beliefs should be updated in light of new data. According to it, the probability of a belief or hypothesis *H* conditional on data *D* is:

where the likelihood *P*(*D*|*H*) is the probability of the data given the hypothesis *H*, and the prior *P*(*H*) reflects the degree of belief in the hypothesis before seeing the data. Across a wide variety of domains, Bayesian models have emerged as a powerful tool for understanding human cognition. One useful aspect of such models is that they provide a normative standard against which human cognition and decision making can be compared. This approach has been applied successfully in a wide variety of domains including concept learning (Reference KempKemp, 2012; Reference Sanborn, Griffiths and NavarroSanborn et al., 2010), causal inference (Reference Lucas and GriffithsLucas & Griffiths, 2010), motor control (Reference WolpertWolpert, 2009), and perception (Reference VincentVincent, 2015).

Despite the success of the Bayesian approach, and though people in the aggregate sometimes appear to behave qualitatively in accordance with Bayesian reasoning (Reference Griffiths, Chater, Kemp, Perfors and TenenbaumGriffiths et al., 2010), there is strong evidence that *individuals* usually do not. People tend not to update their beliefs in accordance with Bayes’ theorem, either underweighting the prior (Reference Kahneman and TverskyKahneman & Tversky, 1973) or the likelihood (Reference Phillips and EdwardsPhillips & Edwards, 1966) or both (Reference Benjamin, Bodoh-Creed and RabinBenjamin et al., 2019). *Base rate neglect* occurs when people discount information about prior probabilities when updating their beliefs. It has been replicated in field settings and in hypothetical scenarios (e.g., Reference Bar-HillelBar-Hillel, 1980; Reference Kahneman and TverskyKahneman & Tversky, 1973; Reference Kennedy, Willis and FaustKennedy et al., 1997) as well as in lab experiments such as sampling balls from urns (e.g., Reference Griffin and TverskyGriffin & Tversky, 1992; Reference GretherGrether, 1980). Interestingly, in addition to underweighting the prior, people also often underweight the likelihood: that is, they fail to update their beliefs as strongly as Bayes’ theorem predicts. This phenomenon, known as *conservatism*, has also been widely replicated across a variety of situations (Reference Corner, Harris, Hahn, Ohlsson and CatramboneCorner et al., 2010; Reference GretherGrether, 1992; Reference HammertonHammerton, 1973; Reference Holt and SmithHolt & Smith, 2009; Reference Peterson and MillerPeterson & Miller, 1965; Reference Phillips and EdwardsPhillips & Edwards, 1966; Reference Slovic and LichtensteinSlovic & Lichtenstein, 1971).

To some extent, base rate neglect and conservatism cannot easily be separated. Assuming a point prior hypothesis and a single data point, in fact, it is mathematically impossible to identify whether the prior or the likelihood is responsible for a particular pattern of inference: a weaker inference than expected could reflect either conservative updating or stronger priors than were assumed, while a stronger inference than expected could reflect either weaker priors or overweighting the likelihood. Most research exploring base rate neglect and conservatism does not disentangle the effects of priors and likelihoods, and those studies that do disentangle the effect of the prior and the effect of the likelihood focus on aggregate behaviour (for a review see Reference Benjamin, Bodoh-Creed and RabinBenjamin et al., 2019). As a result, little is known about how conservatism and base rate neglect co-occur within the same individual. More problematically, as Reference MandelMandel (2014) points out, people’s priors are typically not measured at all; it is instead assumed that they correspond to the given base rate. However, if they do not — for instance, if participants are suspicious about the accuracy of the base rate or they represent it with some fuzziness in memory — this could look like conservative updating or base rate neglect when it is not.

Even those studies that explicitly measure people’s priors are somewhat lacking, since virtually all of them elicit priors (and posteriors) as point estimates rather than as full distributions (for overviews and discussion see, e.g., Reference Benjamin, Bodoh-Creed and RabinBenjamin et al., 2019; Reference MandelMandel, 2014; Reference Wallsten and BudescuWallsten & Budescu, 1983). This matters because, as illustrated in Figure 1, distributional shape plays an important role in belief updating: even perfect Bayesian reasoners whose priors have the exact same expected value may draw different conclusions if their priors have different distributional shapes. Thus, determining whether people update their beliefs in accordance with Bayes’ theorem depends heavily on obtaining an accurate measure of the full distribution of prior beliefs.

Of course, this is only relevant if people actually *do* represent probabilities as distributions, at least implicitly. It is generally assumed that this is the case, as described by Reference Wallsten and BudescuWallsten and Budescu (1983) when discussing the measurement of subjective probabilities: “Upon being asked to evaluate the probability of an outcome, a person will search his or her memory for relevant knowledge, combine it with the information at hand, and (presumably) provide the best judgment possible…If the same situation were replicated a large number of times, and if the person had no memory of his or her previous judgments, the encoded probabilities, *X*, would give rise to a distribution for that particular individual” (p. 153). This view reflects considerable (if often implicit) agreement; even those who suggest that people make specific inferences on the basis of samples rather than full distributions assume that the underlying representation from which the samples are generated is a distribution (e.g., Reference Lieder and GriffithsLieder & Griffiths, 2020; Reference Mozer, Pashler and HomaeiMozer et al., 2008; Reference Vul, Hanus and KanwisherVul et al., 2009; Reference Vul and PashlerVul & Pashler, 2008). If people do represent probabilities as distributions, even if only implicitly, then their cognitive processes can be described adequately only by eliciting probability distributions. Indeed, there is a rich literature about how best to elicit and measure full belief distributions (for a review see Schlag et al., 2015). This literature, which was developed in applied contexts such as political science and expert elicitation, has rarely been used in research on Bayesian belief updating and provides the methodology that we employ here.

Our aim was to investigate the extent to which people demonstrate base rate neglect and/or conservatism in a simple probability task. We did this by eliciting from each individual both their prior and their posterior. Following the advice of Reference Garthwaite, Kadane and HaganGarthwaite et al. (2005), we have limited ourselves to eliciting one dimensional probability distributions and do so using a graphical user interface. In particular, our method of eliciting probability distributions is similar to that used by Reference Goldstein and RothschildGoldstein and Rothschild (2014) which has been demonstrated to accurately elicit probability distributions similar to those that we elicit (i.e., one dimensional and unimodal). This method is superior to methods based on verbal reports (Reference Goldstein and RothschildGoldstein & Rothschild, 2014) and follows best practice in that participants are asked to estimate proportions, as opposed to probabilities, as participants find the former easier to estimate (Reference Gigerenzer and HoffrageGigerenzer & Hoffrage, 1995).

The probability task that we used consists of a game, common in this literature, known as the urn problem (Reference Corner, Harris, Hahn, Ohlsson and CatramboneCorner et al., 2010; Reference Johnson and KotzJohnson & Kotz, 1977; Reference Peterson and MillerPeterson & Miller, 1965). In a typical version of this game, participants are asked to imagine a container like an urn containing two different types of object (e.g., red and blue chips). Objects are drawn from the container and are revealed to the participant sequentially. Based on this information, people are asked to estimate the overall proportion of one of the types of objects (e.g., red chips) in the container.Footnote ^{1}

We report three experiments. In Experiment 1, we presented people with the urn problem but elicited their priors and posteriors as distributions rather than single estimates by having them draw histograms. We had two main goals in doing this: to establish what people actually assume if the prior is left unspecified, and to determine to what extent each person’s reasoning was well-captured by Bayes’ theorem using their stated prior. Our findings suggested that people showed substantial individual differences in their reported priors as well as how closely they followed the predictions of Bayes’ theorem. That said, the majority demonstrated strong base rate neglect, with most people completely or almost completely disregarding their stated priors. They also showed a moderate degree of conservatism, updating their beliefs somewhat less than a fully Bayesian reasoner would, with no readily apparent systematic relationship between the two. We followed up in Experiment 2 by presenting people with explicit information about a stronger prior distribution in order to determine whether this changed the extent to which they incorporated it. Most participants still showed some conservatism and base rate neglect, although less strongly. To ensure that these results were not due to the particular prior used in that experiment, Experiment 3 used a different prior — the prior that, in the aggregate, people assume when not explicitly given a prior (as determined by Experiment 1). Experiment 3 confirmed that explicitly giving participants a prior caused them to neglect the base rate less than when they were required to infer the prior for themselves.Footnote ^{2}

## 2 Experiment 1

### 2.1 Method

According to Bayes’ theorem, the degree to which a person’s prior influences their posterior is determined by the amount of data they see: the more data, the more the posterior is shaped by the likelihood rather than the prior. There were three conditions, the first two being control conditions. In the OnlyFive condition people were shown five chips drawn sequentially from the urn and then reported a probability distribution. In the OnlyUnlimited condition participants were first shown five chips and then allowed to view as many chips as they wanted before reporting a probability distribution. Finally, in the main condition participants reported their prior, were shown five chips and reported their posterior. They were then allowed to view as many additional chips as they wanted before reporting a second posterior. In this way, the main condition encompassed the previous two conditions. The purpose of the two control conditions was to allow us to determine whether asking participants to repeatedly draw the posteriors in the main condition affected what they drew. We tested for this by comparing whether the posteriors drawn in the main condition matched the corresponding posteriors drawn in the two control conditions.

#### 2.1.1 Participants

452 people (249 male, 201 female, 2 non-binary) were recruited via Prolific Academic and paid 60 British pence. Mean age was 31 years. Ninety were excluded because they failed the bot check (see below) or did not adjust any bars when estimating distributions. All participants gave informed consent and all three experiments in this paper were approved by the University of Melbourne School of Psychological Sciences Human Ethics Advisory Group (ID: 1544692).

#### 2.1.2 Materials

In all conditions, participants were shown an image of a bag that they were told contained red and blue chips. They were asked to provide their probability distributions by adjusting sliders corresponding to bars of a histogram, as shown in Figure 2. The first bar represented the participant’s estimate of the probability that 0% of the chips in the bag were red, the second that 10% were red, and so on, with the final slider representing their estimate of the probability that 100% of the chips were red. The sliders were initialised randomly and constrained so that the total probability added up to 100%. In this way, by varying the position of the sliders, people could draw their probability distributions. When they were satisfied with the distribution, they pressed the submit button to continue.

#### 2.1.3 Procedure

**Bot check**. All participants were initially asked a series of four multiple-choice questions to determine that they were human with adequate English abilities and not a bot. These questions posed analogies of the form “Mother is to daughter as father is to…” (in this example, the correct answer is “son”). Providing an incorrect answer to any of these questions counted as failing the bot check; data from these participants were not analysed. Following the bot check, instructions were presented, demographic information was collected, and people were allocated randomly to one of three conditions.

**OnlyFive condition**. Participants in this condition (*N* = 126) were shown an image of a bag that they were told contained red and blue chips. Five chips (four red and one blue) were then drawn from the bag and presented to each participant one at a time in a random order. Participants were asked to report their estimate of the proportion of red chips in the bag using the histogram visualisation tool shown in Figure 2.

**OnlyUnlimited condition**. This condition was identical to the OnlyFive condition, except after the first five chips were presented, instead of reporting their posterior participants (*N* = 129) were given the option of drawing an additional chip. If they chose to draw one, after a delay of one second they were informed of the colour of the chip and given the option to draw another. This process could be repeated as many times as the participant desired. For the first five chips, four were red and one was blue, but the position of the blue chip in the sequence was randomised between participants. In every additional sequence of five chips, the pattern repeated: four chips were always red and one was always blue, but the position of the blue chip was randomised. After the participant was satisfied that they had drawn enough chips, they were asked to report their estimate of the proportion of red chips using the histogram visualisation tool shown in Figure 2.

**Main condition**. This condition was identical to the previous two except that each person (*N* = 107) was asked to estimate the probability distribution three times: once before being shown any chips, once after being shown five chips, and finally after having had the opportunity to view as many additional chips as they desired. Thus, each participant estimated one prior probability distribution and two posterior probability distributions, one after five chips and one after an unlimited number.

### 2.2 Modelling

Our research questions required determining the extent to which people under-weighted their prior and/or likelihood when reasoning about what chips they expected to see. We thus modelled participants as Bayesian reasoners who made inferences according to the following equation:

where *x* represents the proportion of chips in the bag that are red and *n* _{r} and *n* _{b} represent the observed data (i.e., the number of chips that were drawn from the bag that were red and blue respectively). Thus, *P*(*x*|*n* _{r},*n* _{b}) is the posterior and *P*(*x*) is the prior.

**Prior**. We represent the prior used to form the posterior (i.e., effective prior) as a weighted average of the stated prior, ϕ, and a uniform prior *U*, as shown in Equation 3, where β represents the weighting. The value for β is a constant ranging from 0 to 1, with β=0 indicating that the stated prior was ignored entirely when calculating the posterior (i.e., complete base rate neglect) and β=1 indicating that the prior was weighted appropriately (i.e., no base rate neglect at all).

**Likelihood**. In an urn problem such as ours, *P*(*n* _{r},*n* _{b}|*x*) is captured according to a binomial likelihood function. In order to capture the extent to which each participant over-weights or under-weights the evidence, we use a parameter γ which intuitively captures how many “effective” red (*n* _{r}) or blue (*n* _{b}) chips the participant incorporates into their calculations, as in Equation 4. Thus, when γ = 1, participants are neither over-weighting nor under-weighting the evidence; γ < 1 indicates conservatism while γ > 1 indicates over-weighting the data.Footnote ^{3}

As with the reported priors, the reported posteriors for each person were smoothed by adding 0.01 to the zero values and then normalising; this ensured that fits would not be artificially distorted by the propagation of zero values. (Analyses without this smoothing had qualitatively similar outcomes.) Optimal values for β and γ were calculated for the Main condition (the only one with both priors and posteriors) in aggregate as well as separately for each individual. The analysis was performed in R using the optim function, with β constrained to be within 0.0000001 and 0.9999999 and γ within 0.0000001 and 50 using the L-BFGS-B method. The function being minimised was the mean squared error between the model’s prediction and the reported posterior, at the 11 points the posterior was measured. The supplement contains information about the model fits, which were very good in all experiments.

### 2.3 Results

#### 2.3.1 Aggregate performance

In order to ensure that the act of eliciting a prior or multiple posteriors did not change how participants reported probability distributions, we first compare the posteriors obtained from the two control conditions (i.e., the OnlyFive and OnlyUnlimited conditions) to the corresponding posteriors obtained from the Main condition (i.e., the posterior obtained after participants saw five chips and the posterior obtained after participants saw as many additional chips as they desired). As shown in Figure 3, the aggregate posteriors are extremely similar regardless of whether participants were asked to report their priors first (solid lines) or not (dotted lines). In both subplots, for all three conditions the mode is at 80%, indicating that participants on average correctly reported that they expected about 80% of the chips to be red regardless of the condition. Comparing the right subplot to the left subplot, we see that the peak was narrower indicating that the participants were more certain of the proportion of red chips after seeing more chips. Overall, this indicates that participants understood the task and reported reasonable distributions. More importantly, these results demonstrate that asking participants to estimate the prior did not substantially alter their subsequent estimates of the posterior.Footnote ^{4} This allows us to focus on the Main condition, where each participant estimated three probability distributions: one before viewing any chips, one after viewing five chips, and one after viewing an unlimited number of additional chips.

We can ask several questions of this data on the aggregate level. First, what prior distribution was reported? Participants were not given any information about the quantity of red or blue chips in the bag, so this question allows us to investigate what they presumed in the absence of any instruction. Figure 4 shows the aggregate prior (red line), which has a small peak at 50%, suggesting that on balance people think that a 50/50 split of red and blue chips is more likely than any other mixture. That said, the probability distribution is also fairly uniform across all possible values, indicating that participants would not be terribly surprised if the bag contained all red chips, all blue chips, or any of the other possible combinations.

A second question we can ask of the aggregate data is, when we fit it to our model by adjusting β and γ, what do the resulting parameters tell us about the degree of base rate neglect and conservatism shown by the population as a whole? As Figure 4 makes clear, the best-fit parameters after both five chips and unlimited chips were similar. In both cases, they reflect that the aggregate posteriors were best captured assuming people ignore their reported priors completely (i.e., β = 0) and show a moderate degree of conservatism in updating (i.e., γ < 1). We can understand intuitively why this is the case by comparing the reported posteriors with the predicted posteriors that we would expect from an optimal Bayesian reasoner (grey line, β=γ=1). After five chips, such a reasoner would have a bimodal posterior, which reflects the influence of the prior. Similarly, after an unlimited number of chips, the posterior would be broader than we observe.

#### 2.3.2 Individual performance

One of our main motivations was to understand how individuals (rather than populations) represented and updated their beliefs. Figure 5 shows the distribution of β and γ obtained when fit to each participant simultaneously. It is apparent that there is substantial individual variation and there are few differences based on whether five or unlimited chips were seen. That said, most people showed partial or complete base rate neglect: around half of the people completely disregarded their priors (51.4% of people after seeing five chips and 56.1% after seeing unlimited chips had β<0.1) and only a minority showed no base rate neglect at all (17.8% of people after seeing five chips and 22.4% after seeing unlimited chips had β>0.9). Participants varied more in how they weighted the likelihood, with around half of the participants being conservative (50.4% of people after seeing five chips and 57.9% after seeing unlimited chips had γ<1). There was no obvious systematic relationship between β and γ values within individuals; it was not the case that a low β meant high γ or vice versa (Spearman correlation, after five chips: ρ=.040, *p*=.680; after unlimited chips: ρ=.18, *p*=.060; see the supplement for the scatterplots).

To get an intuitive sense of what people are doing, we can inspect the individual distributions. Figure 6 shows some representative examples, and all participants are shown in the supplement. There is considerable heterogeneity: people report a wide variety of both priors and posteriors. That said, observation of the distributions makes it clear how it is possible to tease apart the weightings of the prior and the likelihood separately. Under-weighting the prior results in a posterior distribution with a different shape (with multiple peaks) or a different peak (closer to the likelihood) than the posterior distribution produced by an optimal Bayesian learner with that prior. By contrast, different likelihood weights change the height of the peak: conservative updating results in a peak that is lower than the Bayesian prediction, while over-weighting the likelihood results in a peak that is higher than the predicted one. As such, inspection of the individual curves is useful for understanding qualitatively what the quantitative fits of β and γ reveal.

Although our model fits in general were excellent (82.2% of people were fit with an MSE of 0.01 or less and 96.7% with 0.05 or less), one might still worry about whether our results were driven in part by the participants who were not fit well by the model. For instance, if all of the people for whom β=0 were also fit badly, this might not mean that most people showed base rate neglect after all. In order to ensure that this was not the case, we redid all analyses after excluding the people with mean squared error greater than than 0.01. This did not change the qualitative results, with most of the 76 remaining participants still showing a high degree of base rate neglect (see the supplement). This suggests that our results are not an artefact of poor fits, and we can be somewhat confident in our interpretation of the parameters.

One might also wonder how robust our method of estimating β and γ is. To address this concern, we performed a robustness analysis. As described in the supplement, our robustness analysis used 12 different priors to construct posteriors by systematically sampling a wide range of β and γ values. It then investigated to what extent we could recover the β and γ values from the constructed posteriors. We showed that, providing the prior was not uniform, in which case β would be undefined, our estimates of β were highly accurate providing γ was not large. This makes sense because a large γ corresponds to a substantial overweighting of the likelihood, which means that the influence of any prior is minimised, thereby making it difficult to estimate β. Similarly, γ was also recovered accurately providing it was not too large, presumably because when γ is too large, the data is overweighted so much that it is impossible to detect small differences in γ. Importantly for our purposes, very few of our participants overweighted the data that much. Even among those for whom the model inferred γ > 1, most of those had estimated γ values of 10 or less, in which case our estimates of β and γ should be accurate for all the priors that were considered except for prior 10. Even for this prior, our robustness analysis indicated that γ would be estimated reliably. The difficulty would be in estimating β and this difficulty was caused by prior 10 being sharply peaked but with the peak occurring on the opposite side to the true proportion of red:blue chips. Considering just the participants who were fit well by our model, none of them reported a prior resembling prior 10, suggesting that, for these participants, the estimated values of β and γ would be accurate.

The robustness analysis demonstrated that whether or not β and γ can be recovered accurately depends in part on the prior. As such, it is useful to ask to what extent we can expect to recover β and γ using the priors actually reported by the participants. To address this issue, we performed a recoverability analysis. This analysis used the β and γ values estimated for each participant to generate a posterior from that participant’s prior. This posterior was then used to estimate β and γ. We showed that, for our data, the original and recovered β and γ values had a correlation of 0.97 or more, across all three experiments. This showed that, given each participant’s prior, if the estimated β and γ values were true we could, in principle, recover them. For further details, the reader is referred to the supplement.

## 3 Experiment 2

In Experiment 1, most participants showed base rate neglect, partially or completely ignoring their own reported prior when updating their beliefs in light of new data. Why did they do this? One possibility is that the task demands encouraged them to do so, since no prior was ever explicitly given and physically seeing chips being drawn may have made the data more salient. In Experiment 2, we investigated this possibility by explicitly giving participants the prior. People in the Peaked prior condition were shown a distribution with a mode at a proportion of 80% red chips (as this most closely aligned with the data the participants would subsequently receive). Those in the Uniform prior condition were shown a completely flat prior; this is a useful comparison because reasoning based on this prior is equivalent to reasoning that completely ignores the prior. As a result, if participants always ignore their prior then the posteriors they report should be the same in both conditions; if not, the posterior should be sharper in the Peaked condition.

### 3.1 Method

#### 3.1.1 Participants

300 people (184 male, 113 female, 3 non-binary) were recruited via Prolific Academic and paid 60 British pence. Mean age was 26 years. Sixty-one people were excluded because they either failed the bot check or did not adjust any bars when estimating distributions.

#### 3.1.2 Materials and Procedure

This experiment involved the same procedure and instructions as Experiment 1 except that we presented participants with an explicit prior distribution using the same “bar” format that they used to report their own. In the Peaked condition (*N* = 121) people were informed that a previous participant who had completed the task several times had stated that “There were usually about four times more red chips than blue chips in the bag (like, 80% red)” and had also drawn the plot in the left panel of Figure 7 to illustrate their statement. Conversely, the people in the Uniform condition (*N* = 118) were informed that a previous participant who had completed the task several times had stated that “The number of red and blue chips in the bag keeps changing, doesn’t seem to be a pattern to it” and had drawn the plot in the right panel of Figure 7 to illustrate their statement.

Because Experiment 2 presented participants with an explicit prior, the procedure did not involve a prior elicitation step. Instead, after having been told the prior, participants were shown five chips (four red and one blue in random order, as before) and were asked to draw their posterior.

### 3.2 Results

#### 3.2.1 Aggregate performance

We first present the aggregate distributions in each condition, along with the best-fit β and γ values. As Figure 8 shows, participants in the Peaked condition were not entirely ignoring the prior; their posterior is tighter and sharper than in the Uniform condition, as one would expect if they were taking the prior into account. That said, a comparison to the posterior inferred by an optimal Bayesian — along with the inferred β and γ values — demonstrates that people still showed substantial underweighting of the base rate (albeit less than before) and some degree of conservatism.

#### 3.2.2 Individual performance

As before, we performed individual-level analyses by fitting each participant to the value of β and γ that best captured their reported posterior based on the prior they were given. The distribution of these parameters in each condition is shown in Figure 9 (recall that there are no β values in the Uniform condition because in that condition β was undefined). There is again substantial individual variation, but most people in the Peaked condition showed partial or complete base rate neglect: 38.8% of participants disregarded their priors (with β<0.1) and only 3.3% showed no base rate neglect at all (with β>0.9). That said, more participants than in Experiment 1 paid *some* attention to the prior they were given, even if they did not weight it as strongly as an optimal Bayesian reasoner would have. The degree to which participants weighted the likelihood depended on their condition. Participants in the Peaked condition were less likely to be conservative than those in the Uniform condition: 38.8% in Peaked and 61.0% in Uniform had γ<1.

As in Experiment 1, we did not find an obvious systematic relationship between β and γ values within individuals (Spearman correlation, ρ = 0.141; *p* = 0.124); see the supplement for the scatterplots and further discussion.

Although our model fits were again excellent (79.1% of people were fit with MSE less than 0.01, and 98.7% with MSE less than 0.05), we redid all analyses after excluding the people that were not fit well by our model (i.e., the people with mean squared error greater than than 0.01). As documented in the supplement, this did not change the qualitative results: the remaining 189 people still appeared to show some base rate neglect in the aggregate, but the Peaked condition had a sharper posterior than the Uniform condition, demonstrating that participants in that condition did take the prior into account at least somewhat.

As mentioned earlier, we conducted a robustness analysis that considered 12 different priors. For this experiment, prior 11 and prior 3 are particularly relevant as they correspond to the priors shown to participants in the Peaked and Uniform conditions respectively. Assuming that participants used the prior given to them, our analysis demonstrated that, for the uniform prior, the estimation of γ was accurate for all combinations of β and γ. For the peaked prior, the estimation of γ was accurate when the actual γ was less than 10 and the accuracy of the estimated γ decayed gradually as actual γ increased. This meant that the estimated γ approximated the actual γ, even when the actual γ was high. For the Peaked condition, almost all participants had an estimated γ less than 10, which means that we can be confident that their actual γ values were estimated accurately.

As before, we also performed a recoverability analysis. Assuming that participants used the prior that was provided to them, this analysis confirmed that, using the posterior implied by each participant’s individual β and γ values, we could recover the original β and γ values. This shows that if an individual’s estimated β and γ values were true we could, in principle, recover them. For further details, the reader is directed to the supplement.

## 4 Experiment 3

The Peaked condition of Experiment 2 suggested that even when the prior is made explicit, people underweight it relative to how they should weight it according to Bayes’ theorem. In this experiment, we further investigate this phenomenon by comparing a condition where people are provided with a prior (the Given condition) to one where they are not, so need to estimate it themselves (the Estimated condition). Building on our results from Experiment 1, we arrange for the prior provided in the Given condition to be approximately equal to the average prior in the Estimated condition. This means that any differences in the aggregate performance in the two conditions is caused by the fact that individuals are given the prior in one condition but not in the other.

### 4.1 Method

#### 4.1.1 Participants

300 participants (132 male, 162 female, 6 non-binary) were recruited via Prolific Academic and paid 60 British pence. Mean age was 25 years. Thirty-nine people were excluded because they either failed the bot check or did not adjust any bars when estimating distributions.

#### 4.1.2 Materials and Procedure

The experiment involved the same procedure and instructions as before except for the following differences. In the Estimated condition participants were not provided with a prior. This condition was thus identical to the Main condition of Experiment 1, except that the experiment stopped after the participants had reported the first posterior (i.e., after the participant had seen five chips). In the Given condition participants (*N* = 133) *were* provided with a prior. It was thus identical to the Peaked condition of Experiment 2 except that the prior they were shown corresponded to the aggregate prior reported in Experiment 1. We designed it this way because it means that in both conditions we would expect people to have the same prior (at least in the aggregate); the conditions differ only in whether that prior was explicitly provided or not. This, therefore, allowed us to determine whether people are more likely to use a prior if it is explicitly provided to them.

### 4.2 Results

#### 4.2.1 Aggregate performance

As shown by Figure 10, the prior reported by the participants in the Estimated condition (solid red line) was very similar to the prior provided to the participants in the Given condition (dashed black line). This suggests that any differences in the posteriors in the two conditions is unlikely to be due to differences in their priors.

Figure 11 shows the aggregate posterior distributions for each condition, shown alongside the optimal Bayesian prediction as well as the prediction made using the best-fit parameters β and γ. As expected, the best fit parameters for the Estimated condition (β = 0, γ = 0.48) are very similar to the best fit parameters in the Five condition in Experiment 1 (β = 0, γ = 0.55), with participants in the aggregate demonstrating complete base rate neglect (β=0). Conversely, in the Given condition participants made much more use of the prior (β = 0.32). This resulted in a posterior with two modes corresponding to the peaks of the prior and the likelihood. This is consistent with the finding from the Peaked condition of Experiment 2 that when the prior is made explicit, participants make use of it, but not to the extent predicted by Bayes’ theorem.

### 4.3 Individual performance

As before, we performed individual-level analyses by fitting each participant to the value of β and γ that best captured their reported posterior given their prior. The distribution of these parameters in each condition is shown in Figure 12. The results in the Estimated condition are very similar to the analogous condition of Experiment 1, with many participants disregarding their prior (43.8% had a β < 0.1, compared to 51.4% previously) and a minority weighting it appropriately (23.4% had a β > 0.9, compared to 17.8% previously). The results from the Given condition are consistent with the observation from Experiment 2 that participants pay more attention to the base rate when the prior is made explicit: fewer people in the Given condition than the Estimated one ignored the base rate entirely (26.3% had β < 0.1) and more weighted it appropriately (41.4% had β > 0.9). As before, a moderate number of participants reasoned conservatively (51.6% in the Estimated condition and 70.7% in the Given condition had γ < 1). There was also again no obvious systematic relationship between β and γ (Spearman correlation, Estimated: ρ = .137, *p* = .124; Given: ρ = −.04, *S* = 406487, *p* = .675; see supplement for scatterplots). Thus, the degree to which an individual weights the prior does not predict the degree to which they weight the likelihood.

The model fits for Experiment 3 were just as good as in previous experiments (81.6% of people had an MSE of less than 0.01, and 98.5% less than 0.05). Nevertheless, as before, we redid all analyses after excluding the people with MSE less than 0.01, leaving 213 in the dataset. As shown in the supplement, this did not change the qualitative results. On the aggregate as well as individual levels, in the Estimated condition participants were more likely to ignore their prior whereas in the Given condition more participants used the prior.

As before, we performed a robustness analysis. In this analysis, prior 12 corresponds to the prior provided to participants in the Given condition (which is very similar to the mean prior assumed by participants in the Estimated condition as shown by Figure 10). This analysis demonstrated that both β and γ can be accurately recovered if actual γ is less than 30. Given that estimated γ was always less than 25 (and usually much less), we can be confident that this condition held for all participants. A recoverability analysis demonstrated that, for each individual, if the estimated β and γ were true, we could, in principle, recover them. Please see the supplement for further information.

## 5 Discussion

In this paper we asked to what extent human probability reasoning conforms to the normative standards prescribed by Bayes’ theorem when participants present their probability estimates as entire distributions rather than as point estimates. Our first experiment was inspired by the standard balls-and-urn task. Participants were shown a bag containing a number of chips, some red and some blue, and were asked to provide three probability distributions (one prior and two posteriors) using a visual histogram tool similar to that of Reference Goldstein and RothschildGoldstein and Rothschild (2014). The task description gave no information about the likely ratio of red to blue chips. Fitting individual participants revealed that, regardless of whether they saw only five chips or were allowed to view as many chips as they desired, the majority showed substantial base rate neglect (i.e., ignoring the prior they had reported) and varied in the degree to which they were conservative (i.e., updating their likelihoods less than a normative Bayesian reasoner).

In order to determine whether people ignored their prior because it was not explicitly stated, in Experiment 2 we presented people with either a uniform or a peaked prior and then asked for their posterior distributions after seeing five chips. Here the aggregate results revealed that, even when given an explicit prior, there was some underweighting of the base rate. However, they had sharper posteriors when given a peaked prior than when given a uniform prior, indicating that the priors were not being ignored entirely. This was supported by fitting individual participants: although variation was again substantial, more people used the prior when it was made explicit in the Peaked condition than when it was not.

Experiment 3 further investigated this phenomenon by directly comparing a condition where people were given a prior to a condition where they were not. We arranged for the given prior to be approximately equal to the mean prior that participants would deduce for themselves. This means that comparing the aggregate performance in the two conditions allowed us to determine to what extent explicitly giving participants a prior induces them to use it. Experiment 3 confirmed the findings of Experiment 2: when the prior is made explicit, people weight it more than when it is not.

To interpret this finding, it is necessary to understand how the prior distribution represents the confidence the participant has in their prior knowledge. The more the stated prior departs from the uniform distribution, the more confidence the participant is expressing that certain proportions are more likely to occur than other proportions. For example, if a participant were to report a prior that had a peak at *x* = 0.5 and was zero at all other values of x, they would be stating that they are 100% confident that the proportion of red chips in the urn is exactly 0.5. Bayes’ theorem uses the degree of confidence people have in their prior knowledge (encoded in the shape of their prior) to calculate their posterior. Our modeling went beyond Bayes’ theorem by allowing for the possibility that the stated prior may not be the effective prior (i.e., it may not be the prior people actually use to construct their posterior). We found that many people disregarded their stated prior and instead constructed their posterior from a uniform prior. We are agnostic as to the reason why these people did this. It could be that they were less confident in their prior knowledge than the shape of their stated prior would indicate. Alternatively, they may have constructed their posterior from a uniform prior because it was cognitively easier to do so.

In Experiment 1 and in the Estimated condition of Experiment 3, participants were given no information as to the likely proportion of red chips, so how did they estimate the prior? Most likely, they did so by drawing on logic and previous knowledge. As there was nothing to suggest that there would be more red chips than blue chips or vice versa, we would expect for the reported prior to be approximately symmetrical around *x* = 0.5. Furthermore, since all proportions of red chips were possible, we would expect a fairly uniform prior to reflect this fact. Finally, past experience would suggest an approximately equal ratio of red to blue chips would be more likely. For instance, it is common practice that packages of assorted goods have approximately equally quantities of each good. For example, one would expect a package of assorted biscuits to have approximately equal quantities of each type biscuit. Consequently, it would not be unreasonable to assume that proportions near the point *x* = 0.5 may be more likely than those further away. These considerations can explain why the aggregate priors shown by the red lines in Figure 4 are approximately uniform and symmetric around the point *x* = 0.5 with a slight peak there.

Some of the subtleties that arose in our analysis illustrate both the benefits and complexities of measuring and fitting full probability distributions. There are several benefits. For instance, this method allows us to disentangle, for an individual participant on a single reasoning problem, to what extent they under-weight or over-weight both their prior *and* their likelihood. This is not possible using any other methodology: mathematically, a single point posterior can arise from one of an infinite number of possible weighting of the prior and likelihood because overweighting the prior is equivalent to underweighting the likelihood and vice versa. The studies that do attempt to disentangle prior and likelihood weightings do so by presenting multiple problems, systematically varying both the priors and the evidence (e.g., Reference Benjamin, Bodoh-Creed and RabinBenjamin et al., 2019; Reference Griffin and TverskyGriffin & Tversky, 1992). This is sometimes useful, but presumes that people weight their priors and likelihoods similarly across all problems. Our results suggest that this is not necessarily the case: people showed more base rate neglect in some circumstances than in others. In particular, people demonstrated more base rate neglect when they estimated the prior as opposed to when it was given to them. Surprisingly, we found that in all three experiments there was no correlation between an individual’s base rate neglect and their degree of conservatism. We had expected these two variables to trade off against each other, so be anti-correlated. Instead, we found that they were independent of each other, implying that they are determined by independent cognitive processes. To our knowledge this is a novel finding; future research is necessary to determine how robust it is and how far it extends.

Another benefit of fitting full probability distributions is that because each individual was fit separately for both prior and likelihood weights, we could determine how each of these weights varied among people. For instance, Experiment 1 demonstrated that most people either completely ignored their priors (with β close to 0) or weighted them appropriately (with β close to 1); that is, the distribution over β was bimodal, with few intermediate values. This bimodal distribution was not observed in Experiment 2 but was in Experiment 3. Further research will be needed to determine when it is and is not observed.

One potential worry about the validity of our method is the extent to which people can actually accurately report their underlying distribution. If people reason by drawing a small number of samples from their distribution, as some suggest (Reference Vul, Goodman, Griffiths and TenenbaumVul et al., 2014), it is not obvious that this would be sufficient for people to reconstruct and report the actual distribution. Although this is a possibility we cannot rule out with certainty, it seems unlikely: the distributions people reported seem reasonable both individually and in the aggregate, and reflect the overall patterns one would expect: tightening with additional information in Experiment 1, stronger inferences with a stronger prior in Experiment 2, and more reliance on the prior when it is made explicit in Experiment 3. Moreover, previous work has demonstrated that people can accurately report similar probability distributions (Reference Goldstein and RothschildGoldstein & Rothschild, 2014), which they could not do if they were limited to drawing a small number of samples from the underlying distribution.

More broadly, this research demonstrates *why* it can be useful to elicit and analyse entire distributions rather than single point estimates. As long as (i) the prior is not uniform and (ii) the prior and likelihood have different modes (unlike in Experiment 2), the two terms make different contributions to the shape of the posterior distribution, so their individual contributions can be estimated. There is a great deal of potential in applying this methodology to long-standing problems in human reasoning. Might the framing of the problem (i.e., how the problem is presented to the participants) affect base rate neglect (Reference Barbey and SlomanBarbey & Sloman, 2007) at least in part because people may implicitly assume priors with different distributional shapes (reflecting different levels of confidence or extent) depending on how the problem is presented to them? Might base rate neglect be smaller for priors that are easier to use, represent, or sample from? To what extent do anchoring effects change if the information is presented as a full distribution? Do the same individuals weight their priors and likelihoods the same across different problems? These are only some of the questions that can now be addressed.

In sum, this paper presents initial research demonstrating the utility of eliciting and fitting full distributions when studying probabilistic reasoning. Across three experiments, we found substantial variation in the extent to which people showed base rate neglect and conservatism, which our method allowed us to measure in individuals on single problems. While most people tended to disregard the base rate, they did so less when it was explicitly presented. Moreover, there was no apparent systematic relationship between base rate neglect and conservatism within individuals. There is a great deal of potential in applying this methodology to other problems in human probabilistic reasoning.