Expertise determines frequency and accuracy of contributions in sequential collaboration

Many collaborative online projects such as Wikipedia and OpenStreetMap organize collaboration among their contributors sequentially. In sequential collaboration, one contributor creates an entry which is then consecutively encountered by other contributors who decide whether to adjust or maintain the presented entry. For numeric and geographical judgments, sequential collaboration yields improved judgments over the course of a sequential chain and results in accurate ﬁnal estimates. We hypothesize that these beneﬁts emerge since contributors adjust entries according to their expertise, implying that judgments of experts have a larger impact compared with those of novices. In three preregistered studies, we measured and manipulated expertise to investigate whether expertise leads to higher change probabilities and larger improvements in judgment accuracy. Moreover, we tested whether expertise results in an increase in accuracy over the course of a sequential chain. As expected, experts adjusted entries more frequently, made larger improvements, and contributed more to the ﬁnal estimates of sequential chains. Overall, our ﬁndings suggest that the high accuracy of sequential collaboration is due to an implicit weighting of judgments by expertise.


Introduction
Online collaborative projects such as Wikipedia and OpenStreetMap have become increasingly important sources of information over the last two decades and are frequently used by many people. Prior research showed that Wikipedia yields highly accurate information both in general (Giles, 2005) and for specific topics (Kräenbring et al., 2014;Leithner et al., 2010). Also, OpenStreetMap provides geographic information with a similar accuracy as commercial map services and governmental data (Ciepłuch et al., 2010;Haklay, 2010;Zhang and Malczewski, 2018;Zielstra and Zipf, 2010). To gather information, both Wikipedia and OpenStreetMap build on a sequential process referred to as sequential collaboration (Mayer and Heck, 2022). In this process, one contributor creates an entry which is then sequentially adjusted or maintained by the following contributors. Mayer and Heck (2022) showed that sequential collaboration represents a successful way of eliciting group judgments. In three online studies, participants either answered general-knowledge questions or located European cities on geographic maps. Participants were randomly assigned to sequential chains of four or six contributors. Each chain started with a contributor providing an independent judgment. Next, other contributors encountered the latest version of the judgment and could then decide whether to adjust or maintain it. For instance, the first individual may start by locating Rome on a map of Italy without additional information. The second contributor may then maintain the location, whereas the third contributor may move the location more to the south. Participants were unaware of their position in the sequential chain, the change history of judgments, and how often a judgment had already been adjusted.
In the three studies by Mayer and Heck (2022), change probability and change magnitude decreased over the course of a sequential chain, whereas judgment accuracy improved. Observing an incremental increase in judgment accuracy over the course of a sequential chain represents a rather weak benchmark of performance since it lacks a comparison standard for the accuracy of the final judgments. As a remedy, one can also compare the accuracy of sequential collaboration against a stronger benchmark. In fact, the studies by Mayer and Heck (2022) provided preliminary evidence that the final judgments of sequential chains were similarly accurate, and in some cases even more accurate, than unweighted averaging, that is, computing the mean of independent individual judgments for the same number of participants. Similar results were reported by Miller and Steyvers (2011) for a more complex ordering task. This is an important finding given that unweighted averaging is known to yield highly accurate estimates in various contexts and tasks, a phenomenon known as wisdom of crowds (Hueffer et al., 2013;Larrick and Soll, 2006;Steyvers et al., 2009;Surowiecki, 2004).
Even though these initial results are promising, the mechanisms contributing to the increase in accuracy of sequential judgments are still unclear. In the present paper, we investigate whether the expertise of contributors affects both the probability of adjusting a presented judgment and the accuracy of revised judgments. We hypothesize that individuals with higher expertise are better at distinguishing between presented judgments they can improve and those they cannot improve. This would lead to a systematic opt-out mechanism: experts provide new, more accurate judgments when possible, but otherwise maintain a presented judgment (Mayer and Heck, 2022). Sequential collaboration would thus facilitate an implicit weighting of judgments by expertise, in turn leading to increasingly accurate judgments over the course of a sequential chain.
In the following, we first define expertise and discuss its relevance for judgment accuracy in various contexts. Based on the literature on the role of expertise for individual judgments, we propose a theoretical framework of how individuals' expertise drives a differential opt-out mechanism that improves the accuracy of sequential judgments. We conducted three experimental studies using a city-location task and a random-dots estimation task. In each study, we either measured individuals' knowledge or manipulated their skill for the task at hand. Thereby, we examined whether expertise influences how frequently presented judgments in sequential collaboration are adjusted and how much they are improved. As expected, we found that contributors with higher expertise adjust presented judgments more frequently and also provide larger improvements if adjustments are made. Furthermore, individuals with higher expertise have a larger impact on sequential chains than individuals with lower expertise, and this effect is more pronounced the later experts enter into the chain. decision-making, the more individuals are aware of the expertise of other group members, the more accurate group decisions become (Baumann and Bonner, 2013). However, in such settings, it is crucial to explicitly communicate the expert status of group members before the discussion starts (Bonner et al., 2002). Moreover, when eliciting independent judgments from a group of individuals, weighting these judgments by expertise improves the accuracy of the aggregated estimates (Budescu and Chen, 2014;Lin and Cheng, 2009). For this purpose, expertise can be estimated statistically based on the observed performance (Mayer and Heck, 2023;Merkle et al., 2020;Merkle and Steyvers, 2011) or measured empirically by asking participants to rate their own, task-relevant knowledge (Ungar et al., 2012).
However, tasks need to have a certain level of 'demonstrability' for expertise to have an impact on a group decisions (Bonner et al., 2022;Laughlin and Ellis, 1986). For a task to be demonstrable, team members completing the task need to rely on the same system of communication and require sufficient information to solve the task. Moreover, team members who cannot solve the task still need to recognize and accept a correct solution if it is proposed by others, whereas members who can solve the task need to have sufficient motivation, ability, and time to demonstrate the accuracy of their solution to others. Highly demonstrable tasks ('intellective tasks') can profit from group members' task-related expertise. In contrast, less demonstrable, highly subjective tasks ('judgmental tasks') may not profit from expertise in a similar way since forming, communicating, and recognizing a correct answer is less clear (Bonner et al., 2022). According to this theoretical framework, sequential collaboration requires a sufficient level of task demonstrability, and thus we focus on intellective tasks in the following.

Implicit weighting of judgments by expertise
We hypothesize that sequential collaboration provides accurate outcomes because the process facilitates an implicit weighting of judgments by expertise. A weighting of judgments emerges due to the opportunity for contributors to opt out of providing a judgment. By opting out and maintaining the presented judgment, contributors assign more weight to the presented judgment (Mayer and Heck, 2022). In contrast, when opting in and adjusting a presented judgment, contributors give more weight to their own judgment compared with the presented judgment. Thus, if contributors show a differential opt-in and opt-out behavior depending on their expertise, judgments are implicitly weighted. This should, in turn, lead to increasingly accurate judgments over the course of a sequential chain since weighting by expertise improves aggregation of individual judgments (e.g., Budescu and Chen, 2014;Mayer and Heck, 2023;Merkle et al., 2020).
Such a process requires contributors to rely on task-related, metacognitive knowledge about their expertise. Metacognition describes contributors' 'cognition about cognitive phenomena' (Flavell, 1979). In the context of sequential collaboration, metacognitive knowledge (Lai, 2011) about one's own expertise allows contributors to evaluate the accuracy of presented judgments and one's own capacity to provide improvements (Kruger and Dunning, 1999;Laughlin and Ellis, 1986). Given that contributors decide whether to opt out of providing a judgment based on their metacognitive knowledge of their expertise (Bennett et al., 2018), sequential collaboration does not require a particular mechanism for identifying experts. It is thus neither necessary to assign expert roles (Baumann and Bonner, 2013), to directly assess individuals' expertise (Ungar et al., 2012), or to estimate expertise statistically (Mayer and Heck, 2023;Merkle et al., 2020). Instead, contributors determine the weighting of judgments within sequential chains implicitly based on their metacognitive assessment of their expertise and the evaluation of the presented judgment. Achieving high accuracy only requires that some of the contributors have sufficient expertise to detect and improve inaccurate judgments by others.

Probability of making adjustments
Sequential collaboration requires a two-stage response process in which contributors first decide whether to adjust a presented judgment (opt in) or whether to maintain it (opt out); only in the former case, they provide a new, revised judgment (Mayer and Heck, 2022). In the following, we derive predictions for the first stage in terms of the probability of adjusting a presented judgment.  Note: In Panel B, a positive (negative) value of improvement indicates that a revised judgment is more (less) accurate than the presented judgment.
As discussed above, we expect that contributors with high expertise are better at distinguishing between presented judgments they can improve and those they cannot improve (Bennett et al., 2018). Figure 1A illustrates our expectations of how the following two factors influence contributors' decision whether to adjust or maintain a presented judgment: first, contributors' expertise, and second, the deviation of the presented judgment from the correct answer (referred to as presented deviation in the following). Contributors high in expertise should be able to detect highly accurate judgments as being correct, and in turn refrain from adjusting such judgments. Moreover, they should be able to detect even small deviations of the presented judgment from the correct judgment and show a high probability of adjusting presented judgments that differ considerably from the correct answer. In contrast, we assume that contributors with less expertise cannot reliably distinguish between presented judgments they can improve and those they cannot improve. Such contributors should show a substantial probability of (unnecessarily) adjusting already accurate judgments but a lower probability of adjusting inaccurate judgments compared with experts. Figure 1A shows that the expected pattern implies an interaction, that is, a steeper slope of the change probability (as a function of the presented deviation) for contributors with higher than for those with lower expertise. As long as contributors have sufficient task-relevant expertise, a positive relationship between presented deviation and change probability should emerge (i.e., a main effect), meaning that larger presented deviations are more likely to be detected and adjusted. Since we assume that contributors with higher expertise can detect even small presented deviations and tend to adjust such judgments, we also expect an overall higher change probability for contributors with higher than for those with lower expertise. However, since we expect a crossed interaction (Figure 1), this effect will only emerge when the range of presented deviations across items is sufficiently large.

Improvement of presented judgments
When deciding to adjust a presented judgment, contributors in sequential collaboration have to provide a new, revised judgment in the second stage of the response process. We assume that, similar to change probability, the amount of improvement of presented judgments also depends on the two factors expertise and presented deviation. As illustrated in Figure 1B, we expect a main effect of presented deviation, meaning that with increasing deviation, contributors improve presented judgments to a larger degree. We also expect that improvements increase with increasing expertise since contributors high in expertise can provide more accurate judgments compared with those low in expertise (Merkle et al., 2020;Ungar et al., 2012). As shown in Figure 1, we also assume an interaction of contributors' expertise and presented deviation. Individuals high in expertise should make no or only minor adjustments to presented judgments that are already accurate while providing medium improvements to moderately inaccurate judgments and large improvements to highly inaccurate judgments. In contrast, contributors with lower expertise may not be able to make similarly large improvements to highly and moderately inaccurate presented judgments. In fact, contributors low in expertise may even revise presented judgments that are already accurate, leading to negative improvements if the presented deviation is zero.

The role of expertise in chains of sequential judgments
The predictions above focus on a single step of sequential collaboration and apply to the level of individual contributions. In the following, we derive additional predictions for actual chains of sequential judgments of groups of at least two contributors. First, we consider whether contributors respond differently when encountering presented judgments of contributors who are high or low in expertise. As discussed above, contributors high in expertise should generally be better at distinguishing between accurate and inaccurate judgments. Assuming that the expertise of contributors is linked to the accuracy of their judgments (e.g., Merkle et al., 2020), we thus expect that contributors high in expertise adjust presented judgments more frequently if judgments were made by others low in expertise than by those high in expertise. In contrast, we expect that contributors low in expertise show a similar change probability irrespective of whether presented judgments were made by contributors with higher or lower expertise.
Concerning the accuracy of revised judgments, presented judgments should be improved most by contributors with high expertise who encounter judgments of contributors with low expertise. We expect smaller improvements if contributors revise presented judgments of others who have a similar level of expertise, that is, when contributors with high (low) expertise correct others with high (low) expertise. However, contributors with low expertise are expected to worsen presented judgments provided by others with high expertise. Our last predictions concern the overall accuracy of chains of sequential judgments. We expect that final estimates become more accurate the more contributors with high expertise enter a sequential chain. Moreover, accuracy should be higher if individuals with high expertise contribute later in the chain, since it becomes less likely that their judgments are changed (and possibly worsened) by other, less-skilled contributors.
To test our predictions, we conducted three experiments. Experiments 1 and 2 focus on a single sequential step of sequential collaboration (i.e., at the level of individual contributions) which allows us to experimentally control the deviation of presented judgments. In contrast, Experiment 3 studies the effects of expertise and presented deviation in actual chains of sequential judgments to examine the role of contributors with varying expertise entering the sequential chain at different points.

Methods
In Experiment 1, we measured expertise in a city-location task and manipulated the presented deviation of judgments before letting participants decide whether to adjust or maintain location judgments with varying distances to the correct answer. To this end, we draw on the paradigm established by Mayer and Heck (2022) to investigate sequential collaboration. In the original study, participants positioned 57 European cities on maps. We modified the paradigm with some of these items serving as a baseline measure of individual expertise. Thereby, expertise was operationalized as knowledge acquired in the past (Schunn and Anderson, 1999). The remaining items were used to examine how participants adjust judgments in terms of change probability and improvement. The study design, sample size, hypotheses, and planned analyses were preregistered at https://aspredicted.org/cj9uu.pdf. Materials, analysis scripts, and data are available at https://osf.io/z2cxv/.

Participants
We recruited 290 participants who were compensated with 0.75e for a median study duration of 9.63 minutes via a German panel provider. We excluded one participant who provided judgments that were on average more accurate than the mean accuracy of judgments found in a small test sample in which we instructed participants to look up the correct locations of each city before providing a judgment. Furthermore, we excluded eight participants who positioned more than 10% of the cities outside the highlighted areas which marked the countries of interest. After these exclusions, the final sample comprised 281 participants who were on average 46.49 years old (SD = 15.33) with 48.80% of participants being female. Concerning educational background, 15.70% had a college degree, 15% held a high school diploma, 31.10% had vocational education, and 38.20% had a lesser educational attainment.

Materials and procedure
Participants had to locate 57 European cities on seven different European maps, namely (1) Austria and Switzerland, (2) France, (3) Italy, (4) Spain and Portugal, (5) United Kingdom and Ireland, (6) Germany, and (7) Poland, Czech Republic, Hungary, and Slovakia. All maps had a resolution of 800 × 500 pixels and were scaled to 1:5,000,000. Table A1 in the Appendix provides a list of all cities and the phase they were presented in. The 17 cities used for measuring expertise were selected based on the accuracy of independent location judgments for these cities collected in Experiment 3 of Mayer and Heck (2022). Cities were selected to have a wide range of difficulty while ensuring that all seven European maps were represented.
In the first phase of Experiment 1, participants provided independent location judgments for the 17 cities which served as a measure of expertise. Each trial showed the instruction to place one city on the map as accurately as possible. Next, in the sequential phase, participants were instructed that each map already showed a location judgment of a previous participant and that they could decide whether to adjust or maintain the position. Again, only one of the remaining 40 cities was presented in each trial, but the map already contained a preselected location judgment. By clicking on the map, participants could adjust the presented judgment by providing a new position for the city, whereas clicking the button 'continue' allowed them to maintain the presented judgment. Participants were not provided with any additional information about the source of the judgment or the expertise of the previous contributor. Figure 2 displays the map of Italy with four preselected location judgments for Rome reflecting different distances from the correct location. All presented judgments were placed in the country or countries of interest colored in white. The seven maps and the corresponding cities were presented in block-randomized order (i.e., both the order of maps and of cities within maps were randomized).
Participants also provided demographic information and indicated their subjective knowledge concerning the locations of large European cities. Finally, they were debriefed and thanked for their participation.
Unknown to the participants, the locations presented in the sequential phase were not provided by other participants. Instead, we manipulated the presented deviation by selecting locations with a certain Euclidean distance to the correct answer (0, 40, 80, or 120 pixels). The presented deviations were selected based on judgments obtained in Mayer and Heck (2022) and pretested in a pilot study ensuring that participants were on average able to improve the presented distances. Moreover, in the Supplementary Materials (https://osf.io/z2cxv/), we show that the presented deviations correspond to plausible values from the empirical distribution of independent judgments of participants (which were collected for measuring expertise). The plots for all cities show that correct judgments as well as distances of 40, 80, or 120 pixels were inside the range of provided answers. For all 40 cities, one deviation was randomly selected such that each deviation was presented 10 times. For each map, the four levels of presented deviations were duplicated as rarely as possible. To manipulate presented deviation, it was necessary to deceive participants about the presented locations allegedly being judgments of other participants. The study was reviewed and approved by the ethics committee of the University of Mannheim, and participants were debriefed after participation. To ensure that participants complied to the instructions and completed the study without technical issues, the online study was accessible only for participants using a computer (but not for mobile devices). We prevented looking up correct answers by implementing a time limit of 40 seconds for each response. Moreover, we already excluded participants during participation if they left the browser tab more than five times despite repeated warnings.

Results and discussion
We estimated participants' expertise based on the independent location judgments for the 17 cities that were shown without a previous judgment. As an operationalization of expertise, we computed the mean of the Euclidean distances between the location judgments and the correct positions for each participant. To ensure that larger values indicate higher expertise, we multiplied the average distances by −1. This measure of expertise was included as a continuous predictor in the analyses below. To assess the validity of this task-related expertise measure, we computed the correlation with the selfreported knowledge about the location of European cities. The large, positive correlation of r = 0.43 (t(279) = 7.91, p < .001) indicates a satisfactory convergent validity.
We tested the effects of participants' expertise, presented deviation, and their interaction on change probability using a generalized linear mixed model. The model used a logistic link function to predict the decision whether to adjust (= 1) or maintain (= 0) a presented judgment. We standardized our expertise measure for all analyses. Moreover, we applied a mean-centered linear contrast with values −1.5, −0.5, 0.5, and 1.5 for the four levels of deviations between presented and correct locations. Standardizing the expertise measure and applying a mean-centered contrast to the presented deviations allows us to interpret the additive terms in the model as main effects and the multiplicative term as interaction. The model accounts for the nested data structure by including random intercepts for items and participants (Pinheiro and Bates, 2000). 1 Figure 3A displays the average change probability for cities depending on participants' expertise and presented deviation. Table 1  However, contrary to our predictions, Figure 3A shows that individuals with higher expertise changed correct judgments more frequently than individuals with lower expertise. The high change probability for accurate presented judgments may be due to demand effects. In fact, participants did not know that 25% of the presented judgments had a perfect accuracy. Hence, they may not have expected that optimal behavior required maintaining a substantial proportion of the presented judgments.
Next, we examined the effects of presented deviation, expertise, and their interaction on the improvement of presented judgments. As dependent variable, we computed the improvement by subtracting the accuracy of the presented judgments from that of the revised judgments. For this purpose, accuracy was defined as the Euclidean distance between a judgment and the correct position. For the presented judgments, accuracy is thus equivalent to the presented deviation (i.e., a distance of 0, 40, 80, or 120 pixels to the correct position). Positive (negative) values of the improvement measure imply that a revised judgment is more (less) accurate than the presented judgment. Since participants could decide whether to adjust or maintain presented judgments, we only included trials in the analysis in which participants actually adjusted the presented judgments. 2 Note: CI = confidence interval. All models included crossed random effects for participants and items. The models for change probability (0 = no adjustment, 1 = adjustment) assumed a logistic link function.
We used improvement as dependent variable in a linear mixed model with (standardized) expertise and presented deviation (linear contrast) as independent variables and added random intercepts for participants and items. Figure 3B displays the average improvement in judgment accuracy, whereas Table 1 shows the estimated regression coefficients. As expected, improvement increased for larger presented deviations ( = 32.289,CI = [31.398,33.181]) and higher expertise ( = 15.545,CI = [13.492,17.598]). In line with the expected pattern shown in Figure 1, the model also showed a significant interaction such that more knowledgeable participants showed a steeper increase in improvement than less knowledgeable participants ( = 3.819, CI = [2.930, 4.707]).

Experiment 2
Experiment 1 allows only weak causal conclusions since expertise was merely measured rather than manipulated. As a remedy, we implemented a new study design in which expertise was operationalized as a skill or strategy. We manipulated the level of expertise in a random-dots estimation task (Honda et al., 2022) in which participants had to estimate the number of randomly positioned, colored dots. Participants in the experimental group learned a strategy to provide accurate estimates for the number of presented points. Importantly for sequential collaboration, the same strategy can also be used to evaluate the accuracy of presented judgments. In contrast, participants in the control condition completed a control task and should thus have no advantage in providing and evaluating judgments. In a pilot study, we examined whether the manipulation of expertise was successful and whether participants in the control condition came up with any solution strategy themselves, which was not the case. The preliminary data were also used to calibrate the time limit per item and to define outliers. Hypotheses, study design, sample size, and planned analyses were preregistered at https://aspredicted.org/8c6wh.pdf. Materials, data and analysis scripts are available at https://osf.io/z2cxv/.

Participants
We recruited 124 college students from the University of Marburg and a study exchange platform. Participants received course credit or the opportunity to take part in a gift-card lottery in exchange for participation. The median time to complete the study was 17.30 minutes. We excluded five participants from the analysis. One participant did not complete the study conscientiously, one vastly underestimated and another vastly overestimated the number of dots for most items, one almost always provided the perfectly exact number of dots, and one did not answer the attention-check questions about the instructions correctly. 3 The remaining 119 participants (69.70% female) had a mean age of 25.50 (SD = 9.94).

Procedure
Participants were randomly assigned either to the expertise-manipulation condition (referred to as 'experts' in the following) or the control condition ('novices'). Experts were introduced to raster scanning, a strategy for accurately estimating the number of objects on a presented image by mentally overlaying a 3 × 3 raster on top of the presented image. With the raster in mind, one can pick one of the nine areas with an approximately average number of dots and count the number of dots within this box. Next, one simply multiplies the result by nine to obtain an estimate for the total number of dots in the image. To facilitate multiplication in one's head, we advised participants to multiply the number of dots by 10 and then subtract the number of counted dots once. Participants in the control condition only read an essay about the importance of accurate judgments. Afterward, both groups answered four attention-check questions concerning the instructions.
As practice trials, all participants had to independently estimate the number of dots for five images. Only in the experimental condition, these five images were overlaid with a visible 3 × 3 raster to train raster scanning. Next, participants saw another set of five images, now always shown without a raster, and were again asked to provide independent judgments for the number of presented dots. The judgments in this phase served as a manipulation check. The following sequential phase was similar to the one in Experiment 1. In each trial, participants saw one of the 30 remaining images (20 test images and 10 easy images for motivational purposes), each with an (alleged) judgment of a previous participant regarding the number of shown dots. They decided whether to adjust or maintain the presented judgment by clicking on respective buttons. Only if they decided to adjust the presented judgment, an open-text box appeared in which the new judgment could be entered. After indicating to maintain the judgment or providing a new judgment, participants continued to the next image. The images were shown in random order with a time limit of 60 seconds (including a warning after 40 seconds). Similar as in Experiment 1, presented judgments were not provided by previous participants but rather preselected to manipulate the deviation of presented judgments from the correct answer. Participants received no additional information about the (alleged) previous contributor. After providing demographic information, participants in the experimental condition were asked whether they had actually used raster scanning, whereas participants in the control condition were asked whether they had used any particular strategy to estimate the number of dots. Finally, we asked whether they completed the study conscientiously and debriefed participants.

Materials
We generated 30 images (600 × 600 pixels; Figure 4) with white background depicting between 100 and 599 randomly positioned, nonoverlapping, colored dots using the R package ggplot2 (Wickham, 2016). Five of these images were used to train participants, and five were used for the manipulation check. The remaining 20 images were shown jointly with an (alleged) judgment of the number of dots. These preselected values were either correct (deviation = 0%) or deviated by ±35% or ±70% from the correct answer. In contrast to Experiment 1, presented deviations were not randomly assigned to items but were fixed for all participants. Moreover, for motivational purposes, we also showed 10 additional images depicting only 10-59 dots which were displayed with a judgment that was either correct or deviated by ±20% or ±35% from the correct answer. For these items, a pilot study showed that participants in both conditions could easily detect whether the presented judgment was correct or not since the time limit allowed to simply count the small number of dots. We manipulated the deviation of presented judgments on five levels. Similar as in Experiment 1, it is thus in principle possible for participants to infer the manipulation if they knew the exact number of dots presented. However, since the deviation was not operationalized by a fixed additive error but rather by a multiplicative constant, it is unlikely that participants could detect the manipulation or acted differently than they would have when seeing actual judgments of previous participants. Moreover, the Supplementary Materials (https://osf.io/z2cxv/) provide plots showing that the five levels of presented deviations fall within the empirical distribution of the independent judgments which were collected for the manipulation check.

Results and discussion
To test whether the manipulation was successful, we examined whether experts showed a higher accuracy than novices for the five items shown without a previous judgment. As a measure of accuracy, we computed the percentage error for each item, defined as the absolute difference between the judgment and the correct answer divided by the correct answer and multiplied by 100. Using this measure allowed us to analyze average accuracy across items even though the number of dots varied from 100 to almost 600. Including only the independent judgments for the five items in the manipulation-check phase, we fitted a linear mixed model with condition as independent variable (dummy-coded with 1 = expert condition, 0 = novice condition). We found a significant negative effect of condition on the percentage error (  We first tested the expected patterns for change probability shown in Figure 1A. While expertise was coded with a dummy contrast (1 = expert condition, 0 = novice condition), we used two orthogonal, centered contrasts for presented deviation. Since the presented deviation includes both over-and underestimation of the correct answer, we used a centered, V-shaped contrast (values: 4, −1, −6, −1, 4) to test whether change probability is lowest for correct presented judgments but increases the more presented judgments deviate from the correct judgment. The regression coefficient of this contrast is positive for a V-shape, negative for an inverse V-shape, and zero in the absence of such an effect. Participants, however, may not equally often adjust over-and underestimated presented judgments. Hence, we also included a centered, linear contrast (values: 2, 1, 0, −1, −2) which tests whether the slope of the V-shaped contrast differs between these two cases. A value of zero indicates a symmetric Vshape, whereas a positive (negative) coefficient indicates a steeper (less steep) slope for underestimated than for overestimated presented judgments. Figure 5A illustrates the average change probability including 99% confidence intervals. Change probabilities followed the expected V-shape as a function of the presented deviation. Moreover, experts generally changed items more frequently than novices. This impression was confirmed by a significant, positive V-shaped contrast for presented deviation in the linear mixed model ( = 0.208, CI = [0.160, 0.256]) and a significant positive effect of condition ( = 0.570, CI = [0.169, 0.972]) (Table 1). Moreover, we found a positive linear contrast indicating a smaller effect of presented deviation (i.e., a smaller slope of the V-shape) for underestimated than for overestimated judgments ( = 0.312, CI = [0.176, 0.448]). As expected, the interaction between condition and the V-shaped contrast of the presented deviation was positive, meaning that experts better distinguished between accurate and inaccurate judgments ( = 0.056, CI = [−0.003, 0.114]). However, in contrast to our predictions, participants in the expert condition adjusted correct presented judgments more frequently than participants in the novice condition ( Figure 5). Besides demand effects, this could be due to the raster-scanning strategy providing only an approximate estimate of the actual number of presented dots. While the approximation leads to improved judgments, it is still prone to errors. Hence, for already accurate presented judgments, participants may have adjusted the judgment even though it was already correct. Lastly, we found a significant interaction between condition and the linear contrast of presented deviation indicating that the V-shape was more symmetric (with respect to over-and underestimated judgments) for experts than for novices ( = −0.329, CI = [−0.503, −0.155]).
Next, we tested whether expertise, presented deviation, and their interaction affect the improvement of presented judgment. Similar to Experiment 1, improvements were computed as the difference between the percentage errors of the presented and the revised judgment. Again, positive (negative) values indicate that presented judgments are improved (worsened). We used a linear mixed model to predict the improvement of presented judgments using the same contrasts for condition and presented deviation as in the model for change probability. Similar as in Experiment 1, we only included trials in which participants adjusted the presented judgment. 4 Figure 5B displays the mean improvement of presented judgments including 99% confidence intervals and violin plots; Table  1 shows the estimated regression coefficients. As expected, presented deviation had a significant V-shaped effect on improvement such that presented judgments were improved more the larger the deviation from the correct judgment was ( = 6.782, CI = [6.323, 7.240]). Compared with the novice condition, participants in the expert condition improved presented judgments more if there was room for improvement and also worsened correct judgments less ( = 8.827,CI = [4.458,13.196]). Furthermore, the model showed a positive interaction between condition and the V-shaped contrast for presented deviation ( = 0.647,CI = [0.191,1.104]). These results are closely in line with the expected patterns derived from our theoretical framework (Figure 1).

Experiment 3
Experiments 1 and 2 show that change probability and improvement of presented judgments depend on expertise, presented deviation, and their interaction. However, both studies implemented only a single incremental step in sequential collaboration using preselected values for the presented judgments.
The same effects should hold if individuals encounter actual judgments of previous individuals rather than preselected judgments. Moreover, as outlined in the Introduction, the benefits of expertise on the accuracy of chains of sequential judgments should be especially pronounced for the final estimates.
To test these assumptions, we again relied on the random-dots estimation task using the rasterscanning strategy as a manipulation of expertise. However, we now implemented a sequentialcollaboration paradigm in which participants actually encountered judgments made by previous participants (Mayer and Heck, 2022). The design allowed us to manipulate the number and position of experts and novices in a sequential chain. The hypotheses, study design, sample size, and planned analyses were preregistered at https://aspredicted.org/HZT_QW3. Materials, data, and analysis scripts can be found at https://osf.io/z2cxv/.

Materials and procedure
We used the same experimental paradigm as in Experiment 2 while making some minor changes. In the expertise condition, we already excluded participants during participation if they did not answer at least three questions about the raster-scanning strategy correctly. Thereby, we avoided the necessity to exclude participants who subsequently contributed to the same sequential chains. We also generated five new items for the sequential-collaboration phase.
Participants were randomly assigned either to the expertise-manipulation or the control condition. We then built sequences of two participants which differed with respect to the status and order of contributors (i.e., novice-novice, expert-novice, novice-expert, and expert-expert). Similarly as in Experiment 2, the first participant in each chain saw preselected judgments which were either correct or deviated by ±35% or ±70% from the correct number of dots. Again, the distribution of independent judgments obtained in the manipulation-check phase revealed that the manipulation for preselected judgments resembled actual judgments made by participants (Supplementary Materials, https://osf.io/z2cxv/). If the first participant in a chain made an adjustment, the second participant saw the revised judgment; otherwise, the second participant merely saw the originally presented judgment. Overall, this procedure results in a mixed design with expertise and composition of dyads as betweensubjects factors and presented deviation as within-subjects factor.

Participants
Using a German panel provider, we recruited 464 participants who were compensated by the panel provider for a median participation time of 18.30 minutes. We excluded one participant because they answered '1' to all items, which in turn made it necessary to remove another participant assigned to the same sequential chain. Moreover, we excluded five participants for technical reasons due to duplicate assignments to sequential chains. The final sample included 457 participants (46.80% female) with mean age of 46.16 (SD = 14.36) and heterogeneous educational background (college degree: 34.80%; high-school diploma: 26%; vocational education: 24.10%; lesser educational attainment: 15.10%).

Results and discussion
We computed the same dependent measures as in Experiment 2. As a manipulation check, we fitted a linear mixed model to test whether the independent judgments for the five items during the manipulation-check phase were more accurate for experts than for novices. As expected, the expertise manipulation leads to a decrease in the percentage error ( = −28.898, CI = [−36.319, −21.477], t(117.16) = −4.277, p < .001), indicating that judgments of experts were twice as accurate as those of novices (mean error = 27.46% vs. mean error = 56.36%, respectively).

Effects of one step in sequential collaboration
Replicating the analyses of Experiment 2, we first tested the effects of presented deviation, expertise, and their interaction on change probability and improvement of presented judgments in a single sequential step. We analyzed change probabilities in the sequential phase by including only those participants who saw preselected judgments (but not those who saw the judgments of other participants). Similarly as in Experiment 2, we fitted a generalized linear mixed model to predict whether a presented judgment was changed using the same contrasts for presented deviation and condition. Figure 6A displays the average change probabilities in Experiment 3. As expected, the V-shaped effect of presented deviation emerged, with a steeper slope for underestimated than for overestimated judgments. In contrast to Experiment 2, the plot does not indicate a main effect of condition. These impressions were supported by the model-based analysis (Table 1)  Most importantly for sequential collaboration, Figure 6A illustrates the interaction of expertise and presented deviation on change probability. In line with our expectations, experts adjusted presented judgments less often than novices if judgments were already correct, but more often if judgments deviated by ±70% from the correct answer. In the mixed model, the corresponding interaction term of condition and the V-shaped contrast was significant ( = 0.075, CI = [0.040, 0.111]). As described in the Introduction, given that we predicted and found a crossed interaction (Figure 1), the absence of a main effect of expertise may simply be due to a limited range of presented deviations. Next, we tested the effect of expertise and presented deviation on the improvement of presented judgments. For this analysis, we only included participants at the first chain position. 5 Figure 6B displays the improvement of presented judgments which followed a V-shaped pattern, with already correct presented judgments being slightly worsened. We fitted a linear mixed model for the percentage improvement again using the same contrasts for condition and presented deviation. In line with our hypotheses, the model showed a V-shaped effect of presented deviation ( = 6.653, CI = [6.290, 7.016]), a main effect of condition, indicating more improvement of judgments provided by participants in the expertise than in the novice condition ( = 16.518,CI = [10.701,22.336]), but no interaction of condition and presented deviation ( = −0.024, CI = [−0.527, 0.479]). Moreover, the interaction between the linear slope for presented deviation and expertise was significant, indicating a steeper slope for the left than the right limb of the V-shape for experts compared with novices

Robustness analyses using all participants
As a robustness check, we also tested our predictions for a single sequential step while including participants at both chain positions. The deviation of presented judgments thus becomes a continuous variable since participants at the second chain position may see revised judgments of participants at the first position. In the linear mixed models, we thus included the standardized deviation and the corresponding, quadratic trend as predictors.

Effects on chains of sequential judgments
We tested the expected impact of experts at the chain level based on data of participants at the second chain position. We first fitted a generalized linear mixed model to predict whether change probability for the second contributor in a sequential chain differed between the four compositions of sequential chains (i.e., novice-novice, expert-novice, novice-expert, or expert-expert). For this purpose, we implemented two contrasts: The first compared novice-novice chains against expert-novice chains, whereas the second compared novice-expert chains against expert-expert chains. In line with our expectations, change probability was larger for experts correcting novices than for expert correcting other experts ( = 0.326, CI = [0.063, 0.588], z = 2.432, p = .015). In contrast, novices changed the entries of experts and novices similarly frequently ( = 0.136, CI = [−0.098, 0.370], z = 1.140, p = .254). These patterns are illustrated in Figure 7A.
To test how novices and experts improve each other's judgments, we only considered judgments that were adjusted by participants at the second chain position 6 and implemented a linear mixed model with percentage improvement as dependent variable and composition of sequential chain as predictor. We additionally used Helmert contrasts to test our expectations concerning the improvement of each other's judgments by contrasting the novice-expert chain with all other chains, the expert-novice chain with the novice-novice and expert-expert chains, and, lastly, testing the novice-novice and expertexpert chains against each other. Figure 7B displays the empirical means for percentage improvement for all compositions of sequential chains. In line with this pattern, we found a significant Helmert contrast for the novice-expert sequential chain ( = 3.760, CI = [1.264, 6.256], t(215.08) = 2.952, p = .004). Furthermore, we found a significant contrast for the expert-novice chain ( = −3.852, CI = [−7.227, −0.477], t(221.47) = −2.237, p = .026). In fact, Figure 7 shows that novices worsen judgments of experts. Lastly, we did not find a significant difference in improvement between expertexpert and novice-novice groups 0.208], t(222.70) = −1.894, p = .060).
Overall, these findings are in line with our expectations that experts improve judgments of novices most, novices worsen judgments of experts, and only little improvement can be found when novices correct novices and experts correct experts. Finally, we tested which composition of sequential chains lead to the most accurate estimates at the end of a sequential chain. We fitted a linear mixed model with percentage error of the final judgment in a sequential chain as dependent variable and chain composition as predictor. Depending on whether the two participants in each chain adjusted the presented judgment, the final judgment could either be the presented judgment, the judgment entered by the first participant, or the judgment entered by the second participant. We used a linear contrast to test whether percentage error decreases, or equivalently, whether accuracy increases across chain compositions.
As expected, we found a significant linear trend between chain composition and accuracy of the final estimates ( = 5.779,CI = [2.199,9.359], t(216.79) = 3.164, p = .002). Figure 7C illustrates this pattern with the percentage error being largest for sequential chains with two novices and smallest for sequential chains with two experts. Regarding mixed sequential chains which included both an expert and a novice, the percentage error was smaller when chains ended rather than started with an expert. Overall, the more and the later experts enter sequential chains, the better the final estimates.

General discussion
In three experiments, we studied when and how contributors with varying expertise adjust presented judgments in sequential collaboration. The results for individual contributions (i.e., a single sequential step) show that the probability of changing a judgment increases as the deviation to the correct answer increases and as participants' expertise increases. Most importantly, compared with novices, contrib-utors with high expertise were better at distinguishing between accurate and inaccurate judgments as indicated by a steeper slope of the change probability. Core aspects of the predicted data pattern in Figure 1 were thus supported. However, the data did not consistently show that contributors with high expertise adjust perfectly accurate judgments less frequently than contributors with low expertise. Concerning the accuracy of revised judgments, the improvement of presented judgments increased for larger presented deviations and higher expertise in two of three experiments.
Expertise is thus an important predictor of change probability and the amount of improvement of judgments in sequential collaboration. This supports our theoretical assumption that contributors adjust and maintain judgments based on their expertise which in turn facilitates an implicit weighting of judgments. Even though this weighting happens at the individual level within each sequential step, the increased accuracy due to overweighting judgments of contributors with higher expertise can be observed at the chain level (Mayer and Heck, 2022). The data provided evidence for an important prerequisite for such a weighting, namely, contributors with high expertise better differentiate between accurate and inaccurate judgments than contributors with low expertise. However, we found only mixed support for our prediction that contributors with high expertise have a lower change probability for perfectly accurate judgments than contributors with low expertise.
In Experiment 3, we also studied chains of two sequential judgments. As expected, experts adjusted judgments of novices more frequently than those of other experts, and experts improved judgments of novices most, whereas novices tend to worsen judgments of experts. Moreover, the final estimates of sequential chains became more accurate the more and the later experts entered a sequential chain. This shows that the number of experts and the position in which they enter a sequential chain affects the accuracy of group estimates. Accurate judgments of experts at the beginning of a sequential chain may be obstructed by novices later, in turn resulting in reduced accuracy. In contrast, possibly inaccurate judgments by novices at the beginning can be corrected by experts later.
Our findings add to the literature on the wisdom of crowds, supporting the notion that weighting judgments by expertise increases accuracy (Budescu and Chen, 2014;Mayer and Heck, 2023;Merkle et al., 2020). In contrast to other experimental designs and statistical techniques, sequential collaboration does not require researchers to identify experts before or after the judgment task. Instead, sequential collaboration results in an implicit weighting of judgments by expertise. This is achieved by the contributors' metacognitive assessment of whether they can improve a presented judgment. Our results thus shed light on the mechanisms of why the aggregation of individual judgments in sequential collaboration results in high accuracy. Note, however, that the evidence for the high accuracy of sequential collaboration is still sparse (Mayer and Heck, 2022;Miller and Steyvers, 2011). Thus, further studies are necessary to test the robustness and performance of sequential collaboration in different tasks and populations.

Limitations and future research directions
In all our experiments, we deceived participants about the source of the presented judgments. Both the presented city locations and the number of dots were not judgments of previous participants as stated in the instructions. Instead, we manipulated presented deviation experimentally by generating hypothetical judgments that closely resembled actual judgments. Even though the manipulation used only few levels of deviation, participants would require substantive knowledge about the correct answers for a considerable amount of items in order to become aware of the manipulation. Moreover, due to the design of the sequential-collaboration paradigm (Mayer and Heck, 2022), it is plausible that presented judgments were actually made by other participants previously. This is also supported by the empirical distribution of independent judgments which were collected for measuring expertise (Experiment 1) and as a manipulation check (Experiments 2 and 3). For these items, the preselected deviations fall within the distribution of actual judgments, which provides evidence for their plausibility. In addition, a design presenting participants with authentic judgments by others was implemented in Experiment 3 in which participants formed sequential chains and encountered actual judgments of previous participants.
Irrespective of the source of the presented judgments, all three studies provided converging evidence for the predicted data patterns. Overall, it thus seems unlikely that participants noticed the manipulation and acted differently toward the presented judgments than they would have when seeing actual judgments of previous participants.
We designed the tasks in our experiments to be highly demonstrable, meaning that contributors have the opportunity to demonstrate their expertise (Bonner et al., 2022). However, demonstrability can still be low if participants are not sufficiently motivated to complete a task (Laughlin and Ellis, 1986). If this was the case in our study, contributors with high expertise may have opted out more frequently than would be beneficial for achieving a high accuracy. Also, contributors may have provided generally imprecise judgments and guesses to proceed more quickly. However, these appear to be minor concerns for the validity of our results. Moreover, in 'natural' applied settings (e.g., online collaborative projects), the motivation of volunteers to provide demonstrable solutions should be very high.
Our studies show that expertise predicts change probability and improvement in chains of sequential judgments. However, it remains unclear whether the high accuracy of final estimates is due to the sequential judgment process itself or due to the possibility to opt out of answering. Bennett et al. (2018) showed that merely providing an opportunity to opt out increases the accuracy of independent individual judgments. Essentially, individuals use their metacognitive knowledge to select those tasks that fit their individual expertise best. Regarding sequential collaboration, future research should thus disentangle the effects of the judgment-elicitation process (i.e., contributors building a chain of sequential judgments) and of the opportunity to opt out of providing a judgment.
Our three studies are also limited in their generalizability to online collaborative projects such as Wikipedia and OpenStreetMap since they differ in various features. First, we examined the effect of expertise only in one sequential step of sequential collaboration and for short sequential chains of only two contributors even though sequential chains are typically much longer and complex in online collaborative projects. Simplifying such a process into single steps is a typical approach in experimental research. Nevertheless, we expect that the effects of expertise and deviation on change probability and improvement of judgments should similarly hold for longer sequential chains, given that participants are not aware of the number of contributors or previous judgments. Moreover, tasks in our experiments considerably vary from tasks in online collaborative projects. Tasks in these projects are typically more judgmental and less demonstrable than providing numeric or geographical judgments with decisions on which, where, and how to include information while also providing more infrastructure for the contributions such as discussion forums and change logs. In contrast to scientific experiments, contributors are not fully anonymous and typically volunteer for editing in these projects. All these factors may influence whether contributors adjust or maintain Wikipedia articles or OpenStreetMap objects and how they contribute to these projects.

Conclusion
Sequential collaboration is a key mechanism found in many large-scale, online collaborative projects. Our studies show that expertise is an important predictor of whether individuals adjust or maintain presented entries, how much they improve an entry, and how accurate the final estimates are. Thereby, we provide first evidence for the implicit weighting of expertise in sequential collaboration, which can explain the high accuracy of online collaborative projects.