The benefits of preregistration for hypothesis-driven bilingualism research

Abstract Preregistration is an open science practice that requires the specification of research hypotheses and analysis plans before the data are inspected. Here, we discuss the benefits of preregistration for hypothesis-driven, confirmatory bilingualism research. Using examples from psycholinguistics and bilingualism, we illustrate how non-peer reviewed preregistrations can serve to implement a clean distinction between hypothesis testing and data exploration. This distinction helps researchers avoid casting post-hoc hypotheses and analyses as confirmatory ones. We argue that, in keeping with current best practices in the experimental sciences, preregistration, along with sharing data and code, should be an integral part of hypothesis-driven bilingualism research.


Introduction
An important aspect of hypothesis-driven research is PREREGISTRATION, an open science practice that consists of the specification of research question(s), method(s) and analysis plan(s) before data collection. Preregistration is a relatively simple yet powerful tool for improving transparency in bilingualism research, and we suggest that, in keeping with current best practices in the experimental sciences, bilingualism researchers include preregistration as an essential component of hypothesis-driven research, along with other open science practices such as releasing materials, data and code alongside publications (Chambers, Feredoes, Muthukumaraswamy & Etchells, 2014;Nosek, Ebersole, DeHaven & Mellor, 2018b;Nosek & Lakens, 2014;Nosek, Ebersole, DeHaven & Mellor, 2018a;Open Science Collaboration, 2015).
There are several positions regarding the goals of preregistration. Many researchers view it as a tool specific to confirmatory research because it can help assess the falsifiability of an experimental study's predictions, control for false positive error probability in null hypothesis significance testing (NHST), and mitigate researcher biases (e.g., Lakens, 2019;Chambers, 2019;Nosek, Beck, Campbell, Flake, Hardwicke, Mellor, van 't Veer & Vazire, 2019). Under this view, preregistration helps implement the distinction between confirmatory analyses (used for hypothesis testing) and exploratory analyses (used for hypothesis generation) (e.g., de Groot, 1956de Groot, /2014Chambers, 2019;Nosek et al., 2018b;Nosek et al., 2019;Wagenmakers, Wetzels, Borsboom, van der Maas & Kievit, 2012). More recently, preregistration has also been considered for qualitative research with the aim to make documentation of research plans more transparent (Haven & Grootel, 2019). Other research groups acknowledge the contribution of preregistration to scientific transparency, but call into question the validity of the distinction between confirmatory and exploratory research, and the usefulness of preregistration to help implement this distinction (e.g., Devezer, Navarro, Vandekerckhove & Buzbas, 2020;Szollosi et al., 2020;Szollosi & Donkin, 2019, cf. Wagenmakers, 2019. From this point of view, a shift to the development of more explicit theories would make preregistration unnecessary. In this paper, we take the position that preregistration is crucial to separate confirmatory from exploratory analyses. In our view, the preregistration of confirmatory hypotheses can counter questionable research practices and unconscious biases (Box 1). Consequently, it can enhance research transparency in confirmatory bilingualism (L2) research. Concerns about (non-)transparency and researcher biases are well-known in psychological science (Wicherts, Borsboom, Kats & Molenaar, 2006;Simmons, Nelson & Simonsohn, 2011). L2 research is similarly affected by a lack of clarity about pre-data collection hypotheses and analysis plan choices. This problem is compounded by the fact that L2 studies rarely release their research materials (Derrick, 2016;Marsden, Thompson & Plonsky, 2018c) or their data (Larson-Hall & Plonsky, 2015;Bolibaugh, Vanek & Marsden, 2020).
To address these issues, two journals in the field of bilingualism, Language Learning and Bilingualism: Language and Cognition, have introduced a new type of article, Registered Reports, which allows researchers to submit their hypotheses, methods, and analysis protocols for peer review prior to data collection (Marsden, Morgan-Short, Trofimovich & Ellis, 2018b).

Box 1. Three questionable research practices and biases
• The garden of forking paths In hypothesis-driven research, there are many possible data analysis paths, and one of several potential paths can be selectively chosen and reported (Gelman & Loken, 2013. For example, one could choose a particular measure, region of interest or time-window that was not originally selected for analysis, or delete outliers based on an arbitrary criterion. Such multiple analysis paths cumulatively create so many researcher degrees of freedom that one can describe them using a decision tree. This bias is often an unconscious one (Gelman & Loken, 2013, pp. 9-10): It's not that the researchers performed hundreds of different comparisons and picked ones that were statistically significant. Rather, they start with a somewhat-formed idea in their mind of what comparison to perform, and they refine that idea in light of the data. (...) they are using their scientific common sense to formulate their hypotheses in a reasonable way, given the data they have. The mistake is in thinking that, if the particular path that was chosen yields statistical significance, that this is strong evidence in favor of the hypothesis.

• Multiple testing
For purely statistical reasons, if one conducts enough statistical tests, some test will eventually come out significant. For example, in psycholinguistic eye-tracking reading research, one can easily end up conducting dozens of statistical tests to evaluate a single hypothesis. Simulations in von der Malsburg and Angele (2017) demonstrate that multiple analyses in eyetracking dramatically inflate Type I error, leading to a large proportion of false positive rejections of the null hypothesis.
• Post-hoc hypothesizing When data is analyzed without having explicitly stated the predictions, one may easily convince oneself that an unforeseen result was expected all along, and subsequently report this unexpected finding as a confirmatory one. This bias is commonly referred to as 'hypothesizing after the results are known' (HARKing) (Simmons et al., 2011;Kerr, 1998). This can skew the scientific record with less well-grounded theories, cherry-picked after the fact (Chambers, 2019).
Here, we discuss a different approach: non-peer reviewed preregistration using open science platforms such as the Open Science Framework (OSF, https://osf.io/) or AsPredicted (https:// aspredicted.org/). On these platforms, researchers have the opportunity to create a public or private, time-stamped, non-modifiable record of a planned study prior to data inspection, either before or during data collection. Here, we argue that non-peer reviewed preregistration can counteract the questionable research practices presented below. We first illustrate them with an example from our own work on native (L1) sentence processing. Then, we discuss correlates in the L2 literature and explain how non-peer reviewed preregistrations can improve L2 research.

Possible pitfalls of hypothesis-driven research: An example from L1 sentence processing
We briefly introduce our study, which attempted to replicate the findings of an eye-tracking reading study that compared the processing of two different syntactic dependencies (Dillon, Mishler, Sloggett & Phillips, 2013;Jäger, Mertzen, Van Dyke & Vasishth, 2020). This example can be easily translated to bilingualism settings where, similar to our example, processing patterns are investigated for different syntactic constructions, but also for different speaker groups, such as native vs. non-native speakers (Felser & Cunnings, 2012 Our example concerns a phenomenon called AGREEMENT ATTRACTION. For subject-verb agreement dependencies, previous work has shown that a processing disruption elicited by an ungrammatical plural verb can be weakened if a plural noun (an "attractor") intervenes between the subject and the verb (as in 1a vs. 1b; Wagers, Lau & Phillips, 2009;Pearlmutter, Garnsey & Bock, 1999;Dillon et al., 2013). Dillon and colleagues used a within-subjects design to examine whether the attraction effect extended to ungrammatical antecedent-reflexive dependencies, where an attractor matched the reflexive in number (1c vs. 1d).
(1) a. Subject-verb agreement; attraction *The amateur bodybuilder who worked with the personal trainers amazingly were competitive for the gold medal. b. Subject-verb agreement; no attraction *The amateur bodybuilder who worked with the personal trainer amazingly were competitive for the gold medal. c. Reflexive; attraction *The amateur bodybuilder who worked with the personal trainers amazingly injured themselves on the lightest weights. d. Reflexive; no attraction *The amateur bodybuilder who worked with the personal trainer amazingly injured themselves on the lightest weights.
Building on work by Sturt (2003), they argued that, unlike subject-verb agreement configurations, the processing of antecedent-reflexive dependencies should be syntactically constrained (Chomsky, 1981). If so, attraction effects were expected in subject-verb dependencies but not in antecedent-reflexive dependencies, yielding an interaction between dependency type and attraction. Dillon et al. (2013) analyzed multiple reading measures and observed the predicted interaction only in total reading time. This result was taken as support for the hypothesis that subjectverb agreement and reflexives show different susceptibility to agreement attraction, and thus are differentially constrained by syntactic principles. In our large-sample replication study (Jäger et al., 2020), the goal was to replicate the statistically significant interaction in total reading time from the original study. Our confirmatory analysis of total reading time showed no effect, while the exploratory analyses of first-pass regressions and regressionpath durations did (Table 1).
The study by Dillon and colleagues and our attempted replication serve to illustrate the potential issues of the garden of forking paths, multiple testing and posthoc theorizing. First, even for a confirmatory replication study, where one analyzes the same region and reading measure that showed the interaction in the original study, garden of forking paths scenarios arise if an analysis path is not defined prior to data inspection. For example, different decisions regarding statistical tests and outlier treatment could still be made after data inspection.
Second, for the analyses of the Dillon et al. study and our replication study, six statistical tests were conducted. Testing six eyetracking measures increases the Type I error probability from 5% to 26.5% (i.e., 1 − 0.95 6 = 0.265) (Bonferroni, 1936). It is possible to correct for multiple testing. For example, a Bonferroni correction would require an adjusted Type I error of 0.05/6 for the six statistical tests we conducted, which implies that the absolute critical t-/ z-value would be 2.64. If this criterion were used, there would be no significant effects in either the original study or the replication attempt (see observed z/t-values in Table 1). A better solution to the multiple testing problem may be to avoid it altogether by having precise predictions about the dependent measure(s), and focus on (Bayesian) estimation of effects rather than NHST (e.g., Norouzian, 2020;Kruschke, 2014).
Third, suppose that the effect that was expected a priori at the critical auxiliary verb or the reflexive had been found further downstream in the sentence or even before the critical region. Without specifying the critical region in advance, one could easily have found a post-hoc theory for the effect showing up in another region and reported this as if it had been predicted all along.
Finally, both the original and the replication study show some evidence of the effect of interest. However, the effect occurs in different measures across the two studies. Because of the exploratory nature of the first-pass regression and regression-path duration results in the replication attempt, we cannot treat these hypothesis tests as confirmatory ones. Exploratory analyses per se are an important part of doing science, but they should be presented as such (e.g., Bishop, 2020;de Groot, 1956de Groot, /2014Nosek et al., 2018b).

Problematic research practices in L2 research
The issues above can also arise in L2 research. Two common examples of forks in the analysis path are outlier treatment and the selection of interest regions in reading studies. For example, a synthesis of methodological decisions in L2 self-paced reading (SPR) research showed a variety of outlier removal criteria across 64 studies, such as standard deviations around the mean, reading time cutoffs, or both (Marsden et al., 2018c;see Nicklin & Plonsky, 2020, for discussion of outlier treatment). Moreover, L2 reading studies on the same grammatical phenomena can vary substantially in their selection of interest regions. For a subset of the L2 SPR studies on local ambiguity processing synthesized in Marsden et al. (2018c), some studies reported statistical analyses for the ambiguous sentence region, and other studies for some, or all, of the subsequent regions. In addition, the critical regions varied between studies, consisting of a single word or several words combined.
A closely related problem to the selective reporting of interest regions is conducting statistical tests for many different regions, and/or eye-tracking measures. Godfroid (2020) reported that an average of 3.4 eye-tracking measures per study are analyzed in the L2 eye-tracking literature, further inflating Type I error probability. The Type I error issue might be particularly prevalent in L2 studies because many of them use frequentist NHST and only report binary decisions about the presence or absence of an effect without also reporting effect estimates (Marsden et al., 2018c). One unfortunate consequence is that other researchers cannot gain knowledge about the magnitude of an effect across studies, or conduct meta-analyses due to the lack of information from previous studies (Plonsky, 2013;Larson-Hall & Plonsky, 2015;Plonsky & Oswald, 2014;Al-Hoorie & Vitta, 2019; for an introduction to meta-analyses in bilingualism research, see Plonsky & Oswald, 2015;Plonsky, Sudina & Hu, 2020).

Non-peer reviewed preregistration in psycholinguistic research
For preregistration to counter questionable research practices and biases, it is not sufficient to a priori specify the dependent measure(s), because many researcher degrees of freedom remain. A Table 1. Comparison of the findings by Dillon et al. (2013) and Jäger et al. (2020). The table shows the interaction effect of Dependency type × Attraction, computed using generalized linear mixed models (effects on first-pass regressions were estimated using a logit link function). The interaction effect was expected to have a negative sign. Significant effects at a 0.05 α-level are shown in bold. Note that the published analyses in Jäger et al. (2020) differ from the ones we present here due to different model assumptions made in the present paper for expository purposes. Bilingualism: Language and Cognition complete preregistration requires a full description of the research questions and hypotheses, study design, methods, speaker group selection criteria, data collection procedure, participant sample size or stopping rule, outcome variable(s), as well as an analysis plan including statistical models, information on data exclusion and statistical inference criteria. This does not only ensure greater transparency, but it can also keep in check one's biases because analysis decisions are made public prior to data analysis, preventing selective reporting of effects. For example, assume that for a planned study we preregister no outlier exclusions, but later find an effect only when removing certain data points. This could be reported as an exploratory finding. Without preregistration, it may be tempting to report the most 'interesting' result as confirmatory, preventing other researchers from evaluating the findings in light of the analysis choices. In addition, if our published preregistration committed to a predicted effect for a particular region and measure, based on theory or previous findings, we can no longer convince ourselves that a surprising result was originally predicted and restate the hypotheses post-hoc. One may argue that if one has strong theoretical predictions, preregistration is redundant because the analysis choices are predetermined by the theory. However, Silberzahn et al. (2018) convincingly illustrated that different analysis choices can be made even under highly constraining conditions. Their study recruited 29 research groups in the psychological sciences to answer the same research question for one particular dataset. Of the 29 groups, 20 observed a significant and nine a non-significant result. Strikingly, the range of effect estimates reported by the different research groups allowed for different conclusions.
Although we take the view that preregistration without peer review can be an effective way to reduce unconscious biases in one's work, the lack of peer review means that the preregistration of a study can be as thorough or as vague as the researcher deems appropriate. Vaguely specified research plans still allow for many possible analysis paths, and selective reporting of effects. Consequently, it is up to the scientific community to make nonpeer reviewed preregistration a success or a failure: only a thoroughly implemented preregistration and a precisely followed research plan can reduce unconscious biases and help to separate confirmatory hypothesis tests from exploratory ones.

Selecting dependent measures for a preregistration
If one wants to preregister a study, but lacks prior knowledge of a particular phenomenon, an experiment could be piloted and exploratory analyses conducted to identify which measure(s) show the predicted effect. One could then generate hypotheses from this and test them in a confirmatory study (e.g., Nicenboim, Vasishth, Engelmann & Suckow, 2018;Nicenboim, Vasishth & Rösler, 2020). If, on the other hand, there are previous findings on a phenomenon, these could serve as the basis for a preregistration. However, when the literature shows equivocal results as discussed above, what steps could be taken to consolidate the support in favor of or against a theory? This is not straightforward. For example, in the Dillon et al. (2013) study and our replication study, the effect of interest was observed in different reading measures. If, based on linguistic theory, we believe that the effect of interest should be found in earlier reading measures (first-pass regression and regression-path duration as in our replication study), the only way to test this is by conducting a replication study. This replication should aim for a sufficiently large participant sample and a sufficiently precise effect estimate, and specify the dependent measure(s) and critical region(s) in advance. Otherwise, in a future study we may find some other dependent measure showing the effect, which may again tempt us to draw a bullseye around the arrow that happened to land where it did.

How to get started with a non-peer reviewed preregistration
Preregistration templates are available on OSF and AsPredicted for novel studies as well as for replication studies (e.g., https://bit.ly/OSFtemplates; https://bit.ly/AsPredtemplate). If one prefers to create a Registered Report-type preregistration (i.e., in manuscript format), it is possible to upload a preregistration manuscript on OSF. It is not enough to upload this document to the project's public repository, because the preregistration could be removed or replaced at any point. Rather, one needs to create a time-stamped, non-editable version which can be made public either immediately or it can be embargoed until, for example, the associated paper is submitted or published. If the preregistration is withdrawn at any stage after creating a "frozen" version of it, some meta data (title, authors, description, reason for withdrawing preregistration) will remain publicly available. A new version of the preregistration can be made available before the data are inspected. We have previously made attempts at such manuscript-style preregistrations, e.g., for Vasishth, Mertzen, Jäger and Gelman (2018) (see https://osf.io/dgewb for the non-editable preregistration).

Conclusion
We have used examples from L1 sentence processing and the L2 literature to illustrate some of the problems that can arise during the research process. We then discussed how preregistration allows researchers to better separate confirmatory and exploratory analyses, which can help them counter questionable research practices and unconscious biases. Our view is that, if done thoroughly, non-peer reviewed preregistration would greatly benefit the bilingualism community. We suggest that the hypothesisdriven L2 research process should standardly include preregistration, in addition to the release of materials, data and code upon publication to increase research transparency and reproducibility.