Introduction
Social scientists use list experiments in surveys to estimate the prevalence of sensitive attitudes and behaviors in a population of interest, with topics including racial prejudice (Kuklinski, Cobb, and Gilens Reference Kuklinski, Cobb and Gilens1997), votebuying (GonzalezOcantos et al. Reference GonzalezOcantos, de Jonge, Meléndez, Osorio and Nickerson2011), sexual behavior (Chuang et al. Reference Chuang, Dupas, Huillery and Seban2021), and voter turnout (Holbrook and Krosnick Reference Holbrook and Krosnick2010). A recent review shows that the standard difference in means estimator in the list experiment is underpowered to capture the extent of sensitivity bias in common applications. This happens because the bias reduction of list experiments relative to direct questioning comes at the cost of increased variance (Blair, Coppock, and Moor Reference Blair, Coppock and Moor2020).
Miller (Reference Miller1984) proposes double list experiments (DLEs) as an alternative research design to improve along the biasvariance frontier. DLEs consist of two parallel list experiments implemented simultaneously, with the average of the treatment effects in each experiment as an estimator of the prevalence of the sensitive trait. Because in this design every respondent sees the sensitive item once, the variance of the pooled DLE estimate is, in expectation, reduced by half.
While DLEs promise more precise estimates, they are yet to become widespread practice. This is because they bring the additional burden of justifying the list experiment assumptions for two lists of baseline items, which in turn requires extensive piloting. This is a challenge considering that different baseline lists can yield diverging prevalence estimates of the same sensitive behavior (Chuang et al. Reference Chuang, Dupas, Huillery and Seban2021).
This paper outlines two statistical tests to detect a form of strategic misreporting that would violate the identification assumptions in a DLE. Both tests leverage variation in the timing with which the sensitive item is presented to respondents in DLEs. I refer to these as treatment schedules. In a DLE, respondents see two baseline lists and the sensitive item appears at random in the first or second list. When respondents see the sensitive item in the first list, they can alter their response to both lists. When they see the sensitive item in the second list, they can only alter their response to that list. By comparing the association between responses across treatment schedules, one can detect carryover design effects, which helps in assessing the validity of prevalence estimates.
I propose the difference in differences and Stephenson’s signed rank (Stephenson Reference Stephenson1981) to detect carryover design effects. I illustrate the implementation of these tests with a reanalysis of a DLE on support for antiimmigration organizations in California (Alvarez et al. Reference Alvarez, Atkeson, Levin and Li2019) and examine their properties via simulation.
DLEs: Promise and challenge
As a running example, consider the study by Alvarez et al. (Reference Alvarez, Atkeson, Levin and Li2019) on support for antiimmigration organizations in California.^{ Footnote 1 } Participants in an online survey in 2014 were asked to indicate how many, not which ones, of the following organizations they support:

Californians for Disability (organization advocating for people with disabilities)

California National Organization for Women (organization advocating for women’s equality and empowerment)

American Family Association (organization advocating for profamily values)

American Red Cross (humanitarian organization)
In the standard list experiment, the control group sees the list as it appears above. The treatment group also sees the following sensitive item:

Organization X (organization advocating for immigration reduction and measures against undocumented immigration)
Respondents saw the name of real organizations, but the replication materials censor them for ethical reasons. In the standard list experiment, the difference in means between treatment and control estimates the proportion of the population who supports Organization X. This estimator is valid under standard experimental assumptions, plus two more (Blair and Imai Reference Blair and Imai2012). First, respondents do not misreport holding the sensitive trait (no liars). This assumption is violated if respondents who hold the sensitive trait give exactly the same response under treatment and control. Li (Reference Li2019) develops estimate bounds that allow researchers to relax this assumption.
Second, participants do not alter their response to baseline items when the sensitive item is included (no design effects). This is violated when respondents deflate (inflate) their responses to avoid (emphasize) association with the sensitive item (Miller Reference Miller1984). Blair and Imai (Reference Blair and Imai2012) propose a test to detect violations of this assumption in the standard list experiment.^{ Footnote 2 }
A recent metaanalysis shows that the list experiment estimator is underpowered to detect sensitivity biases in common applications (Blair, Coppock, and Moor Reference Blair, Coppock and Moor2020). This is because of the biasvariance tradeoff. A validation study shows that, compared to direct questioning, list experiments produce estimates closer to the true prevalence, albeit with wider confidence intervals (Rosenfeld, Imai, and Shapiro Reference Rosenfeld, Imai and Shapiro2015).
An alternative to reduce variability in estimates without compromising bias reduction is to implement a DLE (Miller Reference Miller1984). A DLE differs from the standard list experiment in two ways. First, DLEs include two lists of baseline items as separate questions, usually close to each other in the survey flow.
Continuing with the running example, Alvarez et al. (Reference Alvarez, Atkeson, Levin and Li2019) include a second list:

American Legion (veterans service organization)

Equality California (gay and lesbian advocacy organization)

Tea Party Patriots (conservative group supporting lower taxes and limited government)

Salvation Army (charitable organization)
For simplicity, these are list A and B. The second way in which DLEs differ from the standard design is that the sensitive item is randomly assigned to appear in A or B. This is equivalent to conducting two parallel list experiments. In this case, some respondents receive A under treatment and B under control, others receive A under control and B under treatment.
This implies one difference in means for each list. The DLE estimator is the average of these two. Because each respondent serves as both treatment and control in parallel experiments, DLE estimates have roughly half of the variance of the singlelist estimator (Miller Reference Miller1984).
DLEs promise increased precision at the cost of an additional survey question and no additional assumptions. However, one must now justify these assumptions for two lists, which requires extensive piloting to find the right combination of items. Under limited resources, the additional piloting may decrease the budget available to conduct confirmatory analysis, resulting in a smaller sample size than what a singlelist experiment could afford. This means the precision improvements from implementing a DLE can be offset by sample loss. Section E of the appendix provides guidelines to navigate this tradeoff with simulations that compare the statistical power of the singlelist experiment and the DLE under different degrees of expected sample loss.
Even after carefully choosing baseline items, the challenge is that singlelist prevalence estimates can vary considerably across comparable lists (Chuang et al. Reference Chuang, Dupas, Huillery and Seban2021). Since singlelist estimates often have wide confidence intervals, a likely scenario is to find different point estimates with confidence intervals that overlap. This means one cannot determine whether the average of the two is a credible approximation of the true prevalence.
The design in Alvarez et al. (Reference Alvarez, Atkeson, Levin and Li2019) helps to illustrate this point. The study also includes a second sensitive item:

Organization Y (citizen border patrol group combating undocumented immigration)
Organizations X and Y are mutually exclusive, so one can analyze them as separate DLEs. Since respondents always see list A first, the experiment has four possible combinations of sensitive items and their placement, these appear in Table 1. Each experiment has three different estimates: two singlelist estimates and the pooled DLE estimate.
Figure 1 shows these estimates for both sensitive items. For Organization X, all estimators suggest a nonzero prevalence rate around 0.3. For Organization Y, estimates vary more. The estimate for list A suggests a prevalence of 0.1 that is indistinguishable from zero, list B suggests a nonzero prevalence of 0.4, and the pooled DLE estimate suggests a nonzero prevalence of 0.3.
The baseline lists do not change across organizations, so the different estimate patterns come from how respondents interpret the lists after the inclusion of the sensitive item. Organization Y is a group attempting to take matters against undocumented immigration into their own hands rather than just pushing for stricter policies, so it is more likely to stand out.
Since list A always appears first, the pattern of estimates for Organization Y suggests response deflation. In the list experiment using list A, only the treatment group sees the sensitive item, so they can deflate their responses to avoid signaling support for Organization Y. Since the control group has not seen the sensitive item yet, they respond truthfully, which biases the singlelist estimate toward zero. In contrast, since list B always appears second, both treatment and control groups have already seen the sensitive item, so both shift in the same direction.^{ Footnote 3 }
Since the confidence intervals for the singlelist estimates for list A and B in the Organization Y experiment overlap, one cannot determine whether this pattern reflects violations to the list experiment assumptions. The next section outlines two statistical tests that allow researchers to make uncertainty statements about this possibility.
Statistical tests
Setup
The DLEs in Alvarez et al. (Reference Alvarez, Atkeson, Levin and Li2019) keep the order of baseline lists fixed and randomize the location of the sensitive item. This is not the only admissible version of the DLE. Researchers can choose whether to randomize the order of lists and the location of the sensitive item. The possible combinations are outlined in Table 2.
The fixedfixed design is inadmissible since it does not include any experimental manipulation. The randomizedfixed design changes the order of baseline lists but fixes the sensitive item, usually in the second list. This design is not compatible with the proposed tests because the location of the sensitive item does not vary. One may consider this desirable since it prevents respondents from altering responses to the second list after seeing the sensitive item in the first list. The proposed tests only apply for designs that manipulate the location of the sensitive item, fixedrandomized and randomizedrandomized. One may consider adopting these designs, at least at the piloting stage, to justify the choice of baseline lists through the proposed tests.
The purpose of the tests is to detect strategic misreporting that violates the assumptions of a DLE. The no design effects assumption states that the inclusion of a sensitive item does not alter the responses to the baseline items (see Blair and Imai Reference Blair and Imai2012 for details). Ceiling and floor effects are common violations to this assumption. If none or all of the baseline items apply to a respondent, then answering truthfully may entail revealing the sensitive item to the researcher unequivocally, which betrays anonymity. This may lead respondents with answers at the extremes to deflate or inflate their responses (Miller Reference Miller1984; Kuklinski, Cobb, and Gilens Reference Kuklinski, Cobb and Gilens1997).^{ Footnote 4 } The conventional advice to avoid ceiling and floor effects is to prevent extreme answers while crafting baseline lists, for example, by inducing negative correlation between items (Glynn Reference Glynn2013).
This paper focuses on the strategic misreporting that happens when the inclusion of a sensitive item in the first list of a DLE leads respondents to alter responses in both lists. I term this carryover design effects. Ceiling or floor effects are unlikely to produce them, since they pertain to the distribution of items within a baseline list.
Instead, carryover design effects happen when the inclusion of a sensitive item leads respondents to interpret baseline items differently (Miller Reference Miller1984). In this context, response deflation happens when the sensitive item is a frownedupon attitude, such as admitting to racial prejudice. Similarly, response inflation happens when the absence of the sensitive behavior is frowned upon, such as supporting the regime in a dictatorship. In either case, deflation/inflation happens because the sensitive item stands out to respondents, which cues them on the attitude or behavior of interest. This leads to altered responses in the baseline items that are associated with the sensitive item. These effects can happen regardless of the number of baseline items that apply to a respondent, and even among those who do not hold the sensitive trait.
Carryover design effects happen in DLEs because list experiment questions have a distinct format, tend to appear close to each other in survey flows, and the advice of using lists with paired items to induce positive correlation across lists for the sake of precision (Glynn Reference Glynn2013). These factors allow respondents to connect items across lists when the sensitive item stands out. For example, in Alvarez et al. (Reference Alvarez, Atkeson, Levin and Li2019), supporters of the American Family Association (list A) may also support the Tea Party Patriots (list B). The inclusion of an antiimmigration organization as the sensitive item in the first list may alert respondents about the researcher’s interest on support for conservative organizations, which may lead to response deflation in both lists if the sensitive item appears first.
Fixedrandomized and randomizedrandomized DLE designs allow diagnosis of carryover design effects. Let ${Y_{i1}} = {z_i}{Y_{i1}}\left( 1 \right) + \left( {1  {z_i}} \right){Y_{i1}}\left( 0 \right)$ be individual $i$ ’s observed response to the first list, and ${Y_{i2}} = \left( {1  {z_i}} \right){Y_{i2}}\left( 1 \right) + {z_i}{Y_{i2}}\left( 0 \right)$ the observed response to the second list, with ${z_i}$ indicating whether a respondent sees the sensitive item first. At the individual level, the researcher only observes the paired responses $\left[ {{Y_{i1}}\left( 1 \right),{Y_{i2}}\left( 0 \right)} \right]$ or $\left[ {{Y_{i1}}\left( 0 \right),{Y_{i2}}\left( 1 \right)} \right]$ . These reflect treatment schedules with the sensitive item appearing first or second, respectively. Under the first schedule, respondents can react to the sensitive item in both questions. Under the second schedule, respondents can only react to the sensitive item in ${Y_{i2}}$ .
To illustrate carryover design effects. Table 3 shows the answers of a hypothetical respondent. The first two columns denote the observed responses under different scenarios. As a baseline, the respondent answers the same number in both questions. For now, assume that the sensitive item does not apply to this individual and that they do not engage in any kind of misreporting.
Now imagine that the respondent does not hold the sensitive trait but still seeks to avoid association with it by deflating (inflating) their response by one unit. In this case, the observed responses depend on the placement of the sensitive item. When ${z_i} = 1$ , the relationship between ${Y_{i1}}\left( 1 \right)$ and ${Y_{i2}}\left( 0 \right)$ stays the same because both shift in the same direction. However, under ${z_i} = 0$ , the respondent alters their response only in the second list. In this example, responses become further apart because the baseline counts were the same.
More generally, with similar baseline lists, carryover design effects under the ${z_i} = 1$ schedule lead to changes in responses in the same direction, although not necessarily in the same magnitude. However, the ${z_i} = 0$ schedule only allows for strategic misreporting in the second list. The goal of the tests is to detect this asymmetric shift across treatment schedules.
Difference in differences
This test compares whether mean responses vary across treatment schedules. The quantities
and
denote the difference in means between responses with and without the sensitive item for the first and second list, with ${N_{\rm{*}}}$ as the sample size for the treatment and control groups in each question.^{ Footnote 5 }
The null hypothesis is that the two differences in means are equal, ${H_0}:{\hat \tau _1}  {\hat \tau _2} = 0$ . For a fixedrandomized DLE, ${\hat \tau _1}$ and ${\hat \tau _2}$ correspond to the singlelist prevalence estimates, and the test is equivalent to the consistency test proposed by Chuang et al. (Reference Chuang, Dupas, Huillery and Seban2021). For the randomizedrandomized design, this quantity is the difference in differences in means between responses to the first and second question instead.
Since the control group in the first question has not seen the sensitive item yet, the sign of the test statistic depends mainly on ${\hat \tau _1}$ . A negative test statistic suggests deflation, while a positive value suggests inflation. Calculating the difference in differences is straightforward, but the computation of pvalues must consider the clustered structure of the data, since each participant has two responses. Both randomization inference and linear regression with clustered standard errors accommodate this structure.
Signed rank test
The alternative test evaluates whether one can attribute extreme differences in paired responses to the variation in treatment schedules. Rosenbaum (Reference Rosenbaum2007, Reference Rosenbaum2020) proposes Stephenson’s (Reference Stephenson1981) signed rank test to detect heterogeneous effects in pairrandomized experiments.^{ Footnote 6 }
The test applies to DLEs since responses are paired by participant. The test statistic is
which is the sum of signed ranks ${\tilde q_i}$ , defined as
with ${q_i}$ denoting the rank of the absolute difference in paired responses $\left {{Y_{i1}}  {Y_{i2}}} \right$ . So ${\tilde q_i}$ records the number of possible subsets of size $m$ in the data in which $\left {{Y_{i1}}  {Y_{i2}}} \right$ is the largest.
The choice of $1 \le m \le N$ determines what counts as an extreme difference. For example, with $m = 2$ the test is equivalent to Wilcoxon’s signed rank, but with ranks ranging from $0$ to $n  1$ . As $m$ increases, more ranks are considered zero and more weight goes to large differences. The choice of $m$ is arbitrary, but researchers can use simulations at the preanalysis stage to find the value that maximizes the power of the test.^{ Footnote 7 } In the application and simulations, I report only $m = 10$ to facilitate exposition. Section D of the appendix reports results under additional values and gives an example of how to calibrate $m$ .
Without ties in ranks, $\tilde T$ is a distributionfree statistic, meaning its pvalues are known in advance. With ties, one can compute exact pvalues in small samples, while the analytical derivation is a good approximation for experiments with large samples (Rosenbaum Reference Rosenbaum2020).
To illustrate the behavior of this test statistic, consider the last two columns of Table 3. Under the baseline, treatment assignment does not change responses, so the sign of ${\tilde q_i}$ flips randomly and $\tilde T$ is zero in expectation. The only way $\tilde T$ can be negative is in the presence of response deflation. In the example, the first treatment schedule does not contribute to the test statistic, but the second schedule adds negatively.
One limitation of this test is that $\tilde T$ can be positive in the presence of either response deflation or a nonzero prevalence rate, leading to false positives. In the example, response inflation exhibits the mirror pattern of response deflation, now contributing to a positive test statistic. However, as the last two rows show, a respondent who reports the sensitive item without strategic misreporting also contributes to a positive test statistic on either treatment schedule.
This means that the signed rank test is more appropriate to evaluate the alternative hypothesis of ${H_a}:\tilde T\lt 0$ . Addressing response inflation with this test requires a null hypothesis different than the sharp null, which involves making statements on the prevalence rate, sample size, distribution of outcomes, and $m$ . These are rarely known in advance.
Application
Table 4 applies both tests to the running example, treating each sensitive organization separately. Since Figure 1 suggests response deflation for Organization Y, I report twosided pvalues for the difference in differences and onesided lowertail pvalues for the signed rank.
For Organization X, including the sensitive item in the first question leads to a difference in means about $0.08$ points larger than in the second question, the pvalue of $0.62$ suggests little evidence against the null of equal differences in means. For Organization Y, the difference in differences is around $  0.26$ , which implies a smaller difference in means when the sensitive item goes first. The pvalue of $0.08$ gives evidence against the null, although not sufficient to reject it under conventional standards.
The signed rank test statistic is positive for both sensitive items, and since both pvalues are $1$ , one may conclude that the difference in differences test more appropriate. The simulations in the next section check if this intuition generalizes.
Simulation
Setup
I simulate DLEs with a sample size of $1,000$ respondents and fixed list order. The potential outcome of responses to the first list is ${Y_{i1}}\left( 0 \right)\sim B\left( {4,0.5} \right)$ . This implies four baseline items, each applying to respondent $i$ with probability 0.5. This creates responses centered around middle values, which mimics an attempt to avoid floor and ceiling effects. The potential outcome for the second list, ${Y_{i2}}\left( 0 \right)$ , follows the same distribution and associates with ${Y_{i1}}\left( 0 \right)$ with rank correlation $\rho $ . I consider $\rho = \left\{ {0,0.4,0.8} \right\}$ to capture how inducing correlations between lists affects performance.
I assume 15% of the respondents hold the sensitive trait at random. Following the simulations in Blair, Coppock, and Moor (Reference Blair, Coppock and Moor2020), a standard list experiment is underpowered to detect this under conventional standards, but a DLE has over 80% power. This is a case in which opting for a DLE is consequential.
Also at random, a proportion $\gamma \in \left[ {0,1} \right]$ of the participants alter responses by $1$ or $2$ units when they see the sensitive item, doing so in both questions if they see it first. The magnitude is chosen at random and independently between questions. This reflects a setting with moderate but not symmetrical carryover design effects. To facilitate interpretation, I simulate response deflation and inflation separately. Figure D3 in the appendix shows how inflation and deflation introduce bias in DLE estimates.
For each combination of parameters, I simulate $1,000$ experiments and calculate power as the proportion of tests with pvalues smaller or equal than $0.05$ . For the difference in means, the pvalues are always two sided. For the signed rank, the pvalues are lefttailed for deflation and righttailed for inflation.
Results
Figure 2 shows the power of the proposed tests across parameter combinations for deflation and inflation. In general, power increases with the proportion of unintended responses in both tests. The exception is the signed rank test under inflation, which is sensitive to false positives as it captures the positive prevalence rate. Exception aside, both tests are well powered to detect a proportion of unintended responses that exceeds the true prevalence rate.
Everything else constant, the difference in differences has more power under response deflation than under inflation. One implication of this result is that, if possible, researchers should prefer sensitive items that are frowned upon over those one would pretend to have. For example, “I do not support the regime” over “I support the regime.” Yet this conversion is not always straightforward.
Finally, as the correlation between baseline lists increases, the difference in differences has less power under both deflation and inflation. Under deflation, the performance of the signed rank test improves with the correlation. The difference appears trivial in stylized simulations, but Figure D5 in the appendix shows that it becomes more pronounced as the magnitude of response deflation increases. Since previous work recommends inducing positive correlation between lists to increase the precision of the DLE estimator (Glynn Reference Glynn2013), researchers should consider reporting both tests if response deflation is a concern.
Conclusion
I propose two statistical tests to assess the validity of estimates in DLEs. This helps in facilitating the widespread implementation of a variant of the list experiment that improves along the biasvariance frontier. This is compatible with previous efforts to increase precision, such as using responses to direct questions (Aronow et al. Reference Aronow, Coppock, Crawford and Green2015) or auxiliary information (Chou, Imai, and Rosenfeld Reference Chou, Imai and Rosenfeld2017) to adjust estimates.
These tests are most useful at the preanalysis stage, as researchers can use them to justify the choice of baseline and sensitive items. Future work should use the tests as metrics to identify best research design practices to further improve our ability to address sensitive attitudes and behaviors through surveys.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/XPS.2023.24.
Data availability statement
The data, code, and any additional materials required to replicate all analyses in this article are available at the Journal of Experimental Political Science Dataverse within the Harvard Dataverse Network, at: https://doi.org/10.7910/DVN/SUV8DK (Diaz Reference Diaz2023).
Acknowledgments
I thank Jake Bowers, Alex Coppock, Luke Sanford, workshop participants at Tulane CIPR, and audiences at the 2020 NYU CESS Conference on Experimental Political Science, Polmeth 2021, APSA 2021, and Latin American Polmeth 2021 for valuable feedback.
Competing interests
None.