1. Introduction
Personnel security vetting interviews are a critical step in determining access to protected national information. In Sweden, as in many countries, security clearance assessments are legally required for anyone engaging in security-sensitive activities. A core objective of the Swedish vetting process is to determine whether an individual is loyal to the nation’s interests, as defined by the Swedish Protective Security Act (Säkerhetsskyddslag, 2018). Yet, as recent espionage cases illustrate (Jonsson and Gustafsson, Reference Jonsson and Gustafsson2022), even trusted insiders can turn out to be disloyal, prompting concerns over the reliability of such assessments.
1.1. Loyalty as a psychological and sociocultural construct
Loyalty is a multifaceted and elusive construct that encompasses emotional, moral, and strategic commitments to the state. In the context of protective security, a successfully completed vetting process is a core requirement for obtaining a personnel security clearance, which in turn is necessary to access nationally protected information (Herbig, Reference Herbig2008). While workplace loyalty may be described as an employee’s willingness to stay with an organization (Hirschman, Reference Hirschman1970), loyalty in the context of national security goes beyond organizational commitment. It involves a deeper identification with the values, norms, and interests of the state, often requiring individuals to prioritize security obligations over personal or professional affiliations. This includes ‘post-exit’ loyalty, that is, the expectation that individuals will maintain confidentiality even after their employment ends (Alvesson, Reference Alvesson2000).
Kleinig (Reference Kleinig2014, Reference Kleinig and Sardoč2017) distinguishes loyalty from allegiance: while allegiance involves public declarations, loyalty involves sustained behavioral commitments. This difference is critical in security vetting, where applicants may affirm allegiance verbally, but their deeper loyalties (e.g., to kin, culture, or ideology), whether aligned or not to their stated allegiance, may need to be inferred from more indirect cues. This underscores the judgment challenge; evaluators must infer internal states from partial and ambiguous behavioral evidence. Loyalty is not only about declared values, but about how behavior and identity are interpreted within institutional logics. While classic work (e.g., Goffman, Reference Goffman1959) emphasized the performative aspects of social roles, more recent perspectives suggest that institutional trust depends not just on enactment, but on shared cultural expectations about what loyalty entails (Kleinig, Reference Kleinig and Sardoč2017). As Baron (Reference Baron2009) notes, the very fact of coming from another country can be framed as a source of potential ‘dual loyalty,’ raising suspicion that commitments may be divided between different political or cultural communities. This poses not only a challenge of genuine competing loyalties but also the risk that vetting decisions are shaped by stereotypes and assumptions about migrants or minorities, rather than by evidence of actual behavior. From a psychological standpoint, loyalty can be conceptualized as comprising three distinct yet interconnected components: affective, normative, and continuance commitments (Meyer and Allen, Reference Meyer and Allen1991). Affective commitment reflects an emotional attachment to a group or institution, such as a deep-seated sense of belonging to one’s country. Yet this very depth of attachment can generate a genuine dilemma: even when an individual has established themselves within a new cultural context, enduring ties to another culture—whether through values, kinship, or family bonds—may give rise to competing standpoints that conflict with the positions the individual would otherwise be expected to adopt. Normative commitment refers to a sense of moral obligation or duty to remain loyal, driven by internalized values or perceived expectations. Continuance commitment, on the other hand, involves a rational cost–benefit analysis of the consequences of disloyalty, such as potential personal, professional, or legal repercussions. These dimensions illustrate that loyalty is not simply a matter of expressed beliefs or public allegiance but a complex interplay of emotions, values, and strategic calculations. Although loyalty is often assessed in organizational research using standardized scales (e.g., Meyer and Allen, Reference Meyer and Allen1991, Reference Meyer and Allen1993), in Swedish personnel security vetting, it is not typically quantified in this way. Instead, loyalty is evaluated holistically through semi-structured interviews, where the interviewer interprets applicants’ narratives, cues, and contextual markers to reach a judgment.
Social Identity Theory (SIT; Tajfel and Turner, Reference Tajfel, Turner, Hogg and Abrams1979) suggests that individuals tend to categorize others into in-groups and out-groups, where ethnic affiliation can serve as the basis for assigning someone to an out-group (Riek et al., Reference Riek, Mania and Gaertner2006). This, in turn, can lead to stereotyping and biased decision-making in both directions: either by placing undue weight on ethnicity or by failing to recognize it as a potential risk factor. In security vetting, traits such as speaking a foreign language or maintaining ties to another country can consciously or unconsciously be interpreted as indicators of out-group membership. Such markers may trigger cognitive biases and affective responses in evaluators, raising concerns about divided loyalty, conflicted allegiances, or perceived cultural distance (Veijt and Thijsen, Reference Veijt and Thijsen2021). To empirically examine how such categorization processes might contribute to variability in loyalty assessments, we developed three ordinal-scale items that tapped central aspects of loyalty in the Swedish vetting context. These items asked evaluators to judge (a) the extent to which the applicant could be assumed to be loyal to the interests protected by the Security Protection Act (2018: 585), (b) the extent to which there was a risk of a conflict of loyalty, and (c) the extent to which the applicant could be assumed to prioritize loyalty to Sweden over loyalty to another party if forced to choose. While these items capture key elements of how loyalty is operationalized in practice, they do not fully reflect the construct’s multidimensional nature. Their value lies in providing a systematic and comparable basis for examining variability across evaluators.
As mentioned earlier, evaluative contexts such as recruitment are complex, with multiple factors interacting with ethnic or cultural markers to shape decision outcomes (Bertrand and Mullainathan, Reference Bertrand and Mullainathan2004; Lancee, Reference Lancee2021; Highhouse and Brooks, Reference Highhouse and Brooks2023; Lippens et al., Reference Lippens, Vermeiren and Baert2023; Veijt and Thijsen, Reference Veijt and Thijsen2021). Research on employment interviews shows that salient but job-irrelevant cues, such as unusual accents or appearance, are particularly likely to bias judgments because they attract disproportionate attention (Levashina et al., Reference Levashina, Hartwell, Morgeson and Campion2014; Lippens et al., Reference Lippens, Vermeiren and Baert2023; Wingate et al., Reference Wingate, Rasheed, Risavy and Robie2024). Field experiments in Sweden and other European countries show that women from minority backgrounds may face compounded disadvantages due to both ethnic and gender-based stereotypes (Bursell, Reference Bursell2014; Veijt and Thijsen, Reference Veijt and Thijsen2021). Experimental evidence also indicates that bias operates even when applicants present identical qualifications: Moss-Racusin et al. (Reference Moss-Racusin, Dovidio, Brescoll, Graham and Handelsman2012) found that science faculty rated male applicants as more competent and hireable compared to equally qualified female applicants. At the same time, recent evidence suggests that gender is not invariably a high-risk source of interviewer bias, which may reflect broader progress associated with increased representation and heightened awareness of equality legislation (Wingate et al., Reference Wingate, Rasheed, Risavy and Robie2024). Given the similarities between recruitment and personnel security vetting in terms of gatekeeping, subjectivity, and ambiguity, it is plausible that gender may serve as a cue that either amplifies or buffers perceived risk. While Social Identity Theory does not directly predict gender-based differences in vetting, the interaction of multiple identity markers may nonetheless influence how evaluators interpret foreign ties and assess loyalty. Out-group membership cues may sometimes reflect legitimate risk indicators, but they can also disproportionately influence judgments even when the substantive information is identical. This tension between security vigilance and stereotyping lies at the heart of evaluative uncertainty in personnel vetting. Vetting specialists operate under uncertainty, interpreting ambiguous cues with little guidance. This can result not only in systematic bias but also in inconsistent judgments when evaluators are presented with the same information, which Kahneman et al. (Reference Kahneman, Sibony and Sunstein2021) describe as noise.
1.2. Judging under uncertainty: Cognitive efficiency and variability
Assessing loyalty presents a significant challenge due to its inherently abstract and multidimensional nature (Herbig, Reference Herbig2008; Hirschman, Reference Hirschman1970; Shklar, Reference Shklar1993). Under conditions of uncertainty, evaluators often rely on heuristics that may lead to biased judgments (Gigerenzer and Gaissmaier, Reference Gigerenzer and Gaissmaier2011; Tversky and Kahneman, Reference Tversky and Kahneman1974). While heuristics are adaptive and generally effective in everyday contexts, they also introduce the risk of systematic error, particularly when interpreting emotionally charged or socially salient cues such as foreign connections, cultural practices, or language use.
This cognitive vulnerability is amplified by the absence of standardized interpretive frameworks in Swedish vetting protocols. Unlike structured assessments that use defined scoring rubrics, loyalty evaluations often rely on open-ended and discretionary judgments. As a result, different evaluators may reach different conclusions based on identical information, resulting in noise, that is, undesirable variability in judgments made by individuals who are presented with the same information (Kahneman et al. Reference Kahneman, Sibony and Sunstein2021). Low levels of interrater reliability are a concern that directly undermines the consistency, fairness, and legal defensibility of security clearance decisions (Kahneman et al., Reference Kahneman, Sibony and Sunstein2021; Meehl, Reference Meehl1954).
Kahneman et al. (Reference Kahneman, Sibony and Sunstein2021) identify two primary sources of judgmental noise: level noise and pattern noise. Level noise refers to systematic differences in overall judgment severity among evaluators. For example, one vetting officer may habitually adopt a stricter stance—interpreting most foreign ties as problematic—while another may consistently take a more lenient view. Even when assessing the same applicant profile, their thresholds for acceptable risk may differ, introducing disparity into decision outcomes. Pattern noise, by contrast, arises from idiosyncratic reactions to specific case features. One evaluator might be particularly sensitive to an applicant’s previous residence in a foreign country due to personal experience, while another might focus more heavily on financial history. This form of noise reflects inconsistency not in overall severity, but in which aspects are emphasized and how information is weighted across cases. In practice, such divergences may arise from salience bias—where distinctive or personally meaningful cues (e.g., foreign names or specific country associations) draw disproportionate attention and are interpreted differently by individual evaluators (Wingate et al., Reference Wingate, Rasheed, Risavy and Robie2024).
We argue that pattern noise may be especially likely to arise when judgments are shaped by implicit social identity dynamics, such as in-group/out-group distinctions. Social Identity Theory (SIT; Tajfel and Turner, Reference Tajfel, Turner, Hogg and Abrams1979) suggests that evaluators are more likely to exhibit both bias and inconsistency when confronted with culturally distant applicants. In this light, SIT not only predicts directional bias but helps explain why some cues are interpreted with less agreement than others, bridging bias with noise. Importantly, bias and noise are not mutually exclusive; rather, they interact in ways that compound the unpredictability of loyalty assessments. The result is a vetting process that risks being not only unreliable but also opaque, as applicants cannot reasonably anticipate how their profiles will be interpreted. In practical terms, this means that identical cases can yield opposite outcomes, or that one evaluator approves a high-risk applicant while another rejects a low-risk one. Such inconsistencies threaten both national security and procedural legitimacy, underscoring the need to reduce judgmental noise in security vetting interviews.
1.3. Protective security vetting in Sweden: Legal and procedural framework
The process of security vetting in Sweden is governed by the Swedish Protective Security Act (Säkerhetsskyddslag 2018: 585), which requires organizations to assess individuals for employment or assignments that involve access to classified information, critical infrastructure, or other security-sensitive roles. Such vetting typically involves a background check and a semi-structured interview, where evaluators assess loyalty, reliability, and vulnerabilities in relation to national security. There are different supervisory authorities that correspond to different types of security-sensitive activities. It is up to the relevant supervisory authority to give practical guidance on how to perform the security vetting process.
Sweden’s Defense Materiel Administration (DMA), which supervises security-sensitive defense industries, requires the use of a semi-structured interview protocol. The protective security vetting interviews take place within the defense industry, a sector that exemplifies the characteristics of knowledge-intensive companies (Alvesson, Reference Alvesson2000), where employees are often entrusted with sensitive information to perform complex security-sensitive activities. However, decision makers retain considerable discretion, and there are no formal scoring protocols or interrater calibration procedures. This creates fertile ground for both noise and bias, echoing the concerns raised by Kahneman et al. (Reference Kahneman, Sibony and Sunstein2021).
1.4. Structure and scope of Swedish vetting interviews
Although they resemble employment interviews in format, Swedish personnel security vetting interviews are distinct in both purpose and scope. They are conducted primarily to support decisions about granting security clearances for positions involving classified or sensitive defense-related information. The interviews probe a broad range of topics, including birthplace, citizenship, financial status, living conditions, relationships, criminal history, and substance use, including alcohol and illegal drugs. The DMA’s protocol spans 16 sections, starting with demographic and social ties, and concluding with explicit questions on loyalty and reliability. These interviews are not used to decide job fit per se, but to assess risks that could compromise national security, such as divided loyalties or vulnerabilities to coercion. The interview is designed as an information-gathering exercise, with roots in investigative interviewing methods (Brandon et al., Reference Brandon, Wells and Seale2018; Meissner, Reference Meissner2021). While semi-structured to ensure coverage of critical topics, they lack standardized scoring rubrics, leaving evaluators to integrate diverse cues into holistic judgments. This flexibility can be advantageous for uncovering nuanced risks, but it also creates opportunities for inconsistency (noise) and bias.
1.5. Evaluating loyalty through foreign ties
Connections to foreign countries represent one of the most scrutinized factors in Swedish personnel security vetting. Both the Swedish Security Service and the DMA have flagged such ties as potential risk factors that may lead to divided loyalties (FMV, 2022; Säkerhetspolisen, 2023). The DMA protocol explicitly asks the interviewer to investigate any foreign ties and knowledge of languages that are not normally taught in Swedish schools (FMV, 2023). Interviewers are instructed that such characteristics can be relevant for assessing loyalty, though they are not themselves evidence of disloyalty.
The present study investigates how such connections are interpreted by a coherent sample of professional vetting officers in the Swedish Defense Industry. Specifically, we test whether
-
1. connections to foreign countries are seen as relevant to a personnel security clearance, and
-
2. whether judgments vary systematically across interviewers when all are presented with identical interview data.
By focusing on foreign ties, we can observe not only directional bias (e.g., lower loyalty ratings for certain countries) but also judgmental noise, both level noise (differences in overall harshness) and pattern noise (differences in which countries trigger concern). We use the term perceived otherness to capture how evaluators may interpret certain foreign ties through the lens of social identity processes, perceiving applicants as more or less aligned with the Swedish in-group depending on cultural, geographic, or political distance. This concept links the SIT perspective to both bias (systematic directional differences) and noise (inconsistent emphasis across judges) and frames the dual sources of variability that motivate our research questions.
1.6. Research questions
Grounded in Social Identity Theory and prior research on allegiance, assimilation, and professional loyalty, this study explores five interrelated questions. Each question is linked to bias, level noise, and/or pattern noise.
-
1. Perceived Relevance (bias): To what degree is an applicant’s connection to a foreign country perceived as relevant to a personnel security clearance?
-
2. Country-Specific Effects (bias and pattern noise): Does the nature of the foreign country (e.g., European vs. Asian vs. South American) systematically influence the likelihood of a ‘yes’ response to a dichotomous question concerning the presence of relevant information for a personnel security clearance?
-
3. Downstream Judgments (bias): Does judging a connection to a foreign country as relevant to the security clearance affect subsequent loyalty ratings on a 7-point scale?
-
4. Cultural Distance and Variability (pattern noise): Does greater cultural distance to the focal country lead to more disagreement in loyalty ratings, even when average ratings are lower?
-
5. Judge-Level Predictors (level and pattern noise): How much variability in loyalty judgments stems from judge-level tendencies (level noise), and how much from judge-stimulus interactions (pattern noise), such as differential responses to applicant gender or cultural background?
2. Method
2.1. Participants and data structure
A chain-referral sampling method was used to recruit 58 people (42 women and 16 men; age range: 25–55 years) who regularly conduct security vetting interviews in the defense industry in Sweden. The sample comprised 16 protective security specialists and 42 recruitment specialists. Both categories routinely perform protective security vetting interviews for security clearance decisions, but their responsibilities differ: recruitment specialists oversee the full recruitment process in addition to the vetting interview, while protective security specialists focus solely on the clearance component. All participants operated within the Swedish protective security vetting system within the defense industry, and the findings should be interpreted in the context of Sweden’s legal framework and organizational practices for personnel security. Table 1 summarizes participant demographics by group (gender, mean age, and years of experience). Recruiting personnel with experience in security vetting is inherently challenging due to the sensitivity of the work and the relatively small number of professionals in this field. As such, the sample size and composition reflect the practical constraints of recruitment. Participants were given two movie tickets for their participation.
Table 1 Participant demographics by job title (gender, mean age, and years of experience)

Note: N = 58. Participants were on average 35.5 years old (SD = 9.4). Years of experience performing security vetting interviews ranging from less than 1 year (<1) to more than 10 years (>10).
2.2. Materials
The instrument was developed on the Qualtrics Survey Platform (https://www.qualtrics.com). The stimuli consisted of 30 DMA protocols in total, containing details about applicants being assessed for a security-sensitive position. All applicants were Swedish citizens and thus connected to Sweden. However, in four of the five conditions, the protocols included an additional connection to another country (the United Kingdom, Germany, Brazil, or Thailand) through either a parent, spouse, or birthplace. These additional connections were the focal manipulation, allowing us to examine how connections to specific countries influenced both the binary security-relevance decision and subsequent loyalty ratings. The applicant’s name was drawn from the top 10 most common given names in the respective country to signal cultural origin. An example excerpt of a protocol is provided in Appendix D. The Focal country was presented in a section of the protocol that contained information about citizenship and birthplace. Cultural similarity was operationalized in terms of geographic region, political–security alliances, and broader institutional ties to Sweden. The United Kingdom and Germany, both EU/NATO partners and frequent defense collaborators, were treated as culturally proximate benchmarks. Brazil and Thailand, while recipients of Swedish defense exports such as Gripen fighters, are geographically distant and lack formal alliance structures with Sweden. Although Sweden maintains defense export ties with both Brazil and Thailand (e.g., Gripen fighter aircraft), these countries remain geographically and institutionally distant, lacking the alliance structures that characterize Sweden’s partnerships with the United Kingdom and Germany. This operationalization aligns with Social Identity Theory, which predicts stronger in-group perceptions toward close European partners and more pronounced out-group perceptions toward culturally and institutionally distant affiliations.
Each protocol was approximately eight printed pages and included demographic and contextual details such as birthplace, citizenship, education, occupation, hobbies, and family structure, along with information about social ties, financial standing, and potential risk indicators. All elements apart from the manipulated variables were held constant to ensure comparability.
Each participant read five protocols following a mixed between-within split-plot design. The within-participant factor was the focal country of the applicant (Sweden, the United Kingdom, Germany, Brazil, and Thailand). The between factors were Gender (Man / Woman) of the applicant and Type of relationship to the focal country, which comprised of three levels: (1) born in Sweden but having parents from the focal country, (2) married to a person from the country, (3) or born in the focal country.
The first question a participant answered after reading the protocol was: ‘Has information that may be relevant to the security assessment emerged?’ If that question was answered with a yes, the participant was asked to state a reason. The participant then answered 13 questions using a 7-point sliding scale ranging from 1 = ‘do not agree at all’ to 7 = ‘fully agree.’ (see Appendix B for the full scale). Three of the 13 questions concerned loyalty; these were:
-
L1: To what extent do you agree with the statement that the person can be assumed to be loyal to the interests that the Security Protection Act (2018: 585) aims to protect, on a scale from 1 to 7?
-
L2: To what extent do you agree with the statement that there is a risk of a conflict of loyalty, on a scale from 1 to 7?
-
L3: To what extent do you agree with the statement that the person can be assumed to value loyalty to Sweden higher than loyalty to another party if forced to choose between Sweden and the other party, on a scale from 1 to 7?
Cronbach’s alpha for these three items was 0.57 (standardized α = 0.62, 95% CI [0.33, 0.73]), reflecting moderate internal consistency. Although we report Cronbach’s alpha for the three ordinal loyalty items for comparability with prior work, this statistic assumes interval-level measurement and may misestimate reliability for ordinal data.Footnote 1 Given the multidimensional nature of loyalty and the small item set, this level of reliability was expected, and analyses accounted for measurement error through the hierarchical item response modelling framework. These items were designed by the researchers specifically for this study, based on guidance from the DMA’s interview framework.
2.3. Procedure
Participants accessed general information about the experiment and a consent form via an anonymous link. After consenting, the first phase consisted of background questions and additional details about the experiment (see Appendix A). The information and instructions were designed to avoid triggering thoughts about the research questions.
The participants were in the second phase randomized into one level of the within-subject factor; all participants rated all five countries. Relationship to the five countries was randomly varied between participants, and the gender of the applicant was randomly varied between participants and countries. The third and final phase consisted of questions about protective security in general and asked the participant to define loyalty in the context of protective security.
The study complied strictly with the guidelines of the Swedish Research Council’s ethics committee and the European Research Council’s ethics committees for research involving human participants. Participation in the study was voluntary and did not involve any coercion or deception, or involve any invasive or potentially dangerous methods. All participants provided signed consent before taking part in the study. According to the Swedish Ethical Review Authority and the guidelines of Lund University, formal ethical approval was not required.Footnote 2
2.4. Data analysis
All data were complete, and initial visual inspection revealed no anomalies. Subsequently, the data were summarized descriptively to reveal important trends among the participants and variables involved. Statistical models were fit using a full Bayesian workflow, including posterior estimation, hypothesis testing, and model comparison. This approach follows best practices in applied Bayesian modeling (e.g., Gelman et al., Reference Gelman, Carlin, Stern, Dunson, Vehtari and Rubin2013; Kruschke, Reference Kruschke2015; McElreath, Reference McElreath2020; Schad et al., Reference Schad, Betancourt and Vasishth2022; Wagenmakers et al., Reference Wagenmakers, Lodewyckx, Kuriyal and Grasman2010), integrating posterior credible intervals, directional Bayes factors, and posterior predictive checks to evaluate evidence and model fit. All data and R code required to reproduce the analyses, including model specifications, priors, convergence diagnostics, and plotting scripts, are available in the Supplementary Material. To balance explanatory scope with model parsimony, and to guard against overfitting given the imbalance in participant gender and profession, we fit and compared a set of full and reduced models. The full models included focal country (C), relationship type (T), participant gender (A), applicant gender (G), and participant profession (Group), while the reduced models excluded T, A, and Group. Predictor selection was guided by our research questions and relevant theory. The focal country was the central manipulation and was included in all models. Applicant gender was included as a theoretically meaningful variable based on previous findings that gender can be a source of bias in evaluative contexts (Bursell, Reference Bursell2014; Lippens et al., Reference Lippens, Vermeiren and Baert2023).
We initially fit the full models and conducted sensitivity analyses using the power-scaling method (Kallioinen et al., Reference Kallioinen, Paananen, Bürkner and Vehtari2024), which quantifies the relative influence of the prior and likelihood on the posterior. This method is particularly informative for hierarchical models, where shrinkage effects can be strong. The results revealed moderate prior sensitivity for several group-level parameters, particularly random effects, prompting us to refit the models using slightly wider priors that place more weight on the data. These adjusted models yielded results consistent with the originals, supporting the stability of the conclusions.
We next explored whether inclusion of the predictors (T, A, Group) improved model fit using 10-fold cross-validation (Vehtari et al., Reference Vehtari, Gelman and Gabry2017). This approach evaluates how well a model captures the observed data while penalizing for complexity. We relied on predictive performance via k-fold and visual comparisons of prior and posterior distributions to judge model adequacy.
In addition to evaluating model fit and parameter stability, we used Bayes factors via the Savage-Dickey density ratio method (Wagenmakers et al., Reference Wagenmakers, Lodewyckx, Kuriyal and Grasman2010) for targeted hypothesis tests, such as whether a given parameter (e.g., the effect of D on loyalty ratings) differs meaningfully from zero. This approach allowed us to address research questions framed as comparisons between a null hypothesis (no effect) and an alternative (some effect), using the same posteriors already estimated for inference. The final models, one binary probit model for clearance decisions and one ordinal item response model (IRM) for loyalty ratings, were selected based on superior fit to the observed data, K-fold cross-validation, convergence diagnostics, and prior sensitivity analysis. The exclusion of T, A, and Group did not materially affect the posterior estimates for the retained predictors, further reinforcing the robustness of the findings. We considered including participant gender (A) and profession (Group) as main effects and in interaction terms. However, their inclusion introduced risks of confounding due to collinearity (e.g., most men were protective security specialists), as well as instability due to sample imbalance (only 16 men). This limited our ability to estimate these effects reliably. We therefore examined their effects in follow-up models, as described below, and excluded them from the final models to preserve clarity and interpretability. Although not all predictors were retained in the final models, this exploratory approach allowed us to assess both theoretically grounded and practically relevant sources of judgmental variability.
The binary probit model, together with the IRM, allows us to infer both bias and noise, capturing not only what participants decided but also how consistent they were in their decisions. The binary probit model addresses RQ1 and RQ2 concerning the relevance judgment (D) and the effect of country (C) and gender (G) on that judgment. The ordinal item response model (IRM) was used to evaluate RQ3–RQ5 concerning loyalty ratings, including the downstream effect of relevance judgments (D) on loyalty, and variability across countries (C) and individuals (ID). In the IRM, we estimated discrimination parameters (α) to quantify within-participant consistency in how the 7-point scale was applied. In this context, higher discrimination values indicate that judges applied the 7-point loyalty scale more consistently, whereas lower discrimination values signal greater inconsistency (i.e., more within-participant noise) in how identical information was translated into ratings. All analyses were conducted in R (v4.4.3; R Core Team, 2021) using the brms package (Bürkner, Reference Bürkner2017, Reference Bürkner2018, Reference Bürkner2021), which interfaces with Stan (v2.36; Carpenter et al., Reference Carpenter, Gelman, Hoffman, Lee, Goodrich, Betancourt and Riddell2017; Stan Development Team, 2024).
To explore possible interaction effects between participant and applicant gender, we fit separate subset models for men and women participants as well as an additional model including the interaction of participant and applicant gender. These exploratory models helped assess whether observed effects were driven by a particular group. Due to sampling constraints, two of the 60 possible combinations of country, relationship type, and participant/stimulus gender were not present in the dataset. The gender of the stimulus (G) was retained as a covariate in the models, but showed limited effects across models.
We also ran exploratory models, including participant profession (Group; protective security specialist vs. recruitment specialist), to assess whether role-specific experience influenced judgments. These models used the same priors, outcome variables, and estimation settings as the main models and were interpreted cautiously due to sample size limitations and potential confounding with gender.
2.5. Model 1: Predicting decisions about relevant information (binary outcome D)
To model participants’ responses to the dichotomous question about whether any information relevant to personnel security clearance had emerged, we fit a binary probit regression model using focal country (C) and the applicant’s gender (G) as predictors. This model directly addresses Research Questions 1 and 2, examining whether connections to specific countries (C) and applicant gender (G) systematically influence relevance judgments (bias), and whether judges vary in their responses to different countries (pattern noise). Each participant’s data were modelled with their own intercept and varying slopes for country, capturing differences in decision tendencies across judges.
$$\begin{align*}& D\sim \mathrm{Bernoulli}\left({\mu}_i\right)\\& \mathrm{probit}\left({\mu}_i\right)={\beta}_0+{\beta}_C{C}_{\left[i\right]}+{\beta}_G{G}_{\left[i\right]}+{u}_{0\left[{ID}_{\left[i\right]}\right]}+{u}_{C\left[{ID}_{\left[i\right]}\right]}\bullet {C}_{\left[i\right]}\\&{\beta}_C,{\beta}_G\sim \mathcal{N}\left(0,2\right),\kern1em {u}_i\sim \mathcal{N}\left(0,{\alpha}_u\right),\kern1em {\alpha}_u\sim \mathrm{Exponential}(.5),\\&\left[\begin{array}{c}{u}_{0\left[ ID\right]}\\ {}{u}_{C\left[ ID\right]}\end{array}\right]\sim \mathcal{N}\;\left(\left[\begin{array}{c}0\\ {}0\end{array}\right],\Sigma \right),\Sigma = LKJ2\end{align*}$$
where D is the participant’s binary decision (1 = yes; 0 = no), and C and G represent the country and gender of the fictitious applicant. All
$\beta$
coefficients followed moderately weak priors:
${\unicode{x3b2}}_{\mathrm{C}}$
,
$ {\beta}_{\mathrm{0}},{\beta}_{\mathrm{G}}\sim \mathcal{N}\left(0,2\right)$
. Country-specific random slopes per participant were modeled with standard deviations drawn from an
$\mathrm{Exponential}(.5)$
priors and their correlations followed an LKJ(2) prior (Lewandowski et al., Reference Lewandowski, Kurowicka and Joe2009). Model fit and convergence were verified using posterior predictive checks and diagnostics from the Stan sampler.
2.6. Model 2: Modelling ordinal loyalty ratings (item response model)
To analyze participants’ ordinal ratings on three 7-point loyalty-related questions (L1, L2, L3), we employed a hierarchical cumulative probit model with varying thresholds per item and varying intercepts and slopes by judge (Bürkner, Reference Bürkner2021; Bürkner and Vuorre, Reference Bürkner and Vuorre2019; Liddel and Kruschke, Reference Liddel and Kruschke2018). This model addresses Research Questions 3–5 by examining how the binary relevance judgment (D) influences loyalty ratings (RQ3), how country and applicant gender affect perceived loyalty (RQ4), and the extent of between-judge variability in overall severity (level noise) and in responses to specific countries (pattern noise), as well as judges’ within-participant consistency in how they used the rating scale (captured through discrimination parameters; RQ5). Including a discrimination parameter allows us to quantify within-judge consistency, with lower α indicating greater judgmental noise in the mapping from latent judgments to observed ratings.
Let Yij ∈ {1, …, 7} denote the ordinal response from participant i on item j. The model assumes:
In this model:
-
•
$\Phi$
is the standard normal cumulative distribution function (probit link). -
•
${\tau}_{kj}$
is the kth threshold for item j, where each item has six thresholds separating seven response categories. -
•
${\mu}_{ij}$
is the latent location (trait level) for participant i and item j, and -
•
${\alpha}_{ij}$
is the discrimination parameter for participant i and item j, reflecting how consistently the participant applies the rating scale.
Location model:
Discrimination model:
Threshold priors were weakly informative and spaced around standard normal quantiles (see Kurz, Reference Kurz2021):
$$\begin{align*}&{\tau}_{1j}\sim \mathcal{N}\left(-1.07,1\right),\kern1em {\tau}_{2j}\sim \mathcal{N}\left(-0.57,1\right),\kern1em {\tau}_{3j}\sim \mathcal{N}\left(-0.18,1\right),\\&{\tau}_{4j}\sim \mathcal{N}\left(0.18,1\right),\kern1em {\tau}_{5j}\sim \mathcal{N}\left(0.57,1\right),\kern1em {\tau}_{6j}\sim \mathcal{N}\left(1.07,1\right)\end{align*}$$
While these priors do not enforce monotonicity on their own, brms ensures ordered thresholds by modeling them via a cumulative transformation of unconstrained parameters in Stan (see Stan Development Team, 2023). Thus, threshold order is guaranteed during estimation regardless of prior specification.
Fixed effect priors were set as:
${\beta}_C,{\beta}_G,{\beta}_D\sim \mathcal{N}\left(0,2\right)$
for location, and
${\eta}_C,{\eta}_G,{\eta}_D\sim \mathcal{N}\left(0,1\right)$
for discrimination. Random intercepts and slopes (participant and item level) were modeled as:
${u}_i,{v}_j,{w}_i,{x}_j\sim \mathcal{N}\left(0,{\alpha}^2\right),\kern1em \alpha \sim \mathrm{Exponential}(.5)$
. This structure captures both perceived loyalty (location) and response consistency (discrimination), allowing for both participant- and item-level variation.
We compared the prior and posterior distributions for all fixed effects to evaluate the influence of prior assumptions across both models. Figure 1 shows an illustrative example from the IRM model for the fixed effect of Brazil, where the posterior is narrowly centered around -1, well separated from the prior, demonstrating clear and substantial evidence for a negative effect on loyalty ratings associated with that condition. On the 7-point scale, this corresponds to an expected reduction of approximately 0.7 rating points compared to Sweden, equivalent to moving from ‘agree somewhat’ toward ‘neutral’ in perceived loyalty (see comparative plots for all the fixed effects in Appendix C).

Figure 1 Prior vs. posterior for Brazil (b_CCL). The weakly informative prior in grey and the posterior distribution in blue. The figure shows how strongly the posterior distribution is informed by the data.
2.7. Subset models: Moderation by participant gender and profession
We conducted follow-up analyses using gender-specific subsets of the data. These subset analyses explore whether participant gender moderates the effect of applicant gender on judgments (RQ5). Given the imbalance in participant gender and potential confounding with profession, these models were treated as exploratory and interpreted cautiously. Simplified models were fit to the male and female participant samples separately, using only applicant gender as a predictor. We also examined a model including the interaction of participant and applicant gender. These models used the same priors, outcome variables, and estimation settings as the main models. To ensure valid inference, we also reviewed the distribution of stimulus gender within each subgroup.
We also conducted exploratory models, including participant profession (protective security specialist vs. recruitment specialist) as a predictor. These analyses addressed whether professional background was associated with differences in clearance judgments or loyalty ratings. The models extended the main binary probit and ordinal IRM specifications by adding a main effect of profession (Group). As with the gender analyses, these models used the same priors, outcome variables, and estimation settings as the main models. K-fold cross-validation indicated that including profession did not improve predictive performance. Given the small and unbalanced group sizes, these results are treated as exploratory.
3. Results
3.1. Descriptive statistics
Across all experimental conditions, fewer than half (48.3%) of the protocols resulted in a ‘yes’ to the question whether there was any information that may be relevant to the security assessment. A majority (89%) of the justifications provided for ‘yes’ responses explicitly referred to the applicant’s foreign connection, indicating that when participants did see a profile as clearance-relevant, this was usually attributed to the country tie. It is important to note that these counts of ‘connection as motive’ (reported in Table 2) are a subset of the overall relevance judgments (D). That is, the column shows how often participants explained their ‘yes’ judgments in terms of the foreign tie, not the total frequency of ‘yes’ responses per condition. Median ratings of loyalty declined with increasing cultural distance from Sweden, consistent with the study’s central hypothesis (see Table 2). This pattern was evident across all three loyalty questions (L1-L3), though the extent of the decline varied by type of relationship to the focal country. Table 2 reports the median ratings for each question by focal country, along with counts of ‘yes’ responses and the number of these responses explicitly justified by a foreign connection. The total number of protocols rated per condition (n) indicates the maximum possible number of ‘yes’ responses in that condition. In total, participants rated 232 protocols explicitly describing a connection to one of the foreign countries, of 290 protocols overall.
Table 2 Median rating per question by focal country, counts of ‘yes’ responses, and connection to a foreign country as justification

Note: Table 2 reports the median loyalty ratings (L1–L3) by focal country. The ‘Counts of yes’ column shows the total number of protocols within each condition that participants judged as containing information relevant to the clearance decision (D = 1). The ‘Connection as motive’ column reports, among those ‘yes’ responses, how many were explicitly justified by reference to the applicant’s foreign connection. The final column (n) indicates the total number of protocols per condition, that is, the maximum possible number of ‘yes’ responses.
3.2. Qualitative analysis of the reason for answering yes
All stated reasons for choosing ‘yes’ to the question of whether any information had emerged that would be relevant to the security clearance were reviewed. In keeping with the study’s operational definition of relevant information, responses that explicitly referred to a connection with a foreign nation were classified as correct identifications of information potentially relevant to the personnel security assessment.
This coding served two purposes. First, it provided a validity check for the binary relevance judgment (D) used in the subsequent Bayesian models, ensuring that affirmative responses generally reflected recognition of the intended stimulus feature. Second, it allowed us to quantify the prevalence of foreign connections as a driver of relevance judgments. As shown in Table 2 (column ‘Connection as motive’), the overwhelming majority of ‘yes’ responses corresponded to such connections, suggesting that participants consistently interpreted foreign associations as meeting the relevance criterion.
Building on this qualitative verification, we next examine quantitative patterns in the binary relevance judgments and the ordinal loyalty ratings, using Bayesian hierarchical models with focal country (C), stimulus gender (G), and the binary relevance judgment (D) as predictors.
3.3. Quantitative model results
The final models are described in the data analysis section. Below, the results are linked to their corresponding research questions.
3.3.1 Relevant information for personnel security clearance (Model 1)
The reduced probit model with focal country (C) and applicant gender (G) converged without issues, and inspection of posterior samples showed good effective sample sizes. Participants were less likely to answer ‘yes’ when evaluating applicants with no connection to a country other than Sweden (Intercept: M = –4.08, 95% CrI [–6.35, –2.08]). Compared to Sweden, connections to Germany (β = 3.39, 95% CrI [0.46, 5.86]) were associated with increased probability of identifying clearance-relevant information. Effects for Brazil (β = 2.30, 95% CrI [–0.75, 5.08]), the United Kingdom (β = 1.91, 95% CrI [–1.00, 4.43]), and Thailand (β = 1.97, 95% CrI [–1.15, 4.83]) were more uncertain, with intervals overlapping zero.
RQ1 asked whether the presence of a connection to one of the focal countries increased the likelihood of judging information as relevant for clearance. All applicants were Swedish citizens, but some also had a documented connection to a focal country—a fact that, under clearance-relevant criteria, should generally trigger a ‘yes’ response. To directly test this, we used one-sided Bayes factors (Savage–Dickey method, prior SD = 2) for the hypothesis that each country effect was positive (H1: β > 0). We found strong evidence for Germany (β = 3.39, 95% CrI [1.01, 5.49], BF10 = 77.4) and moderate evidence for the United Kingdom (β = 1.91, 95% CrI [–0.47, 4.06], BF10 = 10.4), Brazil (β = 2.30, 95% CrI [–0.24, 4.63], BF10 = 14.0), and Thailand (β = 1.97, 95% CrI [–0.63, 4.38], BF10 = 8.95).Footnote 3 Despite these effects, the observed proportion of ‘yes’ responses was lower than expected in all four of the focal country conditions (<0.50), indicating that many judges did not treat such connections as relevant to the personnel security clearance. This suggests heterogeneous thresholds or divergent interpretations of what constitutes relevant information.
Participant-level variability in interpreting each country was substantial (Figure 2). The standard deviations of the random slopes for each country (relative to Sweden) were sizeable, particularly for Brazil (SD = 11.60, 95% CrI [6.18, 19.81]) and Thailand (SD = 12.72, 95% CrI [7.14, 21.05]). Variability was also evident for Germany (SD = 5.71, 95% CrI [0.51, 13.49]) and the United Kingdom (SD = 8.06, 95% CrI [3.57, 15.34]). To formally quantify support for this variability, we computed Bayes Factors comparing the hypothesis that the standard deviation was greater than zero against a point null hypothesis of no variation. For all countries, the posterior probability that the standard deviation exceeded zero was > 0.999, corresponding to Bayes Factors greater than 1000 in favor of nonzero variability. This provides overwhelming evidence that participants varied substantially in how they interpreted candidate profiles, consistent with a high degree of pattern noise. Exploratory models including participant profession further suggested that protective security specialists were more likely than recruitment specialists to judge foreign connections as clearance relevant (β = 3.01, 95% CrI [0.39, 5.56], BF10 = 75.9 for β > 0), indicating strong evidence in favor of a positive effect. On the probit scale, this corresponds to a substantially higher probability of responding ‘yes’ when evaluating identical protocols. However, the posterior estimates indicated no systematic differences between the two groups in how they rated loyalty or in their rating consistency (see Section 3.4.2). These results address RQ1 by showing country-specific differences in binary clearance judgments and highlight substantial between-judge disagreement (noise) in interpreting identical case material.

Figure 2 Between-judge variability in identifying a foreign connection as clearance-relevant. Countries ordered by increasing cultural distance from Sweden. Violin width = posterior density; points/lines = median and 95% CrI of the SD parameter (variability).
Note: Between-judge variability in identifying a connection to one of the focal countries as relevant for a personnel security clearance. Violin plots show the posterior distribution of standard deviations for the random slopes of country effects (relative to Sweden), with 95% credible intervals overlaid. Wider violins indicate greater uncertainty in the estimate. Larger values reflect greater between-judge variability (i.e., less agreement) in how connections to that country were judged as relevant. Black dots represent posterior means; boxplots show medians and interquartile ranges.
RQ2 asked whether the nature of the foreign country (e.g., European vs. Asian vs. South American) systematically influenced the likelihood of a ‘Yes’ response. We conducted all six possible pairwise comparisons between the four focal countries using two-sided Bayesian contrasts (prior SD = 2). None of the pairwise comparisons between the focal countries yielded substantial evidence for or against a difference. Bayes factors ranged from 0.94 to 1.25, suggesting that the data were equivocal with respect to differences between the countries. Thus, while connections to foreign countries increased perceived relevance relative to Sweden (RQ1), we found no clear evidence that specific foreign countries differed in how relevant they were judged. For example, the credible intervals for the estimates of Germany vs. Brazil (β = 1.09, 95% CrI [–2.69, 4.85]) and Germany vs. Thailand (β = 1.42, 95% CrI [–2.36, 5.20]) illustrate high uncertainty. This pattern was consistent across all other pairwise comparisons, with Bayes Factors close to 1 throughout, indicating a lack of compelling evidence for differences between countries. These results suggest that, while a foreign-country connection increases perceived relevance compared to Sweden (RQ1), there is no detectable difference in perceived relevance between the specific foreign countries tested.
3.4. Loyalty ratings (Model 2)
We modelled participants’ 7-point ratings on the three loyalty items using a hierarchical cumulative probit model with item-varying thresholds, random intercepts, and slopes by participant. The model included a latent discrimination parameter to account for response consistency. Convergence diagnostics were satisfactory, and posterior predictive checks indicated adequate descriptive fit.
3.4.1 Main effects on loyalty ratings
RQ3 asked if applicants judged to have clearance-relevant information (D = ‘Yes’) also received lower loyalty ratings on the latent scale (β_D = −0.76, 95% CrI [−1.06, −0.51]). A one-sided Bayes factor test (H1: β_D < 0) provided decisive evidence (posterior mass at β_D ≥ 0 ≈ 0; BF10 ≫ 100), confirming a robust downstream effect from the relevance judgment to loyalty ratings. This negative effect also manifested on the original 1–7 scale: switching from D = 0 (‘no relevant information’) to D = 1 (‘relevant information’) reduced the expected loyalty rating by 0.77 points (posterior mean Δ = −0.77, 95% CrI [−1.06, −0.51]). This descriptive result reflects the posterior-predictive implication of the latent-scale coefficient and reinforces the conclusion that ‘Yes’ judgments are followed by lower loyalty ratings. Figure 3C visualizes this shift in expected ratings by judgment type (D), showing clearly that applicants judged to be relevant receive systematically lower scores on the observed 1–7 scale. Compared to Sweden, loyalty ratings were lowest for Brazil (β = –1.56, 95% CrI [–2.12, –1.08]) and Thailand (β = –1.48, 95% CrI [–2.01, –1.02]), followed by the United Kingdom (β = –1.27, 95% CrI [–1.77, –0.84]) and Germany (β = –1.23, 95% CrI [–1.72, –0.80]). These fixed effects are depicted in Figure 3A, which shows the posterior distributions for each country.

Figure 3 Top panel (A): Fixed effects on latent loyalty (relative to Sweden). Violin width = posterior density; points/lines = median and 95% CrI. Middle panel (B): Between-judge variability captured as the posterior distribution of the SD across judges of country-specific random slopes (larger values = more disagreement). Wider posteriors indicate greater uncertainty, whereas larger SD values indicate more variability. Bottom panel (C): Expected loyalty ratings by relevance judgment (D). Violin width = posterior density of expected ratings on the 1–7 scale; points/lines = median and 95% CrI.
RQ4 asked whether greater cultural distance increases disagreement between judges (pattern noise). Between-judge pattern noise was summarized as the standard deviation (SD) of the random slope per country (relative to Sweden). We found substantial heterogeneity in participant-specific reactions to each country: Brazil (SD = 0.43, 95% CrI [0.13, 0.72]), Germany (SD = 0.27, [0.02, 0.56]), the United Kingdom (SD = 0.21, [0.01, 0.49]), and Thailand (SD = 0.14, [0.01, 0.37]). Posterior probabilities of SD > 0 were ~1.0 for all four countries, with corresponding Bayes Factors (BF10) all > 1000, indicating decisive evidence for nonzero variability in each case. These results are visualized in Figure 3B, which shows the posterior distributions of between-judge variability for each country. The wider and higher distributions for Brazil and Germany indicate greater disagreement among judges in these conditions.
Discrimination parameters were estimated relative to Sweden; positive values indicate sharper category distinctions (greater consistency), negative values the converse. Evidence for higher discrimination was clearest for Thailand (0.46, 95% CrI [0.11, 0.83]). Estimates for the United Kingdom, Germany, and Brazil were centered above zero but more uncertain (discrimination parameter for the United Kingdom = 0.21, [−0.14, 0.57]; Germany = 0.30, [−0.05, 0.67]; Brazil = 0.19, [−0.18, 0.57]). The discrimination effect for D itself was small and uncertain (0.19, [−0.05, 0.42]), indicating little evidence that judging a case as relevant/irrelevant changed how sharply participants used the rating scale.
3.4.2. Variability in judgement harshness
To address RQ5, we examined how much variability in loyalty judgements stemmed from general differences in rating severity across participants (level noise) versus differential responses to specific countries (pattern noise). Level noise was estimated using the standard deviation of the participant-level intercepts (sd(Intercept) = 0.75, 95% CrI [0.51, 1.05]), indicating meaningful variation in baseline rating tendencies (see Figure 4).

Figure 4 Participants’ general tendency in ratings of loyalty. Participant-level intercepts (latent scale) with 95% credible intervals, showing general leniency versus harshness in loyalty ratings. Participants above zero tend to give more favorable ratings, while those below zero are harsher relative to the average judge. Other judge-level characteristics (e.g., professional background as recruitment vs. protective security specialist) were tested in exploratory models and did not systematically explain this variability, suggesting that differences reflect individual judgment styles rather than professional role.
Pattern noise was operationalized as the standard deviation of the country-specific random slopes. All four focal countries showed decisive evidence for between-judge variability in loyalty judgments, with posterior probabilities Pr(SD > 0) = 1.0 and Bayes factors BF10 > 1000 in each case. The largest variability was observed for Brazil (SD = 0.43, 95% CrI [0.13, 0.72]), followed by Germany (0.27, [0.02, 0.56]), the United Kingdom (0.21, [0.01, 0.49]), and Thailand (0.14, [0.01, 0.37]).
To test whether greater cultural distance was associated with more disagreement, we compared the average SDs for the high-distance countries (Brazil, Thailand) with those for the lower-distance countries (UK, Germany). This yielded a Bayes factor of BF10 = 1.79, indicating only anecdotal evidence for the hypothesis that cultural distance increases disagreement in loyalty assessments. Still, Figure 3B shows clear differences in posterior spread across countries, especially the wider variability in Brazil compared to Thailand, which may reflect differences in perceived ambiguity or salience.
Exploratory models, including participant profession, indicated that protective security specialists were more likely to judge information as relevant to clearance than recruitment specialists, but did not differ systematically in loyalty ratings or rating consistency. To probe judge-level predictors of pattern noise, we fit exploratory subset models (not part of the final specification), examining the A × G interaction for ratings. Results were consistent with a male judge’s leniency for male applicants:
-
• Men only: β_G = 0.44, 95% CrI [0.14, 0.77], BF10 = 622 (strong evidence for more favorable ratings for male applicants).
-
• Women only: β_G = −0.02, 95% CrI [−0.21, 0.17], BF10 = 0.69 (no evidence for a directional effect; BF01 ≈ 1.45).
-
• Combined A × G model: β_{A × G} = −0.46, 95% CrI [−0.83, −0.10], BF10 = 172 (strong evidence that male judges are more sensitive to applicants).
These analyses should be interpreted cautiously, given the small number of male judges (16 of 58).
4. Discussion
First and foremost, it is important to note that participants evaluated applicant profiles that were equivalent in all respects except for the gender of the applicant, the relational context, and the country to which the applicant was connected. There was no other information in the protocols that would, on its own, justify systematically lower ratings of loyalty for certain countries. Still, clear differences emerged in both relevance judgments and subsequent loyalty ratings. This suggests that evaluations were shaped not by intrinsic features of the applicants, but by perceptions of foreign ties and their implied meaning. While the Swedish Security Service (2023) and the Swedish Defense Materiel Administration (2022) emphasize that ties to foreign countries are relevant to an assessment of a personnel security clearance, highlighting relevance is not equivalent to making assumptions about loyalty. Our findings indicate, however, that relevance judgments may serve as an anchor to more consequential inferences, particularly when connections involve countries perceived as culturally or politically distant from Sweden. From an applied perspective, even moderate shifts on the 7-point loyalty scale are meaningful. For example, the downstream effect of judging a foreign connection as relevant (D = 1) lowered loyalty ratings by about 0.7 points, roughly equivalent to moving from ‘agree somewhat’ toward ‘neutral.’ In practice, a shift of this magnitude can alter whether a candidate is perceived as neutral versus positively loyal. For clearance decisions, even half a scale point may tip judgments toward concern. Thus, the observed differences are not only statistically supported but also consequential for applied vetting outcomes. Likewise, between-judge variability of 0.4–0.7 points on the latent scale means that two evaluators could reach diverging conclusions when reviewing identical protocols. These magnitudes highlight that the problem is not only statistical but practical: without calibration, organizations cannot ensure equal treatment across applicants or confidence that clearance decisions reflect shared standards of loyalty.
The findings of this study shed light on how judgments of loyalty are constructed in personnel security vetting, with implications for both theory and practice. Grounded in Social Identity Theory (SIT) and perspectives on professional loyalty, we examined whether applicants’ ties to foreign countries are seen as relevant to personnel security clearance (RQ1), the nature and direction of country-specific effects (RQ2), the downstream impact on loyalty ratings (RQ3), the variability in these ratings (RQ4), and the role of judge-level characteristics in shaping these responses (RQ5).
4.1. RQ1 and RQ2: Relevance and country-specific effects
Fewer than half of the participants judged a connection to a country other than Sweden as relevant to the personnel security clearance. A clear signal that the guidance issued by the Swedish Security Service and the DMA has not yet been translated into consistent application of clearance criteria among evaluators. Judgments varied systematically by country, with Brazil and Thailand consistently eliciting higher numbers of ‘yes’ responses to the question concerning relevant information for a personnel security clearance. These effects align with expectations from Social Identity Theory, suggesting that cues of cultural or geographic distance may increase perceptions of out-group membership and, in turn, heighten security concerns. These findings also echo broader patterns documented in field experiments on ethnic discrimination in hiring (Lancee, Reference Lancee2021; Veijt and Thijsen, Reference Veijt and Thijsen2021). Our findings on between-judge variability in evaluations, especially in response to culturally distant profiles, align with the idea of ‘perceived otherness’ being amplified even when formal qualifications are equivalent. The qualitative nuances in judgment variability suggest that similar socio-cognitive dynamics may underlie both hiring discrimination and security vetting assessments.
4.2. RQ3 and RQ4: Downstream judgments and cultural distance
Importantly, the binary judgment of relevance (D) had a clear downstream impact on the continuous loyalty ratings, indicating that once a connection was judged relevant (i.e., potentially problematic), participants tended to rate the applicant’s loyalty more negatively. However, the degree of consensus about which countries warranted concern varied. While both Brazil and Thailand were associated with lower loyalty ratings compared to Sweden, the pattern of variability differed. Thailand, despite its lower ratings, elicited greater agreement among participants—suggesting it was a more cognitively salient and unambiguous out-group. In contrast, Brazil produced more disagreement between judges, indicating less shared understanding of its implications for loyalty. This pattern supports the notion that cultural distance interacts with ambiguity to modulate pattern noise (Kahneman et al., Reference Kahneman, Sibony and Sunstein2021). Figure 3 illustrates this contrast: violin plots of judge-level random slopes show greater spread in effects for Brazil than for Thailand, suggesting lower interpretive consensus in the former condition.
4.3. RQ5: Judge-Level Predictors and Variability in Interpretation
Participants also varied in their general harshness (level noise), as shown in the ranked plot of judge-specific intercepts (Figure 4). Moreover, the random slopes by country revealed how different judges responded differently to the same protocol features (pattern noise). This variability, while expected in open-ended judgment tasks, raises concerns about consistency in security vetting, particularly when judgments are based on ambiguous social or cultural cues. While participants demonstrated relatively consistent down-rating for some countries (e.g., Thailand), other contexts (e.g., Brazil) exposed interpretive fragmentation. This highlights the role of not just individual bias, but also the absence of stable institutional norms for interpreting foreign ties. Exploratory models, including participant profession, indicated that protective security specialists were more likely to judge information as relevant to clearance than recruitment specialists, but did not differ systematically in loyalty ratings or rating consistency.
4.4. Bias: Systematic patterns in judgement
We observed consistent directional effects in how judges responded to specific cultural and relational cues. For instance, judgments tended to be more negative toward applicants associated with Brazil and Thailand. From a decision science perspective, such effects reflect bias—predictable deviations from normative judgment—rather than random inconsistency. These biases may stem from deeply ingrained associations or stereotypes, but may also derive from implicit institutional norms regarding loyalty.
The Industry Security Manual (Defence Material Administration, 2022) helps contextualize these findings. In Swedish security vetting practice, loyalty is increasingly framed as a matter of risk calibration. The state’s trust in an individual hinges not solely on formal connections but on interpretive frames shaped by institutional logics. These logics are not culturally neutral; what appears as bias in our models may be rooted in culturally situated constructions of loyalty, security, and risk.
4.5. Noise: Between-judge variability and conceptions of loyalty
Even after adjusting for stimulus features, we found substantial between-judge variability, that is, noise in the evaluation of identical applicant profiles. These findings suggest that reducing noise in personnel security decisions may require more than procedural standardization; it may also demand deeper conceptual alignment among evaluators. In other words, before standardizing how evaluations are conducted, the object of judgement itself—what constitutes loyalty—must be clarified.
Our modelling approach, which separately estimated random slopes and item-specific thresholds, allowed us to distinguish between:
-
- general tendencies (e.g., judges systematically harsher or more lenient overall),
-
- context-specific divergences (e.g., differing interpretations of the same cultural cues).
While Thailand consistently elicited lower loyalty ratings across judges—suggesting relatively strong consensus—Brazil showed more dispersed random slopes, indicating higher between-judge variability. We interpret this asymmetry as consistent with the idea of perceived otherness: when a cue such as a foreign connection is culturally or politically salient (as with Thailand), evaluators converge in their judgments, whereas more ambiguous cases (such as Brazil) create space for idiosyncratic interpretations, thereby amplifying noise (see Wingate et al., Reference Wingate, Rasheed, Risavy and Robie2024, for a related discussion of ambiguity and judgement variability). At the same time, alternative explanations should be acknowledged, including the potential influence of stereotype strength or specific Swedish–country relations. Moreover, the three loyalty items demonstrated only moderate internal consistency (ordinal α = 0.68), which suggests that some of the observed between-judge dispersion may reflect variability in how different evaluators understood or emphasized the underlying construct of ‘loyalty.’ Future research could investigate these sources of noise more systematically, for instance by disentangling the roles of perceived otherness, contextual salience, and measurement limitations. Our findings reaffirm insights from Herbig (Reference Herbig2008), who emphasized that interpretive ambiguity and the absence of standardized evaluative criteria increase both bias and noise. Although this study did not formally model participants’ conceptions of loyalty, the observed variability underscores a broader challenge: even trained professionals may reach divergent conclusions when evaluative standards are insufficiently defined. These findings have practical implications for improving consistency in security vetting. Interventions such as calibration workshops for vetting officers, development of standardized loyalty assessment frameworks, and incorporation of bias-awareness modules into training could be tested empirically. Future research should evaluate whether such approaches reduce bias and noise without constraining professional judgment.
This study was conducted entirely within the Swedish protective security vetting system, which operates under a specific legal framework and organizational culture. This study is also limited by design. We used a coherent sample of Swedish security vetting officers, resulting in a low number of participants. We did this to keep the level of noise to a minimum. Recruiting participants from different types of security-sensitive activities could have increased both the level noise and pattern noise. The sample consisted of 58 participants (16 men, 42 women), reflecting a notable gender imbalance. While all participants completed the same set of tasks, any analyses involving gender comparisons should be interpreted with caution. Subset models were used to explore potential interaction effects between participant and applicant gender; however, given the smaller number of male participants, estimates for this group are subject to greater uncertainty. No strong claims are made about gender differences, and we report full posterior intervals to reflect this uncertainty. The imbalance mirrors known demographic trends in relevant professional settings but limits the generalizability of gender-based comparisons.
4.6. Directions for future research
Our findings have practical implications for improving consistency in security vetting. Interventions such as calibration workshops for vetting officers, development of standardized loyalty assessment frameworks, and incorporation of bias-awareness modules into training could be tested empirically. Future research should evaluate whether such approaches reduce bias and noise without constraining professional judgment.
Future studies can also examine how specific conceptions of loyalty shape evaluative judgments in personnel security contexts. Follow-up research could explore whether judges who articulate clearer or more internally consistent definitions of loyalty also show lower variability in their decisions, or whether different conceptualizations lead to systematically different judgments. Moreover, expanding the sample to include practicing security professionals from other security-sensitive areas than the defense industry could test the generalizability of these findings beyond the current context. Finally, qualitative approaches may be valuable for understanding how judges interpret ambiguous cues and justify their decisions, particularly when the available evidence leaves room for subjective inference. Such work could inform the development of training protocols and interpretive guidelines that reduce unwanted noise while maintaining attention to valid risks.
5. Conclusion
This study reveals that variability in loyalty judgments is not merely a function of individual opinion but reflects systematic influences of stimulus features (e.g., cultural background, gender) and judge-level characteristics (e.g., profession, conception of loyalty). Drawing on the bias and noise framework from behavioral decision science (Kahneman et al., Reference Kahneman, Sibony and Sunstein2021), our findings suggest that both bias (directional effects of stimulus characteristics) and noise (unwanted variability across judges given the same input) are inherent challenges in personnel security clearance decisions. Addressing these challenges may require not only procedural standardization, but also clearer shared definitions of the constructs under evaluation—most critically, loyalty itself.
Supplementary material
The supplementary material, which includes the complete R code for this article , can be found at: https://github.com/DanMalmgren/Security-Vetting---Study-1.
Data availability statement
The data of this article can be found together with the supplementary material.
Funding statement
This research received no specific grant funding from any funding agency, commercial or not-for-profit sectors.
Competing interest
The authors declare no competing interests.
Appendix A: Instructions to participants
The participants were given the following instruction before doing the experiment:
‘It is important that, in your assessment of each protocol, you take into consideration that other parts of the background investigation, such as verification of grades, references, and certificates, confirm the information stated in the protocol in the applicable fields. In other words, the background check has not revealed any discrepancies in relation to what has been stated by the examined individual. The term ‘examined individual’ refers to the person undergoing the security clearance process for potential participation in a security-sensitive activity.
You are asked to assess each document individually. Please disregard any similarities between the different documents.
Carefully read through each document. You will then be required to make several evaluations per examined individual. Press the blue button with a white arrow when you are ready to start the experiment.’
Appendix B: Seven-point response scale
The response scale was phrased in the following manner:
-
1. Do not agree at all
-
2. Do not agree
-
3. Partially do not agree
-
4. Neither agree nor disagree
-
5. Partially agree
-
6. Agree
-
7. Fully agree
Appendix C: Posterior vs. prior plots

Figure C1 Posterior vs. prior plot for probit regression—UK.

Figure C2 Posterior vs. prior plot for probit regression—Germany.

Figure C3 Posterior vs. prior plot for probit regression—Brazil.

Figure C4 Posterior vs. prior plot for probit regression—Thailand.

Figure C5 Posterior vs. prior plot for probit regression—gender of applicant.

Figure C6 Posterior vs. prior plot for probit regression—intercept.

Figure C7 Posterior vs. prior plot for the IRM—the United Kingdom.

Figure C8 Posterior vs. prior plot for the IRM—Germany.

Figure C9 Posterior vs. prior plot for the IRM—Brazil.

Figure C10 Posterior vs. prior plot for the IRM—Thailand.

Figure C11 Posterior vs. prior plot for the IRM—gender of applicant.

Figure C12 Posterior vs. prior plot for the IRM—dichotomous decision.
Appendix D: Example of protocol (marital connection to Brazil)

Note: Page 1 of protocol

Note: Page 2 of protocol

Note: Page 3 of protocol

Note: Page 4 of protocol

Note: Page 5 of protocol

Note: Page 6 of protocol

Note: Page 7 of protocol

Note: Page 8 of protocol

