Hostname: page-component-77f85d65b8-pkds5 Total loading time: 0 Render date: 2026-03-28T09:01:19.913Z Has data issue: false hasContentIssue false

Causal Structural Modeling of Survey Questionnaires via a Bootstrapped Ordinal Bayesian Network Approach

Published online by Cambridge University Press:  03 January 2025

Yang Ni*
Affiliation:
Department of Statistics, Texas A&M University, College Station, TX, USA
Su Chen
Affiliation:
Center of Transforming Data to Knowledge, Rice University, Houston, TX, USA Department of Statistics, Rice University, Houston, TX, USA
Zeya Wang
Affiliation:
Dr. Bing Zhang Department of Statistics, University of Kentucky, Lexington, KY, USA
*
Corresponding author: Yang Ni; Email: yni@stat.tamu.edu
Rights & Permissions [Opens in a new window]

Abstract

Survey questionnaires are commonly used by psychologists and social scientists to measure various latent traits of study subjects. Various causal inference methods such as the potential outcome framework and structural equation models have been used to infer causal effects. However, the majority of these methods assume the knowledge of true causal structure, which is unknown for many applications in psychological and social sciences. This calls for alternative causal approaches for analyzing such questionnaire data. Bayesian networks are a promising option as they do not require causal structure to be known a priori but learn it objectively from data. Although we have seen some recent successes of using Bayesian networks to discover causality for psychological questionnaire data, their techniques tend to suffer from causal non-identifiability with observational data. In this paper, we propose the use of a state-of-the-art Bayesian network that is proven to be fully identifiable for observational ordinal data. We develop a causal structure learning algorithm based on an asymptotically justified BIC score function, a hill-climbing search strategy, and the bootstrapping technique, which is able to not only identify a unique causal structure but also quantify the associated uncertainty. Using simulation studies, we demonstrate the power of the proposed learning algorithm by comparing it with alternative Bayesian network methods. For illustration, we consider a dataset from a psychological study of the functional relationships among the symptoms of obsessive-compulsive disorder and depression. Without any prior knowledge, the proposed algorithm reveals some plausible causal relationships. This paper is accompanied by a user-friendly open-source R package OrdCD on CRAN.

Information

Type
Application and Case Studies
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of Psychometric Society
Figure 0

Figure 1 All possible three-node DAGs. The conditional independence assertion encoded by each graph is shown at the top of each DAG.

Figure 1

Figure 2 Conditional probability tables from two Markov equivalent BNs.

Figure 2

Figure 3 Simulation true (a) DAG and (b) CPDAG. The (blue) bidirected edges in (b) are edges that can be oriented in either direction in the Markov equivalence class represented by the CPDAG.

Figure 3

Figure 4 Simulated survey data with ten 5-point Likert scale questions. Panel (a): Sample size $n=500$ and signal strength varies from 0.25 to 2. Panel (b): Signal strength $\sigma =2$ and sample size varies from 500 to 32,000. In both panels, dotted lines indicate the irreducible error (SHD = 4) for an oracle cBN.

Figure 4

Figure 5 Simulated survey data with varying number of 5-point Likert scale questions $q=10,20,30,40,50$. The sample size is fixed at $n=500$ and the signal strength is fixed at $\sigma =2$. Left panel: The SHD is normalized by dividing the raw SHD by the total number of edges in a complete DAG (i.e., $\frac {q(q-1)}{2}$). Right panel: CPU time of oBN in seconds tested on a 2.9 GHz 6-Core Intel Core i9 CPU.

Figure 5

Table 1 Sensitivity to the choice of link functions. The average (standard error) SHD is reported

Figure 6

Figure 6 BIC as a function of iteration on the OCD-Depression data.

Figure 7

Figure 7 Estimated OCD-Depression networks using oBN with 500 bootstrap samples. The edge width is proportional to its probability. Nodes within the box are the ten OCD-related variables.

Figure 8

Table 2 OCD-Depression data. A list of significant edges identified by oBN ranked by inclusion probabilities

Figure 9

Figure 8 Estimated OCD-Depression networks using PC+oBN. Nodes within the box are the ten OCD-related variables.

Figure 10

Figure 9 Estimated OCD-Depression networks using PC. The (blue) bidirected edges are edges of which the directionality is undetermined. Nodes within the box are the ten OCD-related variables.

Figure 11

Figure 10 Estimated OCD-Depression networks using cBN with BIC and hill-climbing. The (blue) bidirected edges are edges of which the directionality is undetermined. Nodes within the box are the ten OCD-related variables.

Figure 12

Figure 11 Estimated OCD-Depression networks using OSEM. The (blue) bidirected edges are edges of which the directionality is undetermined. Nodes within the box are the ten OCD-related variables.