Hostname: page-component-77f85d65b8-2tv5m Total loading time: 0 Render date: 2026-04-13T06:59:34.507Z Has data issue: false hasContentIssue false

A Hierarchical Gaussian Process Approach to Understanding Mental Health Trajectories via Item Response Theory Models

Published online by Cambridge University Press:  27 March 2026

Zoe Gibbs McBride*
Affiliation:
Department of Statistics, Brigham Young University, Provo, USA Department of Statistics, University of Connecticut, Storrs, USA
Xiaojing Wang
Affiliation:
Department of Statistics, University of Connecticut, Storrs, USA
*
Corresponding author: Zoe Gibbs McBride; Email: zmcbride@stat.byu.edu
Rights & Permissions [Opens in a new window]

Abstract

Longitudinal mental health assessments in mobile health (mHealth) settings are useful for monitoring subjects’ mental health statuses but are often difficult to analyze because they generally appear on an ordinal scale and at unequal time intervals. In this article, we explore the use of Gaussian processes (GPs) and hierarchical modeling techniques to understand mental health trajectories based on repeated multi-item mHealth surveys on a Likert scale. We introduce the GP model for health trajectories, which is based on item response theory. In the study of trajectories, a subject’s longitudinal collection of mHealth responses can be thought of as a single high-dimensional observation. We show how the GP is flexible enough to capture trends in individual trajectories even with the challenges associated with high-dimensional data. We also demonstrate how basis splines can be used to effectively capture nonlinear trends in the mean function of the GP. The high-dimension and ordinal nature of the data often make sampling from the posterior distribution in a Bayesian setting too slow to be practical. We show that using a Hilbert approximation for the GP trajectories can facilitate efficient sampling. We apply these methods to a longitudinal study that monitored college students’ self-esteem.

Information

Type
Application and Case Studies - Original
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press on behalf of Psychometric Society

1 Introduction

Longitudinal mental health assessments in mobile health (mHealth) applications have been found to be a scalable and effective way to capture information on a user’s current mental health status (Pardes et al., Reference Pardes, Lynch, Miclette, McGeoch and Daly2022). Because these assessments may be performed many times over the course of months or years, researchers have the opportunity to model the trajectory of users’ mental health over time. For example, researchers may gain an understanding of changes in an mHealth user’s depressive symptoms over time by the regular completion of the Patient Health Questionnaire-4 (PHQ-4) (Löwe et al., Reference Löwe, Wahl, Rose, Spitzer, Glaesmer, Wingenfeld, Schneider and Brähler2010). While each assessment yields only a handful of item responses, the high-frequency sampling makes each individual’s longitudinal record intrinsically high-dimensional, and analysis thus focuses on recovering the underlying latent trajectory that generates the data.

Yet recovering these trajectories accurately demands proper handling of the measurement scale: most questionnaire items use Likert responses, which violate assumptions like normality and can bias standard models (Wu & Leung, Reference Wu and Leung2017). Ordinal regression addresses this by assuming a latent continuous variable that is mapped to observed categories via a link function and threshold values (McCullagh, Reference McCullagh1980).

Like ordinal regression, polytomous IRT models (Reeve & Fayers, Reference Reeve and Fayers2005) use continuous latent variables to quantify constructs from multi-category items. In particular, the graded response model (GRM) (Samejima, Reference Samejima1968) is tailored to ordered outcomes, such as Likert-scale responses. Specifically, in the typical GRM model,

(1.1) $$ \begin{align} P_{jk}(\theta) & = \begin{cases} 1 & \text{ for } k=1, \\ \text{Pr}(Y_j \geq k |\theta) & \text{ for } k=2, \dots, K_j,\\ 0 & \text{ for } k>K_j, \end{cases} \end{align} $$

where j represents a questionnaire item, k represents an ordered response, $K_j$ represents the maximum ordered response for item j, $\theta $ represents the latent construct of interest, and $P_{jk}(\cdot )$ represents an inverse link function (Samejima, Reference Samejima1968). Additional parameters may also be included in the model to adjust for item difficulty and discrimination (Hays et al., Reference Hays, Morales and Reise2000). A special case of the GRM (Cole et al., Reference Cole, Turner and Gitchel2019) includes item difficulty parameters, $\delta _j$ , item discrimination parameters, $a_j$ , and shared threshold parameters, $\tau _{k}$ , with a logistic link such that

(1.2) $$ \begin{align} P_{jk}(\theta) & = \begin{cases} 1 & \text{ for } k=1, \\ \frac{1}{1+\text{exp}(-[a_j(\theta-\delta_{j})-\tau_{k-1}])} & \text { for } k=2, \dots, K_j,\\ 0 & \text{ for } k>K_j. \end{cases} \end{align} $$

This model assumes that the thresholds $\tau _k$ are constant across items, that is, that the scale is the same for different items, an assumption that has precedent in psychometric research where scales are the same across items (Lubbe & Schuster, Reference Lubbe and Schuster2019).

IRT models may be extended for use in longitudinal data by allowing $\theta $ to vary within subjects over time (Choi, Reference Choi2024; Proust-Lima et al., Reference Proust-Lima, Philipps, Perrot, Blanchin and Sébille2022; Wang & Nydick, Reference Wang and Nydick2020). For example, Liu and Hedeker (Reference Liu and Hedeker2006) used IRT and mixed-effects modeling techniques to model latent subject ability over time based on multi-item Likert-scale questionnaires. Random intercepts and slopes were included to account for variation between subjects and over time. While simple parametric approaches for analyzing longitudinal data, such as random time slopes, are useful in assessing the change in latent continuous variables overtime, they often lack flexibility or are misspecified to capture nonlinear trends in the data (Wang et al., Reference Wang, Chiou and Müller2016). Thus, it is often preferred to use nonparametric methods to model trajectories as a function of time, that is, perform functional data analysis (FDA) (Liu & Wang, Reference Liu and Wang2020; Wang et al., Reference Wang, Chiou and Müller2016).

Gaussian processes (GPs) are stochastic processes such that every finite collection of random variables has a multivariate Gaussian distribution (Liu et al., Reference Liu, Wang and Kong2019). GPs are specified by a covariance matrix that accounts for correlation between points and controls for smoothness, making them an attractive option for FDA (Shi & Choi, Reference Shi and Choi2011). Because of their correlation structure, they are useful for modeling trajectories in a variety of settings (Chen & Zhang, Reference Chen and Zhang2020; Kim et al., Reference Kim, Lee and Essa2011).

There is precedent for using GP regression in both ordinal data and IRT settings. For example, Komori et al. (Reference Komori, Teraji, Shiroshita and Nittono2022) used Bayesian ordinal GPR to determine which facial features in children are perceived as cute, where cuteness was measured on a scale of integer values from 1 to 5. Others have explored the use of power link functions in ordinal GPR (Li et al., Reference Li, Wang and Dey2019). Chen and Zhang (Reference Chen and Zhang2020) used a latent GP model to model longitudinal survey data in psychology applications, remarking how their approach is similar to IRT models. While their paper provides a foundation in using GPs for longitudinal surveys, they only demonstrated the use of the ordinal component of their model in a simple simulation setting, without the complexity of real-world data that may complicate computation and parameter estimation. Kang and Kottas (Reference Kang and Kottas2024) showed how GPs can be used to understand mental health trajectories derived from ecological momentary assessments (EMAs). However, their method requires analyzing different items separately. It also uses the continuation logits representation of the multinomial distribution, resulting in different latent curves for each consecutive category combination. The interpretation of these curves is less clear than a single curve representing the latent construct of interest.

While GPs are frequently used in a Bayesian setting, where model fit is usually performed using Markov chain Monte Carlo (MCMC) techniques to sample from the posterior distribution, sampling from the posterior distribution of GPs can be challenging, especially in the often high-dimensional setting of a longitudinal analysis (Riutort-Mayol et al., Reference Riutort-Mayol, Bürkner, Andersen, Solin and Vehtari2023). Many researchers, including Chen and Zhang (Reference Chen and Zhang2020), choose to use empirical Bayes to estimate these parameters. However, a fully Bayesian approach to fitting a GP model may be desired.

In addition to modeling individual-level trends using a GP, researchers may also be interested in population-level trajectories within an IRT model. This may be done by modeling the mean function of a GP in a hierarchical setting using parametric or nonparametric approaches (Chen & Zhang, Reference Chen and Zhang2020). While Chen and Zhang (Reference Chen and Zhang2020) mentioned the possibility of using nonparametric approaches for estimating the mean function of the GP prior placed on individual trajectories, in practice, they modeled the mean function of the GP using only simple linear regression, reducing the model’s flexibility in recovering overall trends.

Smoothing splines, which balance model fit and smoothness, are a popular FDA technique that can be useful for modeling the mean of a GP (Chen & Zhang, Reference Chen and Zhang2020; Rice & Rosenblatt, Reference Rice and Rosenblatt1983; Shi et al., Reference Shi, Wang, Murray-Smith and Titterington2007). Smoothing splines can generally be described as a set of piecewise polynomial equations joined together at points called knots (Wang, Reference Wang2011). B-splines are a type of spline that have been shown to have excellent numerical properties, are sparse (desirable in high-dimensional settings), and are flexible in the presence of equally spaced knots (De Boor, Reference De Boor1978; Eilers & Marx, Reference Eilers and Marx2010). Specifically, define q knots $\boldsymbol {\xi }=\{\xi _1\leq \xi _2 \leq , \cdots \leq \xi _q\}$ . Then the $j^{th}$ B-spline of degree p evaluated at point x can be defined recursively as (Lyche et al., Reference Lyche, Manni and Speleers2018)

(1.3) $$ \begin{align} B_{j,p, \boldsymbol{\xi}}(x) := \frac{x-\xi_j}{\xi_{j+p}-\xi_j}B_{j,p-1, \boldsymbol{\xi}}(x)+\frac{\xi_{j+p+1}-x}{\xi_{j+p+1}-\xi_{j+1}}B_{j+1, p-1, \boldsymbol{\xi}}(x), \end{align} $$

with

(1.4) $$ \begin{align} B_{j,0,\boldsymbol{\xi}}(x):=\begin{cases} 1 & \text{if } x\in [\xi_i, \xi_{i+1}),\\ 0 & \text{otherwise}. \end{cases} \end{align} $$

Smooth curves may be estimated as a weighted sum of realizations of basis functions that are evaluated at points along the domain. In terms of regression, these weights may be thought of as coefficients that need to be estimated. Shi et al. (Reference Shi, Wang, Murray-Smith and Titterington2007) showed that B-spline basis functions are an excellent choice for GP functional regression, where it is computationally efficient to estimate the coefficients associated with the basis functions. B-splines also have local support, which allows one part of a curve fit to be adjusted without causing undue adjustments to other parts of the curve (De Boor, Reference De Boor1978). In practice, the degree of $p=3$ is commonly chosen and is the default in many standard statistical packages, such as in the splines package in R (R Core Team, 2022). A drawback to splines is that they may be highly sensitive to the location of the spline knots, which can be difficult to choose or estimate (Yang et al., Reference Yang, Zhang and Zhang2023). Cross-validation and other model fit evaluation techniques may be used to make these decisions.

There are many challenges associated with GP ordinal regression. For example, without additional constraints, the ordinal regression problem is scale invariant. A common solution is to fix the variance associated with the GP and to model the correlation matrix. However, modeling correlation matrices is much more challenging than covariance matrices (DeYoreo & Kottas, Reference DeYoreo and Kottas2018). The ordering constraint in the threshold values, $\boldsymbol {\tau }$ , also results in highly correlated posterior draws. DeYoreo and Kottas (Reference DeYoreo and Kottas2018) used mixture modeling, which allows the threshold values to be fixed, reducing computational burdens. However, the interpretation of this mixture modeling in an IRT setting is not clear.

One option to improve computational efficiency is the use of low-rank approximations of GPs (Reeve & Fayers, Reference Reeve and Fayers2005). Solin and Särkkä (Reference Solin and Särkkä2020) proposed the use of Hilbert space methods for reduced-rank GP regression. Riutort-Mayol et al. (Reference Riutort-Mayol, Bürkner, Andersen, Solin and Vehtari2023) explored the practical use of this rank reduction technique and demonstrated that Hilbert basis functions can be used to adequately approximate GPs in practical settings when the number of basis functions is sufficiently large and an appropriate boundary factor is assumed. More specific details on the computational challenges associated with a GP IRT model will be discussed in Section 4.2 after the full model is specified for clarity. We will also provide more detail on the Hilbert approximation and how it improves computational speed and posterior draw mixing in Section 4.3.

This article contributes to the current literature by demonstrating how mental health trajectories can be modeled as a GP within a hierarchical IRT model using Hilbert approximations for computational efficiency. Specifically, we use a GRM with equal thresholds for each item and place a GP prior on the latent construct of interest for each individual over time. We also capture population-level trends over time via smoothing splines. We explore this modeling approach in the context of mHealth survey data on a Likert scale collected in an educational setting. Because mHealth data are high-dimensional, we approximate the posterior distribution of the latent mental health trajectory GP using Hilbert basis functions to facilitate computational efficiency while still obtaining reasonable parameter uncertainty estimates. The efficacy of this approach is demonstrated via simulation. We also discuss model inference in the context of a motivating example. While in this article we present the methodology under a GRM assumption, we emphasize that the methods can easily be extended to other IRT settings where ordinal data meant to capture the trajectory of a unidimensional latent construct are collected frequently over time.

The rest of the article is outlined as follows. In Section 2, we introduce our motivating example. In Section 3, we propose the ordinal GP model for health trajectories (GPMHT). In Section 4, we outline our methods for Bayesian inference, including the prior specification, challenges associated with sampling from the joint posterior, and the Hilbert space approximation of the joint posterior. In Section 5, we discuss our simulation study to show that our Bayesian sampling techniques converge adequately and recover the unknown parameters and trajectories. In Section 6, we present the results of our motivating example analysis and compare our model to a model fit without a GP prior. Finally, we offer a few conclusions in Section 7.

2 Motivating example: College Experience study

Researchers at Dartmouth University have been using mHealth applications to study student health and behavior for many years, and have made their data publicly available (Wang et al., Reference Wang, Chen, Chen, Li, Harari, Tignor, Zhou, Ben-Zeev and Campbell2017). The most recent study, known as the College Experience Study, follows over 200 students who enrolled as freshmen in 2017 or 2018 over the course of their four years at Dartmouth via the StudentLife app (Nepal et al., Reference Nepal, Liu, Pillai, Wang, Vojdanovski, Huckins, Rogers, Meyer and Campbell2024). We focus on the students’ freshman year for reasons discussed in Appendix A.

The College Experience data are comprised of sensing data and EMAs intended to capture students’ behavior and mental health. We wish to develop a model that can be used to understand student mental health trajectories based on EMA responses. Included in the EMA data are responses to the photographic affect meter (PAM) (Pollak, Reference Pollak2012), PHQ-4, and four questions from the State Self-Esteem Scale (SSES) (Heatherton & Polivy, Reference Heatherton and Polivy1991) to measure mood, depression, and self-esteem, respectively. Each of these measures was requested approximately weekly from students via the app, although response rates vary. Other information on the students was collected, such as location, audio, and movement. However, due to a large amount of missing data, we focus on developing a flexible modeling framework for the EMAs without covariates.

We chose to model the SSES data because they represent multi-item surveys where responses varied over time. To avoid overfitting, we only included the 191 students who completed the SSES at least 30 times over the course of the three terms. The SSES is generally issued as a 20-item questionnaire to assess performance, social, and appearance self-esteem. Responses are on a Likert scale with the following options: 1) not at all, 2) a little bit, 3) somewhat, 4) very much, and 5) extremely. In the College Experience study, only four items were administered: 1) Right now, I worry about what other people think of me, 2) Right now, I am pleased with my appearance, 3) Right now, I feel as smart as others, and 4) Right now, overall, I feel good about myself. Note that item one is reverse coded. However, it is assumed for the rest of the article that it has been recoded so that a 5 represents high self-esteem and a 1 represents low self-esteem.

Importantly, the GRM assumes that there is a unidimensional latent construct that is measured by each item on a questionnaire (Andrich, Reference Andrich1978), while the SSES was designed to measure multiple aspects of a person’s self-esteem. However, studies have found that while the SSES captures multiple aspects of self-esteem, the composite score can be used as an effective measure of general self-esteem consistent with other rating scales (Linton & Richard, Reference Linton and Richard1996; Webster et al., Reference Webster, Howell and Shepperd2022).

The concept of an overall self-esteem measure is supported by our exploratory data analysis. Figure 1 contains the average student response to the SSES items by day. We see that although the questions vary in “difficulty” (responses tend to be highest for item 1 and lowest for item 4), the trend in responses appears to be consistent across all items. Thus, we felt comfortable using the four SSES items to measure an overall self-esteem construct.

Figure 1 Average SSES responses by day and question over the course of the students’ freshman year.

While the trend in self-esteem appears clear in Figure 1, the trend is less obvious in the patterns of individual students. Figure 2 shows the responses of a random subset of eight students over the three terms. Due to the discrete nature of the scale, it is difficult to ascertain common patterns in self-esteem from these raw values. We show in this article how we can obtain smooth individual-level self-esteem trajectories for students that help us better picture trends in their mental health.

Figure 2 SSES responses by day, item, and student over the course of the students’ freshman year.

In general, we found that item 1 responses tend to be higher than the responses for items 2–4. We also found evidence that item discrimination may be lower for item 1 than the other items, as indicated by the decreased variation in responses. Please refer to a histogram of the data in Appendix B for additional support of these assertions (Figure B1).

The goals of the analysis of the College Experience study are as follows: 1) to understand how the self-esteem of all freshmen at Dartmouth changes over the course of their first year on campus and 2) quantify and visualize how individual students’ self-esteem trajectories differ from the average Dartmouth freshman population. Understanding general trends in self-esteem can guide the university in offering interventions to the general student body. By studying trends in individual students, officials may be able to identify students whose trajectories reveal concerning patterns that deviate from the average who might need extra support.

3 The proposed Gaussian process model for health trajectories

In this section, we introduce the GPMHT, describe it in the context of our motivating example, and discuss adjustments for other settings.

3.1 GPMHT

Let $Y_{ijt}$ represent subject i’s response to item j in a Likert scale questionnaire on day t. We assume that all responses can take on integer values $1, \dots , K,$ where $K>2$ . Let $\delta _j$ represent the “difficulty” of item j for $j=1, \dots , J$ and let $\tau _0<\tau _1 < \tau _2 < \dots < \tau _{K-1}<\tau _K$ , where $\tau _0=-\infty $ and $\tau _K=\infty $ , represent threshold values that map values from the real numbers to the ordinal Likert scale. Let $\theta _{i,t}$ represent the latent construct of interest, self-esteem in our motivating example, for subject i on day t. Then, we define

(3.1) $$ \begin{align} Pr(Y_{i,j, t} =k) &= \begin{cases} 1- \text{logit}^{-1}(\phi_{i,j,t, 1}) & \text{if } k=1,\\ \text{logit}^{-1}(\phi_{i,j,t,k-1})-\text{logit}^{-1}(\phi_{i,j,t,k}) & \text{if } 1<k<K,\\ \text{logit}^{-1}(\phi_{i,j,t,k-1}) & \text{if } k=K,\\ \end{cases} \end{align} $$
(3.2) $$ \begin{align} \phi_{i,j,t,k} & = a_j(\theta_{i,t}-\delta_j) -\tau_k, \end{align} $$

where $\text {logit}^{-1}(\phi )=1/(1+e^{-\phi })$ . This is a formulation of the GRM described in Equation (1.2). Thus, we assume that the items of the SSES measure a unidimensional construct, that the difference between two consecutive points on the scale is equal for all items, higher ratings represent higher values of the latent construct, and responses to items administered at the same time are conditionally independent of each other given the latent trait. We chose the logit link to relax the assumption of an underlying latent normal distribution implied by a probit link.

We now introduce a hierarchical component to the model to understand the trajectory of the latent construct $\theta _{i,t}$ . Assume that we have observations (item responses) recorded for subject i at $T_i$ unique time points. Let $\mathbf {t}_i=(t_{i,1}, \dots , t_{i, T_i})'$ represent a $T_i \times 1$ vector of observed days/time points for subject i. We consider $\boldsymbol {\theta }_i=(\theta _{i,t_{i,1}}, \dots , \theta _{i,t_{i,T_i}})'$ to be realizations from a function of $\mathbf {t}_i$ such that $\boldsymbol {\theta }_i=f(\mathbf {t}_i)$ . Let $t_{i,r}$ and $t_{i,s}$ represent the $r^{th}$ and $s^{th}$ elements of $\mathbf {t}_i$ , respectively. Then we assume that this function follows a GP prior (Shi & Choi, Reference Shi and Choi2011) with a squared exponential kernel such that

(3.3) $$ \begin{align} \boldsymbol{\theta}_i&=f(\mathbf{t}_i) \sim GP(\boldsymbol{\mu}(\mathbf{t}_i), \boldsymbol{\psi}(\mathbf{t}_i, \mathbf{t}_i')), \end{align} $$
(3.4) $$ \begin{align} \psi(t_{i,r}, t_{i,s})& = \exp\left\{-\frac{(t_{i,r} - t_{i,s})^2}{2\ell^2}\right\}, \end{align} $$

where $\boldsymbol {\mu }(\mathbf {t}_i)$ is a $T_i \times 1$ vector representing the mean function and $\boldsymbol {\psi }$ is a $T_i \times T_i$ matrix where the element in the $r^{th}$ row and $s^{th}$ column equals $\psi (t_{i,r}, t_{i,s})$ . Further, $\ell $ represents the lengthscale of the kernel, determining the smoothness of the GP. We chose to use the squared exponential kernel because it is a simple kernel function that depends only on the distance between points (regardless of direction) while ensuring smoothness. Other kernels could be explored if desired. Specifically, a Matérn kernel ( $\nu =3/2$ or $5/2$ ) could be used for rougher sample paths, rational–quadratic for multi-scale variation, or periodic components if seasonality is desired. Given the ordinal likelihood and our sampling resolution, differences among these kernels primarily affect high-frequency roughness of $g_i(\mathbf {t}_i)=\boldsymbol {\theta }_i-\mu (\mathbf {t}_i)$ , which is weakly identified relative to the low-frequency mean $\mu (\mathbf {t}_i)$ . For this reason—and to keep the manuscript focused—we retain the SE kernel in the main analysis and note kernel exploration as a natural extension for future work.

While $\boldsymbol {\theta }_i$ represents smooth individual trajectories at the individual level, we also want to collect information on the average trajectory at the population level. We do this using cubic B-splines with D degrees of freedom (this corresponds to D-3 interior knots). Specifically, we set

(3.5) $$ \begin{align} \mu(t)=\boldsymbol{B}_{\mu}(t)'\boldsymbol{\beta}, \end{align} $$

where $\boldsymbol {B}_{\mu }(t)$ corresponds to a $D \times 1$ vector of B-spline basis functions evaluated at t, and $\boldsymbol {\beta }$ is a $D\times 1$ vector of coefficients. We place the knots equally across the days of the study. A higher value of D corresponds to a more flexible curve, but can also be prone to overfit if the curve becomes too wiggly. In practice, we chose the value of D by comparing cross-validation results under different values, as will be discussed in Section 6.2.

There are several constraints required to ensure identifiability of the GRM in Equation (3.2). Because $Y_{ij}$ depends on $\theta _{i,t}$ and the cutpoints only through $a_j\{\theta _{i,t}-(\delta _j+\tau _k)\}$ , the likelihood is invariant to location and scale transformations. We therefore adopt the following conventions. First, to fix the location, we omit an intercept from $\boldsymbol {\mu }(x)$ . With $\{\tau _k\}$ ordered, this determines the decomposition between latent traits, item locations, and step offsets. Second, we take the GP prior in Equation (3.3) to have unit marginal variance.

While we have formulated this model as a GRM with equal thresholds, Equations (3.1) and (3.2) can be adjusted for other settings. Specifically, a different link function may be used in (3.1) while still maintaining the ordinal nature of the data. In Equation (3.2), one may wish to assume different scales for each item by incorporating different thresholds $\tau _{j,k}$ . These adjustments may be easily made, although, care should be taken to ensure parameter identifiability.

4 Bayesian inference on the posterior

Because of the complexity of the model outlined in Section 3.1, we turned to Bayesian methods for inference. In this section, we first introduce our prior specification. We then explain the challenges associated with sampling from the joint posterior and describe the Hilbert space approximation as a potential solution.

4.1 Prior specification

We selected fairly noninformative priors for our unknown parameters, reflecting our lack of prior information. Specifically, we assumed that $\ell \sim \mathcal {LN}(0,0.5)$ , where $\mathcal {LN}$ represents the log-normal distribution. We assumed that $\tau _1, \dots , \tau _4$ each were normally distributed with mean zero and variance four under the constraint that $\tau _1<\tau _2< \dots <\tau _{K-1}$ . We then assumed $a_j^* \sim \mathcal {LN}(0, 0.5)$ such that $a_j=a^*_j/(1/J\sum _{j=1}^Ja^*_j)$ . Next, we let $\delta ^*_j \sim \mathcal {N}(0, 1)$ and $\delta _j=\delta ^*_j-1/J\sum _{i=1}^J\delta _j^*$ , where $\mathcal {N}$ represents the normal distribution. Note that the sum-to-zero constraint on $\delta _j$ and the multiply-to-one constraint on $a_j$ are not strictly necessary for identifiability. However, in our setting, we found them helpful to speed convergence, with minimal impact on the interpretation of the latent long-term trends. Since we only have four items for analysis, identifying a complex latent structure over time may be difficult (Akour & AL-Omari, Reference Akour and AL-Omari2024). Other authors have discussed this issue, including Fox (Reference Fox2010) who notes that it can be challenging to identify a complex IRT model by fixing parameters of a prior, and these additional constraints may be helpful. In practice, the combination of weakly informative priors and constraints mainly aids convergence when relatively few items or observations are available. As more intensive mHealth data accumulate, the likelihood will dominate the weakly informative priors, and fewer constraints may be applied. Finally, we assumed $\beta _d \sim \mathcal {N}(0, \sigma ^2)$ for $d=1, \dots , D$ . We then set $\sigma \sim \mathcal {C}(0,1),$ where $\mathcal {C}$ represents the Cauchy distribution. In this way, we established a random effects interpretation for fitting the splines, and the spline coefficients are centered at zero. One can interpret this prior as a Bayesian equivalent to ridge regression, so that the spline coefficients are drawn toward zero (James et al., Reference James, Witten, Hastie and Tibshirani2013). A discussion of the full likelihood and joint posterior distribution can be found in Appendices C and D, respectively.

4.2 Challenges with sampling from full joint posterior distribution

Because of the complexity of the joint posterior distribution induced by the GPMHT likelihood and its prior specification, its closed form is not readily available. Thus, we have to rely on MCMC techniques to sample from the posterior, which is a computationally expensive task. Note that MCMC methods sampling from the exact joint posterior require inverting the $n\times n$ covariance matrix of the latent GP at each update, an $\mathcal {O}(n^3)$ operation (Riutort-Mayol et al., Reference Riutort-Mayol, Bürkner, Andersen, Solin and Vehtari2023). In intensive longitudinal designs, where n grows with both the number of participants and repeated measurements, performing these inversions at every iteration becomes computationally prohibitive (Gu et al., Reference Gu, Wang and Berger2018).

Even aside from the $\mathcal {O}(n^3)$ cost of repeatedly inverting GP covariance matrices, Gibbs sampling (Gelfand, Reference Gelfand2000)—though appealing when full conditionals are available in closed form—introduces additional difficulties. Recall that

(4.1) $$ \begin{align} Y_{i,j,t}=k \ \text{if}\ \tau_{k-1}\le \phi^*_{i,j,t}\le \tau_k, \end{align} $$

where $\phi ^*_{i,j,t}=a_j(\theta _{i,t}-\delta _j)+\epsilon _{i,j,t}$ , $\epsilon _{i,j,t}\sim \mathrm {Logistic}(0,1)$ and refer to Appendix E for details. Under this threshold formulation, the full conditional of $\phi ^*_{i,j,t}$ is truncated into the region $[\tau _{k-1},\tau _k]$ when $Y_{i,j,t}=k$ . When $a_j(\theta _{i,t}-\delta _j)$ lies far from this interval, draws would concentrate in extreme tails, causing numerical instability and poor mixing. In addition, the ordered cut-off points $\tau _1<\cdots <\tau _{K-1}$ mix slowly, since their moves must respect both the order constraint and the current values of $\boldsymbol {\phi }^{*}=\{\phi _{i,j,t}^*\}$ and $\mathbf {Y}=\{Y_{i,j,t}\}$ . These couplings among $\theta _{i,t}$ , $\delta _j, a_j$ , and $\boldsymbol {\tau }$ are amplified further in high-dimensional longitudinal data, yielding highly autocorrelated chains even with long runs.

We have attempted to fit the GP ordinal regression model using an MCMC sampler with a probit link for computational simplicity. In applied analyses, the strong posterior dependence between the cut-off points and the latent trajectories led to poor mixing and inadequate exploration of the parameter space; even for long runs, standard convergence diagnostics did not indicate convergence. Increasing the number of iterations was impractical due to the cubic cost of repeated GP covariance matrix inversions at each iteration.

We further explored methods outside of Gibbs sampling, including the use of Rstan or Stan (Stan Development Team, 2024). Stan performs MCMC using the No U-Turn Sampler (NUTS) version of Hamiltonian Monte Carlo to facilitate efficient sampling from the posterior (Hoffman & Gelman, Reference Hoffman and Gelman2014). While NUTS is efficient in many regards, its computation times can still be high, especially when estimating continuous functions like a GP and in the case of high-dimensional data, such as the College Experience study (Stan Development Team, 2024). In practice, applying NUTS to the exact joint posterior did not achieve convergence after seven days of computation, primarily because each iteration requires repeated Cholesky factorizations of large GP covariance matrices. We therefore adopted a Hilbert-space GP approximation to reduce computational burden and enable tractable sampling.

4.3 Hilbert approximation for GPs

We give a brief description of the Hilbert approximation for GPs. For more detailed explanations and derivations, one should see Riutort-Mayol et al. (Reference Riutort-Mayol, Bürkner, Andersen, Solin and Vehtari2023). It can be shown that stationary covariance functions may be represented by their spectral densities (Riutort-Mayol et al., Reference Riutort-Mayol, Bürkner, Andersen, Solin and Vehtari2023). The squared-exponential kernel spectral density is defined as

(4.2) $$ \begin{align} S(\boldsymbol{\omega}) = \sigma^2 \sqrt{2\pi}\ell\exp\left\{-1/2 \ell^2 \boldsymbol{\omega}'\boldsymbol{\omega}\right\}. \end{align} $$

Next, we can define a domain, $\Omega $ , such that for some positive value, L, $\Omega \in [-L, L]$ . Let $\lambda _q$ and $\{\phi _q(x)\}_{q=1}^{\infty }$ represent the eigenvalues and eigenvectors, respectively, of the Laplacian operator with respect to the domain $\Omega $ . Then, for sufficiently large m,

(4.3) $$ \begin{align} \psi(t_r, t_s) \approx \sum_{q=1}^m S(\sqrt{\lambda_q}) \phi_q(t_r) \phi_q(t_s') = \boldsymbol{\phi}(t_r)'\boldsymbol{\Delta}\boldsymbol{\phi}(t_s), \end{align} $$

where $\boldsymbol {\phi }(t)=\{\phi _q(t)\}_{q=1}^m \in \mathbb {R}^m$ is the column vector of basis functions and $\boldsymbol {\Delta }\in \mathbb {R}^{m\times m}$ is a diagonal matrix of $S(\sqrt {\lambda _q})$ for $q=1, \dots , m$ . Let $\boldsymbol {\Phi }_i$ be the $T_i\times m$ matrix of eigenfunctions. Then we can write

(4.4) $$ \begin{align} \mathbf{f}_i \overset{\cdot}{\sim} \mathcal{MVN}(\boldsymbol{\mu}_i, \boldsymbol{\Phi}_i \boldsymbol{\Delta}\boldsymbol{\Phi_i}'), \end{align} $$

where $\boldsymbol {\mu }_i$ is the $T_i \times 1$ mean vector evaluated at times $t_{i,1}, \dots , t_{i, T_i}$ . If the mean is assumed to be zero, by properties of the multivariate normal distribution,

(4.5) $$ \begin{align} f_i(t) \approx \sum_{q=1}^m\left(S(\sqrt{\lambda_q}\right)^{1/2}\phi_q(t)\eta_q, \end{align} $$

where $\eta _q \sim \mathcal {N}(0,1)$ . Assuming that the inputs $|t|$ are sufficiently small enough to be in $\Omega $ and m is sufficiently large, then this is an appropriate approximation for the density of a GP. This can easily be adjusted for the case of a nonzero mean, since the mean can just be added to the GP realization.

In general, this approximation is more efficient than sampling from an exact GP because it reframes the problem into a linear model with coefficients $\eta _1, \dots , \eta _m$ . It also avoids the inversion of the covariance function (Riutort-Mayol et al., Reference Riutort-Mayol, Bürkner, Andersen, Solin and Vehtari2023). While the initial computational cost of calculating the basis functions is $\mathcal {O}(m^2n)$ , the computational cost at each iteration of the sampler is only $\mathcal {O}(mn+m)$ . With $m<<n$ , this results in a significant computational advantage in high-dimensional settings when compared to samplers that require the inversion of the covariance function, which has a computational cost of $\mathcal {O}(n^3)$ .

Of course, the approximation in Equation (4.5) relies on the values of L and m. We followed the suggested methodology in Riutort-Mayol et al. (Reference Riutort-Mayol, Bürkner, Andersen, Solin and Vehtari2023) to tune these hyperparameters, as described in more detail in Appendix F. As part of the tuning, we also used context-specific metrics to assess whether L and m were sufficiently large such that larger values would not lead to significantly different results. Specifically, we calculated the within sample posterior predictive mean absolute deviation for items 1–4 ( $MAD_1$ $MAD_4$ ), which takes on values between 0 and 4, to assess convergence of the algorithm. To calculate this, assume we have R samples from the joint posterior distribution for each parameter. Let $\hat {y}_{i,j,t}^{(r)}$ represent the $r^{th}$ predicted value for $y_{i,j,t}$ . This is found by sampling from the model in Equations (3.1) and (3.2), where $a_j$ , $\theta _{i,t}$ , $\delta _j$ , and $\tau _1, \dots , \tau _{K-1}$ are replaced by their $r^{th}$ draw from the joint posterior distribution. We can then calculate

(4.6) $$ \begin{align} MAD_j= \frac{1}{\sum_{i=1}^n|\mathbf{t}_i|}\sum_{i=1}^n\sum_{t \in \mathbf{t}_i} \frac{1}{R}\sum_{r=1}^R|y_{i,j, t}-\hat{y}_{i,j,t}^{(r)}|, \end{align} $$

where $|\mathbf {t}_i|$ represents the cardinality of $\mathbf {t}_i$ . For each observation, we used $R=6,000$ samples from the posterior to calculate this metric. We also used the predicted values to calculate the accuracy of the model for each item. We refer to these values as $Acc_1$ $Acc_4$ . The accuracy metric penalizes all incorrect predictions equally, discarding the importance of ordinality in the model.

Each time the model was fit, we sampled from the posterior using four chains with 3,000 samples each. A burn-in of 1,500 was used for each chain. We used four chains to ensure that chains converged to similar values regardless of starting point. Because we used multiple chains in our sampling procedure, we assessed the between and within chain variability of our parameter estimates using $\hat {R}$ , where values close to one indicate that the chains have converged adequately (Vehtari et al., Reference Vehtari, Gelman, Simpson, Carpenter and Bürkner2021). We also found the bulk effective sample size (ESS) (Vehtari et al., Reference Vehtari, Gelman, Simpson, Carpenter and Bürkner2021) for each parameter to assess the movement within chains. It is recommended that each parameter have an $\hat {R}$ value less than 1.01, and a sufficiently large ESS (at least 100 per chain) (Stan Development Team, 2024). We found 3,000 draws from the posterior gave us ample room to meet these convergence standards. Sampling was done on 4 cores using around 30 GB of RAM for 191 subjects and 272 possible observed time points. Each fitting took around 6 hours. Note that, in the following sections, we will provide diagnostics and simulations that support the efficacy of our sampling algorithm. The Stan code is available in Appendix L.

In addition to tuning the Hilbert approximation, the number of splines used for the mean function must be tuned. We did this by comparing the within-sample predictive accuracy and mean absolute deviation measures discussed before for several different numbers of knots. We also compared cross-validation predictive accuracy measures to decide. We do not go into great detail here since spline tuning is well-studied, but the process for the College Experience study will be described in Section 6.2.

5 Simulation study

In order to assess whether samples from our MCMC algorithm adequately approximate the joint posterior distribution and recover unknown parameters and latent trajectories, we performed multiple simulation studies in this section.

5.1 Simulation for parameter recovery

To perform the simulation study for parameter recovery, we simulated $M=100$ data sets. These data were designed to emulate the data observed in the College Experience study by using the same structure of missingness and parameter values derived from the College Experience study. The procedure for generating each data set is described in Appendix G. After creating a simulated data set, we then used $\texttt {Rstan}$ to sample from the joint posterior. For the simulation, we used four chains with 2,000 samples each. For each chain, we used a burn-in of 1,000. We decreased the number of draws from the main analysis to reduce computation time for the simulations. For the Hilbert approximation used $c=2.271$ and $m=21$ to match the values, we found to be adequate for the College Experience data.

5.1.1 Parameter recovery results

To assess parameter recovery, uncertainty, and convergence, we calculated several diagnostics for each of the parameters and 100 simulated data sets. Specifically, we found the bias, mean squared error (MSE), standard error (SE), standard deviation (SD), and coverage probability (CP) of the 95% credible intervals (CIs) for each parameter. Formulas for these metrics can be found in Appendix H. We also reported the average $\hat {R}$ and ESS value for each parameter across all 100 data sets. All of the diagnostics mentioned in this section for the simulation study are shown in Table 1.

Table 1 Simulation diagnostics based on 100 simulated data sets for each parameter

We found that each of the parameters exhibits a bias close to zero. Further, the SE and MSE values were close to zero for all parameters. The reported SD values for the cutoffs, $\tau _1, \dots , \tau _4$ , were somewhat high, possibly due to the high correlation in the parameters. However, the CPs for the 95% CIs were all at least 95% for these parameters. The CP for $\ell $ was 96%, suggesting that our Hilbert approximation is working appropriately. All parameters had a 95% CI CP of at least 93%, well within the range of reasonable values when only 100 data sets are simulated. We also found average $\hat {R}$ values less than 1.01, suggesting the convergence of the chains. The average ESS was also reasonably large, averaging at least 100 samples per chain, for all parameters.

5.2 Simulation for trajectory recovery

Next, we simulated data to assess the model’s ability to recover individual mental health trajectories, with $\boldsymbol {\theta }_i$ defined as the sum of a population-level mean function (Equation (3.5)) and a subject-specific, zero-mean GP with covariance $\boldsymbol {\psi }(\mathbf {t}, \mathbf {t}')$ (Equation (3.4)). Focusing on the Hilbert approximation to the GP, we evaluated recovery of individual deviations, $g_i(\mathbf {t}_i)=\boldsymbol {\theta }_i - \boldsymbol {\mu }(\mathbf {t}_i)$ , across scenarios using a single simulated dataset per setting to reflect practical conditions.

First, let us consider a scenario where the data are generated as in Section 5.1. However, we drew the mean function from a multivariate normal distribution with mean vector $\mathbf {0}$ , a correlation matrix given by a squared exponential kernel with lengthscale 0.5, and SD 0.5. By generating the data from a multivariate normal distribution with an infinitely differentiable covariance function, we are able to see whether splines, which have a finite number of knots, can adequately approximate a smooth mean function in this setting. We chose a lengthscale of 0.5 because we wanted to evaluate trajectory recovery under the scenario where the true underlying function is moderately wiggly. After fitting the model to the generated data, we then compared plots of the individual deviations from the mean function to the true deviations in the simulated data. The results for eight randomly selected trajectories are shown in Figure 3. The black lines indicate the true trajectory while the blue lines represent the median of the posterior distribution. Note that some of the black lines (true trajectory) are shorter than the blue lines (estimated trajectory) because in the College Experience study different students entered and left the study at different points in the year. As described in Section 5.1 and Appendix G, the data were generated to mimic this entry and exit pattern. Since our goal was to model the trajectory over the course of the entire freshman year, we felt it was useful to see the trajectory estimates (and uncertainty) associated with time points outside of a subject’s study period. This led to the blue and black curves displaying different lengths. We see that in most scenarios, the algorithm does an excellent job at capturing individual deviations from the mean for periods where data are available. For most parts of the curve, the true trajectory lies within the 95% CI bounds. We also see how the GP is useful in flexibly capturing a variety of different shapes in individual deviations from the mean function. Importantly, the uncertainty associated with our trajectory estimates is much greater during periods where data are unavailable (i.e., late entry or early exit) when compared to periods where data are available. To see how well the splines capture the mean trend, please see Appendix I.1. Note that the smooth mean function in the generated data was approximated well by the spline mean function with 16 degrees of freedom.

Figure 3 A comparison of simulated trajectories (black curves) to the estimated trajectories (as determined by the median of the posterior distribution, in blue). The shaded region represents 95% CI bounds. The data were designed to match the College Experience study. Thus, the black curves may be shorter than the blue curves to mimic late entry or early exit in the study. The full estimated curve is shown in blue since we want to make inference on the entire freshman year.

Many other simulations were performed to assess individual trajectory recovery under different values of $\ell $ , different numbers of items, and different numbers of possible ordinal responses. For brevity, the results of a select number of these simulations are in Appendix I. In general, we found that the model appropriately captures individual deviations from trajectories, even in challenging situations, including small values of $\ell $ and low numbers of possible ordinal responses.

5.3 Simulation for comparison to baseline model

We also performed a simulation to compare model fit and trajectory recovery between our model and a baseline model without GP trajectories. Specifically, we defined the baseline model using Equations (3.1) and (3.2). However, we assumed $\theta _{i,t}=\mu (t)+\alpha _i$ , where $\alpha $ is a random intercept such that $\alpha _i \sim \mathcal {N}(0, \sigma ^2_{int})$ and $\mu (t)$ is defined as in Equation (3.5). To be consistent with the GPMHT, we used 16 degrees of freedom for the B-splines in the mean function. We used the same data generation procedure as described in Section 5.2 and fit the generated data to both the GPMHT and baseline models. The within-sample accuracy and mean absolute deviation for each item and model are shown in Table 2. We do not see much of a difference in within-sample predictive accuracy. Note that in the context of ordinal regression, we might not expect large changes in predictive accuracy or model fit measures based on changes to the latent trajectory. This is because the fitted values and predictions are dependent on the threshold values. Thus, small changes in the latent trajectory will not impact the observed ordinal category if they do not cross thresholds, especially when the two models have the same mean functions that center the latent values in the same threshold region.

Table 2 Accuracy and mean absolute deviation metrics for items 1–4 in simulated data comparing a model with no GP trajectories (Baseline) to the GPMHT

We can also compare the latent trajectories in the two models. Since both models have the same mean function, we compare the individual deviations from the average under the two models. These are shown in Figure 4. We see that the 95% CI bounds for the individual deviation from the average under the baseline model are more narrow than the GPMHT. However, the baseline model does not capture any changes in the deviation over time. This difference is especially stark for IDs 14, 85, and 148, where there are very clear nonlinear trends in the individual deviations from the average. Conversely, the GPMHT mostly captures these trends, and the uncertainty associated with them, although it does struggle with some who entered the study late (see ID 76). In general, even without changes to model fit or predictive accuracy, we see value in the additional complexity provided by a GP trajectory.

Figure 4 A comparison of simulated trajectories (black curves) to the estimated trajectories for both the baseline model (red) and GPMHT (blue) determined by the median of the posterior distribution). The shaded regions represent 95% CI bounds. This simulation was designed to emulate the College Experience study.

6 Application to college experience study

In this section, we discuss our GPMHT in the context of our motivating example: the College Experience study. We explain how the data were prepared for use in the model, discuss computational details, and present parameter estimates with their associated 95% CIs. Finally, we show how the fitted model can be used to make inference on individual mental health trajectories, as well as the average mental health trajectory of all the participants in the study.

6.1 Data preparation and missingness

The cleaned data represent 191 students over the course of 272 days during their freshman year. The weekend before the beginning of the fall term of their freshman year represents day 0 for all students, regardless of when they entered the study. For a complete discussion on data preparation, see Appendices A and J. Note that the data do not indicate when each student is sent a notification, only when they respond to a notification. Thus, it is not clear whether the cause of a missing EMA response is due to a non-response to an EMA survey or if no EMA was sent that day. Regardless of the cause, missing EMA responses are not imputed or considered in the analysis, implying that non-responses are missing at random (MAR). We acknowledge that this is a simplifying assumption. For example, on days where students are experiencing lower self-esteem, they may be less likely to respond to the EMA. However, without knowing which days an EMA was sent, it is not possible to distinguish mechanisms of missingness. Formally, we treat missingness as ignorable under an MAR assumption conditional on observed history and calendar/time-of-term effects: letting $R_{it}$ indicate response for person i on day t, we assume

$$\begin{align*}p(R_{it}=1 \mid \theta_{i,1:t}, \mathcal{H}_{it}) \;=\; p(R_{it}=1 \mid \mathcal{H}_{it}), \end{align*}$$

where $\mathcal {H}_{it}$ includes past observed EMA responses and calendar/time-of-term indicators. Because prompt delivery times were not logged, we cannot distinguish non-prompt days from nonresponse, so this assumption is required for analysis.

6.2 College experience data computational details

We fit the GPMHT with the Hilbert approximation on the College Experience data using four different degrees of freedom: 8, 16, 24, and 32. Recall that the prior on $\boldsymbol {\beta }$ represents Bayesian regularization akin to ridge regression in a frequentist setting (James et al., Reference James, Witten, Hastie and Tibshirani2013). Thus, we needed the number of knots to be sufficiently large to ensure adequate flexibility. However, the model is controlled for overfit through regularization. The tuning results for the optimal value of L and m for each number of knots are shown in Table 3. The intermediate tuning steps and additional information are available in Appendix K.

Table 3 The optimal values of m and L, where $L=cS$ for each B-spline scenario

Note: DF refers to the degrees of freedom associated with the B-splines. For each DF value, we also present other metrics, including the estimated value $\hat {\ell }$ , accuracy, and mean absolute deviation for each item.

Regardless of the number of knots used for the B-splines in the mean function, the value of $\hat {\ell }$ was estimated to be between 1.197 and 1.225. This small range of values suggests that the number of knots does not significantly impact the estimation of $\ell $ . We also found the model fit, as determined by within-sample predictive accuracy and mean absolute deviation, to be comparable across all models.

To make a final decision on the number of knots, we also looked at the out-of-sample predictive accuracy of the models with the optimal values of L and m for a given number of B-spline knots as determined by our Hilbert approximation tuning. This was done by randomly selecting around 20% of observations (not entire subjects) to be withheld from the data during training. We then predicted the withheld values and calculated the accuracy and mean absolute deviation for each item. Similar to the within-sample values, 6,000 draws from the predictive posterior distribution were used to compute these values. The results are shown in Table 4.

Table 4 Cross-validation accuracy and mean absolute deviation for each B-spline degree of freedom considered

We did not find a difference in cross-validation predictive accuracy or mean absolute deviation depending on the degrees of freedom used for the B-splines. The differences are small, but according to Table 3, there does appear to be a slight improvement in within-sample predictive metrics when more than 8 degrees of freedom are used. Recall that because our prior on $\boldsymbol {\beta }$ invokes a ridge regression framework, excess coefficients are shrunk toward zero, but we cannot make up for too few knots. To ensure that the number of knots is large enough but does not cause a computational burden or a tendency to overfit, we selected a value of 16 degrees of freedom. Associated with the 16 degrees of freedom, we used a value of $m=21$ and $c=2.271$ , according to the tuned values in Table 3.

6.3 GPMHT parameter inference

After taking 6,000 draws from the posterior distribution, we computed parameter estimates by finding the median of the draws for each parameter. We calculated 95% CIs, as determined by the $2.5^{th}$ and $97.5^{th}$ percentiles, to quantify the uncertainty in our estimates. We also computed $\hat {R}$ values and ESS to check for convergence. Each of these values for the GPMHT fit on the College Experience data is shown in Table 5. We found a moderately large value of $\hat {\ell }=1.225 \ (1.084, 1.375)$ . Conversely, the SD associated with B-spline coefficients was fairly small at $\hat {\sigma }=0.416 \ (0.284, 0.661)$ . Further, we found that responses to the SESS tended to be highest for item 1, followed by items 4, 3, and 2. Note that lower values of $\delta _j$ indicate that responses tend to be higher for item j. Discrimination was lowest for item 1 and highest for item 2.

Table 5 GPMHT parameter estimates, 95% CI bounds, $\hat {R}$ values, and ESS for the College Experience study motivating example

6.4 Mental health trajectories

We now focus on the goals of the study: to determine how self-esteem changes over the course of the freshmen year at both the population and individual levels. We begin by focusing on the population level by examining the mean function in Equation (3.5) evaluated across the year. This is shown in Figure 5, where the solid black line represents the median of the posterior distribution, and the shaded region represents the 95% CI bounds. The blue and red vertical lines, respectively, indicate approximate term start and end days. Note that these are approximate due to the mismatch in days in a term for each year, as well as students’ individual schedules.

Figure 5 Average self-esteem trajectory for students in the College Experience study. Blue and red vertical lines indicate approximate term start and end days, respectively.

Overall, we see that students’ self-esteem tends to fluctuate throughout the fall term. It then increases toward the end of the term, peaking during winter break. Afterward, we see a steady decline in self-esteem that reaches a minimum during the second half of winter term. We again see a slight increase in self-esteem going through spring break, with another gradual decrease in self-esteem during the spring term. Self-esteem tends to level out at the end of the academic year. It might be slightly lower on average than at the beginning of the academic year, but this difference does not appear to be significant. This trajectory may help stakeholders understand when students may need additional mental health resources such as in the middle of winter term.

Next, we look at the self-esteem trajectories for individual students. Specifically, we want to determine which students, if any, exhibit trajectories that may warrant concern or additional follow-up. To do this, we focus on the students’ deviation from the average, or $g_i(\mathbf {t}_i)=\boldsymbol {\theta }_i-\boldsymbol {\mu }(\mathbf {t}_i)$ . This is a particularly helpful view because while all students might have low mental health attributes during the winter term, students whose self-esteem decrease more than average may need some additional support. We consider nine different randomly selected students in Figure 6. We see that the IDs 96 and 182 started out with self-esteem values around average, but their self-esteem steadily decreased throughout the year relative to the other students. Conversely, while ID 196 started with a lower self-esteem than average, their self-esteem increased relative to their peers.

Figure 6 Estimated deviation from the average self-esteem trajectory by subject, along with 95% credible intervals.

6.5 College experience study comparison to baseline model

We now compare our model to the baseline model without GP trajectories defined in Section 5.3. We fit this baseline model using 6,000 draws from the posterior. We calculated within-sample and cross-validation predictive accuracy and mean absolute deviation metrics the same way we did for the GPMHT. These are shown in Table 6. Overall accuracy was slightly higher and the mean absolute deviation was slightly lower for the GPMHT, indicating slightly better fit by the GPMHT. Even with only slight changes to model fit, through simulations in Section 5.3, we showed that our model is capable of recovering nonlinear trends in individual latent mental health quantities, and how they deviate from the average population, which cannot be done with random intercepts or slopes alone.

Table 6 Within-sample (WS) and cross-validation (CV) posterior predictive accuracy and mean absolute deviation for the comparison baseline model without GP trajectories

Because the baseline model and the GPMHT have the same mean function, there is not expected to be a meaningful difference in population mean functions. Conversely, Figure 7 displays individual students’ deviations from the population average self-esteem trajectory. Here, we can see the real benefit of the GP prior. While the baseline model only provides a single line for a random intercept, we can see using the GPMHT how students progress over the course of the year. For example, while IDs 12 and 187 have similar random intercepts, ID 12 appears to display an upward trend in the trajectory, while person 187 displays a downward trend. Further, we can see how the uncertainty in the curve associated with the GPMHT changes over time. The uncertainty is lower in periods where the observed data are dense, and higher elsewhere. Conversely, using a single random intercept gives no information on how our level of uncertainty changes over time. Without using a GP prior, the nuance of individual deviations from the population average is lost. With the GP, we are able to easily visualize and quantify how students’ self-esteem changes with respect to their peers, thereby enabling individualized interventions.

Figure 7 A comparison of the deviation from the average trajectory of self-esteem over the course of the freshman year. The shaded areas corresponded to the appropriate 95% credible intervals.

7 Conclusion

In this article, we introduced the GPMHT for modeling longitudinal Likert-scale survey data. We showed how a GP can be used to model the trajectory of a construct of interest underlying the survey items. Because of the complexity of the model and the high-dimensional nature of the data, we used Bayesian techniques combined with a Hilbert space approximation to obtain model estimates and quantify uncertainty. Through simulations, we showed that our sampling algorithm was able to recover the unknown parameters.

We applied our model to self-esteem data collected during Dartmouth students’ freshmen year. We showed that, on average, self-esteem tends to peak during winter break, followed by a large decrease. We see another increase in self-esteem during spring break, after which self-esteem decreases and levels out during the spring term. In addition to overall population trends, we are also able to study how individuals’ latent trajectories deviate from the population average. The flexibility of the GP allows us to incorporate flexibility to model nonlinear patterns in these deviations.

Understanding health trajectories in self-esteem can help stakeholders optimize the time of interventions to help populations of interest. By focusing on both population and individual trajectories, interventions for an entire population, as well as individualized interventions for those whose self-esteem is trending negatively or much lower than the population average, may be offered. Further research should be done to investigate how covariates can effectively be used in the GP mean function to better understand trajectories. Additional simulations could also be performed to investigate the effectiveness of the Hilbert approximation in settings with extreme response patterns. In regard to the College Experience study, we could also explore mental health trajectories based on other metrics, as well as the correlation between trajectories derived from different metrics.

Data availability statement

The College Experience study dataset used in this article is publicly available at https://www.kaggle.com/datasets/subigyanepal/college-experience-dataset.

Author contributions

Conceptualization: X.W.; Data curation: Z.G.M.; Data visualization: Z.G.M.; Formal analysis: Z.G.M.; Funding acquisition: X.W.; Methodology: Z.G.M. and X.W.; Software: Z.G.M.; Writing original draft: Z.G.M.; Writing—review and editing: Z.G.M. and X.W. Both authors approved the final submitted draft.

Funding statement

This research was supported by a grant from the U.S. National Science Foundation www.nsf.gov/awardsearch/showAward?AWD_ID=1848451∖&HistoricalAwards (Award No. 1848451).

Competing interests

The authors declare none.

Ethical standards

The research meets all ethical guidelines, including adherence to the legal requirements of the study country.

Appendix

A Reasoning for only using freshman year for the analysis

Unfortunately, the data do not indicate when students were enrolled in classes. Dartmouth runs on a term system, so it is not clear when students are likely to be enrolled. However, Dartmouth freshmen are required to take classes during the fall, winter, and spring semesters. Because it is likely that mental health and behavior differ during terms of enrollment from other terms, and the goal of this article is to model mental health trajectories, we focused our analysis only on the first three terms of the freshman year. This period covers a total of 272 days, although some students enrolled in the study later in the fall term than others, so observed days may be substantially decreased. Note that for both freshman in 2017 and 2018, the weekend before the start of the Fall term is used as 0, regardless of when a specific student entered the study. By numbering the days in this way, a specific day for a student refers to approximately the same time frame in the course of the freshman year (midterms, breaks, etc.), no matter when the student joined the study.

B Histogram of SSES responses

Figure B1 Histogram of SSES item responses.

C Likelihood

To derive the full likelihood, we assumed that there are N study participants. Then, we defined the collection of unknown parameters $\boldsymbol {\Theta }=(\ell , \sigma , \tau _1, \dots , \tau _{K-1}, a_1, \dots , a_J, \delta _1, \dots , \delta _{J}, \boldsymbol {\theta }_1, \dots , \boldsymbol {\theta }_N, \boldsymbol {\beta })'$ . Let $\mathbf {Y}$ represent a random vector of all responses and $\mathbf {y}$ represent a vector of observed responses. Further, let $y_{i,j,t}$ denote the observed response for person i to item j at time t. Then the likelihood, conditional on the set of unknown parameters, is

(C.1) $$ \begin{align} Pr(\mathbf{Y}=\mathbf{y}|\boldsymbol{\Theta}) = \prod_{i=1}^N\prod_{j=1}^J\prod_{t=t_{i,1}}^{t_{i,T_i}} Pr(Y_{i,j,t}=y_{i,j,t}|\boldsymbol{\Theta}), \end{align} $$

where $Pr(Y_{i,j,t}=y_{i,j,t}|\boldsymbol {\Theta })$ is given by Equations (3.1) and (3.2).

D Joint posterior

Recall that according to Bayes’ Theorem, we find the joint posterior distribution, $\pi (\boldsymbol {\Theta }|\mathbf {Y})$ , up to a normalizing constant by

(D.1) $$ \begin{align} \pi(\boldsymbol{\Theta}|\mathbf{y}) \propto Pr(\mathbf{Y}=\mathbf{y}|\boldsymbol{\Theta})\pi(\boldsymbol{\Theta}), \end{align} $$

where $\pi (\boldsymbol {\Theta })$ , the joint prior distribution, equals

(D.2) $$ \begin{align} \pi(\boldsymbol{\Theta})= \left[\prod_{i=1}^N \pi(\boldsymbol{\theta}_i|\ell, \boldsymbol{\beta}, \sigma)\right] \pi(\ell) \pi(\sigma) \pi(\boldsymbol{\beta})\left[\prod_{j=1}^J\pi(\delta_j)\pi(a_j)\right]\pi(\boldsymbol{\tau})). \end{align} $$

E Derivation of alternative parameterization of GRM

Note that since the CDF of the logistic distribution is defined as $F(x)=1/(1+e^{-x})$ , we can write $P(Y_{i,j,t}=k)=F(a_j(\theta _{i,t}-\delta _j)-\tau _{k-1})-F(a_j(\theta _{i,t}-\delta _j)-\tau _k)$ . Also note that $F(x)=1-F(-x)$ . So, $F(a_j(\theta _{i,t}-\delta _j)-\tau _{k-1})-F(a_j(\theta _{i,t}-\delta _j)-\tau _k)=(1-F(-a_j(\theta _{i,t}-\delta _j)+\tau _{k-1}))-(1-F(-a_j(\theta _{i,t}-\delta _j)+\tau _k))=F(\tau _k-a_j(\theta _{i,t}-\delta _j))-F(\tau _{k-1}-a_j(\theta _{i,t}-\delta _j))$ . Next define the latent variable $\phi ^*_{i,j,t}=a_j(\theta _{i,t}-\delta _j)+\epsilon _{i,j,t}$ , where $\epsilon _{i,j,t}\sim \text {Logistic}(0,1)$ . Then $P(Y_{i,j,t}=k)=F(\tau _k-a_j(\theta _{i,t}-\delta _j))-F(\tau _{k-1}-a_j(\theta _{i,t}-\delta _j))=P(\tau _{k-1}-a_j(\theta _{i,t}-\delta _j)\leq \epsilon _{i,j,t}\leq \tau _k-a_j(\theta _{i,t}-\delta _j)) =P(\tau _{k-1}<a_j(\theta _{i,t}-\delta _j)+\epsilon _{i,j,t}<\tau _{k})=P(\tau _{k-1}<\phi ^*_{i,j,t}<\tau _k)$ . Then Equation (3.1) may equivalently be written as

(E.1) $$ \begin{align} Y_{i,j,t} =k \text{ if } \tau_{k-1}\leq \phi^*_{i,j,t} \leq \tau_k, \end{align} $$

where $\epsilon _{i,j,t}\sim \text {Logistic}(0,1)$ .

F Hilbert approximation tuning

We give a short description of the tuning algorithm here developed by Riutort-Mayol et al.(2023), but encourage the reader to refer to their paper for greater detail. As suggested, we let S be the maximum value of the standardized input domain. Then we defined $L=cS$ for some value of $c>=1.2$ . We call c the “boundary parameter.” Note that via simulation, Riutort-Mayol et al. (2023) found that for the squared exponential kernel with lengthscale, $\ell $ , the minimum appropriate value of m is around $1.75cS/l$ . There is an inverse relationship between m and the smallest estimable value of $\ell $ , denoted $\ell _{min}$ , while the boundary parameter should be larger for larger values of $\ell $ such that $c> 3.2 \ell /S$ .

Because the value of $\ell $ is unknown, one must tune the algorithm to ensure that m and c are appropriately sized to recover its value. We did this by first setting m and c to large values of $30$ and $5$ , respectively. We then used these values to fit the model and obtain an initial guess for $\ell $ , which we set as $\ell _{min}$ . Using this value of $\ell _{min}$ , we calculated the values of m and c based on the inequalities listed above. Then, we sampled from the posterior using $\texttt {Rstan}$ and obtained an estimate for $\ell $ from the median of the posterior draws, denoted $\hat {\ell }$ . We first checked that $\hat {\ell }-0.01>\ell _{min}$ , ensuring that the estimated value of $\ell $ is greater than the minimum estimable $\ell $ based on the values of m and c used. If the condition was not true, we set $\hat {\ell }=\ell _{min}$ and continued until the condition was satisfied. If the condition was true, then we increased m by 5 and calculated $c=max(1.2, 3.2\hat {\ell }/S)$ . We also calculated the minimum estimable value of $\ell $ given the new values of c and m such that $\ell _{min}=1.75 cS/m$ . We then sampled from the posterior distribution and obtained a new $\hat {\ell }$ . We ensured that $\hat {\ell }-0.01>=\ell _{min}$ . We continued increasing m until we found reasonable evidence of convergence through stable estimates of $\ell $ . We defined this as estimates of m that changed by less than 0.01 between iterations.

Throughout this process, we focused on the value of $\ell $ since it is the most difficult to estimate and primarily defines the kernel of the GP kernel. Thus, if we are able to recover $\ell $ , then it is reasonable to assume that Hilbert approximation is functioning appropriately (Riutort-Mayol et al., 2023).

G Data generation procedure for simulation study in Section 5.1

In the simulation, we generated data for 191 individuals to match the sample size of the College Experience study. We also used GPMHT the parameter value estimates from the fitted College Experience study data as the parameter values for the simulation to ensure reasonability. See Table 5 in Section 6.3 for information on these values. Specifically, we assumed $\ell =1.226$ , $\sigma =0.431$ , $\boldsymbol {\delta }=(-1.568, 0.710, 0.582$ , $0.276)'$ , $\boldsymbol {a}=(0.809, 1.133, 0.989, 1.068)'$ , and $\boldsymbol {\tau }=(-4.128, -2.003, 0.353, 2.733)'$ .

For each individual, we assumed daily data collection from day 0 to 271. The day values were centered and scaled to form the input vector $\mathbf {t}$ for the squared exponential kernel $\boldsymbol {\psi }(\mathbf {t}, \mathbf {t}')$ (Equation (3.4)). Basis splines $\boldsymbol {B}_{\mu }(t)'$ (Equation (3.5)) were computed for each t using 16 degrees of freedom, as determined for the College Experience study via the comparison of model fit and accuracy metrics as described in Section 6.2. Coefficients in $\boldsymbol {\beta }$ were drawn independently from $N(0, \sigma ^2)$ . For each person, $\boldsymbol {\theta }_i$ was generated by sampling from a $272$ -dimensional multivariate normal distribution with mean $\mathbf {0}$ and covariance $\boldsymbol {\psi }(\mathbf {t}, \mathbf {t}')$ (Equation (3.3)), then shifting the mean by $\boldsymbol {B}_{\mu }(t)'\boldsymbol {\beta }$ . We retained only the days corresponding to observed measurements for each individual in the College Experience study. These values, along with $\boldsymbol {\delta }$ , $\boldsymbol {a}$ , and $\boldsymbol {\tau }$ , were entered into Equations (3.1) and (3.2) to compute $Pr(Y_{i,j,t} = k)$ for $i = 1,\dots ,191$ , $j = 1,\dots ,4$ , and $k = 1,\dots ,5$ . Each $Y_{i,j,t}$ was then sampled from the multinomial distribution in Equation (3.2).

H Simulation study diagnostics

Let $\hat {\omega }_m$ be the median of the posterior draws for a given parameter, $\omega $ , for the $m^{th}$ data set. Then we calculated $\text {Bias}=\sum _{i=1}^M(\hat {\omega }_m-\omega )/M$ , the mean squared errors, that is, $\text {MSE}=\sum _{m=1}^M(\hat {\omega }_m-\omega )^2/M$ , and the standard error, that is, $\text {SE}=\sqrt {\sum _{m=1}^M\left (\hat {\omega }_m - \sum _{m=1}^M\hat {\omega }_m/M\right )^2/(M-1)}$ . We defined $\omega _m^{[k]}$ be the $k^{th}$ iteration of the MCMC sampler for the $m^{th}$ data set so that the standard deviation may be written as, $\text {SD}=\sum _{m=1}^M \sqrt {\sum _{k=1}^K(\omega _m^{[k]}-\hat {\omega }_m)^2/(K-1)}/M$ . Finally, we let $\omega _{m, q}$ be the $q^{th}$ percentile of MCMC draws for parameter $\omega $ in the $m^{th}$ data set. Then the 95% coverage probability is $CP=\sum _{i=1}^M I(\omega _{m, 0.025}< \omega < \omega _{m,0.975})/M$ , where $I(\cdot )$ is an indicator function.

I Simulation for trajectory recovery

This section contains additional simulations that explore the ability of the GPMHT model to recover parameter trajectories. Each simulation uses as a basis the simulation described in Section 5.2 and Appendix G, with adjustments made as described.

I.1 Mean trajectory simulation

While our focus is on the GP deviations from the mean, we do briefly show the estimated mean function for this simulation to show that the splines do adequately capture the mean trend. Figure I1 shows a comparison of the true mean function to the estimated mean function. The true curve lies within 95% CI bounds throughout almost the entirety of the curve. The largest deviations from the truth lie at the beginning and end of the study where there may be fewer participants/observations due to students joining the study late or dropping out early (Figure I2).

Figure I1 A comparison of the simulated mean function (black curve) to the estimated mean function (as determined by the median of the posterior distribution, in blue). The shaded region represents 95% CI bounds. The data were designed to match the College Experience study.

Figure I2 A comparison of simulated trajectories (black curves) to the estimated trajectories (as determined by the median of the posterior distribution, in blue). The shaded region represents 95% CI bounds. The value of $\ell $ was chosen to be 0.5.

I.2 Value of $\ell $

The Hilbert approximation can be more difficult to tune with smaller values of $\ell $ , exhibiting more wiggly lines (Riutort-Mayol et al., 2023). Thus, we explored the same simulation as described before but when $\ell =0.5$ . We see that the curves demonstrated greater variation from the truth when compared to the scenario where $\ell =1.226$ . While there are more areas where the true trajectory lies outside the 95% CI bounds, we are still able to capture the general shape of the trajectories.

I.3 Number of ordinal responses per item

For the next simulation, we studied how the number of ordinal response items in a mental health questionnaire might impact the recovery of the GP trajectories. Using the same set up as described for initial simulation study, we considered the scenario where there were only three ordinal responses per question were possible. We set $\boldsymbol {\tau }=(-\infty , -1.35, -0.90, \infty )'$ . The results are shown in Figure I3. We see that even with fewer responses per item, trajectory recovery remains fairly high. However, the model does seem to struggle more to capture peaked curves, such as seen by person 85.

Figure I3 A comparison of simulated trajectories (black curves) to the estimated trajectories (as determined by the median of the posterior distribution, in blue). The shaded region represents 95% CI bounds. This simulation allowed for only three ordinal responses per item.

Next, we considered how adding more ordinal responses per item would change recovery. Here, we used seven ordinal responses per item with $\boldsymbol {\tau }=(-4.30, -2.05, -1.30, -0.83, 1.87, 5.34)'$ . These results are shown in Figure I4. Not surprisingly, we see that the trajectory recovery seems to be higher when there are more response options per item. This is likely because adding more responses makes the ordinal scale behave more similarly to a continuous scale defined by a GP.

Figure I4 A comparison of simulated trajectories (black curves) to the estimated trajectories (as determined by the median of the posterior distribution, in blue). The shaded region represents 95% CI bounds. This simulation allowed for seven ordinal responses per item.

I.4 Number of items

Finally, we studied how the number of items might impact recovery. To do this, we removed items 3 and 4. We then used $\boldsymbol {a}=(1.11, 0.89)'$ and $\boldsymbol {\delta }=(-0.85, 0.85)'$ . Figure J1 contains the results of this study. With two items, we see that the model may struggle a bit with some shapes (person 186), but does well in other scenarios. Overall, too few items do not appear to inhibit model performance.

Figure J1 A comparison of simulated trajectories (black curves) to the estimated trajectories (as determined by the median of the posterior distribution, in blue). The shaded region represents 95% CI bounds. This simulation contained only two items.

To further understand how the number of items impacts recovery, we also performed the simulation adding two more items to the original simulation, leading to a total of six items. We used $\boldsymbol {a}=(0.25, 0.17, 4.54, 0.84, 0.08, 0.13)'$ and $\boldsymbol {\delta }=(-2.25, -0.58, 1.07, 1.70, 0.00, 0.07)'$ . Figure J2 shows that the additional items may have led to the tightening of 95% CI bounds. Because these bounds are smaller, there may be more areas along the true trajectories that lie outside the CI bounds.

Figure J2 A comparison of simulated trajectories (black curves) to the estimated trajectories (as determined by the median of the posterior distribution, in blue). The shaded region represents 95% CI bounds. This simulation contained six items.

J College experience study data preparation

Students entered the study during their freshmen year in 2017 or 2018. We used the Dartmouth academic calendar to establish the beginning and end of terms. For students who began in 2017, we included all observations on or before June 5, 2018 and for those who began in 2018 we included all observations on or before June 4, 2019. These dates correspond to the end of finals for the Spring term in the associated academic year. For the 2017 academic year, we define day 0 to begin on September 7th, while for the 2018 academic year, day 0 began on September 13th. These dates correspond to the first observations in the study for each year, and approximate the beginning of the term for the students. Day 0 was defined the same for each cohort of students, regardless of when they entered the study. We then numbered each day from the start of the study, ending with day 271. For input into the GP kernel, we centered and scaled these days to get values between $-$ 1.73 and 1.73.

While we tried to match up days in the freshmen year, we note that there is some mismatch between the dates in the two cohorts with one beginning in 2017 and the other beginning in 2018. Because term start and end dates, holidays, midterms, etc., vary from year to year, there is not an exact match between the years. Additionally, depending on students’ class schedules and extracurricular activities their arrival on campus, midterms, and end to finals may differ slightly. However, we are still able to capture general trends in self-esteem across years, even if there is some slight mismatch in days.

K College experience study Hilbert space approximation tuning

Table K1 Hilbert space GP approximation tuning and diagnostics

Note: DF refers to the degrees of freedom associated with the B-splines while Iter. refers to the iteration number for tuning the Hilbert approximation with splines of the indicated number of degrees of freedom. We also present other metrics associated with each iteration, including m, c, the value of $\ell _{min}$ associated with m and c, the estimated value $\hat {\ell }$ , accuracy, and mean absolute deviation for each item. Bolded lines indicate the results under the optimal values of m and c for each specific degree of freedom as determined by our tuning algorithm. Recall that in the Hilbert approximation, $L=cS$ , where in the College Experience study, $S=1.73$ .

L Stan code

Footnotes

This manuscript is part of the special section, Integrating and Analyzing Complex High-Dimensional Data in Social and Behavioral Sciences Research. We thank Drs. Eric F. Lock and Katrijn Van Deun for serving as Co-Guest Editors.

Note: DF refers to the degrees of freedom associated with the B-splines while Iter. refers to the iteration number for tuning the Hilbert approximation with splines of the indicated number of degrees of freedom. We also present other metrics associated with each iteration, including m, c, the value of $\ell _{min}$ associated with m and c, the estimated value $\hat {\ell }$ , accuracy, and mean absolute deviation for each item. Bolded lines indicate the results under the optimal values of m and c for each specific degree of freedom as determined by our tuning algorithm. Recall that in the Hilbert approximation, $L=cS$ , where in the College Experience study, $S=1.73$ .

References

Akour, M., & AL-Omari, H. (2024). Empirical investigation of the stability of IRT item-parameters estimation. International Online Journal of Educational Sciences, 5(2), 291301.Google Scholar
Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43(4), 561573. https://doi.org/10.1007/bf02293814 CrossRefGoogle Scholar
Chen, Y., & Zhang, S. (2020). A latent Gaussian process model for analysing intensive longitudinal data. British Journal of Mathematical and Statistical Psychology, 73(2), 237260. https://doi.org/10.1111/bmsp.12180 CrossRefGoogle ScholarPubMed
Choi, I.-H. (2024). The impact of measurement noninvariance across time and group in longitudinal item response modeling. Asia Pacific Education Review, 25(4), 911924. https://doi.org/10.1007/s12564-024-09985-y Google Scholar
Cole, K. L., Turner, R. C., & Gitchel, W. D. (2019). A study of polytomous IRT methods and item wording directionality effects on perceived stress items. Personality and Individual Differences, 147, 6372. https://doi.org/10.1016/j.paid.2019.03.046 CrossRefGoogle Scholar
De Boor, C. (1978). A practical guide to splines. (Vol. 27). Springer.10.1007/978-1-4612-6333-3CrossRefGoogle Scholar
DeYoreo, M., & Kottas, A. (2018). Bayesian nonparametric modeling for multivariate ordinal regression. Journal of Computational and Graphical Statistics, 27(1), 7184. https://doi.org/10.1080/10618600.2017.1316280 CrossRefGoogle Scholar
Eilers, P. H., & Marx, B. D. (2010). Splines, knots, and penalties. Wiley Interdisciplinary Reviews: Computational Statistics, 2(6), 637653. https://doi.org/10.1002/wics.125 CrossRefGoogle Scholar
Fox, J.-P. (2010). Bayesian item response modeling: Theory and applications. Springer.10.1007/978-1-4419-0742-4CrossRefGoogle Scholar
Gelfand, A. E. (2000). Gibbs sampling. Journal of the American Statistical Association, 95(452), 13001304. https://doi.org/10.1080/01621459.2000.10474335 CrossRefGoogle Scholar
Gu, M., Wang, X., & Berger, J. O. (2018). Robust Gaussian stochastic process emulation. The Annals of Statistics, 46(6A), 30383066.10.1214/17-AOS1648CrossRefGoogle Scholar
Hays, R. D., Morales, L. S., & Reise, S. P. (2000). Item response theory and health outcomes measurement in the 21st century. Medical Care, 38(9), II-28II-42. https://doi.org/10.1097/00005650-200009002-00007 CrossRefGoogle Scholar
Heatherton, T. F., & Polivy, J. (1991). Development and validation of a scale for measuring state self-esteem. Journal of Personality and Social Psychology, 60(6), 895910. https://doi.org/10.1037/0022-3514.60.6.895 CrossRefGoogle Scholar
Hoffman, M. D., & Gelman, A. (2014). The no-U-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1), 15931623.Google Scholar
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning with applications in R. Springer. https://doi.org/10.1007/978-1-4614-7138-7 Google Scholar
Kang, J., & Kottas, A. (2024). Flexible Bayesian modeling for longitudinal binary and ordinal responses. Statistics and Computing, 34(6), 206. https://doi.org/10.1007/s11222-024-10525-2 CrossRefGoogle Scholar
Kim, K., Lee, D., & Essa, I. (2011). Gaussian process regression flow for analysis of motion trajectories. In 2011 international conference on computer vision (pp. 11641171). https://doi.org/10.1109/ICCV.2011.6126365 CrossRefGoogle Scholar
Komori, M., Teraji, T., Shiroshita, K., & Nittono, H. (2022). Examination of morphological traits of children’s faces related to perceptions of cuteness using Gaussian process ordinal regression. Frontiers in Psychology, 13. https://doi.org/10.3389/fpsyg.2022.979341 CrossRefGoogle ScholarPubMed
Li, D., Wang, X., & Dey, D. (2019). Power link functions in an ordinal regression model with Gaussian process priors. Environmetrics, 30(6), e2564. https://doi.org/10.1002/env.2564 CrossRefGoogle Scholar
Linton, K. E., & Richard, G. M. (1996). Self-esteem in adolescents: Validation of the state self-esteem scale. Personality and Individual Differences, 21(1), 8590. https://doi.org/10.1016/0191-8869(96)83741-x CrossRefGoogle Scholar
Liu, L. C., & Hedeker, D. (2006). A mixed-effects regression model for longitudinal multivariate ordinal data. Biometrics, 62(1), 261268. https://doi.org/10.1111/j.1541-0420.2005.00408.x CrossRefGoogle ScholarPubMed
Liu, Y., Wang, F., & Kong, A. W. K. (2019). Probabilistic deep ordinal regression based on Gaussian processes. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 53005308). https://doi.org/10.1109/iccv.2019.00540 CrossRefGoogle Scholar
Liu, Y., & Wang, X. (2020). Bayesian nonparametric monotone regression of dynamic latent traits in item response theory models. Journal of Educational and Behavioral Statistics, 45(3), 274296. https://doi.org/10.3102/1076998619887913 CrossRefGoogle Scholar
Löwe, B., Wahl, I., Rose, M., Spitzer, C., Glaesmer, H., Wingenfeld, K., Schneider, A., & Brähler, E. (2010). A 4-item measure of depression and anxiety: Validation and standardization of the patient health questionnaire-4 (PHQ-4) in the general population. Journal of Affective Disorders, 122(1–2), 8695. https://doi.org/10.1016/j.jad.2009.06.019 CrossRefGoogle ScholarPubMed
Lubbe, D., & Schuster, C. (2019). A graded response model framework for questionnaires with uniform response formats. Applied Psychological Measurement, 43(4), 290302. https://doi.org/10.1177/0146621618789394 CrossRefGoogle ScholarPubMed
Lyche, T., Manni, C., & Speleers, H. (2018). Foundations of spline theory: B-splines, spline approximation, and hierarchical refinement. In Splines and PDEs: From approximation theory to numerical linear algebra: Cetraro, Italy 2017 (pp. 176). Springer. https://doi.org/10.1007/978-3-319-94911-6_1 Google Scholar
McCullagh, P. (1980). Regression models for ordinal data. Journal of the Royal Statistical Society: Series B (Methodological), 42(2), 109127. https://doi.org/10.1111/j.2517-6161.1980.tb01109.x CrossRefGoogle Scholar
Nepal, S., Liu, W., Pillai, A., Wang, W., Vojdanovski, V., Huckins, J. F., Rogers, C., Meyer, M. L., & Campbell, A. T. (2024). Capturing the college experience: A four-year mobile sensing study of mental health, resilience and behavior of college students during the pandemic. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 8(1), 137. https://doi.org/10.1145/3643501 CrossRefGoogle Scholar
Pardes, A., Lynch, W., Miclette, M., McGeoch, E., & Daly, B. P. (2022). Use of a mobile health (mHealth) platform for remote assessment of suicidal ideation, depression, and anxiety: A longitudinal retrospective study. Innovations in Digital Health, Diagnostics, and Biomarkers, 2(2022), 815. https://doi.org/10.36401/iddb-21-03 CrossRefGoogle Scholar
Pollak, J. (2012). The photographic affect meter: A novel application to measure momentary emotional states [Doctoral dissertation]. Cornell University. https://ecommons.cornell.edu/server/api/core/bitstreams/4271938c-c677-4518-b3db-01e1f8514b78/content Google Scholar
Proust-Lima, C., Philipps, V., Perrot, B., Blanchin, M., & Sébille, V. (2022). Modeling repeated self-reported outcome data: A continuous-time longitudinal item response theory model. Methods, 204, 386395. https://doi.org/10.1016/j.ymeth.2022.01.005 CrossRefGoogle Scholar
R Core Team. (2022). R: A language and environment for statistical computing [computer software manual]. Vienna, Austria. https://www.R-project.org/ Google Scholar
Reeve, B. B., & Fayers, P. (2005). Applying item response theory modeling for evaluating questionnaire item and scale properties. In Assessing quality of life in clinical trials: Methods of practice (Vol. 2, pp. 5573). https://doi.org/10.1093/oso/9780198527695.003.0005 CrossRefGoogle Scholar
Rice, J., & Rosenblatt, M. (1983). Smoothing splines: Regression, derivatives and deconvolution. The Annals of Statistics, 11(1), 141156. https://doi.org/10.1214/aos/1176346065 CrossRefGoogle Scholar
Riutort-Mayol, G., Bürkner, P.-C., Andersen, M. R., Solin, A., & Vehtari, A. (2023). Practical Hilbert space approximate Bayesian Gaussian processes for probabilistic programming. Statistics and Computing, 33(1), 17. https://doi.org/10.1007/s11222-022-10167-2 CrossRefGoogle Scholar
Samejima, F. (1968). Estimation of latent ability using a response pattern of graded scores. Psychometrika, 34(S1), 197. https://doi.org/10.1007/bf03372160 CrossRefGoogle Scholar
Shi, J., Wang, B., Murray-Smith, R., & Titterington, D. (2007). Gaussian process functional regression modeling for batch data. Biometrics, 63(3), 714723. https://doi.org/10.1111/j.1541-0420.2007.00758.x CrossRefGoogle ScholarPubMed
Shi, J. Q., & Choi, T. (2011). Gaussian process regression analysis for functional data. Chapman and Hall/CRC. https://doi.org/10.1201/b11038 CrossRefGoogle Scholar
Solin, A., & Särkkä, S. (2020). Hilbert space methods for reduced-rank Gaussian process regression. Statistics and Computing, 30(2), 419446. https://doi.org/10.1007/s11222-019-09886-w CrossRefGoogle Scholar
Stan Development Team. (2024). RStan: The R interface to stan. (R package version 2.32.6). https://mc-stan.org/ Google Scholar
Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., & Bürkner, P.-C. (2021). Rank-normalization, folding, and localization: An improved r-hat for assessing convergence of MCMC (with discussion). Bayesian Analysis, 16(2), 667718. https://doi.org/10.1214/20-BA1221 CrossRefGoogle Scholar
Wang, C., & Nydick, S. W. (2020). On longitudinal item response theory models: A didactic. Journal of Educational and Behavioral Statistics, 45(3), 339368. https://doi.org/10.3102/1076998619882026 CrossRefGoogle Scholar
Wang, J.-L., Chiou, J.-M., & Müller, H.-G. (2016). Functional data analysis. Annual Review of Statistics and Its Application, 3(1), 257295. https://doi.org/10.1146/annurev-statistics-041715-033624 CrossRefGoogle Scholar
Wang, R., Chen, F., Chen, Z., Li, T., Harari, G., Tignor, S., Zhou, X., Ben-Zeev, D., & Campbell, A. T. (2017). StudentLife: Using smartphones to assess mental health and academic performance of college students. Mobile Health: Sensors, Analytic Methods, and Applications, 733. https://doi.org/10.1007/978-3-319-51394-2_2 CrossRefGoogle Scholar
Wang, Y. (2011). Smoothing splines: Methods and applications. CRC press. https://doi.org/10.1201/b10954 CrossRefGoogle Scholar
Webster, G. D., Howell, J. L., & Shepperd, J. A. (2022). Self-esteem in 60 seconds: The six-item state self-esteem scale (SSES-6). Assessment, 29(2), 152168. https://doi.org/10.31234/osf.io/emfwn CrossRefGoogle ScholarPubMed
Wu, H., & Leung, S.-O. (2017). Can Likert scales be treated as interval scales?—A simulation study. Journal of Social Service Research, 43(4), 527532. https://doi.org/10.1080/01488376.2017.1329775 CrossRefGoogle Scholar
Yang, G., Zhang, B., & Zhang, M. (2023). Estimation of knots in linear spline models. Journal of the American Statistical Association, 118(541), 639650. https://doi.org/10.1080/01621459.2021.1947307 CrossRefGoogle Scholar
Figure 0

Figure 1 Average SSES responses by day and question over the course of the students’ freshman year.

Figure 1

Figure 2 SSES responses by day, item, and student over the course of the students’ freshman year.

Figure 2

Table 1 Simulation diagnostics based on 100 simulated data sets for each parameter

Figure 3

Figure 3 A comparison of simulated trajectories (black curves) to the estimated trajectories (as determined by the median of the posterior distribution, in blue). The shaded region represents 95% CI bounds. The data were designed to match the College Experience study. Thus, the black curves may be shorter than the blue curves to mimic late entry or early exit in the study. The full estimated curve is shown in blue since we want to make inference on the entire freshman year.

Figure 4

Table 2 Accuracy and mean absolute deviation metrics for items 1–4 in simulated data comparing a model with no GP trajectories (Baseline) to the GPMHT

Figure 5

Figure 4 A comparison of simulated trajectories (black curves) to the estimated trajectories for both the baseline model (red) and GPMHT (blue) determined by the median of the posterior distribution). The shaded regions represent 95% CI bounds. This simulation was designed to emulate the College Experience study.

Figure 6

Table 3 The optimal values of m and L, where $L=cS$ for each B-spline scenario

Figure 7

Table 4 Cross-validation accuracy and mean absolute deviation for each B-spline degree of freedom considered

Figure 8

Table 5 GPMHT parameter estimates, 95% CI bounds, $\hat {R}$ values, and ESS for the College Experience study motivating example

Figure 9

Figure 5 Average self-esteem trajectory for students in the College Experience study. Blue and red vertical lines indicate approximate term start and end days, respectively.

Figure 10

Figure 6 Estimated deviation from the average self-esteem trajectory by subject, along with 95% credible intervals.

Figure 11

Table 6 Within-sample (WS) and cross-validation (CV) posterior predictive accuracy and mean absolute deviation for the comparison baseline model without GP trajectories

Figure 12

Figure 7 A comparison of the deviation from the average trajectory of self-esteem over the course of the freshman year. The shaded areas corresponded to the appropriate 95% credible intervals.

Figure 13

Figure B1 Histogram of SSES item responses.

Figure 14

Figure I1 A comparison of the simulated mean function (black curve) to the estimated mean function (as determined by the median of the posterior distribution, in blue). The shaded region represents 95% CI bounds. The data were designed to match the College Experience study.

Figure 15

Figure I2 A comparison of simulated trajectories (black curves) to the estimated trajectories (as determined by the median of the posterior distribution, in blue). The shaded region represents 95% CI bounds. The value of $\ell $ was chosen to be 0.5.

Figure 16

Figure I3 A comparison of simulated trajectories (black curves) to the estimated trajectories (as determined by the median of the posterior distribution, in blue). The shaded region represents 95% CI bounds. This simulation allowed for only three ordinal responses per item.

Figure 17

Figure I4 A comparison of simulated trajectories (black curves) to the estimated trajectories (as determined by the median of the posterior distribution, in blue). The shaded region represents 95% CI bounds. This simulation allowed for seven ordinal responses per item.

Figure 18

Figure J1 A comparison of simulated trajectories (black curves) to the estimated trajectories (as determined by the median of the posterior distribution, in blue). The shaded region represents 95% CI bounds. This simulation contained only two items.

Figure 19

Figure J2 A comparison of simulated trajectories (black curves) to the estimated trajectories (as determined by the median of the posterior distribution, in blue). The shaded region represents 95% CI bounds. This simulation contained six items.

Figure 20

Table K1 Hilbert space GP approximation tuning and diagnostics