Disrupted state transition learning as a computational marker of compulsivity

Background Disorders involving compulsivity, fear, and anxiety are linked to beliefs that the world is less predictable. We lack a mechanistic explanation for how such beliefs arise. Here, we test a hypothesis that in people with compulsivity, fear, and anxiety, learning a probabilistic mapping between actions and environmental states is compromised. Methods In Study 1 (n = 174), we designed a novel online task that isolated state transition learning from other facets of learning and planning. To determine whether this impairment is due to learning that is too fast or too slow, we estimated state transition learning rates by fitting computational models to two independent datasets, which tested learning in environments in which state transitions were either stable (Study 2: n = 1413) or changing (Study 3: n = 192). Results Study 1 established that individuals with higher levels of compulsivity are more likely to demonstrate an impairment in state transition learning. Preliminary evidence here linked this impairment to a common factor comprising compulsivity and fear. Studies 2 and 3 showed that compulsivity is associated with learning that is too fast when it should be slow (i.e. when state transition are stable) and too slow when it should be fast (i.e. when state transitions change). Conclusions Together, these findings indicate that compulsivity is associated with a dysregulation of state transition learning, wherein the rate of learning is not well adapted to the task environment. Thus, dysregulated state transition learning might provide a key target for therapeutic intervention in compulsivity.

density density Parameter estimate Figure S2. Interaction between condition effect and psychopathology. In each plot, the terms "positive" or "threat" denote whether or not a positive or negative image was interposed between state transitions, and the terms "common" or "rare" denote the probability of the state transition (common = 50%, rare=30%; the remaining state transition was of non-interest because both actions led to the final state 20% of the time) No interaction approached a significant effect, as noted by the ROPE being near the mode of each interaction effect. We additionally tested whether differences between condition (e.g., common positive -rare positive) depended on psychopathology group, all of which were nonsignificant. Figure S3. We conducted a step-wise model comparison, where each step is denoted by the box next to each subplot. Each subplot shows iBIC difference of each tested model compared to the best-fitting model (the one with the lowest iBIC.

S3. Logic of model step-wise model testing and full model descriptions
We fit the models to actual data from Gillan et al. (2016) in a step-wise fashion. We will first describe the logic behind each step. We then list in full detail the model formalisms.
We first compared two standard learning architectures from the literature, wherein differences between the two models are reflected in how first-stage actions are learnt. One model combines two prediction errors concerning first-stage actions (after reaching stage 2 and after receiving a reward) into a single action value (1AV model) where the second prediction error is weighted by an eligibility trace. The second encodes two separate first-stage action values (which we call the "typical no decay" model, as it was used in Gillan et al., 2016), one value for each type of prediction error (Step 1), and two separate parameters control the influence of these values over first-stage actions. At the conclusion of this step, it was apparent that keeping two separate first-stage action values (i.e., typical model) provided the best fit to the data.
We thus tested whether the 'ISTL no decay' learning architecture would be enhanced by the inclusion of incremental state transition learning with individual learning rates (Step 2). For this purpose, we compared the 'typical no decay' model, which learns transitions via a counting rule, to a model that learns transitions from state prediction errors multiplied by a subjectspecific transition learning rate, Incremental State Transition Learning model (ISTL). The results showed that subjects' behavior was best captured by a model with between-subject variation in the rate of incremental transition learning from experienced transitions.
In step 3, we included three alternative ways in which state transition learning might differ from the incremental state transition learning model: a model where the learning rate does not vary across subjects ('2AV +Fixed LR'), a model where transition learning is realised via Bayesian inference ('2AV + Bayes'), and a model where transitions for actions not taken are updated via counterfactual inference ('2AV + Counterfactual'). These alternative models did not explain the data as well as a model that allowed variation between subjects in learning rate (ISTL no decay).
We then determined whether transition learning rates decrease as information is accumulated, as expected from an optimal observer (Step 4). Thus, we compared the so far best-fitting model with models where learning rates decay over time to 0 ('2AV + Dynamic LR') or to some fixed learning rate ('2AV + Dynamic LR + Intercept'). These dynamic LR models did not explain the data as well as a model that allowed variation between subjects in learning rate, but within subjects kept learning rates constant (ISTL no decay).
Next, since the behavioural signatures characterizing average performance could be generated by a pure model-based learner (Step 5), we tested whether model-free learning was at all necessary to explain the choice data. For this purpose, we compared ISTL to an 'MB' learner that only used model-based inference (and perseveration) to determine first-stage actions. Again this did not explain the data as well as the ISTL no decay model. Finally, we tested a set of models in which learned information decays over trials. As reported in Gillan et al. (2016), decay may allow the model to better explain participants' behavior. Thus, we first compared the ISTL model without decay to the original Gillan et al. (2016) model, which we refer to as the 'Typical' model. In this model, values of actions not chosen decay with a rate of 1-learning rate. Secondly, we enhanced the Typical model by including the same incremental state transition learning process that is included in the ISTL no decay model, allowing also state transitions to decay at rate of 1-state transition learning rate ('ISTL+one decay). Last, we tested variant of these two models wherein values of unchosen actions decayed at a rate that constituted a separate free parameter ('Typical + two decay' and 'ISTL', respectively). The ISTL model explained the data the best of all models.

Model descriptions
July 16, 2021 1 Models with decay on unchosen actions

Gillan et al 2016 model = Typical model
The following model includes two updates in the model-free system for first-stage actions: namely a Temporal Difference update (known as TD (0)) and a Monte Carlo update. The Monte Carlo update reinforces first-stage action values according to the final reward only. Note that the Monte Carlo update is equivalent to lambda=1 in Daw (2011).
Allowing each of these possible model-free updates to influence first-stage action learning was carried out in Gillan et al. (2016). The first action-value represents the prediction of which second-stage one will arrive in, each of which has its own value depending on how rewarded it has been in the recent past. The second first-stage action value represents the prediction that the first-stage action will be rewarded after the second-stage choice. Separating these first-stage action values in turn removes the requirement for an eligibility trace. All models use the Bellman equation to derive model-based action values.

Variables
Below, t=time, s=state, a=action. At stage one, two images appeared, one of which could be selected with a given action. The image was always chosen with a certain action that did not change across the task. Each action led to two possible states (determined by a transition matrix), wherein each state had two unique images. At this second stage, an image again is selected with a given action. Note only an "s" subscript is used for second-stage action values, since one could either transition to state 2 or 3. The selection of this image would lead to a monetary reward determined by a latent probability that drifted across the task (see Gillan et al., 2016).

NOTE
See Gillan et al. (2016) for how the inverse temperature parameters are rescaled by learning rate which improves parameter estimation. Actions taken are denoted as 'a' and unchosen actions as 'u'. Second stage states arrived at are denoted as 's' whereas states not arrived at are denoted as 'x'.
Transition matrix M = one-hot vector indicating which first-stage action was taken on the last trial.
First-stage action values: Q M F 0t,a = First-stage action value predicting value at second stage.
Q M F 1t,a = First-stage action value predicting reward after second stage.

Q M Bt,a = Model-based value of action 1
Second-stage action values: Q M F 2t,s,a = Second-stage action value predicting reward.

Free Parameters
All alpha parameters below were drawn from Beta priors that spans the 0-1 range.
1 − α = decay rates on chosen and unchosen actions All beta parameters below were drawn from Gamma priors that spans all positive real numbers.
β st = strength of perseveration at first stage. This multiplies the M vector, which the previously enacted first-stage action.
β M F 2 = inverse temperature for Q M F 2 at second stage.

Learning computations
• Updating the transition matrix: Each trial, a transition counter is updated. For example if state1, action1 led to state 2 once, and on the next transition, the same transition occurs, the counting matrix would be updated as follows: T can be one of two matrices at and given trial at any given trial. This is determined by the T counting matrix. When T counting (1, 1) + T counting (2, 2) > T counting (1, 2) + T counting (2, 1), then T 1 is used. When the inequality is the converse, then T 2 is used. If they are equal, then T 3 is used.
• Updating chosen action-values Model-based q-values are then computed via the Bellman equation:

Decision computations
First-stage action: Secon-stage action:

ISTL + one decay
Here we simply amend the Typical model used in Gillan et al. (2016) except instantiate state transition learning in an incremental process. γ = learning rate for state transitions • Updating the transition matrix: Each trial, a transition estimate is updated with a learning rate, and probabilities are at that time normalized. For instance, if action 1 is taken and transition to state 2: and P P (s = 1|a = 1) t+1 = 1 − P (s = 2|a = 1) t+1

ISTL model
The final and winning ISTL model included the same incremental state transition learning model in ISTL + one decay, with the addition that transitions for actions not taken decayed by the same rate to the prior on state transitions (i.e., 0.5): Each trial, a transition estimate is updated with a learning rate, and probabilities are at that time normalized. For instance, if action 1 is taken and transition to state 2: and Second, the ISTL model include a separate free decay parameter on all unchosen actions, which is the same as defined in the Typical + two decay model.

Free Parameters
All alpha parameters below were drawn from Beta priors that spans the 0-1 range.
α 1 = learning rate for both updates on first-stage action values.
α 2 = learning rate for for update on second action value.
All beta parameters below were drawn from Gamma priors that spans all positive real numbers.
β M F 0 = inverse temperature for Q M F 0 at first stage. β M F 2 = inverse temperature for Q M F 2 at second stage.

Learning computations
• Updating the transition matrix: Each trial, a transition counter is updated. For example if state1, action1 led to state 2 once, and on the next transition, the same transition occurs, the counting matrix would be updated as follows: T can be one of two matrices at and given trial at any given trial. This is determined by the T counting matrix. When T counting (1, 1) + T counting (2, 2) > T counting (1, 2) + T counting (2, 1), then T 1 is used. When the inequality is the converse, then T 2 is used. If they are equal, then T 3 is used.
• Updating action-values Model-based q-values are then computed via the Bellman equation: Q M F 2 (s 2 , a i ).

Decision computations
First-stage action: Secon-stage action:

Free Parameters
All alpha parameters below were drawn from Beta priors that spans the 0-1 range.
α 1 = learning rate for both updates on first-stage action values. β M F 2 = inverse temperature for Q M F 2 at second stage.

Learning computations
• Updating the transition matrix: Each trial, a transition estimate is updated with a learning rate, and probabilities are at that time normalized. For instance, if action 1 is taken and transition to state 2: and P P (s = 1|a = 1) t+1 = 1 − P (s = 2|a = 1) t+1 • Update action values: Model-based q-values are then computed via the Bellman equation:

Decision computations
First-stage action: Secon-stage action:

ISTL + no decay + Counterfactual
Same as ISTL + no decay except that transitions for actions not taken are updated as if the nottaken action led to the state than was not experienced for the taken action. This counterfactual inference is predicated on assumption (that was told to participants and experienced in practice) that the two actions cannot lead most often to the same state.

ISTL + no decay + Dynamic LR
Same as ISTL + no decay except here the γ decays to 0 on each trial by the following equation: where ϵ determine the starting learning rate, and N action is a tally of how many times a given action was taken.

Dynamic LR + Intercept
Same asISTL + no decay + Dynamic LR except here the γ decays to a variable baseline LR, ω: where ϵ determines time it will take to decay to baselin ω learning rate, and N action is a tally of how many times a given action was taken.

Fixed LR
Same as 2AV + LR except here, γ is fixed across subjects.

Bayesian transition learning
transition matrix M = one-hot vector indicating which first-stage action was taken on the last trial.

Fixed parameter
The beta prior defining evidence in favor of Transition Matrix 1 was initialized with mode=0.5

Free Parameters
All alpha parameters below were drawn from Beta priors that spans the 0-1 range.
α 1 = learning rate for both updates on first-stage action values.
α 2 = learning rate for for update on second action value. κ = concentration of prior over belief in either possible transition matrix.
All beta parameters below were drawn from Gamma priors that spans all positive real numbers.
β st = strength of perseveration at first stage. This multiplies the M vector, which the previously enacted first-stage action.
β M F 2 = inverse temperature for Q M F 2 at second stage.

Learning computations
• Updating the transition matrix: Note that we use a mode of 0.5 to define the prior belief in the correct transition matrix, and a free parameter to quantify the spread of the belief distribution, which is formally known as the concentration of the prior distribution.
The mode (fixed) and concentration (free) of the beta distribution defining the prior belief in T1 and T2 was converted to E1 (evidence in favor of T1) and E2 (evidence in favor of T2) parameters describing the shape of the beta distribution by the following equations: The posterior of the beta prior is updated analytically: E1 = E1 + 1 when common transitions predicted by T1 are experienced. and E2 = E2 + 1 when common transitions predicted by T2 are experienced.
Each time model-based action values are computed, evidence for each transition matrix is derived from the mean of the beta distribution by: which represents the probability that T1 is the true transition matrix. p 2 = 1 − p 1 . Q M B t+1 = (Bellman Equation for T 1 )(p 1 ) + (Bellman Equation for T 2 )(p 2 ).

• Updating action-values
Model-based q-values are then computed via the Bellman equation:

Decision computations
First-stage action: Secon-stage action:

MB
Here, first-stage actions are only influences by model-based planning and a perseveration parameter.
transition matrix M = one-hot vector indicating which first-stage action was taken on the last trial.

Free Parameters
All alpha parameters below were drawn from Beta priors that spans the 0-1 range.
α 1 = learning rate for both updates on first-stage action values. β M F 2 = inverse temperature for Q M F 2 at second stage.

Learning computations
• Updating the transition matrix: Each trial, a transition estimate is updated with a learning rate, and probabilities are at that time normalized. For instance, if action 1 is taken and transition to state 2: and P (s = 1|a = 1) t+1 = 1 − P (s = 2|a = 1) t+1 • Updating action-values Model-based q-values are then computed via the Bellman equation: Q M F 2 (s 2 , a i ).

Decision computations
First-stage action: Secon-stage action: Second-stage action values: Q M F 2t,s,a = Second-stage action value predicting reward.

Free Parameters
All parameters below were drawn from Beta priors that spans the 0-1 range. This is determined by the T counting matrix. When T counting (1, 1) + T counting (2, 2) > T counting (1, 2) + T counting (2, 1), then T 1 is used. When the inequality is the converse, then T 2 is used. If they are equal, then T 3 is used.
• Updating action-values  Figure S4: Parameter Recovery. The diagram comprises data simulated from the winning model (ISTL) and best-fit group hyperparameters for the subsample of subjects (MB > 2.5). The simulation was contained 400 generated agents over 200 trials of the two-step task. The group hyperparameters were derived through fitting the model, and thus, parameter recovery reflect the empirical range of parameter values. The plot comprises the full set of correlations between fitted and true parameters in the simulation and subsequent model fitting. The vertical axis in the heatmap with "F" represent the fitted parameters, and the horizontal axis represent the ground-truth parameters that generated the data. The following are the group priors we generated the data from. Importantly, next to each parameter in parentheses denotes the abbreviations used in the heatmap for each of these parameters: