Hostname: page-component-77f85d65b8-5ngxj Total loading time: 0 Render date: 2026-03-29T22:55:19.650Z Has data issue: false hasContentIssue false

Thought flow nets: From single predictions to trains of model thought

Published online by Cambridge University Press:  06 September 2024

Hendrik Schuff*
Affiliation:
Ubiquitous Knowledge Processing Lab, Technical University of Darmstadt, Darmstadt, Germany Institut für Maschinelle Sprachverarbeitung, University of Stuttgart, Stuttgart, Germany
Heike Adel
Affiliation:
Hochschule der Medien, Stuttgart, Germany
Ngoc Thang Vu
Affiliation:
Institut für Maschinelle Sprachverarbeitung, University of Stuttgart, Stuttgart, Germany
*
Corresponding author: Hendrik Schuff; Email: hendrik.schuff@tu-darmstadt.de
Rights & Permissions [Opens in a new window]

Abstract

When humans solve complex problems, they typically construct, reflect, and revise sequences of ideas, hypotheses, and beliefs until a final decision or conclusion is reached. Contrary to this, current machine learning models are mostly trained to map an input to one single and fixed output. In this paper, we investigate how we can equip models with the ability to represent, construct, and evaluate a second, third, and $k$-th thought within their prediction process. Drawing inspiration from Hegel’s dialectics, we propose and evaluate the thought flow concept which constructs a sequence of predictions. We present a self-correction mechanism which (a) is trained to estimate the model’s correctness and which (b) performs iterative prediction updates based on the gradient of the correctness prediction. We introduce our method focusing initially on question answering (QA) and carry out extensive experiments which demonstrate that (i) our method is able to correct its own predictions and that (ii) it can improve model performance by a large margin. In addition, we conduct a qualitative analysis of thought flow correction patterns and explore how thought flow predictions affect users’ human-AI collaboration in a crowdsourcing study. We find that (iii) thought flows improve user performance and are perceived as more natural, correct, and intelligent regarding single and/or top-3 predictions.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Figure 1. In contrast to the vanilla approach of mapping an input to an output in a single step (grey box), we propose a method that allows models to sequentially “reconsider” and update their predictions through a “thought flow” extension which can correct an incorrect (false) answer. In this (real) QA example, the orange box marks our thought flow extension, which corrects a flawed answer in two steps and ultimately returns the correct ground-truth answer (marked in bold).

Figure 1

Table 1. Overview of the concepts from Hegel’s dialectics which we draw inspiration from (left), their main characteristics (middle), and their corresponding elements in our proposed thought flow method (right)

Figure 2

Figure 2. The steps of our prediction update scheme. The example shows the second answer change from Fig. 1. $\boldsymbol{x}$ refers to the model input and represents the question and its given textual context, “enc.” denotes an encoder function (e.g., a Transformer model), $f_{\text{pred}}$ maps the encoding $\phi (\boldsymbol{x})$ to logits that correspond to probability distributions over start and end positions of the respectively predicted answer. In addition to this standard model architecture, we propose the addition of a function $f_{\text{corr}}$ that is trained to predict an estimate of a correctness score $s$ (e.g., $\text{F}_1$ score) given $\phi (\boldsymbol{x})$ and the probability distributions predicted by $f_{\text{pred}}$.

Figure 3

Figure 3. Simplified visualization of the modification step (during inference) shown in Fig. 2 depicted within a cut through the logits space. The axes correspond to the elements of ${\hat{\boldsymbol{z}}_{\text{start}}}$ that represent the index positions of the words “feather” and “their” as shown in the example in Fig. 1. The dotted isolines correspond to the self-estimated correctness scores obtained from $f_{\text{corr}}$. Before the modification, the system’s prediction corresponding to $\hat{\boldsymbol{z}}^{(\textbf{1})}$ would be an answer starting at “feather.” The gradient $\nabla$ points in the direction of an improvement of self-estimated correctness of $\hat{\boldsymbol{z}}^{(\textbf{1})}$. After a gradient step into the direction of $\nabla$, the change in logits towards ${\hat{\boldsymbol{z}}^{(2)}}$ leads to a shift of the answer start position. The modification behavior emerges solely from the logit modifications and can lead to complex modification patterns (see Section 3.3).

Figure 4

Figure 4. Thought flows with different gradient scaling targets $\delta$ averaged over three seeds of a QA model. Higher values for $\delta$ correspond to more aggressive decision changes. Without a stopping oracle that stops when the thought flow no longer improves an answer (top left), only $\delta =0.1$ provides consistently stable, but very small $\text{F}_1$ improvements. With an oracle (top right), higher values for $\delta$ reach higher and faster $\text{F}_1$ improvements up to $\gt$9%. Nearly all performance gains are achieved by the first decision change (bottom left). A detailed analysis of flows using $\delta =10^4$ shows that the observed $\text{F}_1$ improvements are the result of slight decreases and stronger increases in both precision and recall (bottom right). y axes of plots (a), (b), and (c) use a symlog scale. Improvements are reported as absolute $\text{F}_1$ score differences to the base model performance of 63.5% $\text{F}_1$.

Figure 5

Table 2. Emergent thought flow modification patterns identified in 150 randomly sampled thought flows using $\delta =1$. The correct answer is marked in bold, and the predicted answer per flow step is marked in orange

Figure 6

Table 3. Multi-step modification examples ($\delta =1$). The correct answer is marked in bold, the predicted answer per flow step is marked in orange

Figure 7

Table 4. Examples of long thought flows. The table shows two questions for which the resulting thought flow contains 45 prediction changes and 12 decision changes respectively. In contrast to previous tables, the answer contexts are omitted and only the predicted answer spans are displayed. Different shades of orange background reflect repeated answer spans and highlights, for example, a 2-cycle for the first example

Figure 8

Table 5. Examples of successful (incorrect $\rightarrow$ correct) and unsuccessful cross-sentence modifications. The correct answer is marked in bold, and the predicted answer per flow step is marked in orange

Figure 9

Table 6. Examples of successful (incorrect $\rightarrow$ correct) and unsuccessful span extension modifications. The correct answer is marked in bold, and the predicted answer per flow step is marked in orange

Figure 10

Figure 5. User study interface showing the TF condition (ours).

Figure 11

Table 7. Statistical results of our human evaluation ($N=55$). “$^{*}$” marks dependent variables on which a significant effect of the system condition was observed (Friedman tests and LRT tests for GLMM/CLMM). Pairwise differences between conditions (Holm-adjusted Tukey/Conover tests) are reported as compact letter display codings. For example, the “human-like” column shows that the post hoc test detected a significant difference between single and TF but no significant difference between any other pair. Similarly, the last column shows pairwise differences between all conditions and the TF condition reaches significantly higher human answer $\text{F}_1$-scores than any other conditions. Variables for which TF is among the best-performing models are marked in cyan, variables for which it is found to be the sole superior system are marked in green

Figure 12

Table 8. Detailed $p$ values for all main effects and pairwise comparisons are shown in Table 7. Significant $p$ values are marked in bold. Cell colors follow the color coding used in Table 7

Figure 13

Figure 6. Exemplary thought flows on CIFAR-100 instances. The black rectangle shows the initial class probabilities from the base model (step 0), that is, the unmodified prediction, from a bird’s eye perspective. The corresponding predicted label is marked in italics. On the right side of the black rectangle, the thought flow is depicted. The white lines mark the maximum probability across classes for each step. The ground-truth label is marked with a gray box. For readability, we only show classes that reach a probability of at least 1% within the thought flow.

Figure 14

Figure B1. Fraction of instances for which a thought flow with different decision step thresholds results in increased or decreased values of precision and recall.

Figure 15

Table B1. Additional examples of the modification patterns presented in Table 2. The correct answer is marked in bold, and the predicted answer per flow step is marked in orange

Figure 16

Figure C1. User study interface showing the TF condition (ours).

Figure 17

Figure C2. User study interface showing the top-3 condition.

Figure 18

Figure C3. User study interface showing the single condition.

Figure 19

Figure C4. User study interface showing an attention check.

Figure 20

Figure D1. Exemplary thought flows from different models on CIFAR demonstrating the diverse range of correction dynamics. A detailed description of the plots is provided in Fig. 6.

Figure 21

Figure D2. More exemplary thought flows from different models on CIFAR demonstrating the diverse range of correction dynamics. A detailed description of the plots is provided in Fig. 6.