Hostname: page-component-77f85d65b8-t6st2 Total loading time: 0 Render date: 2026-03-29T06:36:55.838Z Has data issue: false hasContentIssue false

Artificial fine-tuning tasks for yes/no question answering

Published online by Cambridge University Press:  30 June 2022

Dimitris Dimitriadis*
Affiliation:
Aristotle University of Thessaloniki, Thessaloniki, Greece
Grigorios Tsoumakas
Affiliation:
Aristotle University of Thessaloniki, Thessaloniki, Greece
*
*Corresponding author. E-mail: dndimitri@csd.auth.gr
Rights & Permissions [Opens in a new window]

Abstract

Current research in yes/no question answering (QA) focuses on transfer learning techniques and transformer-based models. Models trained on large corpora are fine-tuned on tasks similar to yes/no QA, and then the captured knowledge is transferred for solving the yes/no QA task. Most previous studies use existing similar tasks, such as natural language inference or extractive QA, for the fine-tuning step. This paper follows a different perspective, hypothesizing that an artificial yes/no task can transfer useful knowledge for improving the performance of yes/no QA. We introduce three such tasks for this purpose, by adapting three corresponding existing tasks: candidate answer validation, sentiment classification, and lexical simplification. Furthermore, we experimented with three different variations of the BERT model (BERT base, RoBERTa, and ALBERT). The results show that our hypothesis holds true for all artificial tasks, despite the small size of the corresponding datasets that are used for the fine-tuning process, the differences between these tasks, the decisions that we made to adapt the original ones, and the tasks’ simplicity. This gives an alternative perspective on how to deal with the yes/no QA problem, that is more creative, and at the same time more flexible, as it can exploit multiple other existing tasks and corresponding datasets to improve yes/no QA models.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2022. Published by Cambridge University Press
Figure 0

Figure 1. The adaptation of the answer validation task. An answer validator gets as input a question and an enriched passage, instead of a candidate answer, to predict if the special tokens are in the correct position. The check mark (green color) indicates the new input to the answer validator, while the X mark (red color) indicates the input in the original task that we removed. The line without mark (blue color) indicates the part that we keep the same before and after the task adaptation.

Figure 1

Figure 2. The adaptation of the sentiment classification task. Instead of classifying a sentence as negative or positive, the model answers the question are the two sentences both positive? The check mark (green color) indicates the new input and output of the sentiment classifier, while the X mark (red color) indicates the original output that we removed. The line without mark (blue color) indicates the part that we keep the same before and after the task adaptation.

Figure 2

Figure 3. The adaption of the lexical simplification task. A classifier gets as input a pair of sentences and predicts if one sentence is a simplified version of the other, instead of getting an original sentence to generate the simplified version. The check mark (green color) indicates a new input/output to the classifier and the X mark (red color) indicates an output of the original task. The line without mark (blue color) indicates the part that we keep the same before and after the task adaptation.

Figure 3

Table 1. The datasets for each new constructed task

Figure 4

Figure 4. The transitions for training the question answering classifier. A transition indicates the fine-tuning process of a model on a task.

Figure 5

Table 2. Results on BoolQ development dataset with and without transfer learning. The results are the mean accuracy of 5 runs with different random seeds. Underline scores indicate better performance over the same baseline model architecture. Bold indicates better performance over all the baseline models. Last column corresponds to the questions that are correctly answered by the corresponding model, transfer task and transfer data

Figure 6

Table 3. A confusion matrix corresponds to a comparison between two models fine-tuned on a task (AV, SC, SV) regarding their predictions on BoolQ dev dataset. A corresponds to the number of correct answers and $\lnot A$ to the number of wrong answers for a model

Figure 7

Table 4. Results on the BoolQ development set combining the three transfer tasks either by adapting the RoBERTa model to all transfer tasks (all-in-one scheme) or by fitting the RoBERTa model separately to a transfer task and then combining the models using the voting scheme

Figure 8

Table 5. (Un)Expected outcome using RoBERTa model fine-tuned on AV task. Examples from BoolQ development dataset

Figure 9

Table 6. (Un)Expected outcome using RoBERTa model fine-tuned on SC task. Examples from BoolQ development dataset

Figure 10

Table 7. (Un)Expected outcome using RoBERTa model fine-tuned on SV task. Examples from BoolQ development dataset

Figure 11

Figure 5. The attentions between the question and the correct passage (left side (a),(b)) and the question and wrong passage (right side (c),(d)). The weight of the arrows indicates the magnitude of the attention scores, while the different colors correspond to the 12 attention heads of the BERT base model.

Figure 12

Figure 6. The attention of [CLS] token for the pair (Q1,P) of the models TM (a) and M (b), and the pair (Q2,P) (c),(d). Positive and negative values are colored blue and orange, respectively, with color saturation based on magnitude of the value.

Figure 13

Figure 7. The attentions between the sentences of a positive instance (a),(b) and a negative one (c),(d).

Figure 14

Figure 8. The attentions between the sentences of a negative instance repositioning the word worst.

Figure 15

Figure 9. The attention of the [CLS] token for the M (a),(b) and TM (c),(d) for the pairs (Q,P) and (Q,*P).

Figure 16

Figure 10. The attentions between the sentences of a correct instance (a) and two incorrect ones (b)(c). The model trained on simplification validation task.

Figure 17

Figure 11. The attentions between the sentences of a correct instance (a) and two incorrect ones (b)(c). The model trained on QA task instead of simplification validation task.

Figure 18

Figure 12. The attentions of TM (left side) and the attentions of M (right side).