Hostname: page-component-6766d58669-bp2c4 Total loading time: 0 Render date: 2026-05-15T07:37:58.615Z Has data issue: false hasContentIssue false

Supervised contrastive learning in few-shot soft prompt tuning

Published online by Cambridge University Press:  16 April 2026

Ali Edalat
Affiliation:
School of Electrical and Computer Engineering, University of Tehran , Tehran, Iran
Yadollah Yaghoobzadeh*
Affiliation:
School of Electrical and Computer Engineering, University of Tehran , Tehran, Iran
*
Corresponding author: Yadollah Yaghoobzadeh; Email: y.yaghoobzadeh@ut.ac.ir
Rights & Permissions [Opens in a new window]

Abstract

Soft prompt learning methods offer parameter-efficient tuning of pre-trained language models for few-shot scenarios. This study explores the integration of supervised contrastive learning (SCL) into two leading soft prompt tuning models: DifferentiAble pRompT (DART) and PTuning. By incorporating SCL as an auxiliary task, we observe consistent performance enhancements across 13 few-shot natural language understanding tasks, including benchmarks such as SST-2, TREC, MNLI, and real-world datasets such as Overruling, TC, and ADE. We also delve into SCL’s impact in label-imbalanced settings, introducing a novel approach called balanced batch in SCL (BBSCL). BBSCL employs balanced mini-batches, sampling the majority class proportionally to the minority class to stabilize SCL calculations. Our results indicate that SCL and BBSCL significantly boost the performance and robustness of soft prompt learning models, especially on datasets with intricate label spaces. Experimentally, DART + SCL and PTuning + SCL outperform their base models by an average of $2.1\%$ across the 13 tasks. Additionally, we find that SCL’s contribution is more substantial in scenarios with complex and less separable label spaces. Compared to large language models such as GPT-3.5 and OpenChat, our enhanced soft prompt learning models with SCL and BBSCL extensions exhibit superior performance in both balanced and imbalanced few-shot settings. This research not only improves the effectiveness of few-shot tuning techniques but also deepens our understanding of this area.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press
Figure 0

Figure 1. DART + SCL consists of two components: the DART model and the SCL module. The DART model optimizes two objectives: the “Class Discrimination Objective,” which learns to distinguish between different text classes , and the “Fluency Constraint Objective.” The SCL module uses the “Class Discrimination Objective” to encode text into a vector representation, by averaging the hidden state of the [CLS] token and the label. The left side of this figure shows the PTuning + SCL model, which consists of two components: the PTuning model with the “Class Discrimination Objective” and the SCL module.

Figure 1

Figure 2. SCL loss calculation for imbalanced batch with two positive and four negative samples.

Figure 2

Figure 3. BBSCL loss calculation for imbalanced batch with two positive and four negative samples.

Figure 3

Table 1. Results on ten few-shot datasets used in the DART paper which are standard in prompt tuning literature and for three real-world datasets (ADE, TC, and Overruling) in few-shot settings Mean $\displaystyle \pm$ std performances are reported. Models use RoBERTa-large. Adding SCL makes consistent improvements to the DART and PTuning model. We report the PTuning and DART from their original papers to ground our reproduced DART and PTuning results better. Our experimental setting GPU type is different from the DART paper. We focus on our reproduced results here to fairly evaluate our SCL and BBSCL extensions. Avg: average performance

Figure 4

Figure 4. Evaluation prompt for SST2 dataset in zero-shot setting.

Figure 5

Figure 5. Evaluation prompt for CR dataset in zero-shot setting.

Figure 6

Figure 6. Evaluation prompt for SUBJ dataset in zero-shot setting.

Figure 7

Figure 7. Complete evaluation prompt for SST2 dataset in the few-shot setting. For visualization purposes, we utilize a 2-shot balanced setting and a batch inference size of 2 for the questionnaire texts.

Figure 8

Figure 8. Comparison of hard and soft prompting methods in balanced and imbalanced settings for the Overruling dataset. For hard prompting, we consider the PET method. For soft prompting, we consider DART and our proposed extension in this paper when the PLM model is RoBERTa-large (the setting that shows the best performance in the previous). Results are reported based on different data folds. Mean $\displaystyle \pm$ as bar’s height and standard deviation as error bar are reported. ($\displaystyle \rho$ is an imbalance Ratio. For the imbalanced setting, we consider $\rho ^+=0.25$.

Figure 9

Figure 9. Comparison of hard and soft prompting methods in balanced and imbalanced settings for the TC dataset. For hard prompting, we consider the PET method. For soft prompting, we consider DART and our proposed extension in this paper when the PLM model is RoBERTa-large (the setting that shows the best performance in the previous). Results are reported based on different data folds. Mean $\displaystyle \pm$ as bar’s height and standard deviation as error bar are reported. ($\displaystyle \rho$ is an imbalance Ratio. For the imbalanced setting, we consider $\rho ^+=0.25$.

Figure 10

Table 2. Performance in few-shot balanced and imbalanced settings. Models use BERT-large-uncased. ($\displaystyle \rho$ is a imbalance ratio; Avg: average performance; $\displaystyle \frac {pos}{all}=\rho ^+$; $\displaystyle \frac {neg}{all}=\rho ^-$

Figure 11

Table 3. Performance in few-shot imbalanced settings. Models use RoBERTa-large. ($\displaystyle \rho$ is a imbalance ratio; $\displaystyle \frac {pos}{all}=\rho ^+$; $\displaystyle \frac {neg}{all}=\rho ^-$

Figure 12

Figure 10. F1 performance of DART, DART + SCL, and DART + BBSCL (random) in Overruling dataset wrt four different imbalance ratio ($\displaystyle \rho ^+ = \{0.1, 0.25, 0.36, 0.5\}$). $\displaystyle \rho ^+=0.5$ is the balanced setting and lower $\displaystyle \rho ^+$ means higher label-imbalance.

Figure 13

Table 4. Impact of SCL and BBSCL on DART and PTuning models. Performances are in imbalanced settings (100 examples for positive and 300 examples for negative class) far from few-shot settings. Results are reported based on different data folds. Mean $\displaystyle \pm$ std performances are reported. ($\displaystyle \rho$ is an imbalance Ratio; $\displaystyle \frac {pos}{all}=\rho ^+$; $\displaystyle \frac {neg}{all}=\rho ^-$; Avg: average performance)

Figure 14

Table 5. F1 of DART, DART + SCL, and DART + BBSCL (random) in overrulling dataset w.r.t four different imbalance ratio ($\displaystyle \rho ^+ = \{0.1, 0.25, 0.36, 0.5\}$). $\displaystyle \rho ^+=0.5$ is the balanced setting and lower $\displaystyle \rho ^+$ means higher label-imbalance

Figure 15

Figure 11. Test data distribution of Overruling dataset in 2D space. Test data are a sample of the entire data space. The vector representation of texts is prepared with the RoBERTa-large language model. We used the UMAP (McInnes, Healy, and Saul 2018) dimension reduction method to present this distribution in two-dimensional space. Classes are well-separated (low $R_D=0.44$), explaining smaller SCL gains here.

Figure 16

Figure 12. Test data distribution of TC dataset in 2D space. Test data are a sample of the entire data space. The vector representation of texts is prepared with the RoBERTa-large language model. We used the UMAP (McInnes, Healy, and Saul 2018) dimension reduction method to present this distribution in two-dimensional space. TC dataset shows moderate overlap ($R_D=0.84$), where SCL improves separability.

Figure 17

Figure 13. Test data distribution of ADE dataset in 2D space. Test dataare a sample of the entire data space. The vector representation of texts is prepared with the RoBERTa-large language model. We used the UMAP (McInnes, Healy, and Saul 2018) dimension reduction method to present this distribution in two-dimensional space. ADE exhibits high class overlap ($R_D=0.85$), leading to significant SCL benefits.

Figure 18

Figure 14. Impact of batch size on BBSCL method in the imbalanced setting.

Figure 19

Table 6. Impact of SCL and BBSCL on text classification with fine-tuning frozen PLMs. Performances are in few-shot balanced (16 examples per class) and imbalanced settings (8 examples for positive and 24 examples for negative class). For PLMs, we use RoBERTa-large and BERT-large-uncased. Results are reported based on 5 run with different random seeds. Mean $ \displaystyle \pm$ std performances are reported. ($ \displaystyle \rho$ is an imbalance Ratio; $\displaystyle \frac {pos}{all}=\rho ^+$; $\displaystyle \frac {neg}{all}=\rho ^-$; CE is cross entropy; Avg: average performance; in CE + SCL (oversample), we use oversample to balance batches in training; in CE + BBCL (random), we sample negative nodes from majority class with replacement to construct label-balanced groups and it is possible to have shared negative nodes between groups; In CE + BBCL (partition), we sample negative nodes from majority class without replacement to construct label-balanced groups and there are no shared nodes between groups.)

Figure 20

Table 7. Impact of SCL and BBSCL on text classification with fine-tuning frozen PLMs. Performances are in imbalanced settings (64 examples for positive and 192 examples for negative class) far from few-shot settings. Results are reported based on 5 run with different random seeds. Mean $\displaystyle \pm$ std performances are reported. ($\displaystyle \rho$ is an imbalance Ratio; $\displaystyle \frac {pos}{all}=\rho ^+$; $\displaystyle \frac {neg}{all}=\rho ^-$; CE is cross entropy; Avg: average performance)

Figure 21

Table 8. Comparison of LLMs and soft prompt learning models in balanced and imbalanced settings. Based on the results in Tables 1 to 6, and considering the superior performance of models utilizing RoBERTa-large as the pretrained language model (PLM) and the DART as soft prompt model in most times, we have opted to exclusively employ the RoBERTa-large model in our experiments for the DART model. Results are reported based on 3 different data folds in the few-shot and 3 different run seeds in the zero-shot. We sample 200 data points from the test dataset for evaluation to manage the API costs of GPT-3.5. Mean $\displaystyle \pm$ std performances are reported. GPT-3.5 and OpenChat results are based on their APIs on 8 February. ($\displaystyle \rho$ is an imbalance ratio; $\displaystyle \frac {pos}{all}=\rho ^+$; $\displaystyle \frac {neg}{all}=\rho ^-$; Avg: average performance)

Figure 22

Algorithm 1: BBSCL – Multiclass (random sampling variant)

Figure 23

Table A1. Significance levels of p-values for comparisons between DART and its extensions, as well as P-Tuning and its extensions, using RoBERTa-large as the pretrained language model in Table 1. Only the average performance across all datasets is considered. Significance markers are assigned as follows: “***” for p-value < 0.001, “**” for p-value < 0.01, “*” for p-value < 0.05, and an empty string otherwise. A hyphen (“–”) indicates no comparison was possible

Figure 24

Table A2. Significance levels of p-values for comparisons between DART and its extensions, as well as P-Tuning and its extensions, using RoBERTa-large as the pretrained language model in Table 2. Only the average performance across all datasets is considered. Significance markers are assigned as follows: “***” for p-value < 0.001, “**” for p-value < 0.01, “*” for p-value < 0.05, and an empty string otherwise. A hyphen (“–”) indicates no comparison was possible

Figure 25

Table A3. Significance levels of p-values for comparisons between DART and its extensions, as well as P-Tuning and its extensions, using BERT-large-uncased as the pretrained language model in Table 3, in the imbalanced setting. Only the average performance across all datasets is considered. Significance markers are assigned as follows: “***” for p-value < 0.001, “**” for p-value < 0.01, “*” for p-value < 0.05, and an empty string otherwise. A hyphen (“–”) indicates no comparison was possible

Figure 26

Table A4. Significance levels of p-values for comparisons between DART and its extensions, as well as P-Tuning and its extensions, using BERT-large-uncased as the pretrained language model in Table 3, in the balanced setting. Only the average performance across all datasets is considered. Significance markers are assigned as follows: “***” for p-value < 0.001, “**” for p-value < 0.01, “*” for p-value < 0.05, and an empty string otherwise. A hyphen (“–”) indicates no comparison was possible

Figure 27

Table A5. Significance levels of p-values for comparisons between DART and its extensions, as well as P-Tuning and its extensions in Table 6 when the setting is imbalanced and PLM is RoBERTa-large. Only the average performance across all datasets is considered. Significance markers are assigned as follows: “***” for p-value < 0.001, “**” for p-value < 0.01, “*” for p-value < 0.05, and an empty string otherwise. A hyphen (“–”) indicates no comparison was possible

Figure 28

Table A6. Significance levels of p-values for comparisons between DART and its extensions, as well as P-Tuning and its extensions in Table 6 when the setting is imbalanced and PLM is BERT-large-uncased. Only the average performance across all datasets is considered. Significance markers are assigned as follows: “***” for p-value < 0.001, “**” for p-value < 0.01, “*” for p-value < 0.05, and an empty string otherwise. A hyphen (“–”) indicates no comparison was possible

Figure 29

Table A7. Significance levels of p-values for comparisons between DART and its extensions in Table 7 when the setting is imbalanced with different ratios. Significance markers are assigned as follows: “***” for p-value < 0.001, “**” for p-value < 0.01, “*” for p-value < 0.05, and an empty string otherwise. A hyphen (“–”) indicates no comparison was possible

Figure 30

Table A8. Significance of improvements over base model (Roberta and Bert) in Table 4 when we use balanced setting

Figure 31

Table A9. Significance of improvements over base model (Roberta and Bert) in Table 4 when we use an imbalanced setting

Figure 32

Table A10. Significance of improvements over base model (Roberta and Bert) in Table 5 when we use an imbalanced setting

Figure 33

Table A11. Significance levels of p-values for comparisons between GPT-3.5, openchat, DART, and DART + SCL in Table 8 when the setting is balanced. For GPT-3.5 and openchat, we consider the best performance in different settings. Only the average performance across all datasets is considered. Significance markers are assigned as follows: “***” for p-value < 0.001, “**” for p-value < 0.01, “*” for p-value < 0.05, and an empty string otherwise. A hyphen (“–”) indicates no comparison was possible

Figure 34

Table A12. Significance levels of p-values for comparisons between GPT-3.5, openchat, DART, DART + SCL, DART + BBSCL (random), DART + BBSCL (partition), and DART + SCL (over) in Table 8 when the setting is imbalanced. Only the average performance across all datasets is considered. Significance markers are assigned as follows: “***” for p-value < 0.001, “**” for p-value < 0.01, “*” for p-value < 0.05, and an empty string otherwise. A hyphen (“–”) indicates no comparison was possible

Supplementary material: File

Edalat and Yaghoobzadeh supplementary material

Edalat and Yaghoobzadeh supplementary material
Download Edalat and Yaghoobzadeh supplementary material(File)
File 3.5 MB