Hostname: page-component-77f85d65b8-2tv5m Total loading time: 0 Render date: 2026-03-29T10:45:55.297Z Has data issue: false hasContentIssue false

Masked transformer through knowledge distillation for unsupervised text style transfer

Published online by Cambridge University Press:  25 July 2023

Arthur Scalercio
Affiliation:
Institute of Computing, Universidade Federal Fluminense, Niterói, RJ, Brazil
Aline Paes*
Affiliation:
Institute of Computing, Universidade Federal Fluminense, Niterói, RJ, Brazil
*
Corresponding author: Aline Paes; Email: alinepaes@ic.uff.br
Rights & Permissions [Opens in a new window]

Abstract

Text style transfer (TST) aims at automatically changing a text’s stylistic features, such as formality, sentiment, authorial style, humor, and complexity, while still trying to preserve its content. Although the scientific community has investigated TST since the 1980s, it has recently regained attention by adopting deep unsupervised strategies to address the challenge of training without parallel data. In this manuscript, we investigate how relying on sequence-to-sequence pretraining models affects the performance of TST when the pretraining step leverages pairs of paraphrase data. Furthermore, we propose a new technique to enhance the sequence-to-sequence model by distilling knowledge from masked language models. We evaluate our proposals on three unsupervised style transfer tasks with widely used benchmarks: author imitation, formality transfer, and polarity swap. The evaluation relies on quantitative and qualitative analyses and comparisons with the results of state-of-the-art models. For the author imitation and the formality transfer task, we show that using the proposed techniques improves all measured metrics and leads to state-of-the-art (SOTA) results in content preservation and an overall score in the author imitation domain. In the formality transfer domain, we paired with the SOTA method in the style control metric. Regarding the polarity swap domain, we show that the knowledge distillation component improves all measured metrics. The paraphrase pretraining increases content preservation at the expense of harming style control. Based on the results reached in these domains, we also discuss in the manuscript if the tasks we address have the same nature and should be equally treated as TST tasks.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Table 1. Non-disentaglemet publications according to model characteristics

Figure 1

Figure 1. Training illustration when the model is predicting the token $y_3$ using an MLM. $P_\theta$ is the student distribution, while $P_\phi$ is the teacher soft distribution provided by the MLM.

Figure 2

Figure 2. Adversarial training illustration. $G$ denotes the generator network, while $D$ denotes the discriminator network.

Figure 3

Table 2. Results with automatic evaluation metrics. BScore stands for BARTScore, and HM is the harmonic mean of BLEU and accuracy. The best results are in bold.

Figure 4

Table 3. T-test over the mean population for BARTSore and SIM metric on the test set

Figure 5

Table 4. Results from the human evaluation on Yelp and GYAFC datasets. The best results are in bold.

Figure 6

Table 5. Evaluation results for the ablation study. Acc. stands for accuracy, BScore for BARTScore, and HM for the harmonic mean between BLEU and accuracy

Figure 7

Table 6. Pearson correlation between content preservation metrics over N systems

Figure 8

Table 7. Transferred sentences for the author imitation task

Figure 9

Table 8. Hyperparameters used in training. For sentiment transfer $0$ means negative and $1$ positive. For author imitation $0$ means shakespearean and $1$ modern English

Figure 10

Table 9. Hyperparameters for training. For formality transfer $0$ means informal and $1$ formal

Figure 11

Table 10. Examples of transferred sentences in the sentiment transfer task

Figure 12

Figure 3. Evaluation metrics, for both transfer directions, during training for our FULL MODEL and its variation NO PARA on the Yelp dataset

Figure 13

Table 11. Metrics form model variations trained from scratch. The best results are in bold.