Hostname: page-component-89b8bd64d-5bvrz Total loading time: 0 Render date: 2026-05-07T05:07:43.707Z Has data issue: false hasContentIssue false

Abstractive summarization with deep reinforcement learning using semantic similarity rewards

Published online by Cambridge University Press:  31 October 2023

Figen Beken Fikri*
Affiliation:
Faculty of Engineering and Natural Sciences, Sabancı University, Istanbul, Türkiye
Kemal Oflazer
Affiliation:
Qatar Computer Science Program/Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA
Berrin Yanıkoğlu
Affiliation:
Faculty of Engineering and Natural Sciences, Sabancı University, Istanbul, Türkiye Center of Excellence in Data Analytics (VERIM), Sabancı University, Istanbul, Türkiye
*
Corresponding author: Figen Beken Fikri; Email: fbekenfikri@sabanciuniv.edu
Rights & Permissions [Opens in a new window]

Abstract

Abstractive summarization is an approach to document summarization that is not limited to selecting sentences from the document but can generate new sentences as well. We address the two main challenges in abstractive summarization: how to evaluate the performance of a summarization model and what is a good training objective. We first introduce new evaluation measures based on the semantic similarity of the input and corresponding summary. The similarity scores are obtained by the fine-tuned BERTurk model using either the cross-encoder or a bi-encoder architecture. The fine-tuning is done on the Turkish Natural Language Inference and Semantic Textual Similarity benchmark datasets. We show that these measures have better correlations with human evaluations compared to Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scores and BERTScore. We then introduce a deep reinforcement learning algorithm that uses the proposed semantic similarity measures as rewards, together with a mixed training objective, in order to generate more natural summaries in terms of human readability. We show that training with a mixed training objective function compared to only the maximum-likelihood objective improves similarity scores.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Table 1. Sample translations from the Semantic Textual Similarity benchmark dataset along with the corresponding English sentences. The similarity score between two Turkish sentences are set to the similarity between the corresponding English sentences (Beken Fikri et al. 2021) (Section 3.1)

Figure 1

Table 2. Example sentences illustrating the Natural Language Inference task (Budur et al. 2020) (Section 3.2)

Figure 2

Figure 1. Cross-encoder and bi-encoder model architectures (Reimers and Gurevych 2019).

Figure 3

Table 3. Correlations between the semantic textual similarities predicted by the fine-tuned models in varying architecture and train sets, and the corresponding ground-truth similarity scores in the Semantic Textual Similarity benchmark test set. Pearson and Spearman correlations are reported as $\rho \times 100$

Figure 4

Table 4. Correlations between ROUGE, BERTScore, and proposed evaluation methods and the human judgments (Section 4.1.3). Pearson and Spearman correlations are reported as $\rho \times 100$

Figure 5

Figure 2. Self-critical policy gradient training process with bi-encoder/cross-encoder similarity rewards.

Figure 6

Figure 3. Pearson correlations of the evaluations with human judgments.

Figure 7

Figure 4. Spearman correlations of the evaluations with human judgments.

Figure 8

Table 5. Results of the mT5 summarization models trained on MLSUM dataset. We reported the average results for MLE-only (that best performed on validation set in terms of bi-encoder and cross-encoder similarity results) and RL training objectives on the MLSUM test set (Section 4.2). All values are scaled to 100

Figure 9

Table 6. Human evaluations for the summarization models (higher the better). Results are shown for the bi-encoder and cross-encoder models separately and together (all). n is the sample size (Section 4.2.4)

Figure 10

Table 7. Sample article with the reference and generated summaries from MLE-only model and MLE + RL models with ROUGE-L and bi-encoder similarity rewards, respectively

Figure 11

Table 8. Sample article with the reference and generated summaries from MLE-only model and MLE + RL models with ROUGE-L and cross-encoder similarity rewards, respectively