Hostname: page-component-77f85d65b8-g4pgd Total loading time: 0 Render date: 2026-03-28T17:52:48.946Z Has data issue: false hasContentIssue false

Turkish abstractive text summarization using pretrained sequence-to-sequence models

Published online by Cambridge University Press:  13 May 2022

Batuhan Baykara*
Affiliation:
Department of Computer Engineering, Boğaziçi University, Bebek, 34342 Istanbul, Turkey
Tunga Güngör
Affiliation:
Department of Computer Engineering, Boğaziçi University, Bebek, 34342 Istanbul, Turkey
*
*Corresponding author. E-mail: batuhan.baykara@boun.edu.tr
Rights & Permissions [Opens in a new window]

Abstract

The tremendous amount of increase in the number of documents available on the Web has turned finding the relevant piece of information into a challenging, tedious, and time-consuming activity. Accordingly, automatic text summarization has become an important field of study by gaining significant attention from the researchers. Lately, with the advances in deep learning, neural abstractive text summarization with sequence-to-sequence (Seq2Seq) models has gained popularity. There have been many improvements in these models such as the use of pretrained language models (e.g., GPT, BERT, and XLM) and pretrained Seq2Seq models (e.g., BART and T5). These improvements have addressed certain shortcomings in neural summarization and have improved upon challenges such as saliency, fluency, and semantics which enable generating higher quality summaries. Unfortunately, these research attempts were mostly limited to the English language. Monolingual BERT models and multilingual pretrained Seq2Seq models have been released recently providing the opportunity to utilize such state-of-the-art models in low-resource languages such as Turkish. In this study, we make use of pretrained Seq2Seq models and obtain state-of-the-art results on the two large-scale Turkish datasets, TR-News and MLSum, for the text summarization task. Then, we utilize the title information in the datasets and establish hard baselines for the title generation task on both datasets. We show that the input to the models has a substantial amount of importance for the success of such tasks. Additionally, we provide extensive analysis of the models including cross-dataset evaluations, various text generation options, and the effect of preprocessing in ROUGE evaluations for Turkish. It is shown that the monolingual BERT models outperform the multilingual BERT models on all tasks across all the datasets. Lastly, qualitative evaluations of the generated summaries and titles of the models are provided.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2022. Published by Cambridge University Press
Figure 0

Figure 1. A high-level transformer-based encoder-decoder network.

Figure 1

Figure 2. A number of noising methods experimented in the BART model. T1-T6 denote tokens. The box that the arrows point to shows the denoised text.

Figure 2

Figure 3. Various downstream tasks such as machine translation, semantic textual similarity, and text summarization on mT5 framework shown with examples in Turkish.

Figure 3

Table 1. Comparison of summarization datasets with respect to sizes of training, validation, and test sets, and average content, abstract, and title lengths (in terms of words and sentences)

Figure 4

Table 2. Comparison of summarization datasets with respect to vocabulary size and type-token ratio of content, abstract, title, and overall

Figure 5

Table 3. Two news articles selected from TR-News and MLSum (TR)

Figure 6

Table 4. Tokenization outputs of the methods for a given Turkish sentence which translates to “If one day, my words are against science, choose science”

Figure 7

Figure 4. Average number of tokens generated by the tokenizers of the models for content, abstract, and title.

Figure 8

Table 5. Novelty ratios of the datasets with respect to the summary generation and title generation tasks. N1, N2, and N3 denote uni-gram, bi-gram and tri-gram ratios, respectively

Figure 9

Table 6. Text summarization results of pretrained encoder-decoder models on TR-News, MLSum (TR), and Combined-TR datasets. ROUGE-1 (R1), ROUGE-2 (R2), and ROUGE-L (RL) scores are given in F-measure. “-” denotes result is not available. Bold values show the highest scores obtained in the experiments per dataset

Figure 10

Table 7. Title generation (abstract as input) results of pretrained encoder-decoder models on TR-News, MLSum (TR), and Combined-TR datasets. ROUGE-1 (R1), ROUGE-2 (R2), and ROUGE-L (RL) are given in F-measure. Bold values show the highest scores obtained in the experiments per dataset

Figure 11

Table 8. Title generation (LEAD-3 as input) results of pretrained encoder-decoder models on TR-News, MLSum (TR), and Combined-TR datasets. ROUGE-1 (R1), ROUGE-2 (R2), and ROUGE-L (RL) are given in F-measure. Bold values show the highest scores obtained in the experiments per dataset

Figure 12

Table 9. Title generation LEAD sentences ablation study results. ROUGE-1 (R1), ROUGE-2 (R2), and ROUGE-L (RL) are given in F-measure

Figure 13

Table 10. Novelty ratios of the summaries generated by the models per dataset. N1, N2, and N3 denote uni-gram, bi-gram, and tri-gram ratios, respectively. Bold values show the highest scores obtained in the experiments per dataset (the mBERT-uncased results are misleading and are ignored due to the high number of unknown tokens output)

Figure 14

Table 11. Novelty ratios of the titles (abstracts are given as input) generated by the models per dataset. N1, N2, and N3 denote uni-gram, bi-gram, and tri-gram ratios, respectively. Bold values show the highest scores obtained in the experiments per dataset

Figure 15

Table 12. Cross-dataset evaluation results for the summary generation and the title generation (abstract as input) tasks. The values correspond to ROUGE-1 scores

Figure 16

Table 13. Results for the summary generation and title generation (abstract as input) tasks with various beam sizes and early-stopping method. The values correspond to ROUGE-1 scores. Bold values show the highest scores obtained in the experiments per dataset

Figure 17

Table 14. ROUGE scores calculated with different preprocessing settings. “Punct removed” refers to removing the punctuations, whereas “Punct kept” refers to keeping the punctuations before the ROUGE calculations. “Stems taken” refers to applying stemming operation on the words, whereas “Stems not taken” refers to leaving the words in their surface form before the ROUGE calculations

Figure 18

Table 15. An example from the test set of TR-News accompanied with the summaries generated by the models. The spelling and grammatical errors in the original texts are left as is. News article’s content is given as the input, and the reference summary is the abstract of the article. The words in bold denote novel unigrams (unigrams which are not present in the input text) generated by the models, whereas the underlined texts are for reference in the discussion

Figure 19

Table 16. An example from the test set of MLSum (TR) accompanied with the summaries generated by the models. News article’s content is given as the input and the reference summary is the abstract of the article. The words in bold denote novel unigrams (unigrams which are not present in the input text) generated by the models, whereas the underlined texts are for reference in the discussion

Figure 20

Table 17. An example from the test set of TR-News accompanied with the titles generated by the models. News article’s abstract is given as the input, and the title of the article is expected as the output. The words in bold denote novel unigrams (unigrams which are not present in the input text) generated by the models, whereas the underlined texts are for reference in the discussion

Figure 21

Table 18. An example from the test set of MLSum (TR) accompanied with the titles generated by the models. News article’s abstract is given as the input, and the title of the article is expected as the output. The words in bold denote novel unigrams (unigrams which are not present in the input text) generated by the models

Figure 22

Table A1. Cross-dataset evaluation results for the summary generation task

Figure 23

Table A2. Cross-dataset evaluation results for the title generation (abstract as input) task

Figure 24

Table A3. The analysis results for the summary generation task given various beam sizes and early-stopping method

Figure 25

Table A4. The analysis results for the title generation (abstract as input) task given various beam sizes and early-stopping method

Figure 26

Table A5. ROUGE scores with different preprocessing settings for the summary generation task. “Punct removed” refers to removing the punctuations, whereas “Punct kept” refers to keeping the punctuations before the ROUGE calculations. “Stems taken” refers to applying stemming operation on the words, whereas “Stems not taken” refers to leaving the words in their surface form before the ROUGE calculations

Figure 27

Table A6. ROUGE scores with different preprocessing settings for the title generation (abstract as input) task. “Punct removed” refers to removing the punctuations, whereas “Punct kept” refers to keeping the punctuations before the ROUGE calculations. “Stems taken” refers to applying stemming operation on the words, whereas “Stems not taken” refers to leaving the words in their surface form before the ROUGE calculations

Figure 28

Table A7. ROUGE-1 scores of all the models calculated under different preprocessing settings on the TR-News dataset for the text summarization task. “Punct removed” refers to removing the punctuations, whereas “Punct kept” refers to keeping the punctuations before the ROUGE calculations. “Stems taken” refers to applying stemming operation on the words, whereas “Stems not taken” refers to leaving the words in their surface form before the ROUGE calculations

Figure 29

Table A8. ROUGE-1 scores of all the models calculated under different preprocessing settings on the MLSum (TR) dataset for the text summarization task. “Punct removed” refers to removing the punctuations, whereas “Punct kept” refers to keeping the punctuations before the ROUGE calculations. “Stems taken” refers to applying stemming operation on the words, whereas “Stems not taken” refers to leaving the words in their surface form before the ROUGE calculations

Figure 30

Table A9. ROUGE-1 scores of all the models calculated under different preprocessing settings on the Combined-TR dataset for the text summarization task. “Punct removed” refers to removing the punctuations, whereas “Punct kept” refers to keeping the punctuations before the ROUGE calculations. “Stems taken” refers to applying stemming operation on the words, whereas “Stems not taken” refers to leaving the words in their surface form before the ROUGE calculations