Hostname: page-component-77f85d65b8-zzw9c Total loading time: 0 Render date: 2026-03-29T21:16:39.238Z Has data issue: false hasContentIssue false

Towards diverse and contextually anchored paraphrase modeling: A dataset and baselines for Finnish

Published online by Cambridge University Press:  16 March 2023

Jenna Kanerva*
Affiliation:
TurkuNLP, Department of Computing, University of Turku, Turku, Finland
Filip Ginter
Affiliation:
TurkuNLP, Department of Computing, University of Turku, Turku, Finland
Li-Hsin Chang
Affiliation:
TurkuNLP, Department of Computing, University of Turku, Turku, Finland
Iiro Rastas
Affiliation:
TurkuNLP, Department of Computing, University of Turku, Turku, Finland
Valtteri Skantsi
Affiliation:
TurkuNLP, Department of Computing, University of Turku, Turku, Finland
Jemina Kilpeläinen
Affiliation:
TurkuNLP, Department of Computing, University of Turku, Turku, Finland
Hanna-Mari Kupari
Affiliation:
TurkuNLP, Department of Computing, University of Turku, Turku, Finland
Aurora Piirto
Affiliation:
TurkuNLP, Department of Computing, University of Turku, Turku, Finland
Jenna Saarni
Affiliation:
TurkuNLP, Department of Computing, University of Turku, Turku, Finland
Maija Sevón
Affiliation:
TurkuNLP, Department of Computing, University of Turku, Turku, Finland
Otto Tarkka
Affiliation:
TurkuNLP, Department of Computing, University of Turku, Turku, Finland
*
*Corresponding author. Email: jmnybl@utu.fi
Rights & Permissions [Opens in a new window]

Abstract

In this paper, we study natural language paraphrasing from both corpus creation and modeling points of view. We focus in particular on the methodology that allows the extraction of challenging examples of paraphrase pairs in their natural textual context, leading to a dataset potentially more suitable for evaluating the models’ ability to represent meaning, especially in document context, when compared with those gathered using various sentence-level heuristics. To this end, we introduce the Turku Paraphrase Corpus, the first large-scale, fully manually annotated corpus of paraphrases in Finnish. The corpus contains 104,645 manually labeled paraphrase pairs, of which 98% are verified to be true paraphrases, either universally or within their present context. In order to control the diversity of the paraphrase pairs and avoid certain biases easily introduced in automatic candidate extraction, the paraphrases are manually collected from different paraphrase-rich text sources. This allows us to create a challenging dataset including longer and more lexically diverse paraphrases than can be expected from those collected through heuristics. In addition to quality, this also allows us to keep the original document context for each pair, making it possible to study paraphrasing in context. To our knowledge, this is the first paraphrase corpus which provides the original document context for the annotated pairs.

We also study several paraphrase models trained and evaluated on the new data. Our initial paraphrase classification experiments indicate a challenging nature of the dataset when classifying using the detailed labeling scheme used in the corpus annotation, the accuracy substantially lacking behind human performance. However, when evaluating the models on a large scale paraphrase retrieval task on almost 400M candidate sentences, the results are highly encouraging, 29–53% of the pairs being ranked in the top 10 depending on the paraphrase type. The Turku Paraphrase Corpus is available at github.com/TurkuNLP/Turku-paraphrase-corpus as well as through the popular HuggingFace datasets under the CC-BY-SA license.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Table 1. Manual paraphrase extraction statistics for different text sources, where Documents refers to the number of document pairs producing paraphrases, Empty refers to the percentage of candidate document pairs not producing any paraphrase candidates (all other metrics are calculated after discarding the empty pairs), Yield refers to the average number of paraphrase pairs extracted from one document pair, Coverage is the total proportion of text (in terms of alphanumeric characters) selected in paraphrase extraction from the original source documents, and Length is the average length of the original document in terms of alphanumeric characters. Note that the alternative subtitle statistics are based on the first round of annotations only, where the movie/episode selection is not biased towards high-yield documents, and here one subtitling document refers to a 15-minute segment of a movie/episode

Figure 1

Table 2. The number of paraphrase pairs in the released corpus originating from different text sources (rewrites, introduced in Section 5.3, are included in the statistics)

Figure 2

Table 3. The sections of the corpus and their sizes in terms of number of paraphrase pairs

Figure 3

Figure 1. Label distribution in the whole corpus.

Figure 4

Figure 2. Histogram of different labels in the corpus conditioned on cosine similarity of the paraphrase pairs.

Figure 5

Figure 3. Comparison of paraphrase length distributions in terms of tokens per paraphrase.

Figure 6

Figure 4. Comparison of paraphrase pair cosine similarity distributions.

Figure 7

Figure 5. Percentage of the types of systematic differences characterizing the paraphrases in Opusparcus, TaPaCo, and our corpus. Others refers to all paraphrases including differences not automatically detectable by the used method.

Figure 8

Table 4. Baseline classification performance on the two test sets, when the base label and the flags are predicted separately. In the upper section, we merge the subsumption flags with the base class prediction, but leave the flags i and s separated. The rows W. avg and Acc on the other hand refer to performance on the complete labels, comprising all allowed combinations of base label and flags. W. avg is the average of P/R/F values across the classes, weighted by class support. Acc is the accuracy

Figure 9

Figure 6. (a) Distribution of the manually annotated labels in the opus-parsebank set including both development and test examples. (b) Comparison of the types of paraphrases in the manually and automatically extracted data. The manually-extracted data refers to the training set of our corpus, while the automatically extracted data refers to the combination of opus-parsebank-dev and opus-parsebank-test sets.

Figure 10

Figure 7. Heatmap with estimated negative example density per tile in increments of 0.2 for opus-parsebank-dev. Lexical similarity is plotted in y-axis and prediction confidence in x-axis, creating two-dimensional tiles when both are divided in increments of 0.2. Each tile is yet enhanced with a density score indicating the percentage of negative examples in the tile based on the manually annotated labels.

Figure 11

Table 5. Final classification performance on the two test sets, as in Table 4

Figure 12

Figure 8. The top-1 retrieval accuracy (higher is better) of all positive paraphrases in the Turku Paraphrase Corpus test set and the opus-parsebank-test set. The test sets consists of 19,893 and 19,271 unique retrieval candidates respectively. The exact accuracy numbers are visualized on top of the bars.

Figure 13

Figure 9. The average ranking positions normalized to percentages (lower is better) for the Turku Paraphrase Corpus test set by various models. The ranking is measured separately for each paraphrase label (2, 3, 4</>, and 4), however disregarding the flags i and s. The exact numbers are visualized on top of the bars (percentage calculated out of 19,893 candidate sentences).

Figure 14

Figure 10. The retrieval of the opus-parsebank test set paraphrase candidates by various models. The numbers on top of the bars indicate the average ranking in percentage (out of 19,271 candidate sentences) for each class of paraphrase candidates. The ranking is measured separately for each paraphrase label (1, 2, 3, 4</>, and 4), however disregarding the flags i and s.

Figure 15

Figure 11. The retrieval of test set paraphrase pairs by the fine-tuned Finnish SBERT, the multilingual SBERT, and the vanilla FinBERT, out of 400M candidate sentences. The white numbers indicate percentage of pairs in the given category, and the retrieval is measured for the three main classes of paraphrase: 4, 4< or 4>, and 3 (disregarding flags s and i); and for several top k cut-offs. NA means that the correct sentence did not rank in the top 2048 list, which was the upper technical limit in the experiment.