Hostname: page-component-89b8bd64d-9prln Total loading time: 0 Render date: 2026-05-07T18:24:50.081Z Has data issue: false hasContentIssue false

Urdu paraphrase detection: A novel DNN-based implementation using a semi-automatically generated corpus

Published online by Cambridge University Press:  29 May 2023

Hafiz Rizwan Iqbal
Affiliation:
Information Technology University, Lahore, Pakistan
Rashad Maqsood
Affiliation:
Information Technology University, Lahore, Pakistan
Agha Ali Raza*
Affiliation:
Lahore University of Management Sciences, Lahore, Pakistan
Saeed-Ul Hassan
Affiliation:
Manchester Metropolitan University, Manchester, UK
*
Corresponding author: Agha Ali Raza; Email: agha.ali.raza@lums.edu.pk
Rights & Permissions [Opens in a new window]

Abstract

Automatic paraphrase detection is the task of measuring the semantic overlap between two given texts. A major hurdle in the development and evaluation of paraphrase detection approaches, particularly for South Asian languages like Urdu, is the inadequacy of standard evaluation resources. The very few available paraphrased corpora for these languages are manually created. As a result, they are constrained to smaller sizes and are not very feasible to evaluate mainstream data-driven and deep neural networks (DNNs)-based approaches. Consequently, there is a need to develop semi- or fully automated corpus generation approaches for the resource-scarce languages. There is currently no semi- or fully automatically generated sentence-level Urdu paraphrase corpus. Moreover, no study is available to localize and compare approaches for Urdu paraphrase detection that focus on various mainstream deep neural architectures and pretrained language models.

This research study addresses this problem by presenting a semi-automatic pipeline for generating paraphrased corpora for Urdu. It also presents a corpus that is generated using the proposed approach. This corpus contains 3147 semi-automatically extracted Urdu sentence pairs that are manually tagged as paraphrased (854) and non-paraphrased (2293). Finally, this paper proposes two novel approaches based on DNNs for the task of paraphrase detection in Urdu text. These are Word Embeddings n-gram Overlap (henceforth called WENGO), and a modified approach, Deep Text Reuse and Paraphrase Plagiarism Detection (henceforth called D-TRAPPD). Both of these approaches have been evaluated on two related tasks: (i) paraphrase detection, and (ii) text reuse and plagiarism detection. The results from these evaluations revealed that D-TRAPPD ($F_1 = 96.80$ for paraphrase detection and $F_1 = 88.90$ for text reuse and plagiarism detection) outperformed WENGO ($F_1 = 81.64$ for paraphrase detection and $F_1 = 61.19$ for text reuse and plagiarism detection) as well as other state-of-the-art approaches for these two tasks. The corpus, models, and our implementations have been made available as free to download for the research community.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Table 1. Existing corpora for Urdu paraphrase and text reuse and plagiarism detection. For document-level corpora, the size indicates the total number of documents, including both the source and the suspicious documents and is equal to the summation of source and suspicious. In the case of sentence-level corpora, the size indicates the number of pairs, where each pair consists of the source and the corresponding suspicious sentences

Figure 1

Table 2. Distribution of the source sentence pairs w.r.t classes in the COUNTER corpus

Figure 2

Figure 1. Corpus generation process.

Figure 3

Table 3. SUSPC human evaluation statistics

Figure 4

Table 4. SUSPC statistics

Figure 5

Table 5. Distribution of the semi-automatically generated sentence pairs in SUSPC w.r.t. COUNTER corpus

Figure 6

Figure 2. Example sentence pairs from SUSPC: (a) paraphrased and (b) non-paraphrased.

Figure 7

Figure 3. Proposed deep neural network architecture for paraphrased text reused detection.

Figure 8

Table 6. Result comparison of the proposed approaches on UPPC

Figure 9

Table 7. Result comparison of the proposed approaches on SUSPC

Figure 10

Table 8. Result comparison of the proposed approaches on USTRC

Figure 11

Table 9. Comparison of the best results for: (i) paraphrase detection task, and (ii) text reuse and plagiarism detection task

Figure 12

Figure 4. Effect of epochs on precision, recall, and$F_1$ for paraphrase detection task (Section 5.1.1) on: (a) UPPC and (b) SUSPC.

Figure 13

Figure 5. Effect of epochs on precision, recall, and$F_1$ for text reuse and plagiarism detection task (Section 5.1.2) on USTRC for: (c) binary classification, and (d) multi-classification.

Figure 14

Table 10. Comparison with the state-of-the-art approaches on Urdu Short Text Reuse Corpus