Hostname: page-component-89b8bd64d-ktprf Total loading time: 0 Render date: 2026-05-08T03:21:19.402Z Has data issue: false hasContentIssue false

Ben-Sarc: A self-annotated corpus for sarcasm detection from Bengali social media comments and its baseline evaluation

Published online by Cambridge University Press:  27 May 2024

Sanzana Karim Lora*
Affiliation:
Department of Computer Science and Engineering, Ahsanullah University of Science and Technology, Dhaka, Bangladesh
G. M. Shahariar
Affiliation:
Department of Computer Science and Engineering, Ahsanullah University of Science and Technology, Dhaka, Bangladesh
Tamanna Nazmin
Affiliation:
Department of Computer Science and Engineering, Ahsanullah University of Science and Technology, Dhaka, Bangladesh
Noor Nafeur Rahman
Affiliation:
Department of Computer Science and Engineering, Ahsanullah University of Science and Technology, Dhaka, Bangladesh
Rafsan Rahman
Affiliation:
Department of Computer Science and Engineering, Ahsanullah University of Science and Technology, Dhaka, Bangladesh
Miyad Bhuiyan
Affiliation:
Department of Computer Science and Engineering, Ahsanullah University of Science and Technology, Dhaka, Bangladesh
Faisal Muhammad Shah
Affiliation:
Department of Computer Science and Engineering, Ahsanullah University of Science and Technology, Dhaka, Bangladesh
*
Corresponding author: Sanzana Karim Lora; Email: sanzanalora@yahoo.com
Rights & Permissions [Opens in a new window]

Abstract

Sarcasm detection research in the Bengali language so far can be considered to be narrow due to the unavailability of resources. In this paper, we introduce a large-scale self-annotated Bengali corpus for sarcasm detection research problem in the Bengali language named ‘Ben-Sarc’ containing 25,636 comments, manually collected from different public Facebook pages and evaluated by external evaluators. Then we present a complete strategy to utilize different models of traditional machine learning, deep learning, and transfer learning to detect sarcasm from text using the Ben-Sarc corpus. Finally, we demonstrate a comparison between the performance of traditional machine learning, deep learning, and transfer learning models on our Ben-Sarc corpus. Transfer learning using Indic-Transformers Bengali Bidirectional Encoder Representations from Transformers as a pre-trained source model has achieved the highest accuracy of 75.05%. The second-highest accuracy is obtained by the long short-term memory model with 72.48% and Multinomial Naive Bayes is acquired the third highest with 72.36% accuracy for deep learning and machine learning, respectively. The Ben-Sarc corpus is made publicly available in the hope of advancing the Bengali Natural Language Processing Community. The Ben-Sarc is available at https://github.com/sanzanalora/Ben-Sarc.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - SA
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike licence (http://creativecommons.org/licenses/by-nc-sa/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the same Creative Commons licence is used to distribute the re-used or adapted article and the original article is properly cited. The written permission of Cambridge University Press must be obtained prior to any commercial use.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Table 1. Comparison of the performances and other aspects of our study with other existing studies

Figure 1

Table 2. The content of the Facebook pages

Figure 2

Table 3. Inter-annotator agreement of Ben-Sarc assessed by human evaluators

Figure 3

Figure 1. Data distribution of Ben-Sarc dataset according to labels.

Figure 4

Table 4. A overview of the Ben-Sarc dataset preprocessing

Figure 5

Figure 2. Length-frequency distribution of the Ben-Sarc dataset.

Figure 6

Table 5. Overall summary of Ben-Sarc dataset

Figure 7

Figure 3. The visualization of the statistics of the Ben-Sarc dataset.

Figure 8

Figure 4. Workflow of the proposed approach.

Figure 9

Figure 5. Long short-term memory model without pre-trained word embeddings.

Figure 10

Figure 6. Long short-term memory + CNN + Pre-trained word embeddings model.

Figure 11

Figure 7. Stacked long short-term memory + CNN model + Pre-trained word embeddings model.

Figure 12

Figure 8. BiLSTM + Pre-trained word embedding model.

Figure 13

Figure 9. Illustration of transfer learning (Torrey and Shavlik 2010)

Figure 14

Figure 10. Model architecture of sarcasm detection for transfer learning.

Figure 15

Table 6. Performance(in %) of 5-fold and 10-fold cross-validation for experiment I

Figure 16

Table 7. Performance(in %) of each model for best setting of experiment II

Figure 17

Table 8. Hyperparameter setting of each best-performed model for experiment II

Figure 18

Table 9. Performance(in %) of each model for best setting of experiment III

Figure 19

Table 10. Hyperparameter setting of each best-performed model for experiment III

Figure 20

Table 11. A short overview of the best-performed model (in %) from each experiment

Figure 21

Table 12. Hyperparameters with their values which tuned across all models for experiment II

Figure 22

Table 13. Hyperparameters with their values which tuned across all models for experiment III

Figure 23

Table 14. Error analysis for different input and output for the best performing model of experiment I, II, III.