Hostname: page-component-77f85d65b8-2tv5m Total loading time: 0 Render date: 2026-03-29T23:33:43.690Z Has data issue: false hasContentIssue false

Hate speech detection in low-resourced Indian languages: An analysis of transformer-based monolingual and multilingual models with cross-lingual experiments

Published online by Cambridge University Press:  27 August 2024

Koyel Ghosh*
Affiliation:
Department of Computer Science and Engineering, Central Institute of Technology, Kokrajhar, Assam, India
Apurbalal Senapati
Affiliation:
Department of Computer Science and Engineering, Central Institute of Technology, Kokrajhar, Assam, India
*
Corresponding author: Koyel Ghosh; Email: ghosh.koyel8@gmail.com
Rights & Permissions [Opens in a new window]

Abstract

Warning: This paper is based on hate speech detection and may contain examples of abusive/ offensive phrases.

Cyberbullying, online harassment, etc., via offensive comments are pervasive across different social media platforms like ™Twitter, ™Facebook, ™YouTube, etc. Hateful comments must be detected and eradicated to prevent harassment and violence on social media. In the Natural Language Processing (NLP) domain, the most prevalent task is comment classification, which is challenging, and language models based on transformers are at the forefront of this advancement. This paper intends to analyze the performance of language models based on transformers like BERT, ALBERT, RoBERTa, and DistilBERT on the Indian hate speech datasets over binary classification. Here, we utilize the existing datasets, i.e., HASOC (Hindi and Marathi) and HS-Bangla. So, we evaluate several multilingual language models like MuRIL-BERT, XLM-RoBERTa, etc., few monolingual language models like RoBERTa-Hindi, Maha-BERT (Marathi), Bangla-BERT (Bangla), Assamese-BERT (Assamese), etc., and perform cross-lingual experiment also. For further analyses, we perform multilingual, monolingual, and cross-lingual experiments on our Hate Speech Assamese (HS-Assamese) (Indo-Aryan language family) and Hate Speech Bodo (HS-Bodo) (Sino-Tibetan language family) dataset (HS dataset version 2) also and achieved a promising result. The motivation of the cross-lingual experiment is to encourage researchers to learn about the power of the transformer. Note that no pre-trained language models are currently available for Bodo or any other Sino-Tibetan languages.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Table 1. Class-wise distribution for HASOC-Hindi (2019), HASOC-Marathi (2021), and HS-Bangla(2021) dataset

Figure 1

Figure 1. Dataset samples of (a) HASOC-Hindi (2019), (b) HASOC-Marathi (2021), and (c) HS-Bangla (2021) datasets, respectively.

Figure 2

Table 2. Class distribution analysis for Training and test set HS-Assamese and HS-Bodo datasets, respectively

Figure 3

Figure 2. Samples of (a) HS-Assamese and (b) HS-Bodo datasets where hate comments are tagged as HOF and otherwise NOT.

Figure 4

Figure 3. Architecture of hate speech detection model which includes (a) input representation, (b) transformer encoder block, and (c) BERT model.

Figure 5

Figure 4. Experiments of hate speech detection model which includes (a) Multilingual experiment, (b) Monolingual experiment, and (c) Cross-lingual experiment.

Figure 6

Table 3. Hyperparameters for all the experiments

Figure 7

Table 4. Calculations of precision, recall, F1 score, and accuracy of various TLMs on HASOC-Hindi (2019), HASOC-Marathi (2021), HS-Bangla (2021), HS-Assamese, and HS-Bodo datasets, respectively

Figure 8

Figure 5. Confusion matrix of best models such as MuRIL-BERT for HASOC-Hindi (2019) (a), Maha-BERT for HASOC-Marathi (2021) (b), MuRIL-BERT for HS-Bangla (2021) (c), Assamese-BERT for HS-Assamese (d) and m-BERT for HS-Bodo (e).