Hostname: page-component-77f85d65b8-g4pgd Total loading time: 0 Render date: 2026-03-28T05:09:04.743Z Has data issue: false hasContentIssue false

Sentiment analysis of code-mixed Dravidian languages leveraging pretrained model and word-level language tag

Published online by Cambridge University Press:  11 September 2024

Supriya Chanda*
Affiliation:
Department of Computer Science and Engineering, Indian Institute of Technology (BHU), Varanasi, Uttar Pradesh, India
Anshika Mishra
Affiliation:
Vellore Institute of Technology Bhopal, Bhopal, Madhya Pradesh, India
Sukomal Pal
Affiliation:
Department of Computer Science and Engineering, Indian Institute of Technology (BHU), Varanasi, Uttar Pradesh, India
*
Corresponding author: Supriya Chanda; Email: supriyachanda.rs.cse18@itbhu.ac.in
Rights & Permissions [Opens in a new window]

Abstract

The exponential growth of social media data in the era of Web 2.0 has necessitated advanced techniques for sentiment analysis. While sentiment analysis in monolingual datasets has received significant attention that in code-mixed datasets still need to be studied more. Code-mixed data often contain a mixture of monolingual content (might be in transliterated form), single-script but multilingual content, and multi-script multilingual content. This paper explores the issue from three important angles. What will be the best strategy to deal with the data for sentiment detection? Whether to train the classifier with the whole of the dataset or only with the pure code-mixed subset from the dataset? How much important is the language identification (LID) for the task? If LID is to be done, how, and when will it be used to yield the best performance? We explore the questions in the light of three datasets of Tamil–English, Kannada–English, and Malayalam–English YouTube social media comments. Our solution incorporated mBERT and an optional LID module. We report our results using a set of metrics like precision, recall, $F_1$ score, and accuracy. The solutions provide considerable performance gain and some interesting insights for sentiment analysis from code-mixed data.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Figure 1. A monolingual example from Tamil–English dataset.

Figure 1

Figure 2. A code-mixed example from Tamil–English dataset.

Figure 2

Table 1. Data distribution for sentiment detection of code-mixed text in Dravidian languages

Figure 3

Table 2. Example of code-mixed text in Dravidian languages from three language pairs for all classes

Figure 4

Figure 3. Model Architecture for identifying monolingual and code-mixed data.

Figure 5

Figure 4. Model architecture for multi-class classification with ruled-based language tag.

Figure 6

Figure 5. Model architecture for hierarchical approach with mBERT.

Figure 7

Table 3. The statistics of monolingual and code-mixed data involved in training, development, and test datasets of all three language pairs

Figure 8

Table 4. Level of code-mixing (CMI values) involved in training, development, and test datasets of all three language pairs

Figure 9

Table 5. Precision, recall, $F_1$-scores, and support for all experiments on Tamil-English test data

Figure 10

Table 6. Precision, recall, $F_1$-scores, and support for all experiments on Kannada–English test data

Figure 11

Table 7. Precision, recall, $F_1$-scores, and support for all experiments on Malayalam–English test data

Figure 12

Table 8. Weighted average $F_1$-scores and support for three language pairs where model is trained only on CM data but tested on all, Monolingual and CM data

Figure 13

Table 9. Comparison of Precision, recall, $F_1$-scores, and support with our proposed model for RQ-2 on Tamil–English test data

Figure 14

Table 10. Comparison of precision, recall, $F_1$-scores, and support with our proposed model for RQ-2 on Kannada–English test data

Figure 15

Table 11. Precision, recall, $F_1$-scores, and support with our proposed model for RQ-2 on Malayalam–English test data

Figure 16

Table 12. Comparison of precision, recall, $F_1$-scores, and support with our proposed model for RQ-3 on Tamil–English test data

Figure 17

Table 13. Comparison of precision, recall, $F_1$-scores, and support with our proposed model for RQ-3 on Kannada–English test data

Figure 18

Table 14. Comparison of precision, recall, $F_1$-scores, and support with our proposed model for RQ-3 on Malayalam–English test data

Figure 19

Table 15. Errors in gold standard

Figure 20

Table 16. Errors made by the LID model