Hostname: page-component-6766d58669-tq7bh Total loading time: 0 Render date: 2026-05-15T01:12:40.770Z Has data issue: false hasContentIssue false

Probing a pretrained RoBERTa on Khasi language for POS tagging

Published online by Cambridge University Press:  06 September 2024

Aiom Minnette Mitri
Affiliation:
Department of Information Technology, North Eastern Hill University, Shillong, Meghalaya, India
Eusebius Lawai Lyngdoh
Affiliation:
Department of Information Technology, North Eastern Hill University, Shillong, Meghalaya, India
Sunita Warjri
Affiliation:
Faculty of Fisheries and Water Protection, University of South Bohemia in Ceské, Budejovicich, Czech Republic
Goutam Saha
Affiliation:
Department of Information Technology, North Eastern Hill University, Shillong, Meghalaya, India
Saralin A. Lyngdoh
Affiliation:
Department of Linguistics, North Eastern Hill University, Shillong, Meghalaya, India
Arnab Kumar Maji*
Affiliation:
Department of Linguistics, North Eastern Hill University, Shillong, Meghalaya, India
*
Corresponding author: Arnab Kumar Maji; Email: akmaji@nehu.ac.in
Rights & Permissions [Opens in a new window]

Abstract

Part of speech (POS) tagging, though considered to be preliminary to any Natural Language Processing (NLP) task, is crucial to account for, especially in low resource language like Khasi that lacks any form of formal corpus. POS tagging is context sensitive. Therefore, the task is challenging. In this paper, we attempt to investigate a deep learning approach to the POS tagging problem in Khasi. A deep learning model called Robustly Optimized BERT Pretraining Approach (RoBERTa) is pretrained for language modelling task. We then create RoBERTa for POS (RoPOS) tagging, a model that performs POS tagging by fine-tuning the pretrained RoBERTa and leveraging its embeddings for downstream POS tagging. The existing tagset that has been designed, customarily, for the Khasi language is employed for this work, and the corresponding tagged dataset is taken as our base corpus. Further, we propose additional tags to this existing tagset to meet the requirements of the language and have increased the size of the existing Khasi POS corpus. Other machine learning and deep learning models have also been tried and tested for the same task, and a comparative analysis is made on the various models employed. Two different setups have been used for the RoPOS model, and the best testing accuracy achieved is 92 per cent. Comparative analysis of RoPOS with the other models indicates that RoPOS outperforms the others when used for inferencing on texts that are outside the domain of the POS tagged training dataset.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Figure 1. Bert Architecture (Image source: Thabah et al.2022).

Figure 1

Table 1. List of borrowed words extracted from the corpus

Figure 2

Figure 2. Aligning labels to tokens.

Figure 3

Table 2. Dataset for training, testing and validation

Figure 4

Table 3. Hardware configuration

Figure 5

Table 4. Parameter and hyperparameter setting for pretraining RoBERTa

Figure 6

Figure 3. Basic structure of RoPOS model.

Figure 7

Table 5. RoPOS model parameter and hyperparameter setting

Figure 8

Figure 4. Accuracy and Loss of RoPOS.

Figure 9

Table 6. Deep learning models’ configuration

Figure 10

Figure 5. Accuracy plots of deep learning models.

Figure 11

Table 7. Machine learning models’ configuration

Figure 12

Table 8. Reported accuracy

Figure 13

Table 9. RoPOS with BPE: classification report on a random article

Figure 14

Table 10. RoPOS with Wordpiece Encoding: classification report on a random article

Figure 15

Table 11. F1-score, Precision and Recall while inferencing on texts from the same domain (Newspaper Articles)

Figure 16

Table 12. F1-score, Precision and Recall while inferencing on out of domain texts taken from the Bible

Figure 17

Table 13. F1-score, Precision and Recall while inferencing on out of domain—literary Khasi texts

Figure 18

Table 14. Observed performance of RoPOS with Wordpiece

Figure 19

Figure 6. Confusion Matrix on inferencing with RoPOS (with Wordpiece tokens) on a random article from the newspaper domain.

Figure 20

Figure 7. Confusion Matrix on inferencing with RoPOS (with Wordpiece tokens) on a random text from the Bible domain.

Figure 21

Figure 8. Confusion Matrix on inferencing with RoPOS (with Wordpiece tokens) on a random article from the domain of Khasi Literary texts.