Hostname: page-component-77f85d65b8-6c7dr Total loading time: 0 Render date: 2026-03-28T07:46:28.694Z Has data issue: false hasContentIssue false

Part-of-speech tagger for Bodo language using deep learning approach

Published online by Cambridge University Press:  03 June 2024

Dhrubajyoti Pathak*
Affiliation:
Centre for Linguistic Science and Technology, IIT Guwahati, Guwahati, Assam, India
Sanjib Narzary
Affiliation:
Centre for Linguistic Science and Technology, IIT Guwahati, Guwahati, Assam, India
Sukumar Nandi
Affiliation:
Centre for Linguistic Science and Technology, IIT Guwahati, Guwahati, Assam, India
Bidisha Som
Affiliation:
Centre for Linguistic Science and Technology, IIT Guwahati, Guwahati, Assam, India
*
Corresponding author: Dhrubajyoti Pathak; Email: drbj153@iitg.ac.in
Rights & Permissions [Opens in a new window]

Abstract

Language processing systems such as part-of-speech (POS) tagging, named entity recognition, machine translation, speech recognition, and language modeling have been well-studied in high-resource languages. Nevertheless, research on these systems for several low-resource languages, including Bodo, Mizo, Nagamese, and others, is either yet to commence or is in its nascent stages. Language model (LM) plays a vital role in the downstream tasks of modern natural language processing. Extensive studies are carried out on LMs for high-resource languages. However, these low-resource languages are still underreprese. In this study, we first present BodoBERT, an LM for the Bodo language. To the best of our knowledge, this work is the first such effort to develop an LM for Bodo. Second, we present an ensemble deep learning-based POS tagging model for Bodo. The POS tagging model is based on combinations of BiLSTM with conditional random field and stacked embedding of BodoBERT with BytePairEmbeddings. We cover several LMs in the experiment to see how well they work in POS tagging tasks. The best-performing model achieves an F1 score of 0.8041. A comparative experiment was also conducted on Assamese POS taggers, considering that the language is spoken in the same region as Bodo.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NC
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial licence (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original article is properly cited. The written permission of Cambridge University Press must be obtained prior to any commercial use.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Table 1. Statistics of Bodo POS annotated dataset

Figure 1

Table 2. Dataset format

Figure 2

Table 3. Tagset used in the dataset

Figure 3

Table 4. Training datasize of different language models

Figure 4

Table 5. POS tagging performance in the stacked method using BiLSTM-CRF architecture

Figure 5

Figure 1. Block diagram of POS tagging model.

Figure 6

Table 6. Performance of POS tagging model in different methods

Figure 7

Table 7. The F1 score of different language models in POS tagging task on Bodo and Assamese language

Figure 8

Table 8. Tag-wise performance of best-performing Bodo POS tagging model

Figure 9

Figure 2. Learning curve of BodoBERT $+$ BytePairEmbeddings based POS model.

Figure 10

Figure 3. Confusion matrix of BodoBERT $+$ BytePairEmbeddings POS model.