Hostname: page-component-89b8bd64d-5bvrz Total loading time: 0 Render date: 2026-05-07T17:53:51.003Z Has data issue: false hasContentIssue false

A bidirectional LSTM-based morphological analyzer for Gujarati

Published online by Cambridge University Press:  31 May 2024

Jatayu Baxi*
Affiliation:
Department of Computer Engineering, Dharmsinh Desai University, Nadiad, India
Brijesh Bhatt
Affiliation:
Department of Computer Engineering, Dharmsinh Desai University, Nadiad, India
*
Corresponding author: Jatayu Baxi; Email: jatayubaxi.ce@ddu.ac.in
Rights & Permissions [Opens in a new window]

Abstract

Morphological analysis is a crucial preprocessing stage for building the state-of-the-art natural language processing applications. We propose a bidirectional LSTM (long short-term memory)-based approach to develop the morphological analyzer for the Gujarati language. Our morph analyzer predicts a root word and the morphological features for the given inflected word. We have experimented with two different methods for label representation for predicting morphological features: the monolithic representation method and the individual label representation method. We have also created the gold morphological dataset of 16,234 unique words for the Gujarati language. The dataset contains morpheme splitting and grammatical feature information for each inflected word. Due to the change in the label representation technique in the proposed model, the accuracy of the present baseline system is improved by a large margin. The proposed system performs very well across the POS categories without the knowledge of language-specific suffix rules.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Table 1. List of case markers for Gujarati noun

Figure 1

Table 2. Examples of moods in Gujarati verb

Figure 2

Table 3. Examples of Gujarati verb aspects

Figure 3

Table 4. Gujarati adjective inflection

Figure 4

Table 5. List of features along with corresponding labels

Figure 5

Table 6. Comparison of number of classes in baseline and proposed individual label representation technique

Figure 6

Table 7. Details about dataset

Figure 7

Figure 1. System architecture for the morpheme segmentation and grammatical feature prediction task, respectively.

Figure 8

Figure 2. Example of encoding for the identification of morpheme boundary.

Figure 9

Table 8. Morpheme boundary detection results—POS category wise

Figure 10

Table 9. Comparison of accuracy F1 scores—monolithic and individual feature representation

Figure 11

Table 10. Results for morphological feature tagging task for individual features using proposed label representation technique—POS category and feature wise

Figure 12

Figure 3. Comparison of result accuracy between the monolithic representation and individual feature representation.

Figure 13

Table 11. Analysis of training and validation errors for baseline and present system

Figure 14

Table 12. Comparison of the results of proposed approach with other deep neural network architectures—RNN and LSTM

Figure 15

Table 13. Comparison of results obtained from neural model and unsupervised model

Figure 16

Table 14. System output prediction for the words containing multiple suffix attachments

Figure 17

Table 15. Examples of incorrect root identification