Hostname: page-component-89b8bd64d-mmrw7 Total loading time: 0 Render date: 2026-05-07T17:34:45.478Z Has data issue: false hasContentIssue false

Word sense disambiguation corpus for Kashmiri

Published online by Cambridge University Press:  27 May 2024

Tawseef Ahmad Mir*
Affiliation:
Alliance School of Advanced Computing, Alliance University, Bangalore, India
Aadil Ahmad Lawaye
Affiliation:
Department of Computer Science, Baba Ghulam Shah Badshah University, Rajouri, India
*
Corresponding author: Tawseef Ahmad Mir; Email: tawseefmir1191@gmail.com
Rights & Permissions [Opens in a new window]

Abstract

Ambiguity is considered an indispensable attribute of all natural languages. The process of associating the precise interpretation to an ambiguous word taking into consideration the context in which it occurs is known as word sense disambiguation (WSD). Supervised approaches to WSD are showing better performance in contrast to their counterparts. These approaches, however, require sense annotated corpus to carry out the disambiguation process. This paper presents the first-ever standard WSD dataset for the Kashmiri language. The raw corpus used to develop the sense annotated dataset is collected from different resources and contains about 1 M tokens. The sense-annotated corpus is then created using this raw corpus for 124 commonly used ambiguous Kashmiri words. Kashmiri WordNet, an important lexical resource for the Kashmiri language, is used for obtaining the senses used in the annotation process. The developed sense-tagged corpus is multifarious in nature and has 19,854 sentences. Based on this annotated corpus, the Lexical Sample WSD task for Kashmiri is carried out using different machine-learning algorithms (J48, IBk, Naive Bayes, Dl4jMlpClassifier, SVM). To train these models for the WSD task, bag-of-words (BoW) and word embeddings obtained using the Word2Vec model are used. We used different standard measures, viz. accuracy, precision, recall, and F1-measure, to calculate the performance of these algorithms. Different machine learning algorithms reported different values for these measures on using different features. In the case of BoW model, SVM reported better results than other algorithms used, whereas Dl4jMlpClassifier performed better with word embeddings.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Figure 1. Architecture for sense annotated dataset preparation process.

Figure 1

Figure 2. PoS tagged instance extracted from PoS tagged dataset.

Figure 2

Table 1. Target-words with total senses in Kashmiri WordNet and instances in annotated Dataset.

Figure 3

Table 2. Senses for word (thud) in Kashmiri WordNet

Figure 4

Figure 3. Sense inventory snapshot.

Figure 5

Figure 4. Example sentence from sense annotated corpus.

Figure 6

Table 3. Results Produced by Different Machine Learning Algorithms Using Different Features

Figure 7

Table 4. Senses predictions for the word (kaem) by Dl4jMlpClassifier

Figure 8

Figure 5. Average accuracies of SVM and Dl4jMlpClassifier with respect to number of senses.