Hostname: page-component-77f85d65b8-grvzd Total loading time: 0 Render date: 2026-03-28T08:43:39.545Z Has data issue: false hasContentIssue false

A case study on decompounding in Indian language IR

Published online by Cambridge University Press:  03 June 2024

Siba Sankar Sahu*
Affiliation:
School of Computer Science Engineering and Technology, Bennett University, Greater Noida, Uttar Pradesh, India
Sukomal Pal
Affiliation:
Indian Institute of Technology (BHU), Department of Computer Science and Engineering, Varanasi, Uttar Pradesh, India
*
Corresponding author: Siba Sankar Sahu; Email: sibasankarsahu.rs.cse17@itbhu.ac.in, siba.sahu@bennett.edu.in
Rights & Permissions [Opens in a new window]

Abstract

Decompounding is an essential preprocessing step in text-processing tasks such as machine translation, speech recognition, and information retrieval (IR). Here, the IR issues are explored from five viewpoints. (A) Does word decompounding impact the Indian language IR? If yes, to what extent? (B) Can corpus-based decompounding models be used in the Indian language IR? If yes, how? (C) Can machine learning and deep learning-based decompounding models be applied in the Indian language IR? If yes, how? (D) Among the different decompounding models (corpus-based, hybrid machine learning-based, and deep learning-based), which provides the best effectiveness in the IR domain? (E) Among the different IR models, which provides the best effectiveness from the IR perspective? This study proposes different corpus-based, hybrid machine learning-based, and deep learning-based decompounding models in Indian languages (Marathi, Hindi, and Sanskrit). Moreover, we evaluate the effectiveness of each activity from an IR perspective only. It is observed that the different decompounding models improve IR effectiveness. The deep learning-based decompounding models outperform the corpus-based and hybrid machine learning-based models in Indian language IR. Among the different deep learning-based models, the Bi-LSTM-A model performs best and improves mean average precision (MAP) by 28.02% in Marathi. Similarly, the Bi-RNN-A model improves MAP by 18.18% and 6.1% in Hindi and Sanskrit, respectively. Among the retrieval models, the In_expC2 model outperforms others in Marathi and Hindi, and the BB2 model outperforms others in Sanskrit.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Figure 1. Sandhi-window (AnaNg) as the prediction target in a compound word.

Figure 1

Figure 2. Model architecture for Sandhi Split – Stage 1.

Figure 2

Figure 3. A basic RNN architecture.

Figure 3

Table 1. Show the Parameter of different Encoder–Decoder models

Figure 4

Figure 4. A single cell LSTM architecture.

Figure 5

Figure 5. A single cell GRU architecture.

Figure 6

Table 2. Statistics of text collection

Figure 7

Table 3. MAP scores of different corpus-based decompounding evaluation in Marathi (39 T queries)

Figure 8

Table 4. MAP scores of different corpus-based decompounding evaluation in Hindi (50 T queries)

Figure 9

Table 5. MAP scores of different corpus-based decompounding evaluation in Sanskrit (50 T queries)

Figure 10

Table 6. MAP scores of different hybrid machine learning-based decompounding evaluation in the Marathi (39 T queries)

Figure 11

Table 7. MAP scores of different hybrid machine learning-based decompounding evaluation in the Hindi (50 T queries)

Figure 12

Table 8. MAP scores of different hybrid machine learning-based decompounding evaluation in the Sanskrit (50 T queries)

Figure 13

Table 9. MAP scores of different deep learning-based decompounding evaluation in the Marathi (39 T queries)

Figure 14

Table 10. MAP scores of different deep learning-based decompounding evaluation in the Hindi (50 T queries)

Figure 15

Table 11. MAP scores of different deep learning-based decompounding evaluation in the Sanskrit (50 T queries)

Figure 16

Figure 6. A query by query evaluation in the Marathi by In_expC2 model.

Figure 17

Figure 7. A query by query evaluation in the Hindi by In_expC2 model.

Figure 18

Figure 8. A query by query evaluation in the Sanskrit by BB2 model.

Figure 19

Table A.1. List of prefixes used in different Indian languages IR