Hostname: page-component-89b8bd64d-9prln Total loading time: 0 Render date: 2026-05-07T13:02:52.981Z Has data issue: false hasContentIssue false

EHMMQA: English, Hindi, and Marathi multilingual question answering framework using deep learning

Published online by Cambridge University Press:  24 May 2024

Pawan Lahoti*
Affiliation:
Department of Computer Science and Engineering, Malaviya National Institute of Technology Jaipur, Rajasthan, India.
Namita Mittal
Affiliation:
Department of Computer Science and Engineering, Malaviya National Institute of Technology Jaipur, Rajasthan, India.
Girdhari Singh
Affiliation:
Department of Computer Science and Engineering, Malaviya National Institute of Technology Jaipur, Rajasthan, India.
*
Corresponding author: Pawan Lahoti; Email: lahotipawan@gmail.com
Rights & Permissions [Opens in a new window]

Abstract

Multilingual question answering (MQA) is an effective access to multilingual data to provide accurate and precise answers, irrespective of language. Although a wide range of datasets is available for monolingual QA systems in natural language processing, benchmark datasets specifically designed for MQA are considerably limited. The absence of comprehensive and benchmark datasets hinders the development and evaluation of MQA systems. To overcome this issue, the proposed work attempts to develop the EHMQuAD dataset, an MQA dataset for low-resource languages such as Hindi and Marathi accompanying the English language. The EHMQuAD dataset is developed using a synthetic corpora generation approach, and an alignment is performed after translation to make the dataset more accurate. Further, the EHMMQA model is proposed to create an abstract framework that uses a deep neural network that accepts pairs of questions and context and returns an accurate answer based on those questions. The shared question and shared context representation have been designed separately to develop this system. The experiments of the proposed model are conducted on the MMQA, Translated SQuAD, XQuAD, MLQA, and EHMQuAD datasets, and EM and F1-score are used as performance measures. The proposed model (EHMMQA) is collated with state-of-the-art MQA baseline models for all possible monolingual and multilingual settings. The results signify that EHMMQA is a considerable step toward the MQA system utilizing Hindi and Marathi languages. Hence, it becomes a new state-of-the-art model for Hindi and Marathi languages.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NC
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial licence (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original article is properly cited. The written permission of Cambridge University Press must be obtained prior to any commercial use.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Table 1. List of acronyms and notations used in the work

Figure 1

Algorithm 1. Pseudo Code for modified TAR methodology.

Figure 2

Figure 1. NMT Architecture (a) LSTM based model, (b) bi-LSTM based model, (c) transformer based model (Vaswani et al. 2017).

Figure 3

Table 2. BLEU score of en-hi and en-mr translation

Figure 4

Table 3. Statistics of EHMQuAD dataset over SQuADv1.1

Figure 5

Figure 2. An example of the EHMQuAD dataset.

Figure 6

Figure 3. Examples showing type of errors.

Figure 7

Figure 4. Proposed unified model of MQA.

Figure 8

Table 4. Statistics of MQA dataset

Figure 9

Table 5. Details of monolingual and multilingual settings for experimental setup-I

Figure 10

Table 6. Details of monolingual and multilingual settings for experimental setup-II

Figure 11

Table 7. Details of monolingual model

Figure 12

Table 8. Performance analysis of proposed model (monolingual settings—experimental setup-I) with the other various models

Figure 13

Table 9. Performance analysis of proposed model (multilingual settings—experimental setup-I) with the other various models

Figure 14

Table 10. Performance analysis of proposed model (monolingual settings—experimental setup-II) with the other various models

Figure 15

Table 11. Performance analysis of proposed model (multilingual settings—experimental setup-II) with the other various models

Figure 16

Table 12. Statistical test results showing the comparison between the proposed model and the baseline model

Figure 17

Table 13. Ablation study of various word representations