Hostname: page-component-77c78cf97d-bzm8f Total loading time: 0 Render date: 2026-04-27T21:16:50.843Z Has data issue: false hasContentIssue false

Is Attention always needed? A case study on language identification from speech

Published online by Cambridge University Press:  31 May 2024

Atanu Mandal*
Affiliation:
Department of Computer Science and Engineering, Jadavpur University, Kolkata, INDIA
Santanu Pal
Affiliation:
Wipro AI Lab, Wipro India Limited, Bengaluru, INDIA
Indranil Dutta
Affiliation:
School of Languages and Linguistics, Jadavpur University, Kolkata, INDIA
Mahidas Bhattacharya
Affiliation:
School of Languages and Linguistics, Jadavpur University, Kolkata, INDIA
Sudip Kumar Naskar
Affiliation:
Department of Computer Science and Engineering, Jadavpur University, Kolkata, INDIA
*
Corresponding author: Atanu Mandal; Email: atanumandal0491@gmail.com
Rights & Permissions [Opens in a new window]

Abstract

Language identification (LID) is a crucial preliminary process in the field of Automatic Speech Recognition (ASR) that involves the identification of a spoken language from audio samples. Contemporary systems that can process speech in multiple languages require users to expressly designate one or more languages prior to utilisation. The LID task assumes a significant role in scenarios where ASR systems are unable to comprehend the spoken language in multilingual settings, leading to unsuccessful speech recognition outcomes. The present study introduces convolutional recurrent neural network (CRNN)-based LID, designed to operate on the mel-frequency cepstral coefficient (MFCC) characteristics of audio samples. Furthermore, we replicate certain state-of-the-art methodologies, specifically the convolutional neural network (CNN) and Attention-based convolutional recurrent neural network (CRNN with Attention), and conduct a comparative analysis with our CRNN-based approach. We conducted comprehensive evaluations on thirteen distinct Indian languages, and our model resulted in over 98 per cent classification accuracy. The LID model exhibits high-performance levels ranging from 97 per cent to 100 per cent for languages that are linguistically similar. The proposed LID model exhibits a high degree of extensibility to additional languages and demonstrates a strong resistance to noise, achieving 91.2 per cent accuracy in a noisy setting when applied to a European Language (EU) dataset.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Table 1. List of languages as per the Eighth Schedule of the Constitution of India, as of 1 December 2007 with their language family and states spoken in

Figure 1

Figure 1. The figure presents our CRNN framework consisting of a convolution block and LSTM block denoted in different blocks. The convolution block extracts features from the input audio. The output of the final convolution layer is provided to the bidirectional LSTM network as the input which is further connected to a linear layer with softmax classifier.

Figure 2

Figure 2. Schematic diagram of the Attention module.

Figure 3

Table 2. Statistics of the Indian language (IN) dataset

Figure 4

Table 3. Statistics of the EU dataset

Figure 5

Table 4. Comparative evaluation results (in terms of accuracy) of our model and the model of Kulkarni et al. (2022) on the Indian language dataset

Figure 6

Table 5. Experimental results for Indian languages

Figure 7

Table 6. Confusion matrix for CRNN with Attention framework

Figure 8

Table 7. Confusion matrix for CRNN

Figure 9

Table 8. Confusion matrix for CNN

Figure 10

Table 9. Most common errors

Figure 11

Table 10. Experimental results of LID for close languages

Figure 12

Table 11. Confusion matrix for cluster 1

Figure 13

Table 12. Confusion matrix for cluster 2

Figure 14

Table 13. Confusion matrix for cluster 3

Figure 15

Table 14. Comparative evaluation results (in terms of accuracy) of our model and the model of Bartz et al. (2017) on the EU dataset

Figure 16

Table 15. Ablation study on convolution kernel sizes

Figure 17

Table 16. Experimental results for manually balancing the samples for each category to 100

Figure 18

Table 17. Experimental results for manually balancing the samples for each category to 200

Figure 19

Table 18. Experimental results for manually balancing the samples for each category to 571

Figure 20

Table 19. A comprehensive performance analysis of our various proposed frameworks

Figure 21

Figure 3. Comparison of model results for varying dataset size.

Figure 22

Table A1. Confusion matrix of manually balancing the samples for each category to 100 with CNN

Figure 23

Table A2. Confusion matrix of manually balancing the samples for each category to 200 with CNN

Figure 24

Table A3. Confusion matrix of manually balancing the samples for each category to 571 with CNN

Figure 25

Table B1. Confusion matrix of manually balancing the samples for each category to 100 with CRNN

Figure 26

Table B2. Confusion matrix of manually balancing the samples for each category to 200 with CRNN

Figure 27

Table B3. Confusion matrix of manually balancing the samples for each category to 571 with CRNN

Figure 28

Table C1. Confusion matrix of manually balancing the samples for each category to 100 with CRNN with Attention

Figure 29

Table C2. Confusion matrix of manually balancing the samples for each category to 200 with CRNN with Attention

Figure 30

Table C3. Confusion matrix of manually balancing the samples for each category to 571 with CRNN with Attention