1. Introduction
Folk music embodies a profound legacy and diversity while also transcending cultural boundaries with its universal language. It serves as a means to safeguard our cultural heritage and pass down our history to future generations, ensuring the preservation of our rich legacy. Folk music in India encompasses a vast array of traditions, each reflecting the unique cultural heritage of its respective region. The Northeastern states of India exhibit a particularly diverse and rich folk music tradition. These cultures have a vibrant tapestry of folk songs that are deeply rooted in their history, customs, and rituals. These songs provide insights into the local customs, beliefs, and values of the communities they belong to.
One such community is the Mizo community from Mizoram, which is the southernmost state among the Northeastern states of India, as shown in Figure 1. The Mizo tribe has a significant population, with over 8,40,000 speakers (as per 2011 census) within Mizoram as well as its neighboring states such as Manipur, Tripura, and Assam. There are also Mizo-speaking populations in certain parts of Bangladesh, toward the western border of Mizoram, as well as in Myanmar, on the eastern border of Mizoram. Various sub-tribes like Thado, Paite, Lusei, Pawi, and others reside within the Mizo community, with the Lusei dialect adopted as the lingua franca of modern-day Mizoram. The Mizo language belongs to the Tibeto-Burman language family (Weidert Reference Weidert1975), specifically the Kuki-Chin subgroup. Mizo tribe and their language, with their vibrant cultural traditions and linguistic heritage, contribute to the diverse tapestry of North-East India’s rich ethnic and linguistic mosaic.
The Mizos have a rich and diverse cultural heritage. Their folklore is naturally passed down from previous generations orally. Mizo folk songs depict stories of the Mizo society, tradition, and culture, at a certain time in history. They reflect the glorious past of the Mizos, including their way of farming and harvesting, hunting, war, natural disasters, romance and nuptials, place of females in social strata, place of males in society, etc.

Figure 1. Speaker population of Mizo languageFootnote a (marked in green dots).
According to the works of Thanmawia (Reference Thanmawia1998), Lalthangliana (Reference Lalthangliana2005), and Lalremruati (Reference Lalremruati2012, Reference Lalremruati2019), Mizo folk songs can be categorized under five main themes depending on their purpose and use—hunting and war chants, lamentations, satire, love, and nature themes. Songs are also categorized depending on their types of tune, called thlûk (Lalthangliana Reference Lalthangliana1993; Khiangte Reference Khiangte2001, Reference Khiangte2002; Lalzarzova, 2016). This is termed as ‘Hlabu’. Those having the same tune or melody are called ‘hlabu khat’. Different categories from these earlier works are discussed in brief as follows:
- 
1. War chants: Bawh hla (war chants) are chanted solo by warriors after a successful war or raid where they have taken the head of an enemy. These personal and subjective chants are spontaneous and convey the singer’s emotions and mood, reflecting a sense of pride and ego (Lalremruati Reference Lalremruati2019). 
- 
2. Hunting chants: Hlado (hunting chants) share the same melody as Bawh hla and are spontaneously composed and chanted after a triumphant hunt. They typically emphasize the singer’s supremacy over the common man, employing words that express the singer’s ego and pride (Lalremruati Reference Lalremruati2019). 
- 
3. Lamentations: Songs of this nature are traditionally chanted during times of adversity. The Mizos experienced famines during their settlement in the Than ranges, leading to significant loss of life (Lalremruati Reference Lalremruati2019). These songs, known as ‘ṭhuthmun zai’ (songs sung while sitting), emerged as a way for people to offer condolences and gather together, sitting and singing these songs in solidarity (Thanmawia Reference Thanmawia1998; Reference Thanmawia1998, Lalremruati Reference Lalremruati2012, Reference Lalremruati2019). 
- 
4. Satire: These songs serve the purpose of ridicule and can be both aggressive and offensive, but they also encompass cheerful and humorous elements. Known as ‘intuk hla’, they are utilized to lighten the mood in gatherings that are otherwise heavy and tense (Thanmawia Reference Thanmawia1998; Lalthangliana Reference Lalthangliana2005; Lalremruati Reference Lalremruati2019). 
- 
5. Nature themed: Mizo folk songs frequently depict the beauty of nature and its influence on society, emphasizing the Mizo people’s reliance on and connection with the natural world, including its flora and fauna. These songs often draw parallels between elements of nature and the affection shared between couples, intertwining themes of nature and love (Lalremruati Reference Lalremruati2019). 
- 
6. Couplet and triplet: Mizo folk songs can be categorized based on the number of lines they contain. The first form of folk song, known as a couplet (tlar hnih zai), is believed to consist of two lines (Lalthangliana Reference Lalthangliana1993). It is further believed that earlier songs primarily comprised couplets and triplets (Khiangte Reference Khiangte2001). 
- 
7. Songs named after individuals: Mizo folk songs are also categorized according to the names of their original composers. Subsequently, other composers utilize the same tune to create different lyrics, a practice often done as a tribute to honor the original composer (Lalremruati Reference Lalremruati2012; Lalzarzova 2016). 
- 
8. Songs named after merry and festive occasions: The Mizos celebrate various festivals, many of which are connected to the agricultural season. Among the most common ones are Chapchar Kût, Mim Kût, and Ṭhalfavang Kût. Chapchar Kût marks the joyous completion of rice plantation, Thalfavang Kût celebrates the harvest, while Mim Kût is a solemn festival dedicated to the souls of the deceased, featuring rituals, feasting, and mournful singing and dancing (Khiangte Reference Khiangte2002). 
The Indian Government has implemented heritage preservation schemes that aim to preserve and promote oral traditions, performing arts, social and ritual events, etc., from various states. Projects by the All India Radio and the Indira Gandhi National Centre for the ArtsFootnote b focus on preserving dying folk songs and classical Indian music; they have not yet included Mizo folk songs or the Mizo language in the Technology Development for Indian LanguagesFootnote c repository. It is crucial to protect the language, culture, and traditions of the Mizo community, as Mizo is classified as ‘vulnerable’ on the 2010 UNESCO list of endangered languages (Moseley Reference Moseley2010).
The cultural development of Mizo society has been significantly influenced by the impact of globalization, which has gradually diminished the significance of traditional folk songs due to linguistic changes in the Mizo language. The vocabulary of these songs differs from spoken Mizo and includes borrowings from the Paite dialect, making it challenging to sing or comprehend the lyrics. As a result, passing down this cultural heritage to the younger generation has become increasingly difficult amidst the rapid social changes, depriving them of access to and practice folk tales, which are vital for maintaining cultural roots. Thus, preserving folk songs has become even more crucial.
Hence, this work aims to address this issue by proposing a framework for preservation and classification of Mizo folk songs. The main contributions of this work are listed below:
- 
Creation of Mizo folk song database for research in music processing. 
- 
Utilization of the database toward identification of unique acoustic characteristics of the Mizo folk songs from a speech processing point of view. 
- 
Acoustic classification of Mizo folk song categories using a long short-term memory (LSTM) network with custom attention layer (LSTM-attn). 
The paper is structured as follows. Section 2 presents a brief summary of existing literature on Music Information Retrieval (MIR), existing analysis methods, features, and classification methods of folk songs in other languages. Section 3 discusses the methodology including data used, acoustic features employed, as well as detailed discussion of the proposed LSTM-attn model. Section 4 discusses the experiments and results. Section 5 concludes with a summarization of the work, its limitations, and future scope.
2. Literature survey
2.1 Music information retrieval and recent techniques
MIR deals with problems of music access, filtering, tool development, and retrieval (Orio, Reference Orio2006). According to (Orio, Reference Orio2006), the applications of MIR are intended to help users find specific music in a large collection by a particular similarity matching technique and criteria. Major tasks in MIR include (i) audio fingerprinting, (ii) audio-textual alignment, (iii) cover song identification, (iv) music genre identification and classification, and (v) music recommendation, among others (Srinivasa Murthy and Koolagudi Reference Srinivasa Murthy and Koolagudi2018; Blaß and Bader, Reference Blaß and Bader2019). Our proposed framework focuses on the tasks of music identification and classification. The basic framework of an MIR system is shown in Figure 2.

Figure 2. Block diagram of a typical MIR system.
In Deruty et al. (Reference Deruty, Grachten, Lattner, Nistal and Aouameur2022), music production of contemporary pop music is carried out using AI tools. Different musical instrument sound generation tools are utilized with automatic music labeling features in the form of symbolic representations and the coupling of composition with sound editing and mixing. In Shah et al. (Reference Shah, Pujara, Mangaroliya, Gohil, Vyas and Degadwala2022), music genre classification is carried out using machine learning models such as Support Vector Machines (SVM), Random Forest, Extreme Gradient Boosting, and Convolutional Neural Networks (CNN). They used the popular GTZAN dataset for training and testing. It is seen that CNN performs the best compared to the traditional models. Deep learning models are also used in Mersy (Reference Mersy2021), wherein depth-wise separable CNN is trained on electronic dance music and validated the performance with a CNN that is tested on a source-separated spectrogram and a normal spectrogram. The source-separated spectrogram proves better in terms of classification performance for limited dataset. Genre classification on the GTZAN dataset and Free Music Archive dataset is also undertaken in by Ashraf et al. (Reference Ashraf, Geng, Wang, Ahmad and Abid2020), using deep learning models such as CNN, Recurrent Neural Network (RNN), and CNN-RNN models with global layer regularization (GLR) using Mel-spectrograms to evaluate the performances. In the GLR technique, every hidden unit of a layer shares the same normalization terms. It is seen that CNN-RNN networks performed better on the two datasets due to the utilization of deep features.
In recent years, retrieval of music information from real-time embeddings has been seen in Stefani and Turchet (Reference Stefani and Turchet2022). Twelve acoustic guitar techniques are compiled, and the onset of such musical instances is detected. Cepstral features are used to train and test Deep Neural Network (DNN) models, and deployed to a Raspberry Pi-based embedded computer, with accuracy of 99.1%. Image embedding and acoustic embeddings are used in Dogan et al. (Reference Dogan, Xie, Heittola and Virtanen2022) for zero-shot audio classification. Similarly, in Lazzari, Poltronieri, and Presutti (Reference Lazzari, Poltronieri and Presutti2023), the pitch class from music structures is embedded into continuous vectors using existing methods and custom encodings using LSTM neural networks. This performs better than those techniques that use chord symbolic annotations.
2.2 Classification methods for identification of folk song and music
In music and singing processing, songs that share the same tune or melodies and have similar acoustic components are grouped together. It is usually done for efficient storage and retrieval, where ordering is necessary for ease of access. Classification of songs is also important in finding out the geographical origin of folk songs/music, for music recommendation, etc.
A few decades back, classification methods based on the ending notes, number of lines in the song, number of syllables in a line, etc., were adopted in Elschekova (Reference Elschekova1966), Keller (Reference Keller1984), Bohlman (Reference Bohlman1988), Umapathy, Krishnan, and Jimaa (Reference Umapathy, Krishnan and Jimaa2005), Van Kranenburg et al. (Reference Van Kranenburg, Garbers, Volk, Wiering, Grijp and Veltkamp2007), but not without problems and limitations (Keller, Reference Keller1984). In recent years, machine learning models have been heavily employed for musical classification, mainly based on the genre. Music genre classification techniques are found in Jiang et al. (Reference Jiang, Lu, Zhang, Tao and Cai2002), Aucouturier and Pachet (Reference Aucouturier and Pachet2003), Umapathy et al. (Reference Umapathy, Krishnan and Jimaa2005), Orio (Reference Orio2006); Meng et al. (Reference Meng, Ahrendt, Larsen and Hansen2007), Lee et al. (Reference Lee, Shih, Yu and Lin2009), and Fu et al. (Reference Fu, Lu, Ting and Zhang2010), which are based on different temporal and spectral methods. Several approaches and models have emerged over the years. Conditional Random Fields (CRFs) have been employed by (Liu et al. Reference Liu, Xu, Wei and Tian2007; Li et al. Reference Li, Luo, Ding, Zhao and Yang2019) along with GMM and Restricted Boltzmann Machine. In Li, Ding, and Yang (Reference Li, Ding and Yang2017), it is seen that CRF-GMM outperformed traditional classification models (approx. 4.6 %–18.13 %).
Attention neural network-based architecture for folk song classification is also explored (Arronte-Alvarez and Gomez-Martin Reference Arronte-Alvarez and Gomez-Martin2019). Musical motif embedding is also introduced to represent folk songs in different languages. For motif embedding, Word2Vec model (Mikolov et al. Reference Mikolov, Chen, Corrado and Dean2013) has been used and then later on employed in the ANN architecture. They classified folk songs of Chinese, Swedish, and German origins. Their results are comparable to those in existing studies (Cuthbert, Ariza, and Friedland Reference Cuthbert, Ariza and Friedland2011; Le and Mikolov Reference Le and Mikolov2014). In Loh and Emmanuel (Reference Loh and Emmanuel2006), the extreme learning machine (ELM) for music genre classification is used, wherein features from 160 songs of four different genres in classical, pop music, rock music, and dance music are extracted. ELM and SVM have been used to classify these folk music.
 In the Indian context, classification of Punjabi folk musical instruments from audio segments is carried out by Singh and Koolagudi (Reference Singh and Koolagudi2017). Although vocal singing is not considered in their paper, the classification methods they used for polyphonic musical signals could be employed for the classification of vocal singing with multiple singers. Classification accuracy of 71% is achieved using the J48 classifier, which is increased to 91% by further improvement of input data samples. Feature selection is performed based on the performance of the J48 classifier. The selected features are then supplied to eight additional classifiers, where the highest classification rate is achieved by logistics classifier (95%). In Das, Bhattacharyya, and Debbarma (Reference Das, Bhattacharyya and Debbarma2021), a classification system for Kokborok music using traditional machine learning techniques is developed. A computational method to minimize the errors for each class is developed, with an alpha (
 $\alpha$
) value defined to estimate better accuracy, which successfully improved the original classification accuracy. In Das et al. (Reference Das, Ramdinmawii, Kumar and Nath2023), music source separation is carried out for Hunting chants of Mizo folk songs using techniques like REpeating Pattern Extraction Technique (REPET), Robust Principal Component Analysis (RPCA), and Non-negative Matrix Factorization (NMF). It is seen that RPCA obtained the best signal-to-distortion ratio and signal-to-noise ratio for separation of vocals and musical accompaniments, followed by REPET, and NMF.
$\alpha$
) value defined to estimate better accuracy, which successfully improved the original classification accuracy. In Das et al. (Reference Das, Ramdinmawii, Kumar and Nath2023), music source separation is carried out for Hunting chants of Mizo folk songs using techniques like REpeating Pattern Extraction Technique (REPET), Robust Principal Component Analysis (RPCA), and Non-negative Matrix Factorization (NMF). It is seen that RPCA obtained the best signal-to-distortion ratio and signal-to-noise ratio for separation of vocals and musical accompaniments, followed by REPET, and NMF.
Based on the literature survey, the following research gaps are noted:
- 
Folk songs have received relatively less attention compared to mainstream or commercial music genres. Consequently, there is a limited pool of research studies and resources dedicated to under-resourced folk songs. This limitation affects the depth and breadth of research findings and the development of specialized tools and techniques for analysis. 
- 
Folk songs are often part of an oral tradition, passed down through generations without written documentation. This poses challenges in preserving and documenting these songs, leading to the risk of songs being lost or forgotten over time. 
- 
The oral nature of folk songs and the lack of standardized annotation and metadata for these songs make it difficult to compare and analyze them systematically. 
3. Methodology
In Figure 3, the methodology followed for Mizo folk song classification is depicted. Firstly, the dataset is built by collecting folk songs from different sources. Then pre-processing is carried out for extraction of acoustic features. Next, the extracted features are used for training and testing of the classification models. These different stages of the framework are detailed in the subsequent sections.

Figure 3. Methodology for classification of Mizo folk songs.
3.1 Mizo folk song dataset
The dataset for this study is collected from three sources, namely, publicly available Mizo folk songs of performances in cultural events and competitions, which are sourced from the internet, songs provided by the Art & Culture Department (A&C), Mizoram, and songs collected in field recordings. The internet data included audio from YouTube videos, mainly from documentaries, competition performances, and recordings made by educational institutes. The sampling frequency is 48,000 samples per second in mono channel.
Songs obtained from A&C Department have vocals accompanied by cow-hide drums. Recording has been done using the Shure SM58 dynamic vocal microphone, at 44,100 samples per second as the sampling rate and 1,411 kbps bit rate, with stereo channel. The songs were originally recorded for use in a folk song competition by the technical staff at A&C Department, and later shared with the authors of this study. Field data is recorded from a male singer who performed several categories of folk songs. Recording is carried out in a quiet room by the authors of this work. Zoom H1n portable recorder has been used, with sampling frequency of 44,100 samples per second, bit rate of 1,411 kbps, with stereo mode. The recording is placed approximately 1 ft. from the singer and mounted on a tripod.
Data imbalance is observed mainly due to singers being more familiar with certain song categories than others. From all the collected songs, Hunting chants (Hlado), Children’s songs (Pawnto hla), and Elderly songs (Pi Pu zai) have the most number of song samples and longer duration. Here, the Hlado songs obtained from the internet have been chanted in an open field, and the ones from field data are recorded in a quiet room. Pawnto hla has been sung by kids in an open playground. Pi pu zai has been recorded in a room full of people who sang the songs together in a group. The data distribution can be seen in Table 1. Except for these three chosen categories, other categories contain varying song samples ranging between 1 and 20. This huge imbalance in data makes it infeasible for classification using all available categories. Hence, for the purpose of this paper, the said three categories have been chosen.
Table 1. Categories of Mizo folk song dataset used in this study

Data pre-processing tasks such as cleaning, segmentation, noise removal, and normalization are carried out on the dataset. Unwanted segments like background noise, coughing sounds, swallowing sounds, and tongue clicking in the recordings are removed. However, in order to avoid aliasing and windowing effects, about 0.5–1 sec regions of silence are left uncut at the start and end of each audio clip. Amplitude normalization is performed by taking the absolute maximum amplitude of the song signal, in order to keep the amplitudes in the range of −1 and + 1.
A consistent naming structure is maintained for each category of the data: songcategory_source_genderSpeakerNo_speechNo. So, for a folk song type Hlado, the first song, performed by the second male singer obtained from A&C Department, can be written as: hlado_src2_m2_0001.wav. This dataset will be made available upon request to the authors or through the Natural Language Processing Laboratory, Department of Computer Science & Engineering, Tezpur University.
3.2 Acoustic feature extraction
For the experiments, Matlab (MATLAB 2022) and Praat (Boersma Reference Boersma2001) tools are used. For the purpose of feature extraction, the songs are sampled at 48,000 samples per second, and frame-wise extraction is carried out. Frame size of 25 msec and frameshift of 10 msec are used (Paliwal, Lyons, and Wójcicki Reference Paliwal, Lyons and Wójcicki2010; O’Shaugnessy Reference O’Shaugnessy1987, p. 179). The following acoustic features are used:
- 
1. Fundamental Frequency (F0): F0 is the frequency at which the vocal folds vibrate during voice production. In this work, F0 is extracted using the autocorrelation function, which is computed as: (1)where 1 \begin{equation} R(i) = \sum ^{N-1}_{n=i}x(n)x(n-i) \end{equation} \begin{equation} R(i) = \sum ^{N-1}_{n=i}x(n)x(n-i) \end{equation} $\le$
i $\le$
i $\le$
p for a finite duration of x(n) and $\le$
p for a finite duration of x(n) and $p$
 is a range of lag values ((O’Shaugnessy Reference O’Shaugnessy1987, p. 196); (Huang et al. Reference Huang, Acero, Hon and Reddy2001, p. 321)). Six parameters of F0, minimum, maximum, mean, range, standard deviation, and median, are extracted. $p$
 is a range of lag values ((O’Shaugnessy Reference O’Shaugnessy1987, p. 196); (Huang et al. Reference Huang, Acero, Hon and Reddy2001, p. 321)). Six parameters of F0, minimum, maximum, mean, range, standard deviation, and median, are extracted.
- 
2. Signal energy: The energy of a continuous-time signal x(t) can be calculated by taking the square of amplitude of each time instance of x ((Haykin and Van Veen, Reference Haykin and Van Veen2007, p.20); (O’Shaugnessy Reference O’Shaugnessy1987, p. 180)). It is computed as (2) \begin{equation} E_x = \int _{-\infty }^{\infty }x^2(t) dt \end{equation} \begin{equation} E_x = \int _{-\infty }^{\infty }x^2(t) dt \end{equation}
- 
3. Zero crossing rate: The amount of time a signal crosses the x-axis is known as zero-crossing rate (ZCR). For a signal x(t), ZCR is computed as (3)where W(i) represents a window of size M samples, and the signum function returns output of ZCR in the range of [0, 1] (O’Shaugnessy 1987, p.182). Higher ZCR value implies higher frequency content in the signal (Lerch Reference Lerch2012, p. 62) \begin{equation} ZCR(x(t)) = \frac{1}{2M}|sgn(x(t))-sgn(x(t-1))|W(i-j) \end{equation} \begin{equation} ZCR(x(t)) = \frac{1}{2M}|sgn(x(t))-sgn(x(t-1))|W(i-j) \end{equation}
- 
4. Strength of excitation (SoE): SoE is the relative strength of impulse-like excitation at the Glottal Closure Instants. In this work, SoE is extracted using zero frequency filtering (ZFF) method (Yegnanarayana and Murty Reference Yegnanarayana and Murty2009). The slope of the ZFF signal at each epoch is the SoE (Mittal Reference Mittal2016; Kadiri and Alku Reference Kadiri and Alku2020). 
- 
5. Cepstral peak prominence (CPP): It is a commonly used method for acoustic measure of voice quality in different applications of speech analysis like singing voice studies (Baker et al. Reference Baker, Sundberg, Purdy, de and S.2022) and speech dysphonia (Fraile and Godino-Llorente Reference Fraile and Godino-Llorente2014). We have extracted CPP with voice detection and without voice detection, as found in (Murton, Hillman, and Mehta Reference Murton, Hillman and Mehta2020). 
- 
6. Mel frequency cepstral coefficients (MFCC): MFCCs describe the overall spectral envelope of a signal (Lerch Reference Lerch2012; O’Shaugnessy, Reference O’Shaugnessy1987). The  $i^{th}$
 coefficient, as in Lerch (Reference Lerch2012, p. 51), is computed as(4)where $i^{th}$
 coefficient, as in Lerch (Reference Lerch2012, p. 51), is computed as(4)where \begin{equation} MFCC_i (n)= = \sum _{k'=1}^{K'} log|X'(k',n)|.cos\left (i.\left (k' - \frac{1}{2}\right ) \frac{\pi }{K'}\right ) \end{equation} \begin{equation} MFCC_i (n)= = \sum _{k'=1}^{K'} log|X'(k',n)|.cos\left (i.\left (k' - \frac{1}{2}\right ) \frac{\pi }{K'}\right ) \end{equation} $|X'(k',n)|$
 is the Mel spectrum at that frame block. In this work, 13 MFCC coefficients are used. $|X'(k',n)|$
 is the Mel spectrum at that frame block. In this work, 13 MFCC coefficients are used.
- 
7. Formant frequencies: Acoustic resonances in the vocal tract are called formants (O’Shaugnessy, Reference O’Shaugnessy1987). They are crucial in examining the articulatory response of the vocal tract (Ladefoged and Johnson Reference Ladefoged and Johnson2014). A  $10^{th}$
 order linear prediction is used for generating the first four formants, and the songs are resampled to 10,000 samples per second. $10^{th}$
 order linear prediction is used for generating the first four formants, and the songs are resampled to 10,000 samples per second.
3.3 Proposed models
At present, it is still difficult to implement a fully unsupervised learning model for audio, since singing signal is an exceedingly non-linear data. Moreover, sufficient data to implement an unsupervised model is currently unavailable for Mizo folk songs. Hence, an approach using a supervised deep learning model, LSTM, is proposed in the following subsection.
3.3.1 LSTM with attention mechanism (LSTM-attn)
A LSTM model with attention mechanism has been proposed. This attention mechanism enhances the ability of LSTM to focus on specific regions of the input acoustic feature vector at each time step. It computes attention scores based on the similarity between the data points in the feature vector, and assigns weights to different regions of the input vector giving ‘attention’ to the most relevant data points. LSTM-attn in this work is computed as in Algorithm 1, using the following parameters:
- 
One-hot encoded input sequence, x, with dimension 2102  $\times$
 29 $\times$
 29 $\times$
 1 (for the 29 selected acoustic features) $\times$
 1 (for the 29 selected acoustic features)
- 
Two weight parameters,  $Q_w$
 and $Q_w$
 and $K_w$
, are defined as learnable weight matrices for Query projection and Key projection, respectively. $K_w$
, are defined as learnable weight matrices for Query projection and Key projection, respectively.
- 
Output sequence, y, which is an attention-weighted sequence with the same dimension as x. 
Algorithm 1. Attention mechanism for LSTM

 The model summary of LSTM-attn is shown in Figure 4. The first LSTM layer in this figure is the input layer, which takes the input having a shape of 2102 
 $\times$
 29
$\times$
 29 
 $\times$
 1. This layer uses ReLU (Rectified Linear Unit) with 64 units to introduce non-linearity in the input vector. This layer allows to capture the temporal differences in the input feature sequence. The output produced has a shape of 32
$\times$
 1. This layer uses ReLU (Rectified Linear Unit) with 64 units to introduce non-linearity in the input vector. This layer allows to capture the temporal differences in the input feature sequence. The output produced has a shape of 32 
 $\times$
 29
$\times$
 29 
 $\times$
 64, as batchSize = 32.
$\times$
 64, as batchSize = 32.

Figure 4. Summary of the proposed LSTM-attn model with custom attention layer.
 This is then passed to the attention layer, where attn_scores are computed based on the importance of the data points, as per the algorithm mentioned above. The dimension of the weights for the matrices 
 $Q_w$
 and
$Q_w$
 and 
 $K_w$
 are customized as 64
$K_w$
 are customized as 64 
 $\times$
 29, taking the size of both the time axis and the feature axis from the input vector, rather than weighing on the time axis alone as done in conventional attention mechanisms. This projection of input data to a higher dimensionality for query vector allows the model to concentrate on the most relevant features in the batch, while the key weight is set to retain the dimension of the input vector. Moreover, this customization allows for pairwise relationships between features within the input sequence while preserving all input features. The shape of the output is maintained from the previous layer. This setting was seen to improve the model performance than allowing the model to assign random weights.
$\times$
 29, taking the size of both the time axis and the feature axis from the input vector, rather than weighing on the time axis alone as done in conventional attention mechanisms. This projection of input data to a higher dimensionality for query vector allows the model to concentrate on the most relevant features in the batch, while the key weight is set to retain the dimension of the input vector. Moreover, this customization allows for pairwise relationships between features within the input sequence while preserving all input features. The shape of the output is maintained from the previous layer. This setting was seen to improve the model performance than allowing the model to assign random weights.
 Next is another LSTM layer with 128 ReLU units, whose output sequence has 32 
 $\times$
 29
$\times$
 29 
 $\times$
 128 shapes. Then, the output is flattened to get a vector of shape 32
$\times$
 128 shapes. Then, the output is flattened to get a vector of shape 32 
 $\times$
 3712. A fully connected dense layer using ReLU activation with 128 units is again added, which reduces the shape of the vector to 32
$\times$
 3712. A fully connected dense layer using ReLU activation with 128 units is again added, which reduces the shape of the vector to 32 
 $\times$
 128. Subsequently, the softmax layer follows with 3 units (i.e. the number of classes, which in our case is the number of song categories) to produce the 32
$\times$
 128. Subsequently, the softmax layer follows with 3 units (i.e. the number of classes, which in our case is the number of song categories) to produce the 32 
 $\times$
 3 output as class probabilities.
$\times$
 3 output as class probabilities.
3.3.2 Machine learning models
In addition to the proposed LSTM-attn model above, four commonly used supervised machine learning models, SVM, K-Nearest Neighbor (KNN), Naive Bayes, and Ensemble learning, are employed for comparing the results obtained from the LSTM-attn model. These models have been found to have the highest classification rates as compared to other models for shorter segments of speech (Grimaldi, Cunningham, and Kokaram Reference Grimaldi, Cunningham and Kokaram2003; Huang et al. Reference Huang, Lin, Wu and Li2014).
4. Experiments and discussion of results
4.1 Experiments
In this work, a total of 29 acoustic features and parameters have been extracted and divided into four different combination sets. This is done to find out which group of features are relevant for the classification task based on their acoustic properties. The features are grouped as follows: set-1: Temporal features (F0, Energy, ZCR); set-2: Source feature (SoE); set-3: Source-system features (CPP, MFCC); set-4: System features (Formants); set-1 + 2: Temporal + source features (F0, Energy, ZCR, SoE); set-1 + 2 + 3: Temporal + source + source-system features ((F0, Energy, ZCR, SoE, CPP, MFFC); and set-1 + 2 + 3 + 4: all sets of feature (F0, Energy, ZCR, SoE, CPP, MFFC, Formants). Class labels 1, 2, and 3 are assigned to the hunting chants, children’s songs, and elderly songs, respectively.
In total, there are seven feature combinations used for classification of the folk songs. Performing the classification with such combinations will help to identify what acoustic features are relevant for the classification of Mizo folk songs. The three categories of songs, whose typical sample length is 1–5 mins, are divided into manageable chunks of 3 sec. So, from the original 93 song files, a total of 2948 samples are generated.
 After removal of NaN and zero values, the feature vector is one-hot encoded to reshape and make it compatible with the model. The shape of the vector becomes 2102 
 $\times$
 29
$\times$
 29 
 $\times$
 1. Using the seven feature set combinations, experiments are carried out wherein the shape of the input vector changes depending on the number of features considered. For these experiments, the ‘adam’ optimizer is used, along with a constant batch size of 32 for different epochs—10, 20, 30, 40, and 50. Only the epoch with the best result, i.e., 10, is reported in this study. With the small size of the feature vector, it is deemed sufficient to choose 10 epochs for this work. For the four ML models, after eliminating NaN values, the dimension of the final feature vector becomes 2183
$\times$
 1. Using the seven feature set combinations, experiments are carried out wherein the shape of the input vector changes depending on the number of features considered. For these experiments, the ‘adam’ optimizer is used, along with a constant batch size of 32 for different epochs—10, 20, 30, 40, and 50. Only the epoch with the best result, i.e., 10, is reported in this study. With the small size of the feature vector, it is deemed sufficient to choose 10 epochs for this work. For the four ML models, after eliminating NaN values, the dimension of the final feature vector becomes 2183 
 $\times$
 29. The models are trained using k-fold cross validation (k = 5) on 80% of the data, and 20% is set aside for testing.
$\times$
 29. The models are trained using k-fold cross validation (k = 5) on 80% of the data, and 20% is set aside for testing.
4.2 Discussions
In Table 2, the accuracy results of the proposed LSTM-attn model are shown. The training and testing results are split 80:20 from the input sequence. Although slightly lesser, the performance of the model for each feature set is comparable to the existing ML models used in this study. As the classes are rather distinctive from one another, it has been observed that class-2 and class-3 exhibit minimal misclassification between them, and higher number of misclassifications are seen between class-1 and class-3.
Table 2. Macro-averages of long short-term memory with attention layer model for classification of three categories of Mizo folk songs, with 20% testing data

Interestingly, it is observed that LSTM-attn performance deteriorates as the feature set combines more acoustic features. The accuracy plots of the feature sets are shown in Figure 5. The set-1 provides the best accuracy (95.01%) among the individual feature sets. However, as the combinations are increased, the model appears to gradually reduce in performance (91.21%) in case of set-1 + 2 + 3 + 4 for all the features. This is attributed to the fact that LSTMs are sequential models and work well in capturing patterns and dependencies in sequential data. However, as the feature sets combined are not inherently sequential by nature, the performance of the LSTM-attn is seen to deteriorate.

Figure 5. Accuracy plots for LSTM-attn with 10 epochs for different feature sets (Accuracy 
 $\times$
 100).
$\times$
 100).
Out of the four different supervised classifiers employed, it can be seen from Table 3 that Ensemble method achieves the best accuracy of 97.71% for temporal features in set-1 and all features in set-1 + 2 + 3 + 4 , with 66 incorrectly classified data points. It can also be observed from Figure 6 that there is hardly any misclassification between class-3 (elderly songs) and class-2 (Children’s songs). This is because children’s voice and adult’s voice have clear distinction and lack similarity, so the models are able to train and predict well. Misclassification is highest in case of hunting chants and the elderly songs. Although the rhythm and tempo of the songs are not similar, there is still the fact that both are sung and performed by adults. As such, the characteristics of these two categories of songs may show some similarity in terms of excitation features.
Table 3. Classifier performances for three categories of Mizo folk songs, with different combinations of acoustic features (5-fold cross validation with 20% data for testing)


Figure 6. Confusion matrices of the four ML models and LSTM-attn with different feature sets.
4.2.1 A comparative analysis of classification using different feature sets
In set-1 , the LSTM-attn model yields a testing accuracy of 95.01%, which, despite a slight decrease, is considered a fairly good performance given the limited data size. The accuracy plot in Figure 5(a) shows slight improvements. The precision, recall, and f1-score are also the highest among all the feature sets, as shown in Table 2. Ensemble model also performs well, achieving an accuracy rate of 97.71%. KNN, SVM, and Naive Bayes classifiers obtained testing accuracies of 96.79%. Overall, this feature set with temporal features demonstrates the best accuracy when considering both machine learning models and the proposed LSTM-attn model.
In set-2 , LSTM-attn achieves the highest performance (71.73%) despite challenges with reduced features and its f1-score dipping to 0.74. There is no misclassification of class-3 as class-2 for the models except Naive Bayes seen in Figure 6(b). Ensemble model also performs the best among (64.91%) the ML models. The use of a single acoustic feature (SoE) in this set affects the performance. It can also be due to the fact that estimation of excitation strength is obtained using a ZFF model (Yegnanarayana and Gangashetty Reference Yegnanarayana and Gangashetty2011), which uses a fixed window length in the trend removal of zero-frequency resonators. This fixed windowing does not work well for singing voice and expressive voice due to high source-filter system interaction (Kadiri and Yegnanarayana Reference Kadiri and Yegnanarayana2015; Kadiri, Alku, and Yegnanarayana Reference Kadiri, Alku and Yegnanarayana2020).
In set-3 , source-system features produced better classification accuracy than set-2 . LSTM-attn obtained 91.69% accuracy with its f1-score at 0.89. Ensemble method performs the best (92.66%) while SVM has the lowest accuracy (89.68%) and the highest classification error. The accuracy plot in Figure 5(c) shows higher accuracies with little improvement over 10 epochs.
In set-4 , LSTM-attn obtains the best performance (69.88%). The challenges with feature set-4 become evident as the models struggle to effectively categorize folk songs. As can be seen in Figure 6(d), the misclassification ratio closely mirrors the classification accuracy, which can be attributed to the influence of musical notes in vocal singing, causing shifts in formant frequencies (Heaton, Reference Heaton2010). Although relatively steady, the accuracy plot also shows a slight dip toward the last epoch in Figure 5(d).
With set-1 + 2 , there is an improvement to the classification accuracy when the temporal features are combined with source features, as it has been observed that set-2 does not perform well on its own. The LSTM-attn obtains 93.11% accuracy, although the ML models achieved better accuracy.
With feature set-1 + 2 + 3 , improvements in the accuracy of the classification are observed with fewer classification errors as shown in Figure 6(f). The f1-score is at 0.92 and the accuracy plot in Figure 5(f) shows a steady curve between training and testing data. KNN model does a good job of classification with the least amount of misclassified data points. The LSTM-attn achieves the lowest accuracy of 93.11% with f1 score of 0.93.
Finally, for set-1 + 2 + 3 + 4 , the incorporation of system features (formants) to the previous three types of features does not improve the classification accuracy for LSTM-attn, KNN, SVM, and Naive Bayes. However, in Ensemble model, the accuracy is improved and misclassification is reduced. It is observed that the performance of LSTM-attn does not necessarily improve with more diversified feature combinations. It is seen to have lesser accuracy rate than those whose features are of the same feature type.
4.2.2 Comparison with existing works
Given the absence of prior research on the acoustic analysis and classification of Mizo folk songs, this study draws comparisons with similar research on folk songs in other low-resource Indian languages. Currently, the existing works do not typically employ more recent techniques using deep learning methods due to being under-resourced.
As depicted in Table 4, different works on classification of Indian folk songs are shown, out of which Kokborok (Das et al. Reference Das, Bhattacharyya and Debbarma2021) seems to be closest to Mizo in terms of language family (Tibeto-Burman language family) and geographical location. Although the experimental setup is dissimilar, in case of Kokborok, 63% classification accuracy has been achieved by using statistical computational methods to improve the classification error of the feature sets.
Table 4. Classifier performance compared with existing studies of under-resourced folk songs

In case of existing work with similar experimental setup, the folk songs of different categories of Gais, Rais, and Phag are classified by Pandey and Dutta (Reference Pandey and Dutta2014). A 5-fold cross validation and 80:20 training-testing ratio has been employed, which achieved 91.3% using SVM classifier. Despite the performance of our proposed LSTM-attn model being lower than the four existing ML models used in this study, there is a slight improvement than the existing works.
5. Summary and conclusion
Mizo is a low-resource language that lacks tools and technology required for the archival of its folk music. It has been observed that very few acoustic studies exist for Indian folk songs and music in spite of its richness in cultural and regional diversity. A survey of literature on Mizo folk songs as well as on recent methods of folk song and music classification have been carried out.
This work proposes an LSTM-attn consisting of an LSTM layer, a custom attention layer, a fully connected dense layer, and a softmax layer. Its performance is compared with those of existing machine learning models like SVM, KNN, Naive Bayes, and Ensemble models. Three categories of Mizo folk songs are used as dataset for classification, Hunting chants (Hlado), Children’s songs (Pawnto hla), and Elderly songs (Pi pu zai), with the total duration of the songs being approximately 2 hrs. A total of 29 acoustic features grouped into temporal features, source features, source-system features, and vocal tract filter features are extracted from the Mizo folk songs. Classification is carried out with 20% of the data segregated for testing. The highest accuracy achieved for the LSTM-attn is 95.01% (for temporal features), while it achieved 91.21% for all features combined. The results are comparable to existing studies of folk song classification in other Indian languages.
Our work is constrained by the relatively small dataset, which necessitates the segmentation of song samples into 3-sec segments. This approach may result in the loss of contextual information and potential discontinuities. Consequently, important audio events or transitions that span longer duration could be divided across different segments, posing challenges in capturing the complete audio content. Moreover, this employed frame-wise analysis might have overlooked important tonal characteristics of the Mizo language present in these folk songs. Performance of the proposed LSTM-attn model could be improved with larger sample size in each class. A comprehensive evaluation of the model will be undertaken in future. Additionally, analysis will be conducted to address the issue of lower accuracy in diverse acoustic feature sets. Exploration of tone-tune relationship in the Mizo language will also be undertaken, building upon previous studies in by Ramdinmawii and Nath (Reference Ramdinmawii and Nath2022) and Gogoi and Nath (Reference Gogoi and Nath2023).
This work would significantly contribute to India’s efforts in preserving intangible cultural heritage, benefiting Mizoram’s Art & Culture Department, currently engaged in archiving the state’s heritage. Additionally, this method can have broader applications in MIR, not only in Mizo but also in other Tibeto-Burman languages like Tani (Arunachal Pradesh), Meitei (Manipur), and Garo (Meghalaya).
Acknowledgement
Authors thank the Director and Technician, Department of Art & Culture, Mizoram, for their contribution in sharing their prerecorded songs. Authors are also grateful to the late Pu Lalkhuma (Sialsuk) for his valuable contribution to the dataset. Lastly, a great appreciation goes to the owners of YouTube channels who permitted us to use their content for our dataset in this work.
 
  
 
 
 
 
 
 
 
 
 
 
 
 
 












