Skip to main content Accessibility help

Combining acoustic signals and medical records to improve pathological voice classification

  • Shih-Hau Fang (a1), Chi-Te Wang (a1) (a2) (a3), Ji-Ying Chen (a1) (a3), Yu Tsao (a4) and Feng-Chuan Lin (a2) (a3)...


This study proposes two multimodal frameworks to classify pathological voice samples by combining acoustic signals and medical records. In the first framework, acoustic signals are transformed into static supervectors via Gaussian mixture models; then, a deep neural network (DNN) combines the supervectors with the medical record and classifies the voice signals. In the second framework, both acoustic features and medical data are processed through first-stage DNNs individually; then, a second-stage DNN combines the outputs of the first-stage DNNs and performs classification. Voice samples were recorded in a specific voice clinic of a tertiary teaching hospital, including three common categories of vocal diseases, i.e. glottic neoplasm, phonotraumatic lesions, and vocal paralysis. Experimental results demonstrated that the proposed framework yields significant accuracy and unweighted average recall (UAR) improvements of 2.02–10.32% and 2.48–17.31%, respectively, compared with systems that use only acoustic signals or medical records. The proposed algorithm also provides higher accuracy and UAR than traditional feature-based and model-based combination methods.

  • View HTML
    • Send article to Kindle

      To send this article to your Kindle, first ensure is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about sending to your Kindle. Find out more about sending to your Kindle.

      Note you can select to send to either the or variations. ‘’ emails are free but can only be sent to your device when it is connected to wi-fi. ‘’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

      Find out more about the Kindle Personal Document Service.

      Combining acoustic signals and medical records to improve pathological voice classification
      Available formats

      Send article to Dropbox

      To send this article to your Dropbox account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Dropbox.

      Combining acoustic signals and medical records to improve pathological voice classification
      Available formats

      Send article to Google Drive

      To send this article to your Google Drive account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Google Drive.

      Combining acoustic signals and medical records to improve pathological voice classification
      Available formats


This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (, which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.

Corresponding author

Corresponding author: Yu Tsao, Email:


Hide All
[1]Kelly, J.; Knottenbelt, W.: Neural nilm: deep neural networks applied to energy disaggregation, in Proc. of the 2nd ACM Int. Conf. on Embedded Systems for Energy-Efficient Built Environments, pp. 5564, ACM, 2015.
[2]Mocanu, E.; Nguyen, P.H.; Gibescu, M.; Kling, W.L.: Deep learning for estimating building energy consumption. Sustainable Energy, Grids and Networks, 6 (2016), 9199.
[3]Akçay, S.; Kundegorski, M.E.; Devereux, M.; Breckon, T.P.: Transfer learning using convolutional neural networks for object classification within x-ray baggage security imagery. IEEE, 2016.
[4]Zheng, Y.-J.; Sheng, W.-G.; Sun, X.-M.; Chen, S.-Y.: Airline passenger profiling based on fuzzy deep machine learning. IEEE Trans. Neural Netw. Learn. Syst., 28 (12) (2017), 29112923.
[5]Shi, S.; Wang, Q.; Xu, P.; Chu, X.: Benchmarking state-of-the-art deep learning software tools, In Cloud Computing and Big Data (CCBD), 2016 7th IEEE Int. Conf., pp. 99104, IEEE, 2016.
[6]Bahrampour, S.; Ramakrishnan, N.; Schott, L.; Shah, M.: Comparative study of deep learning software frameworks. arXiv preprint arXiv:1511.06435, 2015.
[7]Huang, W.; Song, G.; Hong, H.; Xie, K.: Deep architecture for traffic flow prediction: deep belief networks with multitask learning. IEEE Trans. Intell. Transp. Syst., 15 (5) (2014), 21912201.
[8]Lv, Y.; Duan, Y.; Kang, W.; Li, Z.; Wang, F.-Y. et al. : Traffic flow prediction with big data: a deep learning approach. IEEE Trans. Intell. Transp. Syst., 16 (2) (2015), 865873.
[9]Fang, S.-H.; Fei, Y.-X.; Xu, Z.; Tsao, Y.: Learning transportation modes from smartphone sensors based on deep neural network. IEEE Sensors J., 17 (18) (2017), 61116118.
[10]Fang, S.-H. et al. : Transportation modes classification using sensors on smartphones. Sensors, 16 (8) (2016), 1324.
[11]Mamoshina, P.; Vieira, A.; Putin, E.; Zhavoronkov, A.: Applications of deep learning in biomedicine. Mol. Pharm., 13 (5) (2016), 14451454.
[12]Titze, I.: Workshop on acoustic voice analysis: summary statement. National center for voice and speech. 1995.
[13]Stachler, R.J. et al. : Clinical practice guideline: hoarseness (dysphonia)(update). Otolaryngol. Head. Neck. Surg., 158 (2018), S1S42.
[14]Fang, S.-H. et al. : Detection of pathological voice using cepstrum vectors: a deep learning approach. J. Voice., 2018.
[15]Gilman, M.; Merati, A.L.; Klein, A.M.; Hapner, E.R.; Johns, M.M.: Performer's attitudes toward seeking health care for voice issues: understanding the barriers. J. Voice, 23 (2) (2009), 225228.
[16]Vaziri, G.; Almasganj, F.; Behroozmand, R.: Pathological assessment of patients’ speech signals using nonlinear dynamical analysis. Comput. Biol. Med., 40 (1) (2010), 5463.
[17]Cohen, S.M.; Dupont, W.D.; Courey, M.S.: Quality-of-life impact of non-neoplastic voice disorders: a meta-analysis. Ann. Oto. Rhinol. Laryn., 115 (2006), 128134.
[18]Arjmandi, M.K.; Pooyan, M.: An optimum algorithm in pathological voice quality assessment using wavelet-packet-based features, linear discriminant analysis and support vector machine. Biomed. Signal. Process. Control., 7 (2012), 319.
[19]Markaki, M.; Stylianou, Y.: Using modulation spectra for voice pathology detection and classification, in 2009 Annual Int. Conf. of the IEEE Engineering in Medicine and Biology Society, September (2009), 25142517.
[20]Hammami, I.; Salhi, L.; Labidi, S.: Pathological voices detection using support vector machine, in 2016 2nd Int. Conf. on Advanced Technologies for Signal and Image Processing (ATSIP), March (2016), 662666.
[21]Verde, L.; Pietro, G.D.; Sannino, G.: Voice disorder identification by using machine learning techniques. IEEE Access., 6 (2018), 1624616255.
[22]Pishgar, M.; Karim, F.; Majumdar, S.; Darabi, H.: Pathological voice classification using mel-cepstrum vectors and support vector machine. arXiv preprint arXiv:1812.07729, 2018.
[23]Arias-Londoño, J.D.; Godino-Llorente, J.I.; Sáenz-Lechón, N.; Osma-Ruiz, V.; Castellanos-Domínguez, G.: Automatic detection of pathological voices using complexity measures, noise parameters, and mel-cepstral coefficients. IEEE Trans. Biomed. Eng., 58 (2011), 370379.
[24]Ali, Z.; Alsulaiman, M.; Muhammad, G.; Elamvazuthi, I.; Mesallam, T.A.: Vocal fold disorder detection based on continuous speech by using mfcc and gmm, in 2013 7th IEEE GCC Conf. and Exhibition (GCC), November (2013), 292297.
[25]Fezari, M.; Amara, F.; I.M. El-Emary: Acoustic analysis for detection of voice disorders using adaptive features and classifiers in Int. Conf. on Circuits, Systems and Control, ISBN, January (2014), 978–1.
[26]Wang, J.; Jo, C.: Vocal folds disorder detection using pattern recognition methods, in 2007 29th Annual Int. Conf. of the IEEE Engineering in Medicine and Biology Society, August (2007), 32533256.
[27]Wu, H.; Soraghan, J.; Lowit, A.; Di Caterina, G.: A deep learning method for pathological voice detection using convolutional deep belief networks. Interspeech 2018, 2018.
[28]Wu, H.; Soraghan, J.; Lowit, A., Di Caterina, G.: Convolutional neural networks for pathological voice detection, in 40th Int. Conf. of the IEEE Engineering in Medicine and Biology Society, 2018.
[29]Alhussein, M.; Muhammad, G.: Voice pathology detection using deep learning on mobile healthcare framework. IEEE Access., 6 (2018), 4103441041.
[30]Gupta, V.: Voice disorder detection using long short term memory (lstm) model. arXiv preprint arXiv:1812.01779, 2018.
[31]Hsu, Y.-T.; Zhu, Z.; Wang, C.-T.; Fang, S.-H.; Rudzicz, F.; Tsao, Y.: Robustness against the channel effect in pathological voice detection. Machine Learning for Health (ML4H) Workshop at NeurIPS, 2018.
[32]Lee, T. et al. : Automatic speech recognition for acoustical analysis and assessment of cantonese pathological voice and speech, in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE Int. Conf., pp. 64756479, IEEE, 2016.
[33]Dibazar, A.A.; Narayanan, S.; Berger, T.W.: Feature analysis for automatic detection of pathological speech, In Proc. of the Second Joint 24th Annual Conf.and the Annual Fall Meeting of the Biomedical Engineering Society][Engineering in Medicine and Biology, vol. 1, pp. 182183, IEEE, 2002.
[34]Dibazar, A.A.; Narayanan, S.: A system for automatic detection of pathological speech, in Conference Signals, Systems, and Computers, Asilomar, CA, 2002.
[35]Henríquez, P.; Alonso, J.B.; Ferrer, M.A.; Travieso, C.M., Godino-Llorente, J.I., Díaz-de María, F.: Characterization of healthy and pathological voice through measures based on nonlinear dynamics. IEEE Trans. Audio, Speech, Language Process., 17 (6) (2009), 11861195.
[36]Zhang, Y.; Jiang, J.J.: Acoustic analyses of sustained and running voices from patients with laryngeal pathologies. J. Voice, 22 (1) (2008), 19.
[37]Moran, R.J.; Reilly, R.B.; de Chazal, P.; Lacy, P.D.: Telephony-based voice pathology assessment using automated speech analysis. IEEE Trans. Biomed. Eng., 53 (3) (2006), 468477.
[38]Tsui, S.-Y.; Tsao, Y.; Lin, C.-W.; Fang, S.-H.; Lin, F.-C.; Wang, C.-T.: Demographic and symptomatic features of voice disorders and their potential application in classification using machine learning algorithms. Folia Phoniatrica et Logopaedica, 70 (3–4) (2018), 174182.
[39]Stemple, J.C.; Roy, N.; Klaben, B.K.: Clinical voice pathology: Theory; management. Plural Publishing The United States of America, 2014.
[40]Hashibe, M. et al. : Interaction between tobacco and alcohol use and the risk of head and neck cancer: pooled analysis in the international head and neck cancer epidemiology consortium. Cancer Epidemiol. Prev. Biomar., 18 (2) (2009), 541550.
[41]Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; Ng, A.Y.: Multimodal deep learning, in Proc. of the 28th Int. Conf. on machine learning (ICML-11), 2011, 689696.
[42]Mroueh, Y.; Marcheret, E.; Goel, V.: Deep multimodal learning for audio-visual speech recognition, in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE Int. Conf., pp. 21302134, IEEE, 2015.
[43]Hsiao, S.-W.; Sun, H.-C.; Hsieh, M.-C.; Tsai, M.-H.; Tsao, Y.; Lee, C.-C.: Toward automating oral presentation scoring during principal certification program using audio-video low-level behavior profiles. IEEE Trans. Affect. Comput., 2017.
[44]Wu, C.-H.; Lin, J.-C.; Wei, W.-L.: Survey on audiovisual emotion recognition: databases, features, and data fusion strategies. APSIPA Trans. Signal Inf. Process., 3, (2014), e12
[45]Hou, J.-C.; Wang, S.-S.; Lai, Y.-H.; Tsao, Y.; Chang, H.-W.; Wang, H.-M.: Audio-visual speech enhancement using multimodal deep convolutional neural networks. IEEE Trans. Emerging Topics in Computational Intelligence, 2 (2) (2018), 117128.
[46]Aaltonen, L.-M. et al. : Voice quality after treatment of early vocal cord cancer: a randomized trial comparing laser surgery with radiation therapy. Int. J. Radiation Oncology* Biology* Physics, 90 (2) (2014), 255260.
[47]Hsu, Y.-C.; Lin, F.-C.; Wang, C.-T.: Optimization of the minimal clinically important difference of the mandarin chinese version of 10-item voice handicap index. J. Taiwan Otolaryngology-Head and Neck Surgery, 52 (1) (2017), 814.
[48]Belafsky, P.C.; Postma, G.N.; Koufman, J.A.: Validity and reliability of the reflux symptom index (rsi). J. voice, 16 (2) (2002), 274277.
[49]Bocklet, T.; Haderlein, T.; Hönig, F.; Rosanowski, F.; Nöth, E.: Evaluation and assessment of speech intelligibility on pathologic voices based upon acoustic speaker models, in Proc. of the 3rd Advanced Voice Function Assessment Int. Workshop, pp. 8992, Citeseer, 2009.
[50]Reynolds, D.A.; Quatieri, T.F.; Dunn, R.B.: Speaker verification using adapted Gaussian mixture models. Digit. Signal processing, 10 (1–3) (2000), 1941.
[51]Campbell, W.M.; Sturim, D.E.; Reynolds, D.A.: Support vector machines using gmm supervectors for speaker verification. IEEE Signal Process. Lett., 13 (5) (2006), 308311.
[52]Kinnunen, T.; Li, H.: An overview of text-independent speaker recognition: From features to supervectors. Speech Commun., 52 (1) (2010), 1240.
[53]Kumar, K.; Kim, C.; Stern, R.M.: Delta-spectral cepstral coefficients for robust speech recognition, in 2011 IEEE int. conf. on acoustics, speech and signal processing (ICASSP), pp. 47844787, IEEE, 2011.
[54]Ahmad, K.S.; Thosar, A.S.; Nirmal, J.H.; Pande, V.S.: A unique approach in text independent speaker recognition using mfcc feature sets and probabilistic neural network, in 2015 Eighth Int. Conf. on Advances in Pattern Recognition (ICAPR), pp. 16, IEEE, 2015.
[55]Fang, S.-H.; Chuang, C.-C.; Wang, C.: Attack-resistant wireless localization using an inclusive disjunction model. IEEE Trans. Commun., 60 (5) (2012), 12091214.
[56]Fang, S.-H.; Wang, C.-H.: A novel fused positioning feature for handling heterogeneous hardware problem. IEEE Trans. Commun., 63 (7) (2015), 27132723.
[57]Karamizadeh, S.; Abdullah, S.M.; Manaf, A.A.; Zamani, M.; Hooman, A.: An overview of principal component analysis. J. Signal Info. Process., 4 (03) (2013), 173.
[58]Boualleg, A.; Bencheriet, C.; Tebbikh, H.: Automatic face recognition using neural network-pca, in 2006 2nd Int. Conf. on Information & Communication Technologies, vol. 1, pp. 19201925, IEEE, 2006.
[59]Meng, J.; Yang, Y.: Symmetrical two-dimensional pca with image measures in face recognition. Int. J. Adv. Robot. Syst., 9 (6) (2012), 238.



Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed