Hostname: page-component-89b8bd64d-r6c6k Total loading time: 0 Render date: 2026-05-07T14:02:55.215Z Has data issue: false hasContentIssue false

Combining acoustic signals and medical records to improve pathological voice classification

Published online by Cambridge University Press:  11 June 2019

Shih-Hau Fang
Affiliation:
Department of Electrical Engineering, Yuan Ze University, and MOST Joint Research Center for AI Technology and All Vista Healthcare Innovation Center, Taoyuan, Taiwan
Chi-Te Wang
Affiliation:
Department of Electrical Engineering, Yuan Ze University, and MOST Joint Research Center for AI Technology and All Vista Healthcare Innovation Center, Taoyuan, Taiwan Department of Otolaryngology Head and Neck Surgery, Far Eastern Memorial Hospital, New Taipei CityTaiwan Department of Special Education, University of Taipei, Taipei, Taiwan
Ji-Ying Chen
Affiliation:
Department of Electrical Engineering, Yuan Ze University, and MOST Joint Research Center for AI Technology and All Vista Healthcare Innovation Center, Taoyuan, Taiwan Department of Special Education, University of Taipei, Taipei, Taiwan
Yu Tsao*
Affiliation:
Research Center for Information Technology Innovation, Academia Sinica, TaipeiTaiwan
Feng-Chuan Lin
Affiliation:
Department of Otolaryngology Head and Neck Surgery, Far Eastern Memorial Hospital, New Taipei CityTaiwan Department of Special Education, University of Taipei, Taipei, Taiwan
*
Corresponding author: Yu Tsao, Email: yu.tsao@citi.sinica.edu.tw

Abstract

This study proposes two multimodal frameworks to classify pathological voice samples by combining acoustic signals and medical records. In the first framework, acoustic signals are transformed into static supervectors via Gaussian mixture models; then, a deep neural network (DNN) combines the supervectors with the medical record and classifies the voice signals. In the second framework, both acoustic features and medical data are processed through first-stage DNNs individually; then, a second-stage DNN combines the outputs of the first-stage DNNs and performs classification. Voice samples were recorded in a specific voice clinic of a tertiary teaching hospital, including three common categories of vocal diseases, i.e. glottic neoplasm, phonotraumatic lesions, and vocal paralysis. Experimental results demonstrated that the proposed framework yields significant accuracy and unweighted average recall (UAR) improvements of 2.02–10.32% and 2.48–17.31%, respectively, compared with systems that use only acoustic signals or medical records. The proposed algorithm also provides higher accuracy and UAR than traditional feature-based and model-based combination methods.

Information

Type
Original Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © The Authors, 2019
Figure 0

Table 1. FEMH data description.

Figure 1

Table 2. Phonotrauma data description.

Figure 2

Table 3. Medical records of demographics and symptoms feature retrieved from the FEMH database.

Figure 3

Fig. 1. The block diagram of the proposed HGD framework.

Figure 4

Fig. 2. The block diagram of the proposed TSD framework.

Figure 5

Fig. 3. Waveforms from voice samples of neoplasm (a), vocal palsy (b), and phonotrauma (c). Wide band spectrograms in voice samples of neoplasm (d), vocal palsy (e), and phonotrauma (f).

Figure 6

Fig. 4. Distribution of the first and second principal component for acoustic signals MFCC. (a) and (d) Are the histogram of individual components, while (b) and (c) are the joint distribution in two-dimension principal component space (the first and second principal components are placed at different axes in (b) and (c)).

Figure 7

Fig. 5. Distribution of the first and second principal component for medical record. (a) and (d) Are the histogram of individual components, while (b) and (c) are the joint distribution in two-dimension principal component space (the first and second principal components are placed at different axes in (b) and (c)).

Figure 8

Table 4. Performance comparison.

Figure 9

Table 5. P-value for accuracy (ACC).

Figure 10

Table 6. P-value for unweighted average recall (UAR).