Hostname: page-component-6766d58669-vgfm9 Total loading time: 0 Render date: 2026-05-24T02:15:34.243Z Has data issue: false hasContentIssue false

A technical framework for automatic perceptual evaluation of singing quality

Published online by Cambridge University Press:  14 September 2018

Chitralekha Gupta*
Affiliation:
NUS Graduate School for Integrative Sciences and Engineering, National University of Singapore, Singapore Computer Science Department, National University of Singapore, Singapore
Haizhou Li
Affiliation:
Electrical and Computer Engineering Department, National University of Singapore, Singapore
Ye Wang
Affiliation:
NUS Graduate School for Integrative Sciences and Engineering, National University of Singapore, Singapore Computer Science Department, National University of Singapore, Singapore
*
Corresponding author: Chitralekha Gupta Email: chitralekha@u.nus.edu

Abstract

Human experts evaluate singing quality based on many perceptual parameters such as intonation, rhythm, and vibrato, with reference to music theory. We proposed previously the Perceptual Evaluation of Singing Quality (PESnQ) framework that incorporated acoustic features related to these perceptual parameters in combination with the cognitive modeling concept of the telecommunication standard Perceptual Evaluation of Speech Quality to evaluate singing quality. In this study, we present further the study of the PESnQ framework to approximate the human judgments. First, we find that a linear combination of the individual perceptual parameter human scores can predict their overall singing quality judgment. This provides us with a human parametric judgment equation. Next, the prediction of the individual perceptual parameter scores from the PESnQ acoustic features show a high correlation with the respective human scores, which means more meaningful feedback to learners. Finally, we compare the performance of early fusion and late fusion of the acoustic features in predicting the overall human scores. We find that the late fusion method is superior to that of the early fusion method. This work underlines the importance of modeling human perception in automatic singing quality assessment.

Information

Type
Original Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © The Authors, 2018
Figure 0

Fig. 1. The concept of the PESnQ framework. The perceptual parameters are motivated by the rules of singing as dictated by music theory and how humans perceive it. Our proposed PESnQ framework comprises elements from signal acoustics, perceptual parameters, and human perception to obtain a perceptually-valid score for singing quality called the PESnQ score.

Figure 1

Fig. 2. The diagram of PESnQ scoring with different approaches: early fusion, and late fusion.

Figure 2

Table 1. List of perceptual parameters

Figure 3

Table 2. Acoustic features, Perceptual features, and their description corresponding to the human perceptual parameters for singing quality evaluation

Figure 4

Table 3. Pearson's correlation between individual perceptual parameters human scores. (vq: voice quality, pronun: pronunciation, pdr: pitch dynamic range)

Figure 5

Fig. 3. Performance of the best set of acoustic features from [8] in predicting the individual perceptual parameters when trained separately for each of them, at utterance-level, and song-level (average and median). (vq: voice quality, pronun: pronunciation, pdr: pitch dynamic range).

Figure 6

Table 4. Comparison of pearson's correlation of the human overall judgment with the predicted overall PESnQ score by early and late fusion methods

Figure 7

Table 5. Comparison of pearson's correlation of predicting the 5th judge in a leave-one-judge-out experiment by early and late fusion methods

Figure 8

Fig. 4. (a) Early Fusion versus (b) Late Fusion to obtain the PESnQ score. Pearson's correlation of early fusion method is 0.725 and that of late fusion method is 0.904, both with statistical significance of p < 0.001.

Figure 9

Table 6. Pearson's correlation between individual perceptual parameter score predictions and overall singing quality (PESnQ) scoring by late fusion method. (vq: voice quality, pronun: pronunciation, pdr: pitch dynamic range)