Hostname: page-component-89b8bd64d-z2ts4 Total loading time: 0 Render date: 2026-05-12T03:26:58.957Z Has data issue: false hasContentIssue false

Combining augmented statistical noise suppression and framewise speech/non-speech classification for robust voice activity detection

Published online by Cambridge University Press:  14 July 2017

Yasunari Obuchi*
Affiliation:
School of Media Science, Tokyo University of Technology, 1404-1 Katakura, Hachioji, Tokyo 192-0982, Japan
*
Corresponding author: Y. Obuchi, Email: obuchiysnr@stf.teu.ac.jp

Abstract

This paper proposes a new voice activity detection (VAD) algorithm based on statistical noise suppression and framewise speech/non-speech classification. Although many VAD algorithms have been developed that are robust in noisy environments, the most successful ones are related to statistical noise suppression in some way. Accordingly, we formulate our VAD algorithm as a combination of noise suppression and subsequent framewise classification. The noise suppression part is improved by introducing the idea that any unreliable frequency component should be removed, and the decision can be made by the remaining signal. This augmentation can be realized using a few additional parameters embedded in the gain-estimation process. The framewise classification part can be either model-less or model-based. A model-less classifier has the advantage that it can be applied to any situation, even if no training data are available. In contrast, a model-based classifier (e.g., neural network-based classifier) requires training data but tends to be more accurate. The accuracy of the proposed algorithm is evaluated using the CENSREC-1-C public framework and confirmed to be superior to many existing algorithms.

Information

Type
Original Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © The Authors, 2017
Figure 0

Table 1. Detail of CENSREC-1-C real dataset.

Figure 1

Fig. 1. Comparison of framewise scores. The same threshold-based classifier was applied.

Figure 2

Fig. 2. Effect of over-subtraction expressed by various α.

Figure 3

Fig. 3. Effect of gain augmentation expressed by various β.

Figure 4

Fig. 4. Effect of prominent component removal expressed by various η.

Figure 5

Table 2. Length and number of frames of Noisy UT-ML-JPN database.

Figure 6

Fig. 5. CNN topology. Relu stands for rectified linear unit. The output layer (ip2) has two units corresponding to speech and non-speech.

Figure 7

Fig. 6. Preliminary evaluation of classifier ensemble.

Figure 8

Fig. 7. Preliminary evaluation by comparing various classifiers.

Figure 9

Fig. 8. ROC curves for CENSREC-1-C obtained by various classifiers. DT and SVM represent the voting results of 100 classifiers.

Figure 10

Fig. 9. ROC curves for CENSREC-1-C obtained by the proposed algorithms. Reference points of CRF, SKF, and SKF/GP were cited from the published papers.

Figure 11

Fig. 10. Detail of CENSREC-1-C evaluation.

Figure 12

Table 3. Processing Time for CENSREC-1-C.