Hostname: page-component-6766d58669-r8qmj Total loading time: 0 Render date: 2026-05-16T03:21:43.348Z Has data issue: false hasContentIssue false

Noise masking method based on an effective ratio mask estimation in Gammatone channels

Published online by Cambridge University Press:  15 May 2018

Feng Bao*
Affiliation:
Department of Electrical and Computer Engineering, University of Auckland, Auckland 1010, New Zealand
Waleed H. Abdulla
Affiliation:
Department of Electrical and Computer Engineering, University of Auckland, Auckland 1010, New Zealand
*
Corresponding author: F. Bao, Email: fbao026@aucklanduni.ac.nz

Abstract

In computational auditory scene analysis, the accurate estimation of binary mask or ratio mask plays a key role in noise masking. An inaccurate estimation often leads to some artifacts and temporal discontinuity in the synthesized speech. To overcome this problem, we propose a new ratio mask estimation method in terms of Wiener filtering in each Gammatone channel. In the reconstruction of Wiener filter, we utilize the relationship of the speech and noise power spectra in each Gammatone channel to build the objective function for the convex optimization of speech power. To improve the accuracy of estimation, the estimated ratio mask is further modified based on its adjacent time–frequency units, and then smoothed by interpolating with the estimated binary masks. The objective tests including the signal-to-noise ratio improvement, spectral distortion and intelligibility, and subjective listening test demonstrate the superiority of the proposed method compared with the reference methods.

Information

Type
Original Paper
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - ND
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives licence (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is unaltered and is properly cited. The written permission of Cambridge University Press must be obtained for commercial re-use or in order to create a derivative work.
Copyright
Copyright © The Authors, 2018
Figure 0

Fig. 1. The Block diagram of the proposed method.

Figure 1

Fig. 2. The Block diagram of the speech synthesis mechanism.

Figure 2

Fig. 3. Average HIT-FA score histogram with respect to factor η.

Figure 3

Fig. 4. Average HIT-FA score histogram with respect to threshold ξ.

Figure 4

Fig. 5. An example of normalized cross-correlation coefficient in different channels (Input SNR = 5 dB, white noise). (a) True noise condition. (b) Estimated noise condition.

Figure 5

Table 1. Iterative algorithm of $\hat{P}_{x}$.

Figure 6

Fig. 6. Power error comparison of speech against the number of Iteration.

Figure 7

Fig. 7. The cochleogram comparison (Input SNR = 5 dB, factory1 noise). (a) The cochleogram resynthesized by idea ratio mask; (b) The cochleogram resynthesized by the estimated binary mask VB; (c) The cochleogram resynthesized by the initial ratio mask VR; (d) The cochleogram resynthesized by the modified ratio mask with adjacent T–F units R; (e) The cochleogram resynthesized by the smoothed ratio mask with binary mask R.

Figure 8

Fig. 8. Speech waveform comparison of five channels (Input SNR = 5 dB, white noise). (a) Clean speech; (b) Noisy speech; (c) Enhanced speech.

Figure 9

Fig. 9. Spectrogram comparison (Input SNR = 5 dB, factory1 noise), (“She had your dark suit in greasy wash water all year”).

Figure 10

Table 2. SSNR improvement results.

Figure 11

Table 3. LSD results.

Figure 12

Table 4. STOI results.

Figure 13

Fig. 10. The MUSHRA results for five types of noises.