Skip to main content
×
×
Home

Combining augmented statistical noise suppression and framewise speech/non-speech classification for robust voice activity detection

  • Yasunari Obuchi (a1)
Abstract

This paper proposes a new voice activity detection (VAD) algorithm based on statistical noise suppression and framewise speech/non-speech classification. Although many VAD algorithms have been developed that are robust in noisy environments, the most successful ones are related to statistical noise suppression in some way. Accordingly, we formulate our VAD algorithm as a combination of noise suppression and subsequent framewise classification. The noise suppression part is improved by introducing the idea that any unreliable frequency component should be removed, and the decision can be made by the remaining signal. This augmentation can be realized using a few additional parameters embedded in the gain-estimation process. The framewise classification part can be either model-less or model-based. A model-less classifier has the advantage that it can be applied to any situation, even if no training data are available. In contrast, a model-based classifier (e.g., neural network-based classifier) requires training data but tends to be more accurate. The accuracy of the proposed algorithm is evaluated using the CENSREC-1-C public framework and confirmed to be superior to many existing algorithms.

  • View HTML
    • Send article to Kindle

      To send this article to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about sending to your Kindle. Find out more about sending to your Kindle.

      Note you can select to send to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be sent to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

      Find out more about the Kindle Personal Document Service.

      Combining augmented statistical noise suppression and framewise speech/non-speech classification for robust voice activity detection
      Available formats
      ×
      Send article to Dropbox

      To send this article to your Dropbox account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Dropbox.

      Combining augmented statistical noise suppression and framewise speech/non-speech classification for robust voice activity detection
      Available formats
      ×
      Send article to Google Drive

      To send this article to your Google Drive account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Google Drive.

      Combining augmented statistical noise suppression and framewise speech/non-speech classification for robust voice activity detection
      Available formats
      ×
Copyright
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Corresponding author
Corresponding author: Y. Obuchi, Email: obuchiysnr@stf.teu.ac.jp
References
Hide All
[1] Rabiner, L.R.; Sambur, M.R.: An algorithm for determining the endpoints of isolated utterances. Bell Syst. Tech. J., 54 (2) (1975), 297315.
[2] Bou-Ghazale, S.E.; Assaleh, K.: A robust endpoint detection of speech for noisy environments with application to automatic speech recognition, in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Orlando, FL, USA, 2002, IV-3808–IV-3811.
[3] Martin, A.; Charlet, D.; Mauuary, L.: Robust speech/non-speech detection using LDA applied to MFCC, in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Salt Lake City, UT, USA, 2001, 237240.
[4] Kinnunen, T.; Chemenko, E.; Tuononen, M.; Fränti, P.; Li, H.: Voice activity detection using MFCC features and support vector machine, in Int. Conf. on Speech and Computer, Moscow, Russia, 2007, 556561.
[5] Shen, J.-L.; Hung, J.-W.; Lee, L.-S.: Robust entropy-based endpoint detection for speech recognition in noisy environments, in Int. Conf. on Spoken Language Processing, Sydney, Australia, 1998, 232235.
[6] Ramirez, J.; Yelamos, P.; Gorriz, J.M.; Segura, J.C.: SVM-based speech endpoint detection using contextual speech features. Electron. Lett., 42 (7) (2006), 426428.
[7] Ishizuka, K.; Nakatani, T.: Study of noise robust voice activity detection based on periodic component to aperiodic component ratio, in ISCA Tutorial and Research Workshop on Statistical and Perceptual Audition, Pittsburgh, PA, USA, 2006, 6570.
[8] Cournapeau, D.; Kawahara, T.: Evaluation of real-time voice activity detection based on high order statistics, in Interspeech, Antwerp, Belgium, 2007, 29452948.
[9] Lee, A.; Nakamura, K.; Nisimura, R.; Saruwatari, H.; Shikano, K.: Noise robust real world spoken dialog system using GMM based rejection of unintended inputs, in Interspeech, Jeju Island, Korea, 2004, 173176.
[10] Zhang, X.-L.; Wu, J.: Deep belief networks based voice activity detection. IEEE Trans. Audio Speech Lang. Process., 21 (4) (2013), 697710.
[11] Hughes, T.; Mierle, K.: Recurrent neural networks for voice activity detection, in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 2013, 73787382.
[12] Ryant, N.; Liberman, M.; Yuan, J.: Speech activity detection on YouTube using deep neural networks, in Interspeech, Lyon, France, 2013, 728731.
[13] Fujita, Y.; Iso, K.: Robust DNN-based VAD augmented with phone entropy based rejection of background speech, in Interspeech, San Francisco, CA, USA, 2016, 36633667.
[14] Ramirez, J.; Segura, J.C.; Benitez, C.; de la Torre, A.; Rubui, A.: An effective subband OSF-based VAD with noise reduction for robust speech recognition. IEEE Trans. Speech Audio Process., 13 (6) (2005), 11191129.
[15] Kingsbury, B.; Jain, P.; Adami, A.G.: A hybrid HMM/traps model for robust voice activity detection, in Interspeech, Denver, CO, USA, 2002, 10731076.
[16] Saito, A.; Nankaku, Y.; Lee, A.; Tokuda, K.: Voice activity detection based on conditional random field using multiple features, in Interspeech, Makuhari, Japan, 2010, 20862089.
[17] Sohn, J; Kim, N.S.; Sung, W.: A statistical model-based voice activity detection. IEEE Signal Process. Lett., (6) (1999), 13.
[18] Ephraim, Y.; Malah, D.: Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process., 32 (6) (1984), 11091121.
[19] Fujimoto, M.; Ishizuka, K.: Noise robust voice activity detection based on switching Kalman filter. IEICE Trans. Inf. Syst., E91-D (3) (2008), 467477.
[20] Cohen, I.; Berdugo, B.: Speech enhancement for non-stationary noise environments. Signal Process., 81 (2001), 24032418.
[21] Obuchi, Y.; Takeda, R.; Kanda, N.: Voice activity detection based on augmented statistical noise suppression, in APSIPA Annu. Summit and Conf., Holywood, CA, USA, 2012, 14.
[22] Obuchi, Y.: Framewise speech-nonspeech classification by neural networks for voice activity detection with statistical noise suppression, in IEEE International Conference on Acoust. Speech Signal Process., Shanghai, China, 2016, 57155719.
[23] Ephraim, Y.; Malah, D.: Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process., ASSP-33 (2), (1985), 443445.
[24] Obuchi, Y.; Takeda, R.; Togami, M.: Bidirectional OM-LSA speech estimator for noise robust speech recognition, in IEEE Automatic Speech Recognition and Understanding Workshop, Big Island, HI, USA, 2011, 173178.
[25] Kitaoka, N. et al. : CENSREC-1-C: an evaluation framework for voice activity detection under noisy environments. Acoust. Sci. Technol., 30 (5) (2009), 363371.
[26] Kim, C.; Stern, R.M.: Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis, in Interspeesh, Brisbane, Australia, 2008, 25982601.
[27] Speech Resources Consortium (NII-SRC): University of Tsukuba Multilingual Speech Corpus (UT-ML). http://research.nii.ac.jp/src/en/UT-ML.html.
[28] Deng, L.; Acero, A.; Plumpe, M.; Huang, X.: Large-vocabulary speech recognition under adverse acoustic environments, in Int. Conf. on Spoken Language Processing, Beijing, China, 2000, 806809.
[29] Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explorations, 11 (1) (2009), 1018.
[30] Jia, Y. et al. : Caffe: convolutional architecture for fast feature embedding, arXiv preprint (2014), arXiv:1408.5093.
[31] Fujimoto, M.; Watanabe, S.; Nakatani, T.: Voice activity detection using frame-wise model re-estimation method based on Gaussian pruning with weight normalization, in Interspeech, Makuhari, Japan, 2010, 31023105.
[32]ITU-T: A silence compression scheme for G.729 optimized for terminals conforming to Recommendation V.70, ITU-T Recommendation G..729 – Annex B, 1996.
[33]ETSI ES 202 050 v1.1.5, Speech processing, transmission and quality aspects (STQ); distributed speech recognition; advanced front-end feature extraction algorithm; compression algorithm, 2007.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

APSIPA Transactions on Signal and Information Processing
  • ISSN: 2048-7703
  • EISSN: 2048-7703
  • URL: /core/journals/apsipa-transactions-on-signal-and-information-processing
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Keywords

Metrics

Full text views

Total number of HTML views: 39
Total number of PDF views: 162 *
Loading metrics...

Abstract views

Total abstract views: 288 *
Loading metrics...

* Views captured on Cambridge Core between 14th July 2017 - 15th August 2018. This data will be updated every 24 hours.