Hostname: page-component-6766d58669-zlvph Total loading time: 0 Render date: 2026-05-14T11:09:49.309Z Has data issue: false hasContentIssue false

Nested Gibbs sampling for mixture-of-mixture model and its application to speaker clustering

Published online by Cambridge University Press:  31 August 2016

Naohiro Tawara*
Affiliation:
Department of Communications and Computer Engineering, Waseda University, Tokyo, Japan
Tetsuji Ogawa
Affiliation:
Department of Communications and Computer Engineering, Waseda University, Tokyo, Japan
Shinji Watanabe
Affiliation:
Mitsubishi Electric Research Laboratories, MA, USA.
Tetsunori Kobayashi
Affiliation:
Department of Communications and Computer Engineering, Waseda University, Tokyo, Japan
*
Corresponding author: N. Tawara tawara@pcl.cs.waseda.ac.jp

Abstract

This paper proposes a novel model estimation method, which uses nested Gibbs sampling to develop a mixture-of-mixture model to represent the distribution of the model's components with a mixture model. This model is suitable for analyzing multilevel data comprising frame-wise observations, such as videos and acoustic signals, which are composed of frame-wise observations. Deterministic procedures, such as the expectation–maximization algorithm have been employed to estimate these kinds of models, but this approach often suffers from a large bias when the amount of data is limited. To avoid this problem, we introduce a Markov chain Monte Carlo-based model estimation method. In particular, we aim to identify a suitable sampling method for the mixture-of-mixture models. Gibbs sampling is a possible approach, but this can easily lead to the local optimum problem when each component is represented by a multi-modal distribution. Thus, we propose a novel Gibbs sampling method, called “nested Gibbs sampling,” which represents the lower-level (fine) data structure based on elemental mixture distributions and the higher-level (coarse) data structure based on mixture-of-mixture distributions. We applied this method to a speaker clustering problem and conducted experiments under various conditions. The results demonstrated that the proposed method outperformed conventional sampling-based, variational Bayesian, and hierarchical agglomerative methods.

Information

Type
Original Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © The Authors, 2016
Figure 0

Fig. 1. Hierarchical structure of multi-level data analysis. Segment-wise (higher-level) observations are composed of a set of frame-wise (lower-level) observations. Left figure illustrates the hierarchical structure in speech data composed of frame-wise observations (e.g. Mel-frequency cepstral coefficients).

Figure 1

Fig. 2. Graphical representation of mixture-of-mixture model. The white square denotes frame-wise observations, and dots denote the hyper-parameters of prior distributions.

Figure 2

Algorithm 1: Model estimation algorithm using the VB method.

Figure 3

Algorithm 2: Model estimation algorithm based on the proposed nested Gibbs sampling method.

Figure 4

Fig. 3. LML obtained using proposed nested Gibbs sampler, applied to A1+station noise. Refer to Table 1 for the details of test set A1. Each figure shows results with a different sampling size Nsamp. Eight lines correspond to results of eight trials using different random seeds. (a) NGibbs=1 & (b) NGibbs=5

Figure 5

Fig. 4. LML as a function of K value. Each plot shows the results obtained by applying the proposed n-Gibbs sampler to five different datasets (id:000, 001, …, 004). Refer to Table 1 for the details of test set B1. (a) B1 (clean) & (b) B1+crowd (c) B1+street & (d) B2+party (e) B2+station

Figure 6

Table 1. Details of test set.

Figure 7

Fig. 5. K values obtained by existing Gibbs and proposed nested Gibbs sampler applied on (a) clean (A1) and (b) noisy (A1 + crowd) speech.

Figure 8

Fig. 6. LML obtained by Gibbs and nested Gibbs with SA applied on A1. Each figure shows result with different initial temperature βinit. Eight lines correspond to the results of eight trials with different seeds. (a) Gibbs (βinit=1 (w/o annealing)); (b) nested Gibbs (βinit=1 (w/o annealing)); (c) Gibbs (βinit=10); (d) nested Gibbs (βinit=10 ); (e) Gibbs (βinit=30); (f) nested Gibbs (βinit=30).

Figure 9

Table 2. K-value for clean test sets.

Figure 10

Table 3. K value for noisy test sets. Four types of noise (crowd, street, party, and station) are overlapped with speech of nine datasets.