Hostname: page-component-89b8bd64d-x2lbr Total loading time: 0 Render date: 2026-05-09T08:52:53.967Z Has data issue: false hasContentIssue false

A sampling-based speaker clustering using utterance-oriented Dirichlet process mixture model and its evaluation on large-scale data

Published online by Cambridge University Press:  28 October 2015

Naohiro Tawara*
Affiliation:
Department of Computer Science, Waseda University, Tokyo, Japan
Tetsuji Ogawa
Affiliation:
Department of Computer Science, Waseda University, Tokyo, Japan
Shinji Watanabe
Affiliation:
Mitsubishi Electric Research Laboratories, Cambridge, MA, USA
Atsushi Nakamura
Affiliation:
Graduate School of Natural Sciences, Nagoya City University, Nagoya, Japan
Tetsunori Kobayashi
Affiliation:
Department of Computer Science, Waseda University, Tokyo, Japan
*
Corresponding author: N. Tawara Email: tawara@pcl.cs.waseda.ac.jp

Abstract

An infinite mixture model is applied to model-based speaker clustering with sampling-based optimization to make it possible to estimate the number of speakers. For this purpose, a framework of non-parametric Bayesian modeling is implemented with the Markov chain Monte Carlo and incorporated in the utterance-oriented speaker model. The proposed model is called the utterance-oriented Dirichlet process mixture model (UO-DPMM). The present paper demonstrates that UO-DPMM is successfully applied on large-scale data and outperforms the conventional hierarchical agglomerative clustering, especially for large amounts of utterances.

Information

Type
Original Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © The Authors, 2015
Figure 0

Fig. 1. Graphical models of utterance-oriented mixture models for (a) finite and (b) infinite speakers.

Figure 1

Table 1. Details of test set. # speakers, # utterances, # samples, and total duration denote the number of speakers, number of utterances, number of frame-wise observations, and total duration.

Figure 2

Algorithm 1 Speaker clustering using UO-DPMM. Threshold is 30 for TIMIT and 50 for CSJ.

Figure 3

Table 2. Speaker clustering results for TIMIT. #cl. denotes the number of clusters estimated.

Figure 4

Table 3. Speaker clustering results for CSJ.#cl. denotes the number of clusters estimated.

Figure 5

Fig. 2. K values obtained from proposed method for (a) T-1, (b) T-2, (c) C-1, (d) C-2, (e) C-3, and (f) C-4. Eight lines in each figure show results of eight trials using different seeds.