Hostname: page-component-89b8bd64d-46n74 Total loading time: 0 Render date: 2026-05-10T16:38:44.037Z Has data issue: false hasContentIssue false

Latent acoustic topic models for unstructured audio classification

Published online by Cambridge University Press:  10 December 2012

Samuel Kim*
Affiliation:
3710 S. McClintock Ave, RTH 320, Los Angeles, CA 90089, U.S.A
Panayiotis Georgiou
Affiliation:
3710 S. McClintock Ave, RTH 320, Los Angeles, CA 90089, U.S.A
Shrikanth Narayanan
Affiliation:
3710 S. McClintock Ave, RTH 320, Los Angeles, CA 90089, U.S.A
*
Corresponding author: Samuel Kim E-mail: worshipersam@gmail.com

Abstract

We propose the notion of latent acoustic topics to capture contextual information embedded within a collection of audio signals. The central idea is to learn a probability distribution over a set of latent topics of a given audio clip in an unsupervised manner, assuming that there exist latent acoustic topics and each audio clip can be described in terms of those latent acoustic topics. In this regard, we use the latent Dirichlet allocation (LDA) to implement the acoustic topic models over elemental acoustic units, referred as acoustic words, and perform text-like audio signal processing. Experiments on audio tag classification with the BBC sound effects library demonstrate the usefulness of the proposed latent audio context modeling schemes. In particular, the proposed method is shown to be superior to other latent structure analysis methods, such as latent semantic analysis and probabilistic latent semantic analysis. We also demonstrate that topic models can be used as complementary features to content-based features and offer about 9% relative improvement in audio classification when combined with the traditional Gaussian mixture model (GMM)–Support Vector Machine (SVM) technique.

Information

Type
Original Article
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - SA
The online version of this article is published within an Open Access environment subject to the conditions of the Creative Commons Attribution-NonCommercial-ShareAlike license . The written permission of Cambridge University Press must be obtained for commercial re-use.
Copyright
Copyright © The Authors, 2012.
Figure 0

Fig. 1. Graphical representation of the topic model using LDA.

Figure 1

Fig. 2. Diagram of the proposed acoustic topic modeling procedure for unstructured audio signals.

Figure 2

Fig. 3. An example of interpretation of the acoustic topic models as a type of probabilistic clustering.

Figure 3

Fig. 4. Illustrative examples of acoustic topic modeling: (a) topic distribution in a given audio clip, (b) the 5 most probable acoustic words in the most probable topic #46, (c) the 5 most probable acoustic words in the second most probable topic #80, (d) the 5 most probable acoustic words in the third most probable topic #50, (e) the 5 most probable acoustic words in the fourth most probable topic #66, and (f) the 5 most probable acoustic words in the fifth most probable topic #83 (the number of acoustic words is 1000 and the number of latent topics is 100).

Figure 4

Fig. 5. A simple diagram of two-step learning strategy for audio tag classification task.

Figure 5

Table 1. Summary of BBC sound effect library.

Figure 6

Table 2. Distribution of onomatopoeic words and semantic labels in the BBC sound library (22 onomatopoeic labels and 21 semantic labels).

Figure 7

Fig. 6. Audio tag classification results using LSA, pLSA and ATM according to the number of latent components: (a) onomatopoeic labels and (b) semantic labels.

Figure 8

Fig. 7. Audio tag classification results with respect to onomatopoeic labels using (a) ATM, (b) pLSA, and (c) LSA according to the number of latent components and the size of acoustic word dictionary.

Figure 9

Fig. 8. Audio tag classification results with respect to semantic labels using (a) ATM, (b) pLSA, and (c) LSA according to the number of latent components and the size of acoustic word dictionary.

Figure 10

Fig. 9. Audio tag classification results with respect to (a) onomatopoeic labels and (b) semantic labels using ATM, pLSA and LSA according to the size of acoustic word dictionary; when the number of topics are 5% of size of the dictionary.

Figure 11

Fig. 10. Audio tag classification results with respect to onomatopoeic labels using (a) ATM, GMM, and (b) their hybrid according to the number of latent clusters.

Figure 12

Fig. 11. Audio tag classification results with respect to semantic labels using (a) ATM, GMM, and (b) their hybrid according to the number of latent clusters.

Figure 13

Fig. 12. Per-class F-measure of audio tag classification results with respect to (a) onomatopoeic labels and (b) semantic labels, using ATM, GMM, and their hybrid method. The number of latent components is 100, and the size of acoustic word dictionary is 1000.