Hostname: page-component-6766d58669-76mfw Total loading time: 0 Render date: 2026-05-19T22:37:09.639Z Has data issue: false hasContentIssue false

Chord-aware automatic music transcription based on hierarchical Bayesian integration of acoustic and language models

Published online by Cambridge University Press:  22 November 2018

Yuta Ojima
Affiliation:
Kyoto University, Yoshida-honmachi, Sakyo-ku, Kyoto, Japan
Eita Nakamura
Affiliation:
Kyoto University, Yoshida-honmachi, Sakyo-ku, Kyoto, Japan
Katsutoshi Itoyama
Affiliation:
Kyoto University, Yoshida-honmachi, Sakyo-ku, Kyoto, Japan
Kazuyoshi Yoshii*
Affiliation:
Kyoto University, Yoshida-honmachi, Sakyo-ku, Kyoto, Japan
*
Corresponding author: Kazuyoshi Yoshii Email: yoshii@kuis.kyoto-u.ac.jp

Abstract

This paper describes automatic music transcription with chord estimation for music audio signals. We focus on the fact that concurrent structures of musical notes such as chords form the basis of harmony and are considered for music composition. Since chords and musical notes are deeply linked with each other, we propose joint pitch and chord estimation based on a Bayesian hierarchical model that consists of an acoustic model representing the generative process of a spectrogram and a language model representing the generative process of a piano roll. The acoustic model is formulated as a variant of non-negative matrix factorization that has binary variables indicating a piano roll. The language model is formulated as a hidden Markov model that has chord labels as the latent variables and emits a piano roll. The sequential dependency of a piano roll can be represented in the language model. Both models are integrated through a piano roll in a hierarchical Bayesian manner. All the latent variables and parameters are estimated using Gibbs sampling. The experimental results showed the great potential of the proposed method for unified music transcription and grammar induction.

Information

Type
Original Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © The Authors, 2018
Figure 0

Fig. 1. A hierarchical generative model consisting of language and acoustic models that are linked through binary variables representing the existences of pitches.

Figure 1

Fig. 2. An acoustic model based on a variant of NMF with binary variables indicating a piano roll.

Figure 2

Fig. 3. Harmonic and noise spectra learned from data.

Figure 3

Fig. 4. A language model based on an autoregressive HMM that emits sequentially dependent binary variables.

Figure 4

Algorithm 1 Posterior inference

Figure 5

Fig. 5. The emission probabilities $\overline{{\bi \pi}}$ obtained by the tatum-level model assuming the independence of S.

Figure 6

Fig. 6. The emission probabilities $\overline{{\bi \pi}}$ obtained by the tatum-level model assuming the sequential dependency of S.

Figure 7

Table 1. Accuracy of unsupervised chord estimation.

Figure 8

Table 2. Experimental results of multipitch analysis based on the frame-level model for 30 piano pieces labeled as ENSTDkCl.

Figure 9

Table 3. Experimental results of multipitch analysis based on the tatum-level model for 30 piano pieces labeled as ENSTDkCl.

Figure 10

Fig. 7. Piano rolls estimated for MUS-chpn_p19_ENSTDkCl. (a) Sparse model, (b) chord-aware model (α=1), (c) chord-aware model (α=12.5), (d) chord-aware Markov model (α=12.5).

Figure 11

Fig. 8. Piano rolls of two musical pieces estimated by using the frame-level and tatum-level models.

Figure 12

Fig. 9. The emission probabilities $\overline{{\bi \pi}}$ estimated for MUS-chpn_p19_ENSTDkCl.

Figure 13

Table 4. Experimental results of multipitch analysis based on the pre-trained frame-level model.

Figure 14

Table 5. Experimental results of multipitch analysis based on the pre-trained tatum-level model.

Figure 15

Table 6. Performance comparison between five methods.