Hostname: page-component-77f85d65b8-zzw9c Total loading time: 0 Render date: 2026-03-29T19:49:20.955Z Has data issue: false hasContentIssue false

Audio-to-score singing transcription based on a CRNN-HSMM hybrid model

Published online by Cambridge University Press:  20 April 2021

Ryo Nishikimi*
Affiliation:
Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan
Eita Nakamura
Affiliation:
Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan The Hakubi Center for Advanced Research, Kyoto University, Kyoto 606-8501, Japan
Masataka Goto
Affiliation:
National Institute of Advanced Industrial Science and Technology (AIST), Ibaraki 305-8568, Japan
Kazuyoshi Yoshii
Affiliation:
Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan PRESTO, Japan Science and Technology Agency, Tokyo, 102-0076, Japan
*
Corresponding author: Ryo Nishikimi Email: nishikimi@sap.ist.i.kyoto-u.ac.jp

Abstract

This paper describes an automatic singing transcription (AST) method that estimates a human-readable musical score of a sung melody from an input music signal. Because of the considerable pitch and temporal variation of a singing voice, a naive cascading approach that estimates an F0 contour and quantizes it with estimated tatum times cannot avoid many pitch and rhythm errors. To solve this problem, we formulate a unified generative model of a music signal that consists of a semi-Markov language model representing the generative process of latent musical notes conditioned on musical keys and an acoustic model based on a convolutional recurrent neural network (CRNN) representing the generative process of an observed music signal from the notes. The resulting CRNN-HSMM hybrid model enables us to estimate the most-likely musical notes from a music signal with the Viterbi algorithm, while leveraging both the grammatical knowledge about musical notes and the expressive power of the CRNN. The experimental results showed that the proposed method outperformed the conventional state-of-the-art method and the integration of the musical language model with the acoustic model has a positive effect on the AST performance.

Information

Type
Original Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © The Author(s), 2021. Published by Cambridge University Press in association with Asia Pacific Signal and Information Processing Association
Figure 0

Fig. 1. The problem of automatic singing transcription. The proposed method takes as input a spectrogram of a target music signal and tatum times and estimates a musical score of a sung melody.

Figure 1

Fig. 2. The proposed hierarchical probabilistic model that consists of a SMM-based language model representing the generative process of musical notes from local keys and a CRNN-based acoustic model representing the generative process of an observed spectrogram from the musical notes. We aim to infer the latent notes and keys from the observed spectrogram.

Figure 2

Fig. 3. Representation of a melody note sequence and variables of the language model.

Figure 3

Fig. 4. The acoustic model $p(\mathbf{X} | \overline {\mathbf{P}}, \overline {\mathbf{C}})$ representing the generative process of the spectrogram $\mathbf{X}$ from note pitches $\overline {\mathbf{P}}$ and residual durations $\overline {\mathbf{C}}$.

Figure 4

Fig. 5. Architecture of the CNN. Three numbers in the parentheses in each layer indicate the channel size, height, and width of the kernel.

Figure 5

Table 1. The AST performances of the different methods.

Figure 6

Fig. 6. Examples of musical scores estimated by the proposed method, the CRNN method, the HHSMM-based method, and the majority-vote method from the separated audio signals and the estimated F0 contours and tatum times. Transcription errors are indicated by the red squares. Capital letters attached to the red squares represent the following error types: pitch error (P), rhythm error (R), deletion error (D), and insertion error (I). Error labels are not shown in the transcription result by the majority-vote method, which contains too many errors.

Figure 7

Fig. 7. The transition probabilities $\bar{{{\boldsymbol \phi }}}$ and ${{\boldsymbol \psi }}$ trained from the existing musical scores. The triangles indicate (a) the seven pitch classes on the C major scale and (b) the eighth-note-level metrical positions.

Figure 8

Table 2. The AST performances obtained from the different input data.

Figure 9

Fig. 8. Examples of musical scores obtained with and without singing voice separation when the ground-truth tatum times were used. The left and right figures illustrate the positive and negative impacts of singing voice separation.