Hostname: page-component-89b8bd64d-7zcd7 Total loading time: 0 Render date: 2026-05-13T05:54:27.846Z Has data issue: false hasContentIssue false

A multi-branch ResNet with discriminative features for detection of replay speech signals

Published online by Cambridge University Press:  29 December 2020

Xingliang Cheng
Affiliation:
Center for Speech and Language Technologies, Beijing National Research Center for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing, China
Mingxing Xu
Affiliation:
Center for Speech and Language Technologies, Beijing National Research Center for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing, China
Thomas Fang Zheng*
Affiliation:
Center for Speech and Language Technologies, Beijing National Research Center for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing, China
*
Corresponding author: Thomas Fang Zheng Email: fzheng@tsinghua.edu.cn

Abstract

Nowadays, the security of ASV systems is increasingly gaining attention. As one of the common spoofing methods, replay attacks are easy to implement but difficult to detect. Many researchers focus on designing various features to detect the distortion of replay attack attempts. Constant-Q cepstral coefficients (CQCC), based on the magnitude of the constant-Q transform (CQT), is one of the striking features in the field of replay detection. However, it ignores phase information, which may also be distorted in the replay processes. In this work, we propose a CQT-based modified group delay feature (CQTMGD) which can capture the phase information of CQT. Furthermore, a multi-branch residual convolution network, ResNeWt, is proposed to distinguish replay attacks from bonafide attempts. We evaluated our proposal in the ASVspoof 2019 physical access dataset. Results show that CQTMGD outperformed the traditional MGD feature, and the fusion with other magnitude-based and phase-based features achieved a further improvement. Our best fusion system achieved 0.0096 min-tDCF and 0.39% EER on the evaluation set and it outperformed all the other state-of-the-art methods in the ASVspoof 2019 physical access challenge.

Information

Type
Original Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © The Author(s), 2020 published by Cambridge University Press
Figure 0

Algorithm 1 CQTMGD ExtractionMethod

Figure 1

Table 1. The overall architecture of ResNeWt18. The shape of a residual block [27] is inside the brackets, and the number of stacked blocks on a stage is outside the brackets. “C=32” means the grouped convolutions [28] with 32 groups. “2-d fc” means a fully connected layer with 2 units.

Figure 2

Table 2. A summary of the ASVspoof 2019 physical access dataset [31].

Figure 3

Fig. 1. An illustration of the simulation processes in the ASVspoof 2019 physical access dataset (adapted from [31]).

Figure 4

Table 3. Performance in the ASVspoof 2019 physical access development set with different features using ResNeWt18 as the classifier.

Figure 5

Fig. 2. The correlation between the decision score of the systems using different features in the ASVspoof 2019 physical access development set. A|B is the concatenating of feature A and B along the frequency-axis.

Figure 6

Table 4. Performance (EER%) in the ASVspoof 2019 physical access development set with different models.

Figure 7

Table 5. Comparison with relevant systems in the ASVspoof 2019 physical access evaluation set.

Figure 8

Table 6. Comparison with relevant systems in the ASVspoof 2017 V2 evaluation set.

Figure 9

Table 7. Contribution analysis in the ASVspoof 2019 physical access development set comparing with the best baseline system.

Figure 10

Table 8. Performance (EER%) analysis in the ASVspoof 2019 physical access evaluation dataset pooled by environment configurations.

Figure 11

Table 9. Performance (EER%) analysis in the ASVspoof 2019 physical access evaluation dataset pooled by replay configurations.

Figure 12

Table 10. Performance (EER%) analysis of the best ResNeWt system in the ASVspoof 2019 physical access evolution dataset pooled by talker-to-ASV distance.

Figure 13

Fig. 3. The attention distribution of the ResNeWt model using the class activation mapping technique [13] for the spoofing category. Each row represents an input feature set of the ResNeWt model. Each column represents a randomly selected audio sample from the ASVspoof 2019 physical access development dataset. The filename of each sample is shown on the top of the column. The first two columns (on the left side) are genuine attempts, the last two columns (on the right side) are replay attacks. The green box shows that the models are paying much attention to the lower-frequency range. Best view in color.

Figure 14

Fig. 4. The distribution of the duration of the trailing silence along with various T60. All the outliers are hidden for clarity.

Figure 15

Table 11. Results of trailing silence analysis in the ASVspoof 2019 physical access development set. The condition “O” means that the dataset is original, and the condition “R” means that the trailing silence is removed. The condition “X - Y” means the model is trained under condition “X” and tested under condition “Y”. The number on the left of the arrow indicates the performance in the original dataset (i.e. on condition “O - O”).

Figure 16

Fig. 5. The F-ratio analysis results.

Figure 17

Fig. 6. The detailed F-ratio analysis results in the ASVspoof 2017 V2 dataset (grouped by replay configurations).

Figure 18

Fig. 7. The detailed F-ratio analysis results in the ASVspoof 2019 physical access dataset (grouped by replay configurations). Attack ID: (replay device quality, attacker-to-talker distance). Environment ID: (room size, T60, talker-to-ASV distance). All the factors fall into three categories (from “a” to “c” or from “A” to “C”).