Hostname: page-component-77f85d65b8-pztms Total loading time: 0 Render date: 2026-03-28T16:03:24.579Z Has data issue: false hasContentIssue false

Recurring spoken term discovery in the zero-resource constraint using diagonal patterns

Published online by Cambridge University Press:  02 June 2025

Sudhakar Pandiarajan*
Affiliation:
School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, Tamilnadu, India
Sreenivasa Rao K
Affiliation:
Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, West Bengal, India
Pabitra Mitra
Affiliation:
Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, West Bengal, India
*
Corresponding author: Sudhakar Pandiarajan; Email: sudhakar.p@vit.ac.in

Abstract

Spoken term discovery (STD) is challenging when a large volume of spoken content is generated without annotations. Unsupervised approaches resolve this challenge by directly computing pattern matches from the acoustic feature representation of the speech signal. However, this approach produces a lot of false alarms due to inherent speech variabilities, leading to performance degradation in the STD task. To overcome these challenges and improve performance, we propose a two-stage approach. First, we identify an acoustic feature representation that emphasizes spoken content irrespective of the variability challenge. Second, we employ the proposed diagonal pattern search to capture spoken term matches in an unsupervised way without any transcriptions. The proposed approach validated using Microsoft Speech Corpus for Low-Resource languages reveals that an 18% gain in hit ratio and 37% reduction in the false alarm ratio was achieved compared with the state-of-the-art methods.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Figure 1. depicts the spoken term match detected by the proposed approach. (a) and (b) indicate the upper and lower diagonal costs computed from the similarity matrix, respectively. (c) Highlights the matched regions in red color rectangle boxes.

Figure 1

Figure 2. (a) and (b) indicates the cost matrix and diagonal cost for the proposed approach. (c) and (d) indicates the cost matrix and cost path for the subsequence DTW. Similarly, (e) and (f) represent the segmental DTW approach.

Figure 2

Table 1. Details of the MicroSoft Low-Resource Language corpus. # Docs. represents the number of spoken documents. # Speakers indicate the number of speakers

Figure 3

Figure 3. depicts the match propagation in the similarity matrix by varying the thresholds$ 0.5\le \eta \le 1 $and$ 7\le \lambda \le 15 $.

Figure 4

Figure 4. depicts the relationship between true positive rate (TPR) and false positive rate (FPR) obtained from the proposed approach by varying$ \eta $and$ \lambda $.

Figure 5

Figure 5. Performance comparison across methods and features.

Figure 6

Table 2. Performance of the STD task using RASTA-PLP representation. # Matches$ {}_{act} $ represent the ground truth matches

Figure 7

Table 3. Performance of the STD task using Wav2vec representation. # Matchesact represent the ground truth matches

Figure 8

Table 4. Performance of the STD task using Mel-Specnorm representation. # Matchesact represent the ground truth matches

Figure 9

Table 5. Performance of the proposed approach in comparison with the state-of-the-art systems

Submit a response

Comments

No Comments have been published for this article.