Hostname: page-component-89b8bd64d-z2ts4 Total loading time: 0 Render date: 2026-05-08T01:05:41.471Z Has data issue: false hasContentIssue false

An efficient classification algorithm for NGS data based on text similarity

Published online by Cambridge University Press:  17 September 2018

Xiangyu Liao
Affiliation:
Department of Oncology, The First College of Clinical Medical Science, China Three Gorges University, Yichang Central People's Hospital, Yichang, Hubei 443000, P.R. China
Xingyu Liao
Affiliation:
School of Information Science and Engineering, Central South University, Changsha, Hunan 410083, P.R. China
Wufei Zhu*
Affiliation:
Department of Endocrinology, The First College of Clinical Medical Science, China Three Gorges University, Yichang Central People's Hospital, Yichang, Hubei 443000, P.R. China
Lu Fang
Affiliation:
Department of Endocrinology, The First College of Clinical Medical Science, China Three Gorges University, Yichang Central People's Hospital, Yichang, Hubei 443000, P.R. China
Xing Chen
Affiliation:
Department of Endocrinology, The First College of Clinical Medical Science, China Three Gorges University, Yichang Central People's Hospital, Yichang, Hubei 443000, P.R. China
*
Author for correspondence: Wufei Zhu, E-mail: zhuwufei@aliyun.com
Rights & Permissions [Opens in a new window]

Abstract

With the advancement of high-throughput sequencing technologies, the amount of available sequencing data is growing at a pace that has now begun to greatly challenge the data processing and storage capacities of modern computer systems. Removing redundancy from such data by clustering could be crucial for reducing memory, disk space and running time consumption. In addition, it also has good performance on reducing dataset noise in some analysis applications. In this study, we propose a high-performance short sequence classification algorithm (HSC) for next generation sequencing (NGS) data based on efficient hash function and text similarity. First, HSC converts all reads into k-mers, then it forms a unique k-mer set by merging the duplicated and reverse complementary elements. Second, all unique k-mers are stored in a hash table, where the k-mer string is stored in the key field, and the ID of the reads containing the k-mer are stored in the value field. Third, each hash unit is transformed into a short text consisting of reads. Fourth, texts that satisfy the similarity threshold are combined into a long text, the merge operation is executed iteratively until there is no text that satisfies the merge condition. Finally, the long text is transformed into a cluster consisting of reads. We tested HSC using five real datasets. The experimental results showed that HSC cluster 100 million short reads within 2 hours, and it has excellent performance in reducing memory consumption. Compared to existing methods, HSC is much faster than other tools, it can easily handle tens of millions of sequences. In addition, when HSC is used as a preprocessing tool to produce assembly data, the memory and time consumption of the assembler is greatly reduced. It can help the assembler to achieve better assemblies in terms of N50, NA50 and genome fraction.

Information

Type
Research Paper
Copyright
Copyright © Cambridge University Press 2018 
Figure 0

Fig. 1. The illustration of the pipeline of HSC.

Figure 1

Fig. 2. The principle of converting a hash unit to a short text.

Figure 2

Table 1. Details of datasets.

Figure 3

Fig. 3. Classification performance of HSC on simulated data. The y-axis represents the position on the reference, and the x-axis represents the ID of the cluster.

Figure 4

Table 2. The classification results of HSC on lib1.

Figure 5

Table 3. The classification results of HSC on lib2.

Figure 6

Table 4. The classification results of HSC on lib3.

Figure 7

Table 5. The classification results of HSC on lib4.