Asymptotic Behavior of k-Word Matches Between two Uniformly Distributed Sequences

M. R. Kantorovitz; H. S. Booth; C. J. Burden; S. R. Wilson

doi:10.1239/jap/1189717545

Asymptotic Behavior of k-Word Matches Between two Uniformly Distributed Sequences

Part of: Distribution theory Limit theorems

Published online by Cambridge University Press: 14 July 2016

C. J. Burden and

M. R. Kantorovitz*: Affiliation:
Australian National University and University of Illinois
H. S. Booth*: Affiliation:
Australian National University
C. J. Burden*: Affiliation:
Australian National University
S. R. Wilson*: Affiliation:
Australian National University
*: ∗Postal address: Department of Mathematics, University of Illinois, Urbana, IL 61801, USA. Email address: ruth@math.uiuc.edu
∗∗H. S. Booth died 26 May 2005.
∗∗∗Postal address: Mathematical Sciences Institute, Australian National University, Canberra, ACT 0200, Australia.
∗∗∗Postal address: Mathematical Sciences Institute, Australian National University, Canberra, ACT 0200, Australia.

Article contents

Abstract
References

Rights & Permissions

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Given two sequences of length n over a finite alphabet A of size |A| = d, the D2 statistic is the number of k-letter word matches between the two sequences. This statistic is used in bioinformatics for EST sequence database searches. Under the assumption of independent and identically distributed letters in the sequences, Lippert, Huang and Waterman (2002) raised questions about the asymptotic behavior of D2 when the alphabet is uniformly distributed. They expressed a concern that the commonly assumed normality may create errors in estimating significance. In this paper we answer those questions. Using Stein's method, we show that, for large enough k, the D2 statistic is approximately normal as n gets large. When k = 1, we prove that, for large enough d, the D2 statistic is approximately normal as n gets large. We also give a formula for the variance of D2 in the uniform case.

Keywords

Stein's method count vector k-word matches sequence comparison

MSC classification

Secondary: 62E20: Asymptotic distribution theory 92D20: Protein sequences, DNA sequences 60F99: None of the above, but in this section

Type: Research Article
Information: Journal of Applied Probability , Volume 44 , Issue 3 , September 2007 , pp. 788 - 805

DOI: https://doi.org/10.1239/jap/1189717545 [Opens in a new window]
Copyright: Copyright © Applied Probability Trust 2007

References

[1] Barbour, A. and Chryssaphinou, O. (2001). Compound Poisson approximation: a user's guide. Ann. Appl. Prob. 11, 964–1002.Google Scholar

[2] Billingsley, P. (1995). Probability and Measure, 3rd edn. John Wiley, New York.Google Scholar

[3] Burke, J., Davison, D. and Hide, W. (1999). d2 cluster: a validated method for clustering EST and full-length cDNA sequences. Genome Res. 9, 1135–1142.CrossRef Google Scholar

[4] Carpenter, J. E., Christoffels, A., Weinbach, Y. and Hide, W. A. (2002). Assessment of the parallelization approach of d2 cluster for high-performance sequence clustering. J. Comput. Chem. 23, 755–757.Google Scholar

[5] Chen, L. H. Y. (1975). Poisson approximation for dependent trials. Ann. Prob. 3, 534–545.Google Scholar

[6] Christoffels, A. et al. (2001). STACK: sequence tag alignment and consensus knowledgebase. Nucleic Acids Res. 29, 234–238.Google Scholar

[7] Dembo, A. and Rinott, Y. (1996). Some examples of normal approximations by Stein's method. In Random Discrete Structures (IMA Vol. Math. Appl. 76), Springer, New York, pp. 25–44.CrossRef Google Scholar

[8] Johnson, N. L. and Kotz, S. (1970). Distributions in Statistics. Continuous Univariate Distributions. 1. Houghton Mifflin Co., Boston, MA.Google Scholar

[9] Lippert, R. A., Huang, H and Waterman, M. S. (2002). Distributional regimes for the number of k-word matches between two random sequences. Proc. Nat. Acad. Sci. USA 99, 13980–13989.Google Scholar

[10] Miller, R. T. et al. (1999). A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base. Genome Res. 9, 1143–1155.CrossRef Google Scholar PubMed

[11] Smith, T. F. and Waterman, M. S. (1981). Identification of common molecular subsequences. J. Molec. Biol. 147, 195–197.Google Scholar

[12] Stein, C. (1972). A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proc. Sixth Berkeley Symp. Math. Statist. Prob., Vol. II, University of California Press, Berkeley, pp. 583–602.Google Scholar

[13] Stein, C. (1986). Approximate Computation of Expectations. Institute of Mathematical Statistics, Hayward, CA.Google Scholar

[14] Vinga, S. and Almeida, J. S. (2003). Alignment-free sequence comparison – a review. Bioinformatics 19, 513–523.CrossRef Google Scholar PubMed

[15] Waterman, M. S. (1995). Introduction to Computational Biology. Chapman & Hall, New York.CrossRef Google Scholar

[16] Zhang, Y. X. et al. (2002). Genome shuffling leads to rapid phenotypic improvement in bacteria. Nature 415, 644–646.Google Scholar

Article contents

Asymptotic Behavior of k-Word Matches Between two Uniformly Distributed Sequences

Abstract

Keywords

MSC classification

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests