Hostname: page-component-6766d58669-tq7bh Total loading time: 0 Render date: 2026-05-24T02:59:02.074Z Has data issue: false hasContentIssue false

Error bounds on multivariate Normal approximations for word count statistics

Published online by Cambridge University Press:  01 July 2016

Haiyan Huang*
Affiliation:
University of Southern California
*
Current address: Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA. Email address: hhuang@hsph.harvard.edu

Abstract

Given a sequence S and a collection Ω of d words, it is of interest in many applications to characterize the multivariate distribution of the vector of counts U = (N(S,w 1), …, N(S,w d )), where N(S,w) is the number of times a word w ∈ Ω appears in the sequence S. We obtain an explicit bound on the error made when approximating the multivariate distribution of U by the normal distribution, when the underlying sequence is i.i.d. or first-order stationary Markov over a finite alphabet. When the limiting covariance matrix of U is nonsingular, the error bounds decay at rate O ((log n) / √n) in the i.i.d. case and O ((log n)3 / √n) in the Markov case. In order for U to have a nondegenerate covariance matrix, it is necessary and sufficient that the counted word set Ω is not full, that is, that Ω is not the collection of all possible words of some length k over the given finite alphabet. To supply the bounds on the error, we use a version of Stein's method.

Information

Type
General Applied Probability
Copyright
Copyright © Applied Probability Trust 2002 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable