Hostname: page-component-6766d58669-kn6lq Total loading time: 0 Render date: 2026-05-21T21:25:47.535Z Has data issue: false hasContentIssue false

Poisson mixtures

Published online by Cambridge University Press:  12 September 2008

Kenneth W. Church
Affiliation:
AT&T Bell Laboratories, Murray Hill, NJ 07974, USA. e-mail: kwc@research.att.com
William A. Gale
Affiliation:
AT&T Bell Laboratories, Murray Hill, NJ 07974, USA. e-mail: kwc@research.att.com

Abstract

Shannon (1948) showed that a wide range of practical problems can be reduced to the problem of estimating probability distributions of words and ngrams in text. It has become standard practice in text compression, speech recognition, information retrieval and many other applications of Shannon's theory to introduce a “bag-of-words” assumption. But obviously, word rates vary from genre to genre, author to author, topic to topic, document to document, section to section, and paragraph to paragraph. The proposed Poisson mixture captures much of this heterogeneous structure by allowing the Poisson parameter θ to vary over documents subject to a density function φ. φ is intended to capture dependencies on hidden variables such genre, author, topic, etc. (The Negative Binomial is a well-known special case where φ is a Г distribution.) Poisson mixtures fit the data better than standard Poissons, producing more accurate estimates of the variance over documents (σ2), entropy (H), inverse document frequency (IDF), and adaptation (Pr(x ≥ 2/x ≥ 1)).

Information

Type
Articles
Copyright
Copyright © Cambridge University Press 1995

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable