Hostname: page-component-848d4c4894-ndmmz Total loading time: 0 Render date: 2024-05-04T15:57:29.422Z Has data issue: false hasContentIssue false

Word clustering and disambiguation based on co-occurrence data

Published online by Cambridge University Press:  17 June 2002

HANG LI
Affiliation:
Theory NEC Laboratory, Real World Computing Partnership, c/o Internet Systems Research Laboratories, NEC Corporation, 4-1-1 Miyazaki, Miyamae-ku, Kawasaki 216-8555, Japan; e-mail: hangli@microsoft.com Current Address: Hang Li, Microsoft Research Asia, 5F Sigma Center, No. 49 Zhichun Road Haidian District, Beijing, China 100080. Email: hangli@microsoft.com, Home Page: http://www.research.microsoft.com/users/hangli/

Abstract

We address the problem of clustering words (or constructing a thesaurus) based on co-occurrence data, and conducting syntactic disambiguation by using the acquired word classes. We view the clustering problem as that of estimating a class-based probability distribution specifying the joint probabilities of word pairs. We propose an efficient algorithm based on the Minimum Description Length (MDL) principle for estimating such a probability model. Our clustering method is a natural extension of that proposed in Brown, Della Pietra, deSouza, Lai and Mercer (1992). We next propose a syntactic disambiguation method which combines the use of automatically constructed word classes and that of a hand-made thesaurus. The overall disambiguation accuracy achieved by our method is 88.2%, which compares favorably against the accuracies obtained by the state-of-the-art disambiguation methods.

Type
Research Article
Copyright
© 2002 Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

A previous version of this paper appeared in COLING-ACL'98 (Li and Abe 1998b).