Word clustering and disambiguation based on co-occurrence data

HANG LI

doi:10.1017/S1351324902002838

Abstract

We address the problem of clustering words (or constructing a thesaurus) based on co-occurrence data, and conducting syntactic disambiguation by using the acquired word classes. We view the clustering problem as that of estimating a class-based probability distribution specifying the joint probabilities of word pairs. We propose an efficient algorithm based on the Minimum Description Length (MDL) principle for estimating such a probability model. Our clustering method is a natural extension of that proposed in Brown, Della Pietra, deSouza, Lai and Mercer (1992). We next propose a syntactic disambiguation method which combines the use of automatically constructed word classes and that of a hand-made thesaurus. The overall disambiguation accuracy achieved by our method is 88.2%, which compares favorably against the accuracies obtained by the state-of-the-art disambiguation methods.

Information

Footnotes

A previous version of this paper appeared in COLING-ACL'98 (Li and Abe 1998b).

Crossref Citations

This article has been cited by the following publications. This list is generated based on data provided by Crossref.

Jing, Hongyan and Tzoukermann, Evelyne 1999. Information retrieval based on context distance and morphology. p. 90.

Zhao, Shaojun and Lin, Dekang 2005. Natural Language Processing – IJCNLP 2004. Vol. 3248, Issue. , p. 545.

Weeds, Julie and Weir, David 2005. Co-occurrence Retrieval: A Flexible Framework for Lexical Distributional Similarity. Computational Linguistics, Vol. 31, Issue. 4, p. 439.

Jin, Peng Sun, Xu Wu, Yunfang and Yu, Shiwen 2007. Computational Linguistics and Intelligent Text Processing. Vol. 4394, Issue. , p. 267.

Kandylas, Vasileios Ungar, Lyle Sandler, Ted and Jensen, Shane 2008. Multiway Clustering for Creating Biomedical Term Sets. p. 449.

Momtazi, Saeedeh and Klakow, Dietrich 2009. A word clustering approach for language model-based sentence retrieval in question answering systems. p. 1911.

Liu, Ying Nan, Wang and Zheng, Tie 2009. Spectral Clustering for Chinese Word. p. 529.

Bassiou, Nikoletta and Kotropoulos, Constantine 2011. Long distance bigram models applied to word clustering. Pattern Recognition, Vol. 44, Issue. 1, p. 145.

Marx, Zvika Dagan, Ido and Shamir, Eli 2011. Cross-partition clustering: revealing corresponding themes across related datasets. Journal of Experimental & Theoretical Artificial Intelligence, Vol. 23, Issue. 2, p. 153.

Ghayoomi, Masood 2012. Advances in Natural Language Processing. Vol. 7614, Issue. , p. 126.

Goyal, Pawan Behera, Laxmidhar and McGinnity, T. M. 2013. A novel neighborhood based document smoothing model for information retrieval. Information Retrieval, Vol. 16, Issue. 3, p. 391.

Bekos, Michael A. van Dijk, Thomas C. Fink, Martin Kindermann, Philipp Kobourov, Stephen Pupyrev, Sergey Spoerhase, Joachim and Wolff, Alexander 2014. Algorithms - ESA 2014. Vol. 8737, Issue. , p. 87.

Vanishree, M. and Sudha, R. 2014. A context sensitive document indexing approach for Information Retrieval. p. 395.

Esther Hannah, M. Mukherjee, Saswati and Balaramar, Sakthi 2014. A Redundancy Elimination Approach towards Summary Refinement. IERI Procedia, Vol. 10, Issue. , p. 245.

Dařena, František and Žižka, Jan 2015. Modern Computational Models of Semantic Discovery in Natural Language. p. 71.

Horng, Gwo-Jiun 2015. Opportunistic Content Sharing Scheme for Distributed Network in City Environments. Wireless Personal Communications, Vol. 84, Issue. 4, p. 2327.

Mosharraf, Maedeh Taghiyareh, Fattaneh and Alaee, Sara 2017. Investigating eLearning Research Trends in Iran via Automatic Semantic Network Generation. Journal of Global Information Technology Management, Vol. 20, Issue. 2, p. 91.

Bekos, Michael A. van Dijk, Thomas C. Fink, Martin Kindermann, Philipp Kobourov, Stephen Pupyrev, Sergey Spoerhase, Joachim and Wolff, Alexander 2017. Improved Approximation Algorithms for Box Contact Representations. Algorithmica, Vol. 77, Issue. 3, p. 902.

Jia, Jingdong Mo, Hanlin Capretz, Luiz Fernando and Chen, Zupeng 2018. Grouping environmental factors influencing individual decision‐making behavior in software projects: A cluster analysis. Journal of Software: Evolution and Process, Vol. 30, Issue. 1,

Wang, Yipei Fan, Xingyu Chen, Luoxin Chang, Eric I-Chao Ananiadou, Sophia Tsujii, Junichi and Xu, Yan 2019. Mapping anatomical related entities to human body parts based on wikipedia in discharge summaries. BMC Bioinformatics, Vol. 20, Issue. 1,

Article contents

Word clustering and disambiguation based on co-occurrence data

Abstract

Information

Access options

Article purchase

Temporarily unavailable

Footnotes

This article has been cited by the following publications. This list is generated based on data provided by Crossref.

Article contents

Word clustering and disambiguation based on co-occurrence data

Abstract

Information

Access options

Article purchase

Temporarily unavailable

Footnotes

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests