Skip to main content
    • Aa
    • Aa

An information-theoretic, vector-space-model approach to cross-language information retrieval*


In this article, we demonstrate several novel ways in which insights from information theory (IT) and computational linguistics (CL) can be woven into a vector-space-model (VSM) approach to information retrieval (IR). Our proposals focus, essentially, on three areas: pre-processing (morphological analysis), term weighting, and alternative geometrical models to the widely used term-by-document matrix. The latter include (1) PARAFAC2 decomposition of a term-by-document-by-language tensor, and (2) eigenvalue decomposition of a term-by-term matrix (inspired by Statistical Machine Translation). We evaluate all proposals, comparing them to a ‘standard’ approach based on Latent Semantic Analysis, on a multilingual document clustering task. The evidence suggests that proper consideration of IT within IR is indeed called for: in all cases, our best results are achieved using the information-theoretic variations upon the standard approach. Furthermore, we show that different information-theoretic options can be combined for still better results. A key function of language is to encode and convey information, and contributions of IT to the field of CL can be traced back a number of decades. We think that our proposals help bring IR and CL more into line with one another. In our conclusion, we suggest that the fact that our proposals yield empirical improvements is not coincidental given that they increase the theoretical transparency of VSM approaches to IR; on the contrary, they help shed light on why aspects of these approaches work as they do.

Linked references
Hide All

This list contains references from the content that can be linked to their source. For a full set of references and notes please see the PDF or HTML where available.

B. W. Bader , M. W. Berry , and M. Browne 2008. Discussion tracking in Enron email using PARAFAC. In M. W. Berry and M. Castellanos (eds.), Survey of Text Mining: Clustering, Classification, and Retrieval, Second Edition, pp. 147162. London: Springer.

B. W. Bader , M. W. Berry , and A. N. Langville 2009. Text analysis using nonnegative matrix/tensor factorizations. In A. Srivastava and M. Sahami (eds.), Text Mining: Classification, Clustering and Applications, pp. 95120. Chapman & Hall/CRC.

B. Bader , and P. Chew 2008. Enhancing multilingual Latent Semantic Analysis with term alignment information. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008), August 2008, Manchester, England, pp. 4956.

P. A. Chew , B. W. Bader , T. G. Kolda , and A. Abdelali 2007. Cross-language information retrieval using PARAFAC2. In KDD '07: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 143152. New York: ACM Press.

H. Kashioka , Y. Kawata , Y. Kinjo , A. Finch , and E. W. Black 1998. Use of mutual information based character clusters in dictionary-less morphological analysis of japanese. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, August 1998, Montreal, Quebec, pp. 658662.

Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *


Full text views

Total number of HTML views: 1
Total number of PDF views: 27 *
Loading metrics...

Abstract views

Total abstract views: 129 *
Loading metrics...

* Views captured on Cambridge Core between September 2016 - 23rd September 2017. This data will be updated every 24 hours.