Skip to main content

An information-theoretic, vector-space-model approach to cross-language information retrieval*


In this article, we demonstrate several novel ways in which insights from information theory (IT) and computational linguistics (CL) can be woven into a vector-space-model (VSM) approach to information retrieval (IR). Our proposals focus, essentially, on three areas: pre-processing (morphological analysis), term weighting, and alternative geometrical models to the widely used term-by-document matrix. The latter include (1) PARAFAC2 decomposition of a term-by-document-by-language tensor, and (2) eigenvalue decomposition of a term-by-term matrix (inspired by Statistical Machine Translation). We evaluate all proposals, comparing them to a ‘standard’ approach based on Latent Semantic Analysis, on a multilingual document clustering task. The evidence suggests that proper consideration of IT within IR is indeed called for: in all cases, our best results are achieved using the information-theoretic variations upon the standard approach. Furthermore, we show that different information-theoretic options can be combined for still better results. A key function of language is to encode and convey information, and contributions of IT to the field of CL can be traced back a number of decades. We think that our proposals help bring IR and CL more into line with one another. In our conclusion, we suggest that the fact that our proposals yield empirical improvements is not coincidental given that they increase the theoretical transparency of VSM approaches to IR; on the contrary, they help shed light on why aspects of these approaches work as they do.

Hide All
Bader, B. W., Berry, M. W., and Browne, M. 2008. Discussion tracking in Enron email using PARAFAC. In Berry, M. W. and Castellanos, M. (eds.), Survey of Text Mining: Clustering, Classification, and Retrieval, Second Edition, pp. 147162. London: Springer.
Bader, B. W., Berry, M. W., and Langville, A. N. 2009. Text analysis using nonnegative matrix/tensor factorizations. In Srivastava, A. and Sahami, M. (eds.), Text Mining: Classification, Clustering and Applications, pp. 95120. Chapman & Hall/CRC.
Bader, B., and Chew, P. 2008. Enhancing multilingual Latent Semantic Analysis with term alignment information. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008), August 2008, Manchester, England, pp. 4956.
Baeza-Yates, R., and Ribeiro-Neto, B. 1999. Modern Information Retrieval. New York: ACM Press.
Biola University 2005–2006. The Unbound Bible. Retrieved on January 29, 2008, from
Boyack, K., Klavans, R., and Börner, K. 2005. Mapping the backbone of science. Scientometrics 64 (3): 351374.
Broe, M. 1996. A generalized information-theoretic measure for systems of phonological classification and recognition. In Proceedings of the Second Meeting of the ACL Special Interest Group in Computational Phonology, July 1996, Santa Cruz, California, pp. 1724.
Brown, P. F., Della Pietra, V. J., Della Pietra, S. A., and Mercer, R. L. 1994. The mathematics of Statistical Machine Translation: parameter estimation. Computational Linguistics 19 (2): 263311.
Brown, P. F., deSouza, P. V., Mercer, R. L., Della Pietra, V. J., and Lai, J. C. 1992. Class-based n-gram models of natural language. Computational Linguistics 18 (4): 467479.
Bullinaria, J. A., and Levy, J. P. 2007. Extracting semantic representations from word co-occurrence statistics: a computational study. Behavior Research Methods 39: 510526.
Cherry, E., Halle, M., and Jakobson, R. 1953. Toward the logical description of languages in their phonemic aspect. Language 29: 3446.
Chew, P., and Abdelali, A. 2007. Benefits of the ‘massively parallel Rosetta Stone’: cross-language information retrieval with over 30 languages. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, June 2007, Prague, Czech Republic, pp. 872879.
Chew, P. A., Bader, B. W., Kolda, T. G., and Abdelali, A. 2007. Cross-language information retrieval using PARAFAC2. In KDD '07: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 143152. New York: ACM Press.
Chew, P., Kegelmeyer, P., Bader, B., and Abdelali, A. 2008. The knowledge of good and evil: multilingual ideology classification with PARAFAC2 and machine learning. Language Forum 34 (1): 3752.
Chisholm, E., and Kolda, T. G. 1999. New term weighting formulas for the vector space method in information retrieval. Technical Report ORNL-TM-13756, Oak Ridge National Laboratory, Oak Ridge, TN.
Chomsky, N. 1956. Three models for the description of language. IRE Transactions on Information Theory 2: 113124.
Chomsky, N., and Halle, M. 1968. The Sound Pattern of English. New York: Harper & Row.
Cleverdon, C. W. 1991. The significance of the Cranfield tests on index languages. In Proceedings of SIGIR, pp. 312. New York: ACM Press.
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. 1990. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science 41 (6): 391407.
Dumais, S. 1991. Improving the retrieval of information from external sources. Behavior Research Methods, Instruments and Computers 23: 229236.
Eckart, G., and Young, G. 1936. The approximation of one matrix by another of lower rank. Psychometrika 1 (3): 211218.
Goldsmith, J. 2001. Unsupervised learning of the morphology of a natural language. Computational Linguistics 27 (2): 153198.
Golub, G. H., and Van Loan, C. F. 1996. Matrix Computations. Baltimore, MD: Johns Hopkins University Press.
Halle, M. 1959. The Sound Pattern of Russian. The Hague, Netherlands: Mouton.
Hendrickson, B. 2007. Latent Semantic Analysis and Fiedler retrieval. Linear Algebra and its Applications 421 (2–3): 345355.
Hockett, C. 1958. A Course in Modern Linguistics. New York: Macmillan.
Kashioka, H., Kawata, Y., Kinjo, Y., Finch, A., and Black, E. W. 1998. Use of mutual information based character clusters in dictionary-less morphological analysis of japanese. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, August 1998, Montreal, Quebec, pp. 658662.
Kolda, T. G., and Bader, B. W. 2009. Tensor decompositions and applications. SIAM Review 51 (3): 455500.
Landauer, T., and Dumais, S. 1997. A solution to Plato's problem: the Latent Semantic Analysis theory of acquisition, induction, and representation of knowledge. Psychological Review 104 (2): 211240.
Landauer, T., Foltz, P., and Laham, D. 1998. An introduction to Latent Semantic Analysis. Discourse Processes 25: 259284.
Lin, D. 1999. Automatic identication of noncompositional phrases. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, June 1999, College Park, Maryland, pp. 317324.
Liu, N., Zhang, B., Yan, J., Chen, Z., Liu, W., Bai, F., and Chien, L. 2005. Text representation: from vector to tensor. In Proceedings of the 5th IEEE International Conference on Data Mining, November 2005, Houston, Texas, pp. 725728.
Lovins, J. B. 1968. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11: 2231.
Matveeva, I., Levow, G.-A., Farahat, A., and Royer, C. 2005. Term representation with generalized Latent Semantic Analysis. Paper presented at the International Conference on Recent Advances in Natural Language Processing (RANLP-05), September 2005, Borovets, Bulgaria.
Recchia, G., and Jones, M. N. 2009. More data trumps smarter algorithms: comparing pointwise mutual information with Latent Semantic Analysis. Behavior Research Methods 41: 647656.
Resnik, P., Broman Olsen, M., and Diab, M. 1999. The Bible as a parallel corpus: annotating the “Book of 2000 Tongues”. Computers and the Humanities 33: 129153.
Rissanen, J. 1989. Stochastic Complexity in Statistical Inquiry. Singapore: World Scientific Publishing.
Salton, G. 1991. Developments in automatic text retrieval. Science 253: 974980.
Salton, G., and Buckley, C. 1988. Term weighting approaches in automatic text retrieval. Information Processing and Management 24 (5): 513523.
Shannon, C. E. 1948. A mathematical theory of communication. Bell System Technical Journal 27: 379423 and 623–656.
Sparck Jones, K. 1972. A statistical interpretation of term specificity and its applications to retrieval. Journal of Documentation 28: 1121.
Swanson, D. 1988. Historical note: information retrieval and the future of an illusion. Journal of the American Society for Information Science 39 (2): 9298.
Tomlinson, S. 2004. Finnish, Portuguese and Russian retrieval with Hummingbird SearchServer at CLEF 2004. In Working Notes for the Cross-Language Evaluation Forum (CLEF) 2004 Workshop, Bath, England. Accessed on September 15, 2010 at
Voorhees, E., and Harmaneds, D. 2005. TREC: Experiment and Evaluation in Information Retrieval. Cambridge, MA: MIT Press.
Weaver, W. 1955. Translation (1949). Cambridge, MA: MIT Press.
Young, P. 1994. Cross Language Information Retrieval Using Latent Semantic Indexing. Master's thesis, University of Knoxville, Knoxville, TN.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *


Full text views

Total number of HTML views: 1
Total number of PDF views: 36 *
Loading metrics...

Abstract views

Total abstract views: 183 *
Loading metrics...

* Views captured on Cambridge Core between September 2016 - 19th March 2018. This data will be updated every 24 hours.