An information-theoretic, vector-space-model approach to cross-language information retrieval*

PETER A. CHEW; BRETT W. BADER; STEPHEN HELMREICH; AHMED ABDELALI; STEPHEN J. VERZI

doi:10.1017/S1351324910000185

An information-theoretic, vector-space-model approach to cross-language information retrieval*

Published online by Cambridge University Press: 05 January 2011

AHMED ABDELALI and

PETER A. CHEW: Affiliation:
Moss Adams LLP, Albuquerque, NM 87110-4189, USA e-mail: PeterAChew@aol.com
BRETT W. BADER: Affiliation:
Sandia National Laboratories, Albuquerque, NM 87185-0519, USA emails: bwbader@sandia.gov, sjverzi@sandia.gov
STEPHEN HELMREICH: Affiliation:
New Mexico State University, New Mexico, 88003-8001, USA emails: helmreich@zianet.com, aabdelal@nmsu.edu
AHMED ABDELALI: Affiliation:
New Mexico State University, New Mexico, 88003-8001, USA emails: helmreich@zianet.com, aabdelal@nmsu.edu
STEPHEN J. VERZI: Affiliation:
Sandia National Laboratories, Albuquerque, NM 87185-0519, USA emails: bwbader@sandia.gov, sjverzi@sandia.gov

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

In this article, we demonstrate several novel ways in which insights from information theory (IT) and computational linguistics (CL) can be woven into a vector-space-model (VSM) approach to information retrieval (IR). Our proposals focus, essentially, on three areas: pre-processing (morphological analysis), term weighting, and alternative geometrical models to the widely used term-by-document matrix. The latter include (1) PARAFAC2 decomposition of a term-by-document-by-language tensor, and (2) eigenvalue decomposition of a term-by-term matrix (inspired by Statistical Machine Translation). We evaluate all proposals, comparing them to a ‘standard’ approach based on Latent Semantic Analysis, on a multilingual document clustering task. The evidence suggests that proper consideration of IT within IR is indeed called for: in all cases, our best results are achieved using the information-theoretic variations upon the standard approach. Furthermore, we show that different information-theoretic options can be combined for still better results. A key function of language is to encode and convey information, and contributions of IT to the field of CL can be traced back a number of decades. We think that our proposals help bring IR and CL more into line with one another. In our conclusion, we suggest that the fact that our proposals yield empirical improvements is not coincidental given that they increase the theoretical transparency of VSM approaches to IR; on the contrary, they help shed light on why aspects of these approaches work as they do.

Information

Type: Papers
Information: Natural Language Engineering , Volume 17 , Issue 1 , January 2011 , pp. 37 - 70

DOI: https://doi.org/10.1017/S1351324910000185 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Bader, B. W., Berry, M. W., and Browne, M. 2008. Discussion tracking in Enron email using PARAFAC. In Berry, M. W. and Castellanos, M. (eds.), Survey of Text Mining: Clustering, Classification, and Retrieval, Second Edition, pp. 147–162. London: Springer.CrossRef Google Scholar

Bader, B. W., Berry, M. W., and Langville, A. N. 2009. Text analysis using nonnegative matrix/tensor factorizations. In Srivastava, A. and Sahami, M. (eds.), Text Mining: Classification, Clustering and Applications, pp. 95–120. Chapman & Hall/CRC.CrossRef Google Scholar

Bader, B., and Chew, P. 2008. Enhancing multilingual Latent Semantic Analysis with term alignment information. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008), August 2008, Manchester, England, pp. 49–56.CrossRef Google Scholar

Baeza-Yates, R., and Ribeiro-Neto, B. 1999. Modern Information Retrieval. New York: ACM Press.Google Scholar

Biola University 2005–2006. The Unbound Bible. Retrieved on January 29, 2008, from http://www.unboundbible.org/Google Scholar

Boyack, K., Klavans, R., and Börner, K. 2005. Mapping the backbone of science. Scientometrics 64 (3): 351–374.CrossRef Google Scholar

Broe, M. 1996. A generalized information-theoretic measure for systems of phonological classification and recognition. In Proceedings of the Second Meeting of the ACL Special Interest Group in Computational Phonology, July 1996, Santa Cruz, California, pp. 17–24.Google Scholar

Brown, P. F., Della Pietra, V. J., Della Pietra, S. A., and Mercer, R. L. 1994. The mathematics of Statistical Machine Translation: parameter estimation. Computational Linguistics 19 (2): 263–311.Google Scholar

Brown, P. F., deSouza, P. V., Mercer, R. L., Della Pietra, V. J., and Lai, J. C. 1992. Class-based n-gram models of natural language. Computational Linguistics 18 (4): 467–479.Google Scholar

Bullinaria, J. A., and Levy, J. P. 2007. Extracting semantic representations from word co-occurrence statistics: a computational study. Behavior Research Methods 39: 510–526.CrossRef Google Scholar PubMed

Cherry, E., Halle, M., and Jakobson, R. 1953. Toward the logical description of languages in their phonemic aspect. Language 29: 34–46.Google Scholar

Chew, P., and Abdelali, A. 2007. Benefits of the ‘massively parallel Rosetta Stone’: cross-language information retrieval with over 30 languages. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, June 2007, Prague, Czech Republic, pp. 872–879.Google Scholar

Chew, P. A., Bader, B. W., Kolda, T. G., and Abdelali, A. 2007. Cross-language information retrieval using PARAFAC2. In KDD '07: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 143–152. New York: ACM Press.CrossRef Google Scholar

Chew, P., Kegelmeyer, P., Bader, B., and Abdelali, A. 2008. The knowledge of good and evil: multilingual ideology classification with PARAFAC2 and machine learning. Language Forum 34 (1): 37–52.Google Scholar

Chisholm, E., and Kolda, T. G. 1999. New term weighting formulas for the vector space method in information retrieval. Technical Report ORNL-TM-13756, Oak Ridge National Laboratory, Oak Ridge, TN.CrossRef Google Scholar

Chomsky, N. 1956. Three models for the description of language. IRE Transactions on Information Theory 2: 113–124.CrossRef Google Scholar

Chomsky, N., and Halle, M. 1968. The Sound Pattern of English. New York: Harper & Row.Google Scholar

Cleverdon, C. W. 1991. The significance of the Cranfield tests on index languages. In Proceedings of SIGIR, pp. 3–12. New York: ACM Press.Google Scholar

Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. 1990. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science 41 (6): 391–407.Google Scholar

Dumais, S. 1991. Improving the retrieval of information from external sources. Behavior Research Methods, Instruments and Computers 23: 229–236.CrossRef Google Scholar

Eckart, G., and Young, G. 1936. The approximation of one matrix by another of lower rank. Psychometrika 1 (3): 211–218.CrossRef Google Scholar

Goldsmith, J. 2001. Unsupervised learning of the morphology of a natural language. Computational Linguistics 27 (2): 153–198.CrossRef Google Scholar

Golub, G. H., and Van Loan, C. F. 1996. Matrix Computations. Baltimore, MD: Johns Hopkins University Press.Google Scholar

Halle, M. 1959. The Sound Pattern of Russian. The Hague, Netherlands: Mouton.Google Scholar

Hendrickson, B. 2007. Latent Semantic Analysis and Fiedler retrieval. Linear Algebra and its Applications 421 (2–3): 345–355.Google Scholar

Hockett, C. 1958. A Course in Modern Linguistics. New York: Macmillan.Google Scholar

Kashioka, H., Kawata, Y., Kinjo, Y., Finch, A., and Black, E. W. 1998. Use of mutual information based character clusters in dictionary-less morphological analysis of japanese. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, August 1998, Montreal, Quebec, pp. 658–662.Google Scholar

Kolda, T. G., and Bader, B. W. 2009. Tensor decompositions and applications. SIAM Review 51 (3): 455–500.Google Scholar

Landauer, T., and Dumais, S. 1997. A solution to Plato's problem: the Latent Semantic Analysis theory of acquisition, induction, and representation of knowledge. Psychological Review 104 (2): 211–240.CrossRef Google Scholar

Landauer, T., Foltz, P., and Laham, D. 1998. An introduction to Latent Semantic Analysis. Discourse Processes 25: 259–284.CrossRef Google Scholar

Lin, D. 1999. Automatic identication of noncompositional phrases. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, June 1999, College Park, Maryland, pp. 317–324.Google Scholar

Liu, N., Zhang, B., Yan, J., Chen, Z., Liu, W., Bai, F., and Chien, L. 2005. Text representation: from vector to tensor. In Proceedings of the 5th IEEE International Conference on Data Mining, November 2005, Houston, Texas, pp. 725–728.Google Scholar

Lovins, J. B. 1968. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11: 22–31.Google Scholar

Matveeva, I., Levow, G.-A., Farahat, A., and Royer, C. 2005. Term representation with generalized Latent Semantic Analysis. Paper presented at the International Conference on Recent Advances in Natural Language Processing (RANLP-05), September 2005, Borovets, Bulgaria. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.110.2216&rep=rep1&type=pdf.Google Scholar

Recchia, G., and Jones, M. N. 2009. More data trumps smarter algorithms: comparing pointwise mutual information with Latent Semantic Analysis. Behavior Research Methods 41: 647–656.Google Scholar

Resnik, P., Broman Olsen, M., and Diab, M. 1999. The Bible as a parallel corpus: annotating the “Book of 2000 Tongues”. Computers and the Humanities 33: 129–153.CrossRef Google Scholar

Rissanen, J. 1989. Stochastic Complexity in Statistical Inquiry. Singapore: World Scientific Publishing.Google Scholar

Salton, G. 1991. Developments in automatic text retrieval. Science 253: 974–980.CrossRef Google Scholar PubMed

Salton, G., and Buckley, C. 1988. Term weighting approaches in automatic text retrieval. Information Processing and Management 24 (5): 513–523.CrossRef Google Scholar

Shannon, C. E. 1948. A mathematical theory of communication. Bell System Technical Journal 27: 379–423 and 623–656.CrossRef Google Scholar

Sparck Jones, K. 1972. A statistical interpretation of term specificity and its applications to retrieval. Journal of Documentation 28: 11–21.Google Scholar

Swanson, D. 1988. Historical note: information retrieval and the future of an illusion. Journal of the American Society for Information Science 39 (2): 92–98.3.0.CO;2-P>CrossRef Google Scholar

Tomlinson, S. 2004. Finnish, Portuguese and Russian retrieval with Hummingbird SearchServer^™ at CLEF 2004. In Working Notes for the Cross-Language Evaluation Forum (CLEF) 2004 Workshop, Bath, England. Accessed on September 15, 2010 at http://www.clef-campaign.org/2004/working_notes/WorkingNotes2004/21.pdf.Google Scholar

Voorhees, E., and Harmaneds, D. 2005. TREC: Experiment and Evaluation in Information Retrieval. Cambridge, MA: MIT Press.Google Scholar

Weaver, W. 1955. Translation (1949). Cambridge, MA: MIT Press.Google Scholar

Young, P. 1994. Cross Language Information Retrieval Using Latent Semantic Indexing. Master's thesis, University of Knoxville, Knoxville, TN.Google Scholar

Article contents

An information-theoretic, vector-space-model approach to cross-language information retrieval*

Abstract

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests