Hostname: page-component-77c89778f8-7drxs Total loading time: 0 Render date: 2024-07-21T01:08:38.408Z Has data issue: false hasContentIssue false

Evaluating vector space models with canonical correlation analysis*

Published online by Cambridge University Press:  20 September 2011

SAMI VIRPIOJA
Affiliation:
Department of Information and Computer Science, Aalto University School of ScienceP.O. Box 15400, FI-00076 Aalto, Finland e-mails: sami.virpioja@tkk.fi, mari-sanna.paukkeri@tkk.fi, tiina.lindh-knuutila@tkk.fi, krista.lagus@tkk.fi
MARI-SANNA PAUKKERI
Affiliation:
Department of Information and Computer Science, Aalto University School of ScienceP.O. Box 15400, FI-00076 Aalto, Finland e-mails: sami.virpioja@tkk.fi, mari-sanna.paukkeri@tkk.fi, tiina.lindh-knuutila@tkk.fi, krista.lagus@tkk.fi
ABHISHEK TRIPATHI
Affiliation:
Department of Computer Science, University of Helsinki, Finland and Xerox Research Centre Europe (XRCE) 6, Chemin de Maupertuis, 38240, Meylan, France e-mail: abhishektripathi.at@gmail.com
TIINA LINDH-KNUUTILA
Affiliation:
Department of Information and Computer Science, Aalto University School of ScienceP.O. Box 15400, FI-00076 Aalto, Finland e-mails: sami.virpioja@tkk.fi, mari-sanna.paukkeri@tkk.fi, tiina.lindh-knuutila@tkk.fi, krista.lagus@tkk.fi
KRISTA LAGUS
Affiliation:
Department of Information and Computer Science, Aalto University School of ScienceP.O. Box 15400, FI-00076 Aalto, Finland e-mails: sami.virpioja@tkk.fi, mari-sanna.paukkeri@tkk.fi, tiina.lindh-knuutila@tkk.fi, krista.lagus@tkk.fi

Abstract

Vector space models are used in language processing applications for calculating semantic similarities of words or documents. The vector spaces are generated with feature extraction methods for text data. However, evaluation of the feature extraction methods may be difficult. Indirect evaluation in an application is often time-consuming and the results may not generalize to other applications, whereas direct evaluations that measure the amount of captured semantic information usually require human evaluators or annotated data sets. We propose a novel direct evaluation method based on canonical correlation analysis (CCA), the classical method for finding linear relationship between two data sets. In our setting, the two sets are parallel text documents in two languages. A good feature extraction method should provide representations that reflect the semantic content of the documents. Assuming that the underlying semantic content is independent of the language, we can study feature extraction methods that capture the content best by measuring dependence between the representations of a document and its translation. In the case of CCA, the applied measure of dependence is correlation. The evaluation method is based on unsupervised learning, it is language- and domain-independent, and it does not require additional resources besides a parallel corpus. In this paper, we demonstrate the evaluation method on a sentence-aligned parallel corpus. The method is validated by showing that the obtained results with bag-of-words representations are intuitive and agree well with the previous findings. Moreover, we examine the performance of the proposed evaluation method with indirect evaluation methods in simple sentence matching tasks, and a quantitative manual evaluation of word translations. The results of the proposed method correlate well with the results of the indirect and manual evaluations.

Type
Articles
Copyright
Copyright © Cambridge University Press 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Akaho, S. 2001. A kernel method for canonical correlation analysis. In Proceedings of the International Meeting of the Psychometric Society (IMPS2001), Osaka, Japan. Berlin, Germany: Springer-Verlag.Google Scholar
Alpaydin, E. 2010. Introduction to Machine Learning, 2nd ed.Cambridge, MA, USA: MIT Press.Google Scholar
Bach, F. R., and Jordan, M. I. 2003. Kernel independent component analysis. The Journal of Machine Learning Research 3: 148.Google Scholar
Bagga, A., and Baldwin, B. 1998. Entity-based cross-document coreferencing using the vector space model. In Proceedings of the 17th International Conference on Computational Linguistics (COLING'98), Montreal, Canada, vol. 1, pp. 7985. New Brunswick, NJ, USA: Association for Computational Linguistics.CrossRefGoogle Scholar
Benzécri, J.-P. 1973. L'Analyse des Données. Vol. II. L'Analyse des Correspondances. Paris, France: Dunod.Google Scholar
Bernard, J. R. L. (ed.). 1990. The Macquarie Encyclopedic Thesaurus. Sydney, Australia: The Macquarie Library.Google Scholar
Besançon, R., and Rajman, M. 2002. Evaluation of a vector space similarity measure in a multilingual framework. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC 2002), vol. 1252, Las Palmas, Spain. Paris, France: European Language Resources Association.Google Scholar
Bingham, E., and Mannila, H. 2001. Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2001), San Francisco, CA, USA, pp. 245250. New York, NY, USA: ACM.Google Scholar
Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3: 9931022.Google Scholar
Borga, M. 1998. Learning Multidimensional Signal Processing. PhD thesis, Linköping University, Sweden.Google Scholar
Bradford, R. B. 2008. An empirical study of required dimensionality for large-scale latent semantic indexing applications. In Proceeding of the 17th ACM Conference on Information and Knowledge Management (CIKM '08), Napa Valley, CA, USA, pp. 153162. New York, NY, USA: ACM.CrossRefGoogle Scholar
Caropreso, M. F., Matwin, S., and Sebastiani, F. 2001. A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In Chin, A. G. (ed.), Text Databases & Document Management: Theory & Practice, pp. 78102. Hershey, PA, USA: IGI Publishing.Google Scholar
Chew, P., and Abdelali, A. 2007. Benefits of the ‘massively parallel Rosetta stone’: cross-language information retrieval with over 30 languages. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL 2007), Prague, Czech Republic, pp. 872879. New Brunswick, NJ, USA: Association for Computational Linguistics.Google Scholar
Coenen, F., Leng, P., Sanderson, R., and Wang, Y. J. 2007. Statistical identification of key phrases for text classification. In Proceedings of the 5th International Conference on Machine Learning and Data Mining in Pattern Recognition (MLDM '07), Leipzig, Germany, pp. 838853. Berlin, Germany: Springer-Verlag.CrossRefGoogle Scholar
Curran, J. R., and Moens, M. 2002. Scaling context space. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), Philadelphia, PA, USA, pp. 231238. New Brunswick, NJ, USA: Association for Computational Linguistics.Google Scholar
De Bie, T., and De Moor, B. 2003. On the regularization of canonical correlation analysis. In Proceedings of the Fourth International Symposium on Independent Component Analysis and Blind Source Separation (ICA2003), Nara, Japan, pp. 785790. Kyoto, Japan: NTT Communication Science Laboratories.Google Scholar
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. 1990. Indexing by latent semantic analysis. Journal of the American Society of Information Science 41 (6): 391407.3.0.CO;2-9>CrossRefGoogle Scholar
Dumais, S. T. 1991. Improving the retrieval of information from external sources. Behavior Research Methods, Instruments, & Computers 23 (2): 229236.CrossRefGoogle Scholar
Fellbaum, C. (ed.). 1998. WordNet: An Electronic Lexical Database. Cambridge, MA, USA: MIT Press.CrossRefGoogle Scholar
Finn, A., and Kushmerick, N. 2006. Learning to classify documents according to genre. Journal of the American Society for Information Science and Technology 57 (11): 15061518.CrossRefGoogle Scholar
Gaussier, É., Renders, J.-M., Matveeva, I., Goutte, C., and Déjean, H. 2004. A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), Barcelona, Spain, vol. 4, pp. 526533. East Stroudsburg, PA, USA: Association for Computational Linguistics.Google Scholar
Haghighi, A., Liang, P., Berg-Kirkpatrick, T., and Klein, D. 2008. Learning bilingual lexicons from monolingual corpora. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-08: HLT), Columbus, OH, USA, pp. 771779. Stroudsburg, PA, USA: Association for Computational Linguistics.Google Scholar
Hardoon, D. R., and Shawe-Taylor, J. 2007. Sparse canonical correlation analysis. Technical Report, University College London, London, UK.Google Scholar
Hardoon, D. R., Szedmak, S., and Shawe-Taylor, J. 2004. Canonical correlation analysis: an overview with application to learning methods. Neural Computation 16 (12): 26392664.CrossRefGoogle ScholarPubMed
Harman, H. H. 1960. Modern Factor Analysis. Chicago, IL, USA: University of Chicago Press.Google Scholar
Hofmann, T. 1999. Probabilistic latent semantic analysis. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI'99), Stockholm, Sweden, pp. 289296. San Francisco, CA, USA: Morgan Kaufmann.Google Scholar
Honkela, T., Hyvärinen, A. and Väyrynen, J. J. 2010. WordICA – emergence of linguistic representations for words by independent component analysis. Natural Language Engineering 16: 277308.CrossRefGoogle Scholar
Hotelling, H. 1936. Relations between two sets of variates. Biometrika 28 (3): 321377.CrossRefGoogle Scholar
Johnson, W. B., and Lindenstrauss, J. 1984. Extensions of Lipschitz maps into a Hilbert space. Contemporary Mathematics 26: 189206.CrossRefGoogle Scholar
Jones, K. S. 1972. A statistical interpretation of term specifity and its application in retrieval. Journal of Documentation 28 (1): 1121.CrossRefGoogle Scholar
Kanerva, P., Kristoferson, J., and Holst, A. 2000. Random indexing of text samples for latent semantic analysis. In Proceedings of the 22nd Annual Conference of the Cognitive Science Society (CogSci 2000), Philadelphia, PA, USA, p. 1036. Mahwah, NJ, USA: Erlbaum.Google Scholar
Kaski, S. 1998. Dimensionality reduction by random mapping: fast similarity computation for clustering. In Proceedings of International Joint Conference on Neural Networks (IJCNN'98), Anchorage, AK, USA, vol. 1, pp. 413418. Piscataway, NJ, USA: IEEE.Google Scholar
Kay, J. 1992. Feature discovery under contextual supervision using mutual information. In Proceedings of the International Joint Conference on Neural Networks (IJCNN 1992), Baltimore, MD, USA, vol. 4, pp. 7984. Los Alamitos, CA, USA: IEEE.CrossRefGoogle Scholar
Kiss, G. R., Armstrong, C., Milroy, R., and Piper, J. 1973. An associative thesaurus of English and its computer analysis. In Aitkin, A. J., Bailey, R. W., and Hamilton-Smith, N. (eds.), The Computer and Literary Studies, pp. 153165. Edinburgh, UK: Edinburgh University Press.Google Scholar
Koehn, P. 2005. Europarl: a parallel corpus for statistical machine translation. In Proceedings of the 10th Machine Translation Summit, Phuket, Thailand, pp. 7986. Tokyo, Japan: Asia-Pacific Association for Machine Translation.Google Scholar
Koehn, P., Och, F. J., and Marcu, D. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (HLT-NAACL), Edmonton, Canada, pp. 4854. Morristown, NJ, USA: Association for Computational Linguistics.Google Scholar
Koster, C. H. A., and Seutter, M. 2003. Taming wild phrases. In Proceedings of the 25th European Conference on Information Retrieval Research (ECIR'03), Pisa, Italy, pp. 161176. Berlin, Germany: Springer-Verlag.Google Scholar
Kuhn, H. W. 1955. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 2: 8397.CrossRefGoogle Scholar
Lai, P. L., and Fyfe, C. 2000. Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems 10 (5): 365377.CrossRefGoogle ScholarPubMed
Landauer, T. K., and Dumais, S. T. 1997. A solution to Plato's problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review 104 (2): 211240.CrossRefGoogle Scholar
Leurgans, S. E., Moyeed, R. A., and Silverman, B. W. 1993. Canonical correlation analysis when the data are curves. Journal of the Royal Statistical Society. Series B (Methodological) 55 (3): 725740.CrossRefGoogle Scholar
Lewis, D. D. 1992. An evaluation of phrasal and clustered representations on a text categorization task. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '92), Copenhagen, Denmark, pp. 3750. New York, NY, USA: ACM.Google Scholar
Li, Y., and Shawe-Taylor, J. 2007. Advanced learning algorithms for cross-language patent retrieval and classification. Information Processing and Management 43 (5): 11831199.CrossRefGoogle Scholar
Lund, K., and Burgess, C. 1996. Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments & Computers 28 (2): 203208.CrossRefGoogle Scholar
Manning, C. D. and Schütze, H. 1999. Foundations of Statistical Natural Language Processing. Cambridge, MA, USA: MIT Press.Google Scholar
Manning, C. D., Raghavan, P. and Schütze, H. 2008. Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press.CrossRefGoogle Scholar
Melzer, T., Reiter, M., and Bischof, H. 2001. Nonlinear feature extraction using generalized canonical correlation analysis. In Dorffner, G., Bischof, H., and Hornik, K. (eds.), Proceedings of the International Conference on Artificial Neural Networks (ICANN '01), Vienna, Austria (vol. 2130 of Lecture Notes in Computer Science), pp. 353360. Berlin, Germany: Springer-Verlag.Google Scholar
Mihalcea, R., and Simard, M. 2005. Parallel texts. Natural Language Engineering 11 (3): 239246.CrossRefGoogle Scholar
Minier, Z., Bodó, Z. and Csató, L. 2007. Wikipedia-based kernels for text categorization. In Proceedings of the 9th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC'07), Timisoara, Romania, pp. 157164. Los Alamitos, CA, USA: IEEE Computer Society.Google Scholar
Mitchell, J., and Lapata, M. 2008. Vector-based models of semantic composition. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-08:HLT), Columbus, OH, USA, pp. 236244. Stroudsburg, PA, USA: Association for Computational Linguistics.Google Scholar
Nakov, P., Popova, A., and Mateev, P. 2001. Weight functions impact on LSA performance. In Proceedings of the EuroConference on Recent Advances in Natural Language Processing (RANLP 2001), pp. 187193. Tzigov Chark, Bulgaria: Bulgarian Academy of Sciences.Google Scholar
Nelson, D. L., McEvoy, C. L. and Schreiber, T. A. 1998. The University of South Florida word association, rhyme, and word fragment norms. http://web.usf.edu/FreeAssociation/ Tampa, FL, USA: University of South Florida (Accessed 7 Oct 2010).Google Scholar
Rapp, R. 2002. The computation of word associations: comparing syntagmatic and paradigmatic approaches. In Proceedings of the 19th International Conference on Computational Linguistics (COLING'02), Taipei, Taiwan, pp. 17, International Committee on Computational Linguistics. New Brunswick, NJ, USA: Association for Computational Linguistics.Google Scholar
Rapp, R. 2004. A freely available automatically generated thesaurus of related words. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, pp. 395398. Paris, France: European Language Resources Association.Google Scholar
Ritter, H., and Kohonen, T. 1989. Self-organizing semantic maps. Biological Cybernetics 61: 241254.CrossRefGoogle Scholar
Roget, P. 1911. Thesaurus of English Words and Phrases. London, UK: Longmans, Green.Google Scholar
Rummel, R. J. 1970. Applied Factor Analysis. Evanston, IL, USA: Northwestern University Press.Google Scholar
Sadeniemi, M., Kettunen, K., Lindh-Knuutila, T., and Honkela, T. 2008. Complexity of European Union languages: a comparative approach. Journal of Quantitative Linguistics 15 (2): 185211.CrossRefGoogle Scholar
Sahlgren, M. 2006 a. Towards pertinent evaluation methodologies for word-space models. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy. Paris, France: European Language Resources Association.Google Scholar
Sahlgren, M. 2006 b. The Word-Space Model. PhD thesis, Department of Linguistics, Stockholm University, Stockholm, Sweden.Google Scholar
Sahlgren, M., and Karlgren, J. 2005. Automatic bilingual lexicon acquisition using random indexing of parallel corpora. Natural Language Engineering 11 (03): 327341.CrossRefGoogle Scholar
Salton, G. (ed.). 1971. The SMART System – Experiments in Automatic Document Processing. Upper Saddle River, NJ, USA: Prentice-Hall.Google Scholar
Salton, G., and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Information Prosessing and Management 24 (5): 513523.CrossRefGoogle Scholar
Salton, G., Wong, A., and Yang, C. 1975. A vector space model for automatic indexing. Communications of the ACM 18 (11): 620.CrossRefGoogle Scholar
Schütze, H. 1992. Dimensions of meaning. In Proceedings of the 1992 ACM/IEEE Conference on Supercomputing (SC 1992), Minneapolis, MN, USA, pp. 787796. Los Alamitos, CA, USA: IEEE Computer Society.Google Scholar
Schütze, H., and Pedersen, J. 1995. Information retrieval based on word senses. In Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval (SDAIR95), Las Vegas, NV, USA, pp. 161175.Google Scholar
Schütze, H., Hull, D. A., and Pedersen, J. O. 1995. A comparison of classifiers and document representations for the routing problem. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '95), Seattle, WA, USA, pp. 229237. New York, NY, USA: ACM.Google Scholar
Scott, S., and Matwin, S. 1999. Feature engineering for text classification. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML '99), Bled, Slovenia, pp. 379388. San Francisco, CA, USA: Morgan Kaufmann.Google Scholar
Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing Surveys 34 (1): 147.CrossRefGoogle Scholar
Steyvers, M., Shiffrin, R. M., and Nelson, D. L. 2005. Word association spaces for predicting semantic similarity effects in episodic memory. In Healy, A. F. (ed.), Experimental Cognitive Psychology and Its Applications, pp. 237249. Washington, DC, USA: American Psychological Association.CrossRefGoogle Scholar
Tripathi, A., Klami, A., and Kaski, S. 2008. Using dependencies to pair samples for multi-view learning. TKK Reports in Information and Computer Science TKK-ICS-R8, Helsinki University of Technology, Espoo, Finland.CrossRefGoogle Scholar
Tripathi, A., Klami, A., and Virpioja, S. 2010. Bilingual sentence matching using kernel CCA. In Proceedings of the 2010 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 2010), Kittilä, Finland, pp. 130135. Los Alamitos, CA, USA: IEEE Press.Google Scholar
Turney, P. D. 2001. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In Raedt, L. D. and Flach, P. A. (eds.), Proceedings of the Twelth European Conference on Machine Learning (ECML-2001), Freiburg, Germany (vol. 2167 of Lecture Notes in Computer Science), pp. 491502. Berlin, Germany: Springer-Verlag.Google Scholar
Turney, P. D. 2005. Measuring semantic similarity by latent relational analysis. In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence (IJCAI-05), Edinburgh, UK, pp. 11361141. International Joint Conferences on Artificial Intelligence Organization. San Francisco, CA, USA: Morgan Kaufmann.Google Scholar
Väyrynen, J. J., Lindqvist, L., and Honkela, T. 2007. Sparse distributed representations for words with thresholded independent component analysis. In Proceedings of the International Joint Conference on Neural Networks (IJCNN 2007), Orlando, FL, USA, pp. 10311036. Piscataway, NJ, USA: IEEE.Google Scholar
Vinokourov, A., Shawe-Taylor, J., and Cristianini, N. 2003. Inferring a semantic representation of text via cross-language correlation analysis. Advances in Neural Information Processing Systems 15: 14971504.Google Scholar
Yarowsky, D., and Florian, R. 2002. Evaluating sense disambiguation across diverse parameter spaces. Natural Language Engineering 8 (4): 293310.CrossRefGoogle Scholar
Zelikovitz, S., and Hirsh, H. 2001. Improving text classification with LSI using background knowledge. In Nebel, B. (ed.), Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI01), Seattle, WA, USA, pp. 113118. International Joint Conferences on Artificial Intelligence Organization. San Francisco, CA, USA: Morgan Kaufmann.Google Scholar
Zesch, T., and Gurevych, I. 2009. Wisdom of crowds versus wisdom of linguists – measuring the semantic relatedness of words. Natural Language Engineering 16 (1): 2559.CrossRefGoogle Scholar
Zhang, D., Mei, Q., and Zhai, C. 2010. Cross-lingual latent topic extraction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp. 11281137. Stroudsburg, PA, USA: Association for Computational Linguistics.Google Scholar