Skip to main content
×
×
Home

Evaluating vector space models with canonical correlation analysis*

  • SAMI VIRPIOJA (a1), MARI-SANNA PAUKKERI (a1), ABHISHEK TRIPATHI (a2), TIINA LINDH-KNUUTILA (a1) and KRISTA LAGUS (a1)...
Abstract

Vector space models are used in language processing applications for calculating semantic similarities of words or documents. The vector spaces are generated with feature extraction methods for text data. However, evaluation of the feature extraction methods may be difficult. Indirect evaluation in an application is often time-consuming and the results may not generalize to other applications, whereas direct evaluations that measure the amount of captured semantic information usually require human evaluators or annotated data sets. We propose a novel direct evaluation method based on canonical correlation analysis (CCA), the classical method for finding linear relationship between two data sets. In our setting, the two sets are parallel text documents in two languages. A good feature extraction method should provide representations that reflect the semantic content of the documents. Assuming that the underlying semantic content is independent of the language, we can study feature extraction methods that capture the content best by measuring dependence between the representations of a document and its translation. In the case of CCA, the applied measure of dependence is correlation. The evaluation method is based on unsupervised learning, it is language- and domain-independent, and it does not require additional resources besides a parallel corpus. In this paper, we demonstrate the evaluation method on a sentence-aligned parallel corpus. The method is validated by showing that the obtained results with bag-of-words representations are intuitive and agree well with the previous findings. Moreover, we examine the performance of the proposed evaluation method with indirect evaluation methods in simple sentence matching tasks, and a quantitative manual evaluation of word translations. The results of the proposed method correlate well with the results of the indirect and manual evaluations.

Copyright
References
Hide All
Akaho, S. 2001. A kernel method for canonical correlation analysis. In Proceedings of the International Meeting of the Psychometric Society (IMPS2001), Osaka, Japan. Berlin, Germany: Springer-Verlag.
Alpaydin, E. 2010. Introduction to Machine Learning, 2nd ed.Cambridge, MA, USA: MIT Press.
Bach, F. R., and Jordan, M. I. 2003. Kernel independent component analysis. The Journal of Machine Learning Research 3: 148.
Bagga, A., and Baldwin, B. 1998. Entity-based cross-document coreferencing using the vector space model. In Proceedings of the 17th International Conference on Computational Linguistics (COLING'98), Montreal, Canada, vol. 1, pp. 7985. New Brunswick, NJ, USA: Association for Computational Linguistics.
Benzécri, J.-P. 1973. L'Analyse des Données. Vol. II. L'Analyse des Correspondances. Paris, France: Dunod.
Bernard, J. R. L. (ed.). 1990. The Macquarie Encyclopedic Thesaurus. Sydney, Australia: The Macquarie Library.
Besançon, R., and Rajman, M. 2002. Evaluation of a vector space similarity measure in a multilingual framework. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC 2002), vol. 1252, Las Palmas, Spain. Paris, France: European Language Resources Association.
Bingham, E., and Mannila, H. 2001. Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2001), San Francisco, CA, USA, pp. 245250. New York, NY, USA: ACM.
Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3: 9931022.
Borga, M. 1998. Learning Multidimensional Signal Processing. PhD thesis, Linköping University, Sweden.
Bradford, R. B. 2008. An empirical study of required dimensionality for large-scale latent semantic indexing applications. In Proceeding of the 17th ACM Conference on Information and Knowledge Management (CIKM '08), Napa Valley, CA, USA, pp. 153162. New York, NY, USA: ACM.
Caropreso, M. F., Matwin, S., and Sebastiani, F. 2001. A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In Chin, A. G. (ed.), Text Databases & Document Management: Theory & Practice, pp. 78102. Hershey, PA, USA: IGI Publishing.
Chew, P., and Abdelali, A. 2007. Benefits of the ‘massively parallel Rosetta stone’: cross-language information retrieval with over 30 languages. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL 2007), Prague, Czech Republic, pp. 872879. New Brunswick, NJ, USA: Association for Computational Linguistics.
Coenen, F., Leng, P., Sanderson, R., and Wang, Y. J. 2007. Statistical identification of key phrases for text classification. In Proceedings of the 5th International Conference on Machine Learning and Data Mining in Pattern Recognition (MLDM '07), Leipzig, Germany, pp. 838853. Berlin, Germany: Springer-Verlag.
Curran, J. R., and Moens, M. 2002. Scaling context space. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), Philadelphia, PA, USA, pp. 231238. New Brunswick, NJ, USA: Association for Computational Linguistics.
De Bie, T., and De Moor, B. 2003. On the regularization of canonical correlation analysis. In Proceedings of the Fourth International Symposium on Independent Component Analysis and Blind Source Separation (ICA2003), Nara, Japan, pp. 785790. Kyoto, Japan: NTT Communication Science Laboratories.
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. 1990. Indexing by latent semantic analysis. Journal of the American Society of Information Science 41 (6): 391407.
Dumais, S. T. 1991. Improving the retrieval of information from external sources. Behavior Research Methods, Instruments, & Computers 23 (2): 229236.
Fellbaum, C. (ed.). 1998. WordNet: An Electronic Lexical Database. Cambridge, MA, USA: MIT Press.
Finn, A., and Kushmerick, N. 2006. Learning to classify documents according to genre. Journal of the American Society for Information Science and Technology 57 (11): 15061518.
Gaussier, É., Renders, J.-M., Matveeva, I., Goutte, C., and Déjean, H. 2004. A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), Barcelona, Spain, vol. 4, pp. 526533. East Stroudsburg, PA, USA: Association for Computational Linguistics.
Haghighi, A., Liang, P., Berg-Kirkpatrick, T., and Klein, D. 2008. Learning bilingual lexicons from monolingual corpora. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-08: HLT), Columbus, OH, USA, pp. 771779. Stroudsburg, PA, USA: Association for Computational Linguistics.
Hardoon, D. R., and Shawe-Taylor, J. 2007. Sparse canonical correlation analysis. Technical Report, University College London, London, UK.
Hardoon, D. R., Szedmak, S., and Shawe-Taylor, J. 2004. Canonical correlation analysis: an overview with application to learning methods. Neural Computation 16 (12): 26392664.
Harman, H. H. 1960. Modern Factor Analysis. Chicago, IL, USA: University of Chicago Press.
Hofmann, T. 1999. Probabilistic latent semantic analysis. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI'99), Stockholm, Sweden, pp. 289296. San Francisco, CA, USA: Morgan Kaufmann.
Honkela, T., Hyvärinen, A. and Väyrynen, J. J. 2010. WordICA – emergence of linguistic representations for words by independent component analysis. Natural Language Engineering 16: 277308.
Hotelling, H. 1936. Relations between two sets of variates. Biometrika 28 (3): 321377.
Johnson, W. B., and Lindenstrauss, J. 1984. Extensions of Lipschitz maps into a Hilbert space. Contemporary Mathematics 26: 189206.
Jones, K. S. 1972. A statistical interpretation of term specifity and its application in retrieval. Journal of Documentation 28 (1): 1121.
Kanerva, P., Kristoferson, J., and Holst, A. 2000. Random indexing of text samples for latent semantic analysis. In Proceedings of the 22nd Annual Conference of the Cognitive Science Society (CogSci 2000), Philadelphia, PA, USA, p. 1036. Mahwah, NJ, USA: Erlbaum.
Kaski, S. 1998. Dimensionality reduction by random mapping: fast similarity computation for clustering. In Proceedings of International Joint Conference on Neural Networks (IJCNN'98), Anchorage, AK, USA, vol. 1, pp. 413418. Piscataway, NJ, USA: IEEE.
Kay, J. 1992. Feature discovery under contextual supervision using mutual information. In Proceedings of the International Joint Conference on Neural Networks (IJCNN 1992), Baltimore, MD, USA, vol. 4, pp. 7984. Los Alamitos, CA, USA: IEEE.
Kiss, G. R., Armstrong, C., Milroy, R., and Piper, J. 1973. An associative thesaurus of English and its computer analysis. In Aitkin, A. J., Bailey, R. W., and Hamilton-Smith, N. (eds.), The Computer and Literary Studies, pp. 153165. Edinburgh, UK: Edinburgh University Press.
Koehn, P. 2005. Europarl: a parallel corpus for statistical machine translation. In Proceedings of the 10th Machine Translation Summit, Phuket, Thailand, pp. 7986. Tokyo, Japan: Asia-Pacific Association for Machine Translation.
Koehn, P., Och, F. J., and Marcu, D. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (HLT-NAACL), Edmonton, Canada, pp. 4854. Morristown, NJ, USA: Association for Computational Linguistics.
Koster, C. H. A., and Seutter, M. 2003. Taming wild phrases. In Proceedings of the 25th European Conference on Information Retrieval Research (ECIR'03), Pisa, Italy, pp. 161176. Berlin, Germany: Springer-Verlag.
Kuhn, H. W. 1955. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 2: 8397.
Lai, P. L., and Fyfe, C. 2000. Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems 10 (5): 365377.
Landauer, T. K., and Dumais, S. T. 1997. A solution to Plato's problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review 104 (2): 211240.
Leurgans, S. E., Moyeed, R. A., and Silverman, B. W. 1993. Canonical correlation analysis when the data are curves. Journal of the Royal Statistical Society. Series B (Methodological) 55 (3): 725740.
Lewis, D. D. 1992. An evaluation of phrasal and clustered representations on a text categorization task. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '92), Copenhagen, Denmark, pp. 3750. New York, NY, USA: ACM.
Li, Y., and Shawe-Taylor, J. 2007. Advanced learning algorithms for cross-language patent retrieval and classification. Information Processing and Management 43 (5): 11831199.
Lund, K., and Burgess, C. 1996. Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments & Computers 28 (2): 203208.
Manning, C. D. and Schütze, H. 1999. Foundations of Statistical Natural Language Processing. Cambridge, MA, USA: MIT Press.
Manning, C. D., Raghavan, P. and Schütze, H. 2008. Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press.
Melzer, T., Reiter, M., and Bischof, H. 2001. Nonlinear feature extraction using generalized canonical correlation analysis. In Dorffner, G., Bischof, H., and Hornik, K. (eds.), Proceedings of the International Conference on Artificial Neural Networks (ICANN '01), Vienna, Austria (vol. 2130 of Lecture Notes in Computer Science), pp. 353360. Berlin, Germany: Springer-Verlag.
Mihalcea, R., and Simard, M. 2005. Parallel texts. Natural Language Engineering 11 (3): 239246.
Minier, Z., Bodó, Z. and Csató, L. 2007. Wikipedia-based kernels for text categorization. In Proceedings of the 9th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC'07), Timisoara, Romania, pp. 157164. Los Alamitos, CA, USA: IEEE Computer Society.
Mitchell, J., and Lapata, M. 2008. Vector-based models of semantic composition. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-08:HLT), Columbus, OH, USA, pp. 236244. Stroudsburg, PA, USA: Association for Computational Linguistics.
Nakov, P., Popova, A., and Mateev, P. 2001. Weight functions impact on LSA performance. In Proceedings of the EuroConference on Recent Advances in Natural Language Processing (RANLP 2001), pp. 187193. Tzigov Chark, Bulgaria: Bulgarian Academy of Sciences.
Nelson, D. L., McEvoy, C. L. and Schreiber, T. A. 1998. The University of South Florida word association, rhyme, and word fragment norms. http://web.usf.edu/FreeAssociation/ Tampa, FL, USA: University of South Florida (Accessed 7 Oct 2010).
Rapp, R. 2002. The computation of word associations: comparing syntagmatic and paradigmatic approaches. In Proceedings of the 19th International Conference on Computational Linguistics (COLING'02), Taipei, Taiwan, pp. 17, International Committee on Computational Linguistics. New Brunswick, NJ, USA: Association for Computational Linguistics.
Rapp, R. 2004. A freely available automatically generated thesaurus of related words. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, pp. 395398. Paris, France: European Language Resources Association.
Ritter, H., and Kohonen, T. 1989. Self-organizing semantic maps. Biological Cybernetics 61: 241254.
Roget, P. 1911. Thesaurus of English Words and Phrases. London, UK: Longmans, Green.
Rummel, R. J. 1970. Applied Factor Analysis. Evanston, IL, USA: Northwestern University Press.
Sadeniemi, M., Kettunen, K., Lindh-Knuutila, T., and Honkela, T. 2008. Complexity of European Union languages: a comparative approach. Journal of Quantitative Linguistics 15 (2): 185211.
Sahlgren, M. 2006 a. Towards pertinent evaluation methodologies for word-space models. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy. Paris, France: European Language Resources Association.
Sahlgren, M. 2006 b. The Word-Space Model. PhD thesis, Department of Linguistics, Stockholm University, Stockholm, Sweden.
Sahlgren, M., and Karlgren, J. 2005. Automatic bilingual lexicon acquisition using random indexing of parallel corpora. Natural Language Engineering 11 (03): 327341.
Salton, G. (ed.). 1971. The SMART System – Experiments in Automatic Document Processing. Upper Saddle River, NJ, USA: Prentice-Hall.
Salton, G., and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Information Prosessing and Management 24 (5): 513523.
Salton, G., Wong, A., and Yang, C. 1975. A vector space model for automatic indexing. Communications of the ACM 18 (11): 620.
Schütze, H. 1992. Dimensions of meaning. In Proceedings of the 1992 ACM/IEEE Conference on Supercomputing (SC 1992), Minneapolis, MN, USA, pp. 787796. Los Alamitos, CA, USA: IEEE Computer Society.
Schütze, H., and Pedersen, J. 1995. Information retrieval based on word senses. In Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval (SDAIR95), Las Vegas, NV, USA, pp. 161175.
Schütze, H., Hull, D. A., and Pedersen, J. O. 1995. A comparison of classifiers and document representations for the routing problem. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '95), Seattle, WA, USA, pp. 229237. New York, NY, USA: ACM.
Scott, S., and Matwin, S. 1999. Feature engineering for text classification. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML '99), Bled, Slovenia, pp. 379388. San Francisco, CA, USA: Morgan Kaufmann.
Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing Surveys 34 (1): 147.
Steyvers, M., Shiffrin, R. M., and Nelson, D. L. 2005. Word association spaces for predicting semantic similarity effects in episodic memory. In Healy, A. F. (ed.), Experimental Cognitive Psychology and Its Applications, pp. 237249. Washington, DC, USA: American Psychological Association.
Tripathi, A., Klami, A., and Kaski, S. 2008. Using dependencies to pair samples for multi-view learning. TKK Reports in Information and Computer Science TKK-ICS-R8, Helsinki University of Technology, Espoo, Finland.
Tripathi, A., Klami, A., and Virpioja, S. 2010. Bilingual sentence matching using kernel CCA. In Proceedings of the 2010 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 2010), Kittilä, Finland, pp. 130135. Los Alamitos, CA, USA: IEEE Press.
Turney, P. D. 2001. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In Raedt, L. D. and Flach, P. A. (eds.), Proceedings of the Twelth European Conference on Machine Learning (ECML-2001), Freiburg, Germany (vol. 2167 of Lecture Notes in Computer Science), pp. 491502. Berlin, Germany: Springer-Verlag.
Turney, P. D. 2005. Measuring semantic similarity by latent relational analysis. In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence (IJCAI-05), Edinburgh, UK, pp. 11361141. International Joint Conferences on Artificial Intelligence Organization. San Francisco, CA, USA: Morgan Kaufmann.
Väyrynen, J. J., Lindqvist, L., and Honkela, T. 2007. Sparse distributed representations for words with thresholded independent component analysis. In Proceedings of the International Joint Conference on Neural Networks (IJCNN 2007), Orlando, FL, USA, pp. 10311036. Piscataway, NJ, USA: IEEE.
Vinokourov, A., Shawe-Taylor, J., and Cristianini, N. 2003. Inferring a semantic representation of text via cross-language correlation analysis. Advances in Neural Information Processing Systems 15: 14971504.
Yarowsky, D., and Florian, R. 2002. Evaluating sense disambiguation across diverse parameter spaces. Natural Language Engineering 8 (4): 293310.
Zelikovitz, S., and Hirsh, H. 2001. Improving text classification with LSI using background knowledge. In Nebel, B. (ed.), Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI01), Seattle, WA, USA, pp. 113118. International Joint Conferences on Artificial Intelligence Organization. San Francisco, CA, USA: Morgan Kaufmann.
Zesch, T., and Gurevych, I. 2009. Wisdom of crowds versus wisdom of linguists – measuring the semantic relatedness of words. Natural Language Engineering 16 (1): 2559.
Zhang, D., Mei, Q., and Zhai, C. 2010. Cross-lingual latent topic extraction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp. 11281137. Stroudsburg, PA, USA: Association for Computational Linguistics.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Full text views

Total number of HTML views: 2
Total number of PDF views: 30 *
Loading metrics...

Abstract views

Total abstract views: 153 *
Loading metrics...

* Views captured on Cambridge Core between September 2016 - 20th July 2018. This data will be updated every 24 hours.