Skip to main content
×
Home

Unsupervised learning of semantic representation for documents with the law of total probability*

  • YANG WEI (a1) (a2), JINMAO WEI (a1) and ZHENGLU YANG (a1)
Abstract
Abstract

The semantic information of documents needs to be represented because it is the basis for many applications, such as document summarization, web search, and text analysis. Although many studies have explored this problem by enriching document vectors with the relatedness of the words involved, the performance remains far from satisfactory because the physical boundaries of documents hinder the evaluation of the relatedness between words. To address this problem, we propose an effective approach to further infer the implicit relatedness between words via their common related words. To avoid overestimation of the implicit relatedness, we restrict the inference in terms of the marginal probabilities of the words based on the law of total probability. The proposed method measures the relatedness between words, which is confirmed theoretically and experimentally. Thorough evaluation on real datasets illustrates that significant improvement on document clustering has been achieved with the proposed method compared with state-of-the-art methods.

Copyright
Footnotes
Hide All
*

This work was supported by the National Natural Science Foundation of China under grant 61070089, 61772288, the Science Foundation of TianJin under grant 14JCYBJC15700, and the National Science Foundation of China under grant 11431006, U1636116.

Footnotes
References
Hide All
AlAgha I., and Nafee R., 2015. Investigating the efficiency of WordNet as background knowledge for document clustering. Journal of Engineering Research and Technology 2 (2): 152–8.
Amiri H., and III H. D. 2016. Short text representation for detecting churn in microblogs. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA. Menlo Park, CA: AAAI Press, pp. 2566–72.
Andrews N. O., and Fox E. A. 2007. Recent developments in document clustering. Technical Report, Department of Computer Science, Virginia Tech.
Billhardt H., Borrajo D., and Maojo V., 2002. A context vector model for information retrieval. Journal of the American Society for Information Science and Technology 53 (3): 236–49.
Blei D. M., Ng A. Y., and Jordan M. I., 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3 (2003): 9931022.
Bullinaria J. A., and Levy J. P., 2007. Extracting semantic representations from word co-occurrence statistics: a computational study. Behavior Research Methods 39 (3): 510–26.
Cai D., He X., and Han J., 2011. Locally consistent concept factorization for document clustering. IEEE Transactions on Knowledge and Data Engineering 23 (6): 902–13.
Cheng X., Miao D., Wang C., and Cao L., 2013. Coupled term-term relation analysis for document clustering. In Proceedings of the International Joint Conference on Neural Networks, Dallas, TX, USA. Washington, DC, USA: IEEE, pp. 18.
Das R., Zaheer M., and Dyer C., 2015. Gaussian LDA for topic models with word embeddings. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, Beijing, China. aclweb.org, pp. 795804.
Deerwester S. C., Dumais S. T., Landauer T. K., Furnas G. W., and Harshman R. A., 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41 (6): 391407.
Finkelstein L., Gabrilovich E., Matias Y., Rivlin E., Solan Z., Wolfman G., and Ruppin E., 2002. Placing search in context: the concept revisited. ACM Transactions on Information Systems 20 (1): 116–31.
Gabrilovich E., and Markovitch S., 2006. Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge. In Proceedings of the 21st National Conference on Artificial Intelligence, Boston, MA, USA. Menlo Park, CA: AAAI Press, pp. 1301–6.
Gabrilovich E., and Markovitch S., 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In International Joint Conference on Artifical Intelligence, Hyderabad, India. San Francisco: Margan Kaufmann, pp. 1606–11.
Grefenstette E., Hermann K. M., Dinu G., and Blunsom P., 2014. New directions in vector space models of meaning. Tutorials. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, MD, USA. aclweb.org, pp. 88.
Harris Z. S., 1954. Distributional structure. Word 10 (2–3): 146–62.
Hassan S., and Mihalcea R. 2011. Semantic relatedness using salient semantic analysis. In Proceedings of the 25th AAAI Conference on Artificial Intelligence, San Francisco, CA, USA. Menlo Park, CA: AAAI Press, pp. 884–9.
Hu X., Zhang X., Lu C., Park E. K., and Zhou X., 2009. Exploiting wikipedia as external knowledge for document clustering. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France. New York, NY, USA: ACM, pp. 389–96.
Iosif E., and Potamianos A., 2010. Unsupervised semantic similarity computation between terms using web documents. IEEE Transactions on Knowledge and Data Engineering 22 (11): 1637–47.
Kalogeratos A., and Likas A., 2012. Text document clustering using global term context vectors. Knowledge and Information Systems 31 (3): 455–74.
Kim Y., 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar. aclweb.org, pp. 1746–51.
Kusner M. J., Sun Y., Kolkin N. I., and Weinberger K. Q., 2015. From word embeddings to document distances. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France: JMLR.org, pp. 957–66.
Landauer T. K., and Dumais S. T., 1997. A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review 104 (2): 211–40.
Landauer T. K., Laham D., Rehder B., and Schreiner M. E., 1997. How well can passage meaning be derived without using word order? A comparison of Latent Semantic Analysis and humans. In Proceedings of the 19th Annual Meeting of the Cognitive Science Society, Stanford University, CA, USA, Mawhwah, NJ: Erlbaum, pp. 412–7.
Le Q. V., and Mikolov T., 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning, Beijing, China, San Francisco, CA, USA: Morgan Kaufmann, pp. 1188–96.
Lebret R., and Collobert R., 2015. Rehabilitation of count-based models for word vector representations. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics, Cairo, Egypt, Lecture Notes in Computer Science, Cham: Springer, pp. 417–29.
Lin D., 1998. An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning, Madison, WI, USA, San Francisco, CA, USA: Morgan Kaufmann, pp. 296304.
Lovász L., and Plummer MD., 1986. Matching theory. Annals of Discrete Mathematics 29 (5): 42–6.
Mihalcea R., Corley C., and Strapparava C., 2006. Corpus-based and knowledge-based measures of text semantic similarity. In Proceedings of the 21st National Conference on Artificial Intelligence, Boston, MA, USA, Menlo Park, CA: AAAI Press, pp. 775–80.
Mikolov T., Sutskever I., Chen K., Corrado G. S., and Dean J., 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA. USA: Curran Associates, pp. 3111–9.
Miller G. A., and Charles W. G., 1991. Contextual correlates of semantic similarity. Language Cognition and Neuroscience 6 (1): 128.
Mitchell J., and Steedman M., 2015. Orthogonality of syntax and semantics within distributional spaces. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, Beijing, China. aclweb.org, pp. 1301–10.
Nasir J. A., Varlamis I., Karim A., and Tsatsaronis G., 2013. Semantic smoothing for text clustering. Knowledge-Based Systems 54: 216–29.
Österlund A., and Ödling D., 2015. Factorization of latent variables in distributional semantic models. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal. aclweb.org, pp. 227–31.
Pangos A., Iosif E., Potamianos A., and Fosler-Lussier E., 2005. Combining statistical similarity measures for automatic induction of semantic classes. In Automatic Speech Recognition and Understanding, 2005 IEEE Workshop on, San Juan, Puerto Rico. Washington, DC, USA: IEEE, pp. 278–83.
Rubenstein H., and Goodenough J. B., 1965. Contextual correlates of synonymy. Communications of the ACM 8 (10): 627–33.
Rubner Y., Tomasi C., and Guibas L. J., 1998. A metric for distributions with applications to image databases. In Procedings of the 16th International Conference on Computer Vision, Bombay, India. Washington, DC, USA: IEEE, pp. 5966.
Rui L., Liu S., Yang M., Li M., Zhou M., and Li S., 2015. Hierarchical recurrent neural network for document modeling. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal. aclweb.org, pp. 899907.
Rungsawang A., 1998. Dsir: the first trec-7 attempt. In Proceedings of The 7th Text REtrieval Conference, Gaithersburg, MD, USA, pp. 366–72.
Turney P. D., and Pantel P., 2010. From frequency to meaning: vector space models of semantics. Journal of Artificial Intelligence Research 37 (1): 141–88.
Wang T., Mohamed A., and Hirst G., 2015. Learning lexical embeddings with syntactic and lexicographic knowledge. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, Beijing, China. aclweb.org, pp. 458–63.
Wei T., Lu Y., Chang H., Zhou Q., and Bao X., 2015. A semantic approach for text clustering using WordNet and lexical chains. Expert Systems with Applications 42 (4): 2264–75.
Wei Y., and Wei J., 2013. A semantic set theory for word semantic similarity assessment. In Proceedings of the International Conference on Mechatronic Sciences, Electric Engineering and Computer, Shenyang, China. Washington, DC, USA: IEEE, pp. 2466–71.
Wei Y., Wei J., and Xu H., 2015. Context vector model for document representation: a computational study. In Natural Language Processing and Chinese Computing, Nanchang, China. Lecture Notes in Computer Science, Cham: Springer, pp. 194206.
Wei Y., Wei J., and Yang Z., 2015. Enriching document representation with the deviations of word co-occurrence frequencies. In Proceedings of the International Conference on Algorithms and Architectures for Parallel Processing, Zhangjiajie, China, Lecture Notes in Computer Science, Cham: Springer, pp. 241–54.
Wei Y., Wei J., Yang Z., and Liu Y., 2016. Joint probability consistent relation analysis for document representation. In Proceedings of the International Conference on Database Systems for Advanced Applications, Dallas, TX, USA, Lecture Notes in Computer Science, Cham: Springer, pp. 517–32.
Wu Z., and Giles C. L., 2015. Sense-aware semantic analysis: a multi-prototype word representation model using Wikipedia. In Proceedings of the 29th AAAI Conference on Artificial Intelligence, Austin, TX, USA. Menlo Park, CA: AAAI Press, pp. 2188–94.
Xie P., Deng Y., and Xing E., 2015. Diversifying restricted boltzmann machine for document modeling. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Hilton, Sydney, Australia, New York, NY, USA: ACM, pp. 1315–24.
Xu W., Liu X., and Gong Y., 2003. Document clustering based on non-negative matrix factorization. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, Toronto, Canada, New York, NY, USA: ACM, pp. 267–73.
Yang Y., Downey D., and Boyd-Graber J., 2015. Efficient methods for incorporating knowledge into topic models. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal. aclweb.org, pp. 308–17.
Zimmerman D. W., 1997. Teacher’s corner: a note on interpretation of the paired-samples t test. Journal of Educational and Behavioral Statistics 22 (3): 349–60.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Full text views

Total number of HTML views: 1
Total number of PDF views: 12 *
Loading metrics...

Abstract views

Total abstract views: 59 *
Loading metrics...

* Views captured on Cambridge Core between 2nd November 2017 - 18th November 2017. This data will be updated every 24 hours.