Skip to main content Accessibility help

Short-text learning in social media: a review

  • Antonela Tommasel (a1) and Daniela Godoy (a1)


Social networks occupy a ubiquitous and pervasive place in the life of their users. The substantial amount of content generated and shared by social networking users offers new research opportunities across a wide variety of disciplines, including media and communication studies, linguistics, sociology, psychology, information and computer sciences, or education. This situation, in combination with the continuous growth of social media data, creates an imperative need for content organisation. Thus, large-scale text learning tasks in social environments arise as one of the most relevant problems in machine learning and data mining. Interestingly, social media data pose several challenges due to its sparse, high-dimensional and large-volume characteristics. This survey reviews the field of social media data learning, focusing on classification and clustering techniques, as they are two of the most frequent learning tasks. It reviews not only new techniques that have been developed to tackle the new challenges posed by short-texts, but also how traditional techniques can be adapted to overcome such challenges. Then, open issues and research opportunities for social media data learning are discussed.



Hide All
Aggarwal, C. C. 2014. A survey of stream classification algorithms. In Data Classification: Algorithms and Applications, Aggarwal, C. C. (ed). CRC Press, 245274.
Aggarwal, C. C. & Zhai, C. X. 2012. A survey of text classification algorithms. In Mining Text Data, Aggarwal, C. C. & Zhai, C. X. (eds). Springer US, 163222. ISBN 978-1-4614-3222-7.
Alelyani, S., Tang, J. & Liu, H. 2013. Feature selection for clustering: a review. In Data Clustering: Algorithms and Applications. Chapman and Hall/CRC, 2960.
Arthur, D. & Vassilvitskii, S. 2007. k-means++: the advantages of careful seeding. In SODA ‘07: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 10271035. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA. ISBN 978-0-898716-24-5.
Asur, S. & Huberman, B. A. 2010. Predicting the future with social media. In 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 1, 492499.
Becker, H., Naaman, M. & Gravano, L. 2011. Beyond trending topics: real-world event identification on Twitter. In Fifth International AAAI Conference on Weblogs and Social Media.
Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. 2008. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10), P10008.
Broder, A. Z. 1997. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), 2129.
Carullo, M., Binaghi, E. & Gallo, I. 2009. An online document clustering technique for short web contents. Pattern Recognition Letters 30(10), 870876.
Ciampaglia, G. L., Shiralkar, P., Rocha, L. M., Bollen, J., Menczer, F. & Flammini, A. 2015. Computational fact checking from knowledge networks. PLOS ONE 10(6), 113.
Collins, R., May, D., Weinthal, N. & Wicentowski, R. 2015. SWAT-CMW: classification of Twitter emotional polarity using a multiple-classifier decision schema and enhanced emotion tagging. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), 669672. Association for Computational Linguistics.
Croft, W. B., Metzler, D. & Strohman, T. 2010. Search Engines: Information Retrieval in Practice, 283. Addison- Wesley Reading.
Cui, R., Agrawal, G., Ramnath, R. & Khuc, V. 2016. Ensemble of heterogeneous classifiers for improving automated tweet classification. In 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), 10451052.
Dai, X., Bikdash, M. & Meyer, B. 2017. From social media to public health surveillance: word embedding based clustering method for Twitter classification. In SoutheastCon 2017, 17.
de la Rosa, G. R., Montes-y-Gémez, M. Solorio, T. & Pineda, L. V. 2013. A document is known by the company it keeps: neighborhood consensus for short text categorization. Language Resources and Evaluation 47(1), 127149.
Deutsch, P. 1996. DEFLATE Compressed Data Format Specification version 1.3. RFC 1951 (Informational).
Dietterich, T. G. 2000. Ensemble methods in machine learning. In Multiple Classifier Systems. Springer, 115. ISBN 978-3-540-45014-6.
Efron, M., Lin, J., He, J. & de Vries, A. 2014. Temporal feedback for tweet search with non-parametric density estimation. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’14, 3342. ACM. ISBN 978-1-4503-2257-7.
Ferrara, E., JafariAsbagh, M., Varol, O., Qazvinian, V., Menczer, F. & Flammini, A. 2013. Clustering memes in social media. In Proceedings of IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).
Forman, G. 2004. A pitfall and solution in multi-class feature selection for text classification. In ICML, Brodley, C. E. (ed), ACM International Conference Proceeding Series, 69. ACM.
Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M. & Bouchachia, A. 2014. A survey on concept drift adaptation. ACM Computing Surveys 46(4), 44:144:37. ISSN 0360-0300.
Gandomi, A. & Haider, M. 2015. Beyond the hype: big data concepts, methods, and analytics. International Journal of Information Management 35(2), 137144. ISSN 0268-4012.
Guyon, I. & Elisseeff, A. 2003. An introduction to variable and feature selection. Journal of Machine Learning Research 3, 11571182.
Hu, M. & Liu, B. 2004. Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04, 168177. ACM. ISBN 1-58113-888-1.
Irfan, R., King, C. K., Grages, D., Ewen, S., Khan, S. U., Madani, S. A., Kolodziej, J., Wang, L., Chen, D. & Rayes, A. 2015. A survey on text mining in social networks. The Knowledge Engineering Review 30(2), 157170.
Iwata, T., Watanabe, S., Yamada, T. & Ueda, N. 2009. Topic tracking model for analyzing consumer purchase behavior. In IJCAI, 9, 14271432.
Jain, A. K. & Dubes, R. C. 1988. Algorithms for Clustering Data. Prentice-Hall, Inc. ISBN 0-13-022278-X.
Jia, C., Carson, M. B., Wang, X. & Yu, J. 2018. Concept decompositions for short text clustering by identifying word communities. Pattern Recognition 76, 691703. ISSN 0031-3203.
Kang, J. H., Lerman, K. & Plangprasopchok, A. 2010. Analyzing microblogs with affinity propagation. In Proceedings of the First Workshop on Social Media Analytics, 6770. ACM.
Khan, F. H., Bashir, S. & Qamar, U. 2014. Tom: Twitter opinion mining framework using hybrid classification scheme. Decision Support Systems 57, 245257. ISSN 0167-9236.
Kim, K., Chung, B.-S., Choi, Y., Lee, S., Jung, J.-Y. & Park, J. 2014. Language independent semantic kernels for short-text classification. Expert Systems with Applications 41(2), 735743. ISSN 0957-4174.
Kim, S., Jeon, S., Kim, J., Park, Y.-H. & Yu, H. 2012. Finding core topics: topic extraction with clustering on tweet. In 2012 Second International Conference on Cloud and Green Computing (CGC), 777782.
Kim, Y.-H., Seo, S., Ha, Y.-H., Lim, S. & Yoon, Y. 2013. Two applications of clustering techniques to Twitter: community detection and issue extraction. Discrete Dynamics in Nature and Society 2013.
Li, C., Sun, A. & Datta, A. 2012. Twevent: segment-based event detection from tweets. In CIKM, Chen, X. W., Lebanon, G., Wang, H. & Zaki, M. J. (eds). ACM, 155164. ISBN 978-1-4503-1156-4.
Li, J., Khan, S. U., Li, Q., Ghani, N., Min-Allah, N., Bouvry, P. & Zhang, W. 2011a. Efficient data sharing over large-scale distributed communities. In Intelligent Decision Systems in Large-Scale Distributed Environments. Springer, 149164.
Li, J., Li, Q., Khan, S. U. & Ghani, N. 2011b. Community-based cloud for emergency management. In 2011 6th International Conference on System of Systems Engineering, 5560.
Li, P., He, L.,Wang, H., Hu, X., Zhang, Y., Li, L. &Wu, X. 2018. Learning from short text streams with topic drifts. IEEE Transactions on Cybernetics 48(9), 26972711. ISSN 2168-2267. doi: 10.1109/TCYB.2017.2748598.
Li, S., Wang, Z., Zhou, G. & Lee, S. Y. M. 2011c. Semi-supervised learning for imbalanced sentiment classification. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume 3, IJCAI ’11, 18261831. AAAI Press. ISBN 978-1-57735-515-1.
Li, X., Yan, L., Qin, N. & Ran, H. 2017. A novel semi-supervised short text classification algorithm based on fusion similarity. In Intelligent Computing Methodologies, Huang, D.-S., Hussain, A., Han, K. & Gromiha, M. M. (eds). Springer International Publishing, 309319. ISBN 978-3-319-63315-2.
Liang, S., Yilmaz, E. & Kanoulas, E. 2016. Dynamic clustering of streaming short documents. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, 9951004. ACM, New York, NY, USA. ISBN 978-1-4503-4232-2.
Lifna, C. S. & Vijayalakshmi, M. 2015. Identifying concept-drift in Twitter streams. Procedia Computer Science 45, 8694. ISSN 1877-0509. International Conference on Advanced Computing Technologies and Applications (ICACTA).
Lin, J., Keogh, E., Lonardi, S. & Chiu, B. 2003. A symbolic representation of time series, with implications for streaming algorithms. In Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, DMKD ’03, 211. ACM.
Liu, H. & Yu, L. 2005. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering 17(4), 491502.
Losing, V., Hammer, B. & Wersing, H. 2018. Incremental on-line learning: a review and comparison of state of the art algorithms. Neurocomputing 275,12611274. ISSN 0925-2312.
Mathew, K. & Issac, B. 2011. Intelligent spam classification for mobile text message. In 2011 International Conference on Computer Science and Network Technology (ICCSNT), 1, 101105.
Miller, Z., Dickinson, B., Deitrick, W., Hu, W. & Wang, A. H. 2013. Twitter spammer detection using data stream clustering. Information Sciences 260, 6473. ISSN 0020-0255.
Nakov, P., Rosenthal, S., Kiritchenko, S., Mohammad, S. M., Kozareva, Z. Ritter, A., Stoyanov, V. & Zhu, X. 2016. Developing a successful SemEval task in sentiment analysis of Twitter and other social media texts. Language Resources and Evaluation 50(1), 3565. ISSN 1574-020X.
Ni, X., Quan, X., Lu, Z., Wenyin, L. & Hua, B. 2011. Short text clustering by finding core terms. Knowledge and Information Systems 27(3), 345365.
Nishida, K., Banno, R., Fujimura, K. & Hoshide, T. 2011. Tweet classification by data compression. In Proceedings of the 2011 International Workshop on DETecting and Exploiting Cultural DiversiTy on the Social Web, DETECT ’11, 2934. ACM. ISBN 978-1-4503-0962-2.
Oh, O., Agrawal, M. & Rao, H. R. 2011. Information control and terrorism: tracking the Mumbai terrorist attack through Twitter. Information Systems Frontiers 13(1), 3343. ISSN 1572-9419.
Parikh, R. & Karlapalem, K. 2013. ET: events from tweets. In WWW (Companion Volume), Carr, L., Laender, A. H. F., Lóscio, B. F. King, I., Fontoura, M., Vrandecic, D., Aroyo, L., de Oliveira, J. P. M., Lima, F. & Wilde, E. (eds), 613620. International World Wide Web Conferences Steering Committee/ACM. ISBN 978-1-4503-2038-2.
Phan, X.-H., Nguyen, L.-M. & Horiguchi, S. 2008. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In WWW ’08: Proceeding of the 17th International Conference on World Wide Web, 91100. ACM. ISBN 978-1-60558-085-2.
Popovici, R., Weiler, A. & Grossniklaus, M. 2014. On-line clustering for real-time topic detection in social media streaming data. In SNOW-DC@ WWW, 5763.
Prusa, J., Khoshgoftaar, T. M. & Dittman, D. J. 2015. Using ensemble learners to improve classifier performance on tweet sentiment data. In 2015 IEEE International Conference on Information Reuse and Integration, 252257.
Prusa, J. D., Khoshgoftaar, T. M. & Seliya, N. 2016. Enhancing ensemble learners with data sampling on highdimensional imbalanced tweet sentiment data. In FLAIRS Conference, 322328.
Rangrej, A., Kulkarni, S. & Tendulkar, A. V. 2011. Comparative study of clustering techniques for short text documents. In Proceedings of the 20th International Conference Companion on WorldWideWeb,WWW’11, 111112. ACM. ISBN 978-1-4503-0637-9.
Ravi, S. & Kozareva, Z. 2018. Self-governing neural networks for on-device short text classification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 804810. Association for Computational Linguistics.
Romero, F. P., Julián-Iranzo, P., Soto, A., Ferreira-Satler, M. & Gallardo-Casero, J. 2013. Classifying unlabeled short texts using a fuzzy declarative approach. Language Resources and Evaluation 47(1), 151178. ISSN 1574-020X.
Rosa, K. D. & Ellen, J. 2009. Text classification methodologies applied to micro-text in military chat. In International Conference on Machine Learning and Applications, 2009, ICMLA ’09, 710714.
Rosa, K. D., Shah, R., Lin, B., Gershman, A. & Frederking, R. 2011. Topical clustering of tweets. In Proceedings of the ACM SIGIR: SWSM.
Saeys, Y., Inza, I. & Larrañaga, P. 2007. A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 25072517.
Sajnani, H., Javanmardi, S., McDonald, D. W. & Lopes, C. V. 2011. Multi-label classification of short text: a study on Wikipedia barnstars. In Analyzing Microtext, AAAI Workshops WS-11-05. AAAI.
Sander, J., Ester, M., Kriegel, H.-P. & Xu, X. 1998. Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Mining and Knowledge Discovery 2(2), 169194.
Sculley, D. 2010. Web-scale k-means clustering. In Proceedings of the 19th International Conference on WorldWide Web, WWW ’10, 11771178. ACM, New York, NY, USA. ISBN 978-1-60558-799-8.
Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing Surveys 34(1), 147. ISSN 0360-0300.
Sedhai, S. & Sun, A. 2015. HSpam14: a collection of 14 million tweets for hashtag-oriented spam research. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’15, 223–232. ACM. ISBN 978-1-4503-3621-5.
Sedhai, S. & Sun, A. 2018. Semi-supervised spam detection in Twitter stream. IEEE Transactions on Computational Social Systems 5(1), 169175.
Shi, C., Li, Y., Zhang, J., Sun, Y. & Yu, P. S. 2017. A survey of heterogeneous information network analysis. IEEE Transactions on Knowledge and Data Engineering 29(1), 1737. ISSN 1041-4347.
Song, G., Ye, Y., Du, X., Huang, X. & Bie, S. 2014. Short text classification: a survey. Journal of Multimedia 9(5), 635.
Stilo, G. & Velardi, P. 2017. Hashtag sense clustering based on temporal similarity. Computational Linguistics, 43(1), 181200. ISSN 0891-2017.
Su-zhi, Z. & Pei-feng, S. 2011. A new short-text categorization algorithm based on improved KSVM. In 2011 IEEE 3rd International Conference on Communication Software and Networks (ICCSN), 154157.
Tang, J. & Liu, H. 2012. Feature selection with linked data in social media. In Proceedings of the 12th SIAM International Conference on Data Mining, 118128. SIAM/Omnipress. ISBN 978-1-61197-232-0.
Tang, J., Alelyani, S. & Liu, H. 2014. Feature selection for classification: a review. In Data Classification: Algorithms and Applications, Aggarwal, C. C. (ed). CRC Press, 3764. ISBN 978-1-4665-8674-1.
Thelwall, M., Buckley, K. & Paltoglou, G. 2011. Sentiment in Twitter events. Journal of the American Society for Information Science and Technology 62(2), 406418. ISSN 1532-2890.
Tsur, O., Littman, A. & Rappoport, A. 2013. Efficient clustering of short messages into general domains. Proceedings of the 7th International Conference on Weblogs and Social Media, ICWSM 2013. 621630.
Tsymbal, A. 2004. The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin 106(2).
Tu, H. & Ding, J. 2012. An efficient clustering algorithm for microblogging hot topic detection. In 2012 International Conference on Computer Science Service System (CSSS), 738741.
Wang, J., Zhao, P., Hoi, S. C. H. & Jin, R. 2014. Online feature selection and its applications. IEEE Transactions on Knowledge and Data Engineering 26(3), 698710. ISSN 1041-4347.
Wang, P., Xu, B., Xu, J., Tian, G., Liu, C.-L. & Hao, H. 2016. Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing 174, 806814a. ISSN 0925-2312.
Wang, Z., Mi, H. & Ittycheriah, A. 2016b. Semi-supervised clustering for short text via deep representation learning. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, Berlin, Germany, August 11–12, 2016, 3139.
Weller, K., Bruns, A., Burgess, J. & Mahrt, M. 2013. Twitter and Society. Peter Lang International Academic Publishers. ISBN 1433121697, 9781433121692.
Weng, J. & Lee, B.-S. 2011. Event detection in Twitter. In Proceedings of the Fifth International Conference on Weblogs and Social Media, Barcelona, Catalonia, Spain, July 17–21, 2011, Adamic, L. A., Baeza-Yates, R. A. &Counts, S. (eds). The AAAI Press.
Witten, I. H., Frank, E., Hall, M. A. & Pal, C. J. 2016. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann.
Wu, W., Li, H., Wang, H. & Zhu, K. Q. 2012. Probase: a probabilistic taxonomy for text understanding. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD ’12, 481492. ACM. ISBN 978-1-4503-1247-9.
Xu, R. & Wunsch, D. 2005. Survey of clustering algorithms. IEEE Transactions on Neural Networks 16(3), 645678. ISSN 1045-9227.
Yan, L., Zheng, Y.&Cao., J. 2018. Few-shot learning for short text classification. Multimedia Tools and Applications 77(22), 2979929810. ISSN 1573-7721.
Yang, C. C. & Ng, T. D. 2009. Web opinions analysis with scalable distance-based clustering. In ISI, 6570. IEEE.
Yang, L., Li, C., Ding, Q. & Li, L. 2013. Combining lexical and semantic features for short text classification. Procedia Computer Science 22:7886. ISSN 1877-0509. 17th International Conference on Knowledge Based and Intelligent Information and Engineering Systems - KES 2013.
Yin, J. & Wang, J. 2014. A Dirichlet multinomial mixture model-based approach for short text clustering. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, 233242. ACM. ISBN 978-1-4503-2956-9.
Yin, J., Chao, D., Liu, Z., Zhang, W., Yu, X. & Wang, J. 2018. Model-based clustering of short text streams. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 26342642. ACM.
Yu, Y. & Chen, Y. 2012. A novel content based and social network aided online spam short message filter. In 2012 10th World Congress on Intelligent Control and Automation (WCICA), 444449.
Yuan, Q., Cong, G. & Magnenat-Thalmann, N. 2012. Enhancing naive bayes with various smoothing methods for short text classification. In WWW (Companion Volume), Mille, A., Gandon, Misselis, F. L. J., Rabinovich, M. & Staab, S. (eds). ACM, 645646. ISBN 978-1-4503-1230-1.
Zhai, C. & Lafferty, J. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems 22(2), 179214. ISSN 1046-8188.
Zhang, G., Sun, Y., Xu, M. & Bie, R. 2014. Weibo clustering: a new approach utilizing users’ reposting data in social networking services. Computer Science and Information Systems 11(3), 1157–1172.
Zhang, H. & Zhong, G. 2016. Improving short text classification by learning vector representations of both words and hidden topics. Knowledge-Based Systems 102, 7686. ISSN 0950-7051.
Zubiaga, A., Liakata, M., Procter, R., Bontcheva, K. & Tolmie, P. 2015. Towards detecting rumours in social media. In AAAI Workshop: AI for Cities.

Short-text learning in social media: a review

  • Antonela Tommasel (a1) and Daniela Godoy (a1)


Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed