Skip to main content
×
×
Home

Silhouette + attraction: A simple and effective method for text clustering

  • MARCELO L. ERRECALDE (a1), LETICIA C. CAGNINA (a1) and PAOLO ROSSO (a2)
Abstract

This article presents silhouette–attraction (Sil–Att), a simple and effective method for text clustering, which is based on two main concepts: the silhouette coefficient and the idea of attraction. The combination of both principles allows us to obtain a general technique that can be used either as a boosting method, which improves results of other clustering algorithms, or as an independent clustering algorithm. The experimental work shows that Sil–Att is able to obtain high-quality results on text corpora with very different characteristics. Furthermore, its stable performance on all the considered corpora is indicative that it is a very robust method. This is a very interesting positive aspect of Sil–Att with respect to the other algorithms used in the experiments, whose performances heavily depend on specific characteristics of the corpora being considered.

Copyright
Footnotes
Hide All

This research work has been partially funded by UNSL, CONICET (Argentina), DIANA-APPLICATIONS-Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) research project, and the WIQ-EI IRSES project (grant no. 269180) within the FP 7 Marie Curie People Framework on Web Information Quality Evaluation Initiative. The work of the third author was done also in the framework of the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems.

Footnotes
References
Hide All
Alexandrov, M., Gelbukh, A., and Rosso, P. 2005. An approach to clustering abstracts. In Montoyo, A., Muñoz, R., and Métais, E. (eds.), Natural Language Processing and Information Systems, vol. 3513, pp. 275–85. Lecture Notes in Computer Science. Germany: Springer-Verlag Berlin Heidelberg,.
Aranganayagi, S., and Thangavel, K., 2007. Clustering categorical data using silhouette coefficient as a relocating measure. In Proceedings of the International Conference on Computational Intelligence and Multimedia Applications (ICCIMA-2007), vol. 02, California, USA: IEEE Computer Society, pp. 13–7.
Azzag, H., Monmarche, N., Slimane, M., Venturini, G., and Guinot, C., 2003. AntTree: a new model for clustering with artificial ants. In Proceedings of the IEEE Conference on Evolutionary Computation (CEC-2003), vol. 04, New Jersey, USA: IEEE Press, pp. 2642–7.
Banerjee, S., Ramanathan, K., and Gupta, A., 2007. Clustering short texts using Wikipedia. In Proceedings of the 30th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR-2007), New York, USA: ACM, pp. 787–8.
Barnette, J. J., and McLean, J. E., 1998. The Tukey honestly significant difference procedure and its control of the type I error-rate. In Proceedings of 27 Annual Meeting of the Mid-South Educational Research Association, John Petry. Louisiana, USA, pp. 215.
Berry, M. W. (ed.), 2003. Survey of Text Mining: Clustering, Classification, and Retrieval. New York, USA: Springer Science+Business Media, Inc.
Bezdek, J. C., and Pal, N. R., 1995. Cluster validation with generalized Dunn’s indices. In Proceedings of the Second New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems, California, USA: IEEE Computer Society Press, pp. 190–3.
Bonato dos Santos, J., Heuser, C., Pereira Moreira, V., and Krug Wives, L., 2011. Automatic threshold estimation for data matching applications. Information Sciences 181 (13): 2685–99.
Brun, M., Sima, C., Hua, J., Lowey, J., Carroll, B., Suh, E., and Dougherty, E. R., 2007. Model-based evaluation of clustering validation measures. Pattern Recognition 40 (3): 807–24.
Cagnina, L., Errecalde, M., Ingaramo, D., and Rosso, P., 2008. A discrete particle swarm optimizer for clustering short-text corpora. In Proceedings of the International Conference on Bioinspired Optimization Methods and their Applications (BIOMA08), Ljubljana, Slovenia: Jozef Stefan Institute, pp. 93103.
Cagnina, L., Errecalde, M., Ingaramo, D., and Rosso, P., 2014. An efficient particle swarm optimization approach to cluster short texts. Information Sciences 265 : 3649.
Carullo, M., Binaghi, E., and Gallo, I., 2009. An online document clustering technique for short web contents. Pattern Recognition Letters 30 (10): 870–6.
Charikar, M., Chekuri, C., Feder, T., and Motwani, R., 2004. Incremental clustering and dynamic information retrieval. SIAM Journal on Computing 33 (6): 1417–40.
Choi, M. J., Tan, V. Y. F., Anandkumar, A., and Willsky, A. S., 2011. Learning latent tree graphical models. Journal of Machine Learning Research 12 : 1771–812.
Cutting, D. R., Karger, D. R., Pedersen, J. O., and Tukey, J. W., 1992. Scatter/gather: a cluster-based approach to browsing large document collections. In Proceedings of the 15th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR 2002), Interface Design and Display, New York, USA: ACM, pp. 318–29.
Davies, D. L., and Bouldin, D. W., 1979. A cluster separation measure. IEEE Transactions on Analysis and Machine Intelligence 1 (2): 224–7.
Dunn, J. C., 1974. Well separated clusters and optimal fuzzy-partitions. Journal of Cybernetics 4 (1): 95104.
Errecalde, M., Ingaramo, D., and Rosso, P., 2008. Proximity estimation and hardness of short-text corpora. In 19th International Workshop on Database and Expert Systems Application, California, USA: IEEE Computer Society, pp. 15–9.
Errecalde, M., Ingaramo, D., and Rosso, P., 2010. ITSA*: an effective iterative method for short-text clustering tasks. In Proceedings of the 23rd International Conference on Industrial Engineering and other Applications of Applied Intelligent Systems, IEA/AIE 2010, vol. 6096, Lecture Notes in Computer Science, Germany: Springer-Verlag Berlin, Heidelberg., pp. 550–9.
Fisher, R., 1925. Statistical Methods for Research Workers. Edinburgh, Scotland: Oliver & Boyd.
Hasan, M. A., Chaoji, V., Salem, S., and Zaki, M. J., 2009. Robust partitional clustering by outlier and density insensitive seeding. Pattern Recognition Letters 30 (11): 9941002.
He, H., Chen, B., Xu, W., and Guo, J., 2007. Short text feature extraction and clustering for web topic mining. In Proceedings of the 3rd International Conference on Semantics, Knowledge and Grid, California, USA: IEEE Computer Society, pp. 382–5.
Hearst, M. A., 2006. Clustering versus faceted categories for information exploration. Communications of the ACM 49 (4): 5961.
Hu, X., Sun, N., Zhang, C., and Chua, T., 2009. Exploiting internal and external semantics for the clustering of short texts using world knowledge. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, New York, USA: ACM, pp. 919–28.
Ingaramo, D., Errecalde, M., Cagnina, L., and Rosso, P. 2009. Particle Swarm Optimization for clustering short-text corpora. In Computational Intelligence and Bioengineering, pp. 319. Amsterdam, The Netherlands: IOS Press.
Ingaramo, D., Errecalde, M., and Rosso, P. 2010a. A general bio-inspired method to improve the short-text clustering task. In Proceedings of 11th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2010, vol. 6008, pp. 661–72. Lecture Notes in Computer Science. Germany: Springer-Verlag Berlin Heidelberg.
Ingaramo, D., Errecalde, M., and Rosso, P., 2010b. A new AntTree-based algorithm for clustering short-text corpora. Journal of Computer Science & Technology 10 (1): 17.
Ingaramo, D., Leguizamón, G., and Errecalde, M., 2005. Adaptive clustering with artificial ants. Journal of Computer Science & Technology 5 (4): 264–71.
Ingaramo, D., Pinto, D., Rosso, P., and Errecalde, M. 2008. Evaluation of internal validity measures in short-text corpora. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2008, vol. 4919, pp. 555–67. Lecture Notes in Computer Science. Germany: Springer-Verlag Berlin Heidelberg.
Ingaramo, D., Rosas, M. V., Errecalde, M., and Rosso, P., 2011. Clustering iterativo de textos cortos con representaciones basadas en concepto. Procesamiento del Lenguaje Natural 46 : 1926.
Jiang, D., Pei, J., and Zhang, A., 2003. DHC: a density-based hierarchical clustering method for time series gene expression data. In Proceedings 3rd IEEE Symposium on BioInformatics and BioEngineering (BIBE 2003), California, USA: IEEE Computer Society, pp. 393400.
Jing, L. 2005. Survey of text clustering. Technical report. Department of Mathematics. The University of Hong Kong, Hong Kong, China.
Karthikeyan, T., Peter, S. J., and Chidambaranathan, S., 2011. Hybrid algorithm for noise-free high density clusters with self-detection of best number of clusters. International Journal of Hybrid Information Technology 4 (2): 3954.
Karypis, G., Han, E., and Kumar, V., 1999. Chameleon: hierarchical clustering using dynamic modeling. IEEE Computer 32 (8): 6875.
Kaufman, L., and Rousseeuw, P. J. 1990. Finding Groups in Data: An Introduction to Cluster Analysis, New York, USA: John Wiley and Sons.
Kruskal, W. H., and Wallis, W. A., 1952. Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association 47 (260): 583621.
Kyriakopoulou, A. 2008. Text classification aided by clustering: a literature review. In Tools in Artificial Intelligence, pp. 233–52. Rijeka, Croatia: INTECH.
Larocca Neto, J., Santos, A. D., Kaestner, C. A. A., and Freitas, A. A. 2000. Document clustering and text summarization. In Mackin, N. (ed.), Proceedings of the 4th International Conference Practical Applications of Knowledge Discovery and Data Mining (PADD-2000), London, England: The Practical Application Company, pp. 4155.
Levene, H. 1960. Robust tests for equality of variances. In Contributions to Probability and Statistics, pp. 278–92. California, USA: Stanford University Press.
Liu, X., and Croft, W. B. 2004. Cluster-based retrieval using language models. In Sanderson, M., Jrvelin, K., Allan, J., and Bruza, P. (eds.), Proceedings of the 27th annual International ACM Conference on Research and Development in Information Retrieval (SIGIR 2004), New York, USA: ACM, pp. 186–93.
MacQueen, J. B., 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, California, USA: University of California Press, pp. 281–97.
Makagonov, P., Alexandrov, M., and Gelbukh, A. 2004. Clustering abstracts instead of full texts. In Text, Speech and Dialogue, 10th International Conference (TSD 2004), vol. 3206, pp. 129–35. Lecture Notes in Artificial Intelligence, Germany: Springer-Verlag Berlin Heidelberg.
Manning, C. D., Raghavan, P., and Schutze, H., 2008. Introduction to Information Retrieval. Cambridge, England: Cambridge University Press.
Ng, A. Y., Jordan, M. I., and Weiss, Y. 2001. On spectral clustering: analysis and an algorithm. In The Proceedings of the 2001 Neural Information Processing Systems (NIPS) Conference. vol. 14, pp. 849–56. Advances in Neural Information Processing Systems. Massachusetts, USA: MIT Press.
Ng, R. T., and Han, J., 1994. Efficient and effective clustering methods for spatial data mining. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB ’94), California, USA: Morgan Kaufmann Publishers Inc., pp. 144–55.
Paterlini, A. A., Nascimento, M. A., and Traina, C. Jr, 2011. Using pivots to speed-up k-medoids clustering. Journal of Information and Data Management 2 (2): 221–36.
Pinto, D., Benedí, J. M., and Rosso, P. 2007. Clustering narrow-domain short texts by using the Kullback-Leibler distance. In Computational Linguistics and Intelligent Text Processing, 8th International Conference (CICLing 2007), vol. 4394, pp. 611–22. Lecture Notes in Computer Science, Germany: Springer-Verlag Berlin Heidelberg.
Pinto, D., and Rosso, P. 2007. On the relative hardness of clustering corpora. In Text, Speech and Dialogue, 10th International Conference (TSD 2007), vol. 4629, pp. 155–61. Lecture Notes in Computer Science, Germany: Springer-Verlag Berlin Heidelberg.
Pons-Porrata, A., Berlanga-Llavori, R., and Ruiz-Shulcloper, J., 2007. Topic discovery based on text mining techniques. Information Processing and Management 43 (3): 752–68.
Popova, S., and Khodyrev, I. 2011. Local theme detection and annotation with keywords for narrow and wide domain short text collections. In Proceedings of the 5th International Conference on Advances in Semantic Processing (SEMAPRO 2011). Delaware, USA: International Academy, Reseach, and Industry Association (IARIA), pp. 4955.
Qi, G.-J., Aggarwal, C., and Huang, T., 2012. Community detection with edge content in social media networks. In IEEE 28th International Conference on Data Engineering (ICDE 2012), California, USA: IEEE Computer Society - Conference Publishing Services (CPS), pp. 534–45.
Rousseeuw, P., 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20 (1): 5365.
Selim, S. Z., and Alsultan, K., 1991. A simulated annealing algorithm for the clustering problem. Pattern Recognition 24 (10): 1003–8.
Shannon, C. E., 1948. A mathematical theory of communication. The Bell system technical journal 27 : 379423.
Slonim, N., Friedman, N., and Tishby, N., 2002. Unsupervised document classification using sequential information maximization. In Proceedings of the 25th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR 2002), New York, USA: ACM, pp. 129–36.
Spearman, C., 1904. The proof and measurement of association between two things. The American Journal of Psychology 15 (1): 72101.
Stein, B., and Busch, M., 2005. Density-based cluster algorithms in low-dimensional and high-dimensional applications. In 2nd International Workshop on Text-Based Information Retrieval (TIR 2005), Fachberichte Informatik, Mainz, Germany: University of Koblenz-Landau, pp. 4556.
Stein, B., and Meyer zu Eissen, S. 2002. Document categorization with MajorClust. In Proceedings Workshop on Issues in the Theory of Security (WITS 2002), Barcelona, Spain: Technical University of Barcelona, pp. 91–6.
Stein, B., and Meyer zu Eissen, S., 2004. Topic identification: framework and application. In Proceedings of the International Conference on Knowledge Management (I-KNOW 2004), Graz, Austria: Know-Center, pp. 353–60.
Stein, B., Meyer zu Eissen, S., and Wißbrock, F., 2003. On cluster validity and the information need of users. In 3rd International Conference on Artificial Intelligence and Applications (AIA 2003), Benalmádena, Spain: ACTA Press, pp. 216–21.
Steinbach, M., Karypis, G., and Kumar, V., 2000. A comparison of document clustering techniques. In Working Notes - Workshop on Text Mining. The 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Massachusetts, USA, pp. 109–11.
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., and Varga, D., 2006. The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Paris, France: European Language Resources Association (ELRA), pp. 2142–7.
Takeda, T., and Takasu, A., 2007. Updatenews: a news clustering and summarization system using efficient text processing. In Proceedings of the 7th ACM/IEEE CS Joint Conference on Digital Libraries, New York, USA: ACM, pp. 438–9.
Tan, P. N., Steinbach, M., and Kumar, V., 2005. Introduction to Data Mining. Massachusetts, USA: Addison-Wesley Longman Publishing Co..
Tu, L., and Chen, Y., 2009. Stream data clustering based on grid density and attraction. ACM Transactions on Knowledge Discovery from Data 3 : 127.
Tukey, J. W. 1953. The problem of multiple comparisons. In The Collected Works of John W. Tukey VIII. Multiple Comparisons: 1948–1983, pp. 1300. New York, USA: Chapman and Hall.
Tukey, J. W., 1977. Exploratory Data Analysis. Massachusetts, USA: Pearson Education, Inc.
van Rijsbergen, C. J. 1979. Information Retrieval, 2 edn. London, England: Butterworths.
Xu, W., Liu, X., and Gong, Y. 2003. Document clustering based on non-negative matrix factorization. In Proceedings of the 26th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR 2003). New York, USA: ACM, pp. 267–73.
Yang, T., Jin, R., Chi, Y., and Zhu, S., 2009. Combining link and content for community detection: a discriminative approach. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09. New York, USA: ACM, pp. 927–36.
Zha, H., He, X., Ding, C., Simon, H., and Gu, M. 2001. Spectral relaxation for k-means clustering. In The Proceedings of the 2001 Neural Information Processing Systems (NIPS) Conference, vol. 14, pp. 1057–64. Advances in Neural Information Processing Systems. Massachusetts, USA: MIT Press.
Zhang, D., Wang, J., and Si, L., 2011. Document clustering with universum. In Proceedings of the 34th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2011), New York, USA: ACM, pp. 873–82.
Zhou, Y., Cheng, H., and Yu, J. X., 2009. Graph clustering based on structural/attribute similarities. In Proceedings of the VLDB Endowment, vol. 2. Very Large Data Base Endowment Inc (VLDB Endowment), Lyon, France, pp. 718–29.
Zhao, Y., and Karypis, G., 2004. Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning 55 : 311–31.
Zhao, Y., Karypis, G., and Fayyad, U., 2005. Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery 10 : 141–68.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 2
Total number of PDF views: 61 *
Loading metrics...

Abstract views

Total abstract views: 660 *
Loading metrics...

* Views captured on Cambridge Core between September 2016 - 21st September 2018. This data will be updated every 24 hours.