Skip to main content Accessibility help
×
Home

A new approach for textual feature selection based on N-composite isolated labels

  • Samir Elloumi (a1)

Abstract

Textual Feature Selection (TFS) aims to extract relevant parts or segments from text as being the most relevant ones w.r.t. the information it expresses. The selected features are useful for automatic indexing, summarization, document categorization, knowledge discovery, so on. Regarding the huge amount of electronic textual data daily published, many challenges related to the semantic aspect as well as the processing efficiency are addressed. In this paper, we propose a new approach for TFS based on Formal Concept Analysis background. Mainly, we propose to extract textual features by exploring the regularities in a formal context where isolated points exist. We introduce the notion of N-composite isolated points as a set of N words to be considered as a unique textual feature. We show that a reduced value of N (between 1 and 3) allows extracting significant textual features compared with existing approaches even for non-completely covering an initial formal context.

Copyright

Corresponding author

*Corresponding author. Email: samir.elloumi@fst.utm.tn

References

Hide All
Agrawal, R. and Batra, M. (2013). A detailed study on text mining techniques. International Journal of Soft Computing and Engineering (IJSCE) ISSN 2(6), 22312307.
Agrawal, R., Imielinski, T. and Swami, A. (1993). Mining association rules between sets of items in large databases. Proceedings of the ACM SIGMOD International Conference on Management of Data, Washington, USA, pp. 207216.
Bastide, Y., Pasquier, N., Taouil, R., Lakhal, L. and Stumme, G. (2000). Mining minimal non-redundant association rules using frequent closed itemsets. Proceedings of the International Conference DOOD’2000, LNCS, Springer-Verlag, pp. 972986.
Belohlavek, R. and Vychodil, V. (2010). Discovery of optimal factors in binary data via a novel method of matrix decomposition. Journal of Computer and System Sciences 76(1), 310.
Berend, G. (2016). Exploiting extra-textual and linguistic information in keyphrase extraction. Natural Language Engineering 22(1), 7395.
Berger, C. (2012). Big data analytics with oracle advanced analytics in-database option. Oracle and/or its affiliates: Data Mining and Advanced Analytics. Available at http://www.oracle.com/technetwork/database/options/advanced-analytics/oaa12cpreso-1964644.pdf. Last visited in November 2018.
Bernotas, M., Karklius, K.Laurutis, R. and Asta Slotkien, A. (2007). The peculiarities of the text document representation, using ontology and tagging-based clustering technique. 124x Information Technology and Control 36(2), 217220.
Besanon, R., De Chalendar, G., Ferret, O., Gara, F., Mesnard, O., Lab, M. and Semmar, N. (2010). LIMA: A Multilingual Framework for Linguistic Analysis and Linguistic Resources Development and Evaluation. Proceedings of LREC 2010, pp. 36973704.
Bird, S. and Loper, E. (2004). NLTK: the natural language toolkit. Proceedings of the ACL 2004 on Interactive Poster and Demonstration Sessions, pp. 6972.
Brank, J., Grobelnik, M., Frayling, N. and Mladenic, D. (2002). Interaction of Feature Selection Methods and Linear Classification Models. Proceedings of the 19th Conference on Machine Learning (ICML-02), Workshop on Text Learning.
Chang, C. and Lin, C. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 127. Available at http://www.csie.ntu.edu.tw/cjlin/libsvm.
Dasgupta, A., Drineas, P., Harb, B., Josifovski, V. and Mahoney, M. W. (2007). Inductive learning algorithms and representations for text categorization. Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 230239.
Dumais, S., Platt, J., Heckerman, D. and Sahami, M. (1998). Feature selection methods for text classification. Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management 2, 148–55.
Elloumi, S., Boulifa, B., Jaoua, A., Saleh, M., Al Otaibi, J. and Frias, M. (2014). Inference engine based on closure and join operators over truth table BRs. Journal of Logical and Algebraic Methods in Programming 83(2), 180193.
Elloumi, S., Ferjani, F. and Jaoua, A. (2016). Using minimal generators for composite isolated point extraction and conceptual binary relation coverage: Application for extracting relevant textual features. Information Sciences 336, 129144.
Ferjani, F., Elloumi, S., Jaoua, A., Ben Yahia, S., Ismail, S. and Ravan, S. (2012). Formal context coverage based on isolated labels: An efficient solution for text feature extraction. Information Sciences 188, 198214.
Financial Keywords (2017). A Collection of Financial Keywords and Phrases. Software available at http://home.ubalt.edu/ntsbarsh/stat-data/keysphrasfinance.htm. Last visited in march 2017.
Ganter, B. and Wille, R. (1999). Formal Concept Analysis. Berlin: Springer-Verlag.
Garey, M.R. and Johnson, D.S. (1979). Computers and Intractability: A Guide to the Theory of NP-Completeness (Series of Books in the Mathematical Sciences). San Francisco: W. H. Freeman and Company, p. 340.
Gharehchopogh, F.S. (2010). Approach and review of user oriented interactive data mining. IEEE the 4th International Conference on Application of Information and Communication Technologies, AICT2010. IEEE, Tashkent, Uzbekistan.
Godin, R., Missaoui, R. and Alaoui, H. (1995). Incremental concept formation algorithms based on Galois (concept) lattices. Computational Intelligence 11(2), 246267.
Gordon, M.D. and Kochen, M. (1998). Recall-precision trade-ok: a derivation. Journal of the American Society for Information Science 40, 145151.
Gosset, W.S. (1908). Student. In The Probable Error of a Mean, Biometrikam, vol. 6, pp. 125.
Gupta, V. (2009). A survey of text mining techniques and applications. Journal of Emerging Technologies in Web Intelligence 1, 6076.
Harish, B.S., Manjunath, S. and Guru, D.S. (2012). Text document classification: An approach based on indexing. International Journal of Data Mining and Knowledge Management Process (IJDKP) 2, 4362.
Hasan, K. S. and Ng, V. (2014). Automatic Keyphrase Extraction: A Survey of the State of the Art. ACL, vol. 1, pp. 12621273.
Irfan, R., King, C., Grages, D., Ewen, S., Khan, S., Madani, S. and Li, H. (2015). A survey on text mining in social networks. The Knowledge Engineering Review 30(2), 157170. doi: 10.1017/S0269888914000277
Jaoua, A., AlJa’am, J., Hammami, H., Ferjani, F., Laban, F., Semmar, N., Essafi, H. and Elloumi, S. (2010). IEEE International Conference on Progress in Informatics and Computing (PIC) 1, 652655.
Jaoua, A., Elloumi, S., Hasnah, A., Jaam, J. and Nafkha, I. (2004). Discovering regularities in databases using canonical decomposition of binary relations. Journal on Relational Methods in Computer Science (JoRMiCS) 1, 217234.
Jones, K.S. and Willet, P. (1997) Readings in Information Retrieval: Porter Stemmer. San Francisco: Morgan Kaufmann. ISBN 1- 55860-454-4.
Kcherif, R., Gammoudi, M.M. and Jaoua, A. (2000). Using difunctional relations in information organization. Information Science 125, 53166.
KDnuggets (2012). Data Integration, Analytical ETL, Data Analysis, and Reporting. Software available at http://sourceforge.net/projects/rapidminer/.
Kosala, R. and Blockeel, H. (2000). Web mining research: A survey. ACM Sigkdd Explorations Newsletter 2, 115.
Lan, M., Tan, C.L. and Low, H.B. (2006). Proposing a new term weighting scheme for text categorization. AAAI 6, 763768.
Ma, W., Fang, W., Wang, G. and Liu, J. (2007). Concept index for document retrieval with peer-to-peer network. IEEE Computer Society. Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing, Qingdao, China, 3, 11191123.
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S. and McClosky, D. (2014). The Stanford CoreNLP Natural Language Processing Toolkit. Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, Maryland USA, pp. 5560.
Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv: 13013781.
Miller, G.A. (1995). Wordnet: A lexical database for English. Communications of the ACM 38(11), 3941.
Mouakher, A. and Ben Yahia, S. (2016). QualityCover: Efficient binary relation coverage guided by induced knowledge quality. Information Sciences 355, 5873.
Osei-Bryson, K. (2010). Towards supporting expert evaluation of clustering results using a data mining process model. Information Sciences 180, 414431.
Pasquier, N., Bastide, Y., Taouil, R. and Lakhal, L. (1999). Efficient mining of association rules using closed itemset lattices. Information Systems Journal 24(1), 2546.
Passalis, N. and Tefas, A. (2016). Entropy optimized feature-based bag-of-words representation for information retrieval. IEEE Transactions on Knowledge and Data Engineering 28(7), 16641677.
Pennington, J., Socher, R. and Manning, C. (2014). Glove: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 15321543.
Porter, M.F. (1980). An algorithm for suffix stripping. Program: Electronic Library and Information Systems 14(3), 130137.
Porter, M.F. (2006). Stemming algorithms for various European languages. Available at http://www.snowball.tartarus.org/texts/stemmersoverview.html. Last visited in November 2018.
Rajbhandari, S. and Keizer, J. (2012). The Agrovoc concept scheme-a walkthrough, food and agriculture organization of the united nations, Rome 00153, Italy. Journal of Integrative Agriculture 11(5), 694699.
Rennie, J. (2008). The 20-newsgroups Dataset. Available at http://qwone.com/jason/20Newsgroups/. Last visited in November 2018.
Riguet (1948). Relations binaires, fermetures et correspondances de Galois. Bull. Soc. Math. France 78, 114155.
Rodriguez-Esteban, R. (2019). Text mining applications. In Encyclopedia of Bioinformatics and Computational Biology, Academic Press, pp. 9961000.
Ronen, F. and James, S. (2006). The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge: Cambridge University Press.
Salton, G. and McGill, M. (1983). Introduction to Modern Information Retrieval. New York: McGraw-Hill Book Co.
Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. Proceedings of International Conference on New Methods in Language Processing. Manchester, UK, pp. 4449.
Stavrianou, A., Andritsos, P. and Nicoloyannis, N. (2007). Overview and semantic issues of text mining. ACM Sigmod Record 36(3), 2234.
Talib, R., Hanif, M., Ayesha, S. and Fatima, F. (2016). Text mining: Techniques, applications and issues. International Journal of Advanced Computer Science and Applications 7(11), 414418.
Weiss, S.M. (2005). Text Mining: Predictive Methods for Analyzing Unstructured Information. New York: Springer-Verlag.
Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C. and Nevill-Manning, C.G. (1999). KEA: Practical automatic keyphrase extraction. In Proceedings of the Fourth ACM Conference on Digital Libraries, 254255.
WordNet (2017). Wordnet 3.0. http://wordnet.princeton.edu/wordnet/documentation. Last visited in march 2017.
Yang, Y. and Pedersen, J.O. (1997). A comparative study on feature selection in text categorization. ICML 97, 412420.

Keywords

A new approach for textual feature selection based on N-composite isolated labels

  • Samir Elloumi (a1)

Metrics

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed