A new approach for textual feature selection based on N-composite isolated labels

Samir Elloumi

doi:10.1017/S1351324919000160

A new approach for textual feature selection based on N-composite isolated labels

Published online by Cambridge University Press: 29 April 2019

Samir Elloumi

Show author details

Samir Elloumi*: Affiliation:
University of Tunis El Manar, Faculty of Sciences of Tunis, Computer Science Department, Tunis, Tunisia
*: *Corresponding author. Email: samir.elloumi@fst.utm.tn

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Textual Feature Selection (TFS) aims to extract relevant parts or segments from text as being the most relevant ones w.r.t. the information it expresses. The selected features are useful for automatic indexing, summarization, document categorization, knowledge discovery, so on. Regarding the huge amount of electronic textual data daily published, many challenges related to the semantic aspect as well as the processing efficiency are addressed. In this paper, we propose a new approach for TFS based on Formal Concept Analysis background. Mainly, we propose to extract textual features by exploring the regularities in a formal context where isolated points exist. We introduce the notion of N-composite isolated points as a set of N words to be considered as a unique textual feature. We show that a reduced value of N (between 1 and 3) allows extracting significant textual features compared with existing approaches even for non-completely covering an initial formal context.

Keywords

Textual features Formal concept Composite isolated point Difunctional relation

Information

Type: Article
Information: Natural Language Engineering , Volume 26 , Issue 2 , March 2020 , pp. 221 - 243

DOI: https://doi.org/10.1017/S1351324919000160 [Opens in a new window]
Copyright: © Cambridge University Press 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Agrawal, R. and Batra, M. (2013). A detailed study on text mining techniques. International Journal of Soft Computing and Engineering (IJSCE) ISSN 2(6), 2231–2307.Google Scholar

Agrawal, R., Imielinski, T. and Swami, A. (1993). Mining association rules between sets of items in large databases. Proceedings of the ACM SIGMOD International Conference on Management of Data, Washington, USA, pp. 207–216.CrossRef Google Scholar

Bastide, Y., Pasquier, N., Taouil, R., Lakhal, L. and Stumme, G. (2000). Mining minimal non-redundant association rules using frequent closed itemsets. Proceedings of the International Conference DOOD’2000, LNCS, Springer-Verlag, pp. 972–986.CrossRef Google Scholar

Belohlavek, R. and Vychodil, V. (2010). Discovery of optimal factors in binary data via a novel method of matrix decomposition. Journal of Computer and System Sciences 76(1), 3–10.CrossRef Google Scholar

Berend, G. (2016). Exploiting extra-textual and linguistic information in keyphrase extraction. Natural Language Engineering 22(1), 73–95.CrossRef Google Scholar

Berger, C. (2012). Big data analytics with oracle advanced analytics in-database option. Oracle and/or its affiliates: Data Mining and Advanced Analytics. Available at http://www.oracle.com/technetwork/database/options/advanced-analytics/oaa12cpreso-1964644.pdf. Last visited in November 2018.Google Scholar

Bernotas, M., Karklius, K.Laurutis, R. and Asta Slotkien, A. (2007). The peculiarities of the text document representation, using ontology and tagging-based clustering technique. 124x Information Technology and Control 36(2), 217–220.Google Scholar

Besanon, R., De Chalendar, G., Ferret, O., Gara, F., Mesnard, O., Lab, M. and Semmar, N. (2010). LIMA: A Multilingual Framework for Linguistic Analysis and Linguistic Resources Development and Evaluation. Proceedings of LREC 2010, pp. 3697–3704.Google Scholar

Bird, S. and Loper, E. (2004). NLTK: the natural language toolkit. Proceedings of the ACL 2004 on Interactive Poster and Demonstration Sessions, pp. 69–72.CrossRef Google Scholar

Brank, J., Grobelnik, M., Frayling, N. and Mladenic, D. (2002). Interaction of Feature Selection Methods and Linear Classification Models. Proceedings of the 19th Conference on Machine Learning (ICML-02), Workshop on Text Learning.Google Scholar

Chang, C. and Lin, C. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 1–27. Available at http://www.csie.ntu.edu.tw/cjlin/libsvm.CrossRef Google Scholar

Dasgupta, A., Drineas, P., Harb, B., Josifovski, V. and Mahoney, M. W. (2007). Inductive learning algorithms and representations for text categorization. Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 230–239.Google Scholar

Dumais, S., Platt, J., Heckerman, D. and Sahami, M. (1998). Feature selection methods for text classification. Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management 2, 148–55.Google Scholar

Elloumi, S., Boulifa, B., Jaoua, A., Saleh, M., Al Otaibi, J. and Frias, M. (2014). Inference engine based on closure and join operators over truth table BRs. Journal of Logical and Algebraic Methods in Programming 83(2), 180–193.CrossRef Google Scholar

Elloumi, S., Ferjani, F. and Jaoua, A. (2016). Using minimal generators for composite isolated point extraction and conceptual binary relation coverage: Application for extracting relevant textual features. Information Sciences 336, 129–144.CrossRef Google Scholar

Ferjani, F., Elloumi, S., Jaoua, A., Ben Yahia, S., Ismail, S. and Ravan, S. (2012). Formal context coverage based on isolated labels: An efficient solution for text feature extraction. Information Sciences 188, 198–214.CrossRef Google Scholar

Financial Keywords (2017). A Collection of Financial Keywords and Phrases. Software available at http://home.ubalt.edu/ntsbarsh/stat-data/keysphrasfinance.htm. Last visited in march 2017.Google Scholar

Ganter, B. and Wille, R. (1999). Formal Concept Analysis. Berlin: Springer-Verlag.CrossRef Google Scholar

Garey, M.R. and Johnson, D.S. (1979). Computers and Intractability: A Guide to the Theory of NP-Completeness (Series of Books in the Mathematical Sciences). San Francisco: W. H. Freeman and Company, p. 340.Google Scholar

Gharehchopogh, F.S. (2010). Approach and review of user oriented interactive data mining. IEEE the 4th International Conference on Application of Information and Communication Technologies, AICT2010. IEEE, Tashkent, Uzbekistan.Google Scholar

Godin, R., Missaoui, R. and Alaoui, H. (1995). Incremental concept formation algorithms based on Galois (concept) lattices. Computational Intelligence 11(2), 246–267.CrossRef Google Scholar

Gordon, M.D. and Kochen, M. (1998). Recall-precision trade-ok: a derivation. Journal of the American Society for Information Science 40, 145–151.3.0.CO;2-I>CrossRef Google Scholar

Gosset, W.S. (1908). Student. In The Probable Error of a Mean, Biometrikam, vol. 6, pp. 1–25.Google Scholar

Gupta, V. (2009). A survey of text mining techniques and applications. Journal of Emerging Technologies in Web Intelligence 1, 60–76.CrossRef Google Scholar

Harish, B.S., Manjunath, S. and Guru, D.S. (2012). Text document classification: An approach based on indexing. International Journal of Data Mining and Knowledge Management Process (IJDKP) 2, 43–62.CrossRef Google Scholar

Hasan, K. S. and Ng, V. (2014). Automatic Keyphrase Extraction: A Survey of the State of the Art. ACL, vol. 1, pp. 1262–1273.CrossRef Google Scholar

Irfan, R., King, C., Grages, D., Ewen, S., Khan, S., Madani, S. and Li, H. (2015). A survey on text mining in social networks. The Knowledge Engineering Review 30(2), 157–170. doi: 10.1017/S0269888914000277CrossRef Google Scholar

Jaoua, A., AlJa’am, J., Hammami, H., Ferjani, F., Laban, F., Semmar, N., Essafi, H. and Elloumi, S. (2010). IEEE International Conference on Progress in Informatics and Computing (PIC) 1, 652–655.Google Scholar

Jaoua, A., Elloumi, S., Hasnah, A., Jaam, J. and Nafkha, I. (2004). Discovering regularities in databases using canonical decomposition of binary relations. Journal on Relational Methods in Computer Science (JoRMiCS) 1, 217–234.Google Scholar

Jones, K.S. and Willet, P. (1997) Readings in Information Retrieval: Porter Stemmer. San Francisco: Morgan Kaufmann. ISBN 1- 55860-454-4.Google Scholar

Kcherif, R., Gammoudi, M.M. and Jaoua, A. (2000). Using difunctional relations in information organization. Information Science 125, 53–166.Google Scholar

KDnuggets (2012). Data Integration, Analytical ETL, Data Analysis, and Reporting. Software available at http://sourceforge.net/projects/rapidminer/.Google Scholar

Kosala, R. and Blockeel, H. (2000). Web mining research: A survey. ACM Sigkdd Explorations Newsletter 2, 1–15.CrossRef Google Scholar

Lan, M., Tan, C.L. and Low, H.B. (2006). Proposing a new term weighting scheme for text categorization. AAAI 6, 763–768.Google Scholar

Ma, W., Fang, W., Wang, G. and Liu, J. (2007). Concept index for document retrieval with peer-to-peer network. IEEE Computer Society. Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing, Qingdao, China, 3, 1119–1123.Google Scholar

Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S. and McClosky, D. (2014). The Stanford CoreNLP Natural Language Processing Toolkit. Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, Maryland USA, pp. 55–60.CrossRef Google Scholar

Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv: 1301–3781.Google Scholar

Miller, G.A. (1995). Wordnet: A lexical database for English. Communications of the ACM 38(11), 39–41.CrossRef Google Scholar

Mouakher, A. and Ben Yahia, S. (2016). QualityCover: Efficient binary relation coverage guided by induced knowledge quality. Information Sciences 355, 58–73.CrossRef Google Scholar

Osei-Bryson, K. (2010). Towards supporting expert evaluation of clustering results using a data mining process model. Information Sciences 180, 414–431.CrossRef Google Scholar

Pasquier, N., Bastide, Y., Taouil, R. and Lakhal, L. (1999). Efficient mining of association rules using closed itemset lattices. Information Systems Journal 24(1), 25–46.Google Scholar

Passalis, N. and Tefas, A. (2016). Entropy optimized feature-based bag-of-words representation for information retrieval. IEEE Transactions on Knowledge and Data Engineering 28(7), 1664–1677.CrossRef Google Scholar

Pennington, J., Socher, R. and Manning, C. (2014). Glove: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543.CrossRef Google Scholar

Porter, M.F. (1980). An algorithm for suffix stripping. Program: Electronic Library and Information Systems 14(3), 130–137.CrossRef Google Scholar

Porter, M.F. (2006). Stemming algorithms for various European languages. Available at http://www.snowball.tartarus.org/texts/stemmersoverview.html. Last visited in November 2018.Google Scholar

Rajbhandari, S. and Keizer, J. (2012). The Agrovoc concept scheme-a walkthrough, food and agriculture organization of the united nations, Rome 00153, Italy. Journal of Integrative Agriculture 11(5), 694–699.CrossRef Google Scholar

Rennie, J. (2008). The 20-newsgroups Dataset. Available at http://qwone.com/jason/20Newsgroups/. Last visited in November 2018.Google Scholar

Riguet (1948). Relations binaires, fermetures et correspondances de Galois. Bull. Soc. Math. France 78, 114–155.Google Scholar

Rodriguez-Esteban, R. (2019). Text mining applications. In Encyclopedia of Bioinformatics and Computational Biology, Academic Press, pp. 996–1000.CrossRef Google Scholar

Ronen, F. and James, S. (2006). The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge: Cambridge University Press.Google Scholar

Salton, G. and McGill, M. (1983). Introduction to Modern Information Retrieval. New York: McGraw-Hill Book Co.Google Scholar

Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. Proceedings of International Conference on New Methods in Language Processing. Manchester, UK, pp. 44–49.Google Scholar

Stavrianou, A., Andritsos, P. and Nicoloyannis, N. (2007). Overview and semantic issues of text mining. ACM Sigmod Record 36(3), 22–34.CrossRef Google Scholar

Talib, R., Hanif, M., Ayesha, S. and Fatima, F. (2016). Text mining: Techniques, applications and issues. International Journal of Advanced Computer Science and Applications 7(11), 414–418.CrossRef Google Scholar

Weiss, S.M. (2005). Text Mining: Predictive Methods for Analyzing Unstructured Information. New York: Springer-Verlag.CrossRef Google Scholar

Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C. and Nevill-Manning, C.G. (1999). KEA: Practical automatic keyphrase extraction. In Proceedings of the Fourth ACM Conference on Digital Libraries, 254–255.CrossRef Google Scholar

WordNet (2017). Wordnet 3.0. http://wordnet.princeton.edu/wordnet/documentation. Last visited in march 2017.Google Scholar

Yang, Y. and Pedersen, J.O. (1997). A comparative study on feature selection in text categorization. ICML 97, 412–420.Google Scholar

Article contents

A new approach for textual feature selection based on N-composite isolated labels

Abstract

Keywords

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests