Detecting sexual predators in chats using behavioral features and imbalanced learning*

CLAUDIA CARDEI; TRAIAN REBEDEA

doi:10.1017/S1351324916000395

Detecting sexual predators in chats using behavioral features and imbalanced learning*

Published online by Cambridge University Press: 31 January 2017

CLAUDIA CARDEI and

TRAIAN REBEDEA

Show author details

CLAUDIA CARDEI: Affiliation:
Department of Computer Science, University Politehnica of Bucharest, 060042 Bucharest, Romania e-mails: claudia.cardei@gmail.com, traian.rebedea@cs.pub.ro
TRAIAN REBEDEA: Affiliation:
Department of Computer Science, University Politehnica of Bucharest, 060042 Bucharest, Romania e-mails: claudia.cardei@gmail.com, traian.rebedea@cs.pub.ro

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

This paper presents a system developed for detecting sexual predators in online chat conversations using a two-stage classification and behavioral features. A sexual predator is defined as a person who tries to obtain sexual favors in a predatory manner, usually with underage people. The proposed approach uses several text categorization methods and empirical behavioral features developed especially for the task at hand. After investigating various approaches for solving the sexual predator identification problem, we have found that a two-stage classifier achieves the best results. In the first stage, we employ a Support Vector Machine classifier to distinguish conversations having suspicious content from safe online discussions. This is useful as most chat conversations in real life do not contain a sexual predator, therefore it can be viewed as a filtering phase that enables the actual detection of predators to be done only for suspicious chats that contain a sexual predator with a very high degree. In the second stage, we detect which of the users in a suspicious discussion is an actual predator using a Random Forest classifier. The system was tested on the corpus provided by the PAN 2012 workshop organizers and the results are encouraging because, as far as we know, our solution outperforms all previous approaches developed for solving this task.

Type: Articles
Information: Natural Language Engineering , Volume 23 , Issue 4 , July 2017 , pp. 589 - 616

DOI: https://doi.org/10.1017/S1351324916000395 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2017

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

This work has been partially funded by the Sectorial Operational Programme Human Resources Development 2007–2013 of the Romanian Ministry of European Funds through the Financial Agreement POSDRU/159/1.5/S/132397. Moreover, Claudia Cardei would like to thank Google for the Anita Borg scholarship granted in 2014 which partly funded this work on sexual predator identification in online conversations.

References

Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research 3 (Jan): 993–1022.Google Scholar

Bogdanova, D., Rosso, P., and Solorio, T. 2012a. Modelling fixated discourse in chats with cyberpedophiles. In Proceedings of the Workshop on Computational Approaches to Deception Detection, Association for Computational Linguistics, Avignon, France, pp. 86–90.Google Scholar

Bogdanova, D., Rosso, P., and Solorio, T. 2012b. On the impact of sentiment and emotion based features in detecting online sexual predators. In Proceedings of the 3rd Workshop in Computational Approaches to Subjectivity and Sentiment Analysis, Association for Computational Linguistics, Jeju, Republic of Korea, pp. 110–8.Google Scholar

Breiman, L. 2001. Random forests. Machine Learning 45 (1): 5–32.CrossRef Google Scholar

Cambria, E., and Hussain, A. 2012. Sentic Computing: Techniques, Tools, and Applications. Dordrecht: Springer Netherlands.CrossRef Google Scholar

Core, M. G., and Allen, J. 1997. Coding dialogs with the damsl annotation scheme. In AAAI Fall Symposium on Communicative Action in Humans and Machines, Boston, MA, pp. 28–35.Google Scholar

Cover, T. M., and Thomas, J. A. 2012. Elements of Information Theory. New York, NY: John Wiley & Sons. Google Scholar

Domingos, P. 1999. Metacost: a general method for making classifiers cost-sensitive. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 155–64.Google Scholar

Eriksson, G., and Karlgren, J. 2012. Features for modelling characteristics of conversations. In Proceedings of CLEF 2012 (Online Working Notes/Labs/Workshop), CEUR-WS, Rome, Italy.Google Scholar

Escalante, H. J., Erro, L., Villesanor, E. Villatoro-Tello, A. Juá rez, and Montes-y Gómez, M. 2013. Sexual predator detection in chats with chained classifiers. In Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, Association for Computational Linguistics, pp. 46–54, Atlanta, Georgia.Google Scholar

Finkelhor, D., Ormrod, R., Turner, H., and Hamby, S. L. 2005. The victimization of children and youth: a comprehensive, national survey. Child Maltreatment 10 (1): 5–25.CrossRef Google Scholar PubMed

Flesch, R. 1948. A new readability yardstick. Journal of Applied Psychology 32 (3): 221.CrossRef Google Scholar PubMed

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. 2009. The weka data mining software: an update. ACM SIGKDD Explorations Newsletter 11 (1): 10–8.CrossRef Google Scholar

He, H., and Garcia, E. A. 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21 (9): 1263–84.Google Scholar

Inches, G., and Crestani, F. 2012. Overview of the international sexual predator identification competition at PAN-2012. In Proceedings of CLEF 2012 (Online Working Notes/Labs/Workshop), CEUR-WS, Rome, Italy.Google Scholar

Jo, T., and Japkowicz, N. 2004. Class imbalances versus small disjuncts. ACM SIGKDD Explorations Newsletter 6 (1): 40–9.CrossRef Google Scholar

Kontostathis, A., Garron, A., Reynolds, K., West, W., and Edwards, L. 2012. Identifying predators using ChatCoder 2.0. In Proceedings of CLEF 2012 (Online Working Notes/Labs/Workshop), CEUR-WS, Rome, Italy.Google Scholar

Kukar, M., and Kononenko, I. 1998. Cost-sensitive learning with neural networks. In Proceedings of 13th European Conference on Artificial Intelligence, Brighton, UK, pp. 445–9.Google Scholar

Liu, X.-Y., Wu, J., and Zhou, Z.-H. 2009. Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 39 (2): 539–50.Google Scholar PubMed

Malesky, L. A. 2007. Predatory online behavior: modus operandi of convicted sex offenders in identifying potential victims and contacting minors over the internet. Journal of Child Sexual Abuse 16 (2): 23–32. PMID: 17895230.CrossRef Google Scholar

Maloof, M. A. 2003. Learning when data sets are imbalanced and when costs are unequal and unknown. In Proceedings of ICML-2003 Workshop on Learning from Imbalanced Data Sets II, vol. 2, Washington, DC.Google Scholar

Manning, C. D., Raghavan, P., and Schütze, H. 2008. Introduction to Information Retrieval. Cambridge: Cambridge University Press.CrossRef Google Scholar

Mitchell, K. J., Finkelhor, D., and Wolak, J. 2007. Youth internet users at risk for the most serious online sexual solicitations. American Journal of Preventive Medicine 32 (6): 532–7.CrossRef Google Scholar PubMed

Morris, C., and Hirst, G. 2012. Identifying sexual predators by svm classification with lexical and behavioral features. In Proceedings of CLEF 2012 (Online Working Notes/Labs/Workshop), CEUR-WS, Rome, Italy.Google Scholar

Parapar, J., Losada, D. E., and Barreiro, A. 2012. A learning-based approach for the identification of sexual predators in chat logs. In Proceedings of CLEF 2012 (Online Working Notes/Labs/Workshop), CEUR-WS, Rome, Italy.Google Scholar

Peersman, C., Vaassen, F., Van Asch, V., and Daelemans, W. 2012. Conversation level constraints on pedophile detection in chat rooms. In Proceedings of CLEF 2012 (Online Working Notes/Labs/Workshop), CEUR-WS, Rome, Italy.Google Scholar

Pennebaker, J. W., Francis, M. E., and Booth, R. J. 2001. Linguistic inquiry and word count: LIWC 2001. Mahway, NJ: Lawrence Erlbaum Associates.Google Scholar

Platt, J. 1998. Sequential minimal optimization: a fast algorithm for training support vector machines. Technical Report msr-tr-98-14, Microsoft Research.Google Scholar

Popescu, M., and Grozea, C. 2012. Kernel methods and string kernels for authorship analysis. In Proceedings of CLEF 2012 (Online Working Notes/Labs/Workshop), CEUR-WS, Rome, Italy.Google Scholar

Sun, Y., Kamel, M. S., Wong, A. K., and Wang, Y. 2007. Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition 40 (12): 3358–78.CrossRef Google Scholar

Vartapetiance, A., and Gillam, L. 2012. Quite simple approaches for authorship attribution, intrinsic plagiarism detection and sexual predator identification. In Proceedings of the 6th PAN workshop at CLEF2012 on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN2012), Rome.Google Scholar

Vartapetiance, A., and Gillam, L. 2014. Our little secret: pinpointing potential predators. Security Informatics 3 (1): 1–19.CrossRef Google Scholar

Villatoro-Tello, E., Juá rez-Gonzá lez, A., Escalante, H. J., Montes-y Gómez, M., and Pineda, L. V. 2012. A two-step approach for effective detection of misbehaving users in chats. In Proceedings of CLEF 2012 (Online Working Notes/Labs/Workshop), CEUR-WS, Rome, Italy.Google Scholar

Whittle, H., Hamilton-Giachritsis, C., Beech, A., and Collings, G. 2013. A review of online grooming: characteristics and concerns. Aggression and Violent Behavior 18 (1): 62–70.CrossRef Google Scholar

Whitty, M. T. 2002. Liar, liar! an examination of how open, supportive and honest people are in chat rooms. Computers in Human Behavior 18 (4): 343–352.CrossRef Google Scholar

Wolak, J., Finkelhor, D., and Mitchell, K. 2004. Internet-initiated sex crimes against minors: implications for prevention based on findings from a national study. Journal of Adolescent Health 35 (5): 11–20.CrossRef Google Scholar PubMed

Wolak, J., Finkelhor, D., Mitchell, K. J., and Ybarra, M. L. 2008. Online “predators” and their victims: myths, realities, and implications for prevention and treatment. American Psychologist 63 (2), 111–128.CrossRef Google Scholar PubMed

Article contents

Detecting sexual predators in chats using behavioral features and imbalanced learning*

Abstract

Access options

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests