Skip to main content
    • Aa
    • Aa

Detecting sexual predators in chats using behavioral features and imbalanced learning*


This paper presents a system developed for detecting sexual predators in online chat conversations using a two-stage classification and behavioral features. A sexual predator is defined as a person who tries to obtain sexual favors in a predatory manner, usually with underage people. The proposed approach uses several text categorization methods and empirical behavioral features developed especially for the task at hand. After investigating various approaches for solving the sexual predator identification problem, we have found that a two-stage classifier achieves the best results. In the first stage, we employ a Support Vector Machine classifier to distinguish conversations having suspicious content from safe online discussions. This is useful as most chat conversations in real life do not contain a sexual predator, therefore it can be viewed as a filtering phase that enables the actual detection of predators to be done only for suspicious chats that contain a sexual predator with a very high degree. In the second stage, we detect which of the users in a suspicious discussion is an actual predator using a Random Forest classifier. The system was tested on the corpus provided by the PAN 2012 workshop organizers and the results are encouraging because, as far as we know, our solution outperforms all previous approaches developed for solving this task.

Hide All

This work has been partially funded by the Sectorial Operational Programme Human Resources Development 2007–2013 of the Romanian Ministry of European Funds through the Financial Agreement POSDRU/159/1.5/S/132397. Moreover, Claudia Cardei would like to thank Google for the Anita Borg scholarship granted in 2014 which partly funded this work on sexual predator identification in online conversations.

Linked references
Hide All

This list contains references from the content that can be linked to their source. For a full set of references and notes please see the PDF or HTML where available.

E. Cambria , and A. Hussain 2012. Sentic Computing: Techniques, Tools, and Applications. Dordrecht: Springer Netherlands.

D. Finkelhor , R. Ormrod , H. Turner , and S. L. Hamby 2005. The victimization of children and youth: a comprehensive, national survey. Child Maltreatment 10 (1): 525.

R. Flesch 1948. A new readability yardstick. Journal of Applied Psychology 32 (3): 221.

M. Hall , E. Frank , G. Holmes , B. Pfahringer , P. Reutemann , and I. H. Witten 2009. The weka data mining software: an update. ACM SIGKDD Explorations Newsletter 11 (1): 10–8.

T. Jo , and N. Japkowicz 2004. Class imbalances versus small disjuncts. ACM SIGKDD Explorations Newsletter 6 (1): 40–9.

L. A. Malesky 2007. Predatory online behavior: modus operandi of convicted sex offenders in identifying potential victims and contacting minors over the internet. Journal of Child Sexual Abuse 16 (2): 2332. PMID: 17895230.

K. J. Mitchell , D. Finkelhor , and J. Wolak 2007. Youth internet users at risk for the most serious online sexual solicitations. American Journal of Preventive Medicine 32 (6): 532–7.

Y. Sun , M. S. Kamel , A. K. Wong , and Y. Wang 2007. Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition 40 (12): 3358–78.

A. Vartapetiance , and L. Gillam 2014. Our little secret: pinpointing potential predators. Security Informatics 3 (1): 119.

H. Whittle , C. Hamilton-Giachritsis , A. Beech , and G. Collings 2013. A review of online grooming: characteristics and concerns. Aggression and Violent Behavior 18 (1): 6270.

M. T. Whitty 2002. Liar, liar! an examination of how open, supportive and honest people are in chat rooms. Computers in Human Behavior 18 (4): 343352.

J. Wolak , D. Finkelhor , and K. Mitchell 2004. Internet-initiated sex crimes against minors: implications for prevention based on findings from a national study. Journal of Adolescent Health 35 (5): 1120.

J. Wolak , D. Finkelhor , K. J. Mitchell , and M. L. Ybarra 2008. Online “predators” and their victims: myths, realities, and implications for prevention and treatment. American Psychologist 63 (2), 111128.

Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *


Full text views

Total number of HTML views: 11
Total number of PDF views: 57 *
Loading metrics...

Abstract views

Total abstract views: 434 *
Loading metrics...

* Views captured on Cambridge Core between 31st January 2017 - 23rd September 2017. This data will be updated every 24 hours.