Skip to main content Accessibility help

Authorship analysis of aliases: Does topic influence accuracy?



Aliases play an important role in online environments by facilitating anonymity, but also can be used to hide the identity of cybercriminals. Previous studies have investigated this alias matching problem in an attempt to identify whether two aliases are shared by an author, which can assist with identifying users. Those studies create their training data by randomly splitting the documents associated with an alias into two sub-aliases. Models have been built that can regularly achieve over 90% accuracy for recovering the linkage between these ‘random sub-aliases’. In this paper, random sub-alias generation is shown to enable these high accuracies, and thus does not adequately model the real-world problem. In contrast, creating sub-aliases using topic-based splitting drastically reduces the accuracy of all authorship methods tested. We then present a methodology that can be performed on non-topic controlled datasets, to produce topic-based sub-aliases that are more difficult to match. Finally, we present an experimental comparison between many authorship methods to see which methods better match aliases under these conditions, finding that local n-gram methods perform better than others.



Hide All
Aggarwal, C. C., and Zhai, C. X. (eds.) 2012. A survey of text classification algorithms. Mining Text Data, Springer, pp. 163–222. doi: 10.1007/978-1-4614-3223-4_6.
Alazab, M., Layton, R., Venkataraman, S., and Watters, P. 2010. Malware detection based on structural and behavioural features of API calls. In Proceedings of the International Cyber Resilience Conference, School of Computer and Information Science, Security Research Centre, Edith Cowan University, Perth, Western Australia.
Choudhury, J., Kimtani, D. K. and Chakrabarty, A. 2012. Text clustering using a WordNet-based knowledge-base and the Lesk Algorithm. International Journal of Computer Applications 48 (21): 20–4.
Clarke, R. V. G., 1997. Situational Crime Prevention. Guilderland, New York: Criminal Justice Press.
Escalante, H., Montes-y Gómez, M., and Solorio, T. 2011. A weighted profile intersection measure for profile-based authorship attribution. Advances in Artificial Intelligence, 7094: 232–43.
Frantzeskou, G., Stamatatos, E., Gritzalis, S., Chaski, C. E., and Howald, B. S., 2007. Identifying authorship by byte-level n-grams: the source code author profile (SCAP) method. International Journal of Digital Evidence 6 (1): 118.
Holzer, R., Malin, B., and Sweeney, L. 2005. Email Alias Detection Using Social Network Analysis. PhD thesis. Information Networking Institute, Carnegie Mellon University.
Hotho, A., Staab, S., and Stumme, G., 2003. Ontologies improve text document clustering. In Third IEEE International Conference on Data Mining, 2003. ICDM 2003, Melbourne, Florida: IEEE, pp. 541–4.
Jing, L., Zhou, L., Ng, M. K., and Huang, J. Z. 2006. Ontology-based distance measure for text clustering. In Proceedings of the Text Mining Workshop, SIAM International Conference on Data Mining, Bethesda, Maryland.
Juola, P. 2004. Ad-hoc authorship attribution competition. In Proceedings of the Joint Conference of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing, Sweden, pp. 175–6.
Kešelj, V., Peng, F., Cercone, N., and Thomas, C. 2003. N-gram-based author profiles for authorship attribution. In Proceedings of the Pacific Association for Computational Linguistics.
Koppel, M., Schler, J., and Argamon, S. 2010. Authorship attribution in the wild. Language Resources and Evaluation 45 (1): 8394. ISSN . doi: 10.1007/s10579-009-9111-2.
Layton, R., McCombie, S., and Watters, P. A., 2012. Authorship attribution of IRC messages using inverse author frequency. In Cybercrime and Trustworthy Computing Workshop (CTC), 2012 Third, Ballarat, Australia: IEEE, pp. 713.
Layton, R., and Watters, P. A., 2009. Determining provenance in phishing websites using automated conceptual analysis. In eCrime Researchers Summit, 2009. eCRIME’09., Tacoma, WA, pp. 17.
Layton, R., Watters, P. A., and Dazeley, R. 2010. Authorship attribution for Twitter in 140 characters or less. In 2010 Second Cybercrime and Trustworthy Computing Workshop, Ballarat, Australia, pp. 18. ISBN 978-1-4244-8054-8. doi: 10.1109/CTC.2010.17.
Layton, R., Watters, P., and Dazeley, R. 2011a. Automated unsupervised authorship analysis using evidence accumulation clustering. Natural Language Engineering 1 (1): 126.
Layton, R., Watters, P., and Dazeley, R. 2011b. Automatically determining phishing campaigns using the USCAP methodology. In eCrime Researchers Summit (eCrime), 2010, Dallas, TX, pp. 18.
Layton, R., Watters, P. A., and Dazeley, R. 2011c. Recentred local profiles for authorship attribution. Journal of Natural Language Engineering 18 (3): 293312. doi: 10.1017/S1351324911000180. Available on CJO 2011.
Luyckx, K., and Daelemans, W. 2011. The effect of author set size and data size in authorship attribution. Literary and Linguistic Computing 26 (1): 35.
Narayanan, A., Paskov, H., Gong, N. Z., and Bethencourt, J. 2012. On the feasibility of internet-scale author identification. In Proceedings of the 33rd conference on IEEE Symposium on Security and Privacy, San Francisco, CA,.
Novak, J., Raghavan, P., and Tomkins, A. 2004. Anti-aliasing on the web. In Proceedings of the 13th conference on World Wide Web - WWW ’04, New York: ACM, pp. 30–9. doi: 10.1145/988672.988678.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E., 2011. Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12: 2825–30.
Pillay, S. R., and Solorio, T., 2010. Authorship attribution of web forum posts. In eCrime Researchers Summit (eCrime), 2010, Dallas, TX, pp. 17.
Rudman, J., 1998. The state of authorship attribution studies: some problems and solutions. Computers and the Humanities 31: 351–65.
Salton, G., and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Information Processing and Management 24 (5): 513–23.
Salton, G., and McGill, M. J., 1986. Introduction to Modern Information Retrieval. New York: McGraw-Hill.
Schein, A. I., Caver, J. F., Honaker, R. J., and Martell, C. H., 2010. Author attribution evaluation with novel topic cross-validation. In KDIR, Valencia, Spain, pp. 206–15.
Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing Surveys (CSUR) 34 (1): 147.
Sedding, J., and Kazakov, D. 2004. Wordnet-based text document clustering. In Proceedings of the 3rd Workshop on RObust Methods in Analysis of Natural Language Data, Geneva: Association for Computational Linguistics, pp. 104–13.
Solorio, T., Pillay, S., Raghavan, S., and Montes-y Gómez, M., 2011. Modality specific meta features for authorship attribution in web forum posts. In IJCNLP, Chiang Mai, Thailand, pp. 156–64.
Stabek, A., Watters, P. A., and Layton, R., 2010. The seven scam types: mapping the terrain of cybercrime. In Cybercrime and Trustworthy Computing Workshop (CTC), 2010 Second, Ballarat, Australia, pp. 4151.
Stamatatos, E. 2007. Author identification using imbalanced and limited training texts. In 18th International Workshop on Database and Expert Systems Applications, 2007. DEXA’07., Regensburg, pp. 237–41.
Ureche, O., Layton, R., and Watters, P. A., 2012. Towards an implementation of information flow security using semantic web technologies. In 2012 Third Cybercrime and Trustworthy Computing Workshop, Ballarat, Australia, pp. 18.
Watters, P. A., McCombie, S., Layton, R., and Pieprzyk, J. 2012. Characterising and predicting cyber attacks using the Cyber Attacker Model Profile (CAMP). Journal of Money Laundering Control 15 (4): 430–41.
Watters, P. A., and Patel, M. 1998. Modeling lexical-semantic processes using wordnet. Glot International 3 (9–10): 23–4.
Zheng, R., Li, J., Chen, H., and Huang, Z., 2005. A framework for authorship identification of online messages: writing-style features and classification techniques. Journal of the American Society for Information Science and Technology 57: 378–93.

Authorship analysis of aliases: Does topic influence accuracy?



Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed