Automated unsupervised authorship analysis using evidence accumulation clustering

ROBERT LAYTON; PAUL WATTERS; RICHARD DAZELEY

doi:10.1017/S1351324911000313

Automated unsupervised authorship analysis using evidence accumulation clustering

Published online by Cambridge University Press: 21 November 2011

ROBERT LAYTON ,

PAUL WATTERS and

RICHARD DAZELEY

Show author details

ROBERT LAYTON: Affiliation:
Internet Commerce Security Laboratory, University of Ballarat, Australia e-mails: r.layton@icsl.com.au, p.watters@ballarat.edu.au
PAUL WATTERS: Affiliation:
Internet Commerce Security Laboratory, University of Ballarat, Australia e-mails: r.layton@icsl.com.au, p.watters@ballarat.edu.au
RICHARD DAZELEY: Affiliation:
Data Mining and Informatics Research Group, University of Ballarat, Australia e-mail: r.dazeley@ballarat.edu.au

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Authorship Analysis aims to extract information about the authorship of documents from features within those documents. Typically, this is performed as a classification task with the aim of identifying the author of a document, given a set of documents of known authorship. Alternatively, unsupervised methods have been developed primarily as visualisation tools to assist the manual discovery of clusters of authorship within a corpus by analysts. However, there is a need in many fields for more sophisticated unsupervised methods to automate the discovery, profiling and organisation of related information through clustering of documents by authorship. An automated and unsupervised methodology for clustering documents by authorship is proposed in this paper. The methodology is named NUANCE, for n-gram Unsupervised Automated Natural Cluster Ensemble. Testing indicates that the derived clusters have a strong correlation to the true authorship of unseen documents.

Type: Articles
Information: Natural Language Engineering , Volume 19 , Issue 1 , January 2013 , pp. 95 - 120

DOI: https://doi.org/10.1017/S1351324911000313 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abbasi, A. and Chen, H. 2005. Applying authorship analysis to extremist-group web forum messages. IEEE Intelligent Systems 20 (5): 67–75.CrossRef Google Scholar

Abbasi, A. and Chen, H. 2008. Writeprints: a stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems 26 (2): 7:1–7:29.Google Scholar

Alazab, M., Venkataraman, S. and Watters, P. 2010. Towards understanding malware behaviour by the extraction of API calls. In Proceedings of the Cybercrime and Trustworthy Computing Workshop, Ballarat, Australia, July 9–10, pp. 52–9.Google Scholar

Argmamon, S., Koppel, M., Pennebaker, J. and Schler, J. 2009. Automatically profiling the author of an anonymous text. Communications of the ACM 52: 119–23.CrossRef Google Scholar

Aston, M., McCombie, S., Reardon, B., and Watters, P. 2009. A preliminary profiling of internet money mules: an Australian perspective. In Symposia and Workshops on Ubiquitous, Autonomic and Trusted Computing, 2009 (UIC-ATC'09), Los Alamitos, CA, USA, pp. 482–7. IEEE Computer Society.CrossRef Google Scholar

Cavnar, W. B. 1994. Using an n-gram-based document representation with a vector processing retrieval model. In Proceedings of the Text REtrieval Conference (TREC-3), Gaithersburg, MD, USA, November 2–4 (NIST).Google Scholar

Chen, Y.-D., Abbasi, A., and Chen, H. 2010. Framing social movement identity with cyber-artifacts: a case study of the International Falun Gong Movement. Security Informatics, 9: 1–23 (Springer).CrossRef Google Scholar

Cohen, D. and Narayanaswamy, K. 2004. Survey/analysis of Levels I, II, and III attack attribution techniques. Technical Report, Cs3 Inc, Memphis, TN, USA.Google Scholar

Duarte, J., Fred, A., Lourenço, A., and Duarte, F. 2010. On consensus clustering validation. In Structural, Syntactic, and Statistical Pattern Recognition, Lecture Notes in Computer Science, vol. 6218. Berlin: Springer, pp. 385–94.CrossRef Google Scholar

Frantzeskou, G., Stamatatos, E., Gritzalis, S. and Chaski, C. E. 2007. Identifying authorship by byte-level n-grams: The source code author profile (SCAP) method. International Journal of Digital Evidence 6. www.ijde.org Google Scholar

Fred, A. and Jain, A. 2002. Evidence accumulation clustering based on the k-means algorithm. Structural, Syntactic, and Statistical Pattern Recognition, Lecture Notes in Computer Science, vol. 6218. Berlin: Springer, pp. 303–33.Google Scholar

Gao, H., Zhu, D. and Wang, X. 2010. A parallel clustering ensemble algorithm for intrusion detection system. In Proceedings of International Symposium on Distributed Computing and Applications to Business, Engineering and Science, Cambridge, MA, USA, September 13–15, pp. 450–3.Google Scholar

Ghaemi, R., Sulaiman, Md. N., Ibrahim, H., and Mustapha, N. 2009. A survey: clustering ensembles techniques. Proceedings of World Academy of Science, Engineering and Technology 38: 2070–3740.Google Scholar

Holmes, D. 1992. A stylometric analysis of Mormon scripture and related texts. Journal of the Royal Statistical Society. Series A (Statistics in Society) 155 (1): 91–120.CrossRef Google Scholar

Holmes, D. I. 1994. Authorship attribution. Computers and the Humanities 28 (2): 87–106.CrossRef Google Scholar

Huber, P. J. and Ronchetti, E. 1981. Robust Statistics, 2nd ed.Wiley Online Library.CrossRef Google Scholar

Iqbal, F., Binsalleeh, H., Fung, B. C. M. and Debbabi, M. 2010. Mining writeprints from anonymous e-mails for forensic investigation. Digital Investigation 7 (1–2): 56–64.CrossRef Google Scholar

Juola, P. 2004. Ad-hoc authorship attribution competition. In Proceedings of 2004 Joint International Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities (ALLC/ACH 2004), Goteborg, Sweden, June 11–16, pp. 175–176.Google Scholar

Juola, P. 2008. Authorship Attribution. Hanover, MA: Now Publishing.Google Scholar

Kanich, C., Kreibich, C., Levchenko, K., Enright, B., Voelker, G. M., Paxson, V., and Savage, S. 2008. Spamalytics: an empirical analysis of spam marketing conversion. In Proceedings of the 15th ACM Conference on Computer and Communications Security, pp. 3–14. ACM.CrossRef Google Scholar

Kešelj, V., Peng, F., Cercone, N., and Thomas, C. 2003. N-gram-based author profiles for authorship attribution. In Proceedings of the Pacific Association for Computational Linguistics, pp. 255–264.Google Scholar

Koppel, M. and Schler, J. 2004. Authorship verification as a one-class classification problem. In Proceedings of the Twenty-First International Conference on Machine Learning (ICML '04), pp. 62–68. ISBN 1-58113-838-5.Google Scholar

Layton, R., Watters, P. and Dazeley, R. 2010. Authorship attribution for twitter in 140 characters or less. In 2010 Second Cybercrime and Trustworthy Computing Workshop, Los Alamitos, CA, USA, pp. 1–8. IEEE Computer Society.Google Scholar

Layton, R., Watters, P. and Dazeley, R. 2011a. Automatically determining phishing campaigns using the USCAP methodology. In eCrime Researchers Summit (eCrime), 2010, Los Alamitos, CA, USA, pp. 1–8. IEEE Computer Society.Google Scholar

Layton, R., Watters, P. and Dazeley, R. 2011b. Recentred local profiles for authorship attribution. Journal of Natural Language Engineering. doi: 10.1017/S1351324911000180 Available on CJO 2011. http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=8296826&fulltextType=RA&fileId=S1351324911000180 Google Scholar

Li, J., Zheng, R. and Chen, H. 2006. From fingerprint to writeprint. Communications of the ACM 49: 76–82.CrossRef Google Scholar

Luyckx, K. and Daelemans, W. 2010. The effect of author set size and data size in authorship attribution. Literary and Linguistic Computing 26: 35–55.CrossRef Google Scholar

McCombie, S., Watters, P., Ng, A., and Watson, B. 2008. Forensic characteristics of phishing – petty theft or organized crime? WEBIST 1: 149–57.Google Scholar

Mohtasseb, H. and Ahmed, A. 2009. Mining online diaries for blogger identification. Proceedings of the World Congress on Engineering 1: 295–302.Google Scholar

Moore, T. and Clayton, R. 2007. Examining the impact of website take-down on phishing. In Proceedings of the IEEE 2nd Annual eCrime Researchers Summit (eCrime '07), Los Alamitos, CA, USA, pp. 1–13. IEEE Computer Society.Google Scholar

Mosteller, F. and Wallace, D. L. 1963. Inference in an authorship problem. Journal of the American Statistical Association 58 (302): 275–309.Google Scholar

Novak, J., Raghavan, P. and Tomkins, A. 2004. Anti-aliasing on the web. In Proceedings of the 13th International Conference on World Wide Web, pp. 30–9. ACM.CrossRef Google Scholar

Parag, T. and Elgammal, A. M. 2009. A voting approach to learn affinity matrix for robust clustering. In Proceedings of the International Conference on Image Processing (ICIP), Cairo, Egypt, November 7–10, pp. 2409–12.Google Scholar

Project Gutenberg Organisation. 2011. Project Gutenberg. http://www.gutenberg.org/Google Scholar

Radvanovsky, B. 2006. Analyzing spoofed email headers. Journal of Digital Forensic Practice 1: 231–43.CrossRef Google Scholar

Raghavan, S., Kovashka, A. and Mooney, R. 2010. Authorship attribution using probabilistic context-free grammars. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL-2010), Association for Computational Linguistics, pp. 38–42.Google Scholar

Rijsbergen, C. J. Van.. 1979. Information Retrieval, 2nd ed.Newton, MA: Butterworth-Heinemann.Google Scholar

Rosenberg, A. and Hirschberg, J. 2007. V-measure: a conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, June 28–30, pp. 410–20.Google Scholar

Rousseeuw, P. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20: 53–65.CrossRef Google Scholar

Sokal, R. and Rohlf, F. J. 1962. The comparison of dendrograms by objective methods. Taxon 11 (2): 33–40.CrossRef Google Scholar

Stamatatos, E. 2009. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology, 60: 538–556.CrossRef Google Scholar

Steinbach, M., Karypis, G. and Kumar, V. 2000. A comparison of document clustering techniques. In Proceedings of KDD Workshop on Text Mining, 400: 525–6. Citeseer.Google Scholar

Turville, K., Yearwood, J. and Miller, C. 2010. Understanding victims of identity theft: preliminary insights. Proceedings of the Cybercrime and Trustworthy Computing Workshop, Ballarat, Australia, July 19–20, pp. 60–8.Google Scholar

Urvoy, T., Chauveau, E., Filoche, P. and Lavergne, T. 2008. Tracking web spam with html style similarities. ACM Transactions of the Web 2 (1): 1–28.CrossRef Google Scholar

Vlachos, A., Korhonen, A. and Ghahramani, Z. 2009. Unsupervised and constrained Dirichlet process mixture models for verb clustering. In Proceedings of the Workshop on Geometrical Models of Natural Language Semantics, Association for Computational Linguistics, pp. 74–82.Google Scholar

Watters, P. A. and McCombie, S. 2011. A methodology for analyzing the credential marketplace. Journal of Money Laundering Control 14 (1): 32–43. ISSN .CrossRef Google Scholar

Xu, R. and Wunsch, D. II 2005. Survey of clustering algorithms. IEEE Transactions on Neural Networks 16: 645.CrossRef Google Scholar PubMed

Zheng, R., Li, J., Chen, H. and Huang, Z. 2005. A framework for authorship identification of online messages: writing-style features and classification techniques. Journal of the American Society for Information Science and Technology 57: 378–93.CrossRef Google Scholar

Zheng, R., Qin, Y., Huang, Z. and Chen, H. 2003. Authorship analysis in cybercrime investigation. In Lecture Notes in Computer Science, vol. 2665, pp. 59–73. Berlin: Springer.Google Scholar

Article contents

Automated unsupervised authorship analysis using evidence accumulation clustering

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests