Skip to main content Accessibility help
×
×
Home

Robust stylometric analysis and author attribution based on tones and rimes

  • Renkui Hou (a1) (a2) and Chu-Ren Huang (a2)

Abstract

In this article, we propose an innovative and robust approach to stylometric analysis without annotation and leveraging lexical and sub-lexical information. In particular, we propose to leverage the phonological information of tones and rimes in Mandarin Chinese automatically extracted from unannotated texts. The texts from different authors were represented by tones, tone motifs, and word length motifs as well as rimes and rime motifs. Support vector machines and random forests were used to establish the text classification model for authorship attribution. From the results of the experiments, we conclude that the combination of bigrams of rimes, word-final rimes, and segment-final rimes can discriminate the texts from different authors effectively when using random forests to establish the classification model. This robust approach can in principle be applied to other languages with established phonological inventory of onset and rimes.

Copyright

Corresponding author

*Corresponding author. Email: hourk0917@163.com

References

Hide All
Abbasi, A. and Chen, H. (2008). Writeprints: a stylometric approach to identity-level identification and similarity detection. ACM Transactions on Information Systems 26(), 129.
Argamon, S. and Levitan, S. (2005). Measuring the usefulness of function words for authorship attribution. In Proceedings of the Joint Conference of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing. Victoria, BC, Canada.
Bingenheimer, M., Hung, J.-J. and Hsieh, C.-E. (2017). Stylometric analysis of Chinese Buddhist texts - Do different Chinese translations of the Gaṇḍavyūha reflect stylistic features that are typical for their age?. Journal of the Japanese Association for Digital Humanities 2(1), 130.
Boroda, M. (1982). Häufigkeitsstrukturen musikalischer Texte. In Orlov, J.K., Boroda, M.G. and Nadarejšvili, I.Š. (eds), Sprache, text, kunst. Quantitative analysen. Bochum: Brockmeyer, pp. 231262.
Chan, B.C. (1986). A computerized stylostatistical approach to the disputed authorship problem of the dream of the red chamber. Tamkang Review: A Quarterly of Comparative Studies between Chinese and Foreign Literatures 16, 247278.
Chao, Y.R. (1968). A Grammar of Spoken Chinese. Berkeley and Los Angeles: University of California Press.
Chen, D.K. (1987). —— 1, 293318.
Chen, H.H. (1994). The contextual analysis of Chinese sentences with punctuation marks. Literary and Linguistic Computing 9(4), 281289.
Chen, K.-J., Huang, C.-R., Chang, L.-P. and Hsu, H.-L. (1996). Sinica corpus: design methodology for balanced corpora. In Park, B.-S. and Kim, J.B. (eds), Proceedings of the 11th Pacific Asia Conference on Language, Information and Computation. Seoul: Kyung Hee University, pp. 167176.
Dumais, S., Platt, J., Heckerman, D. and Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of the Seventh International Conference on Information and Knowledge Management. ACM, New York, USA. pp. 137142.
García, A.M. and Martin, J.C. (2006). Function words in authorship attribution studies. Literary and Linguistic Computing 22(1), 4966.
Grieve, J. (2007). Quantitative authorship attribution: an evaluation of techniques. Literary and Linguistic Computing 22(3), 251270.
Grzybek, P. (2007). History and methodology of word length studies. In Grzybek, P. (ed), Contributions to the Science of Text and Language. Netherlands: Springer, pp. 1590.
Grzybek, P., Stadlober, E., Kelih, E. and Antić, G (2005). Quantitative text typology: the impact of word length. In Weihs, C. (ed), Classification—The Ubiquitous Challenge. Berlin, Heidelberg: Springer, pp. 5364.
He, X. and Liu, Y. (2014). Mining stylistic features of rhythm and tempo base on text clustering. Journal of Chinese Information Processing 18(6), 194200.
Herdan, G. (1966). The Advanced Theory of Language as Choice and Chance. New York: Springer-Verlag.
Hinh, R., Shin, S. and Taylor, J. (2016). Using frame semantics in authorship attribution. In Proceedings of the 2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC’16), pp. 004093004098. Taiwan.
Hirst, G. and Feiguina, O. (2007). Bigrams of syntactic labels for authorship discrimination of short texts. Literary and Linguistic Computing 22(4), 405417.
Ho, J. (2015). From the use of three functional words “” examining author’s unique writing style–and on dream of red chamber author issues. BIBLID 120(1), 119150.
Holmes, D.I. (1994). Authorship attribution. Computers and the Humanities 28(2), 87106.
Holmes, D.I. (1998). The evolution of stylometry in humanities scholarship. Literary and Linguistic Computing 13(3), 111117.
Holmes, D.I. and Kardos, J. (2003). Who was the author? An introduction to stylometry. Chance 16(2), 58.
Hou, R., Huang, C. and Liu, H. (2017). A study on Chinese register characteristics based on regression analysis and text clustering. Corpus Linguistics and Linguistic Theory, AOP. doi: 10.1515/cllt-2016-0062
Hou, R., Huang, C.-R., Do, H.S. and Liu, H. (2017). A study on correlation between Chinese sentence and constituting clauses based on the Menzerath-Altmann law. Journal of Quantitative Linguistics 24(4), 350366. doi: 10.1080/09296174.2017.1314411
Hou, R., Huang, C.-R., Ahrens, K. and Sophia Lee, Y.-M. (2019). Linguistic characteristics of Chinese register based on the Menzerath– Altmann law and text clustering. Digital Scholarship in the Humanities. doi: 10.1093/llc/fqz005.
Hu, S. (1921). .
Hu, X., Wang, Y. and Wu, Q. (2014). Multiple authors detection: a quantitative analysis of dream of the red chamber. Advances in Adaptive Data Analysis 6(4), 1450012.
Huang, C.-R. and Chen, K.-J. (2017). Sinica treebank. In Ide, N. and Pustejovsky, J. (eds), Handbook of Linguistic Annotation. Berlin, Heidelberg: Springer.
Huang, C.-R. and Hsieh, S.-K. (2015). Chinese lexical semantics: From radicals to event structure. In William, S.-Y. W. and Sun, C.-F. (eds), The Oxford Handbook of Chinese Linguistics. New York: Oxford University Press, pp. 290305.
Huang, C.-R. and Shi, D. (2016). A reference Grammar of Chinese. Cambridge: Cambridge University Press.
Jin, M. (2002). Author identification based on n - gram pattern of auxiliary word. Measurement of Language. 23(5), 225240.
Jin, M. and Jiang, M. (2012). Text clustering on authorship attribution based on the features of punctuations usage. In 2012 IEEE 11th International Conference on Signal Processing (ICSP), vol. 3. IEEE, pp. 21752178. Beijing. China.
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In European Conference on Machine Learning. Berlin, Heidelberg, Springer, pp. 137142.
Jockers, M.L. and Witten, D.M. (2010). A comparative study of machine learning methods for authorship attribution. Literary and Linguistic Computing. 25(2), 215223.
Juola, P. (2008). Author attribution. Foundations and Trends in Information Retrieval. 1(3), 233334.
Kelih, E., Antić, G., Grzybek, P. and Stadlober, E. (2005). Classification of author and/or genre? The impact of word length. In Weihs, C. (eds), Classification—The Ubiquitous Challenge. Berlin, Heidelberg, Springer, pp. 498505.
Koppel, M., Schler, J. and Argamon, S. (2009). Computational methods in authorship attribution. Journal of the American Society for information Science and Technology 60(1), 926.
Koppel, M., Schler, J. and Bonchek-Dokow, E. (2007). Measuring differentiability: Unmasking pseudonymous authors. Journal of Machine Learning Research 8, 12611276.
Köhler, R. (2006). The frequency distribution of the lengths of length sequences. In Genzor, J. and Bucková, M. (eds), Favete Linguis. Studies in Honour of Victor Krupa. Bratislava: Slovak Academic Press, pp. 145152.
Köhler, R. (2008). Sequences of linguistic quantities report on a new unit of investigation. Glottotheory 1(1), 115119.
Köhler, R. (2012). Quantitative Syntax Analysis. Berlin/Boston: De Gruyter Mouton.
Köhler, R. (2015). Linguistic motifs. Sequences in language and text. pp. 89108.
Köhler, R. and Naumann, S. (2010). A syntagmatic approach to automatic text classification. Statistical properties of F and L-motifs as text characteristics. In Grzybek, P., Kelih, E. and Mačutek, J. (eds), Text and Language. Wien: Praesens, pp. 8189.
Layton, R., Watters, P. and Dazeley, R. (2013a). Automated unsupervised authorship analysis using evidence accumulation clustering. Natural Language Engineering 19(1), 95120.
Layton, R., Watters, P. and Dazeley, R. (2013b). Evaluating authorship distance methods using the positive Silhouette coefficient. Natural Language Engineering 19(4), 517535.
Li, J., Zheng, R. and Chen, H. (2006). From fingerprint to writeprint. Communication of ACM 49(4), 7682.
Love, H. (2002). Attributing Authorship: An Introduction. Cambridge: Cambridge University Press.
Lu, J. (1993). The features of Chinese sentences. Chinese Language Learning 1, 16.
Luyckx, K. and Daelemans, W. (2008). Authorship attribution and verification with many authors and limited data. In Proceedings of the 22nd International Conference on Computational Linguistics, August 18–22, 2008, pp. 513520. Manchester, United Kingdom.
Luyckx, K. and Daelemans, W. (2011). The effect of author set size and data size in authorship attribution. Literary and Linguistic Computing 26(1), 3555.
Marton, Y., Wu, N. and Hellerstein, L. (2005). On compression-based text classification. In Proceedings of the European Conference on Information Retrieval. Berlin, Germany: Springer, pp. 300314.
Mendenhall, T.C. (1887). The characteristic curves of composition. Science IX, 237249.
Mosteller, F. and Wallace, D.L. (1964). Inference and Disputed Authorship: The Federalist. Reading, Massachusetts: Addison-Wesley.
Neal, T., Sundararajan, K., Fatima, A., Yan, Y., Xiang, Y. and Woodard, D. (2018). Surveying stylometry techniques and applications. ACM Computing Surveys (CSUR) 50(6), 86.
Neergaard, K.D. and Huang, C.-R. (2019). Constructing the Mandarin phonological network: novel syllable inventory used to identify schematic segmentation. To Appear in Complexity (special issue), Cognitive Network Science: A New Frontier.
Peng, F., Schuurmans, D., Wang, S. and Keselj, V. (2003). Language independent authorship attribution using character level language models. In Proceedings of the Tenth Conference on European Chapter of the Association for Computational Linguistics, Budapest, Hungary, April 12–17, 2003. doi: 10.3115/1067807.1067843.
R Core Team. (2016). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Available at https://www.R-project.org.
Ruano San Segundo, P. (2016). A corpus-stylistic approach to Dickens’ use of speech verbs: beyond mere reporting. Language and Literature. 25(2), 113129.
Sanderson, C. and Guenter, S. (2006). Short text authorship attribution via sequence kernels, Markov chains and author unmasking: An investigation. In Proceedings of the International Conference on Empirical Methods in Natural Language Engineering. Morristown, NJ: Association for Computational Linguistics, pp. 482491.
Savoy, J. (2012). Authorship attribution: a comparative study of three text corpora and three language. Journal of Quantitative Linguistics 19(2), 132161.
Savoy, J. (2015). Comparative evaluation of term selection functions for authorship attribution. Literary and Linguistic Computing 30(2), 246261.
Sproat, R. (2000). A Computational Theory of Writing Systems. London: Cambridge University Press.
Stamatatos, E. (2007). Author identification using imbalanced and limited training texts. In Proceedings of the 18th International conference on Database and Expert Syterms Applications, Regensburg, Germany: IEEE Computer society. pp. 237241.
Stamatatos, E. (2008). A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology. 60(3), 538556.
Stamatatos, E., Fakotakis, N. and Kokkinakis, G. (2000). Automatic text categorization in terms of genre and author. Computational Linguistics 26(4), 471495.
Tan, P.-N., Steinbach, M. and Kumar, V. (Translated by Fan, Ming, Fan, Hongjian). (2006). Introduction to Data Mining. China, Beijing: Posts and Telecom Press, P115.
Vitevitch, M.S. (2002). The influence of phonological similarity neighborhoods on speech production. Journal of Experimental Psychology: Learning, Memory, and Cognition 28(4). P735747.
Wang, D. (1992). Fictional realism in Twentieth-Century China. Dun, Mao, She, Lao, Congwen, Shen. Columbia University Press. New York. USA.
Wang, K. and Qin, H. (2014). What is peculiar to translational Mandarin Chinese? A corpus-based study of Chinese constructions’ load capacity. Corpus Linguistics and Linguistic Theory 10(1), 5777.
Wang, S.-K., Dong, K.-J. and Bao-Ping, Y. (2011). Research on authorship identification based on sentence rhythm feature. Computer Engineering 37(9), 45 +8.
Wei, P. (2002). From the distribution of common words examining the author issue of Dream of Red Chamber Author. In Memorial Li Fanggui’s 100th Anniversary International Symposium on Chinese History. Seattle: University of Washington.
Williams, C.B. (1976). Mendenhall’s studies of word-length distribution in the works of Shakespeare and Bacon. Biometrika 62(1), 207212.
Wu, X.C., Huang, X.J. and Wu, L.D. (2006). Method research of author identification based on semantic analysis. Journal Chinese Information 20(6), 6168.
Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Information Retrieval 1(1), 6990.
Yang, M.Zhu, D., Tang, Y. and Wang, J. (2017). Authorship Attribution with Topic Drift Model. Available at https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14152.
Yu, P.B. . (1950). .
Yu, B. (2012). Function words for Chinese authorship attribution. In Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature. Association for Computational Linguistics, pp. 4553. Montréal, Canada.
Yule, G.U. (1938). On sentence-length as a statistical characteristic of style in prose: With application to two cases of disputed authorship. Biometrika 30(3/4), 363390.
Yule, G.U. (1944). The Statistical Study of Literary Vocabulary. Cambridge: Cambridge University Press.
Zheng, R., Li, J., Chen, H. and Huang, Z. (2006). A framework for authorship identification of online messages: writing style features and classification techniques. Journal of the American Society for Information Science and Technology 57(3), 378393.
Zhu, D. (1982). Lectures on Grammar. Beijing, China: Commercial Press.
Zipf, G.K. (1932). Selected Studies of the Principle of Relative Frequency in Language. Cambridge, MA: Harvard University Press.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Keywords

Metrics

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed