Hostname: page-component-5db58dd55d-8lnk4 Total loading time: 0 Render date: 2026-05-30T14:20:08.769Z Has data issue: false hasContentIssue false

Cognitive stylometry: A computational study of defamiliarization in modern Chinese

Published online by Cambridge University Press:  05 December 2025

Maciej Kurzynski*
Affiliation:
Department of Chinese, Lingnan University , Hong Kong
*
Corresponding author: Maciej Kurzynski; Email: maciej.kurzynski@ln.edu.hk
Rights & Permissions [Opens in a new window]

Abstract

Autoregressive language models generate text by predicting the next word from the preceding context. The regularities internalized from specific training data make this mechanism a useful proxy for historically situated readerly expectations, reflecting what earlier linguistic communities would find probable or meaningful. In this article, I pre-train a GPT model (223M parameters) on a broad corpus of Chinese texts (FineWeb Edu Chinese V2.1) and fine-tune it on the collected writings of Mao Zedong (1893–1976) to simulate the evolving linguistic landscape of post-1949 China. Identifying token sequences with the sharpest drops in perplexity – a measure of the model’s surprise – reveals the core phraseology of “Maospeak,” the militant language style that developed from Mao’s writings and pronouncements. A comparative analysis of modern Chinese fiction demonstrates how literature becomes unfamiliar to the fine-tuned model, generating perplexity spikes of increasing magnitude. The findings suggest a mechanism of attentional control: whereas propaganda backgrounds meaning through repetition (cognitive overfitting), literature foregrounds it through deviation (non-anomalous surprise). By visualizing token sequences as perplexity landscapes with peaks and valleys, the article reconceives style as a probabilistic phenomenon and showcases the potential of “cognitive stylometry” for literary theory and close reading .

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NC
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial licence (https://creativecommons.org/licenses/by-nc/4.0), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original article is properly cited. The written permission of Cambridge University Press or the rights holder(s) must be obtained prior to any commercial use.
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Table 1. Corpus and model training parameters.

Figure 1

Table 2. Top 20 most significantly learned 16-character phrases with context.

Figure 2

Table 3. Top 20 most significantly learned 16-character phrases with context (continued).

Figure 3

Figure 1. Comparison of average perplexities on 16-grams (top) and 1,024-token documents (bottom), as calculated on 1,000 trials with 10,000 samples per trial.

Figure 4

Figure 2. Per-character perplexity plots for three excerpts from the Selected Works of Mao Zedong. The yellow line (Base) shows the perplexity of the pre-trained model, while the lines from orange to red show the decreasing perplexity over five epochs of fine-tuning on the Mao Corpus. The $x$-axis displays the Chinese character sequence, and the $y$-axis shows perplexity on a logarithmic scale.

Figure 5

Figure 3. Shannon entropy comparison between Mao’s Selected Works and Chinese novels across different $n$-gram sizes on 1,000 trials.

Figure 6

Figure 4. Per-character perplexity plots for three excerpts from novels. The light blue line (Base) shows the perplexity of the pre-trained model, while the darker lines show the decreasing perplexity over five epochs of fine-tuning on the Mao Corpus. The $x$-axis displays the Chinese character sequence, and the $y$-axis shows perplexity on a logarithmic scale. Top: Zhang Wei’s The Ancient Ship; Middle: Lilian Lee’s Farewell My Concubine; Bottom: Dung Kai-cheung’s Works and Creations.

Supplementary material: File

Kurzynski supplementary material

Kurzynski supplementary material
Download Kurzynski supplementary material(File)
File 10.5 MB
Submit a response

Rapid Responses

No Rapid Responses have been published for this article.