Hostname: page-component-6766d58669-bp2c4 Total loading time: 0 Render date: 2026-05-24T22:35:05.230Z Has data issue: false hasContentIssue false

Word-specific tonal realizations in Mandarin

Published online by Cambridge University Press:  13 May 2026

Yu-Ying Chuang*
Affiliation:
Department of Taiwan Culture, Languages and Literature, National Taiwan Normal University, Taiwan
Melanie J. Bell
Affiliation:
Anglia Ruskin University, Cambridge, UK
Yu-Hsiang Tseng
Affiliation:
Department of Linguistics, University of Tübingen, Tübingen, Germany
R. Harald Baayen
Affiliation:
Department of Linguistics, University of Tübingen, Tübingen, Germany
*
Corresponding author: Yu-Ying Chuang; Email: yuying.chuang@ntnu.edu.tw
Rights & Permissions [Opens in a new window]

Abstract

The pitch contours of Mandarin two-character words are generally understood as being shaped by lexical tones on the constituent single-character words, in interaction with articulatory constraints imposed by factors such as speech rate, coarticulation with adjacent tones, segmental makeup, and predictability. This study shows that tonal realization is also partially determined by words’ meanings. We first show, on the basis of a corpus of Taiwan Mandarin spontaneous conversations, using a generalized additive regression model and focusing on the rise-fall tonal pattern, that after controlling for effects of speaker and context, word type is a stronger predictor of tonal realization than all of the previously established word-form-related predictors combined. Importantly, the addition of information about meaning in context improves prediction accuracy even further. We then proceed to show, using computational modeling with context-specific word embeddings, that token-specific pitch contours predict word type with 50% accuracy on held-out data, and that context-sensitive, token-specific embeddings can predict the shape of pitch contours with 40% accuracy. These accuracies, which are an order of magnitude above chance level, suggest that the relation between words’ pitch contours and their meanings are sufficiently strong to be potentially functional for language users. The theoretical implications of these empirical findings are discussed.

Information

Type
General Research Article
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - SA
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike licence (http://creativecommons.org/licenses/by-nc-sa/4.0), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the same Creative Commons licence is used to distribute the re-used or adapted article and the original article is properly cited. The written permission of Cambridge University Press or the rights holder(s) must be obtained prior to any commercial use.
Copyright
© The Author(s), 2026. Published by Cambridge University Press on behalf of Linguistic Society of America
Figure 0

Figure 1. Toy data set. The left-hand panel shows the f0 contours of single tokens of six Taiwan Mandarin words with the RF tonal pattern, produced in isolation by the same speaker. The right-hand panel shows the RF contour predicted by a simple GAM, using a thin plate regression spline smooth for normalized time as predictor.

Figure 1

Figure 2. The left-hand panel shows by-word adjustment contours from the toy model with only by-word factor smooth and normalized time as predictors. The right-hand panel plots the fitted contour for each word, with the predicted general contour (identical for all words) indicated by the dashed line.

Figure 2

Figure 3. Partial effects in the baseline GAM. The upper left-hand panel shows the predicted base contours for speakers self-identified as female and speakers self-identified as male. The next four panels show, for female speakers, how the base contour is modulated by duration, utterance position, previous bigram probability, and following bigram probability, respectively. The final panel presents, again for female speakers, the effect of tonal coarticulation with the tone of the preceding word, when the following word has a high-level tone.

Figure 3

Figure 4. The left-hand panel shows model fit improvement gauged by decrease in AIC units when a predictor (or set of predictors) is added to the baseline model for the word-type analysis. The right-hand panel shows the concurvity score of individual predictors in two models using the full data set of 3,778 tokens: the omnibus-segment model (light gray), with factor smooths for all segment-related control variables added to the baseline, and the word model (dark gray), with only a factor smooth for word added to the baseline.

Figure 4

Figure 5. Examples of the pitch contours predicted by the general smooth for time for female speakers, combined with the partial effects of the factor smooth for word. These partial effects do not include the general intercept or the differences in pitch between female and male speakers. As they represent the pure effect of word on the pitch contour, irrespective of other predictors, the curves are centered around the y-axis (indicated by a horizontal dotted line). The vertical dotted lines in the panels indicate the average (word-specific) syllable boundary.

Figure 5

Figure 6. Model accuracy under 100 runs of cross-validation for the word-type analysis. The boxplots represent the distributions of reduction in SSE.

Figure 6

Figure 7. Examples of the pitch contours predicted by the general smooth for time for female speakers, combined with the partial effects of the factor smooth for word. Predictions obtained with the novel and original data sets are indicated by dark and light gray, respectively. The upper panels present words that have different samples of tokens in the two data sets, whereas the lower panels present a random selection of four words of which the same tokens were used in the two analyses.

Figure 7

Figure 8. The partial effect of the word factor smooth predicted by the three models for a selection of eight words.

Figure 8

Figure 9. The effect of smoothing parameters on the mean squared error (MSE) for training (left) and test (right) data. The dashed lines indicate the estimated smoothing parameter by GAM in the full model. For both curves, a 95% confidence interval is indicated, which for the training data is so narrow that it is hardly visible.

Figure 9

Figure 10. The left-hand panel shows model fit improvement gauged by decrease in AIC units when a predictor (or set of predictors) is added to the baseline model for the sense analysis. The right-hand panel shows the concurvity score of individual predictors in three models using the smaller data set of 3,458 tokens: the omnibus-segment model with factor smooths for all segment-related control variables (light gray), the word-type model with a factor smooth for word predictor (dark gray), and the sense model with a factor smooth for sense (black).

Figure 10

Figure 11. Examples of the pitch contours predicted by the general smooth for time for female speakers, combined with the partial effects of the factor smooth for sense. The left-hand panel shows the fitted tonal contours for different senses of the word bu2yao4, a negation marker in Mandarin. The four senses are ‘prohibition’, ‘dissuasion’, ‘unneccesity’, and ‘to wish something to not happen’. The upper right-hand panel shows the fitted tonal contours for the two senses of shi2zai4, meaning ‘truly’ and ‘indeed’, respectively. The lower right-hand panel plots the fitted contours for the two senses of neng2gou4: ‘being capable of’ and ‘enabling’.

Figure 11

Figure 12. Model accuracy under 100 runs of cross-validation for the sense analysis. The boxplots represent the distributions of reduction in SSE.

Figure 12

Figure 13. Predicted pitch contours of the partial effects of the factor smooth for sense, for the five most frequent senses (upper row) and the five least frequent senses (lower row). Numbers in parentheses indicate the number of tokens in the data set for the different senses.

Figure 13

Figure 14. Contextualized embeddings, obtained from a pretrained Chinese GPT-2 model, cluster by word type in the two-dimensional plane obtained with t-distributed stochastic neighbor embedding (van der Maaten & Hinton 2008). Convex hulls (gray polygons) show that the tokens of the different word types form well-localized and highly distinct clusters.

Figure 14

Figure 15. One token randomly selected for a selection of words. The dots plot the observed pitch contour (raw data), and pitch vectors obtained from the word-type and the omnibus-segment models are represented by the dark gray and light gray curves, respectively. The vertical dotted lines indicate syllable boundaries.

Figure 15

Figure 16. Mean comprehension accuracies for training data (left) and test data (right) for LDL and ResLDL mappings from omnibus-segment (black) and word (gray) pitch vectors. Mean accuracy is obtained from thirty stratified random training and testing splits, each trained and evaluated independently. Error bars indicate double the standard error.

Figure 16

Figure 17. Mean production accuracies for training data (left) and test data (right) for LDL and ResLDL mappings from omnibus-segment (black) and word-type (gray) pitch vectors. Mean accuracy is obtained from thirty stratified random training and testing splits, each trained and evaluated independently. Error bars indicate double the standard error.

Figure 17

Figure 18. Pitch contours for the sample of fifteen word types introduced in Figure 5. The light gray lines represent the average of the pitch vectors generated by the word-type GAM across all tokens of that type (i.e. the average of the contours used to train LDL). The dark gray lines represent the predictions generated by LDL. These LDL contours were predicted from ‘centroid’ word meaning, obtained by averaging the CEs of all tokens of the same type. The black lines represent the word-specific contours predicted by the word GAM as presented in Figure 5 and reproduced here after centering and scaling. That is, these black lines show the pure effect of word on the pitch contour, irrespective of other predictors. The vertical dotted lines in the panels indicate the average (word-specific) syllable boundary.