Iconic prosody enhances the depictive power of ideophones

Kimi Akita; Shigeto Kawahara

doi:10.1017/langcog.2025.10034

Iconic prosody enhances the depictive power of ideophones

Published online by Cambridge University Press: 09 October 2025

Kimi Akita

and

Shigeto Kawahara

Show author details

Kimi Akita*: Affiliation:
Department of English Linguistics, School of Humanities, Nagoya University , Aichi, Japan
Shigeto Kawahara: Affiliation:
The Institute of Cultural and Linguistic Studies, Keio University , Tokyo, Japan
*: Corresponding author: Kimi Akita; Email: akita.kimi.s4@f.mail.nagoya-u.ac.jp

Article contents

Abstract
Introduction
Previous studies
Experiment 1: Production
Experiment 2: Perceived meaning of f0, intensity and duration
Experiment 3: Perceived meaning of voice quality
General discussion
Conclusion
Data availability statement
Competing interests
Footnotes
References

Rights & Permissions

Abstract

Prosody not only signals the speaker’s cognitive states but can also imitate various concepts. However, previous studies on the latter, the iconic function of prosody, have mostly analyzed novel words and nonlinguistic vocalizations. To fill this gap in the literature, the current study has examined the iconic potential of the prosodic features of existing Japanese imitative words known as ideophones. In Experiment 1, female Japanese speakers pronounced 20 sentences containing ideophones in infant-directed speech. They used a higher f0 to express faster and more pleasant movements. Similar iconic associations were observed in Experiment 2, in which Japanese speakers chose the best-matching pitch–intensity–duration combination for each of the ideophones. In Experiment 3, Japanese speakers chose the best-matching voice quality – creaky voice, falsetto, harsh voice or whisper – for the ideophones. Falsetto was preferred for a light object’s fast motion, harsh voice for violent motion and whisper for quiet motion. Based on these results, we entertain the possibility that the iconic prosody of ideophones provides a missing link in the evolutionary theory of language that began with iconic vocalizations. Ideophones with varying degrees of iconic prosody can be considered to be located between nonlinguistic vocalizations and arbitrary words in this evolutionary path.

Keywords

duration evolution of language fundamental frequency iconicity ideophones intensity Japanese prosody vocalizations voice quality

Information

Type: Article
Information: Language and Cognition , Volume 17 , 2025 , e77

DOI: https://doi.org/10.1017/langcog.2025.10034 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2025. Published by Cambridge University Press

1. Introduction

The same word can be pronounced in different ways according to the intended message. We would shout Yes! when our favorite table tennis player wins a long rally, but we might whisper Yes… when we respond to our friend’s question while watching a movie. In Peirce’s (Reference Peirce1932) semiotics, these prosodic features of speech sound are indexical of our emotional state or attitude: There is a causal relationship between our inner state and the phonetic details of our pronunciation.

On the other hand, we might whisper the adjective quiet when we exaggerate the quietness of the car we want to sell, as in This car is {W quiet W}.Footnote ¹ This whisper is iconic of the car’s quietness: The voice quality resembles the car’s sonic attribute. While most studies on the functions of suprasegmental features investigate its indexical use (Anderson et al., Reference Anderson, Klofstad, Mayew and Venkatachalam2014; Esling et al., Reference Esling, Moisik, Benner and Crevier-Buchman2019; Gussenhoven, Reference Gussenhoven2016; Hancil & Hirst, Reference Hancil and Hirst2013; Hübscher et al., Reference Hübscher, Borràs-Comes and Prieto2017; Laver, Reference Laver1994), its iconic use is also worth special attention in the current cognitive science inquiries, since both indexicality and iconicity are viewed as relevant to the origin of human language (Everett, Reference Everett2017; Imai & Kita, Reference Imai and Kita2014; Perniss et al., Reference Perniss, Thompson and Vigliocco2010; Vigliocco et al., Reference Vigliocco, Perniss and Vinson2014).

To this end, the current study explores iconic prosody, as it relates to the use of Japanese ideophones, which are imitative words that themselves iconically represent various sensory and emotional information, such as piyopiyo ‘cheeping’, sutasuta ‘walking briskly’, sarari ‘dry and smooth’, zukin ‘one’s head throbbing’ and ukiuki ‘buoyant’. Ideophones are often accompanied by prominent prosody, such as distinctly high or low pitch and marked voice quality (Childs, Reference Childs, Hinton, Nichols and Ohala1994; Dingemanse et al., Reference Dingemanse, Schuerman, Reinisch, Tufvesson and Mitterer2016). For example, according to Dingemanse and Akita’s (Reference Dingemanse and Akita2017) study using a multimodal corpus of Japanese, about 30% of adverbial ideophones marked by a quotative particle were ‘phonationally foregrounded’, i.e., pronounced with marked voice quality. For instance, in the following quote, the female speaker pronounces the ideophone hyoi ‘unexpected’ in a falsetto voice, which appears to emphasize the unexpectedness or suddenness of her son’s answering her phone call.

Soshite, nan-byak-kai-ka nan-zen-kai-ka wakar-anai-n-des-u-kedo, keitai shi-tara, {F hyoi F} -to koo yuugata, daibu kuraku nat-te-kara, guuzen tsunagat-ta-n-des-u.

‘And, I don’t know how many phone calls I made, but when I made a phone call on my cellphone, unexpectedly in the evening, when it was very dark, [my son] answered it by chance’.

https://www2.nhk.or.jp/archives/movies/?id=D0026020063_00000

Expanding on these previous observations, in particular the one by Dingemanse and Akita (Reference Dingemanse and Akita2017), the current study reports on three experiments, which together explore what the prosodic features of ideophones can represent iconically in Japanese.

The organization of this paper is as follows. Section 2 summarizes previous experimental studies on the iconic properties of speech sounds. Section 3 reports on an experiment in which native speakers of Japanese pronounced ideophones expressively, and Sections 4 and 5 report on perception experiments in which native Japanese speakers chose pitch–intensity–duration combinations and voice qualities that intuitively suited individual ideophones. Section 6 discusses the implications that the current findings may offer for the theory of language evolution. Section 7 is the conclusion.

2. Previous studies

There are several studies on the production and perception of the iconic functions of speech prosody, but they are so far limited in both number and scope. Shintel et al. (Reference Shintel, Nusbaum and Okrent2006), for example, asked English speakers to describe the direction of motion of an animated dot and found that the participants tend to use a higher fundamental frequency (f₀) to describe upward motion than to describe downward motion and faster speech rate to describe faster motion than to describe slower motion.

Nygaard et al. (Reference Nygaard, Herold and Namy2009) asked three female speakers of English to pronounce sentences with novel adjectives in infant-directed speech (IDS): Can you get the {blicket/seebow/daxen/foppick/tillen/riffel} one? The participants pronounced the novel words as having positive (‘happy’, ‘hot’, ‘big’, ‘yummy’, ‘tall’, ‘strong’), negative (‘sad’, ‘cold’, ‘small’, ‘yucky’, ‘short’, ‘weak’) or neutral (i.e., unspecified) meanings. The acoustic analysis of the recorded materials revealed that, for instance, the novel words were pronounced with a higher f₀, higher intensity and shorter duration, when the speakers intended to express ‘happiness’, and with higher intensity and longer duration, when they intended to express ‘largeness’ (see also Ferrara et al., Reference Ferrara, Lu and Goldin-Meadow2025; Herold, Reference Herold2006; Herold et al., Reference Herold, Nygaard, Chicos and Namy2011; Kunihira, Reference Kunihira1971; Michelini & Nygaard, Reference Michelini and Nygaard2025, for related experiments).

Similar experiments have also been conducted on iconic vocalizations (Ćwiek et al., Reference Ćwiek2021; Ćwiek & Fuchs, Reference Ćwiek, Fuchs, Goel, Seifert and Freksa2019; Perlman, Reference Perlman, Fischer, Akita and Perniss2026; Perlman et al., Reference Perlman, Dale and Lupyan2015; Perlman & Cain, Reference Perlman and Cain2014; Perlman & Lupyan, Reference Perlman and Lupyan2018). In this series of experiments, English speakers were instructed to use nonlinguistic vocalizations to express various meanings, such as ‘tiger’, ‘water’, ‘small’, ‘many’ and ‘that’. For example, ‘water’ was iconically represented by mimicking the sound of pouring water into a glass and ‘tiger’ by mimicking its roar. In forced-choice tasks, speakers of both English and several other languages showed accuracy that was greater than chance at selecting the intended meanings of the obtained vocalizations.

A few studies on iconic prosody can also be found in the recent literature on sound symbolism. According to Akita (Reference Akita2021, Reference Akita2025) and Motoki et al. (Reference Motoki, Pathak and Spence2022), Japanese speakers associate novel words pronounced with creaky voice with largeness, spikiness and bitterness; those pronounced in a falsetto with roundedness, brightness and sweetness; and those pronounced in a whisper with smallness, roundedness and darkness, and English speakers share most of these associations (see also Lacey et al., Reference Lacey, Jamal, List, McCormick, Sathian and Nygaard2020; Villegas et al., Reference Villegas, Akita and Kawahara2023).

While these studies are starting to unveil an important role of iconic prosody in natural languages, what is yet to be addressed is its role in real words, such as hyoi ‘unexpected’ discussed in Section 1. It remains an open question how prosodic features interact with the lexical meanings of conventional words other than the directional words (i.e., up and down) examined in Shintel et al. (Reference Shintel, Nusbaum and Okrent2006) (cf. Stolarski, Reference Stolarski2019). We would like to emphasize here that iconic prosody has not so far been tested with ideophones, iconic words that are characterized by their marked prosody. The current study fills this gap in the literature by experimentally examining the production and perception of Japanese ideophones.

3. Experiment 1: Production

To investigate whether – and if so, how – iconic prosody can contribute to ideophonic utterances, we first built upon Nygaard et al.’s (Reference Nygaard, Herold and Namy2009) nonword-based study, asking Japanese speakers to produce sentences with ideophones in IDS. We focused on IDS, as it generally employs exaggerated prosody, such as heightened pitch, a wide pitch range and lengthened vowels (Fernald et al., Reference Fernald, Taeschner, Dunn, Papousek, de Boysson-Bardies and Fukui1989; Garnica, Reference Garnica, Snow and Ferguson1977: Igarashi et al., Reference Igarashi, Nishikawa, Tanaka and Mazuka2013; Mazuka et al., Reference Mazuka, Igarashi, Martin and Utsugi2015).

3.1. Method

3.1.1. Participants

Thirty female monolingual speakers of Japanese (age: 23–63; M = 38.00; standard deviation [SD] = 9.43) were recruited on CrowdWorks, a Japanese crowdsourcing platform. Twenty-four of them had childcare experience, but this factor did not significantly improve the fit of a regression model reported below; hence, this factor was not considered in the subsequent analysis. They were paid 300 JPY for their participation.

3.1.2. Stimuli

We prepared a total of 20 simple sentences containing an ideophone, as listed in Table 1. The current experiment focused on those ideophones that represent manners of motion, such as walking, running and floating. They constitute a major semantic domain in the Japanese ideophone inventory, and their meanings have been extensively described in the literature (Akita, Reference Akita, Perniss, Fischer and Ljungberg2020a; Ibarretxe-Antuñano, Reference Ibarretxe-Antuñano, Akita and Pardeshi2019; Saji et al., Reference Saji, Akita, Kantartzis, Kita and Imai2019; Toratani, Reference Toratani2012). We selected motion ideophones from Akita (Reference Akita, Perniss, Fischer and Ljungberg2020a) that have a non-reduplicated form and end with a so-called sokuon /Q/ (phonetically realized as the first half of a geminate when followed by a consonant, as in the current experiment). We used this particular morphological shape because it is known that expressive, emphatic prosody appears most frequently with ideophones of this type (Akita, Reference Akita2020b). The 20 sentences were presented in a random order on Google Forms.

Table 1.

Stimulus sentences for Experiment 1, with abbreviated semantic labels for cross-referencing in parentheses

In order to quantitatively explore the possible correlations between the semantic features of these ideophones and the use of particular prosodic patterns, a different group of 20 monolingual speakers of Japanese (female: 13 and male: 7; age: 22–55; M = 40.05; SD = 8.80) rated each of the 20 ideophones on six 7-point semantic scales, adapted from Ibarretxe-Antuñano (Reference Ibarretxe-Antuñano, Akita and Pardeshi2019) and Saji et al. (Reference Saji, Akita, Kantartzis, Kita and Imai2019): size (from 0 ‘small’ to 6 ‘large’), speed (from 0 ‘slow’ to 6 ‘fast’), weight (from 0 ‘light’ to 6 ‘heavy’), intensiveness (from 0 ‘moderate’ to 6 ‘intensive’), pleasantness (from 0 ‘unpleasant’ to 6 ‘pleasant’) and noise (from 0 ‘quiet’ to 6 ‘noisy’). Although no additional instructions were given as to how to interpret these scales, the ratings were no more variable for highly subjective scales (e.g., pleasantness: mean SD = 1.07) than for less subjective scales (e.g., speed: mean SD = 1.28). Using Google Forms, the ideophones were visually presented with the example sentences in Table 1. The order of the ideophones was randomized. The mean semantic ratings for the ideophones are shown in Table 2.

Table 2.

Mean semantic ratings, with standard deviation in parentheses, for all the ideophones that were examined

Some of these scales were found to be strongly correlated with each other. For example, the Spearman correlation between size and weight was 0.78. Analyzing all these scales in a single regression analysis was not desirable, which would have resulted in a collinearity problem. Hence, a principal component (PC) analysis was run, using R version 4.4.0 (R Core Team, 2024). It was revealed that the first three components account for 89.52% of the variability in the data. As shown in Table 3, these dimensions are primarily characterized by speed and pleasantness, and the other four scales primarily contribute to PCs 4 to 6. Therefore, we only used the speed and pleasantness in the following statistical analyses.

Table 3.

Principal components’ loadings

Note: Boldface > |0.50|.

3.1.3. Procedure

The 30 participants, recruited for the main experiment, were instructed to complete the experiment alone in a quiet room, read the sentences aloud so that even a 1-year-old infant could understand what they meant and record their pronunciation on their smartphone or other devices. They were allowed to pronounce each sentence as many times as they liked. In this experiment, as well as in Experiments 2 and 3 reported below, the participants read the consent form before they began their task.

3.1.4. Predictions

The previous literature on sound symbolism allows us to make some specific predictions about how the prosodic features of ideophones may be used to express the speed and pleasantness of motion. Notably, the frequency code hypothesis (Ohala, Reference Ohala1984) – one of the most influential hypotheses in the literature on sound symbolism – states that higher-frequency sounds signal a smaller vocalizer than lower-frequency sounds; for example, a small mouse makes a higher-frequency voice than a large elephant. From this hypothesis, Ohala and his colleagues proposed to derive several sound–meaning associations, as quoted below:

… high tones, vowels with high second formants (notably /i/), and high-frequency consonants are associated with high-frequency sounds, small size, sharpness, and rapid movement; low tones, vowels with low second formants (notably /u/), and low-frequency consonants are associated with low-frequency sounds, large size, softness, and heavy, slow movements.

(Hinton et al., Reference Hinton, Nichols, Ohala, Hinton, Nichols and Ohala1994, p. 10)

If Japanese ideophones behave according to the frequency code, it is predicted, for instance, that fast motion is expressed by high f₀.

In addition, according to Nygaard et al.’s (Reference Nygaard, Herold and Namy2009) nonce word experiment, high f₀ might also be used to express pleasant motion of some kind. It may also be reasonable to expect faster motion to be expressed by shorter duration, as in Shintel et al. (Reference Shintel, Nusbaum and Okrent2006) (see also Knoeferle et al., Reference Knoeferle, Li, Maggioni and Spence2017; Perlman & Cain, Reference Perlman and Cain2014).

3.1.5. Analysis

We analyzed the last repetition of each ideophone unless it was mispronounced. Ideophones whose form was changed from the intended target form, as in buraburaburaaQ or buran for buraQ ‘taking a walk’, were excluded from the data. A total of 557 recordings entered the following acoustic and statistical analysis (30 participants x 20 sentences – 43 exclusions; exclusion percentage = 7%). Although the participants were from different areas of Japan, we did not exclude any recordings, as the obtained pronunciation of the ideophones did not exhibit conspicuous dialectal variations.

Using Praat version 6.3.18 (Boersma & Weenink, Reference Boersma and Weenink2023), we obtained the mean f₀ of the second vowel (V2) of each ideophone (e.g., /a/ of buraQ) and the mean intensity and duration of the entire ideophone (i.e., from the initial consonant to the beginning of the quotative particle -to) and standardized them within each participant. We focused on the f₀ of V2 specifically, as it is the locus of the pitch accent.

Using the brms package (Bürkner, Reference Bürkner2017) with R version 4.4.0 (R Core Team, 2024), Bayesian mixed effects models were fit, with mean f₀, mean intensity and duration as the dependent variables and the two mean ratings (speed and pleasantness), initial voicing of the ideophones and V2 as fixed effects, as well as a random intercept for participant and a random slope for participant associated with each of the three fixed factors. The consonantal and vocalic factors were included in the current models, as they may influence f₀ in nontrivial ways. In particular, in the sound-symbolic system of Japanese ideophones, voiceless obstruents in the word-initial position represent small, light, fine objects, as in korokoro ‘a light object rolling’ versus gorogoro ‘a heavy object rolling’ (Hamano, Reference Hamano1998). The frequency code hypothesis would predict that ideophones with voiceless obstruents are pronounced with a higher f₀. There are also acoustic bases for the inclusion of these segmental factors. Voiceless obstruents are known to raise the f₀ of adjacent vowels, and voiced obstruents lower it (Kingston & Diehl, Reference Kingston and Diehl1994). High vowels, such as [i] and [u], tend to have higher pitch than low vowels, such as [ɑ] (Ohala, Reference Ohala1973).

We used default weakly informative priors for the intercept and group-level SDs. Four Markov Chain Monte Carlo (MCMC) chains were run with 2,000 iterations each, but the first 1,000 iterations were discarded as warm-up (burn-in) iterations. Convergence was assessed via R-hat statistics (all R-hat values = 1.00) and visual inspection of trace plots. To improve sampling efficiency and avoid divergent transitions, we set the target acceptance rate as adapt_delta = 0.97 and the maximum tree depth as max_treedepth = 15. All the analytical details can be checked in the R Markdown file at the project’s Open Science Framework (OSF) repository at https://osf.io/xrd96/.

3.2. Results

3.2.1. Prosodic tendencies of individual ideophones

Overall, the prosodic properties of the pronounced ideophones were consistent with some previously reported sound–meaning associations in Japanese and beyond. Figure 1 presents the mean f₀ of the V2 of the 20 ideophones. Ideophones with an initial voiceless obstruent (/p, t, k, s, h/) tend to be pronounced with a higher f₀ than those with voiced obstruent onset (/b, d, g, z/)—note that these are not due to phonetic f₀ perturbation effects (Kingston & Diehl, Reference Kingston and Diehl1994), as f₀ was measured at V2.

Figure 1.

Mean standardized f₀ of the V2 of ideophones, from the lowest to the highest.

In Japanese ideophones, smallness and lightness are often represented by word-initial voiceless obstruents (Hamano, Reference Hamano1998). The current results suggest that the same size and weight information is also expressible by f₀: The higher the fundamental frequency, the smaller and lighter the referent. For example, kuruQ ‘a light object spinning quickly once’ was pronounced with a higher f₀ (M = 0.33) than guruQ ‘going around, drawing a large circle’ (M = −0.37). Similarly, potoQ represents the falling motion of a small light object and was generally produced with high f₀ (M = 0.97), whereas bataQ represents that of a heavy two-dimensional object and was produced with low f₀ (M = −0.39). These results appear to accord well with the frequency code hypothesis.

Figure 2 shows the mean intensity of the 20 ideophones. It appears that ideophones that depict heavy objects’ forceful movements tend to be pronounced strongly. For example, goroQ represents a person’s lazy rolling movement on the floor and tends to be produced with high intensity (M = 0.73), whereas pukaQ represents a light object’s floating motion and is generally produced with low intensity (M = −1.04).

Figure 2.

Mean standardized intensity of ideophones, from the lowest to the highest.

Figure 3 shows the mean duration of the 20 ideophones. /Q/-ending ideophones generally tend to represent quick movements. However, the relatively long duration of mowaQ ‘steam/smoke coming out’ (M = 1.23) reflects the slow movement it depicts.

Figure 3.

Mean standardized duration of ideophones, from the lowest to the highest.

3.2.2. Ideophone semantics and prosody

The subjective ratings obtained for these ideophones (see Table 1) allow us to quantitatively assess how ideophones’ meanings and their prosodic features are correlated with each other. Figure 4 shows positive correlations between the two selected semantic dimensions of motion ideophones (i.e., speed and pleasantness) and the mean f₀ of V2. Ideophones that represent faster and pleasant motion, such as pyokoQ ‘a little frog hopping once’, tend to be pronounced with a higher f₀ than those that represent slower and unpleasant motion, such as goroQ ‘a heavy object rolling down, lying down’.

Figure 4.

The speed and pleasantness of ideophones and the mean standardized f₀ of their V2.

Table 4 shows the results of the Bayesian regression model. The last column shows the probability that the posterior samples are either positive or negative, depending on the skew of the posterior distribution. These probabilities represent the certainty of the effects being credible. The Bayesian mixed regression model revealed positive, very credible effects of speed and pleasantness on f₀ at V2. Moreover, ideophones with V2 /o/ (e.g., pyokoQ and potoQ) tend to be pronounced with a higher f₀ than those with V2 /a/. It might be that the small and ‘inconspicuous’ image associated with the vowel /o/ in Japanese ideophones (Hamano, Reference Hamano1998) was produced with a higher f₀ via the frequency code.Footnote ²

Table 4.

The results of the Bayesian mixed regression model for the mean f₀ of the V2 of ideophones

Figure 5 shows the weak inverse relationship between the two semantic scales and mean intensity.

Figure 5.

The speed and pleasantness of ideophones and their mean standardized intensity.

As shown in Table 5, a Bayesian regression model revealed that ideophones with an initial voiceless obstruent (e.g., koroQ) tend to be pronounced with lower intensity than those with an initial voiced obstruent (e.g., goroQ). Moreover, ideophones with V2 /o/ and those with V2 /u/ tend to be pronounced with higher intensity than those with V2 /a/. The effects of the two semantic scales, speed and pleasantness, were modest at best, however.

Table 5.

The results of the Bayesian mixed regression model for the intensity of ideophones

Figure 6 shows the relationships between the two semantic scales and mean duration.

Figure 6.

The speed and pleasantness of ideophones and their mean standardized duration.

As shown in Table 6, a Bayesian regression model revealed that ideophones with V2 /o/ tend to be pronounced with shorter duration than those with V2 /a/. No noticeable effects were found for the two semantic scales in the regression model.

Table 6.

The results of the Bayesian mixed regression model for the duration of ideophones

3.3. Discussion

The production experiment demonstrated the iconic use of expressive prosody in Japanese ideophones. It showed that the association between pleasantness and f₀ that Nygaard et al. (Reference Nygaard, Herold and Namy2009) found with native speakers of English also holds with Japanese speakers producing ideophones. We also found that f₀ was associated with speed.

On the other hand, the effects of the semantic scales were modest – if not entirely absent – in terms of the intensity and duration of ideophones, which were more strongly affected by consonant and vowel types. It may be the case that the modest effects of intensity can partly be attributed to the uncontrolled recording environment; the distance between the participants’ mouths and the recording device may not have been fixed across the tokens. A perception experiment using controlled audio stimuli, such as Experiment 2, will address this potential confound.

In addition to these findings, it was observed that some participants used marked voice quality to express nuanced semantic differences between motion ideophones. For example, one participant pronounced dokaQ ‘a heavy object thudding’ with a harsh, pressed voice to emphasize the violent sound and motion. Falsetto was used for potoQ ‘a small light object dropping’ and pyokoQ ‘a little frog hopping once’, both of which represent a light motion of a small entity. One participant used a voiceless pronunciation for the quick, violent motion expressed by bataQ ‘a two-dimensional object slamming down’.

Ideophones generally have highly specific, holistic, multisensorial meanings, as suggested by their multiword translations used in this paper (Akita, Reference Akita2012; Iida & Akita, Reference Iida, Akita, Goldwater, Anggoro, Hayes and Ong2023; Nuckolls, Reference Nuckolls, Akita and Pardeshi2019). Therefore, it might be that the relationship between prosody and ideophones’ semantic properties can be captured more intuitively in terms of specific voice quality categories, such as falsetto and creaky voice, at least more so than general prosodic features, such as f₀ and intensity. The relationship between ideophones and specific voice qualities will be further explored in Experiment 3, after testing the sound-symbolic significance of pitch, intensity and duration in perception in Experiment 2.

4. Experiment 2: Perceived meaning of f₀, intensity and duration

Experiment 1 demonstrated that Japanese speakers can utilize iconic prosody to highlight the meaning of ideophones. Considering that production experiments generally have a high degree of freedom, the iconic effects of each prosodic feature may manifest themselves in a clearer fashion in perception experiments. To this end, Experiments 2 and 3 examined whether Japanese speakers use iconic prosody in understanding ideophones.

4.1. Method

4.1.1. Participants

A total of 50 monolingual speakers of Japanese (female: 26; male: 24; age: 22–62; M = 42.04; SD = 9.19) were recruited on CrowdWorks. None of them participated in Experiment 1. They were paid 300 JPY for their participation.

4.1.2. Stimuli

The same set of 20 simple ideophone sentences as Experiment 1 was used, but this time without IDS-like expressions, such as papa ‘dad’, Pengin-san ‘Mr. Penguin’ and the sentence-final particle -ne, which conveys ‘a soft tone of voice’. The first author, who is a male native speaker of Japanese, pronounced all ideophones in eight (2 x 2 x 2) distinct prosodic patterns: high (M = 145.10 Hz, SD = 5.16) versus low f₀ (M = 118.86 Hz, SD = 2.99), high (M = 66.99 dB, SD = 1.83) versus low intensity (M = 63.04 dB, SD = 1.51) and long (M = 0.57 s, SD = 0.03) versus short duration (M = 0.42 s, SD = 0.02). Long pronunciation involved an extra-long V2, as in pukaaaQ, a strategy that is commonly found in the emphatic usage of ideophones. The same recording of the non-ideophonic part of each sentence, which was pronounced in normal tones, was used for all eight pronunciations of each ideophone. All the sound files are available at the project’s OSF repository mentioned above.

The sentences were presented in a random order, one sentence per page, on Google Forms, but the order of the eight audio files, labeled ‘A’ to ‘H’, was fixed due to the technical limitations of the platform: A: high f₀–high intensity–long duration; B: high–high–short; C: high–low–long; D: high–low–short; E: low–high–long; F: low–high–short; G: low–low–long; and H: low–low–short.

4.1.3. Procedure

The participants were instructed to wear headphones or earphones and perform the task alone in a quiet environment. They were also instructed to listen to each recording as many times as they liked and choose the most suitable (or, if all sounded unnatural, most acceptable) pronunciation, from A to H, for the sentence.

4.1.4. Analysis

Separate Bayesian logistic regression models were fit for f₀ (high [A, B, C, D] versus low [E, F, G, H]), intensity (high [A, B, E, F] versus low [C, D, G, H]) and duration (long [A, C, E, G] versus short [B, D, F, H]). The two mean semantic ratings (speed and pleasantness) for the ideophones, initial voicing and V2 were included as fixed effects, a random intercept of participant as well as random slopes for the two semantic ratings, initial voicing and V2. We used default weakly informative priors for the intercept and group-level SDs. In the f₀ model, four MCMC chains were run with 5,000 iterations each, of which the first 4,000 were discarded as warm-ups. In the intensity and duration models, four MCMC chains were run with 3,000 iterations each, of which 2,000 iterations were warm-up iterations. Convergence was assessed via R-hat statistics (R-hat values = 1.00) and visual inspection of trace plots. The other details are identical to those in Experiment 1. See the R Markdown file for further details.

4.2. Results

In general, we observe those patterns that accord well with the overall results of Experiment 1. Figure 7 shows the proportions of high- and low-frequency choices for each ideophone. High f₀ was preferred for ideophones with initial voiceless obstruents, which generally represent fast and pleasant motion, such as pyokoQ ‘a little frog hopping once’, and low f₀ was preferred for those with initial voiced obstruents, which represent slow and unpleasant motion, such as goroQ ‘a heavy object rolling once, lying down’.

Figure 7.

Proportions of high and low f₀ sounds (A, B, C, D versus E, F, G, H) preferred for the 20 ideophones, ordered in the same way as the corresponding figure in Experiment 1.

A Bayesian regression model, summarized in Table 7, revealed credible effects of pleasantness and voicing. High f₀ was preferred for those with initial voiceless obstruents and those who represent pleasant motion.

Table 7.

The results of the Bayesian mixed regression model for the preferred f₀ level

Figure 8 shows the proportions of high- and low-intensity sounds chosen for each ideophone, which showed results that are more straightforward than the corresponding results of Experiment 1. Two ideophones for violent movements (dokaQ ‘a hard object thudding down’ and bataQ ‘a two-dimensional object slamming down’) exhibited a strong preference for pronunciation with high intensity. In contrast, ideophones for light objects’ quiet motion, such as hiraQ ‘a light thin object fluttering down’, horoQ ‘a light teardrop falling’, potaQ ‘liquid dropping quietly’ and potoQ ‘a small light object dropping’, preferred pronunciation with low intensity.

Figure 8.

Proportions of high- and low-intensity sounds (A, B, E, F versus C, D, G, H) preferred for the 20 ideophones, ordered in the same way as the corresponding figure in Experiment 1.

As shown in Table 8, a Bayesian regression model revealed credible effects of speed and pleasantness. Tokens with high intensity were preferred for ideophones for fast and unpleasant motion.Footnote ³

Table 8.

The results of the Bayesian mixed regression model for the preferred intensity level

Figure 9 shows the preference for long versus short renditions for each ideophone. As was the case with Experiment 1, our /Q/-ending ideophones generally represent quick motion and prefer short pronunciation, but long pronunciation was also chosen frequently for ideophones that represent relatively slow movements, such as mowaQ ‘steam/smoke coming out’, buraQ ‘taking a walk’ and goroQ ‘a heavy object rolling once, lying down’.

Figure 9.

Proportions of long and short sounds (A, C, E, G versus B, D, F, H) preferred for the 20 ideophones, ordered in the same way as the corresponding figure in Experiment 1.

As shown in Table 9, a Bayesian regression model revealed credible effects of speed, pleasantness, voicing, V2 /o/ and /u/. Long duration was preferred for those with initial voiceless obstruents, V2 /u/ and slow and unpleasant motion, whereas short duration was preferred for V2 /o/ and fast and pleasant motion.

Table 9.

The results of the Bayesian mixed regression model for the preferred duration

4.3. Discussion

The current perception experiment provided additional, and arguably even clearer, evidence for the sound-symbolic relevance of prosody in Japanese ideophones. Higher f₀ was, like Experiment 1, associated with more pleasant motion. Higher intensity was associated with faster and less pleasant motion, which may involve greater energy. Longer duration was associated with slower motion and, contrary to the results of Experiment 1, greater pleasantness, extending the scope of Shintel et al.’s (Reference Shintel, Nusbaum and Okrent2006) findings. This association between slow motion and pleasantness – rather than unpleasantness – may be due to specific motion contexts where someone or something moves in a relaxed, leisurely manner, illustrated by buraQ ‘taking a walk’, goroQ ‘lying down’ and pukaQ ‘a light object floating’.

5. Experiment 3: Perceived meaning of voice quality

In the final experiment, we went a step further into the nature of iconic prosody and investigated how different voice quality categories may interact with the meanings of Japanese ideophones. People do not actively use marked voice quality in noninteractive experimental settings such as Experiment 1. Therefore, in order to explore the question of whether certain voice qualities are favored for particular meanings, we recorded ideophones pronounced with marked voice quality and asked Japanese speakers to evaluate them in a perception experiment.

5.1. Method

5.1.1. Participants

Fifty-one monolingual speakers of Japanese who did not participate in either Experiment 1 or 2 (female: 21 and male: 29; prefer not to answer: 1; age: 22–66; M = 42.82; SD = 10.24) were recruited on CrowdWorks. They were paid 300 JPY for their participation.

5.1.2. Stimuli

The same set of 20 simple ideophone sentences as Experiment 2 was used. The first author, who is a male native speaker of Japanese, pronounced all ideophones with four types of marked phonation: creaky voice (a low-pitched voice with audible pulses, as in the end of an old man’s utterance), harsh voice (a rough, pressed voice produced with tensed vocal folds, similar to the sound that one often makes when lifting a heavy object), falsetto and whisper. The rest of the sentences were pronounced in a modal voice. These sounds are available at the OSF repository. The order of the sentences was randomized on Google Forms, but the order of the four voice qualities for each sentence was fixed: creaky voice, harsh voice, falsetto and whisper.

5.1.3. Procedure

The participants were instructed to wear headphones or earphones and complete the task alone in a quiet environment. They were also instructed to listen to each recording as many times as they liked and choose the most suitable (or, if all sounded unnatural, most acceptable) pronunciation for the sentence: the first, second, third or fourth.

5.1.4. Analysis

A Bayesian multinomial logistic regression model was fit, with the four voice quality categories as a dependent variable, the two mean semantic ratings (speed and pleasantness) for the ideophones, initial voicing and V2 as fixed effects; the random effect structure was identical to that of Experiments 1 and 2. The model consisted of four chains with 2,000 iterations, with a warm-up period of 1,000 iterations. We set the target acceptance rate as adapt_delta = 0.97 and the maximum tree depth as max_treedepth = 15. See the R Markdown file at the study’s OSF repository for details.

5.2. Results

As shown in Figure 10, different ideophones exhibited different preferences for the four voice qualities, and the iconic basis of these sound–meaning associations appears to be fairly straightforward to interpret. Harsh voice was preferred for ideophones that represent a violent sound and motion, such as zuboQ ‘one’s foot falling into a ditch’, dokaQ ‘a hard object thudding down’ and bataQ ‘a two-dimensional object slamming down’. Falsetto was preferred for ideophones that represent a light object’s fast motion, such as tsuruQ ‘slipping’, koroQ ‘a light object rolling once’, kuruQ ‘a light object spinning quickly once’, potoQ ‘a small light object dropping’ and pyokoQ ‘a little frog hopping once’. This result is again consistent with the frequency code hypothesis (Ohala, Reference Ohala1984).Footnote ⁴ Whisper was preferred for ideophones that represent quiet motion, such as mowaQ ‘steam/smoke coming out’, paraQ ‘small light drops falling’, hiraQ ‘a light thin object fluttering down’, horoQ ‘a light teardrop falling’, suruQ ‘a light object going off quietly’ and potaQ ‘liquid dropping quietly’.

Figure 10.

Proportions of the four voice qualities preferred for the 20 ideophones.

No straightforward semantic generalization appears to hold for ideophones that preferred creaky voice: buraQ ‘taking a walk’, goroQ ‘a heavy object rolling once’, guruQ ‘going around’, taraQ ‘liquid dropping’, pukaQ ‘a light object floating’ and poroQ ‘a small light object dropping’. This is probably because a creaky voice (i.e., a very low-pitched voice) generally sounded most natural and least marked for the male speaker.

As shown in Table 10, a Bayesian regression model revealed that, consistent with the results of Experiments 1 and 2, falsetto (i.e., a distinctly high-pitched sound) was preferred to creaky voice for faster and more pleasant motion. This result may be related to the actual example of an ideophonic utterance cited in Section 1, in which hyoi pronounced in a falsetto appeared to express suddenness and unexpected happiness. On the other hand, a harsh voice was preferred to a creaky voice for ideophones depicting faster but less pleasant motion. Whisper was preferred to a creaky voice for ideophones for faster motion.

Table 10.

The results of the Bayesian regression model for the preferred voice qualities, with creaky voice as a baseline

Note: The factors that are of particular interest are highlighted in bold.

5.3. Discussion

The current results show that the complex semantics of motion ideophones can be iconically associated with specific voice quality categories. Crucially, raw acoustic features, such as f₀ and intensity, do not fully account for these associations. Specifically, while in Experiments 1 and 2, fast speed was generally associated with pleasantness, a harsh voice was associated with fast but unpleasant motion. Likewise, while low intensity evoked a slow image in Experiment 2, whisper, which is a low-intensity sound, was associated with fast motion. These results indicate that Japanese speakers can refer to the acoustic details of ideophones – demonstrably in terms of both raw acoustic features such as f₀ as well as voice qualities – in their iconic interpretation of these words.

As an anonymous reviewer pointed out, it is worth mentioning that semantic scales and voice quality categories do not always have one-to-one correspondence. For example, in Experiment 1, one participant pronounced bataQ ‘a two-dimensional object slamming down’ without regular vibrations of the vocal folds (i.e., in a whisper voice). However, in the current experiment, this ideophone exhibited a strong preference for harsh voice, arguably due to the violent movement it represents. This oscillation indicates that iconic prosody can not only emphasize the semantic components inherent in individual ideophones but may also be able to add a new dimension to them.

6. General discussion

The three experiments demonstrated that the prosody of real words can be used and understood iconically. These findings confirm the importance of iconic prosody in human communication, which has been primarily investigated using novel words and nonlinguistic vocalizations in previous studies. Iconic prosody is attracting broad attention in cognitive science, as it is one of the major candidates for the original form of human language (Arbib et al., Reference Arbib, Liebal and Pika2008; Ćwiek et al., Reference Ćwiek2021; Haiman, Reference Haiman2018; Perlman, Reference Perlman, Fischer, Akita and Perniss2026). The iconic prosody of ideophones examined in the current study may be able to provide a missing link in this evolutionary theory.

Ideophones are imitative lexemes that are iconic and conventionalized at the same time. Unlike iconic vocalizations (Ćwiek et al., Reference Ćwiek2021), which are primarily prosodic and do not consist of consonants and vowels, ideophones are segmentally specified in the lexicon of an individual language. One can speculate that ideophones inherited iconic prosody from nonlinguistic vocalizations and gradually lost it to form a conventionalized system of nonimitative symbols (i.e., signs whose form–meaning relationship is arbitrary; Peirce, Reference Peirce1932).

This hypothesis gains additional support from the fact that the iconic prosody of ideophones can be adjusted in a graded manner. For example, pyokoQ-to ‘with a light hop’ can be pronounced in plain prosody [pʲokótːo], expressive prosody [pʲŏkőtːo] (extra-short V1, extra-high-pitched V2) and even more expressive prosody [{F pʲŏkőtːo F}] (extra-short V1, falsetto) (for a related observation, see Rhodes, Reference Rhodes, Hinton, Nichols and Ohala1994). Thus, as depicted in Figure 11, iconic prosody allows us to draw a fine-grained evolutionary path from nonlinguistic vocalizations to nonimitative, symbolic words via ideophones with varying degrees of expressiveness. Analyzing the iconic function of ideophone prosody might reveal how this evolution may have taken place – for example, what type of meaning was lexicalized first.

Figure 11.

Possible evolutionary path from nonlinguistic vocalizations to non-ideophonic, symbolic words via ideophones.

7. Conclusion

This paper has examined the relevance of prosody in both the production and perception of ideophonic expressions in Japanese. We have demonstrated that Japanese speakers utilize iconic prosody for enhancing the depictive power of each ideophone and that they also have a clear preference regarding which acoustic properties should accompany what kinds of meanings. It is also worth noting that iconic prosody can be observed at different levels of abstraction. The association between high f₀ and speed was confirmed in all three experiments, but the results of Experiment 3 also indicated that specific semantic aspects of ideophones can be iconically linked with a specific type of voice quality. The iconic prosody of ideophones can be considered a remnant of nonlinguistic vocalizations as an early form of spoken language. This remnant may connect iconic vocalizations and arbitrary words, which are otherwise far apart in the relevant evolutionary theory.

This study is the first step toward a comprehensive examination of iconic prosody in real words and opens up research opportunities in various respects. For example, future research should examine how iconic prosody works in semantic domains other than motion, including more abstract ones such as pain and emotion (McLean, Reference McLean2021). Another future direction would be a finer-grained psychoacoustic analysis of iconic prosody using continuous measures of voice quality instead of discrete categories (Lacey et al., Reference Lacey, Jamal, List, McCormick, Sathian and Nygaard2020; Villegas et al., Reference Villegas, Akita and Kawahara2023). Furthermore, a cross-linguistic comparison of prosody–meaning associations may enrich our evolutionary considerations. The hypothesis outlined in Section 6 would predict that iconic prosody should manifest itself across languages, which should be empirically tested in future studies (Akita, Reference Akita2025). Finally, a qualitative analysis of actual ideophone uses in specific discourse, such as the ideophone in a falsetto cited at the beginning of this paper, may help us to better understand the meaning of iconic prosody.

Data availability statement

All stimuli, experimental instructions, data and code are available at https://osf.io/xrd96/.

Acknowledgments

We are grateful to Laura Speed and the anonymous reviewer for their insightful comments on an earlier version of this paper.

Competing interests

The authors declare none.

Footnotes

¹ For voice quality symbols, see Ball et al. (Reference Ball, Howard and Miller2018).

² One may wonder whether this difference is due to their inherent f₀ difference, where mid-vowels tend to be produced with higher pitch than low vowels. While this interpretation is also possible, we also note that [u] did not credibly differ from [a], despite the fact that [u] should show a higher f₀ than [a] (Ohala, Reference Ohala1973).

³ We do not have a straightforward explanation as to why intensity was associated with slower motion in Experiment 1 but with faster motion in the current experiment. One possible reason may lie in the ambivalent relationship between energy and speed: while generating high speed typically requires more energy (i.e., F = ma), slower movement may also imply motion against greater resistance.

⁴ Acoustic measurement of the stimulus ideophones revealed that, replicating Experiment 1, high f₀ was preferred for ideophones for quick motion, such as kuruQ ‘a light object spinning quickly once’ and pyokoQ ‘a little frog hopping once’.

References

Akita, K. (2012). Toward a frame-semantic definition of sound-symbolic words: A collocational analysis of Japanese mimetics. Cognitive Linguistics, 23(1), 67–90.10.1515/cog-2012-0003CrossRef Google Scholar

Akita, K. (2020a). Modality-specificity of iconicity: The case of motion ideophones in Japanese. In Perniss, P., Fischer, O., & Ljungberg, C. (Eds.), Operationalizing iconicity (pp. 3–20). John Benjamins.10.1075/ill.17.01akiCrossRef Google Scholar

Akita, K. (2020b). System integration of Japanese ideophones. Ms., Nagoya University. https://drive.google.com/file/d/1-fYXYjbRXFhkr0C7gnKq3kZ9RvzwcAnV/view?usp=sharing Google Scholar

Akita, K. (2021). Phonation types matter in sound symbolism. Cognitive Science, 45(5), e12982.10.1111/cogs.12982CrossRef Google Scholar PubMed

Akita, K. (2025). Voice quality has robust visual associations in English and Japanese speakers. Open Linguistics. https://doi.org/10.1515/opli-2025-0069.Google Scholar

Anderson, R. C., Klofstad, C. A., Mayew, W. J., & Venkatachalam, M. (2014). Vocal fry may undermine the success of young women in the labor market. PLoS One, 9(5), e97506.10.1371/journal.pone.0097506CrossRef Google Scholar PubMed

Arbib, M. A., Liebal, K., & Pika, S. (2008). Primate vocalization, gesture, and the evolution of human language. Current Anthropology, 49(6), 1053–1076.10.1086/593015CrossRef Google Scholar PubMed

Ball, M. J., Howard, S. J., & Miller, K. (2018). Revisions to the extIPA chart. Journal of the International Phonetic Association, 48(2), 155–164.10.1017/S0025100317000147CrossRef Google Scholar

Boersma, P., & Weenink, D. (2023). Praat: Doing phonetics by computer. Version 6.3.17, retrieved 8 October 2023.Google Scholar

Bürkner, P.-C. (2017). Brms: An R package for Bayesian multilevel models using Stan. Journal of Statistical Software, 80, 1–28.10.18637/jss.v080.i01CrossRef Google Scholar

Childs, G. T. (1994). African ideophones. In Hinton, L., Nichols, J., & Ohala, J. J. (Eds.), Sound symbolism (pp. 178–204). Cambridge University Press.Google Scholar

Ćwiek, A., & Fuchs, S. (2019). Iconic prosody is rooted in sensori-motor properties: Fundamental frequency and the vertical space. In Goel, A. K., Seifert, C. M., & Freksa, C. (Eds.), Proceedings of the 41st Annual Meeting of the Cognitive Science Society (pp. 1572–1578). University of California, Merced.Google Scholar

Ćwiek, A., et al. (2021). Novel vocalizations are understood across cultures. Scientific Reports, 11, 10108.10.1038/s41598-021-89445-4CrossRef Google Scholar PubMed

Dingemanse, M., & Akita, K. (2017). An inverse relation between expressiveness and grammatical integration: On the morphosyntactic typology of ideophones, with special reference to Japanese. Journal of Linguistics, 53(3), 501–532.10.1017/S002222671600030XCrossRef Google Scholar

Dingemanse, M., Schuerman, W., Reinisch, E., Tufvesson, S., & Mitterer, H. (2016). What sound symbolism can and cannot do: Testing the iconicity of ideophones from five languages. Language, 92, e117–e133.10.1353/lan.2016.0034CrossRef Google Scholar

Esling, J. H., Moisik, S. R., Benner, A., & Crevier-Buchman, L. (2019). Voice quality: The laryngeal articulator model. Cambridge University Press.10.1017/9781108696555CrossRef Google Scholar

Everett, D. L. (2017). How language began: The story of humanity’s greatest invention. Liveright.Google Scholar

Fernald, A., Taeschner, T., Dunn, J., Papousek, M., de Boysson-Bardies, B., & Fukui, I. (1989). A cross-language study of prosodic modifications in mothers’ and fathers’ speech to preverbal infants. Journal of Child Language, 16(3), 477–501.10.1017/S0305000900010679CrossRef Google Scholar PubMed

Ferrara, C., Lu, J. C., & Goldin-Meadow, S. (2025). Playing with language in the manual modality: Which motions do signers gradiently modify? Cognitive Science, 49(4), e70051.10.1111/cogs.70051CrossRef Google Scholar PubMed

Garnica, O. K. (1977). Some prosodic and paralinguistic features of speech to young children. In Snow, C. E. & Ferguson, C. A. (Eds.), Talking to children: Language input and acquisition (pp. 63–88). Cambridge University Press.Google Scholar

Gussenhoven, C. (2016). Foundations of intonational meaning: Anatomical and physiological factors. Topics in Cognitive Science, 8(2), 425–434.10.1111/tops.12197CrossRef Google Scholar PubMed

Haiman, J. (2018). Ideophones and the evolution of language. Cambridge University Press.10.1017/9781107706897CrossRef Google Scholar

Hamano, S. (1998). The sound-symbolic system of Japanese. Kurosio.Google Scholar

Hancil, S., & Hirst, D. (Eds.) (2013). Prosody and iconicity. John Benjamins.10.1075/ill.13CrossRef Google Scholar

Herold, D. S. (2006). Acoustic correlates to word meaning in infant directed speech. Ph.D. dissertation, Emory University.Google Scholar

Herold, D. S., Nygaard, L. C., Chicos, K. A., & Namy, L. L. (2011). The developing role of prosody in novel word interpretation. Journal of Experimental Child Psychology, 108, 229–241.10.1016/j.jecp.2010.09.005CrossRef Google Scholar PubMed

Hinton, L., Nichols, J., & Ohala, J. J. (1994). Introduction: Sound-symbolic processes. In Hinton, L., Nichols, J., & Ohala, J. J. (Eds.), Sound symbolism (pp. 1–12). Cambridge University Press.Google Scholar

Hübscher, I., Borràs-Comes, J., & Prieto, P. (2017). Prosodic mitigation characterizes Catalan formal speech: The frequency code reassessed. Journal of Phonetics, 65, 145–159.10.1016/j.wocn.2017.07.001CrossRef Google Scholar

Ibarretxe-Antuñano, I. (2019). Towards a semantic typological classification of motion ideophones: The motion semantic grid. In Akita, K. & Pardeshi, P. (Eds.), Ideophones, mimetics and expressives (pp. 137–166). John Benjamins.10.1075/ill.16.07ibaCrossRef Google Scholar

Igarashi, Y., Nishikawa, K., Tanaka, K., & Mazuka, R. (2013). Phonological theory informs the analysis of intonational exaggeration in Japanese infant-directed speech. The Journal of the Acoustical Society of America, 134, 1283–1294.10.1121/1.4812755CrossRef Google Scholar PubMed

Iida, H., & Akita, K. (2023). Perceptual strength norms for 510 Japanese words, including ideophones: A comparative study with English. In Goldwater, M., Anggoro, F. K., Hayes, B. K., & Ong, D. C. (Eds.), Proceedings of the 45th Annual Conference of the Cognitive Science Society (pp. 2201–2207). University of California, Merced.Google Scholar

Imai, M., & Kita, S. (2014). The sound symbolism bootstrapping hypothesis for language acquisition and language evolution. Philosophical Transactions of the Royal Society B, 369, 20130298.10.1098/rstb.2013.0298CrossRef Google Scholar PubMed

Kingston, J., & Diehl, R. L. (1994). Phonetic knowledge. Language, 70(3), 419–454.10.1353/lan.1994.0023CrossRef Google Scholar

Knoeferle, K., Li, J., Maggioni, E., & Spence, C. (2017). Different acoustic cues underlie sound-size and sound-shape mappings. Scientific Reports, 7, 5562.10.1038/s41598-017-05965-yCrossRef Google Scholar PubMed

Kunihira, S. (1971). Effects of the expressive voice on phonetic symbolism. Journal of Verbal Learning and Verbal Behavior, 10, 427–429.10.1016/S0022-5371(71)80042-7CrossRef Google Scholar

Lacey, S., Jamal, Y., List, S. M., McCormick, K., Sathian, K., & Nygaard, L. C. (2020). Stimulus parameters underlying sound-symbolic mapping of auditory pseudowords to visual shapes. Cognitive Science, 44(9), e12883.10.1111/cogs.12883CrossRef Google Scholar PubMed

Laver, J. (1994). Principles of phonetics. Cambridge University Press.10.1017/CBO9781139166621CrossRef Google Scholar

Mazuka, R., Igarashi, Y., Martin, A., & Utsugi, A. (2015). Infant-directed speech as a window into the dynamic nature of phonology. Laboratory Phonology, 6(3–4), 281–303.10.1515/lp-2015-0009CrossRef Google Scholar

McLean, B. (2021). Revising an implicational hierarchy for the meanings of ideophones, with special reference to Japonic. Linguistic Typology, 25(3), 507–549.10.1515/lingty-2020-2063CrossRef Google Scholar

Michelini, L., & Nygaard, L. C. (2025). Size-sound iconicity in English-like pseudowords influences referent labeling and prosody. Cognitive Science, 49(2), e70042.10.1111/cogs.70042CrossRef Google Scholar PubMed

Motoki, K., Pathak, A., & Spence, C. (2022). Tasting prosody: Crossmodal correspondences between voice quality and basic tastes. Food Quality and Preference, 100, 104621.10.1016/j.foodqual.2022.104621CrossRef Google Scholar

Nuckolls, J. B. (2019). The sensori-semantic clustering of ideophonic meaning in Pastaza Quichua. In Akita, K. & Pardeshi, P. (Eds.), Ideophones, mimetics and expressives (pp. 167–198). John Benjamins.10.1075/ill.16.08nucCrossRef Google Scholar

Nygaard, L. C., Herold, D. S., & Namy, L. L. (2009). The semantics of prosody: Acoustic and perceptual evidence of prosodic correlates to word meaning. Cognitive Science, 33(1), 127–146.10.1111/j.1551-6709.2008.01007.xCrossRef Google Scholar PubMed

Ohala, J. J. (1973). Explanations for the intrinsic pitch of vowels. In Monthly internal memorandum (pp. 1–14). University of California.Google Scholar

Ohala, J. J. (1984). An ethological perspective on common cross-language utilization of F₀ of voice. Phonetica, 41, 1–16.10.1159/000261706CrossRef Google Scholar PubMed

Peirce, C. S. (1932). Collected papers of Charles Sanders Peirce (Vol. 2). Harvard University Press.Google Scholar

Perlman, M. (2026). Iconic prosody is deeply connected to iconic gesture, and it may occur just as frequently. In Fischer, O., Akita, K., & Perniss, P. (Eds.), The Oxford handbook of iconicity in language. Oxford University Press.Google Scholar

Perlman, M., & Cain, A. A. (2014). Iconicity in vocalization, comparisons with gesture, and implications for theories on the evolution of language. Gesture, 14(3), 320–350.10.1075/gest.14.3.03perCrossRef Google Scholar

Perlman, M., Dale, R., & Lupyan, G. (2015). Iconicity can ground the creation of vocal symbols. Royal Society Open Science, 2, 150152.10.1098/rsos.150152CrossRef Google Scholar PubMed

Perlman, M., & Lupyan, G. (2018). People can create iconic vocalizations to communicate various meanings to naïve listeners. Scientific Reports, 8, 2634.10.1038/s41598-018-20961-6CrossRef Google Scholar PubMed

Perniss, P., Thompson, R. L., & Vigliocco, G. (2010). Iconicity as a general property of language: Evidence from spoken and signed languages. Frontiers in Psychology, 1, 227.10.3389/fpsyg.2010.00227CrossRef Google Scholar PubMed

R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing.Google Scholar

Rhodes, R. (1994). Aural images. In Hinton, L., Nichols, J., & Ohala, J. J. (Eds.), Sound symbolism (pp. 276–292). Cambridge University Press.Google Scholar

Saji, N., Akita, K., Kantartzis, K., Kita, S., & Imai, M. (2019). Cross-linguistically shared and language-specific sound symbolism in novel words elicited by locomotion videos in Japanese and English. PLoS One, 14(7), e0218707.10.1371/journal.pone.0218707CrossRef Google Scholar PubMed

Shintel, H., Nusbaum, H. C., & Okrent, A. (2006). Analog acoustic expression in speech communication. Journal of Memory and Language, 55, 167–177.10.1016/j.jml.2006.03.002CrossRef Google Scholar

Stolarski, Ł. (2019). Correlations between positive or negative utterances and basic acoustic features of voice: A preliminary analysis. Research in Language, 20(2), 153–178.10.18778/1731-7533.20.2.03CrossRef Google Scholar

Toratani, K. (2012). The role of sound-symbolic forms in motion event descriptions: The case of Japanese. Review of Cognitive Linguistics, 10, 90–132.10.1075/rcl.10.1.03torCrossRef Google Scholar

Vigliocco, G., Perniss, P., & Vinson, D. (2014). Language as a multimodal phenomenon: Implications for language learning, processing and evolution. Philosophical Transactions of the Royal Society B, 369, 20130292.10.1098/rstb.2013.0292CrossRef Google Scholar PubMed

Villegas, J., Akita, K. & Kawahara, S. (2023). Psychoacoustic features explain subjective size and shape ratings of pseudo-words. In Proceedings of the 10th Convention of the European Acoustics Association: Forum Acusticum 2023.Google Scholar

Table 1. Stimulus sentences for Experiment 1, with abbreviated semantic labels for cross-referencing in parentheses

Table 2. Mean semantic ratings, with standard deviation in parentheses, for all the ideophones that were examined

Table 3. Principal components’ loadings

Figure 1. Mean standardized f0 of the V2 of ideophones, from the lowest to the highest.

Figure 2. Mean standardized intensity of ideophones, from the lowest to the highest.

Figure 3. Mean standardized duration of ideophones, from the lowest to the highest.

Figure 4. The speed and pleasantness of ideophones and the mean standardized f0 of their V2.

Table 4. The results of the Bayesian mixed regression model for the mean f0 of the V2 of ideophones

Figure 5. The speed and pleasantness of ideophones and their mean standardized intensity.

Table 5. The results of the Bayesian mixed regression model for the intensity of ideophones

Figure 6. The speed and pleasantness of ideophones and their mean standardized duration.

Table 6. The results of the Bayesian mixed regression model for the duration of ideophones

Figure 7. Proportions of high and low f0 sounds (A, B, C, D versus E, F, G, H) preferred for the 20 ideophones, ordered in the same way as the corresponding figure in Experiment 1.

Table 7. The results of the Bayesian mixed regression model for the preferred f0 level

Figure 8. Proportions of high- and low-intensity sounds (A, B, E, F versus C, D, G, H) preferred for the 20 ideophones, ordered in the same way as the corresponding figure in Experiment 1.

Table 8. The results of the Bayesian mixed regression model for the preferred intensity level

Figure 9. Proportions of long and short sounds (A, C, E, G versus B, D, F, H) preferred for the 20 ideophones, ordered in the same way as the corresponding figure in Experiment 1.

Table 9. The results of the Bayesian mixed regression model for the preferred duration

Figure 10. Proportions of the four voice qualities preferred for the 20 ideophones.

Table 10. The results of the Bayesian regression model for the preferred voice qualities, with creaky voice as a baseline

Figure 11. Possible evolutionary path from nonlinguistic vocalizations to non-ideophonic, symbolic words via ideophones.

Article contents

Iconic prosody enhances the depictive power of ideophones

Abstract

Keywords

Information

1. Introduction

2. Previous studies

3. Experiment 1: Production

3.1. Method

3.1.1. Participants

3.1.2. Stimuli

3.1.3. Procedure

3.1.4. Predictions

3.1.5. Analysis

3.2. Results

3.2.1. Prosodic tendencies of individual ideophones

3.2.2. Ideophone semantics and prosody

3.3. Discussion

4. Experiment 2: Perceived meaning of f0, intensity and duration

4.1. Method

4.1.1. Participants

4.1.2. Stimuli

4.1.3. Procedure

4.1.4. Analysis

4.2. Results

4.3. Discussion

5. Experiment 3: Perceived meaning of voice quality

5.1. Method

5.1.1. Participants

5.1.2. Stimuli

5.1.3. Procedure

5.1.4. Analysis

5.2. Results

5.3. Discussion

6. General discussion

7. Conclusion

Data availability statement

Acknowledgments

Competing interests

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests

4. Experiment 2: Perceived meaning of f₀, intensity and duration