11.1 The Perceptual Center in Speech
The notion of the perceptual center (or the P-center) dates back to the beginnings of speech rhythm research that focused on temporal isochrony (Morton et al., Reference Morton, Marcus and Frankish1976), though the concept has not lost its appeal to the present day (Lin and De Jong, Reference Lin and De Jong2023; Zoefel et al., Reference Zoefel, Gilbert and Davis2023). The P-center is defined as the subjectively perceived moment of occurrence, highlighting that acoustic and perceptual onsets of rhythmic events do not necessarily co-occur (Morton et al., Reference Morton, Marcus and Frankish1976). Instead, the P-center seems to lag behind the acoustic onset of the corresponding rhythmic event, such as a (monosyllabic) word. The discovery was made with a recording of English digits from one to nine that were evenly concatenated to create an isochronous rhythmic sequence. The evenly spaced concatenation, however, sounded irregular to the experimenters (and other listeners). The sequence could only be made regular once the digits were concatenated with reference to the perceived, rather than acoustic, isochrony (Morton et al., Reference Morton, Marcus and Frankish1976). It was observed that the perceptual onsets of the concatenated digits deviated systematically, but somewhat inconsistently, from their acoustic counterparts. They did not coincide with local peaks in signal amplitude or fundamental frequency (Morton et al., Reference Morton, Marcus and Frankish1976).
Figure 11.1 illustrates this idea with the recording of a speaker producing the words bad, mad, sad, had, ad, pad at a steady pace as cued by a 2.5 Hz metronome (with an interval of 400 ms between beat onsets). If we measure the resulting intervals between successive word onsets, the produced sequence of words is irregular. And yet it sounds isochronous, just as the speaker intended to produce it. Under the P-center view, perceptual isochrony of this example derives from an even spacing of the P-centers of rhythmic speech events. The concept has also been applied to music (London et al., Reference London, Nymoen and Langerød2019; Vos and Rasch, Reference Vos and Rasch1981), and some discussions of the P-center propose that it constitutes the level of the beat in speech, thus linking the rhythmic structure of speech and music (Allen, Reference Allen1972; Cumming et al., Reference Cumming, Wilson, Leong, Colling and Goswami2015; Harsin, Reference Harsin1997; Harsin and Green, Reference Harsin and Green1994; Hoequist, Reference Hoequist1983; Scott, Reference Scott1998). The beat in music is defined as an underlying grid of equal time intervals that provides temporal structure to musical notes (Savage et al., Reference Savage, Brown, Sakai and Currie2015). It is uncontroversially regular in contrast to speech timing that shows no evidence for isochrony and very limited evidence for regularity (Arvaniti, Reference Arvaniti2009; Dauer, Reference Dauer1983; Rathcke and Smith, Reference Rathcke and Smith2015), at least on the surface of measurable acoustics (Turk and Shattuck-Hufnagel, Reference Turk and Shattuck-Hufnagel2013). The P-center effect indicates that equal time intervals may exist in speech after all, and that they are perceptual rather than acoustic in nature.
Illustration of the P-center effect.
The word sequence consisting of six English words (bad, mad, sad, had, ad, and pad) was produced in sync with a metronome at 2.5 Hz (or 400 ms between beat onsets) and sounds highly regular. However, the resulting inter-onset intervals between successive word onsets vary between minimally 312 ms (mad-sad) and maximally 534 ms (had-ad), demonstrating a discrepancy between (irregular) acoustics and (regular) perception typical of the P-center effect.

11.2 Methods of Examining the P-Center
Following its discovery, the P-center was extensively researched using a great variety of methods with the ultimate goal of developing an algorithm that would automatically identify P-center location in speech. One method gave rise to the example shown in Figure 11.1. In this task, participants are asked to produce a series of words or syllables in time with a (real or imagined) metronome (e.g., Chow et al., Reference Chow, Belyk, Tran and Brown2015; Fowler, Reference Fowler1979; Fox and Lehiste, Reference Fox and Lehiste1987a; Marcus, Reference Marcus1981; Šturm and Volín, Reference Šturm and Volín2016; Tuller and Fowler, Reference Tuller and Fowler1980). The subsequent analyses focus either on determining the alignment point of the metronome beat and the speech signal or on identifying the magnitude of discrepancies between timings of words with identical versus varied phonological structure. Another commonly used task is based on the perceptual adjustment for isochrony (e.g., Cooper et al., Reference Cooper, Whalen and Fowler1988; Harsin, Reference Harsin1997; Marcus, Reference Marcus1981; Pompino-Marschall, Reference Pompino-Marschall1989; Scott, Reference Scott1998). In a version of this task, listeners are given one base word repeating with a fixed inter-onset interval and asked to adjust the timing of a following, phonologically different word such that it matches the regular inter-onset intervals of the preceding base word repetitions. The subsequent analyses of listeners’ adjustments examine temporal deviations between the inter-onset intervals established by the base words and the perceptually matched words that deviate from the base word in their phonological structure. Another variant of the task asks listeners to align words to metronome beats (Pompino-Marschall, Reference Pompino-Marschall1989). Finally, participants have also been asked to tap along with a designated syllable of a looped sequence of words (Allen, Reference Allen1972). The timing of the tap can then be analyzed, determining the location of the syllable’s P-center.
Having discovered the P-center effect with a series of isolated monosyllabic words, Morton et al. (Reference Morton, Marcus and Frankish1976: 405) were cautious to add that the properties (and the existence) of the P-center may well be “subject to phonological, semantic, or syntactic influences” that play a role in natural speech. These influences have not yet been empirically addressed (see Chapter 22). While it has also been suggested that the P-center effect may explain the apparent lack of isochrony in speech acoustics (e.g., Lehiste, Reference Lehiste1977; Morton et al., Reference Morton, Marcus and Frankish1976), no studies have examined the P-center in connected speech. Across all methods mentioned above, speech materials consist of isolated, real or nonce, words presented with an intervening pause. The materials can be varied with regards to the identity of onset consonants and the phonological complexity on their clusters (e.g., seed, bead, lead, blead), the vowel quality in the syllable nucleus (e.g., bad, bed, bid), the presence or absence of coda consonants, and the complexity of codas (e.g., see, seek, seeks), but they have been generally restricted to mono- or bisyllabic words. Evidence from related work on beat perception in natural connected speech indicates that the subjectively perceived onset of the beat in spoken sentences indeed deviates from the acoustic onset of phonological syllables (Lin and Rathcke, Reference Lin and Rathcke2020; Rathcke et al., Reference Rathcke, Lin, Falk and Dalla Bella2021), providing preliminary support for the P-center effect in speech of higher complexity than one-word utterances (though without demonstrating an isochronous distribution of P-centers in natural speech).
11.3 On the Location of the P-Center
There is no generally accepted model of the exact P-center location (Villing et al., Reference Villing, Repp, Ward and Timoney2011). The evidence on which factors affect it and in what ways is mixed. It is mostly assumed to lie somewhere close to the vowel onset (Barbosa et al., Reference Barbosa, Arantes, Meireles and Vieira2005; Franich, Reference Franich2018; Hoequist, Reference Hoequist1983; Marcus, Reference Marcus1981) and to be mostly affected by syllable onsets rather than codas (Howell, Reference Howell1988; Marcus, Reference Marcus1981; Pompino-Marschall, Reference Pompino-Marschall1989; Scott and Howell, Reference Scott and Howell1992; Šturm and Volín, Reference Šturm and Volín2016), though these generalizations may be limited to Germanic and possibly Romance languages that have predominantly been studied to date (Šturm and Volín, Reference Šturm and Volín2016). In Cantonese, however, monosyllabic words produced in time with a metronome do not show the tendency for the beat to lag behind the acoustic syllable onset (Chow et al., Reference Chow, Belyk, Tran and Brown2015). The speech-to-metronome synchronization in this tonal language is tightly timed to syllable-initial consonants rather than vowels (Chow et al., Reference Chow, Belyk, Tran and Brown2015), casting doubts on the existence of the P-center in Cantonese and, more generally, on the effect being a cross-linguistic universal as previously suggested (Hoequist, Reference Hoequist1983).
Overall, the effect has been documented in nontonal languages including English (e.g., Cooper et al., Reference Cooper, Whalen and Fowler1988; Fowler, Reference Fowler1979; Fox, Reference Fox1987; Harsin, Reference Harsin1997; Tuller and Fowler, Reference Tuller and Fowler1980), German (Pompino-Marschall, Reference Pompino-Marschall1989; Pompino‐Marschall et al., Reference Pompino‐Marschall, Kühnert and Tillmann1989), Brazilian Portuguese (Barbosa et al., Reference Barbosa, Arantes, Meireles and Vieira2005), Spanish (Hoequist, Reference Hoequist1983), Czech (Šturm and Volín, Reference Šturm and Volín2016), and Japanese (Fox, Reference Fox1987; Hoequist, Reference Hoequist1983). Extending typological diversity, the effect has recently been ascertained in Medumba, a tonal language of the Bantu family (Franich, Reference Franich2018), and Mandarin Chinese (Lin and De Jong, Reference Lin and De Jong2023). This finding suggests that the lack of the P-center effect in Cantonese cannot be due to lexical tone. Chow et al. (Reference Chow, Belyk, Tran and Brown2015) explain their result with reference to syllable structure in Cantonese whose onsets are either empty or occupied by one or maximally two (an obstruent plus a glide) consonants. The authors suggest that the phonotactic restriction may have the acoustic consequence that the prevocalic part in Cantonese syllables is relatively short and minimally variable, leading to vowel onsets being less reliable acoustic landmarks than onsets of syllable-initial consonants. However, the syllable structure of Cantonese is quite comparable to that of Japanese (Kubozono, Reference Kubozono1989; Otake et al., Reference Otake, Hatano, Cutler and Mehler1993), yet Japanese speakers (Hoequist, Reference Hoequist1983) and listeners (Fox, Reference Fox1987) display the P-center effect in production and perception comparable to the one found with English speakers and listeners. Moreover, Mandarin Chinese has an even more restricted syllable phonotactics than Cantonese (e.g., Zhao and Berent, Reference Zhao and Berent2016), though a recent production study indicates that the P-center of Mandarin Chinese is located close to the acoustic vowel onset, just as in nontonal languages. Language-specific syllable phonotactics is thus less likely to be the main reason for the cross-linguistic differences in the P-center effect.
While Cantonese boasts a complex tone system with several dynamic and level tones, Medumba has a two-way contrast (Franich, Reference Franich2018) and Mandarin has a four-way contrast (Lin and De Jong, Reference Lin and De Jong2023). It is unclear whether this difference in tonal inventory can account for the discrepancy in P-center findings across the three tonal languages. It is also unclear whether pitch plays any role in influencing the location of the P-center. While Chow et al. (Reference Chow, Belyk, Tran and Brown2015) did not observe any differences between P-centers of Cantonese words carrying different tones and Lin and De Jong (Reference Lin and De Jong2023) only examined syllables with tone-1, Franich (Reference Franich2018) measured differently timed P-centers in words carrying a low versus high tone, with high tones leading to earlier P-centers. In their seminal study, Morton et al. (Reference Morton, Marcus and Frankish1976) did not find an effect of pitch on P-center location in English. This finding was confirmed in a recent study with more complex English materials (Lin and Rathcke, Reference Lin and Rathcke2020; Rathcke and Lin, Reference Rathcke and Lin2023), though there remains a possibility that pitch may shape beat perception in some (not necessarily tonal) languages.
In languages that clearly demonstrate the P-center, its location seems to be affected by the properties of the whole syllable or word, though the effects of onset, nucleus, and coda are neither similar in magnitude nor additive, and the evidence documenting the (phonological versus phonetic) nature of P-center shifts is mixed. Early work by Marcus (Reference Marcus1981) experimented with natural and manipulated versions of monosyllabic words for English digits and found that their P-center was located later in the syllable if the duration of the onset was shorter, or if the vowel or coda duration was longer. Fox and Lehiste (Reference Fox and Lehiste1987a) asked if such durational influences on P-center shifts were phonological rather than phonetic in nature, given that many phonological contrasts (e.g., tense versus lax vowels in English) go hand in hand with timing alternations (long versus short). They conducted a study into the effect of vowel quality as opposed to vowel duration on P-center location, examining English monosyllables with lax versus tense vowel nuclei. The results indicated little role of vowel phonology in shifting the location of the P-center within a syllable, confirming that the nature of the P-center effect was purely phonetic rather than phonological. An opposite conclusion was reached by Šturm and Volín (Reference Šturm and Volín2016) who demonstrated that P-center location in bisyllabic words of Czech was strongly affected by the phonological vowel length rather than their phonetic duration.
Cooper et al. (Reference Cooper, Whalen and Fowler1986) studied the phonetic influence of syllable onsets and nuclei by varying the duration of fricative noise in a fricative-vowel syllable, the duration of acoustic silence in a fricative-stop-vowel syllable, or the duration of the vowel itself. The perception of the P-center in the resulting stimuli was mostly affected by the duration of the syllable-initial consonant(s) and, to a lesser extent, by the duration of the vowel, showing temporal shifts similar to those documented by Marcus (Reference Marcus1981). Following up on this work, Cooper et al. (Reference Cooper, Whalen and Fowler1988) examined the role of syllable rime in more detail, testing the hypothesis put forward by Marcus (Reference Marcus1981) that the rime behaves as a unit such that durational variability in the vowel versus the coda does not exert an independent influence on the location of the P-center. Two experiments systematically manipulated the duration of the vowel in a vowel-consonant syllable with either covarying or constant duration of the rime. The results did not provide evidence in support of the hypothesis by Marcus (Reference Marcus1981). Instead, they suggested that both constituents of the rime (i.e., vowels and codas) had comparable effects on affecting P-center location.
A series of experiments with more varied materials conducted by Pompino-Marschall (Reference Pompino-Marschall1989), however, showed that the phonetic effects of segment duration on P-center location were not as linear and additive as suggested by earlier research. Rather, the duration of the syllable onset, vowel, and coda interacted in complex ways, jointly determining the direction and the magnitude of shifts in P-center location. Adding to this complexity, Harsin (Reference Harsin1997) provided further evidence that the durational effect of syllable onsets on P-center location did not equally apply across a wide range of consonants but was moderated by the phonological category of the onset. Specifically, syllables with sonorants versus obstruents of the same duration differed in their P-centers and did not display a unified effect of consonant lengthening on a later location of the P-center that had been generally shown in earlier work with more limited materials (Allen, Reference Allen1972; Cooper et al., Reference Cooper, Whalen and Fowler1988; Fowler, Reference Fowler1979; Marcus, Reference Marcus1981). Given that sonorants and obstruents show remarkable differences in their energy distributions and amplitude envelopes, subsequent work focused primarily on the attempts to model P-center location as a function of spectro-temporal properties of a syllable, even though experimental evidence to this end had been rather mixed (Harsin, Reference Harsin1997; Marcus, Reference Marcus1981; Morton et al., Reference Morton, Marcus and Frankish1976; Tuller and Fowler, Reference Tuller and Fowler1980).
Testing an acoustic account of the P-center in their original work, Morton et al. (Reference Morton, Marcus and Frankish1976) excluded local peaks in signal amplitude or in fundamental frequency as suitable signal-driven anchors of the center location. Subsequent studies further elaborated that the P-center did not coincide with any acoustic landmarks in speech (e.g., Cooper et al., Reference Cooper, Whalen and Fowler1986; Marcus, Reference Marcus1981). However, Howell (Reference Howell1988) and Scott and Howell (Reference Scott and Howell1992) revisited the acoustic account of the P-center and proposed a model based on the amplitude envelope and a syllabic “center of gravity,” suggesting that perceptual judgments are linked to the distribution of the energy in a syllable (see Chapter 3). In this model, the center of gravity refers to the moment when the energy peak of a syllable is reached, typically at the consonant-vowel transition. The slope of the energy rise toward the center of gravity is assumed to encode onset consonants and is crucial to the calculations of P-center location. If the energy contour rises quickly right from the syllable onset (as for some syllable-initial fricatives), the P-center occurs earlier; if it shows a more gradual increase, the P-center is located later (the concept came to be widely known as syllable rise time, e.g., Goswami et al., Reference Goswami, Fosker, Huss, Mead and Szűcs2011; Leong et al., Reference Leong, Hämäläinen, Soltész and Goswami2011). This model stands in contrast to the most recent acoustic representation of P-center location that somewhat downplays the energy of some consonants – notably fricatives – in order to derive the P-center (Šturm and Volín, Reference Šturm and Volín2016). According to Šturm and Volín (Reference Šturm and Volín2016), the P-center is best represented as the moment of the fastest energy change (maxD) occurring at the consonant-vowel transitions within a syllable, though for the algorithm to perform well, the high energy of some consonants such as fricatives ought to be significantly downplayed and smoothed (Šturm and Volín, Reference Šturm and Volín2016: 42). Previous attempts to localize the P-center at the midpoint of the amplitude rise time did not have that feature (Cummins and Port, Reference Cummins and Port1998). The two algorithms are available for researchers with an interest in the study of the P-center, either from the first author’s website (Cummins and Port, Reference Cummins and Port1998) or upon individual request (Šturm and Volín, Reference Šturm and Volín2016).
Despite some differences, both algorithms of P-center location operate within the domain of a syllable and sample acoustic properties of the local amplitude envelope delimited by the syllable boundaries. There is, however, some evidence that the P-center can also be affected by a preceding or following syllable. For example, Fox and Lehiste (Reference Fox and Lehiste1987b) showed that the P-center shifts to a later location if an additional syllable is suffixed to form a bisyllabic word. In contrast, it shifts (even more substantially) to an earlier location if an additional syllable is prefixed. While Šturm and Volín (Reference Šturm and Volín2016) focused specifically on bisyllabic words, they did not compare them to monosyllables, so it is unclear whether or not their maxD algorithm should account for polysyllabic complexity and in what ways. In recent work, we applied the algorithm to more varied and naturally complex materials in English and examined the potential of the maxD-derived landmark to predict the location of finger taps produced during a task requiring participants to synchronize with the beat of repeated sentences (Rathcke et al., Reference Rathcke, Lin, Falk and Dalla Bella2021). The results showed that the maxD landmark was statistically as good a predictor of finger-tap locations as vowel onsets. Even though both the task and the stimuli of this sensorimotor synchronization experiment by far exceeded the complexity of more traditional P-center paradigms, the findings confirmed that the P-center effect existed in the perception of rhythmic beat structure in natural English speech. That is, there is a discrepancy between the acoustic onset of a syllable and the perceived onset of the syllable beat.
11.4 Explanations of the P-Center Effect
The original interest in the effect was grounded in the idea that rhythm meant isochrony and motivated by the search for some temporal constancy in language. The discovery of the effect gave rise to the hypothesis that isochrony in language might be perceptual and not acoustic (Lehiste, Reference Lehiste1977; Morton et al., Reference Morton, Marcus and Frankish1976). Even though the idea that speech rhythm can be defined purely on the basis of duration and timing has received much criticism (Arvaniti, Reference Arvaniti2009; Kohler, Reference Kohler2009) and is not unanimously shared (White and Malisz, Reference White, Malisz, Gussenhoven and Chen2020), the P-center effect maintains its relevance to speech rhythm research as it signifies a notable discrepancy between speech acoustics and perception. Such discrepancy is not unique to the P-center but generally applies to a range of speech perception phenomena that show nonlinear relationships with physical input properties (e.g., Dilley and Pitt, Reference Dilley and Pitt2010; Goldstone and Hendrickson, Reference Goldstone and Hendrickson2010; Warren, Reference Warren1968).
As noted by Morton et al. (Reference Morton, Marcus and Frankish1976: 408), the concept of the P-center has “no explanatory power” of its own as it simply describes one temporal aspect of speech perception. Not surprisingly, approaches to explaining the origin of the P-center effect somewhat mirror approaches to understanding speech perception in general. A prominent account in this regard is the proposal put forward by Fowler and colleagues (Fowler, Reference Fowler1979, Reference Fowler1986, Reference Fowler1994; Tuller and Fowler, Reference Tuller and Fowler1980). It follows the motor theory of speech perception (Liberman and Mattingly, Reference Liberman and Mattingly1985), assuming that articulatory gestures constitute perceptual units in connected speech and that P-centers track the kinematic signal of speech production or, more specifically, the temporal regularity of vowel gestures (see Chapter 2). Accordingly, the P-center effect originates in the fact that “listeners extract information from the acoustic signal that specifies articulatory timing” (Fowler et al., Reference Fowler, Whalen and Cooper1988: 94). While articulatory recordings do not straightforwardly support the motor account of the P-center (De Jong, Reference De Jong1992, Reference De Jong1994; Pompino‐Marschall et al., Reference Pompino‐Marschall, Kühnert and Tillmann1989), the key suggestion that beat perception in speech may be locked to vowels is further found in other discussions of the P-center effect (Barbosa et al., Reference Barbosa, Arantes, Meireles and Vieira2005; Rathcke et al., Reference Rathcke, Lin, Falk and Dalla Bella2021). For example, Barbosa et al. (Reference Barbosa, Arantes, Meireles and Vieira2005) hypothesize that P-centers can best be understood as a surface manifestation of the perceptual task to predict upcoming vowel onsets in a sequence of syllables. Similarly, Rathcke et al. (Reference Rathcke, Lin, Falk and Dalla Bella2021) discuss that vowels have a special importance for speech perception as they shape the sonority contour of the acoustic speech signal. The resulting fluctuations in signal sonority may support speech segmentation and assist first-language acquisition (Räsänen et al., Reference Räsänen, Doyle and Frank2018). Naturally evolved drummed languages such as Amazonian Bora also make use of such sonority fluctuations and map rhythmic units onto intervocalic intervals (Seifart et al., Reference Seifart, Meyer, Grawunder and Dentel2018), further corroborating the special role of vowels for the perception of speech rhythm. In languages spoken around the world, the nucleus (most frequently occupied by a vowel) represents the only obligatory constituent of a syllable and is often reflected in a local sonority maximum (Blevins, Reference Blevins and Goldsmith1995).
Rathcke et al. (Reference Rathcke, Lin, Falk and Dalla Bella2021) also note that during rhythmic synchronization with the beat of natural sentences, it was particularly the very first vowel of a sentence that showed anticipation – in other words, a vowel occurring after an acoustic silence. All subsequent vowels – in other words, vowels embedded in a meaningful sentence – were synchronized with more precisely and in a less anticipatory way. Notably, previous studies of P-center location experimented with isolated words concatenated using silent pauses. It is therefore not implausible to hypothesize that the P-center may reflect a temporal prediction of a vowel onset that is expected to occur after a silent pause. This explanation of the P-center effect is in line with current evidence of negative mean asynchrony obtained in rhythmic synchronization tasks with a variety of auditory stimuli (Aschersleben, Reference Aschersleben2002; Repp and Su, Reference Repp and Su2013). Accordingly, measurable anticipation of regular auditory prompts occurs specifically when those prompts are interspersed with acoustic silences but is attenuated, or even completely removed, in complex, continuous rhythmic contexts such as music (see Chapter 6). This hypothesis of P-center origin requires experimental testing in future research.
Alternative accounts of the P-center assume that the effect is rather acoustic (Howell, Reference Howell1988; Scott and Howell, Reference Scott and Howell1992; Šturm and Volín, Reference Šturm and Volín2016; Vos and Rasch, Reference Vos and Rasch1981) or psychoacoustic (Harsin, Reference Harsin1993, Reference Harsin1997; Pompino-Marschall, Reference Pompino-Marschall1989) in nature. These accounts highlight that the P-center is neither unique to speech nor completely independent of the spectro-temporal features of the stimuli tested. Psychoacoustic models define P-centers with reference to critical-band audio frequency regions that matter for the human auditory system (see Chapter 3). Accordingly, the P-center effect arises due to a salient acoustic energy change within the critical audio frequency bands of an entire syllable. The P-center itself can then be best modeled as a tracker of acoustic changes at relevant frequencies to which a perceptual weighting function is applied. The weighting function integrates the knowledge of critical bands as well as perceptual thresholds that need to be reached for the P-center effect to arise at a given point in time.
Purely acoustic models tend to abstract away from the complexities of critical frequency bands and nonlocal influences on P-center location. These models see the origin of the effect in the perceptual system sampling amplitude envelopes at onsets of auditory input units (e.g., syllables, tones, or metronome clicks) and responding particularly sensitively to salient points of the maximal rate of change. Most recent installments of this account further suggest that the perceptual system may not be simply attracted to the local moments of the fastest energy change but is sensitive to the overall rate of change in the amplitude envelope (i.e., slope and rise time). Accordingly, only some acoustic signals lend themselves readily to the perception of a clear P-center at a certain point in time (Villing et al., Reference Villing, Repp, Ward and Timoney2011), while others may be perceived like a “broad slur” (Benadon, Reference Benadon2014). A notion of a “beat bin” has been put forward to account for the fact that the clarity of the P-center tends to vary across different types of auditory input (Danielsen et al., Reference Danielsen, Nymoen and Anderson2019), though this idea has not yet been comprehensively investigated, particularly in speech.
11.5 Unresolved Issues and Future Directions
Apart from the controversies surrounding the exact P-center location, difficulties with the development of suitable P-center algorithms, and limited availability of cross-linguistic evidence, the current understanding of the effect faces one key issue – individual variability. As Pompino-Marschall (Reference Pompino-Marschall1989) notes, listener performance in P-center tasks can differ rather substantially. Early work even reported on difficulties in determining P-center locations with inexperienced listeners (Cooper et al., Reference Cooper, Whalen and Fowler1988; Fox and Lehiste, Reference Fox and Lehiste1987b; Morton et al., Reference Morton, Marcus and Frankish1976). Several studies reviewed above are based on data from no more than three or four participants, which, in the presence of large individual variability, suggests that the difficulty in establishing P-center location may be even greater than currently appreciated, though it may also be more meaningful than currently assumed. Studying the P-center effect with a more representative sample of 23 Cantonese participants, Chow et al. (Reference Chow, Belyk, Tran and Brown2015: 63) also noted that the participants “behaved quite differently from one another.” The answer to the question of which individual listener traits and characteristics moderate the perceptual variability may potentially help to better explain the origins and the nature of the effect. Some preliminary findings indicate that musical training (or, more generally, musical aptitude) plays a role in rhythmic tasks such as P-center paradigms (Rathcke et al., Reference Rathcke, Lin, Falk and Dalla Bella2021; Šturm and Volín, Reference Šturm and Volín2016). Participants with higher levels of musical training show reduced variability of P-center responses (Šturm and Volín, Reference Šturm and Volín2016) and higher accuracy in rhythmic synchronization with vowel onsets (Rathcke et al., Reference Rathcke, Lin, Falk and Dalla Bella2021). Having studied the effect with highly skilled musicians, Villing et al. (Reference Villing, Repp, Ward and Timoney2011: 1626) found consistent results across a range of tasks and argued that P-centers demonstrated “a reliable and universal percept” – a conclusion that probably owed a lot to the homogeneity of the participating group of listeners.
However, the role of musical aptitude and training is possibly limited to the perception of participants whose native languages do not have lexical tone, as Chow et al. (Reference Chow, Belyk, Tran and Brown2015) did not observe any systematic differences in P-center location among musically trained and untrained Cantonese participants. Little is known if, and how, native language(s) of listeners shape(s) their beat perception in speech and other complex auditory signals, which is a fruitful avenue to explore in future studies of the P-center effect.
Summary
The P-center refers to the perceptual moment of occurrence of a speech unit and has been hypothesized to represent the beat in spoken language. It can be found among many other controversial concepts surrounding the idea of rhythm in speech and language. Over decades of study, the exact location and the nature of the P-center have remained largely unresolved, though the concept itself has retained its potential to inform future research.
Implications
The P-center effect has direct implications for the construction of speech stimuli, specifically for those experiments that work with concatenated monosyllables interspersed by silent pauses. If the P-center effect is not considered, an acoustically constant inter-onset interval connecting a string of phonologically variable monosyllables may be perceived as irregular, while a (slightly) jittered concatenation results in a good approximation of perceptual regularity.
Gains
The key contribution of the P-center to the current understanding of rhythm in speech and language is rather profound as it establishes that the perception of temporal structure in speech, just as the perception of spectral and other features, deviates from the acoustic signal in complex ways and is not universal but language-specific. As such, the P-center signifies that a purely acoustic study of speech rhythm is likely to be futile.
