38.1 Introduction
Young infants are sensitive to the rhythmic patterns and regularities in their auditory input, which is primarily composed of the speech and singing that they hear from their caregivers (Hilton et al., Reference Hilton, Moser and Bertolo2022). This sensitivity is considered to play a fundamental role in the process of early language acquisition in the first years of life (Gervain, Reference Gervain2018; Gervain and Mehler, Reference Gervain and Mehler2010; Jusczyk, Reference Jusczyk1997; Nazzi and Ramus, Reference Nazzi and Ramus2003), as well as support development of more advanced language processing abilities in childhood such as phonological awareness and reading. Infants begin learning the rhythmic patterns of their native language or languages in utero when they perceive limited segmental but intact suprasegmental information from ambient speech (Moon and Fifer, Reference Moon and Fifer2000). This equips them with the ability to differentiate their native language(s) from rhythmically different nonnative languages already at birth (DeCasper and Fifer, Reference DeCasper and Fifer1980; Nazzi et al., Reference Nazzi, Bertoncini and Mehler1998). This initial language-specific rhythmic sensitivity becomes further fine-tuned and extended over infants’ first year of life. This is observed in the emerging abilities to discriminate between rhythmically similar languages (Bosch and Sebastián-Gallés, Reference Bosch and Sebastián-Gallés1997; Molnar et al., Reference Molnar, Gervain and Carreiras2014) and to detect native prosodic patterns at the utterance, lexical, and sub-lexical levels (Friederici et al., Reference Friederici, Friedrich and Christophe2007; Höhle et al., Reference Höhle, Bijeljac-Babic, Herold, Weissenborn and Nazzi2009; Weber et al., Reference Weber, Hahne, Friedrich and Friederici2004). These early prosodic abilities, in turn, support the processing and segmentation of incoming continuous speech (e.g., Jusczyk et al., Reference Jusczyk, Cutler and Redanz1993; Hallé and de Boysson-Bardies, Reference Hallé and de Boysson-Bardies1996; Mattys et al., Reference Mattys, Jusczyk, Luce and Morgan1999; Morgan and Saffran, Reference Morgan and Saffran1995; and refer to Chapter 39 for a detailed discussion of infants’ early perception of native language rhythm).
In this chapter, we discuss how infants’ environments are optimized for supporting their challenging task of identifying and learning the prosodic and rhythmic patterns of their native language(s). We review findings showing that this optimization is not manifested as an enhancement of language-specific rhythmic cues in infant-directed speech. Instead, we discuss evidence for the presence of language-general rhythmic regularities in natural caregiver–infant interactions, in which caregivers do not only communicate with infants through speech but also music, touch, and movement. We show that infants are exposed to multiple rhythmic input modalities that combined comprise optimized intersensory rhythmic input that infants enjoy from their first months of gestation and after birth. This optimized multisensory stimulation fosters infants’ early ability to extract rhythmic regularities from their environment, supporting subsequent language acquisition.
38.2 Infant-Directed Speech: Rhythmic Information in Infants’ Early Language Input
Infants’ early linguistic environment consists primarily of infant-directed (ID) speech, which refers to the speech style or register that adults spontaneously use in interactions with young infants (Golinkoff et al., Reference Golinkoff, Can, Soderstrom and Hirsh-Pasek2015; Soderstrom, Reference Soderstrom2007). Compared to speech used in interactions among adults or adult-directed (AD) speech, ID speech is characterized by affective, acoustic, and linguistic properties proposed to serve specific functions such as regulating infants’ emotional states (Papoušek et al., Reference Papoušek, Bornstein, Nuzzo, Papoušek and Symmes1990), capturing and maintaining infants’ attention during communicative interactions (Fernald and Simon, Reference Fernald and Simon1984), and facilitating early language development (Kuhl, Reference Kuhl2000). Evidence for ID speech has been documented across languages and language families, and adult listeners are highly accurate at differentiating it from AD speech even in unfamiliar languages, suggesting universality in the acoustic properties of ID speech (Hilton et al., Reference Hilton, Moser and Bertolo2022). Even though ID speech is produced by adults for an infant audience, infants are not passive perceivers of this speech register. From their first months of life, infants prefer listening to ID compared to AD speech (Byers-Heinlein et al., Reference Byers-Heinlein, Tsui and Bergmann2021; Cooper and Aslin, Reference Cooper and Aslin1990), and their active feedback shapes the acoustic properties of the ID speech produced by their caregivers (Lam and Kitamura, Reference Lam and Kitamura2012; Murray and Trevarthen, Reference Murray and Trevarthen1986; Smith and Trainor, Reference Smith and Trainor2008). Caregivers are also sensitive to their infants’ developmental needs and evolving listening preferences (Kitamura and Burnham, Reference Kitamura and Burnham1998, Reference Kitamura and Burnham2003; Kitamura and Lam, Reference Kitamura and Lam2009), and adjust their speech accordingly. For instance, ID speech to newborns is highly soothing and comforting, while ID speech to six–nine-month-olds acquires more directive and attention-grabbing qualities (Kitamura and Burnham, Reference Kitamura and Burnham2003), reflecting infants’ readiness to actively engage with objects and people in their environment and attend to linguistic information in their speech input (Newman and Hussain, Reference Newman and Hussain2006). Similarly, as infants approach their second birthday, ID speech becomes faster and exhibits a reduction in mean pitch and pitch range (Cox et al., Reference Cox, Bergmann and Fowler2023), progressively approximating the less prosodically exaggerated child-directed and AD speech registers.
Several acoustic and prosodic adaptations in ID relative to AD speech have been proposed to directly support early language processing and subsequent language acquisition. For instance, phonetic exaggeration, typically indexed by the expansion of the acoustic distance between the three corner vowels /i/, /u/, and /a/ (Burnham et al., Reference Burnham, Kitamura and Vollmer-Conna2002; Kuhl et al., Reference Kuhl, Andruski and Chistovich1997), has been proposed to directly foster early speech perception abilities by providing infants with exposure to clear speech and more distinct phonetic categories (Kuhl, Reference Kuhl2000). Indeed, infants show more efficient vowel processing and word recognition when presented with ID speech containing acoustically exaggerated vowels (Lovcevic et al., Reference Lovcevic, Burnham and Kalashnikova2022; Peter et al., Reference Peter, Kalashnikova, Santos and Burnham2016; Song et al., Reference Song, Demuth and Morgan2010), and the extent to which individual mothers exaggerate vowels in ID speech correlates with their infants’ concurrent speech perception abilities and future vocabulary size (Hartman et al., Reference Hartman, Ratner and Newman2017; Kalashnikova and Burnham, Reference Kalashnikova and Burnham2018; Kalashnikova and Carreiras, Reference Kalashnikova and Carreiras2022; Liu et al., Reference Liu, Kuhl and Tsao2003; but see Martin et al., Reference Martin, Schatz and Versteegh2015; McMurray et al., Reference McMurray, Kovack-Lesh, Goodwin and McEchron2013, for findings contradicting this claim). Prosodic characteristics of ID speech including expanded pitch range and slow speech rate have also been linked to enhanced speech processing. In experimental paradigms, infants are more successful at segmenting continuous ID speech characterized by expanded pitch range (Floccia et al., Reference Floccia, Keren-Portnoy and DePaolis2016; Schreiner and Mani, Reference Schreiner and Mani2017; Thiessen et al., Reference Thiessen, Hill and Saffran2005) and recognizing words produced in slow ID speech (Song et al., Reference Song, Demuth and Morgan2010).
Based on this evidence linking the characteristics of caregivers’ ID speech to language outcomes, several studies have hypothesized that specific acoustic cues to the rhythmic patterns of infants’ native language would also be exaggerated compared to AD speech, and that this would support the development of native prosodic competence and speech segmentation abilities. Traditionally, the world’s languages have been classified as belonging to one of three rhythmic classes defined by the size of their prosodic units, which include the metrical foot in stress-timed languages (e.g., English), the syllable in syllable-timed languages (e.g., Spanish), and the mora in mora-timed languages (e.g., Japanese) (Abercrombie, Reference Abercrombie1967). More recently, this classification has been debated (e.g., Turk and Shattuck-Hufnagel, Reference Turk and Shattuck-Hufnagel2013), but there is evidence that a language’s rhythmic structure can be defined by several durational measures that represent the proportion and variability of duration with which consonantal and vocalic intervals occur in that language (e.g., Ramus et al., Reference Ramus, Nespor and Mehler1999, Reference Ramus, Hauser, Miller, Morris and Mehler2000; and refer to Chapter 11 for a discussion of other measures of speech rhythm). A recent meta-analysis shows that infants rely on these durational cues in language differentiation tasks, whereby smaller differences in vocalic variability and larger differences in consonantal variability between languages lead to better discrimination performance from birth to 12 months (Gasparini et al., Reference Gasparini, Langus, Tsuji and Boll-Avetisyan2021).
These language-specific durational cues may be difficult to extract for a novice language-learner, so they would be expected to be exaggerated in the slower and prosodically exaggerated ID speech. To test this, Payne et al. (Reference Payne, Post, Astruc, Prieto and Vanrell2009) compared durational rhythmic cues in English, Spanish, and Catalan ID speech to two-, four-, and six-year-olds (a register also referred to as child-directed speech, reflecting the recipient children’s ages). These three languages differ significantly in their rhythmic structure (English is stress-timed, Spanish is syllable-timed, and Catalan falls in between). Results showed that even though ID speech differed acoustically from AD speech in each language, instead of enhancing cross-linguistic differences, the ID adaptations resulted in greater rhythmic similarities. Across languages, ID speech was more vocalic and had reduced variability in consonantal segments. Lee et al. (Reference Lee, Kitamura, Burnham and McAngus Todd2014) assessed the same cues in English ID speech to infants from 0 to 12 months but failed to replicate the ID versus AD speech differences reported by Payne et al. However, using a modelling approach that assessed the sonorant structure of ID and AD speech (which measures the regularity with which syllables occur in speech based on prominence cues such as intensity, pitch, spectral balance, and duration; Lee and Todd, Reference Lee and Todd2004), this study also reported that stressed and unstressed syllables were differentiated to a lesser extent in ID compared to AD speech. Finally, Tajima et al. (Reference Tajima, Tanaka, Martin and Mazuka2013) investigated the durational cues of ID and AD speech in Japanese, a mora-timed language. Their analyses also failed to reveal any ID adjustments that would support the identification of mora segments in continuous speech. As can be seen, there are some inconsistencies across these studies, which are likely due to cross-dialectal and cross-linguistic differences and assessments of speech directed to infants and children spanning a wide age range. Despite this issue, this evidence suggests that ID speech does not enhance, and may even distort, durational cues that signal language-specific rhythm structure, potentially in favor of the prosodic adjustments that serve the affective and attention-grabbing functions of this speech register (Kempe et al., Reference Kempe, Brooks and Gillis2005).
Even if caregivers do not exaggerate language-specific rhythmic properties in ID speech, the findings reviewed above indicate that the segmental and suprasegmental properties of this register result in a temporal structure that is significantly different from AD speech (Nencheva and Lew-Williams, Reference Nencheva and Lew-Williams2022). Specifically, ID speech exhibits greater rhythmic regularities, resulting in a more isochronous signal. At first glance, this may appear to contradict the notion that ID speech is optimized for promoting the acquisition of language-specific rhythmic competence, but it may be the case that rhythmic regularization leads to more efficient neural encoding and processing of speech by young infants. This argument is based on extensive evidence from neurophysiological studies for entrainment between endogenous oscillatory neural activity and incoming sensory information (Ding et al., Reference Ding, Patel and Chen2017; and see Chapters 3, 5, 35, 36, and 37).
In the case of speech, specifically, linguistic information across multiple timescales is conveyed by the energy fluctuations of the amplitude envelope (i.e., the lexical and phrasal rate 1–4 Hz, the syllabic rate 5–8 Hz, and the phonological rate 30–50 Hz), which are tracked by corresponding frequency bands of neural oscillations in the auditory cortex (delta, theta, and gamma bands, respectively) (Ding et al., Reference Ding, Melloni, Zhang, Tian and Poeppel2016; Ding and Simon, Reference Ding and Simon2014; Ghitza, Reference Ghitza2012; Peelle and Davis, Reference Peelle and Davis2012; Poeppel and Assaneo, Reference Poeppel and Assaneo2020; and refer to Chapter 35 for a detailed discussion of cortical tracking of speech). Speech processing is primarily governed by efficient entrainment to the slowly occurring information in the speech envelope (delta and theta oscillation bands) (Ghitza, Reference Ghitza2012; Gross et al., Reference Gross, Hoogenboom and Thut2013; Luo and Poeppel, Reference Luo and Poeppel2007), which significantly correlates with speech intelligibility and successful comprehension (Arnal and Giraud, Reference Arnal and Giraud2012; Vanthornhout et al., Reference Vanthornhout, Decruy, Wouters, Simon and Francart2018). Both top-down processes and bottom-up information contribute to the efficiency of cortical entrainment in individual listeners (Di Liberto et al., Reference Di Liberto, Lalor and Millman2018; Doelling et al., Reference Doelling, Arnal, Ghitza and Poeppel2014; Peelle et al., Reference Peelle, Gross and Davis2013). Entrainment is more efficient when listeners have higher proficiency in the language in which the stimuli are presented (Lizarazu et al., Reference Lizarazu, Carreiras and Molinaro2023), and when they actively attend to the stimulus string (Golumbic et al., Reference Zion Golumbic, Ding and Bickel2013; Obleser and Kayser, Reference Obleser and Kayser2019). The bottom-up cues that modulate entrainment include speech rate and rhythmic regularity (Aubanel et al., Reference Aubanel, Davis and Kim2016; Lizarazu et al., Reference Lizarazu, Lallier and Molinaro2019). Thus, it is plausible that young infants who have limited access to top-down information during speech processing particularly benefit from exposure to ID speech, which is precisely a slow, attention-grabbing, and more isochronous speech signal.
Several studies support this proposal. Falk and Kello (Reference Falk and Kello2017) analyzed the temporal structure of the amplitude envelope of German ID speech and singing to six-month-old infants. They found that acoustic energy across frequencies corresponding to different units in the linguistic hierarchy were clustered to a greater extent in ID than AD registers. That is, the hierarchical temporal structure of ID registers was more regular compared to AD counterparts. Leong et al. (Reference Leong, Kalashnikova, Burnham and Goswami2014) assessed naturally produced English ID and AD speech to infants from seven to 11 months, focusing specifically on low-frequency information in the amplitude envelope. Their findings revealed greater delta–theta phase synchronization in ID speech compared to greater theta–gamma synchronization in AD speech. These results indicate greater rhythmic regularity in ID compared to AD speech, in particular at the rates at which stressed and unstressed syllables occur. Pérez-Navarro et al. (Reference Pérez-Navarro, Lallier, Clark, Flanagan and Goswami2022) recently replicated this result for Spanish child-directed speech to four-year-olds, suggesting that these adjustments extend across rhythm classes (i.e., in stress-timed English and syllable-timed Spanish) and may be preserved as children develop more advanced language abilities over time. Critically, the benefits of these rhythmic adjustments in ID and child-directed speech are reflected in direct measures of infant speech processing: Infants show more efficient neural entrainment to ID compared to AD speech (Kalashnikova et al., Reference Kalashnikova and Burnham2018; Menn et al., Reference Menn, Michel, Meyer, Hoehl and Männel2022a). Thus, the language-general rhythmic regularities in ID speech play an important role in facilitating early processing and encoding of speech, which may in turn promote infants’ subsequent learning of the specific rhythmic patterns of their native language and the extraction of meaningful linguistic units from continuous speech.
38.3 ID Singing: A Multimodal Source of Rhythmic Information
The temporal adjustments approximate ID speech to the rhythmically regular and melodic structure of music (Daikoku and Goswami, Reference Daikoku and Goswami2022). Music is another highly prominent auditory signal available to infants before and after birth (Kisilevsky et al., Reference Kisilevsky, Hains, Jacquet, Granier-Deferre and Lecanuet2004), and provides them with valuable rhythmic information (Papadimitriou et al., Reference Papadimitriou, Smyth, Politimou, Franco and Stewart2021). After birth, the most common source of music for most infants is their caregivers’ singing (Trehub et al., Reference Trehub, Unyk and Kamenetsky1997). Caregivers across the world’s cultures have been documented to sing to their infants (Trehub and Russo, Reference Trehub, Russo, Russo, Ilari and Cohen2020), and their ID singing differs from AD singing by a number of acoustic and prosodic properties including higher pitch, increased amplitude, and greater frequency variation. ID singing has been proposed to serve primarily an affective function by regulating infants’ arousal and emotional states (Cirelli et al., Reference Cirelli, Trehub and Trainor2018; Salimpoor et al., Reference Salimpoor, Benovoy, Longo, Cooperstock and Zatorre2009; Trainor, Reference Trainor2006) as well as promoting social bonding in caregiver–infant interactions (Cirelli and Trehub, Reference Cirelli and Trehub2019), crucial for healthy socio-emotional development (Poćwierz-Marciniak and Harciarek, Reference Poćwierz-Marciniak and Harciarek2021). As it is the case for ID speech, caregivers dynamically adjust the characteristics of ID singing according to their infants’ emotional and developmental needs and their own intention to capture and maintain their infants’ attention or modify the infants’ emotional state (Delavenne et al., Reference Delavenne, Gratier and Devouche2013; Rock et al., Reference Rock, Trainor and Addison1999). Lullabies, for instance, have a slower tempo, simpler melodies, and a reduced pitch range, and they aim to calm and soothe the infant. Play songs, on the other hand, have a faster tempo, more complex and varied melodies, and a wider pitch range, and their aim is to engage and animate the infant (Trainor and Trehub, Reference Trainor and Trehub1998). Infants, in turn, enjoy listening to ID singing by relaxing, manifesting less distress, and exhibiting more positive emotions (Shannon, Reference Shannon2006; Trehub et al., Reference Trehub, Ghazban and Corbeil2015). Interestingly, infants’ responses are also molded by the type of songs that they hear. Infants direct attention internally, focusing on self-regulation, during lullabies, but direct their attention externally toward the caregiver through mutual gaze during rhythmic play songs (Rock et al., Reference Rock, Trainor and Addison1999).
Critically, the exaggerated acoustic features of ID singing delineate its hierarchical beat structure, helping infants process phrase boundaries, rhythm, and grouping structures (Longhi, Reference Longhi2009). This means that not only the musical but also the linguistic information encountered in this register can be processed more effectively, supporting language acquisition. Several recent studies have assessed neural entrainment in young infants in response to pre-recorded sung ID speech and live, naturally produced ID singing and found efficient tracking at the delta and theta frequencies, relevant for speech encoding (Attaheri et al., Reference Attaheri, Choisdealbha and Di Liberto2022; Menn et al., Reference Menn, Michel, Meyer, Hoehl and Männel2022a; Nguyen et al., Reference Nguyen, Reisner and Lueger2023). Nguyen et al. (Reference Nguyen, Reisner and Lueger2023) measured neural entrainment of seven-month-old infants in response to lullabies and play songs, which display different acoustic and prosodic properties, as discussed above. Infants tracked lullabies more efficiently, which was primarily related to the slow tempo and beat clarity in this song type. However, infants were more likely to produce rhythmic movements when listening to play songs, which was related to the perceived loudness of these stimuli. Thus, it appears that these different types of ID songs not only elicit different attentional responses from infants (Rock et al., Reference Rock, Trainor and Addison1999) but may also have differential functions in facilitating infants’ encoding of the linguistic information in these songs (Franco et al., Reference Franco, Suttora, Spinelli, Kozar and Fasolo2022). Evidence is starting to emerge with several studies showing moderate but significant correlations between individual infants’ efficiency of neural entrainment to ID songs measured in the first year of life and language outcomes, in particular, vocabulary size in the second year (Attaheri et al., Reference Attaheri, Choisdealbha and Di Liberto2022; Menn et al., Reference Menn, Ward and Braukmann2022b; Nguyen et al., Reference Nguyen, Reisner and Lueger2023). Thus, the regular prosodic and rhythmic structure of ID singing, similarly to ID speech, may play a significant role in fostering infants’ speech processing and supporting later acquisition of the specific rhythmic properties of their native language and subsequent language acquisition.
ID speech and singing convey auditory-only information that facilitates infants’ auditory processing. However, solely focusing on the auditory modality misrepresents the true multimodal nature of caregiver–infant communication and fails to capture the facilitative role that ID adjustments in other sensory modalities may play in supporting language development, to which we now turn.
Infants are typically exposed to ID speech and singing in the context of active face-to-face interactions with their caregivers, in which the auditory input is supported by rich multimodal information that can also foster the process of encoding and segmenting incoming speech input. For instance, infants attend to the speaker’s face when listening to ID speech and singing, which leads to enhanced neural entrainment and greater interpersonal synchronization with their caregivers (Lense et al., Reference Lense, Shultz, Astésano and Jones2022; Tan et al., Reference Tan, Kalashnikova, Di Liberto, Crosse and Burnham2022). Caregivers also employ touch and gesture to emphasize relevant linguistic information when interacting with their young infants (Abu-Zhaya et al., Reference Abu-Zhaya, Seidl and Cristia2017; O’Neill et al., Reference O’Neill, Bard, Linnell and Fluck2005). Furthermore, and particularly relevant to infants’ perception of rhythmic information in the input, ID singing (but not ID speech; Zentner and Eerola, Reference Zentner and Eerola2010) elicits active rhythmic movement from the infants themselves. Therefore, in addition to perceiving rhythmic information in the auditory modality, infants have the opportunity to experience it in their own passive and active body movements.
Similarly to adults, infants are enticed to move when they hear music (Gordon, Reference Gordon2003). Several studies show that infants in their first year of life are more likely to produce rhythmic movements in response to ID singing and musical sequences compared to other stimuli, including ID speech (e.g., Ilari, Reference Ilari2015; Nguyen et al., Reference Nguyen, Reisner and Lueger2023; Zentner and Eerola, Reference Zentner and Eerola2010). Most importantly, there is also evidence to suggest that infants’ movements are coordinated with the temporal structure of the input, signalling a connection between infants’ rhythmic perception and rhythmic movement production (de l’Etoile et al., Reference de l’Etoile, Bennett and Zopluoglu2020; Longhi, Reference Longhi2009; Zentner and Eerola, Reference Zentner and Eerola2010). For instance, a recent study by de l’Etoile et al. (Reference de l’Etoile, Bennett and Zopluoglu2020) assessed whether infants’ movements related to rhythmic regularities in the stimuli, and whether infants were sensitive to rhythmic training. Six–10-month-old infants’ movements were recorded in response to silence, irregular auditory stimuli, and rhythmically regular song recordings. Infants’ movements were more regularly timed in response to the rhythmically regular songs. While infants’ ability to precisely time their rhythmic movements to the tempo of musical stimuli that they hear seems to emerge much later in preschool years (Drake et al., Reference Drake, Jones and Baruch2000; Eerola et al., Reference Eerola, Luck and Toiviainen2006), they show early sensitivity to the rhythmic structure of music, which is observed in the greater quantity and higher regularity of their rhythmic movements.
38.4 Early Vestibular, Tactile, and Somatosensory Experiences with Rhythm
Movement plays a significant role in processing rhythm, so it is evident that hearing is not the only sensory modality involved in processing rhythm in music and speech. Most importantly, infants start perceiving rhythm through vestibular, tactile, and somatosensory (VTS) experiences (Provasi et al., Reference Provasi, Anderson and Barbu-Roth2014) well before they start moving in response to music and ID singing. Specifically, the first experiences with rhythm occur already in utero, encompassing multiple sources of rhythmic stimulation produced by the mother’s body movements (Lecanuet and Schaal, Reference Lecanuet and Schaal2002). As a consequence, vestibular and somatosensory input has been found to be central to rhythm perception after birth and during early development (Phillips-Silver and Trainor, Reference Phillips-Silver and Trainor2005; Tichko et al., Reference Tichko, Kim and Large2022; Trainor et al., Reference Trainor, Gao, Lei, Lehtovaara and Harris2009; and refer to Chapter 4 for a sensorimotor account of multimodal prosody, including the role of the vestibular system). The VTS system is the first to develop in the human fetus: Cutaneous and trigeminal somatosensory receptors mature at around four to seven weeks of gestation (Humphrey, Reference Humphrey1964). While the early emergence of somatosensory anatomy and function is shortly followed by vestibular and chemosensory functioning, auditory and visual functioning are first observed much later in prenatal development (Bremner and Spence, Reference Bremner and Spence2017). From the third trimester of gestation, fetuses perceive rhythmic signals coming from the intrauterine and extrauterine environment through bone conduction (Sohmer et al., Reference Sohmer, Perez, Sichel, Priner and Freeman2001). Specifically, the maternal voice is transmitted directly to the amniotic fluid via body tissues and bones, with pitch and the first overtones being fully conducted through the spine and the pelvic arch (Petitjean, Reference Petitjean1989). External low frequencies (corresponding to the first four octaves of the piano) are perceived in utero, with bone conduction allowing for intelligibility of both speech and music rhythmic patterns (Granier-Deferre et al., Reference Granier-Deferre and Busnel2011; Mampe et al., Reference Mampe, Friederici, Christophe and Wermke2009).
VTS mechanisms, crucial for early rhythmic experiences, also provide the context in which fetal “hearing” occurs. Maternal heartbeat, breathing, and walking all produce movement together with sound (Kisilevsky et al., Reference Kisilevsky, Killen, Muir and Low1991), so it is probably the cross-modal temporal synchrony generated by the coupling of VTS and auditory signals that is crucial in shaping rhythm processing in utero (Provasi et al., Reference Provasi, Anderson and Barbu-Roth2014). Consistently, premature infants who receive VTS stimulation adapt their breathing to the rate and acceleration of VTS stimuli (Zimmerman and Barlow, Reference Zimmerman and Barlow2012). The very early exposure to the isochronous pulse of maternal heartbeat and breathing patterns may shape the development of neural and physiological systems supporting rhythm perception after birth toward familiar patterns (Teie, Reference Teie2016). Memory of these experiences is maintained after birth (Ullal-Gupta et al., Reference Ullal-Gupta, Vanden Bosch der Nederlanden, Tichko, Lahav and Hannon2013), with newborns preferring to listen to music and nursery rhymes heard prenatally compared to unfamiliar tracks (DeCasper and Spence, Reference DeCasper and Spence1986; Hepper, Reference Hepper1991). That is, experience of VTS stimulation combines with auditory input to comprise the prenatal experience of rhythm. The effects of these early multimodal experiences are seen in the infant preferences toward regular, binary rhythms after birth and throughout development (DeCasper and Sigafoos, Reference DeCasper and Sigafoos1983; Doheny et al., Reference Doheny, Hurwitz, Insoft, Ringer and Lahav2012; Lahav et al., Reference Lahav, Saltzman and Schlaug2007; Tichko et al., Reference Tichko, Kim and Large2022).
Passive experience with movements that generate VTS stimulation continues after birth, significantly influencing the perception of auditory rhythm in infants (Phillips-Silver and Trainor, Reference Phillips-Silver and Trainor2005). Across cultures, caregivers use VTS rhythms to calm their newborns, for instance by moving them back and forth or by rocking them while singing or walking (Provasi et al., Reference Provasi, Anderson and Barbu-Roth2014). A recent study demonstrated that the cadence of parent movements and walking is related to the spontaneous motor tempo of young infants (Rocha et al., Reference Rocha, Southgate and Mareschal2021). Furthermore, studies with adult and infant participants showed that body movement can bias the auditory encoding of ambiguous rhythms (Phillips-Silver and Trainor, Reference Phillips-Silver and Trainor2005, Reference Phillips-Silver and Trainor2007). Phillips-Silver and Trainor (Reference Phillips-Silver and Trainor2005) exposed infants to a rhythmically ambiguous auditory pattern while they were bounced in time with a duple or a triple meter. After this exposure, infants preferred the meter pattern to which they were bounced. Watching someone else moving in time with a meter, however, was not sufficient to elicit similar effects in infants or adults, demonstrating that the active movement of the body is crucial in rhythm encoding and processing (Phillips-Silver and Trainor, Reference Phillips-Silver and Trainor2007, Reference Phillips-Silver and Trainor2008). These findings confirm that active and passive motion generating VTS experience not only shapes the early development of structural and functional mechanisms underlying rhythm processing, but that it also influences rhythm processing right after birth and throughout development.
VTS experiences continue to mediate the interaction between rhythm ability and language development throughout infancy and early childhood. VTS and auditory rhythmic abilities have been proposed as a potential scaffold for early interpersonal synchrony between mothers and infants (Trehub, Reference Trehub2003), and VTS–auditory coupling has an impact in sustaining early language acquisition. For instance, synchronous tactile cues were shown to help four- and five-month-olds to find words in continuous speech, promoting word learning (Abu-Zhaya et al., Reference Abu-Zhaya, Seidl and Cristia2017; Seidl and Cristià, Reference Seidl and Cristià2008). Moreover, audio-tactile stimulation resulted in enhanced event related potentials (ERPs) and higher beta-band activity (at 15−16 Hz) recorded using electroencephalography (EEG), compared to auditory-only stimulation in eight-month-old infants (Tanaka et al., Reference Tanaka, Kanakogi, Kawasaki and Myowa2018). Consistently, parents tend to spontaneously synchronize the location and timing of their touches on the infant’s body with word rhythm in ID speech (Custode and Tamis-LeMonda, Reference Custode and Tamis-LeMonda2020; Lew-Williams et al., Reference Lew-Williams, Ferguson, Abu-Zhaya and Seidl2019; Tincoff et al., Reference Tincoff, Seidl, Buckley, Wojcik and Cristia2019). Later in development, preschoolers who can entrain their motion to an external beat display more faithful neural encoding of temporal modulations in speech and achieve higher scores on tests of early language skills (Woodruff Carr et al., Reference Woodruff Carr, White-Schwoch, Tierney, Strait and Kraus2014).
On the other hand, rhythmic deficits including VTS and sensorimotor abilities are shared by different atypically developing populations, including developmental dyslexia (Beker et al., Reference Beker, Foxe and Molholm2021; Goswami, Reference Goswami2002; Power et al., Reference Power, Mead, Barnes and Goswami2013), attention deficit disorder (Carrer, Reference Carrer2015; Puyjarinet et al., Reference Puyjarinet, Bégel, Lopez, Dellacherie and Dalla Bella2017), autism (Fitzpatrick et al., Reference Fitzpatrick, Romero and Amaral2017; Franich et al., Reference Franich, Wong, Yu and To2021, see also Chapter 47 on rhythmic-prosodic synchrony in speakers with and without autism), and developmental coordination disorder (Chang et al., Reference Chang, Li and Chan2021; Puyjarinet et al., Reference Puyjarinet, Bégel, Lopez, Dellacherie and Dalla Bella2017; Trainor et al., Reference Trainor, Chang, Cairney and Li2018). For instance, individuals with dyslexia who exhibit weaker performance in rhythm perception and production tasks also tend to show weaker phonological awareness (Flaugnacco et al., Reference Flaugnacco, Lopez and Terribili2014; Forgeard et al., Reference Forgeard, Schlaug and Norton2008; Goswami et al., Reference Goswami, Gerson and Astruc2010; Huss et al., Reference Huss, Verney, Fosker, Mead and Goswami2011; Kalashnikova et al., Reference Kalashnikova, Burnham and Goswami2021; Lee et al., Reference Lee, Sie, Chen and Cheng2015; Thomson and Goswami, Reference Thomson and Goswami2008) and poorer reading skills (Dellatolas et al., Reference Dellatolas, Watier, Le Normand, Lubart and Chevrie-Muller2009; Flaugnacco et al., Reference Flaugnacco, Lopez and Terribili2015; Goswami et al., Reference Goswami, Gerson and Astruc2010, Reference Goswami, Huss, Mead, Fosker and Verney2013; Muneaux et al., Reference Muneaux, Ziegler, Truc, Thomson and Goswami2004; Thomson and Goswami, Reference Thomson and Goswami2008). Furthermore, individuals with dyslexia also demonstrate impaired processing of auditory rise times, which has been linked to inefficient entrainment between the neural oscillatory activity and the speech signal (Huss et al., Reference Huss, Verney, Fosker, Mead and Goswami2011; Leong et al., Reference Leong, Hämäläinen, Soltész and Goswami2011; Chapter 35).
38.5 Directions for Future Research
This chapter highlights the need for research on early rhythm development to focus on understanding how infants’ rhythmic experiences intertwine across modalities, with VTS, auditory, visual, and sensorimotor inputs being simultaneously perceived and processed in an integrated manner. Specifically, it is vital to understand how these different input modalities interact and are integrated, as well as their combined effects on infants’ language development. Recent methodological and technological advances offer the opportunity to investigate this multimodal integration in increasingly more ecologically valid paradigms (see Chapter 36). This can be achieved by using infant-friendly neurophysiological and neuroimaging techniques (e.g., EEG, functional near infrared spectroscopy) combined with behavioral online measures of infants’ gaze or motion tracking (e.g., Nguyen et al., Reference Nguyen, Reisner and Lueger2023; Rocha et al., Reference Rocha, Attaheri and Choisdealbha2024), which provide temporally precise indices of infants’ rhythm perception and production. These measures can now be successfully assessed in the context of live caregiver–infant interactions (e.g., Haresign et al., Reference Haresign, Phillips and Whitehorn2022; Nguyen et al., Reference Nguyen, Abney, Salamander, Bertenthal and Hoehl2021; Piazza et al., Reference Piazza, Hasenfratz, Hasson and Lew-Williams2020; Wass et al., Reference Wass, Whitehorn, Haresign, Phillips and Leong2020), which also capture the interpersonal synchrony during rhythm production and the dynamic changes in caregivers’ behaviors produced in response to the infants’ communicative cues.
This research will also lead to significant practical applications such as the development of rhythm-based intervention programs for infants with developmental or neurological disorders given that rhythm training has been shown to improve language and reading-related abilities (Bonacina et al., Reference Bonacina, Cancer, Lanzi, Lorusso and Antonietti2015; Flaugnacco et al., Reference Flaugnacco, Lopez and Terribili2015; Habib et al., Reference Habib, Lardy and Desiles2016; Overy, Reference Overy2000; Thomson et al., Reference Thomson, Leong and Goswami2013). Evidence for the effectiveness of rhythmic training outside the auditory and visual modalities, which remains highly limited to date, will also provide key information for optimizing the early rhythmic experiences of infants affected by sensory deprivation early in life (e.g., Hidalgo et al., Reference Hidalgo, Falk and Schön2017; Holland et al., Reference Holland, Bouwer, Dalgelish and Hurtig2010; Karam et al., Reference Karam, Russo, Branje, Price and Fels2008; Özcan et al., Reference Özcan, Caligiore, Sperati, Moretta and Baldassarre2016; Petitto et al., Reference Petitto, Holowka, Sergio and Ostry2001; Russo, Reference Russo2023).
38.6 Conclusion
This chapter has reviewed the growing evidence that ID communication is rich in rhythmic information, and supports the development of infants’ early rhythm perception and production abilities and plays a fundamental role in the process of early language development. This evidence demonstrates that compared to AD speech, rhythmic information in ID speech, infants’ primary source of linguistic information, exhibits higher rhythmic regularity, which facilitates infants’ encoding and processing of this register. This rhythmic regularity is also a characteristic of ID singing, a prominent signal used in natural caregiver–infant interactions. Most importantly, these auditory signals are not the only sources of regular rhythmic information available to infants. Young infants experience rhythm passively and actively across multiple modalities, including auditory, visual, tactile, and sensorimotor. The integration of rhythmic information from these different sources sustains the encoding of auditory rhythmic stimuli from the earliest stages of development and supports infants in the task of extracting rhythmic information specific to the language or languages spoken in their environment.
38.7 Acknowledgements
This work was supported by the Basque Government through the BERC 2022–2025 program and funded by the Spanish State Research Agency through BCBL Severo Ochoa excellence accreditation CEX2020–001010/AEI/10.13039/501100011033. Marina Kalashnikova’s work was supported by the Spanish State Research Agency through the Ramon y Cajal research fellowship, RYC2018–024284-I. Laura Fernández-Merino’s work was supported by a Predoctoral Grant from the Spanish Ministry of Science, Innovation and Universities and the European Social Fund, PRE2019–087623. Sofia Russo’s work was supported by a Postdoctoral Grant from the Department of Developmental Psychology and Socialization, Università degli Studi di Padova Rif. 2022ASSDPSS14.
Summary
Sensitivity to their native language’s rhythmic patterns allows infants to segment continuous speech. ID speech contains rhythmic cues, but it is not the only mode of ID communication to do so. Early development of rhythmic skills is supported by infants’ experience with rhythm across auditory, visual, tactile, and sensorimotor modalities.
Implications
Infants’ rhythm experience is not restricted to the auditory domain. Future research should investigate how infants’ rhythmic experiences intertwine and become integrated across sensory modalities involved in ID communication. This understanding is vital for defining the environmental factors that facilitate the development of early rhythmic abilities and language acquisition.
Gains
Infants experience optimized intersensory rhythmic input. This multisensory stimulation plays an important role in facilitating efficient entrainment between infants’ oscillatory neural activity and the input signal as well as in promoting intra-personal synchronization within the caregiver–infant dyad, both critical for successful language acquisition and healthy socio-cognitive development.