Rhythm in Language Acquisition

doi:10.1017/9781009295888.041

Section 6 - Rhythm in Language Acquisition

Published online by Cambridge University Press: 23 April 2026

Edited by

Lars Meyer and

Antje Strauss

Show author details

Lars Meyer: Affiliation:
Max Planck Institute for Human Cognitive and Brain Sciences
Antje Strauss: Affiliation:
University of Konstanz

Book contents

Information

Type: Chapter
Information: Rhythms of Speech and Language
Physiology, Cognition, Culture
, pp. 627 - 740

DOI: https://doi.org/10.1017/9781009295888.041 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2026
Creative Commons: This content is Open Access and distributed under the terms of the Creative Commons Attribution licence CC-BY-NC 4.0 https://creativecommons.org/cclicenses/

Section 6 Rhythm in Language Acquisition

35 Rhythm and Language Acquisition: A Temporal Sampling Perspective

35.1 Introduction

The importance of rhythm for language processing is a core theme throughout this volume. Both the acoustic structure of the speech signal and the physiology of speech perception and production are intimately tied to rhythm, and so the proposal that rhythm is also at the centre of language acquisition is perhaps unsurprising. Rhythm can be understood as a kind of hidden glue that underpins the acoustic structure of human languages and the representation of sensory structure by human brains. Indeed, almost four decades ago, Mehler et al. (Reference Mehler, Jusczyk and Lambertz1988) proposed that sensitivity to rhythm was a universal precursor to language acquisition by young infants. Here, I adapt temporal sampling (TS) theory (Goswami, Reference Goswami2011) to provide a rhythm-based framework for explaining language acquisition by infants (see also Goswami, Reference Goswami2019, Reference Goswami2020, Reference Goswami2022a). TS theory was originally proposed as a causal framework for integrating the acoustic and motor difficulties with rhythm found in children with dyslexia with an oscillatory model of sensory encoding. The acoustic rhythmic difficulties in dyslexia were indexed primarily by sensory difficulties in discriminating changes in speech energy called amplitude rise times (ARTs, depicted in Figure 35.1), and by perceptual difficulties in discriminating syllable stress patterns and musical rhythm patterns, along with problems in tapping in time with a regular metronome beat (e.g., Thomson and Goswami, Reference Thomson and Goswami2008; Huss et al., Reference Huss, Verney, Fosker, Mead and Goswami2011; Goswami et al., Reference Goswami and Leong2013). The rhythmic neural encoding aspects of TS theory were focused on low-frequency oscillations in the delta (0.5–4 Hz) and theta (4–8 Hz) neurophysiological bands, thought to play a core role in rhythmic entrainment (Ghitza and Greenberg, Reference Ghitza and Greenberg2009; see Chapter 3). These features were integrated into a sensory/neural explanatory system for understanding the phonological difficulties that characterise children with dyslexia, and that also characterise many children with what was previously termed specific language impairment (SLI – now termed developmental language disorder, DLD). Since 2011, TS theory has been developed further in the light of experimental data. Following an outline of TS theory drawing on sensory/neural data from adults and children, two longitudinal studies testing the theory with infant participants will be described in detail.

Figure 35.1

A schematic depiction of ARTs.

Panel 1A depicts the amplitude envelope for the sentence ‘drive round, pick my children back up’, and Panel 1B shows the rise time for the syllable ‘my’, taken from the longer sentence.

(a)

Graph A is titled Example of Speech Amplitude Envelope and shows a waveform of speech. See long description.

Figure 35.1(a) Long description

The horizontal axis represents time ranging from 0.2 to 1.4 seconds. The vertical axis represents signal amplitude, which ranges from minus 0.3 through 0.4. The waveform of speech reads, drive round, pick my children back up. A solid line outlines the waveform. The values are estimated.

(b)

Graph B is titled Onset Rise Time for My, and like graph A shows a waveform of speech. See long description.

Figure 35.1(b) Long description

The horizontal axis represents time ranging from 0.02 through 0.16 seconds. The vertical axis represents signal amplitude, which ranges from minus 0.3 through 0.4. This graph displays a waveform which originates at (0.00, 0) and terminates at (0.16, 0). A solid line for the envelope overlaps the waveform and originates at (0.00, 0) and terminates at (0.16, 0). A dotted line indicates the rising amplitude of the onset. A vertical horizontal line also overlaps the waveform to indicate computation of the rise time, originates at (0.00, 0) and terminates at (0.7, 0). The values are estimated.

Used with permission of Walter de Gruyter and Company from Goswami and Leong, 2013; permission conveyed through Copyright Clearance Center, Inc.

35.2 Temporal Sampling Theory

Temporal ‘sampling’ refers to the fact that the brain depends on networks of neurons that oscillate between excitation and inhibition to encode sensory information (‘brain waves’). Temporal coding is an important aspect of information coding in the brain (Buzsáki and Draghun, Reference Buzsáki and Draguhn2004), and for encoding the complex speech signal, the synchronous activity of oscillating networks of neurons at different timescales (frequency bands, primarily delta, 0.5–4 Hz, theta, 4–8 Hz, beta, 15–25 Hz, and gamma, 30–80 Hz) is thought to be critical (Giraud and Poeppel, Reference Giraud and Poeppel2012; see also Obleser et al., Reference Obleser, Herrmann and Henry2012). Neurophysiology shows that these frequency bands are ubiquitous throughout the brain and encode many aspects of sensory information (Buzsáki, Reference Buzsáki2006). The acoustic speech signal can be modelled as the summation of a number of frequency bands fluctuating in intensity (amplitude) over time (the amplitude envelope, AE; Drullman, Reference Drullman, Greenberg and Ainsworth2006). The auditory system codes amplitude modulations (AMs) in natural sounds across different frequency channels and at different timescales (Joris et al., Reference Joris, Schreiner and Rees2004; Turner, Reference Turner2010), and there is systematic AM structure in the AE of all natural sounds, not just speech (Turner, Reference Turner2010; see Daikoku and Goswami, Reference Daikoku and Goswami2022, for AM structure of music, birdsong, wind, and rain). The AE of both child-directed speech (CDS) such as nursery rhymes and infant-directed speech (IDS) or BabyTalk can be analysed in terms of its constituent temporal modulation frequencies (Leong and Goswami, Reference Leong and Goswami2015; Leong et al., Reference Leong, Kalashnikova, Burnham and Goswami2017). This computational modelling shows that the dominant modulation frequency bands for these highly rhythmic speech registers cluster into three AM bands, with centre frequencies around ~2 Hz, ~5 Hz, and ~20 Hz, thereby approximately matching the electrophsyiological bands of delta, theta, and beta/low gamma. These insights concerning the AM structure of CDS and IDS fit well with adult multi-time resolution speech processing models (MTRMs) (see Poeppel, Reference Poeppel2003; Poeppel et al., Reference Poeppel, Idsardi and van Wassenhove2008; also see Chapters 3 and 5), models that played a core role in the development of TS theory (see Goswami, Reference Goswami2011).

35.2.1 TS Theory: Sensory Data

Adult MTRMs proposed that the different temporal integration windows characterised by the different oscillatory networks yielded packets of information at different linguistic grain sizes that were matched to stored representations of words in the mental lexicon (Poeppel et al., Reference Poeppel, Idsardi and van Wassenhove2008). Adult MTRMs focused on the linguistic units of the syllable and the phoneme, thought to be captured by temporal integration windows in the theta and gamma bands, respectively. The infant and child speech modelling studies discussed in Section 35.2 showed that the stressed syllable is also an important linguistic grain size, and is captured by temporal integration windows in the delta band (Leong and Goswami, Reference Leong and Goswami2015). Leong and Goswami’s speech modelling revealed that if one cycle of AM at each temporal rate (delta (~2 Hz), theta (~5 Hz), and beta/low gamma (~20 Hz)) in English nursery rhymes is assumed to match one level of linguistic structure (stressed syllable, syllable, onset-rime), then the AM structure in the AE of CDS provides sufficient sensory information for the brain to extract phonological units of different sizes with over 80–90% efficiency (the spectral-amplitude modulation hierarchy model, S-AMPH, Leong and Goswami, Reference Leong and Goswami2015). This suggests that applying an MTRM approach to the simplified rhythmic genres of IDS and CDS, which underpin language acquisition across cultures, could enable ‘acoustic-emergent’ phonology. When brain rhythms and speech rhythms at these different rates (delta, theta, beta/low gamma) are in temporal alignment, then an emergent phonological system can be extracted from the speech signal by automatic acoustic statistical learning processes (see Goswami, Reference Goswami2022a, for detail; statistical learning is required as speech–brain alignment is only probabilistically accurate in yielding phonological units). This automatic learning can, in principle, enable perceptual organisation of the acoustic speech stream into syllable stress patterns (prosodic feet), syllables, and onset-rime units (m-ash, cr-ash, spl-ash). Further, automatic learning of higher-level acoustic prosodic hierarchies that are also present in the sensory signal could enable the perceptual organisation of acoustic information relevant to extracting syntax and grammar, such as prosodic and intonational phrasing (the ‘prosodic phrasing’ hypothesis; see Cumming et al., Reference Cumming, Wilson and Goswami2015). Perceptual detection of prosodic hierarchies also depends on accurate ART perception (Cumming et al., Reference Cumming, Wilson and Goswami2015).

In TS theory, rhythm patterns are hypothesised to be core to these organisational processes. The S-AMPH computational modelling work based on nursery rhymes further revealed that each AM band is in temporal synchronicity with its adjacent band, via phase relations. Phase relations are temporal relations between different cycles of rhythmic timing, such as slower versus faster bands of AM. In IDS and CDS, these phase relations are governed by the delta AM band (Leong et al., Reference Leong, Stone, Turner and Goswami2014). In particular, phase relations between the delta (~2 Hz) and theta (~5 Hz) AM bands in nursery rhymes are critical to perceiving their ‘beat’ structure or metrical structure (e.g., trochaic versus iambic; Leong et al., Reference Leong, Stone, Turner and Goswami2014, Leong and Goswami, Reference Leong and Goswami2015). Delta–theta AM phase alignment is also core to perceiving rhythm in music (Daikoku and Goswami, Reference Daikoku and Goswami2022). Indeed, the finding that the delta AM band sits at the top of the phase hierarchy in CDS and IDS is thought-provoking, as its centre rate of ~2 Hz matches the dominant beat rate of music (120 bpm) (Fraisse, Reference Fraisse and Deutsch1982; see also Chapters 26 and 27). Temporal analyses of the lullabies sung by mothers to their infants across cultures reveals a cross-cultural convergence on this beat rate of ~2 Hz (Trehub and Trainor, Reference Trehub, Trainor, Rovee-Collier, Lipsitt and Hayne1998). Accordingly, AM-based modelling of the speech signal suggests that cultural practices such as BabyTalk and lullabies may facilitate the automatic extraction of phonology and possibly syntax via the AM phase hierarchy, by statistical sensory learning beginning in the cradle. Infant auditory statistical learning is already known to be highly efficient (e.g., Saffran, Reference Saffran2001), and while IDS and CDS are simplified speech registers, they nevertheless encompass more rhythmic categories than trochaic versus iambic, for example anapest rhythms.

TS theory proposes that this automatic statistical learning depends intimately on ART discrimination, as different ARTs are associated with different temporal modulation rates. TS theory (Goswami, Reference Goswami2011) was in part developed because psychoacoustic studies had repeatedly demonstrated impaired ART discrimination by children with dyslexia, across languages (to date English, French, Finnish, Dutch, Spanish, Chinese, and Hungarian, summarised in Goswami, Reference Goswami2015). Difficulties in discriminating ARTs (which index the rate of change of AMs) could be expected to cause impairments in distinguishing the different modulation frequency ranges in speech, which would result in inefficient phase-locking to these frequency ranges by neuronal oscillatory networks (Goswami, Reference Goswami2011). Adult neuroscience studies were able to demonstrate that ARTs (called ‘speech edges’ in that literature) played a key mechanistic role in the alignment of brain rhythms and speech rhythms (Gross et al., Reference Gross, Hoogenboom and Thut2013; Doelling et al., Reference Doelling, Arnal, Ghitza and Poeppel2014; Chapter 8). For example, ARTs in the theta band provide sensory landmarks that automatically trigger brain rhythms and speech rhythms into temporal alignment, via phase-resetting ongoing neural activity (Doelling et al., Reference Doelling, Arnal, Ghitza and Poeppel2014). Further, the oscillatory networks that are responsive to speech inputs are also hierarchically organised, with delta at the top (Gross et al., Reference Gross, Hoogenboom and Thut2013). Accordingly, prior research with children and adults shows that there is a mechanistic neural hierarchy of oscillations, an AM phase hierarchy in IDS and CDS, and a linguistic hierarchy of phonological units that all match in terms of their intrinsic relational structure. This provides a sensory/neural framework for language acquisition via automatic statistical learning of this relational AM structure, which is depicted in simplified form in Figure 35.2.

Figure 35.2

TS theory and infant language acquisition.

The schematic depiction of TS theory emphasises the core role of ART and rhythm processing via automatic neural entrainment to AMs at ~2 Hz (delta) and ~5 Hz (theta) rates, sensory-neural processes that support the developing lexicon. Although only briefly discussed in this chapter, individual differences in rhythmic entrainment to visual speech and motor entrainment may also have an impact on the developing lexicon.

A flowchart illustrates a proposed model of to understand how infants acquire language. See long description.

Figure 35.2 Long description

The flowchart starts with the acoustic input, which is followed by delta sampling network and theta sampling network, a box for delta theta phase alignment helping to determine prosodic representations or syllable level representations and a box indicating the developing lexicon. Motor entertainment to the beat and visual entertainment to the beat also contribute to developing the lexicon.

Reproduced from Goswami (2022a); licensed under CC BY 4.0.

35.2.2 TS Theory: Neural Data

Children with dyslexia have phonological impairments at many levels of the linguistic hierarchy, showing difficulties in identifying and manipulating phonological units such as stressed syllables, syllables, onset-rimes, and phonemes in oral tasks (Goswami, Reference Goswami and Skeide2022b, for a recent review). If automatic statistical learning of AM phase relations supports acoustic-emergent phonology, and given the well-established sensory discrimination difficulties with ARTs found in children with dyslexia, it is logical to expect atypical oscillatory responses to natural speech in these children. The first studies investigating the neural aspects of TS theory in children predicted atypical responding in both the delta and theta bands. These early studies used a rhythmic speech paradigm, in which a ‘talking head’ repeated the syllable ‘ba’ at 2 Hz while an electroencephalogram (EEG) was recorded (Power et al., Reference Power, Mead, Barnes and Goswami2012, Reference Power, Mead, Barnes and Goswami2013). Children were listening for oddball syllables that were slightly out of time with the 2 Hz rhythm, and pressed a button when this occurred. Even when children with dyslexia were equated to control children without dyslexia for their performance in the button-press task, their brains were out of time regarding the delta-band response (but not the theta-band response; Power et al., Reference Power, Mead, Barnes and Goswami2013). The key finding was that the oscillatory response in the delta band had a different preferred phase (preferred phase refers to the point in time when most neurons are discharging their electrical potentials). Brain excitation in the delta band was maximal at a different point in time for the group of children with dyslexia compared to control children. By TS theory, this would suggest that the peak delta-band neural response for dyslexic children was aligned with less informative points in the speech stimulus, a phase difference that might be expected to impair the neural representation of phonological information.

Subsequent dyslexia studies testing TS theory in English indicated that the dyslexic brain encoded a less accurate representation of delta-band envelope information (Power et al., Reference Power, Colling, Mead, Barnes and Goswami2016; Keshavarzi et al., Reference Keshavarzi, Mandke and Macfarlane2022; Mandke et al., Reference Mandke, Flanagan and Macfarlane2022). These investigations recorded either EEG or MEG (magnetoencephalography) during speech listening tasks, and either used a method called the multivariate temporal response function (mTRF) to estimate different features of the heard speech such as the speech AE from the EEG responses (generating a correlational r value; Di Liberto et al., Reference Di Liberto, O’Sullivan and Lalor2015; Crosse et al., Reference Crosse, di Liberto, Bednar and Lalor2016), or other methods such as phase-locking values (PLVs) in MEG to estimate the cortical encoding of speech in different frequency bands. These developmental studies showed that neural representation in the delta band was less accurate (exhibiting a significantly lower mTRF r value) in a sentence-listening task for children with dyslexia compared to both age-matched control children and younger reading-level-matched control children using EEG (Power et al., Reference Power, Colling, Mead, Barnes and Goswami2016). It was also less accurate in a story-listening task compared to age-matched control children in neuroimaging studies using MEG (exhibiting a significantly lower PLV; Mandke et al., Reference Mandke, Flanagan and Macfarlane2022) and EEG (exhibiting a significantly lower mTRF r value; Keshavarzi et al., Reference Keshavarzi, Mandke and Macfarlane2022). In the English EEG studies, neural accuracy was significantly related to measures of phoneme awareness and non-word reading, and in the MEG study, to vocabulary development. Neuroimaging studies in other languages have also found impaired low-frequency envelope encoding for children with dyslexia (in both the delta and theta bands; see Destoky et al., Reference Destoky, Bertels and Niesen2020, Reference Destoky, Bertels and Niesen2022, French) and atypical speech–brain synchronisation in the delta band (see Molinaro et al., Reference Molinaro, Lizarazu, Lallier, Bourguignon and Carreiras2016, Spanish). Importantly, speech reconstruction studies with infants, children, and adults (in which EEG recorded during speech listening is used to reconstruct low-frequency speech information <8 Hz using mTRFs) show that delta-band information yields phonetic information as well as prosodic (syllable stress) information (Di Liberto et al., Reference Di Liberto, O’Sullivan and Lalor2015, Reference Di Liberto, Peter and Kalashnikova2018, Reference Di Liberto, Attaheri and Cantisani2023). Accordingly, this review of neural data supporting TS theory indicates that low-frequency envelope information is encoded less accurately by children with dyslexia, and that this impaired encoding affects the quality of their phonological development at many levels of the linguistic hierarchy, including the phoneme level.

35.2.3 Implications for a TS Theory of Language Acquisition

The sensory and neural data with children gathered to test TS theory have a number of implications for language acquisition. Regarding sensory factors, if ART sensitivity could be measured in infancy, it would be expected to be a significant predictor of linguistic development. As infants get older and become able to participate in metrical rhythm tasks, individual differences in rhythm perception and production might also be expected to relate to ART discrimination and to predict linguistic development. These sensory predictions of TS theory were investigated in the SEEDS project. Regarding neural factors, both the accuracy of continuous speech encoding in the delta and theta bands as measured by mTRFs and the timing of oscillatory responses (such as preferred phase) in the delta band to rhythmic inputs might be expected to predict later language outcomes. These predictions of TS theory were investigated in the Cambridge UK BabyRhythm project. Both were longitudinal infant projects, one with an at-risk sample (SEEDS) and one with a typically developing sample (BabyRhythm), and an overview of their findings is presented here.

35.3 SEEDS: Infants at Family Risk for Dyslexia

The SEEDS (of Literacy) project recruited a group of infants at family risk for dyslexia (FR group) and a group of infants not at family risk for dyslexia (NFR group) in Australia, and followed their development from the age of five months. One aim of SEEDS was to investigate whether infant measures of auditory sensitivity to rhythm would be related to subsequent linguistic and phonological skills and, eventually, to reading achievement (Kalashnikova et al., Reference Kalashnikova, Goswami and Burnham2018, Reference Kalashnikova, Goswami and Burnham2019, Reference Kalashnikova, Burnham and Goswami2020a, Reference Kalashnikova, Burnham and Goswami2020b, Reference Kalashnikova, Burnham and Goswami2020c, Reference Kalashnikova, Burnham and Goswami2021). Children were assigned to the FR and NFR groups at the beginning of the project based on a parent having a pre-existing dyslexia diagnosis and/or their performance on a comprehensive parental screening battery that included language, reading, and cognitive tasks. This targeted sample proved difficult to recruit; hence, most publications are based on samples of 50 infants or fewer.

An infant behavioural procedure for assessing sensitivity to ART was developed based on the psychoacoustic stimuli used to measure ART discrimination in younger children (Corriveau et al., Reference Corriveau, Goswami and Thomson2010). An infant version of a two-alternative forced choice (2AFC) adaptive threshold procedure was created based on looking preference. Infants sat on their parent’s lap facing three monitors in a sound-attenuated infant-testing booth. Once infants fixated the central monitor, the images of a chequerboard appeared on the right and left screens. Infants’ fixations to the monitor on one side produced a repeating auditory stimulus with a fixed rise time (e.g., 15 ms, 15 ms, 15 ms, 15 ms …), while fixations to the monitor on the other side produced an auditory stimulus with alternating rise times (15 ms, 300 ms, 15 ms, 300 ms …). Greater fixation to the alternating stimulus was taken as a measure of discrimination, and step size was reduced accordingly (e.g., 15 ms, 270 ms, 15 ms, 270 ms …), enabling a threshold (a just noticeable difference in ART) to be established (see Kalashnikova et al., Reference Kalashnikova, Goswami and Burnham2018, for a full description). When the FR and NFR groups were aged 10 months, they already showed a significant difference in ART discrimination in this task (Kalashnikova et al., Reference Kalashnikova, Goswami and Burnham2018), as would be expected by TS theory.

Early auditory rhythm sensitivity as indexed by the ART measure administered at 10 months was a significant predictor of subsequent vocabulary development (Kalashnikova et al., Reference Kalashnikova, Goswami and Burnham2019). When they were aged 36 months, the SEEDS cohort were given a standardised vocabulary measure from the Stanford-Binet (Roid, Reference Roid2003). They were also given two experimental measures of phonological development at 42 months. These were a mispronunciation detection task (e.g., ‘abble’ for apple) and a non-word repetition task (the children orally copied items such as ‘sep’, ‘gattom’, and ‘katapet’). A linear mixed-effects analysis showed that ART discrimination at age 10 months was a predictor of vocabulary at 36 months of age, but not (contrary to hypothesis) of phonological development at 42 months (Kalashnikova et al., Reference Kalashnikova, Goswami and Burnham2019). The phonological tasks did not yet show group differences, possibly due to reduced sensitivity at this age. By contrast, infants who were less sensitive to ART had smaller vocabularies, and reduced vocabulary development is known to be an early hallmark of risk for dyslexia. Prospective longitudinal studies of children at family (genetic) risk of dyslexia frequently report deficits in expressive vocabulary (e.g., at 17 months; Koster et al., Reference Koster, Been and Krikhaar2005), and van Viersen et al. (Reference van Viersen, de Bree and Verdam2017) showed that children at risk for dyslexia exhibit delayed growth patterns of both receptive and expressive vocabulary sizes from 17 to 35 months. In the latter study, these early vocabulary scores reliably discriminated between at-risk children who later did and did not develop dyslexia.

One explanation for reduced vocabulary development in FR samples could be that reduced sensitivity to ARTs impairs the learning of new words. This would be predicted by TS theory if infants with poorer rise time sensitivity process the speech signal less effectively. Prior research with children with dyslexia has shown a significant relationship between ART discrimination and the ability to learn novel words (Thomson and Goswami, Reference Thomson and Goswami2010). Thomson and Goswami reported that children with dyslexia were worse at learning novel words in a paired associate learning (PAL) task compared to both age-matched and younger reading-level-matched control children. Within the dyslexic group, children with better ART sensitivity performed better in the PAL task. To test the possibility that reduced ART sensitivity impairs novel word learning, a ‘fast mapping’ task was administered when the SEEDS cohort were aged 19 months (Kalashnikova et al., Reference Kalashnikova, Burnham and Goswami2020a). Two non-word items, ‘lif’ and ‘wug’, were presented repeatedly along with two novel visual objects in a habituation paradigm, and word learning was tested in a subsequent preferential looking paradigm. Both groups of infants showed similar patterns of habituation, and similar engagement and attention during the learning component of the task. However, while the NFR infants showed increased looking time to the correct novel referent at test, the FR infants did not. The FR infants were hence less successful at learning new words early in the language acquisition process, as would be expected on the basis of TS theory.

Of course, it could be argued that the FR infants were simply poorer at PAL. This explanation was ruled out by a subsequent study of PAL when the SEEDS sample were aged 48 and 54 months (Kalashnikova et al., Reference Kalashnikova, Burnham and Goswami2020b). In the later PAL task, the SEEDS children were required to learn associations between four novel words and four novel objects, with learning assessed by (1) finding the correct object in response to its label, and (2) producing the novel word when shown the corresponding object. Their letter knowledge was also assessed as a measure of pre-reading skill at both ages, and vocabulary was measured again at 48 months. Kalashnikova et al. (Reference Kalashnikova, Burnham and Goswami2020b) found no deficit in PAL for the FR children on either measure of learning, and no relationship between PAL and letter knowledge. However, the FR children still had significantly smaller vocabularies than the NFR children. Accordingly, linguistic development is still compromised, even though the difficulties in novel word learning present at 19 months for the FR children could no longer be demonstrated using the later PAL task at 48 and 54 months. Further analyses revealed that it was earlier performance in the non-word repetition task given at 42 months that predicted PAL. This suggests developmental continuity between early auditory difficulties, subsequent early word-learning difficulties, individual differences in later phonological sensitivity, and subsequent vocabulary learning.

By four years of age (48 months), the SEEDS sample were able to complete some of the measures of rhythmic sensitivity used in the TS studies of children with dyslexia. These measures comprised a rhythm discrimination task based on musical beats (following Huss et al., Reference Huss, Verney, Fosker, Mead and Goswami2011, hereafter musical beat perception) and a measure of rhythm production based on tapping to a 2 Hz metronome beat (following Thomson and Goswami, Reference Thomson and Goswami2008; see Kalashnikova et al., Reference Kalashnikova, Burnham and Goswami2021). By TS theory, both measures of rhythmic processing should be related to phonological development and should be impaired in the FR group. The non-word repetition data gathered at 42 months was thus studied in relation to rhythm discrimination and rhythm production at 48 months. When aged four years, the FR group showed significantly poorer rhythm discrimination (significantly lower ‘d’ scores) in the musical beat perception task compared to the NFR group (Kalashnikova et al., Reference Kalashnikova, Burnham and Goswami2021). There was no group difference in rhythm production (tapping accuracy), although the FR group displayed more variable performance (Kalashnikova et al., Reference Kalashnikova, Burnham and Goswami2021; Figure 35.2). As noted above, when aged four years the SEEDS sample also received measures of letter knowledge and vocabulary. The FR and NFR children showed significant group differences in all three linguistic tasks (non-word repetition, letter knowledge, and vocabulary development), with the FR children showing consistently worse performance. Correlation analyses showed that tapping variability was significantly related to performance on all three linguistic tasks, whereas variability in musical beat perception was not. A mediation analysis showed that metronome tapping was a significant predictor of non-word repetition, and non-word repetition was a significant predictor of letter knowledge (Kalashnikova et al., Reference Kalashnikova, Burnham and Goswami2021). Accordingly, the link between rhythm and pre-literacy skills (letter knowledge) is mediated by phonological processing (non-word repetition). Again, these longitudinal relations are consistent with TS theory. In future SEEDS analyses, we plan to investigate developmental continuities between ART discrimination, phonological development, and early literacy.

35.4 Cambridge UK BabyRhythm Project: Typically Developing Infants

In order to test neural predictions of TS theory with infants, large samples are required. EEG data collected from infants is quite noisy, and there can be substantial loss of data. The Cambridge UK BabyRhythm project was able to recruit 122 typically developing infants, who were followed from the age of two months to 42 months with very little drop-out. Infants attended for eight EEG sessions at the ages of two, four, five, six, seven, nine, and 11 months, and language outcomes were measured at 12, 15, 18, 24, 30, and 42 months. To date, only language analyses up to 24 months are available (see also Chapter 36). Based on TS theory, three tasks were selected for the EEG recordings: a nursery rhyme task and two rhythmic tasks. Infants watched videos of a ‘talking head’ singing nursery rhymes in IDS, of a ‘talking head’ saying ‘ta’ repeatedly at 2 Hz in IDS, and of a visual presentation of a ball dropping on to a drum accompanied by a metronome beat at 2 Hz. These three audiovisual tasks enabled us to assess cortical tracking of natural speech in the delta and theta bands by applying mTRF analyses to the nursery rhyme EEG, and to generate measures of preferred phase using the EEG recorded to both speech (‘ta’) and non-speech (drumbeat) rhythmic inputs.

These neural measures were used to predict a range of language outcomes. Although many different experimental measures of phonology and grammar were employed, along with standardised measures such as the UK version of the CDI (communicative development inventory; Alcock et al., Reference Alcock, Meints and Rowland2020), the Covid-19 pandemic had a very disruptive effect on data collection, and only some of the language outcome measures could be used for the neural prediction analyses. The most robust measures were the UK-CDI, in which parents estimated their infants’ word comprehension and production, an infant-led vocabulary measure called the CCT (computerised comprehension task; Friend and Keplinger, Reference Friend and Keplinger2003), in which infants selected a named object on a tablet screen to indicate comprehension, and a non-word repetition task introduced as a game about naming toys. The vocabulary measures were administered from 12 months of age, and the non-word repetition task from 18 months of age (see Rocha et al., Reference Rocha, Ní Choisdealbha and Attaheri2024, for full analyses).

On the basis of prior TS studies with children with dyslexia, it was expected that the accuracy of encoding of envelope information in the delta band along with individual differences in preferred phase in the delta band could be predictors of language outcomes. Regarding the accuracy of low-frequency envelope encoding, the EEG recorded in response to the sung nursery rhymes presented at the four-, seven-, and 11-month laboratory sessions was analysed using the mTRF method. These longitudinal analyses compared the accuracy of the speech envelope information encoded in the delta (0.5–4 Hz), theta (4–8 Hz), and alpha (8–12 Hz, a control band) bands at each age. Encoding accuracy was estimated for the first 60 infants in the UK BabyRhythm sample (Attaheri et al., Reference Attaheri, Ní Choisdealbha and Di Liberto2022), and is currently being replicated with the remaining infants (Attaheri et al., Reference Attaheri, Ní Choisdealbha and Rocha2024). In Attaheri et al. (Reference Attaheri, Ní Choisdealbha and Di Liberto2022), delta-band encoding was significantly stronger (significantly larger r values) than theta-band encoding at four months, and continued to be stronger at all subsequent ages, with no evidence for significant alpha-band encoding at any age. Other neural factors related to speech encoding by adults, such as phase-amplitude coupling (PAC) measures, were also present from the earliest measurement point (four months). Adult speech-encoding studies have highlighted theta–gamma PAC as particularly relevant for speech processing (Hyafil et al., Reference Hyafil, Fontolan, Kabdebon, Gutkin and Giraud2015). Our infant data showed that delta-led PAC was equally important as theta-led PAC during the nursery rhyme listening task, with delta–beta, delta–gamma, theta–beta, and theta–gamma PAC all significant at all ages (four, seven, and 11 months) (Attaheri et al., Reference Attaheri, Ní Choisdealbha and Di Liberto2022). In ongoing analyses (Attaheri et al., Reference Attaheri, Ní Choisdealbha and Rocha2024), language data collected between 12 and 24 months is being related to the neural data. Currently, only theta–gamma PAC appears to be a significant source of individual differences. Stronger theta–gamma PAC is related to better language outcomes, while delta-related PAC is not (Attaheri et al., Reference Attaheri, Ní Choisdealbha and Rocha2024). The other neural predictor of better language outcomes as measured by the mTRF analyses was the accuracy of delta-band encoding at 11 months (Attaheri et al., Reference Attaheri, Ní Choisdealbha and Rocha2024). Worse language outcomes were related to the theta–delta power ratio (higher ratios were associated with worse outcomes), and to the overall amount of theta power (higher theta power spectral density was associated with worse language outcomes; Attaheri et al., Reference Attaheri, Ní Choisdealbha and Rocha2024). Accordingly, both delta-band and theta-band oscillatory responses and their dynamic interactions are important for language acquisition, as would be expected on the basis of TS theory.

Given the earlier data collected with children with dyslexia (Power et al., Reference Power, Mead, Barnes and Goswami2013), the preferred phase of entrainment in the delta band to rhythmic speech should also relate to individual differences in language outcomes. Preferred phase analyses were conducted for the two rhythmic tasks (‘ta’ and drumbeat), which were administered at two, six, and nine months in the auditory-only or auditory-visual (AV) domain, and at five and eight months as visual-only (VO) speech. Phase analyses revealed that infants already exhibited phase consistency to rhythmic syllables and drumbeats at two months of age, when most participants were sleeping and hence experienced auditory-only input (Ní Choisdealbha et al., Reference Ní Choisdealbha, Attaheri and Rocha2023). Individual phase angles in the syllable task at two months were a significant predictor of later vocabulary outcomes as measured by the CCT at 18 months. By the time the infants were aged nine months, phase angles in the syllable task were associated with the number of parent-estimated known words at 24 months on the UK-CDI, for both comprehension and production. Regarding the drumbeat task, phase angles at nine months were predictive of non-word repetition skills at the age of 24 months, an early measure of phonological sensitivity (Ní Choisdealbha et al., Reference Ní Choisdealbha, Attaheri and Rocha2023).

For VO speech, similar results were obtained (Ní Choisdealbha et al., Reference Ní Choisdealbha, Attaheri and Rocha2024). At five months, infants already exhibited an increased cortical response to the stimuli at the stimulus frequency (thereby showing cortical tracking at 2 Hz), but phase consistency was variable. At eight months, the infants showed significant phase consistency to VO speech, and oscillatory responses were moving towards a common preferred phase angle. Individual differences in phase angle at eight months were significantly related to later vocabulary at 24 months. Taken together, the analyses reported by Ní Choisdealbha et al. show that individual differences in the phase angle of the neural response to rhythmic speech stimuli are predictive of later language outcomes, for auditory-only, VO, and AV speech. Accordingly, key neural parameters of TS theory explain some of the individual variation in children’s later language outcomes.

Ideally, the Cambridge UK BabyRhythm study would have explored whether there was an association between individual differences in ART sensitivity and these different neural outcome measures. However, the 2AFC behavioural paradigm from the SEEDS study did not yield reliable data with the UK BabyRhythm infants. As the infants were still wearing the EEG cap when they received the 2AFC paradigm, instead of using behavioural thresholds we used their ERPs as an index of ART sensitivity (Ní Choisdealbha et al., Reference Ní Choisdealbha, Attaheri and Rocha2022). The ERP data revealed that the infants showed robust mismatch responses to all the steps on the thresholding staircase, including the ones that we had expected to be below threshold on the basis of the SEEDS data. Even the smallest just noticeable difference of 161 ms in rise time (15 ms versus 176 ms) was discriminated by 85% of the sample at seven months of age, and there was no consistent pattern to the presence or absence of a mismatch response across the 10 values used for the staircase. This finding shows that infants are able to discriminate ARTs with high accuracy, revealing the readiness of the infant brain to process speech rhythm. The ART data also demonstrate the value of neural measures in assessing the parameters of TS theory. Future studies will need to present more fine-grained differences between ART stimuli in order to find individual infant thresholds and relate them to language outcomes.

35.5 Conclusions

Applying TS theory to infant language acquisition appears to be a promising avenue for future research. Referring back to Figure 35.2, the SEEDS and the Cambridge UK BabyRhythm projects have provided evidence for a number of factors theorised to be critical to language acquisition. The UK BabyRhythm project has shown that the infant brain is highly sensitive to ART differences (Ní Choisdealbha et al., Reference Ní Choisdealbha, Attaheri and Rocha2022), and entrains to delta-band and theta-band speech envelope information from four months of age (Attaheri et al., Reference Attaheri, Ní Choisdealbha and Di Liberto2022, Reference Attaheri, Ní Choisdealbha and Rocha2024). Hence, the delta and theta sampling networks shown in Figure 35.2 are gaining information about AMs at the ‘stress’ and ‘syllable’ rates from early in language learning. Computational speech analyses of IDS collected in the SEEDS project showed that delta–theta phase alignment of AMs is foregrounded in BabyTalk (Leong et al., Reference Leong, Kalashnikova, Burnham and Goswami2017), and although phase-phase analyses have yet to be completed, the UK BabyRhythm neural data showed that phase-amplitude mechanisms are already in play during speech processing (Attaheri et al., Reference Attaheri, Ní Choisdealbha and Di Liberto2022). Hence, phase-phase mechanisms may also be online. The effects of individual differences in these sensory/neural variables on the developing lexicon were documented by both projects. The SEEDS data showed that infant rise time sensitivity at 10 months predicted vocabulary development at three years, with behavioural rhythm measures at four years subsequently predicting pre-literacy variables, mediated by phonological sensitivity (non-word repetition). The UK BabyRhythm data suggested that the accuracy of delta-band cortical tracking of speech envelope information at 11 months predicted later vocabulary, as did theta–gamma PAC (Attaheri et al., Reference Attaheri, Ní Choisdealbha and Rocha2024), while individual differences in neural phase angle to rhythmic speech and non-speech stimuli also predicted both later vocabulary and non-word repetition (Ní Choisdealbha et al., Reference Ní Choisdealbha, Attaheri and Rocha2023). This suggests that both neural phase alignment in the delta band and the accuracy of speech envelope encoding are critical for later language outcomes (see also Rios-Lopez et al., Reference Rios-Lopez, Molinaro, Bourguignon and Lallier2022). Accordingly, data from both infant projects converge in showing that the sensory/neural factors foregrounded by TS theory make important contributions to later language outcomes for English-learning infants.

Box 35.1Chapter Overview

Summary

This chapter presents a theoretical overview of how rhythm may be important for language development, using the framework of TS theory. Sensory and neural data for TS theory from children are reviewed, and the TS-proposed causal sensory/neural mechanisms are assessed by utilising recent infant longitudinal data.

Implications

Amplitude ‘rise time’ discrimination and neural entrainment to rhythmic acoustic signals are physical characteristics of the nervous system and not under conscious control, yet they govern in part the efficiency of language acquisition. Future projects in other languages are now required to test the language universality of TS theory and these sensory/neural factors.

Gains

Deeper understanding of the sensory and neural mechanisms that govern individual differences in language acquisition may open the door to novel interventions, possibly based on rhythm, that may enhance language development.

36 Neural and Behavioural Rhythmic Tracking during Language Acquisition: Findings, Methods, and Outstanding Issues

36.1 Introduction

Language acquisition is a multimodal phenomenon. Within the womb, the fetus is exposed to the rhythm of their mother’s speech via a low-pass filter. They hear the rumbling of their mother talking; they feel her movements. At birth, infants can recognise their mother’s voice (Mehler et al., Reference Mehler, Bertoncini, Barriere and Jassik-Gerschenfeld1978) and show familiarity with stories read to them in utero (Decasper and Spence, Reference Decasper and Spence1986). They are not just passive recipients; in their earliest communications, their cries follow the pattern of the language they are exposed to (Mampe et al., Reference Mampe, Friederici, Christophe and Wermke2009). At birth, even with months of auditory experience under their belt, their language system is flexible and open to the input it receives. Young infants can discriminate between sounds in languages they have never been exposed to before, an ability that is lost over the first year of life as the system acquires expertise for its language(s) (Maurer and Werker, Reference Maurer and Werker2014). The journey towards adult-like language expertise is long; infants have to learn vocabulary, syntax, and grammar. All these elements have been extensively studied in infants and young children, and we have a wealth of knowledge of key roles of, for example, ostension (Csibra, Reference Csibra2010) or statistical learning (Romberg and Saffran, Reference Romberg and Saffran2010).

In recent decades, fuelled in part by observations from language disorders, adult speech perception, and music perception, a new contender on the block has emerged as a critical component of linguistic success – rhythm perception. The grossly oversimplified story (discussed with the detail it deserves in other chapters of this edition in Section 6) is that speech is a rhythmic signal, and that efficient processing of the rhythm of speech facilitates language acquisition. The patterning of syllables and stress syllables gives anchors, or perceptual edges, in the speech signal that allow the listener to attend to important information in speech (Doelling et al., Reference Doelling, Arnal, Ghitza and Poeppel2014). Rhythmic cues give structure to the speech signal for the listener to follow. What is intriguing is that when we speak to infants (see Chapters 23 and 38), we emphasise this rhythm, slowing down and adding greater emphasis. Our voices take on a sing-song quality that is not a reflection of the expert speaker but is attuned to our novice listeners. This phenomenon is known as infant-directed speech (IDS). IDS is linked to enhanced word learning. Infants learn new words better when they are presented in IDS than adult-directed speech (ADS) (Ma et al., Reference Ma, Golinkoff, Houston and Hirsh-Pasek2011), and this benefit is also true for adults learning a new language (Ma et al., Reference Ma, Golinkoff, Houston and Hirsh-Pasek2011). Critically, IDS is not necessary for learning throughout the acquisition journey – once a language has been sufficiently mastered, older infants learn well without it (Ma et al., Reference Ma, Golinkoff, Houston and Hirsh-Pasek2011). Caregivers are therefore responsive to the needs of their infant, modulating the acoustic properties of the IDS they produce according to infant age, and likely reflecting infant attention to different acoustic cues, within the bidirectional and dynamic caregiver–infant speech interactions (Cox et al., Reference Cox, Bergmann and Fowler2023). Similarities in prosody have been demonstrated amongst diverse societies (Broesch and Bryant, Reference Broesch and Bryant2018), and IDS is largely considered universal, at least in form if not quantity (Cox et al., Reference Cox, Bergmann and Fowler2023). Adults can distinguish IDS from ADS in non-native languages from short, contextless audio excerpts (Hilton et al., Reference Hilton, Moser and Bertolo2022). If the greater rhythmicity of IDS is a critical universal property, we must settle on some core understandings of what we mean by rhythm. In music, rhythm describes a series of temporal intervals (see Chapter 27). It is often characterised by isochrony or equal spacing between event onsets. Whilst naturalistic speech never has the regularity of a metronome or click track, IDS has greater isochrony than ADS. We can consider the proximate mechanisms that may be at play whilst infants are listening to this special rhythmic signal, the most intuitive being that infants are (neurophysiologically) tracking the rhythm of IDS. For this to be the case, we must meet two criteria. First, that the infants can neurally track an auditory rhythm, and second, that the speech contains an auditory rhythm for infants to track.

36.1.1 Criterion 1: Infants Perceive Auditory Rhythm

There is good evidence from the field of music cognition that infants can perceive auditory rhythms. We see this behaviourally, for example in habituation studies where we see that infants discriminate tempo changes (Baruch and Drake, Reference Baruch and Drake1997) and metre (Hannon and Johnson, Reference Hannon and Johnson2005). Through infancy, infants’ spontaneous movement behaviour changes in response to music, and whilst infants cannot reliably synchronise to music, they show tempo-flexibility, moving faster to faster auditory tempi and slower to slower tempi (Rocha and Mareschal, Reference Rocha and Mareschal2017; Yu and Myowa, Reference Yu and Myowa2021; Zentner and Eerola, Reference Zentner and Eerola2010). We are also able to measure rhythm perception neurally, with electroencephalographic (EEG) mismatch responses showing that infants detect a missing beat (Winkler et al., Reference Winkler, bor Há den, Ladinig, Sziller and Honing2009) and interpret metrical structure (Flaten et al., Reference Flaten, Marshall, Dittrich and Trainor2022). A more direct approach to measuring infant neural responses to musical beats has used steady-state evoked potentials (SSEPs), which reflect the amount of neural energy at different frequencies. An established measure in adult music cognition (Nozaradan et al., Reference Nozaradan, Peretz, Missal and Mouraux2011), this approach has been used to show that infants have enhanced energy at the perceived beat and metre frequencies of auditory rhythmic patterns (Cirelli et al., Reference Cirelli, Spinelli, Nozaradan and Trainor2016; Flaten et al., Reference Flaten, Marshall, Dittrich and Trainor2022).

36.1.2 Criterion 2: IDS Contains Auditory Rhythm

If we are therefore happy to proceed with our argument that infants perceive critical timing information in auditory rhythmic stimuli such as repeated tones or real music, the next criterion for rhythm as a key to language acquisition is to show that there is indeed rhythm in the speech signal for infants to track (see Chapter 23). Studies of the acoustic signal of naturalistic IDS show increased amplitude modulations around 2 Hz (Leong et al., Reference Leong, Kalashnikova, Burnham and Goswami2017). To investigate this, Leong et al. applied a computational model to child-directed speech (CDS) and revealed that the speech is hierarchically organised, known as the spectral-amplitude modulation phase hierarchy (S-AMPH). The approach consists of a set of algorithms that are used to derive underlying spectral characteristics of the speech signal. It uses probabilistic demodulation to model the rhythm patterns in speech, giving a low-dimensional representation of the acoustic and temporal properties of the speech envelope (Goswami and Leong, Reference Goswami and Leong2013; Leong and Goswami, Reference Leong and Goswami2014, Reference Leong and Goswami2015; Leong et al., Reference Leong, Kalashnikova, Burnham and Goswami2017). This data-driven modelling approach allows us to identify various amplitude modulations corresponding to linguistic boundaries. For example, in the first report on S-AMPH (Leong and Goswami, Reference Leong and Goswami2015), the application of the modelling approach to CDS revealed amplitude modulations corresponding to prosodic stress (stress AM ~2 Hz), syllables (~5 Hz), and phoneme rate (~20 Hz). Furthermore, they argued that this nested hierarchy of speech rhythms could be used by an infant to build stimulus-driven phonological maps of a speech system in any given language. Particularly in CDS, these amplitude modulations are exaggerated, and possibly provide the essential acoustic landmarks for children.

Given the above support for our core criteria, it is not surprising that there has been increased focus in recent years on understanding the mechanisms by which rhythmic processing of speech may support typical and atypical language acquisition. The human auditory cortex has been shown to reliably track the amplitude envelope of the speech signal. This is achieved by phase aligning endogenous neural oscillations with the amplitude envelope of the temporally regular auditory information. The speech envelope refers to the amplitude fluctuations over time, typically occurring in low frequencies (< 10 Hz), which help the listener track the speech rhythm. Using magnetoencephalography (MEG), speech tracking of the amplitude envelope was demonstrated in healthy adult listeners (e.g., Gross et al., Reference Gross, Hoogenboom and Thut2013; Peelle et al., Reference Peelle, Gross and Davis2013) but since then has been revealed in infant EEG (Attaheri et al., Reference Attaheri, Choisdealbha and Di Liberto2022a; Jessen et al., Reference Jessen, Obleser and Tune2021; Menn et al., Reference Menn, Michel, Meyer, Hoehl and Männel2022). There is evidence that this speech envelope-tracking ability develops from childhood to adulthood, and even supports better performance in speech in noise (Destoky et al., Reference Destoky, Bertels and Niesen2020; Vander Ghinst et al., Reference Vander Ghinst, Bourguignon and Niesen2019). The speech-tracking literature has predominantly used measures such as speech–brain coherence, phase-locking value, and mutual information. These methods essentially measure statistical dependency between the speech signal and underlying neurophysiological data. In the rest of this chapter, we aim to provide an account of the state-of-the-art methods being developed to elucidate the relationship between rhythm and language, summarise where the literature converges and diverges, highlight open questions, and discuss the developments in our field that can enhance understanding of these phenomena.

36.2 A Primer on Neural Measures of Rhythm Processing Suitable for Use with Infants

Neural measurements from the earliest moments in life have been possible for some decades now, including via EEG, MEG, and near-infrared spectroscopy (fNIRS). Most relevant studies to this chapter use M/EEG for its excellent temporal resolution. EEG measures spontaneous neuronal activity generated by ensembles of neurons, from the surface of the scalp. MEG on the other hand measures the magnetic components of this underlying neuronal activity. Infant EEG often comprises high-density (64–128-channel) recording using water-based geodesic sensor nets that need little preparation, aiding infant compliance (Figure 36.1a). Modern systems are improving traditional issues with signal-to-noise ratio (SNR), with infant active electrode caps that can be pre-gelled and applied almost as quickly as nets.

Figure 36.1

Infant neural activity can be measured passively using EEG or MEG systems.

1A.

An infant wearing a geodesic sensor net.

A photograph of a child wearing an E E G cap that measures brain waves. See long description.

Figure 36.11A. Long description

Photo A presents a child seated in the lap of an adult wearing a specialized cap on their head with many small electrodes attached. This procedure is known as electroencephalography or E E G, which measures electrical activity of the brain.

1B.

MEG adapted with lightweight optically pumped magnetometers.

A photograph of a child wearing an electrode cap similar to the one in photograph A. See long description.

Figure 36.11B. Long description

Photo B presents a child seated in a specialized chair inside a room wearing an electrode cap similar to the E E G cap in photo A. This procedure is known as magnetoencephalography M E G, which uses specialised sensors to measure magnetic fields of the brain. There are a bunch of wires on the floor of the room.

Picture credits: 1A: Eleanor Smith; 1B: Paul Allen

Infant EEG has further benefited from technological advances in signal processing post data collection. As it is challenging to ensure infants remain stationary during an experiment, data can suffer from non-canonical movement artefacts, difficult to remove using standard adult-defined techniques. However, recent noteworthy advancements in toolboxes and tutorials specifically for infant EEG data allow greater precision in the analysis of noisy data (Gabard-Durnam et al., Reference Gabard-Durnam, Leal, Wilkinson and Levin2018; Lopez et al., Reference Lopez, Monachino and Morales2022). These general technological advancements have facilitated the growth in complex methodologies suitable for answering questions on infant speech perception. On the other hand, MEG offers the same temporal resolution as EEG and has a reasonable spatial resolution allowing us to investigate activity between networks of brain regions. A crucial limitation of the traditional cryogenically cooled MEG system is that it has a fixed array of sensors, making head movements a confound in typical experiments. As the sensor array is fixed, any head motion relative to the sensor array can cause changes in the SNR and spatial blurring of the underlying sources. Recognising this limitation, several algorithms are now available to correct head movement artefacts. However, changes in the SNR (as sources move relative to the array) during recording place a limit on the amount of movement that can be compensated (Medvedovsky et al., Reference Medvedovsky, Taulu, Bikmullina and Paetau2007). The problem of head movement is much more pronounced in the paediatric population, where infants and/or toddlers find it very difficult to stay still in unnatural (i.e., laboratory) environments. This limitation is better overcome by EEG and fNIRs, which involves placing the sensors directly on the participants’ heads. Recent exciting developments in MEG hardware have led to the development of room temperature MEG sensors, which involve the use of optically pumped magnetometers (OPMs) (Boto et al., Reference Boto, Meyer and Shah2017, Reference Boto, Holmes and Leggett2018). The lightweight sensors (OPMs) can be mounted in a helmet, making the scanner a wearable device. This new approach is gaining traction, and early adoption with children demonstrates significant improvements in the SNR with OPMs when testing children with epilepsy (Feys et al., Reference Feys, Corvilain and Aeby2022), cortical tracking of speech (de Lange et al., Reference de Lange, Boto and Holmes2021), and hyper-scanning during play (Holmes et al., Reference Holmes, Rea and Hill2023). Being able to place the OPMs directly over a participant’s head has two distinct advantages: (1) improved SNR, and (2) improved spatial resolution. This makes a compelling use case in developmental populations, particularly during naturalistic experiments. For example, in a study by Hill et al. (Reference Hill, Boto and Holmes2019), the OPM-MEG system was used to measure somatosensory activity underlying maternal touch in two- and five-year-olds. Therefore, whilst this chapter mostly discusses infant EEG, we see great potential for MEG research in the coming years.

36.3 Methodological Overview

Human speech is intrinsically rhythmic. This is mainly the result of coordinated movement by the oro-musculature involved in speech production. In a stress-timed language such as English, the rhythm in speech typically translates to the occurrence of stress and unstressed syllables in connected speech (Cummins and Port, Reference Cummins and Port1998; Nespor et al., Reference Nespor, Shukla, Mehler, Oostendorp, Ewen, Hume and Rice2011). The speech rhythm (i.e., prosody), indexed by the changes in the amplitude envelope of the signal, offers critical cues for speech segmentation (see Chapter 11 for an alternative perspective). Whilst there are variations in the rate of speech, both within and between speakers, healthy adult listeners change their ongoing neural oscillations to match the incoming speech signal. This is a key mechanism for speech perception. Nevertheless, how the auditory cortex achieves this impressive feat, and the precise oscillatory mechanisms underlying it, remain largely elusive. Moreover, and of particular interest to developmental neuroscientists, there are the questions, what does this mechanism look like in infancy and childhood? (How) does it aid language acquisition? And what happens when these mechanisms break down early in childhood? Progress towards answering these questions has been made through measurement of the associations between the speech signal and ongoing neural oscillations using M/EEG. Owing in large part to these speech-tracking methods, the mechanism of ‘neural entrainment’ as a basis for speech processing and language acquisition has also received considerable support. Here, we look at speech–brain coherence, phase-locking value (PLV), mutual information (MI), and multivariate temporal response function (mTRF) as examples of methods that have been used to study neural entrainment.

36.3.1 Speech–Brain Coherence

Coherence is a statistical measure that is used to identify statistical dependency between two signals, x(t) (e.g., speech time series) and y(t) (e.g., neural time series). It is given by:

(1)

{Coh}_{xy} (f) = \frac{{|S_{xy} (f)|}^{2}}{S_{xx} (f) S_{yy} (f)}

where $S_{xy} (f)$ is cross-spectral density between x and y, and $S_{xy} (f)$ and $S_{yy} (f)$ are the auto-spectral density of x and y, respectively. The spectral densities are estimated using Fourier transform. Values of coherence range between 0 (random coupling) and 1 (perfect synchronisation) (Pascual-Marqui et al., Reference Pascual-Marqui, Lehmann and Koukkou2011).

36.3.2 Phase-Locking Value (PLV)

PLV measures frequency-specific phase synchronisation between two signals. It is computed by calculating the distribution of phase difference extracted from two source time series x(t) and y(t). It is formally given by:

(2)

{PLV}_{t} = \frac{1}{N} |\sum_{n = 1}^{N} exp (jθ (t, n))|

where $θ (t, n)$ gives the phase difference $ϕ_{1} (t, n) - ϕ_{2} (t, n)$ . The phase information is typically extracted using the Hilbert transform. PLV provides a summary statistic of the phase difference at t (Lachaux et al., Reference Lachaux, Rodriguez, Martinerie and Varela1999).

36.3.3 Mutual Information (MI)

MI serves as a measure of mutual dependence between two random variables. It is used to quantify the amount of information that can be obtained about one variable by observing the other variable. Unlike speech–brain coherence or PLV, MI captures both linear and non-linear interactions between the two signals. An additional advantage of the method is that the same framework can be extended to study different aspects of the underlying signals (e.g., phase-phase, amplitude-amplitude, phase-amplitude, or cross-frequency coupling). The MI between two random variables $X$ and $Y$ is mathematically given as follows:

(3)

I (X; Y) = \sum_{x \in X} \sum_{y \in Y} P (x, y) log \frac{P (x, y)}{P (x) P (y)}

where $P (x)$ and $P (y)$ are the marginal distributions of variables $X$ and $Y$ , respectively, and $P (x, y)$ is the joint distribution of these variables.

The general steps involved in all three above methods include: (1) band-pass filtering of the neural time series and the speech signal in the same frequency bands; (2) extraction of the relevant quantity (e.g., spectral density, phase, or amplitude information); before (3) subjecting it to the relevant mathematical operation.

36.3.4 Multivariate Temporal Response Function (mTRF)

The mTRF is a novel method for investigating the neurophysiological processing of the auditory signal. Unlike the methods mentioned above, the mTRF method involves decoding the patterns of neural activity related to a particular stimulus feature using a set of linear filters, which could include acoustic envelope, spectrogram, phonemes, or phonetic features (Crosse et al., Reference Crosse, Di Liberto, Bednar and Lalor2016; Di Liberto et al., Reference Di Liberto, O’Sullivan and Lalor2015). These filters are trained on, for example, 80% of the data and then applied to the remaining 20% to generate predictions (or the time course) of the stimulus feature in question. The mTRF approach has some advantages. First, an explicit pre-selection of channels (or ROIs) is not required as data from all the channels is used to create a stimulus reconstruction. Second, the commonly used backward modelling approach can maximise sensitivity to key signal differences between highly correlated sensors. This is achieved by mapping data from all sensor locations simultaneously and by detecting correlations in the data.

36.3.5 Comparison of Approaches

Relevant developmental research in the auditory domain has historically been dominated by the use of non-speech sounds as stimuli, such as amplitude-modulated or frequency-modulated tones, to measure auditory steady-state response (ASSR). These approaches remain very popular because the neural responses to such stimuli are very robust and can reliably be recorded across the lifespan. However, such experiments suffer from a lack of ecological validity and don’t allow us to measure the development of neural responses in a naturalistic setting. Experiments with the use of naturalistic, immersive paradigms have recently started to increase, using the methods described above. Such paradigms using audiovisual stories, nursery rhymes, or IDS allow us to study how multiple streams of information are processed by the infant’s brain. This gives a clear benefit of increased generalisability of the findings.

All the methods outlined in our chapter (coherence, PLV, MI, and mTRF) generally suffer from the same limitations; that is, developmental studies tend to have smaller sample sizes and noisier data than the adult studies from which these techniques have been developed. The ability of each method to deal with inherent low SNR should be considered by the researcher. Further, all the methods briefly reviewed here (except for MI) rely on linear relationships between the speech signal and the neurophysiological data. This assumption may not be sufficient to fully encapsulate the brain’s response to speech stimuli. A further limitation specific to mTRF concerns model selection, as the researcher must define the specific speech parameter that they are interested in studying (e.g., speech envelope, spectrogram, or phonetic features). Choosing the right model for the mTRF can be challenging, and different models may perform differently for different types of stimuli or neural responses. The set of models that generate statistically significant results for one research group may not generalise to other tasks/conditions/datasets. We also think it is important to highlight that the mTRF reconstruction values are often very small. This might be partly to do with the noisy nature of M/EEG signals. Whilst the effects reported in the literature using the mTRF method show statistical significance when compared to a null distribution, their clinical significance remains under-explored, and this will be a critical next step for the field.

Finally, it is worth noting that the brain’s responses to rhythmic stimulation can be a mixture of series of evoked responses and non-phase-aligned oscillatory (or induced) responses (David et al., Reference David, Kilner and Friston2006). It is important to disentangle the two when studying oscillatory responses in infants as researchers risk attributing oscillatory functions to evoked activity. This can be achieved by removing the averaged evoked response from the data before analysing it or by incorporating computational models (e.g., Doelling et al., Reference Doelling, Assaneo, Bevilacqua, Pesaran and Poeppel2019) with theoretical models of language acquisition.

36.4 Synthesis of Infant Rhythmic Processing Literature

As identified in Table 36.1, we are now well equipped to ask and answer questions on the neural underpinnings of rhythmic speech processing. The studies outlined below offer a snapshot of ‘neural entrainment’ research in infants and how this mechanism may aid language acquisition. The precise definition of neural entrainment remains hotly debated (Giraud, Reference Giraud2020; Haegens, Reference Haegens2020; Meyer et al., Reference Meyer, Sun and Martin2020), and we prefer the term speech tracking. Here, speech tracking is defined as the neural process by which the ongoing neurophysiological activity follows the patterns of the speech signal. However, a causal link has yet to be established.

Table 36.1A summary of techniques that have been used to measure speech tracking developmentally

Table 36.1A summary of techniques that have been used to measure speech tracking developmentally
Method	Relevant papers
Speech–brain coherence	Fransen et al. (Reference Fransen, van Ede and Maris2015); Mandke et al. (Reference Mandke, Flanagan and Macfarlane2022); Molinaro et al. (Reference Molinaro, Lizarazu, Lallier, Bourguignon and Carreiras2016); Pascual-Marqui et al. (Reference Pascual-Marqui, Lehmann and Koukkou2011)
Phase-locking value	Lachaux et al. (Reference Lachaux, Rodriguez, Martinerie and Varela1999); Peelle et al. (Reference Peelle, Gross and Davis2013)
Mutual information	Gross et al. (Reference Gross, Hoogenboom and Thut2013); Teng et al. (Reference Teng, Cogan and Poeppel2019); Zan et al. (Reference Zan, Presacco, Anderson and Simon2019)
Multivariate temporal response function	Attaheri et al. (Reference Attaheri, Choisdealbha and Di Liberto2022a); Crosse et al. (Reference Crosse, Di Liberto, Bednar and Lalor2016); Di Liberto et al. (Reference Di Liberto, O’Sullivan and Lalor2015)
Phase-amplitude coupling	Attaheri et al. (Reference Attaheri, Choisdealbha and Di Liberto2022a, Reference Attaheri, Choisdealbha and Rocha2024); Canolty et al. (Reference Canolty, Edwards and Dalal2006)

To our knowledge, the first study to investigate the differential neural substrates of IDS and ADS tracking measured neurophysiological (EEG) responses to recordings of naturalistic IDS and ADS in seven-month-old, pre-verbal infants (Kalashnikova et al., 2018). In this study, spectral analysis revealed that the theta-band power over the left hemisphere was significantly larger than the right hemisphere. The hemispheric differences provide compelling evidence in support of the asymmetric sampling hypothesis (AST) (Hickok and Poeppel, Reference Hickok and Poeppel2007; Poeppel, Reference Kalashnikova2003). It is possible that the functional asymmetry postulated by AST may have origins as early as seven months of age, when infants are at the beginnings of language production, producing babbling. Furthermore, analysis using mTRF showed that theta-band (4–8 Hz) cortical tracking of the speech envelope was greater for IDS than ADS. Here, the authors investigated theta tracking as they were interested in how the exaggerated prosodic features of IDS, such as higher pitch and slower tempo, may enhance the salience of speech sounds for infants and make them easier to process. These findings are in line with the literature outlined in our introduction, which suggests that these unique characteristics of IDS may play an important role in early language acquisition in infants. That amplitude envelope was tracked gives a first insight into the idea that it is indeed the rhythm of IDS that is a critical component. However, the choice of investigating theta-band oscillations in response to the envelope reflects the authors’ interest in IDS directing the infants’ attentional spotlight, and it would be very interesting to understand how different saliency cues drive cortical tracking. Without additional manipulations, it is not possible to know if it specifically or exclusively the enhanced low-frequency rhythms of IDS driving cortical tracking.

In a longitudinal study, Attaheri et al. (Reference Attaheri, Choisdealbha and Di Liberto2022a) measured cortical tracking of sung speech in infants at four, seven, and 11 months of age using mTRF applied to EEG data in canonical delta, theta, and alpha bands. Audiovisual stimuli were used of a woman performing various British nursery rhymes (e.g., ‘Twinkle Twinkle Little Star’). The results revealed that infants had above-chance performance cortical tracking in delta and theta bands across the three time points. They also identified the presence of strong phase-amplitude coupling with delta–theta bands as the drivers. More details on this study and its functional interpretation can be found in Chapter 35. Whilst the current data cannot provide direct evidence for the involvement of this neural process in the extraction of linguistically meaningful information, the data form part of a longitudinal study that continued to track infants into the third year of life with detailed language assessments. Through this design, it is possible to see the extent to which early processing of the amplitude envelope of sung speech predicts language acquisition (Attaheri et al., Reference Attaheri, Choisdealbha and Di Liberto2022a, Reference Attaheri, Choisdealbha and Rocha2024). However, what is also intriguing about these results is a complex developmental pattern. The original longitudinal findings show the strongest mTRF values of cortical tracking at four months of age, with significantly lower levels at 11 months. Attaheri et al. (Reference Attaheri, Panayiotou and Phillips2022b) also replicated their findings with adults. This study used identical stimuli and the same analysis pipeline to that of the infant study. Here, their findings revealed that adult cortical responses to sung speech reflected very similar underlying processes, showing increased delta- and theta-band tracking, with similar mTRF values for adults as for the infants at the youngest time point tested (four months). The suggested overall trajectory may therefore be an inverted U shape, with younger infants performing similarly to adults, but with weaker tracking in the intervening period. Such interpretation remains speculative, especially as it is not clear from these results whether greater tracking at four months than 11 months is the result of a true developmental characteristic (e.g., increased salience of the stressed syllable amplitude modulation at this early age), or a physical or methodological characteristic (e.g., cleaner EEG data at the earlier age whilst the infant is less mobile).

In another study of how natural IDS facilitates the neural processing of prosody in infants, Menn et al. (Reference Menn, Michel, Meyer, Hoehl and Männel2022) used EEG to measure speech–brain coherence in seven–nine-month-old infants. The infants listened to either IDS or ADS presented live by their caregiver. The results showed statistically significant speech–brain coherence for IDS and ADS at prosodic rates. However, the speech–brain coherence was significantly greater for IDS compared to ADS, specifically in the prosodic rates. The authors suggest that natural IDS may facilitate infants’ ability to track and learn the rhythmic features of speech, which could in turn support language development. The rhythmic patterns of IDS, which are characterised by exaggerated intonation, slower tempo, and higher pitch, are thought to aid in infant attention and arousal, as well as in the formation of speech representations in the brain. The main contribution of this study is the ecological validity of the naturalistic speech, with the design set up such that the caregiver and infant were communicating as they would at home, in the IDS condition. However, Menn et al. also draw our attention to the fact that the ADS condition was not exactly matched to the IDS, as caregivers were instructed to additionally remove all ostensive cues, such as mutual gaze. It would be of great interest to understand the additive benefit of such cues in future work.

Across the studies and techniques discussed thus far, it is worth noting that most use non-specific linguistic timescales and assume that there is a one-to-one mapping between speech rhythms and canonical neural oscillations (e.g., delta, theta, gamma bands). Of significant importance is the inter-speaker variability or different registers of speech such as ADS or IDS that can produce speech rhythms across a broad range (see chapters in Section 6). Therefore, it is important to first identify the specific linguistic timescales of interest in the speech material before studying the corresponding neural oscillations. This question was first addressed in adults (Keitel et al., Reference Keitel, Gross and Kayser2018) by manually annotating their stimulus material, and by applying data-driven filtering to CDS (Mandke et al., Reference Mandke, Flanagan and Macfarlane2022). Both these studies identified prosodic features < 5 Hz. These statistical regularities are noticeably lower than assumptions made in the literature; for example, syllable rate is reflected in the theta band (4–8 Hz) and phoneme rate in the gamma band (> 30 Hz). Constraining the neural oscillations by linguistic boundaries identified in the stimulus material will improve the precision of interpretation, particularly in the language acquisition literature.

Overall, these studies provide valuable insights into the neural mechanisms underlying the processing of prosody in infants and highlight the importance of natural IDS in supporting early language development. However, speech envelope tracking alone may not be sufficient to account for language acquisition, as it oversimplifies the computations undertaken by the infant brain. It fails to consider the role of other features contained in the speech signal, such as phonetic features, formant transitions, temporal fine structure, and so on. For example, Inbar et al. (Reference Inbar, Genzer, Perry, Grossman and Landau2023) recently investigated the neurophysiological basis of intonation units (IUs), a fundamental unit of human languages (Inbar et al., Reference Inbar, Grossman and Landau2020). In their naturalistic listening study using EEG, Inbar et al. (Reference Inbar, Genzer, Perry, Grossman and Landau2023) demonstrated robust evoked responses to IU in adult listeners. For further details, we direct the reader to Chapter 15. The evidence from the adult speech-tracking literature strengthens proposals that as the acoustic information travels along the auditory pathway, higher-order structures extract more complex representations from the speech signal. This representational hierarchy receives support from the fact that the anatomy of the auditory system is also hierarchically organised. Future work to account for how meaning is assigned to these speech features (e.g., amplitude envelope), and how these are further used by the developing brain in speech production, will be valuable next steps (see Chapters 17 and 18).

36.5 Multimodal Rhythm Perception and Production in Relation to Language Acquisition

In our introduction, we highlighted that speech is a multimodal act, and in this section, we wish to stress that the rhythm in speech is multimodal. Typically, when infants are exposed to speech, they are not only hearing the auditory signal but also gaining rich visual information. For example, when singing to an infant, adults’ metrically strong moments involve temporally aligned eye-widening and blinking, in addition to the movement of the mouth (Lense et al., Reference Lense, Shultz, Ast Esano and Jones2022). Infants are receptive to this and their looking at the eyes of the singer is coordinated with these eye movements (Lense et al., Reference Lense, Shultz, Ast Esano and Jones2022). IDS is produced with larger mouth movements than ADS (Green et al., Reference Green, Nip, Wilson, Mefferd and Yunusova2010) and more head movements (Smith and Strader, Reference Smith and Strader2014). Eyebrow movements and head nods are particularly useful cues to phrase boundaries and are again more prominent in IDS than ADS (de la Cruz-Pavía et al., Reference de la Cruz-Pavía, Gervain, Vatikiotis-Bateson and Werker2020). Such inter-sensory redundancies (i.e., synchronous information across modalities) facilitate the detection of changes in prosody above an auditory cue alone (Bahrick et al., Reference Bahrick, McNew, Pruden and Castellanos2019). For more details on the multimodal nature of the speech input that infants receive, we direct the reader to Chapter 38.

Aside from the focus on the rich multimodality of the language stimulus directed to infants, it is also critical that we do not forget the multimodality of infants’ attempts at language production. The relationship between gross motor actions across limbs and the development of speech is well recognised. Early repetitive motor movements, such as kicking or hand-waving, in which infants can spend 40% of their time, have been described as stereotypies, reflexive or rhythmic actions that precede more deliberately controlled movement (Thelen, Reference Thelen1981). We can think of these rhythmic movements as a ‘passive’ response to the speech, with seminal studies showing that neonates’ earliest movements are associated with the timing of adult speech (Condon and Sander, Reference Condon and Sander1974). However, we can also think further about rhythmic movements whilst infants are actively generating speech sounds themselves. From a dynamic systems theory approach, rhythmic motor actions produced with the mouth and hand may entrain each other, such that the generation of a well-practised action such as hand-banging may ‘pull in’ the timing of vocalisations (Iverson and Thelen, Reference Iverson and Thelen1999). It is well documented that fluent speech is preceded by canonical babbling, where the infant produces repetitions of consonant-vowel syllables (Kuhl, Reference Kuhl2004). Rhythmic movements such as shaking a rattle reach their peak around the time that infants begin canonical babbling, and drop off once babbling is established (Ejiri, Reference Ejiri1998; Iverson et al., Reference Iverson, Hall, Nickel and Wozniak2007). Infant babbling is frequently temporally coordinated with rhythmic movement such as hand-banging, and the vocalisations that co-occur with such movement show more mature properties, which sustain after the movement ends (Ejiri and Masataka, Reference Ejiri and Masataka2001).

Therefore, in addition to considering infants’ neural tracking of speech rhythm, we believe it is critical to also consider infants’ motoric rhythmic responses to speech. Whilst the cortical tracking methods described above give fine-grained temporal information, the seminal studies of infant movement discussed thus far largely rely on micro-coding of video data, constrained by the frame rate of the video collected. The advancement of motion capture technology now facilitates nuanced analysis of infant movement without the need for the frame-by-frame hand coding of video. Optical 3D motion capture uses reflective markers placed at strategic points on the infant, from which x-, y-, and z-coordinates can be derived. In Figure 36.2a, the infant is wearing rigid bodies (prearranged unique combinations of markers stuck to a firm board), attached to the limbs and head via soft, elasticated, fabric straps. Optical motion capture systems can record infant movement at up to 2,000 frames per second, allowing incredibly high precision measurement. However, these systems are still relatively expensive, requiring the use of multiple near-infrared cameras to measure the reflection of light from the markers. The recent emergence of markerless motion-tracking technology for 2D pose estimation can allow even more naturalistic recording of infant movement via normal video. Markerless motion tracking uses deep-learning models trained on large video datasets to tag key points such as wrist, elbow, or shoulder, and even facial features. Figure 36.2b shows the application of an open-source markerless motion capture model to an infant drumming.

Figure 36.2

Motion capture methods.

2A.

Infant wearing rigid body reflective marker arrangements for optical motion tracking.

A photo of a baby in a high chair. The baby looks toward his left and smiles. He is wearing a headband with sensors.

2B.

Infant recorded on home webcam and analysed offline using OpenPose open-source markerless motion capture.

A photo of a baby in a high chair. A skeletal-like structure is superimposed on the baby's body.

Picture credits: 2A: Eleanor Smith

Motion capture studies of infants’ rhythmic movement whilst listening to rhythmic stimuli provide interesting insights. Whilst as a group they do not show sensorimotor synchronisation to the rate of auditory presentation at an adult level, over the first two years of life, infants show tempo-flexibility or move faster to faster auditory tempi and slower to slower tempi (Rocha and Mareschal, Reference Rocha and Mareschal2017; Zentner and Eerola, Reference Zentner and Eerola2010). Case studies show that some infants may show such adaptation to the rate of music as early as three to four months of age (Fujii et al., Reference Fujii, Watanabe and Oohashi2014). Due to the convincing evidence that infants are tracking the amplitude envelope of speech sounds and other rhythmic auditory stimuli (Attaheri et al., Reference Attaheri, Choisdealbha and Di Liberto2022a; Menn et al., Reference Menn, Michel, Meyer, Hoehl and Männel2022), it is tempting to conceive of infant rhythmic movement as a reflection of the strength of the tracking of the signal (i.e., to hypothesise that those who are tracking the stimulus well will also show better temporal matching of their movement to that stimulus). However, infants’ spontaneous rhythmic movements are found equally in silence as they are to a rhythmic musical stimulus (de l’Etoile et al., Reference de l’Etoile, Bennett and Zopluoglu2020; Fujii et al., Reference Fujii, Watanabe and Oohashi2014). Zentner and Eerola (Reference Zentner and Eerola2010) showed equal rates of rhythmic movements for a simple drumbeat as naturalistic music, but less for IDS or ADS. Whilst the relationship between infants’ quantity and rate of rhythmic movement is therefore not directly tied to what they are hearing, it is very interesting to consider how the development of rhythmic movement and sensorimotor synchronisation may unfold. For example, in the Cambridge UK BabyRhythm longitudinal study, where infant drumming to speech and non-speech rhythmic stimuli was recorded using motion capture, Rocha et al. (Reference Rocha, Attaheri and Choisdealbha2024) show that infant drumming becomes more rhythmic with age. This was particularly true when infants were drumming in a silent control condition, which mirrors previous findings that infants’ spontaneous motor tempo becomes faster and more regular over the first years of life (Rocha et al., Reference Rocha, Southgate and Mareschal2021). The Cambridge UK BabyRhythm study compared infant drumming in silence to an isochronous 2 Hz drumbeat, an isochronous 2 Hz repetition of the syllable ‘ta’, and naturalistic sung nursery rhymes. Rocha et al. found that infant drumming in the presence of a drumbeat showed a similar maturation as their spontaneous motor tempo, becoming faster and more rhythmic with age. However, infants did not show the same pattern of becoming more regular with age in the linguistic conditions (repeated syllables and sung nursery rhymes). It is possible that infants are upregulating variability, in response to more complex auditory stimuli, perhaps reflecting a trade-off between greater adaptation to the stimulus with age and rhythmicity.

36.6 Conclusion

This chapter aimed to unpack how rhythm supports language acquisition. Furthermore, we have provided an overview of the methods and highlighted some open questions. It will be of great interest to better understand gross motor rhythmic action in the context of speech perception and production, and the interplay between the seemingly good early cortical tracking, with the emphasis on neural alignment, and less precise or more variable behavioural tracking.

The focus of this chapter on neural tracking of the rhythmic auditory information in speech reflects cutting-edge neuroscience, using ever more sophisticated techniques to drill into the minutiae of perception. The focus is undoubtedly on the auditory modality, but even where this focus is being broadened to consider, for example, visual information present on the face (Lense et al., Reference Lense, Shultz, Ast Esano and Jones2022; Ní Choisdealbha et al., Reference Ní Choisdealbha, Attaheri and Rocha2024), it is still often measuring the unidirectional impact of features of speech, often presented on a screen, to the neural firing of the infant. In the final section, we make several recommendations as to how we can integrate diverse areas of knowledge and capitalise on the rapid technological developments, to consider the role of rhythm in infant language acquisition more holistically.

We outlined how methodological advancements have provided insight into the way that the infant brain processes language. Recent years have shown that the rhythmic information carried by the amplitude envelope of the speech signal is a core characteristic of IDS that infants are indeed processing. However, it is important to acknowledge other low-level features (e.g., envelope, spectrogram, temporal fine structure) and high-level features (e.g., phonetic features) that are part of the speech signal and are reflected in the EEG signal (Di Liberto et al., Reference Di Liberto, O’Sullivan and Lalor2015). To what extent are these additional timing or landmark cues important? To what extent are these cues more or less important in the special case of IDS? It is important that we do not simply apply our learning from adult speech studies to developmental problems. In the coming years, we can add thorough consideration of the other rhythmic properties of IDS that may be critical, for example, in the visual, touch, or motor domains. In doing so, and without constraining our focus on rhythm to only reflect the amplitude envelope, we can fully consider the breadth of developmental scaffolding that IDS provides.

Finally, in addition to understanding the rich multimodality of spoken language, it is critical to note that outside of the lab, exposure to IDS occurs in bidirectional social interactions (Menn et al., Reference Menn, Männel and Meyer2023), within a wider conversational context (Golinkoff et al., Reference Golinkoff, Can, Soderstrom and Hirsh-Pasek2015). Regarding the infant as simply a passive receiver of (auditory) information does a great disservice to how we know that infants develop language. For example, neonates vocalise more when a parent is present (Caskey et al., Reference Caskey, Stephens, Tucker and Vohr2011). Other strands of research into early communication are taking a hyper-scanning approach, where the neural activity of both the infant and the caregiver is recorded simultaneously (e.g., Nguyen et al., Reference Nguyen, Abney, Salamander, Bertenthal and Hoehl2021a, Reference Nguyen, Schleihauf and Kayhan2021b, Reference Nguyen, Schleihauf and Kungl2021c). The M/EEG methods we have described in this chapter can handle the complexity of this kind of information, and we should attempt to embrace this complexity to further centre the infant as a participator in, rather than the recipient of, IDS.

Box 36.1Chapter Overview