Neurocomputational modeling of speech motor development

Abstract This review describes a computational approach for modeling the development of speech motor control in infants. We address the development of two levels of control: articulation of individual speech sounds (defined here as phonemes, syllables, or words for which there is an optimized motor program) and production of sound sequences such as phrases or sentences. We describe the DIVA model of speech motor control and its application to the problem of learning individual sounds in the infant’s native language. Then we describe the GODIVA model, an extension of DIVA, and how chunking of frequently produced phoneme sequences is implemented within it.

The DIVA model of speech motor control The Directions Into Velocities of Articulators (DIVA) model is an artificial neural network that provides a quantitative account of the computations underlying speech motor control (Guenther, 1995;Tourville & Guenther, 2011;E. Golfinopoulos, Tourville, & Guenther, 2010; see Guenther, 2016 for a detailed treatment).It contains a network of simulated components which represent brain structures responsible for producing speech.The model includes an articulatory synthesizer that mimics the behavior of the vocal tract, and the neural network learns to control movements of the synthesizer's articulators in order to produce intelligible speech.We focus herein on a higher-level treatment of the model's neural computations and developmental processes, avoiding mathematical equations and computer implementation details for tractability.
To understand the model, we will start by defining a   to be a "chunk" of speech that has its own optimized motor program in the brain.These chunks could be phonemes, syllables, and/or words, depending on the age and linguistic experience being considered.In keeping with a number of prior proposals (e.g., Kozhevnikov & Chistovich, 1965;Levelt, 1993;MacNeilage & Davis, 1990) and supported by distributional analyses of phoneme combinations (Sun & Poeppel, 2022;Kessler & Treiman, 1997), we suggest that the syllable is the most typical sound chunk with an optimized motor program.However, motor programs likely also exist for individual phonemes as well as frequently produced multisyllabic utterances, such as common words or names of familiar people and locations.Note that the motor programs can be hierarchical; for example, a syllabic motor program will consist of individual phoneme motor programs along with optimized transitions between these phoneme motor programs.
The model assumes that, in the mature speaker, speech production begins with an intended linguistic message being translated by higher-level brain regions into a sequence of speech sounds.Motor sequencing circuits then activate the appropriate nodes of a    in ventral premotor cortex (vPMC), which is the highest processing level represented in DIVA.While this model focuses on segmental control -production of phonemes, syllables, and words -it should be noted that prosodic control is also an essential goal of speech motor development (Mattys, Jusczyk, Luce, & Morgan, 1999;Kehoe & Stoel-Gammon, 1997).

Neural components of the DIVA model
The brain structures whose functions are simulated by the DIVA model are illustrated in Figure 1.Each box corresponds to a set of modeled neurons, or , that together form a neural map of some type of speech-relevant information.Larger boxes indicate cortical regions and smaller boxes indicate subcortical nuclei.Arrows represent excitatory projections while circles represent inhibitory projections, with the projection target being the area touching the arrowhead or circle.Production of a speech sound starts with activation of a node representing that particular sound in a    in the left ventral premotor cortex.Activation of this node leads to motor commands that arrive in motor cortex via two control systems: a    and a   .
The feedforward control system generates previously learned motor programs for speech sounds.This process involves two components.The first component of feedforward control ensures that the motor program is initiated at the appropriate time.Timing control is carried out by a cortico-basal ganglia loop that includes an   in the supplementary motor area (SMA).This loop identifies the appropriate sensory, motor, and cognitive context for producing the speech sound.We suggest that the input structures of the basal ganglia monitor these contextual cues, with the caudate monitoring cognitive context and the putamen monitoring sensory and motor contexts.When the appropriate context for producing a speech chunk is identified, a corresponding node is activated in the initiation map via the globus pallidus (GP), substantia nigra pars reticula (SNr), and the ventral anterior (VA) thalamic nucleus.This initiation map node activation triggers the readout (execution) of the learned motor program for the current speech sound.
The second component of the feedforward control system comprises the motor programs themselves, which generate feedforward commands for producing learned speech sounds.These commands are encoded by synaptic projections from the speech sound map to an articulator map in the right and left ventral primary motor cortex (vMC).The cortico-cortical projections from left vPMC to vMC are supplemented by a cerebellar loop passing through the pons, cerebellar cortex lobule VI (Cb-VI), and the ventral lateral (VL) nucleus of thalamus.This division of motor execution between cerebellar and basal ganglia loops was originally proposed in a theory founded on nonhuman primate neurophysiology (Hikosaka, Nakamura, Sakai, & Nakahara, 2002), with later support being provided by human neuroimaging (Doyon et al., 2009).Note that multiple instances of a structure in Figure 1, such as the Cb, are implemented as separate non-overlapping neural populations within that structure.For example, separate Cb networks process feedforward commands, auditory targets, and somatosensory targets.
The     detects and corrects for mismatches between the auditory target and the current auditory feedback.Axonal projections from speech sound map nodes in vPMC -both directly and via a cortico-cerebellar loop involving the pons, cerebellum (Cb), and medial geniculate (MG) nucleus of the thalamus arrive at an    in the higher-order auditory cortical areas in posterior auditory cortex (pAC), including the posterior superior temporal gyrus and sulcus and the planum temporale.These projections signal the expected auditory percept generated by the sound currently being produced.
The auditory target for the current sound is compared to incoming reafferent auditory signals.This information is transmitted to cortical areas via MG and is represented in the model's   .If the current auditory state does not match the target, auditory error nodes in the higher-order auditory cortical areas become active.These types of predictive and error-related responses have been localized to auditory cortex by neural recordings in humans (Hashimoto & Sakai, 2003;Okada, Matchin, & Hickock, 2018 ; Ozker, Doyle, Devinsky, & Flinker, 2022).Auditory error node activities are then transformed into corrective motor commands through projections from the auditory error nodes to the    in right vPMC, which in turn projects to the articulator map in vMC both directly and via a loop through the pons, Cb, and VL.Auditory error is computed as a simple subtraction of the target from the state.This subtraction is enabled by making the Auditory State, Target, and Error Maps contain identical representations of speech sounds and equalizing the strength of inputs from the Target and State Maps to the Error Map.
The DIVA model also contains a    , the main components of which are hypothesized to reside in ventral somatosensory cortex (vSC).Projections from the speech sound map to the    encode the expected somatosensory feedback during sound production.These projections include cortico-cortical as well as cortico-cerebellar loop projections via the ventral posterior medial (VPM) thalamic nucleus.The model's    represents proprioceptive and tactile information from the speech articulators.If the somatosensory state does not match the current target, the    sends a corrective command via the feedback control map to correct subsequent motor commands.Studies in which articulator sensory feedback is perturbed during speaking suggest that the somatosensory error map resides primarily in ventral somatosensory cortex (Golfinopoulos, Tourville, Bohland, Ghosh, Nieto-Castanon, & Guenther, 2011).
The components of the DIVA model are a set of heterogeneous, biophysically realistic neural networks.Different neural network structures were chosen for each component based on the distinct function they serve.For example, different architectures were required for the error maps, which compute differences between two input signals, and the Initiation Map, which controls the timing of activation in a downstream structure.Some components in Figure 1 were not instantiated as full neural networks, such as VA and VL, which serve as simple relays from the basal ganglia to the cortex.
Unlike other models of speech motor control (e.g., Hickok, 2014), feedforward commands in DIVA proceed directly to primary motor cortex, without comparison to an internal model of sensory consequences.The lack of sensorimotor knowledge present at this processing stage is not problematic in the scenarios addressed by the model, in which auditory targets have already been well learned.However, this simplification does reduce the application of DIVA in particular speech phenomena, such as internal error correction (Nozari, Dell, & Schwartz, 2011) and attempting to imitate unfamiliar sounds (e.g., Hao & Jong, 2016).
Because most projections in the model are long-range and originate in the cerebral cortex, they are modeled as excitatory, to match known neuroanatomy (DeFelipe & Fariñas, 1992; see Urrutia-Piñones, Morales-Moraga, Sanguinetti-González, Escobar, & Chiu, 2022 regarding exceptions to this pattern).In the case of error maps, inputs are modeled as inhibitory, which is necessary for detecting differences between sensory states and sensory targets.Correlates in the brain of these projections to error maps likely use feedforward inhibition, in which a source area provides long-range excitatory projections to inhibitory neurons in a target area, effectively inhibiting certain excitatory neurons in that target area (Li, Ji, Liang, Li, Xiao, Tao, & Zhang, 2014;Naskar, Qi, Pereira, Gerfen, & Lee, 2021).All pathways in Figure 1 are assumed to have been established by birth, though the micro-scale patterns and weights of connections maintain plasticity, allowing for further postnatal development (Kostović & Jovanov-Milošević, 2006;Dubois et al., 2014).

Implementation of speech motor learning in DIVA
In order for the DIVA model to produce speech, it must undergo a learning process analogous to what occurs in the developing infant brain.The stages of this process are simplified for the purposes of implementation into a   and an  .
The babbling phase involves the generation of semi-random articulator movements through activation of nodes in the model's articulation map (corresponding to vMC), which drives movements of the speech articulators and the generation of auditory and somatosensory feedback signals.The resulting combination of auditory, somatosensory, and articulatory representations is used to tune inverse models that map somatosensory and auditory errors into corrective motor commands via the    in Figure 1.The learning in this stage is not phoneme-or syllable-specific; the learned sensory-motor transformations are applicable to all speech sounds that will be learned later.
During the imitation phase, the model is presented with sample speech sounds to learn, similar to an infant being exposed to the sounds of their native language.These sounds take the form of time-varying acoustic signals corresponding to phonemes, syllables, or words.Based on these samples, the model first learns an auditory target for each sound.Learning of a sound's auditory target involves activation of a speech sound map node that will later represent the sound for production.This occurs via a speech recognition system when the model "hears" the sound1 , which corresponds to a child hearing a new speech sound directed at him/her them by a parent, for example.This in turn leads to adjusting synaptic weights in the projections from that speech sound map node to the auditory cortex to encode the sound's auditory target.
After an auditory target for a sound has been learned, the model can attempt to produce the sound.The appropriate nodes in the initiation map and speech sound map must first be activated.At first, the model will not have a tuned motor program for producing the sound in a feedforward manner, nor will it have a somatosensory target.Thus, the system will depend primarily on auditory feedback for guidance.On each production attempt, the motor target will be updated to incorporate the commands generated by the auditory feedback control subsystem on that attempt.These commands are generated by first determining the auditory error (i.e., the distance and direction in auditory space between the target and what was produced) in the Auditory Error Map.The auditory error is then sent to the Feedback Control Map, where it is transformed into articulator movements that will reverse the auditory error.This corrective signal is then sent to the Articulator Map, where it adjusts the velocities of articulator movements.Subsequent attempts will then have a more accurate feedforward command to guide production.
Over time, the feedforward commands will become sufficient by themselves for reliably producing the sound.That is, the motor program will have become accurate enough that it generates very few auditory errors, obviating the need for auditory feedback control in most instances.At this point the model can fluently produce the speech sound.As the speech articulators grow, the auditory feedback control subsystem continually corrects for changes in the biomechanics of the vocal tract.These corrective commands are subsumed into the motor program, thus allowing it to stay tuned despite significant changes to the shapes and sizes of the articulators over the course of life.
As the model repeatedly produces a sound, it also learns a somatosensory target region for that sound, analogous to the auditory target region.The somatosensory target represents the expected proprioceptive and tactile sensations elicited when producing the sound.This target is different from the auditory target in that it cannot be learned from other speakers, as essential information about tactile patterns, tongue shape, etc. are not available to a listener.The somatosensory target must instead be learned through selfmonitoring of one's own correct productions, a process that occurs at a later stage than the learning of auditory targets.
The simulation study of Callan, Kent, Guenther, and Vorperian (2000) provides an example of how the DIVA model has been used to investigate speech motor development.This study involved computer simulations of the process of learning and correctly producing English vowels during developmental growth of the vocal tract.The model was grounded in empirical data by including the sizes and shapes of infant vocal tracts measured with magnetic resonance imaging.Vowel formants were successfully produced along a developmental timeline that matched those observed in real developing infants, showing the feasibility of the model.The simulation provided additional insight into speech development by showing how infants could make use of motor equivalence to produce a sound, even under the constraints of changing articulator shapes and sizes.

Development of speech motor programs
The motor learning process implemented in computer simulations of the DIVA model as described in the previous section is a highly simplified approximation of speech development in children.In the current section, we provide a more detailed account of the stages of speech development in infants and children with reference to components of the DIVA model.

Overview of infant babbling
The first two months of infancy are characterized by a   (see Oller, 1980, andStark, 1980, for reviews of infant babbling), during which speech-like vocalizations are only rarely exhibited.The few speech-like sounds that can be observed consist largely of phonation with the mouth closed or nearly closed.The next developmental phase, occurring from 2 to 3 months of age, is known as the ""  and is characterized by the production of crude syllable-like sequences composed mostly of velar consonant-like elements in combination with vowel-like elements.By 4 to 6 months old, most infants enter the  , characterized by the production of several new sound types, including labiolingual and bilabial trills, growls, and squeals.The expansion stage may also contain some of  , consisting of vocal tract closures in combination with better-formed vowel-like utterances.Seven months of age sees most infants entering the     , in which syllables with adult-like timing characteristics emerge.During this stage, many utterances consist of reduplicated syllables such as "bababa".The   stage follows at around 10 months old; it is characterized by the use of different consonants and vowels within the same babbling sequence (e.g., "dadabi").It has been suggested (MacNeilage & Davis, 1990) that during the nonreduplicated babbling stage infants begin learning how to produce the phonemes of their native language.
An important feature of this developmental sequence is that many non-speech vocalizations and articulator movements occur well before the onset of frequent speech sounds.It is this observation that motivates the two learning stages of the DIVA model.In the first stage, sensory-motor relationships between the motor, somatosensory, and auditory systems are learned.In a sense, this stage consists of learning about the biophysics of the vocal tract; that is, the infant learns the sensory consequences of various oromotor actions.In the second stage, individual speech sounds from the native language are learned.While these stages are typically carried out sequentially in model simulations for convenience, the real speech motor learning process is not so discrete (e.g., de Boysson- Bardies, Sagart, & Durand, 1984;Boysson-Bardies, Hallé, Sagart, & Durand, 1989;Mitchell & Kent, 1990) and involves processes not addressed in computer simulations of DIVA.Table 1 provides an overview of these processes, which are detailed in the following paragraphs.

Development of auditory and somatosensory maps
The ability to produce the speech sounds of a language depends heavily on the ability to perceive these sounds.Auditory representations of speech signals (corresponding to the DIVA auditory state and auditory error maps) show signs of language specificity in infants as young as 6 months of age (e.g., Kuhl, Williams, Lacerda, Stevens, & Lindblom, 1992).This likely reflects modifications in auditory cortical neuronal responses to optimally capture the auditory signatures of the native language.This developmental process likely does not require knowledge of the phonological units that make up the language, as it occurs at a very early stage of development (see row 1 of Table 1).The shaping of auditory representations can instead be driven by the statistical nature of the acoustic signals experienced by the infant (e.g., Guenther & Gjaja, 1996;Guenther, Husain, Cohen, & Shinn-Cunningham, 1999).
The somatosensory representations of the speech network, corresponding to the somatosensory state map in Figure 1, must also undergo development.Unlike auditory signals for speech, the somatosensory patterns associated with the sounds of a language cannot be learned by listening to native speakers.Thus, development of the somatosensory maps for speech likely lags behind development of auditory maps during the very early stages of infancy, at a time when articulations are limited.Once the infant starts producing more speech-like articulatory movements in the expansion, canonical babbling, and nonreduplicated babbling stages, their somatosensory maps likely become increasingly sensitive to the somatosensory patterns proceeding from these movements (row 2 of Table 1).

Development of sensory-motor transformations
The first movements of speech-related body parts begin almost immediately after birth, when an infant uses their vocal folds and respiratory system to cry and their lips, jaw, and tongue to feed.These movements generate somatosensory feedback and often auditory feedback as well, providing opportunities for the infant's brain to learn about sensory consequences of oromotor actions.Our motor systems have the ability to anticipate sensory consequences of movements commanded by motor cortical activity.Tuning of these sensory-motor predictions, often referred to as forward models, likely begins with early non-speech actions, then accelerates as the infant creates more and more speech-like utterances as they move through the goo, expansion, canonical, and nonreduplicated babbling stages (rows 3, 4, and 5 in Table 1).
The articulatory movements which occur during infant babbling can also be used to tune transformations in the reverse direction, that is, sensory-to-motor transformations, or inverse models.These transformations consist of learned mappings between auditory and somatosensory representations of ongoing vocalizations and articulator movements that produce them.Prior to the development of auditory and somatosensory targets for speech sounds, nodes in the auditory and somatosensory error maps are not yet signaling "errors" per se; these nodes instead represent changes (velocities) in the auditory and somatosensory state that occur due to ongoing movements of speech articulators.This combination of motor activations and resulting sensory velocities enable the tuning of auditory-motor and somato-motor transformations well before an infant develops awareness of phonological units such as phonemes and words.
Later, as auditory and somatosensory targets are learned, the nodes in the auditory and somatosensory error maps stop reflecting ongoing changes in the sensory state and begin to reflect desired sensory changes (i.e., sensory errors, which can be thought of as desired sensory velocities for reaching the target).This development, which can be inferred to have occurred when infants begin to produce language-specific speech sounds, is reflected in the DIVA model by the transition from the babbling phase to the imitation phase, though the model does not simulate specific mechanisms for the cause of this transition.Some continued tuning of sensory-motor transformations likely continues into adulthood; evidence for such plasticity is provided by adaptation to somatosensory feedback perturbations (e.g., Houde & Jordan, 1998;Golfinopoulos et al., 2011;Lametti, Nasir, & Ostry, 2012).

Speech recognition and phonological target acquisition
The learning processes described thus far do not require any knowledge of the distinct phonemes, syllables, or words of a language.Instead, they tune transformations between the largely continuous motor, somatosensory, and auditory spaces without regard for the discrete phonological units that make up a language.These transformations form the essential elements of the feedback control system schematized in Figure 1.
The ultimate goal of the speech motor system is, however, to produce these discrete speech sounds of the native language.Before a child can learn to articulate these sounds, it is required that they learn how to parse continuous auditory signals into discrete phonological categories such as words, syllables, and phonemes.This learning process corresponds to tuning of the speech recognition system and speech sound map in Figure 1.These learning processes (row 6 in Table 1) fall under the domain of speech perception and are not currently implemented in computer simulations of the DIVA model.Instead, speech sounds are presented to the model for learning; these sounds take the form of time-varying auditory signals (in particular, formant frequencies).Note that conscious awareness of phonemes is not a prerequisite for learning to produce phoneme strings; indeed, infants and children successfully learn words like "cat" and "hat" that differ only by a single phoneme despite not yet being consciously aware of phoneme units.

Development of sensory targets and feedforward control
As infants acquire auditory targets corresponding to phonemes and syllables, their brains store information about the sensory signals making up these objectives of speech motor output (row 7 in Table 1).The infant will then try to replicate these auditory targets.Projections to the auditory target map from the speech sound map encode these timevarying auditory targets for sounds represented in the speech sound map, so that these targets can be activated later during production of the corresponding sounds.
Infants have been reported to imitate caregivers' vocalizations as early as 2 months old (Kuhl & Meltzoff, 1996;Kokkinaki & Kugiumutzakis, 2000;Gratier & Devouche, 2011), while other accounts argue that this capacity emerges closer to 1 year of age (Jones, 2009).These initial utterances enable the infant to learn feedforward commands for producing these sounds on their own (row 8 in Table 1).Within the DIVA model, these feedforward commands are stored in synaptic projections from the speech sound map to the primary motor cortical areas, both directly and via a cortico-cerebellar loop.
Finally, after an infant can successfully produce speech sounds, the infant's brain develops a somatosensory target map containing representations of the somatic sensations created by accurately producing the sound (row 9 in Table 1).These targets are used by the somatosensory feedback control system to rapidly detect and correct production errors in ongoing utterances.

Computational modeling of developmental speech disorders
In addition to modeling normal development of speech production, variations of DIVA have also been used to simulate possible mechanisms of childhood disorders that affect speech production.Max, Guenther, Gracco, Ghosh, and Wallace (2004) used mechanisms from DIVA to propose an account of developmental stuttering caused by dysfunctional use of auditory feedback.Subsequent simulation studies implemented this hypothesis (Civier, Tasko, & Guenther, 2010), as well as alternative possible causes of the disorder (Civier, Bullock, Max, & Guenther, 2013).The neural etiology of childhood apraxia of speech has been addressed by DIVA modeling, in a study that simulated the disorder as resulting from impaired feedforward signaling (Terband, Maassen, Guenther, & Brumberg, 2009;Miller & Guenther, 2021).A recent application of the model used it to explore motor and auditory processing in children with autism spectrum disorder (Chenausky, Brignell, Morgan, Norton, Tager-Flusberg, Schlaug, & Guenther, 2021).A promising future direction for similar investigations may be the use of LaDIVA, a modification of the model which incorporates detailed laryngeal physiology, for understanding voice disorders such as pediatric dysphonia (Weerathunge, Alzamendi, Cler, Guenther, Stepp, & Zañartu, 2022).

Sequencing of speech motor programs
The previous sections discussed how the DIVA model simulates production of single speech motor programs and how these programs are learned and refined.Here we describe an extension to the DIVA model called the Gradient Order DIVA (GODIVA) model (Bohland, Bullock, & Guenther, 2010) that describes the neural processes underlying the buffering and sequential production of longer utterances consisting of multiple speech sounds, such as phrases or sentences.In infancy, the capacity for rudimentary speech sound sequencing begins to manifest during nonreduplicated babbling (Levitt & Utman, 1992;Nathani, Ertmer, & Stark, 2006).GODIVA provides a description for developmental processes underlying the learning of these abilities.Before exploring these mechanisms, we give an overview of the components of the model.

Neural components of the GODIVA model
Figure 2 illustrates a simplified schematic of the GODIVA model.The model consists of two basal ganglia-thalamo-cortical loops (shaded regions in the figure): a   (whose components are shared with the DIVA model) responsible for initiating and terminating speech motor programs, and a   that forms a phonological working memory that buffers upcoming speech sounds.The planning loop involves the posterior inferior frontal sulcus (pIFS) in lateral prefrontal cortex and the presupplementary motor area (preSMA) in the medial premotor cortex working in concert with the basal ganglia via projections to the head of the caudate nucleus, whereas the motor loop involves vPMC and SMA working in concert with the basal ganglia via projections to the putamen.
The model's cortical components can also be divided into medial and lateral cortical regions (indicated by dashed boxes in Figure 2), which represent distinct aspects of the speech utterance.One set of structures, the left lateral cortical areas pIFS and vPMC, contains representations of the speech sequence's phonological content (hypothesized to reside in left pIFS) and corresponding motor programs (hypothesized to reside in left vPMC).A second set, the medial premotor areas preSMA and SMA, are responsible for the metrical structure of the phonological sequence.Specifically, preSMA is hypothesized to contain a representation of syllabic frame structure and metrical patterning for an upcoming utterance, whereas SMA contains an initiation map (as in DIVA) that is responsible for turning on and turning off individual speech motor programs at particular instants in time.The planning loop regions preSMA and pIFS in GODIVA both use a gradient order working memory representation in which nodes representing actions to be produced sooner have higher activation levels than those to be produced later; such a representation has been proposed in prior computational models of working memory and sequencing (e.g., Lashley, 1951;Grossberg, 1978;Houghton, 1990;Houghton & Hartley, 1996).The following subsections provide further detail regarding the model's medial and lateral streams.
Processing of sequential structure in medial premotor cortex The GODIVA model posits that preSMA contains a representation of the global metrical structure of an upcoming speech utterance, whereas SMA is primarily responsible for initiating the motor execution of speech articulations.The SMA and preSMA elements in GODIVA are inspired in part by single unit electrophysiological studies of action sequencing in non-human primates.For example, Shima and Tanji (2000) trained macaque monkeys to perform different sequences of three hand/arm movements (e.g., push-pull-turn) while recording from neurons in SMA and preSMA.Broadly speaking, neurons in SMA were more closely tied to particular movements, whereas neurons in preSMA often represented more global aspects of the full sequence, for example neurons that fired at the beginning of only one particular three-movement sequence, or neurons that fired during production of the second (or first, or third) movement of the sequence regardless of whether the movement was a push, pull, or turn.Subsequent human neuroimaging studies found a corresponding association between speech sequence complexity and preSMA activation (Bohland & Guenther, 2006;Rong, Isenberg, Sun, & Hickok, 2018).
In GODIVA, preSMA nodes represent the syllable frame structure and stress patterning of the utterance, which determine the utterance's metrical structure.Projections from preSMA nodes to SMA are responsible (in concert with the basal ganglia, as described below) for activating and deactivating the proper SMA initiation map nodes (each of which launches a distinct motor program) in the proper order and with the proper stress.In this way, the medial stream of the GODIVA model dictates the metrical structure/ tempo of a multi-sound utterance.
Each node in the phonological content buffer represents a different phonological unit (e.g., a phoneme or consonant cluster).The order of upcoming speech sounds to be produced is represented by the gradient of activity across these nodes.GODIVA, like the DIVA model, implements speech sound map nodes residing in vPMC.Once pIFS selects the next motor program to execute, as determined by the highest-activity node in its phonological buffer, this selection is transmitted to left vPMC via projections from pIFS.Execution of the motor program begins at the instant the corresponding SMA initiation map node is activated (at which time the sound's representation is deleted from the pIFS phonological content buffer), and the motor program terminates when the initiation map node activity is extinguished.
Motor sequence chunking and automatization in the basal ganglia loop We propose that, early in development, the working memory areas preSMA and pIFS must be heavily involved in the speech sequencing process since frequently occurring sequences haven't yet been "automated" by transferring control of the sequence to subcortical structures.In GODIVA, if a particular movement sequence is repeated many times, nodes in the basal ganglia learn to recognize the sensorimotor context for initiating the individual items in the sequence.After learning, the sequence is represented by its own speech sound map node, and activating this node leads to readout of the learned movement sequence.The learning process is schematized in Figure 3.
The cortico-basal ganglia motor loop accomplishes this automation of frequently used speech sequences in early childhood by encoding these sequences as "chunks" with their own optimized motor programs.This chunking would reduce the processing load on prefrontal and premotor cortical areas (Alm, 2004;Redgrave et al., 2010).For example, the speech motor system of a young child might attempt to produce the word "snow" (Figure 3, Panel A). vMC contains nodes encoding articulatory gestures (labeled G) for the phonemes /s/, /n/, and /ō/.Each phonemic gesture has a corresponding cell in the SMA initiation map (labeled I) that is responsible for initiating the gesture via projections to vMC.During this early stage of development, vPMC does not contain a motor program for the entire syllable /snō/.Instead, the syllable is represented by individual motor programs for each phoneme that must be activated independently via inputs from the IFS phonological buffer.Similarly, preSMA and pIFS contain only phonemic elements, not larger units such as consonant clusters.
At this stage, production of the word requires activation of the nodes /s/, /n/, and /ō/ in the phonological content buffer in pIFS, as well as the structural representation for /snō/ in the sequential structure buffer in preSMA.Projections from pIFS sequentially activate the vPMC nodes corresponding to the motor programs for /s/, /n/, and /ō/.Projections from these vPMC nodes sequentially activate the matching gestural nodes in vMC.The timing of this sequential activation process is determined by the medial premotor areas.PreSMA-to-SMA projections activate nodes in the initiation map for the individual phonemes in the proper order and with the proper timing.Once a motor program has been completed, the pIFS, vPMC, and pIFS nodes for that program's elements are deactivated, allowing the next motor program to commence.
Panel B of Figure 3 schematizes the production of /snō/ at a more mature stage of development.At this stage, vPMC contains a motor program for the entire syllable /snō/, with subcortical loops through the cerebellum (green dashed arrows) effectively taking over coordination of the individual motor gestures.The importance of the cerebellum for vocal sequence learning has been empirically supported by pediatric clinical studies and animal lesion models (Ziegler & Ackermann, 2017;Pidoux, Blanc, Levenes, & Leblois, 2018;Glickstein, 1994).Once these cortical-subcortical loops are established, working memory buffers in preSMA and pIFS will contain cluster-sized sub-syllabic units, thereby reducing the number of items that have to be stored in working memory for /snō/.The task of initiating the gesture for /n/ in /snō/ now gets carried out by the basal ganglia motor loop (red dashed arrow) instead of preSMA.This learning process reduces the number of pIFS, preSMA, and vPMC nodes that must be activated to produce the word.The required number of cortico-cortical connections (black arrows) has decreased substantially, having been replaced by subcortical communications through the cerebellum (green arrows) and basal ganglia (red arrows).Evidence for speech learning-related reductions in processing load has been demonstrated by neuroimaging studies of nonnative consonant cluster learning (Segawa, Tourville, Beal, & Guenther, 2015;Masapollo et al., 2021).

Summary
This review described neuro-computational approaches for modeling infant and child speech motor development.We first provided an overview of the DIVA model, which characterizes feedforward and feedback mechanisms of speech production controlled by a network of cortical and subcortical loops.The feedforward control system is thought to involve cortico-cortical projections from premotor to motor cortex, as well as contributions from the cerebellum.The auditory and somatosensory feedback control systems monitor the perceptual consequences of speech output, which are compared to sensory predictions transmitted from premotor cortex to higher-order sensory areas.These sensory areas compute error signals, which are sent to motor cortex as corrective motor commands.
We described how early stages of speech motor learning can be simulated with the DIVA model.Speech motor development involves a number of learning processes occurring in a quasi-parallel fashion.Infant babbling and other vocalizations begin tuning forward maps which map motor outputs to resulting auditory and somatosensory perceptions.Auditory maps develop in a way that highlights important acoustic distinctions in a language and de-emphasizes irrelevant distinctions.Analogously, somatosensory maps become sensitive to the tactile and proprioceptive feedback patterns that occur when producing sounds from the native language.Auditory targets for speech sound "chunks" such as phonemes, syllables, and words are formed by monitoring the environment for native language samples, and feedforward commands are tuned as a child attempts to produce these sound chunks.
Next, we addressed computational modeling of a more advanced stage of child speech development, in which longer phonological sequences such as phrases or sentences are produced.Modeling of these processes uses the Gradient Order DIVA (GODIVA) model.High-level language processing regions maintain temporary stores of upcoming phonological content and metrical structure in competitive queues.These regions control the output of the downstream initiation maps and speech sound maps to produce sequences of speech sounds.GODIVA also describes a mechanism of speech sequence learning, or chunking, via cortico-basal ganglia loops.Frequently produced motor sequences that formerly required cortical control for every sequential step are automated into syllabic motor programs controlled mostly by the basal ganglia and cerebellum, reducing cortical processing load as the child proceeds through speech development.

Figure 1 .
Figure 1.Neural correlates of the DIVA model.The main neural output of the model is provided by the vMC Articulator Map, which integrates feedforward commands from VL and the Speech Sound Map with feedback commands from VL and the Feedback Control Map.[Abbreviations: Cb=cerebellum (specific lobule unknown); Cb-VI=cerebellum lobule VI; GP=globus pallidus; MG=medial geniculate nucleus of the thalamus; pAC=posterior auditory cortex; SMA=supplementary motor area; SNr=substantia nigra pars reticula; VA=ventral anterior nucleus of the thalamus; VL=ventral lateral nucleus of the thalamus; vMC=ventral motor cortex; VPM=ventral posterior medial nucleus of the thalamus; vPMC=ventral premotor cortex; vSC=ventral somatosensory cortex.].

Figure 3 .
Figure 3. Illustration of speech sequence learning via "chunking" in the GODIVA model.(A) Network involved in producing the word "snow" early in speech motor development.Cortico-cortical projections are indicated by black arrows.(B) Network involved in producing the word "snow" later in development.The development of basal ganglia (red dashed arrows) and cerebellar (green dashed arrows) loops allow for the use of fewer cortical nodes and projections.[Abbreviations: BG, basal ganglia; Cb, cerebellum; G, gestural node; I, initiation map node; pIFS, posterior inferior frontal sulcus; preSMA, presupplementary motor area; S, syllabic structure node; SMA, supplementary motor area; vMC, ventral primary motor cortex; vPMC, ventral premotor cortex].

Table 1 .
Time-courses for development of the major capacities of the speech motor system.The estimated amount of learning occurring in a neural system within a given time window is indicated as being LOW, MEDIUM, or HIGH.[Abbreviations: Aud.=auditory; Som.=somatosensory.]