Approaches to Bilingual Phonetics and Phonology

Part I Approaches to Bilingual Phonetics and Phonology

1 Generative Approaches to Bilingual Phonetics and Phonology

1.1 Introduction

The goal of generative linguistics is to model the internalized grammars that allow members of a linguistic community to produce and understand utterances they have not previously encountered. Most generative models of the phonological component share the assumption that the grammar defines a mapping between two levels: abstract mental representations, which contain all nonpredictable information about the sound structure of morphemes, and surface representations, which encode predictable alternations. Opinions diverge, however, on the formal properties of the internalized grammar and the nature of the mapping from underlying to surface representations, as well as on the relationship between phonology and phonetics, on one hand, and usage-based factors such as frequency and variation, on the other.

Research in cross-language phonology has been informed by developments in generative phonology. At the same time, second language (L2) research has contributed insight to the understanding of the phonological grammar above and beyond what can be learned from the study of monolingual behavior. Speakers encountering a new language may be confronted with structures that do not occur in their first language (L1), and L2 production and perception of these novel structures may provide a valuable window into the nature of the L1 grammar by revealing how learners generalize beyond prior experience. Second language patterns may also provide insight into factors that are independent of experience with a particular language, such as the relative complexity, the articulatory difficulty, or the perceptual salience of particular structures, as well as the influence of universal grammatical principles and/or learning biases. And the study of bilingual speakers is especially crucial in shedding light on the interaction between distinct internalized grammars.

The goal of this chapter is to explore the strengths and limitations of different models of phonology in accounting for data from multilingual speakers, including the realization of L2 phonemes, allophones, syllables, and suprasegmental structures (tone, stress, pitch accent, and intonation). While bilinguals fall on a continuum from early and consistently balanced users of two languages to adult learners of a new language, the majority of research in second language phonology has focused on data from speakers who have internalized an L1 before acquiring additional languages. Much of the work discussed here will therefore focus on the extent to which L2 patterns can be explained in terms of the L1 grammar. Issues to be addressed include the role of phonological versus phonetic structure in a grammar, the relationship between the production grammar and the perception grammar, and the role of language-independent factors such as putative innate learning biases, articulatory difficulty, and perceptual salience.

This chapter begins with a brief overview of competing models of the phonological grammar that have emerged in the last half-century of work in generative phonology. The focus then turns to how particular areas of L2 phonology have been approached within different models. We examine the limits of reliance on the L1 to explain L2 patterns, and then examine how different views of the nature of phonological representations and the nature of phonological derivations have intersected with differing explanations of L2 phonology.

1.2 Models of the L1 Phonological Grammar

Reference Chomsky and HalleChomsky and Halle’s 1968 foundational work The Sound Pattern of English (SPE) set forth a framework in which the job of the phonological grammar was to map abstract mental representations of stored, nonpredictable lexical structure into surface representations that display allophonic and allomorphic variation. Phonological representations at both the underlying and the surface levels were argued to consist of a linear string of segments, defined as a matrix of phonological feature specifications, along with indicators of morpheme and word boundaries. Surface representations were realized by a system of phonetic implementation which accounted for such aspects of phonetic detail as variation in the realization of a single surface representation.

SPE provided a comprehensive and explicit model of phonological analysis, inspiring researchers to extend this model to numerous languages. As the database expanded, so did the notion of phonological representations. Research on tonal processes, intonation, templatic morphology, and vowel harmony motivated abandonment of the principle of strict linear organization. In Autosegmental Phonology (e.g., Reference GoldsmithGoldsmith, 1979), features were argued to occupy distinct tiers, with the possibility of one-to-many mappings between elements on different tiers; for example, a falling tone might be represented by two tones (High and Low) associated with a single vowel, while a sequence of high-toned syllables might be represented by a single High tone associated with multiple syllables. Research on word and phrasal stress and prominence motivated the claims of Metrical Phonology (e.g., Reference HayesHayes, 1985) that segments may be organized hierarchically into prosodic units such as syllables, feet, and prosodic words, and research on multiply-articulated segments motivated Feature Geometry (e.g., Reference ClementsClements, 1985; Reference SageySagey, 1986), which argued for hierarchical structure within segments, with features organized under place, manner, and laryngeal nodes. Perhaps the most radical departure from SPE phonology was associated with Articulatory/Gestural Phonology (e.g., Reference Browman and GoldsteinBrowman & Goldstein, 1989; Reference GafosGafos, 2002), which included direct reference to the positions of articulators in phonological representations and to the timing of articulatory gestures in phonological derivations. This suggested a rethinking of some earlier analyses; for example, the loss of /t/ in connected pronunciation of the phrase perfect memory could be understood not as a result of a phonological rule deleting /t/ but rather as a result of the English pattern of overlap in time between the velar, the coronal, and the labial closures of the /ktm/ sequence which makes the coronal closure difficult to perceive (Reference Browman and GoldsteinBrowman & Goldstein, 1989).

The extension of the SPE model to a variety of languages also led to new ideas about the relationship between the grammars of individual languages and the role of typological generalizations pointing to cross-linguistic preferences for certain surface structures; for example, the finding that the presence of final voiced obstruent in a language implies the presence of final voiceless obstruents suggests that the latter are universally more natural (less marked) than the former. The assumption that particular surface structures may be considered universally more or less preferred has implications for the nature of phonological derivations. In SPE, the grammar makes no direct reference to surface structures, which simply emerge in the mapping from underlying to surface representations via ordered rewrite rules, where each rule effects a structural transformation at that level, without looking back to lexical representations or ahead to surface representations. The importance of surface structure restrictions was pointed out early on, however, by researchers who noted that individual languages often employ multiple rules that “conspire” to ensure that certain structures never appear in surface representations (Reference KisseberthKisseberth, 1970) and furthermore that children often seem to apply rules that have the effect of simplifying children’s productions in the direction of universally preferred surface structures, even where these rules are not supported by the data of the adult language, which may contain the dispreferred structures (Reference StampeStampe, 1979).

The role of cross-linguistic surface naturalness in grammars was addressed in SPE in the form of the Evaluation/Simplicity Metric, which equates the naturalness of a grammar with the number of symbols contained in its rule set. While the Evaluation Metric rewards rules that refer to more natural classes of sounds, as these require fewer features to specify, it does not distinguish a cross-linguistically common rule such as devoicing of final obstruents from its rare and possibly unattested counterpart of final obstruent voicing, since both rules require the same number of symbols. The failure to distinguish rules that create marked versus unmarked surface structures was acknowledged in the final chapter of SPE, where a system using default fill-in of the more natural value of a feature was proposed, though this system was not widely adopted. Alternative proposals for incorporating cross-linguistic naturalness into phonology included allowing constraints on preferred surface structures to either block or trigger the application of a rule, as well as extending the principles and parameters approach of syntax to include a universal set of phonological parameters to which learners have access, where the default setting defines the least marked (cross-linguistically most common and presumably most natural) option.

The move to incorporate typological generalizations into the grammar ran up against the apparent paradox that even a principle that has clear cross-linguistic support is not necessarily obeyed in every language: some languages do allow surface structures that are considered marked. The emergence of Optimality Theory (OT; Reference KagerKager, 1999; Reference Prince and SmolenskyPrince & Smolensky, 2004) provided a way to make typological markedness of surface forms a driving force in phonological derivations while still allowing some languages to violate so-called universals. In this framework, rules were replaced by a set of constraints that define the optimal surface realization of an underlying representation. In classic OT, constraints are of two types: markedness constraints (or well-formedness constraints), which encode universal preferences by penalizing marked surface structures, and faithfulness constraints (or correspondence constraints), which penalize changes from underlying structure. While the set of constraints is assumed in classic OT to be universal, constraint rankings are considered language-specific, and any constraint may be violated under pressure from higher-ranking constraints. In accordance with the Subset Principle (Reference BerwickBerwick, 1985), whereby the initial state grammar for L1 acquisition is the one defining the most restrictive grammar, children are argued to begin by ranking markedness constraints over faithfulness constraints (see papers in Reference Kager, Pater and ZonneveldKager, Pater, and Zonneveld, 2004). Reranking is motivated by positive linguistic evidence: hearing a marked structure motivates demotion of the violated constraint.

Alongside assumptions about the nature of phonological representations and derivations, a third area where assumptions diverge concerns the sorts of facts that the grammar must account for. While it has long been recognized that perception is affected by one’s native language, explicit formal grammars of the mapping from acoustic signals to phonological representations (e.g., Reference BoersmaBoersma, 1998) were not proposed until the generative enterprise had been underway for several decades. And while sociolinguists tackled the question of variation in language early on, it is only relatively recently that mainstream phonologists have built grammars whose job is not to select a single grammatical realization of a given underlying representation, but rather to predict the probability of different surface representations of an underlying form (e.g., the Gradual Learning Algorithm of Reference Boersma and HayesBoersma and Hayes, 2001; the Maximum Entropy grammar of Reference Hayes and WilsonHayes and Wilson, 2008).

Despite differing views on the nature of phonological representations and the mapping between underlying and surface levels, almost all generative phonologists would agree that acquiring a language involves internalizing a grammar. This makes data from second language acquisition (SLA) a fruitful area for research aimed at discovering the nature of the initial state grammar in language acquisition and the process of moving toward a target-like grammar based on the data to which the learner is exposed.

1.3 The Role of the L1 Grammar in L2 Acquisition

A major question in the study of multilingual speakers has been the extent to which the grammar of an already internalized language influences the acquisition of a new language. Much early work was driven by the Contrastive Analysis Hypothesis (Reference LadoLado, 1957), which predicts positive transfer of features shared by the L1 and the L2 and interference or negative transfer where the two languages differ. In this view, learners’ patterns that deviate from target language norms are classified as errors, indicative of incomplete learning. However, the generative view that learning a language involves constructing a grammar led to a shift in focus on learners’ L2 productions as coherent systems (interlanguages; Reference SelinkerSelinker, 1972) that are worthy of study in their own right as reflections of the developing L2 grammar.

Much of the research inspired by the interlanguage approach maintains a crucial role for the L1 grammar, with most researchers assuming that for languages learned after early childhood, the initial state grammar is (a clone of) the L1 grammar, though developmental processes may also play some role (e.g., the Ontogeny Model; Reference Major, Ioup and WeinbergerMajor, 1987). In a rule-based framework, the learner must suppress rules of the L1 that do not apply in the new language and learn a new set of rules where warranted. In frameworks that assume universal parameters or constraints, the learner’s task is to reset the parameters or rerank the constraints, and the assumption of universality predicts that any intermediate stages in the process will correspond to possible L1 grammars.

The study of interlanguage data as a coherent system led to the recognition of two sorts of patterns that are not readily explainable in terms of either transfer from the L1 or input from the L2: the fact that not all novel structures appear to be equally difficult for learners to master, and the fact that learners alter L2 structures in ways that may not be explainable as an effect of the L1 (which lacks the offending structure) or the L2 (which tolerates the structure). Such patterns have often been argued to provide evidence for the generative view that language acquisition is governed by innate, universal biases.

1.4 Representations in L2 Phonology: Phonemes, Features, Articulatory Gestures, and Acoustic Cues

A long tradition of work in the Speech Learning Model (Reference Flege and StrangeFlege, 1995), the Perceptual Assimilation Model (Reference Best, Tyler, Bohn and MunroBest & Tyler, 2007), and the Native Language Magnet Model (Reference Kuhl, Iverson and StrangeKuhl & Iverson, 1995) focuses on how learners establish phoneme categories in L2 production and perception, a task that is complicated when the L1 and the L2 phoneme inventories do not match, or when a single phoneme category is realized differently in different positions (e.g., Reference Flege and StrangeFlege, 1995; Reference Trofimovich, Gatbonton and SegalowitzTrofimovich, Gatbonton, & Segalowitz, 2007). Yet in most generative models of the phonological grammar, phonemes per se are not primitives of mental representation; rather, segments are defined by a matrix of distinctive feature specifications. Furthermore, as Reference KangKang (2008) points out, generative models differ in their view of which features are present in lexical representations. SPE models assume that only contrastive features are specified, with rules filling in predictable features (such as aspiration for English voiceless stops in particular positions). In OT, on the other hand, the set of ranked constraints ensures the proper surface distribution of predictable features regardless of whether those features are specified in lexical representations or not, and the principle of Lexicon Optimization (Reference Prince and SmolenskyPrince & Smolensky, 2004) assumes that learners will posit lexical representations that are as close as possible to surface representations. These differing views raise the question of whether L2 acquisition proceeds phoneme-by-phoneme, based on the phonetic similarity of L1 and L2 phonemes, or feature-by-feature – and if the latter, which types of features (contrastive versus predictable) are relevant.

1.4.1 Features versus Phonemes

The problem of phoneme substitution is one area where features have been argued to play a role in L2 phonology. Reference Hancin-BhattHancin-Bhatt (1994) laid out a feature-based approach to the long-standing puzzle of differential substitution illustrated by the observation that the preferred substitute for English /θ/ is /s/ for speakers of some languages (such as Turkish) but /t/ for speakers of others (such as Japanese), although each language has both /s/ and /t/ in its phoneme inventory. Reference Hancin-BhattHancin-Bhatt (1994) argues that the choice of substitute is a function of phoneme similarity, measured in terms of the number of specified features shared between two segments. She also argues that whether a feature value is specified depends on the functional load of that feature within a language; thus, the featural similarity of two phonemes can differ across languages. A conceptually similar appeal to the role played by a single feature within different languages is encoded in OT by Reference LombardiLombardi (2003) and by Reference Wester, Gilbers and LowieWester, Gilbers, and Lowie (2007), who argue that speakers who choose /s/ as a substitute for /θ/ are those whose L1 grammar assigns a high rank to the constraint requiring faithfulness to manner (the [+continuant] value shared by /s/ and /θ/); speakers whose L1 offers no evidence for such ranking choose the universally less marked substitute /t/. A challenge for such approaches, however, is posed by Reference BrannenBrannen (2002), who points out that speakers of European and Canadian French, with essentially identical phoneme inventories and phonological systems, differ in their choice of replacement for English /θ/. Brannen attributes the differential choice of phoneme substitute to differences in the phonetic realization of /s/ in the two varieties of French.

Appeals to the L1 feature system have also been used to explain why certain novel structures seem to pose greater difficulty in perception for some L1 groups than for others. Reference Brown and ArchibaldBrown (2000) argued that the ability to perceive a new contrast in L2 correlates with whether the feature which encodes that contrast plays a role in defining some contrast in the L1 grammar. Brown argues that Mandarin speakers are more successful than Japanese speakers in perceiving the English /ɹ–l/ phoneme contrast because Mandarin contrasts two coronal fricatives (alveolar and retroflex), so Mandarin speakers can deploy the feature distinguishing these segments to perceive the English contrast. This approach is somewhat abstract, since the feature [lateral] which is assumed to distinguish the English liquids does not function in Mandarin. And even this more abstract view of featural correspondence cannot account for the fact that Japanese speakers distinguish Russian /r–l/ with relative ease, presumably because the Russian phonetic cues to the liquid contrast are more robust than those for the English contrast, as well as more distinct from the cues associated with the allophonic variants of the Japanese liquid phoneme (Reference ArchibaldArchibald, 2009; Reference Larson-HallLarson-Hall, 2004).

Another problem for the claim that the perceptibility of an L2 contrast depends on the availability of a relevant contrastive L1 feature is outlined by Reference Martinez, Goad and DowMartinez, Goad, and Dow (2023), who investigated the ability of native speakers of French, whose language contrasts oral and nasal mid vowels, to discriminate the oral-nasal contrast in Brazilian Portuguese mid and high vowels. The French speakers not only were significantly less able than native Portuguese speakers to discriminate the novel high vowel oral-nasal contrast, they in fact performed no better than English and Caribbean Spanish speakers, in whose native languages vowel nasality is not contrastive. Thus, the French listeners seemed unable to deploy their native language [nasal] feature to discriminate a new oral-nasal contrast. However, some evidence suggests that the ability to perceive a contrast in one phoneme may generalize across different places of articulation and different prosodic positions. Reference De Jong, Silbert and ParkDe Jong, Silbert, and Park (2009) found that the ability of Korean learners of English to identify English (nonsibilant) obstruents contrasting in [voice] or [continuant] correlated across sounds sharing the same feature; both contrasts are new for Korean speakers, in whose L1 neither voicing nor continuancy is contrastive for nonsibilant sounds.

In production, however, the picture is more complex. Using the same set of English obstruents, Reference De Jong, Hao and ParkDe Jong, Hao, and Park (2009) found correlations between the accurate production of voicing contrasts across different places of articulation, but no similar correlation for the production of a continuancy contrast for labial and coronal positions. They propose that while the gestures involved in the realization of [voice] values for labials and velars are similar enough to generalize across different places of articulation, the realization of [continuant] contrasts requires mastery of different gestures for different places of articulation. This is consistent with the findings of Reference OlsonOlson (2019a), who trained English learners of Spanish on Spanish-like production of one of the three Spanish voiceless stop consonants (which differ in voice onset time [VOT] from their English counterparts). The move to more Spanish-like VOT values did generalize beyond the trained segment to other stops, with a comparable degree of improvement for all three places of articulation. Reference De Jong, Hao and ParkDe Jong, Hao, and Park (2009) suggest that feature-based acquisition plays a greater role in perception than in production, since production requires the acquisition of gestural routines associated with particular feature values that may not be uniform across all sounds sharing a feature.

The role of features in learning new sequences of segments in L2 (new phonotactics) is less well explored. Reference MoretonMoreton (2002) argued that English listeners’ perception of ambiguous consonant-consonant-vowel (CCV) sequences showed a stronger bias against /dlV/ than /bwV/ because, although both combinations of phonemes are illegal in English onsets, /dl/ violates a prohibition on onset sequences of [coronal] consonants. However, Reference DavidsonDavidson (2010) questions the hypothesis that L2 onset clusters that are featurally most similar to L1 onsets are most easily produced; she found that although English and Catalan differ in permitted onset clusters, both language groups exhibited the same hierarchy of difficulty in the production of unfamiliar Russian cluster types, suggesting that difficulty is a function of language-independent phonetic factors rather than featural similarity to L1 clusters (though L1 differences did affect the types of errors).Footnote ¹

1.4.2 Integrated Models of Sound Structure

Like the papers discussed so far, much work in the L2 acquisition of new phonemes and phoneme sequences ends with the conclusion that both phonological and phonetic factors play a role in L2 phonology. However, this conclusion generally assumes the traditional separation of an abstract phonology from a phonetic component dealing with details of articulation and acoustics. As discussed earlier, that separation no longer represents unanimous opinion.

One model of a perception grammar incorporating acoustic information is that of Reference Escudero and BoersmaEscudero and Boersma (2004), which addresses the perception by L1 Spanish speakers of the /i–ɪ/ contrast in either Scottish English or Southern British English. Because these two target dialects differ in the cues used to signal the /i–ɪ/ contrast (durational versus spectral), learners develop different constraint rankings based on the dialect to which they are exposed. Their perception grammars map the L2 acoustic signals to phonological feature-based representations by employing constraints referring not only to abstract phonological features but also to phonetic cues such as segment durations and formant values. This groundbreaking analysis makes explicit the way in which learners can move from the initial state L1 constraint rankings toward an L2-like perception grammar, on the basis of the L2 data to which they are exposed. More recently, Reference van Leussen and Escuderovan Leussen and Escudero (2015) present a revision of this approach couched in terms of the strength of the connections along the paths mapping from the acoustic level to the lexical level.

A model of a production grammar incorporating articulatory information is Reference DavidsonDavidson’s (2006) analysis of L1 English speakers’ pronunciations of Russian onset clusters such as /zg/. This analysis is based on data from Reference DavidsonDavidson’s (2005) study which found that while many English speakers produce the Russian clusters with what is perceived as an inserted schwa between the two consonants, this interconsonantal portion is both acoustically and articulatorily distinct from lexical schwas. Reference DavidsonDavidson (2005) argues that the L2 productions involve gestural mistiming rather than vowel insertion, and Reference DavidsonDavidson (2006) models this mistiming with a grammar that employs constraints referring to consonant release and the alignment of consonantal gestures, with the reranking of constraints over time accounting for the range of production patterns in the data.Footnote ²

While these approaches depart quite radically from classic generative notions of the phonological grammar as abstract, categorical, and distinct from details of phonetic realization, they provide possible paths for moving beyond the phonetics–phonology dichotomy that has dominated much of the discussion in L2 phonology.

1.5 Derivations in L2 Phonology: Integrating Markedness and Frequency Effects

The shift from derivations involving rules to derivations involving constraints referring to universal principles of markedness has been arguably the widest departure from SPE phonology. This change was, however, predated by the recognition in the study of L2 phonology that equally novel L2 structures may pose different degrees of difficulty. Well before the emergence of OT in the 1990s, Reference EckmanEckman (1977) proposed the Markedness Differential Hypothesis, which supplements the Contrastive Analysis Hypothesis by correlating difficulty of acquisition not only with novelty but with markedness. Reference EckmanEckman (1981) pointed out not only that marked structures were more difficult to acquire but that the apparently spontaneous emergence in interlanguage phonologies of rules repairing marked structures posed a challenge for frameworks which assumed that rules were learned from ambient data.

Another shift in generative phonology has been the move from grammars that derive a single surface form to grammars that predict the probability of different surface representations, taking into account the frequency of different structures in the input to the learner. These changes have pointed toward new ways of approaching SLA data and answering some of the questions raised in earlier accounts. In the rest of this section we examine evidence for markedness and frequency effects in three areas: the acquisition of laryngeal contrasts, of complex syllable margins, and of L2 stress and tone.

1.5.1 Acquisition of Final Laryngeal Contrasts

One of the best documented examples of the differential difficulty of equally novel structures involves the acquisition of obstruent voicing contrasts in syllable- or word-final position, with numerous studies showing greater accuracy in the production of English final voiceless than voiced obstruents among speakers of a wide range of L1s that lack any final obstruents (Mandarin, Tswana, Japanese, Angami; see Reference Broselow, Hyman and PlankBroselow, 2018; Reference EckmanEckman, 1981; Reference EdgeEdge, 1991; Reference Flege and DavidianFlege & Davidian, 1984; Reference WangWang, 1995; Reference Weinberger, Ioup and WeinbergerWeinberger, 1987; Reference Wiltshire, Bamman, Magnitskaia and ZallerWiltshire, 2006; Reference Wissing and ZonneveldWissing & Zonneveld, 1996; Reference YavasYavas, 1997). As Reference EckmanEckman (1981, Reference Eckman2004) points out, the greater difficulty of final voiced obstruents is consistent with the cross-linguistic generalization that while many languages (e.g., Dutch, Polish, Catalan) allow only voiceless obstruents in final position, the reverse pattern is unattested. But in a rule-based analysis, the repairs evidenced among L2 learners (deletion or devoicing of the final obstruent and insertion of a vowel after the obstruent) must reflect rules of the learners’ grammar, and these rules are neither part of the L1 grammar nor motivated by the data of the L2.

As Reference Broselow, Chen and WangBroselow, Chen, and Wang (1998) point out, the existence of languages banning voiced obstruents in final position (but not the reverse) motivates a constraint against the marked structure, and the assumption that each constraint is universal means that even speakers whose L1 lacks any final obstruents have the constraint NoFinalVoicedObstruent as part of their grammar, although the effects of this constraint become visible only once learners have demoted the more general constraint banning all final obstruents. The principle that learners rank markedness constraints as high as possible predicts an L1 grammar that ranks NoFinalObstruent and NoFinalVoicedObstruent above faithfulness constraints, but, as Reference BroselowBroselow (2004) points out, it is not clear why learners faced with both voiced and voiceless final obstruents in English should first demote only the more general constraint to arrive at an intermediate ranking NoFinalVoicedObstruent ≫ Faithfulness ≫ NoFinalObstruent. Broselow argues that this intermediate stage can be predicted once frequency is taken into account; in the Gradual Learning Algorithm (Reference Boersma and HayesBoersma & Hayes, 2001), the rate at which a constraint is demoted is a function of the frequency with which the constraint is violated. Any form that violates the more specific NoFinalVoicedObstruent will by definition also violate the more general NoFinalObstruent, causing the more general constraint to be demoted more quickly.

The problem of explaining why learners use rules that have no apparent L1 or L2 support, identified by Reference EckmanEckman (1981), is also avoided in the constraint-based approach, since in OT there are no rules to be learned; the grammar generates all possible surface outputs, and the task of the grammar is simply to determine the optimal surface correspondent of an underlying representation. In this approach, the choice of repair may (at least in some cases) be determined by markedness constraints. For example, Reference WangWang (1995) and Reference Broselow, Chen and WangBroselow et al. (1998) argue that Mandarin learners’ choice of repair for forms with final voiced obstruents – deletion or devoicing of the obstruent versus insertion of a vowel after the obstruent – was governed by a preference for disyllabic words; learners were more likely to delete or devoice a final obstruent in disyllabic forms (thus preserving syllable count) than in monosyllables, where vowel insertion (addition of a syllable) was more likely. Though Mandarin has been argued to show a preference for disyllabic words, this preference has been argued to play a role in L2 phonology even in the absence of L1 support. In a study of Brazilian Portuguese speakers’ productions of English coda stops, Reference CardosoCardoso (2007) found that learners who were sufficiently advanced in English to produce final stops were more likely to do so in polysyllabic words than in monosyllables, which were more likely to undergo insertion of a vowel after the final stop.

One unavoidable question is whether the greater difficulty of voiced obstruents is actually not a matter of the production grammar at all, but is rather based in perception – that is, whether speakers whose L1 lacks final laryngeal contrasts simply misperceive final voiced obstruents as voiceless, and then faithfully reproduce what they hear. However, several studies comparing L2 final obstruent perception and production cast doubt on this; for example, Reference Wissing and ZonneveldWissing and Zonneveld’s (1996) Tswana speakers performed better on perceiving than producing English final laryngeal contrasts, and L1 Mandarin speakers in Reference FlegeFlege’s (1988) study performed as well as native English speakers on identification of final voiceless and voiced obstruents in naturally produced English words. Nonetheless, difficulty in the perception of voicing most likely does play a role in Mandarin speakers’ interlanguage; English obstruents in final position may be partially devoiced and also may be unreleased, and Reference FlegeFlege’s (1988) study found that when stimuli were edited to remove release bursts, the Mandarin speakers’ accuracy in voicing identification dropped below that of English speakers. Thus, a full understanding of the Mandarin facts is incomplete without a model of the perception grammar alongside the production grammar.Footnote ³

Another example of differential difficulty that may be rooted in perceptual difficulty is the fact that the L1 Mandarin speakers in Reference Broselow and XuBroselow and Xu’s (2004) study were more successful in producing final /m/ than final voiced obstruents. Since Mandarin lacks both final [m] and final voiced obstruents, learners who can produce the former but not the latter appear to have demoted NoFinalLabial more quickly than NoFinalVoicedObstruent, despite the fact that voiced obstruents are more frequent than labial consonants in English word-final position. Reference Broselow and XuBroselow and Xu (2004) suggest that the rate of constraint demotion depends not on absolute frequency but rather on perceived frequency: a structure must be perceived accurately to be identified as a violation, and they suggest that Mandarin listeners are simply better at perceiving final /m/ than final voiced obstruents, for two reasons. First, the Mandarin speakers’ L1 contrast between coda /n/ and /ŋ/ requires them to attend to place cues in nasal codas. Second, English speakers may fail to produce the cues that Mandarin speakers rely on to identify voicing (Reference FlegeFlege, 1988).

1.5.2 Acquisition of Complex Syllable Margins

Second language syllable structure is another area in which universal markedness has been argued to correlate with difficulty of acquisition. The assumption that segments within a syllable tend to be organized in terms of sonority (with vowels having highest sonority, followed in order of decreasing sonority by glides, liquids, nasals, fricatives, and stops) has been a staple of investigations of L1 syllable typology. Reference Selkirk, Aronoff and OehrleSelkirk (1984) articulates two principles defining the relative markedness of syllable types: (i) the Sonority Sequencing Generalization (SSG), which favors syllable onsets that increase in sonority and codas that decrease in sonority; and (ii) the Minimal Sonority Distance principle (MSD), which favors syllable margins whose consonants are more widely separated on the sonority scale. These principles together define a markedness scale along which (i) onset clusters such as English /sp, st, sk/, which violate the SSG, are more marked than clusters such as /sl, sn, pl/, which obey the SSG; and (ii) onset clusters with greater sonority distance, such as stop-liquid /pl/, are less marked than fricative-liquid /sl/, which is in turn less marked than onsets such as fricative-nasal /sn/. These principles predict a hierarchy of differential difficulty in the realization of novel syllable margins.

Studies investigating the L2 accuracy of sC onsets in terms of markedness as defined by the SSG have yielded mixed results. The Brazilian Portuguese speakers in Reference Cardoso, Liakin, Watkins, Rauber and BaptistaCardoso and Liakin’s (2008) study produced SSG-conforming English /sn/, /sl/ more accurately than the marked /st/ onset, even though /st/ onsets were more frequent in the learners’ input. Learners also produced /sl/ more accurately than /sn/, in accord with the MSD principle. Additional support for the role of markedness in production of sC was found by Reference Cardoso, Collins and CardosoCardoso, Collins, and Cardoso (2021), who trained Brazilian Portuguese speakers on one of the onsets /st, sn, sl/. The participants trained on the most marked onset improved in their production of the untrained less marked onsets, but training on a less marked onset did not improve performance on more marked onsets. Markedness did not, however, account for the Portuguese speakers’ perception of English onsets; Reference Cardoso, John, French, Watkins, Rauber and BaptistaCardoso, John, and French (2008) found the highest perceptual accuracy rates for the most marked (but most frequent) /st/ onsets, suggesting that accuracy in perception is more affected by input frequency than by markedness. Furthermore, a number of studies of different L1–L2 combinations have found no clear correlation between the SSG-defined markedness of sC onsets and their rate of acquisition in L2; the reader is referred to Reference Edwards and YavasEdwards (2014) for an overview.

Studies investigating the role of the MSD in a wider range of onsets and codas have also yielded mixed results. Reference Broselow and FinerBroselow and Finer (1991) found that L1 speakers of Japanese and Korean, languages which allow only the least marked obstruent-glide onsets, produced stop-liquid onsets in English pseudowords more accurately than the more marked fricative-liquid onsets. Assuming a parameter-setting framework, they attribute this effect to the learners having moved to a parameter setting intermediate between the native and the target language settings. Reference Hancin-Bhatt and BhattHancin-Bhatt and Bhatt (1998) investigated the role of markedness in a constraint-based approach, examining the production of English complex onsets and codas by L1 speakers of Spanish and Japanese. They found that for L1 Spanish speakers, English stop-glide onsets were actually less accurately produced than the more marked stop-liquid onsets, though the Japanese speakers in their study did show accuracy rates consistent with MSD-defined onset markedness. A markedness effect in production of codas was found for the Spanish speakers, who were less accurate in producing the more marked liquid-nasal codas than the less marked liquid-obstruent codas, but not for the Japanese speakers. Reference Hancin-Bhatt and BhattHancin-Bhatt and Bhatt (1998) argue that the results that cannot be explained by markedness are attributable to effects of highly ranked L1 constraints determining the types of segments that are permitted in onset and coda in the native language, which also determine the preferred repair. Thus, the combination of constraints defining coda cluster markedness along with high-ranked L1 constraints defining acceptable coda single consonants together select the correct surface forms without the need for an L2-specific rule of consonant deletion. The OT approach therefore provides an account of phenomena that are unexplained in a rules-and-parameter-based approach.

Sonority-based markedness cannot, however, account for the full range of differential difficulty found in production of novel clusters. For example, Reference DavidsonDavidson’s (2010) study of the production of novel onset types by L1 English and Catalan speakers revealed that for clusters involving a sonority plateau (either stop-stop or fricative-fricative), all of which are equally marked in terms of both SSG and MSD, participants were more accurate on onsets in which both consonants were voiceless (/fs/, /tp/) than on voiced clusters (/zv/, /db/). Reference DavidsonDavidson (2010) explains this pattern in terms of the articulatory difficulty of sustaining voicing in obstruent clusters and the greater perceptual salience of fricatives versus stops in pre-consonantal position. As discussed in Section 1.4, Reference DavidsonDavidson (2006) provides a grammar incorporating constraints on gestural timing that accounts for patterns of non-native production as a failure to closely coordinate the articulation of the two onset consonants. Reference Wilson, Davidson, Kan, Moore-Cantwell and StaubsWilson and Davidson (2013) provide, in addition, a proposal for building frequency effects into the grammar with a Bayesian model of the English and Catalan productions that predicts the probability of different realizations of each type of onset, an approach that recognizes and accounts for the high degree of variability that is a feature of most L2 performances.

1.5.3 Acquisition of Suprasegmentals

The study of stress and tone systems has yielded several generally agreed upon universal preferences. Some research supports a role for markedness in L2 learners’ productions even in the absence of direct evidence from the native or target languages.

In many languages, the position of stress within a word depends upon the weight of the syllables; for example, in Latin, the penultimate syllable of a word is stressed if and only if it is heavy (CVV or CVC, as opposed to light CV). In languages that take syllable weight into account in assigning stress, it is always heavier syllables that attract stress, and this is encoded in OT as the WSP (Weight-to-Stress Principle) constraint, which requires heavy syllables to be stressed (Reference KagerKager, 1999). While English does exhibit some preference for stressing heavy syllables – particularly, syllables containing a long vowel – the English lexicon contains many violations of the WSP – for example, in calendar, where stress falls on the initial light syllable. Nonetheless, Reference Guion, Harada and ClarkGuion, Harada, and Clark (2004) and Reference GuionGuion (2005) found a strong tendency among speakers of L1 Spanish and L1 Seoul Korean to place stress on heavy syllables in English pseudowords, a system that receives at best weak support in Spanish (in which that final CVC is somewhat more likely to be stressed than final CV) and no support in Seoul Korean, which is not a stress-based language. Similarly, Reference ÖzçelikÖzçelik (2018) found that L2 learners of Turkish, where prominence falls on a word-final syllable, showed a tendency to accompany the move to final stress in Turkish with lengthening of vowels in word-final syllables.

In the area of tone, one well-established markedness principle is the OCP (Obligatory Contour Principle) which disfavors sequences of identical tones. Reference ZhangZhang (2016) found that native speakers of English, Korean, and Japanese studying Mandarin made more errors on words containing sequences of identical tones (whether contour or level tones) than nonidentical tones, consistent with the OCP. The exception was sequences of high tones, which showed high accuracy, a fact that Reference ZhangZhang (2016) ascribes to the universally least marked status of high tones. The learners’ productions are analyzed as reflecting a grammar in which the constraint penalizing identical tone sequences is outranked by the constraint favoring high tone.

1.5.4 The Role of Markedness in Phonological Derivations

The introduction of markedness as the major determinant of the mapping from underlying to surface representations provided a way to use learners’ grammars to account for why some novel structures seem to be more difficult than other equally novel structures as well as why learners choose particular repair strategies (for example, the preference for disyllabic words motivates insertion of a vowel in a monosyllable but not a disyllable).

However, structures that are considered more marked are arguably precisely those structures that are more difficult to produce and/or perceive. One position held by many is that the comparative rarity of marked structures is not within the province of grammatical explanation – the rarity of marked structures simply reflects the fact that marked structures are those most likely to be altered or lost in the process of transmitting a language from one generation to another (e.g., Reference BlevinsBlevins, 2004). And if marked structures are marked because of inherently greater difficulty, the fact that L2 speakers tend to be less successful in producing marked than unmarked structures does not in itself constitute an argument that markedness is built into phonological grammar. For example, voicing in final obstruents is not only difficult to perceive (particularly if cues like release burst are absent), it is also difficult to sustain during the narrow vocal tract constriction required for obstruent production. Even markedness principles that have no obvious explanation in terms of articulatory or perceptual difficulty may have a functional grounding; for example, Reference Frisch, Hayes, Kirchner and SteriadeFrisch (2009) argues that the OCP is based in processing difficulty, as the neurons that fire in the presence of one set of cues may not have sufficient time to recover if an identical stimulus follows too closely. The debate concerning the role of markedness in the grammar is far from being resolved, but it is clear that accounts of L2 phonology must take into consideration all the factors that may contribute to difficulty of acquisition in L2 (Reference ArchibaldArchibald, 2021).

1.6 Grammar Interaction in Bilingual Speakers

The view that learning a language means inducing an internalized abstract grammar raises the question of whether bilinguals maintain distinct independent phonological grammars for each language, or whether these two systems may interact. Here factors such as age of acquisition and language dominance must be taken into account.

Reference FlegeFlege’s (1991) study of VOT values in the production of /t/ by early versus later Spanish-English bilinguals suggests that age of acquisition affects the degree of separation in systems of phoneme realization. Reference FlegeFlege (1991) found that participants who had learned English at five to six years of age produced monolingual-appropriate VOT values in both Spanish and English words, suggesting that these speakers maintain not only independent phoneme categories but also independent systems of phonetic realization of those categories. The productions of the early bilinguals contrasted with those of the L1 Spanish speakers who had learned English as adults, who showed VOT values for English /t/ that were significantly shorter (more Spanish-like) than those of monolingual English speakers. Similarly, Reference Gonzales and LottoGonzales and Lotto (2013) found that when confronted with stop consonants with VOT values that fell within the voiced category in English but were voiceless in Spanish, early Spanish-English bilinguals (but not English monolinguals) shifted their identification of the stop’s voicing depending on whether the pseudoword in which the stops were embedded contained Spanish-specific versus English-specific cues.

Yet even early bilinguals may show evidence of permeability across different linguistic systems. For example, Reference Macleod and Stoel-GammonMacleod and Stoel-Gammon (2005) found that French-English bilinguals who had acquired both languages before the age of three evidenced monolingual-like VOT values for voiceless stops in each language but not for English voiced stops, which were realized with VOT values more like those of French stops. It is difficult to determine whether such effects result from the interpenetration of grammatical systems or from external factors such as exposure to accented (nonmonolingual) varieties of one of the languages and/or to a range of variability in the realization of phoneme categories even among monolingual speakers. In any event, the findings that bilinguals sometimes exhibit realizations of a phoneme category that are somewhere between the monolingual norms of each language (e.g., Reference Flege and EeftingFlege & Eefting, 1987) as well as the phenomenon of phonetic drift, whereby even a late-acquired L2 may cause changes in the phonetics of the L1 (Reference Chang, Schmid and KöpkeChang, 2019), suggest that, at least in terms of the pronunciation of segments, a speaker’s coexisting internalized linguistic systems may interact.

Another area to look to for insight into the relationship between coexisting grammars is cases where a speaker’s languages differ in the presence versus absence of a phonological process. For example, voiced stops are realized as spirants between vowels in Spanish but not in English, so bilinguals must map intervocalic voiced stops to spirants in Spanish but suppress this mapping in English productions. Reference AmengualAmengual (2019) found that L1 English speakers who had learned Spanish after age eight showed less spirantization in read Spanish words than participants who had acquired Spanish earlier, suggesting incomplete mastery of the Spanish process by later learners. However, evidence that at least some Spanish-English bilinguals can successfully switch between phonological systems is presented by Reference OlsonOlson (2019b), who analyzed the production of intervocalic voiced stops within all-Spanish, all-English, and code-switched sentences. Spirantization of word-initial voiced stops following a vowel-final word was more likely when the stop was in a Spanish word, regardless of whether the surrounding environment was English or Spanish, though the degree of difference in the stops in Spanish versus English words was more pronounced in participants who scored toward the balanced end of the language dominance scale. The difficulties of finding truly balanced bilinguals and of controlling for factors such as the types of linguistic data to which bilinguals are exposed mean that the question of whether and how coexisting grammars interact remains elusive.

1.7 Future Directions

The past decades of research in generative phonology have seen a wealth of new proposals concerning the nature of phonological representations and phonological derivations, the role of markedness in the phonological grammar, the relationship between phonetics and phonology and between production and perception, and the role of usage factors such as frequency in language learning and grammar construction. New views of the phonological grammar have suggested new ways of approaching old problems in the study of L2 phonology: the inclusion of gestural targets in the elements making up phonological representations suggests ways to move beyond the phonetics–phonology dichotomy; the appeal to specific perception grammars clarifies the role of the L1 in L2 perception; the move to probabilistic rather than categorical grammar addresses the role of frequency and variation in L2 production; and the incorporation of markedness in phonological derivations sheds light on the problems of the differential difficulty of novel L2 structures and the differential repair strategies employed by speakers of different L1s or for different L2 structures. Models of L2 perception that draw on insights from connectionist approaches (e.g., Reference van Leussen and Escuderovan Leussen & Escudero, 2015; Reference Bordag, Gor and OpitzBordag, Gor, & Opitz, 2021) provide avenues for insight into the nature of speakers’ lexical representations, and new imaging techniques may help to reveal distinctions in mental representations that are not obvious from more traditional acoustic investigation of speakers’ productions (e.g., Reference Song and EckmanSong & Eckman, 2021). Second language data have provided and will continue to provide a fruitful area in which to explore the predictions of different models of the internalized knowledge that constitutes a grammar.

2 Usage-Based Approaches to Bilingual Phonetics and Phonology

2.1 Introduction

Usage-based approaches to language attribute a crucial role to use and usage patterns in the emergence of linguistic structure. Usage-based functionalism argues that language use directly shapes language structure (e.g., Reference Bybee, Darnell, Moravcsik, Newmeyer, Noonan and WheatlyBybee, 1999; Reference Givón and TomaselloGivón, 1998; Reference Haiman and PagliucaHaiman, 1994). Work on variation and change of morphological and syntactic forms (e.g., Reference Bybee, Perkins and PagliucaBybee, Perkins, & Pagliuca, 1994; Reference Haiman and PagliucaHaiman, 1994; Reference HopperHopper, 1987; Reference Hopper and TraugottHopper & Traugott, 1993) in addition to sounds and sound systems (e.g., Reference BybeeBybee, 2001, Reference Bybee2010) within multiple languages supports this linguistic paradigm. This chapter will discuss in broad terms aspects of usage-based approaches to language in order to underscore the potential these models have to unify multiple lines of linguistic inquiry of bilingual phonetics and phonology and simultaneously provide new explanations for and understanding of bilingual speech production, perception, and processing. Although the focus in this work will be upon the sounds and sound systems of speakers of more than one language, within a usage-based approach neither a modularity of grammar (formal separation of linguistic systems; morphology, syntax, etc.) nor a necessary theoretical division is drawn between the way language operates in monolinguals versus multilinguals.

Within usage-based grammar, “all types of regularities (and hence predictability and productivity), including grammatical structure, emerge bottom-up from the information in the lexicon, through conventionalization, which in turn results from repetition” (Reference HinskensHinskens, 2011, p. 104). The lexicon, then, contains highly redundant, detailed information on words and word combinations that reflect linguistic and extralinguistic features derived from experiences (in perception and production) of use. An exemplar model of lexical representation (Reference BybeeBybee, 2001, Reference Bybee2010), described in more detail in Section 2.2, proposes that all meaningful characteristics of experienced tokens of words are stored in memory. These details include phonetic realizations, as well as linguistic and extralinguistic information derived from the contexts of use. Linguistic structure emerges as generalizations of form and meaning (of varying degree of abstraction) across experienced tokens of words and constructions, forming a highly organized network of interrelated forms.

Just as usage-based models propose no modularity of grammar in the cognitive architecture of a single language, no formal separation of distinct languages is assumed either. As Reference Onysko, Zenner, Backus and Winter-FroemelOnysko (2019, p. 31) notes, “while on the social plane languages can indeed appear as entities belonging to certain groups of speakers, internal separations for different codes in the mind of a speaker are difficult to draw … The processing and production of language is dependent on activity in a mental network within which there are virtually no boundaries.” A usage-based approach to bilingual phonetics and phonology views the lexicon as a permeable system (e.g., Reference Dussias and SagarraDussias & Sagarra, 2007) and the cross-language influence can be bidirectional (e.g., Reference Putnam, Carlson and ReitterPutnam, Carlson, & Reitter, 2018). In the way that different varieties (dialects) of a single language can bring about shifts in variant forms (e.g., Reference Baayen, Beaman, Ramscar, Beaman and BuchstallerBaayen, Beaman, & Ramscar, 2021) through contact and use, interlingual influence is evident across languages (Reference Kroll, Bogulski and McClainKroll, Bogulski, & McClain, 2012). That is, it is consistent with a usage-based view that the principles operating within one language will also operate across languages (Reference SherkinaSherkina, 2003, p. 139). Bilingualism, then, can be viewed as a particular type of use in which units of language are “bound together on the dimensions of co-occurrence, similarity and overlapping domains of appropriate use via the general cognitive processes of categorization, generalisation and schematization. The result is a highly flexible and always relative model that does not rely on pre-established named languages” (Reference WasserscheidtWasserscheidt, 2021, p. 13). Thus, although much empirical research within usage-based frameworks stems from monolingual data, the theories and methods derived from this approach require no modifications when applied to bilingual and multilingual individuals and settings.

Usage-based approaches to phonology represent a considerable minority when considering the proportion of academic works produced (e.g., Reference Bérces and HoneyboneBérces & Honeybone, 2020). This approach forms part of a comparatively “young paradigm” with regard to linguistic thinking and phonological theorizing (e.g., Reference HinskensHinskens, 2011). Simultaneously, usage-based approaches are in the ascendency (Reference Backus, Zenner and KristiansenBackus, 2013) with numerous independent lines of research converging on findings consistent with exemplar-based models of language (e.g., Reference BybeeBybee, 2001). Methods and theory derived from this paradigm are rife with potential, then, when applied to bilingual phonetics and phonology. Although a usage-based approach has been labeled “part of a twenty-first-century ‘assault on autonomous phonology’” and thought to be characterized “by eclecticism in its proponents and iconoclasm in its views” (Reference Bérces and HoneyboneBérces & Honeybone, 2020, p. 5), this chapter will summarize works that characterize the approach as a cognitively plausible model capable of offering a unified theoretical framework with which to consider instances of language production, perception, and processing. As noted by Reference Quick and VerschikQuick and Verschik (2019, p. 8), a usage-based approach does not “restrict itself solely to linguistic forms, psycholinguistic or communicative processes, or sociolinguistic reasons,” but rather integrates under one theoretical umbrella different research agendas through a common focus on use and usage patterns. As such, any eclecticism derives, in part, from the breadth of contact phenomena unified through usage.

As a means to illustrate key constructs within a usage-based approach to phonetics and phonology, this chapter will focus on work that examines sounds and sound systems in a bilingual (or multilingual) individual or setting. Reference BybeeBybee (2001, Reference Bybee2010) offers a full discussion of the usage-based approach summarized in this chapter, and Reference Backus, Adamou and MatrasBackus (2021) details an overview of usage-based approaches to themes not detailed in this chapter but of intense interest within contact linguistics, such as code-switching, borrowing, and convergence. The reader is referred to these sources for additional detail. A brief overview of crucial usage-based concepts is detailed in Section 2.2. This section will highlight key concepts underlying recent work investigating the sounds and sound systems of bilinguals (see also Section 2.3). Section 2.4 highlights current approaches to bilingual phonetics and phonology employed by a wide range of scholars that test and employ the usage-based constructs overviewed in Sections 2.2 and 2.3. Section 2.5 highlights a few potential directions for future research and the chapter concludes with Section 2.6.

2.2 Usage-Based Approaches to Language

Variation and change are an inherent fact in language. It is not counter to expectations, then, for both variation and change to be evident in situations of contact. That the sounds of language vary and change in ways that correlate with language contact is long established (compare with Reference Thomason and KaufmanThomason & Kaufman, 1988; Reference WinfordWinford, 2003). The patterns of articulation (e.g., Reference FlegeFlege, 1987) and the organization of sound systems (e.g., Reference Van CoetsemVan Coetsem, 1988; Reference WeinreichWeinreich, 1952) and prosodic characteristics (e.g., Reference Harris, Miglio, Gries, Levis, Mohammed, Qian and ZhouHarris, Miglio, & Gries, 2015; Reference ShenkShenk, 2006) have each been examined in cases of bilingual individuals and communities, with differences from monolingual benchmarks noted. How these sound permutations are accounted for, however, has not traditionally been from a usage-based perspective. Nevertheless, such a model has potential to unite scholarship within the many subdisciplines of contact linguistics (Reference Backus, Adamou and MatrasBackus, 2021).

This theory models a nonmodular lexicon in which each word (or combination of words) is stored with redundant phonetic, semantic, and contextual information. This information is word-specific and represented in memory by a “cloud” of remembered exemplars that are directly shaped by speakers’ experiences with words in both production and perception. As words are experienced, they are mapped via similarity matching with clouds of stored exemplars (Reference Langacker and SchmidLangacker, 2016, p. 54), strengthening and shaping variant forms of words. As noted by Reference Docherty and FoulkesDocherty and Foulkes (2014, p. 42), exemplar-based models of phonological representation (Reference BybeeBybee, 2001; Reference Pierrehumbert, Bybee and HopperPierrehumbert, 2001) are well situated to “account for the production, processing and acquisition of social-indexical information woven into the speech signal alongside lexical-propositional content,” all of which is of interest in situations of language contact and bilingualism.

An important measure of experience is frequency, commonly measured as “token frequency,” which measures speakers’ exposure and experience with a linguistic unit (number of instances of use of a word) and “type frequency,” which refers to the number of different lexical items that participate in a pattern. According to Reference BybeeBybee (2006), each time a word is used it is exposed to the eroding effects of articulatory pressures and routinization, and, as such, frequent words exhibit greater degrees of reduction and are leaders in reductive change compared to words of lower frequency. As words are used, they reduce, and this reduction becomes registered in memory as part of the phonetic makeup of the remembered experiences. This is the “Reducing Effect.” At the same time, high token frequency strengthens lexical representations and makes words resistant to changes via analogical pressures (“Conserving Effect”). Higher type frequency (more lexical items sharing form ~ meaning overlap) contributes to increased productivity of a pattern.

Token frequency has stood as a key piece of evidence buttressing the usage-based framework in comparison to the long-standing generative tradition, which maintains a strict separation between use (“performance”) and linguistic structure or knowledge (“competence”). Indeed, negative evidence for high lexical frequency words leading to reduction has been seen as a falsification of usage-based phonology generally (e.g., Reference HinskensHinskens, 2011), where absence of high frequency words leading a change is taken to delegitimize usage-based claims. Yet, importantly, multiple factors operate simultaneously on variant forms of words shaping realizations, and token frequency is just one measure of use. Reference BybeeBybee (2002a) proposes another important (and still underutilized) measure that is contextually informed. Given (i) that contexts of use condition target word pronunciation, (ii) that these instances of use in production are subsequently stored in memory, and (iii) that words differ significantly in their patterns of use, Reference BybeeBybee (2002a) proposes that, in addition to lexical frequency, researchers should consider contextual frequency measures to more accurately model variation.

Reference ForrestForrest (2017), for example, utilizes this type of contextual frequency measure. This author examines the cumulative impact on words of extralexical phonetic context on variable realization of the English suffix -ing ([ɪn], [ɪŋ]). Results show that words occurring more frequently in a context disfavoring [ɪn] (prevelar initial consonant, prepause) have more variants of [ɪŋ], even in phonetic contexts favoring [ɪn]. Reference ForrestForrest (2017, p. 144) explains that speakers “store exemplars of these words with either -in or -ing, which in turn affects their use of these words in other contexts, independent of internal constraints.” A target’s phonetic production context modulates phonetic variation in predictable ways. Few studies, nevertheless, consider the cumulative effect on words’ history of use in specific conditioning contexts as Reference ForrestForrest (2017) does. Other factors operating in the production context to condition phonetic variation are extralinguistic, and words’ distributions among different speakers can likewise accumulate in memory and condition variation. In a study of word-medial English -t-, Reference Hay and FoulkesHay and Foulkes (2016, pp. 321–322) reveal word-level effects of cumulative use among younger (more innovative) speakers. They report a statistically significant effect of words’ distribution among more innovative speakers whereby the words’ likelihood of use among this group of speakers affects lexical representations. The authors (Reference Hay and FoulkesHay and Foulkes, 2016, p. 304) explain that the “non-random association between phonetic forms and social groupings, or social meanings, can also have long-term consequences for the storage of words.” The lexical differentiation derived through use contributes to variation, as frequent variants become targets for production through a type of feedback loop (Reference Kemmer, Barlow, Kemmer and BarlowKemmer & Barlow, 2000, p. ix). As such, words’ likelihood of use in particular linguistic (Reference ForrestForrest, 2017) or extralinguistic (Reference Hay and FoulkesHay & Foulkes, 2016) contexts can have an independent (lexicalized) effect on variant forms of words realized. Thus, in order to better understand patterns of variable phonological reduction, it is productive in studies of variation (and change) to consider words’ proportion of use in specific discourse environments conditioning variation.

Individual words’ history of use, then, can help account for the patterns of variation and change. Additionally, within a usage-based approach to phonetics and phonology, the organization (storage) of words in memory reveals origins of patterns of production and perception. The exemplar model of lexical representation (Reference BybeeBybee, 2001, p. 29) proposes that the episodic traces of words stored in memory are organized into a dynamic network of connections reflecting phonological and semantic overlap between words. These schematic ties between words are gradient, dependent on the number of ties and the strength of association between lexical items. Words with a high degree of phonological and semantic similarity , or a strong degree of association through co-occurrences and entrenchment, share stronger lexical connections than words lacking such associations. These connections represent the form/meaning overlap from which morphology is emergent (Reference Bybee, Darnell, Moravcsik, Newmeyer, Noonan and WheatlyBybee, 1999, p. 224). Through coactivation of proximally stored lexical units and/or storage in overlapping phonetic space, words in memory can influence one another (Reference BybeeBybee, 2001, pp. 152–153). This predicts increased mutual influence between languages where phonological and semantic overlap is the greatest, either enhancing reduction or strengthening nonreduced forms (Reference BrownBrown, 2015; Reference Brown and HarperBrown & Harper, 2009).

The gradient ties between possible Spanish-English lexical items are illustrated in Figure 2.1 (Reference BrownBrown, 2015, p. 401). The associations in memory between two morphologically related words in one language (for example the masculine singular and the feminine singular forms for “doctor” in Spanish) may be quite strong given the overlap of not just form and meaning but also perhaps of common contexts of use. The lexical representation of a bilingual might also have strong ties across languages, in this case to a cognate with similar phonology (English doctor), but the strength of association with a translation equivalent lacking phonological overlap (English pain and Spanish dolor “pain”) would be predicted to be weaker.

Figure 2.1 Bilingual (Spanish-English) lexical representation.

Source: Brown (2015, p. 401).

The cross-linguistic phonological influence is lexically specific, but general articulatory routines are built up as abstractions of varying degrees of specificity across different lexical items (Reference Bybee, Josep-Sole and RecasensBybee, 2012, p. 218). The nature of these connections is dynamic and can be expected to change across “multiple time scales” (Reference Putnam, Carlson and ReitterPutnam et al., 2018, n.p.) as a function of use. Produced forms can reflect the exemplar makeup (the experienced tokens) as well as activated exemplars that are relevant to the linguistic and social contexts . As Reference Lin, Yao and LuoLin, Yao, and Luo (2021, p. 3) note, “exemplar theory allows various linguistic and social factors (e.g., frequency, recency, salience, attitude, awareness) to adjust the activation levels of relevant exemplars, and hence influence the calculation of the production target.” Understanding patterns of use and a consideration of lexical coactivation through proximal storage or schematic ties enables a more precise vision of bilingual phonetics and phonology. These aspects of the usage-based theory (experienced tokens stored in memory, lexical associations built up across particular items) are explored with bilingual data in Section 2.4.

2.3 Domain-General Cognitive Processes

As noted by Reference López-Beltrán and CarlsonLópez-Beltrán and Carlson (2020, p. 4), usage-based approaches “propose that linguistic knowledge is shaped by the everyday interaction of language use with domain-general cognitive mechanisms.” These cognitive processes that give rise to linguistic structure are not unique to language or language acquisition specifically, but rather are characteristics of human cognition generally. The role that these domain-general neurocognitive abilities play in the emergence and storage of linguistic knowledge is central to usage-based approaches. As detailed by Reference SchmidSchmidt (2016, pp. 3–4):

Linguistic knowledge is not autonomous, abstract, and stative but is instead continuously refreshed and reorganized under the influence of communicative events in social situations. Linguistic entrenchment can essentially be regarded as a lifelong cognitive reorganization process, the course and quality of which are conditioned by exposure to and use of language on one hand and by the application of domain general cognitive abilities and processes to language on the other.

Reference BybeeBybee (2010) enumerates five important domain-general cognitive processes. This author argues that “taking language to be an embodied activity that occurs in real time, in real situations and passes through real cognitive systems has great potential for leading to the explanation of what we perceive as linguistic structure” (Reference BybeeBybee, 2010, p. 221). A brief summary of these processes (categorization, chunking, rich memory storage, analogy, cross-modal association) in the context of their application or applicability to studies of bilingualism, then, accounts for the emergence of sounds and sound systems in speakers of more than one language.

As detailed already, as speaker-hearers experience words, those experiences are mapped onto existing exemplars of that word (if experienced previously) or categorized based on degree of similarity with other stored exemplars. Categorization can be based on form or meaning (Reference Pierrehumbert, Rens, Hay and JannedyPierrehumbert, 2003) and the strength and the number of ties in memory are gradient (Reference BybeeBybee, 2001). Categories emerge from experiences that share resemblances across them, often with the most frequently occurring being the most central or prototypical member of the category. Thus, categories of words, sounds, meanings, and constructions are inherently gradient (reflected variable degrees of similarity). Reference BybeeBybee (2010, p. 90) notes that “graded category membership and the central role of the frequent member thus pervade linguistic dimensions, playing a role in synchronic language use (as reflected in corpus distributions), in child language acquisition and in the diachronic expansion of categories.”

Categorization is a pervasive domain-general process that interacts with other cognitive mechanisms. Under a usage-based model, categorization and similarity matching underlie the emergence of exemplars and their schematic ties to other exemplars (Reference Gampe, Quick and DaumGampe, Quick, & Daum, 2021), creating a network of interconnected words and constructions. Words stored in close proximity to one another may stimulate coactivation, either altering the remembered representations slightly (Reference Brown and AmengualBrown & Amengual, 2015) or raising the activation of particular variants (Reference SimonetSimonet, 2014). The effects of similarity in language contact (e.g., Reference Backus, Adamou and MatrasHakimov & Backus, 2021) allow for analyses of cognate effects (even so-called false cognates) in bilingual phonetic productions (e.g., Reference Brown and CoppleBrown & Copple, 2018; Reference Mora and NadeuMora & Nadeu, 2009) and processing (e.g., Reference Carrasco-Ortiz, Amengual and GriesCarrasco-Ortiz, Amengual, & Gries, 2021).

An oft-repeated usage-based adage is that linguistic items that are used together fuse together (Reference Bybee, Givón and MallBybee, 2002b, p. 112). As units of languages are repeatedly sequenced together in use, they become packaged together in memory such that the combination can be accessed holistically as a unit. This chunking of separate linguistic units, or the formation of a more complex unit out of smaller language parts, eases access as well as enhancing articulatory routinization. As such, two detectible effects of chunking are phonetic reduction within the chunked units and autonomy yielding loss of analyzability and compositionality (Reference BybeeBybee, 2010, p. 56). Co-occurrence frequencies have been shown to influence patterns of code-switching (e.g., Reference HakimovHakimov, 2021) and, in cases of interlingual influence of sounds, items that are stored and accessed whole may be expected to pattern differently or to reflect a gradient degree of influence when compared to similar sounds in words not stored holistically as a chunk.

A third domain-general cognitive process (rich memory storage) refers to “the memory storage of the details of experience with language, including phonetic detail for words and phrases, contexts of use, meanings and inferences associated with utterances” (Reference BybeeBybee, 2010, p. 7). Indeed, the exemplar model discussed previously accounts for the emergence of lexically specific patterns of variation and change. The rich information retained in memory is accessed in both production and perception and contributes to the emergence of phonological abstractions across experienced tokens. The finding that the phonological systems of two (or more) languages may be permeable to cross-linguistic influence in lexically specific ways (e.g., Reference AmengualAmengual, 2012, Reference Amengual2016; Reference Brown and AmengualBrown & Amengual, 2015) is straightforwardly accounted for by acknowledging the way in which exemplars effectively record speaker experiences with language. The role that frequency plays will be discussed in Section 2.4.1.

The fourth domain-general process that underlies the creation of linguistic structure is the capacity to make analogical connections across remembered experiences. Unlike the reducing effect apparent in chunking and routinization through high lexical frequency, with analogical processes, high lexical frequency items resist analogical leveling. There is a “conserving effect” (Reference Bybee, Darnell, Moravcsik, Newmeyer, Noonan and WheatlyBybee, 1999). In the context of bilingualism, analogous to the context of monolingualism, the productivity of a pattern will be driven by the high type frequency (the number of different lexical items to which the pattern applies) and the schematicity (the degree of similarity of category members). A highly productive pattern of one language may be prone to influence another. For example, Reference Harris, Miglio, Gries, Levis, Mohammed, Qian and ZhouHarris et al. (2015) demonstrate a case in which the prosodic pattern of contrast marking of one language affects the marking of new (versus old) information in the prosodic system of the bilinguals’ other language. The bilingual speakers of Chicano Spanish encode new information with more English-like pitch patterns compared to the monolingual Mexican Spanish speakers.

The cognitive process of cross-modal association is the association in memory that speakers make between the phonetic form (words spoken) and its meaning. Meaning can be propositional, contextual, and/or pragmatic (Reference BybeeBybee, 2010). Context has a major impact on meaning and aspects of the contexts and inferences made from the context can become conventionalized parts of grammatical meaning. Co-occurring experiences will be associated in cognition and meaning will be applied to the largest linguistic unit. Reference Morford, Occhino and ZirnsteinMorford et al. (2019, p. 363) note that the cross-modal associations provide a link between meaning and form even across languages and find that “form and meaning can be mapped within and across modalities.” Linguistic knowledge and structure, then, emerge from domain-general processes inherent in human cognition generally. Section 2.4 highlights studies that exemplify the application of these everyday processes in bilingual interactions.

2.4 Current Approaches

Each of these “domain-general processes operate[s] through repetition on a massive scale, within individuals and certainly within communities” (Reference BybeeBybee, 2010, p. 221). Usage-based approaches view language as a complex, adaptive system that is constantly changing and updating in response to ongoing experiences and use (e.g., Reference Beckner, Blythe and BybeeBeckner et al., 2009). The understanding that bilingualism is a dynamic process adapting to and shaped by experience (e.g., Reference Adamou, Feltgen and PadureAdamou, Feltgen, & Padure, 2021) creates multiple empirically testable hypotheses, many of which are interrelated. Studies that closely examine the effects of usage and experience with language on bilingual phonetics and phonology are highlighted in Section 2.4.1, and the role of language organization (schematic ties) is discussed in Section 2.4.2. In both Sections 2.4.1 and 2.4.2, the importance of lexically specific patterns that reflect use and usage patterns will be made evident.

2.4.1 Usage-Based Approaches and Language Experience

Language knowledge involves statistical knowledge (e.g., Reference Ellis, Gries and DivjakEllis, 2012). Usage-based approaches hold that this lexically specific knowledge of repetitions and probabilistic distributions of words (and word combinations) both shapes and predicts the very structure of language. The impact of repetition and frequency on the cognitive representation of language is crucial within usage-based frameworks (e.g., Reference Barlow and KemmerBarlow & Kemmer, 2000). An important measure of experience is lexical frequency, both token frequency (with its reducing and conserving effects) and type frequency (with its role in productivity) (Reference BybeeBybee, 2006), as defined in Section 2.2.

Reference Torres Cacoullos and FerreiraTorres Cacoullos and Ferreira (2000), for example, employ lexical frequency in a novel approach to their analysis of variable realizations of voiced labiodental [v] and bilabial [β] fricatives in Traditional New Mexican Spanish. This study tests whether patterns of “v” and “b” articulations are better characterized as retentions in modern Spanish (archaisms) or as instances of influences from English “v” (contact-induced). The highest rates of labiodental realizations are found among the high frequency words, with no effect of orthography or English cognate status, suggesting that, for these words, pronunciations reflect retentions of older forms. These words are highly entrenched through usage and not prone to adapt to analogical pressure from English. At the same time, loss of the archaic form (standardization) is evident in the low frequency words. These words favor a bilabial articulation when the English cognate has “b,” and this analogical change is also in line with usage-based proposals. The lexically specific effects reveal two distinct sources of the variation and can be accounted for within an exemplar model where words are built up as clouds of experienced tokens.

More recent work by Reference Levy and HanulíkováLevy and Hanulíková (2019) explores whether vowels produced by bilingual (Standard German, Swabian) school-aged children varied as a function of lexical frequency and find that vowels in frequent words were produced with greater variability (larger Euclidean distances from the mean of each vowel) than vowels in infrequent words. Additionally, arguing from a usage-based perspective, these authors explored whether greater input variability would lead to greater production variability. Specifically, Reference Levy and HanulíkováLevy and Hanulíková (2019) explored whether greater exposure to different languages and to dialect and speaker variability (i.e., foreign accents) could predict vowel dispersions. They conduct a comparison of the first two formant frequencies of the eight German vowels embedded in high and low frequency words produced in a picture-naming task. Greater vowel dispersion is found in speakers with greater input variability (of both regional and foreign accented speech). Further, language background (bilingual versus monolingual) did not predict production variability, but did predict vowel height. Speakers with exposure to a language other than Standard German had lower first formants (F1) than monolinguals. Reference Levy and HanulíkováLevy and Hanulíková (2019, p. 19) conclude that their findings, which show that input variability and lexical frequency account for increased variability in vowel production, are “difficult to explain without assuming the storage of individual word tokens, with rich acoustic detail, in a single lexicon used for comprehension and production.” As noted by Reference Cohn and RenwickCohn and Renwick (2021, p. 107), the type of fine phonetic detail evident in production supports “the view espoused by Exemplar Theory that lexical representations are built up of clouds of specific utterances that include and are affected by highly specific phonetic properties (e.g., Reference Johnson, Johnson and MullenixJohnson, 1997, Reference Johnson, Solé, Beddor and Ohala2007; Reference Pierrehumbert, Bybee and HopperPierrehumbert, 2001, Reference Pierrehumbert, Gussenhoven and Warner2002).”

Of course, lexical frequency as a straightforward count of occurrence may not always be the optimal measure of experience (e.g., Reference DivjakDivjak, 2017; Reference SchmidShmidt, 2016). Other predictors that capture words’ distributions with regard to linguistic and social contexts may more precisely capture variable realizations of sounds. Additionally, the effects of forms’ frequencies relative to one another (e.g., Reference Kanwit and TeránKanwit & Terán, 2020) have been shown to constrain production variability. Reference Gradoville, Waltermire and LongGradoville, Waltermire, and Long (2021), for instance, explore words’ frequent patterns of use in separate languages (in this case Spanish and Portuguese). They find that realizations of intervocalic /d/ in Spanish reflect the extent to which the /d/ word is used in each of the language pairs. Words used more frequently in Spanish (biased toward Spanish) have greater /d/ lenition in line with monolingual Spanish patterns, and words with greater relative frequency of use in Portuguese (biased toward Portuguese) have stronger /d/ realizations in line with contact effects. The authors note that the lexically specific usage patterns have the effect of shifting the exemplar cluster toward one realization or the other (more Spanish-like or more Portuguese-like), indicative of experiences with the words.

The degree to which experiences with words affect lexical representations depends in part upon perceptual salience. Attention is drawn to salient features, which are those that are unexpected given the listener’s prior experiences with words in specific linguistic and social contexts (e.g., Reference Jaeger and WeatherholtzJaeger & Weatherholtz, 2016). Reference Drager, Kirtley and BabelDrager and Kirtley (2016, p. 17) demonstrate that “when attention is paid to an item during storage, the exemplar receives greater weight.” The exemplar-weighting due to surprisal can in turn influence patterns of production variability, perception, and social salience. Reference BarnesBarnes (2015), for example, examines features of Spanish in contact with Asturian on the Iberian Peninsula and finds that while two distinct suffixes appear in Spanish as a result of contact (/-u/, /-es/), it is the low probability, high surprisal form that obtains social salience. The high cognitive salience of [u] in this variety allows this contact feature to become a marker of social indexation (Reference BarnesBarnes, 2015, p. 234) via experience-based associations of linguistic form and social meaning.

Reference GaskinsGaskins (2020) provides an additional study exemplifying the ways in which language input can reflect linguistic output. Exemplars capture phonetic variability, as discussed in Section 2.2, and, as such, it is important to consider the frequency of words’ distributions in particular discourse and social contexts as opposed to solely considering lexical token frequency. How words are used has an impact on their lexical representation. In addition to considering token and type frequency effects in Polish-English acquisition data, Reference GaskinsGaskins’ (2020) qualitative usage-based study of bilingual input-output shows that hearing words in isolation (as opposed to, for example, in the middle of a clause) makes words more accessible in acquisition. The usage pattern experienced in the input accounts for the tendency for “social words” (words such as mommy, daddy, and those used to accomplish particular pragmatic ends) to be acquired earlier than words more frequently distributed in the middle of clauses, for example, such as nouns, verbs, adjectives, and closed class items.

Contexts of use shape variants of words produced, and words differ significantly in their likelihood of use in particular contexts constraining variation (i.e., Reference BybeeBybee, 2002a). These contextual conditioning factors responsible for the episodic traces stored in memory can be linguistic or extralinguistic. Despite having a demonstrated effect in predicting segment lenition (e.g., Reference Raymond, Brown and HealyRaymond, Brown, & Healy, 2016) and durations (e.g., Reference Brown, Raymond, Brown and File-MurielBrown et al., 2021), cumulative measures of this type are not yet commonly employed in testing hypotheses with bilingual data (though see application to Spanish ~ English bilingual data [Reference BrownBrown, 2015] and discussion for Italian dialectal contact data [Reference Piccardi, Piccardi, Ardolino and CalamaiPiccardi, 2019]). Usage-based work of this type not only supports the exemplar model proposal that use affects lexical representations but also allows for empirical testing of potential influence between lexical items in memory. Such lexical connections are the focus of Section 2.4.2.

2.4.2 Usage-Based Approaches and Language Organization

As Section 2.4.1 makes evident, input frequencies leave an imprint in memory via episodic traces of experiences with language. Nevertheless, within an exemplar model, lexical items are presumed to form part of a network of connections across lexical types. Associations emerge across lexical items based upon similarity of form and/or similarity of meaning. The schematic ties are abstractions from usage events with varying degrees of entrenchment (Reference Langacker and SchmidLangacker, 2016, p. 54) that are not distinct from the lexical units from which they emerge. In usage-based exemplar theory, it is proposed that lexical items that are similar in form and in meaning are stored in close proximity to one another and may be subject to coactivation in language processing, perception, and use.

Similarity of form and meaning across different languages has long been of theoretical interest in contact linguistics. A usage-based approach to bilingual phonetics and phonology views the emergent grammar as a cognitive organization of experiences with usage (e.g., Reference BybeeBybee, 2006). As Hakimov and Backus (Reference Backus, Adamou and Matras2021, p. 468) note, “[t]his organization is grounded in the brain’s capacity to perceive and categorize new information as identical, similar or different to what is already stored.” This identity matching is important in that languages may be most susceptible to interlingual influences in sites of high congruence (Reference Bullock and GerfenBullock & Gerfen, 2004, p. 95). Lexical items with considerable phonological, semantic, and orthographic similarity across languages (i.e., cognates) are of interest for the role they play in language acquisition, language processing, and language production. Detailed analyses of variable productions of cognate words provide empirical evidence in line with the exemplar model.

For example, Reference AmengualAmengual (2012) conducts a study comparing voice onset time (VOT) of word-initial voiceless /t/ in experimentally elicited data. Spanish has a relatively short VOT (0–20 ms typically) and English /t/ commonly has a notably longer VOT (30–120 ms). Reference AmengualAmengual (2012) tested whether any interlingual influence would be evident in the VOT of Spanish words among four different bilingual Spanish-English groups in addition to a bilingual control group without English knowledge and use (Spanish ~ Catalan bilinguals). The Spanish-English bilingual speakers were L1 English-L2 Spanish, L1 Spanish-L2 English, bilingual speakers of Spanish as a heritage language, and bilingual speakers of English as a heritage language. The stimuli contained /t-/ initial words with and without semantic and phonological overlap across languages (cognate words [Eng. terrible, Span. terrible]; noncognate words [Eng. keyboard, Span. teclado]). Acoustic analyses of 3,920 cognate and noncognate tokens revealed significant differences in the VOT of Spanish words with an English cognate among the Spanish ~ English bilingual speakers, but no such difference in the control speaker group. Specifically, words with an English cognate were produced by the experimental groups with a significantly longer VOT compared to the noncognate words, suggesting that the sound systems of the bilingual speakers do not operate independently of one another. Reference AmengualAmengual (2012, p. 526) observes that the “results support an interaction between the phonological and the lexical levels of representation across the two languages of a bilingual.”

Enhanced permeability of a bilingual individual’s sound systems through cognate items is not only fine-grained (as in VOT differences); it is also evident in probabilistic patterns of variant realizations of words (e.g., Reference Brown and AmengualBrown & Amengual, 2015). The study mentioned in Section 2.4.1 by Reference Gradoville, Waltermire and LongGradoville et al. (2021) exemplifies this. These authors examine Spanish ~ Portuguese bilingual productions in Rivera (on the Uruguay–Brazil border) where Portuguese phonological influence on Spanish spoken by residents has been previously noted (e.g., Reference Waltermire, Gradoville and RaoWaltermire & Gradoville, 2020). A notable difference between the two languages is the articulation of intervocalic /d/, where in noncontact varieties of Spanish the phone is typically realized as a voiced, dental approximant ([ð̞]) or deleted, and in monolingual Brazilian Portuguese it is produced as a stop [d] (or as an affricate before high, front vowel [i]). Reference Gradoville, Waltermire and LongGradoville et al. (2021) capitalize on this difference in variant distributions to explore whether the nature of the interlingual influence of pronunciation is gradient, with greater influence on phonetic realizations found in cognates than in words sharing fewer lexical connections between languages.

In acknowledging the gradient nature of degree of overlap (and, resultantly, the differing magnitudes of predicted interlingual influence) Reference Gradoville, Waltermire and LongGradoville et al. (2021) operationalize cross-linguistic similarity of cognates with four different measurements based upon previous studies (each of which significantly predicted variation in the data). Using variationist methodology, they extract 2,435 instances of intervocalic /d/ (word-medial, word-initial) from the naturalistic speech of 40 bilingual speakers of Riverense Spanish. Acoustic measurements of the intensity ratio between the intervocalic consonant /d/ and the following vowel were taken to explore differences in strength of articulation of the target sound (/d/). This study demonstrates that as both the frequency and the cognate similarity with Portuguese increase, the degree of /d/ weakening in Spanish decreases. Very similar frequent cognates in Portuguese affect the production of Spanish /d/ more than words with a lesser degree of phonological overlap. Reference Gradoville, Waltermire and LongGradoville et al. (2021, p. 742) note that these findings are illustrative of “how two of the main properties of the usage-based model, namely the effect of usage on linguistic structure and the categorization based on similarity, operate in a situation of bilingualism where the two languages share many cognates.”

Further evidence for organized storage and coactivation of connected exemplars is found in work on the social indexation of linguistic forms. As previously mentioned, the exemplar makeup of words contains experiential traces of social and stylistic information stemming from the production contexts. As features of a speaker’s identity are activated in interactions, so too are linguistic forms to which the social information is indexed (Reference Drager, Kirtley and BabelDrager & Kirtley, 2016, p. 9). Reference Barnes and ChappellBarnes (2019) examines the effect of such coactivation of socially indexed forms in a language contact situation. In a forced-choice perceptual study of word-final vowel realizations ([o] Spanish-like, [u] Asturian-like), Reference Barnes and ChappellBarnes (2019) demonstrates a significant effect on vowel classifications of sociolinguistic priming. The social characteristics of the speaker activate particular exemplars, which in turn affect vowel categorizations of the participant listeners. Reference Barnes and ChappellBarnes (2019, p. 35) suggests that, taken with the results of the study as a whole, the “observed behavior supports models of linguistic representation that incorporate socially weighted encoding and that take into consideration the effects that salience and awareness have on exemplar activation.”

As the previously cited studies illustrate, usage-based approaches to language generate myriad testable hypotheses that are straightforwardly applied to instances of bilingualism. Utilizing a variety of techniques and data sources, usage-based works elucidate the domain-general cognitive processes that give rise to language structure (Section 2.3), demonstrate the connection between language experience and lexical representation (Section 2.4.1), and explore the effects of intra- and interlingual effects of coactivation and similarity (Section 2.4.2). Section 2.5 highlights some potential areas of further inquiry employing the usage-based tenets outlined in this chapter.

2.5 Future Directions

This chapter has provided a sketch of usage-based linguistics generally, and exemplar model characteristics specifically, in order to highlight ways in which such a framework can be utilized in analyses of bilingual phonetics and phonology. The (albeit incomplete) review of contact and bilingualism studies that incorporate (or explicitly test) usage-based tenets brings to light insights that are obtained through the adoption of such an approach. Because usage-based approaches are relatively “young” in their inception and development, and consequently in their application to contact studies (Reference Backus, Adamou and MatrasBackus, 2021), opportunities abound in which to revisit issues of contact (borrowing, code-switching, convergence) from a usage-based perspective.

For example, as was highlighted in the discussion of the Reference Gradoville, Waltermire and LongGradoville et al. (2021) study, the strength and the direction of interlingual effects are modulated by relative frequencies of use (in separate languages) and the degree of phonological overlap. Reference Gradoville, Waltermire and LongGradoville et al. (2021) note that their findings suggest the cognate effect is not constant, but rather is gradient and modulated by both degree of form ~ meaning overlap and relative frequency of forms. In order to better understand this effect, the authors note that “it is important to study different language pairings based on quantity and similarity of cognates to ascertain the extent to which these two properties influence the quantity of crosslinguistic phonetic influence in the situation in question” (Reference Gradoville, Waltermire and LongGradoville et al., 2021, p. 742). Such empirical proposals should be taken up.

In addition to examining different language pairings (with different proportions of overlapping cognate forms), schematic ties of greater degree of abstraction can be analyzed. For example, a type of “cognate status” not explored in bilinguals sufficiently (to my knowledge) is the potential for interlingual effects of phonesthemes (e.g., Reference Mompean, Fregier and ValenzuelaMompean, Fregier, & Valenzuela, 2020). From a usage-based perspective, Reference Mompean, Fregier and ValenzuelaMompean et al. (2020, p. 516) conducted a cross-linguistic analysis of phonesthemes “or recurrent pairings of phoneme clusters and some element of meaning in sets of words across a language’s vocabulary” in English, French, Spanish, and Macedonian. Apart from future work proposed by these researchers to control for the potential effects of historical language contact in the strengthening of existing phonesthemes or the triggering of phonesthemic networks (Reference Mompean, Fregier and ValenzuelaMompean et al., 2020, p. 540), future projects could test the notion regarding gradient strength of associations (form/meaning overlap) between lexical items. Such work could additionally contribute to our understanding of productivity and type frequency effects.

As the structured variability evident in production is seen as theoretically interesting in usage-based models, variationist methodologies are a natural fit (e.g., Reference De Vogelaer and SeilerDe Vogelaer & Seiler, 2012). Just as sociolinguistic approaches have been strengthened through an incorporation of exemplar theoretic models (e.g., Reference Docherty and FoulkesDocherty & Foulkes, 2014), studies of bilingual phonetics and phonology have been likewise enriched. Variationists recognize the importance of changes in frequencies of competing forms and consider seriously the multiple factors (linguistic, extralinguistic) that contribute to variant forms of words. As has been argued by Reference Poplack, Adamou and MatrasPoplack (2020), the outcomes of language contact (within the bilingual individual and the bilingual community) cannot be fully understood without considering the fine details of inherent variability evident in each of the languages involved. As such, future work that matches variationist methods with usage-based theory stands to generate a broader empirical base from which to test theory (as opposed to traditional introspection or anecdotal reporting) and can yield further insights into bilingual phonetics and phonology.

Of course, it is also possible to explore the nature of bilinguals’ sound systems and potential interlingual influence through experimental design (e.g., Reference AmengualAmengual, 2012; Reference Brown, Smith and NordquistBrown, 2018). Studies seeking explanations in the patterns of input (experience) and in the general cognitive mechanisms (e.g., Reference TomaselloTomasello, 2003) that give rise to the linguistic structure can be informed by experimentally elicited data. Researchers commonly control numerous factors of the production context, the language mode, population characteristics, and more. However, the independent effect of lexical items is less commonly considered. An appreciation of the disparity between lexical items (going beyond simply lexical token frequency differences) is important to account for variation. Words carry with them an accumulation of experiences of use, and these word histories can shape realizations. Even “identical words,” such as homonyms, are shaped by different episodes of use (e.g., Reference GahlGahl, 2008). The usage-based framework, thus, would advocate careful consideration of the role of lexical effects in any experimental paradigm.

Usage-based research values a consideration of speakers’ experiences with words and word combinations in specific social contexts to account for linguistic variation and change. Just as this model does not rely on named languages for explanations (e.g., Reference WasserscheidtWasserscheidt, 2021), it would likely be the case that labels for different types of bilinguals (sequential, simultaneous, early, late, balanced, cyclical, etc.) may capture different usage patterns and experiences. As such, these population categorizations could be reframed. Explanations for phonetic and phonological differences across speaker groups could be sought in different usage patterns and usage histories particular to the different bilingual groups.

This chapter’s discussion has privileged studies of language production. Each of the usage-based tenets outlined, nevertheless, is germane to language perception and language processing. Future overviews of usage-based approaches to bilingual phonetics and phonology could limit the scope of the overview. Though the usage-based theory and methods can be applied equivalently to multiple grammatical dimensions, usage patterns, and cognitive processes, future overviews may look exclusively at production, perception, or processing. Likewise, given the breadth of topics discussed in this chapter, important topics were omitted. Future overviews could highlight usage-based approaches to phonetics and phonology in acquisition data (e.g., Reference AmbridgeAmbridge, 2020a, Reference Ambridge2020b), second language and heritage language studies (e.g., Reference López-Beltrán and CarlsonLópez-Beltrán & Carlson, 2020), or strictly adult synchronic variation.

2.6 Conclusion

This chapter has reviewed several concepts central to the usage-based approach to language and has highlighted the potential explanatory role in bilingual phonetics and phonology of usage, mental representations, form-meaning combinations, and entrenchment (Reference Backus, Adamou and MatrasBackus, 2021). The exemplar model of lexical representation (Reference BybeeBybee, 2001) proposes that speakers’ experience with words (in perception and production) become recorded in memory. In this way, understanding details of use and usage patterns becomes crucial to explaining features of bilinguals’ sounds and sound systems, the nature of and the mechanisms behind interlingual influence, and the emergence of linguistic structure. This model, with a focus on aspects of use and usage patterns, creates the potential to inform many separate disciplines (e.g., Reference Quick and VerschikQuick & Verschik, 2019), as well as to provide new insights into familiar objects of inquiry within language contact studies generally. Likewise, by offering new cases of data and types of language usage, bilingual phonetics and phonology can simultaneously help inform usage-based theory. The works referenced herein exemplify the myriad ways in which the theory and the methods derived from usage-based approaches to linguistic theory (e.g., Reference BybeeBybee, 2010) can yield new insight into the complex, adaptive system that is bilinguals’ phonetic and phonological repertoires.

3 Sociolinguistic Approaches to Bilingual Phonetics and Phonology

3.1 Introduction

Broadly speaking, sociolinguistics refers to the study of language and its social contexts. More specifically, sociolinguists investigate the ways in which social factors influence language variation and change and reflect individual identity related to community norms. Sociolinguistics has evolved over time and is usually described as existing in three separate waves of inquiry (e.g., Reference Bell and CouplandBell, 2016; Reference EckertEckert, 2012; Reference Schilling-Estes, Chambers, Trudgill and Schilling-EstesSchilling-Estes, 2013; Reference TagliamonteTagliamonte, 2015). The first wave of sociolinguistics, pioneered by William Labov in the early 1970s, considered language variation in the context of broad social factors. Using quantitative analyses and a survey-based approach, macro-social factors such as gender, age, ethnicity, and socioeconomic status were considered the causes of variation between vernacular and standard varieties, closely tracking speech behavior. The second wave of sociolinguistics, pioneered by Reference MilroyMilroy (1980) and Reference EckertEckert (1989), to name just two, focused on the ways in which variation occurred at the local level through social networks. In doing this, these studies took on an ethnographic approach to understand the ways in which factors vested with social meaning by communities of practice influenced language change. Variation was found to be constrained by aspects such as changes in local economy, density of social networks across communities, as well as adherence to social groups and norms. Most recently, the third wave of sociolinguistics, as exemplified by Reference Coupland, Eckert and RickfordCoupland (2001), Reference Bucholtz and HallBucholtz and Hall (2005), and more recent work by Eckert (e.g., Reference EckertEckert, 2008), centers on the role of style in individual variation, and examines how speakers express ideologies, identities, and affiliations through language, having access to variants associated with other social groups as a means of identity construction and stance-taking. Research by Reference Foulkes and DochertyFoulkes and Docherty (2006) also served to make social aspects central to the field of phonetics and phonology. Stylistic use of language can be analyzed to allow for insight into bilingual speakers’ agentive use of a variety of linguistic variants to construct their identity. Ways of pronouncing become a means of self-identification by indexing their affiliations with social groups at the local or individual level.

This chapter analyzes bilingual phonetics and phonology through a sociolinguistic lens. In the following sections, we examine the concept of bilingualism, the acquisition of phonetic and phonological competence, and social factors and sociolinguistic methods used to analyze bilingual speech. We conclude by offering remarks on future areas of research based on the gaps identified in the current literature.

3.2 What Makes a Bilingual Speaker?

In this chapter, we employ the definition of bilingualism described by Reference Silva-CorvalánSilva-Corvalán (2014, p. 1) for Spanish-English acquisition in children, namely, “the use of two or more languages in everyday life” (also see Reference GrosjeanGrosjean, 2010). We refer to the use of three or more languages as multilingualism. The empirical study of bilingualism in both children and adults often relies on the notion of age of acquisition to understand and measure both language proficiency and dominance in bilingual individuals. Two key terms in referring to age of acquisition are sequential bilingualism, which refers to the acquisition of one language before another, and simultaneous bilingualism, which refers to the exposure to two languages both simultaneously from birth or an early age. Sequential bilinguals can be further divided based on their age of acquisition (i.e., early and late sequential learners). When separating out these three groups of bilinguals, Reference AmengualAmengual (2019) finds that early sequential bilinguals produced more Spanish-like realizations of approximants in Spanish than late sequential learners or simultaneous bilinguals, suggesting that the type of bilingualism still affects acoustic productions in adulthood. This conceptualization of bilingualism is particularly relevant for quantification in statistical models as a continuous variable considering numerical values for age of acquisition.

In that same vein, the notion of dominance has raised questions regarding what it means to be bilingual, and what the competencies of bilingual speakers are in their respective languages. To start, dominance can be referred to in the context of both language use and proficiency: speakers who use and are more proficient in one language over another language are thus dominant in the former (Reference BellBell, 2013; Reference LiLi, 2000; Reference Silva-CorvalánSilva-Corvalán, 2014).Footnote ¹ Reference Silva-CorvalánSilva-Corvalán (2014, p. 19) reports that the use of a language by an individual reflects their knowledge of said language. Reference GrosjeanGrosjean (2010) supports this assertion, stating that both use and proficiency are relevant in understanding bilingualism at the individual level.

Importantly, dominance does not imply that varying competencies in language negate the notion of bilingualism. Speakers are still considered bilingual even if they vary in language dominance and proficiency (Reference BellBell, 2013). Some speakers may be referred to as passive bilinguals, meaning that while they do not speak or achieve fluency in one language, they are still able to understand it. These speakers are crucially still considered bilingual, despite the disparity in competence in their languages. Reference GrosjeanGrosjean (2010, p. XV) states that it is a myth to consider that bilinguals must be fluent in both of their languages to be deemed bilingual. This myth is rejected, for example, in a group of bilingual children in New York by Reference ZentellaZentella (1997), who finds variable use tendencies among the speakers. Neither age of acquisition nor competency in one language over the other makes a bilingual speaker less bilingual or an “imperfect” bilingual. Rather, Reference GrosjeanGrosjean (2010) argues that the use of one language over the other is often contextual: bilingual speakers may use one language over another in certain scenarios.

This argument is particularly relevant in the context of heritage speakers. Heritage speakers have had contact with a language, often at home, other than the dominant one(s) in their social setting (Reference Valdés and SandstedtValdés, 2000; Reference Valdés, Geoffrion-Vinci and Díaz-CamposValdés & Geoffrion-Vinci, 2011). This can provide them with a degree of comprehension and production competence in the heritage language, although that proficiency is subject to extreme variation (Reference PotowskiPotowski, 2005). Reference Valdés, Geoffrion-Vinci and Díaz-CamposValdés and Geoffrion-Vinci (2011, p. 599) provide criteria that represent heritage speakers in the context of the United States:

1. Speakers have ties to a language not dominant in their social setting, often through family or culture;
2. Speakers understand and communicate in these languages to some degree;
3. Speakers vary in their adherence to and participation in the minority group.

Heritage speakers of Spanish in the United States can be children of immigrants, who arrived at a young age, in adolescence, or were born in the United States and are second- or third-generation Americans (Reference Díaz-Campos and GeeslinDíaz-Campos, 2013; Reference PotowskiPotowski, 2005). Importantly, heritage speakers vary in spoken proficiency and comprehension. Recent studies of heritage speakers in several languages provide important sociophonetic insight into these understudied communities. Reference Hoffman and WalkerHoffman and Walker (2010) find that descendants of Italian and Chinese immigrants in Toronto employ certain heritage language features to indicate group identity in English. Punjabi immigrants in London use coda rhotics that map onto the degree to which they identify with their geographic and ethnic background, Reference Hirson, Sohail, Trouvain and BarryHirson and Sohail (2007) report. Speakers more likely to self-identify as British drop rhotics in post-vocalic position, whereas those who continue to identify as Asian tend to maintain the rhotic. Finally, Reference SharmaSharma (2018) finds that, in the case of an Indian-American television presenter, the use of US English variants reduces in favor of Indian-American phonetic features when the speaker wants to set aside his on-air persona, revealing his “real me” through the use of variants with reduced rhoticity, voice onset time (VOT), diphthongization, and l-velarization. In all of these studies, we see that a sociophonetic perspective not only provides insight into the speech system of bilinguals, heritage speakers, and language learners but also provides a better understanding of the social attitudes and factors that influence speech.

The environment in which speakers acquire their languages is also relevant to the consideration of bilingualism as a sociolinguistic variable. According to Reference BellBell (2013), primary bilinguals are speakers that learned both languages naturally, in a purely social context. In contrast, secondary bilinguals are speakers that acquired their second language in a tutored setting, often later in life. While both speakers can achieve fluency, the sociolinguistic competence related to naturally acquiring a language is more developed due to regular contact with a speech community.

A variety of social factors associated with bilingualism have been examined in the context of language variation and change. Reference GrosjeanGrosjean (2010, p. 172) contextualizes aspects of the acquisition environment (e.g., input amount, community role, attitudes), along with the age and language use of the speaker, as fundamental aspects in the study of bilingualism, finding that these factors affect the degree to which language is learned, cognitively processed, and stored.

When considering bilingualism through a sociolinguistic lens, it is important to consider that bilingualism itself is a social phenomenon. The consideration of a bilingual society gives context to the importance of language use to express identity, assert sociopolitical power, or interact with peers in the community. The relatively recent push for the adoption of a monolingual society in human history is due to the creation of the Western nation-state (Reference BellBell, 2013; Reference ZentellaZentella, 1995), which assumes that a unified nation should speak one language. However, bilingualism and multilingualism are the norm in societies around the globe, with some estimates showing that more than 60 percent of the world’s population is multilingual (e.g., Reference Ansaldo, Marcotte, Scherer and RaboyeauAnsaldo et al., 2008; Reference De Botde Bot, 1992; Reference RomaineRomaine, 1994; Reference Silva-CorvalánSilva-Corvalán, 2014).

Reference BellBell (2013) suggests that large-scale bilingualism can result from sociopolitical constructs such as immigration or the imposition of new borders around a speaker group. Societal conceptualizations of bilingualism may also be described using the term diglossia (Reference FergusonFerguson, 1959). Diglossia refers to the phenomenon in which languages are utilized only in certain contexts within a society. The use of Kichwa in Andean Ecuador exemplifies a diglossic situation: some speakers use Spanish in the workplace and in formal situations, while they reserve Kichwa for familial and community-oriented contexts (Reference ColeCole, 2022). Diglossia is not always so cut and dried, though; languages are not always neatly organized into specific contexts of use.

In some societies, the ability to communicate in more than one language is seen as a marker of prestige or is viewed as favorable over being monolingual. In contrast, the prejudice that bilingualism is detrimental to brain development unfortunately still pervades Western discourse (Reference Bialystok, Craik and LukBialystok, Craik, & Luk, 2012). The theory that bilingualism in children will cause “confusion” between languages, or lags in brain development or linguistic capabilities, has been repeatedly disproven. Reference GrosjeanGrosjean (2010) elaborates that the consideration of bilingualism as a hinderance or negative trait is related to socioeconomic status. Monolingual speakers may view bilinguals of high socioeconomic status positively, while bilinguals who are immigrants or minorities may be seen as speaking imperfectly or experiencing difficulty in academic and even general communication. Despite these views, studies show that bilingualism is advantageous over monolingualism in both the development of cognitive ability and the expression of identity (Reference BellBell, 2013; Reference Bullock and ToribioBullock & Toribio, 2009). Reference Lindholm-Leary, Potowski and Muñoz-BasolsLindholm-Leary (2020) references numerous studies related to dual language programs in elementary education to argue that students with regular instruction in two languages tend to reach levels of achievement that either mirror or overcome those of their monolingually educated peers.

Speakers’ attitudes toward their languages can inform the ways in which bilingualism and identity intersect. From a sociolinguistic perspective, language and identity are seen to be intrinsically linked. Speakers’ use of variants indexes affiliation with certain groups related to gender, age, ideals, and values. For example, Reference Berrebi, Bassel and GafterBerrebi, Bassel, and Gafter (2022) investigate how the phonemic distinction between /ħ/ and /x/ is preserved among Mizrahi Jews and lost among Ashkenazi Jews. This distinction has social meaning for the European-descended Ashkenazi group, as it allows speakers to linguistically differentiate themselves from others in Israel. Recently, sociolinguistic variables have also been examined using the notion of indexical fields to determine the sociolinguistic values that communities tend to assign to different variants (Reference Delgado-Diaz, Galarza, Díaz-Campos, Díaz-Campos and SessaregoDelgado-Díaz, Galarza, & Díaz-Campos, 2021; Reference Pollock, Willis and Díaz-CamposPollock, Willis, & Díaz-Campos, 2023).

3.3 Language Dominance

The notion of language dominance has been found to be particularly relevant in understanding how one language may affect an additional language from a cognitive point of view (see Chapter 29, this volume; Reference AmengualAmengual, 2019; Reference Amengual and ChamorroAmengual & Chamorro, 2015). Insight into language dominance is often gained through language background questionnaires, which aim to understand how speakers acquire and use their languages, how proficient they perceive themselves to be, and demographic factors.

The Language Contact Profile (LCP), designed by Reference Freed, Dewey, Segalowitz and HalterFreed et al. (2004), was one of the first standardized surveys used in second language acquisition (SLA), and includes a variety of usage factors (i.e., language use with certain interlocutors and across activities; time spent using the language, etc.). The LCP is regularly used in study abroad literature to determine if having more contact hours with L1 local speakers increases speakers’ production of language-specific or dialect-specific features.

Another example is the Language Background Questionnaire (LBQ) used by Reference KimKim (2011), in a study of Spanish heritage speaker VOT, which operationalized dominance using three factors: age of acquisition of the L2, frequency of use of the L2, and language proficiency. Speakers with exposure to both Spanish and English before age seven who used less Spanish and scored lower for Spanish proficiency than English were considered “English-dominant” and were excluded from consideration. Kim found that English-dominant heritage speakers produced stop consonants more similar to monolingual English speakers than to monolingual Spanish speakers, causing her to conclude that the strength and the direction of interference rely on a bilingual speaker’s dominant language.

In the analysis of bilingual phonology, Reference SimonetSimonet (2010) uses an LBQ to assess the role of language dominance in understanding lateral production among Spanish-Catalan bilinguals. In his investigation, he found that both groups of speakers produced laterals in their L2 comparable to those from their L1. Reference SimonetSimonet (2010) takes the fact that speakers do not produce monolingual variants as a sign that they acquire a new phonetic category for their L2, which they intentionally separate from their L1. The LBQ asked participants to identify their native language, the language used most in childhood, the language they currently used most frequently, and the degree to which they used Catalan and Spanish in various family and work contexts. This background questionnaire illustrates three important aspects for gauging dominance: (1) the language that speakers deem to be their native language, which indicates proficiency without a test of language ability; (2) childhood languages, which provide information on the age of acquisition for each language; and (3) usage rates and domains, which compare language use and tell the contexts and spaces where languages are used in the community.

These three aspects gauging language dominance are manifested in one of the key background questionnaires used in sociophonetic analyses of bilingual language, the Bilingual Language Profile (BLP), developed by Reference Birdsong, Gertken and AmengualBirdsong, Gertken, and Amengual (2012). The BLP includes both sociodemographic factors and questions about language dominance. The authors conceptualize language dominance as a continuous, gradient factor that can change over the course of an individual’s life. The BLP collects information on sociodemographic factors, language history, language use, language proficiency, and language attitudes. Questions include seven-point Likert scales and ratings to help participants describe how they use their languages. The final result is a dominance score, ranging from -218 to 218, with 0 representing a case of balanced bilingualism, and a negative or positive score indicating dominance of one language over the other. The scoring criteria for the BLP is particularly useful for sociolinguistic considerations of bilingualism in its quantification of results, allowing for statistical modeling that is compatible with typical analyses in the field, as sociolinguistic analyses often rely on both descriptive and inferential statistics to illustrate trends and make predictions about language use.

3.3.1 Language Dominance at the Segmental Level

Dominance can be tracked across various levels of linguistic study. As one example at the segmental level, Reference McKinnonMcKinnon (2020) considers contact-induced language changes in the Spanish stop system in a bilingual Kaqchikel-Spanish community in Guatemala. Participants completed a modified oral BLP that accommodated the literacy rates in the community, using an adapted Likert scale format. Reference McKinnonMcKinnon (2020) found that voiced and voiceless stops were treated differently within this community. He argued that this stemmed from the fact that voiced stops are not part of the Kaqchikel inventory, meaning that phonemic category creation must occur. While category creation occurred for most speakers, female participants, as well as those with greater Kaqchikel proficiency and more integration in the community, were found to have more constricted variants of the voiced stops /b d g/. Similar to Reference McKinnonMcKinnon (2020), Reference Michnowicz and CarpenterMichnowicz and Carpenter (2013) also found that there was an effect of language and dialect contact on VOT patterns in the Yucatan area.

Utilizing a similar background questionnaire to measure language dominance and proficiency, Reference ColeCole (2022) examines use of the palatal lateral approximant in Cañar, Ecuador. She adapted the BLP to include open-ended questions to accommodate interview participants. Rather than having participants rate their percentage use of Kichwa/Spanish, she asked participants to describe how they used their language(s) in daily life, who they spoke to, and how frequently they uses their language(s) in general. Reference ColeCole (2022) then used thematic analysis to identify patterns and themes in participant responses, which were later organized and coded post hoc. For example, participants who mentioned using only Kichwa in daily life received a score of 6/6, whereas those who mentioned using both languages equally received a score of 3/6. A similar scale was used to quantify responses for proficiency. Both qualitative and quantitative analyses found language dominance to be key in understanding community use of the regional variant. Speakers who used more Kichwa and self-identified as having strong Kichwa abilities were more likely to use the palatal lateral approximant. Speakers who used more Spanish in daily life used this variant less. These findings suggest a connection between language dominance and use.

Another segmental study that uses the BLP to measure language dominance comes from Reference Mohamed, González and MuntendamMohamed, González, and Muntendam (2019), who study glottal stop epenthesis in Spanish-Arabic bilingual speech in Puerto Rico. They argue that language dominance is a significant factor in glottalization rates, with Arabic-dominant speakers exhibiting higher rates of glottalization than the Spanish-dominant participants. In fact, Spanish-dominant speakers were not significantly different from monolingual speakers in their rates of glottalization, illustrating that language dominance, rather than number of languages spoken, is the relevant factor in conditioning glottalization rates.

In order to account for the interaction between bilingual speakers’ two linguistic sound systems, Reference GuiónGuión (2003) developed a model called the Single-System Hypothesis (SSH). The SSH holds that bilingual speakers cannot separate their phonetic systems, causing each to influence the other. Reference GuiónGuión (2003) sees this as related to language use, since increased use of one language results in greater activation of that language’s sound system, leading to an increased influence on the other language system. In a study on the perception of foreign accent in Kichwa speakers, Reference GuiónGuión (2003) finds that the frequent use of Kichwa resulted in speakers’ Spanish being perceived as more accented than their monolingual Spanish-speaking counterparts.

The previous studies examined the role of language dominance in the description of bilingual segmental phenomena. They employed adapted versions of the BLP in order to understand the background of participants and the role that individual factors play in explaining bilingual sociolinguistic variation. By adapting the BLP, these researchers have shown that it is feasible to utilize equitable and accessible materials that fit the needs of the research while also addressing the socioeconomic and educational needs of the community in question. Overall, we see that language dominance is an important factor in the description of bilingual phonological systems at the segmental level.

3.3.2 Language Dominance at the Suprasegmental Level

Sociolinguistic literature has shown that language dominance serves as a conditioning factor in bilingual speech at the suprasegmental level as well as at the segmental level, such as through intonation and prosody. For example, Reference Baird, Willis, Butragueño and ZendejasBaird (2015) studies Spanish in contact with K’ichee’ in Guatemala, focusing on relative peak alignment in Spanish intonation systems in three speaker groups: monolingual Spanish speakers from Quetzaltenango, bilingual K’ichee’-Spanish speakers from Cantel, and bilingual speakers from Nahualá. The BLP was used to help show that language dominance was a significant factor in understanding early peak alignment in bilingual speech. Significant differences in overall level of dominance in K’ichee’ caused differences in peak alignment in the Spanish spoken by these speakers. Reference Baird, Willis, Butragueño and ZendejasBaird (2015) asserts that this finding is not related to language-internal differences in the two regional varieties of K’ichee’, as the peak alignment tendencies are the same in both. The BLP scores between the speakers of these two varieties, coupled with tendencies in peak alignment, suggest that overall dominance in K’ichee’ was a contributing factor to the significantly earlier peak alignments found in Nahualá Spanish. These findings illustrate the crucial role that language dominance often plays in understanding bilingual phonetic variation and its relevance in describing individual variation. Reference Baird, Willis, Butragueño and ZendejasBaird (2015) argues that describing individual-level patterns is fruitful for understanding language variation in bilingual populations.

In another study using the BLP to examine suprasegmental phenomena, Reference Henriksen and FafulasHenriksen and Fafulas (2017) study the effects of language contact between Yagua and Spanish on prosodic timing in the Peruvian Amazon. These researchers support BLP responses with qualitative data from interviews that focused on the speakers’ age of acquisition of Yagua, self-reported proficiency rates, and educational background as it related to their language use, as well as information regarding contexts where speakers used their languages. Age or sequence of acquisition was a primary factor in conceptualizing dominance in this study. However, in addition to speaker age, speakers were deemed to be Spanish- or Yagua-dominant depending on the language they spoke more in daily life and toward which they had the most positive attitude. Reference Henriksen and FafulasHenriksen and Fafulas (2017) suggest that while Yagua-dominant speakers employ prosodic timing in Spanish comparable to Yagua, Spanish-dominant speakers’ timing is more Spanish-like. In this way, dominance, which encompasses age of acquisition, usage patterns, and proficiency, as well as language use and attitudes, plays an influential role in determining intonational output.

In an analysis of an autosegmental system in Cuzco and Lima, in Peru, Reference O’RourkeO’Rourke (2012) examines speakers with bilingual Spanish-Quechua and Spanish-only backgrounds. She measures the focus and peak alignment of utterances to determine common patterns found in the data. Most of the monolingual speakers from Lima and only some of the bilinguals from Cuzco employed features that expressed contrastive focus with a similarity to the patterns described in other varieties of Spanish. On the other hand, speakers with greater Quechua contact had access to a larger repertoire of tools to express focus across both languages and did not rely as heavily on peak alignment (e.g., native speakers of Quechua had low peaks on sentence subjects, or else didn’t show height distinctions between contrastive and broad focus).

These studies show the effect of language contact conditions and specifically the role of language dominance in describing effects on suprasegmental patterns in bilingual speech. Although more research has focused on these topics in recent years, many questions about intonation remain, and suprasegmental phenomena continue to be a fruitful topic for further study.

3.4 Age of Acquisition

Age of acquisition has been shown to be relevant in the transfer of phonetic material from one language to another in the speaker’s cognitive space. Sociophonetic studies that focus on bilingual individuals illustrate the importance of the age at which a speaker acquires each language they control, as well as methods for measuring and operationalizing age and sequence of language acquisition. While we have divided this section into subsections, an exhaustive review of age of acquisition is beyond the scope of this chapter. We discuss recent and relevant examples.

3.4.1 Age of Acquisition at the Segmental Level

A considerable amount of research has focused on segmental phenomena, emphasizing the importance of the age at which speakers acquire language in predicting variation. In a study on the vowel spaces of Peruvian Quechua-Spanish bilinguals in Lima and Cuzco, Reference O’Rourke and Ortega-LlebariaO’Rourke (2010) operationalizes age of acquisition as an ordinal value in statistical analysis. Speakers were described as belonging to one of three groups: monolinguals from Lima and Cuzco, native bilinguals from Cuzco who acquired both languages simultaneously at home, and Cuzco L2 Spanish speakers who acquired Spanish in school. Reference O’Rourke and Ortega-LlebariaO’Rourke (2010) finds that age of acquisition is significant in accounting for variability in vowel productions, alongside regional differences between Cuzco and Lima. The Cuzco native bilinguals varied the most in their productions of vowels, with similar /e/ vowels to the L2 Spanish speakers, and similar /a/ and /u/ vowels to the Cuzco monolingual speakers. The L2 Spanish speakers differed from both the Cuzco bilinguals and the monolinguals in their /o/ and /u/ vowels, which had more backing than the vowels in the other groups. These findings provide important insight into the vowel space of bilingual Spanish-Quechua speakers, and illustrate how age of acquisition is instrumental in determining the acceptable variation in speakers’ vowel categories, an area where few studies have been conducted.

In her work on Spanish in contact with Kichwa in Cañar, Ecuador, Reference ColeCole (2022) uses both statistical modeling and qualitative analysis to study the palatal lateral approximant [ʎ], a variant attributed to Spanish-Kichwa contact and isolated Andean communities. Reference ColeCole (2022) found that age was an important factor predicting variation in Cañar: early bilinguals (those who acquired Kichwa before age seven) used more palatal lateral approximants than monolinguals, or those who acquired Kichwa later in life as an L2. These findings suggest a possible connection between the acquisition of the Kichwa palatal lateral approximant early in life and its increased use by early bilingual speakers in this community.

In a Scandinavian context, Reference Strandberg, Gooskens and SchüppertStrandberg, Gooskens, and Schüppert (2021) examine the phonetic transfer in production of Finnish and Swedish long mid front vowels [øː] and [œː] in a group of simultaneous bilingual Finnish and Swedish speakers in Finland. While Swedish uses both of these allophones for /ø/, Finnish uses only [ø]. Performing an acoustic analysis that included social variables related to speech register, language background, region, and age, the authors collected data from 115 speakers. Regarding age, they found that younger speakers used higher F2 values when producing [œː] compared to the two older groups. Reference Strandberg, Gooskens and SchüppertStrandberg et al. (2021) argue that this may represent a change in apparent time with regard to the height and fronting of the two allophones. Changes in height occurred only in the most spontaneous speech style, whereas changes in fronting occurred across all three styles examined, suggesting that younger speakers, who have early and more recent acquisitional exposure to both languages, are producing the allophonic distinction.

3.4.2 Age of Acquisition at the Suprasegmental Level

While there has not been as much focus on suprasegmental phenomena, research does indicate that age of acquisition plays a similarly important role in describing variation at this level. In a study on contact between Yagua and Spanish, Reference Henriksen and FafulasHenriksen and Fafulas (2017) employ both the BLP and qualitative questions to analyze prosodic patterns across two groups of speakers based on language use. The Yagua-dominant group had acquired Yagua as an L1, used it in daily life, and learned Spanish later in life; meanwhile, the Spanish-dominant speakers learned Yagua and Spanish simultaneously at home, used Spanish more in daily life, and reserved Yagua for family interactions. These authors find that there is a direct relation between dominant language, age, and the order in which speakers acquired their languages. Spanish-dominant bilinguals who acquired Yagua and Spanish simultaneously exhibited prosodic timing similar to Spanish norms. In contrast, Yagua-dominant bilinguals that acquired Yagua as an L1 had prosodic timing in Spanish more similar to Yagua. These findings support the idea that the age at which speakers acquire their first and second languages plays a role in the extent to which they may transfer phonetic patterns from their L1 to their L2, a claim that is supported by models such as Reference Flege and StrangeFlege’s (1995) Speech Learning Model, which emphasizes the importance of speaker age in the acquisition of phonetic boundaries.

Age of acquisition also plays an important role in studies on immigrant population, with distinctions made between recent arrivals and long-term residents in areas of contact. Reference Erker and BrusoErker and Bruso (2017) examine filled pauses produced by twenty-four Spanish-speakers in the Boston, Massachusetts, area of the United States. Speakers with more English exposure and proficiency differed in their production of filled pauses in Spanish compared to more monolingual-like speakers. Increased English contact was found to favor centralization of vowels, causing [e] productions to become more schwa-like, which the authors describe as a contact-induced change. A salient pattern found was that recent arrivals in the city favored the use of [e] over [a], but they also added a new category to their phonetic inventory, the schwa [ə], suggesting a convergence in filled pause behavior due to increased language contact. These representative examples show the importance of age of acquisition in the study of both segmental and suprasegmental phonology of bilingual populations.

3.5 Language Attitudes and Perception

The study of attitudes is an important aspect of sociolinguistic research as it provides insights into community and individual identities. Methodological approaches to the study of attitudes include interviews, survey data (e.g., Reference Masgoret and GardnerMasgoret & Gardner, 2003), and indirect techniques (e.g., matched guise as proposed by Reference Lambert, Hodgson, Gardner and FillenbaumLambert et al., 1960), as well as perceptual tasks of a diverse nature.

Relevant to bilingual communities, the investigation by Reference Lambert, Hodgson, Gardner and FillenbaumLambert et al. (1960) was groundbreaking in describing attitudes toward French and English in Canada. A group of sixty-four English- and sixty-six French-speakers assessed stimuli using attributes such as intelligent, honest, friendly, and attractive. The findings show that attributes related to individual competence (e.g., intelligence) were associated with English. Furthermore, community stereotypes were prevalent not only in the English-speaking group but also in the French-speaking group. In a methodologically similar study conducted in Los Angeles, in the US, to assess attitudes toward Chicano English, Reference Arthur, Farrar and BradfordArthur, Farrar, and Bradford (1974) also show that participants negatively assessed Chicano English stimuli.

Recent research in bilingual communities has provided a more comprehensive perspective of attitudes toward features of pronunciation associated with local groups. A study by Reference BarnesBarnes (2016) examines /-o/ and /-u/ alternation in samples of twenty-four bilingual speakers of Asturian and Spanish. The findings of the investigation show that the Asturian variant (e.g., [u]) is favored by male speakers of lower educational levels and lower-prestige occupations. While it was hypothesized that knowledge of Asturian would have a positive effect on the use of the local variant, this variable was not significant. Reference BarnesBarnes (2016) attributes this result to the stable use of local sociolinguistic variants in Asturian Spanish.

In a study examining attitudes of monolingual Madrid Spanish and bilingual Barcelona Spanish-Catalan speakers, Reference Davidson and ChappellDavidson (2019) focuses on two phonetic features associated with Catalonian Spanish: lateral velarization and intervocalic /s/ voicing. Using a matched-guise task, Reference Davidson and ChappellDavidson (2019) had fifty-four participants assess stimuli containing the two variants. His investigation found that explicit negative associations were more common with velarized /l/ than with /s/ voicing, which was seen to reflect Catalonian identity by monolinguals. Nevertheless, the study also determined that these local features were assessed positively within the Catalan-Spanish bilingual community, reflecting an ideology that values bilingualism and language contact.

In a study of language identification, Reference LipskiLipski (2019) determines that Palenquero-Spanish phonotactics serve to facilitate bilingual speakers’ word recognition. The findings, coming from three experimental instruments carried out with fifty-three Palenquero-Spanish bilinguals, indicate age-related differences among the speakers. As a result of revitalization efforts, younger speakers show an increased proficiency in Palenquero, not looking at the language as stigmatized in the same way as speakers from the previous generation. This social situation is shown to play a role in perception between the two groups. On the one hand, younger speakers used lexical items to aid them in identification, with words showing phonological correspondence helping most notably in identification. Older speakers, on the other hand, relied on grammatical structure to carry out identification, possibly as the result of the perceived stigma that the spoken language carried.

Working with seventy-five US-born heritage speakers from San Antonio, Texas, Reference Chappell and ChappellChappell (2019a) focuses on the social perception of orthographic <v> produced as bilabial or labiodental allophones. Heritage speakers perceived both allophonic variants as socially meaningful; women who produced the labiodental [v] were perceived as hardworking, older, intelligent, and competent in Spanish, while men who produced this variant received the opposite classification. Men using [v] were seen as younger, less hardworking, less intelligent, and less Hispanic. Late immigrants receive higher evaluations in the areas of intelligence and “hispanicity” than heritage speakers overall. Both monolingual and heritage speakers were found to share the same sociolinguistic meanings for these allophonic variants, as well as the same expectations regarding “standard” language and hyperarticulated speech with [v]. Overall, this study suggests that variants associated with language contact can have a valence associated with gender identity, and that heritage speakers share a set of language values that reflect community practices.

This section served as a representative look at some attitudes associated with bilingualism. More thorough descriptions of sociophonetic perception, including in contact situations, are discussed in the volumes by Reference ChappellChappell (2019b) and Reference Díaz-CamposDíaz-Campos (2014).

3.6 Quantitative Methods

A crucial component of many sociolinguistic studies on bilingual phonetics and phonology is the use of descriptive and inferential statistics, which illustrate trends across quantitative data through the use of cross-tabulations, graphs, and charts. Cross-tabulations can provide information on overall tendencies found in data sets, allowing for quick comparison between data points. Graphs and charts are useful to help readers understand general tendencies across data.

In contrast to descriptive statistics, inferential statistics make predictions and illustrate interactions between variables in a data set. These tests are common in studies on sociophonetics and vary widely in both complexity and power. The ANOVA (Analysis of Variance) is a statistical model that demonstrates whether significant difference exists in means across three or more independent groups. The limitations that this type of test presents include the fact that comparing means across groups only results in a determination that the difference across means is significant, not which group’s mean significantly differs from the other groups tested. In order to address this problem, post hoc tests like the Tukey test are useful in describing the association of ANOVA means with significant differences.

One example of multiple ANOVA usage in sociolinguistics comes from Reference Baird, Willis, Butragueño and ZendejasBaird (2015), who studies the effect of speaker dialect in peak alignment tendencies among bilingual and monolingual speakers. The results from this statistical test indicate that speaker dialect significantly affected relative peak alignment scores in each speaker group, while a post hoc Tukey test illustrated that bilingual speakers from Nahualá had significantly different relative peak alignment scores from other groups. Reference Baird, Willis, Butragueño and ZendejasBaird (2015) conducted ANOVAs for each dialect, setting the speaker as the independent variable, and found with a Tukey post hoc test that individual differences played a significant role in relative peak alignment scores in both bilingual and monolingual speaker groups.

Sociolinguistic studies often favor regression analyses, as multiple observations from the same speaker are often a key component in sociolinguistic data, which violates the assumption of independent samples found in ANOVAs, making them unsuitable. In order to make predictions about the effects of one variable on another, linear and logistic regressions are used. Rather than merely describing the relationship between independent means, these analyses allow researchers to make predictions about the effect that certain variables have on each other. Many sociolinguistic analyses rely on regressions to determine the effect of multiple independent variables on a dependent variable. The two most frequent types of regression analyses used in sociolinguistic studies are logistic and linear regressions. Logistic regressions allow for categorical dependent variables, while linear regressions require continuous ones. For each of these analyses, independent variables can be either categorical or continuous, as discussed by Reference Díaz-Campos, Dickinson, Rei-Doval and Tejedo-HerreroDíaz-Campos and Dickinson (2019).

One example of a mixed-effects logistic regression being used in a sociophonetic analysis comes from Reference ColeCole (2022), who develops this model using the Rbrul software (Reference JohnsonJohnson, 2009) developed for R (R Foundation, 2022) to better understand the effects of social factors on the use of the palatal lateral approximant [ʎ] in Cañar, Ecuador, in a Kichwa-Spanish bilingual community. In a mixed-effects regression, variables can be treated as random effects, meaning that the model controls for variation across multiple categories, such as an individual speaker producing multiple tokens, or multiple instances of a lexical item. In these data, the continuous independent variable of participant age was found to significantly condition the use of the palatal lateral approximant (Table 3.1).

Table 3.1 Mixed-effects logistic regression for age of speakers in Cañar, Ecuador

Variable	Log-odds	p-value	df	N	R²
Age	0.075	0.0001	3	557	0.229

A mixed-effects model produces several types of statistical output that allow for interpretation of the relationship between independent and dependent variables (Reference JohnsonJohnson, 2009). First, log-odds measure the strength of the relationship between the dependent and an independent variable; positive log-odds values indicate that a factor is favored by the dependent variable. Thus, in the case of Cañar, as age increases, the use of [ʎ] does as well. Second, a p-value below 0.05 indicates that the relationship between the dependent and the independent variable (i.e., the increase in [ʎ] usage as a function of increase in age) does not occur simply due to chance. Third, the degrees of freedom refer to the number of parameters considered in the model. Fourth, N refers to the number of tokens analyzed in the model. Finally, the R² value refers to the extent to which the regression equation accounts for the variance in the data. This statistic helps to determine whether the model can be maximized by including more variables. The inclusion of additional variables in a mixed-effects model should be performed with caution, as “overfitting” a model does not result in a clearer picture of the effect of predictors on the dependent variable.

Another means of representing data is through conditional inference trees (CIT), which illustrate the relationship and the ordering of independent variables that have an effect on a dependent variable. For example, in Figure 3.1, Reference McKinnonMcKinnon (2020) utilizes a CIT to show factors that influence VOT values in Spanish-Kaqchikel contact environments in Guatemala.

Figure 3.1 CIT illustrating social variation in the VOT of the voiced stop /g/.

Source: McKinnon (2020, p. 150).

Using the CIT, McKinnon describes the factors that influence VOT values for the phoneme /g/: speaker sex, age, BLP proficiency, BLP language history, and social group integration. Male speakers’ VOT differed between young speakers and older speakers with a greater or reduced Spanish proficiency. Female speakers’ VOT was influenced by language acquisition history, as well as integration in their social group. Based on these separations, the CIT provides a comparison of the VOT rate for individual groups that differ within the community.

Statistical models in sociolinguistic and specifically sociophonetic analyses must consider various concurrent factors. This leads to the problem of independence of observation, in which certain analyses assume that each token represents a different speaker, while there are actually often several or many instances produced by the same speaker. Mixed-effects models provide the possibility to control for the individual and the lexical items, yielding a more accurate analysis of the significant factors and better predicting the influence of factors on the application value selected for a particular dependent variable.

3.7 Final Remarks

This chapter provided a brief summary of bilingualism and how it is treated under a sociophonetic and sociolinguistic perspective. Some of the main social factors (e.g., age and language dominance) and theoretical ideas (e.g., types of bilingual speakers) under this approach were discussed and exemplified with current research.

Numerous questions remain to be answered. Numerous parts of the world experience bilingualism and multilingualism. However, most of the research is dedicated to major (often European) languages, including English, Spanish, French, among others. Future research will benefit from the sociolinguistic examination of other language contact areas, as well as sites of contact between these major languages and understudied ones. In the case of Spanish, various instances of contact exist with Indigenous languages (e.g., Cuicatec, Quechua, Guaraní, Mixtec, Maya, etc.), across national borders (e.g., the border between the US and Mexico, that between the Dominican Republic and Haiti, and those between Brazil and Ecuador, Venezuela, and Peru) and globally (e.g., with French, English, etc.), that have not received thorough examination from a social perspective. As a result, linguists are left without a complete picture of either the sound system or the system of social meaning that is unique to each of these groups of bilingual speakers.

There is also an opportunity for study of the intonational system of bilingual communities, which has received some attention in recent years. While both segmental and suprasegmental phonology are frequent topics in bilingual sociophonetics, little is still known about the social norms related to these topics, especially in Latin America (Reference Díaz-Campos, Escalona Torres and FilimonovaDíaz-Campos, Escalona Torres, & Filimonova, 2020). Indigenous languages offer rich intonational systems that merit further research, as study in contact areas may show language contact phenomena that provide a better understanding of cross-linguistic patterns, contributing to the advancement of sociolinguistic theory.

An important focus for the study of bilingual sociophonetics moving forward is likely to be attitudes toward bilingual varieties. Using available techniques to gather information about speech perceptions in bilingual communities may contribute to current understanding of the language variation and change caused by language contact. In the present chapter, we have reviewed studies that demonstrate the importance of understanding attitudes and perception across a variety of social contexts. This includes European Spanish in contact with Catalan, Latin American Spanish in contact with Palenquero, and American Spanish in contact with English. These studies have shown that bilingual heritage speakers face stigmatizing interpretations of their L1-influenced speech, and that minority speakers often internalize these negative evaluations toward their vernacular variety. Their group identity is often regarded with covert prestige, while at the same time they share the same interpretative values as mainstream monolinguals.

The complexity of social meanings associated with sociolinguistic variables can be analyzed with indexical fields, as proposed by Reference EckertEckert (2008) and applied by Reference Delgado-Diaz, Galarza, Díaz-Campos, Díaz-Campos and SessaregoDelgado-Díaz et al. (2021) or by Reference Pollock, Willis and Díaz-CamposPollock et al. (2023). These fields, which help demonstrate how speakers impart ideologies onto certain phonetic variants, can be indexed through linguistic and social choices that reflect the ideologies that exist in bilingual communities. Speakers imbue variants with social characteristics, and perceive these variants based on their own backgrounds. Factors such as age, social standing, geographic location, age of acquisition, and dominant language play a role in predicting which speakers view a variant favorably and which do not. By understanding the possible associations made by speakers with a particular variant, future researchers can understand who uses certain variants in a community and why. Findings in this area have implications for the symbolic value and the place that an individual has in the complex network of their speech community.

4 Psycholinguistic Approaches to Bilingual Phonetics and Phonology

4.1 Introduction

In this chapter, we explore how psycholinguists think about bilingual language processing, and how this informs their work when conducting phonetic and phonological bilingualism research. As reviewed in Section 4.2, in this tradition, the mind is conceptualized as a set of mechanical processes that map between levels of mental representation. Cognition is seen as a form of computation, emerging from the interaction of these processes. For psycholinguists, this perspective is typically instantiated within a connectionist framework. As discussed later, this framework uses an abstract model of biological computations in neural networks to realize cognitive computations. In Section 4.3, we illustrate how this conceptualization has been used to explore bilinguals’ knowledge and processing of sound structure using various experimental techniques. We examine how the coactivation of representations – the claim that elements from multiple languages are simultaneously processed – has motivated studies of speech production processes and representations in speech perception. In Section 4.4, we conclude by considering how psycholinguistic theories can inform as well as be informed by other perspectives.

4.2 The Psycholinguistic Approach

4.2.1 Objectives of Psycholinguistic Research: Effects versus Capacities

What is the goal of psycholinguistic research into bilingualism? Given the focus on experimental data, researchers from other traditions might assume that it is to catalog and precisely characterize the effects that we observe in experiments. For example, in speech production research there has been a substantial number of experiments that have examined the switch cost: the slow-down in reaction times that occurs when participants are unexpectedly cued to switch languages (Reference Meuter and AllportMeuter & Allport, 1999). Considerable effort has been expended in clarifying the nature of such effects and what factors moderate them (see Reference Gade, Declerck, Philipp, Rey-Mermet and KochGade et al. [2021] for a recent related meta-analysis). However, precisely characterizing effects is not the goal of psycholinguistic theories. Such theories aim to specify a system of interacting cognitive mechanisms or the processes underlying the behavioral effects (Reference Cummins, Keil and WilsonCummins, 2000). Effects are the behavioral manifestations of these interacting mechanisms.

For example, bilingual speakers can speak in one language but not another.Footnote ¹ A Spanish-English bilingual can name a picture of a dog as either perro or dog, controlling the language of production. How is this possible? The core of Reference GreenGreen’s (1998) psycholinguistic theory of language control is an inhibition mechanism that suppresses processing of representations from the nontarget language. When a speaker has the goal to speak Spanish, this inhibition mechanism suppresses the representation of English words, blocking them from being prepared for speech; Spanish representations are not inhibited. If this inhibition carries over from one trial to the next, then trying to retrieve the inhibited language on a subsequent trial – that is, after a language switch – will be more difficult, producing the observed switch cost effect. In this way, mechanistic theories of language capacities account for the effects we observe in experiments.

4.2.2 The Architecture of the Language Processing System

In psycholinguistic theories, mechanisms underlying specific capacities are situated in the context of the broader cognitive system: the set of interacting mechanisms that underlie speech production and perception capacities, as elicited in specific tasks (e.g., read aloud a sentence, provide a definition for a spoken word; note that typically researchers only discuss a small set of tasks, and therefore specify only a subset of the full range of mechanisms that make up the entire cognitive system). From the perspective of psycholinguistics, then, the study of phonetic and phonological capacities (the focus of this volume) can be studied only in the context of the specific speaking and listening tasks that engage the mechanisms underlying these capacities. Critically, the larger set of mechanisms that underlie task performance can influence sound structure processing mechanisms. For example, common experimental tasks used to elicit speech sound production include naming pictures and reading aloud.

Figure 4.1 shows a simplified schematic of the mechanisms involved in these two tasks (following Reference Barry, Hirsh, Johnston and WilliamsBarry et al., 2001; see Reference RoelofsRoelofs [2004] for more detailed discussion of sound structure processing across these tasks). Both of these require the use of sound structure processes in order to generate the sounds of spoken language (the name of the picture, the sounds of the printed word). While these processes are shared, there are many other distinct components used in one task but not the other. Glossing over many of the details, picture naming requires that the perceiver identify the visual properties of the object; link these visual features to a representation of the depicted object; access the meaning of this object; select a word to express this meaning; and only then process the sounds associated with this word. Reading aloud can proceed along a similar path: identifying letters; linking letters to a word; then accessing the meaning of the word and proceeding just as in picture naming. However, letters (unlike visual features of other objects) can also be linked to sounds without accessing meaning (consider: you can easily pronounce nonwords like blinch). These processing distinctions predict that semantic variables (e.g., priming of words with similar meanings; Reference Damian, Vigliocco and LeveltDamian, Vigliocco, & Levelt, 2001) and lexical variables (e.g., word frequency; Reference Mousikou and RastleMousikou & Rastle, 2015) will exert a significantly stronger influence in picture naming versus reading. Thus, consideration of the broader cognitive task helps clarify the different factors that may modulate speaker performance – including the processing of sound structure. Later, we’ll discuss how this general approach is used to select the tasks we use to study phonetic and phonological capacities, as well as the process of drawing inferences from behavior in these tasks.

Figure 4.1 Simplified schematic of cognitive mechanisms in picture naming and reading aloud.

4.2.3 Connectionism as a Mechanistic Framework for Language Processing

While the simplified flow chart (Figure 4.1) can help guide our thinking about the mental processes involved in these tasks, it says nothing about the mechanistic nature of these processes. Much of the theorizing in psycholinguistics has used the connectionist framework to meet this goal. Connectionist theories are abstract models of the computations that occur within neurobiological networks. This abstraction is built around several core computational principles (Reference SmolenskySmolensky, 1999). The core representational principle is that mental representations are instantiated via patterns of numerical activity over simple processing units. The core processing principle is the spread of activity over numerically weighted links that connect representational units. Figure 4.2 illustrates how retrieval of word-forms in speech production could be implemented within a connectionist framework (after Reference Costa, Caramazza and Sebastian-GallesCosta, Caramazza, & Sebastian-Galles, 2000). Suppose a Spanish-Catalan bilingual is trying to produce the Catalan word gat (“cat”). Figure 4.2(a) focuses on the initial processing of this target word. The speaker has selected the corresponding word unit GAT as the target for retrieval. The activation from this target processing unit spreads along the connections linking word units to sound units, providing activation to units representing the constituent sounds /ɡ/, /a/, and /t/. Other sound units remain inactive. Figure 4.2(b) considers how the pattern of activation changes when we allow for another lexical unit to be partially activated. Overlapping semantic representations cause the translation equivalent GATO to be activated. While language control mechanisms suppress the activation of nontarget-language words, their activation is nonzero, leading to partial activation of its constituent sounds. Because GATO is a cognate, this activation flow converges with that of the target word, boosting the activation of its sounds.

(a) Target word and sounds, along with connections.

(b) Introduction of a coactivated cognate with connections and implications for sound activation.

Figure 4.2 Connectionist implementation of retrieval of phonological forms of words from Spanish-Catalan bilingual memory for target word GAT.

This example illustrates some key components of connectionist-informed psycholinguistic theories of bilingualism. There are graded levels of activation; mental representations are not simply active or inactive, but participate in processing to varying degrees. In bilingualism research, this is the basis for the concept of coactivation – the graded activation of representations across languages (e.g., Spanish GATO during processing of Catalan target GAT). This concept links up to a very general issue in bilingualism research: the extent to which language processing focuses on only one language at a time (language-selective processing) or involves simultaneous processing of both languages (e.g., perception: Reference Marian and SpiveyMarian & Spivey, 2003; production: Reference Kroll, Bobb and WodnieckaKroll, Bobb, & Wodniecka, 2006). Coactivation is a computational mechanism that can be used to realize a nonspecific or nonselective theory of language processing. In the sections that follow we will discuss how this concept has been a critical part of speech production and perception research.

Once the mechanisms underlying a capacity have been specified, we can generate the predicted behavioral effects that can be observed in experiments. Response times (the time needed to initiate a response) are typically modeled as being driven by relative activation levels of units at a given level of representation. The more active a target representation is – and/or the less active any competing nontarget representations are – the shorter the response time will be. For example, Figure 4.2 shows how the presence of a cognate enhances the activation of target word sounds. Consistent with this relative difference in activation, response times for cognates are typically faster relative to noncognates (Reference Costa, Caramazza and Sebastian-GallesCosta et al., 2000).

The distribution of nontarget responses that will be produced when processing is disrupted (i.e., speech errors) is also modeled as reflecting relative activation levels. If processing is disrupted (modeled by, for example, adding random noise to the activation levels of units), the more active a nontarget word is, the more likely it is to be produced as an error. For example, when bilinguals read aloud mixed language text, they occasionally make mistakes. At the point of a language switch, they produce the highest rate of cross-language intrusions, producing the translation equivalent of the printed word (e.g., reading soup as sopa; Reference Gollan and GoldrickGollan & Goldrick, 2016). This is argued to reflect coactivation by shared semantic representations. Although, as discussed already, language control mechanisms inhibit nontarget language words (Reference Meuter and AllportMeuter & Allport, 1999), shared semantic representations will strongly activate the translation equivalent, making it a potent competitor for production.

4.2.4 Other Mechanistic Frameworks

While connectionist models have served as the dominant mechanistic framework (Reference LiLi, 2013), others have been deployed to account for bilingual language processing data. A prominent alternative develops symbol-manipulating algorithms that instantiate a specific theoretical proposal. These can be implemented on digital computers, allowing explicit tests of their performance against data. Reference Fibla, Sebastian-Galles and CristiaFibla, Sebastian-Galles, and Cristia (2022) provide a recent example of this approach, evaluating seven algorithms implementing contrasting theories of how infants learn to segment words from continuous speech (e.g., do infants posit word boundaries when encountering a low-frequency sequence of speech sounds, or a low-frequency sequence of syllables?). Developing computational implementations of these algorithms, and applying them to naturally occurring data, allows for explicit comparison of the ability of each theory to account for empirical data.

4.2.5 Interim Summary: The Psycholinguistic Perspective

Psycholinguistics aims to understand the set of interacting mechanical processes that yield behavioral effects. The mechanical nature of these processes has frequently (but not exclusively) been framed within the theoretical framework of connectionism. In the remainder of the chapter, we will explore how this perspective has informed the study of phonological and phonetic processes in speech perception and production.

4.3 Studies of Processing of Sounds: Establishing the Cognitive Mechanisms That Underlie Experimental Effects

4.3.1 Phonetics and Processing in Production

Bilinguals have the capability of speaking in one of their two languages, and are able to select the target language without making errors. How is it that they are able to select their target language? It is proposed that an inhibition-based language control mechanism exists in which sounds from both languages compete for language selection, and the target sound is selected if its activation level surpasses those of other sounds in both languages (see Figure 4.2 for a visualization of this process; Reference GreenGreen, 1998; for reviews of models with a focus on lexical selection, see Reference Declerck and PhilippDeclerck & Philipp, 2015; Reference Runnqvist, Strijkers, Costa, Goldrick, Ferreira and MiozzoRunnqvist, Strijkers, & Costa, 2014). According to Reference GreenGreen’s (1998) inhibition-based language control model, bilinguals must inhibit their nontarget language in order to be able to use their target language. The degree of inhibition can vary depending on the language context bilinguals are speaking in. For example, a Spanish-English bilingual will process English sounds differently in a mixed context (i.e., a context in which both languages are being used) than in a single language context (i.e., a context in which only one language is being used). Drawing on the cognitive psychology literature on task switching (viewing speaking different languages as akin to performing different tasks), Reference Meuter and AllportMeuter and Allport (1999) defined mixed language contexts to include stay (i.e., a context in which the previous word is in the same language as the target word) and switch (i.e., a context in which the previous word is in a different language from the target word) language contexts. Language selection becomes more difficult when bilinguals are using both of their languages; when representations from both languages are simultaneously activated, greater use of inhibition is required.

The inhibition of representations from the target language in switched contexts decreases sound contrasts via cascading activation (Reference Goldrick and BlumsteinGoldrick & Blumstein, 2006; Reference Goldrick, Runnqvist and CostaGoldrick, Runnqvist, & Costa, 2014). As discussed earlier, bilinguals store lexical items in memory at multiple levels of representation. Recall the example of GAT and GATO, shown in Figure 4.2. The concept “furry four-legged feline” is associated with the lexical item GATO for Spanish and GAT for Catalan. These are in turn mapped onto the phonemes and articulatory gestures corresponding to “gato” and “gat.” When bilinguals are retrieving an item from either language, the shared semantic representations of translation equivalents lead to a parallel activation of lexical representations in both languages. This parallel activation cascades through all levels of the representational system: activation spreading from conceptual representations yields activation of both the desired lexical item and its translation equivalent; these in turn spread activation to the phonemic and phonetic representations of both the target and its translation equivalent. Parallel activation of the target language’s and the nontarget language’s phonemes yields competition, as they specify incompatible articulatory targets for production. The result is a decrease in the phonetic contrast between sounds in the two languages – a blending of the properties of both languages in production.

Critically, this is predicted to be modulated by the context of language production. If, as claimed earlier, there is an increased coactivation of words when mixing and switching languages, increased cascading activation from the nontarget language will decrease the contrast between sounds. Specifically, words produced in a stay language context are predicted to have a larger phonetic contrast than words in a switched language context (e.g. Reference Goldrick, Runnqvist and CostaGoldrick et al., 2014).

Various experimental methods can be used in order to test these predictions. The most commonly used ones are spontaneous speech, cued picture naming, and reading. Spontaneous speech is when the participant(s) are given some sort of prompt and speak naturally either to another participant or to an experimenter. The main benefit of using spontaneous speech is that it is more naturalistic than speech in experimental studies. Participants control when it is that they mix languages, allowing experimenters to observe naturalistic mixing and switch patterns. However, naturalistic approaches are limited, as it can be extremely difficult to conduct phonetic analysis when there is limited control over the produced phonetic environment. In order to properly sample a large range of phonetic environments (allowing for statistical control of environmental effects), large amounts of data need to be collected. Only a few such data sources exist (e.g., Reference Balukas and KoopsBalukas & Koops, 2015; Reference Deuchar, Davies, Herring, Parafita Couto, Carter, Thomas and MennenDeuchar et al., 2014; Reference JohnsonJohnson, 2021). The labor-intensive process for assembling such data means that the range of bilingual populations represented in such work is practically limited.

Reading tasks have participants reading text, either sentences (e.g., Reference AmengualAmengual, 2018) or paragraphs (e.g., Reference Gollan and GoldrickGollan & Goldrick, 2018). Reading materials are typically presented one at a time. Some benefits of using reading are the control over the environment in which sound sequences are being produced, the control of the language context, and the fact that all participants produce the same speech. However, if investigating the shared activation of a bilingual’s languages is the experimental goal, using a reading task may not be ideal because of the cognitive processes involved in reading aloud (see Figure 4.1). As discussed previously, the cascading activation theory claims that cross-language activation is driven by semantic processing. In order to test this theory, methodologies that require semantic retrieval are necessary. Since letters have regular relationships to phonemes, reading can bypass semantics, something readers regularly do: reading a novel word out loud without knowing the meaning. For language pairs with the same orthography (e.g., English and Spanish), it raises the possibility that cross-linguistic activation can be driven by perceptual overlap between words, not semantics.

An alternative laboratory method is cued picture naming, where participants name pictures (with optional prior training) and are cued on which language to name the picture in. Some cues that have been used are colors (e.g., Reference Meuter and AllportMeuter & Allport, 1999) and flags (e.g., Reference Kleinman and GollanKleinman & Gollan, 2018). As with reading aloud, cued picture naming allows control over the environment in which sound sequences are being produced, and control of the language context; it also ensures that all participants will produce the same range of speech sounds. Yet, there are some limitations, like the unnaturalistic production of language (i.e., single word productions). However, using a cued picture naming task is beneficial if investigating the shared activation of a bilingual’s languages based on language context. This is because, unlike reading, naming pictures requires activation at the semantic level (see Figure 4.1). Cued picture naming provides a means for semantically driven coactivation of languages,Footnote ² enabling a strong test of the cascading activation that will cascade down to phonetics.

Cued picture naming is adopted in studies such as Reference Goldrick, Runnqvist and CostaGoldrick et al. (2014). This study demonstrated that increased coactivation across different language contexts decreases the contrast between sounds. They tested ten Spanish-English bilinguals residing in Barcelona, Spain (all had knowledge of Catalan), using a cued picture naming task in mixed language contexts, contrasting stay and switch trials. All target words started with either voiced or voiceless alveolar stops. These target sounds were chosen because the production of stops differs between English and Spanish based on the voice onset time (VOT) of the stop. Voice onset time is the time between the release of the consonant’s constriction and the onset of periodicity. As shown in Figure 4.3, both Spanish and English have a two-way distinction between stops; however, they differ in how they make these distinctions. Spanish voiced stops are produced with prevoicing (i.e., voicing before the release of the consonant constriction) and voiceless stops are produced with short-lag VOT (i.e., around 0–30 ms VOT). On the other hand, English voiced stops are produced with short-lag VOT and voiceless stops are produced with long-lag VOT (i.e., around 30–120 ms; Reference Lisker and AbramsonLisker & Abramson, 1964). For voiceless stops, a decrease in sound contrasts for VOT would therefore be production of a relatively long-lag VOT when the target is Spanish /t/ and a relatively short-lag VOT when the target is English /t/. The results showed a decrease in sound contrast in switch contexts in comparison with stay contexts (with the reduction in contrast driven by the nondominant, second or third language English), providing evidence that when there is increased semantically driven coactivation, phonetic contrasts between languages decrease.

Figure 4.3 Differences between Spanish and English VOT using dental/alveolar stops as an example. Sounds in brackets indicate phones, while sounds in forward slashes indicate phonemes (differences between the two explained in Section 4.3.2). Spanish has pre-voiced voiced stops and short-lagged voiceless stops, while English has short-lagged voiced stops and long-lagged voiceless stops.

4.3.2 Phonology and Representations in Perception

The coactivation that occurs in the production of sounds is also found in perception. As opposed to production where shared semantics start the coactivation process, in perception it is the overlap in form that triggers coactivation. This has been examined in a number of different tasks. In an auditory lexical decision task, word and nonword stimuli are presented auditorily to a participant; the task is to indicate if the stimulus is or is not a word. The coactivation of languages during lexical processing is indexed by presenting stimuli with various form and/or meaning properties, examining which properties speed or slow down the decision to accept or reject the stimulus as a lexical item. For example, Reference Lagrou, Hartsuiker and DuyckLagrou, Hartsuiker, & Duyck (2011) had bilinguals listen to words that share form but not meaning (i.e., cross-language homophones or “false friends”). Bilinguals listen to words in one of their languages (e.g., English) and then in their other language (e.g., Dutch); cross-language homophones are presented in both of these conditions. A Dutch-English listener may hear lief “sweet” or leaf in the respective language blocks. If coactivation is present in perception, the high degree of cross-language overlap for these items will lead to strong coactivation relative to nonhomophonous controls. Consistent with a difference in processing, lexical decision times are slower and less accurate for these items relative to nonhomophonous control words. This is argued to reflect greater cross-language competition due to parallel activation of words in the two languages. While this is a plausible account of the findings, the measurement of cross-language activation is quite indirect.

The visual world eye-tracking paradigm offers a potentially more direct index of cross-language activation. Participants are shown a screen with four pictures (or interact with a surface that has four objects). In a prototypical study of bilingual speech perception (Reference Marian and SpiveyMarian & Spivey, 2003; Reference McDonald and KaushanskayaMcDonald & Kaushanskaya, 2020), these include a target word (e.g., desk), a cross-language competitor whose translation equivalent shares a sound with the target word (e.g., screwdriver, with Spanish translation equivalent desarmador), and two unrelated fillers (e.g., ring and caterpillar, whose Spanish translation equivalents, anillo and oruga, respectively, do not overlap with desk). In the most basic version of the paradigm, after viewing these pictures, participants click a button to hear an auditory stimulus (e.g., desk) and then click the picture of what they heard. While they do this, their eye movements are tracked. Critically, the amount of fixations on different pictures is used to index lexical activation – providing a more direct measure of lexical activation. The critical measure is whether participants fixate on the cross-language competitor more than the fillers. This provides evidence, within a single trial, that listeners are more strongly considering one potential nontarget word (the cross-language competitor) versus other nontarget words.

Reference McDonald and KaushanskayaMcDonald and Kaushanskaya (2020) used this paradigm to examine what, precisely, is the nature of form overlap that leads to coactivation in comprehension. Form could refer to two different levels of linguistic organization: phonemes or phones. Phonemes characterize the potentially meaningful units of contrast in a language; the articulatory realization of a phoneme in a particular phonetic context is referred to as a phone. Cross-language overlap can differ at these two levels of organization. Figure 4.3 illustrates this for alveolar stops in English and dental stops in Spanish. In word-initial position, English and Spanish distinguish voicing for stop consonants; this yields overlapping representations of the phonemic contrast (e.g., voiced vs. voiceless alveolar/dental stops, denoted by /d/ vs. /t/). However, these phonemes are realized via distinct phones. For example, in Spanish, the voiced dental stop phoneme /d/ is realized via a prevoiced stop (denoted by [d]). In contrast, the English alveolar phoneme /d/ is realized by a short-lag, unaspirated [t]. A similar divergence is found for /t/ (realized as [t] in Spanish and the long-lag aspirated [t^h] in English).

It has been proposed that bilinguals utilize a shared mapping between various levels of language processing when perceiving speech, allowing for coactivation of lexical items across languages. However, there are two possibilities as to which linguistic level drives coactivation in perception: phonological (i.e., shared phonemes are activated; see Reference Shook and MarianShook & Marian [2013] for more information) or phonetic (i.e., shared phone representations are activated; see Reference Grosjean, de Groot and KrollGrosjean [1997] for more information). If shared phones are what is driving lexical coactivation, then hearing the English word “desk” with the initial phone [t] would activate the phonemes /t/_Spanish and /d/_English, since both phonemes are associated with this phone. However, if shared phonemes are what is driving lexical coactivation, then hearing the English word “desk” would activate Spanish words starting with /d/_Spanish, which is in the same phonological category as /d/_English but is produced with a different phone ([d] vs. [t]).

Reference McDonald and KaushanskayaMcDonald and Kaushanskaya (2020) used the visual world paradigm to examine lexical coactivation when words’ initial sounds were phonemically matched (e.g., /t/_English and /t/_Spanish) versus when the sounds were phonetically matched (e.g., [t]_English and [t]_Spanish). They tested Spanish-English bilinguals in two eye-tracking studies. The first had the target word and the competitor word share phonemes (but not phones) word initially (e.g., desk and desarmador “screwdriver”), while the second eye-tracking study had the target word and the competitor share phones (but not phonemes) (e.g., desk and techo “roof”). The results of the phonemic experiment showed a higher number of fixations to the cross-linguistic competitor image (for target desk, desarmador) than unrelated images, providing evidence of coactivation. In contrast, coactivation was absent in the phonetic experiment; this showed no difference between fixations to the cross-linguistic competitor (for target desk, techo) as compared to the control items. If a Spanish-English bilingual hears the word “desk,” the cross-language words that become coactivated start with the shared phoneme /d/ as opposed to the shared phone [t], indicating that it is shared cross-language phonological representations that drive lexical coactivation and not shared cross-language phonetic representations.

4.3.3 Current Direction: Psycholinguistics of Speech Production Beyond Isolated Words

As the examples thus far show, psycholinguistic studies tend to focus on how participants perform a variety of tasks; however, the most commonly used tasks use single words. This is clearly not ecologically valid. Individuals don’t produce and perceive language in single words; they do it in the context of sentential structures and larger discourse. Although studying single words can be informative to help build theories, in order to be able to expand current theories about bilingual language processing to include more aspects of cognition, methods using larger strings (phrases, sentences, an entire discourse) need to be explored. Using these methodologies will provide evidence for the processing of language in the context of natural usage. Research in speech perception has emphasized the importance of larger processing contexts for some time (see, e.g., Reference Lauro and SchwartzLauro & Schwartz [2017] for a meta-analysis of studies of how cognate effects in speech perception are influenced by sentence context). However, psycholinguistic studies of phonological and phonetic processes in language production have only recently begun shifting toward an emphasis on investigating bilingual language cognition outside of single word stimuli.

As discussed already, Reference GreenGreen’s (1998) inhibitory control model has primarily been tested in language production in the context of single word experiments that require participants to switch languages. To investigate control processes in a sentential context, it is important that we consider other theoretical perspectives that have been developed to account for spontaneous code-switching. One influential account is the matrix language framework, developed by Myers-Scotton and colleagues (e.g., Reference Myers-Scotton, Jake, Bullock and ToribioMyers-Scotton & Jake, 2009). This distinguishes the matrix from the embedded language in a code-mixed utterance. The matrix language is the more active language, supplying the majority of the words in the utterance, the grammatical structure, and the most closed class function words (e.g., determiners, pronouns). The less activated embedded language supplies the minority of the words in the utterance, typically including only open class content words (e.g., nouns, verbs).

This framework informed the design of a recent study examining language control in sentence versus single word productions (Reference Li and GollanLi & Gollan, 2021). This work aimed to understand how sentential context, biased toward the matrix language and against the embedded language, influenced language control. One possibility is that context serves to inhibit the embedded language. Alternatively, it might boost the activation of the matrix language. To tease these two possibilities apart, this study used the cognate facilitation effect. As discussed in Section 4.2.3, production of cognates is facilitated due to parallel activation of phonologically similar translation equivalents. This cognate facilitation effect can therefore be used to index the activation of nontarget language representations. If the embedded language is inhibited, we expect to see a reduction in the cognate facilitation effect. Reference Li and GollanLi & Gollan (2021) found no such effect; cognate facilitation was not significantly different in sentences versus single word productions (which lack a matrix language). This suggests that in parallel to inhibitory mechanisms (Reference GreenGreen, 1998), facilitatory mechanisms help control language production in sentence (but not single word) contexts. Our understanding of language control may radically change as psycholinguistic studies of language production begin to explore more ecologically valid contexts (including contexts the individual chooses to switch as opposed to when being told to switch).

4.4 How Can Psycholinguistic Theories Inform as Well as Be Informed by Other Perspectives?

4.4.1 Speech Learning Model, Perceptual Assimilation Model, and Second Language Linguistic Perception Model

Some influential theories in phonetics about bilingual sound systems are the Speech Learning Model (SLM; Reference Flege and StrangeFlege, 1995; Reference Flege, Bohn and WaylandFlege & Bohn, 2021), the Perceptual Assimilation Model (PAM; Reference Best, Tyler, Munro and BohnBest & Tyler, 2007; see also Chapter 7, this volume), and the Second Language Linguistic Perception Model (L2LP; Reference Van Leussen and Escuderovan Leussen & Escudero, 2015; see also Chapter 8, this volume). They all provide theoretical frameworks for how bilingual sound systems are created and structured. The SLM examines the interaction of L1 and L2 sound categories (typically focusing on data from production). Recently, the SLM has been revised (SLM-r; Reference Flege, Bohn and WaylandFlege & Bohn, 2021) to state that the sound acquisition mechanism used for L1 acquisition is implemented in L2. Similar to the SLM, the SLM-r states that the L1 and the L2 sound systems interact, but the SLM-r believes that the systems update throughout an individual’s lifetime depending on the input the individual is receiving. The PAM primarily focuses on how the structure of the L1 sound system influences the acquisition of the L2 sound system (typically focusing on data from perception). The L2LP claims that there is acquisition of an L2 sound system through an L1, and that the L2 sound system is updated based on experience (typically focusing on data from perception). These theories focus on the creation and maintenance of contrasting speech sound categories. Their explanatory utility has been clear, based on their ability to make explicit predictions about the contrasts that will be difficult for speakers and listeners (based on the structure of the interacting languages and the experiences speakers and listeners have with each of them).

The psycholinguistic tradition has much to gain from such perspectives. While psycholinguistic theories make claims regarding the elements of sound structure representations, there is no explicit encoding of the “system of contrasts” that serves to define the predictions of the SLM, PAM, and L2LP theories. Understanding this systematic structure is a key insight of phonetic and phonological theory that has not, so far, been a major focus of psycholinguistic investigation. Furthermore, outside of the domain of phonotactics (Reference Dell, Kelley, Hwang and BianDell et al., 2021), psycholinguistic theories of phonological and phonetic processing have not made significant contact with research into language learning. Perspectives such as SLM, PAM, and L2LP have many insights to offer here.

We believe there are reciprocal benefits for these theories to incorporate psycholinguistic perspectives. These models have not typically considered the detailed structure of the tasks used to gauge knowledge of sound structure; as discussed earlier, this can have critical implications for interpreting empirical data. There has also been less consideration of the potentially graded nature of linguistic representations (i.e., coactivation), a concept that has proved to be a potent component of psycholinguistic theorizing.

4.4.2 Exemplar Theory

Exemplar theories of bilingual phonetics and phonology (Reference AmengualAmengual, 2018; see also Chapters 2 and 11, this volume) claim that speaker-listeners store speech experiences (in both perception and production) at many levels of linguistic and nonlinguistic structure. This includes phonetic and phonological information as well as lexical, social, indexical, environmental – potentially all aspects of the experience. The co-presence of multiple dimensions of structure allows for interactions that cross traditional divisions of linguistic information. For example, words with similar phonological structure can be associated with distinct phonetic properties; cognates can exhibit a greater degree of cross-language phonetic influence than phonologically similar noncognates (Reference AmengualAmengual, 2012; Reference BrownBrown, 2014).

Exemplar theories offer a more radical view of the interaction of different information sources in language processing; beyond word-specific phonetics, they open up the possibility of exploring interactions with information traditionally considered “nonlinguistic,” such as social structure (see later for more discussion). Psycholinguistic perspectives, in turn, have much to offer exemplar perspectives. In focusing on processing, psycholinguistic models aim to capture the dynamic variation that occurs on a moment-to-moment basis within a trial. In contrast, exemplar theories have traditionally focused on the storage of information (see Reference Fink and GoldrickFink & Goldrick [2015] for discussion). An integrated psycholinguist-exemplar perspective could capitalize on the strengths of both approaches.

4.4.3 Sociolinguistics

Psycholinguistic practice typically assumes that we can study the mind separately from the social context of the individual. Sociolinguistics is one research tradition that foregrounds social context (for a more detailed overview, see Chapter 3, this volume). Leveraging insights from this field to understand differences in the linguistic practices of different communities may enhance psycholinguistic theories (see Reference Green and WeiGreen and Wei [2014] for discussion of possible implications for language control mechanisms). For example, Reference Sharma and SankaranSharma and Sankaran (2011) investigated the variation of /t/-production in a bilingual Punjabi British community. They collected spontaneous speech via interviews with the participants from three different generations: first generation (born in India, immigrated to the United Kingdom as an adult); older second generation (born in the UK to Punjabi British parents in the 1960s–1970s); and younger second generation (born in the UK 1970–1995). The distinction between the older and the younger second generation was made because of the societal changes that occurred during the lifetimes of these individuals; discrimination was more severe during the older second generation’s youth. Reference Sharma and SankaranSharma and Sankaran (2011) show that there were different patterns of /t/-production across generations; the first generation tended to have more Punjabi influence in their English (i.e., more retroflexion of /t/, typical of Punjabi articulation), while the older and the younger second generations had more British English influences in their English (i.e., more glottalization of /t/, a common feature of British English). However, second-generation individuals showed a distinct pattern. Because of the prevalence of discrimination they experienced during their formative years, as well as cultural expectations from their own community, they exhibited both patterns – but in different contexts. They used Punjabi-related /t/-retroflexion within their community and British-English-related /t/-glottalization outside. Clearly, to understand this aspect of language processing, we must understand the societal factors that drive it as well as the generational structure of the social groups. Incorporating social information into psycholinguistics research may provide additional insights and broaden the applicability of theories to multiple social/cultural contexts.

Psycholinguistic perspectives can also enhance sociolinguistics research by providing a more nuanced understanding of the many mechanisms that contribute to speech perception and production. For example, psycholinguistic studies suggest that /t/ variation can be driven, in part, by the scope of advanced planning during speech production. Reference Kilbourn-Ceron and GoldrickKilbourn-Ceron and Goldrick (2021) examine variation in word-final /t/. When both preceded and followed by a vowel, /t/ is frequently pronounced as a flap (e.g., contrast bat with batter). The results of this study show that when an upcoming vowel-initial word is easier versus more difficult to retrieve from memory, the likelihood of pronouncing the /t/ as a flap increases. Sociolinguists can use such frameworks to consider what factors to control for when analyzing speech data for impacts of social information.

4.5 Conclusions

Psycholinguistics uses experimental methods to study bilingual language processing in specific tasks, aiming to understand the cognitive mechanisms that underlie bilingual language processing capacities. While it has long focused on decontextualized single word processing, more recent work has aimed to examine more ecologically valid contexts. This shift in focus helps establish greater connections with other perspectives; these links will serve to deepen our understanding of bilingual language cognition in context.

5 Neurolinguistic Approaches to Bilingual Phonetics and Phonology

5.1 Noninvasive Neurolinguistic Technologies

The study of neuroplasticity in bilinguals has progressed significantly in the last decades, showing that bilingualism can alter the function and structure of brain regions that support speech communication (Reference HernandezHernandez, 2013; Reference Li, Legault and LitcofskyLi, Legault, & Litcofsky, 2014). In this chapter, we overview these changes through the lens of current noninvasive neuroimaging technologies. From a mechanistic perspective, neural signals exhibit a lower degree of opacity than behavior. Because behavioral tasks demand attention, decision-making, and metalinguistic awareness, these cognitive processes cannot be always factored out of the behavioral response. As we show in this chapter, several neurophysiological signals do not require overt behavioral responses or even attention to be measured. These neural signals provide a unique opportunity to segregate computations that are specific to speech from simultaneous task-dependent processing.

Neurolinguistic technologies (Figure 5.1a) can be divided into two groups with respect to the signals that they measure: (1) technologies like electroencephalography (EEG; Reference KaanKaan, 2007; Reference LuckLuck, 2014) and magnetoencephalography (MEG; Reference Hansen, Kringelbach and SalmelinHansen, Kringelbach, & Salmelin, 2010), which track changes in neural activity with millisecond precision, and (2) technologies like magnetic resonance imaging (MRI; Reference HuettelHuettel, 2012; Reference Poldrack, Mumford and NicholsPoldrack, Mumford, & Nichols, 2011) and functional near-infrared spectroscopy (fNIRS), which track spatially localized changes in brain structure and function based on secondary signals, such as local blood oxygenation.

(a) Noninvasive neurolinguistic technologies and

(b) representative brain regions and white matter tracts for speech processing. MBG = medial geniculate body; IC = inferior colliculus; CN = cochlear nucleus; AC = auditory cortex; STG = posterior superior temporal gyrus.

Figure 5.1 Neuroimaging.

One of the most established technologies in bilingual research is EEG (Kaan, 2007; Luck, 2014). This neuroimaging modality is considerably cheaper than most other technologies and relatively portable. It works by using scalp-based electrodes that capture far-field electrical activity of neurons. The synchronized depolarization (or hyperpolarization) of hundreds of post-synaptic neurons creates a localized electric dipole that can be measured with microvolt precision (Reference LuckLuck, 2014). In EEG research, the processing of speech features is typically indexed by positive and negative peaks of electrical activity time-locked to the onset of the sounds. As we note in Sections 5.4 and 5.5, these event-related potentials (ERPs; Reference BeresBeres, 2017) provide valuable insights into the encoding and retrieval of speech representations in bilingual populations. Sustained EEG responses, on the other hand, reflect ongoing electrical activity phase-locked to fast (e.g., fundamental frequency contours) and slow (e.g., speech amplitude envelopes) oscillations in the speech waveform. As we discuss in Sections 5.3 and 5.6, sustained responses are more appealing for investigating the neural dynamics of speech intelligibility.

Similarly, MEG (Reference Hansen, Kringelbach and SalmelinHansen et al., 2010) measures a complementary signal to EEG, based on magnetic field fluctuations resulting from concomitant changes in electrical dipole activity. Modern MEGs are equipped with hundreds of super-conducting quantum interference devices (SQUIDs) that detect changes in the magnetic field. This neuroimaging technology has a better spatial resolution than EEG, localizing dipoles with centimeter to millimeter scale. Also, MEG can measure rapid cortical oscillations in the gamma band (>30 Hz) that are not as clearly represented in EEG. This rapid oscillatory activity can be used to investigate the processing of fast phonetic changes in the speech signal. However, MEG remains less popular than EEG mainly because it is quite expensive, sensitive to interference (requiring a dedicated facility), and thus not yet portable.

Neuroimaging technologies that measure changes in blood flow and oxygenation in response to nearby neural activity provide a lower temporal resolution than MEG and EEG but offer better spatial resolution. Functional MRI (fMRI; Reference HuettelHuettel, 2012; Reference Poldrack, Mumford and NicholsPoldrack et al., 2011) is an adaptation of existing structural MRI technology. It requires a dedicated multi-room facility because the scanner uses a strong static magnetic field, which poses minimal biological risk but major logistical obstacles. Structural MRI three-dimensionally images soft tissue through temporary electromagnetic perturbation of hydrogen atoms at high spatial resolution (around 1 cubic millimeter per voxel). Structural MRI (Reference SabaSaba, 2016) can be used to observe changes in the neuronal cell bodies (gray matter) and bundles of myelinated neuronal axons connecting distant brain regions (white matter). On the other hand, fMRI is used to link changes in the blood oxygenation level dependent (BOLD) signal to specific language functions. When applied to human brain imaging, fMRI has a spatial resolution below 10 cubic millimeters (as low as 3 mm³ in newer scanners) and a temporal resolution of around 1–2 seconds, depending on scanner and study parameters. However, this high spatial resolution makes fMRI very sensitive to participants’ movement, which poses obstacles when head motion is likely, such as studies involving speech production or young children.

Finally, fNIRS (Ferrari & Quaresima, 2012; Pinti et al., 2020) measures infrared light scattered through the scalp, skull, and brain to detect changes in blood oxygenation along the surface of the cerebral cortex. These measures are correlated with the BOLD signal measured by fMRI. Light sources and detectors are arranged in a grid or honeycomb array and may target specific brain regions of interest or provide broad, whole-scalp coverage. This diffuse sampling approach limits fNIRS’ spatial resolution to a few centimeters squared, although multiplexing techniques are improving on that figure. In contrast to fMRI, fNIRS cannot be used to monitor neural activity in subcortical regions. However, fNIRS samples the slow hemodynamic signal rapidly (around 10 Hz) and is therefore more resilient than fMRI to transient artifacts like head motion. Lastly, fNIRS has lower cost and higher portability than fMRI. For these reasons, fNIRS is commonly used with infants and young children.

5.2 Neurobiology of Speech Processing

The neural encoding of speech features begins in the central auditory pathway, a few tens of milliseconds before the auditory signal reaches the primary auditory cortex (Reference Picton, Rodriguez, Linden and MaistePicton et al., 1985). This pathway connects the cochlea with a series of subcortical nuclei like the cochlear nucleus and inferior colliculus in the brainstem and midbrain, and the medial geniculate body in the thalamus (Figure 5.1b, top). These nuclei provide an early sensory representation of spectral and temporal phonetic cues, such as consonant-vowel transients, formants, pitch, or amplitude modulation (Reference Wolfe, Kluender and LeviWolfe et al., 2006). While the auditory pathway has been traditionally regarded as a subcortical relay station for the primary auditory cortex, it is malleable to auditory experience (Reference Krishnan, Xu, Gandour and CarianiKrishnan et al., 2005) and supports computations that are critical for speech perception, such as the spectral contrast effect (Reference StilpStilp, 2020). As we show in Section 5.3, the sensory representation of phonetic cues in the central auditory pathway is further influenced by bilingualism.

Once the neural signal leaves the auditory pathway, phonetic features are integrated into more complex representations of sounds in the primary auditory cortex and the superior temporal gyrus (STG; Figure 5.1b, middle) (Reference Mesgarani, Cheung, Johnson and ChangMesgarani et al., 2014; Reference Yi, Leonard and ChangYi, Leonard, & Chang, 2019). These complex representations are subsequently mapped into meanings, in anterior temporal regions, and articulatory gestures, in left inferior parietal and frontal regions (Reference Hickok and PoeppelHickok & Poeppel, 2007). Recent invasive electrocorticography (ECoG) work in humans has demonstrated that the representation of speech sounds in the STG is mediated by local populations of neurons that are differentially attuned to phonological features (Reference Leonard and ChangLeonard & Chang, 2014). According to the dual stream model (Reference Hickok and PoeppelHickok & Poeppel, 2007, Reference Hickok, Poeppel, Hickok and Small2016), the integration of speech sound representations with meaning and articulatory representations is achieved along two different neural pathways: the ventral and the dorsal streams (Figure 5.1b, middle). The ventral stream projects from superior to middle and inferior regions in the temporal lobe. This stream is innervated by several white matter tracts, such as the inferior fronto-occipital fasciculus (IFOF) (Figure 5.1b, bottom), which connects posterior and anterior temporal regions related to lexical and semantic processing (Reference FriedericiFriederici, 2015). As we discuss later, the structural integrity of this tract is modulated by bilingual experience (Reference Kuhl, Stevenson and CorriganKuhl et al., 2016).

The dorsal stream (Figure 5.1b, middle) projects to the left inferior parietal and frontal lobe regions involved in the selection, initiation, and sequencing of articulatory gestures (Reference Bouchard, Mesgarani, Johnson and ChangBouchard et al., 2013; Reference Tourville and GuentherTourville & Guenther, 2011). The more anterior parts of the frontal lobes are thought to be involved in domain-general executive functions related to attention, working memory, control, and monitoring (Reference Abutalebi and GreenAbutalebi & Green, 2007; Reference Fedorenko and BlankFedorenko & Blank, 2020). Critically, these executive functions are also supported by subcortical structures, such as the basal ganglia and the cerebellum (Figure 5.1b, top), which are also sensitive to bilingual experience (Reference PliatsikasPliatsikas, 2020). Additionally, the dorsal stream supports the integration of perceptual and sensorimotor representations in both perceptual and production tasks (Reference Bohland, Tourville, Guenther, Katz and AssmannBohland, Tourville, & Guenther, 2019). Communication between brain regions specialized in speech perception and production is facilitated by the arcuate fasciculus (Figure 5.1b, bottom), which is an important tract of white matter that underlies the dorsal stream. While speech production and speech perception are often examined separately, they can exhibit a high degree of interaction. Unfortunately, the neural computations underpinning the overt articulation of speech sounds are often difficult to investigate because of signal artifacts introduced by movement during the articulation of speech sounds. This experimental limitation has contributed to reduce the number of neuroscientific studies on overt bilingual speech production. However, the number of neuroscientific studies on overt speech production is currently rising (e.g., see Reference Ganushchak, Christoffels and SchillerGanushchak, Christoffels, & Schiller, 2011).

5.3 Early Sensory Encoding of Phonetic Cues

The sensory representation of phonetic cues along the central auditory pathway can be investigated using the frequency following response (FFR). The FFR is a short-latency (7–10 ms) neurophysiological index of sound encoding fidelity (Reference Skoe and KrausSkoe & Kraus, 2010; Reference Coffey, Nicol and White-SchwochCoffey et al., 2019). When the central auditory system is stimulated with a sound, neurons synchronize their oscillations by firing at the same phase of each cycle in the sound waveform. This sustained neural oscillatory activity (>70 Hz) is aggregated by the FFR, providing a faithful representation of the speech waveform (Figure 5.2a). Because of its short neural latency and fast timescale, the FFR provides a unique window into the effects of bilingualism during the first stages of auditory processing. Additionally, the FFR is a preattentive signal. This means that it can be evoked even when listeners are not paying attention to the sounds. Thus, the FFR can also be used to factor out the effects of attention on auditory processing.

(a) Waveforms and spectrograms of speech sounds (e.g., one rising Mandarin lexical tone) and their averaged FFR are very similar.

(b) In FFR experiments, participants are usually exposed to thousands of stimulus repetitions. One FFR montage includes three electrodes placed on the vertex (Cz, active channel), left (LM, ground), and right (RM, reference) mastoids.

(c) Top: Native speakers exhibit more-faithful temporal encoding of phonetic cues. This is shown for the neural tracking of a rising Mandarin tone in one native speaker of Mandarin and one of English. Bottom: Native speakers also exhibit more-robust spectral encoding of phonetic cues. This is shown for the fundamental frequency (F0) of a flat-pitch Mandarin tone in one native speaker of Mandarin and one of English.

Figure 5.2 The frequency following response.

The FFR can be recorded with a simple setup of three EEG electrodes (Figure 5.2b) that target phase-locked activity from sub-cortical nuclei in the brainstem, midbrain, and thalamus (Reference Coffey, Herholz, Chepesiuk, Baillet and ZatorreCoffey et al., 2016). Recent MEG and ECoG work (Reference Coffey, Nicol and White-SchwochCoffey et al., 2019; Reference Gnanateja, Rupp and LlanosGnanateja et al., 2021) have also identified a contribution from the primary auditory cortex; however, this contribution does not seem to be significant for stimulus frequencies higher than 200 Hz (Reference BidelmanBidelman, 2018). Because the FFR drivers are far from the scalp electrodes, the FFR is typically averaged across thousands of stimulus repetitions to improve the signal-to-noise ratio (SNR).

While the FFR was discovered nearly sixty years ago (Reference Worden and MarshWorden & Marsh, 1968), its popularity has recently increased as evidence has emerged that it captures language-dependent plasticity (Reference Krishnan, Xu, Gandour and CarianiKrishnan et al., 2005). The FFR exhibits a more faithful (Figure 5.2c, top) and robust (Figure 5.2c, bottom) encoding of native speech, compared to non-native speech (Reference Krishnan, Xu, Gandour and CarianiKrishnan et al., 2005). The FFR is also modulated by non-native speech training (Reference Reetzke, Xie, Llanos and ChandrasekaranReetzke et al., 2018) and age of acquisition (AOA; Reference Giroud, Baum, Gilbert, Phillips and GraccoGiroud et al., 2020; Reference Skoe, Burakiewicz, Figueiredo and HardinSkoe et al., 2017). In a seminal FFR study, Reference Krizman, Marian, Shook, Skoe and KrausKrizman et al. (2012) investigated the effects of bilingualism on the sensory representation of the fundamental frequency (F0) of speech stimuli in the presence of a background competitor. Much like many other FFR studies, they focused on the F0 because this speech cue is represented with high SNR in the FFR. They collected FFRs from simultaneous Spanish-English bilinguals and English monolinguals and found morerobust neural encoding of F0 among bilinguals. They also found a significant positive correlation between F0 encoding and individual performance in an auditory-based cognitive control task. Because bilinguals scored higher than monolinguals on this task, this FFR study provides a link between early auditory processing and cognitive control in bilingual populations.

The FFR gain reported for Spanish-English bilinguals in Reference Krizman, Marian, Shook, Skoe and KrausKrizman et al. (2012) has recently been extended to other bilingual populations. For instance, Reference Skoe, Burakiewicz, Figueiredo and HardinSkoe et al. (2017) collected FFRs from early bilingual speakers of eleven different languages and also found a more robust encoding of F0 among bilinguals compared to English monolinguals. Because the language backgrounds included in this study were typologically diverse, the FFR gain was likely the result of early auditory exposure to more than one language. Combined, the research studies discussed in this section show that bilingualism can impact the early sensory representation of phonetic cues.

5.4 Neural Discrimination of Speech Sounds

In Section 5.3, we discussed the effects of bilingualism on the early sensory representation of phonetic cues. In this section, we will show that the effects of bilingualism persist at later stages of processing, as the neural signal propagates throughout the cortex and the perception of sounds is experienced. Perhaps the most popular neurophysiological signal used to examine the neural processing of speech sounds in the cortex is the mismatch negativity (MMN) (Reference Garrido, Kilner, Stephan and FristonGarrido et al., 2009; Reference Näätänen, Paavilainen, Rinne and AlhoNäätänen et al., 2007). The MMN is a neurophysiological index of automatic sound discrimination. This EEG component is characterized by a negative deflection that peaks at approximately 100–250 ms following the onset of a sound that is mismatched to a preceding sequence of sounds (Figure 5.3a).

(a) Schematic representation of the oddball paradigm for standard and deviant sounds contrasting in phonological voicing (left) and their average EEG responses (right). The origin of the x-axis corresponds to the onset of the sounds.

(b) Electrode configuration in a 64-channel cap (left) and schematic representation of MMN sources and putative dynamics (right). IFG = inferior frontal gyrus; AC = auditory cortex; STG = posterior superior temporal gyrus.

Figure 5.3 The mismatch negativity.

The MMN is elicited with the oddball paradigm (Figure 5.3a, left). In this paradigm, listeners are exposed to repetitions of the same standard stimulus (e.g., syllable /ba/) before the deviant stimulus (e.g., syllable /pa/) is randomly presented. The characteristic negative deflection of the MMN is obtained by subtracting the average EEG response to the standard stimulus from the one to the deviant stimulus (Figure 5.3a, right). This negative deflection, which maximizes at fronto-central electrodes (e.g., FCz in Figure 5.3b, left), is thought to reflect differences in electrical activity from neurons in the STG and the inferior frontal gyrus (IFG) in response to predictable stimuli (activity suppression) and unpredictable stimuli (activity facilitation) (Reference Garrido, Friston and KiebelGarrido et al., 2008, Reference Garrido, Kilner, Stephan and Friston2009). In this model, STG neurons mediate the automatic detection of unexpected changes in the afferent auditory signal (Figure 5.3b, right). Meanwhile, IFG neurons generate a high-level response to the error signal delivered from the STG. Once the error signal is received, the IFG resets the expectations about upcoming stimuli in the STG. Within this prediction-error framework, the magnitude of the MMN is thought to reflect the magnitude of the error signal.

While the MMN can be used to index the detection of subtle acoustic differences in stimulus frequency, duration, or intensity (Reference Näätänen, Paavilainen, Rinne and AlhoNäätänen et al., 2007), this EEG component reflects more than what can be measured in the acoustic signal. Reference Dehaene-Lambertz, Dupoux and GoutDehaene-Lambertz, Dupoux, and Gout (2000) demonstrated this sensitivity with MMNs collected from native speakers of French or Japanese in response to the word contrast /ebuzo-ebzo/. In Japanese, the word /ebzo/ is perceived with an epenthetic vowel /u/ (as in /ebuzo/) that is not present in the acoustic signal. Thus, if the MMN is sensitive only to changes in the acoustic signal, the amplitude of the MMN should not differ between native French and Japanese speakers. Their results, however, showed that the amplitude of the MMN was significantly weaker among Japanese speakers. This finding demonstrates that the MMN is also sensitive to phonological features that are not directly encoded in the acoustic signal.

The effects of language experience on the MMN have been documented for a wide variety of segmental and suprasegmental contrasts (Reference Chandrasekaran, Gandour and KrishnanChandrasekaran, Gandour, & Krishnan, 2007; Reference Näätänen, Paavilainen, Rinne and AlhoNäätänen et al., 2007). The amplitude and the latency of the MMN are also influenced by perceptual training and non-native language experience (Reference Jakoby, Goldstein and FaustJakoby, Goldstein, & Faust, 2011; Reference Winkler, Kujala and TiitinenWinkler et al., 1999). In this body of literature, advanced learners tend to exhibit larger MMN amplitudes than poor learners. Longer MMN latencies, on the other hand, are thought to index longer processing times in less-experienced listeners.

The MMN has also been used to evaluate the degree of interference between phonological systems in bilinguals. For example, Reference Tamminen, Peltola, Toivonen, Kujala and NäätänenTamminen et al. (2013) collected MMNs from Finnish monolinguals and simultaneous Swedish-Finnish bilinguals using a vowel pair that straddled the phonological boundary between two Finnish vowels but belonged to the same vowel category in Swedish. They found longer MMN latencies in bilinguals compared to monolinguals. Because participants were not explicitly biased toward Swedish or Finnish during the oddball session, these longer latencies were attributed to the conflict between the phonological interpretation of the same vowel contrast in Swedish versus Finnish. In line with this interpretation, Reference Peltola, Tamminen, Toivonen, Kujala and NäätänenPeltola et al. (2012) found shorter MMN latencies in late Finnish-Swedish bilinguals, for whom the degree of cross-language interference is expected to reduce. Combined, these research findings suggest that the degree of neural interference between two phonological systems is modulated by AOA.

To understand how language-specific phonological systems are disentangled in the bilingual brain, other MMN studies have explicitly manipulated the language context of the experimental task. For example, Reference García-Sierra, Ramírez-Esparza, Silva-Pereyra, Siard and ChamplinGarcía-Sierra et al. (2012) established the language context by asking bilingual participants to read Spanish or English magazines while being exposed to voice onset time (VOT) exemplars straddling either the Spanish ([ba] vs. [pa]) or the English ([pa] vs. [p^ha]) voicing boundary. This experimental manipulation was possible because the MMN is a preattentive signal. Thus, participants could be instructed to focus on their magazines during the oddball session. Critically, they found that the amplitude of the MMN in response to each phonetic contrast was enhanced in the language context in which the contrast was phonologically distinctive. This finding shows that the neural encoding of phonetic contrasts in the bilingual brain is top-down modulated by language context.

5.5 Phonological Retrieval of Words and Cognitive Control

In Section 5.4, we discussed the effects of cross-language interference on the neural discrimination of speech sounds. In this section, we will show that cross-language interference can also affect the ability to recall fine-grained phonological representations of words in covert speech production tasks. The effects of cross-language interference in these tasks are usually observed at latencies longer than 200 ms. One popular ERP in this regard is the N2 (Reference Morales, Yudes, Gómez-Ariza and BajoMorales et al., 2015; Reference Sussman, Kujala and HalmetojaSussman et al., 2004). As with the MMN, the N2 is characterized by a negative shift time-locked to the onset of a specific event in the task. Prior EEG and invasive electrophysiological work has linked this negative shift to neural activity in the cingulate cortex and supplementary motor regions associated with conflict detection and cognitive control (Reference Cavanagh and FrankCavanagh & Frank, 2014; Reference Cavanagh, Frank, Klein and AllenCavanagh et al., 2010). In the bilingual literature, cognitive control typically refers to a series of mental computations that enhance processing in one language while attenuating automatic processing in the other language (Reference Grant, Dennis and LiGrant, Dennis, & Li, 2014).

The N2 is often used to index stimulus-response conflict or response inhibition in experimental paradigms like the go/no-go task (Reference Barac, Moreno and BialystokBarac, Moreno, & Bialystok, 2016). In this task, participants are instructed to press a button as fast as they can in response to certain events (go trials; e.g., symbols printed in a target color) and make no response for others (no-go trials; e.g., symbols printed in a nontarget color). While the instructions are simple, response inhibition becomes more challenging as the ratio of go to no-go trials increases. To perform successfully in the go/no-go task, participants must spend a considerable amount of resources to inhibit their response in no-go trials.

The go/no-go task can be particularly difficult when stimulus dimensions create a conflict between go and no-go signals. In one such seminal study, Reference Rodriguez-Fornells, van der Lugt and RotteRodriguez-Fornells et al. (2005) investigated the effects of cross-language interference on covert speech from simultaneous German-Spanish bilinguals and German monolinguals. Covert speech is often required to avoid EEG artifacts generated by the overt articulation of speech gestures. To prevent these artifacts, they created a go/no-go version of a picture-naming task. In one condition, German monolinguals and German-Spanish bilinguals were instructed to respond by pressing a button when the German name of the picture shown on the screen started with a vowel. In a different condition, bilinguals were instructed to respond when the name of the picture started with a vowel in Spanish. In the bilingual group, response inhibition was expected to be harder for incongruent no-go trials (e.g., pictures beginning with a consonant in the target language and a vowel in the nontarget language) relative to congruent no-go trials (e.g., pictures beginning with a consonant in German and Spanish).

To assess the degree of cross-language interference across groups, Reference Rodriguez-Fornells, van der Lugt and RotteRodriguez-Fornells et al. (2005) compared the amplitude of the N2 between incongruent and congruent no-go trials. As expected, N2 amplitude in no-go trials was not modulated by trial congruence in monolinguals. The bilingual group, however, revealed a strong degree of cross-language interference, indexed by a larger N2 amplitude in incongruent no-go trials, relative to congruent no-go trials. Interestingly, the latency of the no-go effect in the bilingual group was delayed by approximately 200 ms with respect to the monolingual group. This finding is consistent with the longer MMN latency observed among simultaneous bilinguals in response to cross-linguistically ambiguous phonetic contrasts (Reference Tamminen, Peltola, Toivonen, Kujala and NäätänenTamminen et al., 2013). Consequently, the results of Rodriguez-Fornells and colleagues demonstrate that the cross-language interference effect is not limited to the perceptual domain but also realized in (covert) speech production.

5.6 From Isolated Words to Connected Speech

We have so far focused on responses over very short time intervals, phase- or time-locked to isolated words or events. These responses are averaged across several dozen to thousands of repetitions to amplify reliable properties of the signal and produce interpretable patterns. Thus, one downside of these methods is that stimuli are brief, repetitious, and punctuated by onset and offset boundaries between which the signal of interest can be measured. One way to get around these experimental limitations is using longer and less repetitive stimuli, such as spoken narratives (e.g., audiobooks and podcasts; Reference Hamilton and HuthHamilton & Huth, 2020). While isolated words and sentences are experimentally more controlled than spoken narratives, they are less naturalistic and engage the brain differently (Reference Alexandrou, Saarinen, Kujala and SalmelinAlexandrou et al., 2020; Reference Hickok and PoeppelHickok & Poeppel, 2007). Spoken narratives, on the other hand, are quite appealing for investigating the neural correlates of speech intelligibility (i.e., the ability to recognize auditory word-forms) and comprehension (i.e., the ability to understand auditory word-forms) in naturalistic listening contexts.

One of the most prominent acoustic correlates of sentence intelligibility is the speech amplitude envelope (Reference Rosen, Carlyon, Darwin and RussellRosen et al., 1992), defined as slow changes in the amplitude of the speech waveform over time (Figure 5.4a). Research on speech intelligibility has shown that the amplitude envelope conveys enough information to accurately perceive vowels, consonants, and sentences. A seminal study in this regard (Reference Shannon, Zeng, Kamath, Wygonski and EkelidShannon et al., 1995) demonstrated that the combined amplitude envelopes of only three separate frequency channels are sufficient to retain approximately 70 percent of sentence intelligibility.

(a) EEG oscillations in delta (1-4 Hz) and theta (4-8 Hz) bands entrain to the broadband envelope of the speech waveform. This can be observed in both time (left) and spectral (right) domains.

(b) In fMRI univariate analysis, the magnitude of the hemodynamic response in one voxel (or averaged across voxels) is used to index functional differences in speech processing.

(c) A common approach in multivariate fMRI analysis is to calculate the neural distance or dissimilarity between distributions of hemodynamic responses elicited with different speech sounds in a region of interest.

Figure 5.4 Connected speech and fMRI.

The neural tracking of amplitude modulation can be examined using multiple technologies. Fast gamma activity between 30 Hz and 100 Hz is represented with high SNR in MEG and thus this technology is well suited to investigate the temporal coding of phonetic segments (Reference Giraud and PoeppelGiraud & Poeppel, 2012; Reference Teng, Tian, Rowland and PoeppelTeng et al., 2017). Most EEG and MEG research (Reference Ding and SimonDing & Simon, 2014), however, has focused on the neural tracking of amplitude modulation patterns associated with the temporal coding of longer speech constituents, such as syllables (4–8 Hz) and auditory word-forms (< 4 Hz) (Figure 5.4a). This research has found that the coupling or phase-locking synchronization between brain oscillations and amplitude envelopes deteriorates when speech signals are acoustically degraded (Reference Ding, Chatterjee and SimonDing, Chatterjee, & Simon, 2014). Conversely, the neural tracking of amplitude envelopes improves when listeners are more attentive to the speech signal (Reference Rimmele, Zion Golumbic, Schröger and PoeppelRimmele et al., 2015). Hence, neural tracking of amplitude modulation provides a proxy for investigating the neural basis of speech intelligibility and attention in naturalistic speech contexts (Reference Di Liberto, Crosse and LalorDi Liberto, Crosse, & Lalor, 2018; Reference Nourski, Reale and OyaNourski et al., 2009; Reference Peelle and DavisPeelle & Davis, 2012).

A handful of studies have leveraged different linear regression models to assess the synchronization between EEG oscillations and amplitude envelopes in monolinguals (e.g., Reference Crosse, Di Liberto, Bednar and LalorCrosse et al., 2016; Reference Di Liberto, O’Sullivan and LalorDi Liberto, O’Sullivan, & Lalor, 2015) and bilinguals (Reference Di Liberto, Nie and YeatonDi Liberto et al., 2021; Reference Zinszer, Yuan, Zhang, Chandrasekaran and GuoZinszer et al., 2022). In two of these studies, Reference Song and IversonSong and Iverson (2018) and Reference Reetzke, Gnanateja and ChandrasekaranReetzke, Gnanateja, and Chandrasekaran (2021) found a better neural tracking of amplitude modulation among late onset bilinguals compared to English monolinguals. This result has been attributed to the greater recruitment of attentional resources in second language (L2) speech contexts. Specifically, the findings of Reference Reetzke, Gnanateja and ChandrasekaranReetzke et al. (2021) indicate that prolonged non-native immersion may not suffice to compensate for comprehension difficulties in non-native naturalistic listening environments. This view is consistent with the Interface Hypothesis (Reference SoraceSorace, 2011), which argues that advanced L2 speakers may still struggle to integrate the outputs of different linguistic computations in real time. As we discuss in Section 5.7, the integration of these outputs could be facilitated by white matter connections between different gray matter regions in the language network. Interestingly, these white matter connections can be more resistant to change than most of the gray matter regions that they innervate (Reference PliatsikasPliatsikas, 2020).

Other EEG and MEG studies have examined the effects of bilingual experience on the amplitude or power (Figure 5.4a, right) of specific brain oscillations while listening to connected speech. Here, neural power is thought to index the integration of phonemes into more complex linguistic constituents, such as syllables, words, or phrases. These studies (e.g., Reference Ding, Melloni, Zhang, Tian and PoeppelDing et al., 2016, Reference Ding, Melloni and Yang2017) have identified a potential relationship between the amplitude of neural oscillations in the delta (1–4 Hz) and theta (4–8 Hz) bands and the comprehension (delta amplitude) and intelligibility (theta amplitude) of speech signals. More recently, one MEG study (Reference Blanco-Elorrieta, Ding, Pylkkänen and PoeppelBlanco-Elorrieta et al., 2020) has found that the amplitude of neural oscillations in the delta band is modulated by language proficiency and language dominance. This finding indicates that the neural integration of phonemes into higher-order linguistic constituents is influenced by bilingual experience.

5.7 Structural and Functional MRI

The preceding sections have discussed the effects of bilingual experience along multiple neural latencies, from tens (e.g., FFR) to hundreds (e.g., MMN) of milliseconds following the stimulus onset. In this section, we focus on the structural and functional changes that underlie these effects. These changes can be investigated using MRI (Reference HernandezHernandez, 2013; Reference Li, Legault and LitcofskyLi et al., 2014). Most fMRI metrics are concerned with the amount and the type of BOLD signal in a particular region of interest (ROI) (Figure 5.4b). The amount of BOLD signal can be measured as the difference in magnitude, estimated as a beta coefficient (Reference Poline and BrettPoline & Brett, 2012), of the hemodynamic response in one experimental condition (e.g., while listening to non-native speech) over one control condition (e.g., while listening to native speech).

Several fMRI studies have leveraged speech training to understand the emergence of non-native speech representations in the cortex. Using corrective feedback, Reference Callan, Tajima and CallanCallan et al. (2003) trained native speakers of Japanese to learn the English contrast /r-l/. Japanese speakers were trained with English minimal word pairs for approximately one month. Scans conducted right before and after the training regime revealed training-induced changes in BOLD signal in a wide variety of temporal, parietal, and frontal regions in the dorsal and the ventral auditory streams. The findings of Reference Callan, Tajima and CallanCallan et al. (2003) have been extended for the most part to other speech contrasts in subsequent fMRI training studies (Reference Golestani and ZatorreGolestani & Zatorre, 2004, Reference Golestani and Zatorre2009; Reference Wang, Sereno, Jongman and HirschWang et al., 2003; Reference Wong, Perrachione and ParrishWong, Perrachione, & Parrish, 2007).

To investigate how patterns of BOLD activity translate into differential representations of speech sounds, fMRI research has also deployed multivariate pattern analysis (MVPA; see Figure 5.4c) (e.g., Reference Feng, Yi and ChandrasekaranFeng, Yi, & Chandrasekaran, 2019; Reference HaxbyHaxby, 2012; Reference Raizada, Tsao and LiuRaizada et al., 2010). In MVPA, the focus is on how stimulus or task conditions affect the distribution of BOLD signals across a vector of voxels extracted from a particular brain region. Because local populations of neurons in the STG are differentially attuned to phonological features (Reference Mesgarani, Cheung, Johnson and ChangMesgarani et al., 2014), differences in the distributed patterns of activity across the STG reflect linguistically meaningful information. Using MVPA, prior fMRI work has shown that non-native training can enhance the representation of non-native speech patterns in the STG (Reference Raizada, Tsao and LiuRaizada et al., 2010). Relatedly, in a recent fMRI study by Reference Feng, Li and HsuFeng et al. (2021), the degree of similarity between multivariate representations of phonemes in non-native and native listeners was measured to predict individual differences in non-native speech learning. The study found that native-like representations emerged in the STG after a few hundred trials during one training session, demonstrating the rapid change in functional response in this region.

Language experience-dependent changes in brain function are usually accompanied by structural changes in gray or white matter. One popular metric for structural MRI is voxel-based morphometry (VBM) (Reference Danylkiv and KrafnickDanylkiv & Krafnick, 2020). This metric measures gray matter density within a voxel, which reflects the morphology of neuron cell bodies. Using this metric, MRI research has documented changes in gray matter density following non-native language training and non-native immersion (Reference Li, Legault and LitcofskyLi et al., 2014; Reference PliatsikasPliatsikas, 2020). Gray matter changes can also occur in both dorsal and ventral auditory streams, as well as in frontal and subcortical regions linked to cognitive control in bilinguals. For example, gray matter volume in the primary auditory cortex of bilinguals and successful non-native speech learners has been observed to differ from that of monolinguals (Reference Ressel, Pallier and Ventura-CamposRessel et al., 2012). As noted in Reference PliatsikasPliatsikas (2020), language-experience structural changes may occur at different times depending on the brain region. For example, changes in cortical gray matter density are typically best observed during initial stages of non-native exposure, whereas changes in subcortical gray matter supporting cognitive control tend to require more language experience to emerge.

Changes in white matter connectivity may require even more language experience to emerge, and they are best observed in early and simultaneous bilinguals using metrics like fractional anisotropy (FA), which measures the integrity of myelinated axons based on the diffusion of water molecules. Further, MRI research on bilingualism has documented important changes in white matter tracts (Reference García-Pentón, Pérez Fernández, Iturria-Medina, Gillon-Dowens and CarreirasGarcía-Pentón et al., 2014; Reference Olsen, Pangelinan and BogulskiOlsen et al., 2015) in the dorsal and the ventral streams, including the IFOF and the superior longitudinal fasciculus. For instance, Reference Kuhl, Stevenson and CorriganKuhl et al. (2016) have documented increases in FA for these two fasciculi in bilinguals, but only after extensive non-native immersion and language use. This long time course to adjust white matter connections between different brain regions could explain why, as predicted by the Interface Hypothesis (Reference SoraceSorace, 2011), late bilinguals struggle to interface between different types of linguistic computations even after years of immersion.

5.8 Functional Near-Infrared Spectroscopy

Resilient to motion artifacts and easy to use, fNIRS technology is increasingly adopted in spaces where fMRI-based research is impractical, such as in child development research (Reference YeungYeung, 2021). Measuring brain responses in neonates using fNIRS is relatively common (Reference Peng and HouPeng & Hou, 2021), and sleeping babies three or fewer days old have shown broad increases in brain responses to foreign-accented over native-accented speech (Reference Giordano, Alexopoulos and SpagnaGiordano et al., 2021). Using the oddball paradigm, Reference Petitto, Berens and KovelmanPetitto et al. (2012) identified changes in left inferior frontal activity in infants exposed to one versus more than one language since birth. These results highlight how multilingual experience can alter the automatic neural discrimination of sounds from very early ages. As in ERP and fMRI, fNIRS studies using executive function tasks have also identified differences between monolingual and bilingual children in inferior frontal cortex activity (Reference Xie, Wu and YangXie et al., 2021) and lateralization (Reference Arredondo, Hu, Satterfield and KovelmanArredondo et al., 2017). This research shows that the effects of bilingualism on language-specific and domain-general processing can be traced back to the first years of life. In adults, as well, fNIRS adaptation of MMN research into lexical tone has found significant effects of AOA on speech sound processing (Reference Zinszer, Chen, Wu, Shu and LiZinszer et al., 2015).

The reduced operational barriers of fNIRS also permit wider deployment than fMRI, such as among bimodal bilinguals and cochlear implant users, because the optical signal is not subject to the same implant-related imaging artifacts as fMRI (Reference Saliba, Bortfeld, Levitin and OghalaiSaliba et al., 2016). Researchers have used portable fNIRS systems to image hundreds of primary school students performing word reading and rhyming tasks in their second languages in Japan (Reference Sugiura, Ojima and Matsuba-KuritaSugiura et al., 2015) and Côte d’Ivoire (Reference Jasińska and GueiJasińska & Guei, 2018). Across sub-Saharan Africa and South Asia, fNIRS has dramatically increased the reach of developmental neuroscience to low- and middle-income communities, where fMRI is often impossible due to cost and infrastructure needs (Reference Lloyd-Fox, Moore and DarboeLloyd-Fox et al., 2016). In these spaces where cost and logistics prohibit fMRI-based research, fNIRS provides a distinct methodological advantage.

5.9 Final Remarks

In the foregoing sections, we have described just a thin cross-section of the neuroscientific research on bilingualism, but in doing so we aim to portray the broad methodological flexibility and unique sensitivity afforded by neural measures. Neurolinguistic methods have yielded a detailed, if still evolving, picture of how acoustic signals are transformed into linguistically relevant information by the human brain. Studies of monolingual speech have predominated in our understanding of language in the brain via the dual stream model, but, with a turn to bilingualism, neurolinguists have made major contributions to our understanding of both the brain and behavior.

Electrophysiological measures derived from onset (e.g., ERPs) and sustained (e.g., brain oscillations) responses to speech have demonstrated that bilingualism can influence the neural encoding and integration of speech units across multiple neural latencies and timescales, including those involved in preattentive processing. Tuning processes have also been revealed through structural and functional MRI, demonstrating language experience-dependent changes in vast cortical and subcortical networks. We have also illustrated several experimental paradigms that require neither attention nor response from participants. These paradigms permit research in children or in more naturalistic contexts, such as reading a magazine or listening to an audiobook. They are also useful to segregate computations that are specific to speech from simultaneous task-dependent processing.

Each of these discoveries has been enabled by the transparency afforded by neuroimaging, allowing researchers to access subtle processing differences where behavioral results offer a single response measure at the end of a series of processes. Moreover, the temporal and the spatial resolutions of neuroimaging data are steadily improving to permit finer-grained discrimination between linguistic representations in univariate and multivariate analyses, while, at the same time, falling equipment costs have dramatically changed the geography of neuroscience. Researchers in low- and middle-income communities are increasingly enabled to study underrepresented populations and under-documented languages. Questions about speech processing and bilingualism once thought impractical to pursue are now well within reach for neurolinguists around the world.

Acknowledgments

The MEG picture was generously provided by Paul Ferrari, director of the Jack H. Miller Magnetoencephalography Center at the Helen DeVos Children’s Hospital. The fMRI picture was generously provided by Douglas Befroy, facility director of the Biomedical Imaging Center, a core facility within the Center for Biomedical Research Support at the University of Texas at Austin. The fNIRS picture was generously provided by Maria Arredondo and Minerva Ramirez. Maria Arredondo is the director of the ChiLDS Lab at the University of Texas at Austin. Brains, white matter tracts, and head model images were created in Brainstorm (Reference Tadel, Baillet, Mosher, Pantazis and LeahyTadel et al., 2011).

6 Computational Approaches to Bilingual Phonetics and Phonology

6.1 Introduction

Computational approaches have over the past two decades become an integral part of language acquisition research, establishing themselves as an important methodological counterpart to theoretical and experimental approaches. The benefit of using computational models lies in the explicitness of those models. Researchers formulating a linguistic theory in a computational context need to make that theory explicit enough that it can be implemented in a computer program. This typically means that the researcher develops an algorithm that prescribes how linguistic input is processed in a step-by-step fashion, transforming it into some sort of output. In the context of language acquisition, the model provides an exact characterization of the relation between linguistic input and linguistic knowledge that is somehow learned from that input. There is no room for vagueness or ambiguity in describing the process, as the entire input–output mapping has to be spelled out in a formal (programming) language.

Due to their explicitness, computational models provide insights into aspects of formal learnability. That is, models can be used to prove that a certain theory (of acquisition) is learnable or not. Simply put, when input is presented to the model, and the model cannot compute the intended output, the theory must be incorrect or incomplete. For example, Reference YangYang (2004) used computational modeling to show that a word segmentation model based on syllable transitional probabilities, as described in Reference Saffran, Newport and AslinSaffran, Newport, and Aslin (1996), cannot segment monosyllabic words. Such a limitation could mean that the model needs to be revised, or perhaps the model needs to operate on a different type of representation.

In addition to the explicitness advantage, computational models crucially lead to testable predictions and hence provide an essential piece in the empirical testing of any theoretical proposal. Models are generally tested either against natural language corpora or against data from experiments with human participants. The former allows one to scale up the empirical testing of a theoretical proposal by testing it in a whole-language simulation (as opposed to a relatively small set of items used in experiments with human participants). The latter allows one to test if the model performs in a way similar to human learners, and thus addresses the cognitive plausibility of the model.

The past two decades have seen great advancements in the modeling of phonetic and phonological acquisition in a monolingual setting. There has been a wide body of research focused on the question of how infants learn phonetic categories from acoustic input (e.g., Reference AdriaansAdriaans, 2018; Reference Adriaans and SwingleyAdriaans & Swingley, 2017; Reference De Boer and Kuhlde Boer & Kuhl, 2003; Reference Dillon, Dunbar and IdsardiDillon, Dunbar, & Idsardi, 2013; Reference Feldman, Griffiths, Goldwater and MorganFeldman et al., 2013; Reference McMurray, Aslin and ToscanoMcMurray, Aslin, & Toscano, 2009; Reference Swingley and AlarconSwingley & Alarcon, 2018; Reference Vallabha, McClelland, Pons, Werker and AmanoVallabha et al., 2007). Research here has focused on two things: (i) identifying the learning mechanism that allows infants to form categories, and (ii) examining the properties of the input that may help or hinder the learning mechanism in achieving its goal. The consensus of these studies has been that the learning mechanism involves some type of clustering along acoustic dimensions (e.g., Reference Maye, Werker and GerkenMaye, Werker, & Gerken, 2002), but it has not been clear how this mechanism operates on realistic input data. Computational and corpus studies have advanced our understanding of phonetic category learning by showing that categories are hard to learn from isolated phonetic tokens (Reference SwingleySwingley, 2009), and that learning can be supported by contextual information, such as the lexical or phonological contexts in which phonetic tokens occur (e.g., Reference AdriaansAdriaans, 2018; Reference Dillon, Dunbar and IdsardiDillon et al., 2013; Reference Feldman, Griffiths, Goldwater and MorganFeldman et al., 2013; Reference Swingley and AlarconSwingley & Alarcon, 2018).

In phonology, computational studies have focused on the learnability of phonological constraints and their rankings. The learnability perspective has been an integral aspect of Optimality Theory (OT; Reference Prince and SmolenskyPrince & Smolensky, 2004; Reference Tesar and SmolenskyTesar & Smolensky, 2000). In OT (and more recent approaches using Harmonic Grammar; Reference Legendre, Miyata and SmolenskyLegendre, Miyata, & Smolensky, 1990), the phonological learning problem is defined as finding the appropriate ranking (or weighting) for a universal set of constraints, and various computational approaches have been proposed to determine the language-specific ranking of constraints (e.g., Reference Boersma and HayesBoersma & Hayes, 2001; Reference Potts, Pater, Jesney, Bhatt and BeckerPotts et al., 2010; Reference Prince, Tesar, Kager, Pater and ZonneveldPrince & Tesar, 2004; Reference Tesar and SmolenskyTesar & Smolensky, 2000). More recently, studies in computational phonology have focused on the induction of the constraints themselves (e.g., Reference Adriaans and KagerAdriaans & Kager, 2010; Reference Gouskova and GallagherGouskova & Gallagher, 2020; Reference Hayes and WilsonHayes & Wilson, 2008). Computational approaches have thus helped to address questions regarding the origins of phonological constraints, to what extent they are learned from input data, and what type of input data they are learned from.

Despite these advancements in the modeling of phonetic and phonological acquisition in a monolingual setting, only very few studies have begun to address the computational modeling of bilingual acquisition. This chapter will put the problem of simultaneous bilingual phonetic and phonological acquisition in a computational perspective. First a general introduction to computational modeling will be provided, using a simplified model of phonotactic learning as an example to illustrate the main methodological issues. The chapter will then discuss recent studies that have used computational modeling to study bilingual phonetic and phonological acquisition in three main areas: phonetic and phonological cues for bilingual input separation, bilingual phonology in computational models of speech comprehension, and computational models of L2 speech perception. The chapter will conclude by discussing several key challenges in advancing the development of computational models of bilingual phonetic and phonological acquisition.

6.2 The Computational Perspective

Before we discuss computational approaches to bilingual phonetics and phonology, we need to establish a general computational perspective, which will then be used to frame the problem of bilingual acquisition. Computational approaches present formal characterizations of the learning problem by distinguishing three basic components, illustrated in Figure 6.1. The input is the data available to the learner. Representing the input is a nontrivial issue, and choices that are made regarding the representation of input data affect how the model in the end will perform (Reference MarrMarr, 1982). Ultimately, the input of a model should be the same as the input to a human learner. In the case of early language development, this would be speech data. However, due to the complexities of working with unprocessed speech data, computational models typically operate on a simplified representation of the input.

Figure 6.1 A computational model provides an explicit description of an input–output mapping.

In phonetic learning, the input is often represented along one or more particular phonetic dimensions of interest, specified by the researcher. For example, a model learning a voicing contrast might be presented with input tokens represented along a single voice onset time dimension (e.g., Reference McMurray, Aslin and ToscanoMcMurray et al., 2009). A model of vowel learning might be presented with tokens along two or three dimensions, such as vowel formants and duration (e.g., Reference Vallabha, McClelland, Pons, Werker and AmanoVallabha et al., 2007). A more detailed representation of the speech signal can be obtained using techniques from Automatic Speech Recognition, where it is common to represent the speech signal as a sequence of vectors of thirty-nine (delta) Mel-Frequency Cepstral Coefficients (MFCCs), extracted from the signal at 10 ms intervals (Reference Jurafsky and MartinJurafsky & Martin, 2009), but the link to linguistic properties is not immediately apparent in such cases.

In phonological learning, the input to a computational model may vary from relatively simple sequential symbolic representations of phonological categories to more complex linguistically annotated forms. There is no one right way to represent the input to the learner, because different models aim to achieve different goals, and simplifications are necessary to keep the computational problem solvable. One must therefore critically examine whether the input representation is appropriate to answer the research question at hand.

The middle part in Figure 6.1 is the learning model that operates on the input. The model specifies how linguistic knowledge is learned from input data. This is the researcher’s theory of the learning process, implemented as a computer program. The model performs computations on the input, which then ultimately leads to some form of output. The output usually takes on the shape of some new set of linguistic representations, such a set of categories, or a full-fledged grammar. In the context of acquisition, the model’s output should match either what infants or children know about a particular language or some general property of the language to be learned.

There are two things that should be noted at this point. First, all forms of modeling involve some level of simplification of the real-life scenario. That is, the input is simplified, the model is simplified, or (usually) both are simplified. This is inherent to the modeling approach, at least for the types of models that are currently available. Ideally, a computational model of language acquisition would explain the entire mapping from the first speech input to output in the form of an adult grammar. In practice, models focus on particular subproblems, each with their own input and output representations. One current issue in computational modeling is to uncover how these different learning problems are connected, as it seems increasingly unlikely that these different learning problems are solved in a strictly sequential way (e.g., Reference AdriaansAdriaans, 2018; Reference Dillon, Dunbar and IdsardiDillon et al., 2013; Reference Feldman, Griffiths, Goldwater and MorganFeldman et al., 2013; Reference SwingleySwingley, 2009). Each modeling study should thus be seen as providing a piece of the acquisition puzzle, and those varying pieces need to be connected to obtain a more complete picture of acquisition.

Second, it is important to realize that computational models can provide explanations at different levels. Reference MarrMarr (1982) identifies three different levels of modeling. At the computational level, models are focused on the goal of a computation, and the logical ways in which a problem can be solved. Such models may identify types of information which are necessary to solve a problem, but they do not aim to describe the process by which a human learner would solve a problem. In contrast, models at the algorithmic (or mechanistic) level aim to explain how a particular input representation is transformed into an output representation. Finally, the implementational level is concerned with the physical realization of the transformation.

These different levels mean that models should be interpreted and evaluated differently in the context of language acquisition. The main question regarding evaluation at the computational level is whether the model solves the problem or not. At this level, the best model is one that obtains maximal accuracy on some learning task. At the algorithmic level, the question is whether the model approaches the problem in the same way that human learners approach the problem. Since humans make errors, a model with maximal accuracy is not necessarily the best model. The model crucially needs to predict human errors, which in acquisition might take on the form of overgeneralization, U-shaped learning curves, intermediate developmental stages, and so on. At the implementational level, the question is whether the model reflects the physical structure of the learning process in the human brain. Data from neural studies are needed to evaluate such models.

Computational models typically operate on large amounts of input data, and the computations performed by the model can be quite complex. This can make it difficult to develop a thorough understanding of the modeling approach. The computational approach will therefore be illustrated in Section 6.2.1 using a fairly simplistic phonotactic learning model, which operates on a small sample of input data. Due to these simplifications, the computations can be tracked by hand, and it quickly becomes apparent how different methodological considerations affect the performance of the model. Later on in this chapter, we will illustrate how this phonotactic learning problem could be modeled for bilingual learners.

6.2.1 A Computational Model of Phonotactic Learning

The model of phonotactic learning that will be used is one that learns sequential biphone-based phonotactic probabilities (e.g., Reference Bailey and HahnBailey & Hahn, 2001; Reference Cairns, Shillcock, Chater and LevyCairns et al., 1997; Reference Jusczyk, Luce and Charles-LuceJusczyk, Luce, & Charles-Luce, 1994; Reference Vitevitch and LuceVitevitch & Luce, 1999). The model is simplified in the sense that it has a very limited interpretation of phonotactics: probabilities of adjacent segments. The model does not refer to prosodic structure, phonological features, or nonlocal dependencies. The input to the model is also simplified. We will assume that the input to the learner consists exclusively of the following two arbitrary American English utterances taken from the Buckeye Corpus (Reference Pitt, Dilley and JohnsonPitt et al., 2007):

(6.1)
well i work in the accounting department
i’m an accounting assistant

The entire corpus contains close to 300,000 words, and the corpus has been used as a phonetically transcribed approximation of spoken American English in various computational studies (e.g., Reference Daland and PierrehumbertDaland & Pierrehumbert, 2011). It should be noted, however, that this is an adult-directed conversational speech corpus, and a more ecologically valid set of acquisition input data would ideally involve transcriptions of child- or infant-directed speech. The “toy” corpus of two utterances shown here nevertheless suffices to illustrate the workings of the computational model.

6.2.1.1 Representing the Input

We first need to establish an appropriate input representation for this learning task. For the purposes of phonotactic learning, we require a phonemic transcription of the two utterances. Already at this early stage we are faced with making certain modeling decisions that will affect the performance of our model. Phonemic transcriptions provided in corpora are typically either canonical transcriptions, which means they are the result of looking up the orthographic word in a pronunciation dictionary, or they are variable transcriptions, which means they have been manually coded (or corrected) to reflect some of the variability found in spoken language, such as reductions and assimilations. A machine-readable canonical transcription for the toy corpus can be obtained using The CMU Pronouncing Dictionary, Version 0.7b (Reference Carbajal, Fér and DupouxCarnegie Mellon University, 2014), which transforms the utterances into a computerized representation of the International Phonetic Alphabet:

(6.2)
w eh l # ay # w er k # ih n # dh ah # ah k aw n t ih ng # d ih p aa r t m ah n t
ay m # ae n # ah k aw n t ih ng # ah s ih s t ah n t

This canonical transcription is notably different from the manually coded variable transcription that has been included in the corpus:Footnote ¹

(6.3)
w ah # aa # w er k # ih n # n ih # ah k aw iy #
ih p aa r t m ih n
aa m # ah # ah k aw n iy ng # ih s ih s t eh n t

The difference between these two different input representations can be seen in the word accounting, which is transcribed as ah k aw n t ih ng in the canonical transcription and as either ah k aw iy or ah k aw n iy ng in the variable transcription. The variable transcription provides a closer match to the actual speech signal, as it codes the outcome of natural speech production processes that occur in everyday spoken language. (The word accounting has various other forms in the corpus.) Nevertheless, canonical transcriptions are commonly used in modeling studies. In fact, phonological learning is commonly assumed to operate on word types rather than on word tokens (e.g., Reference AlbrightAlbright, 2009; Reference Hay, Pierrehumbert and BeckmanHay, Pierrehumbert, & Beckman, 2004; Reference Pierrehumbert, Bod, Hay and JannedyPierrehumbert, 2003; Reference RichtsmeierRichtsmeier, 2011), which presupposes that the learner has internalized a vocabulary of canonical phonological forms.

Additional input representations are possible, depending on assumptions that are made regarding the input that is used in learning. Several studies on phonotactic learning have suggested that early phonotactics might be learned from continuous speech, rather than from the lexicon (e.g., Reference Adriaans and KagerAdriaans & Kager, 2010, Reference Adriaans and Kager2017; Reference Brent and CartwrightBrent & Cartwright, 1996; Reference Daland and PierrehumbertDaland & Pierrehumbert, 2011; Reference Sundara and BreissSundara & Breiss, 2020). In these models, phonotactic probabilities are used for the detection of word boundaries in continuous speech. If we assume that words (or word boundaries) are not available to the learner, then this would change the representation of the input. Below is an adapted version of the variable input representation, where word boundaries have been removed:

(6.4)
w ah aa w er k ih n n ih ah k aw iy ih p aa r t m ih n
aa m ah ah k aw n iy ng ih s ih s t eh n t

A final issue regarding input representations is that the learner might be biased to process or attend to different parts or properties of the input. For example, one might hypothesize that in early phonological learning the learner might focus more on consonants than on vowels in the input (e.g., Reference Bonatti, Peña, Nespor and MehlerBonatti et al., 2005; Reference Hochmann, Benavides-Varela, Nespor and MehlerHochmann et al., 2011). Such biases can be taken into account in modeling by filtering out particular tokens from the input representation (e.g., Reference AdriaansKastner & Adriaans, 2018). For example, if we assume that phonotactics is learned exclusively from a consonantal tier, then we could represent the toy corpus as follows:

(6.5)
w w k n n k p r t m n
m k n ng s s t n t

The right representation for training and testing computational models thus depends on the researcher’s assumptions (and evidence) regarding the nature of the input that is used for learning. The choice of input representation has direct consequences for the linguistic knowledge that is learned from it. As we will see, a bilingual learning environment will complicate this input issue further.

6.2.1.2 The Learning Model

The model represents the researcher’s theory regarding the human learning mechanism (in the case of modeling at the algorithmic level) or the general problem-solving strategy (in the case of modeling at the computational level). In this example, we will assume that probabilistic phonotactics is learned through a mechanism of statistical learning (e.g., Reference Saffran, Aslin and NewportSaffran, Aslin, & Newport, 1996; Reference Saffran, Newport and AslinSaffran, Newport, et al., 1996). Specifically, phonotactic probabilities are learned by computing transitional probabilities for each sequence of two phonemes in the input (e.g., Reference AlbrightAlbright, 2009; Reference Cairns, Shillcock, Chater and LevyCairns et al., 1997; Reference Vitevitch and LuceVitevitch & Luce, 1999). We adopt this learning mechanism not because it provides a full account of a language’s phonotactics (it doesn’t) but because it is a straightforward model that is easy to compute directly from the toy corpus. Formally, it is an N-gram (in this case bigram) language model applied to phonemic representations (see Reference Jurafsky and MartinJurafsky & Martin, 2009). Applying this learning algorithm to the toy corpus of variable transcriptions in (6.4) results in the phonotactic probabilities shown in Table 6.1.

Table 6.1 A model of phonotactic learning based on biphone transitional probabilities (TPs) in the toy corpus

Biphone	TP	Biphone	TP	Biphone	TP	Biphone	TP	Biphone	TP
aa m	0.33	aw n	0.50	iy ih	0.50	n iy	0.25	s t	0.50
aa r	0.33	eh n	1.00	iy ng	0.50	n n	0.25	t eh	0.50
aa w	0.33	er k	1.00	k aw	0.67	n t	0.25	t m	0.50
ah aa	0.25	ih ah	0.17	k ih	0.33	ng ih	1.00	w ah	0.50
ah ah	0.25	ih n	0.33	m ah	0.50	p aa	1.00	w er	0.50
ah k	0.50	ih p	0.17	m ih	0.50	r t	1.00	–	–
aw iy	0.50	ih s	0.33	n ih	0.25	s ih	0.50	–	–

It should be noted that the model could be applied to any of the representations in (6.2)–(6.5). Each of these representations would lead to a different set of phonotactic probabilities.

The model processes the input data either incrementally (one biphone is presented to the model at a time, and at each step the model updates its probabilities) or in a batch (all biphones are presented to the model at once, and the model performs computations over the entire set). The outcomes of these two ways of input processing may or may not be the same, depending on how the model employs memory. If an incremental model has some form of memory decay, it will gradually forget older tokens, and the resulting model will be different from the batch model because it relies more heavily on more recently processed biphones. This also means that the order in which data are presented to the model matters. However, if the incremental model makes the assumption of perfect memory, the outcome of the incremental and the batch versions will be the same: a set of probabilities computed over the entire input (such as the ones shown in Table 6.1).

The probabilities shown in Table 6.1 are of course not representative of English phonotactics, since they are based on only two utterances. Normally one would apply the model to a larger, more realistic data set (such as the whole corpus), after which the performance of the model can be evaluated.

6.2.1.3 The Output and Its Evaluation

We have now applied a formalized learning procedure to some input data, and this generated output in the form of a set of phonotactic probabilities. For this to be a “good” model of phonotactic learning, it should perform well on some sort of evaluation task. Evaluation of a model depends on the research question it is meant to answer. Some straightforward questions that could be asked are: To what extent do these probabilities capture English phonotactics? To what extent do these probabilities match English speakers’ knowledge of phonotactics? To what extent do the probabilities match infants’ knowledge of phonotactics? To what extent are these probabilities useful for further language development? Each question requires a different type of data to be used as a gold standard in evaluation. For example, if we are interested in modeling English speakers’ knowledge of phonotactics, the probabilities could be matched with human judgments of the well-formedness of nonwords (e.g., Reference AlbrightAlbright, 2009). If we are interested in the usefulness of these probabilities for further language development, we could assess the effectiveness of the model in predicting the locations of word boundaries in continuous speech (e.g., Reference Adriaans and KagerAdriaans & Kager, 2010; Reference Daland and PierrehumbertDaland & Pierrehumbert, 2011).

Setting up a test case for the model requires us to make another formalization: How exactly are the probabilities used to predict well-formedness or segmentation behavior? For well-formedness judgments, nonword probabilities could be obtained by multiplying biphone probabilities within the nonword. Using this approach, our toy model in Table 6.1 would predict that, for example, m ah k is more well-formed than m ih n, because it has a higher probability (0.5 * 0.5 = 0.25 for m ah k, versus 0.5 * 0.33 = 0.17 for m ih n).

For word segmentation, two different formalizations have been proposed regarding the application of transitional probabilities to the task of boundary detection. One is segmentation based on “troughs”: a boundary occurs whenever the probability of some bigram sequence xy is lower than both of its neighboring bigrams wx and yz. While this implementation is straightforward, it has some learnability consequences. As Reference YangYang (2004) pointed out, one consequence of this formalization is that unigram words cannot be segmented as they would require two minima next to each other, which is not possible by definition. If one assumes syllables as the basic unit over which to compute transitional probabilities, then monosyllabic words cannot be segmented using the trough-based approach. If one assumes phonemes as the basic unit, then monophonemic words cannot be segmented. The latter is less consequential because English has only a small number of monophonemic words (e.g., “a,” “I”).

A different approach is to set a threshold on the probabilities: whenever a probability is below some threshold t, a boundary is inserted. This approach does not face the same limitation as the trough-based approach, but it raises another issue: t is a parameter of the model, and it is not clear a priori what its value should be. Changing the value of the parameter will change the behavior of the model, so this is a nontrivial issue. It also points to a more general issue in modeling. Computational models often have various parameters that can be set to improve the fit of the model to the data. This is a potential criticism of the modeling approach, since it could be argued that, with the right parameter settings, any phenomenon or data set could be modeled. What does the model itself then prove, exactly? It is therefore important that computational studies give an explicit description of the parameters in the model, which values they take, and how they affect the model’s performance.

In (6.6) is an example of how the model in Table 6.1 would segment the first utterance from the training set based on the two different approaches. We use an utterance from the training set here for illustrative purposes, but it should be noted that normally one would use novel test utterances that did not occur in the training set. The gold standard (“correct”) segmentation is included to illustrate where the boundaries should have been.

(6.6)

i.	Threshold: (t = 0.3)
	w ah # aa w er k ih n # n # ih # ah k aw iy ih # p aa r t m ih n
ii.	Trough:
	w ah # aa w er k ih n n ih # ah k aw iy ih # p aa # r t m ih n
iii.	Gold standard:
	w ah # aa # w er k # ih n # n ih # ah k aw iy # ih p aa r t m ih n

We can now ask which of the two approaches leads to a better segmentation, by employing commonly used evaluation metrics such as Precision (= number of hits / (number of hits + number of false alarms)) and Recall (= number of hits / (number of hits + number of misses)). While such metrics indicate which approach is better at solving the problem of word segmentation, it doesn’t necessarily mean that human learners use the same approach. A stronger test of a computational model would therefore be to derive novel predictions from the model and design experiments specifically aimed at testing these predictions. But even without such a test, the example shown in this section illustrates how computational approaches force the researcher to posit an explicit theory, and to assess what exact approach would be more successful in solving a particular problem.

6.3 Computational Modeling of Bilingual Phonetic and Phonological Acquisition

Computational approaches can be extremely valuable in understanding the complexities of the bilingual speech input that infants receive, and the mechanisms they employ to learn two languages from that input. Using computational models, concrete solutions can be formulated by which infants could navigate their complex language environment. Despite this potential, computational approaches to bilingual phonetics and phonology have only recently started to emerge. This section will first discuss existing studies that have used computational approaches in three different research areas. We will then zoom in on certain aspects of bilingual modeling by returning to our toy example. These examples show the potential of computational modeling to address essential questions in bilingualism that are hard to address directly with infant experiments.

6.3.1 Phonetic and Phonological Cues for Bilingual Input Separation

One major challenge in early bilingual acquisition is to distinguish two different languages in the input speech stream. The ability to detect and separate languages could, for example, allow infants to develop separate statistical distributions to learn sound categories for two languages (Reference Curtin, Byers-Heinlein and WerkerCurtin, Byers-Heinlein, & Werker, 2011; Reference Sundara and ScutellaroSundara & Scutellaro, 2011). Because the ability to detect and/or separate languages in the input is crucial for the development of two languages, computational studies have focused on identifying cues that might be used by learners to solve this problem. Several studies have assessed the degree to which either acoustic or phonological cues can be used to separate two languages in the input.

Several studies have attempted to identify ways in which the two languages that are present in a child’s input might be separated on the basis of the acoustic properties of the signal (Reference CarbajalCarbajal, 2018; Reference Carbajal, Dawud, Thiollière and DupouxCarbajal, Dawud, et al., 2016; Reference Carbajal, Fér and DupouxCarbajal, Fér, & Dupoux, 2016; Reference Dehak, Torres-Carrasquillo, Reynolds and DehakDehak et al., 2011; Reference De Seyssel and Dupouxde Seyssel & Dupoux, 2020). These studies use an i-vector approach, which was originally developed for automatic speaker identification (Reference Dehak, Kenny, Dehak, Dumouchel and OuelletDehak et al., 2010). In this approach, the entire acoustic space of a speech data set is modeled as a Universal Background Model (UBM), and individual utterances within the data set are represented as deviations from the UBM. Utterances can then be clustered into different languages, and new test utterances are classified according to their similarity to the clusters. This approach has been used for automatic language identification (Reference Dehak, Torres-Carrasquillo, Reynolds and DehakDehak et al., 2011), and more recently for the modeling of language separation in bilingual speech input (Reference CarbajalCarbajal, 2018; Reference Carbajal, Dawud, Thiollière and DupouxCarbajal, Dawud, et al., 2016; Reference Carbajal, Dawud, Thiollière and DupouxCarbajal, Fér, et al., 2016; Reference De Seyssel and Dupouxde Seyssel & Dupoux, 2020).

Reference Carbajal, Dawud, Thiollière and DupouxCarbajal, Dawud, et al. (2016) used i-vector representations to separate English and Xitsonga speech in training conditions that employed either separated (monolingual) training data or mixed (bilingual) training data. They found that separation using this approach was more successful when trained on monolingual data than on mixed bilingual data. Their results also pointed to the relevance of speaker information in overcoming the complexities of being exposed to mixed data. Reference De Seyssel and DupouxDe Seyssel and Dupoux (2020) extended this approach to include a bilingual condition where speakers spoke each of the two languages (English and Finnish, or English and German). They found that close languages (English and German) are more difficult to separate than more distant languages (English and Finnish), and that bilingual input where speakers speak both languages is harder to cluster than input where speakers speak only one language (the “one parent, one language” approach).

Taking a phonological approach to language separation, Reference AdriaansAdriaans (2020) assessed segmental and phonotactic cues for input separation of English and Dutch. Computational modeling was used to determine the effectiveness of different cues in predicting the origin language in mixed data, as well as the robustness of these cues when dealing with different degrees of mixed input. Probabilistic models based on either the relative frequencies of segments, or phonotactic probabilities (biphone and triphone probabilities) were trained on combined samples from English and Dutch corpora in a variety of input mixing proportions, ranging from completely separated training data to 50:50 mixed input. The study found that phonotactics (in particular, biphones) provided the model with a cue for language separation which scored well in terms of both accuracy and robustness.

6.3.2 Bilingual Phonology in Computational Models of Speech Comprehension

Some of the earliest computational approaches to bilingual phonetics/phonology were phonological components embedded in larger computational models of spoken word comprehension. For example, Reference Li and FarkasLi and Farkas (2002) proposed a Self-Organizing Connectionist Model of Bilingual Processing (SOMBIP) which includes a bilingual lexicon of phonological forms. The bilingual lexicon was trained by presenting the model with the 400 most frequent English and Chinese word types taken from the Hong Kong Bilingual Corpus from CHILDES (Reference MacWhinneyMacWhinney, 2014; Reference Yip and MatthewsYip & Matthews, 2000). No explicit language markers or labels were given to the model. Instead, the model was presented simultaneously with semantic and phonological representations and the representations self-organized on the basis of the similarity between representations (which included a fixed CVVCCVVC prosodic template, where each C and V was represented by five different features; see also Reference Li and MacWhinneyLi & MacWhinney, 2002). After training, the lexicon showed two distinct clusters of English and Chinese lexical representations, indicating that a bilingual lexicon can be learned without explicit language labels.

In similar work using Self-Organizing Maps with English and Spanish words, Reference Shook and MarianShook and Marian (2013) noted that their Bilingual Language Interaction Network for Comprehension of Speech (BLINCS) separates the two input languages automatically on the basis of phonotactics, giving further computational support for the potential relevance of phonotactics in language separation during bilingual acquisition.

6.3.3 Computational Models of L2 Speech Perception

Several models of L2 speech perception have been implemented as computational models. These models do not perform simultaneous bilingual acquisition, but rather explore the effects of an established first language on L2 perception. For example, Reference Keidel, Zevin, Kluender and SeidenbergKeidel et al. (2003) trained a neural network on English CV syllables, and tested the model on the discrimination of isiZulu stimuli. Their model showed similar discrimination behavior to the human data reported in Reference Best, McRoberts and GoodellBest, McRoberts, and Goodell (2001; see also Chapter 7, this volume). The model achieved this performance using exclusively acoustic information, without reference to articulatory gestures.

More recently, Reference Van Leussen and Escuderovan Leussen and Escudero (2015) presented a computational implementation of Escudero’s Second Language Linguistic Perception (L2LP) model (Reference EscuderoEscudero, 2005; see also Chapter 8, this volume). Their model gradually learned a two-way Spanish L2 vowel contrast after establishing a three-way Dutch L1 vowel contrast. One interesting feature of their model, in addition to being able to learn subsets of categories, is that they aim to explain the learning trajectory from non-native to native-like perception, as opposed to L2 perception at one given point in development.

6.3.4 A Computational Model of Bilingual Phonotactic Learning

To illustrate the problem of bilingual input, we return to the “toy” phonotactic learning example from Section 6.2.1, and put the computational model in a simulated bilingual environment. Being simultaneously exposed to two different languages complicates the learning problem, at least from a computational point of view. This can be seen in the example in (6.7), where we have expanded the two-sentence English toy corpus to include two more sentences taken from another language: Dutch. The two sentences were taken arbitrarily from the Spoken Dutch Corpus (Reference Goddijn and BinnenpoorteGoddijn & Binnenpoorte, 2003), a corpus which is similar to the English Buckeye corpus in terms of size and level of transcription.

(6.7)

well i work in the accounting department

i’m an accounting assistant

laten we dit in de toekomst ook voortzetten

(“let’s also continue this in the future”)

vooral als je in september dan pas weer start

(“especially if you only start again in September”)

All four sentences come from adult spoken language corpora, so this simulated bilingual toy corpus is not reflective of an actual child’s input, but they contain parts of the phonology of their respective languages, which allows us to illustrate the effects of language mixing on phonotactic probabilities.

Using computational modeling, we can simulate bilingual phonotactic learning under different input conditions. For example, we can train one model that assumes that the learner is able to separate the languages in the input, and another model that assumes that the learner is not able to separate the input. Comparing the output of the two models allows us to quantify the effects of language mixing.

The Separated model would generate two language models, one for each language. The English part of this Separated model would be identical to the model presented in Table 6.1, and the Dutch part of the model would be computed in a similar way, but exclusively from the Dutch data. When using phonotactic probabilities in a word segmentation task, the Separated model would segment English test sentences the way it did in (6.6), because there is no influence of the second language.

In contrast, the Mixed model would compute transitional probabilities over the entire four-sentence data set, and would use the resulting probabilities on any test set it is presented with, regardless of origin language. The phonotactic probabilities in this Mixed model would match the probabilities of neither English nor Dutch. By means of simulation we can quantify how detrimental this would be to a particular acquisition task, such as word segmentation.

Table 6.2 shows the phonotactic probabilities of the English biphones from Table 6.1 when the training set of the model includes the two Dutch utterances. (This is a subset of the total Mixed model, which also contains biphones that occur exclusively in the Dutch data.) Due to the mixed input, the probabilities have changed in two ways. They have changed in an absolute sense: most of the probabilities are lower than in the English-only (Separated) model. This is because the addition of a new language has resulted in an increase of the combinatorial possibilities. For example, where the Separated model says that in English n can be followed by four different phonemes (ih, iy, n, t), the Dutch data have introduced three more possible successors to n (s, p, d) in the Mixed model. This by itself is not problematic: in both the Separated and the Mixed model, all of these n-initial biphones have equal probabilities; the absolute values are simply lower in the Mixed model. (In fact, all of these combinations are possible combinations in “real” English as well; they just do not occur in “toy” English.)

Table 6.2 A biphone model with mixed English-Dutch statistics

Biphone	TP	Biphone	TP	Biphone	TP	Biphone	TP	Biphone	TP
aa m	0.13	aw n	0.50	iy ih	0.50	n iy	0.14	s t	0.38
aa r	0.25	eh n	0.25	iy ng	0.50	n n	0.14	t eh	0.20
aa w	0.13	er k	1.00	k aw	0.40	n t	0.14	t m	0.10
ah aa	0.13	ih ah	0.11	k ih	0.20	ng ih	1.00	w ah	0.50
ah ah	0.13	ih n	0.44	m ah	0.25	p aa	0.67	w er	0.25
ah k	0.25	ih p	0.11	m ih	0.25	r t	0.50	–	–
aw iy	0.50	ih s	0.22	n ih	0.14	s ih	0.13	–	–

More interestingly, the relative phonotactic probabilities change in the Mixed model as well. For example, w ah and w er were equally probable in the Separated model, but in the Mixed model w ah has become more likely than w er. It is because of these relative changes that the model will behave differently on any task we might give the model, such as predicting well-formedness judgments or predicting word boundaries. This can be seen in the following example where the Separated and the Mixed model make different predictions in terms of word boundaries:

(6.8)

i.	Separated:
	w ah # aa w er k ih n n ih # ah k aw iy ih # p aa # r t m ih n
ii.	Mixed:
	w ah aa w er k # ih n n ih # ah k aw iy ih # p aa # r t # m ih n
iii.	Gold standard:
	w ah # aa # w er k # ih n # n ih # ah k aw iy # ih p aa r t m ih n

The Mixed model makes different predictions at three locations (underlined in the test sentence), and comparison to the gold standard indicates that in two of the three cases the mixed prediction is false. Mixing English and Dutch data in this case thus had a negative impact on the performance of the model. This is of course just a toy simulation, so no conclusions should be drawn about mixing English and Dutch from this example, but the example illustrates how mixed statistics could lead to problems in solving computational tasks that are central to language development. Computational modeling provides a way to quantify the impact of mixed input, and to explore potential solutions to separate data coming from two different languages (e.g., Reference AdriaansAdriaans, 2020; Reference Carbajal, Dawud, Thiollière and DupouxCarbajal, Dawud, et al., 2016; Reference Shook and MarianShook & Marian, 2013).

6.4 Future Directions

This chapter discussed current computational approaches to bilingual phonetic and phonological acquisition. There is a sizeable gap between the number of computational studies on monolingual acquisition and the number of computational studies on bilingual acquisition, and it will take many more modeling studies to start closing that gap. The gap is not entirely surprising, though. As other chapters in this volume illustrate, bilingualism comes in many different forms. Bilingual learners vary greatly in terms of the age at which the second language is introduced, the input quantities for each of the different languages, the distribution of languages over speakers in the learner’s environment, and so on. One key challenge in the computational modeling of bilingual phonetics and phonology will be to take all these factors into account. Computational approaches can help us to understand bilinguals’ impressive learning mechanisms by making the learning process, and the connections between the different factors that affect the learning process, explicit, and by testing hypotheses regarding bilingual acquisition against natural language data.

Currently, however, there is a limited availability of ecologically valid training and test materials for bilingual computational models. Modeling studies have used adult spoken language corpora of different languages (often from different sources) to simulate bilingual input. While this gives an impression of the overall complexity of mixed input from two languages, and there is some evidence that suggests that input from a single bilingual speaker might not be very different from the combined input from two monolingual speakers (Reference Danielson, Seidl, Onishi, Alamian and CristiaDanielson et al., 2014), ultimately there is a need for large phonetically and/or phonologically transcribed bilingual corpora to be used in simulations of bilingual learning.

Another challenge lies in modeling the individual learning trajectories of bilingual learners. Modeling has so far mostly focused on identifying overall solutions to learning a language. A major challenge with modeling bilingualism, perhaps even more so than with monolingual modeling, is to account for individual differences that may vary due to factors such as the different degrees of parental language mixing (e.g., Reference Byers-HeinleinByers-Heinlein, 2013). Studies that test the behavior of a model at different input mixing proportions (e.g., Reference AdriaansAdriaans, 2020) are a first step toward understanding how differently shaped input can lead to different outcomes.

Regarding the bilingual learning mechanisms used in phonetic and phonological acquisition, much remains unknown. Computational models can help here by exploring how a particular problem could in principle be solved. Highly debated questions such as whether bilingual learning mechanisms are fundamentally different from monolingual mechanisms, or the extent to which bilingual learners separate their languages, can be approached from a computational perspective. In order to assess computational models that simulate human bilingual learning and learning trajectories, the output of the models can be evaluated against developmental findings. Ultimately a cycle between computational modeling and experiments with human participants is needed to uncover the full complexities involved in bilingual phonetic and phonological acquisition.

Book contents

Part I - Approaches to Bilingual Phonetics and Phonology

Information