An exception-filtering approach to phonotactic learning

Huteng Dai

doi:10.1017/S0952675725000028

An exception-filtering approach to phonotactic learning

Published online by Cambridge University Press: 22 April 2025

Huteng Dai

Show author details

Huteng Dai*: Affiliation:
Department of Linguistics, University of Michigan, Ann Arbor, MI, USA Department of Linguistics, Rutgers University, New Brunswick, NJ, USA
*: Email: hutengdai@gmail.com

Article contents

Abstract
Introduction
Background
The Exception-Filtering phonotactic learner
Evaluation
Case study: English onsets
Case study: Polish onsets
Case study: Turkish vowel phonotactics
Discussion
Conclusion
Data availability statement
Competing interests
Ethical standards
Footnotes
References

Rights & Permissions

Abstract

Phonotactic learning has been a fertile ground for research in the field of phonology. However, the challenge of lexical exceptions in phonotactic learning remains largely unexplored. Traditional learning models, which typically assume all observed input data to be grammatical, often blur the distinction between lexical exceptions and grammatical words, consequently skewing the learning results. To address this issue, this article innovates a categorical-grammar-plus-exception-filtering approach that harnesses the discrete nature of categorical grammars to filter out lexical exceptions using statistical criteria adapted from probabilistic models. Applied to naturalistic corpora from English, Polish and Turkish, the learnt grammars showed a high correlation with the acceptability judgements in behavioural experiments. Compared to benchmark models, the model performs increasingly better with data that contain a higher proportion of lexical exceptions, reaching its peak in learning Turkish non-local vowel phonotactics, highlighting its ability to handle lexical exceptions.

Keywords

phonotactics phonological learning categorical grammar exceptionality indirect negative evidence frequency onsets vowel harmony acceptability

Information

Type: Article
Information: Phonology , Volume 42 , 2025 , e5

DOI: https://doi.org/10.1017/S0952675725000028 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2025. Published by Cambridge University Press

1. Introduction

There exist a logically infinite number of potential sound sequences in any given language, yet only some are considered permissible (or well-formed) by speakers. The term phonotactics refers to this implicit knowledge that speakers use to determine the permissible sound sequences in their language. Phonotactic knowledge does not apply uniformly to the entire lexicon – certain lexical exceptions can violate otherwise universally applicable patterns (Guy Reference Guy2007; Wolf Reference Wolf2011). However, children can acquire regular patterns in the presence of lexical exceptions. For example, despite the existence of disharmonic sequences in their language experience, experimental studies have shown that Turkish infants tune in to non-local phonotactics in vowel harmony patterns as early as six months (Hohenberger et al. Reference Hohenberger, Altan, Kaya, Tuncer, Avcu, Ketrez and Haznedar2016; Sundara et al. Reference Sundara, Zhou, Breiss, Katsuda and Steffman2022; see §7 for details).

The challenge of phonotactic learning in the presence of lexical exceptions is illustrated in Figure 1. Under the positive-evidence-only assumption, the learner relies exclusively on unlabelled input data (Marcus Reference Marcus1993), denoted by the filled dots in the figures; conversely, the unfilled dots represent unattested data that are absent from the input. The learning problem is to arrive at a target grammar that can differentiate between grammatical and ungrammatical sequences, represented by 1s and 0s in Figure 1b.

Figure 1

The learning problem in the presence of exceptions (adapted from Mohri et al. Reference Mohri, Rostamizadeh and Talwalkar2018: 8). In both (a) and (b), filled dots represent attested data, while unfilled dots indicate unattested data. In (b), 0 indicates the ungrammatical items and 1 indicates grammatical items, assuming Boolean grammaticality.

Learning models that assume all attested sound sequences are grammatical run the risk of building attested but ungrammatical noise into the model. This is a case of ‘overfitting’ in machine learning, in which a model is trained too well on the input data, to the extent that it starts to fit noise, consequently reducing its ability to generalise to unseen data (Mohri et al. Reference Mohri, Rostamizadeh and Talwalkar2018). The optimal model does not necessarily fit the input data perfectly; instead, it should filter out or heavily penalise lexical exceptions as perceived noise.

Although exceptionality has been a topic of perennial interest in phonology (Wolf Reference Wolf2011; Moore-Cantwell & Pater Reference Moore-Cantwell and Pater2016; Mayer et al. Reference Mayer, McCollum and Eziz2022),Footnote ¹ learning models based on categorical grammars capable of handling exceptions remain to be developed. Categorical grammars make clear-cut demarcations between grammatical and ungrammatical sequences (Yang Reference Yang2016: 3), which can facilitate the identification of lexical exceptions. However, learning models based on categorical grammars are generally considered vulnerable to exceptions in naturalistic corpora, as discussed in Gouskova & Gallagher (Reference Gouskova and Gallagher2020: 107; emphasis added):

In contrast to our approach, Heinz (Reference Heinz2010), Jardine (Reference Jardine2016) and Jardine and Heinz (Reference Jardine and Heinz2016) characterise non-local phonology as an idealised problem of searching for unattested substrings. Their learners memorise attested precedence relations between segments and induce constraints against those sequences that they have not encountered. One of the problems with this approach is that it can reify accidental gaps to the level of categorical phonotactic constraints, whereas stochastic patterns with exceptions will stymie it (Wilson & Gallagher, Reference Wilson2018).

However, it would be uninsightful to dismiss categorical grammars altogether based on the modest performance of several idealised models, which were designed to explore the mathematical underpinnings of phonological learning, instead of handling real-world corpora. Recent developments have both demonstrated promising results using simple categorical phonotactic learning models in naturalistic corpora (Gorman Reference Gorman2013; Durvasula Reference Durvasula2020; Kostyszyn & Heinz Reference Kostyszyn, Heinz, Jurgec, Duncan, Elfner, Kang, Kochetov, O’Neill, Ozburn, Rice, Sanders, Schertz, Shaftoe and Sullivan2022) and begun to address challenges such as accidental gaps (Rawski Reference Rawski2021).

The current study undertakes a similar endeavour: rooted in formal language theory, it proposes a novel approach to address the problem of exceptions by integrating frequency information from the input data. This proposal draws inspiration from probabilistic approaches, especially the Hayes & Wilson (Reference Hayes and Wilson2008) phonotactic learner and traditional observed-over-expected ( $O/E$ ) criterion (Pierrehumbert Reference Pierrehumbert1993), and takes the initiative to bridge the gap between the mathematical underpinnings of phonological learning and realistic data, harnessing the potential that categorical grammars can offer. The discrete nature of categorical grammars allows the proposed model to completely filter out lexical exceptions and demonstrates robust performance across noisy corpora from English, Polish and Turkish, successfully learning phonotactic grammars that approximate acceptability judgements in behavioural experiments. Compared to benchmark models, the model performs increasingly better with data that contain a higher proportion of lexical exceptions, reaching its peak in learning Turkish non-local vowel phonotactics despite the complexity introduced by disharmonic forms in the input data.

This article is structured as follows: §2 outlines the theoretical background and related assumptions; §3 introduces the current proposal, the Exception-Filtering learning algorithm; §4 illustrates the evaluation methods and provides an overview of the three subsequent case studies in English (§5), Polish (§6) and Turkish (§7). §8 discusses topics arising from the current study and outlines the directions for future work.

2. Background

This section outlines the essential concepts, underlying assumptions and relevant evidence involved in the current proposal.

2.1. The competence–performance dichotomy

This study assumes three interconnected components involved in phonotactic learning: grammar, lexicon and performance. The relationship between these components is visualised in Figure 2. Together, the lexicon and grammar constitute competence, representing internalised knowledge of a language. Speakers’ acceptability judgments are influenced by both competence and performance factors. For example, a word [sfid] will receive low acceptability if the lexicon does not contain the word and the grammar penalises its substructure *sf. (The rating ‘1 out of 7’ is provided as an example and does not represent actual data.) As highlighted in Figure 2, this article focuses on the acquisition of grammar, abstracting away from lexicon acquisition and general performance factors.

Figure 2

The relationship between lexicon, grammar and performance.

The current study distinguishes between the terms grammaticality (or well-formedness) and acceptability, which have frequently been conflated in previous research in phonology (Hayes & Wilson Reference Hayes and Wilson2008; Albright Reference Albright2009). In this context, acceptability refers to the judgements made by speakers on real-world performance, which can be influenced by both grammar and extragrammatical factors, such as processing difficulty, lexical frequency and similarity (Schütze Reference Schütze1996; see §8 for detailed discussion). In contrast, grammaticality refers to the abstract, internalised knowledge represented by the grammar, such as phonotactic constraints in the current article, independent of any extragrammatical factors, such as frequency information. A sound sequence is deemed grammatical only if it adheres strictly to the hypothesis grammar. As in the dual-route model (Pinker & Prince Reference Pinker and Prince1988; Zuraw Reference Zuraw2000; Zuraw et al. Reference Zuraw, Lin, Yang and Peperkamp2021), the lexical route allows the speaker to access the lexicon and evaluate the acceptability of existing (or attested) words, regardless of possible grammar violations. If the lexicon does not contain certain sound sequences, as in nonce words, the speaker instead evaluates their acceptability in the grammar via the non-lexical route, in which grammaticality is predicted based on grammar. This grammaticality then interacts with other extragrammatical factors and results in acceptability in the performance level.

Therefore, the relationship between grammaticality and acceptability is not one-to-one: certain ungrammatical forms in the lexicon could be deemed more acceptable than some grammatical forms. Due to the existence of extragrammatical factors, models that perfectly align with acceptability could actually deviate from the grammar. This is not due to any inherent inability to explain acceptability, but rather to an overreach in explanatory power, which is caused by representing extragrammatical factors in the grammar (Kahng & Durvasula Reference Kahng and Durvasula2023: 3).

Acceptability judgements are commonly collected through rating tasks employing a numeric Likert scale and characterised as ‘gradient’ (non-categorical) in nature (Albright Reference Albright2009). Individual Likert ratings correspond to categorical multilevel, rather than continuous, values (e.g., 1 = ‘strongly disagree’, 2 = ‘disagree’, 3 = ‘neutral’, 4 = ‘agree’, 5 = ‘strongly agree’), exhibiting considerable individual variability, which are not incompatible with categorical grammars.Footnote ² When averaged over multiple participants, these results can present as gradient values, hinting at the need to incorporate individual variability within a categorical framework (see §8 for a discussion). Furthermore, because they are influenced by task effects, rating tasks can elicit gradient responses even for inherently discrete concepts, such as the concept of odd and even numbers (Armstrong et al. Reference Armstrong, Gleitman and Gleitman1983; Gorman Reference Gorman2013). Another extragrammatical factor in play in the acceptability judgement is auditory illusions, as shown in Kahng & Durvasula (Reference Kahng and Durvasula2023).

In light of these considerations, the acceptability judgements reported in previous studies are not incompatible with categorical grammar. On the one hand, the current study assumes that the grammaticality of sound sequences, categorical or probabilistic, is reflected in acceptability judgements, and a successful grammar should exhibit a robust correlation between predicted grammaticality and acceptability judgements to allow ‘direct investigation’ of linguistic competence (Lau et al. Reference Lau, Clark and Lappin2017). On the other hand, the current study argues that gradient acceptability collected through numerical rating tasks does not necessitate gradient or probabilistic grammars, nor does it negate categorical grammars (cf. Coleman & Pierrehumbert Reference Coleman, Pierrehumbert and Coleman1997; Hayes & Wilson Reference Hayes and Wilson2008).

The current study employs categorical grammars using a discrete set of constraints that simply accept grammatical sequences and reject ungrammatical ones. In contrast, probabilistic grammars, such as maximum entropy (MaxEnt) grammars (Hayes & Wilson Reference Hayes and Wilson2008), involve constraints with continuous weights, assigning a probability continuum across all possible sequences. Analogous to probabilistic grammars, grammaticality in categorical grammars is associated with discrete, often binary values, where 0 signifies ungrammatical sequences, and 1 designates grammatical ones. However, categorical grammars cannot be conflated with probabilistic grammars with thresholds (Hale & Smolensky Reference Hale, Smolensky, Smolensky and Legendre2006), which cannot define infinite languages, as mathematically demonstrated in Alves (Reference Alves2023). Probabilistic grammars have been noted for their ability to model human sensitivity to frequency information and approximate gradient acceptability judgements (Hayes & Wilson Reference Hayes and Wilson2008), whereas categorical grammars delineate a clear boundary between grammatical words and lexical exceptions (Yang Reference Yang2016: 3). This discrete nature can be used to facilitate phonological learning, as shown in the current study.

Hale & Reiss (Reference Hale and Reiss2008: 18) adopt a nihilistic view of phonotactic grammars, arguing that phonotactics is not part of phonological grammar, as it is computationally inert in morphophonological alternations (Reiss Reference Reiss2017, §6). However, experimental evidence has shown that infants do acquire phonotactics (Jusczyk et al. Reference Jusczyk, Friederici, Wessels, Jeanine, Svenkerud and Jusczyk1993, Reference Jusczyk, Luce and Charles-Luce1994; Jusczyk & Aslin Reference Jusczyk and Aslin1995; Archer & Curtin Reference Archer and Curtin2016). Recent work also shows that phonotactics can facilitate the learning of morphophonological alternations (Chong Reference Chong2021). Gorman (Reference Gorman2013, §1) demonstrates the internalisation of phonotactic constraints in various domains, such as wordlikeness judgements and loanword adaptation. Furthermore, the current study maintains the concept of categorical grammars, which essentially motivated the adoption of the nihilistic view (see Reiss Reference Reiss2017: 436, who cautions against throwing out the ‘categorical baby’ with – or instead of – the ‘phonotactic bathwater’). In light of this, the current study models the learning of phonotactic grammar as a crucial component within a broader framework of phonological learning (see the discussion in §8).

2.2. Attestedness vs. grammaticality

While the grammar is a finite system representing an infinite number of grammatical sound sequences, the lexicon lists all words that a speaker knows, including all exceptional and unpredictable features of attested input data (Chomsky Reference Chomsky1965: 229; Chomsky & Halle Reference Chomsky and Halle1965; Jackendoff Reference Jackendoff2002: 153). In turn, the input data in phonotactic learning drawn from the lexicon can include sound sequences that deviate from the grammar.

The current study assumes that exceptionality is not labelled in the input data or the lexicon but emerges from the discrepancy between attestedness in the input data and grammaticality with respect to the hypothesis grammar. Grammaticality indicates whether phonological representations conform to the hypothesis grammar internalised by the learner. Researchers have used various converging methodologies to approximate the hypothesis grammar, especially statistical generalisations (e.g., observed-to-expected ratio; detailed in §3) or performance data such as nonce-word acceptability (detailed below) and speech errors.Footnote ³ For convenience in the discussion, consider a hypothesis grammar consists of categorical constraints {*sf, *bn}. The symbol * is used here only to indicate ungrammatical sequences (as opposed to unattested ones). In contrast, attestedness indicates whether a sound sequence occurs in the input data. [brɪk] (as in brick) and *[sfiə] (sphere) are both attested in the English lexicon, while [blɪk] (blick) and *[bnɪk] (bnick) are not, as illustrated in Table 1.

Table 1

The distinction between attestedness and grammaticality (adapted from Hyman Reference Hyman1975)

This discrepancy between attestedness and grammaticality yields both accidental gaps (grammatical but unattested) and lexical exceptions (attested but ungrammatical), with this article particularly emphasising the latter. For example, although both are nonexistent words, blick is grammatical while *bnick is not, as speakers uniformly reject *bnick while accepting blick, a classic example of an accidental gap (Chomsky & Halle Reference Chomsky and Halle1965; Hayes & Wilson Reference Hayes and Wilson2008).Footnote ⁴

The attested sequences are considered lexical exceptions if and only if they violate the hypothesis grammar, such as $\{$ *sf, *bn $\}$ in the above example. Sphere is a classic example of a lexical exception: the onset [sf] rarely occurs in English and has been labelled ungrammatical in previous work (Hyman Reference Hyman1975; Algeo Reference Algeo1978; Kostyszyn & Heinz Reference Kostyszyn, Heinz, Jurgec, Duncan, Elfner, Kang, Kochetov, O’Neill, Ozburn, Rice, Sanders, Schertz, Shaftoe and Sullivan2022). The architecture in Figure 2 predicts that the acceptability of the attested word ‘sphere’ itself is directly influenced by the lexicon and is considered highly acceptable by some speakers. However, when they are not stored in the lexicon, [sf]-onset nonce words are commonly judged unacceptable, as shown in an experiment conducted by Scholes (Reference Scholes1966: 114): 33 seventh-grade English speakers were asked if a nonce word ‘is likely to be usable as a word of English’. Only seven participants responded ‘yes’ to the [sf]-onset nonce word [sfid], lower than [blʌŋ] (31 ‘yes’), and even lower than words with unattested onsets such as [mlʌŋ] (13 ‘yes’). Leveraging the converging evidence that *sf is a phonotactic constraint in hypothesis grammar, the attested [sf]-onset word sphere can be considered as a lexical exception, in contrast to attested and grammatical brick.Footnote ⁵

Lexical exceptions are also commonly observed in loanwords, leading to an evolving lexicon that could incorporate ungrammatical sound sequences from various languages (Kang Reference Kang, van Oostendorp, Ewen, Hume and Rice2011). For example, exceptional onsets can be observed in English loanwords, such as [bw] Bois, [sr] sri, [ʃm] schmuck, [ʃl] schlock, [ʃt] shtick, [zl] zloty, and adapted names from different languages, including [vr] Vradenburg. All these onsets exhibit low type frequencies in English, according to the CMU Pronouncing Dictionary (http://www.speech.cs.cmu.edu/cgi-bin/cmudict), and they receive relatively low acceptability scores in nonce word judgements (Scholes Reference Scholes1966; see also §5). Similar examples have been observed in other languages where putative phonotactic restrictions do not extend to loanwords (Gorman Reference Gorman2013: 6–7). Thus, this article takes the position that input data drawn from the lexicon can contain lexical exceptions according to the hypothesised phonotactic grammar.

2.3. Summary

This section has underscored the tension between competence and performance and clarified the nuanced distinctions between acceptability and grammaticality. It uses a categorical grammar that distinguishes between grammatical and ungrammatical data. This section argues that the learning model should correlate the grammaticality scores predicted by the learnt grammar with acceptability judgements and handle lexical exceptions by using an exception-filtering mechanism based on frequency information.

3. The Exception-Filtering phonotactic learner

This section proposes a categorical-grammar-plus-exception-filtering approach to select a hypothesised categorical grammar (hereafter ‘hypothesis grammar’) from the hypothesis space. This section starts by justifying the concepts and assumptions of the current proposal and then introduces the core learning algorithm in §3.4.

3.1. Segment-based representation

The primary objective of this study is not to build an all-around model of phonotactic learning, but to distill the problem of exceptions to its essence at the computational level (Marr Reference Marr1982). For this reason, the current proposal adopts segmental representations derived from input data for their practical advantages, a departure from prespecified feature representations advocated by previous studies (Hayes & Wilson Reference Hayes and Wilson2008; Gouskova & Gallagher Reference Gouskova and Gallagher2020). In this article, a segmental approach facilitates the analysis of exceptions tied to segment-based constraints. For example, the presence of [sf] in the word sphere explicitly violates a single segmental constraint *sf but could be associated with several feature-based constraints such as *[+sibilant, −voice][+labiodental, −voice] and *[+alveolar][+labiodental]. Moreover, when training data are phonemically transcribed, segmental representations can be directly obtained from the input data, independent of any prespecified feature system. Employing segmental representations also significantly narrows down the hypothesis space, as discussed below.

3.2. The structure of grammars and hypothesis space

Phonotactic learning involves selecting a hypothesis grammar (G, a set of constraints) from the hypothesis space (, adapted from the OT terminology). The current study uses a non-cumulative, inviolable and unranked categorical grammar, labelling any sequence with non-zero constraint violations as ‘ungrammatical’ and those with zero violations as ‘grammatical’. The current study intentionally departs from the cumulative effects suggested in previous experimental work (Coleman & Pierrehumbert Reference Coleman, Pierrehumbert and Coleman1997; Breiss Reference Breiss2020; Kawahara & Breiss Reference Kawahara and Breiss2021), and primarily investigates whether phonotactic learning of categorical grammars is possible in the presence of exceptions. One possible way to incorporate cumulativity in the future could involve replacing the grammaticality function with the sum of constraint violations (see also §8).

This structure, while similar to that of Optimality Theory (OT; Prince & Smolensky Reference Prince and Smolensky1993; Prince & Tesar Reference Prince and Tesar2004), diverges significantly from OT’s cumulative, violable and ranked constraint grammars. In contrast to OT, the hypothesis grammar in the current proposal is drawn from a highly restrictive hypothesis space.Footnote ⁶ Based on the analytical results of formal language theory (FLT), the current study adopts tier-based strictly k-local (TSL $_k$ ) languages (Heinz et al. Reference Heinz, Rawal and Tanner2011; Jardine & Heinz Reference Jardine and Heinz2016; Lambert & Rogers Reference Lambert and Rogers2020) as the hypothesis space. In formal language theory, a ‘language’ is a set of strings (e.g., sound sequences) that adhere to its associated grammar, which can be mathematically characterised as a set of forbidden structures.

k-factors are substrings of length k. A TSL $_k$ grammar consists of all forbidden k-factors on a specific tier, known as TSL $_k$ constraints. The tier, also referred to as a projection (Hayes & Wilson Reference Hayes and Wilson2008), functions as a targeted subset of the inventory of phonological representations (e.g., segments, consonants, vowels) for constraint evaluation. In the context of local phonotactics, the tier encompasses the full inventory, such as all segments, while in non-local phonotactics, it includes only specific segments, such as vowels. For example, as shown in Figure 3, the Turkish word [døviz] ‘currency’ is represented as [øi] on the vowel tier. Non-tier segments are ignored during the evaluation of tier-based constraints. Therefore, [døviz] violates a tier-based local constraint *øi on the vowel tier. This concept of a tier is similar to, but distinct from, the traditional feature-based definition in autosegmental phonology (Goldsmith Reference Goldsmith1976).

Figure 3

Extraction of vowel tier from the Turkish word [døviz] ‘currency’. The vowel tier contains the vowels in this word, disregarding the non-tier consonants.

A string is labelled as grammatical if it does not contain any forbidden k-factors specified by the grammar; otherwise, the string is considered ungrammatical. This can be formalised by the function $\texttt {factor}(s,k)$ , which generates all k-factors of a string s. For example, $\texttt {factor}(\text {CCV},2) = \{\text {CC}, \text {CV}\}$ and $\texttt {factor}(\text {CVC},2) = \{\text {CV}, \text {VC}\}$ . The grammaticality score of a string s under a grammar G, denoted as $g(s, G)$ , is defined as follows:

For example, consider a grammar $G = $ $\{$ *CC $\}$ , which forbids any strings containing the sequence CC. In this case, the string CCV would be deemed ungrammatical, while the string CVC would be classified as grammatical.

TSL $_k$ languages delineate a formally restrictive but typologically robust hypothesis space, capturing a range of local and non-local phonotactics (Heinz et al. Reference Heinz, Rawal and Tanner2011). Specifically, McMullin & Hansson (Reference McMullin and Hansson2019) provide experimental evidence for TSL $_2$ as a viable working hypothesis space for phonotactic learning, demonstrating that adult participants in artificial learning experiments were able to learn TSL $_2$ patterns, but struggled with patterns that fall outside the TSL $_2$ class. Formal language-theoretic studies have also demonstrated that this hypothesis space is accompanied by efficient learning properties (Heinz et al. Reference Heinz, Rawal and Tanner2011; Jardine & Heinz Reference Jardine and Heinz2016; Jardine & McMullin Reference Jardine and McMullin2017). This approach has been successfully applied in previous work spanning both probabilistic and categorical approaches (Heinz Reference Heinz2007; Hayes & Wilson Reference Hayes and Wilson2008; Jardine & Heinz Reference Jardine and Heinz2016; Gouskova & Gallagher Reference Gouskova and Gallagher2020; Mayer Reference Mayer2021; Dai et al. Reference Dai, Mayer and Futrell2023).

One of the main challenges of phonotactic learning, as discussed in Hayes & Wilson (Reference Hayes and Wilson2008: 392), is the rapid growth of the hypothesis space with increasing size of k. In response to this challenge, the current study limits k to two (TSL $_2$ ), which is sufficient to capture a large number of local and non-local phonotactic patterns. Although this article only examines local phonotactics of English and Polish onsets and non-local phonotactics of Turkish vowels, the proposed hypothesis space is broadly applicable for suitable domains, extending to phenomena such as non-local laryngeal phonotactics in Quechua (Gouskova & Gallagher Reference Gouskova and Gallagher2020), Hungarian vowel harmony (Hayes & Londe Reference Hayes and Londe2006) and Arabic OCP-Place patterning (Frisch & Zawaydeh Reference Frisch and Zawaydeh2001; Frisch et al. Reference Frisch, Pierrehumbert and Broe2004). To summarise, the learner hypothesises a non-cumulative, inviolable and unranked categorical TSL $_2$ grammar, derived from the hypothesis space of TSL $_2$ languages.

3.3. The Exception-Filtering mechanism and O/E criterion

The goal of phonotactic learning is to select the grammar that distinguishes between grammatical and ungrammatical sequences from unlabelled input data. This problem is challenging in the presence of exceptions because intrusions of ungrammatical sequences can mislead the learner to build exceptional patterns into the hypothesis grammar. Computationally, a learning model exposed solely to positive evidence struggles to identify the target grammar from the hypothesis space of numerous formal language classes (Gold Reference Gold1967; Osherson et al. Reference Osherson, Stob and Weinstein1986). This challenge is particularly evident in classes of linguistic interest, such as the (tier-based) strictly 2-local languages. An in-depth review of this issue can be found in Wu & Heinz (Reference Wu and Heinz2023).

One approach to address the challenge of exceptions uses an exception-filtering mechanism to exclude exceptions while learning categorical grammars. Hayes & Wilson (Reference Hayes and Wilson2008: 427–428) hypothesise that children possess an innate ability to discern the unique status of certain exotic items, and improve their learning results by excluding exotic items from input data. This ability to detect and exclude anomalies aligns closely with the concept of exception-filtering in the current proposal. Although such a mechanism has been considered challenging to formulate (Clark & Lappin Reference Clark and Lappin2011: 105), the current study achieves it by leveraging indirect negative evidence derived from frequency information (Clark & Lappin Reference Clark and Lappin2009; Pearl & Lidz Reference Pearl and Lidz2009; Yang Reference Yang2016), specifically from type frequency (Pierrehumbert Reference Pierrehumbert2001; Hayes & Wilson Reference Hayes and Wilson2008; Richtsmeier Reference Richtsmeier2011).Footnote ⁷ Indirect negative evidence allows learners to infer grammaticality labels from unseen data, despite the absence of such labels in positive evidence, guided by the principle that a sequence that occurs less frequently than expected in the input data are likely ungrammatical.

The comparison between observed (O) and expected (E) type frequencies embodies the exception-filtering mechanism in the current study and has been widely applied in the identification of phonotactic constraints (Pierrehumbert Reference Pierrehumbert1993, Reference Pierrehumbert2001; Frisch et al. Reference Frisch, Pierrehumbert and Broe2004; Hayes & Wilson Reference Hayes and Wilson2008) since Trubetzkoy (Reference Trubetzkoy1939, ch. 7). For instance, the exceptional [sf] sequence would have the same expected type frequency as grammatical sequences like [br] (as in brick) if no constraints are present in the current grammar. However, if [sf] only appears in a limited number of words, such as sphere, its observed type frequency would be significantly lower than its expected type frequency. This discrepancy allows the learner to infer a *sf constraint and classify the observed sphere as a lexical exception.

The traditional $O/E$ equation proposed by Pierrehumbert (Reference Pierrehumbert1993) has been widely applied to discover phonotactic constraints (Pierrehumbert Reference Pierrehumbert2001; Frisch et al. Reference Frisch, Pierrehumbert and Broe2004). However, this equation assumes an empty hypothesis grammar, which becomes inaccurate once any constraint is added, as discussed in Wilson & Obdeyn (Reference Wilson and Obdeyn2009) and Wilson (Reference Wilson2022).

The current criterion $O/E$ draws inspiration from Hayes & Wilson (Reference Hayes and Wilson2008), while a crucial difference lies in the definition in which the hypothesis grammar is non-cumulative, leading to distinct calculations of O and E. The observed type frequency (O) of a potential constraint C is determined by the count of unique strings in the sample that violate C:

In a toy sample $S = $ $\{$ CVC, CVV, VVC, VVV, VCV, CCV $\}$ , $O[\text {*CC}]=1$ , $O[\text {*CV}]=4$ , $O[\text {*VC}]=3$ , $O[\text {*VV}]=3$ . Here, $O[\text {*VV}]$ is $3$ rather than $4$ because, by definition, $O[C]$ counts the number of strings that violate the potential constraint (at least once) rather than the cumulative number of substring violations across all strings. Therefore, the string VVV is counted only once in $O[\text {*VV}]$ . Moreover, O is updated during the learning process, as the learner filters out lexical exceptions from the input data S every time a new constraint is added to the hypothesis grammar.

The expected type frequency $E[C]$ represents the number of unique strings in the hypothesised language L that violate C, under a non-cumulative hypothesis grammar G.Footnote ⁸ Following Hayes & Wilson (Reference Hayes and Wilson2008), the current study works with an estimation to $E[C]$ by limiting the maximum string length in L to $\ell _{\max }$ , the length of the longest string in the input data S. $E[C]$ is then approximated by

Here, the learner first partitions the input data $S = S_1 \cup S_2 \cup \ldots \cup S_{\ell _{\max }}$ and the hypothesised language $L = L_1 \cup L_2 \cup \ldots \cup L_{\ell _{\max }}$ into subsets by string lengths. $E_{\ell }[C]$ is the expected number of unique strings in each $S_{\ell }$ that violate C:

$\textit {Ratio}(C, G, \ell )$ represents the proportion of strings of $\ell $ length accepted by G but violating C. This is found by comparing the accepted strings in G and $G' = G \cup \{C\}$ , where C is added to G.Footnote ⁹

$\textit {Count}(G,\ell )$ is the count of unique $\ell $ -length strings in the hypothesis language L accepted by G. Therefore, $\textit {Count}(G,\ell )-\textit {Count}(G',\ell )$ is the number of unique strings that violate C in L.

Table 2 illustrates this calculation with exception-free input data that perfectly align with each corresponding hypothesis grammar G. The first row shows an empty hypothesis grammar ( $G=\emptyset $ ) along with input data $\{$ CCC, CCV, CVC, CVV, VVV, VCV, VCC, VVC $\}$ (where $|S_{3}| = 8$ ). $\textit {Count}(\emptyset ,3) = 8$ , given that the empty hypothesis grammar permits eight potential strings $\{$ CCC, CCV, VCC, CVC, CVV, VVV, VCV, VVC $\}$ of length 3.

Table 2

The list of idealised input data and corresponding hypothesis grammar, as well as expected frequencies for length 3. The input data $S_3$ here is idealised and identical to the target language $L_3$

When *CC is added to the intersected grammar, resulting $G' = \{\text {*CC}\}$ , $G'$ only permits five strings $\{$ CVC, CVV, VVV, VCV, VVC $\}$ ( $\textit {Count}(\{\text {*CC}\},3) = 5$ ). The expected frequency of *CC is calculated as in (6):

This matches the fact that three strings $\{$ CCC, CCV, VCC $\}$ violate the potential constraint *CC in the idealised input data $L_3$ in the first row. Here, $E\left [\text {*CC}\right ] = E_3\left [\text {*CC}\right ]$ because only strings of length 3 exist in the input data.

Following this update, ungrammatical strings (violating G) are filtered from the input data S. When G becomes $\{$ *CC $\}$ , as shown in the second row of Table 2, the input data shrink to $\{$ CVC, CVV, VVV, VCV, VVC $\}$ ( $|S_3| = 5$ ). $E[\text {*CC}]$ drops to zero, because *CC is already penalised by G ( $\text {*CC}\in G$ ). In other potential constraints, for example, $E[\text {*VV}] = |S_3| \times (\frac {5-2}{5}) = 5 \times \frac {3}{5} = 3$ , three of the five strings allowed by $G = \{\text {*CC}\}$ violate *VV.

Although alternative calculations, such as $O-E$ , yield similar learning results, $O/E$ has the advantage of a clear range from $0$ ( $O=0$ ) to $1$ ( $O=E$ ). During the learning process, a constraint is included in the grammar if the $O/E$ ratio falls below a specified threshold ( $O/E<\theta $ ). This comparison is performed at increasing threshold levels, ranging from $0.001$ to $\theta _{\max }$ , also known as the accuracy schedule (Hayes & Wilson Reference Hayes and Wilson2008), where the interval after $0.1$ is fixed to $0.1$ . For example, the accuracy schedule Θ = [0.001, 0.01, 0.1, 0.2, 0.3, …, 1] if $\theta _{\max } = 1$ . This structure prioritises the integration of potential constraints with the lowest $O/E$ values.Footnote ¹⁰ $\theta _{\max }$ can be interpreted as follows: the higher $\theta _{\max }$ indicates the need for more statistical support, that is, higher $O/E$ , before considering a two-factor as grammatical. This also allows for the modelling of individual variability in phonotactic learning, where some learners require more statistical support for grammatical sequences, reflected by a higher $\theta _{\max }$ .

Equipping the Exception-Filtering learner with the accuracy schedule adapted from Hayes & Wilson (Reference Hayes and Wilson2008) controls the contrast between them and facilitates direct comparison between their best-performing models. Dealing with realistic corpora and experimental data requires posterior adjustments of $\theta _{\max }$ : the analyst/user sets this hyperparameter to the value between $0$ and $1$ that achieves the highest scores on all statistical tests in each test data set. In this article, $\theta _{\max }$ is set to $0.1$ for the English and Polish case studies and $0.5$ for the Turkish case study. The current study shows that once an appropriate hyperparameter is in place, the proposed model can successfully acquire categorical grammars despite the existence of lexical exceptions.

Future psycholinguistic studies are required to better model the factors that determine $\theta _{\max }$ . For example, Frisch et al. (Reference Frisch, Large, Zawaydeh, Pisoni, Bybee and Hopper2001) showed that the larger the lexicon size of individual participants in their experiment, the more likely they would accept sequences with low type frequency, which means lower $\theta _{\max }$ in the Exception-Filtering learner.

3.4. Learning procedure

Building on the concepts above, the Exception-Filtering learner models how a child learner acquires a categorical phonotactic grammar given the input data. The learning problem in the presence of exceptions is formalised as follows: given the input data S, select a hypothesis grammar G from the hypothesis space, so that G approximates the target grammar $\mathcal {T}$ that defines the target language $\mathcal {L}$ .Footnote ¹¹

The input data S includes grammatical strings from $\mathcal {L}$ and a limited number of ungrammatical strings outside $\mathcal {L}$ , that is, lexical exceptions, disregarding speech errors and other noise reserved for future investigations.

Let us look at a toy example: given the tier (also the inventory) $\{$ C, V $\}$ , consider the target grammar $\mathcal {T}=\{\text {*CC}\}$ . The hypothesis space consists of all possible two-factors on the tier $\{$ *CC,*CV,*VV,*VC $\}$ . The toy input data $\mathcal {S}=\{\text {CVC, CVV, VVC, VVV, VCV, CCV}\}$ includes one exception, CCV, which violates the target grammar $\mathcal {T}$ . Though the toy example limits the string length to three, the learner can handle samples with varying lengths.

As visualised in Figure 4, given the input data S, tier and the maximum $O/E$ threshold $\theta _{\max }$ , the learner first initialises an empty hypothesis grammar G and hypothesis space Con (Step 1). The learner then selects the next threshold $\theta $ from the accuracy schedule $\Theta $ (Step 2). Subsequently, the learner computes $O/E$ for each potential constraint in Con (Step 3). Constraints with $O/E < \theta $ are integrated into G and removed from Con and all lexical exceptions that violate these constraints are filtered out of the input data S (Step 4). This is followed by a reselection of $\theta $ , a reevaluation of the values of $O/E$ and an update of (Steps 2, 3 and 4). The learner follows the accuracy schedule and incrementally sets a higher threshold for constraint selection. The iteration continues until the threshold reaches a maximum value ( $\theta = \theta _{\max }$ ), marking the termination. The following paragraphs illustrate this learning procedure using the toy input data with the exception of *CCV. Given the page limitations, a simplified accuracy schedule $\Theta = [0.5, 1]$ with $\theta _{\max }=1$ is used to avoid too many iterations.

Figure 4

The learning procedure of the Exception-Filtering learner.

3.4.1. Step 1: initialisation

Given the input data S and tier $\{$ C, V $\}$ , the learning process begins with the initialisation of a hypothetical grammar G. Initially, G is an empty set, implying that all possible sequences are assumed to be grammatical prior to the learning procedure. The learner also defines the hypothesis space Con, which encompasses all forbidden two-factors. This initialisation process is shown in Table 3, where the left side shows the initialisation of O and E, and the right side stores the variables.

Table 3

Initialisation

3.4.2. Steps 2 and 3: select θ, compute O/E

Following the initialisation, the learner selects the first $\theta = 0.5$ from the accuracy schedule and calculates the observed type frequency O and expected type frequency E for each potential constraint within the hypothesis space Con. In essence, $O[C]$ represents the proportion of strings that violate a potential constraint C in the input data, while $E[C]$ represents the proportion of strings that violate C in the current grammar G.

Consider the toy input data $S = \text {\{CVC, CVV, VVC, VVV, VCV, CCV\}}$ ( $|S|=6$ ). For the potential constraint *CC, $\textit {Count}(G,3) = 8$ and $\textit {Count}(G',3) = 5$ because three strings in the language defined by G (namely, CCV, VCC, CCC) violate the updated grammar $G' = \{\text {*CC}\}$ . The ratio that a string violates *CC in the sample is $\textit {Ratio} (\text {*CC}, \emptyset , 3) = 1-\frac {5}{8} = \frac {3}{8}$ . As a result, $E[\text {*CC}] = |S|\times \textit {Ratio} (\text {*CC}, \emptyset , 3)=6\times \frac {3}{8}=2.25$ , as illustrated in Table 4.

Table 4

Compute O and E

3.4.3. Step 4: update G, Con and S (exception filtering)

The learner then stores potential constraints with $O/E < \theta $ in G. Here, the learner updates G with *CC, as shown in Table 5. The sample S is also updated, and strings that contradict the updated hypothesis grammar are filtered out. In this case, the potential constraint *CC is added to G and removed from Con, and the string CCV is removed from S. This process is depicted in Table 5.

Table 5

Update G, Con and S

To prevent the overestimation of $O/E$ , the learner filters out ungrammatical strings, including exceptions, from the input data. This is because adding one constraint to the hypothesis grammar has an impact on the expected frequency of other two-factors.Footnote ¹² For instance, after integrating *CC into the hypothesis grammar, CCV, VCC and CCC should no longer be considered in the expected frequency count, thereby reducing the expected frequency of *CV and *VC. This mechanism ensures the learner continue the subsequent learning process without the negative impact of identified lexical exceptions.

3.4.4. Iteration and termination

The learner then enters an iterative process and returns to Step 2 to reselect $\theta $ and recalculate O and E based on the updated hypothesis grammar G. This iteration is crucial as the values of O and E depend on the current state of G. The process continues until the accuracy schedule is exhausted ( $\theta =\theta _{\max }$ ), indicating that there are no more potential constraints, marking the termination of learning. The term convergence is avoided in this context because establishing its conditions requires a more general proof, which is reserved for future research.

In the second iteration of the toy example, after *CC is added to G and removed from Con (hence ‘–’ in the $O[\text {*CC}]$ and $E[\text {*CC}]$ of Table 6), $\theta $ is reassigned to $1$ , and no constraint satisfies $O/E<\theta $ . $\theta =\theta _{\max }=1$ indicates the termination of the learning process. The learnt grammar matches the target grammar $\mathcal {T} = $ $\{$ *CC $\}$ , as shown in Table 6.

Table 6

Steps 2 and 3 after the first iteration

3.5. Summary

To summarise, the Exception-Filtering learner initiates the learning process with an empty hypothesis grammar, allowing all possible sequences. As it accumulates indirect negative evidence from input data, the learner gradually filters out exceptions, shrinks the space of possible sequences, and updates the hypothesis grammar G with respect to the comparison of the observed and expected type frequency. The learner iteratively filters out lexical exceptions from the input data, rather than accepting them in the hypothesis grammar.

4. Evaluation

This section aims to provide a clear methodology for evaluating the proposed learning model. Inspired by Hastie et al. (Reference Hastie, Tibshirani and Friedman2009), the evaluation in the current study consists of four dimensions (two analytical and two statistical):

1. Scalability: can the model be applied successfully to a wide range of data sets?
2. Interpretability: can human analysts (e.g., linguists) interpret the learnt grammar?
3. Model assessment: evaluating the performance of the model with new data. This is achieved through the statistical tests against test data set as discussed below;
4. Model comparison: comparing the performance of different models.

The current study examines these four dimensions through three case studies in representative data sets: local onset phonotactics in English and Polish and non-local vowel phonotactics in Turkish. Learning from onset phonotactics controls the influence of syllable structures and considerably simplifies the learning problem (Daland et al. Reference Daland, Hayes, White, Garellek, Davis and Norrmann2011;Jarosz Reference Jarosz2017; Jarosz & Rysling Reference Jarosz and Rysling2017). In Turkish, however, learning models are applied to vowel tiers without specified syllabic structures.

Moreover, the current proposal is compared to the learning algorithm proposed by Hayes & Wilson (Reference Hayes and Wilson2008) – henceforth referred to as the HW learner – due to its widespread acceptance in the field and its accessible software (the UCLA Phonotactic Learner, available at https://linguistics.ucla.edu/people/hayes/Phonotactics/), making it an ideal benchmark for comparison. In the case studies, the hyperparameters Max $O/E$ (0.1 to 1; similar to $\theta _{\max }$ in this article) and Max gram size n (2 to 3) in the HW learner were fine-tuned so that only the highest-performing models across all tests are reported.Footnote ¹³ A 300 Maximum constraint limit was only established in the Turkish case study due to hardware limitations when handling a large corpus. Moreover, the default Gaussian prior is used to reduce overfitting and handle exceptions (Hayes & Wilson Reference Hayes and Wilson2008: 387; $\mu =0,\sigma =1$ ).Footnote ¹⁴

The current study also implements a simple categorical tier-based strictly 2-local phonotactic learner (henceforth Baseline, capitalised to distinguish it from other baseline models), adapted from the memory-seg learner (Wilson & Gallagher Reference Wilson and Gallagher2018) and other previous work (Gorman Reference Gorman2013; Kostyszyn & Heinz Reference Kostyszyn, Heinz, Jurgec, Duncan, Elfner, Kang, Kochetov, O’Neill, Ozburn, Rice, Sanders, Schertz, Shaftoe and Sullivan2022), in which a string is considered grammatical ( $g = 1$ ) if all its two-factors have non-zero frequency in the input data, and ungrammatical ( $g = 0$ ) otherwise.

As the current study proposes a categorical-grammar-plus-exception-filtering-mechanism approach, contrasting it with the HW learner sheds light on the role of categorical grammars, while comparing it with the Baseline learner highlights the significance of the exception-filtering mechanism. All models are trained on the same input data.

Although none of the learning models here claim to be the exact algorithm performed by child learners, comparing their learning results and behavioural data provides valuable insights into the underlying mechanisms of phonotactic learning in the face of exceptions. In English and Polish case studies, the learnt grammars are tested on the acceptability judgements from behavioural data. In the Turkish case study, while conducting a new experiment falls outside the scope of the current study, the study approximates the acceptability judgements using the experimental data collected by Zimmer (Reference Zimmer1969). This is in line with the methodology employed by Hayes & Wilson (Reference Hayes and Wilson2008) for deriving acceptability judgements in English from Scholes (Reference Scholes1966). Moreover, the learnt grammar is contrasted with the documented grammar as analysed by human linguists. This has been a standard method in phonotactic modelling. For example, Hayes & Wilson (Reference Hayes and Wilson2008) compared the learnt grammars of Shona and Wargamay with the phonological generalisations in the previous literature. Gouskova & Gallagher (Reference Gouskova and Gallagher2020) used a method to generate grammaticality labels for nonce words based on phonological generalisations that are experimentally verified (§7). The major statistical tests for model assessment and comparison are described below.

4.1. Correlation tests

The correlation between predicted judgements and gradient acceptability judgements, often based on Likert scales, can be assessed using various correlation tests: Pearson’s r (Pearson Reference Pearson1895), Spearman’s ρ (Spearman Reference Spearman1904), Goodman and Kruskal’s γ (Goodman & Kruskal Reference Goodman and Kruskal1954) and Kendall’s τ (Kendall Reference Kendall1938). These values range from −1 (highly negative) to 1 (highly positive).

Pearson’s r requires the assumption of linearity, positing that intervals between ratings are of equal size (e.g., the distance between 1 and 2 is the same as between 4 and 5). However, this assumption may not hold for Likert ratings (Gorman Reference Gorman2013; Dillon & Wagers Reference Dillon and Wagers2021), even if they are averaged over participants. Moreover, the Pearson correlation test also requires both variables to be continuous and their relationship to be normally distributed. The categorical grammaticality predicted in the current proposal does not satisfy this requirement. Therefore, Pearson’s r is not reported in this study.Footnote ¹⁵

Non-parametric tests measuring rank correlations are more appropriate, as they make weaker assumptions about the distribution of acceptability judgements (Gorman Reference Gorman2013: 27). Spearman’s ρ assumes monotonicity, meaning that the lower values in acceptability consistently correspond to lower levels of predicted grammaticality score. Spearman’s ρ requires stronger monotonic relationships to produce higher correlation coefficients, making the score more sensitive to inconsistent performance of subjects, compared to other non-parametric tests. For example, if subjects assign ratings on a scale of 1 to 6 inconsistently to intermediate judgements, such that a score of 4 could represent grammaticality less than or equal to a score of 2, this will disrupt monotonicity and thus greatly lower Spearman’s ρ.

In Goodman and Kruskal’s γ and Kendall’s τ test, pairs of observations $(X_i, Y_i)$ and $(X_j, Y_j)$ from predicted judgements (X) and gradient acceptability judgements (Y) are classified as concordant, discordant, or tied. A pair is considered concordant if the order of elements in X matches that of Y ( $X_i< X_j$ implies $Y_i < Y_j$ ), and discordant if the orders are reversed. If $X_i=X_j$ or $Y_i=Y_j$ , the pair is considered a tie.

Goodman and Kruskal’s γ calculates the difference between the number of concordant and discordant pairs, normalised by the total number of non-tied pairs: γ $=$ (concordant − discordant) / (concordant + discordant). Tied pairs are ignored in this computation. Kendall’s τ penalises tied pairs by modifying the denominator in γ based on the number of tied pairs. Goodman and Kruskal’s γ acts as a benchmark when Kendall’s τ incurs severe penalty in categorical grammar, which often produces a large number of tied pairs.

4.2. Classification accuracy

When categorical grammaticality labels are provided in the test data, this article utilises binary accuracy and the F-score as performance measures for predicted grammaticality in the classification task. The binary accuracy represents the proportion of correct predictions of all labels. This value is then separately calculated for ‘ungrammatical’ and ‘grammatical’ labels. F-score is an accuracy metric that takes into account both precision and recall. Precision is the ratio of true positives to the sum of true positives and false positives. Recall is the ratio of true positives to the sum of true positives and false negatives. The F-score is the harmonic mean of precision and recall ( $2\ \times\ (\text {precision}\ \times\ \text {recall})/ (\text {precision}\ +\ \text {recall}$ )), ranging from 0 to 1. A model devoid of false positives obtains a precision score of 1, while one without false negatives achieves a recall of 1. A model without both errors yields an F-score of 1.

To evaluate the HW learner in binary classification, a thresholding method was used to transform the harmony scores of the learnt MaxEnt grammar into categorical grammaticality judgements (Hayes & Wilson Reference Hayes and Wilson2008: 385). Specifically, sequences with harmony scores equal to or below a certain threshold were classified as grammatical, whereas those with harmony scores exceeding the threshold were classified as ungrammatical. The optimal threshold was chosen, from the minimum to the maximum of all harmony scores, to maximise the binary accuracy of the learnt MaxEnt grammar. In other words, the current proposal is compared to the maximal performance that a MaxEnt grammar can achieve in binary accuracy. The current study will evaluate this thresholding method empirically, while Alves (Reference Alves2023) has mathematically and theoretically shown the consequences of probabilistic grammars with thresholds.

The following three sections employ the methodologies described above to the case studies of English and Polish onsets and Turkish vowel phonotactics.

5. Case study: English onsets

Gorman (Reference Gorman2013: 36) has shown that the HW learner does not reliably outperform the baseline learning model based on categorical grammar. This observation was based on the test data set from studies conducted by Albright & Hayes (Reference Albright and Hayes2003) and Scholes (Reference Scholes1966). This section extends this investigation by modelling the learning process from an exceptionful input data set and evaluating the learning results against a novel test data set drawn from Daland et al. (Reference Daland, Hayes, White, Garellek, Davis and Norrmann2011).

5.1. English input data

To facilitate comparison with previous work, this case study uses the ‘modestly’ exceptionful data in Hayes & Wilson (Reference Hayes and Wilson2008, appendix B) as the input, assuming that this data set has a distribution of type frequencies similar to children’s learning experience. The data set consists of 31,985 onsets taken from distinct word types drawn from the CMU Pronouncing Dictionary. Each of these words has been encountered at least once in the CELEX English database (Baayen et al. Reference Baayen, Piepenbrock and Gulikers1995; Daland et al. Reference Daland, Hayes, White, Garellek, Davis and Norrmann2011; Hayes Reference Hayes2012). This provides a representative sample that approximates the type frequencies of English onsets in the language experience of English speakers.

There are 90 unique onsets in the input data. Table 7 illustrates how the majority of the input data (31,641 to be precise) are classified as non-exotic (Table 7a), while the onsets of 344 words are considered exotic (Table 7b) per Hayes & Wilson (Reference Hayes and Wilson2008). The HW learner yields worse performance when exposed to input data with ‘exotic’ items compared to samples containing only non-exotic items. The current study claims that some, if not all, of these exotic items are lexical exceptions, especially those sequences borrowed from other languages, such as [zl] zloty from Polish. Following Hayes & Wilson (Reference Hayes and Wilson2008: 395), [Cj] onsets are removed from the corpus due to considerable phonological evidence indicating that the [j] portion of [Cj] onsets is better parsed as part of the nucleus and rhyme; for example, spew is analysed as [[sp]_onset [ju]_rhyme].Footnote ¹⁶ This filtering of [Cj] onsets leads to the input data being characterised as ‘modestly exceptionful’ because there are only a few remaining exotic onsets.

Table 7

Type frequency of English onsets in the input data

Several phonotactic patterns are worth noting while interpreting the learnt grammar, especially whether the attested ‘exotic’ onsets such as [sf, zl, zw] are deemed ungrammatical. Moreover, previous studies have emphasised the impact of the Sonority Sequencing Principle (SSP) on English phonotactic judgements. According to the SSP, onsets featuring large sonority rises, such as stop+liquid combinations (e.g., [pl, bl, dr]), are generally favoured as being well-formed (Daland et al. Reference Daland, Hayes, White, Garellek, Davis and Norrmann2011).Footnote ¹⁷ The current study only uses the SSP to better interpret the learnt grammar. Capturing the effects of the SSP on unattested clusters, also known as sonority projection (Daland et al. Reference Daland, Hayes, White, Garellek, Davis and Norrmann2011; Jarosz & Rysling Reference Jarosz and Rysling2017), would require feature-based representations, which are beyond the scope of this article.

5.2. Learning procedure and learnt grammar

For the given input data and the tier (all segments of the input data), the Exception-Filtering learner first initialises a hypothesis space for 23 consonants that appear in the input data based on the TSL $_2$ language, excluding phonemes that never occur at word initial positions such as [x] (as in loch) and [ŋ] (ring). As a result, the hypothesis space is populated with a total of $23 \times 23 = 529$ potential constraints for the English input data. For all case studies, two-factors involving the initial word boundary (#) and each consonant (e.g., *#z) are considered in the hypothesis space, but are ignored in the article, because they are always deemed grammatical in learnt grammars.

The Exception-Filtering learner learns consistent categorical grammars in every simulation, owing to the discrete nature of constraint selection. Arranged according to the sonority hierarchy, Table 8 illustrates the learnt grammar when the maximum threshold $\theta _{\max }$ is set at 0.1, which delivers the best performance during the evaluation. The rows of the table, labelled at the left, represent the first symbol in each two-factor, and the columns, labelled at the top, represent the second symbol. The learner deems grammatical two-factors, such as [pl], as $1$ , and ungrammatical ones, such as [pt], as $0$ . The grammatical two-factors such as [bl] in the learnt grammar are all attested, while the attested ungrammatical two-factors such as [pw] indicate detected lexical exceptions. The value $\theta _{\max } = 0.1$ demarcates ungrammatical two-factors (e.g., [dw]: $O/E = 17/174 \approx 0.098$ ) from grammatical ones (e.g., [ʃr]: $O/E = 40/265 \approx 0.151$ ).

Table 8

A grammar learnt from the English sample. The first symbols of two-factor sequences correspond to rows (labelled at left), and the second symbols to columns (labelled at the top). Shaded cells indicate the attested two-factors in the input data, with darker grey for grammatical two-factors and lighter grey for ungrammatical ones

Interpreting the learnt grammar yields several interesting insights. Only clusters with large sonority rises are permitted by the learnt grammar, such as stop+liquid and fricative+liquid, which is consistent with SSP and previous studies (Jarosz Reference Jarosz2017: 270), except for [s]-initial two-factors [sp, st, sk]. Moreover, most detected lexical exceptions involve a consonant followed by an approximant, as seen in [zl] zloty, [sr] Sri Lanka and [pw] Pueblo. These exceptional two-factors all exhibit substantial sonority rises, indicating a conflict between SSP and the learnt grammar.

Furthermore, many learnt segment-based constraints match the MaxEnt grammar learnt in Hayes & Wilson (Reference Hayes and Wilson2008: 397). For instance, the learnt grammar bans sonorants before other onset consonants (*[+sonorant][]; e.g., *rt) and fricative clusters with a preceding consonant (*[][+continuant]; e.g., *sf). Also identified are exceptional two-factors such as *gw, *dw, *θw, also noted by Hayes & Wilson, in which these two-factors are treated as violable constraints.

5.3. Model evaluation in English

This section evaluates whether the learnt grammar approximates the acceptability judgements from the experimental data in Daland et al. (Reference Daland, Hayes, White, Garellek, Davis and Norrmann2011). The test data set includes 96 nonce words with a CC-VCVC structure, for example, $\textit {pr-} + \textit {-eebid} = \textit {preebid}$ . The 48 word-initial CC onsets of these words were randomly concatenated with 6 VCVC tails. There are 18 onsets that never occur as English onsets (unattesteds), for example, [tl], [rg], and 18 clusters that frequently occur as English onsets (attesteds) as well as 12 clusters that are found only rarely or in loanwords (marginals), for example, [gw] in Gwendolyn, [ʃl] in schlep (Daland et al. Reference Daland, Hayes, White, Garellek, Davis and Norrmann2011: 203).

Then each nonce word was rated on a Likert scale, ranging from 1 (unlikely) to 6 (likely), by highly proficient English speakers who were recruited through the Mechanical Turk platform (Daland et al. Reference Daland, Hayes, White, Garellek, Davis and Norrmann2011). Individual scores were not disclosed by the authors, and the test data set only has averaged Likert ratings over all participants.

Table 9 shows the onsets presented to the subjects and the corresponding type frequency in the input data, the average Likert ratings and the predicted grammaticality (g) of the learnt grammar. Detected exceptions (non-zero frequency but deemed ungrammatical) are highlighted. Notably, the ungrammatical two-factors identified by the Exception-Filtering learner receive low to modest ratings (between 1.325 and 3.124), compared to grammatical two-factors (between 3 and 4.525).

Table 9

Type frequency, averaged Likert ratings and predicted grammaticality by the learnt grammar of English nonce word onsets, sorted by averaged Likert ratings. Detected exceptions (non-zero frequency and g = 0) are shaded

Table 10 provides a performance comparison among the Exception-Filtering ( $\theta _{\max } = 0.1$ ), Baseline and HW learners (Max $O/E=0.3$ , Max gram $n=3$ , the same as Hayes & Wilson Reference Hayes and Wilson2008). Correlation scores are compared across the entire test data set as a whole. It should be noted that the test data set from Daland et al. (Reference Daland, Hayes, White, Garellek, Davis and Norrmann2011) excludes several exceptional onsets penalised by the Exception-Filtering learner, such as *[sf].

Table 10

Results of the best performances by the Exception-Filtering ( $\theta _{\max }$ = 0.1), Baseline and HW learners (Max $O/E$ = 0.3, $n$ = 3). Correlation tests are reported with respect to averaged Likert ratings in English; best scores are in bold

The reported correlation scores of all models are significantly different from zero at a two-tailed alpha of 0.01. Both the Exception-Filtering and Baseline learners delivered comparable performances,Footnote ¹⁸ while the HW learner demonstrated slightly superior results, especially in terms of Spearman’s ρ and Kendall’s τ. Interestingly, the close-to-one Goodman and Kruskal’s γ observed in both Exception-Filtering and Baseline learners indicates a higher number of tied pairs in nonparametric tests, leading to a marginally reduced Kendall’s τ.

Although the Exception-Filtering learner shows a comparable performance on par with other well-established models, it did not stand out in approximating the acceptability judgements of Daland et al. (Reference Daland, Hayes, White, Garellek, Davis and Norrmann2011). However, the relatively modest performance of the Exception-Filtering learner in the modestly exceptionful input data sets the stage for improved learning results in the forthcoming sections dealing with highly exceptionful data sets.

In summary, the proposed learner successfully learns a categorical phonotactic grammar from naturalistic input data of English onsets. The learnt grammar reveals several interesting observations in English phonotactics, and approximates gradient acceptability judgements from the behavioural data in Daland et al. (Reference Daland, Hayes, White, Garellek, Davis and Norrmann2011), and managed to deliver a robust performance comparable to benchmark HW model in a modestly exceptionful input data.

6. Case study: Polish onsets

In this section, the Exception-Filtering learner is applied to the input data and gradient behavioural data concerning Polish onsets (Jarosz Reference Jarosz2017; Jarosz & Rysling Reference Jarosz and Rysling2017).

6.1. Polish input data

To model the language acquisition experiences of children, the model was trained on input data consisting of 39,174 word-initial onsets, sourced from a phonetically transcribed Polish lexicon (Jarosz Reference Jarosz2017; Jarosz et al. Reference Jarosz, Calamaro and Zentz2017) derived from a corpus of spontaneous child-directed speech (Haman et al. Reference Haman, Etenkowski, Łuniewska, Szwabe, Dabrowska, Szreder and Łaziński2011). There are 384 unique onsets in the input data.

Table 11 shows the consonants that appear in the input data. The current study uses a uniform system for converting orthography to IPA, remaining neutral on the ongoing debate surrounding the specific phonetic properties of certain segments, particularly the retroflex consonants cz [\t{t\textrtails}], drz/dż [\t{d\textrtailz}], sz [ʂ] and rz/ż [ʐ] (Jarosz & Rysling Reference Jarosz and Rysling2017; Kostyszyn & Heinz Reference Kostyszyn, Heinz, Jurgec, Duncan, Elfner, Kang, Kochetov, O’Neill, Ozburn, Rice, Sanders, Schertz, Shaftoe and Sullivan2022). Polish is known for allowing complex onsets (up to four consonants such as [vzdw]) that defy SSP (Jarosz Reference Jarosz2017; Kostyszyn & Heinz Reference Kostyszyn, Heinz, Jurgec, Duncan, Elfner, Kang, Kochetov, O’Neill, Ozburn, Rice, Sanders, Schertz, Shaftoe and Sullivan2022).Footnote ¹⁹ For example, a large number of glide+stop, liquid+fricative and nasal+stop sequences are attested, such as [wb, rʐ, mk]. Moreover, many attested onsets are equally or even less acceptable than unattested onsets, as shown in the test data set below, which provides a unique challenge for the Exception-Filtering learner.

Table 11

Polish consonant inventory (derived from the input data)

6.2. Learning procedure and learnt grammar in Polish

Similar to the English case study, for the given input data and tier (all segments from the input data), the Exception-Filtering learner initialises possible constraints for 30 consonants that appear in the input data. As a result, the hypothesis space includes a total of 30 $\times $ 30 = 900 two-factors for the Polish input data. As above, two-factors involving the initial word boundary (#) are ignored because they are all considered grammatical by the learnt grammar.

After the learning process, Table 12, arranged according to the sonority hierarchy, illustrates the learnt grammar when $\theta _{\max }$ is set at 0.1, which delivers the best performance during the evaluation. The learnt grammar provides intriguing information on attested SSP-defying onsets (Jarosz Reference Jarosz2017). Most grammatical two-factors that violate the SSP are obstruent pairs such as fricative+stop and fricative+fricative. Rubach & Booij (Reference Rubach and Booij1990) proposed that stops, affricates and fricatives have indistinguishable sonority and should be considered as a single category, obstruents, in the context of the SSP. If one follows this proposition and disregards obstruent-initial onsets, most of the remaining SSP-defying two-factors, such as nasal+obstruent [rʐ] and glide+stop [wd], have relatively low type frequencies and are deemed ungrammatical by the learnt grammar. Only 4 of 900 two-factors are grammatical while defying SSP (having equal sonority or a low rise), namely [lv, rv, mn, mɲ]. In essence, while a comprehensive evaluation of SSP’s role in phonotactic learning is beyond the scope of this study, it is noteworthy that the learnt grammar here shows a viable approach to interpreting SSP-defying onsets in the context of lexical exceptions.

Table 12

Learnt grammar from Polish input data. The first symbols of two-factor sequences correspond to rows (labelled at left), and the second symbols to columns (labelled at the top). Shaded cells indicate the attested two-factors in the input data, with darker grey for grammatical two-factors and lighter grey for ungrammatical ones

6.3. Model evaluation in Polish data

This section evaluates the degree to which the learnt grammar reflects acceptability judgements gathered from experimental data in Polish. The test data set consists of 159 nonce words, which are constructed from a combination of 53 word-initial onsets (heads) and 3 trisyllabic VCVC(C)V(C) tails. The test data set also includes 240 attested fillers, varying in word length (1 to 4 syllables) and onset length (0 to 3 consonants). This setting allows for the evaluation of the learner’s performance on both attested and unattested sound sequences. Likert ratings were collected from 81 native Polish-speaking adults through an online experiment conducted on Ibex Farm (Jarosz & Rysling Reference Jarosz and Rysling2017).

Table 13 shows the onsets presented to the subjects and the corresponding type frequency in the input data, Likert ratings (average by onsets) and the predicted grammaticality (g) of the learnt grammar. Exceptions detected by the learnt grammar (non-zero frequency and $g = 0$ ) are highlighted.Footnote ²⁰ For instance, [ʐj] is deemed ungrammatical, which is reflected in its average score of 2.259 on a 1 to 7 Likert scale.

Table 13

Type frequency, averaged Likert ratings and predicted grammaticality by the learnt grammar of Polish onsets, sorted by Likert rating. Detected exceptional onsets are highlighted

Table 14 shows the correlation with respect to the average Likert ratings in Table 13.Footnote ²¹ Correlations in all models significantly differ from zero at a two-tailed alpha of 0.01. In all correlation tests, the Exception-Filtering learner modestly outperforms the Baseline learner. It performs comparably to the benchmark HW learner (Max $O/E = 0.7$ , $n=2$ ), with a modestly lower Spearman’s ρ and a modestly higher Kendall’s τ.

Table 14

Results of the best performance in Exception-Filtering ( $\theta _{\max }$ = 0.1), Baseline and HW learner (Max $O/E$ = 0.7, $n$ = 2). Correlation tests are approximating averaged Likert ratings in Polish; categorised based on attestedness; best scores are in bold

The Exception-Filtering learner identified more exceptional two-factors in the Polish input data. Moreover, its performance relative to the benchmark models improved compared to the English case study and surpassed the Baseline learner, which lacks the exception-filtering mechanism. These findings highlight the value of the exception-filtering mechanism in phonotactic learning, particularly when dealing with exceptionful real-world corpora.

To summarise, the Exception-Filtering learner, trained on Polish child-directed corpus, has illustrated its potential in extracting categorical grammars that approximate acceptability judgements. The performance of the model is on par with the HW learner in Spearman’s ρ, and modestly outperforms the benchmark HW learner and the Baseline learner both in Goodman and Kruskal’s γ and in Kendall’s τ test, demonstrating its capability in approximating acceptability judgements. These results further substantiate the potential of the Exception-Filtering learner in inducing phonotactic patterns from realistic corpora.

7. Case study: Turkish vowel phonotactics

This section tests the Exception-Filtering learner’s capability in capturing non-local vowel phonotactics from highly exceptionful input data drawn from a comprehensive corpus in Turkish.

7.1. Turkish vowel phonotactics

This section applies the current proposal to vowel phonotactic patterns in Turkish. Turkish vowels are shown in Table 15. Turkish orthography is converted to IPA, including ö [ø], ü [y] and ı [ɯ].Footnote ²²

Table 15

Turkish vowel system

Turkish vowel phonotactic patterns are summarised as follows, adapted from Kabak (Reference Kabak, van Oostendorp, Ewen, Hume and Rice2011):

1. Backness harmony: All vowels must agree in frontness or backness.
2. Roundedness harmony: High vowels must also agree in roundness with the immediately preceding vowel; hence, no high rounded vowels can be found after the unrounded vowels within a word.
3. No non-initial mid round vowels: No mid rounded vowels (i.e., [o] and [ø]) may be present in a non-initial syllable of a word, which means that they cannot follow other vowels.

First, a vowel cannot follow another vowel with a different [back] value (backness harmony). This is clearly demonstrated in morphophonological alternations, as shown in (7a) and (7b), adapted from Gorman (Reference Gorman2013: 46). For instance, when the plural suffix /lAr/ follows the root /pul/ ‘stamp’, it surfaces as [lɑr] rather than [ler]. This can be attributed to the non-local phonotactic constraint against the co-occurrence of u…e. In contrast, when /køj/ ‘village’ is combined with /lAr/, the resulting form is [køj-ler], demonstrating the non-local *ø…ɑ co-occurrence restriction. However, exceptions to this generalisation exist both within roots and across root–affix boundaries, as illustrated in (7c) and (7d). For example, both the root [silɑh] ‘weapon’ and its plural form [silɑh-lɑr] violate the vowel co-occurrence restriction *i…ɑ.

In the second phonotactic constraint related to harmony, a high vowel cannot follow another vowel with a different value for [round] (roundness harmony). (8) provides examples of this pattern. Yet again, exceptions are found, as in [boɰɑz-ɯn].Footnote ²³

Last but not the least, the mid round vowels [ø] and [o] are typically restricted to initial position in native Turkish words, as in [ødev] ‘homework’ and [ojun] ‘game’. Consequently, these vowels should not follow any other vowels, for example, *ɑ…ø and *e…o. However, in loanwords, mid round vowels may occur freely in any position.

Generally, a substantial number of exceptions to these phonotactic patterns arise from compounds and loanwords (Lewis Reference Lewis2001; Göksel & Kerslake Reference Göksel and Kerslake2004; Kabak Reference Kabak, van Oostendorp, Ewen, Hume and Rice2011). For example, the loanword [piskopos] borrowed from Greek epískopos ‘bishop’ violates both the roundness harmony and the constraint on non-initial mid round vowels.

Despite many exceptions, these generalisations are not only well-documented in the literature, including Underhill (Reference Underhill1976: 25), Lewis (Reference Lewis2001: 16), Göksel & Kerslake (Reference Göksel and Kerslake2004: 11) and Kabak (Reference Kabak, van Oostendorp, Ewen, Hume and Rice2011: 4), but also supported by experimental studies (Zimmer Reference Zimmer1969; Arik Reference Arik2015). Furthermore, recent acquisition studies reveal that some harmony patterns are discernible by infants as young as six months old, who extract and pay attention to the harmonic patterns present in their language environment, filtering out any disharmonic tokens (Hohenberger et al. Reference Hohenberger, Altan, Kaya, Tuncer, Avcu, Ketrez and Haznedar2016).

Another layer of complexity in Turkish vowel phonotactics comes from root harmony. Turkish vowel phonotactic constraints are applicable within roots and across morpheme boundaries (Zimmer Reference Zimmer1969; Arik Reference Arik2015), while it is still a matter of debate whether harmony patterns in the domain of roots should be analysed as active phonological processes given the existence of exceptions in disharmonic roots (Kabak Reference Kabak, van Oostendorp, Ewen, Hume and Rice2011: 17), some of which may originate from loanwords. However, from the perspective of phonological learning, these roots constitute a significant part of the input data exposed to human learners, as most Turkish roots can stand alone.

Therefore, Turkish vowel phonotactic patterns pose a unique challenge for phonological learning: how does the learner acquire vowel phonotactic generalisations from both roots and derived forms, despite the high level of lexical exceptions in the input data?

7.2. Turkish input data and learning procedure

The current study uses the Turkish Electronic Living Lexicon (TELL; https://linguistics.berkeley.edu/TELL/; Inkelas et al. Reference Inkelas, Aylin, Orhan Orgun and Sprouse2000) as input data, which consists of approximately 66,000 roots and the elicited derived forms (root+affixes) produced by two adult native Turkish speakers.Footnote ²⁴ Table 16 shows the type frequency of all non-local two-factors on the vowel tier in TELL. Two-factors that follow the Turkish vowel phonotactics introduced above are highlighted. This corpus is a great testing ground for evaluating the role of the exception-filtering mechanism. Notably, every non-local two-factor has a non-zero frequency in this data set. Therefore, any phonotactic learner that assumes every attested two-factor to be grammatical would invariably conclude that all combinations are allowed and completely miss the vowel harmony patterns.

Table 16

The type frequency of two-factors in the input data; cells of documented grammatical two-factors are highlighted

Similar to previous case studies, for the given input data and tier (all vowels from the input data), the Exception-Filtering learner initialises possible constraints for eight Turkish vowels, which yields 64 two-factors in the hypothesis space. The optimal maximum $O/E$ threshold is $0.5$ . The learnt grammar is illustrated in the first test data set below.

7.3. Model evaluation

This section evaluates the learning models in two separate test data sets below.

7.3.1. The first test data set (categorical labels)

The first test data set consists of 64 nonce words in the template of [tV $_1$ kV $_2$ z], such as [tokuz], representing all possible two-factors on the vowel tier. Each word is categorically labelled 1 (‘grammatical’; 16 in total) or 0 (‘ungrammatical’: 48 in total) based on the aforementioned well-documented phonotactic generalisations.Footnote ²⁵ Only roots are included in this analysis, as the learning model disregards morpheme boundaries.

It is important to note that individual variability is expected and that the grammaticality labels here may not match the exact target grammar of every speakers. However, these categorical labels are supported by Zimmer’s (Reference Zimmer1969) behavioural experiment. In a binary wordlikeness task, Zimmer (Reference Zimmer1969) asked Turkish native adult speakers to select which of a pair of nonce words (e.g., temez and temaz) was ‘more like Turkish’. Experiment 1 had 23 participants and Experiment 2 had 32 (see Supplementary Material for details); the majority of participants preferred the harmonic to the disharmonic roots in a yes/no rating task, which provides evidence for the psychological reality of Turkish vowel phonotactic patterns encoded in the first test data set. In other words, the first test data set aims to evaluate how well the learnt grammar mirrors the categorical phonotactic judgements of the majority of participants in Zimmer’s (Reference Zimmer1969) experiment. This follows the common practice in previous computational studies when acceptability judgements of nonce words in the test data set are not accessible. For example, Gouskova & Gallagher (Reference Gouskova and Gallagher2020) manually labelled the categorical grammaticality of nonce words in the test data set based on documented phonotactic generalisations supported by behavioural experiments (Gallagher Reference Gallagher2014, Reference Gallagher2015, Reference Gallagher2016).

Table 17 summarises the tests of classification accuracy on the first test data set with categorical labels. The Baseline learner miscategorised all nonce words as grammatical, which caused it to achieve perfect recall but at the expense of the lowest precision (0.238), F-score (0.385) and binary accuracy (0.238) due to false positives.

Table 17

Performance comparison of Exception-Filtering ( $\theta _{\max }$ = 0.5), Baseline and HW learner ( $\text {Max } O/E$ = 0.7, $n$ = 3) in the first test data set (categorical labels). Best scores are in bold

As discussed in §4, the harmony scores of the benchmark HW learner are transformed into categorical labels to produce its highest binary accuracy. However, even at its best performance (Max $O/E = 0.7$ , $n=3$ , vowel tier: [high], [round], [back], [word boundary]), the HW learner displayed higher error rates in the classification of Turkish phonotactics than the Exception-Filtering learner.

When tested against these categorical labels, the Exception-Filtering learner ( $\theta _{\max } = 0.5$ ) demonstrated outstanding performance in binary classification with an F-score of 0.933 and a total binary accuracy of 0.969. Table 18 compares the grammars acquired by the Exception-Filtering learner (a) and the benchmark HW learner (b). A score of 0 indicates that a two-factor has been classified as ungrammatical, whereas a score of 1 designates it as grammatical. In (b), the degree of shading is proportional to the negative harmony score, which is rescaled according to the minimum and maximum harmony scores.

Table 18

Comparing the learnt grammars of (a) the Exception-Filtering learner and (b) the HW learner

Compared to phonotactic generalisations in Turkish, the learnt grammar in the Exception-Filtering learner predicts two false negatives, which are reflected in the relatively lower recall (0.875) in classification accuracy. These two mismatches have an unexpectedly low type frequency (ø…e: 982; ø…y: 1,179), compared to other grammatical two-factors. The errors of the learnt MaxEnt grammar, on the contrary, are mostly false positives misled by their high type frequency, such as e…ɑ (2,873), ɑ…i (4,369), ɑ…u (1,526), and ɑ…e (3,197). The Exception-Filtering learner avoids these false positives by categorically penalising these exceptional two-factors.Footnote ²⁶

7.3.2. The second test data set (approximated acceptability judgements)

The purpose of the second test data set is to demonstrate that the learnt categorical grammar can approximate the acceptability judgements in the behavioural data. The second testing data set includes 36 nonce words from Zimmer (Reference Zimmer1969), and takes the proportion of ‘yes’ responses averaged across participants to approximate the acceptability judgements of speakers. The data show a gradient transition from harmonic words (e.g., [temez], with a score of $19/23\approx 0.826$ ) to disharmonic ones (e.g., [temɑz], at $3/23 \approx 0.130$ ). This method is similar to Hayes & Wilson’s (Reference Hayes and Wilson2008) approach to creating gradient acceptability judgements from Scholes’s (Reference Scholes1966) experiment, following previous studies (Pierrehumbert Reference Pierrehumbert and Keating1994; Coleman & Pierrehumbert Reference Coleman, Pierrehumbert and Coleman1997). In Zimmer’s (Reference Zimmer1969) study, some words were tested twice, leading to minor variations in response rates (e.g., [tatuz] receives either 0.375 or 0.3125), which do not significantly influence the results of the statistical tests below.

Table 19 presents the results of the statistical tests. The Baseline learner is omitted due to its lack of standard deviation, which makes correlation tests inapplicable. Notably, while correlations in all models differ significantly from zero at a two-tailed alpha of 0.01, the Exception-Filtering learner scored higher than the benchmark HW learner in all tests.

Table 19

Performance comparison of Exception-Filtering and HW learner in the second test data set adapted from Zimmer’s (Reference Zimmer1969) experiment. Best scores are in bold

Figure 5 visualises the distribution of predicted score against the approximated acceptability in both the Exception-Filtering and the HW learner. Some words have two response rates as they appeared in two separate experiments. A simple linear regression line is fitted in the plot, where the predictor (x-axis) is the predicted grammaticality score in the Exception-Filtering learner, and the exponentiated negative harmony score in the HW learner. The outcome (y-axis) is the proportion of ‘yes’ responses in Zimmer (Reference Zimmer1969), which approximates the acceptability judgments. The predicted scores of the Exception-Filtering learner cluster at 0 and 1, while exp(−harmony) is on a continuum.Footnote ²⁷

Figure 5

Scatter plots based on the learning results of two learners. Expected grammaticality is highlighted based on documented phonotactic generalizations. Some words have two response rates as they appeared in two separate experiments. Overlapped words are omitted from the plots.

Both regression models reject the null hypothesis that the predicted judgements have no effect on the proportion of ‘yes’ responses (Exception-Filtering: $\text {residual deviance} = 2.264$ , $p < 0.001$ ; HW: $\text {residual deviance} = 4.073$ , $p = 0.013$ ), at an alpha level of 0.05. Furthermore, Figure 5 shows that the Exception-Filtering learner is capable of categorically penalising lexical exceptions, such as ɑ…i in [tɑtiz], which can mislead the HW learner to assign relatively higher probabilities than harmonic sequences such as e…e in [pemez].

To summarise, the Exception-Filtering learner trained using a Turkish corpus acquired the documented vowel phonotactics in Turkish except for two mismatches. The Exception-Filtering learner not only succeeded in classifying grammatical and ungrammatical words, but also achieved a high correlation between the predicted judgement and the approximated acceptability judgement of nonce words from previous behavioural experiment. This result indicates the capability of the Exception-Filtering model in modelling phonotactic patterns with exceptions.

8. Discussion

To summarise the case studies, in terms of interpretability and scalability, the categorical grammars learnt in the case studies of English and Polish onset phonotactics largely align with the Sonority Sequencing Principle that penalises most sequences with low sonority rises. The proposed learner also successfully generalised Turkish vowel phonotactics from highly exceptionful input data with both roots and derived forms. When it comes to model assessment and comparison, the grammaticality scores generated by the learnt grammars closely approximate the acceptability judgements observed in behavioural experiments and demonstrate competitive performance in model comparisons, highlighting the effectiveness of the exception-filtering mechanism. The following section discusses topics that arise from the current study and outlines directions for future work.

8.1. Extragrammatical factors

As elaborated in §2, this research adopts the competence–performance dichotomy (Pinker & Prince Reference Pinker and Prince1988; Zuraw Reference Zuraw2000; Zuraw et al. Reference Zuraw, Lin, Yang and Peperkamp2021). Within this framework, extragrammatical factors are conceptualised as originating from two main sources: performance-related and lexicon-related variables. Performance-related variables include individual differences, auditory illusions (Kahng & Durvasula Reference Kahng and Durvasula2023) and task effects in general (Armstrong et al. Reference Armstrong, Gleitman and Gleitman1983; Gorman Reference Gorman2013). Lexicon-related variables include lexical information such as lexical similarity (Bailey & Hahn Reference Bailey and Hahn2001, Reference Bailey and Hahn2005; Avcu et al. Reference Avcu, Newman, Ahlfors and Gow2023), frequency (Frisch et al. Reference Frisch, Large and Pisoni2000; Ernestus & Baayen Reference Ernestus and Baayen2003), etc.

In the current study, in tandem with the learnt grammar, extragrammatical factors contribute to acceptability judgements in behavioural experiments. For example, previous studies have shown that lexical similarity and frequency are significant predictors of acceptability judgements (Frisch et al. Reference Frisch, Large and Pisoni2000; Bailey & Hahn Reference Bailey and Hahn2001, Reference Bailey and Hahn2005). Performance-related variables, such as individual differences and task effects, can also influence acceptability judgements. Therefore, a comprehensive evaluation of a learnt grammar against acceptability judgements should take these factors into account. In future research, this evaluation could be carried out by adopting a mixed-effects regression model, in which the grammaticality score is treated as a fixed effect and extragrammatical factors are treated as other effects.

8.2. Accidental gaps

Accidental gaps, the unattested but grammatical sequences emerging from the lexicon–grammar discrepancy, pose a significant challenge to phonotactic learning. Given that there are logically infinite numbers of grammatical strings and only some of them are associated with lexical meaning, gaps in the input data are inevitable. These accidental gaps can lower the $O/E$ ratio because expected sequences are absent in the input data, which could potentially lead the learner to misinterpret these sequences as ungrammatical. This issue does not cause severe problems in the current proposal, because the learner can potentially avoid the misgeneralisation of accidental gaps by adjusting the maximum threshold. However, this is not a fundamental solution and places an excessive burden on a simple statistical criterion.

A more principled solution to the challenge of accidental gaps is to incorporate feature-based constraints, as suggested by Wilson & Gallagher (Reference Wilson and Gallagher2018). Segmental representations may overlook subsegmental generalisations – underrepresented segmental two-factors in the input data can exhibit high frequency in feature-based generalisations. For instance, in English, b[+approximant] sequences (e.g., [br], [bl]) are highly frequent, except for [bw], which has only three unique occurrences. In contrast, all segmental two-factors of the form b−approximant] are unattested (e.g., [bn], [bg], [bt]). A feature-based grammar can penalise −approximant] after b, but allow b[+approximant], hence avoiding overpenalising accidental gaps with [bw] onsets. By considering the entire natural class, the grammar can recognise subsegmental patterns that are overlooked in segmental representation. As Hayes & Wilson (Reference Hayes and Wilson2008: 401) demonstrate, a feature-based model outperforms a segment-based model in their English case study.

It is feasible to integrate feature-based representations into the current approach using the generality heuristics in Hayes & Wilson (Reference Hayes and Wilson2008) and the bottom-up strategies proposed by Rawski (Reference Rawski2021). The current study offers a straightforward demonstration of the concept here: consider a simplified feature system illustrated in (9). A feature-based Exception-Filtering learner initialises the most general feature-based potential constraints, for example, *[+F][+F], *[+F][+G], etc.

After selecting the next threshold from the accuracy schedule, and computing the $O/E$ for each possible two-factor, the learner adds a two-factor to the hypothesis grammar if (a) the two-factor is not implied by any previously learnt constraints, and (b) the $O/E$ of the two-factor is lower than the current threshold. For example, a constraint such as *[+G][+G] would imply more specific two-factors such as *[+G][+F,+G] and *[+F,+G][+G], but not *[+F][+F]. Therefore, if *[+G][+G] is already learnt, the learner will not consider the implied *[+G][+F,+G], regardless of its $O/E$ value. The learning process continues until all thresholds have been exhausted. The next step of the current study is to incorporate more learning strategies proposed in Hayes & Wilson (Reference Hayes and Wilson2008) and Rawski (Reference Rawski2021) to optimise the learner for natural language corpora.

8.3. Hayes & Wilson’s (Reference Hayes and Wilson2008) learner

The Exception-Filtering learner drew inspiration from probabilistic approaches, especially the benchmark HW learner, which learns a MaxEnt grammar (Berger et al. Reference Berger, Della Pietra and Della Pietra1996; Goldwater & Johnson Reference Goldwater and Johnson2003) from input data. The HW learner adjusts constraint weights to maximise the likelihood of the observed data predicted by the hypothesis grammar, also known as maximum likelihood estimation (MLE), aiming to approximate the underlying target grammar by maximising the likelihood of observed input data, including lexical exceptions.

Interestingly, although the HW learner also uses the $O/E$ criterion in constraint selection, it cannot exclude lexical exceptions from the input data even with the correct constraints selected. The principle of MLE prevents the probabilistic grammar from assigning a zero probability to observed lexical exceptions and from completely excluding these anomalies. The underpenalisation of lexical exceptions can compromise generalisations for non-exceptional candidates (Moore-Cantwell & Pater Reference Moore-Cantwell and Pater2016). For example, in the Turkish case study, the HW learner underpenalised the highly frequent disharmonic patterns such as ɑ…i in [tɑtiz] (Figure 5). As a result, researchers usually manually remove the strings considered lexical exceptions from the training data prior to simulations, such as in the English case study of Hayes & Wilson (Reference Hayes and Wilson2008).

This issue has motivated several interesting proposals to handle exceptions within the HW learner. Hayes & Wilson (Reference Hayes and Wilson2008: 386) add a Gaussian prior to prevent overfitting by adjusting the standard deviation $\sigma $ of the Gaussian distribution for constraint weights. Although this method proves effective for certain data sets based on their specific noise distribution, it still assigns non-zero, albeit low, probabilities to lexical exceptions.

Another strategy is to include lexically specific constraints in the hypothesis space to handle lexical exceptions (Pater Reference Pater2000; Linzen et al. Reference Linzen, Kasyanenko and Gouskova2013; Moore-Cantwell & Pater Reference Moore-Cantwell and Pater2016; Hughto et al. Reference Hughto, Lamont, Prickett and Jarosz2019; O’Hara Reference O’Hara2020). For example, a lexically specific constraint *sf $_i$ would penalise the sequence [sf] except when it is in the indexed lexical exception sphere $_i$ . In this way, the learnt grammar is able to allow exceptions without compromising the generalisations for non-exceptional candidates. Meanwhile, nonce words are evaluated according to the general constraints of the grammar, as they do not have lexical indices. However, lexically specific constraints considerably escalate the computational complexity of the learning model, because the hypothesis space grows exponentially with the size of the input data. Such computational complexity not only restricts our capacity to test the proposal adequately, but also raises questions about its plausibility in child language acquisition.

Both proposals above handle the exception-related overfitting problem through the incorporation of a regularisation function during maximum likelihood estimation. An open question is whether the HW learner can be improved by incorporating the exception-filtering mechanism advocated in the current proposal, so that identified anomalies can be removed from input data during the learning process.

8.4. O/E and alternative criteria

Both the Exception-Filtering learner and the HW learner employ a ‘greedy’ algorithm that selects constraints whenever $O/E$ is below the selected threshold in an accuracy schedule. This approach, while computationally efficient, does not guarantee the discovery of a globally optimal grammar, given that the addition of one constraint may influence the $O/E$ of others. As the learning model does not possess the capacity to look ahead, it becomes vital for the analyst to thoroughly examine the learning results across various threshold levels to uncover potential implications and enhancements. In the context of learning phonotactic grammars from exceptionful data, the $O/E$ criterion has proved to be an effective measure in case studies.

An alternative strategy, such as the use of a depth-first search algorithm, could circumvent local optima by allowing the learner to examine future constraints before committing to the current one. However, this method comes with a considerable increase in computational complexity.

To ultimately solve the problem of local optima, a future direction is to consider other criteria, such as gain (Della Pietra et al. Reference Della Pietra, Della Pietra and Lafferty1997; Berent et al. Reference Berent, Wilson, Marcus and Bemis2012) or the Tolerance Principle (Yang Reference Yang2016). Similar to $\theta _{\max }$ in the accuracy schedule, gain is set at a specific threshold – the higher the gain, the more statistical support is required for a constraint to be added to the hypothesis grammar (Gouskova & Gallagher Reference Gouskova and Gallagher2020: 5). The gain criterion was originally designed for well-defined probabilistic distributions, and its convex property ensures that the added constraints approximate a global optimum. Generalising this criterion to the current proposal involves some non-trivial adjustments, especially deriving a probabilistic distribution from categorical grammars.

The Tolerance Principle proposes that a rule will be generalised if the number of exceptions does not exceed $\frac {N}{\ln N}$ , where N is the number of words in the relevant category. This threshold is set a priori for each N before the learner is exposed to the training data, rather than induced as in the current proposal. Although this constitutes a promising avenue for future research, it is worth noting that the Tolerance Principle was not originally formulated with phonotactic learning in mind, and it requires non-trivial adjustment in defining the scope of phonotactic constraint.

8.5. Other future directions

The current study represents an initial step towards understanding the interplay between lexical exceptions and phonotactic learning. The primary objective of this study has been to address the issue of exceptions, rather than developing an all-encompassing learning model. This has led to significant simplifications in the proposed learning model. Therefore, the next step is to enhance the current proposal towards a more comprehensive model. First, this study uses a simplified non-cumulative categorical grammar, while experimental evidence has indicated a cumulative effect on phonotactic learning (Breiss Reference Breiss2020; Kawahara & Breiss Reference Kawahara and Breiss2021). A future direction involves adapting the current proposal to accommodate a cumulative grammar, which would subsequently alter the assignment of grammaticality and the calculation of $O/E$ . Second, the learnt grammar in Polish shows a viable approach to interpret SSP-defying onsets in the context of lexical exceptions (Jarosz Reference Jarosz2017). Third, this study prespecifies tiers for the hypothesis space during phonotactic learning. In the future, it would be beneficial to integrate an automatic tier-induction algorithm based on the principles proposed in previous studies (Jardine & Heinz Reference Jardine and Heinz2016; Gouskova & Gallagher Reference Gouskova and Gallagher2020). Another promising direction is to extend the current approach to the hypothesis space defined by other formal languages (Jäger & Rogers Reference Jäger and Rogers2012).

Last but not least, while phonotactic learning facilitates the learning of morphophonological alternations, it cannot independently motivate alternation learning, as shown in experimental studies (Pater & Tessier Reference Pater, Tessier, Slabakova, Montrul and Prévost2006; Chong Reference Chong2021). Given this evidence, a future direction is to model phonotactic and alternation learning as distinct but interconnected components. The phonotactic model proposed in this article can be used to filter out lexical exceptions that interfere with alternation learning. For example, in Turkish rounding harmony, after the rounded stem vowel [ø], the high front vowel /i/ in the suffixes typically changes to round [y]. However, in noisy real-world data, unrounded [i] can exceptionally surface after [ø]. The phonotactic model proposed in the current study can be used to filter out illicit sequences such as [ø…i] during alternation learning, allowing feature-based generalisations such as i $\rightarrow $ [+round]/[+round]\_.

9. Conclusion

This research represents a significant step forward in two key areas: first, it pioneers a categorical-grammar-plus-exception-filtering-mechanism approach for learning categorical grammars from naturalistic input data with lexical exceptions. Moreover, while the current study primarily focusses on the learning of categorical grammars, it lays the groundwork for integrating learnt grammars with extragrammatical factors to model behavioural data, and marks initial steps in reassessing the ability of categorical grammars to approximate human judgements.

Supplementary material

The supplementary material for this article can be found at https://doi.org/10.1017/S0952675725000028.

Data availability statement

Replication data and code can be found at https://github.com/hutengdai/exception-filtering-phonotactic-learner. Zimmer’s (Reference Zimmer1969) original experimental data and the Polish training data can be found in the Supplementary Material.

Acknowledgments

I would like to thank Colin Wilson, Adam Jardine, Bruce Tesar, Yang Wang, Adam McCollum, Jeff Heinz, Tim Hunter, Bruce Hayes, Gaja Jarosz, Caleb Belth, Jon Rawski, anonymous reviewers, the audience at AMP 2022, UCI QuantLang Lab, and MIT Exp/Comp group for their help and suggestions.

Competing interests

The author declares that there are no competing interests regarding the publication of this article.

Ethical standards

The research meets all ethical guidelines, including adherence to the legal requirements of the study country.

Footnotes

1 The challenge of exceptions in phonotactic learning is analogous to that of ‘Type IV’ patterns in Moreton et al. (Reference Moreton, Pater and Pertsova2017), which can be conceptualised as general patterns that have a single exception. Their learning model took longer to learn Type IV patterns compared to exceptionless patterns, but eventually reached convergence. This difficulty was mirrored in their learning experiment. The author thanks a reviewer for pointing out this connection.

2 Alternatively, categorical grammar can represent non-binary discrete contrasts. For example, multiple categorical levels, such as 1 (ungrammatical), 3 (marginal) and 5 (grammatical), can be achieved by distributing potential constraints into three distinct subsets of the grammar. Although the current study does not adopt this alternative, such a method could be advantageous for modelling intermediate acceptability judgements.

3 Speech errors elicitation have been used to probe phonotactic constraints (Fromkin Reference Fromkin1973), such as non-local consonant cooccurrences (Rose & King Reference Rose and King2007).

4 Although it is a classic example, whether or not blick is in the English lexicon is disputable. For example, Dick Blick started an art supply company over 100 years ago (Adam McCollum, p.c.; https://www.dickblick.com/about-blick/history/).

5 [br]-onset nonce words are not included in Scholes’s experiment, but they are rated more acceptable than [bl]-onset nonce words in Daland et al. (Reference Daland, Hayes, White, Garellek, Davis and Norrmann2011), as shown in §5.

6 For an in-depth discussion on the computational complexity of OT grammars, refer to works such as Ellison (Reference Ellison1994), Eisner (Reference Eisner, Cohen and Wahlster1997), Idsardi (Reference Idsardi2006) and Heinz et al. (Reference Heinz, Kobele and Riggle2009).

7 Lexical exceptions might also exhibit unexpectedly high token frequencies. For example, in a Wiki corpus of approximately 100 million words (https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/Turkish_WordList_10K), the disharmonic Turkish word [silɑh] ‘weapon’ contradicts the backness harmony pattern, yet has a frequency of 26,658. On the contrary, the grammatical root [sɑpɯk] ‘pervert’ is less common, with only 2,716 occurrences. However, previous studies have shown that token frequency yields worse results than type frequency in phonotactic models (Hayes & Wilson Reference Hayes and Wilson2008: 395). The current study leaves this alternative for future investigation.

8 Hayes & Wilson (Reference Hayes and Wilson2008: 427) provide a method to estimate E for cumulative constraints.

9 Efficient computation can be done using a short-distance algorithm on finite-state automata, such as shortestdistance in pynini (Gorman Reference Gorman2016). The author acknowledges Colin Wilson and Kyle Gorman for assistance.

10 The current proposal leverages the normal approximation technique to refine the $O/E$ ratio, applying a statistical upper confidence limit (UCL) given by $p + \sqrt {\frac {p(1-p)}{n}} \times t_{(1-\alpha )/2}^{(n-1)}$ , where p is the $O/E$ ratio, n is the sample size (proportional to E value), and $t_{(1-\alpha )/2}^{(n-1)}$ is the t-value for a two-tailed test at significance level α with $n-1$ degrees of freedom (Mikheev Reference Mikheev1997; Albright & Hayes Reference Albright and Hayes2002, Reference Albright and Hayes2003; Hayes & Wilson Reference Hayes and Wilson2008). α is set to 0.975 after Hayes & Wilson (Reference Hayes and Wilson2008). This adaptation provides more nuanced differentiation in $O/E$ evaluations, especially prominent between figures such as 0/10 and 0/1,000, resulting in UCLs of 0.22 and 0.002, respectively. This differentiation helps to prioritise potential constraints where the O and E disparity is high.

11 The assumption that a single uniform target grammar applies to all speakers is a simplification. In this idealisation, the input data would all be generated by a single source, such as a parent or teacher. However, in a more realistic learning environment, there might be multiple target grammars across different speakers due to a variety of input data sources, causing variation among speakers.

12 This filtering mechanism does not exist in Hayes & Wilson (Reference Hayes and Wilson2008: 389). Their observed frequency $O[C]$ remains constant throughout the learning process, while $E[C]$ is proportional to the probability of sequences penalised by the constraint C, which is updated by the MaxEnt grammar in each iteration. Technically, this problem is trivial as several hyperparameters can ‘repair’ overestimation and still select correct constraints in their algorithm.

13 Similarly, $\theta _{\max }$ in the Exception-Filtering learner is also reported on the best-performance basis.

14 The current study omits the complementation operator $^{\land }$ , which introduces implicational constraints in Hayes & Wilson (Reference Hayes and Wilson2008, §4.1.1). For example, [ $^{\land }$ αF, βG, …] denotes any segment that is not a member of the natural class [αF, βG, …]. The constraint that limits pre-nasal segments to [s] would be formulated as *[ $^{\land }$ −voice, +ant, +strident][+nasal] ‘if the segment that precedes [+nasal] is not [s], assign a violation’. For the current case studies, this operator has a modest to no impact on learning results, for example, no difference in the English case and approximately 0.020 lower Spearman ρ correlation in the Polish case. Omitting this additional mechanism controls the difference in comparison with other models that do no employ the complementation operator.

15 The author thanks the anonymous reviewers who pointed this out. Consequently, the temperature parameter in Hayes & Wilson (Reference Hayes and Wilson2008: 400) is omitted, which only plays a role in their Pearson’s correlation test and linear regression.

16 Gorman (Reference Gorman2013: 98) reviews various empirical evidence for this analysis. For example, [ju] behaves as a unit in language games (Davis & Hammond Reference Davis and Hammond1995; Nevins & Vaux Reference Nevins and Vaux2003) and speech errors (Shattuck-Hufnagel Reference Shattuck-Hufnagel1986: 130).

17 This article assumes the conventional sonority hierarchy: stops $\gg $ affricates $\gg $ fricatives $\gg $ nasals $\gg $ liquids $\gg $ glides (Clements Reference Clements, Kingston and Beckman1990), and discusses an alternative hierarchy from Rubach & Booij (Reference Rubach and Booij1990) in the Polish case study (§6).

18 The only difference is that Exception-Filtering learner can learn constraint *ʃl which receives an intermediate 3.125 averaged Likert rating, while the Baseline learner cannot.

19 Discussion on the source of Polish SSP-defying phonotactics can be found in Kostyszyn & Heinz (Reference Kostyszyn, Heinz, Jurgec, Duncan, Elfner, Kang, Kochetov, O’Neill, Ozburn, Rice, Sanders, Schertz, Shaftoe and Sullivan2022: §3.2, on the role of yer-deletion) and Zydorowicz & Orzechowska (Reference Zydorowicz and Orzechowska2017, proposing net auditory distance as an alternative to sonority).

20 There is substantial variability among participants in the use of the Likert scale. Some participants tend to assign higher average Likert ratings (up to 6.006), while others lean toward lower average Likert ratings (down to 1.748). The standard deviation of Likert ratings for each word spans a wide range from 0 to 2.88, demonstrating the variability in participants’ responses.

21 The correlation scores are compared across the entire test data set as a whole, rather than separately for attested (type frequency $>0$ ) and unattested (type frequency $=0$ ) sequences as in Jarosz & Rysling (Reference Jarosz and Rysling2017) because the Exception-Filtering learner uniformly assigns a score of 0 to unattested sequences, resulting in a standard deviation of zero, and nullifies the correlation tests in unattested sequences.

22 Orthographic ğ is consistently rendered here as IPA [ɰ], though its precise realisation varies contextually.

23 A unique case of exceptions is caused by the phenomenon of root-internal ‘labial attraction’, where [u] is produced in the context ɑC $_{\textrm {[+labial]}}\_$ , as in [sɑbur] ‘patient’ (Lees Reference Lees1966). However, this pattern is not internalised by all speakers, as shown in the ratings of nonce words by speakers (Zimmer Reference Zimmer1969). Modelling labial attraction would require extending the tier from vowel to labial consonants. This task falls beyond the scope of the current study, which treats these cases as exceptions to roundness harmony, leaving the detailed investigation of labial attraction for future research.

24 During the learning process, morpheme boundaries are disregarded on the vowel tier. The current study acknowledges the presence of derived forms in the input data, but remains neutral on whether these forms are stored as whole words in the lexicon (see discussion on whole-word storage in Lignos & Gorman Reference Lignos and Gorman2012).

25 This approach avoids any sampling bias that might arise from manually reducing or increasing the amount of either categories.

26 The current study also tests the case when the Exception-Filtering learner does not filter out the identified lexical exceptions from the input data in each iteration, in which the learner falsely classifies two more cases as ungrammatical: ɯ…ɑ (frequency 3,009) and u…ɑ (frequency 3,035).

27 As harmony scores range from 0 to positive infinity, the corresponding values of exp(−harmony) decrease from 1 to 0, approaching but never reaching 0 as harmony scores approach infinity. Therefore, the range of exp(−harmony) for harmony in [0, +inf] is (0, 1). This value should not be misread as probability, despite their similar ranges.

References

Albright, Adam (2009). Feature-based generalisation as a source of gradient acceptability. Phonology 26, 9–41.Google Scholar

Albright, Adam & Hayes, Bruce (2002). Modeling English past tense intuitions with minimal generalization. In Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning. Philadelphia: Association for Computational Linguistics, 58–69.CrossRef Google Scholar

Albright, Adam & Hayes, Bruce (2003). Rules vs. analogy in English past tenses: a computational/experimental study. Cognition 90, 119–161.CrossRef Google Scholar PubMed

Algeo, John (1978). What consonant clusters are possible? Word 29, 206–224.CrossRef Google Scholar

Alves, Fernando C. (2023). Categorical versus gradient grammar in phonotactics. Language and Linguistics Compass 17, e12501.CrossRef Google Scholar

Archer, Stephanie L. & Curtin, Suzanne (2016). Nine-month-olds use frequency of onset clusters to segment novel words. Journal of Experimental Child Psychology 148, 131–141.CrossRef Google Scholar PubMed

Arik, Engin (2015). An experimental study of Turkish vowel harmony. Poznań Studies in Contemporary Linguistics 51, 359–374.Google Scholar

Armstrong, Sharon Lee, Gleitman, Lila R. & Gleitman, Henry (1983). What some concepts might not be. Cognition 13, 263–308.CrossRef Google Scholar PubMed

Avcu, Enes, Newman, Olivia, Ahlfors, Seppo P. & Gow, David W. Jr. (2023). Neural evidence suggests phonological acceptability judgments reflect similarity, not constraint evaluation. Cognition 230, 105322.CrossRef Google Scholar

Baayen, R. Harald, Piepenbrock, Richard & Gulikers, Leon (1995). The CELEX lexical database (release 2). Philadelphia: Linguistic Data Consortium, University of Pennsylvania.Google Scholar

Bailey, Todd M. & Hahn, Ulrike (2001). Determinants of wordlikeness: phonotactics or lexical neighborhoods? Journal of Memory and Language 44, 568–591.CrossRef Google Scholar

Bailey, Todd M. & Hahn, Ulrike (2005). Phoneme similarity and confusability. Journal of Memory and Language 52, 339–362.CrossRef Google Scholar

Berent, Iris, Wilson, Colin, Marcus, Gary F. & Bemis, Douglas K. (2012). On the role of variables in phonology: remarks on Hayes and Wilson 2008. LI 43, 97–119.Google Scholar

Berger, Adam L., Della Pietra, Vincent J. & Della Pietra, Stephen A. (1996). A maximum entropy approach to natural language processing. Computational Linguistics 22, 39–71.Google Scholar

Breiss, Canaan (2020). Constraint cumulativity in phonotactics: evidence from artificial grammar learning studies. Phonology 37, 551–576.CrossRef Google Scholar

Chomsky, Noam (1965). Aspects of the theory of syntax. Cambridge, MA: MIT Press.Google Scholar

Chomsky, Noam & Halle, Morris (1965). Some controversial questions in phonological theory. JL 1, 97–138.CrossRef Google Scholar

Chong, Adam J. (2021). The effect of phonotactics on alternation learning. Lg 97, 213–244.Google Scholar

Clark, Alexander & Lappin, Shalom (2009). Another look at indirect negative evidence. In Proceedings of the EACL 2009 Workshop on Cognitive Aspects of Computational Language Acquisition. Athens: Association for Computational Linguistics, 26–33.CrossRef Google Scholar

Clark, Alexander & Lappin, Shalom (2011). Linguistic nativism and the poverty of the stimulus. Oxford: Wiley-Blackwell.Google Scholar

Clements, G.N. (1990). The role of the sonority cycle in core syllabification. In Kingston, John & Beckman, Mary E. (eds.) Papers in laboratory phonology I: between the grammar and physics of speech. Cambridge: Cambridge University Press, 283–333.Google Scholar

Clements, G.N. & Sezer, Engin (1982). Vowel and consonant disharmony in Turkish. In van der Hulst, Harry & Smith, Norval (eds.) The structure of phonological representations, volume 2. Dordrecht: Foris, 213–255.Google Scholar

Coleman, John & Pierrehumbert, Janet (1997). Stochastic phonological grammars and acceptability. In Coleman, John (ed.) Proceedings of the 3rd meeting of the ACL Special Interest Group in Computational Phonology. Somerset, NJ: Association for Computational Linguistics, 49–56.Google Scholar

Dai, Huteng, Mayer, Connor & Futrell, Richard (2023). Rethinking representations: a log-bilinear model of phonotactics. Proceedings of the Society for Computation in Linguistics 6, 259–268.Google Scholar

Daland, Robert, Hayes, Bruce, White, James, Garellek, Marc, Davis, Andrea & Norrmann, Ingrid (2011). Explaining sonority projection effects. Phonology 28, 197–234.CrossRef Google Scholar

Davis, Stuart & Hammond, Michael (1995). On the status of onglides in American English. Phonology 12, 159–182.Google Scholar

Della Pietra, Stephen, Della Pietra, Vincent & Lafferty, John (1997). Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19, 380–393.Google Scholar

Dillon, Brian & Wagers, Matthew W. (2021). Approaching gradience in acceptability with the tools of signal detection theory. In Grant Goodall (ed.) The Cambridge handbook of experimental syntax. Cambridge: Cambridge University Press, 62–96.Google Scholar

Durvasula, Karthik (2020). O gradience, whence do you come? Paper presented at the Annual Meeting on Phonology, University of California, Santa Cruz.Google Scholar

Eisner, Jason (1997). Efficient generation in Primitive Optimality Theory. In Cohen, Philip R. & Wahlster, Wolfgang (eds.) Proceedings of the 35th annual meeting of the Association for Computational Linguistics and eighth conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 313–320.Google Scholar

Ellison, Timothy Mark (1994). The machine learning of phonological structure. PhD dissertation, University of Western Australia.Google Scholar

Ernestus, Mirjam & Baayen, R. Harald (2003). Predicting the unpredictable: interpreting neutralized segments in Dutch. Lg 79, 5–38.Google Scholar

Frisch, Stefan A., Large, Nathan R. & Pisoni, David B. (2000). Perception of wordlikeness: effects of segment probability and length on the processing of nonwords. Journal of Memory and Language 42, 481–496.CrossRef Google Scholar PubMed

Frisch, Stefan A., Large, Nathan R., Zawaydeh, Bushra & Pisoni, David B. (2001). Emergent phonotactic generalizations in English and Arabic. In Bybee, Joan L. & Hopper, Paul J. (eds.) Frequency and the emergence of linguistic structure, volume 45 of Typological Studies in Language. Amsterdam: Benjamins, 159–180.CrossRef Google Scholar

Frisch, Stefan A., Pierrehumbert, Janet B. & Broe, Michael B. (2004). Similarity avoidance and the OCP. NLLT 22, 179–228.Google Scholar

Frisch, Stefan A. & Zawaydeh, Bushra Adnan (2001). The psychological reality of OCP-place in Arabic. Lg 77, 91–106.Google Scholar

Fromkin, Victoria (1973). Slips of the tongue. Scientific American 229, 110–117.Google Scholar PubMed

Gallagher, Gillian (2014). An identity bias in phonotactics: evidence from Cochabamba Quechua. Laboratory Phonology 5, 337–378.CrossRef Google Scholar

Gallagher, Gillian (2015). Natural classes in cooccurrence constraints. Lingua 166, 80–98.Google Scholar

Gallagher, Gillian (2016). Asymmetries in the representation of categorical phonotactics. Lg 92, 557–590.Google Scholar

Göksel, Asl

& Kerslake, Celia (2004). Turkish: a comprehensive grammar. New York: Routledge.CrossRef Google Scholar

Gold, E. Mark (1967). Language identification in the limit. Information and Control 10, 447–474.Google Scholar

Goldsmith, John (1976). Autosegmental phonology. PhD dissertation, Massachusetts Institute of Technology.Google Scholar

Goldwater, Sharon & Johnson, Mark (2003). Learning OT constraint rankings using a maximum entropy model. In Jennifer Spenader, Anders Eriksson, & Östen Dahl (eds.) Proceedings of the Stockholm Workshop on Variation Within Optimality Theory. Stockholm: Department of Linguistics, Stockholm University, 111–120.Google Scholar

Goodman, L.A. & Kruskal, W.H. (1954). Measures of association for cross classifications. Journal of the American Statistical Association 49, 732–764.Google Scholar

Gorman, Kyle (2013). Generative phonotactics. PhD dissertation, University of Pennsylvania.Google Scholar

Gorman, Kyle (2016). Pynini: a Python library for weighted finite-state grammar compilation. In Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata. Berlin: Association for Computational Linguistics, 75–80.CrossRef Google Scholar

Gouskova, Maria & Gallagher, Gillian (2020). Inducing nonlocal constraints from baseline phonotactics. NLLT 38, 77–116.Google Scholar

Guy, Gregory R. (2007). Lexical exceptions in variable phonology. University of Pennsylvania Working Papers in Linguistics 13, 109–119.Google Scholar

Hale, John & Smolensky, Paul (2006). Harmonic Grammars and harmonic parsers for formal languages. In Smolensky, Paul & Legendre, Géraldine (eds.) The harmonic mind, volume 1: from neural computation to Optimality-Theoretic grammar. Cambridge, MA: MIT Press, 393–416.Google Scholar

Hale, Mark & Reiss, Charles (2008). The phonological enterprise. Oxford: Oxford University Press.CrossRef Google Scholar

Haman, Ewa, Etenkowski, Bartłomiej, Łuniewska, Magdalena, Szwabe, Joanna, Dabrowska, Ewa, Szreder, Marta & Łaziński, Marek (2011). Polish CDS corpus. Available at http://childes.psy.cmu.edu.Google Scholar

Hastie, Trevor, Tibshirani, Robert & Friedman, Jerome H. (2009). The elements of statistical learning: data mining, inference, and prediction, volume 2. New York: Springer.CrossRef Google Scholar

Hayes, Bruce (2012). BLICK: a phonotactic probability calculator (manual).Google Scholar

Hayes, Bruce & Londe, Zsuzsa Cziráky (2006). Stochastic phonological knowledge: the case of Hungarian vowel harmony. Phonology 23, 59–104.CrossRef Google Scholar

Hayes, Bruce & Wilson, Colin (2008). A maximum entropy model of phonotactics and phonotactic learning. LI 39, 379–440.Google Scholar

Heinz, Jeffrey (2007). The inductive learning of phonotactic patterns. PhD dissertation, University of California, Los Angeles.Google Scholar

Heinz, Jeffrey (2010). Learning long-distance phonotactics. LI 41, 623–661.Google Scholar

Heinz, Jeffrey, Kobele, Gregory M. & Riggle, Jason (2009). Evaluating the complexity of Optimality Theory. LI 40, 277–288.Google Scholar

Heinz, Jeffrey, Rawal, Chetan & Tanner, Herbert G. (2011). Tier-based strictly local constraints for phonology. In Proceedings of the 49th annual meeting of the Association for Computational Linguistics. Portland, OR: Association for Computational Linguistics, 58–64.Google Scholar

Hohenberger, Annette, Altan, Asli, Kaya, Utku, Tuncer, Özgün Köksal & Avcu, Enes (2016). Sensitivity of Turkish infants to vowel harmony: preference shift from familiarity to novelty. In Ketrez, F. Nihan & Haznedar, Belma (eds.) The acquisition of Turkish in Childhood. Amsterdam: John Benjamins Publishing Company, 29–56.CrossRef Google Scholar

Hughto, Coral, Lamont, Andrew, Prickett, Brandon & Jarosz, Gaja (2019). Learning exceptionality and variation with lexically scaled MaxEnt. In Proceedings of the Society for Computation in Linguistics (SCIL) 2019. New York: Society for Computation in Linguistics, 91–101.Google Scholar

Hyman, Larry M. (1975). Phonology: theory and analysis. New York: Holt, Rinehart & Winston.Google Scholar

Idsardi, William J. (2006). A simple proof that Optimality Theory is computationally intractable. LI 37, 271–275.Google Scholar

Inkelas, Sharon, Aylin, Küntay, Orhan Orgun, C. & Sprouse, Ronald (2000). Turkish electronic living lexicon (TELL): a lexical database. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00). Athens: European Language Resources Association (ELRA).Google Scholar

Jackendoff, Ray (2002). Foundations of language: brain, meaning, grammar, evolution. Oxford: Oxford University Press.CrossRef Google Scholar

Jäger, Gerhard & Rogers, James (2012). Formal language theory: refining the Chomsky hierarchy. Philosophical Transactions of the Royal Society B: Biological Sciences 367, 1956–1970.CrossRef Google Scholar PubMed

Jardine, Adam (2016). Learning tiers for long-distance phonotactics. In Proceedings of the 6th Conference on Generative Approaches to Language Acquisition North America (GALANA 2015). Somerville, MA: Cascadilla Proceedings Project, 60–72.Google Scholar

Jardine, Adam & Heinz, Jeffrey (2016). Learning tier-based strictly 2-local languages. Transactions of the Association for Computational Linguistics 4, 87–98.Google Scholar

Jardine, Adam & McMullin, Kevin (2017). Efficient learning of tier-based strictly k-local languages. In Frank Drewes, Carlos Martín Vide & Bianca Truthe (eds.) Language and automata theory and applications: 11th International Conference, LATA 2017, Umeå, Sweden, March 6–9, 2017, proceedings. Cham: Springer, 64–76.CrossRef Google Scholar

Jarosz, Gaja (2017). Defying the stimulus: acquisition of complex onsets in Polish. Phonology 34, 269–298.CrossRef Google Scholar

Jarosz, Gaja, Calamaro, Shira & Zentz, Jason (2017). Input frequency and the acquisition of syllable structure in Polish. Language Acquisition 24, 361–399.CrossRef Google Scholar

Jarosz, Gaja & Rysling, Amanda (2017). Sonority sequencing in Polish: the combined roles of prior bias & experience. In Karen Jesney, Charlie O’Hara, Caitlin Smith & Rachel Walker (eds.) Proceedings of the 2016 Annual Meeting on Phonology. Washington: Linguistic Society of America, 12 pp.Google Scholar

Jusczyk, Peter W. & Aslin, Richard N. (1995). Infants’ detection of the sound patterns of words in fluent speech. Cognitive Psychology 29, 1–23.CrossRef Google Scholar PubMed

Jusczyk, Peter W., Friederici, Angela D., Wessels, I., Jeanine, M. Svenkerud, Vigdis Y. & Jusczyk, Ann Marie (1993). Infants’ sensitivity to the sound patterns of native language words. Journal of Memory and Language 32, 402–420.CrossRef Google Scholar

Jusczyk, Peter W., Luce, Paul A. & Charles-Luce, Jan (1994). Infants’ sensitivity to phonotactic patterns in the native language. Journal of Memory and Language 33, 630–645.CrossRef Google Scholar

Kabak, Bariş (2011). Turkish vowel harmony. In van Oostendorp, Marc, Ewen, Colin J., Hume, Elizabeth & Rice, Keren (eds.) The Blackwell companion to phonology, volume 5, chapter 118. Oxford: Wiley-Blackwell, 2831–2854.Google Scholar

Kahng, Jimin & Durvasula, Karthik (2023). Can you judge what you don’t hear? Perception as a source of gradient wordlikeness judgements. Glossa 8.Google Scholar

Kang, Yoonjung (2011). Loanword phonology. In van Oostendorp, Marc, Ewen, Colin J., Hume, Elizabeth & Rice, Keren (eds.) The Blackwell companion to phonology, volume 4, chapter 95. Oxford: Wiley-Blackwell, 2258–2282.Google Scholar

Kawahara, Shigeto & Breiss, Canaan (2021). Exploring the nature of cumulativity in sound symbolism: experimental studies of Pokémonastics with English speakers. Laboratory Phonology 12, article no. 3.CrossRef Google Scholar

Kendall, M.G. (1938). A new measure of rank correlation. Biometrika 30, 81–93.CrossRef Google Scholar

Kostyszyn, Kalina & Heinz, Jeffrey (2022). Categorical account of gradient acceptability of word-initial Polish onsets. In Jurgec, Peter, Duncan, Liisa, Elfner, Emily, Kang, Yoonjung, Kochetov, Alexei, O’Neill, Brittney K., Ozburn, Avery, Rice, Keren, Sanders, Nathan, Schertz, Jessamyn, Shaftoe, Nate & Sullivan, Lisa (eds.) Proceedings of the 2021 Annual Meeting on Phonology. Washington, DC: Linguistic Society of America, 8 pp.Google Scholar

Lambert, Dakotah & Rogers, James (2020). Tier-based strictly local stringsets: perspectives from model and automata theory. Proceedings of the Society for Computation in Linguistics 3, 330–337.Google Scholar

Lau, Jey Han, Clark, Alexander & Lappin, Shalom (2017). Grammaticality, acceptability, and probability: a probabilistic view of linguistic knowledge. Cognitive Science 41, 1202–1241.CrossRef Google Scholar PubMed

Lees, Robert B. (1966). On the interpretation of a Turkish vowel alternation. Anthropological Linguistics 8, 32–39.Google Scholar

Lewis, Geoffrey L. (2001). Turkish grammar, 2nd edition. Oxford: Oxford University Press.Google Scholar

Lignos, Constantine & Gorman, Kyle (2012). Revisiting frequency and storage in morphological processing. CLS 447–461.Google Scholar

Linzen, Tal, Kasyanenko, Sofya & Gouskova, Maria (2013). Lexical and phonological variation in Russian prepositions. Phonology 30, 453–515.CrossRef Google Scholar

Marcus, Gary F. (1993). Negative evidence in language acquisition. Cognition 46, 53–85.CrossRef Google Scholar PubMed

Marr, David (1982). Vision: a computational investigation into the human representation and processing of visual information. San Francisco, CA: W.H. Freeman & Company.Google Scholar

Mayer, Connor (2021). Capturing gradience in long-distance phonology using probabilistic tier-based strictly local grammars. Proceedings of the Society for Computation in Linguistics 4, 39–50.Google Scholar

Mayer, Connor, McCollum, Adam & Eziz, Gülnar (2022). Issues in Uyghur phonology. Language and Linguistics Compass 16, e12478.CrossRef Google Scholar

McMullin, Kevin & Hansson, Gunnar Ólafur (2019). Inductive learning of locality relations in segmental phonology. Laboratory Phonology 10, article no. 14.Google Scholar

Mikheev, Andrei (1997). Automatic rule induction for unknown-word guessing. Computational Linguistics 23, 405–423.Google Scholar

Mohri, Mehryar, Rostamizadeh, Afshin & Talwalkar, Ameet (2018). Foundations of machine learning. Cambridge, MA: MIT press.Google Scholar

Moore-Cantwell, Claire & Pater, Joe (2016). Gradient exceptionality in Maximum Entropy Grammar with lexically specific constraints. Catalan Journal of Linguistics 15, 53–66.CrossRef Google Scholar

Moreton, Elliott, Pater, Joe & Pertsova, Katya (2017). Phonological concept learning. Cognitive Science 41, 4–69.CrossRef Google Scholar PubMed

Nevins, Andrew & Vaux, Bert (2003). Metalinguistic, shmetalinguistic: the phonology of shmreduplication. CLS 39, 702–721.Google Scholar

O’Hara, Charlie (2020). Frequency matching behavior in on-line MaxEnt learners. Proceedings of the Society for Computation in Linguistics 3, 463–465.Google Scholar

Osherson, Daniel, Stob, Michael & Weinstein, Scott (1986). Systems that learn: an introduction to learning theory. Cambridge, MA: MIT Press.Google Scholar

Pater, Joe (2000). Non-uniformity in English secondary stress: the role of ranked and lexically specific constraints. Phonology 17, 237–274.CrossRef Google Scholar

Pater, Joe & Tessier, Anne-Michelle (2006). L1 phonotactic knowledge and the L2 acquisition of alternations. In Slabakova, R., Montrul, S. & Prévost, P. (eds.) Inquiries in linguistic development: studies in honor of Lydia White. Amsterdam: Benjamins, 115–131.CrossRef Google Scholar

Pearl, Lisa & Lidz, Jeffrey (2009). When domain-general learning fails and when it succeeds: identifying the contribution of domain specificity. Language Learning and Development 5, 235–265.CrossRef Google Scholar

Pearson, K. (1895). Notes on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London 58, 240–242.Google Scholar

Pierrehumbert, Janet (1993). Dissimilarity in the Arabic verbal roots. In Proceedings of NELS, volume 23. Amherst, MA: University of Massachusetts Amherst, 367–381.Google Scholar

Pierrehumbert, Janet (1994). Syllable structure and word structure: a study of triconsonantal clusters in English. In Keating, Patricia A. (ed.) Phonological structure and phonetic form: papers in laboratory phonology III. Cambridge: Cambridge University Press, 168–188.CrossRef Google Scholar

Pierrehumbert, Janet (2001). Stochastic phonology. Glot International 5, 195–207.Google Scholar

Pinker, Steven & Prince, Alan (1988). On language and connectionism: analysis of a parallel distributed processing model of language acquisition. Cognition 28, 73–193.CrossRef Google Scholar PubMed

Prince, Alan & Smolensky, Paul (1993). Optimality Theory: constraint interaction in generative grammar. Oxford: Blackwell.Google Scholar

Prince, Alan & Tesar, Bruce (2004). Learning phonotactic distributions. In René Kager, Joe Pater & Wim Zonneveld (eds.) Constraints in phonological acquisition. Cambridge: Cambridge University Press, 245–291.CrossRef Google Scholar

Rawski, Jonathan (2021). Structure and learning in natural language. PhD dissertation, State University of New York at Stony Brook.Google Scholar

Reiss, Charles (2017). Substance free phonology. In S.J. Hannahs and Anna Bosch (eds.) The Routledge handbook of phonological theory. New York: Routledge, 425–452.CrossRef Google Scholar

Richtsmeier, Peter T. (2011). Word-types, not word-tokens, facilitate extraction of phonotactic sequences by adults. Laboratory Phonology 2, 157–183.CrossRef Google Scholar

Rose, Sharon & King, Lisa (2007). Speech error elicitation and co-occurrence restrictions in two Ethiopian Semitic languages. Language and Speech 50, 451–504.CrossRef Google Scholar PubMed

Rubach, Jerzy & Booij, Geert (1990). Syllable structure assignment in Polish. Phonology 7, 121–158.CrossRef Google Scholar

Scholes, Robert J. (1966). Phonotactic grammaticality. The Hague: Mouton.CrossRef Google Scholar

Schütze, Carson T. (1996). The empirical base of linguistics: grammaticality judgments and linguistic methodology. Chicago, IL: University of Chicago Press.Google Scholar

Shattuck-Hufnagel, Stefanie (1986). The representation of phonological information during speech production planning: evidence from vowel errors in spontaneous speech. Phonology 3, 117–149.Google Scholar

Spearman, C. (1904). The proof and measurement of association between two things. The American Journal of Psychology 15, 72–101.CrossRef Google Scholar

Sundara, Megha, Zhou, ZL, Breiss, Canaan, Katsuda, Hironori & Steffman, Jeremy (2022). Infants’ developing sensitivity to native language phonotactics: a meta-analysis. Cognition 221, 104993.CrossRef Google Scholar PubMed

Trubetzkoy, Nikolaï Sergeyevich (1939). Grundzüge der Phonologie. Travaux du Cercle linguistique de Prague 7.Google Scholar

Underhill, Robert (1976). Turkish grammar. Cambridge, MA: MIT Press.Google Scholar

Wilson, Colin (2022). Identifiability, log-linear models, and observed/expected (response to Stanton & Stanton, 2022). Ms, Johns Hopkins University. Available at https://lingbuzz.net/lingbuzz/006474.Google Scholar

Wilson, Colin & Gallagher, Gillian (2018). Accidental gaps and surface-based phonotactic learning: a case study of South Bolivian Quechua. LI 49, 610–623.Google Scholar

Wilson, Colin & Obdeyn, Marieke (2009). Simplifying subsidiary theory: statistical evidence from Arabic, Muna, Shona, and Wargamay. Ms., Johns Hopkins University.Google Scholar

Wolf, Matthew (2011). Exceptionality. In Marc van Oostendorp, Colin J. Ewen, Elizabeth Hume & Keren Rice (eds.) The Blackwell companion to phonology. Oxford: Wiley-Blackwell, 2538–2559.Google Scholar

Wu, Katherine & Heinz, Jeffrey (2023). String extension learning despite noisy intrusions. Proceedings of Machine Learning Research. 217, 80–95.Google Scholar

Yang, Charles (2016). The price of linguistic productivity: how children learn to break the rules of language. Cambridge, MA: MIT Press.CrossRef Google Scholar

Zimmer, Karl E. (1969). Psychological correlates of some Turkish morpheme structure conditions. Lg. 45, 309–321.Google Scholar

Zuraw, Kie (2000). Patterned exceptions in phonology. Los Angeles, CA: University of California Los Angeles.Google Scholar

Zuraw, Kie, Lin, Isabelle, Yang, Meng & Peperkamp, Sharon (2021). Competition between whole-word and decomposed representations of English prefixed words. Morphology 31, 201–237.CrossRef Google Scholar

Zydorowicz, Paulina & Orzechowska, Paula (2017). The study of Polish phonotactics: measures of phonotactic preferability. Studies in Polish Linguistics 12, 97–136.Google Scholar

Figure 1 The learning problem in the presence of exceptions (adapted from Mohri et al.2018: 8). In both (a) and (b), filled dots represent attested data, while unfilled dots indicate unattested data. In (b), 0 indicates the ungrammatical items and 1 indicates grammatical items, assuming Boolean grammaticality.

Figure 2 The relationship between lexicon, grammar and performance.

Table 1 The distinction between attestedness and grammaticality (adapted from Hyman 1975)

Figure 3 Extraction of vowel tier from the Turkish word [døviz] ‘currency’. The vowel tier contains the vowels in this word, disregarding the non-tier consonants.

Table 2 The list of idealised input data and corresponding hypothesis grammar, as well as expected frequencies for length 3. The input data $S_3$ here is idealised and identical to the target language $L_3$

Figure 4 The learning procedure of the Exception-Filtering learner.

Table 3 Initialisation

Table 4 Compute O and E

Table 5 Update G, Con and S

Table 6 Steps 2 and 3 after the first iteration

Table 7 Type frequency of English onsets in the input data

Table 8 A grammar learnt from the English sample. The first symbols of two-factor sequences correspond to rows (labelled at left), and the second symbols to columns (labelled at the top). Shaded cells indicate the attested two-factors in the input data, with darker grey for grammatical two-factors and lighter grey for ungrammatical ones

Table 9 Type frequency, averaged Likert ratings and predicted grammaticality by the learnt grammar of English nonce word onsets, sorted by averaged Likert ratings. Detected exceptions (non-zero frequency and g = 0) are shaded

Table 10 Results of the best performances by the Exception-Filtering ($\theta _{\max }$ = 0.1), Baseline and HW learners (Max $O/E$ = 0.3, $n$ = 3). Correlation tests are reported with respect to averaged Likert ratings in English; best scores are in bold

Table 11 Polish consonant inventory (derived from the input data)

Table 12 Learnt grammar from Polish input data. The first symbols of two-factor sequences correspond to rows (labelled at left), and the second symbols to columns (labelled at the top). Shaded cells indicate the attested two-factors in the input data, with darker grey for grammatical two-factors and lighter grey for ungrammatical ones

Table 13 Type frequency, averaged Likert ratings and predicted grammaticality by the learnt grammar of Polish onsets, sorted by Likert rating. Detected exceptional onsets are highlighted

Table 14 Results of the best performance in Exception-Filtering ($\theta _{\max }$ = 0.1), Baseline and HW learner (Max $O/E$ = 0.7, $n$ = 2). Correlation tests are approximating averaged Likert ratings in Polish; categorised based on attestedness; best scores are in bold

Table 15 Turkish vowel system

Table 16 The type frequency of two-factors in the input data; cells of documented grammatical two-factors are highlighted

Table 17 Performance comparison of Exception-Filtering ($\theta _{\max }$ = 0.5), Baseline and HW learner ($\text {Max } O/E$ = 0.7, $n$ = 3) in the first test data set (categorical labels). Best scores are in bold

Table 18 Comparing the learnt grammars of (a) the Exception-Filtering learner and (b) the HW learner

Table 19 Performance comparison of Exception-Filtering and HW learner in the second test data set adapted from Zimmer’s (1969) experiment. Best scores are in bold

Figure 5 Scatter plots based on the learning results of two learners. Expected grammaticality is highlighted based on documented phonotactic generalizations. Some words have two response rates as they appeared in two separate experiments. Overlapped words are omitted from the plots.

Dai supplementary material

File 26.5 KB

Article contents

An exception-filtering approach to phonotactic learning

Abstract

Keywords

Information

1. Introduction

2. Background

2.1. The competence–performance dichotomy

2.2. Attestedness vs. grammaticality

2.3. Summary

3. The Exception-Filtering phonotactic learner

3.1. Segment-based representation

3.2. The structure of grammars and hypothesis space

3.3. The Exception-Filtering mechanism and O/E criterion

3.4. Learning procedure

3.4.1. Step 1: initialisation

3.4.2. Steps 2 and 3: select θ, compute O/E

3.4.3. Step 4: update G, Con and S (exception filtering)

3.4.4. Iteration and termination

3.5. Summary

4. Evaluation

4.1. Correlation tests

4.2. Classification accuracy

5. Case study: English onsets

5.1. English input data

5.2. Learning procedure and learnt grammar

5.3. Model evaluation in English

6. Case study: Polish onsets

6.1. Polish input data

6.2. Learning procedure and learnt grammar in Polish

6.3. Model evaluation in Polish data

7. Case study: Turkish vowel phonotactics

7.1. Turkish vowel phonotactics

7.2. Turkish input data and learning procedure

7.3. Model evaluation

7.3.1. The first test data set (categorical labels)

7.3.2. The second test data set (approximated acceptability judgements)

8. Discussion

8.1. Extragrammatical factors

8.2. Accidental gaps

8.3. Hayes & Wilson’s (Reference Hayes and Wilson2008) learner

8.4. O/E and alternative criteria

8.5. Other future directions

9. Conclusion

Supplementary material

Data availability statement

Acknowledgments

Competing interests

Ethical standards

Footnotes

References

Dai supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests