Hostname: page-component-89b8bd64d-nlwjb Total loading time: 0 Render date: 2026-05-05T17:55:38.492Z Has data issue: false hasContentIssue false

An exception-filtering approach to phonotactic learning

Published online by Cambridge University Press:  22 April 2025

Huteng Dai*
Affiliation:
Department of Linguistics, University of Michigan, Ann Arbor, MI, USA Department of Linguistics, Rutgers University, New Brunswick, NJ, USA
Rights & Permissions [Opens in a new window]

Abstract

Phonotactic learning has been a fertile ground for research in the field of phonology. However, the challenge of lexical exceptions in phonotactic learning remains largely unexplored. Traditional learning models, which typically assume all observed input data to be grammatical, often blur the distinction between lexical exceptions and grammatical words, consequently skewing the learning results. To address this issue, this article innovates a categorical-grammar-plus-exception-filtering approach that harnesses the discrete nature of categorical grammars to filter out lexical exceptions using statistical criteria adapted from probabilistic models. Applied to naturalistic corpora from English, Polish and Turkish, the learnt grammars showed a high correlation with the acceptability judgements in behavioural experiments. Compared to benchmark models, the model performs increasingly better with data that contain a higher proportion of lexical exceptions, reaching its peak in learning Turkish non-local vowel phonotactics, highlighting its ability to handle lexical exceptions.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Figure 1 The learning problem in the presence of exceptions (adapted from Mohri et al.2018: 8). In both (a) and (b), filled dots represent attested data, while unfilled dots indicate unattested data. In (b), 0 indicates the ungrammatical items and 1 indicates grammatical items, assuming Boolean grammaticality.

Figure 1

Figure 2 The relationship between lexicon, grammar and performance.

Figure 2

Table 1 The distinction between attestedness and grammaticality (adapted from Hyman 1975)

Figure 3

Figure 3 Extraction of vowel tier from the Turkish word [døviz] ‘currency’. The vowel tier contains the vowels in this word, disregarding the non-tier consonants.

Figure 4

Table 2 The list of idealised input data and corresponding hypothesis grammar, as well as expected frequencies for length 3. The input data $S_3$ here is idealised and identical to the target language $L_3$

Figure 5

Figure 4 The learning procedure of the Exception-Filtering learner.

Figure 6

Table 3 Initialisation

Figure 7

Table 4 Compute O and E

Figure 8

Table 5 Update G, Con and S

Figure 9

Table 6 Steps 2 and 3 after the first iteration

Figure 10

Table 7 Type frequency of English onsets in the input data

Figure 11

Table 8 A grammar learnt from the English sample. The first symbols of two-factor sequences correspond to rows (labelled at left), and the second symbols to columns (labelled at the top). Shaded cells indicate the attested two-factors in the input data, with darker grey for grammatical two-factors and lighter grey for ungrammatical ones

Figure 12

Table 9 Type frequency, averaged Likert ratings and predicted grammaticality by the learnt grammar of English nonce word onsets, sorted by averaged Likert ratings. Detected exceptions (non-zero frequency and g = 0) are shaded

Figure 13

Table 10 Results of the best performances by the Exception-Filtering ($\theta _{\max }$ = 0.1), Baseline and HW learners (Max $O/E$ = 0.3, $n$ = 3). Correlation tests are reported with respect to averaged Likert ratings in English; best scores are in bold

Figure 14

Table 11 Polish consonant inventory (derived from the input data)

Figure 15

Table 12 Learnt grammar from Polish input data. The first symbols of two-factor sequences correspond to rows (labelled at left), and the second symbols to columns (labelled at the top). Shaded cells indicate the attested two-factors in the input data, with darker grey for grammatical two-factors and lighter grey for ungrammatical ones

Figure 16

Table 13 Type frequency, averaged Likert ratings and predicted grammaticality by the learnt grammar of Polish onsets, sorted by Likert rating. Detected exceptional onsets are highlighted

Figure 17

Table 14 Results of the best performance in Exception-Filtering ($\theta _{\max }$ = 0.1), Baseline and HW learner (Max $O/E$ = 0.7, $n$ = 2). Correlation tests are approximating averaged Likert ratings in Polish; categorised based on attestedness; best scores are in bold

Figure 18

Table 15 Turkish vowel system

Figure 19

Table 16 The type frequency of two-factors in the input data; cells of documented grammatical two-factors are highlighted

Figure 20

Table 17 Performance comparison of Exception-Filtering ($\theta _{\max }$ = 0.5), Baseline and HW learner ($\text {Max } O/E$ = 0.7, $n$ = 3) in the first test data set (categorical labels). Best scores are in bold

Figure 21

Table 18 Comparing the learnt grammars of (a) the Exception-Filtering learner and (b) the HW learner

Figure 22

Table 19 Performance comparison of Exception-Filtering and HW learner in the second test data set adapted from Zimmer’s (1969) experiment. Best scores are in bold

Figure 23

Figure 5 Scatter plots based on the learning results of two learners. Expected grammaticality is highlighted based on documented phonotactic generalizations. Some words have two response rates as they appeared in two separate experiments. Overlapped words are omitted from the plots.

Supplementary material: File

Dai supplementary material

Dai supplementary material
Download Dai supplementary material(File)
File 26.5 KB