Hostname: page-component-77c78cf97d-d2fvj Total loading time: 0 Render date: 2026-04-24T02:27:15.466Z Has data issue: false hasContentIssue false

An algorithm for learning phonological classes from distributional similarity

Published online by Cambridge University Press:  28 April 2020

Connor Mayer*
Affiliation:
University of California, Los Angeles
Rights & Permissions [Opens in a new window]

Abstract

An important question in phonology is to what degree the learner uses distributional information rather than substantive properties of speech sounds when learning phonological structure. This paper presents an algorithm that learns phonological classes from only distributional information: the contexts in which sounds occur. The input is a segmental corpus, and the output is a set of phonological classes. The algorithm is first tested on an artificial language, with both overlapping and nested classes reflected in the distribution, and retrieves the expected classes, performing well as distributional noise is added. It is then tested on four natural languages. It distinguishes between consonants and vowels in all cases, and finds more detailed, language-specific structure. These results improve on past approaches, and are encouraging, given the paucity of the input. More refined models may provide additional insight into which phonological classes are apparent from the distributions of sounds in natural languages.

Information

Type
Articles
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © The Author(s), 2020. Published by Cambridge University Press.
Figure 0

Figure 1 The phonological classes of Parupa.

Figure 1

Table I Count vectors for a toy language.

Figure 2

Table II Count vectors for a toy language, normalised using PPMI.

Figure 3

Figure 2 A PCA visualisation of the vector embeddings of Parupa, generated using trigram counts and PPMI normalisation: (a) all segments; (b) consonants; (c) vowels.

Figure 4

Figure 3 Classes retrieved for Parupa.

Figure 5

Figure 4 A plot of the median number of expected and unexpected classes found by the algorithm as the percentage of noisy words increases. Error bars span the minimum and maximum number of classes retrieved from a corpus at that noise level.

Figure 6

Figure 5 A PCA visualisation of the vector embeddings of Samoan: (a) all segments; (b) consonants; (c) vowels.

Figure 7

Figure 6 Classes retrieved for Samoan.

Figure 8

Figure 7 A PCA visualisation of the vector embeddings of Samoan consonants from a corpus without vowels (scaling factor: 1.3).

Figure 9

Figure 8 Classes retrieved for Samoan with no vowels.

Figure 10

Figure 9 A PCA visualisation of the vector embeddings of English: (a) all segments; (b) consonants; (c) vowels (scaling factor: 1.1).

Figure 11

Figure 10 Classes retrieved for English.

Figure 12

Figure 11 English vowels projected onto PC3.

Figure 13

Figure 12 A PCA visualisation of the vector embeddings of French: (a) all segments; (b) consonants; (c) vowels (scaling factor: 1.7).

Figure 14

Figure 13 Classes retrieved for French.

Figure 15

Figure 14 A PCA visualisation of the vector embeddings of Finnish from a corpus with only vowels.

Figure 16

Figure 15 Classes retrieved for the Finnish corpus containing only vowels.

Figure 17

Figure 16 A PCA visualisation of the vector embeddings of Finnish: (a) all segments; (b) consonants; (c) vowel (scaling factor: 1.2).

Figure 18

Figure 17 Vowel classes retrieved for the full Finnish corpus.

Supplementary material: File

Mayer supplementary material

Mayer supplementary material

Download Mayer supplementary material(File)
File 410.7 KB