Hostname: page-component-5db58dd55d-bthnr Total loading time: 0 Render date: 2026-05-31T00:14:52.546Z Has data issue: false hasContentIssue false

A computational approach to investigating phonological complexity with latent variables and dimensionality reduction

Published online by Cambridge University Press:  06 April 2026

Frederik Hartmann*
Affiliation:
Linguistics, University of North Texas , USA
Rights & Permissions [Opens in a new window]

Abstract

This article investigates phonological complexity by using artificial neural network and Bayesian structural equation models to derive representations of phonological complexity from counts of the segments associated with particular features in languages’ phonemic inventories. These latent representations can then be used alongside principal component analysis to further analyse how interactions between phonological features affect overall complexity, and what phonological complexity patterns a model can detect in a phonological feature data set. The results indicate that the per-feature segment counts investigated tend to contribute positively to a language’s complexity, and that the latent complexity variables approximate a log-normal distribution. This implies that phonological complexifications co-occur with other complexifications diachronically while tending to be more constrained at the upper and lower ends of the complexity range.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press
Figure 0

Table 1 Phonological inventory of Rotokas following Firchow & Firchow (1969), the source used in PHOIBLE.

Figure 1

Figure 1 Left: two possible vectors constructed for a two-dimensional toy data set and the distance of each data point to its projection on the respective vector (dashed lines). Right: distribution of the point projections on the two vectors.

Figure 2

Figure 2 Sketch of a single principal component $PC$, which is a linear combination of dimensions $D_1$, $D_2$ and $D_3$.

Figure 3

Figure 3 Sketch of a latent variable L and its effects on a set of three variables $\{A, B, C\}$.

Figure 4

Figure 4 Illustration of a hypothetical algebraic relationship between a latent state and two output segment counts. The latent state has the value 6, and the output counts are 12 and 3. In this simplified example, we assume a linear relationship.

Figure 5

Figure 5 Simplified encoder–decoder network with three layers, where $a_x^{(l)}$ is the activation of neuron x in layer l and $w_x^{(l)}$ is the connection weight x forward from layer l.

Figure 6

Table 2 Table of tested parameters and their intervals in a grid search.

Figure 7

Table 3 Two neural network architectures tested in the grid search: small and large.

Figure 8

Figure 6 Feature effect values of the input segment counts in the construction of the latent complexity variable or PC1, weighted by feature frequency.

Figure 9

Table 4 Spearman rank correlation coefficients between the feature effects from the three models.

Figure 10

Figure 7 Comparison of the distribution of normalised latent complexity values obtained from the neural network and Bayesian models along with the raw feature sums. The black curves indicate the shape of the corresponding approximate log-normal distributions (see analysis below). Left: normalised latent complexity values; right: log-transformed normalised latent complexity values.

Figure 11

Table 5 Table of the differences in ELPD values in reference to the best-fitting distributions (top row) and their standard errors (SE) between different fitted outcome distributions.

Figure 12

Figure 8 Histograms of the frequency distributions of each individual phonological feature count in the data set.

Figure 13

Table 6 Posterior estimates with credible intervals of the log-normal distribution for the original and shuffled data sets.

Figure 14

Table 7 Spearman rank correlation coefficients between the feature effects from the three models.

Figure 15

Table 8 The five most and least complex languages as derived from the models and the raw counts.

Figure 16

Table 9 Feature counts of Klao (Atlantic-Congo) and Wappo (Yuki-Wappo). The two have nearly identical sums of feature counts (Klao = 200, Wappo = 206), but differ in their (normalised) calculated complexity: Klao = 0.1 (neural network), 0.17 (Bayesian); Wappo = 0.04 (neural network), 0.11 (Bayesian).

Figure 17

Figure 9 Raw feature effect values of the input segment counts in the construction of the latent complexity variable or PC1.

Figure 18

Figure 10 Histograms of the frequency distributions of each individual phonological feature count in the data set.

Figure 19

Table 10 Results from the grid search analysis ordered by average mean-squared-error loss over 10 runs.