Hostname: page-component-6766d58669-l4t7p Total loading time: 0 Render date: 2026-05-21T11:51:13.214Z Has data issue: false hasContentIssue false

Blowing in the wind: Using ‘North Wind and the Sun’ texts to sample phoneme inventories

Published online by Cambridge University Press:  07 June 2021

Louise Baird
Affiliation:
ARC Centre of Excellence for the Dynamics of Language, The Australian National University louise.baird@anu.edu.au
Nicholas Evans
Affiliation:
ARC Centre of Excellence for the Dynamics of Language, The Australian National University nicholas.evans@anu.edu.au
Simon J. Greenhill
Affiliation:
ARC Centre of Excellence for the Dynamics of Language, The Australian National University & Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History greenhill@shh.mpg.de
Rights & Permissions [Opens in a new window]

Abstract

Language documentation faces a persistent and pervasive problem: How much material is enough to represent a language fully? How much text would we need to sample the full phoneme inventory of a language? In the phonetic/phonemic domain, what proportion of the phoneme inventory can we expect to sample in a text of a given length? Answering these questions in a quantifiable way is tricky, but asking them is necessary. The cumulative collection of Illustrative Texts published in the Illustration series in this journal over more than four decades (mostly renditions of the ‘North Wind and the Sun’) gives us an ideal dataset for pursuing these questions. Here we investigate a tractable subset of the above questions, namely: What proportion of a language’s phoneme inventory do these texts enable us to recover, in the minimal sense of having at least one allophone of each phoneme? We find that, even with this low bar, only three languages (Modern Greek, Shipibo and the Treger dialect of Breton) attest all phonemes in these texts. Unsurprisingly, these languages sit at the low end of phoneme inventory sizes (respectively 23, 24 and 36 phonemes). We then estimate the rate at which phonemes are sampled in the Illustrative Texts and extrapolate to see how much text it might take to display a language’s full inventory. Finally, we discuss the implications of these findings for linguistics in its quest to represent the world’s phonetic diversity, and for JIPA in its design requirements for Illustrations and in particular whether supplementary panphonic texts should be included.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2021. Published by Cambridge University Press on behalf of the International Phonetic Association
Figure 0

Figure 1 Histograms showing distributions of the number of phonemes, for each language: (a) the size of the phoneme inventory; (b) the length of the Illustration transcript; (c) the number of phonemes not found in the transcript; and (d) the number of unexpected phonemes in the transcript (errors).

Figure 1

Figure 2 Map of languages in JIPA Illustrations. Language families with more than three languages in the corpus are colour coded.

Figure 2

Figure 3 Relationship between (a) number of unobserved phonemes vs. inventory size and (b) number of unobserved phonemes vs. the length of the full Illustrative Text, measured in phoneme tokens. The orange line shows the significant phylogenetically controlled least-squares regression line, indicating the strength of the relationship.

Figure 3

Figure 4 The complementary cumulative distribution of segment token frequencies in four randomly selected languages. The attested frequencies are plotted in blue, while the fit of three candidate distributions are indicated with dashed lines. The y axis, ‘p(X ≥ x)’ is the probability of observing a phoneme X with the frequency less than or equal to x.

Figure 4

Figure 5 Rate at which a language’s phoneme inventory is recovered in the NWS transcript. Individual languages are coloured by the size of their respective inventories: (a) shows the recovery rate as a function of transcript percentage, while (b) shows the recovery rate as a function of transcript length, and (c) shows the rate in (b) transformed into a log scale. The black lines are local regression ‘LOESS’ curves fitted to the data.

Figure 5

Figure 6 Consistency of recovery rates for the phonemic orthography in Czech (Log Scale for Number of Tokens Observed). The line in blue indicates the recovery rate found in the JIPA Illustration, while the lines in red indicate the recovery rates for the same languages in large bible corpora.

Figure 6

Figure 7 Comparison of recovery methods on various Czech bibles. Black cross shows the real number of tokens needed to fully capture the full inventory. The blue and orange points show the estimated number of tokens from the LM and GAM methods respectively. The cloud of grey points shows the estimate from the simulation method where each point is a single simulation.

Figure 7

Figure 8 Estimated number of segment tokens needed to fully recover a language’s phoneme inventory under the Simulation approach. Languages are ranked by number of phonemes in their inventory from largest (top) to the smallest (bottom). The first vertical line indicates the median number of tokens required, while the second indicates the number of tokens required to capture 95% of the simulated language texts.

Figure 8

Figure 9 Histogram showing the overall amount of text needed for full recovery of a phoneme inventory across all languages using the three methods.

Figure 9

Figure 10 Scatter plots showing the relationship between (a) the global ranking of each phoneme, and (b) the frequency of each phoneme within the JIPA Illustrations, compared to the average percentage through the transcript of the phonemes’ first observation. The colours indicate the global ranking of how common the phoneme is according to PHOIBLE, with bluer points being found in the top 50, and yellower points found less frequently. Points coloured in red are the phonemes that are not observed after the complete transcript.