Hostname: page-component-6766d58669-nqrmd Total loading time: 0 Render date: 2026-05-15T05:38:46.241Z Has data issue: false hasContentIssue false

Detecting linguistic variation with geographic sampling

Published online by Cambridge University Press:  06 May 2024

Ezequiel Koile*
Affiliation:
Department of Linguistic and Cultural Evolution, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
George Moroz*
Affiliation:
Linguistic Convergence Laboratory, HSE University, Moscow, Russia
*
Corresponding author: Emails: ezequiel_koile@eva.mpg.de and agricolamz@gmail.com
Corresponding author: Emails: ezequiel_koile@eva.mpg.de and agricolamz@gmail.com
Rights & Permissions [Opens in a new window]

Abstract

Geolectal variation is often present in settings where one language is spoken across a vast geographic area. This can be found in phonological, morphosyntactic, and lexical features. For practical reasons, it is not always possible to conduct fieldwork in every single location of interest in order to obtain the full pattern of variation, and a sample of them must be chosen. We propose and test a method for sampling these locations, with the goal of obtaining a distribution of typological features representative of the whole area. We apply k-means and hierarchical clustering algorithms for defining this sample, based on their geographic distribution. We test our methods against simulated data with several spatial configurations, and also against real data from Circassian dialects (Northwest Caucasian). Our results show an efficiency significantly higher than random sampling for detecting this variation, which makes our method profitable to fieldworkers when designing their research.

Information

Type
Articles
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Figure 1. (a) Circular equidistant, (b) center–periphery, (c) dialect chain, and (d) uniform distributions for Ns = 117 settlements distributed in Nc = 6 categories, with a count configuration Q = (42,21,19,13,11,11) and r = 26. In (a), the distributions’ centers form a regular hexagon. In (b), the most populated category lies in the origin, while the other centers form a regular pentagon around it. In (c), they form a straight line. In (d), all centers coincide at the origin. The dots within each category are surrounded with their normal ellipses (Fox and Weisberg, 2018). The centers around which the data are generated in (a) and (b) form regular polygons, while the polygons showed here link the centroids of the generated data a posteriori and are therefore not regular. Similarly, the dialect chain in (c) does not form a horizontal segment, although the centers for data generation did.

Figure 1

Map 1. Distribution of dialects in the Circassian settlements.

Figure 2

Map 2. The distribution of Circassian reflex of *qh.

Figure 3

Figure 2. Results for the discovery fraction as a function of the settlement sample fraction for different values of the parameters. We can see a clear improvement in the discovery fraction when using clustering algorithms for the data with spatial structure (rows 1–3) and no improvement or depreciation in performance for the cases with no spatial structure (last row).

Figure 4

Figure 3. Results for the discovery fraction as a function of entropy. We can see a clear improvement in the discovery fraction when there is a higher value of entropy, although the obtained results depend on the number of categories.

Figure 5

Figure 4. Discovery rate as a function of settlement sample fraction for Circassian dialect data (eight categories).

Figure 6

Figure 5. Discovery rate as a function of settlement sample fraction for Circassian reflex of *qh (four categories).