Hostname: page-component-89b8bd64d-ktprf Total loading time: 0 Render date: 2026-05-07T12:52:14.645Z Has data issue: false hasContentIssue false

Drawing areal information from a corpus of noisy dialect data

Published online by Cambridge University Press:  19 May 2020

Alfred Lameli*
Affiliation:
Albert-Ludwigs-Universität Freiburg, German Department, Platz der Universität 3, 79098 Freiburg/Breisgau, Germany, phone: ++49-761-203-3250
Elvira Glaser
Affiliation:
Universität Zürich, German Department, Schönberggasse 9, 8001 Zürich, Switzerland
Philipp Stöckle
Affiliation:
Austrian Academy of Science, Austrian Centre for Digital Humanities (ACDH), Postgasse 7–9, 1010 Wien, Austria
*
Author for correspondence: Alfred Lameli, Email: lameli@germanistik.uni-freiburg.de
Rights & Permissions [Opens in a new window]

Abstract

This article is an analysis of linguistic survey data representing German dialects in Switzerland in 1933/34 based on the so-called Wenker sentences. The data are impressionistic in terms of applied phonetic transcriptions, which were produced by non-specialists using the Latin alphabet. Due to the lack of pre-defined standardization, the phonetic transcriptions are very heterogeneous. From a technical perspective, this leads to very noisy data, which is why the validity of the Wenker data in general and the Swiss Wenker data in particular has been questioned. Using methods from computational linguistics, we compare, for the first time, Wenker data with linguistic data collected at virtually the same time by linguistics professionals. Direct comparison with a sample from the published atlas of German-speaking Switzerland (SDS) reveals that despite the noisiness of the data, they nevertheless provide reliable information, e.g., in terms of the spatial structuring of Swiss dialects. The study is thus a successful pilot for other corpus-based studies dealing with unstructured Wenker data in other regions.

Information

Type
Articles
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - ND
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives licence (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is unaltered and is properly cited. The written permission of Cambridge University Press must be obtained for commercial re-use or in order to create a derivative work.
Copyright
© The Author(s), 2020. Published by Cambridge University Press
Figure 0

Map 1. North-South divide in German-speaking Switzerland (Christen et al., 2019:32).

Figure 1

Map 2. East-West divide in German-speaking Switzerland (Christen et al., 2019:33).

Figure 2

Table 1. Characteristics of the Swiss Wenker corpus

Figure 3

Figure 1. Extract from the legend of the SDS bauen map (vol. I), demonstrating the classification of phonetic variants.

Figure 4

Table 2. Comparison of the Swiss Wenker corpus with the SDS sample provided by Scherrer (2012; see also Scherrer & Kellerhals, 2014)

Figure 5

Map 3. Spatial distribution of selected realizations from the Wenker sample; A: bauen (‘build-inf’); B: schneien (‘snow-inf’); C: früher (‘early-comp’); D–F: smoothing of A–C data.

Figure 6

Figure 2. Statistical distribution of the realizations in Map 3 together with the 3% level of all realizations (red line).

Figure 7

Map 4. Realizations of the stem vowel std. // in bauen (‘build-inf’) from the Wenker sample against the SDS sample.

Figure 8

Map 5. Realizations of the stem vowel std. // in schneien (‘snow-inf’) from the Wenker sample against the SDS sample.

Figure 9

Map 6. Realizations of früher (‘early-comp’) from the Wenker sample against the SDS sample.

Figure 10

Map 7. Linguistic distance between Swiss-German sites; A: MDS plot of unweighted data (three dimensions in RGB color space); B: nearest-neighbor smoothing of A (three neighbors).

Figure 11

Figure 3. Token frequency in the Wenker sample against the frequency rank of tokens; A: overall pattern; B: log-log plot of frequency distribution against fitted power law (red) and lognormal distribution (blue); C: same as B for rank one to 200; D: same as B for rank 201 to 11,807.

Figure 12

Map 8. Distribution of most frequent tokens in the spatially balanced extract of the Wenker sample (N = 392 sites) based on LDϕ measure.

Figure 13

Map 9. DBSCAN clustering; A: clustering based on unweighted distance matrix (color = clusters, gray = noise); B: clustering based on MDS data.

Figure 14

Figure 4. Dendrogram of UPGMA classification.

Figure 15

Map 10. Areal classification of data following Wardʼs algorithm.

Figure 16

Map 11. Spatial clustering of the Wenker data based on weighted Ward-like clustering; A: seven-cluster solution; B: eight cluster solution.

Figure 17

Figure 5. Wenker sample against SDS sample; A: matrix correlation with linear fit (blue) and non-linear fit (red); B: matrix correlation with log-Wenker data.

Figure 18

Map 12. Comparison of one-dimensional MDS coordinates between the Wenker sample (A) and the SDS sample (B).