Hostname: page-component-77f85d65b8-6c7dr Total loading time: 0 Render date: 2026-04-18T02:15:09.181Z Has data issue: false hasContentIssue false

Corpus-based dialectometry with topic models

Published online by Cambridge University Press:  20 May 2024

Olli Kuparinen*
Affiliation:
University of Helsinki
Yves Scherrer
Affiliation:
University of Helsinki
*
Corresponding author: Olli Kuparinen; Email: olli.kuparinen@tuni.fi
Rights & Permissions [Opens in a new window]

Abstract

This paper presents a topic modeling approach to corpus-based dialectometry. Topic models are most often used in text mining to find latent structure in a collection of documents. They are based on the idea that frequently co-occurring words present the same underlying topic. In this study, topic models are used on interview transcriptions containing dialectal speech directly, without any annotations or preselected features. The transcriptions are modeled on complete words, on character n-grams, and after automatical segmentation. Data from three languages, Finnish, Norwegian, and Swiss German, are scrutinized. The proposed method is capable of discovering clear dialectal differences in all three datasets, while reflecting the differences between them. The method provides a significant simplification of the dialectometric workflow, simultaneously saving time and increasing objectivity. Using the method on non-normalized data could also benefit text mining, which is the traditional field of topic modeling.

Information

Type
Articles
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Table 1. The time and number of recordings in each corpus.

Figure 1

Map 1. The component distribution of each speaker in a seven-component NMF model based on complete words in the Finnish dataset. The distributions are presented as pie charts, with each used component presented as a section with a specified color. The blue line denotes the border between Eastern and Western dialects, the red lines differentiate smaller dialect areas. The location in the far west is Värmland in central Sweden, where people from Savonia migrated to in the sixteenth century.

Figure 2

Map 2. The component distribution of each speaker in a three-component LDA model on complete words in the Norwegian dataset. The thin lines separate the four dialect areas of Eastern, Western, Trøndersk, and Northern dialects.

Figure 3

Map 3. The component distribution of each speaker in a three-component NMF model on trigrams in the Norwegian dataset.

Figure 4

Map 4. The component distribution of each speaker in a seven-component NMF model on Morfessor-segmented data in the Swiss German dataset. The German-speaking area is presented in beige and the big lakes in blue.

Figure 5

Table 2. An example sequence of the input types from the Swiss German dataset.

Figure 6

Table 3. The highest ranking models on dialectal completeness in the Finnish data.

Figure 7

Figure 1. The highest-ranking terms in each component of a seven-component NMF model based on complete words in the Finnish dataset.

Figure 8

Table 4. The highest ranking models on dialectal completeness in the Norwegian data.

Figure 9

Figure 2. The highest-ranking terms in each component of a three-component LDA model on complete words in the Norwegian dataset.

Figure 10

Figure 3. The highest-ranking terms in each component of a three-component NMF model on trigrams in the Norwegian dataset.

Figure 11

Table 5. The highest ranking models on dialectal completeness in the Swiss German data.

Figure 12

Figure 4. The highest-ranking terms in each component of a seven-component NMF model on Morfessor-segmented units in the Swiss German dataset.

Figure 13

Map 5. The distribution of components 1 and 2 around Zurich compared with the transcribers of each interview.