Hostname: page-component-77f85d65b8-g4pgd Total loading time: 0 Render date: 2026-03-29T07:07:41.543Z Has data issue: false hasContentIssue false

Language model adaptation for language and dialect identification of text

Published online by Cambridge University Press:  31 July 2019

T. Jauhiainen*
Affiliation:
Department of Digital Humanities, University of HelsinkiHelsinki 00014, Finland.
K. Lindén
Affiliation:
Department of Digital Humanities, University of HelsinkiHelsinki 00014, Finland.
H. Jauhiainen
Affiliation:
Department of Digital Humanities, University of HelsinkiHelsinki 00014, Finland.
*
*Corresponding author. Email: tommi.jauhiainen@helsinki.fi
Rights & Permissions [Opens in a new window]

Abstract

This article describes an unsupervised language model (LM) adaptation approach that can be used to enhance the performance of language identification methods. The approach is applied to a current version of the HeLI language identification method, which is now called HeLI 2.0. We describe the HeLI 2.0 method in detail. The resulting system is evaluated using the datasets from the German dialect identification and Indo-Aryan language identification shared tasks of the VarDial workshops 2017 and 2018. The new approach with LM adaptation provides considerably higher F1-scores than the basic HeLI or HeLI 2.0 methods or the other systems which participated in the shared tasks. The results indicate that unsupervised LM adaptation should be considered as an option in all language identification tasks, especially in those where encountering out-of-domain data is likely.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© Cambridge University Press 2019
Figure 0

Table 1. List of the Swiss German varieties used in the datasets distributed for the 2017 GDI shared task. The sizes of the training and the test sets are in words

Figure 1

Table 2. List of the Swiss German varieties used in the datasets distributed for the 2018 GDI shared task. The sizes of the training, the development, and the test sets are in words

Figure 2

Table 3. List of the Indo-Aryan languages used in the datasets distributed for the 2018 ILI shared task. The sizes of the training, the development, and the test sets are in words

Figure 3

Table 4. Average accuracies within the 10% portions when the results are sorted by the confidence scores CM

Figure 4

Table 5. The weighted F1-scores obtained by the identifier using LM adaptation with different values of k when tested on the development partition of the GDI 2017 dataset

Figure 5

Table 6. Weighted F1-scores with confidence threshold for LM adaptation on the development set

Figure 6

Table 7. Macro F1-scores with iterative LM adaptation on the development partition

Figure 7

Table 8. The weighted F1-scores using different methods on the 2017 GDI test set. The results from the experiments presented in this article are bolded. The system description papers of each team, if existing, are listed in Section 2.1

Figure 8

Table 9. The macro F1-scores gained with different values of k when evaluated on the GDI 2018 development set

Figure 9

Table 10. Macro F1-scores with iterative LM adaptation on the GDI 2018 development set

Figure 10

Table 11. The macro F1-scores using different methods on the 2018 GDI test set. The results from the experiments presented in this article are bolded. The system description papers of each team, if existing, are listed in Section 2.1

Figure 11

Table 12. The macro F1-scores gained with different values of k when tested on the ILI 2018 development set

Figure 12

Table 13. Macro F1-scores with iterative LM adaptation on the ILI 2018 development set

Figure 13

Table 14. The macro F1-scores using different methods on the 2018 ILI test set. The results presented for the first time are in bold. The system description papers of each team, if existing, are listed in Section 2.2

Figure 14

Table 15. The macro F1-scores for the second part of test set using different training data combinations

Figure 15

Table 16. Time measurements for creating the LMs and predicting the language for different test sets. The measurements give some indication of the computational efficiency but can only be considered rough estimates