Hostname: page-component-77f85d65b8-grvzd Total loading time: 0 Render date: 2026-03-29T06:11:51.244Z Has data issue: false hasContentIssue false

Combining n-grams and deep convolutional features for language variety classification

Published online by Cambridge University Press:  18 July 2019

Matej Martinc*
Affiliation:
Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia
Senja Pollak
Affiliation:
Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia Usher Institute of Population Health Sciences and Informatics, Edinburgh Medical School, Usher Institute, University of Edinburgh, Edinburgh, UK
*
*Corresponding author. Email: matej.martinc@ijs.si
Rights & Permissions [Opens in a new window]

Abstract

This paper presents a novel neural architecture capable of outperforming state-of-the-art systems on the task of language variety classification. The architecture is a hybrid that combines character-based convolutional neural network (CNN) features with weighted bag-of-n-grams (BON) features and is therefore capable of leveraging both character-level and document/corpus-level information. We tested the system on the Discriminating between Similar Languages (DSL) language variety benchmark data set from the VarDial 2017 DSL shared task, which contains data from six different language groups, as well as on two smaller data sets (the Arabic Dialect Identification (ADI) Corpus and the German Dialect Identification (GDI) Corpus, from the VarDial 2016 ADI and VarDial 2018 GDI shared tasks, respectively). We managed to outperform the winning system in the DSL shared task by a margin of about 0.4 percentage points and the winning system in the ADI shared task by a margin of about 0.2 percentage points in terms of weighted F1 score without conducting any language group-specific parameter tweaking. An ablation study suggests that weighted BON features contribute more to the overall performance of the system than the CNN-based features, which partially explains the uncompetitiveness of deep learning approaches in the past VarDial DSL shared tasks. Finally, we have implemented our system in a workflow, available in the ClowdFlows platform, in order to make it easily available also to the non-programming members of the research community.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© Cambridge University Press 2019
Figure 0

Table 1. Winning systems for AP classification tasks in PAN AP and VarDial DSL shared tasks (language variety tasks in bold)

Figure 1

Figure 1. System architecture: layer names and input parameters are written in bold, layer output sizes are written in normal text, msl stands for maximum sequence length, and csl stands for concatenated sequence length.

Figure 2

Table 2. DSLCC v4.0, ADIC and GDIC corpora

Figure 3

Figure 2. Confusion matrix for language group classification (TF-IDF weighting scheme).

Figure 4

Figure 3. Confusion matrix for language group classification (BM25 weighting scheme).

Figure 5

Table 3. Results of the proposed language variety classifier on the DSLCC v4.0 for different language groups, as well as for the discrimination between language groups (All-language groups). Also the results for all language varieties (All-language varieties) are provided, for which a comparison with the official VarDial 2017 winners is made. Results for both weighting schemes, TF-IDF and BM25, are reported separately

Figure 6

Figure 4. Confusion matrix for Spanish language varieties classification (TF-IDF weighting scheme).

Figure 7

Figure 5. Confusion matrix for Spanish language varieties classification (BM25 weighting scheme).

Figure 8

Figure 6. Confusion matrix for Farsi language varieties classification (TF-IDF weighting scheme).

Figure 9

Figure 7. Confusion matrix for Farsi language varieties classification (BM25 weighting scheme).

Figure 10

Figure 8. Confusion matrix for French language varieties classification (TF-IDF weighting scheme).

Figure 11

Figure 9. Confusion matrix for French language varieties classification (BM25 weighting scheme).

Figure 12

Figure 10. Confusion matrix for Indonesian and Malay variety classification (TF-IDF weighting scheme).

Figure 13

Figure 11.. Confusion matrix for Indonesian and Malay variety classification (BM25 weighting scheme).

Figure 14

Figure 12. Confusion matrix for Portuguese language varieties classification (TF-IDF weighting scheme).

Figure 15

Figure 13. Confusion matrix for Portuguese language varieties classification (BM25 weighting scheme).

Figure 16

Figure 14. Confusion matrix for Slavic language varieties classification (TF-IDF weighting scheme).

Figure 17

Figure 15. Confusion matrix for Slavic language varieties classification (BM25 weighting scheme).

Figure 18

Table 4. Accuracy comparison of our system to the VarDial 2017 DSL winners on validation sets

Figure 19

Table 5. Results of the proposed language variety classifier on the ADIC and GDIC. Results for both weighting schemes, TF-IDF and BM25, are reported separately

Figure 20

Figure 16. Confusion matrix for Arabic language varieties classification (TF-IDF weighting scheme).

Figure 21

Figure 17. Confusion matrix for Arabic language varieties classification (BM25 weighting scheme).

Figure 22

Figure 18. Confusion matrix for German language varieties classification (TF-IDF weighting scheme).

Figure 23

Figure 19. Confusion matrix for German language varieties classification (BM25 weighting scheme).

Figure 24

Table 6. Results of the error analysis on 405 misclassified Slavic documents

Figure 25

Table 7. Results of the ablation study. Column CNN F1 (weighted) presents performance of the system in terms of weighted F1 if only CNN-based features are used, column BON F1 (weighted) presents performance of the system if only TF-IDF-weighted BON features are used and column All F1 (weighted) presents the performance when these two types of features are combined

Figure 26

Table 8. Results of the error analysis on Slavic documents misclassified by the BON classifier and correctly classified by the CNN classifier and on Slavic documents misclassifed by the CNN classifier and correctly classified by the BON classifier

Figure 27

Figure 20. ClowdFlows implementation of the two-step approach for the language variety classification on the DSLCC v4.0. Workflow is publicly available at http://clowdflows.org/workflow/13322/.