Hostname: page-component-89b8bd64d-mmrw7 Total loading time: 0 Render date: 2026-05-09T15:38:56.878Z Has data issue: false hasContentIssue false

Moving to continuous classifications of bilingualism through machine learning trained on language production

Published online by Cambridge University Press:  24 May 2024

M. I. Coco*
Affiliation:
Department of Psychology, “Sapienza” University of Rome, Rome, Italy I.R.C.C.S. Fondazione Santa Lucia, Rome, Italy
G. Smith*
Affiliation:
School of Health Sciences, University of East Anglia, Norwich, UK
R. Spelorzi
Affiliation:
Department of Linguistics and English Language, University of Edinburgh, Edinburgh, UK
M. Garraffa
Affiliation:
School of Health Sciences, University of East Anglia, Norwich, UK
*
Authors for correspondence: M. I. Coco; Email: moreno.coco@uniroma1.it G. Smith; Email: giuditta.smith@uea.ac.uk
Authors for correspondence: M. I. Coco; Email: moreno.coco@uniroma1.it G. Smith; Email: giuditta.smith@uea.ac.uk
Rights & Permissions [Opens in a new window]

Abstract

Recent conceptualisations of bilingualism are moving away from strict categorisations, towards continuous approaches. This study supports this trend by combining empirical psycholinguistics data with machine learning classification modelling. Support vector classifiers were trained on two datasets of coded productions by Italian speakers to predict the class they belonged to (“monolingual”, “attriters” and “heritage”). All classes can be predicted above chance (>33%), even if the classifier's performance substantially varies, with monolinguals identified much better (f-score >70%) than attriters (f-score <50%), which are instead the most confusable class. Further analyses of the classification errors expressed in the confusion matrices qualify that attriters are identified as heritage speakers nearly as often as they are correctly classified. Cluster clitics are the most identifying features for the classification performance. Overall, this study supports a conceptualisation of bilingualism as a continuum of linguistic behaviours rather than sets of a priori established classes.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Table 1. Characteristics of study participants, for the original dataset and the novel dataset

Figure 1

Table 2. Coding strategy, with examples from the data

Figure 2

Table 3. Descriptive statistics of the SVM classification performances

Figure 3

Table 4. Output of a linear model predicting F-score as a function of the three classes of speakers in our study (attriters, heritage and monolinguals, as the reference level)

Figure 4

Figure 1. Visualisation of the confusion matrices about the classification performance of our models (A: trained and tested using the original dataset; B: trained on the original dataset tested on the novel dataset). Predictions of the model are organised over the rows while the target, i.e., expected outcome, is organised over the columns. The percentages indicate how many cases, per class, matched or not, between predictions and targets. The colours of the tiles go from white (few cases) to orange (most cases). (C) Percentages of times a certain type of production was selected as a key feature, i.e., it significantly improved performance, by the classifier. The type of productions is depicted as colours and organised as stacked bars. Cluster glie-lo in the image refers to cluster 3rd, and cluster me-lo to cluster 1st/2nd. The x-axis indicates instead whether the feature was selected as the first or second feature.Note: All models contained a maximum of two types of production, hence, there are no further ranks.