Hostname: page-component-89b8bd64d-sd5qd Total loading time: 0 Render date: 2026-05-07T20:42:38.839Z Has data issue: false hasContentIssue false

Realistic and broad-scope learning simulations: first results and challenges

Published online by Cambridge University Press:  29 May 2023

Maureen de SEYSSEL*
Affiliation:
Laboratoire de Sciences Cognitives et de Psycholinguistique, Département d’Études Cognitives, ENS, EHESS, CNRS, PSL University, Paris, France Laboratoire de Linguistique Formelle, Université Paris Cité, CNRS, Paris, France
Marvin LAVECHIN
Affiliation:
Laboratoire de Sciences Cognitives et de Psycholinguistique, Département d’Études Cognitives, ENS, EHESS, CNRS, PSL University, Paris, France Meta AI Research, Paris, France
Emmanuel DUPOUX
Affiliation:
Laboratoire de Sciences Cognitives et de Psycholinguistique, Département d’Études Cognitives, ENS, EHESS, CNRS, PSL University, Paris, France Meta AI Research, Paris, France
*
Corresponding authors: Maureen de Seyssel and Marvin Lavechin; Emails: maureen.deseyssel@gmail.com; marvinlavechin@gmail.com
Rights & Permissions [Opens in a new window]

Abstract

There is a current ‘theory crisis’ in language acquisition research, resulting from fragmentation both at the level of the approaches and the linguistic level studied. We identify a need for integrative approaches that go beyond these limitations, and propose to analyse the strengths and weaknesses of current theoretical approaches of language acquisition. In particular, we advocate that language learning simulations, if they integrate realistic input and multiple levels of language, have the potential to contribute significantly to our understanding of language acquisition. We then review recent results obtained through such language learning simulations. Finally, we propose some guidelines for the community to build better simulations.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Table 1. Four dimensions along which theoretical approaches of language acquisition can be sorted

Figure 1

Figure 1. General outline of a realistic learning simulation (centre) in relation to real infants (left) and traditional theoretical approaches (right). 1. Verbal frameworks inspire and help set up the entire language learning simulation by describing the environment, learner, and outcome models; 2. Corpus studies of children’s input help us build realistic models of the environment. In the best case, the model of the environment is a subset of a real environment, obtained through child-centred long-form recordings, for instance; 3. Machine learning provides effective artificial language learners. The learner model is relatively unconstrained as learning mechanisms used by the real learner (i.e., infants) remain largely unobservable; 4. Correlational models describe how the input should relate to the outcome measures; 5. Experimental and corpus studies of children’s outcomes show how we can evaluate learning outcomes of the artificial learner. The real versus predicted outcome measures allow us to compare humans to machines and provide new predictions for correlational models that relate input to outcomes in infants.

Figure 2

Table 2. Non-exhaustive list of language learning assumptions for infants and whether they are included within the STELA simulation

Figure 3

Figure 2. Overview of the STELA learner and outcome measures. a. (left): model of the learner; b. (right): add-on models for two types of outcome measures.

Figure 4

Figure 3. Phonetic (left) and Lexical (right) scores for native and non-native input at different quantities of training data. Phonetic score is expressed in terms of ABX accuracy, obtained by the discrete representations for native and non-native inputs. Lexical score is expressed in terms of accuracy on the spot-the-word task, on the high frequency words for native and non-native inputs. Error bars represent standard errors computed across mutually exclusive training sets. Two-way ANOVAs with factors of nativeness and training language were carried out for each quantity of speech. Significance scores indicate whether the native models are better than the non-native ones. Significance was only computed when enough data points were available to run sensical comparisons. Significance levels: na: not applicable, ns: not significant, * p<.05, **, p<.001, *** p<.0001. Figure taken from Lavechin, de Seyssel et al. (2022c).

Figure 5

Figure 4. An example spectrogram of an English utterance, along with the corresponding phonemes (top tier) and the units discovered by a STELA model trained on 3200 hours of English. Transcription: “The valley was filled”

Figure 6

Figure 5. Panel (a) shows native discrimination accuracy, as measured in an ABX discrimination task, obtained by American English and Metropolitan French CPC models (both models are evaluated on phonemes of their native languages). Panel (b) shows native advantage, computed as the average relative difference of the native model and the non-native model, obtained by the same pairs of models (a positive native advantage indicates that the native model is better at discriminating native sounds than the non-native model). Figure adapted from Lavechin et al. (2022b).

Figure 7

Figure A1. Model of the learner used in STELA. The Acoustic model is composed of a convolutional encoder which delivers a vector of continuous values zt every 10ms. This is sent to a recurrent network aggregator that integrates context and delivers vectors with the same time step. Contrastive Predictive Coding is trained to predict the outputs of the encoder in the near-future (up to 120 ms). The output of the aggregator is sent to a K-means algorithm that discretise the continuous representations ct into qt. Then, a language model (long-short term memory (LSTM) network) is trained to predict the next qt unit based on past ones.