Skip to main content Accessibility help
×
Hostname: page-component-6766d58669-nf276 Total loading time: 0 Render date: 2026-05-20T01:12:41.946Z Has data issue: false hasContentIssue false

7 - Native Language Identification on EFCAMDAT

from Part III - Data Driven Models

Published online by Cambridge University Press:  30 November 2017

Xiao Jiang
Affiliation:
Computer Laboratory, University of Cambridge, UK
Yan Huang
Affiliation:
Department of Theoretical and Applied Linguistics, University of Cambridge, UK
Yufan Guo
Affiliation:
IBM Research, USA
Jeroen Geertzen
Affiliation:
Department of Theoretical and Applied Linguistics, University of Cambridge, UK
Theodora Alexopoulou
Affiliation:
Department of Theoretical and Applied Linguistics, University of Cambridge, UK
Lin Sun
Affiliation:
Greedy Intelligence, China
Anna Korhonen
Affiliation:
Department of Theoretical and Applied Linguistics, University of Cambridge, UK
Thierry Poibeau
Affiliation:
Centre National de la Recherche Scientifique (CNRS), Paris
Aline Villavicencio
Affiliation:
Universidade Federal do Rio Grande do Sul, Brazil
Get access

Summary

Abstract

Native Language Identification (NLI) is a task aimed at determining the native language (L1) of learners of second language (L2) on the basis of their written texts. To date, research on NLI has focused on relatively small corpora. We apply NLI to EFCAMDAT, an L2 English learner corpus that is not only multiple times larger than previous L2 corpora but also provides pseudo-longitudinal data across several proficiency levels. Based on accurate machine learning with a wide range of linguistic features, our investigation reveals interesting patterns in the longitudinal data that are useful for both further development of NLI and its application to research on L2 acquisition.

Introduction

Native language identification (NLI) is a task aimed at detecting the native language (L1) of writers on the basis of their second language (L2) production. NLI is important for natural language processing (NLP) applications including language tutoring systems and authorship profiling. Moreover, NLI can offer useful empirical data for research on L2 acquisition. For example, NLI can shed light on how L1 background influences L2 learning, and on differences between the writings of L2 learners across different L1 backgrounds.

To date, studies on NLI have focused on relatively small learner corpora. Furthermore, none of them have investigated the influence of L1s across L2 proficiency levels. Our work takes the first step toward addressing these problems. We apply NLI to EFCAMDAT, the EF-Cambridge Open Language Database (Geertzen, Alexopoulou, and Korhonen, 2013), an open-access L2 learner corpus.

EFCAMDAT consists of writings of learners submitted to Englishtown, the online school of EF. EFCAMDAT stands out for its size, diversity of student backgrounds, and coverage of the proficiency levels. The first release of 2013 (Geertzen, Alexopoulou, and Korhonen, 2013), on which this paper is based, amounts to 30 million words, a corpus multiple times larger than any other available L2 corpora. Using a standard machine learning–based methodology for NLI, we explore the optimal linguistic features for NLI on this data at different proficiency levels. We discover interesting patterns that can be useful for both further development of NLI and its application to research on L2 acquisition.

In this introductory section, we first review the history of research on NLI, and introduce the data sets that have been used in earlier NLI research.We then summarise our contribution briefly.

Information

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

Save book to Kindle

To save this book to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

  • Native Language Identification on EFCAMDAT
    • By Xiao Jiang, Computer Laboratory, University of Cambridge, UK, Yan Huang, Department of Theoretical and Applied Linguistics, University of Cambridge, UK, Yufan Guo, IBM Research, USA, Jeroen Geertzen, Department of Theoretical and Applied Linguistics, University of Cambridge, UK, Theodora Alexopoulou, Department of Theoretical and Applied Linguistics, University of Cambridge, UK, Lin Sun, Greedy Intelligence, China, Anna Korhonen, Department of Theoretical and Applied Linguistics, University of Cambridge, UK
  • Edited by Thierry Poibeau, Centre National de la Recherche Scientifique (CNRS), Paris, Aline Villavicencio, Universidade Federal do Rio Grande do Sul, Brazil
  • Book: Language, Cognition, and Computational Models
  • Online publication: 30 November 2017
  • Chapter DOI: https://doi.org/10.1017/9781316676974.007
Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

  • Native Language Identification on EFCAMDAT
    • By Xiao Jiang, Computer Laboratory, University of Cambridge, UK, Yan Huang, Department of Theoretical and Applied Linguistics, University of Cambridge, UK, Yufan Guo, IBM Research, USA, Jeroen Geertzen, Department of Theoretical and Applied Linguistics, University of Cambridge, UK, Theodora Alexopoulou, Department of Theoretical and Applied Linguistics, University of Cambridge, UK, Lin Sun, Greedy Intelligence, China, Anna Korhonen, Department of Theoretical and Applied Linguistics, University of Cambridge, UK
  • Edited by Thierry Poibeau, Centre National de la Recherche Scientifique (CNRS), Paris, Aline Villavicencio, Universidade Federal do Rio Grande do Sul, Brazil
  • Book: Language, Cognition, and Computational Models
  • Online publication: 30 November 2017
  • Chapter DOI: https://doi.org/10.1017/9781316676974.007
Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

  • Native Language Identification on EFCAMDAT
    • By Xiao Jiang, Computer Laboratory, University of Cambridge, UK, Yan Huang, Department of Theoretical and Applied Linguistics, University of Cambridge, UK, Yufan Guo, IBM Research, USA, Jeroen Geertzen, Department of Theoretical and Applied Linguistics, University of Cambridge, UK, Theodora Alexopoulou, Department of Theoretical and Applied Linguistics, University of Cambridge, UK, Lin Sun, Greedy Intelligence, China, Anna Korhonen, Department of Theoretical and Applied Linguistics, University of Cambridge, UK
  • Edited by Thierry Poibeau, Centre National de la Recherche Scientifique (CNRS), Paris, Aline Villavicencio, Universidade Federal do Rio Grande do Sul, Brazil
  • Book: Language, Cognition, and Computational Models
  • Online publication: 30 November 2017
  • Chapter DOI: https://doi.org/10.1017/9781316676974.007
Available formats
×