Hostname: page-component-77f85d65b8-9nbrm Total loading time: 0 Render date: 2026-03-28T16:10:33.143Z Has data issue: false hasContentIssue false

Multilingual native language identification

Published online by Cambridge University Press:  02 December 2015

SHERVIN MALMASI
Affiliation:
Centre for Language Technology, Department of Computing, Macquarie University, Sydney, Australia e-mail: shervin.malmasi@mq.edu.au, mark.dras@mq.edu.au
MARK DRAS
Affiliation:
Centre for Language Technology, Department of Computing, Macquarie University, Sydney, Australia e-mail: shervin.malmasi@mq.edu.au, mark.dras@mq.edu.au
Rights & Permissions [Opens in a new window]

Abstract

We present the first comprehensive study of Native Language Identification (NLI) applied to text written in languages other than English, using data from six languages. NLI is the task of predicting an author’s first language using only their writings in a second language, with applications in Second Language Acquisition and forensic linguistics. Most research to date has focused on English but there is a need to apply NLI to other languages, not only to gauge its applicability but also to aid in teaching research for other emerging languages. With this goal, we identify six typologically very different sources of non-English second language data and conduct six experiments using a set of commonly used features. Our first two experiments evaluate our features and corpora, showing that the features perform well and at similar rates across languages. The third experiment compares non-native and native control data, showing that they can be discerned with 95 per cent accuracy. Our fourth experiment provides a cross-linguistic assessment of how the degree of syntactic data encoded in part-of-speech tags affects their efficiency as classification features, finding that most differences between first language groups lie in the ordering of the most basic word categories. We also tackle two questions that have not previously been addressed for NLI. Other work in NLI has shown that ensembles of classifiers over feature types work well and in our final experiment we use such an oracle classifier to derive an upper limit for classification accuracy with our feature set. We also present an analysis examining feature diversity, aiming to estimate the degree of overlap and complementarity between our chosen features employing an association measure for binary data. Finally, we conclude with a general discussion and outline directions for future work.

Information

Type
Articles
Copyright
Copyright © Cambridge University Press 2015 
Figure 0

Fig. 1. An example of an NLI system that attempts to classify the native languages (L1) of the authors of non-native (L2) English texts.

Figure 1

Fig. 2. Three of the most common overuse patterns found in the writing of L1 Spanish learners. They show erroneous pluralization of adjectives, determiner misuse and overuse of the word that.

Figure 2

Table 1. A summary of the basic properties of the L2 data used in our study. The text length is the average number of tokens across the texts along with the standard deviation in parentheses

Figure 3

Table 2. A breakdown of the six languages and the L1 classes used in our study. Texts is the number of documents in each L1 class and Length represents the average text length in tokens, except for Chinese, which is measured in characters

Figure 4

Table 3. A listing of the tagsets used for the languages in our experiments, including the size of the tagset

Figure 5

Table 4. Function word counts for the various languages in our study

Figure 6

Fig. 3. A constituent parse tree for an example sentence along with the context-free grammar production rules which can be extracted from it.

Figure 7

Fig. 4. An example of a dataset that is not balanced by topic: class 1 contains mostly documents from topic A while class 2 is dominated by texts from topic B. Here, a learning algorithm may distinguish the classes through other confounding variables related to topic.

Figure 8

Table 5. NLI classification accuracy (per cent) for Chinese (eleven classes), Arabic (seven classes), Italian (fourteen classes), Finnish (nine classes), German (eight classes) and Spanish (six classes). Results are reported using both binary and frequency-based feature representations. The production rules features were only tested on some languages

Figure 9

Fig. 5. Comparing feature performance on the CLC and Toefl11 corpora. POS-1/2/3: POS uni/bi/trigrams, FW: Function words, PR: Production rules.

Figure 10

Fig. 6. The learning curves (classification accuracy score versus training set size) for two feature types, Function words and POS trigrams, across three languages: English (Toefl11, row 1), Chinese (CLC, row 2) and Italian (VALICO, row 3).

Figure 11

Table 6. The six L1 classes used for each language in Experiment II

Figure 12

Table 7. Comparing classification results across languages

Figure 13

Fig. 7. Performance of our syntactic features (Function words and part-of-speech 1–3 grams) across the three languages.

Figure 14

Table 8. Accuracy for classifying texts as native or non-native

Figure 15

Fig. 8. A learning curve for the Spanish NS author versus non-native speaker author classifier trained on POS trigrams. The standard deviation range is also highlighted.

Figure 16

Fig. 9. NLI classification accuracy for L2 English data from the Toefl11 corpus, using POS n-grams extracted with the CLAWS, Penn Treebank and Universal POS tagsets.

Figure 17

Fig. 10. NLI classification accuracy for the L2 Chinese data, using POS n-grams extracted with the Penn Chinese Treebank and Universal POS tagsets.

Figure 18

Fig. 11. NLI classification accuracy for the L2 Italian data using POS n-grams extracted with the EAGLES and the Universal POS tagsets.

Figure 19

Table 9. Example oracle results for an ensemble of three classifiers

Figure 20

Table 10. Oracle classifier accuracy for the three languages in experiment V

Figure 21

Fig. 12. The Q coefficient matrices of five features for Chinese (l), Arabic (m) and English (r). The matrices are displayed as heat maps. POS 1/2/3: POS uni/bi/trigrams, FW: Function words, PR: Production rules.