Hostname: page-component-8448b6f56d-c47g7 Total loading time: 0 Render date: 2024-04-23T19:19:14.361Z Has data issue: false hasContentIssue false

Exploiting native language interference for native language identification

Published online by Cambridge University Press:  26 November 2020

Ilia Markov*
Affiliation:
University of Antwerp, CLiPS, Antwerp, Belgium
Vivi Nastase
Affiliation:
University of Stuttgart, Stuttgart, Germany
Carlo Strapparava
Affiliation:
FBK-irst, Fondazione Bruno Kessler, Trento, Italy
*
*Corresponding author. E-mail: ilia.markov@uantwerpen.be

Abstract

Native language identification (NLI)—the task of automatically identifying the native language (L1) of persons based on their writings in the second language (L2)—is based on the hypothesis that characteristics of L1 will surface and interfere in the production of texts in L2 to the extent that L1 is identifiable. We present an in-depth investigation of features that model a variety of linguistic phenomena potentially involved in native language interference in the context of the NLI task: the languages’ structuring of information through punctuation usage, emotion expression in language, and similarities of form with the L1 vocabulary through the use of anglicized words, cognates, and other misspellings. The results of experiments with different combinations of features in a variety of settings allow us to quantify the native language interference value of these linguistic phenomena and show how robust they are in cross-corpus experiments and with respect to proficiency in L2. These experiments provide a deeper insight into the NLI task, showing how native language interference explains the gap between baseline, corpus-independent features, and the state of the art that relies on features/representations that cover (indiscriminately) a variety of linguistic phenomena.

Type
Article
Copyright
© The Author(s), 2020. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Baron, N. (2001). Commas and canaries: The role of punctuation in speech and writing. Language Sciences 23(1), 1567.CrossRefGoogle Scholar
Bergsma, S. and Kondrak, G. (2007). Alignment-based discriminative string similarity. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Prague, Czech Republic: ACL, pp. 656663.Google Scholar
Blanchard, D., Tetreault, J., Higgins, D., Cahill, A. and Chodorow, M. (2013). TOEFL11: A corpus of non-native English. ETS Research Report Series 2013(2), i–15.CrossRefGoogle Scholar
Brooke, J. and Hirst, G. (2011). Native language detection with ‘cheap’ learner corpora. In Proceedings of the Conference of Learner Corpus Research. Louvain-la-Neuve, Belgium: Presses universitaires de Louvain, pp. 3747.Google Scholar
Brooke, J. and Hirst, G. (2012). Robust, lexicalized native language identification. In Proceedings of the 24th International Conference on Computational Linguistics. Mumbai, India: The COLING 2012 Organizing Committee, pp. 391408.Google Scholar
Bruthiaux, P. (1993). Knowing when to stop: Investigating the nature of punctuation. Language and Communication 13(1), 2743.CrossRefGoogle Scholar
Caldwell-Harris, C. (2014). Emotionality differences between a native and foreign language: Theoretical implications. Frontiers in Psychology 5(1055), 1–4.CrossRefGoogle ScholarPubMed
Chaski, C. (2001). Empirical evaluations of language-based author identification techniques. Forensic Linguistics 8(1), 165.Google Scholar
Chen, L. (2016). Native Language Identification on Learner Corpora. M.Phil. Thesis, University of Trento, Department of Information Engineering and Science, Trento, Italy.Google Scholar
Chen, L., Strapparava, C. and Nastase, V. (2017). Improving native language identification by using spelling errors. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Vancouver, Canada: ACL, pp. 542546.CrossRefGoogle Scholar
Cimino, A. and Dell’Orletta, F. (2017). Stacked sentence-document classifier approach for improving native language identification. In Proceedings of the 12th Workshop on Building Educational Applications Using NLP. Copenhagen, Denmark: ACL, pp. 430437.CrossRefGoogle Scholar
de Melo, G. and Weikum, G. (2010). Towards universal multilingual knowledge bases. In Principles, Construction, and Applications of Multilingual Wordnets. Proceedings of the 5th Global WordNet Conference. Mumbai, India: Narosa Publishing House, pp. 149156.Google Scholar
Flanagan, B. and Hirokawa, S. (2018). An automatic method to extract online foreign language learner writing error characteristics. International Journal of Distance Education Technologies 16(4), 1530.CrossRefGoogle Scholar
Franco-Salvador, M., Kondrak, G. and Rosso, P. (2017). Bridging the native language and language variety identification tasks. Procedia Computer Science 112, 15541561.CrossRefGoogle Scholar
Goldin, G., Rabinovich, E. and Wintner, S. (2018). Native language identification with user generated content. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: ACL, pp. 35913601.CrossRefGoogle Scholar
Gómez-Adorno, H., Bel-Enguix, G., Sierra, G., Sánchez, O. and Quezada, D. (2018). A machine learning approach for detecting aggressive tweets in Spanish. In Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages, vol. 2150. Seville, Spain: CEUR-WS.org, pp. 97101.Google Scholar
Granger, S., Dagneaux, E., Meunier, F. and Paquot, M. (2009). International Corpus of Learner English v2 (ICLE). Louvain-la-Neuve, Belgium: Presses Universitaires de Louvain.Google Scholar
Grieve, J. (2007). Quantitative authorship attribution: An evaluation of techniques. Literary and Linguistic Computing 22(3), 251270.CrossRefGoogle Scholar
Hirvela, A., Nussbaum, A. and Pierson, H. (2012). ESL students’ attitudes toward punctuation. System 40(1), 1123.CrossRefGoogle Scholar
Ionescu, R.T. and Popescu, M. (2017). Can string kernels pass the test of time in native language identification? In Proceedings of the 12th Workshop on Building Educational Applications Using NLP. Copenhagen, Denmark: ACL, pp. 224234.Google Scholar
Ionescu, R.T., Popescu, M. and Cahill, A. (2014). Can characters reveal your native language? A language-independent approach to native language identification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: ACL, pp. 13631373.CrossRefGoogle Scholar
Jarvis, S., Bestgen, Y. and Pepper, S. (2013). Maximizing classification accuracy in native language identification. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications. Atlanta, GA, USA: ACL, pp. 111118.Google Scholar
Kestemont, M. (2014). Function words in authorship attribution. From black magic to theory? In Proceedings of the 3rd Workshop on Computational Linguistics for Literature. Gothenburg, Sweden: ACL, pp. 5966.Google Scholar
Koppel, M., Schler, J. and Zigdon, K. (2005). Determining an author’s native language by mining a text for errors. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. New York, NY, USA: ACM, pp. 624628.CrossRefGoogle Scholar
Kumar, A., Ganesh, B., Ajay, S. and Soman, P. (2018). Overview of the second shared task on Indian native language identification (INLI). In Working notes of FIRE 2018 - Forum for Information Retrieval Evaluation, vol. 2266. Gandhinagar, India: CEUR Workshop Proceedings, pp. 3950.Google Scholar
Kumar, A., Ganesh, B., Singh, S., Soman, P. and Rosso, P. (2017). Overview of the INLI PAN at FIRE-2017 track on Indian native language identification. In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation, vol. 2036. Bangalore, India: CEUR Workshop Proceedings, pp. 99105.Google Scholar
Leersnyder, J.D., Mesquita, B. and Kim, H.S. (2011). Where do my emotions belong? A study of immigrants’ emotional acculturation. Personality and Social Psychology Bulletin 37(4), 451463.CrossRefGoogle Scholar
Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8), 707710.Google Scholar
Malmasi, S. and Dras, M. (2015). Multilingual native language identification. Natural Language Engineering 23(2), 163215.CrossRefGoogle Scholar
Malmasi, S., Evanini, K., Cahill, A., Tetreault, J., Pugh, R., Hamill, C., Napolitano, D. and Qian, Y. (2017). A report on the 2017 native language identification shared task. In Proceedings of the 12th Workshop on Building Educational Applications Using NLP. Copenhagen, Denmark: ACL, pp. 6275.CrossRefGoogle Scholar
Mann, G. and Yarowsky, D. (2001). Multipath translation lexicon induction via bridge languages. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics. Pittsburgh, PA, USA: ACL, pp. 151158.CrossRefGoogle Scholar
Markov, I., Chen, L., Strapparava, C. and Sidorov, G. (2017). CIC-FBK approach to native language Identification. In Proceedings of the 12th Workshop on Building Educational Applications Using NLP. Copenhagen, Denmark: ACL, pp. 374381.CrossRefGoogle Scholar
Markov, I. and Sidorov, G. (2018). CIC-IPN@INLI2018: Indian native language identification. In Working Notes of FIRE 2018 - Forum for Information Retrieval Evaluation, vol. 2266. Gandhinagar, India: CEUR Workshop Proceedings, pp. 8288.Google Scholar
Markov, I., Stamatatos, E. and Sidorov, G. (2018). Improving cross-topic authorship attribution: The role of pre-processing. In Proceedings of the 18th International Conference on Computational Linguistics and Intelligent Text Processing, vol. 10762. Budapest, Hungary: Springer, pp. 289302.CrossRefGoogle Scholar
McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12(2), 153157.CrossRefGoogle ScholarPubMed
Mohammad, S. and Turney, P. (2013). Crowdsourcing a word-emotion association lexicon. Computational Intelligence 29, 436465.CrossRefGoogle Scholar
Moore, N. (2016). What’s the point? The role of punctuation in realising information structure in written English. Functional Linguistics 3(1), 6.Google Scholar
Newman, M., Pennebaker, J., Berry, D. and Richards, J. (2003). Lying words: Predicting deception from linguistic styles. Personality and Social Psychology Bulletin 29(5), 665–675.CrossRefGoogle ScholarPubMed
Nicolai, G., Hauer, B., Salameh, M., Yao, L. and Kondrak, G. (2013). Cognate and misspelling features for natural language identification. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications. Atlanta, GA, USA: ACL, pp. 140145.Google Scholar
Odlin, T. (1989). Language Transfer: Cross-Linguistic Influence in Language Learning. Cambridge, UK: Cambridge University Press.CrossRefGoogle Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. and Duchesnay, É. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research 12, 28252830.Google Scholar
Pennebaker, J., Booth, R. and Francis, M. (2007). Linguistic Inquiry and Word Count: LIWC2007. Austin, TX: LIWC.net.Google Scholar
Rabinovich, E., Tsvetkov, Y. and Wintner, S. (2018). Native language cognate effects on second language lexical choice. Transactions of the Association for Computational Linguistics 6, 329342.CrossRefGoogle Scholar
Rangel, F. and Rosso, P. (2013). On the identification of emotions and authors’ gender in facebook comments on the basis of their writing style. In Proceedings of the First International Workshop on Emotion and Sentiment in Social and Expressive Media: Approaches and perspectives from AI, vol. 1096. Torino, Italy: CEUR-WS.org, pp. 3446.Google Scholar
Rangel, F. and Rosso, P. (2016). On the impact of emotions on author profiling. Information Processing & Management 52(1), 7492.CrossRefGoogle Scholar
Rangel, F., Rosso, P., Brooke, J. and Uitdenbogerd, A. (2018). Cross-corpus native language identification via statistical embedding. In Proceedings of the Second Workshop on Stylistic Variation. New Orleans, LA, USA: ACL, pp. 3943.CrossRefGoogle Scholar
Schmid, H. (1999). Improvements in Part-of-Speech Tagging With an Application to German. Springer. pp. 1325.Google Scholar
Sidorov, G., Miranda-Jiménez, S., Viveros-Jiménez, F., Gelbukh, A., Castro-Sánchez, N., Velásquez, F., Díaz-Rangel, I., Suárez-Guerra, S., Treviño, A. and Gordon, J. (2013). Empirical study of machine learning based approach for opinion mining in tweets. In Proceedings of the Mexican International Conference on Artificial Intelligence, vol. 7629. San Luis Potosí. Mexico: Springer, pp. 114.CrossRefGoogle Scholar
Smith, T. and Witten, I. (1993). Language inference from function words. Tech. rept. 93/3. Department of Computer Science, University of Waikato. Computer Science Working Papers.Google Scholar
Solorio, T., Blair, E., Maharjan, S., Bethard, S., Diab, M., Ghoneim, M., Hawwari, A., AlGhamdi, F., Hirschberg, J., Chang, A. and Fung, P. (2014). Overview for the first shared task on language identification in code-switched data. In Proceedings of the First Workshop on Computational Approaches to Code Switching. Doha, Qatar: ACL, pp. 6272.CrossRefGoogle Scholar
Tetreault, J., Blanchard, D., Cahill, A. and Chodorow, M. (2012). Native tongues, lost and found: Resources and empirical evaluations in native language identification. In Proceedings of the 24th International Conference on Computational Linguistics. Mumbai, India: The COLING 2012 Organizing Committee, pp. 25852602.Google Scholar
Tetreault, J., Blanchard, D. and Cahill, A. (2013). A report on the first native language identification shared task. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications. Atlanta, GA, USA: ACL, pp. 4857.Google Scholar
Torney, R., Vamplew, P. and Yearwood, J. (2012). Using psycholinguistic features for profiling first language of authors. Journal of the American Society for Information Science and Technology 63(6), 1256–1269.CrossRefGoogle Scholar
Volkova, S., Ranshous, S. and Phillips, L. (2018). Predicting foreign language usage from English-only social media posts. In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. New Orleans, LA, USA: ACL, pp. 608614.CrossRefGoogle Scholar
Wierzbicka, A. (1999). Emotions Across Languages and Cultures: Diversity and Universals. Cambridge: Cambridge University Press.CrossRefGoogle Scholar