Skip to main content

Advancing CALL research via data-mining techniques: Unearthing hidden groups of learners in a corpus-based L2 vocabulary learning experiment

  • Hansol Lee (a1), Mark Warschauer (a2) and Jang Ho Lee (a3)

In this study, we used a data-mining approach to identify hidden groups in a corpus-based second-language (L2) vocabulary experiment. After a vocabulary pre-test, a total of 132 participants performed three online reading tasks (in random orders) equipped with the following glossary types: (1) concordance lines and definitions of target lexical items, (2) concordance lines of target lexical items, and (3) no glossary information. Although the results of a previous study based on variable-centred analysis (i.e. multiple regression analysis) revealed that more glossary information could lead to better learning outcomes (Lee, Warschauer & Lee, 2017), using a model-based clustering technique in the present study allowed us to unearth learner types not identified in the previous analysis. Instead of the performance pattern found in the previous study (more glossary led to higher gains), we identified one learner group who exhibited their ability to make successful use of concordance lines (and thus are optimized for data-driven learning, or DDL; Johns, 1991), and another group who showed limited L2 vocabulary learning when exposed to concordance lines only. Further, our results revealed that L2 proficiency intersects with vocabulary gains of different learner types in complex ways. Therefore, using this technique in computer-assisted language learning (CALL) research to understand differential effects of accommodations can help us better identify hidden learner types and provide personalized CALL instruction.

Hide All
AbuSeileek, A. F. (2011) Hypermedia annotation presentation: The effect of location and type on the EFL learners’ achievement in reading comprehension and vocabulary acquisition. Computers & Education, 57(1): 12811291.
Bergman, L. R. Magnusson, D. (1997) A person-oriented approach in research on developmental psychopathology. Development and Psychopathology, 9(2): 291319.
Boulton, A. (2009) Data-driven learning: Reasonable fears and rational reassurance. Indian Journal of Applied Linguistics, 35(1): 81106.
Boulton, A. Cobb, T. (2017) Corpus use in language learning: A meta-analysis. Language Learning, 67(2): 348393.
Chen, I.-J. Yen, J.-C. (2013) Hypertext annotation: Effects of presentation formats and learner proficiency on reading comprehension and vocabulary learning in foreign languages. Computers & Education, 63: 416423.
Chun, D. M. (2001) L2 reading on the Web: Strategies for accessing information in hypermedia. Computer Assisted Language Learning, 14(5): 367403.
Cobb, T. (1999) Applying constructivism: A test for the learner-as-scientist. Educational Technology Research and Development, 47(3): 1531.
Cobb, T., Greaves, C. Horst, M. (2001) Can the rate of lexical acquisition from reading be increased? An experiment in reading French with a suite of on-line resources. In Raymond, P. & Cornaire, C. (eds.), Regards sur la didactique des langues seconds. Montréal: Éditions logique, 133153.
Csizér, K. Dörnyei, Z. (2005) Language learners’ motivational profiles and their motivated learning behavior. Language Learning, 55(4): 613659.
Cunningham, S., Moor, P. Carr, J. C. (2003) Cutting edge: Advanced with phrase builder. Harlow: Pearson Education.
Dolnicar, S. (2002) A review of unquestioned standards in using cluster analysis for data-driven market segmentation. In Shaw, R. N., Adam, S. & McDonald, H. (eds.), ANZMAC 2002: Proceedings of the Australian and New Zealand Marketing Academy Conference 2002. Deakin University, 2–4 December, 31–37.
Doornik, J. A. Hansen, H. (2008) An omnibus test for univariate and multivariate normality. Oxford Bulletin of Economics and Statistics, 70(s1): 927939.
Educational Testing Service (2016) TOEIC® listening and reading test scored and the CEFR levels.
Faul, F., Erdfelder, E., Lang, A.-G. Buchner, A. (2007) G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods, 39: 175191.
Field, A. P. (2009) Discovering statistics using SPSS (3rd ed.). London: Sage.
Firooz, H. (2015, March 4) When not to use Gaussian mixture model (EM clustering).
Fitzmaurice, G. M., Laird, N. M. Ware, J. H. (2012) Applied longitudinal analysis (2nd ed.). Hoboken: John Wiley & Sons.
Flowerdew, L. (2008) Pedagogic value of corpora: A critical evaluation. In Frankenberg-Garcia, A. (ed.), Proceedings of the 8th Teaching and Language Corpora conference. Associação de Estudos e de Investigação Cientifíca do ISLA-Lisboa, 115119.
Flowerdew, L. (2015) Data-driven learning and language learning theories: Whither the twain shall meet. In Leńko-Szymańska, A. & Boulton, A. (eds.), Multiple affordances of language corpora for data-driven learning. Amsterdam: John Benjamins, 1536.
Fraley, C. Raftery, A. E. (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal, 41(8): 578588.
Fraley, C. Raftery, A. E. (2002) Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97(458): 611631.
Fraley, C., Raftery, A. E., Scrucca, L., Murphy, T. B. Fop, M. (2017) mclust: Gaussian mixture modelling for model-based clustering, classification, and density estimation (R package version 5.3)
Frankenberg-Garcia, A. (2012) Learners’ use of corpus examples. International Journal of Lexicography, 25(3): 273296.
Frankenberg-Garcia, A. (2014) The use of corpus examples for language comprehension and production. ReCALL, 26(2): 128146.
Fraser, C. A. (1999) Lexical processing strategy use and vocabulary learning through reading. Studies in Second Language Acquisition, 21(2): 225241.
Gass, S. M., Behney, J. Plonsky, L. (2013) Second language acquisition: An introductory course (4th ed.). New York: Routledge.
Godwin-Jones, R. (2001) Tools and trends in corpora use for teaching and learning. Language Learning & Technology, 5(3): 712.
Henze, N. Zirkler, B. (1990) A class of invariant consistent tests for multivariate normality. Communications in Statistics – Theory and Methods, 19(10): 35953617.
Huang, L.-S. (2011) Language learners as language researchers: The acquisition of English grammar through a corpus-aided discovery learning approach mediated by intra- and interpersonal dialogues. In Newman, J., Baayen, H. & Rice, S. (eds.), Corpus-based studies in language use, language learning, and language documentation. Amsterdam: Rodopi, 91122.
Hummel, K. M. French, L. M. (2016) Phonological memory and aptitude components: Contributions to second language proficiency. Learning and Individual Differences, 51: 249255.
Johns, T. (1991) Should you be persuaded: Two examples of data-driven learning. In Johns, T. & King, P. (eds.), Classroom concordancing. English Language Research Journal , 4: 116.
Jung, Y. G., Kang, M. S. Heo, J. (2014) Clustering performance comparison using K-means and expectation maximization algorithms. Biotechnology & Biotechnological Equipment, 28(Supp. 1): S44S48.
Lee, H. Lee, J. H. (2013) Implementing glossing in mobile-assisted language learning environments: Directions and outlook. Language Learning & Technology, 17(3): 622.
Lee, H. Lee, J. H. (2015) The effects of electronic glossing types on foreign language vocabulary learning: Different types of format and glossary information. The Asia-Pacific Education Researcher, 24(4): 591601.
Lee, H., Warschauer, M. Lee, J. H. (2017) The effects of concordance-based electronic glosses on L2 vocabulary learning. Language Learning & Technology, 21(2): 3251.
Lee, H., Warschauer, M. Lee, J. H. (2018) The effects of corpus use on second language vocabulary learning: A multilevel meta-analysis. Applied Linguistics. Advance online publication.
Leńko-Szymańska, A. Boulton, A. (2015) Introduction: Data-driven learning in language pedagogy. In Leńko-Szymańska, A. & Boulton, A. (eds.), Multiple affordances of language corpora for data-driven learning. Amsterdam: John Benjamins, 114.
Lomicka, L. L. (1998) “To gloss or not to gloss”: An investigation of reading comprehension online. Language Learning & Technology, 1(2): 4150.
Maris, E. (1998) Covariance adjustment versus gain scores—revisited. Psychological Methods, 3(3): 309–327.
Martin, K. I. Ellis, N. C. (2012) The role of phonological short-term memory and working memory in L2 grammar and vocabulary learning. Studies in Second Language Acquisition, 34(3): 379413.
Meilă, M. Heckerman, D. (2001) An experimental comparison of model-based clustering methods. Machine Learning, 42(1/2): 929.
Mun, E. Y., von Eye, A., Bates, M. E. Vaschillo, E. G. (2008) Finding groups using model-based cluster analysis: Heterogeneous emotional self-regulatory processes and heavy alcohol use risk. Developmental Psychology, 44(2): 481495.
Nassaji, H. (2003) L2 vocabulary learning from context: Strategies, knowledge sources, and their relationship with success in L2 lexical inferencing. TESOL Quarterly, 37(4): 645670.
Papi, M. Teimouri, Y. (2014) Language learner motivational types: A cluster analysis study. Language Learning, 64(3): 493525.
Pires, A. M. Branco, J. A. (2010) Projection-pursuit approach to robust linear discriminant analysis. Journal of Multivariate Analysis, 101(10): 24642485.
Plass, J. L., Chun, D. M., Mayer, R. E. Leutner, D. (1998) Supporting visual and verbal learning preferences in a second-language multimedia learning environment. Journal of Educational Psychology, 90(1): 2536.
Poole, R. (2012) Concordance-based glosses for academic vocabulary acquisition. CALICO Journal, 29(4): 679693.
Royston, P. (1991) sg3.5: Comment on sg3.4 and an improved D’Agostino test. Stata Technical Bulletin, 3: 2324.
Rüschoff, B. Ritter, M. (2001) Technology-enhanced language learning: Construction of knowledge and template-based learning in the foreign language classroom. Computer Assisted Language Learning, 14(3-4): 219232.
Schmitt, N. (2000) Vocabulary in language teaching. Cambridge: Cambridge University Press.
Schmitt, N. (2008) Review article: Instructed second language vocabulary learning. Language Teaching Research, 12(3): 329363.
Scrucca, L., Fop, M., Murphy, T. B. Raftery, A. E. (2016) mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1): 289317.
Skehan, P. (1986) Cluster analysis and the identification of learner types. In Cook, V. (ed.), Experimental approaches to second language acquisition. Oxford: Pergamon, 8194.
Staples, S. Biber, D. (2015) Cluster analysis. In Plonsky, L (ed.), Advancing quantitative methods in second language research. New York: Routledge, 243274.
Tacq, J. (2010) Multivariate normal distribution. In Peterson, P., Baker, E. & McGaw, B. (eds.), International encyclopedia of education (3rd ed.). Oxford: Elsevier, 332338.
Tseng, W.-T. Schmitt, N. (2008) Toward a model of motivated vocabulary learning: A structural equation modeling approach. Language Learning, 58(2): 357400.
Witten, I. H., Frank, E., Hall, M. A. Pal, C. J. (2016) Data mining: Practical machine learning tools and techniques (4th ed.). Cambridge, MA: Morgan Kaufmann.
Yamamori, K., Isoda, T., Hiromori, T. Oxford, R. L. (2003) Using cluster analysis to uncover L2 learner differences in strategy use, will to learn, and achievement over time. International Review of Applied Linguistics in Language Teaching, 41(4): 381409.
Yanguas, I. (2009) Multimedia glosses and their effect on L2 text comprehension and vocabulary learning. Language Learning & Technology, 13(2): 4867.
Yeung, K. Y., Fraley, C., Murua, A., Raftery, A. E. Ruzzo, W. L. (2001) Model-based clustering and data transformations for gene expression data. Bioinformatics, 17(10): 977987.
Yoshii, M. (2006) L1 and L2 glosses: Their effects on incidental vocabulary learning. Language Learning & Technology, 10(3): 85101.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

  • ISSN: 0958-3440
  • EISSN: 1474-0109
  • URL: /core/journals/recall
Please enter your name
Please enter a valid email address
Who would you like to send this to? *


Type Description Title
Supplementary materials

Lee et al. supplementary material
Lee et al. supplementary material

 Word (59 KB)
59 KB


Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed