Skip to main content

Developing, evaluating, and refining an automatic generator of diagnostic multiple choice cloze questions to assess children's comprehension while reading*


We describe the development, pilot-testing, refinement, and four evaluations of Diagnostic Question Generator (DQGen), which automatically generates multiple choice cloze (fill-in-the-blank) questions to test children's comprehension while reading a given text. Unlike previous methods, DQGen tests comprehension not only of an individual sentence but of the context preceding it. To test different aspects of comprehension, DQGen generates three types of distractors: ungrammatical distractors test syntax; nonsensical distractors test semantics; and locally plausible distractors test inter-sentential processing. (1)

A pilot study of DQGen 2012 evaluated its overall questions and individual distractors, guiding its refinement into DQGen 2014.


Twenty-four elementary students generated 200 responses to multiple choice cloze questions that DQGen 2014 generated from forty-eight stories. In 130 of the responses, the child chose the correct answer. We define the distractiveness of a distractor as the frequency with which students choose it over the correct answer. The incorrect responses were consistent with expected distractiveness: twenty-seven were plausible, twenty-two were nonsensical, fourteen were ungrammatical, and seven were null.


To compare DQGen 2014 against DQGen 2012, five human judges categorized candidate choices without knowing their intended type or whether they were the correct answer or a distractor generated by DQGen 2012 or DQGen 2014. The percentage of distractors categorized as their intended type was significantly higher for DQGen 2014.


We evaluated DQGen 2014 against human performance based on 1,486 similarly blind categorizations by twenty-seven judges of sixteen correct answers, forty-eight distractors generated by DQGen 2014, and 504 distractors authored by twenty-one humans. Surprisingly, DQGen 2014 did significantly better than humans at generating ungrammatical distractors and marginally better than humans at generating nonsensical distractors, albeit slightly worse at generating plausible distractors. Moreover, vetting DQGen 2014's output and writing distractors only when necessary would halve the time to write them all, and produce higher quality distractors.

Hide All

This paper combines material from Mostow and Jang (2012), our AIED2015 paper (Huang and Mostow 2015) on a comparison to human performance, and substantial new content including improvements to DQGen and the evaluations reported in Section 4.1 and 4.2. The research reported here was supported in part by the Institute of Education Sciences, U.S. Department of Education, through Grant R305A080157, the National Science Foundation through Grant IIS1124240, and by the Taiwan National Science Council through the Graduate Students Study Abroad Program. We thank the other LISTENers who contributed to this work; everyone who categorized and wrote distractors; the reviewers of our BEA2012 and AIED2015 papers and this article for their helpful comments; and Prof. Y. S. Sun at National Taiwan University and Dr. M. C. Chen at Academia Sinica for enabling the first author to participate in this program. The opinions expressed are those of the authors and do not necessarily represent the views of the Institute, the U.S. Department of Education, the National Science Foundation, or the National Science Council.

Hide All
Agarwal, M., and Mannem, P., 2011a. Automatic gap-fill question generation from text books. In Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics. 209 N. Eighth Street, Stroudsburg, PA 18360, USA, pp. 56–64.
Agarwal, M., Shah, R., and Mannem, P., 2011b. Automatic question generation using discourse cues. In Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics. 209 N. Eighth Street, Stroudsburg, PA 18360, USA, pp. 1–9.
Aldabe, I., and Maritxalar, M. 2010. Automatic distractor generation for domain specific texts advances in natural language processing. In Loftsson, H., Rögnvaldsson, E., and Helgadóttir, S. (eds.), The 7th International Conference on NLP, Reykjavk, Iceland, pp. 2738, Berlin/Heidelberg: Springer.
Aldabe, I., Maritxalar, M., and Martinez, E. 2007. Evaluating and improving distractor-generating heuristics. In Ezeiza, N., Maritxalar, M., and S. M. (eds.), The Workshop on NLP for Educational Resources. In conjunction with RANLP07, Amsterdam, Netherlands, pp. 713. Borovets, Bulgaria.
Aldabe, I., Maritxalar, M., and Mitkov, R. 2009, July 6–10. A study on the automatic selection of candidate sentences and distractors. In Dimitrova, V., Mizoguchi, R., Boulay, B. D., and Graesser, A. (eds.), In Proceedings of the 14th International Conference on Artificial Intelligence in Education (AIED2009), pp. 656–8. Brighton, UK: IOS Press.
Becker, L., Basu, S., and Vanderwende, L. 2012. Mind the gap: learning to choose gaps for question generation. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 742–51. Montreal, Canada: Association for Computational Linguistics.
Biemiller, A., 2009. Words Worth Teaching: Closing the Vocabulary Gap. Columbus, OH: SRA/McGraw-Hill.
Brown, J. C., Frishkoff, G. A., and Eskenazi, M. 2005, October 6–8. Automatic question generation for vocabulary assessment. In Proceedings of the Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pp. 819–26. Vancouver, BC, Canada. Stroudsburg, PA, USA: Association for Computational Linguistics.
Burton, S. J., Sudweeks, R. R., Merrill, P. F., and Wood, B., 1991. How to Prepare Better Multiple-Choice Test Items: Guidelines for University Faculty. Salt Lake City, UT: Brigham Young University Testing Services and The Department of Instructional Science.
Cassels, J. R. T., and Johnstone, A. H. 1984. The effect of language on student performance on multiple choice tests in chemistry. Journal of Chemical Education 61 (7): 613.
Chang, K.-M., Nelson, J., Pant, U., and Mostow, J. 2013. Toward exploiting eeg input in a reading tutor. International Journal of Artificial Intelligence in Education 22(1, “Best of AIED2011 Part 1”): 2941.
Chen, W., Mostow, J., and Aist, G. S. 2013. Recognizing young readers’ spoken questions. International Journal of Artificial Intelligence in Education 21 (4): 255–69.
Coniam, D. 1997. A preliminary inquiry into using corpus word frequency data in the automatic generation of english language cloze tests. CALICO Journal 14 (2–4): 1533.
Correia, R., Baptista, J., Mamede, N., Trancoso, I., and Eskenazi, M. 2010, September 22–24. Automatic generation of cloze question distractors. In Proceedings of the Interspeech 2010 Satellite Workshop on Second Language Studies: Acquisition, Learning, Education and Technology, Waseda University, Tokyo, Japan.
Fellbaum, C. 2012. Wordnet. The Encyclopedia of Applied Linguistics: Blackwell Publishing Ltd. Hoboken, New Jersey, USA.
Gates, D., Aist, G., Mostow, J., Mckeown, M., and Bey, J. 2011. How to generate cloze questions from definitions: a syntactic approach. In Proceedings of the AAAI Symposium on Question Generation, pp. 19–22. Arlington, VA, AAAI Press.
Goto, T., Kojiri, T., Watanabe, T., Iwata, T., and Yamada, T. 2010. Automatic generation system of multiple-choice cloze questions and its evaluation. Knowledge Management & E-Learning: An International Journal (KM& EL) 2 (3): 210–24.
Graesser, A. C., and Bertus, E. L. 1998. The construction of causal inferences while reading expository texts on science and technology. Scientific Studies of Reading 2 (3): 247–69.
Haladyna, T. M., Downing, S. M., and Rodriguez, M. C. 2002. A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement In Education 15 (3): 309–34.
Heilman, M., and Smith, N. A. 2009. Question Generation Via Overgenerating Transformations and Ranking (Technical Report CMU-LTI-09-013). Pittsburgh, PA: Carnegie Mellon University.
Heilman, M., and Smith, N. A. 2010, June. Good question! Statistical ranking for question generation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, pp. 609–17. Los Angeles, CA, Association for Computational Linguistics.
Hensler, B. S., and Beck, J. E. 2006, June 26–30. Better student assessing by finding difficulty factors in a fully automated comprehension measure [best paper nominee]. In Ashley, K. and Ikeda, M. (eds.), Proceedings of the 8th International Conference on Intelligent Tutoring Systems, pp. 21–30. Jhongli, Taiwan, Springer-Verlag.
Huang, Y.-T., Chen, M. C., and Sun, Y. S. 2012, November 26–30. Personalized automatic quiz generation based on proficiency level estimation. In Proceedings of the 20th International Conference on Computers in Education (ICCE 2012), pp. 553–60. Singapore.
Huang, Y.-T., and Mostow, J. 2015, June 22–26. Evaluating human and automated generation of distractors for diagnostic multiple-choice cloze questions to assess children’s reading comprehension. In Conati, C., Heffernan, N., Mitrovic, A., and Verdejo, M. F. (eds.), Proceedings of the 17th International Conference on Artificial Intelligence in Education, pp. 155–64. Madrid, Spain, Lecture Notes in Computer Science, vol. 9112. Switzerland: Springer International Publishing.
Kendall, M. G., and Babington Smith, B. 1939. The problem of m rankings. The Annals of Mathematical Statistics 10 (3): 275–87.
Kintsch, W. 2005. An overview of top-down and bottom-up effects in comprehension: the ci perspective. Discourse Processes 39 (2–3): 125–8.
Klein, D., and Manning, C. D. 2003, July 7–12. Accurate unlexicalized parsing. In E. W. Hinrichs and D. Roth (eds.), Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 423–30. Sapporo, Japan, Association for Computational Linguistics.
Kolb, P. 2008. Disco: a multilingual database of distributionally similar words. In Proceedings of KONVENS-2008 (Konferenz zur Verarbeitung natürlicher Sprache), pp. 5–12. Berlin.
Kolb, P. 2009. Experiments on the difference between semantic similarity and relatedness. In Proceedings of the 17th Nordic Conference on Computational Linguistics-NODALIDA’09, Odense, Denmark.
Landis, J. R., and Koch, G. G. 1977. The measurement of observer agreement for categorical data. Biometrics 33 (1): 159–74.
Lee, J., and Seneff, S. 2007, August 27–31. Automatic generation of cloze items for prepositions. In Proceedings of INTERSPEECH, pp. 2173–6. Antwerp, Belgium,
Li, L., Roth, B., and Sporleder, C. 2010. Topic models for word sense disambiguation and token-based idiom detection. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1138–47. Uppsala, Sweden, Association for Computational Linguistics.
Li, L., and Sporleder, C. 2009. Classifier combination for contextual idiom detection without labelled data, In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 315–23. Singapore, Association for Computational Linguistics.
Lin, Y.-C., Sung, L.-C., and Chen, M. C., 2007. An automatic multiple-choice question generation scheme for english adjective understanding. In Workshop on Modeling, Management and Generation of Problems/Questions in eLearning, the 15th International Conference on Computers in Education (ICCE 2007), Amsterdam, Netherlands, pp. 137–42.
Liu, C.-L., Wang, C.-H., Gao, Z.-M., and Huang, S.-M. 2005, June 29. Applications of lexical information for algorithmically composing multiple-choice cloze items. In Proceedings of the Second Workshop on Building Educational Applications Using NLP, Ann Arbor, Michigan, pp. 1–8. Stroudsburg, PA: Association for Computational Linguistics.
Ming, L., Calvo, R. A., Aditomo, A., and Pizzato, L. A. 2012. Using wikipedia and conceptual graph structures to generate questions for academic writing support. IEEE Transactions on Learning Technologies 5 (3): 251–63.
Mitkov, R., Ha, L. A., and Karamanis, N. 2006. A computer-aided environment for generating multiple choice test items. Natural Language Engineering 12 (2): 177–94.
Mitkov, R., Ha, L. A., Varga, A., and Rello, L. 2009, March 31. Semantic similarity of distractors in multiple-choice tests: extrinsic evaluation. In Basili, R. and Pennacchiotti, M. (eds.), EACL 2009 Workshop on GEMS: GEometrical Models of Natural Language Semantics, pp. 49–56. Athens, Greece, Association for Computational Linguistics.
Mostow, J. 2013, July. Lessons from project listen: what have we learned from a reading tutor that listens? (keynote). In H. C. Lane, K. Yacef, J. Mostow, and P. Pavlik (eds.), Proceedings of the 16th International Conference on Artificial Intelligence in Education, pp. 557–8. Memphis, TN, LNAI, vol. 7926. Springer.
Mostow, J., Beck, J. E., Bey, J., Cuneo, A., Sison, J., Tobin, B., and Valeri, J. 2004. Using automated questions to assess reading comprehension, vocabulary, and effects of tutorial interventions. Technology, Instruction, Cognition and Learning 2 (1–2): 97134.
Mostow, J., and Chen, W. 2009, July 6–10. Generating instruction automatically for the reading strategy of self-questioning. In Dimitrova, V., Mizoguchi, R., Boulay, B. D., and Graesser, A. (eds.), Proceedings of the 14th International Conference on Artificial Intelligence in Education, pp. 465–72. Brighton, UK: IOS Press.
Mostow, J., and Jang, H. 2012, June 7. Generating diagnostic multiple choice comprehension cloze questions. In NAACL-HLT 2012 7th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 136–46. Montréal, Association for Computational Linguistics.
Niraula, N. B., Rus, V., Stefanescu, D., and Graesser, A. C. 2014. Mining gap-fill questions from tutorial dialogues. In Proceedings of the 7th International Conference on Educational Data Mining, pp. 265–8. London, UK.
Pearson, P. D., and Hamm, D. N. 2005. The history of reading comprehension assessment. In Paris, S. G. and Stahl, S. A. (eds.), Children’s Reading Comprehension and Assessment, pp. 1369. London, United Kingdom, CIERA.
Pino, J., Heilman, M., and Eskenazi, M. 2008. A selection strategy to improve cloze question quality. In Proceedings of the Workshop on Intelligent Tutoring Systems for Ill-Defined Domains. 9th International Conference on Intelligent Tutoring Systems, pp. 22–34. Montreal, Canada.
Piwek, P., and Boyer, K. E. 2012. Varieties of question generation: introduction to this special issue. Dialogue and Discourse 3 (2): 19.
Raghunathan, K., Lee, H., Rangarajan, S., Chambers, N., Surdeanu, M., Jurafsky, D., and Manning, C. 2010. A multi-pass sieve for coreference resolution. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 492–501. MIT, Cambridge, MA, Association for Computational Linguistics.
Rus, V., Wyse, B., Piwek, P., Lintean, M., Stoyanchev, S., and Moldovan, C. 2010. The first question generation shared task evaluation challenge. In Proceedings of the 6th International Natural Language Generation Conference, pp. 251–7. Dublin, Ireland, Association for Computational Linguistics.
Shrout, P. E., and Fleiss, J. L. 1979. Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin 86 (2): 420–8.
Sleator, D. D. K., and Temperley, D. 1993, August 10–13. Parsing english with a link grammar. Third International Workshop on Parsing Technologies, Tilburg, NL, and Durbuy, Belgium.
Smith, S., Sommers, S., and Kilgarriff, A. 2008. Learning words right with the sketch engine and webbootcat: automatic cloze generation from corpora and the web. In Proceedings of the 25th International Conference of English Teaching and Learning & 2008 International Conference on English Instruction and Assessment, pp. 1–8. Lisbon, Portugal.
Sumita, E., Sugaya, F., and Yamamoto, S. 2005. Measuring non-native speakers’ proficiency of english by using a test with automatically-generated fill-in-the-blank questions. In Proceedings of the Second Workshop on Building Educational Applications Using NLP, pp. 61–8. Ann Arbor, Michigan, Association for Computational Linguistics.
Tapanainen, P., and Järvinen, T. 1997. A non-projective dependency parser. In Proceedings of the 5th Conference on Applied Natural Language Processing, pp. 64–71. Washington, DC, Association for Computational Linguistics.
Toutanova, K., Klein, D., Manning, C., and Singer, Y. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. Proceedings of the Human Language Technology Conference and Annual Meeting of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), Edmonton, Canada, pp. 252–9.
Unspecified. 2006. Tiny invaders, National Geographic Explorer (Pioneer Edition)
van den Broek, P., Everson, M., Virtue, S., Sung, Y., and Tzeng, Y. 2002. Comprehension and memory of science texts: inferential processes and the construction of a mental representation. In Otero, J., Leon, J., and Graesser, A. C. (eds.), The Psychology of Science Text Comprehension, pp. 131154. Mahwah, NJ: Erlbaum.
Zesch, T., and Melamud, O. 2014. Automatic generation of challenging distractors using context-sensitive inference rules. In Workshop on Innovative Use of NLP for Building Educational Applications (BEA), pp. 143–8. Baltimore, MD.
Zhang, X., Mostow, J., and Beck, J. E. 2007, July 9–13. Can a computer listen for fluctuations in reading comprehension?. In R. Luckin, K. R. Koedinger, and J. Greer (eds.), Proceedings of the 13th International Conference on Artificial Intelligence in Education, pp. 495–502. Marina del Rey, CA: IOS Press.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *


Full text views

Total number of HTML views: 8
Total number of PDF views: 112 *
Loading metrics...

Abstract views

Total abstract views: 1039 *
Loading metrics...

* Views captured on Cambridge Core between September 2016 - 22nd September 2018. This data will be updated every 24 hours.