Developing, evaluating, and refining an automatic generator of diagnostic multiple choice cloze questions to assess children's comprehension while reading*
Published online by Cambridge University Press: 14 April 2016
We describe the development, pilot-testing, refinement, and four evaluations of Diagnostic Question Generator (DQGen), which automatically generates multiple choice cloze (fill-in-the-blank) questions to test children's comprehension while reading a given text. Unlike previous methods, DQGen tests comprehension not only of an individual sentence but of the context preceding it. To test different aspects of comprehension, DQGen generates three types of distractors: ungrammatical distractors test syntax; nonsensical distractors test semantics; and locally plausible distractors test inter-sentential processing.
(1) A pilot study of DQGen 2012 evaluated its overall questions and individual distractors, guiding its refinement into DQGen 2014.
(2) Twenty-four elementary students generated 200 responses to multiple choice cloze questions that DQGen 2014 generated from forty-eight stories. In 130 of the responses, the child chose the correct answer. We define the distractiveness of a distractor as the frequency with which students choose it over the correct answer. The incorrect responses were consistent with expected distractiveness: twenty-seven were plausible, twenty-two were nonsensical, fourteen were ungrammatical, and seven were null.
(3) To compare DQGen 2014 against DQGen 2012, five human judges categorized candidate choices without knowing their intended type or whether they were the correct answer or a distractor generated by DQGen 2012 or DQGen 2014. The percentage of distractors categorized as their intended type was significantly higher for DQGen 2014.
(4) We evaluated DQGen 2014 against human performance based on 1,486 similarly blind categorizations by twenty-seven judges of sixteen correct answers, forty-eight distractors generated by DQGen 2014, and 504 distractors authored by twenty-one humans. Surprisingly, DQGen 2014 did significantly better than humans at generating ungrammatical distractors and marginally better than humans at generating nonsensical distractors, albeit slightly worse at generating plausible distractors. Moreover, vetting DQGen 2014's output and writing distractors only when necessary would halve the time to write them all, and produce higher quality distractors.
- Natural Language Engineering , Volume 23 , Issue 2 , March 2017 , pp. 245 - 294
- Copyright © Cambridge University Press 2016
This paper combines material from Mostow and Jang (2012), our AIED2015 paper (Huang and Mostow 2015) on a comparison to human performance, and substantial new content including improvements to DQGen and the evaluations reported in Section 4.1 and 4.2. The research reported here was supported in part by the Institute of Education Sciences, U.S. Department of Education, through Grant R305A080157, the National Science Foundation through Grant IIS1124240, and by the Taiwan National Science Council through the Graduate Students Study Abroad Program. We thank the other LISTENers who contributed to this work; everyone who categorized and wrote distractors; the reviewers of our BEA2012 and AIED2015 papers and this article for their helpful comments; and Prof. Y. S. Sun at National Taiwan University and Dr. M. C. Chen at Academia Sinica for enabling the first author to participate in this program. The opinions expressed are those of the authors and do not necessarily represent the views of the Institute, the U.S. Department of Education, the National Science Foundation, or the National Science Council.