Crowdsourcing evaluation of the quality of automatically generated questions for supporting computer-assisted language teaching

Abstract How can state-of-the-art computational linguistic technology reduce the workload and increase the efficiency of language teachers? To address this question, we combine insights from research in second language acquisition and computational linguistics to automatically generate text-based questions to a given text. The questions are designed to draw the learner’s attention to target linguistic forms – phrasal verbs, in this particular case – by requiring them to use the forms or their paraphrases in the answer. Such questions help learners create form-meaning connections and are well suited for both practice and testing. We discuss the generation of a novel type of question combining a wh- question with a gapped sentence, and report the results of two crowdsourcing evaluation studies investigating how well automatically generated questions compare to those written by a language teacher. The first study compares our system output to gold standard human-written questions via crowdsourcing rating. An equivalence test shows that automatically generated questions are comparable to human-written ones. The second crowdsourcing study investigates two types of questions (wh- questions with and without a gapped sentence), their perceived quality, and the responses they elicit. Finally, we discuss the challenges and limitations of creating and evaluating question-generation systems for language learners.


Introduction
Questions are habitually used by teachers to test comprehension, encourage discussion, and check understanding of learning materials. We argue that in a language-learning classroom particular questions can facilitate the acquisition and practice of different linguistic forms by creating a functional need to notice and process a linguistic form (Robinson, Mackey, Gass & Schmidt, 2012). This idea is supported by a large body of research on input enhancement (Sharwood Smith, 1993; see Simard, 2018, for a recent overview) and processing instruction, particularly research on structured input activities (see VanPatten, 2017, for a recent review).
Cite this article: Chinkina, M., Ruiz, S. & Meurers, D. (2020). Crowdsourcing evaluation of the quality of automatically generated questions for supporting computer-assisted language teaching. ReCALL 32(2): 145-161. https://doi.org/10.1017/ S0958344019000193 In our work, we combine insights from research in second language acquisition and computational linguistics to automatically generate text-based questions that draw the learner's attention to target linguistic forms by requiring learners to process and use these forms in their answers. In this study we focus on English phrasal verbs. Phrasal verbs are multi-word verbs that function syntactically and semantically as a single unit (e.g. end up [finish]). They are the first linguistic form we examine as they represent a considerable teaching and learning load (Garnier & Schmitt, 2016). Phrasal verbs exhibit both lexical and syntactic properties that make them particularly difficult for language learners to master (Larsen-Freeman & Celce-Murcia, 2015). This includes the specifics of their compositionality and the seemingly random nature of some of the particles that are part of phrasal verbs (Side, 1990).
With this in mind, and considering the intersection of second language acquisition, natural language processing (NLP) methods, and computer-assisted language learning (CALL) (see Lu, 2018;Meurers, 2012;Meurers & Dickinson, 2017;Reinders & Stockwell, 2017), we compare questions targeting phrasal verbs written by an English teacher to those automatically generated by an intelligent CALL (ICALL) application in order to assess whether the questions produced by computers are equivalent in terms of quality to those written by humans. As we will show, automatically generated questions are qualitatively comparable to those devised by a language teacher, and therefore we argue that this technology can be integrated into future language instruction via computer-assisted language teaching applications.

Questions in traditional and computer-assisted language teaching
Form-focused instruction is premised on the idea that mere exposure to input is insufficient for second language acquisition to occur (Long, 1991). Learners need to notice certain features of the language (e.g. grammatical encodings, lexical items) in the input in order for these features to be acquired. In line with Ellis's (2016) remarks about focus on form, it has also been argued that different kinds of attention-drawing activities are needed to facilitate the acquisition and practice of different linguistic forms (Robinson et al., 2012).
Questions offer the possibility to provide form-focused instruction because they can target specific parts in reading materials that contain language constructions that learners need to systematically pay attention to or notice, thereby producing a functional demand to process second language input (Ellis, 2016). In the case of phrasal verbs, the linguistic target in this study, questions targeting these structures can be designed to draw the learner's attention to form by focusing on both form and meaning, whereby the only way to answer a given question correctly is by understanding both the lexical and morphological form and the meaning of the targeted phrasal verb (see Appendix for examples).

From manually written to automatically generated questions in CALL
While input enhancement and language-learning activities are traditionally implemented manually or at times hard-coded in CALL tools, computational linguistic methods can support their automation resulting in ICALL applications (Meurers et al., 2010;Ziegler et al., 2017). By leveraging computational linguistic tools and methods, we have developed a system that automatically generates wh-questions and gapped sentences from text, with the primary goal of drawing learners' attention to target linguistic forms. For instance, given the source text (1), a program we developed automatically generated the question (1a) targeting the phrasal verb tick up: (1) Source text: [ : : : ] Cancellations "ticked up slightly and unexpectedly" in early April amid press coverage about the coming increases, the Netflix letter said. a. Computer: According to the Netflix letter, what did cancellations do? Cancellations _______ slightly and unexpectedly in early April amid press coverage about the coming increases.
Our system relies on Stanford CoreNLP, a natural language processing toolkit by Manning, Surdeanu, Bauer, Finkel, Bethard and McClosky (2014). In general terms, the task of NLP is to assign a structure representing syntactic relationships between words in a given sentence. More specifically, we use it for sentence splitting, tokenizing, lemmatizing, constituency and dependency parsing, and to resolve coreferences. Given an analysed sentence, our algorithm generates questions from it as follows. It detects the target linguistic form (e.g. phrasal verbs), identifies grammatical functions in the sentence (e.g. subject, predicate), and turns a declarative sentence into an interrogative one by applying syntactic transformation rules. A sentence with a gap is generated by substituting all parts of a target linguistic construction (e.g. the verb and the particle of a phrasal verb) with a gap. The technical side of the implementation of our system is described in more detail in . Here, we focus on evaluating the approach and extend it with a novel question type in question-generation research, the combination of a wh-question and a gapped sentence.

Computational linguistic methods for evaluating automatically generated questions
The computational linguistic task of automatic question generation has explored a range of question types, from factual recall questions (Wolfe, 1976) to deeper discussion questions (Adamson, Bhartiya, Gujral, Kedia, Singh & Rosé, 2013). The work at the intersection of computational linguistics and language learning has addressed the generation of wh-questions (Heilman, 2011;Mitkov & Ha, 2003) as well as that of cloze exercises (i.e. sentences where the target form is replaced with a gap) (Becker, Basu & Vanderwende, 2012;Brown, Frishkoff & Eskenazi 2005;Mostow et al., 2004). In order to leverage the advances in question generation and apply them in the languagelearning context in a focused task, we propose to generate questions consisting of both a wh-question and a sentence with a gap. In the following sections, we discuss its advantages over simple open-ended wh-questions and compare the two question types in an online study.
As for the performance of question-generation systems, it has been assessed either by using automatic measures, such as BLEU (Papineni, Roukos, Ward & Zhu, 2002), or by collecting human judgments. For instance, Zhang and VanLehn (2016) recruited students to judge the comparability of computer-generated, web-crawled and human-written biology questions based on several 5-point scales (relevance, fluency, ambiguity, pedagogy, depth). Heilman and Smith (2010) conducted a crowdsourcing study to assess the goodness of computer-generated questions using one 5-point scale and used the collected judgments to train a statistical ranker for their question-generation system. Crowdsourcing is an attractive option for evaluating questiongeneration systems given its time and cost effectiveness along with the similarity of the crowd ratings to expert judgments (Benoit, Conway, Lauderdale, Laver & Mikhaylov, 2016;Snow, O'Connor, Jurafsky & Ng, 2008). Using crowdsourcing to compare computer-generated and human-written questions seems like a logical next step in this line of research.

Research questions and hypotheses
The purpose of our study was to compare the perceived quality of automatically generated questions to that of human-written ones. Zhang and VanLehn (2016) conducted a similar kind of evaluation in an offline setting. The researchers showed that university students' ratings to questions generated by a computer to test comprehension of biology texts are comparable to those written by a teacher. Heilman and Smith (2010), on the other hand, turned to crowdsourcing for assessing the quality of their automatically generated factual questions using one "goodness" scale, but did not compare it to the perceived quality of human-written questions. Informed by this research, we opted for a crowdsourcing evaluation and defined two important aspects as the basis for a comparison between automatically generated and manually written questions: well-formedness and answerability. A question is considered well formed if it does not contain grammar mistakes. The answerability of a question, on the other hand, refers to its semantics. A question is considered answerable if it is formulated in a way that is understandable and an answer to it can be found in the source text. We target these characteristics with our first research question: RQ1. Are computer-generated questions comparable to those written by English teachers in well-formedness and answerability?
Although there is no previous research comparing computer-generated and human-written questions via crowdsourcing, based on the aforementioned related work by Zhang and VanLehn (2016) and Heilman and Smith (2010), we expected that the questions produced by the computer and the English teacher would be comparable regarding the two aspects under investigation.
As the combination of a wh-question and a gapped sentence that we generate is novel in the question-generation research field (see Heilman, 2011), we were particularly interested in whether this type of question is perceived as better formed and more easily answerable than standard whquestions. Therefore, we formulated the second research question: RQ2. Are wh-questions followed by a gapped sentence perceived as better with respect to well-formedness and answerability than open-ended wh-questions?
For this novel type of question, we predicted that a wh-question and a gapped sentence may cancel out each other's potential disadvantages, and thus their combination would be rated higher than a single wh-question with respect to both perceived well-formedness and answerability.
Finally, to further explore the differences between the two types of questions (with and without a gapped sentence) in terms of what answers they can elicit, we formulated the third question: RQ3. Do wh-questions followed by a gapped sentence elicit more correct responses and target phrasal verbs than open-ended wh-questions?
We predicted that the addition of a gapped sentence would limit the participants' choice of an answer phrase to the phrasal verb given in the text. Thus, the combination of a wh-question and a gapped sentence would increase the likelihood of obtaining a correct response and have a higher probability of containing the target phrasal verbs from the source text as part of the answer than simple open-ended wh-questions.
To address these research questions, we conducted two crowdsourcing studies on the Figure Eight platform (https://www.figure-eight.com), discussed in detail in the following sections.

Study 1: Quality of automatically generated questions
For questions to be effective in a real-life language classroom, they must be reasonably well formed and answerable. The goal of the first study was to evaluate our system by comparing the quality of computer-generated questions to the gold standard questions written by the English teacher in these two respects.

Data for Study 1
The data consisted of 138 questions designed to facilitate the acquisition and practice of phrasal verbs. Stanford CoreNLP (Manning et al., 2014) and additional algorithms were used to automatically detect 92 sentences containing unique phrasal verbs in a corpus of 40 English news articles. For these sentences, our question-generation system produced 69 questions, both well and ill formed, all of which were included in the data set. An English teacher wrote 113 questions targeting the same sentences, so we randomly selected 69 of those to include in the data set. To illustrate, the questions that follow are instances of well-formed questions by a human (2a) and a computer (2b), derived from the same source text. Question (3a) is a well-formed () human-written question, whereas the computer-generated question (3b) is ill formed (-).
(2) Source text: [ : : : ] Beijing's drive to make the nation a leader in robotics through its "Made in China 2025" initiative launched last year has set off a rush as municipalities up and down the country vie to become China's robotics center.
a. Human (): What has the "Made in China 2025" initiative done since it was launched last year? It has _____ a rush for municipalities to become China's robotics center.
b. Computer (): According to the article, what has Beijing's drive done? Beijing's drive has _____ a rush as municipalities up and down the country vie to become China's robotics center.
(3) Source text: [ : : : ] Twitter is also working to better define its role in the social media landscape. This week it rolled out a video ad that showed it as the place to go for live news, updates and discussion about current events.
a. Human (): What is Twitter doing to better define its role in the social media landscape? It ______ a video ad this week.
b. Computer (-): According to the article, what did this week do? This week ______ a video ad that showed it as the place to go for live news.

Participants of Study 1
Although the main advantage of crowdsourcing is that it provides access to a large number of people all around the world, it comes with a risk of recruiting unsuitable contributors (see Stewart, Chandler & Paolacci, 2017, for a recent review on the use of crowdsourcing in behavioral research). For this study, we needed judgments that are as close to expert ones (e.g. English teachers) as possible. The following steps helped us achieve this.
First, we used the functionality of the crowdsourcing website to select only English-speaking countries, thereby increasing the probability of the contributors being native speakers of English. However, when we only received one response in the first five hours, we extended the list to include some other European countries where English proficiency is high, which, according to the EF English Proficiency Index (Education First, 2017), are the Netherlands, Denmark, Norway, Sweden, Finland, Germany, and Austria.
We included test questions to further filter out unsuitable contributors. In order to proceed to the main task, each contributor first took a quiz in which they had to correctly rate and answer four out of five test questions. The test questions looked exactly like the questions from the main task, except that some of them were manually edited to either be ungrammatical or unanswerable in order to ensure an even distribution of low-rated and high-rated test questions, as recommended by Figure Eight guidelines. Finally, a small number of test questions looked different and required the participants to specify whether they were in fact proficient speakers of English and whether their answers were reliable. In this way, we made sure that the contributors understood the task at hand, that they were able to distinguish between a well-formed and an ill-formed question, and that their language skills were advanced enough to answer a question given a source text.
In order to perform the main task, participants had to keep their accuracy rate above 70% by correctly answering randomly inserted test questions among the other question items. In total, 364 reliable contributors took part in this study.

Procedure of Study 1
Participants were presented with a source text (an excerpt from a news article, one to three sentences long) and a question about this text. Each question had to be rated on two separate 5-point Likert scales: one for well-formedness and the other for answerability. To help ensure participants were paying attention, participants were also required to answer the question about the source text. Finally, they were asked to guess whether the presented question was written by either the English teacher or generated by the computer. We collected 10 judgments per question item.

Results of Study 1
To investigate whether computer-generated questions were rated as high as human-written ones, we first calculated the intra-class correlation (ICC) between the contributors' ratings. The ICC was smaller than .1 (i.e. .08 and .09 for well-formedness and answerability, respectively), meaning that the contributors provided different ratings for different question items, so that we can assume the judgments to be independent.
To test whether the quality of the questions generated by the computer was equivalent to those written by the teacher, we conducted Schuirmann's (1987) two one-sided tests of equivalence (medium effect size d = 0.5, alpha level of .05) for each of the two scales. All results were statistically significant on both scales: well-formedness, t 1 (912) = 9.814, p 1 < .001, t 2 (912) = −5.677, p 2 < .001, 90% CI [0.025, 0.220]; answerability, t 1 (944) = 7.322, p 1 ≤ .001, t 2 (944) = −8.170, p 2 < .001, 90% CI [−0.134, 0.079]. As the null and the alternative hypothesis are reversed in equivalence testing, statistically significant results indicate that the two samples are indeed equivalent. Thus, the results show that questions generated by the computer are not inferior or superior to those written by the English teacher in well-formedness or answerability, considering medium size effects.
To investigate for differences of smaller effect sizes, we used t-tests to compare the questions produced by the computer and the human. The results showed that there is a statistically significant difference between human-written and computer-generated questions with respect to their well-formedness with a small effect size, t(1,316) = 2.48, p = .013, d = 0.133. 1 On the answerability scale, there was no such significant difference, t(1,362) = −0.509, p = .611, d = 0.027.
Finally, we analysed the contributors' guesses about whether the questions were written by the English teacher or generated by the computer using a mixed-effects model. There was a strong correlation between rating a question high and thinking that it was written by the English teacher on the well-formedness scale, t(1,299) = 17.12, p < .001, d = 0.806, and the answerability scale, t(1,307) = 11.71, p < .001, d = 0.610. In fact, the top 11% of computer-generated questions (i.e. those having scored the highest on well-formedness) were thought to be written by the English teacher. Overall, participants thought that 74% of human-written and 67% of computer-generated questions were produced by a teacher.

Discussion of the results of Study 1
The results of the first study imply that the questions automatically generated by our system are comparable to those written by a human with respect to well-formedness and answerability, 1 The exact numbers differ slightly from those in Chinkina, Ruiz and Meurers (2017), as we excluded two unreliable responses from the original data analysis. However, this did not lead to different levels of statistical significance. although the questions written by the English teacher were rated as slightly better formed. Interestingly, most of the well-formed and answerable questions were thought to be written by the English teacher, even if they had in fact been generated automatically. This indicates that computers are not expected to be able to produce high-quality output in the sense that automatically generated questions are expected to be more ungrammatical and unnatural.

Study 2: Types of questions and the answers they elicit
In the second crowdsourcing study, we wanted to find out (a) whether the addition of a gapped sentence to an otherwise open-ended wh-question influences a question rating and (b) whether wh-questions followed by a gapped sentence elicit more phrasal verbs than open-ended whquestions. The task and the procedure were the same as in the first study, but the selection criteria for both data and participants differed.

Data for Study 2
For each source sentence, we generated two types of questions, namely an open-ended whquestion and the same wh-question followed by a gapped sentence. As we did not intend to evaluate our system in this study, we excluded all ungrammatical and unanswerable computer-generated questions. Given the source text used in Example (2), the following questions were part of the data set in our second study: (4) Source text: [ : : : ] Beijing's drive to make the nation a leader in robotics through its "Made in China 2025" initiative launched last year has set off a rush as municipalities up and down the country vie to become China's robotics center. Overall, the data consisted of 96 human-written and 96 computer-generated questions. They were randomized in such a way that the two types of questions (with and without a gapped sentence) for the same source sentence were never shown together on the same page. We collected five judgments per question item.

Participants in Study 2
For the second study, we selected contributors with a high reliability, as specified in their profile on the crowdsourcing page, but did not limit the participation based on their level of English. To ensure the contributors' suitability, we included a quiz of five test questions, four of which had to be answered and rated correctly in order to proceed to the main task. By assuming that users working on an English-language crowdsourcing website have enough of a language background for this second study, we aimed to mimic a study with English learners of different levels of proficiency. In this study, we collected judgments from 545 contributors including 68 participants who had already taken part in the first study. However, for the evaluation purposes, we only analysed the data from the 477 new contributors.

Procedure of Study 2
As in the first study, participants were asked to answer the presented questions and rate them on the two separate 5-point Likert scales in terms of well-formedness and answerability. For this second study, we analysed both the ratings and the responses to the questions. Different from the first study, we did not ask participants to guess whether a question was written by a teacher or generated by a computer.

Results of Study 2
As participants in the second study were not selected based on their English proficiency level, there was less agreement among subjects when rating questions regarding well-formedness and answerability (ICC = 0.34 and 0.37, respectively). Hence, we used mixed-effect models to account for the dependencies across observations.
The analysis was conducted using the lme4 package Version 1.1-12 in the R environment Version 3.2.1 (R Core Team, 2013). We estimated a model for each of the two continuous dependent variables: the perceived well-formedness and answerability of question items. The models included fixed effects for the source of a question item (human or computer) and the item type (with or without a gapped sentence), as well as crossed random effects for both participants and items (Baayen, 2008). An effect was considered significant if the absolute value of the t statistic was greater than or equal to 2.0 (Baayen, 2008;Gelman & Hill, 2006).
First, we found that participants did not rate computer-generated questions significantly lower than human-written questionswell-formedness, b = 0.024, SE = 0.047, t = 0.500; answerability, b = 0.065, SE = 0.060, t = 1.080which is in line with the results of our first study with proficient English speakers. As for the addition of a gapped sentence, it did indeed influence the rating of a question item. The results showed that this had an effect on both the perceived well-formedness, b = 0.158, SE = 0.054, t = 2.930, and answerability, b = 0.127, SE = 0.055, t = 2.300. In other words, the addition of a gapped sentence to a simple open-ended wh-question improved the perceived well-formedness and answerability of the question.
Finally, we conducted logistic regression analyses (Jaeger, 2008) to investigate which type of questions elicited more correct responses and more phrasal verbs. In the first model, the dependent variable was analysed as a binary outcome: correct versus incorrect. In the second model, the dependent variable was also treated as a binary outcome: presence versus absence of the phrasal verb from the source text. We selected a random sample of 20% of responses and excluded nonsensical (e.g. "good!") and non-English (e.g. "konuşma") answers from the data. Out of 359 answers, 277 (77.2%) contained exact matches of the phrasal verbs given in the source text. Only 12 (3.3%) contained rephrasings of phrasal verbs, and the remaining 70 (19.5%) answers were marked as incorrect. As expected, the linear regression results showed that, as compared to simple wh-questions, questions followed by a gapped sentence had a higher probability of eliciting correct responses, b = 0.791, SE = 0.278, p = .004, as well as of containing the target phrasal verbs from the source text, b = 2.577, SE = 0.484, p < .001.

Discussion of the results of Study 2
The results of the second study show question ratings in line with those from the proficient English speakers in the first study: computer-generated and human-written questions were rated similarly for both well-formedness and for answerability. This confirms our hypothesis that automatically generated questions are perceived as qualitatively comparable to those written by humans, in line with findings from previous studies on automatic question generation (e.g. Zhang & VanLehn, 2016).
Going beyond the results of the first study, the second study shows that wh-questions followed by a gapped sentence are rated higher than open-ended ones on both well-formedness and answerability scales. Apparently, a gapped sentence providing an answer context for a question can render an otherwise ambiguous question more specific so that it is perceived as better formed and easier to answer. The wh-questions followed by a gapped sentence also elicited more correct responses and more phrasal verbs. Therefore, our intuition that such gapped answer sentence can be used to narrow down the reader's focus to the target linguistic form in the source sentence is confirmed.

Implications of our work for computer-assisted language teaching
The results of the two studies indicate that participants rated the questions produced by the computer and the English teacher similarly, confirming our initial hypothesis that computergenerated questions are comparable to those produced by humans with respect to wellformedness and answerability. A potential implication of this finding is the possibility that language teachers can use our question-generation system to automatically generate questions from reading materials, which in turn may save them time and effort when preparing their class materials. This becomes particularly relevant when considering individual differences between students, which are particularly substantial in second language learning, so that teachers in principle should offer different reading material to individual or subgroups of students. Question generation here fits naturally with ICALL tools supporting the automated retrieval of reading material in line with the individual learner's zone of proximal development (Chen & Meurers, 2019) and the school curriculum (Chinkina & Meurers, 2016).
As we predicted for our second and third research questions, the form of a question item has proven to be an important factor in judging the quality of a question and eliciting correct responses and target phrasal verbs. The combination of a wh-question and a gapped sentence was rated higher in terms of perceived well-formedness and answerability than single open-ended wh-questions. The combination of a wh-question and a gapped sentence provides a more explicit context for answering a question. From the technical point of view, the generation of short whquestions and verbatim gapped sentences is less prone to errors than that of their longer counterparts. The more specificand therefore longa wh-question is, the more syntactic elements it contains, thus raising the probability of a question being ungrammatical. At the same time, when the number of syntactic elements is kept to a minimum, there is a risk that a question will be too general or ambiguous. On the other hand, gapped sentences are typically grammatical and unambiguous (Becker et al., 2012), but they do not serve a communicative goal. Therefore, combining a general wh-question with a more specific gapped sentence can help avoid the aforementioned pitfalls of the two question types: It maximizes the grammaticality and minimizes the ambiguity of the whole question item while keeping the task communicative.

Linguistic and technical limitations, challenges, and considerations
Importantly, the perceived similarity between computer-generated and human-written questions does not only provide evidence for the generally good quality of questions that can be generated but also reveals the limitations of both approaches. In particular, in addition to occasional grammar mistakes, both automatically and manually produced questions can be too vague, overly specific, or include superfluous information. This is illustrated by the following examples: Although leveraging and fine-tuning of computational linguistic tools can help improve the quality of automatically generated questions, there are linguistic and technical considerations that need to be taken into account when creating questions and evaluating their quality. We discuss them in detail in this section.

Inclusion of non-restrictive phrases and clauses
The computer-generated questions that received the highest scores in our studies were concise, which showcases the importance of considering the syntactic structure of a sentence. For instance, removing non-restrictive clauses (usually separated by commas or other punctuation), but keeping restrictive types, usually led to well-formed questions, such as the one that follows that received the highest score on both the well-formedness and answerability scales: Interestingly, this seemed to be the case even when not enough information was provided in the question in order to answer it correctly. For example, the following question does not specify the conditions under which Jia might be forced to put up more collateral. Nevertheless, the question also received the highest scores on both scales: (9) Source text: [ : : : ] Such share pledges can be risky: if Leshi Internet stock fell sharply, Jia might be forced to put up more collateral or sell down his stake. a. Computer: According to the article, what might Jia be forced to do? Jia might be forced to _______ his stake.
We only removed non-restrictive clauses from a gapped sentence when they were separated by commas, which was the case for 33% of computer-generated questions. We never removed prepositional phrases when they were in the same clause with the target form. Subordinate and coordinate clauses were not removed when they followed the main clause and were not separated by a comma, as in example (8).
To obtain some quantitative evidence regarding our intuition about the superiority of the questions with removed non-restrictive clauses, we conducted the following pilot analyses. First, we filtered out obviously ungrammatical computer-generated questions, where the errors were caused by the parser or the coreference resolution module. We then annotated the remaining 60 computer-generated questions (M well-formedness = 4.59, M answerability = 4.62) and conducted Welch's t-tests. The results showed that the perceived well-formedness of the computer-generated questions with removed non-restrictive clauses (M = 4.73, SD = 0.23) was higher than that of the ones where no part of the sentence following the target form was removed (M = 4.52, SD = 0.49), and the difference was significant, t(58) = 2.30, p = .02, 95% CI [4.73,4.52]. For answerability, on the other hand, the questions with removed clauses (M = 4.59, SD = 0.79) were rated as more difficult to answer than the ones that did not undergo the modification (M = 4.64, SD = 0.54). However, the difference was non-significant, t(28) = −0.23, p = .82, 95% CI [0.45, 0.36]. The results confirm that the removal of non-restrictive clauses in general leads to better-formed questions, but more data would be relevant to explore when they remain easy to answer.
Although the heuristics of splitting the sentence into clauses separated by commas seems to be working well, excluding conditional clauses may lead to unanswerable questions, especially in a richer context. This leads us to the next subsection, where we discuss the limitations of the task of automatic question generation and its evaluation.

Limitations of natural language processing tools and algorithms
The quality of automatically generated questions relies on the accuracy of the natural language processing tools that our question-generation system is built on. In fact, the main causes of ill-formed questions were erroneous coreference resolution (43%) and incorrect parses (28%) of the source sentences, with ill-formedness being operationalized as an average rating below 3 on a 5-point scale. Other factors influencing question quality include: i. The question item may not present enough information to answer it correctly (e.g. a missing restrictive clause) or be too specific compared to a more general context of the paragraph: ii. The question item may have superfluous information (e.g. a non-restrictive phrase or a clause) making it too long and potentially unnatural: iii. According to feedback from the participants of the studies, questions may be perceived as less well formed if the subject in the gapped sentence repeats the subject in the wh-question. Although the question item as a whole could sound more natural if the subject in the gapped sentence were substituted with a pronoun, it poses a computational challenge because of the aforementioned suboptimal performance of coreference resolution tools. Given the alternative of generating a wrong pronoun (e.g. he instead of she), we opted for the safe, albeit slightly less natural option of keeping the subject in both the wh-question and the gapped sentence. As a result, all computer-generated examples in this paper demonstrate this limitation.

Evaluation of question-generation systems
First and foremost, it should be noted that any kind of human evaluation is subjective. In our studies, this issue became particularly salient when raters encountered the test questions in the first crowdsourcing experiment. The Figure Eight guidelines recommend an even distribution of answers for test questions (i.e. testing both good and bad questions in our case). Although ill-formed or unanswerable test questions were not difficult to write and did not receive criticism from the participants, even by those who rated these questions incorrectly (we accepted any rating below 4 for such questions), the rating of good test questions proved to be more challenging and subjective. As an alternative, one could test participants only on ill-formed questions, possibly also giving only a binary choice (Is this question grammatical or ungrammatical?) instead of the 5-point scale (How well formed is this question?).
Malicious activities (e.g. randomly clicking through the task, copy-pasting answers) are another limitation of a crowdsourcing experiment designor any web-based design, for that matter (Gadiraj, Demartini, Kawase & Dietze, 2015). In our second study, where the quality control mechanism was not as strict as in the first, participants used the exact wording from the source text in 97% of their answers. When designing a similar study in the future, one could block the copy-paste functionality in order to prevent participants from directly copying answers from the text.

Conclusion and outlook
To conclude, answering questions is an integral part of facilitating and practicing vocabulary and grammar in a language-learning classroom. In the two studies presented in this paper, we found evidence that automatically generated and human-written questions can be comparable with respect to both well-formedness and answerability. The findings are in line with previous research involving expert judges evaluating the quality of computer-generated and human-written questions (e.g. Zhang & VanLehn, 2016)although our discussion also identified clear room for improvement in question generation.
We found that the addition of a gapped sentence to a wh-question significantly improves its wellformedness and answerability. Moreover, the responses elicited by wh-questions followed by a gapped sentence contain significantly more correct answers and phrasal verbs than those elicited by openended wh-questions. From the computational linguistic perspective, these findings imply that question-generation systems can benefit from leveraging and combining different types of questions.
Although we focused on phrasal verbs as the target linguistic form in this study, our system is able to generate questions to any verb phrase, and in principle any automatically identifiable dependent. In future studies, we plan to assess the quality of computer-generated questions targeting different linguistic forms appearing in texts of different genres to empirically test question-generation effectiveness. For this purpose, a large-scale randomized controlled field study with intermediate English language learners is currently being planned as part of a grant proposal. The study is designed to provide an evidence-based assessment of the effectiveness of question-generation technology in a real-life educational setting and compare it to more traditional approaches.
Interestingly, proficient speakers of English thought that most of the questions were written by an English teacher, although the proportion of computer-generated and human-written questions in the study was the same. This finding shows that people are often unaware of the state of the art in computational linguistics and how it can or could connect to the needs of real-life teaching and learning. We believe that computer-assisted language teaching, that is, the use of technology by not only language learners but also primarily by language teachers, can play an important role in supporting teachers in facing current challenges. Automated approaches arguably will become particularly important for the class-internal differentiation that is increasingly required to adaptively support different subgroups of learners, for which automatically generated materials are ideally suited. About the authors Maria Chinkina is a doctoral candidate at the LEAD Graduate School & Research Network and the University of Tübingen, Germany. In her thesis, she explores and implements computational linguistic techniques, such as information retrieval, input enrichment, and question generation, that help language learners to create a richer grammatical intake from the given text input. Her research focus lies at the intersection of computational linguistics, second language acquisition and computerassisted language learning.
Simón Ruiz is a post-doctoral researcher at the English department of the University of Tübingen, Germany, from where he also obtained his PhD. His research focuses on individual differences in second language acquisition, second language teaching and learning, implicit and explicit learning in second language acquisition, and intelligent computer-assisted language learning.
Detmar Meurers is professor of computational linguistics at the University of Tübingen, Germany, and on the steering board of the LEAD Graduate School & Research Network in empirical educational science there. As head of the ICALL-Research.com group, his work focuses on intelligent computer-assisted language learning and computational linguistic methods in second language acquisition research and language teaching. He has published on automatic short-answer assessment, the analysis of learner corpora, linguistic complexity analysis, tutoring systems, and input enrichment and enhancement applications.