Hostname: page-component-cb9f654ff-nr592 Total loading time: 0 Render date: 2025-09-11T03:04:24.117Z Has data issue: false hasContentIssue false

How good are LLMs in generating input texts for reading tasks in German as a foreign language?

Published online by Cambridge University Press:  11 September 2025

Anastasia Drackert*
Affiliation:
Gesellschaft für akademische Studienvorbereitung und Testentwicklung (g.a.s.t.), Bochum, Germany Department of German Studies, Ruhr-Universität, Bochum, Germany
Andrea Horbach
Affiliation:
Leibniz-Institute for Science and Mathematics Education, Kiel, Germany Institute for Psychology of Learning and Instruction, Christian Albrecht University, Kiel, Germany
Anja Peters
Affiliation:
Gesellschaft für akademische Studienvorbereitung und Testentwicklung (g.a.s.t.), Bochum, Germany
*
Corresponding author: Anastasia Drackert; Email: drackert@gast.de

Abstract

To study the potential of generative AI for generating high-quality input texts for a reading comprehension task on specific CEFR levels in German, we investigated the comparability of reading texts from a high-stakes German exam used as benchmarks for the purpose of this study and those generated by ChatGPT (3.5 and 4). These three types of texts were analyzed according to a variety of linguistic features and evaluated by three assessment experts. Our findings indicate that AI-generated texts provide a valuable starting point for the production of test materials, but they require adjustments to align with benchmark texts. Computational analysis and expert evaluations identified key discrepancies that necessitate careful control of certain textual features. Specifically, modifications are needed to address the frequency of nominalizations, lexical density, the use of technical vocabulary, and non-idiomatic expressions that are direct translations from English. To enhance comparability with benchmark texts, it is essential to incorporate features such as examples illustrating the discussed phenomena and the use of passive constructions in the AI-generated content. We discuss the consequences of the usage of ChatGPT for input text generation and point out important aspects to consider when using generated texts as input materials in assessment tasks.

Abstract in german

Abstract in German

Die Erstellung von Prüfungstexten für rezeptive Prüfungsteile ist ein zeitaufwendiger und ressourcenintensiver Prozess. Um das Potenzial generativer KI für die Erstellung von Input-Texten für eine high-stakes-Prüfung für Deutsch als Fremdsprache zu ermitteln, haben wir Lesepassagen, die von geschulten Autor*innen erstellt wurden, mit ChatGPT-generierten Texten verglichen. Die Texte wurden in Hinblick auf eine Reihe von linguistischen Merkmalen mittels einer computerbasierten Analyse ausgewertet und von drei Testerstellungsexpertinnen beurteilt. Unsere Ergebnisse zeigen, dass KI-generierte Texte einen wertvollen Ausgangspunkt für die Erstellung von Prüfungstexten bieten, aber Anpassungen erforderlich sind, um eine Vergleichbarkeit mit den Texten der Autor*innen zu erreichen. Insbesondere sind Modifikationen hinsichtlich der folgenden Aspekte notwendig: Veranschaulichung der dargestellten Inhalte durch Beispiele, lexikalische Dichte, Gebrauch von Fachvokabular, Idiomatik, Nominalisierungen und Passivkonstruktionen. Abschließend diskutieren wir die Konsequenzen der Nutzung von ChatGPT zur Erstellung von Input-Texten.

Information

Type
Research Article
Copyright
© The Author(s), 2025. Published by Cambridge University Press.

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Adesso, G. (2023). Towards the ultimate brain: Exploring scientific discovery with ChatGPT AI. AI Magazine, 44(3), 328342. https://doi.org/10.1002/aaai.12113CrossRefGoogle Scholar
Alkaissi, H., & McFarlane, S. I. (2023). Artificial hallucinations in ChatGPT: Implications in scientific writing. Cureus, 15(2), e35179. https://doi.org/10.7759/cureus.35179Google ScholarPubMed
Attali, Y., Runge, A., LaFlair, G. T., Yancey, K., Goodwin, S., Park, Y., & von Davier, A. A. (2022). The interactive reading task: Transformer-based automatic item generation. Frontiers in Artificial Intelligence, 5, 113. https://doi.org/10.3389/frai.2022.903077CrossRefGoogle ScholarPubMed
Bamberger, R., & Vanecek, E. (1984). Lesen-Verstehen-Lernen-Schreiben: Die Schwierigkeitsstufen von Texten in deutscher Sprache [Reading–comprehension–learning–writing: The levels of difficulty of texts in the German language Jugend und Volk. https://doi.org/10.2307/3530491CrossRefGoogle Scholar
Benedetto, L., Gaudeau, G., Caines, A., & Buttery, P. (2025). Assessing how accurately large language models encode and apply the common European framework of reference for languages. Computers and Education: Artificial Intelligence, 8, 100353. https://doi.org/10.1016/j.caeai.2024.100353Google Scholar
Björkelund, A., Bohnet, B., Hafdell, L., & Nugues, P. (2010). A high-performance syntactic and semantic dependency parser. Proceedings of COLING 2010: Demonstrations (pp. 3336). COLING 2010 Organizing Committee. http://www.aclweb.org/anthology/C/C10/C10-3009.pdfGoogle Scholar
Bolender, B., Foster, C., & Vispoel, S. (2023). The criticality of implementing principled design when using AI technologies in test development. Language Assessment Quarterly, 20(4–5), 512519. https://doi.org/10.1080/15434303.2023.2288266CrossRefGoogle Scholar
Brunfaut, T. (2021). Assessing reading. In Fulcher, G. & Harding, L. (Eds.), The Routledge handbook of language testing (2nd ed., pp. 254267). Routledge.CrossRefGoogle Scholar
Burstein, J. (2023). The Duolingo English test responsible AI standards. Retrieved March 29 , 2024. Available at https://go.duolingo.com/ResponsibleAICrossRefGoogle Scholar
Chapelle, C. A., & Lee, H. (2021). Understanding argument-based validity in language testing. In Chapelle, C. A. & Voss, E. (Eds.), Validity argument in language testing: Case studies of validation research (pp. 1944). Cambridge University Press. https://doi.org/10.1017/9781108669849.004CrossRefGoogle Scholar
Chen, H., & He, B. (2013). Automated essay scoring by maximizing human-machine agreement. In Yarowsky, D., Baldwin, T., Korhonen, A., Livescu, K. & Bethard, S. (Eds.), Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 17411752). Association for Computational Linguistics. https://www.aclweb.org/anthology/D13-1180.pdfCrossRefGoogle Scholar
Chen, J., & Sheehan, K. M. (2015). Analyzing and comparing reading stimulus materials across the TOEFL® Family of Assessments. ETS Research Report Series, 2015(1), 112. https://doi.org/10.1002/ets2.12055Google Scholar
Covington, M. A., & McFall, J. D. (2010). Cutting the Gordian knot: The moving-average type-token ratio (MATTR). Journal of Quantitative Linguistics, 17(2), 94100. https://doi.org/10.1080/09296171003643098CrossRefGoogle Scholar
Eckart de Castilho, R., & Gurevych, I. (2014). A broad-coverage collection of portable NLP components for building shareable analysis pipelines. Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT (pp. 111). Association for Computational Linguistics and Dublin City University. https://doi.org/10.3115/v1/W14-5201CrossRefGoogle Scholar
Feng, S., Park, C. Y., Liu, Y., & Tsvetkov, Y. (2023). From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair NLP models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1173711762). https://doi.org/10.18653/v1/2023.acl-long.656CrossRefGoogle Scholar
Ferrucci, D., & Lally, A. (2004). UIMA: An architectural approach to unstructured information processing in the corporate research environment. Natural Language Engineering, 10(3–4), 327348. https://doi.org/10.1017/S1351324904003523CrossRefGoogle Scholar
Fitzgerald, J., Elmore, J., Relyea, J. E., Hiebert, E. H., & Stenner, A. J. (2016). Has first-grade core reading program text complexity changed across six decades? Reading Research Quarterly, 51(1), 728. https://doi.org/10.1002/rrq.115Google Scholar
Freedle, R., & Kostin, I. (1993). The prediction of TOEFL reading comprehension item difficulty for expository prose passages for three item types: Main idea, inference, and supporting idea items. ETS Research Report Series, 1993(1), i48. https://doi.org/10.1002/j.2333-8504.1993.tb01524.xCrossRefGoogle Scholar
Green, A., & Hawkey, R. (2011). Re-fitting for a different purpose: A case study of item writer practices in adapting source texts for a test of academic reading. Language Testing, 29(1), 109129. https://doi.org/10.1177/0265532211413445CrossRefGoogle Scholar
Gudibande, A., Wallace, E., Snell, C., Geng, X., Liu, H., Abbeel, P., … Song, D. (2023). The false promise of imitating proprietary LLMs. arXiv preprint arXiv:2305.15717. https://doi.org/10.48550/arXiv.2305.15717CrossRefGoogle Scholar
Hancke, J., Vajjala, S., & Meurers, D. (2012). Readability classification for German using lexical, syntactic, and morphological features. Proceedings of COLING 2012 (pp. 10631080). The COLING 2012 Organizing Committee. https://aclanthology.org/C12-1065/Google Scholar
Koizumi, R., & In’nami, Y. (2012). Effects of text length on lexical diversity measures: Using short texts with less than 200 tokens. System, 40(4), 554564. https://doi.org/10.1016/j.system.2012.10.012CrossRefGoogle Scholar
Liu, Z., Ping, W., Roy, R., Xu, P., Lee, C., Shoeybi, M., & Catanzaro, B. (2024). ChatQA: Surpassing GPT-4 on conversational QA and RAG. Advances in Neural Information Processing Systems, 37, 1541615459. https://proceedings.neurips.cc/paper_files/paper/2024/file/1c0d54ebd0a6e58c4eca7d591e374b9d-Paper-Conference.pdfGoogle Scholar
Lu, X. (2014). Computational methods for corpus annotation and analysis. Springer.CrossRefGoogle Scholar
Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 5560). https://doi.org/10.3115/v1/P14-5010CrossRefGoogle Scholar
Min, B., Ross, H., Sulem, E., Veyseh, A. P. B., Nguyen, T. H., Sainz, O., … Roth, D. (2023). Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys, 56(2), 140. https://doi.org/10.1145/3605943CrossRefGoogle Scholar
Norris, J., & Drackert, A. (2018). Test review: TestDaF. Language Testing, 35(1), 149157. https://doi.org/10.1177/0265532217715848CrossRefGoogle Scholar
O’Sullivan, B. (2023). Reflections on the application and validation of technology in language testing. Language Assessment Quarterly, 20(4–5), 501511. https://doi.org/10.1080/15434303.2023.2291486CrossRefGoogle Scholar
Petrosyan, A. (2024). Most used languages online by share of websites 2023. Statista. Retrieved April 29 , 2024, Available at https://www.statista.com/statistics/262946/most-common-languages-on-the-internetGoogle Scholar
Petrov, S., Das, D., & McDonald, R. (2012). A universal part-of-speech tagset. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12) (pp. 20892096). European Language Resources (ELRA). http://lrec-conf.org/proceedings/lrec2012/pdf/274_Paper.pdfGoogle Scholar
Pugh, D., De Champlain, A., Gierl, M., Lai, H., & Touchie, C. (2020). Can automated item generation be used to develop high quality MCQs that assess application of knowledge? Research and Practice in Technology Enhanced Learning, 15(1), 113. https://doi.org/10.1186/s41039-020-00134-8CrossRefGoogle Scholar
Ray, P. P. (2023). ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems, 3, 121154. https://doi.org/10.1016/j.iotcps.2023.04.003CrossRefGoogle Scholar
Rehm, G., Berger, M., Elsholz, E., Hegele, S., Kintzel, F., Marheinecke, K., Klejch, O. (2020). European language grid: An overview. Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020) (pp. 33663380). European Language Resources Association (ELRA). https://aclanthology.org/2020.lrec-1.413.pdfGoogle Scholar
Révész, A., & Brunfaut, T. (2013). Text characteristics of task input and difficulty in second language listening comprehension. Studies in Second Language Acquisition, 35(1), 3165. https://doi.org/10.1017/S0272263112000678CrossRefGoogle Scholar
Salisbury, K. (2005). The edge of expertise? Towards an understanding of listening test item writing as professional practice [Doctoral dissertation, King’s College London]. Kings Research Portal. https://kclpure.kcl.ac.uk/portal/files/2936144/419477.pdfGoogle Scholar
Shin, D., & Lee, J. H. (2023). Can ChatGPT make reading comprehension testing items on par with human experts? Language Learning & Technology, 27(3), 2740. https://hdl.handle.net/10125/73530CrossRefGoogle Scholar
Toyama, Y. (2021). What makes reading difficult? An investigation of the contributions of passage, task, and reader characteristics on comprehension performance. Reading Research Quarterly, 56(4), 633642. https://doi.org/10.1002/rrq.440CrossRefGoogle Scholar
Vázquez-Ingelmo, A., García-Holgado, A., Therón, R., Shoeibi, N., & García-Peñalvo, F. J. (2023). Design and development of the LATILL platform for retrieving adequate texts to foster reading skills in German. In Saltiveri, T. G. I., Veloso, M. S., Navarro, J. E. G., González, R. G., Cairol, M. T., Solé, M. O., Gomà, J. V. (Eds.), Proceedings of the XXIII International Conference on Human-Computer Interaction (pp. 19). ACM. https://dl.acm.org/doi/pdf/10.1145/3612783.3612796CrossRefGoogle Scholar
Wei, X., Cui, X., Cheng, N., Wang, X., Zhang, X., Huang, S., … Han, W. (2023). Zero-shot information extraction via chatting with ChatGPT. arXiv preprint arXiv:2302.10205. https://doi.org/10.48550/arXiv.2302.10205CrossRefGoogle Scholar
Weiß, Z. L. (2024). An integrative approach to linguistic complexity analysis for German [Doctoral dissertation, Universität Tübingen]. https://bibliographie.uni-tuebingen.de/xmlui/bitstream/handle/10900/152467/Weiss-2024.pdf?sequence=1Google Scholar
Weiss, Z., & Meurers, D. (2018). Modeling the readability of German targeting adults and children: An empirically broad analysis and its cross-corpus validation. Proceedings of the 27th International Conference on Computational Linguistics, 303317. Association for Computational Linguistics. https://aclanthology.org/C18-1026/Google Scholar
Wendler, C., Veselovsky, V., Monea, G., & West, R. (2024). Do llamas work in English? On the latent language of multilingual transformers. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1536615394). https://doi.org/10.18653/v1/2024.acl-long.820CrossRefGoogle Scholar
Xi, X. (2023). Advancing language assessment with AI and ML – Leaning into AI is inevitable, but can theory keep up? Language Assessment Quarterly, 20(4–5), 357376. https://doi.org/10.1080/15434303.2023.2291488CrossRefGoogle Scholar
Xiao, C., Xu, X. S., Zhang, K., Wang, Y., & Xia, L. (2023). Evaluating reading comprehension exercises generated by LLMs: A showcase of ChatGPT in education applications. Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) (pp. 610625). https://doi.org/10.18653/v1/2023.bea-1.52CrossRefGoogle Scholar
Young, J. C., & Shishido, M. (2023). Evaluation of the potential usage of ChatGPT for providing easier reading materials for ESL students. In Bastiaens, T. (Ed.), Proceedings of EdMedia + Innovate Learning (pp. 155162). Association for the Advancement of Computing in Education (AACE).Google Scholar
Zesch, T., Horbach, A., Weiss, Z., Aggarwal, P., Bewersdorff, J., Bexte, M., … Westphal, M. (2021). Linguistic Features in Text (LiFT). GitHub. https://github.com/zesch/linguistic-features-in-textGoogle Scholar
Zhang, X., Li, S., Hauer, B., Shi, N., & Kondrak, G. (2023). Don’t trust ChatGPT when your question is not in English: A study of multilingual abilities and types of LLMs. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 79157927). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.491CrossRefGoogle Scholar