How good are LLMs in generating input texts for reading tasks in German as a foreign language?

Anastasia Drackert; Andrea Horbach; Anja Peters

doi:10.1017/S0267190525000066

How good are LLMs in generating input texts for reading tasks in German as a foreign language?

Published online by Cambridge University Press: 11 September 2025

Anastasia Drackert ,

Andrea Horbach and

Anja Peters

Show author details

Anastasia Drackert*: Affiliation:
Gesellschaft für akademische Studienvorbereitung und Testentwicklung (g.a.s.t.), Bochum, Germany Department of German Studies, Ruhr-Universität, Bochum, Germany
Andrea Horbach: Affiliation:
Leibniz-Institute for Science and Mathematics Education, Kiel, Germany Institute for Psychology of Learning and Instruction, Christian Albrecht University, Kiel, Germany
Anja Peters: Affiliation:
Gesellschaft für akademische Studienvorbereitung und Testentwicklung (g.a.s.t.), Bochum, Germany
*: Corresponding author: Anastasia Drackert; Email: drackert@gast.de

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

To study the potential of generative AI for generating high-quality input texts for a reading comprehension task on specific CEFR levels in German, we investigated the comparability of reading texts from a high-stakes German exam used as benchmarks for the purpose of this study and those generated by ChatGPT (3.5 and 4). These three types of texts were analyzed according to a variety of linguistic features and evaluated by three assessment experts. Our findings indicate that AI-generated texts provide a valuable starting point for the production of test materials, but they require adjustments to align with benchmark texts. Computational analysis and expert evaluations identified key discrepancies that necessitate careful control of certain textual features. Specifically, modifications are needed to address the frequency of nominalizations, lexical density, the use of technical vocabulary, and non-idiomatic expressions that are direct translations from English. To enhance comparability with benchmark texts, it is essential to incorporate features such as examples illustrating the discussed phenomena and the use of passive constructions in the AI-generated content. We discuss the consequences of the usage of ChatGPT for input text generation and point out important aspects to consider when using generated texts as input materials in assessment tasks.

Abstract in german

Abstract in German

Die Erstellung von Prüfungstexten für rezeptive Prüfungsteile ist ein zeitaufwendiger und ressourcenintensiver Prozess. Um das Potenzial generativer KI für die Erstellung von Input-Texten für eine high-stakes-Prüfung für Deutsch als Fremdsprache zu ermitteln, haben wir Lesepassagen, die von geschulten Autor*innen erstellt wurden, mit ChatGPT-generierten Texten verglichen. Die Texte wurden in Hinblick auf eine Reihe von linguistischen Merkmalen mittels einer computerbasierten Analyse ausgewertet und von drei Testerstellungsexpertinnen beurteilt. Unsere Ergebnisse zeigen, dass KI-generierte Texte einen wertvollen Ausgangspunkt für die Erstellung von Prüfungstexten bieten, aber Anpassungen erforderlich sind, um eine Vergleichbarkeit mit den Texten der Autor*innen zu erreichen. Insbesondere sind Modifikationen hinsichtlich der folgenden Aspekte notwendig: Veranschaulichung der dargestellten Inhalte durch Beispiele, lexikalische Dichte, Gebrauch von Fachvokabular, Idiomatik, Nominalisierungen und Passivkonstruktionen. Abschließend diskutieren wir die Konsequenzen der Nutzung von ChatGPT zur Erstellung von Input-Texten.

Keywords

readability assessment reading comprehension academic language proficiency TestDaF exam

Information

Type: Research Article
Information: Annual Review of Applied Linguistics , Volume 45 , March 2025 , pp. 222 - 252

DOI: https://doi.org/10.1017/S0267190525000066 [Opens in a new window]
Copyright: © The Author(s), 2025. Published by Cambridge University Press.

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Adesso, G. (2023). Towards the ultimate brain: Exploring scientific discovery with ChatGPT AI. AI Magazine, 44(3), 328–342. https://doi.org/10.1002/aaai.12113CrossRef Google Scholar

Alkaissi, H., & McFarlane, S. I. (2023). Artificial hallucinations in ChatGPT: Implications in scientific writing. Cureus, 15(2), e35179. https://doi.org/10.7759/cureus.35179Google Scholar PubMed

Attali, Y., Runge, A., LaFlair, G. T., Yancey, K., Goodwin, S., Park, Y., & von Davier, A. A. (2022). The interactive reading task: Transformer-based automatic item generation. Frontiers in Artificial Intelligence, 5, 1–13. https://doi.org/10.3389/frai.2022.903077CrossRef Google Scholar PubMed

Bamberger, R., & Vanecek, E. (1984). Lesen-Verstehen-Lernen-Schreiben: Die Schwierigkeitsstufen von Texten in deutscher Sprache [Reading–comprehension–learning–writing: The levels of difficulty of texts in the German language Jugend und Volk. https://doi.org/10.2307/3530491CrossRef Google Scholar

Benedetto, L., Gaudeau, G., Caines, A., & Buttery, P. (2025). Assessing how accurately large language models encode and apply the common European framework of reference for languages. Computers and Education: Artificial Intelligence, 8, 100353. https://doi.org/10.1016/j.caeai.2024.100353Google Scholar

Björkelund, A., Bohnet, B., Hafdell, L., & Nugues, P. (2010). A high-performance syntactic and semantic dependency parser. Proceedings of COLING 2010: Demonstrations (pp. 33–36). COLING 2010 Organizing Committee. http://www.aclweb.org/anthology/C/C10/C10-3009.pdf Google Scholar

Bolender, B., Foster, C., & Vispoel, S. (2023). The criticality of implementing principled design when using AI technologies in test development. Language Assessment Quarterly, 20(4–5), 512–519. https://doi.org/10.1080/15434303.2023.2288266CrossRef Google Scholar

Brunfaut, T. (2021). Assessing reading. In Fulcher, G. & Harding, L. (Eds.), The Routledge handbook of language testing (2nd ed., pp. 254–267). Routledge.CrossRef Google Scholar

Burstein, J. (2023). The Duolingo English test responsible AI standards. Retrieved March 29 , 2024. Available at https://go.duolingo.com/ResponsibleAI CrossRef Google Scholar

Chapelle, C. A., & Lee, H. (2021). Understanding argument-based validity in language testing. In Chapelle, C. A. & Voss, E. (Eds.), Validity argument in language testing: Case studies of validation research (pp. 19–44). Cambridge University Press. https://doi.org/10.1017/9781108669849.004CrossRef Google Scholar

Chen, H., & He, B. (2013). Automated essay scoring by maximizing human-machine agreement. In Yarowsky, D., Baldwin, T., Korhonen, A., Livescu, K. & Bethard, S. (Eds.), Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1741–1752). Association for Computational Linguistics. https://www.aclweb.org/anthology/D13-1180.pdf CrossRef Google Scholar

Chen, J., & Sheehan, K. M. (2015). Analyzing and comparing reading stimulus materials across the TOEFL® Family of Assessments. ETS Research Report Series, 2015(1), 1–12. https://doi.org/10.1002/ets2.12055Google Scholar

Covington, M. A., & McFall, J. D. (2010). Cutting the Gordian knot: The moving-average type-token ratio (MATTR). Journal of Quantitative Linguistics, 17(2), 94–100. https://doi.org/10.1080/09296171003643098CrossRef Google Scholar

Eckart de Castilho, R., & Gurevych, I. (2014). A broad-coverage collection of portable NLP components for building shareable analysis pipelines. Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT (pp. 1–11). Association for Computational Linguistics and Dublin City University. https://doi.org/10.3115/v1/W14-5201CrossRef Google Scholar

Feng, S., Park, C. Y., Liu, Y., & Tsvetkov, Y. (2023). From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair NLP models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 11737–11762). https://doi.org/10.18653/v1/2023.acl-long.656CrossRef Google Scholar

Ferrucci, D., & Lally, A. (2004). UIMA: An architectural approach to unstructured information processing in the corporate research environment. Natural Language Engineering, 10(3–4), 327–348. https://doi.org/10.1017/S1351324904003523CrossRef Google Scholar

Fitzgerald, J., Elmore, J., Relyea, J. E., Hiebert, E. H., & Stenner, A. J. (2016). Has first-grade core reading program text complexity changed across six decades? Reading Research Quarterly, 51(1), 7–28. https://doi.org/10.1002/rrq.115Google Scholar

Freedle, R., & Kostin, I. (1993). The prediction of TOEFL reading comprehension item difficulty for expository prose passages for three item types: Main idea, inference, and supporting idea items. ETS Research Report Series, 1993(1), i–48. https://doi.org/10.1002/j.2333-8504.1993.tb01524.xCrossRef Google Scholar

g.a.s.t., TestDaF-Institut (2020). Modelltest 01 - Leseverstehen. https://www.testdaf.de/fileadmin/testdaf/downloads/Modelltests_papierbasierter_TestDaF/Modelltest_1/Lesen/Modelltest_01_LV_Heft.pdf Retrieved May 8, 2025.Google Scholar

Green, A., & Hawkey, R. (2011). Re-fitting for a different purpose: A case study of item writer practices in adapting source texts for a test of academic reading. Language Testing, 29(1), 109–129. https://doi.org/10.1177/0265532211413445CrossRef Google Scholar

Gudibande, A., Wallace, E., Snell, C., Geng, X., Liu, H., Abbeel, P., … Song, D. (2023). The false promise of imitating proprietary LLMs. arXiv preprint arXiv:2305.15717. https://doi.org/10.48550/arXiv.2305.15717CrossRef Google Scholar

Hancke, J., Vajjala, S., & Meurers, D. (2012). Readability classification for German using lexical, syntactic, and morphological features. Proceedings of COLING 2012 (pp. 1063–1080). The COLING 2012 Organizing Committee. https://aclanthology.org/C12-1065/Google Scholar

Koizumi, R., & In’nami, Y. (2012). Effects of text length on lexical diversity measures: Using short texts with less than 200 tokens. System, 40(4), 554–564. https://doi.org/10.1016/j.system.2012.10.012CrossRef Google Scholar

Liu, Z., Ping, W., Roy, R., Xu, P., Lee, C., Shoeybi, M., & Catanzaro, B. (2024). ChatQA: Surpassing GPT-4 on conversational QA and RAG. Advances in Neural Information Processing Systems, 37, 15416–15459. https://proceedings.neurips.cc/paper_files/paper/2024/file/1c0d54ebd0a6e58c4eca7d591e374b9d-Paper-Conference.pdf Google Scholar

Lu, X. (2014). Computational methods for corpus annotation and analysis. Springer.CrossRef Google Scholar

Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 55–60). https://doi.org/10.3115/v1/P14-5010CrossRef Google Scholar

Min, B., Ross, H., Sulem, E., Veyseh, A. P. B., Nguyen, T. H., Sainz, O., … Roth, D. (2023). Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys, 56(2), 1–40. https://doi.org/10.1145/3605943CrossRef Google Scholar

Norris, J., & Drackert, A. (2018). Test review: TestDaF. Language Testing, 35(1), 149–157. https://doi.org/10.1177/0265532217715848CrossRef Google Scholar

O’Sullivan, B. (2023). Reflections on the application and validation of technology in language testing. Language Assessment Quarterly, 20(4–5), 501–511. https://doi.org/10.1080/15434303.2023.2291486CrossRef Google Scholar

Petrosyan, A. (2024). Most used languages online by share of websites 2023. Statista. Retrieved April 29 , 2024, Available at https://www.statista.com/statistics/262946/most-common-languages-on-the-internet Google Scholar

Petrov, S., Das, D., & McDonald, R. (2012). A universal part-of-speech tagset. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12) (pp. 2089–2096). European Language Resources (ELRA). http://lrec-conf.org/proceedings/lrec2012/pdf/274_Paper.pdf Google Scholar

Pugh, D., De Champlain, A., Gierl, M., Lai, H., & Touchie, C. (2020). Can automated item generation be used to develop high quality MCQs that assess application of knowledge? Research and Practice in Technology Enhanced Learning, 15(1), 1–13. https://doi.org/10.1186/s41039-020-00134-8CrossRef Google Scholar

Ray, P. P. (2023). ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems, 3, 121–154. https://doi.org/10.1016/j.iotcps.2023.04.003CrossRef Google Scholar

Rehm, G., Berger, M., Elsholz, E., Hegele, S., Kintzel, F., Marheinecke, K., Klejch, O. (2020). European language grid: An overview. Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020) (pp. 3366–3380). European Language Resources Association (ELRA). https://aclanthology.org/2020.lrec-1.413.pdf Google Scholar

Révész, A., & Brunfaut, T. (2013). Text characteristics of task input and difficulty in second language listening comprehension. Studies in Second Language Acquisition, 35(1), 31–65. https://doi.org/10.1017/S0272263112000678CrossRef Google Scholar

Salisbury, K. (2005). The edge of expertise? Towards an understanding of listening test item writing as professional practice [Doctoral dissertation, King’s College London]. Kings Research Portal. https://kclpure.kcl.ac.uk/portal/files/2936144/419477.pdf Google Scholar

Shin, D., & Lee, J. H. (2023). Can ChatGPT make reading comprehension testing items on par with human experts? Language Learning & Technology, 27(3), 27–40. https://hdl.handle.net/10125/73530 CrossRef Google Scholar

Toyama, Y. (2021). What makes reading difficult? An investigation of the contributions of passage, task, and reader characteristics on comprehension performance. Reading Research Quarterly, 56(4), 633–642. https://doi.org/10.1002/rrq.440CrossRef Google Scholar

Vázquez-Ingelmo, A., García-Holgado, A., Therón, R., Shoeibi, N., & García-Peñalvo, F. J. (2023). Design and development of the LATILL platform for retrieving adequate texts to foster reading skills in German. In Saltiveri, T. G. I., Veloso, M. S., Navarro, J. E. G., González, R. G., Cairol, M. T., Solé, M. O., Gomà, J. V. (Eds.), Proceedings of the XXIII International Conference on Human-Computer Interaction (pp. 1–9). ACM. https://dl.acm.org/doi/pdf/10.1145/3612783.3612796 CrossRef Google Scholar

Wei, X., Cui, X., Cheng, N., Wang, X., Zhang, X., Huang, S., … Han, W. (2023). Zero-shot information extraction via chatting with ChatGPT. arXiv preprint arXiv:2302.10205. https://doi.org/10.48550/arXiv.2302.10205CrossRef Google Scholar

Weiß, Z. L. (2024). An integrative approach to linguistic complexity analysis for German [Doctoral dissertation, Universität Tübingen]. https://bibliographie.uni-tuebingen.de/xmlui/bitstream/handle/10900/152467/Weiss-2024.pdf?sequence=1 Google Scholar

Weiss, Z., & Meurers, D. (2018). Modeling the readability of German targeting adults and children: An empirically broad analysis and its cross-corpus validation. Proceedings of the 27th International Conference on Computational Linguistics, 303–317. Association for Computational Linguistics. https://aclanthology.org/C18-1026/Google Scholar

Wendler, C., Veselovsky, V., Monea, G., & West, R. (2024). Do llamas work in English? On the latent language of multilingual transformers. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 15366–15394). https://doi.org/10.18653/v1/2024.acl-long.820CrossRef Google Scholar

Xi, X. (2023). Advancing language assessment with AI and ML – Leaning into AI is inevitable, but can theory keep up? Language Assessment Quarterly, 20(4–5), 357–376. https://doi.org/10.1080/15434303.2023.2291488CrossRef Google Scholar

Xiao, C., Xu, X. S., Zhang, K., Wang, Y., & Xia, L. (2023). Evaluating reading comprehension exercises generated by LLMs: A showcase of ChatGPT in education applications. Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) (pp. 610–625). https://doi.org/10.18653/v1/2023.bea-1.52CrossRef Google Scholar

Young, J. C., & Shishido, M. (2023). Evaluation of the potential usage of ChatGPT for providing easier reading materials for ESL students. In Bastiaens, T. (Ed.), Proceedings of EdMedia + Innovate Learning (pp. 155–162). Association for the Advancement of Computing in Education (AACE).Google Scholar

Zesch, T., Horbach, A., Weiss, Z., Aggarwal, P., Bewersdorff, J., Bexte, M., … Westphal, M. (2021). Linguistic Features in Text (LiFT). GitHub. https://github.com/zesch/linguistic-features-in-text Google Scholar

Zhang, X., Li, S., Hauer, B., Shi, N., & Kondrak, G. (2023). Don’t trust ChatGPT when your question is not in English: A study of multilingual abilities and types of LLMs. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 7915–7927). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.491CrossRef Google Scholar

Article contents

How good are LLMs in generating input texts for reading tasks in German as a foreign language?

Abstract

Abstract in german

Keywords

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests