Hostname: page-component-77f85d65b8-6c7dr Total loading time: 0 Render date: 2026-03-29T17:47:36.802Z Has data issue: false hasContentIssue false

ChatGPT in foreign language lesson plan creation: Trends, variability, and historical biases

Published online by Cambridge University Press:  11 December 2024

Alex Dornburg
Affiliation:
University of North Carolina at Charlotte, USA (adornbur@charlotte.edu)
Kristin J. Davin
Affiliation:
University of North Carolina at Charlotte, USA (kdavin@charlotte.edu)
Rights & Permissions [Opens in a new window]

Abstract

The advent of generative artificial intelligence (AI) models holds potential for aiding teachers in the generation of pedagogical materials. However, numerous knowledge gaps concerning the behavior of these models obfuscate the generation of research-informed guidance for their effective usage. Here, we assess trends in prompt specificity, variability, and weaknesses in foreign language teacher lesson plans generated by zero-shot prompting in ChatGPT. Iterating a series of prompts that increased in complexity, we found that output lesson plans were generally high quality, though additional context and specificity to a prompt did not guarantee a concomitant increase in quality. Additionally, we observed extreme cases of variability in outputs generated by the same prompt. In many cases, this variability reflected a conflict between outdated (e.g. reciting scripted dialogues) and more current research-based pedagogical practices (e.g. a focus on communication). These results suggest that the training of generative AI models on classic texts concerning pedagogical practices may bias generated content toward teaching practices that have been long refuted by research. Collectively, our results offer immediate translational implications for practicing and training foreign language teachers on the use of AI tools. More broadly, these findings highlight trends in generative AI output that have implications for the development of pedagogical materials across a diversity of content areas.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press on behalf of EUROCALL, the European Association for Computer-Assisted Language Learning
Figure 0

Table 1. Prompts

Figure 1

Figure 1. Average score of each prompt across iterations. Bars are shaded to correspond to specific prompts; bars above mean scores indicate 25% and 75% quantiles of the mean score.

Figure 2

Figure 2. Visualization of the non-metric multidimensional scaling (NMDS) ordination plot of scores from all prompt groups showing group cluster overlap and separation. Ellipses represent estimated clusters and are shaded to correspond to the prompt group. Circles represent replicate scores and are shaded to the corresponding prompt group.

Figure 3

Figure 3. Alignment of prompt responses to rubric criteria. The resulting score of each prompt’s (P.1–P.5) output relative to a perfect target score containing all 25 criteria. Circles indicate scores with dotted lines corresponding to each of the 10 prompt iterations. Circle shadings correspond to specific prompts along the x-axis (P.1–P.5).

Figure 4

Figure 4. Distances between prompt output scores. (A) Heatmap depicting the computed distances between prompt replicate scores. Brighter shading indicates increased dissimilarity. (B) Dendrograms estimated using hierarchical clustering based on computed Jaccard distances. Shading in the center corresponds to prompt (P.1–P.5). Dendrogram node heights correspond to the respective height. Prompt IDs in both panels indicate prompt group and replicate number (i.e. P.4.4 = P.4, fourth replicate).

Figure 5

Figure 5. Summary of output scores by prompt group. Twenty-five categories were scored as present or absent for each prompt replicate. Bar charts indicate the frequency of replicates meeting scoring criteria. Bars are shaded to correspond to each prompt group.

Supplementary material: File

Dornburg and Davin supplementary material 1

Dornburg and Davin supplementary material
Download Dornburg and Davin supplementary material 1(File)
File 276 Bytes
Supplementary material: File

Dornburg and Davin supplementary material 2

Dornburg and Davin supplementary material
Download Dornburg and Davin supplementary material 2(File)
File 276 Bytes
Supplementary material: File

Dornburg and Davin supplementary material 3

Dornburg and Davin supplementary material
Download Dornburg and Davin supplementary material 3(File)
File 53 KB
Supplementary material: File

Dornburg and Davin supplementary material 4

Dornburg and Davin supplementary material
Download Dornburg and Davin supplementary material 4(File)
File 39.3 KB