Hostname: page-component-89b8bd64d-mmrw7 Total loading time: 0 Render date: 2026-05-12T12:11:28.976Z Has data issue: false hasContentIssue false

Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agents

Published online by Cambridge University Press:  22 August 2025

Erick Tyndall*
Affiliation:
Department of Systems Engineering and Management, Air Force Institute of Technology , Wright-Patterson Air Force Base, Ohio, USA
Colleen Gayheart
Affiliation:
Department of Systems Engineering and Management, Air Force Institute of Technology , Wright-Patterson Air Force Base, Ohio, USA
Alexandre Some
Affiliation:
Department of Systems Engineering and Management, Air Force Institute of Technology , Wright-Patterson Air Force Base, Ohio, USA
Joseph Genz
Affiliation:
Department of Anthropology, University of Hawaiʻi at Hilo, Hilo, Hawaii, USA
Torrey Wagner
Affiliation:
Department of Systems Engineering and Management, Air Force Institute of Technology , Wright-Patterson Air Force Base, Ohio, USA
Brent Langhals
Affiliation:
Department of Systems Engineering and Management, Air Force Institute of Technology , Wright-Patterson Air Force Base, Ohio, USA
*
Corresponding author: Erick Tyndall; Email: erick.tyndall@us.af.mil

Abstract

The capabilities of large language models (LLMs) have advanced to the point where entire textbooks can be queried using retrieval-augmented generation (RAG), enabling AI to integrate external, up-to-date information into its responses. This study evaluates the ability of two OpenAI models, GPT-3.5 Turbo and GPT-4 Turbo, to create and answer exam questions based on an undergraduate textbook. 14 exams were created with four true-false, four multiple-choice, and two short-answer questions derived from an open-source Pacific Studies textbook. Model performance was evaluated with and without access to the source material using text-similarity metrics such as ROUGE-1, cosine similarity, and word embeddings. Fifty-six exam scores were analyzed, revealing that RAG-assisted models significantly outperformed those relying solely on pre-trained knowledge. GPT-4 Turbo also consistently outperformed GPT-3.5 Turbo in accuracy and coherence, especially in short-answer responses. These findings demonstrate the potential of LLMs in automating exam generation while maintaining assessment quality. However, they also underscore the need for policy frameworks that promote fairness, transparency, and accessibility. Given regulatory considerations outlined in the European Union AI Act and the NIST AI Risk Management Framework, institutions using AI in education must establish governance protocols, bias mitigation strategies, and human oversight measures. The results of this study contribute to ongoing discussions on responsibly integrating AI in education, advocating for institutional policies that support AI-assisted assessment while preserving academic integrity. The empirical results suggest not only performance benefits but also actionable governance mechanisms, such as verifiable retrieval pipelines and oversight protocols, that can guide institutional policies.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Figure 1. Number of artifacts (models & datasets) per application, based on huggingface.com tags.

Figure 1

Table 1. Key frameworks, libraries, modules

Figure 2

Table 2. Hardware/Software

Figure 3

Table 3. Metrics and brief description

Figure 4

Table 4. Model performance for one exam

Figure 5

Table 5. Model rankings per exam

Figure 6

Table 6. Model overall rankings

Figure 7

Table 7. AI-assisted exam workflow archetypes

Supplementary material: File

Tyndall et al. supplementary material

Tyndall et al. supplementary material
Download Tyndall et al. supplementary material(File)
File 17.3 KB
Submit a response

Comments

No Comments have been published for this article.