Hostname: page-component-89b8bd64d-mmrw7 Total loading time: 0 Render date: 2026-05-06T17:32:58.276Z Has data issue: false hasContentIssue false

Optimizing the key information section of informed consent documents in biomedical research: A multi-phased study using large language models

Published online by Cambridge University Press:  02 January 2026

Shahana Chumki
Affiliation:
Office of the Vice President for Research, University of Michigan, Ann Arbor, MI, USA
Caleb Smith
Affiliation:
Office of Research, University of Michigan Medical School, Ann Arbor, MI, USA
Terri Ridenour
Affiliation:
Office of the Vice President for Research, University of Michigan, Ann Arbor, MI, USA
Kristi Hottenstein
Affiliation:
Michigan Institute for Clinical & Health Research (MICHR), University of Michigan Medical School, Ann Arbor, MI, USA
Lana Gevorkyan
Affiliation:
Office of the Vice President for Research, University of Michigan, Ann Arbor, MI, USA
Corey Zolondek
Affiliation:
Institutional Review Board for Health Sciences & Behavioral Sciences (IRB-HSBS), University of Michigan, Ann Arbor, MI, USA
Joshua Fedewa
Affiliation:
Institutional Review Board Medical School (IRBMED), University of Michigan Medical School, Ann Arbor, MI, USA
Judith Birk*
Affiliation:
Office of the Vice President for Research, University of Michigan, Ann Arbor, MI, USA
*
Corresponding author: J. Birk; Email: jbirk@umich.edu
Rights & Permissions [Opens in a new window]

Abstract

Introduction:

Informed consent is a cornerstone of ethical research, but the lack of widely accepted standards for the key information (KI) section in informed consent documents (ICDs) creates challenges in institutional review board (IRB) reviews and participant comprehension. This study explored the use of GPT-4o, a large language model (collectively, AI), to generate standardized KI sections.

Methods:

An AI tool was developed to interpret and generate KI content from ICDs. The evaluation involved a multi-phased process where IRB subject matter experts, principal investigators (PIs), and IRB reviewers assessed the AI output for accuracy, differentiation between standard care and research, appropriate information prioritization, and structural coherence.

Results:

Iterative refinements improved the AI’s accuracy and clarity, with initial assessments highlighting factual errors that decreased over time. Many PIs found the AI-generated sections comparable to their own and expressed a high likelihood of using the tool for future drafts. Blinded evaluations by IRB reviewers highlighted the AI tool’s strengths in describing study benefits and maintaining readability. However, the findings underscore the need for further improvements, particularly in ensuring accurate risk descriptions, to enhance regulatory compliance and IRB reviewer confidence.

Conclusions:

The AI tool shows promise in enhancing the consistency and efficiency of KI section drafting in ICDs. However, it requires ongoing refinement and human oversight to fully comply with regulatory and institutional standards. Collaboration between AI and human experts is essential to maximize benefits while maintaining high ethical and accuracy standards in informed consent processes.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NC
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial licence (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original article is properly cited. The written permission of Cambridge University Press or the rights holder(s) must be obtained prior to any commercial use.
Copyright
© The Regents of the University of Michigan, 2026. Published by Cambridge University Press on behalf of Association for Clinical and Translational Science
Figure 0

Figure 1. Institutional review board (IRB) subject matter experts (SMEs) factual accuracy assessment of artificial intelligence (AI)-generated key information (KI) sections. Cohort 1: IRB SMEs evaluated assigned KI sections produced from eight (8) iterations of the AI tool; each new KI output was assessed for factual accuracy against the original IRB-approved, human-authored KI section. Individual raters were assigned 3 different consents but assessed each of the consents for all 8 iterations (iter1–8). Cohort 2: IRB SMEs evaluated a new series of designated informed consents across 3 AI iterations (iter9–11) to ensure the AI tool’s adaptability to varied content it had not seen previously. DARK GRAY – Not factually correct, WHITE – Factually correct, C1: Cohort 1 (Initial set of studies), C2: Cohort 2 (Subsequent set of studies), iter1–11: refers to consecutive iterations of AI model.*Major updates to the content or delivery of prompts.

Figure 1

Figure 2. Additional institutional review board (IRB) subject matter experts (SMEs) assessment of artificial intelligence (AI)-generated key information (KI) sections. Comments were evaluated using ChatGPT-4o in four categories: (1) factual accuracy, (2) standard of care vs. research differentiation, (3) information weighting, and (4) style and structure. Scores range from 0 to 1, with higher scores indicating better performance (See Supplemental Material 4 for detailed prompt instructions on the IRB SME’s comment analysis conducted by ChatGPT-4o). The color-coded table uses darker gray for lower scores and lighter gray for higher scores, illustrating the changes in accuracy, clarity, and relevance from iteration 1 to 11 (iter1–11).*Major updates to the content or delivery of prompts. Low accuracyHigh accuracy; iter1-11: refers to consecutive iterations of AI model.

Figure 2

Table 1. Principal Investigators’ evaluation of artificial intelligence (AI)-generated key information (KI) sections

Figure 3

Figure 3. Principal Investigator (PI) and Institutional Review Board (IRB) reviewer perspectives on using an artificial intelligence (AI) tool to draft key information (KI) sections: comparison between future use and recommendations. Comparison of likelihood for future use and recommendation for future use of the AI tool between PIs and IRB reviewers. The side-by-side bar chart shows the frequency distribution of responses across five Likert scale categories: “5 (very likely),” “4 (likely),” “3 (neutral),” “2 (not likely),” and “1 (not at all likely)” (x-axis). Gray bars represent PI likelihood of future use, while black bars represent IRB reviewer likelihood to recommend for future use. The y-axis displays the percentage frequency of responses within each category.

Figure 4

Table 2. IRB reviewers’ evaluation of artificial intelligence (AI) vs. human generated key information (KI) sections

Supplementary material: File

Chumki et al. supplementary material 1

Chumki et al. supplementary material
Download Chumki et al. supplementary material 1(File)
File 18 KB
Supplementary material: File

Chumki et al. supplementary material 2

Chumki et al. supplementary material
Download Chumki et al. supplementary material 2(File)
File 7.1 KB
Supplementary material: File

Chumki et al. supplementary material 3

Chumki et al. supplementary material
Download Chumki et al. supplementary material 3(File)
File 36.8 KB
Supplementary material: File

Chumki et al. supplementary material 4

Chumki et al. supplementary material
Download Chumki et al. supplementary material 4(File)
File 42.1 KB