A critical juncture: Integrating large language models in biostatistical workflows

Vihaan Sahu

doi:10.1017/cts.2026.10713

A critical juncture: Integrating large language models in biostatistical workflows

Published online by Cambridge University Press: 18 February 2026

Vihaan Sahu

Show author details

Vihaan Sahu*: Affiliation:
Medicine, Georgian National University SEU, Georgia
*: Corresponding author: V. Sahu; Email: vsahu@seu.edu.ge

Article contents

Abstract
Author contributions
Competing interests
References

Rights & Permissions

Abstract

An abstract is not available for this content. As you have access to this content, full HTML content is provided on this page. A PDF of this content is also available in through the ‘Save PDF’ action button.

Information

Type: Letter
Information: Journal of Clinical and Translational Science , Volume 10 , Issue 1 , 2026 , e47

DOI: https://doi.org/10.1017/cts.2026.10713 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2026. Published by Cambridge University Press on behalf of Association for Clinical and Translational Science

To the Editor

Grambow et al [Reference Grambow, Desai and Weinfurt1] provide essential, empirical insights into the integration of large language models (LLMs) in biostatistical workflows, revealing both rapid adoption and significant, unmitigated risks. Their survey of biostatisticians demonstrates substantial integration, with 63.8% (44/69) of respondents already using LLMs, nearly half (46.5%) daily [Reference Grambow, Desai and Weinfurt1]. The primary uses are for quantitative coding (77.5%) and writing tasks (76.3% for editing), indicating a profound shift in daily practice [Reference Grambow, Desai and Weinfurt1]. However, this adoption starkly outpaces the development of adequate safeguards. The most critical finding is that 70.7% (29/41) of users reported encountering incorrect LLM outputs with potentially serious consequences, spanning incorrect code generation, statistical misinterpretation, content fabrication, and inappropriate tone [Reference Grambow, Desai and Weinfurt1].

This scenario represents a genuine paradigm shift. LLMs are not merely augmenting workflows but are fundamentally transforming how biostatisticians generate code, communicate findings, and access knowledge, creating a new, hybrid human-AI workflow [Reference Dell’Acqua, McFowland and Mollick2,Reference Peng, Kalliamvakou, Cihon and Demirer3]. This transformation necessitates a parallel shift in our approach to verification and quality control. The current reliance on individualized, ad hoc verification strategies such as personal expertise, external checks, and debugging is insufficient [Reference Grambow, Desai and Weinfurt1]. This is compounded by a glaring institutional support gap: while 49.3% of respondents reported organizational encouragement to use LLMs, only 18.8% had access to formal guidance or training [Reference Grambow, Desai and Weinfurt1].

This disconnect between rapid adoption and inadequate safeguards opens a larger series of essential experiments. First, we need rigorous, longitudinal studies to examine how LLM integration affects the reproducibility and quality of clinical research analyses, as initial comparative assessments show alarming variability in endpoint calculations like objective response rate [Reference Denecke, May and Rivera Romero4]. Second, domain-specific verification frameworks must be developed and validated to systematically detect the high-frequency error types identified [Reference Grambow, Desai and Weinfurt1,Reference Perlis and Fihn5]. Third, comparative evaluations of different LLM architectures and prompting strategies for core biostatistical tasks are urgently required, moving beyond simple accuracy metrics to assess reasoning, hallucination rates, and robustness as recommended by scoping reviews [Reference Komandur, McDunn and Nair6,Reference Lee, Park, Shin and Cho7].

The impact on clinical and translational science is twofold. Properly managed, LLMs could accelerate trial design, analysis, and the communication of complex results. However, the current high error rate and lack of standardized verification pose a direct threat to research integrity and subsequent clinical decision-making [Reference Grambow, Desai and Weinfurt1,Reference Zhou, Liu and Gu8,Reference Thapa and Adhikari9]. The strong demand for structured support (75.9% of respondents wanted case studies and 69.0% interactive tutorials) must be met with field-specific resources that emphasize critical evaluation and the eight core principles for responsible use proposed by the authors, including collaborative verification and transparency [Reference Grambow, Desai and Weinfurt1,Reference Low, Jackson and Hyde10].

Grambow et al [Reference Grambow, Desai and Weinfurt1] have laid a crucial evidentiary foundation. Their work should catalyze a coordinated effort to develop the specialized training, robust evaluation frameworks, and institutional policies required to ensure this paradigm shift enhances, rather than undermines, the methodological rigor that is the cornerstone of valid clinical and translational science.

Author contributions

Vihaan Sahu: Conceptualization, Writing - original draft, Writing - review and editing.

Competing interests

The author declares no conflicts of interest.

References

Grambow, SC, Desai, M, Weinfurt, KP, et al. Integrating large language models in biostatistical workflows for clinical and translational research. J Clin Transl Sci. 2025;9:1–8. doi: 10.1017/cts.2025.10064.Google Scholar

Dell’Acqua, F, McFowland, E III, Mollick, E, et al. Navigating the jagged technological frontier: field experimental evidence of the effects of AI on knowledge worker productivity and quality. Harvard Business School Working Paper, 24-013. Boston, MA: Harvard Business School; 2023. doi: 10.2139/ssrn.4573321 Google Scholar

Peng, S, Kalliamvakou, E, Cihon, P, Demirer, M. The impact of AI on developer productivity: evidence from GitHub copilot. 2023. doi: 10.48550/arXiv.2302.06590.Google Scholar

Denecke, K, May, R, Rivera Romero, O. Potential of large language models in health care: Delphi study. J Med Internet Res. 2024;26:e52399. doi: 10.2196/52399.Google Scholar

Perlis, RH, Fihn, SD. Evaluating the application of large language models in clinical research contexts. JAMA Netw Open. 2023;6:e2335924. doi: 10.1001/jamanetworkopen.2023.35924.Google Scholar

Komandur, R, McDunn, J, Nair, N, et al. Artificial intelligence in biomedical data analysis: a comparative assessment of large language models for automated clinical trial interpretation and statistical evaluation. MedRxiv. 2025. doi: 10.1101/2025.02.05.25321607.Google Scholar

Lee, J, Park, S, Shin, J, Cho, B. Analyzing evaluation methods for large language models in the medical field: a scoping review. BMC Med Inform Decis Mak. 2024;24:366. doi: 10.1186/s12911-024-02709-7.Google Scholar

Zhou, H, Liu, F, Gu, B, et al. A survey of large language models in medicine: progress, application, and challenge. 2023. doi: 10.48550/arXiv.2311.05112.Google Scholar

Thapa, S, Adhikari, S. Bard, and large language models for biomedical research: opportunities and pitfalls. Ann Biomed Eng. 2023;51:2647–2651. doi: 10.1007/s10439-023-03284-0.Google Scholar

Low, YS, Jackson, ML, Hyde, RJ, et al. Answering real-world clinical questions using large language model-based systems. 2024. doi: 10.48550/arXiv.2407.00541.Google Scholar

Article contents

A critical juncture: Integrating large language models in biostatistical workflows

Abstract

Information

Author contributions

Competing interests

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests