Hostname: page-component-89b8bd64d-rbxfs Total loading time: 0 Render date: 2026-05-08T22:51:36.022Z Has data issue: false hasContentIssue false

Development and validation of natural language processing algorithms in the national ENACT network

Published online by Cambridge University Press:  22 August 2025

Yanshan Wang*
Affiliation:
Clinical and Translational Science Institute, University of Pittsburgh, Pittsburgh, PA, USA Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
Jordan Hilsman
Affiliation:
Clinical and Translational Science Institute, University of Pittsburgh, Pittsburgh, PA, USA Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA
Chenyu Li
Affiliation:
Clinical and Translational Science Institute, University of Pittsburgh, Pittsburgh, PA, USA Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
Michele Morris
Affiliation:
Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
Paul M. Heider
Affiliation:
Biomedical Informatics Center and Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, USA
Sunyang Fu
Affiliation:
McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA
Min Ji Kwak
Affiliation:
McGovern Medical School, University of Texas Health Science Center at Houston, Houston, TX, USA
Andrew Wen
Affiliation:
McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA
Joseph R. Applegate
Affiliation:
McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA
Liwei Wang
Affiliation:
McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA
Elmer Bernstam
Affiliation:
McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA Division of General Internal Medicine, McGovern Medical School, University of Texas Health Science Center at Houston, Houston, TX, USA
Hongfang Liu
Affiliation:
McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA
Jack Chang
Affiliation:
Clinical and Translational Science Institute, University of Rochester Medical Center, Rochester, NY, USA
Daniel R. Harris
Affiliation:
Institute for Biomedical Informatics, University of Kentucky, Lexington, KY, USA
Alexandria Corbeau
Affiliation:
Institute for Biomedical Informatics, University of Kentucky, Lexington, KY, USA
Darren Henderson
Affiliation:
Institute for Biomedical Informatics, University of Kentucky, Lexington, KY, USA
John Osborne
Affiliation:
Department of Biomedical Informatics and Data Science, University of Alabama at Birmingham, Birmingham, AL, USA
Richard E. Kennedy
Affiliation:
Division of Gerontology, Geriatrics, and Palliative Care, Department of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
Nelly-Estefanie Garduno-Rapp
Affiliation:
Clinical Informatics Center, University of Texas Southwestern Medical Center, Dallas, TX, USA
Justin F. Rousseau
Affiliation:
Clinical Informatics Center, University of Texas Southwestern Medical Center, Dallas, TX, USA Department of Neurology, University of Texas Southwestern Medical Center, Dallas, TX, USA
Chao Yan
Affiliation:
Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
You Chen
Affiliation:
Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
Mayur B. Patel
Affiliation:
Department of Surgery, Vanderbilt University Medical Center, Nashville, TN, USA
Tyler J. Murphy
Affiliation:
Department of Surgery, Vanderbilt University Medical Center, Nashville, TN, USA
Bradley A. Malin
Affiliation:
Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
Chan Mi Park
Affiliation:
Department of Gerontology, Hebrew SeniorLife, Marcus Institute for Aging Research, Boston, MA, USA
Jungwei W. Fan
Affiliation:
Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN, USA Center for Clinical and Translational Science, Mayo Clinic, Rochester, MN, USA
Sunghwan Sohn
Affiliation:
Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN, USA
Sandeep Pagali
Affiliation:
Department of Medicine, Mayo Clinic, Rochester, MN, USA
Yifan Peng
Affiliation:
Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA Clinical & Translational Science Center, Weill Cornell Medicine, New York, NY, USA
Aman Pathak
Affiliation:
Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville, FL, USA
Yonghui Wu
Affiliation:
Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville, FL, USA
Zongqi Xia
Affiliation:
Department of Neurology, University of Pittsburgh, Pittsburgh, PA, USA
Salvatore Loguercio
Affiliation:
Scripps Research, Scripps Research Translational Institute, La Jolla, CA, USA
Steven E. Reis
Affiliation:
Clinical and Translational Science Institute, University of Pittsburgh, Pittsburgh, PA, USA
Shyam Visweswaran
Affiliation:
Clinical and Translational Science Institute, University of Pittsburgh, Pittsburgh, PA, USA Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
*
Corresponding author: Y. Wang; Email: yanshan.wang@pitt.edu
Rights & Permissions [Opens in a new window]

Abstract

Objective:

Electronic Health Record (EHR) data are critical for advancing translational research and AI technologies. The ENACT network offers access to structured EHR data across 57 CTSA hubs. However, substantial information is contained in clinical narratives, requiring natural language processing (NLP) for research. The ENACT NLP Working Group was formed to make NLP-derived clinical information accessible and queryable across the network.

Methods:

We established the ENACT NLP Working Group with 13 sites selected based on criteria including clinical notes access, IT infrastructure, NLP expertise, and institutional support. We divided sites into five focus groups targeting clinical tasks within disease contexts. Each focus group consisted of two development sites and two validation sites. We extended the ENACT ontology to standardize NLP-derived data and conducted multisite evaluations using the Open Health Natural Language Processing (OHNLP) Toolkit.

Results:

The working group achieved 100% site retention and deployed NLP infrastructure across all sites. We developed and validated NLP algorithms for rare disease phenotyping, social determinants of health, opioid use disorder, sleep phenotyping, and delirium phenotyping. Performance varied across sites (F1 scores 0.53–0.96), highlighting data heterogeneity impacts. We extended the ENACT common data model and ontology to incorporate NLP-derived data while maintaining Shared Health Research Informatics NEtwork (SHRINE) compatibility.

Conclusion:

This demonstrates feasibility of deploying NLP infrastructure across large, federated networks. The focus group approach proved more practical than general-purpose approaches. Key lessons include the challenge of data heterogeneity and importance of collaborative governance. This work also provides a foundation that other networks can build on to implement NLP capabilities for translational research.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of Association for Clinical and Translational Science
Figure 0

Figure 1. Participating sites in the evolve to next-gen accrual to clinical trials (ENACT) network natural language processing (NLP) working group.

Figure 1

Table 1. Focus group tasks and associated development sites, deployment sites, cohort definitions, and clinical note types

Figure 2

Figure 2. An overview of the ENACT NLP workflow. *SHRIN= shared health research information network.

Figure 3

Table 2. Performance of the algorithm developed by the sleep phenotyping focus group

Figure 4

Table 3. Performance of the algorithm developed by the housing status focus group

Figure 5

Table 4. Performance of the algorithm developed by the delirium phenotyping focus group

Supplementary material: File

Wang et al. supplementary material

Wang et al. supplementary material
Download Wang et al. supplementary material(File)
File 613.8 KB