Hostname: page-component-6766d58669-7cz98 Total loading time: 0 Render date: 2026-05-20T14:46:44.677Z Has data issue: false hasContentIssue false

Best practices for clinical trials data harmonization and sharing on NHLBI bioData catalyst (BDC) learned from CONNECTS network COVID-19 studies

Published online by Cambridge University Press:  26 March 2025

Jeran K. Stratford*
Affiliation:
RTI International, Research Triangle Park, NC, USA
Huaqin Helen Pan
Affiliation:
RTI International, Research Triangle Park, NC, USA
Alex Mainor
Affiliation:
Vanderbilt University Medical Center. Nashville, TN, USA
Edvin Music
Affiliation:
Department of Epidemiology, University of Pittsburgh School of Public Health, Pittsburgh, PA, USA
Joshua Froess
Affiliation:
Department of Epidemiology, University of Pittsburgh School of Public Health, Pittsburgh, PA, USA
Alex C. Cheng
Affiliation:
Vanderbilt University Medical Center. Nashville, TN, USA
Alexandra Weissman
Affiliation:
Department of Emergency Medicine, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA
David T. Huang
Affiliation:
Department of Emergency Medicine, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA Department of Critical Care Medicine, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA
Elizabeth C. Oelsner
Affiliation:
Division of General Medicine, Columbia University Irving Medical Center, New York, NY, USA
Sonia M. Thomas
Affiliation:
RTI International, Research Triangle Park, NC, USA
*
Corresponding author: J.K. Stratford; Email: jstratford@rti.org
Rights & Permissions [Opens in a new window]

Abstract

The need for collaborative and transparent sharing of COVID-19 clinical trial and large-scale observational study data to accelerate scientific discovery and inform clinical practice is critical. Responsible data-sharing requires addressing challenges associated with data privacy and confidentiality, data linkage, data quality, variable harmonization, data formats, and comprehensive metadata documentation to produce a high-quality, contextually rich, findable, accessible, interoperable, and reusable (FAIR) dataset. This communication explores the experiences and lessons learned from sharing National Heart Lung and Blood Institute (NHLBI) COVID-19 clinical trial (including adaptive platform trials) and cohort study datasets through the NHLBI BioData Catalyst® (BDC) ecosystem, focusing on the challenges and successes of harmonizing these datasets for broader research use. Our findings highlight the importance of establishing standardized data formats, adopting common data elements and creating and maintaining robust data governance structures that address common challenges (i.e., data privacy and data-sharing limitations resulting from informed consent). These efforts resulted in a set of comprehensive and interoperable datasets from 5 clinical trials and 13 cohort studies that will enable downstream reuse in analyses and collaborations. The principles and strategies outlined, derived through experience with consortia data, can lay the groundwork for advancing collaborative and efficient data sharing.

Information

Type
Special Communication
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of Association for Clinical and Translational Science
Figure 0

Figure 1. A. CONNECTS common data elements development and utilization. Many CONNECTS studies were ongoing (blue lines) prior to development and initial publication of the CONNECTS CDEs in June 2021 (yellow flag). Therefore, concentrated time for retrospective harmonization (solid green lines) was required to align study data with the CONNECTS CDEs to maximize dataset interoperability. In part, CDE adoption during study design coupled with concurrent data collection and intermittent harmonization (dashed green line) during ACTIV4-HT contributed to the reduction in time between study completion and dataset release (red stars). B. CONNECTS study variables mapped to CONNECTS CDEs. The count of mapping levels assigned to the study variable(s)/CDE pairing across CONNECTS studies was evaluated and visualized. An “Identical” mapping (blue) signifies study data was collected exactly as recommended by the NHLBI COVID-19 CDE. A “Comparable” mapping (orange) means that the study variable and NHLBI COVID-19 CDE are conceptually similar but differ in phrasing or response options. A “Related” mapping (gray) indicates that the study variable and the NHLBI COVDI-19 CDE covers a similar topic, but the mapping relationship is uncertain. ACTIV4-HT was the only study to adopt CONNECTS CDEs during study design, which greatly increased the number of “Identical” mappings, thus maximizing interoperability. Please note that ACTIV4a v1.0, v1.1, and v1.2 are different trial arms (drugs), not different versions of the same trial arm (drug).

Figure 1

Table 1. Current data management and sharing status for CONNECTS studies. To request available study data sets, click the link in the “Data request” column at the study website https://nhlbi-connects.org/data-request

Figure 2

Figure 2. BDC submission workflow. Data generators who submitted datasets to BDC completed a multistep process involving multiple systems. The figure outlines tasks for this data generator led workflow for each step, with references to the relevant submission forms. The outcomes produced at each step that enable advancing to the next phase are outlined. dbGaP = database of genotypes and phenotypes; QC = quality control; BDC = NHLBI BioData Catalyst®; DMC = data management core; a.bdcatalystdatasharing@nih.gov, b.nhlbigeneticdata@nhlbi.nih.gov.

Figure 3

Figure 3. ACTIV4a adaptive platform trial data collection timelines. Adaptive platform trials allow for flexibility for interventions to enter or leave the platform based on a predefined decision algorithm. This flexibility results in staggered completion of longitudinal data collection (separate lock dates for each intervention). To make data available as soon as possible while balancing the effort required for data submission, harmonized datasets that are completed at the same time are aggregated (colors) into a single data release. One impact of this approach is the need to access multiple releases to obtain all data for one of the domains (P2Y12 for severe baseline disease). *Release 2 includes updated Release 1 data and is preferentially recommended for analysis. EMR = electronic medical records; SGLT2 = sodium-glucose cotransporter-2, criza = crizanlizumab.

Supplementary material: File

Stratford et al. supplementary material

Stratford et al. supplementary material
Download Stratford et al. supplementary material(File)
File 25.5 KB