Hostname: page-component-74d7c59bfc-56bg9 Total loading time: 0 Render date: 2026-02-11T23:38:34.077Z Has data issue: false hasContentIssue false

A method to enable clinical and translational research teams with custom real-world data from electronic health record systems

Published online by Cambridge University Press:  02 January 2026

Thomas R. Campion Jr.*
Affiliation:
Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA Information Technologies & Services Department, Weill Cornell Medicine, New York, NY, USA Clinical & Translational Science Center, Weill Cornell Medicine, New York, NY, USA Department of Pediatrics, Weill Cornell Medicine, New York, NY, USA
Evan T. Sholle
Affiliation:
Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA Information Technologies & Services Department, Weill Cornell Medicine, New York, NY, USA
Xiaobo Fuld
Affiliation:
Information Technologies & Services Department, Weill Cornell Medicine, New York, NY, USA
Cindy Chen
Affiliation:
Information Technologies & Services Department, Weill Cornell Medicine, New York, NY, USA
Marcos A. Davila
Affiliation:
Information Technologies & Services Department, Weill Cornell Medicine, New York, NY, USA
Vinay I. Varughese
Affiliation:
Information Technologies & Services Department, Weill Cornell Medicine, New York, NY, USA
Curtis L. Cole
Affiliation:
Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA Information Technologies & Services Department, Weill Cornell Medicine, New York, NY, USA Clinical & Translational Science Center, Weill Cornell Medicine, New York, NY, USA Department of Medicine, Weill Cornell Medicine, New York, NY, USA
*
Corresponding author: T. R. Campion Jr.; Email: thc2015@med.cornell.edu
Rights & Permissions [Opens in a new window]

Abstract

Introduction:

Custom transformations of real-world data (RWD) from electronic health record (EHR) systems are necessary to define study variables describing health and disease statuses differently among physicians in multiple specialties and basic scientists from a variety of disciplines . To increase RWD use, we hypothesized that a solution supporting three workflows – discovery, collection, and analysis – using existing rather than novel tools and requiring financial commitment from investigators would scale to meet the needs of clinical and translational research teams and ensure regulatory compliance at an academic medical center.

Materials and methods:

Weill Cornell Medicine (WCM) implemented custom research data repositories (RDRs) consisting of i2b2 for discovery, REDCap for collection, and Microsoft SQL Server for analysis. WCM subsidized the central information technology (IT) department to manage RDRs and required investigators to commit $50,000 for RDR startup and $7500 for annual maintenance.

Results:

From 2013 through 2025, WCM launched more than 17 custom RDRs for pediatrics, myeloproliferative neoplasms, obstetrics and gynecology, pulmonary and critical care, chronic kidney disease, and ophthalmology among other areas. Custom RDRs enabled academic output (e.g., publications, grants) as well as local quality improvement activities.

Discussion:

Custom RDRs facilitated delivery of fit-for-purpose data sets derived from EHR systems and other RWD sources. Over time, RDRs have evolved from an infrastructure product delivered by central IT to a data partnership between investigators and IT.

Conclusion:

Custom RDRs and data partnerships may help increase the use of RWD from EHR and other sources by clinical and translational research teams.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press on behalf of Association for Clinical and Translational Science

Introduction

Clinical and translational researchers increasingly seek real-world data (RWD) from electronic health records (EHR) and other source systems to generate real-world evidence (RWE). Using RWD investigators can conduct epidemiological studies, develop predictive models, evaluate interventions, and enable artificial intelligence among other activities [Reference Liaw, Guo and Ansari1,Reference Bian, Lyu and Loiacono2]. For healthcare organizations and pharmaceutical companies, supporting researchers with RWD has been challenging due to multiple factors including but not limited to technology, governance, workforce, sustainability, data literacy, and accessibility by non-informaticians [Reference Campion, Craven, Dorr, Bernstam and Knosp3Reference Elkin, Lindsell and Facelli7].

Fundamentally study teams need data sets consisting of rows representing a unit of analysis (e.g., patient, encounter, clinical observation) and columns representing variables of interest (e.g., demographics, comorbidities, laboratory results) that are “research-ready” with respect to reliability and validity [Reference Hersh, Cimino and Payne8] as well as for import into a statistical software package (e.g., SAS, R, Python). Physician scientists trained in diverse care specialties and basic scientists from a variety of disciplines often define study variables describing health and disease statuses differently, necessitating custom transformations of source system data to support varying scientific and use cases. Expertise required to generate transformed data sets usually exists in institutional informatics service groups, such as those that operate an enterprise data warehouse for research (EDW4R) [Reference Campion, Craven, Dorr and Knosp9], rather than study teams, and effort required from informatics staff and investigators to support studies with RWD is often considerable [Reference Knosp, Craven, Dorr, Bernstam and Campion10].

To make RWD from EHR systems available to investigators for analytics, academic medical centers (AMCs) have implemented repositories ranging from general for an institution [Reference Baghal, Zozus, Baghal, Al-Shukri and Prior11Reference Cimino, Ayres and Remennik20] to custom for investigator teams [Reference Kortüm, Müller and Kern21Reference Gallagher, Smith and Matthews26]. To add new elements from a source clinical system to a general institutional repository, data engineers historically have performed dimensional modeling and data harmonization activities to ensure consistency [Reference Chute, Beck, Fisk and Mohr13,Reference Johnson27]. However, a general repository’s “one-size-fits-all” approach can fail to meet specific needs of investigators because clinical concept definitions may lose specificity from source systems, requiring staff to rework query approaches and resulting in sluggish delivery of data sets [Reference Chute, Beck, Fisk and Mohr13,Reference Wade, Hum and Murphy16,Reference Cimino, Ayres and Remennik20]. To enable investigators to query general repositories and access data sets for particular studies, institutions have implemented commercial business intelligence tools (e.g., Power BI, Tableau, BusinessObjects, Qlik) [Reference Cimino, Ayres and Remennik20,Reference Horvath, Winfield, Evans, Slopek, Shang and Ferranti28], novel solutions [Reference Danciu, Cowan and Basford12,Reference Lowe, Ferris, Hernandez and Weber15], and applications supported by academic consortia (e.g., i2b2, OHDSI) [Reference Murphy, Weber and Mendis29,Reference Hripcsak, Duke and Shah30]. Information technology (IT) departments typically have managed general data repositories in coordination with an Institutional Review Board (IRB) and other local administrative units [Reference Campion, Craven, Dorr and Knosp9].

In contrast to general institutional data repositories, studies have described approaches for domain-specific custom repositories in cardiology, ophthalmology, urology, and perinatal care among other clinical areas [Reference Kortüm, Müller and Kern21Reference Pennington, Ruth and Italia24]. Of these approaches, two used predefined data models [Reference Hall, Greenberg and Muglia22,Reference Hruby, McKiernan, Bakken and Weng23] while two implemented custom data models specific to investigator needs [Reference Kortüm, Müller and Kern21,Reference Pennington, Ruth and Italia24]. Similar to general repositories, custom repositories have provided data access and querying to investigators using homegrown [Reference Pennington, Ruth and Italia24] and commercial business intelligence [Reference Kortüm, Müller and Kern21] tools along with applications popular among AMCs [Reference Hall, Greenberg and Muglia22,Reference Hruby, McKiernan, Bakken and Weng23]. Compared to general repositories, fewer reports of custom repositories appear to have addressed financial sustainability [Reference Hall, Greenberg and Muglia22,Reference Gallagher, Smith and Matthews26] or described approaches to regulatory oversight [Reference Gallagher, Smith and Matthews26]. In our experience, a custom repository managed by an individual investigator group rather than central IT may fail to adhere to best practices for information security and regulatory compliance, posing a risk to individual patient privacy, institutional reputation, and overall public trust in the biomedical research enterprise. Additionally, although a custom repository may more readily provide scientists with data sets meeting study-specific requirements, it may require dedicated personnel at greater expense than available through a general institutional resource. Although the extent to which custom repositories extend to other investigator groups within the institution of their development or to other sites is unknown, AMCs have widely adopted certain research informatics tools, including i2b2 and REDCap [Reference Harris, Taylor and Minor31,Reference Kohane, Churchill and Murphy32].

Over time significant variation has characterized institutional approaches to supporting investigators with RWD from EHR systems with respect to data, staffing, organization, and tooling [Reference Campion, Craven, Dorr, Bernstam and Knosp3,Reference MacKenzie, Wyatt, Schuff, Tenenbaum and Anderson33] while sustainability challenges have persisted [Reference Campion, Craven, Dorr, Bernstam and Knosp3,Reference DiLaura, Turisco, McGrew, Reel, Glaser and Crowley34,Reference Obeid, Tarczy-Hornoch and Harris35] and led some sites toward closer partnerships with industry [Reference Campion, Craven, Dorr, Bernstam and Knosp3]. As optimal approaches for making RWD accessible to investigators remain unknown, the objective of this paper is to describe one institution’s approach with respect to technology, regulatory, governance, finance, and investigator engagement. To the best of our knowledge, the literature does not describe approaches that enable AMCs to provide custom RWD repositories for multiple investigator groups. We hypothesized that custom research data repositories (RDRs) managed by central IT that supported three workflows – discovery, collection, and analysis – using existing rather than novel tools and requiring financial commitment from investigators would scale to meet the needs of study teams and ensure regulatory compliance.

Materials and methods

Setting

Weill Cornell Medicine (WCM) and NewYork-Presbyterian (NYP) have long shared a clinical affiliation and commitment to biomedical research. In 2025 WCM, the medical college of Cornell University, employed more than 2000 attending physicians who treated patients in 40 outpatient facilities across New York City and admitted patients to NYP/Weill Cornell Medical Center (NYP/WCMC) on the Upper East Side of Manhattan. With 3.3 million annual patient visits, WCM, which is known for multispecialty care, had more than 1300 active NIH awards in 2024.

As separate legal entities with separate IT organizations, WCM and NYP implemented separate EHR systems – Epic in WCM outpatient practices starting in 2000 and Allscripts Sunrise Clinical Manager (SCM) in NYP inpatient and emergency settings starting in 2010 – with automated data interfaces and shared medical record numbers (MRNs) to facilitate patient care and billing. Additionally, WCM and NYP supported specialty-specific clinical applications (e.g., anesthesiology, cardiology) with interfaces to the primary EHR systems. In 2020, WCM and NYP consolidated patient care and billing workflows in a single shared Epic implementation.

At WCM, the Information Technologies & Services Department (ITS) provided electronic infrastructure for the clinical, education, and research missions. Within ITS the Research Informatics division, which received financial assistance from the Joint Clinical Trials Office of WCM and NYP along with the NIH-funded WCM Clinical & Translational Science Center, supported scientists with electronic patient data through a suite of tools and services called Architecture for Research Computing in Health (ARCH) [Reference Campion, Sholle, Pathak, Johnson, Leonard and Cole36]. Undergirding ARCH applications was a Microsoft SQL Server database environment called Secondary Use of Patients’ Electronic Records (SUPER) that aggregated research and clinical data from across WCM and NYP, as well as automated the extraction, transformation, and loading (ETL) of patient data for applications [Reference Sholle, Kabariti, Johnson, Leonard, Pathak and Varughese37]. At WCM data from SUPER enabled a number of research systems widely used in AMCs including but not limited to i2b2 for cohort discovery [Reference Murphy, Weber and Mendis29], Leo for natural language processing (NLP) [Reference Patterson, Freiberg and Skanderson38], the Observational Medical Outcomes Partnership (OMOP) common data model (CDM) for large-scale retrospective data analysis [Reference Hripcsak, Duke and Shah30], and REDCap [Reference Harris, Taylor, Thielke, Payne, Gonzalez and Conde39] with dynamic data pull (DDP) [Reference Campion, Sholle and Davila40] for electronic data capture (EDC) and adjudication of EHR data. We used the SUPER infrastructure to provide custom RDRs.

Technology approach

As shown in Figure 1, a custom RDR contained raw patient data of interest (i.e., protected health information) extracted from one or more source systems of interest as defined by a particular group of investigators in coordination with Research Informatics. If source system data were not available in SUPER, Research Informatics worked with data source owners to obtain data on behalf of investigators for subsequent integration into an RDR. To define patients of interest for an RDR, investigators specified inclusion criteria based on data documented in EHR systems (e.g. diagnosis codes, encounters with particular physicians) or lists of individual patient MRNs (e.g., participants enrolled in a specific IRB protocol).

Figure 1. A custom research data repository (RDR) aggregates data from disparate sources, transforms data into research-ready formats, and supports three workflows using off-the-shelf tools.

For investigators to access data, an RDR deployed customized instances of three tools – i2b2, REDCap, and Microsoft SQL Server – in support of three workflows – discovery, collection, and analysis, respectively. Working with an investigator group, Research Informatics customized data available in each tool to meet scientific needs.

Discovery

For i2b2, custom ontologies built according to an investigator group’s specification (e.g., REDCap project data, groups of procedure codes) appeared within an i2b2 RDR instance alongside common ontologies (e.g., ICD-10, RxNorm) accessible in the general-purpose i2b2 instance at the institution, which contained data for all patients [Reference Sholle, Cusick, Davila, Kabariti, Flores and Campion41]. Through an RDR’s i2b2 instance, investigators had the ability to browse all clinical concepts modeled from the RDR, generate queries using a drag-and-drop interface, and obtain counts of patients to support activities preparatory to research. Most RDR i2b2 users could only view patient counts while some RDR administrators could view and export identified patient-level data using the ExportXLS plug-in pursuant to regulatory approval. As described elsewhere, an RDR i2b2 instance consisted of a specific i2b2 project, which defined user and data access, with logical separation of data achieved through SQL views that protected patient privacy and prevented data duplication with negligible impact on query performance [Reference Sholle, Davila, Kabariti, Schwartz, Varughese and Cole42].

Collection

Through REDCap, researchers recorded novel measures that did not exist in source systems and annotated existing EHR data. For REDCap, custom DDP definitions [Reference Campion, Sholle and Davila40] automatically retrieved data elements from source systems (e.g., peripheral capillary oxygen saturation documented in the medical intensive care unit nursing flowsheet) for study team members to review prior to saving into case report forms (CRFs). For an RDR, investigators were able to augment existing REDCap projects with DDP or create new REDCap projects using DDP with guidance from Research Informatics. Data recorded in REDCap were available for aggregation and transformation with other data in an RDR.

Analysis

For Microsoft SQL Server, custom data marts transformed raw source system data into research-grade rows-and-columns-level data sets ready for statistical analysis. Data marts required multiple iterations between investigators and Research Informatics to define, build, and test. Because different research questions required different units of analysis as rows (e.g., patient, encounter, procedure), definitions of variables as columns (e.g., comorbidity, lab result of interest), and patient cohorts, Research Informatics limited the initial number of data marts investigators could create to two. For example, one data mart defined a dichotomous variable for presence of diabetes according to diagnosis codes while another data mart used a laboratory result above a particular threshold. Additional data marts were available through a separate scope of work and funding mechanism.

For access to data marts, Microsoft SQL Server Reporting Studio allowed all investigators to view and download flat files via a secure web-based front end while Microsoft SQL Server Management Studio and similar command line tools enabled investigators with advanced SQL skills to perform sophisticated queries of relational data models. Custom data marts existed in physically separated databases with permissions granted by Research Informatics according to regulatory approval. Additionally, using logical separation achieved through SQL views as initially implemented for i2b2, an RDR contained an instance of the OMOP CDM specific to its patient inclusion criteria [Reference Sholle, Davila, Kabariti, Schwartz, Varughese and Cole42] to promote standards-based data science activities.

Regulatory compliance

In addition to technology, an RDR provided safeguards for regulatory compliance. An IRB protocol specific to an RDR enabled ongoing data refreshes from source systems, including for research data from studies where patients provided consent for future use, and distribution of RDR data only in support of studies governed by separate IRB protocols; no data analysis was permitted otherwise. For data requests in support of data sets extracted from an RDR for specific research and quality improvement purposes, Research Informatics served as the institutional honest broker [Reference Boyd, Saxman, Hunscher, Smith, Morris and Kaston43], reviewing IRB protocols and coordinating with the IRB, Privacy Office, and study teams as necessary. Investigators submitted data requests via a standardized process documented by Research Informatics in ServiceNow, the commercially available IT service management system used department-wide in WCM ITS. Each RDR data request logged in ServiceNow described data elements of interest, a copy of the IRB protocol covering analysis, and determination by a Research Informatics analyst citing IRB protocol language for release of data. ServiceNow also logged access requests to an RDR’s i2b2 and Microsoft SQL Server instances pursuant to IRB protocols with permissions to those resources managed by Research Informatics. Investigators controlled permissions to their REDCap projects according to IRB protocols using built-in system features for user access management. Each RDR component logged all user activity for audit purposes.

Governance and finance

After initial consultation with Research Informatics to determine RDR feasibility, investigators formed a project team consisting of faculty and staff to define RDR parameters (e.g., data sources, patient inclusion criteria) and determine the order in which to deploy RDR components (e.g., i2b2, data mart 1, REDCap, data mart 2). Typically, faculty involvement included a senior faculty member providing overall project sponsorship and one or more junior faculty members providing specific scientific leadership, such as in defining rules to transform EHR data into data mart variables and participating in quality assurance testing.

To obtain institutional subsidy for RDR development, investigator groups needed to obtain formal approval from the ARCH Steering Committee, which consisted of senior research and IT leaders. In reviewing proposals, the committee evaluated alignment with institutional priorities, scientific merit, and financial feasibility. Critically, investigators needed to commit to providing $50,000 in first-year startup funds and $7500 in ongoing annual maintenance fees. Fees paid by investigators did not cover the full cost of RDR development (i.e., Research Informatics staff, computing resources) but served to limit demand to those dedicated to research and committed to time-consuming work with electronic patient data. Annual maintenance included storage, compute, and security for an RDR plus source system data refreshes, bug fixes, and 40 hours of custom SQL development. Additional projects beyond 40 hours required a new financial commitment and scope of work.

Investigator engagement

After obtaining approval from the ARCH Steering Committee, an investigator team regularly convened with Research Informatics through small and large groups meetings to advance RDR activities. In a small group meeting, which typically occurred every two weeks, a junior faculty member and research coordinator from an investigator team met with a business analyst and project manager from Research Informatics to define requirements, review data engineering delivered by software developers (e.g., i2b2 ontologies, REDCap DDP terms, data marts), address issues such as harmonization of disparate data sources and adjudication of data disparities, and determine next steps. In a large group meeting, which typically occurred quarterly or semi-annually, the small group participants plus leadership from the investigator group and Research Informatics convened to review overall RDR progress and challenges. Both an investigator team and Research Informatics personnel agreed to pursue RDR development in phases delivered in sequence rather than pursued in parallel. After completing project deliverables, meetings between an investigator team and Research Informatics occurred on an ad hoc basis unless a new scope of work initiated RDR expansion, at which point small and large group meetings resumed. incorporating

Results

As described in Table 1, between 2013 and 2025, 17 investigator groups at our institution implemented an RDR to support a variety of use cases. Additionally, RDR techniques enabled two major multi-institutional efforts, the NIH All of Us Research Program and PCORI ADAPTABLE study [Reference Campion, Pompea, Turner, Sholle, Cole and Kaushal44,Reference Turner, Pompea, Williams, Kraemer, Sholle and Chen45]. Some examples of scientific output supported by RDRs include but are not limited to studies in neurology [Reference Kamel, Okin, Merkler, Navi, Campion and Devereux46,Reference Barbour, Hesdorffer, Tian, Yozawitz, McGoldrick and Wolf47], mental health [Reference Deferio, Levin, Cukor, Banerjee, Abdulrahman and Sheth48,Reference Adekkanattu, Sholle, DeFerio, Pathak, Johnson and Campion49], vaccine safety [Reference Son, Riley, Staniczenko, Cron, Yen and Thomas50], COVID-19 [Reference Stringer, Labar, Geleris, Sholle, Berlin and McGroder51,Reference Butler, Mozsary, Meydan, Foox, Rosiene and Shaiber52], pulmonary critical care [Reference Schenck, Hoffman, Cusick, Kabariti, Sholle and Campion53,Reference Schenck, Hoffman, Oromendia, Sanchez, Finkelsztein and Hong54], and myeloproliferative neoplasms [Reference Sholle, Krichevsky, Scandura, Sosner and Campion55Reference Krichevsky, Sholle, Adekkanattu, Abedian, Ouseph and Taylor57], with dozens of additional abstracts and posters addressing diverse additional clinical areas. In addition to supporting specific disease areas, RDRs also afforded students the opportunity to collaborate with us and with WCM clinicians on impactful papers that applied informatics approaches to challenging biomedical questions [Reference Fu, Sholle, Krichevsky, Scandura and Campion58Reference Yin, Guo, Sholle, Rajan, Alshak and Choi60]. Notably, one research coordinator, whose frequent i2b2 use led his principal investigator to launch an RDR and who collaborated actively in RDR development, subsequently completed a doctoral program in biomedical informatics. On Github (https://github.com/wcmc-research-informatics/custom-rdr) we have uploaded RDR resources including an IRB protocol template, application form, project management template, process flowchart, and data request template.

Table 1. Research data repository (RDR) activities by investigator group

Grants enabled investigators to create RDRs, and RDRs enabled investigators to obtain new grants. For example, an NIH R23 enabled a junior faculty member to launch a nephrology-focused RDR, and an RDR enabled a multidisciplinary team of health informatics and psychiatry investigators to secure an NIH R01 [Reference Cusick, Adekkanattu, Campion, Sholle, Myers and Banerjee61]. Along with enabling financial support from private foundations for research in pediatrics [Reference Pan, Wu, Weiner and Grinspan62] as well as myeloproliferative neoplasms [Reference Gazda, Pan, Erdos, Abu-Zeinah, Racanelli and Horn63,Reference Erdos, Alshareef, Silver, Scandura and Abu-Zeinah64], investigators have used RDRs to receive internal awards from WCM’s cancer center and other entities.

To support new and varied RDR use cases, SUPER expanded to include data from Epic, Allscripts SCM, CompuRecord, Xcelera, Standard Molecular, FreezerPro, and Genoptix, as well as a host of additional ancillary and legacy systems. To encourage use of RDR assets, the WCM Data Catalog made descriptions of data marts available to institutional investigators.

Discussion

We have deployed seventeen custom RDRs, addressing research areas ranging from pediatric behavioral health to myeloproliferative neoplasms, that provide a regulatory and conceptual framework for investigators to engage with RWD from EHR systems. RDRs have enabled thousands of queries against underlying data and have resulted in dozens of publications in peer-reviewed journals as well as extramural funding. Custom RDRs represent one approach institutions can consider to address unique challenges in secondary use of EHR data. However, this approach has developed over time, and challenges have led to shifts in our methodology as we adjusted to researchers’ priorities.

In 2021, the National Institutes of Health (NIH) Clinical and Translational Science Award (CTSA) Steering Committee charged the EDW4R Working Group, which aimed to determine best practices for supporting investigators with electronic patient data, with the following:

The informatics community has done an outstanding job of building capabilities, but the usability of the platforms for the majority of investigators is a huge barrier. The biggest opportunity today is to create tools and workflows that non-informaticians can master with modest effort so that our bottleneck in informatics is relieved.

Our experience suggests that “tools and workflows” alone cannot enable use of RWD by non-informaticians. Rather, we have observed that team science among clinicians, biostatisticians, and informaticians is most critical to engage with RWD [Reference Campion, Craven, Dorr, Bernstam and Knosp3]. The self-service components of the RDR, while capable of being mastered with moderate effort, were among those that saw the least utilization. As described in earlier analysis of local i2b2 usage [Reference Turner, Pompea, Williams, Kraemer, Sholle and Chen45], the time we spent developing custom ontology items that theoretically allowed investigators to run self-service queries and “alleviate the bottleneck” saw little to no usage despite extensive efforts to engage investigators, train staff, and publicize awareness of features.

These findings are consistent with observations in other settings. Our experience in an AMC supporting study teams with RWD through custom RDRs parallels the experience of informaticians in industry with respect to the value of building infrastructure versus delivering data and analysis. Reflecting on supporting pharmaceutical industry colleagues with RWD, OHDSI community leader Patrick Ryan noted the following [4]:

[W]hile everyone says they want to ‘generate real-world evidence’ and “conduct observational analyses,” what it [sic] became apparent to me is that for the vast majority of those people, what they really want is to “consume real-world evidence” and “receive the results of observational analyses.” They want to pose questions and get answers, but they don’t want to do the work between Q & A. The difference between being an evidence producer and an evidence consumer is quite important in how you perceive your role and responsibility. And it has nothing to do with the tool itself, it has to do with lack of training in epidemiologic principles, lack of in-depth knowledge of the source data, lack of statistical intuition for non-randomized trials, and more important than anything else, LACK OF DEDICATED TIME.

Our experience developing and maintaining custom RDRs reflect these sentiments. Users of the repositories – in many cases, clinical leaders who had invested financial support in the development of the resource – liked in theory the idea of having a “website to visit” to see how many patients met certain criteria, but in reality expected statisticians and/or data analysts to proactively present figures and summary statistics, elicit requirements and definitions, and execute ad hoc queries to address specific and highly complex clinical questions.

To address these requirements, we learned over time to emphasize the importance of engaging all stakeholders involved in a study from the onset to ensure that data were extracted and transformed with maximum efficiency. This was particularly critical with respect to biostatistical analysis: early RDRs that were defined by close collaboration between clinicians and informatics staff, without involvement from the biostatisticians who would ultimately be analyzing the data to be extracted, often required extensive reconfiguration to accommodate biostatistical workflows. Engaging biostatisticians also afforded the opportunity to coach clinicians to conceptualize patient data on a spectrum with raw, untransformed EHR data on one side of the continuum and a manicured flat file ready for statistical analysis on the other (Figure 2).

Figure 2. Spectrum of transformation of real-world data from electronic health record systems to enable analytics. OMOP = Observational Medical Outcomes Partnership; CDM = common data model.

We sought to emphasize through the notion of this spectrum the idea that data are available free of charge to investigators, but transformation and thoughtful interrogation require skills and money. Investigators with maximal support from dedicated biostatisticians capable of generating flat files themselves fell more on the raw side of the spectrum and sought to complete their repository with raw data from additional sources. Investigators earlier in their career and/or with limited biostatistical resources were instead encouraged to request a more transformed data set to both address a specific research question and cultivate the skillset that might leave them better prepared to engage with additional, more raw data sets such as an instance of the OMOP CDM. The majority of these conversations took place within the “analysis” component (i.e., SQL Server) of the RDR; the “discovery” arm (i.e., i2b2) was quickly supplanted as self-service web client queries struggled to elucidate complex clinical logic easily represented in a SQL query against an underlying database, and the “collection” arm required little custom work, as investigators were responsible for defining their own data capture instruments. Interaction between the arms of the repository also led to further unexpected consolidation as the availability of rows-and-columns EHR data sets obviated the need for ingestion of the same data into a REDCap data capture instrument. The “team science” component of the RDR afforded us the opportunity to consolidate effort. For example, rather than import lab values and demographics into REDCap and configure FHIR ingestion pipelines, we advised biostatisticians to export REDCap data and merge it with EHR data at the point of statistical analysis. Although we lack detailed records of RDR use over time, when RDRs enabled formation of partnerships among clinicians, biostatisticians, and informaticians that allowed each role to “practice at the top of their license,” we observed accelerated time to science and efficient use of IT resources.

Along with skills and technology, time from investigators was a major factor in RDR use. In some cases [Reference Schenck, Hoffman, Cusick, Kabariti, Sholle and Campion53,Reference Schenck, Hoffman, Oromendia, Sanchez, Finkelsztein and Hong54], RDRs had lead investigators who were able to not only dedicate protected time toward collaboration with biostatisticians but also able to write code themselves; scientific output emerged more quickly and analytic yield was higher (i.e., more complex studies published/presented in more formal venues). In other cases, researchers with less protected time to engage with data and/or less access to statisticians capable of performing analyses relied more on self-service tools like i2b2, resulting in fewer large-scale retrospective observational studies but more feasibility assessments for prospective studies. This was particularly notable in disease areas with relatively low prevalence/incidence where the primary mode of evidence production was prospective randomized controlled trials rather than large-scale retrospective observational studies. For some of these RDRs, investigators chose to forego continued custom efforts in favor of existing solutions available at the institutional level that did not require a financial commitment, such as i2b2 and REDCap, and enabled investigator-initiated trials as well as manual chart reviews. RDRs that were originally initiated by clinicians with plans to conduct robust retrospective observational research programs await availability of colleagues with protected time and expertise to fully leverage their capacities.

For each of the three workflows we aimed to support – discovery, collection, and analysis – we provisioned specific tools – i2b2, REDCap, and Microsoft SQL Server, respectively. By using off-the-shelf components and focusing on customization and transformation of raw data rather than tool development, such an approach enables tools to be interchangeable. For example, institutions wishing to invest more effort in the platform provided by the OHDSI consortium may wish to support discovery workflows through an instance of ATLAS or Leaf configured for OMOP. Similarly, REDCap may replaced by other industry-standard electronic CRF platforms, and SQL Server could be replaced by any database management system.

While the approach we describe here is modular and extensible, it has limitations. Charging fees encourages investigator groups to remain engaged with the process, but the underlying informatics work – and, perhaps more importantly, the IT and regulatory infrastructure required for it to be feasible – requires extensive institutional subsidy. Although we recovered startup and maintenance costs from investigators, estimates of partial staff efforts over time – 3–5 engineers, 4–6 analysts, and 1–2 project manager among others – as well as technology infrastructure [Reference Sholle, Kabariti, Johnson, Leonard, Pathak and Varughese37] are incomplete. As described elsewhere, measuring impact of informatics infrastructure on publications and grants remains an unsolved problem [Reference Campion, Craven, Dorr, Bernstam and Knosp3]. Future work will address quantification of true cost and effect of RDRs. Low-resource settings may have challenges provisioning staff, but studies have demonstrated a need for institutional support beyond grants to enable RWD from EHR systems [Reference Campion, Craven, Dorr, Bernstam and Knosp3]. Future studies can investigate workforce development for thoughtful interrogation of EHR data [Reference Labkoff, Quintana and Rozenblit5,Reference Lazem and Sheikhtaheri65,Reference Goriacko, Mirhaji, John, Parimi, Henninger and Soby66]. Although our approach for custom RDRs seeks to make RWD from EHR systems more accessible to investigators, we did not apply large language models (LLMs) [Reference Lee, Jain, Chen, Ono, Biswas and Rudas67]. Future work can address use of LLMs incorporating structured and unstructured data to address needs across AMCs.

Conclusion

Custom RDRs provided a pathway to engage different investigator groups to leverage RWD from EHR systems for activities ranging from retrospective observational analyses to prospective identification of cohorts of eligible patients for clinical trials to the validation and rollout of predictive models. Other institutions seeking to support investigators at scale may wish to learn from our experience deploying this approach.

Acknowledgments

The authors thank Joseph Kabariti, John P. Leonard, Stephen B. Johnson, and Jyotishman Pathak.

Author contributions

Thomas R. Campion Jr.: Conceptualization, Funding acquisition, Methodology, Supervision, and Writing –original draft; Evan T. Sholle: Conceptualization, Formal analysis, Methodology, and Writing – original draft; Xiaobo Fuld: Conceptualization and Methodology; Cindy Chen: Conceptualization and Writing – original draft; Marcos A. Davila: Conceptualization and Methodology; Vinay I. Varughese: Conceptualization, Funding acquisition, and Methodology; Curtis L. Cole: Conceptualization, Funding acquisition, Methodology, and Writing –original draft.

Funding statement

This study received support from the National Institutes of Health National Center for Advancing Translational Sciences through grant number UL1TR002384 as well as support from the Joint Clinical Trials Office of Weill Cornell Medicine and NewYork-Presbyterian.

Competing interests

The authors declare none.

References

Liaw, S-T, Guo, JGN, Ansari, S, et al. Quality assessment of real-world data repositories across the data life cycle: a literature review. J Am Med Inform Assoc. 2021;28:15911599.Google Scholar
Bian, J, Lyu, T, Loiacono, A, et al. Assessing the practice of data quality evaluation in a national clinical data research network through a systematic scoping review in the era of real-world data. J Am Med Inform Assoc. 2020;27:19992010.Google Scholar
Campion, TR, Craven, CK, Dorr, DA, Bernstam, EV, Knosp, BM. Understanding enterprise data warehouses to support clinical and translational research: impact, sustainability, demand management, and accessibility. J Am Med Inform Assoc. 2024;31:15221528.Google Scholar
How Do You Support OHDSI Tools? - Implementers - OHDSI Forums [Internet]. (https://forums.ohdsi.org/t/how-do-you-support-ohdsi-tools/3813/2) Accessed March 8, 2025.Google Scholar
Labkoff, SE, Quintana, Y, Rozenblit, L. Identifying the capabilities for creating next-generation registries: a guide for data leaders and a case for “registry science”. J Am Med Inform Assoc. 2024;31:10011008. doi: 10.1093/jamia/ocae024.Google Scholar
Marwaha, JS, Downing, M, Halamka, J, et al. Mobilizing data during a crisis: building rapid evidence pipelines using multi-institutional real world data. Healthcare 2024;12:100738.Google Scholar
Elkin, PL, Lindsell, C, Facelli, J, et al. Data science and artificial intelligence in biology, health, and healthcare. J Clin Transl Sci. 2025;9:e56.Google Scholar
Hersh, WR, Cimino, J, Payne, PRO, et al. Recommendations for the use of operational electronic health record data in comparative effectiveness research. EGEMS (Wash DC) 2013;1:1018.Google Scholar
Campion, TR, Craven, CK, Dorr, DA, Knosp, BM. Understanding enterprise data warehouses to support clinical and translational research. J Am Med Inform Assoc. 2020;27:13521358.Google Scholar
Knosp, BM, Craven, CK, Dorr, DA, Bernstam, EV, Campion, TR. Understanding enterprise data warehouses to support clinical and translational research: enterprise information technology relationships, data governance, workforce, and cloud computing. J Am Med Inform Assoc. 2022;29:671676.Google Scholar
Baghal, A, Zozus, M, Baghal, A, Al-Shukri, S, Prior, F. Factors associated with increased adoption of a research data warehouse. Stud Health Technol Inform. 2019;257:3135.Google Scholar
Danciu, I, Cowan, JD, Basford, M, et al. Secondary use of clinical data: the Vanderbilt approach. J Biomed Inform. 2014;52:2835.Google Scholar
Chute, CG, Beck, SA, Fisk, TB, Mohr, DN. The enterprise data trust at Mayo Clinic: a semantically integrated warehouse of biomedical data. J Am Med Inform Assoc. 2010;17:131135.Google Scholar
Kamal, J, Liu, J, Ostrander, M, et al. Information warehouse - a comprehensive informatics platform for business, clinical, and research applications AMIA Annu Symp Proc. 2010;2010:452456.Google Scholar
Lowe, HJ, Ferris, TA, Hernandez, PM, Weber, SC. STRIDE--an integrated standards-based translational research informatics platform. AMIA Annu Symp Proc. 2009;2009:391395.Google Scholar
Wade, TD, Hum, RC, Murphy, JR. A dimensional bus model for integrating clinical and research data. J Am Med Inform Assoc. 2011;18 Suppl 1:i96102.Google Scholar
Starren, JB, Winter, AQ, Lloyd-Jones, DM. Enabling a learning health system through a unified enterprise data warehouse: the experience of the Northwestern University Clinical and Translational Sciences (NUCATS) Institute. Clin Transl Sci. 2015;8:269271.Google Scholar
Mosa, ASM, Yoo, I, Apathy, NC, Ko, KJ, Parker, JC. Secondary use of clinical data to enable data-driven translational science with trustworthy access management. Mo Med. 2015;112:443448.Google Scholar
Waitman, LR, Warren, JJ, Manos, EL, Connolly, DW. Expressing observations from electronic medical record flowsheets in an i2b2 based clinical data repository to support research and quality improvement. AMIA Annu Symp Proc. 2011:2011:14541463.Google Scholar
Cimino, JJ, Ayres, EJ, Remennik, L, et al. The National Institutes of Health’s Biomedical Translational Research Information System (BTRIS): design, contents, functionality and experience to date. J Biomed Inform. 2014;52:1127.Google Scholar
Kortüm, KU, Müller, M, Kern, C, et al. Using electronic health records to build an ophthalmologic data warehouse and visualize patients’ data. Am J Ophthalmol. 2017;178:8493.Google Scholar
Hall, ES, Greenberg, JM, Muglia, LJ, et al. Implementation of a regional perinatal data repository from clinical and billing records. Matern Child Health J. 2018;22:485493.Google Scholar
Hruby, GW, McKiernan, J, Bakken, S, Weng, C. A centralized research data repository enhances retrospective outcomes research capacity: a case report. J Am Med Inform Assoc. 2013;20:563567.Google Scholar
Pennington, JW, Ruth, B, Italia, MJ, et al. Harvest: an open platform for developing web-based biomedical data discovery and reporting applications. J Am Med Inform Assoc. 2014;21:379383.Google Scholar
Natter, MD, Quan, J, Ortiz, DM, et al. An i2b2-based, generalizable, open source, self-scaling chronic disease registry. J Am Med Inform Assoc. 2013;20:172179.Google Scholar
Gallagher, SA, Smith, AB, Matthews, JE, et al. Roadmap for the development of the University of North Carolina at Chapel Hill Genitourinary OncoLogy Database--UNC GOLD. Urol Oncol. 2014;32:32.e132.e9.Google Scholar
Johnson, SB. Generic data modeling for clinical repositories. J Am Med Inform Assoc. 1996;3:328339.Google Scholar
Horvath, MM, Winfield, S, Evans, S, Slopek, S, Shang, H, Ferranti, J. The DEDUCE Guided Query tool: providing simplified access to clinical data for research and quality improvement. J Biomed Inform. 2011;44:266276.Google Scholar
Murphy, SN, Weber, G, Mendis, M, et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J Am Med Inform Assoc. 2010;17:124130.Google Scholar
Hripcsak, G, Duke, JD, Shah, NH, et al. Observational health data sciences and informatics (OHDSI): opportunities for observational researchers. Stud Health Technol Inform. 2015;216:574578.Google Scholar
Harris, PA, Taylor, R, Minor, BL, et al. The REDCap consortium: building an international community of software platform partners. J Biomed Inform. 2019;95:103208.Google Scholar
Kohane, IS, Churchill, SE, Murphy, SN. A translational engine at the national scale: informatics for integrating biology and the bedside. J Am Med Inform Assoc. 2012;19:181185.Google Scholar
MacKenzie, SL, Wyatt, MC, Schuff, R, Tenenbaum, JD, Anderson, N. Practices and perspectives on building integrated data repositories: results from a 2010 CTSA survey. J Am Med Inform Assoc. 2012;19:e119e124.Google Scholar
DiLaura, R, Turisco, F, McGrew, C, Reel, S, Glaser, J, Crowley, WF. Use of informatics and information technologies in the clinical research enterprise within US academic medical centers: progress and challenges from 2005 to 2007. J Investig Med. 2008;56:770779.Google Scholar
Obeid, JS, Tarczy-Hornoch, P, Harris, PA, et al. Sustainability considerations for clinical and translational research informatics infrastructure. J Clin Transl Sci. 2018;2:267275.Google Scholar
Campion, TR, Sholle, ET, Pathak, J, Johnson, SB, Leonard, JP, Cole, CL. An architecture for research computing in health to support clinical and translational investigators with electronic patient data. J Am Med Inform Assoc. 2022;29:677685.Google Scholar
Sholle, ET, Kabariti, J, Johnson, SB, Leonard, JP, Pathak, J, Varughese, VI. Secondary use of patients’ electronic records (SUPER): An approach for meeting specific data needs of clinical and translational researchers. AMIA Annu Symp Proc. 2017:2017:15811588.Google Scholar
Patterson, OV, Freiberg, MS, Skanderson, M, et al. Unlocking echocardiogram measurements for heart disease research through natural language processing. BMC Cardiovasc Disord. 2017;17:151.Google Scholar
Harris, PA, Taylor, R, Thielke, R, Payne, J, Gonzalez, N, Conde, JG. Research electronic data capture (REDCap): A metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform. 2009;42:377381.Google Scholar
Campion, TR, Sholle, ET, Davila, MA. Generalizable middleware to support use of redcap dynamic data pull for integrating clinical and research data. AMIA Jt Summits Transl Sci Proc. 2017;2017:7681.Google Scholar
Sholle, ET, Cusick, M, Davila, MA, Kabariti, J, Flores, S, Campion, TR. Characterizing basic and complex usage of i2b2 at an academic medical center. AMIA Jt Summits Transl Sci Proc. 2020;2020:589596.Google Scholar
Sholle, ET, Davila, MA, Kabariti, J, Schwartz, JZ, Varughese, VI, Cole, CL, et al. A scalable method for supporting multiple patient cohort discovery projects using i2b2. J Biomed Inform. 2018;84:179183.Google Scholar
Boyd, AD, Saxman, PR, Hunscher, DA, Smith, KA, Morris, TD, Kaston, M, et al. The University of Michigan Honest Broker: a Web-Based service for clinical and translational research and practice. J Am Med Inform Assoc. 2009;16:784791.Google Scholar
Campion, TR, Pompea, ST, Turner, SP, Sholle, ET, Cole, CL, Kaushal, R. A method for integrating healthcare provider organization and research sponsor systems and workflows to support large-scale studies. AMIA Jt Summits Transl Sci Proc. 2019;2019:648655.Google Scholar
Turner, SP, Pompea, ST, Williams, KL, Kraemer, DA, Sholle, ET, Chen, C, et al. Implementation of informatics to support the NIH all of us research program in a healthcare provider organization. AMIA Jt Summits Transl Sci Proc. 2019;2019:602609.Google Scholar
Kamel, H, Okin, PM, Merkler, AE, Navi, BB, Campion, TR, Devereux, RB, et al. Relationship between left atrial volume and ischemic stroke subtype. Ann Clin Transl Neurol. 2019;6:14801486.Google Scholar
Barbour, K, Hesdorffer, DC, Tian, N, Yozawitz, EG, McGoldrick, PE, Wolf, S, et al. Automated detection of sudden unexpected death in epilepsy risk factors in electronic medical records using natural language processing. Epilepsia. 2019;60:12091220.Google Scholar
Deferio, JJ, Levin, TT, Cukor, J, Banerjee, S, Abdulrahman, R, Sheth, A, et al. Using electronic health records to characterize prescription patterns: focus on antidepressants in nonpsychiatric outpatient settings. JAMIA Open. 2018;1:233245.Google Scholar
Adekkanattu, P, Sholle, ET, DeFerio, J, Pathak, J, Johnson, SB, Campion, TR. Ascertaining depression severity by extracting 9 Patient Health Questionnaire-9 (PHQ-9) scores from clinical notes. AMIA Annu Symp Proc. 2018;2018:147156.Google Scholar
Son, M, Riley, LE, Staniczenko, AP, Cron, J, Yen, S, Thomas, C, et al. Nonadjuvanted bivalent respiratory syncytial virus vaccination and perinatal outcomes. JAMA Netw Open. 2024;7:e2419268.Google Scholar
Stringer, WS, Labar, AS, Geleris, JD, Sholle, EV, Berlin, DA, McGroder, CM, et al. Three hospitalized non-critical COVID-19 subphenotypes and change in intubation or death over time: A latent class analysis with external and longitudinal validation. PLoS ONE. 2025;20:e0316434.Google Scholar
Butler, D, Mozsary, C, Meydan, C, Foox, J, Rosiene, J, Shaiber, A, et al. Shotgun transcriptome, spatial omics, and isothermal profiling of SARS-CoV-2 infection reveals unique host responses, viral diversification, and drug interactions. Nat Commun. 2021;12:1660.Google Scholar
Schenck, EJ, Hoffman, KL, Cusick, M, Kabariti, J, Sholle, ET, Campion, TR. Critical carE Database for Advanced Research (CEDAR): An automated method to support intensive care units with electronic health record data. J Biomed Inform. 2021;118:103789.Google Scholar
Schenck, EJ, Hoffman, K, Oromendia, C, Sanchez, E, Finkelsztein, EJ, Hong, KS, et al. A comparative analysis of the respiratory subscore of the sequential organ failure assessment scoring system. Ann Am Thorac Soc. 2021;18:18491860.Google Scholar
Sholle, E, Krichevsky, S, Scandura, J, Sosner, C, Campion, TR. Lessons learned in the development of a computable phenotype for response in myeloproliferative neoplasms. IEEE Int Conf Healthc Inform. 2018;2018:328331.Google Scholar
Abu-Zeinah, G, Krichevsky, S, Cruz, T, Hoberman, G, Jaber, D, Savage, N, et al. Interferon-Alpha for treating polycythemia vera yields improved myelofibrosis-free and overall survival. Leukemia. 2021;35:25922601.Google Scholar
Krichevsky, S, Sholle, ET, Adekkanattu, PM, Abedian, S, Ouseph, M, Taylor, E, et al. Automated information extraction from unstructured hematopathology reports to support response assessment in myeloproliferative neoplasms. Methods Inf Med. 2024;63:176182.Google Scholar
Fu, JT, Sholle, E, Krichevsky, S, Scandura, J, Campion, TR. Extracting and classifying diagnosis dates from clinical notes: A case study. J Biomed Inform. 2020;110:103569.Google Scholar
Michael, CL, Sholle, ET, Wulff, RT, Roboz, GJ, Campion, TR. Mapping local biospecimen records to the OMOP common data model. AMIA Jt Summits Transl Sci Proc. 2020;2020:422429.Google Scholar
Yin, AL, Guo, WL, Sholle, ET, Rajan, M, Alshak, MN, Choi, JJ, et al. Comparing automated vs. manual data collection for COVID-specific medications from electronic health records. Int J Med Inform. 2022;157:104622.Google Scholar
Cusick, M, Adekkanattu, P, Campion, TR, Sholle, ET, Myers, A, Banerjee, S, et al. Using weak supervision and deep learning to classify clinical notes for identification of current suicidal ideation. J Psychiatr Res. 2021;136:95102.Google Scholar
Pan, S, Wu, A, Weiner, M, Grinspan, ZM. Development and evaluation of computable phenotypes in pediatric epilepsy:3 cases. J Child Neurol. 2021;36:990997.Google Scholar
Gazda, AJ, Pan, D, Erdos, K, Abu-Zeinah, G, Racanelli, A, Horn, EM, et al. High pulmonary hypertension risk by echocardiogram shortens survival in polycythemia vera. Blood Adv. 2025;9:13201329.Google Scholar
Erdos, K, Alshareef, A, Silver, RT, Scandura, JM, Abu-Zeinah, G. Outcomes for ruxolitinib only versus combination with interferon in treating patients with myelofibrosis. Blood Neoplasia. 2025;2:100082.Google Scholar
Lazem, M, Sheikhtaheri, A. Barriers and facilitators for the implementation of health condition and outcome registry systems: a systematic literature review. J Am Med Inform Assoc. 2022;29:723734.Google Scholar
Goriacko, P, Mirhaji, P, John, J, Parimi, P, Henninger, EM, Soby, S, et al. Incorporating Real-World Data Research in Training First-Year Medical Students Using OHDSI OMOP and Atlas tools, (https://www.ohdsi.org/wp-content/uploads/2023/10/goriacko-pavel_IncorporatingRWEResearchTrainingFirst-YearMedicalStudentsUsingOMOPAtlas_2023-Selvin-Soby.pdf) Accessed December 1, 2025.Google Scholar
Lee, SA, Jain, S, Chen, A, Ono, K, Biswas, A, Rudas, Á, et al. Clinical decision support using pseudo-notes from multiple streams of EHR data. npj Digital Med. 2025;8:394.Google Scholar
Figure 0

Figure 1. A custom research data repository (RDR) aggregates data from disparate sources, transforms data into research-ready formats, and supports three workflows using off-the-shelf tools.

Figure 1

Table 1. Research data repository (RDR) activities by investigator group

Figure 2

Figure 2. Spectrum of transformation of real-world data from electronic health record systems to enable analytics. OMOP = Observational Medical Outcomes Partnership; CDM = common data model.