Hostname: page-component-77f85d65b8-grvzd Total loading time: 0 Render date: 2026-04-19T11:26:58.340Z Has data issue: false hasContentIssue false

EHRchitect: An open-source software tool for medical event sequences data extraction from Electronic Health Records

Published online by Cambridge University Press:  26 March 2025

Kostiantyn Botnar*
Affiliation:
Department of Pharmacology and Toxicology, University of Texas Medical Branch at Galveston, Galveston, TX, USA
Justin T. Nguyen
Affiliation:
Department of Pharmacology and Toxicology, University of Texas Medical Branch at Galveston, Galveston, TX, USA
Madison G. Farnsworth
Affiliation:
Department of Human Pathophysiology and Translational Medicine, University of Texas Medical Branch at Galveston, Galveston, TX, USA
George Golovko
Affiliation:
Department of Pharmacology and Toxicology, University of Texas Medical Branch at Galveston, Galveston, TX, USA
Kamil Khanipov
Affiliation:
Department of Pharmacology and Toxicology, University of Texas Medical Branch at Galveston, Galveston, TX, USA
*
Corresponding author: K. Botnar; Email: kobotnar@utmb.edu
Rights & Permissions [Opens in a new window]

Abstract

Background:

Electronic Health Records (EHR) analysis is pivotal in advancing medical research. Numerous real-world EHR data providers offer data access through exported datasets. While enabling profound research possibilities, exported EHR data requires quality control and restructuring for meaningful analysis. Challenges arise in medical events (e.g., diagnoses or procedures) sequence analysis, which provides critical insights into conditions, treatments, and outcomes progression. Identifying causal relationships, patterns, and trends requires a more complex approach to data mining and preparation.

Methods:

This paper introduces EHRchitect – an application written in Python that addresses the quality control challenges by automating dataset transformation, facilitating the creation of a clean, formatted, and optimized MySQL database (DB), and sequential data extraction according to the user’s configuration.

Results:

The tool creates a clean, formatted, and optimized DB, enabling medical event sequence data extraction according to users’ study configuration. Event sequences encompass patients’ medical events in specified orders and time intervals. The extracted data are presented as distributed Parquet files, incorporating events, event transitions, patient metadata, and events metadata. The concurrent approach allows effortless scaling for multi-processor systems.

Conclusion:

EHRchitect streamlines the processing of large EHR datasets for research purposes. It facilitates extracting sequential event-based data, offering a highly flexible framework for configuring event and timeline parameters. The tool delivers temporal characteristics, patient demographics, and event metadata to support comprehensive analysis. The developed tool significantly reduces the time required for dataset acquisition and preparation by automating data quality control and simplifying event extraction.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NC
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial licence (https://creativecommons.org/licenses/by-nc/4.0), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original article is properly cited. The written permission of Cambridge University Press must be obtained prior to any commercial use.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of Association for Clinical and Translational Science
Figure 0

Figure 1. EHRchitect database preparation pipeline. Comma-separated values (CSV) files with raw data packed in a ZIP archive are downloaded using the URL the user provides. Each CSV file is transformed to the EHRchitect database format, along with data cleaning and transformation. The program creates a new MySQL database using MySQL server credentials provided by the user, and uploads transformed data with the following optimization.

Figure 1

Figure 2. EHRchitect data extraction pipeline. The User describes a study in a JavaScript Object Notation file and passes it to the program. EHRchitect selects data according to all determined inclusion and exclusion criteria and time restrictions and delivers the resulting records with metadata and temporal characteristics in distributed Parquet files. Configuration file describes a study as a sequence of events with specified time constraints. Each event is determined by a list of codes (e.g., ICD-10, RxNorm) and a category (e.g., diagnosis, medication.

Figure 2

Figure 3. Study configuration example. A – an example of a study schema. The study explores an amoxicillin treatment impact on sepsis outcomes among patients with severe burn wounds (SBW). SBW is defined through a set of ICD-10 codes (“T31.7,” “T31.8,” “T31.9”). Amoxicillin is defined through the RxNorm code “723.” Sepsis is defined through the ICD-10 code (“A41.9”). Study temporal parameters: Amoxicillin should be prescribed within seven days after the SBW. The outcome should appear within one month after the treatment or after the SBW in the not-treated cohort. Records from the 2010-2020 years only are considered. B – the study configuration file.

Figure 3

Figure 4. The pulmonary embolism treatment research. A. Schematic research configuration. B. Example of the EHRchitect configuration file for the research.

Figure 4

Figure 5. Description of exclusion criteria in the configuration file. All exclusion criteria are described as events under the “exclude” object in the parent event they should be allied. If the period is absent, as in the “Previous outcome cases” event, the exclusion is applied to the entire period before the parent event.

Figure 5

Figure 6. Result tables. A. The patients metadata table contains the demographic parameters of all patients across the study. B. The events metadata table describes the study events. C. Each event group includes patient records selected according to its description. D. The transition table shows patient records of the consequent events that satisfied time conditions. Columns with the siffix “_0” report the start event. Columns with the suffix “_1” report the finish event. Column “t_0” contains a number of days between the start and finish events. All tables are linked by the “patient_id” parameter. Event records are identified by the “event_id” and “code” fields.

Supplementary material: File

Botnar et al. supplementary material

Botnar et al. supplementary material
Download Botnar et al. supplementary material(File)
File 973 Bytes