Extending outbreak investigation with machine learning and graph theory: Benefits of new tools with application to a nosocomial outbreak of a multidrug-resistant organism

Objective: From January 1, 2018, until July 31, 2020, our hospital network experienced an outbreak of vancomycin-resistant enterococci (VRE). The goal of our study was to improve existing processes by applying machine-learning and graph-theoretical methods to a nosocomial outbreak investigation. Methods: We assembled medical records generated during the first 2 years of the outbreak period (January 2018 through December 2019). We identified risk factors for VRE colonization using standard statistical methods, and we extended these with a decision-tree machine-learning approach. We then elicited possible transmission pathways by detecting commonalities between VRE cases using a graph theoretical network analysis approach. Results: We compared 560 VRE patients to 86,684 controls. Logistic models revealed predictors of VRE colonization as age (aOR, 1.4 (per 10 years), with 95% confidence interval [CI], 1.3–1.5; P < .001), ICU admission during stay (aOR, 1.5; 95% CI, 1.2–1.9; P < .001), Charlson comorbidity score (aOR, 1.1; 95% CI, 1.1–1.2; P < .001), the number of different prescribed antibiotics (aOR, 1.6; 95% CI, 1.5–1.7; P < .001), and the number of rooms the patient stayed in during their hospitalization(s) (aOR, 1.1; 95% CI, 1.1–1.2; P < .001). The decision-tree machine-learning method confirmed these findings. Graph network analysis established 3 main pathways by which the VRE cases were connected: healthcare personnel, medical devices, and patient rooms. Conclusions: We identified risk factors for being a VRE carrier, along with 3 important links with VRE (healthcare personnel, medical devices, patient rooms). Data science is likely to provide a better understanding of outbreaks, but interpretations require data maturity, and potential confounding factors must be considered.

Electronic medical records contain information relevant for outbreak investigations; consequently, by integrating the relevant data sources, we can potentially inform and improve patient screening and isolation strategies. However, this integration necessarily leads to large quantities of data, which can be difficult to analyze using standard statistical techniques. Machine-learning or "artificial intelligence" methods comprise a toolbox of approaches that have become popular for analyzing such "big data". 1,2 To date in the field of hospital epidemiology, machine-learning techniques have predominantly been used to extend existing statistical methods to provide deeper insights in the analysis of infections, infection management and outbreak detection (eg, Roth et al, 3 Luz et al, 4 and Leclère et al 5 ). Furthermore, methods based on graph theory (hereafter "network graph methods") have recently been used to identify, for example, superspreaders in community-based outbreaks. 6,7 Here, we applied a network graph approach to the largest documented outbreak of a multidrug-resistant organism (MDRO) in Switzerland, which occurred in 2018-2019 in our hospital group and affected >560 patients. 8 This particular MDRO, vancomycinresistant Enterococcus faecium (VRE) of the sequence type 796 (ST796), predominantly colonizes the gastrointestinal tract, and is known for rapid intrahospital and interhospital spread. Infections due to VRE are associated with increased mortality, morbidity and higher hospital costs. 9,10 Comprehensive literature has described the risk factors for VRE colonization including (among others) length of hospital stay, duration and type of antibiotic use, proximity to a colonized or infected patient, contact with environmental contamination, and immunosuppression or hematologic malignancy. [11][12][13][14][15] The apparent complexity and multifactorial nature of this outbreak provided motivation for using machine-learning and graph-theoretical methods to attempt to untangle and better understand these complex interactions.
Active surveillance screening is a key measure in identifying asymptomatic VRE carriage, with patient contact isolation being the standard precaution to limit further transmission. [16][17][18][19] Therefore, during this outbreak, VRE-positive patients (colonized or infected) were isolated, and a proactive "contact tracing" process was introduced. Although such contact tracing is relatively straightforward to implement, there was room for improvement and optimization of the process because transmission is not necessarily dependent only on rather limited definitions of geographical and organizational proximity.
With this study, we sought to address the following key questions: • What are the risk factors for VRE colonization?
• Which patients should be screened?
• What are the "hot spots" in terms of devices, rooms and employees where transmission may have occurred? • What is the potential benefit of a contact screening approach based on the network graph approach, compared to the traditional proximity-based contact screening?
To ease readability, throughout the document we have noted the use of standard statistical approaches versus new machinelearning-type techniques.

Methods
The outbreak occurred in a 900-bed, tertiary-care, university hospital in Bern, Switzerland. The hospital sees ∼60,000 admissions and 380,000 patient days per year, with most medical disciplines represented.
The outbreak was originally detected in January 2018 following 2 cases of VRE bloodstream infections on the oncology ward, and this consequently led to an outbreak management protocol being introduced based on international guidelines. Briefly, VRE-positive (ie, colonized or infected) patients were isolated, and a proactive proximity-based contact-tracing process was introduced, whereby people were screened if they were hospitalized in the same room and ward (and therefore potentially exposed) as a newly detected VRE-positive patient in the prior 7 days. In addition, cleaning was intensified with measures such as daily disinfection and UV light cleaning procedures (among others). An upcoming publication describes the outbreak and procedures in more detail. 20 In addition, we assembled diverse data from electronic medical records generated during the first 2 years of the outbreak (January 1, 2018-December 31, 2019), covering different aspects of medical care (Table 1). For comparisons, we labeled all patients (including children) with VRE acquisition during the outbreak period as VRE-positive (cases), and all other patients, whether tested or not, were assumed to be VRE negative (controls). Notably, a sampling and subtyping performed in 2018 revealed that 91.7% of the isolates were identified as sequence type 796; the analysis presented here pertains to all subtypes. Summary statistics of cases and controls were presented as number and percentage for categorical, and median and interquartile ranges for continuous, variables. Group differences were investigated using the χ 2 test (or variants thereof) for categorical variables and the Mann-Whitney-Wilcoxon test for continuous variables.

Standard approach
We identified risk factors for VRE colonization by fitting uni-and multivariable logistic regression models with dependent variable colonization (0 = no, 1 = yes), and independent variables those available and integrated in the project specific data warehouse (ie, those in Table 1). The most parsimonious model was found by forward selection then backward deletion using a P value of >.10 as the inclusion criterion. These analyses were performed in R version 3.5 or newer software 21 using base functions and the following packages: data.table, ggplot2, tableone, and survey.

Decision tree approach
In a second step, a machine-learning "decision tree" approach was used to identify a VRE positive individual. In decision-tree learning, a tree-like model of decisions and subsequent outcomes is chosen, which models the path from observations to conclusions. The branches of the tree represent criteria on the observations (eg, age <30 years), and the leaves represent the class labels (eg, those colonized or infected). The learning algorithm builds such a tree by choosing variables, and it learns criteria on these to split the data set most appropriately to the dependent variable, thus revealing the variables that are most predictive, as well as the tree that generates the predicted result. For our purposes, the classification tree was built with the same independent and dependent variables as the logistic regression (ie, those in Table 1).
This approach served to validate initial hypotheses of risk factors derived from the logistic regression models and provided an interpretable decision tree with decision thresholds. To mitigate the strong imbalance between cases and controls in the data set, we applied a minority class oversampling approach resulting in equal numbers of positive and negative samples. The analysis was performed using Python version 3.7 software 22 with the following packages: data processing (pyodbc, numpy, pandas, scipy), machine learning (scikit-learn), and visualization (matplotlib).

Network graph approach
In a third step, a network graph model involving the identified risk factors was developed. We identified potential transmission hotspots: healthcare employees with frequent VRE patient contact, rooms in which VRE patients were present, and medical devices used in conjunction with diagnosing and treating these patients. Interactions documented by healthcare workers were limited to those between patients and nursing employees because physician interactions are not documented at this granularity for in-patient care (which is a limitation of the data available to us). In terms of medical devices, only interactions tracked in the information technology systems were included in the analysis, and this limited the scope of this study to larger, often nonportable devices. Rooms in which patients stayed at any point during their hospitalization were included in the analysis, even if the stay was <24 hours.
The network graph analysis used eigenvector centrality to generate a daily "hotspot list", which was sent to the relevant organizational unit involved in disinfecting rooms and medical devices. Employees remained as important links in the analysis, but hardly any employees underwent VRE screening by occupational health as it was deemed voluntary at the time.
An example of a small section of the network graph is shown in Figure 1. For background information on this approach, please refer to the Supplementary Material (online).

Model validation
The outbreak occurred in many parts of the hospital, and the analysis was performed on data from the whole hospital. However, for the model validation process, we focused on the oncology ward because this ward performed screenings of all patients every week, whereas other wards did not or performed them only sporadically. This meant that for the oncology ward, we had a "ground truth" with which to compare our model predictions; that is, for each week we were able to compare the predicted VRE positive cases from the model to the observed cases.
We compared the current gold standard "proximity-based screening" and therefore nonprioritized approach, with the new prioritized list for screening patients derived from the network graph method. First, the "screening efficiency" was calculated; this represents the percentage fewer screenings compared to the gold standard. Second, the "screening benefit" was determined as the additional patients we could potentially have detected compared to the gold standard (refer to the Supplementary Material online for details).

Ethical approval
The Cantonal Ethics Committee (Bern, Project ID no. 2020-00173) approved this study.

Results
We compared 560 VRE patients to 86,684 controls from January 1, 2018, to December 31, 2019 (Table 1). Compared to the general hospital population during this period, VRE-positive patients were predominantly male (61.8% vs 50.8%), had an older median age (73 vs 58 years), had surgery during their hospitalization (67.3% vs 57.2%), and were more likely to have been in the ICU during their hospitalization (34.5% vs 8.0%).

Decision tree approach
Many of the risk factors identified in the logistic regression analysis are already well established in the literature. However, the presence of complex interactions between variables exhibited by the multiple collinearities in the fitted statistical models motivated a more in-depth investigation using machine learning methods. Decisiontree-based machine-learning methods (with a maximum depth of 6 levels) confirmed the findings from the logistic regression modelling, with the number of antibiotics (importance, 0.21), Charlson score (importance, 0.19), and number of rooms stayed in (importance, 0.15) being the 3 most important patient-associated characteristics ( Table 3). The "importance" is a metric defining the rank of this covariate in splitting the data set: the higher the importance, the earlier this covariate is used in the branching process. As an additional result, the criteria of the decision tree provide proper thresholds that split the data appropriately. Furthermore, the analysis allowed insights that were not apparent from the logistic regression; for example, those patients receiving antibiotics and those aged >55 years were at higher risk for colonization ( Supplementary Fig. S3 online).

Network graph approach
The subsequent complex network analysis established 3 possible pathways by which the 560 VRE cases are connected (although not necessarily in a causal relationship): healthcare personnel, medical devices, and patient rooms. A sample "hot spot list" for a particular day is shown in Table 4; this ordered list of medical devices, healthcare personnel, and patient rooms has been sorted in the order of the likeliness of being colonized with VRE. Depending on the situational environment, appropriate actions can be defined regarding how to isolate patients and clean the devices and rooms more frequently and/or thoroughly, or potentially also to screen the respective personnel (although the latter was not implemented in this study).

Screening efficiency and benefit
In terms of model validation, the screening efficiency was estimated to result in 40.0% fewer screenings compared to the current gold standard (95% CI, 17.3-54.6). With the new screening approach, we identified 102 (95% CI, 66-138) positive patients who were missed from the proximity based screening process and who had an equal or higher centrality as the lowest-ranked positive patient.

Discussion
We originally set out to address a number of key questions with respect to the outbreak and its investigation. We now address each of these in turn.
What are the risk factors for VRE colonization?
Using both standard statistical methods and machine learning, we identified risk factors for VRE colonization in line with those already published in the literature. Patient age, underlying diseases and severity of illness, prolonged hospitalization, surgery, and exposure to antimicrobial drugs were important factors. [23][24][25] Using 2 approaches confirming essentially the same results might be considered an inefficient use of statistical resources, but we prefer to view the 2 approaches as complementary, providing a slightly different perspective to the analysis. Although this was ultimately not particularly beneficial here, this will not always be the case.
Which patients should be screened?
Using a complex network graph analysis, we were able to further investigate 3 main pathways by which the VRE cases are connected: healthcare personnel, medical devices, and patient rooms. Interestingly, the importance of the number of rooms a patient stayed in and patient device interactions reflects recent work by Weterings et al 26 and Gouliouris et al, 27 which identified environmental contamination (and cleaning) as important factors in VRE outbreaks.
Using a network-graph approach to identify super-spreaders is not a new concept 28,29 ; to date, however, there have been few applications using medical data, and, to our knowledge, these approaches have not been applied to nosocomial outbreaks. Our centrality-based screening is theoretically related to that of Klemm et al, 30 which suggests that the dynamic influence of a node in the classical susceptible-infected-recovered (SIR) transmission model can be estimated using eigenvector centrality.
What are the "hot spots" in terms of devices, rooms and employees where transmission may have occurred?
The analysis identified probable "hot spots" based on the hospital system's human and nonhuman connections. This yielded prioritized lists of rooms and devices that might require special measures for outbreak control. What is the potential benefit of a contact screening approach based on the network-graph approach, compared to the traditional proximity-based contact screening?
We proposed potential approaches for estimating the screening efficiency and the benefits of a more focused screening approach. These could certainly be improved, and we would recommend investing adequate time in defining appropriate metrics at the project initiation stage.
In more general terms, the network-graph analysis builds on the results from the logistic regression and decision-tree methods by identifying specific patients to be screened and rooms and devices to be cleaned.
Although the network-graph method was generally successful, the lack of a fully integrated data warehouse was a considerable barrier to the speed of implementation. For example, the time and resource management system did not include information on physicians and other key personnel within the hospital, so "employee" interactions were limited to those performed by nursing healthcare professionals. This represents an example of the main (financial and otherwise) cost of implementing such an approach: an important prerequisite for fast implementation (and therefore actionable results) is the availability of a hospitalwide, fully integrated, data warehouse (refer to the "blueprint" in the Appendix online). The most time-consuming and resource-intensive part of the project was the data source integration (Appendix, step 2) and operationalization (step 7), the necessary effort for which surprised the research team ( Supplementary  Fig. S2 online). The idea to pursue "machine-learning" methods is currently popular, but this can often lead to an analysis-driven approach. With this in mind, and given that, at least anecdotally, "85% of data science projects fail," 31 one of the most important success factors was to have a clear project mandate with metrics for success prior to starting the project.
This study had several limitations. Our study was monocentric, and it could have been improved by validating against data from another hospital. Despite the limitations, even our incomplete picture provides additional valuable insights regarding specific known interactions with VRE-positive patients in the greater network displayed in our research.
In summary, we implemented a new approach to reduce unnecessary screening. The method provides the basis for a smart contact-tracing system for the next outbreak, independent of pathogen, and potentially also for data-driven outbreak monitoring. We calculated the benefits of the new method compared to the existing contact tracing. The differential can be understood as an "added value" in terms of the yield of screening, or analogously, the cost-benefit of "avoided" screening.
Finally, data science provides a better understanding of outbreaks, but interpretations should include consideration of data source maturity, the scope of included sources, and potential confounding factors.