Hostname: page-component-77f85d65b8-g4pgd Total loading time: 0 Render date: 2026-03-28T11:08:24.943Z Has data issue: false hasContentIssue false

Use of random forest to estimate population attributable fractions from a case-control study of Salmonella enterica serotype Enteritidis infections

Published online by Cambridge University Press:  12 February 2015

W. GU*
Affiliation:
Centers for Disease Control and Prevention, Enteric Diseases Epidemiology Branch, Atlanta, GA, USA
A. R. VIEIRA
Affiliation:
Centers for Disease Control and Prevention, Enteric Diseases Epidemiology Branch, Atlanta, GA, USA
R. M. HOEKSTRA
Affiliation:
Centers for Disease Control, Division of Foodborne, Waterborne and Environmental Diseases Atlanta, GA, USA
P. M. GRIFFIN
Affiliation:
Centers for Disease Control and Prevention, Enteric Diseases Epidemiology Branch, Atlanta, GA, USA
D. COLE
Affiliation:
Centers for Disease Control and Prevention, Enteric Diseases Epidemiology Branch, Atlanta, GA, USA
*
* Author for correspondence: Dr W. Gu, Enteric Diseases Epidemiology Branch, Division of Foodborne, Waterborne, and Environmental Diseases, National Center for Emerging and Zoonotic Infectious Diseases, Centers for Disease Control and Prevention, 1600 Clifton Road NE, Atlanta, GA 30333, USA. (Email: vhg8@cdc.gov)
Rights & Permissions [Opens in a new window]

Summary

To design effective food safety programmes we need to estimate how many sporadic foodborne illnesses are caused by specific food sources based on case-control studies. Logistic regression has substantive limitations for analysing structured questionnaire data with numerous exposures and missing values. We adapted random forest to analyse data of a case-control study of Salmonella enterica serotype Enteritidis illness for source attribution. For estimation of summary population attributable fractions (PAFs) of exposures grouped into transmission routes, we devised a counterfactual estimator to predict reductions in illness associated with removing grouped exposures. For the purpose of comparison, we fitted the data using logistic regression models with stepwise forward and backward variable selection. Our results show that the forward and backward variable selection of logistic regression models were not consistent for parameter estimation, with different significant exposures identified. By contrast, the random forest model produced estimated PAFs of grouped exposures consistent in rank order with results obtained from outbreak data, with egg-related exposures having the highest estimated PAF (22·1%, 95% confidence interval 8·5–31·8). Random forest might be structurally more coherent and efficient than logistic regression models for attributing Salmonella illnesses to sources involving many causal pathways.

Information

Type
Original Papers
Copyright
Copyright © Cambridge University Press 2015 
Figure 0

Table 1. Percentage of missing data and estimated odds ratios of significant exposures identified by different variable selection methods of logistic regression

Figure 1

Fig. 1. Permutation importance (blue circles) by mean decrease in classification accuracy of the random forest model [normalized by the standard deviation of the differences in classification accuracy of pre- and post-permutation out-of-bag (unused) data] and exposure frequency in cases (red-grey circles) of individual exposures measured.

Figure 2

Fig. 2. Predicted percentage reduction of illness as a function of probabilistic reduction in grouped exposures based on counterfactual modelling of hypothetical interventions.

Figure 3

Table 2. Estimated summary population attributable fractions for grouped exposures obtained by random forest model based on the Salmonella Enteritidis case-control study data collected by the FoodNet in 2002

Supplementary material: File

Gu supplementary material

Appendix

Download Gu supplementary material(File)
File 14.2 KB