Hostname: page-component-89b8bd64d-4ws75 Total loading time: 0 Render date: 2026-05-06T07:10:21.092Z Has data issue: false hasContentIssue false

Variable ranking and selection with random forest for unbalanced data

Published online by Cambridge University Press:  20 December 2022

Ute Bradter*
Affiliation:
School of Biology, University of Leeds, Leeds, United Kingdom Department of Terrestrial Ecology, Norwegian Institute for Nature Research, Trondheim, Norway
John D. Altringham
Affiliation:
School of Biology, University of Leeds, Leeds, United Kingdom
William E. Kunin
Affiliation:
School of Biology, University of Leeds, Leeds, United Kingdom
Tim J. Thom
Affiliation:
Yorkshire Wildlife Trust, Skipton, United Kingdom
Jerome O’Connell
Affiliation:
School of Biology, University of Leeds, Leeds, United Kingdom ProvEye, Kerry, Ireland
Tim G. Benton
Affiliation:
School of Biology, University of Leeds, Leeds, United Kingdom
*
*Corresponding author. E-mail: Ute.bradter@nina.no

Abstract

When one or several classes are much less prevalent than another class (unbalanced data), class error rates and variable importances of the machine learning algorithm random forest can be biased, particularly when sample sizes are smaller, imbalance levels higher, and effect sizes of important variables smaller. Using simulated data varying in size, imbalance level, number of true variables, their effect sizes, and the strength of multicollinearity between covariates, we evaluated how eight versions of random forest ranked and selected true variables out of a large number of covariates despite class imbalance. The version that calculated variable importance based on the area under the curve (AUC) was least adversely affected by class imbalance. For the same number of true variables, effect sizes, and multicollinearity between covariates, the AUC variable importance ranked true variables still highly at the lower sample sizes and higher imbalance levels at which the other seven versions no longer achieved high ranks for true variables. Conversely, using the Hellinger distance to split trees or downsampling the majority class already ranked true variables lower and more variably at the larger sample sizes and lower imbalance levels at which the other algorithms still ranked true variables highly. In variable selection, a higher proportion of true variables were identified when covariates were ranked by AUC importances and the proportion increased further when the AUC was used as the criterion in forward variable selection. In three case studies, known species–habitat relationships and their spatial scales were identified despite unbalanced data.

Information

Type
Methods Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Open Practices
Open data
Copyright
© The Author(s), 2022. Published by Cambridge University Press
Figure 0

Figure 1. Conceptual diagram of the simulation study: (a) Out of a set of 121 environmental covariates, 14 variables were selected as true variables (true var). Seven of the 14 variables had low variance inflation factors (VIFs) indicating low multicollinearity with other covariates and the other seven variables had high variance inflation factors. True variables were standardized to a mean of 0 and a standard deviation of 1. Eighteen binary, simulated response variables were created based on weak (ß = 0.4), medium (ß = 0.9), and strong (ß = 1.4) effect sizes for true variables and either one, two, or four true variables with either low or high VIFs. (b) From each simulated response variable consisting of presences and absences, 60 datasets were created differing in the sample size (300, 200, or 100 absences) and the imbalance level (20, 15, 10, or 5% presences). Presences and absences were randomly drawn five times for each of the 12 combinations of sample size with imbalance level. In total, 1,080 data sets were created (18 simulated responses × 3 sample sizes × 4 imbalance levels × 5 random samples). (c) For each of the 1,080 datasets consisting of simulated presences and absences and the corresponding 121 environmental covariates, variable importances were calculated for each of the 121 covariates using eight alternative random forest versions: four random forest versions were implemented with R package randomForest (RF, RF-Down, RF-Down-Acc, RF-Priors), two versions were implemented with R package ranger (Rang-Hellinger, Rang-Weight) and two versions with R package party (CFor, CFor-AUC): (1) default RF (RF), (2) the majority class downsampled to the size of the minority class (RF-Down), (3) the majority class downsampled to 64% of the size of the minority class (RF-Down-Acc), (4) weights as class priors (RF-Priors), (5) Hellinger distance as splitting criterion (Rang-Hellinger), (6) weights applied to the splitting rule and majority vote (Rang-Weight), (7) random forest based on conditional inference trees (CFor), (8) random forest based on conditional inference trees with AUC permutation importance (CFor-AUC). After the calculation of variable importances, the covariates were ranked from the covariate with the highest importance to the covariate with the lowest importance and the ranks for true variables were extracted. (d) For each of the 1,080 datasets we carried out variable selection using three alternative versions: (1) Variable ranking based on permutation importances from the default RF, followed by a forward selection with the Out-of-Bag error as selection criterion (RF/OOB), (2) Variable ranking based on AUC permutation importance and forests with conditional inference trees, followed by a forward selection with the Out-of-Bag error as selection criterion (CFor-AUC/OOB), and (3) Variable ranking based on AUC permutation importance and forests with conditional inference trees, followed by a forward selection with the AUC as selection criterion (CFor-AUC/AUC).

Figure 1

Figure 2. Example of model selection using AUC. First, AUC is calculated for all nested models using several repetitions for each nested model. The black curve shows the mean AUC over all repetitions with the gray polygon showing the mean ± standard deviation. Then, the model (vertical black line) with the highest mean AUC (dashed horizontal black line) is identified. The AUC value corresponding to mean–standard deviation for this model (dotted horizontal black line) serves as a threshold accounting for the variation between repetitions. The selected model is the smallest (lowest number of covariates) model with a mean AUC at or above the threshold value (blue lines).

Figure 2

Figure 3. Boxplots showing ranks of permutation importance for the single true predictor with low VIF and a medium effect size of 0.9. For each of the simulated 60 test sets with a total of 121 covariates, permutation importance was calculated 10 times for each of the eight algorithm versions. The true predictors were ranked by the mean permutation importance averaged over the 10 repetitions. Boxes showing interquartile range and median; whiskers show the maximum of 1.5 × interquartile range. Boxes represent the eight algorithm versions: default RF (RF), two downsampled versions (RF-Down, RF-Down-Acc), priors representing the proportion of class sizes (RF-Priors), Hellinger distance as splitting criterion (Rang-Hellinger), weighted random forest (Rang-Weight), random forest based on conditional inference trees (CFor), AUC permutation importance (CFor-AUC; from left to right). Sample sizes decrease from row a to row c. Imbalance levels increase from left to right.

Figure 3

Figure 4. Boxplots showing ranks of permutation importance for four true variables with low VIFs and a medium effect size of 0.9. For each of the 60 simulated test sets with a total of 121 covariates, permutation importance was calculated 10 times for each of the eight algorithm versions. The true predictors were ranked by the mean permutation importance averaged over the 10 repetitions. Boxes showing interquartile range and median; whiskers show the maximum of 1.5 × interquartile range. Boxes represent the eight algorithm versions: default RF (RF), two downsampled versions (RF-Down, RF-Down-Acc), priors representing the proportion of class sizes (RF-Priors), Hellinger distance as splitting criterion (Rang-Hellinger), weighted random forest (Rang-Weight), random forest based on conditional inference trees (CFor), AUC permutation importance (CFor-AUC; from left to right). Sample sizes decrease from row a to row c. Imbalance levels increase from left to right.

Figure 4

Figure 5. Boxplots showing ranks of permutation importance for true variables with a high effect size of 1.4. For each of the 360 simulated test sets with a total of 121 covariates, permutation importance was calculated 10 times for each of the eight algorithm versions. The true predictors were ranked by the mean permutation importance averaged over the 10 repetitions. The boxplots summarize the ranking across all true variables with high effect sizes of 1.4 across simulated data with one, two, and four true variables and with low and with high VIFs of the true variables. Boxes showing interquartile range and median; whiskers show the maximum of 1.5 × interquartile range. Boxes represent the eight algorithm versions: default RF (RF), two downsampled versions (RF-Down, RF-Down-Acc), priors representing the proportion of class sizes (RF-Priors), Hellinger distance as splitting criterion (Rang-Hellinger), weighted random forest (Rang-Weight), random forest based on conditional inference trees (CFor), AUC permutation importance (CFor-AUC; from left to right). Sample sizes decrease from row a to row c. Imbalance levels increase from left to right.

Figure 5

Figure 6. Boxplots of (a) sensitivity and (b) specificity of variable selection for three alternative variable selection approaches with random forest. Variable selections were implemented for simulated data with 121 covariates including one, two, or four covariates with a medium effect size of 0.9 and with low variable inflation factors. Sensitivity represents the rate at which true covariates are identified. Specificity represents the rate at which noise covariates are rejected. Boxes represent from left to right (1) RF/OOB: the default variable selection approach with no adjustments for unbalanced data, (2) CFor-AUC/OOB: covariates are ranked by the AUC permutation importance instead of the default permutation importance, and (3) CFor-AUC/AUC: covariates are ranked by the AUC permutation importance and the threshold criterion in the forward selection is AUC instead of OOB. Boxes showing interquartile range and median, whiskers show the maximum of 1.5 × interquartile range, dots: outliers.

Figure 6

Figure 7. Predicted probability of (a) Northern lapwing, (b) common snipe, and (c) common redshank presence in the study area. Only areas below 500 m elevation shown, which were those surveyed. Also shown is whether these species were recorded on the transects. Maps derived: using NATMAP soilscapes Cranfield University (NSRI) and for the Controller of HMSO 2009, Land-Form PANORAMA, OS MasterMap, Strategi data downloaded from the EDINA Digimap OS service. Crown Copyright/ database right 1993, 2007, and 2009. An Ordnance Survey / EDINA supplied service, Landsat Surface Reflectance products courtesy of the U.S. Geological Survey Earth Resources Observation and Science Center.

Figure 7

Table 1. Selected multi-scale covariates in models for Northern lapwing, common snipe, and common redshank in the Yorkshire Dales, UK.

Supplementary material: File

Bradter et al. supplementary material

Bradter et al. supplementary material

Download Bradter et al. supplementary material(File)
File 5 MB