Assessing the performance and accuracy of invasive plant habitat suitability models in detecting new observations in Wisconsin

Abstract Land managers require tools that improve understanding of suitable habitat for invasive plants and that can be incorporated into survey efforts to improve efficiency. Habitat suitability models (HSMs) contain attributes that can meet these requirements, but it is not known how well they perform, as they are rarely field-tested for accuracy. We developed ensemble HSMs in the state of Wisconsin for 15 species using five algorithms (boosted regression trees, generalized linear models, multivariate regression splines, MaxEnt, and random forests), evaluated performance, determined variables that drive suitability, and tested accuracy. All models had good model performance during the development phase (Area Under the Curve [AUC] > 0.7 and True Skills Statistic [TSS] > 0.4). While variable importance and directionality was species specific, the most important predictor variables across all of the species' models were mean winter minimum temperatures, total summer precipitation, and tree canopy cover. Post model development, we obtained 5,005 new occurrence records from community science observations for all 15 focal species to test the models' abilities to accurately predict results. Using a correct classification rate of 80%, just 8 of the 15 species correctly predicted suitable habitat (α ≤ 0.05). Exploratory analyses found the number of reporters of these new data and the total number of new occurrences reported per species contributed to increasing correct classification. Results suggest that while some models perform well on evaluation metrics, relying on these metrics alone is not sufficient and can lead to errors when utilized for surveying. We recommend any model should be tested for accuracy in the field before use to avoid this potential issue.


Introduction
Invasive plants are a pervasive problem that impact natural and managed lands. They can disrupt native plant, animal, and microbial communities (Dukes and Mooney 2004); alter ecosystem processes (DiTomaso 2000; Levine et al. 2003); be costly to control (Epanchin-Niell et al. 2010); and even impact human health (Mazza et al. 2014). While efforts to prevent invasions and their associated impacts have had success (e.g., Rothlisberger et al. 2010), establishment and spread of invasive species remains common in natural areas. Land management agencies commonly employ early detection and rapid response (EDRR) to limit these impacts (Reaser et al. 2019;Westbrooks 2004). Within this framework, early detection is a critical first step, as it minimizes spread, which prevents impacts across a large area (Maistrello et al. 2016) and allows for cost-effective management (Westbrooks 2004).
Invasive species monitoring efforts, however, are a challenge for land management organizations. Maxwell et al. (2012) estimate that 1% to 2% of managed natural lands are monitored annually for invasive plants. While increasing monitoring is the optimal solution, few have budgets to enhance these efforts, and many state and federal agencies have reduced monitoring programs (NISC 2018). As professional staff and monitoring resources are unlikely to increase, developing tools to improve efficiency of existing monitoring efforts is an attractive alternative that has been a high priority (WISC 2013). Habitat suitability models (HSMs) are one example, as they have the potential to assist with early detection of invasive species. HSMs can improve detection of target species with much greater efficiency (up to 80% detection rate; Wang et al. 2014) compared with the random searches currently employed (Crall et al. 2013). However, a general disconnect exists between researchers and land managers (Renz et al. 2009), which suggests that although these models have potential benefits to land managers, their actual implementation on the landscape is limited.
HSMs have been developed for many invasive species (e.g., Allen and Bradley 2016;Magarey et al. 2018) at a range of scales (e.g., Young et al. 2020). While predictor variables often differ depending on the focal species, climactic variables (e.g., precipitation and temperature), remotely sensed vegetation indices, landscape attributes (e.g., aspect, slope), solar irradiation, distance to dispersal corridors (e.g., water, roads), and soil attributes are commonly used for creation of HSMs (Evangelista et al. 2008;Kumar et al. 2006;Stohlgren et al. 2003). Predictions are best when true absence data are included, but these data sets are difficult to attain. Therefore, targeted background approaches have been developed and demonstrated to be an effective alternative (Crall et al. 2013;Phillips et al. 2009). Different algorithms have been utilized in modeling suitable habitat; however, ensemble predictions utilizing multiple algorithms have shown better predictive capabilities of suitability than reliance on any single algorithm (Margery et al. 2018;Morisette et al. 2013). Updating models with new data (iterative approach) can also improve predictions over time Crall et al. 2013).
While HSMs are available for a wide range of invasive species, their use by practitioners and land managers is limited. Improvement in transparency in habitat modeling development is one approach researchers are taking to address this (Araújo et al. 2019;Sofaer et al. 2019) and improve understanding and adoption of HSMs on the landscape. However, these tools are rarely field-tested for accuracy (Barbet-Massin et al. 2018;West et al. 2016). Thus, the effectiveness of these tools on the landscape is not well known, and this may reduce adoption (Funk et al. 2020;Sofaer et al. 2019). While validation can be difficult, increases in community scientist and stakeholder monitoring efforts that are publicly shared provide a resource to test the accuracy of these models and potentially increase adoption of HSMs.
As part of an outreach effort from 2014 to 2018 focusing on detection of invasive species within Wisconsin (Great Lakes Early Detection Network established by Crall et al. [2012]), we collected known invasive species occurrences for invasive plants and promoted monitoring throughout the state to stakeholders and through a community science group (WIFDN 2020). These data were used to develop HSMs for 15 invasive plants in the state of Wisconsin using an ensemble approach. With these models, we assessed performance and the drivers of suitable habitat and, after 2 yr, determined the accuracy of HSMs using occurrences reported by community scientists and stakeholders. As the accuracy of HSMs varied among species, further analysis sought to understand what factors are influencing the ability of models in correct classification of invasive plant presence.

Predictor Variables
Fourteen candidate predictor variables (Table 1) were utilized for model development. Topographic variables (aspect, elevation, and slope) and distance to dispersal corridors (roads, waterways) were created from a digital elevation model and the Topologically Integrated Geographic Encoding and Referencing (TIGER) data sets, respectively, using ArcMap (ESRI v. 10.5, 2016, Redlands, CA). Climate predictors were obtained from the AdaptWest Project (Wang et al. 2016) and processed to create seasonal average precipitation and average summer maximum and winter minimum temperatures (1981 to 2010 climate normals). Soil attributes (percent organic matter and percent clay) were derived from the USDA gridded soil survey and were included to distinguish soil types across the study area. Finally, a 10-yr average enhanced vegetation index (EVI; Didan 2015) and percent tree canopy cover (Sexton et al. 2013), both calculated from MODIS satellite data, were used as measures of greenness to differentiate canopy density and vegetation types.

Occurrence Data
More than 100,000 occurrence records for more than 100 species of terrestrial invasive plants were obtained through the Wisconsin Department of Natural Resources and the Early Detection and Distribution Mapping System from 2000 to 2016. Occurrences were evaluated, and duplicates and occurrences with GPS error >30 m were removed. Fifteen invasive plants (Table 2) were ultimately chosen for modeling in this study.

Habitat Suitability Models
All habitat models were created using the Software for Assisted Habitat Modeling (SAHM) ) modules within the broader Vistrails framework (Freire and Silva 2012). SAHM was designed to assist researchers with the development of spatial distribution models by allowing the user to make decisions in an interactive workflow. For each species modeled, the presence data were complemented with pseudo-absence records by utilizing additional reported invasive plant species occurrences taken from the same surveys performed for the focal species. Phillips et al. (2009) describe this "targeted background" approach to habitat models as providing a better representation of absence data, rather than randomly assigned background points, when true absence data are not available.
The 14 predictor layers were projected, aggregated (mean values), resampled (bilinear interpolation), and clipped to the extent of the state of Wisconsin using the PARC module within SAHM. These predictors were selected based on previous research in Wisconsin (Crall et al. 2013) to identify fine-scale habitat suitability. While spatial resolutions of 30, 100, and 250 m were considered, we selected a final spatial resolution of 30 m, as this provided similar results as coarser resolutions (data not shown)

Management Implications
With availability of resources for invasive plant monitoring and control efforts nationwide in decline, new tools and technologies are required to fill this gap. Habitat suitability models (HSMs) are an effective tool for improving understanding and detection of invasive plants on the landscape. HSMs use environmental and physical variables of known locations of a focal species to predict where other similar locations on the landscape exist. While HSMs are common in the scientific community, gaps between development of these tools and use by land management agencies exist. As these models are rarely evaluated for accuracy, their implementation in the field has been limited due to concerns over real-world performance. As developers rarely have resources to field validate, this study sought to determine whether HSMs accurately identify presence of community scientist observations of 15 invasive plants throughout Wisconsin. A total of 5,005 verified new occurrences collected over 2 yr from a structured network of community scientists was utilized to test the ability of these models to correctly identify suitable habitat. While model performance looked successful during development, only half of the models performed at or above our performance threshold (80% correct classification). This result underscores the importance of vigorous field testing of accuracy of these models with separate data before release to land managers to ensure accuracy. and provided a more detailed map that could be utilized by land managers. A 10-fold cross-validation test of the training models was conducted on each individual species' models to determine whether overfitting of the data had occurred during model development. The cross-validation method randomly withholds 10% of the data and tests it against the trained model, iterating the process 10 times, with the goal of observing limited variance in the prediction. In the case that predictor layers were highly correlated (|r| > 0.75), based on three coefficient indices (Pearson, Spearman, and Kendall), the predictor with the greater biological importance for the specific species, and the higher percent deviance explained for a simple generalized additive model was selected .
Five algorithms (boosted regression trees, generalized linear models, multivariate adaptive regression splines, maximum entropy, and random forests [RFs]) were used to produce habitat suitability predictions for the state of Wisconsin for the species listed in Table 2. Each algorithm's predictive performance was assessed using both threshold-independent (Area Under the Curve [AUC]) and threshold-dependent (sensitivity, specificity, true skills statistic [TSS]) metrics. If AUC values for the training and cross-validation test were ≥0.075, model hyperparameters (see Supplementary Table S1 for more details) were adjusted to curtail overfitting. If overfitting could not be limited to meet this standard, those models were not included in analyses.  Each algorithm produced a unique probability density surface of predicted suitability for the study area. These probability surfaces were converted to a binary simplification using a threshold cutoff found by maximizing the average of the sensitivity plus specificity. Pixels considered above this threshold were given a value of 1, while unsuitable pixels were designated as 0. Each binary prediction was then summed to create an ensemble model (see Morisette et al. 2013) of suitability for a specific species, and a final model was created where at least one model deemed a pixel suitable.
Variable importance and response curves were produced by SAHM and examined for similarities across the five models. Variable importance is calculated by determining the change in AUC when predictor values are permuted between the occurrence records and background data points. The larger the change produced by the permutation, the greater the influence the predictor has on the model. The predictors selected by each of the five different modeling approaches varied, so we relativized the values to a percentage of all the variables used in the model. Marginal response curves were also produced for each predictor in the model by calculating the response when the other predictors are held constant at their mean values Lötter and Maitre 2014;Pearson 2007).

Community Science Outreach Campaign
In 2017, following the development of our HSMs, we developed an active outreach campaign through Wisconsin First Detector Network to report invasive plants in Wisconsin. Our campaign trained individuals to identify key invasive species we were modeling and report occurrences. Over the 2-yr time frame, 34 workshops conducted in 19 counties trained 986 people.

Testing Accuracy of Models from Community Scientist Observations
As a result of our community outreach campaign, more than 15,000 new invasive plant reports were submitted by community scientists in the state of Wisconsin. We used a subset of these reports to assess how well our models predicted presence of specific invasive plants. Reports were used only if they were verified by an expert reviewer and had a GPS error ≤30 m. Additionally, accuracy was only tested when we had >40 observations for each species in this new data set. This resulted in 5,005 new reports to test 15 species distribution models. Each report was run through the speciesspecific ensemble model to determine the sensitivity associated with the model predictions (Jorgensen et al. 2021). A χ 2 test was performed in R (R Core Team 2019) to determine whether the correct classification rate of these new occurrence records was higher than an 80% cutoff rate. This cutoff value for assessing predictive performance is commonly used in remote-sensing evaluation (Engler et al. 2013).
To determine the drivers of correct classification of the community scientist observations across the 15 species evaluated, we developed a RF model in R using the RANDOMFOREST package. RF is a nonparametric ensemble machine learning approach that stochastically builds regression trees via a bootstrap technique (i.e., leave-one-out sampling with replacement). RF models are robust to data sets including large quantities of both continuous and categorical variables compared with the number of observations (Cutler et al. 2007). The importance of predictor variables is ranked by each tree in the forest, and the tree rankings are then averaged to develop unbiased estimations of the most important predictors where the overall model error is minimized. Twenty-six predictor variables were used to understand what is impacting the accuracy of community scientist observations. Twenty model evaluation metrics (training AUC, TSS, sensitivity, and specificity for each individual model in the species ensemble), three variables associated with the HSM (number of occurrences, pseudo-absence records, and number of counties where the species had been reported), and three variables associated with community scientist data (the number of new occurrence records to test the models, and how many reporters contributed these reports) were used in this RF analysis. The model was tuned to curtail overfitting by setting the mtry parameter to the square root of the number of predictors and the number of trees to 500 (Probst et al. 2019).

Model Characteristics and Important Variables
The ensemble binary models for all 15 species' HSMs were created and deemed appropriate based on evaluation metrics (AUC, TSS) ( Figure 1; Supplementary Figures S1-S14.) Two to 14 variables (Table 3) were retained in the final models, depending upon multicollinearity or the intrinsic model algorithm variable selection method for each species. The relative importance of the predictor variables was both species and model specific. The top three predictors (Table 3) were either consistently important (i.e., within the top three at least 30% of the time), occasionally important (in the top three 10% to 30% of the time), or rarely influential on model predictions (in the top three <10% of the time). Minimum winter temperature was the most common important variable across species models (66% of the top three variables across all 15 models), which is consistent with other work modeling invasive species habitat suitability (Young et al. 2020). We also found summer precipitation (47%), summer maximum (41%) temperatures, and percent tree canopy cover (31%) consistently important across species' models. Precipitation in the fall, spring, and winter was less consistently important, ranking within the top three most important predictors in 27%, 23%, and 27% of the models, respectively. Similarly, percent clay in the soil and distance to road networks were among the top three most important predictors for 14% and 12% of the models, respectively. Crall et al. (2013) found percent clay to be the most important predictor for modeling suitable habitat of spotted knapweed (Centaurea stoebe L.) in Wisconsin, while it ranked within the top three important predictors 14 times in our study, two of which were C. stoebe.
Predictors that were rarely ranked in the top three most important were the topographic variables (aspect, slope), EVI, distance to water, and percent soil organic matter (Table 3). Wisconsin has relatively homogenous topography with only a 426-m difference between the highest and lowest point in the state (mean = 320 m), and a mean slope of just 2.7% rise (range: 0% to 59%). The small differences among these variables in Wisconsin may be a factor in the importance of this variable in model predictions. Similarly, distance to water was only ranked among the top three important predictors in wetland invasive species (Phragmites and purple loosestrife [Lythrum salicaria L.]).
In addition to differences in important predictor variables among species, directionality of the response also varied among species. For example, response to tree canopy cover varied, with 9 species responding positively, 10 negatively, and 2 with a bimodal response (data not shown). While others suggest that HSM for species with common growth forms can be combined, as they behave similarly (Ianonne et al. 2015), our results suggest that the drivers are species specific and agree with those of Allen and Bradley (2016), who reported that invasive plant species occupy different niches.

Habitat Suitability Model Performance
All models performed at or exceeded acceptable thresholds for both threshold dependent (TSS) and independent (AUC) evaluation metrics (Supplementary Table S2). AUCs for models were above the minimum threshold (0.9) for excellent performance 62.5% of the time and never below the acceptable threshold (0.7) (Hosmer and Lemeshow 1989). TSS, an indication of how well the models are able to accurately distinguish presence from background, was above the minimum threshold of 0.4 established by Jarnevich et al. (2018) for all but one of the retained model outputs (multivariate adaptive regression splines [MARS]: Lonicera spp.). Assessment of AUC and TSS values from the 10-fold cross-validation also indicated good model results, as differences between the training and mean cross-validation testing sets indicated limited overfitting with only one exception. Any models that were considered to be overfitting the training data were tuned appropriately (Supplementary Table S1).

Accuracy of HSMs in Correctly Classifying Community Scientist Observations
Over two growing seasons, an additional 6,035 new occurrence records were reported and verified by experts for the 15 modeled species by 114 reporters across 86% of Wisconsin's counties. Of these new occurrences, 1,030 were excluded from this data set, as they were in "novel" areas where models did not assign a prediction (the majority were waterbodies and roads, which were excluded in model development). This resulted in 5,005 new occurrence records.
We used these 5,005 new data points as an independent data set to evaluate HSMs for each of the 15 species. Occurrences used per species ranged from 48 to 1,291 and were distributed across 9 to 32 of Wisconsin's 72 counties (Table 4). Eight of the 15 models performed at or above an 80% correct classification of the new data points in suitable pixels, with leafy spurge (Euphorbia esula L.), wild parsnip (Pastinaca sativa L.), and Torilis spp. performing at or above 97% correct classification (Table 4). All three of these species' models also performed excellently for AUC and TSS values across all five of the model algorithms. However, autumn olive (Elaeagnus umbellata Thunb.), common tansy (Tanacetum vulgare L.), and the teasel species' (Dipsacus spp.) models performed at less than 60% correct classification of these new data despite great to excellent performance for AUC metrics and above thresholds for other model performance metrics.
Comparing E. umbellata and P. sativa emphasizes this pattern, as both species performed similarly on AUC and TSS and had several hundred new occurrence records used to validate the ensemble models (454 for E. umbellata, and 365 for P. sativa). Despite these similarities, the P. sativa ensemble correctly classified presence 97% of the time, compared with 56% correct classification by the E. umbellata ensemble model.
Simplified models that overpredict suitable habitat could be a reason for high performance on the community scientist observations. The total area deemed suitable for each species differed substantially, with species like teasels (Dipsacus spp.) predicting suitability at only 11% of Wisconsin compared with common buckthorn (Rhamnus cathartica L.) predicting suitability at 63% of the state (Table 4). Of the species that performed above our 80% threshold, Lonicera spp., E. esula, C. stoebe, and P. sativa predicted greater than 50% of Wisconsin's area suitable habitat, but crown-vetch [Securigera varia (L.) Lassen], garlic mustard [Alliaria petiolata (M. Bieb.) Cavara & Grande], Torilis spp., and Japanese barberry (Berberis thunbergii DC.)predicted ≤36% of Wisconsin suitable. While results suggest some level of generalization, with large areas of the state deemed suitable habitat for a given species, this reduction would still result in substantial decreases in monitoring efforts for land managers. If a higher level of reduction is desired, increasing the number of models that agree (we selected one) could be an approach to further reduce area. However, this may increase the potential for false negatives and should be used with caution.
Although referenced as a key component of the modeling process, field validation is not widely observed in the literature (Barbet-Massin et al. 2018). Our data corroborate this notion that useful models must be tested with independently acquired data, as relying on the evaluation metrics (training and crossvalidation test) from the model development phase does not correlate with accuracy from newly acquired occurrence data. With only 8 of the 15 models performing at or above 80% correct classification rates, we would recommend further development of the remaining 7 models using the newly collected data as a means to improve model performance (Crall et al. 2013). Below this threshold, these models would likely have limited applicability for land managers employing EDRR management efforts. End users, however, should set their own expectations for accuracy requirements. We hope that our results provide additional momentum in field validating HSMs, as other have also labeled this as a high priority issue (Sofaer et al. 2019).
Finally, to investigate whether any single metric or set of metrics was important in describing correct classification of community scientist observations, an RF analysis was performed to determine the most important factors driving correct classification (n = 15). The predictors included each of the five model evaluation metrics (AUC, TSS) as well as information about the community scientist data to explore the importance these variables had on predicting suitable habitat. These variables only explained 14.9% of variability in the data. The number of community scientist presence data and the number of total reporters of these new records were the two most important variables (1.0 and 0.81 scaled importance, respectively), all other variables had scaled importance <0.62. Partial dependence plots (Figure 2) depict the directionality of these predictor variables on the response when all other variables are held constant. For example, new occurrences above 250 and number of reporters above 19 had a dramatic improvement on correct Table 4. Species used to test the accuracy of community scientist observations (n = 15) and how many of the new points ( Figure 1; Supplementary Figures 1-14 a An asterisk (*) indicates species performing above an 80% correct classification threshold per a χ 2 -test (P < 0.05). classification. As these new data came from across the state of Wisconsin, we consider the number of reporters of these data as a proxy for spatial extent but found it surprising that another proxy for spatial extent (number of counties with reports) was not important. Evaluation metrics for models did not have much influence on the overall result. This agrees with our variability in correct classification performance at or above our 80% threshold, despite relatively similar model evaluation metrics, and suggests that evaluation metrics alone should not be used to assess model performance. Others have suggested that specific model evaluation metrics are a poor indicator of model performance (AUC [Lobo et al. 2008]; TSS [Wunderlich et al. 2019]). Additional research is needed to better understand which metrics are best at predicting model performance in the field. While robust data sets can help with performance, other factors are likely influencing success/failure of these model performances. More explicit descriptions of the distribution of these new occurrence data (distance to roads or trails, elevation, etc.; Mair and Ruete 2016) could offer better insight into the drivers of successful classification, as spatial bias may occur during the data-collection phase. Regardless, Crall et al. (2013) and Jarnevich et al. (2018) recommend overcoming poor field validation performance by using an iterative approach, whereby models are created and vetted in a development phase, then passed through a field evaluation test, and then updated. Our results support the need for this process when developing HSMs for invasive plants for early detection of new populations, until acceptable results are achieved. Life history, habitats invaded, and other species-specific attributes may also influence results and should be evaluated in the future.
Here we presented 15 ensemble HSMs for a range of invasive plants in the state of Wisconsin. While climate predictor variables consistently were most often the most important variable, the specific variable, directionality, and mixture of other variables differed among models, suggesting modeling should be done on individual species, if possible.
Working with community scientists allowed for the development of a separate data set to confirm performance of models. While all but one species performed at or above acceptable ranges for model evaluation metrics, when community scientist data were tested, just 53% of models could correctly classify these new data at >80%. Known issues of transferability to new environments or environmental conditions (e.g., Latif et. al 2016) or to new locations entirely (e.g., Lauria et al. 2015) that were not used to train the models can produce these low-performing models. Therefore, while not all of our models are ready for deployment in land monitoring/management systems, at least eight exceeded expectations for detection of new populations. Model evaluation metrics (e.g., AUC, TSS) were not a good predictor of accuracy in community scientist observations. This effort highlights the benefits of developing a network of community scientists and stakeholders for HSM efforts. While validation of data submitted and education are required, the additional information made available across wide spatial extents broadened the ability for the scientific community to create useful models that perform well in the field and improve interaction between the scientists and the land managers.