Hostname: page-component-6766d58669-tq7bh Total loading time: 0 Render date: 2026-05-21T18:41:11.879Z Has data issue: false hasContentIssue false

Selecting robust features for machine-learning applications using multidata causal discovery

Published online by Cambridge University Press:  07 July 2023

Saranya Ganesh S.*
Affiliation:
Institute of Earth Surface Dynamics, University of Lausanne (UNIL), Lausanne, Switzerland
Tom Beucler
Affiliation:
Institute of Earth Surface Dynamics, University of Lausanne (UNIL), Lausanne, Switzerland
Frederick Iat-Hin Tam
Affiliation:
Institute of Earth Surface Dynamics, University of Lausanne (UNIL), Lausanne, Switzerland
Milton S. Gomez
Affiliation:
Institute of Earth Surface Dynamics, University of Lausanne (UNIL), Lausanne, Switzerland
Jakob Runge
Affiliation:
Institute of Data Science, German Aerospace Center (DLR), Jena, Germany Faculty of Electrical Engineering and Computer Science, Technische Universität Berlin, Berlin, Germany
Andreas Gerhardus
Affiliation:
Institute of Data Science, German Aerospace Center (DLR), Jena, Germany
*
Corresponding author: Saranya Ganesh S; Email: saranyaganesh.s@gmail.com

Abstract

Robust feature selection is vital for creating reliable and interpretable machine-learning (ML) models. When designing statistical prediction models in cases where domain knowledge is limited and underlying interactions are unknown, choosing the optimal set of features is often difficult. To mitigate this issue, we introduce a multidata (M) causal feature selection approach that simultaneously processes an ensemble of time series datasets and produces a single set of causal drivers. This approach uses the causal discovery algorithms PC$ {}_1 $ or PCMCI that are implemented in the Tigramite Python package. These algorithms utilize conditional independence tests to infer parts of the causal graph. Our causal feature selection approach filters out causally spurious links before passing the remaining causal features as inputs to ML models (multiple linear regression and random forest) that predict the targets. We apply our framework to the statistical intensity prediction of Western Pacific tropical cyclones (TCs), for which it is often difficult to accurately choose drivers and their dimensionality reduction (time lags, vertical levels, and area-averaging). Using more stringent significance thresholds in the conditional independence tests helps eliminate spurious causal relationships, thus helping the ML model generalize better to unseen TC cases. M-PC$ {}_1 $ with a reduced number of features outperforms M-PCMCI, noncausal ML, and other feature selection methods (lagged correlation and random), even slightly outperforming feature selection based on explainable artificial intelligence. The optimal causal drivers obtained from our causal feature selection help improve our understanding of underlying relationships and suggest new potential drivers of TC intensification.

Information

Type
Methods Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Figure 1. Multidata causal feature selection applied to TC prediction: After reducing the dimensionality of spatiotemporal fields to yield time series for several TC cases (Step I), the ensemble of aligned time series is fed to the multidata causal discovery algorithm to calculate the optimal set of causal drivers (Step II), which can be fed to a regression algorithm to make robust predictions (Step III).

Figure 1

Figure 2. (a) Causal-MLR models using M-PC$ {}_1 $ systematically outperform LSTMs (dashed line) on all sets and their noncausal counterparts (solid lines) on the validation and test sets. (b) Causal-RF models outperform their noncausal counterparts (solid lines) on the training and validation sets.

Figure 2

Table 1. $ {R}^2 $ score for each experiment’s best model on the validation set, along with the number of selected features (in parentheses)

Figure 3

Figure 3. While (a) both causal and noncausal models fit the training set better when their number of features is increased, M-PC$ {}_1 $ causal feature selection provides the best generalization to unseen cases in the validation (b) and test sets (c and d zoomed-in version), especially when the number of input features is below 100 (d). For all methods, selected features are fed to MLR for predicting maximum winds for WPAC TCs at a lead time of 1 day.

Figure 4

Figure 4. Most frequently and significant predictors used by the best causal-MLR model organized by (a) top nine meteorological variables; (b) pressure level; and (c) time lag. (d–f) Most frequently selected features for the lag correlation method. For the two rightmost columns, we retained the four most frequent features (Relative humidity (inner), Vertical velocity (inner) and horizontal divergence (inner and outer).

Supplementary material: PDF

Ganesh et al. supplementary material

Ganesh et al. supplementary material

Download Ganesh et al. supplementary material(PDF)
PDF 2 MB