S ELECTING R OBUST F EATURES FOR M ACHINE L EARNING A PPLICATIONS USING M ULTIDATA C AUSAL D ISCOVERY

Impact Statement : While causal feature selection helps design more robust ML models, its joint application to multiple dataset remains limited because standard causal discovery algorithms output a different set of drivers for each dataset, which is impractical. To mitigate this issue, we apply a newly-developed “multidata” causal feature selection approach, which identiﬁes a single set of optimal causal drivers from an ensemble of time series. Applied to the statistical prediction of TC intensity, our approach outperforms standard feature selection methods by helping simple regression algorithms better generalize to unseen cases. In addition to making our models robust, causal feature selection also eliminates redundant predictors while identifying new ones, leading to lighter models and aiding scientiﬁc discovery.


Introduction
Machine Learning (ML) combines statistical methods and numerical optimization to learn a group of tasks from data.Progress in computational capabilities, combined with the availability of large amounts of data, allows the development of ML models to predict and understand nonlinear systems such as climate processes and extreme weather events.For environmental applications, processing big data that are nonlinearly related often requires (i) dimensionality reduction; and (ii) strategically selecting the model's features to make ML models cheaper to run, generalize better, and easier to explain [1,2,3].This Article compares different methods to discover a subset of the most relevant features in environmental datasets [1,4,5] and explores the effect of causal feature selection on statistical prediction skill.For this, we work at the intersection of causal inference and ML, an active area of research [6] because causal relations help acquire robust knowledge beyond the support of observed data distributions [7].Causal inference can broadly be categorized into three research directions: (i) causal representation learning; (ii) causal discovery; and (iii) causal reasoning [7,8].To select features, we here explore the use of causal discovery, a methodology for learning qualitative cause-and-effect relationships between a collection of variables from data that have not been obtained under controlled experimental conditions [9,10].Incorporating causal relationships in ML models via feature selection can make ML models more interpretable [1,11,3,12] and less susceptible to overfitting [13,14,15].
There are two main challenges when applying causal discovery in environmental sciences.The first challenge is algorithmic: Often environmental data consists of multiple realizations of the same process with slight differences, and causal discovery algorithms that apply to such multiple realization problems remain underexploited [2].The second challenge is the lack of benchmarking: Causal feature selection is rarely compared against other feature selection methods for ML-based predictions.Here, we address these two gaps by introducing a causal feature selection framework to estimate causal relationships from multiple time series datasets [15,11,16,17,18,19].We compare feature selection algorithms by training simple ML prediction models for each of the selected sets of features and evaluating their predictive performances.Our framework is applied to the prediction of Tropical Cyclone (TC) intensity to demonstrate that causal feature selection (i) improves the out-of-sample skill, and (ii) uncovers the best predictors in real-world situations.

Methodology: Causal Feature Selection for Multiple Realizations
Our implementation of causal feature selection [20,21,22] uses the recently-developed Multidata (M) functionality for two causal discovery algorithms based on time-series, explained below.Our multidata causal discovery approach used to pre-select causally relevant predictors has two steps: (i) the causal discovery algorithms; and (ii) applying these algorithms to a dataset comprising data from multiple sources.From a causal perspective, the setup used in this study is simplified because only the variables that are time-lagged with respect to the target variables are considered potential predictors.As a result, causal discovery effectively reduces to a feature selection algorithm that removes all those predictors which are (conditionally) independent of the target (given the other predictors) and which hence do not provide any additional information for predicting the target.The multidata functionality itself, however, is more general and also applies to the time series causal discovery tasks that also consider contemporaneous causal relationships.
Here, we explore the use of the causal discovery algorithms PC 1 and PCMCI [17] for causal feature selection.The PC 1 algorithm is a variant of the PC algorithm [9].First, PC 1 initializes the potential causal drivers (  ) of each target variable   as the set of all variables (  ) = {  − |  = 1, . . .,   ,  =   , . . .,  max } within the considered range [  ,   ] of time lags, where the   with  = 1, . . .,   are the potential predictors and where   and   respectively are the minimal and maximal time lags at which direct causal influences can occur..Then, PC 1 iteratively removes variables from (  ) that are irrelevant or redundant for the prediction of   .Specifically, PC 1 removes elements   − from (  ) that are conditionally independent of   given subsets   ⊆ (  ) whose cardinality  increases iteratively.For  = 0 all   − with   − ⊥ ⊥   are removed, for  = 1 those with   − ⊥ ⊥   | 1 where  1 is the strongest driver from the previous step, for  = 2 those with   − ⊥ ⊥   | 2 where  2 are the two strongest drivers from the previous step, and so on.In this work, conditional independence is tested using partial correlation (in general, though, the algorithm can be combined with any conditional independence test).The corresponding independence test is based on a standard significance level   = 0.02 and uses a strength of association that is based on the absolute partial correlation value.This iteration is continued until  is greater than the cardinality of (  ).The PC algorithm is different from PC 1 in so far as that PC, for every , does not only test for conditional independence given exactly one cardinality  subset of (  ) but tests for conditional independence given all cardinality  subsets of (  ).
The PCMCI algorithm [17], after first running PC 1 , reinitializes all links and then subjects all links to the momentary conditional independence (MCI) tests   − ⊥ ⊥   | (  ) \ {  − }, (  − ), removing the link if independence is not rejected.Here, the condition on (  − ) helps to remove false positives that tend to be inflated due to autocorrelation.Controlling false positives is important for a causal discovery setting but is not necessarily important for a causal feature selection/prediction setting as considered in this paper.Within the study presented here, we employ both the PC 1 and the PCMCI algorithm to empirically analyze which of the two methods is preferable for causal feature selection.As with all causal discovery methods, PC 1 and PCMCI rely on certain assumptions.The essential assumption is that conditional independencies in the data are in one-to-one correspondence with -separations [23] in the causal graph [24,21,9,25].Moreover, both methods assume causal sufficiency [9], i.e., the absence of unobserved variables that causally influence two observed variables.The latter assumption is not necessarily fulfilled in our context, even though we included a range of potential predictors.This means that some of the obtained causal features might still be spurious and may not work if the target distribution differs from the training distribution (out-of-distribution prediction).
When PC 1 and PCMCI are applied to a single multivariate time series, samples are drawn from this time series in a sliding-window fashion.The drawn set is internally passed to the statistical hypothesis tests of conditional independencies.For this sliding-window approach to be valid, the causal relationships need to remain unchanged throughout the time series (causal stationarity assumption).When PC 1 and PCMCI are applied to multiple multivariate time series, if all time series of this ensemble can be assumed to share the same causal relationships within specific time ranges, then we can combine the sample sets from all ensemble members1 into a single, larger dataset.This larger dataset, which includes data from multiple sources (e.g., from multiple storms), is then internally passed to the conditional independence tests.The PC 1 and PCMCI algorithms can then proceed as usual.Consequently, although the input is an ensemble of multivariate time series, the output is still a single set of predictors.In addition to its practicality, our multidata approach benefits from an enlarged set of samples, increasing the power of the conditional independence tests.Hence, we expect the sets of predictors obtained by running multidata causal discovery on an ensemble of time series to be more reliable than the sets of predictors obtained by running causal discovery on any single member of this ensemble-if the assumption of a common causal structure holds.An alternative approach would be to run PC 1 and PCMCI on any single member time series and then appropriately aggregate the resulting sets of predictors (across the members).
3 Application: Statistical Prediction of Tropical Cyclone Intensity

Motivation
The increasing frequency of intense TCs [26,27] combined with growing coastal populations have escalated the vulnerability of the tropical urban coasts.For context, more than half of Earth's population is projected to live in the Tropics by 2050 [28] and more than a billion people worldwide could be living in low-elevation coastal zones by 2060, particularly in South and East Asia [29].Predicting storm intensity changes, including rapid intensification in TCs, remains a major challenge [30], because of unresolved complexities of storm dynamics in numerical models.Furthermore, numerical models suffer from a reduction in forecast skills with an increase in lead time [31].An alternative to numerical forecasting is statistical forecasting, as statistical models can improve cyclogenesis and intensity forecasts [32,33].For instance, statistical models based on logistic regression, random forest, decision tree, and randomized decision trees [34] outperformed the National Hurricane Center in predicting TC rapid intensifications over the Atlantic and Eastern Pacific basins [35,36].A potential drawback of statistical models is that it is often difficult to choose appropriate predictors for reliable forecasts.To better predict TC intensity, the models need to represent the physical mechanisms behind TC intensification more accurately; these include large-scale circulations, local conditions, and internal processes [37,36].We argue that one way to make statistical models more robust is to apply causal discovery algorithms and eliminate causally-spurious predictors.In this study, we apply the PC 1 and PCMCI methods to generate a single set of causally relevant predictors from multiple TC time series.

Data
The TC dataset is created using multiple environmental variables at different pressure levels known to be favorable for TC intensification [38,39,40] from the global high-resolution ECMWF ReAnalysis-5 [41](ERA5) with 25 km horizontal resolution, and 3-hourly temporal resolution (see SI -section A for the full list).Here, we selected a total of 260 TC cases spanning from 2001 to 2020 in the Northwest Pacific basin (WPAC).The TCs with a lifetime of more than 6 days up to landfall are selected for the study to understand the effect of environmental parameters on TC intensity up to 3 days time lag, so each case has a time span from genesis up to landfall based on each TC best track.TC tracks are obtained from the International Best Track Archive for Climate Stewardship [42].Rather than directly feeding the time series of predictor variables for the cases at each gridpoint around the storm, the values are averaged in horizontal areas defined with respect to the TC Center.Each atmospheric predictor is post-processed into two sets of time series representing inner-core (TC center to a radius of 200 km) and outer-core characteristics (annulus from a radius of 200-800 km).The choice of averaged areas follows the current practice in operational statistical intensity prediction schemes [43].The distinction between outer-core and inner-core processes is justified because TC intensity is affected by environmental conditions in the storm's neighborhood and internal processes within the storm [44,45].From an ML perspective, this preliminary dimensionality reduction removes features with high spatial correlations, reduces the complexity of the statistical models, and possibly improves model generalizability by removing some of the predictors' spatial heterogeneity in different storms.Once this preliminary dimensionality reduction is done, our goal is to eliminate spurious features, here defined based on meteorological variable, vertical (pressure) levels, time lag, and horizontal averaging sector (inner or outer core).We describe each TC using a total of   = 234 time series of horizontally-averaged 3D variables at given vertical levels and 2D variables (see SI Table 1 for details).With regards to the time lags, we explore the time steps between 24h before the target (corresponding to   = 8) and 72h before the target (corresponding to   = 24).This results in a total number of 3978 (234 potential predictors × 17 time steps) for the causal algorithms, which eliminate the spurious links between the features and the targets.We randomly split the data by TC to avoid spatiotemporal correlation: Out of the selected 260 TCs, we randomly split 205 cases from 2001-2020 into a training set (150 cases) and validation set (55 cases) while keeping 55 cases from recent years (2017-2020) in the test set, without any overlaps.The regression task is to forecast three variables with a 1-day lead time, including (1) Maximum wind speed at 10m (Max.10m Wind, in m/s), (2) Minimum Sea-Level Pressure (MSLP, in hPa), and (3) horizontally-integrated total precipitation (Tot.Intg.Precip. in  2 ) accumulated over 3-hrly intervals.Maximum sustained wind speed at 10m (averaged over 1min, 3min, or 10min depending on the Regional Specialized Meteorological Centre) is the standard measurement for the intensity currently used operationally.We include MSLP as it correlates better with TC damage [46,36].Additionally, MSLP is easier to estimate as it is an integrated quantity and only requires a couple of measurements near the storm center.Finally, we included total integrated precipitation as a potential target because most fatalities and damage from TCs are caused by heavy precipitation and storm surges [47].
Figure 1: Multidata Causal Feature Selection applied to TC prediction: After reducing the dimensionality of spatiotemporal fields to yield time series for several TC cases (Step I), the ensemble of aligned time series is fed to the multidata causal discovery algorithm to calculate the optimal set of causal drivers (Step II), which can be fed to a regression algorithm to make robust predictions (Step III)

Causal Machine Learning
In this section, we describe the feature selection methodology in the context of TC intensity prediction.Our causal feature selection framework is shown in Fig 1 .Once the four-dimensional fields have been reduced to time series, we align the time series in the training set based on the minimum pressure value recorded during each TC's lifetime, which is a smoother measure of TC intensity than maximum surface wind speeds [48].Temporally aligning the multivariate time series of different ensemble members is key, as the resulting ensemble is more likely to satisfy the common causal structure assumption, improving prediction skills using causal feature selection.After aligning the time series, we feed the training set as inputs to the PC 1 and PCMCI algorithms (both implemented in tigramite) to extract the most significant causal features.Here, an input feature may be defined as an environmental or derived variable (see SI section A, Table S1) at any given pressure level which is spatially averaged either by the inner or outer core radii at a given 3 hourly time-step.We stress that PC 1 and PCMCI are only applied to the training data.The implementation of both PC 1 and PCMCI contains several hyperparameters, including minimum and maximum time lags for the analysis, fixed to 1 day (8 timesteps) and 3 days (24 timesteps), respectively, for our prediction task.Further tunable hyperparameters are the significance level for conditional independence testing (  ) and a significance threshold for the p-value matrix (alpha-level, only used for PCMCI), which control the selected number of features.Once PC 1 and PCMCI have selected the most significant causal drivers of the targets from the ensemble of time series, these drivers are used as inputs to the prediction model.We logarithmically vary the values of   and alpha-level, which in effect controls the number of selected features, as more stringent (  ) values will result in the selection of fewer and more significant features.From this set of experiments, we determine the best hyperparameters suitable for each target of interest by maximizing the validation performance of the trained regression models.We use Multiple linear regression (MLR) and Random Forest (RF) Regression models to predict the targets from the causally selected features.The MLR algorithms for Causal-MLR experiments were prepared using the Scikit Learn [49] implementation of the Linear Regression algorithm and its corresponding default parameters.Each MLR algorithm was trained to predict one of three unscaled target variables using the selected, standard-scaled features.We also included RF regression models using the same causal feature selection set-up (Fig 1 ) to explore the impact of causal feature selection for more complex nonlinear regression methods.We used the RF regressor from the scikit learn library [49] to prepare the causal-RF and optimized its hyperparameters with a randomized search.

Non-Causal Machine Learning Baselines
This study is motivated by the working hypothesis that regression models using causally-selected features outperform non-causal baselines in terms of generalization.Here, non-causal baselines subsume both the case of no feature selection and the case of feature selection based on non-causal criteria.We compare our causal feature selection to non-causal feature selection methods such as lagged correlation, random selection as well as an eXplainable Artificial Intelligence (XAI) method of feature selection using RF regression (More details are provided in SI section C.).To test our causal approach's ability to effectively use time lags, we also train a Long Short-Term Memory (LSTM) network using all time lags between   and   and without feature selection.We implement the LSTM using the PyTorch [50] library and conducted a hyperparameter search with the Optuna [51] framework.A more detailed description of the architecture is provided in SI Section C.

Performance of Causal Machine Learning
Our first objective is to find the set of causal features that are best linked to the intensification in TCs at a lead time of 1-day.To measure the suitability of a set of causal features, we evaluate how MLR as well as RF models trained with the causally-selected features perform when predicting TCs that are unseen during training.We evaluate prediction skill holistically (see SI section D), but focus on the coefficient of determination (2 ) in the main text, with  2 = 1 corresponding to a perfect prediction and  2 = 0 to an error of one standard deviation.In Fig 2, we show the performance of Causal-ML models to predict the maximum winds, 24 hours in advance using M-PC 1 method.A less stringent significance threshold results in a larger set of features that are retained during training, which has a clear negative effect on the model validation performance.We found that feature selection using M-PCMCI is comparable to M-PC 1 when the minimum time lag is 6 hours (shown in SI section B), but the performance of M-PCMCI drastically deteriorates when we increase the minimum time lag to 1 day.Here, we only show the causal ML results based on M-PC 1 here.For comparison, similar experiments with reduced lead times where minimum and maximum time lags are set to 6 hours and 2 days, respectively, are shown in SI Section B. PCMCI's main advantage is to better control false positives in the presence of strong autocorrelation [17], which is more important for an actual causal discovery setting than for the causally-informed prediction setting considered in this paper.
Causal MLR scores are better than those of non-causal MLRs (Fig 2a), which use all inputs without feature selection.The non-causal baselines clearly overfit the training set.The Causal MLR is also compared with a recurrent neural network using an LSTM layer, and Causal MLRs outperform the best LSTM model for all targets, which is remarkable given their simplicity.When comparing the RF models, we find that Causal-RF scores are better than non-Causal-RF for the validation set, while test set scores are comparable for wind speed predictions.In general, the ( 2 ) values are highest for the predictions of MSLP, followed by wind and total integrated precipitation (Fig 2 , S1, S2).The optimal set of causal drivers that performs well on the training and validation sets (without leading to overfitting) is sparse, containing only 31 features in the causal MLR case for predicting maximum wind (Table 1).This result suggests that many of the features are spuriously linked to TC intensity and can be removed without sacrificing the predictive skill of simple MLR models compared to a non-causal RF baseline.Similar results are obtained for the prediction of other targets, as shown in Fig S1, S2 in SI section B.

Comparison with Feature Selection Baselines
Our second objective is to compare the performance of Causal MLRs to MLRs with non-causal feature selection baselines (described in SI section C).For the maximum wind predictions (Fig 3), PC 1 consistently outperforms the two simpler feature selection baselines (random and lagged correlation), especially on the validation and test sets (Fig 3b,c).Lagged correlation, in particular, selects sets of predictors that perform very poorly in comparison, especially during validation.The ability of an XAI-based feature selection to capture nonlinearities seems to improve predictor selection, resulting in  2 values that are almost comparable to the PC 1 causal feature selection method.Nevertheless, the causal PC 1 method retains an advantage for very sparse models (less than 50 input features), suggesting that the initial selection of causally-relevant predictors allows these sparse models to beat the corresponding noncausal sparse models.PC 1 performs better than the other methods for the two other targets, which is shown in SI section C (Fig S3 -S6).We note that lagged correlation performance is comparable to Causal-MLR and XAI-based feature selection in predicting Total integrated precipitation.This motivates adapting our conditional independence tests for non-normal distributions, which we leave for future work.The performance comparison based on ( 2 ) of all best ML models used in the study, along with their number of input features, is listed in Table 1.

Optimal Causal Features
Here, we expand upon our results from the previous section to understand why causal-MLR models outperform the models that use other feature selection methods.For this purpose, we rely on the frequency of predictor selection (across the models)by the best-performing causal-MLR and lagged correlation MLR models 2 .We find that both methods choose different predictors for the maximum wind predictions while identifying inner core relative humidity as critical for wind prediction.However, divergence is a major predictor in the causal-MLR (Fig 4a .),despite not being in the 10 most frequently selected predictors for the lagged-correlation models (Fig 4d .).The vertical distribution (Fig 4b,e) and the time lag information (Fig 4c,f) of the most frequently chosen features reveal several differences in the causal models as compared to the lagged correlation models.Unlike the lag correlation method, the causal method selects features at  1:  2 score for each experiment's best model on the validation set, along with the number of selected features (in parentheses).Causal-MLR gives the best performance with the least features. .PC 1 then ranks features based on significance test statistics, which gives a good measure of predictor relevance.Once   is in the parent set (  ), its neighbors will be iteratively removed because they are not conditionally independent of   .This confirms the interpretation that the selected time lags and vertical levels are "most predictive" of   , and that the spatiotemporal neighbors of   are eliminated because of redundancy, which is due to the high spatiotemporal correlations in our dataset.Next, from a scientific viewpoint, the causal models clearly emphasize the low-level inner-core convergence (div), middle and upper tropospheric relative humidity (RH, rhum)in the inner core and the upper-level divergence (outdiv) in the outer core as most important predictors whereas the lagged correlation models rely on middle-and upper-tropospheric vertical motions (vvel) for the prediction task.Finally, in the time-lag plots (Fig 4c,f), the divergence links in the causal models are chosen at time lags of more than 2 days (-60hr<t<-50hr), while lagged correlation models focus on features at the shortest lead times (t>-48hr) as they are more correlated with decreasing time lags.

Feature selection using Multidata Causal Discovery
Causal-MLR models rely on low-level convergence and upper-level divergence at longer time lags.In contrast, the lagged correlation MLR models mostly rely on mid-to upper-tropospheric vertical motion at shorter lead times.One way to interpret this is that the mid-tropospheric vertical motion could be a confounder, which is removed by the PC 1 algorithm.In this case, the difference in generalization skill may be attributed to the lagged correlation MLRs making predictions based on causally-spurious links.The causal relationship involved here can be understood in mass adjustment terms:mass conservation requires low-level convergence and upper-level divergence to be balanced by upward mass transport.This upward motion can invigorate convection and aid TC intensification.Hence, the vertical motion should be considered as a consequence of divergence rather than an independent process that drives TC intensification by itself.We believe that the removal of mid-level vertical motion in the PC 1 features shows that causal discovery algorithms can successfully remove causally-spurious links.

Conclusion
This paper described a causal feature selection framework to predict and understand complex geophysical events that can be considered multiple realizations of the same process with small perturbations.We applied this framework to multiple TC time series to identify common causal links and used them as predictors in MLR and RF regression models.
Our results show that causal feature selection is superior to traditional feature selection methods for finding sets of predictors that help regression models generalize to unseen TC cases, especially for very sparse models (Fig 3).Of the two causal methods, we find that the PC 1 algorithm is more appropriate for feature selection, as it only keeps the most informative features, effectively removes confounding features (e.g., mid-tropospheric vertical motion in Fig 4), and is less sensitive to the forecast lead time (Fig S3 -S5) than PCMCI.
Temporally aligning the multiple time series based on a common reference point before causal feature selection tangibly improves model prediction skills.The retention of spurious links in the lag correlation models negatively affected generalizability.From these observations, we conclude that causal feature selection holds potential in our continued effort to improve statistical TC intensity models.Future efforts will involve (1) assessing whether current operational intensity prediction baselines can be improved by the causality-based predictor selection; (2) expanding the study to all ocean basins; and (3) discovering new potential predictors that may improve operational TC intensity predictions.
While not studied in this paper, the multidata causal discovery also opens the possibility to analyze systems whose causal structure changes in time: If one can align the individual member time series on a common time axis and can assume that, although changing in time, their causal structures are the same, then a dataset for independence testing can still be created by taking one sample per member time series.Repeating this procedure for every time step would yield one set of predictors per variable and per time step.Similarly, one could obtain one set of predictors per variable and per time window in a sequence of time windows (useful for slowly varying causal structures).Finally, we note that the multidata approach does not rely on any particular causal discovery algorithm.Therefore, while not shown here, the multidata approach can also be employed with the PCMCI + algorithm [18], a variant of PCMCI that allows contemporaneous causal influences and the Latent-PCMCI (LPCMCI) algorithm [52], a variant of PCMCI that allows for contemporaneous causal influences and latent confounders (available within the open-source Python package tigramite).Lastly, one could further optimize predictions by selecting the subset of causal predictors with the highest (validation set-)skill as discussed from an information-theoretic perspective in [15].In our context, however, iterating through all feature subsets is computationally prohibitive.The noncausal feature selection baselines that are used in the main manuscript are described below.
Random Feature Selection is a sampling method where features are chosen randomly.Random sampling is analogous to drawing out a set of cards after shuffling without any criteria.Our implementation of this algorithm randomly selects a set of input features (size ranging from 10 to 1000) from all possible combinations of variables and time lags.
Lagged Correlation considers the absolute correlation between the prediction targets and different time-lagged input features.We adopted a kitchen sink approach where we obtained the correlation values between targets and all time-lagged variables by c.These correlation values are ranked and the features with the highest correlations are then chosen as MLR inputs.The size of these sets of features ranges from 10 to 1000.
XAI takes the training dataset to build a random forest regression model using Python's scikit-learn library.By using this baseline method, we explore whether the use of feature importance (when nonlinear relationships between variables and targets are included) can result in a better selection of features.The Gini feature importance as measured by the LSTM Neural Network -We prepared three Long Short-Term Memory (LSTM) recurrent neural networks as baselines, training the LSTM models on standard-scaled input data and configuring each LSTM to predict one of the standardscaled target variables (i.e., one of MSLP, precipitation, or Surface Wind).We implemented each LSTM as a sequential model using PyTorch; their architecture includes an LSTM layer, a dropout layer, a linear hidden layer, and a linear output layer.As we targeted standard scaled outputs, the output of the network needed to represent positive and negative values.To do this, we set the output activation function to the identity function and we set the hidden layer activation function to hyperbolic tangent.We selected the Adam optimizer and mean-square error loss for our training, and proceeded to conduct a hyperparameter search using the Optuna framework.The study employed 10 trials that tested LSTM and hidden layers 50-100 units wide, dropout rates between 0.0 and 0.5, and learning rates between 1e−4 and 1e−3.We note that we set the number of units in the LSTM layer and the hidden layer to be equal to each other in all conducted trials.

E. Optimal Causal Predictors
The predictors and time lags for the best causal-MLR model with time-aligned inputs for the prediction of maximum wind speeds 1-day in advance are shown in Table 5.

Figure 2 :
Figure 2: (a.) Causal MLR models using M-PC 1 systematically outperform LSTMs (dashed line) on all sets and their non-causal counterparts (solid lines) on the validation and test sets; (b.) Causal RF models outperform their non-causal counterparts (solid lines) on the training and validation sets

Figure 3 :
Figure 3: While (a) both causal and non-causal models fit the training set better when their number of features is increased, M-PC 1 causal feature selection provides the best generalization to unseen cases in the validation (b) and test sets (c and d zoomed-in version), especially when the number of input features is below 100 (d).For all methods, selected features are fed to MLR for predicting maximum winds for WPAC TCs at a lead time of 1 day

Figure 4 :
Figure 4: Most frequently and significant predictors used by the best Causal-MLR model organized by (a) top nine meteorological variables; (b) pressure level; and (c) time lag.(d-f) Most frequently selected features for the lag correlation method.For the two rightmost columns, we retained the four most frequent features (Relative humidity (inner), Vertical velocity (inner) and horizontal divergence (inner and outer).

Figure 5 :Figure 6 :
Figure 5: Performance of Causal ML for multiple tests by varying hyperparameter for the prediction of Minimum MSLP by causal-MLR (a) compared to noncausal ML (Solid lines) and LSTM (dashed lines), where as (b) shows the performance of Causal-RF compared to noncausal-RF (solid lines)

Figure 7 :
Figure 7: Comparison of M- 1 and M-PCMCI based Causal-MLR model performance for 6 hour predictions without time alignment and with time lags ranging from 6 hrs to 2 days for selected targets

Figure 9 :
Figure 9: Same as the previous figure, but for the prediction of total integrated precipitation

Figure 10 :Figure 11 :
Figure 10: Same as the previous figure, but for the prediction of Minimum MSLP

Figure 12 :
Figure 12: Same as Figure 7. but for predicting Total area integrated precipitation A comparison of the performance of Causal MLR to the performance of MLR models based on other feature selection baselines for the targets, Minimum MSLP and Total Integrated Precipitation are shown in Figures11 and 12respectively.The Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) of the best model prediction of Maximum Surface Winds, MSLP and Total integrated Precipitation on both the training, validation and test sets for the best ML models.All metrics signify a good performance for Causal-ML with far less number of inputs compared to the number of inputs from the best models using the Non-causal-RF and Non-causal MLR methods.For the best ML models used, RMSE are listed in Table3and MAE are listed Table4.

Table 4 :
MAE (lower values means better models)

Table 5 :
List of 31 causally linked predictors for Maximum wind (1 day lead-time) at significant time lags with best model using  1