Extreme heatwave sampling and prediction with analog Markov chain and comparisons with deep learning

We present a data-driven emulator, stochastic weather generator (SWG), suitable for estimating probabilities of prolonged heatwaves in France and Scandinavia. This emulator is based on the method of analogs of circulation to which we add temperature and soil moisture as predictor fields. We train the emulator on an intermediate complexity climate model run and show that it is capable of predicting conditional probabilities (forecasting) of heatwaves out of sample. Special attention is payed that this prediction is evaluated using proper score appropriate for rare events. To accelerate the computation of analogs dimensionality reduction techniques are applied and the performance is evaluated. The probabilistic prediction achieved with SWG is compared with the one achieved with Convolutional Neural Network (CNN). With the availability of hundreds of years of training data CNNs perform better at the task of probabilistic prediction. In addition, we show that the SWG emulator trained on 80 years of data is capable of estimating extreme return times of order of thousands of years for heatwaves longer than several days more precisely than the fit based on generalised extreme value distribution. Finally, the quality of its synthetic extreme teleconnection patterns obtained with stochastic weather generator is studied. We showcase two examples of such synthetic teleconnection patterns for heatwaves in France and Scandinavia that compare favorably to the very long climate model control run.


Introduction
It is expected that heatwaves will be among the most impactful effects of climate change (Russo et al., 2015;Seneviratne et al., 2021) in the 21st century.While, generally, climate scientists have warned that global warming will increase the rate of heatwaves, in some regions such as Europe the trends have accelerated beyond what was expected (Christidis et al., 2020).Extreme heatwaves such as the Western European heatwave in 2003 (García-Herrera et al., 2010) and the Russian heatwave in 2010 (Barriopedro et al., 2011;Chen et al., 2022) had consequences on agriculture and posed health hazards.Such events have also affected other regions including Asia, e.g. the Korean Peninsula (Choi et al., 2021) in the summers of 2013, 2016and 2018(Min et al., 2020)).Thus, understanding the potential drivers, forecasting as well as performing long-term projections of heatwaves is necessary.These tasks, also relevant for climate change attribution studies (National Academies of Sciences Engineering and Medicine, 2016), are difficult since they require computation of small probabilities and thus massive ensemble simulations with expensive models or drawing conclusions from the relatively short observational record.
The possibility of using atmospheric circulation analogs for predicting weather dates back to the works of (Bjerknes, 1921) and (Lorenz, 1969).The idea is quite intuitive: one expects weather patterns to repeat given similar initial conditions.However, estimating the amount of data (a catalog of analogs) actually necessary for short range predictions revealed superiority of physics-based Numerical Weather Prediction (NWP) models (Balaji, 2021;van den Dool, 2007).On the other hand, for predicting certain types of large-scale patterns analog forecasting may require less resources and can be competitive with or superior to NWP on time scales longer than Lyapunov time such as Subseasonal-to-Seasonal (S2S) scales (Van den Dool et al., 2003;Cohen et al., 2019) and especially long-range predictions.The examples include prediction of ENSO (Ding et al., 2019;Wang et al., 2020) and Madden-Julian Oscillation (Krouma et al., 2023).
Analog forecasting involves constructing potential trajectories that the system could have explored from a given state.These methods, which assume a Markov property (Yiou, 2014) belong to the family of methods referred to as Stochastic Weather Generators (SWGs).SWGs can be used as climate model emulators due to their ability to generate long sequences that can infer statistical properties of spatio-temporal dynamics and correlation structures of the system without running expensive General Circulation Models (GCMs) (Ailliot et al., 2015).Other applications include down-scaling (Rajagopalan and Lall, 1999;Wilks, 1992) and data assimilation (Lguensat et al., 2017).
In the last decade there has been a rapid advancement of other types of data-driven forecasting/emulators in earth system models based on deep learning (Reichstein et al., 2019).For example, in a seminal work (Ham et al., 2019) Convolutional Neural Network (CNN) has achieved a positive skill for ENSO prediction compared to NWP.These success stories are sometimes accompanied by similar skills displayed by the analog method (Ding et al., 2019;Wang et al., 2020).Recently, deep learning has been used to predict extreme heatwaves (Chattopadhyay et al., 2020;Jacques-Dumas et al., 2022;Lopez-Gomez et al., 2022) targeting categorical scores related to hit rates1 .In order to provide probabilistic prediction the following studies were performed using random forests (van Straaten et al., 2022) and neural networks (Miloshevich et al., 2023a).Probabilistic forecasting in the context of uncertainty quantification plays an important role in the current development of machine learning driven techniques applied to, for instance, post-processing of weather forecasts (Schulz and Lerch, 2022;Grönquist et al., 2021).
Geopotential height anomalies and soil moisture were chosen as the inputs in (Miloshevich et al., 2023a) based on physical understanding of the precursors to heatwaves.Persistent positive geopotential height anomalies are known drivers, since they favor clear skies and produce subsidence.It has been argued that soil moisture has memory of previous land-atmospheric conditions (Koster and Suarez, 2001;Seneviratne et al., 2006).In fact, soil moisture was identified Seneviratne et al. (2012) among the important drivers for European heatwaves.In contrast, in Northern European regions soil moisture may play a lesser role in preconditioning (Felsche et al., 2023).This association between soil moisture deficits and heatwave occurrence has been indicated also elsewhere (Shukla and Mintz, 1982;Rowntree and Bolton, 1983;D'Andrea et al., 2006;Vautard et al., 2007;Seneviratne et al., 2010;Fischer et al., 2007;Lorenz et al., 2010;Stéfanon et al., 2012;Hirschi et al., 2011;Schubert et al., 2014;Zhou et al., 2019;Benson and Dirmeyer, 2021;Zeppetello et al., 2022;Vargas Zeppetello and Battisti, 2020;Miralles et al., 2014Miralles et al., , 2019) ) contributing to the understanding that in dry regimes heatwaves could be amplified due to impacts of evapotranspiration.In contrast, the SWG, designed for heatwaves, did not take this crucial input and thus was not able to reproduce the corresponding feedback (Jézéquel et al., 2018).However, it is important to better understand the contributions of such drivers to improve projected climate change impacts on heatwave probabilities (Berg et al., 2015;Field et al., 2012;Horton et al., 2016).For instance, using circulation analogs in a pre-industrial simulation, (Horowitz et al., 2022) showed that circulation patterns explain around 80% of the temperature anomalies in the United States, while negative soil moisture anomalies added a significant positive contribution, especially mid-continent.
Recently, the analog method was successfully adapted for predicting chaotic transitions in a lowdimensional system (Lucente et al., 2022a).In fact, the analog method is generally expected to perform better when the number of relevant degrees of freedom is not too high.This is natural given that it belongs to the family of -nearest neighbors algorithms, which suffer from the curse of dimensionality (Beyer et al., 1999).Consequently, for our problem of heatwave prediction we will consider linear and nonlinear dimensionality reduction techniques and evaluate the performance of SWG in real vs latent space.This approach is partially motivated by the emergence of generative modelling for climate and weather applications, e.g.studies combining deep learning architectures with Extreme Value Theory (EVT) for generating extremes (Bhatia et al., 2021;Boulaguiem et al., 2022) and realistic climate situations (Besombes et al., 2021).
Beyond finite-horizon probabilistic prediction, data-driven methods can be used to create emulators capable of extracting risks for "black-swan" events, events that are so far removed from the climatology that they have not been observed yet but their impact may be devastating.Such risk assessments are often carried out using EVT based on reanalysis, with long runs of high fidelity models (methods such as ensemble boosting (Gessner et al., 2021)) or rare event algorithms (Ragone et al., 2018;Ragone and Bouchet, 2021).More recently, other statistical approaches such as Markov state models (Finkel et al., 2023) have been used to estimate the rates of extreme sudden stratospheric warming events based on ensemble hindcasts of European Center of Medium Range Weather Forecasting (ECMWF).
The first goal of this study is to compare the performance of two data-driven probabilistic forecasting approaches for extreme heatwaves in two European regions: France and Scandinavia.We aim to explore the use of Stochastic Weather Generator (SWG), that relies on the method of analog Markov chain, optimize it for our task and compare or combine it with more modern deep learning approaches such as CNN.We will investigate ways to accelerate the computation of analogs using dimensionality reduction techniques, such as training a Variational Autoencoder (VAE) to project the state of the system to a small-dimensional latent space.Our second goal will consist of generating synthetic time series with the help of the SWG trained on a short model run and comparing statistics of under-resolved extreme events (in this short run) to a long control run.We will work with PlaSim data, an intermediate complexity General Circulation Model (GCM).The choice of the right metrics is important since many of the current data-driven forecasts are trained based on Mean Square Error (MSE) which is not suited for evaluation of the extremes.However, extremes actually represent the most immediate societal risks (Watson, 2022) and are a subject of this manuscript.The choice of GCM is motivated by the ability of PlaSim to generate inexpensive long runs and the recent study of Miloshevich et al. (2023b) Figure 1.The map shows relevant areas.Blue: North Atlantic -Europe (NAE).The dimensions of this region are 22 by 48.Red+Blue: North Hemisphere (above 30 N).The dimensions of this region are 24 by 128.Green: France; Purple: Scandinavia.. who showed that the composite statistics of the large scale 500 hPa geopotential height fields, conditioned on heatwaves, revealed similar patterns for PlaSim as for higher fidelity models such as CESM and the ERA5 reanalysis.
The paper is organised as follows: Section 2.1 describes the details of our dataset generated by Plasim simulation, Section 2.2 defines notation and heatwave events.Section 2.3 delineates the goal of the probabilistic inference, the scoring function which determines the goodness of the predictions, the training and validation protocol.Section 2.4 reviews the CNN introduced in Miloshevich et al. (2023a), Section 2.5 introduces analog Markov chain, i.e.SWG and the corresponding steps involving coarsegraining, definition of metric, how to make probabilistic prediction with SWG and how to construct return time plots.Section 3 spells out the results consisting of Section 3.1 that covers probabilistic forecasting and Section 3.2 which discusses extending return time plots and teleconnections.Finally Section 4 concludes the findings of this study.

PlaSim model data
The data have been generated from 80 batches2 that are independent 100 year long stationary simulations of PlaSim (Fraedrich et al., 2005(Fraedrich et al., , 1998)).PlaSim is an intermediate complexity General Circulation Climate Model (GCM), suitable for methodological developments, like the one performed here.It can be run to generate long simulations at a lower computational cost than the new generation of CMIP6 ensemble.
PlaSim consists of an atmospheric component coupled to ice, ocean and land modules.The atmospheric component solves the primitive equations of fluid dynamics adapted to the geometry of the Earth in the spectral space.Unresolved processes such as boundary heat fluxes, convection, clouds, etc abbreviation training years validation years 100 80 20 500 400 100 8000 7200 800 Table 1.Break-up of the input data into training and validation for different subsets of the full D8000 dataset.In D100 and D500 we follow 5-fold cross-validation, while in D8000 exceptionally 10-fold crossvalidation was employed.The latter implies that we slide the test window 10 times so that it explores the full dataset, while the training set is always the complement.
are parameterized.The interactions between the soil and the atmosphere are crucial for this study.They are governed by the bucket model of Manabe (1969), which controls the soil water content by replenishing it from precipitation and snow melt and depleting by the surface evaporation.The soil moisture capacity is prescribed based on geographical distribution.
The model was run with conditions corresponding to the Earth climate in 1990s with fixed greenhouse gas concentrations and boundary conditions which include incoming solar radiation, sea surface temperatures and sea ice cover that are cyclically repeated each year.The parameters are chosen to reproduce the climate of the 1990s.We also include the daily cycle in this work (like in Miloshevich et al. (2023a)).The model is run at T42 resolution, which corresponds to cells that are 2.8 by 2.8 degrees and results in 64 by 128 resolution over the whole globe.The vertical resolution is 10 layers.The fields are sampled every 3 hours and daily averages are taken.
In this paper we work with different subsets of the full 8000 year long dataset defined in Table 1.

Physical fields and geographical domains
The inputs to our various data pipelines will consist of fields F (r, ) such as • T : 2-meter temperature • Z: 500 hPa geopotential height • S: soil moisture at time , and discrete points r ∈ D, where D represents domain of interest .We will consider three domains: • D   : North Atlantic, Europe (24 by 48 cells) • D   : the area of France masked over the land • D  : the area of Scandinavia masked over the land The areas of France and Scandinavia that will correspond to the geographical areas of heatwaves are defined in Figure 1.Data-driven methods that will be introduced below will accept typically some stacked version of the fields which will be represented by the letter X X = (Z(r, ), T (r, ), S(r, )) , where each component corresponds to a specific field.

Definition of heatwave events
As in (Miloshevich et al., 2023a) we are concerned with the prediction of  = 15 day-long heatwaves.Thus, our events of interest are related to the time and space integrated anomaly of 2-meter temperature (T ), where we use symbol E to imply average over the years conditioned to the calendar day: which depending on the threshold () ≥  defines a class of heatwaves.Symbol D corresponds to the domain of interest described in Section 2.2.|D| is the surface area of D and equation (2) contains an area weighted average. is the chosen duration (in days) of the heatwaves.For probabilistic prediction we set  = 15 days, while for return times we relax this requirement to study a family of return time plots.The paper will involve two slightly different definitions of heatwaves based on equation (2) depending on the task that will be performed.In the next Section 2.4.1 we will introduce probabilistic prediction, for which heatwaves will be defined as exceeding a certain fixed threshold  set fixed for all heatwaves, which allows many heatwaves per summer.Concurrently, in Section 2.6.6 we will work with a block-maximum definition, which permits only single heatwave per summer, so each year gets unique value .Moreover, a range of  values will be considered.

Committor Function
Our goal is a probabilistic forecast, a committor function (Lucente et al., 2022a) in the nomenclature of rare event simulations.Based on equation (10) we define the label  = () for each state as a function of the corresponding time  as () ∈ {0, 1}, so that  = 1 iff  > , where  is 95-th percentile of .Our objective is to find the probability  of  = 1 at time  given the state X which we observe at an earlier time  −  (see Section 2.6.3): = Pr(() = 1|X( − ), ). ( Parameter  is referred to as lead time or sometimes lag time.We stress that for non-zero  it is the initial observed state X that is shifted in time, rather than the labels , thus the events of interest (both  = 1 or  = 0) are kept the same, allowing controlled comparisons among different values of .Moreover, the season of interest for which we intend to compute the committor functions and compare them to the ground truth should always involve the same days: from June 1 to August 16, because the last potential heatwave may last  = 15 days and, this way, end in August 30 (the last day of summer in PlaSim).

Normalized Logarithmic Skill Score
The quality of the prediction will be measured with a Normalized Logarithmic Score (NLS) (Miloshevich et al., 2023a) due its attractive properties for rare events (Benedetti, 2010) and the appropriateness for probabilistic predictions (Wilks, 2019).In the following we will briefly introduce the NLS.Since the dataset is composed of  states {X  } 1≥≥  , each of them labeled by   ∈ {0, 1}, it can be equivalently represented by  independent pairs (X  ,   ).The aim of the prediction task is to provide an estimate p of the unknown probability  ȳ () = P( = ȳ|X = ).It has been argued (Miloshevich et al., 2023a;Benedetti, 2010) that among the possible scores to assess the quality of a probabilistic prediction, the most suitable is the logarithmic score It should be noted that the logarithmic score coincides with the empirical cross entropy widely used in machine learning.The score   is positive and negatively oriented (the smaller the better).The value of   is not indicative in itself, but is useful for comparing the quality of two different predictions.Thus, the NLS is defined as an affine transformation of the logarithmic score, which allows us to compare the data-driven forecast with the climatological one.To be more precise, let p be the climatological frequency of the extremes ( p = P( = 1)) and  ()  the logarithmic score for this prediction, i.e.
Then, the NLS score is defined as Thus, NLS is positively oriented (the larger the better) and bounded above by 1.Since we identify  = 1 with 95th percentile of () this implies that p = 0.05.In other words, we always compare the results to the baseline3 , which would result in NLS equal to zero.Any predictions with smaller NLS are completely useless from the probabilistic perspective.Finally, we stress that other scores devised for rare events, such as MCC (Matthews, 1975), most notably used in the study (Jacques-Dumas et al., 2022) and others, are useful for categorical prediction but do not necessarily translate into high NLS scores when training neural networks especially if early stopping4 is used, since the optimal epoch may vary depending on the chosen score.

Training/Validation protocol for committor function
We split the dataset into training (TS) and validation (VS).The ML algorithm of choice is trained on TS and various hyper-parameters are optimized on VS depending on the target metric (see the Section 2.4.2).Different methods can be compared on VS, which will be performed in this paper.The splits will be based on the Table 1 and correspond to the selection of the first number of years, e.g.D100 implies choosing the first 100 years of the simulation.
To have a better idea about the spread of the skill (equation ( 6)) we will rely on 5-fold crossvalidation.This means that the TS/VS split is performed 5 times, while sliding the VS window consequently through the full data interval, chosen for the particular study (D100, D500 etc).Each TS/VS split we refer to as "fold", i.e. there are 5 folds (numbered 0-4), for instance for D100, fold number 3, VS starts in year 60 and ends in year 79, while TS is as usual the complement of that set.Notice that years are also numbered from 0. In order to ensure that each VS has the same number of extreme events we perform custom stratification (Miloshevich et al., 2023a).This means that we shuffle the years of the simulation between different folds so that each interval receives the same (or almost the same) number of extreme events.The procedure does not modify the mean benchmarks but tends to decrease the error bars, which strongly depend on the number of extreme events within the VS.

Convolutional Neural Network
The Convolutional Neural Network (CNN) we take for this study has been proposed by Miloshevich et al. (2023a).Since the precise inputs are slightly different in this paper (as well as a whole new region of Scandinavia that is considered in this paper for the first time) the training had to be repeated, but the hyperparameters were left unchanged.
In this manuscript the CNN accepts an input that is 24 by 48 grid-points over three fields (see Eq. ( 1)).We perform masking (Miloshevich et al., 2023a) of the temperature and soil moisture, i.e. values outside of the respective heatwave area (France or Scandinavia depending on which heatwaves we study, see Figure 1) are set to zero to avoid some spurious correlations.In this paper we also study  On the left we show the three input fields which are labeled directly on the plot.On the right the target of the inference is displayed (for instance probabilistic prediction as described in 2.6.5).On top we have the direct CNN approach.In the middle we show the analog method (SWG), which is the main topic of this work and which is compared to CNN for this task.At at the bottom the option is presented to perform dimensionality reduction of the input fields and pass them as the input on which SWG is "trained".Subsequently, SWG can be used to generate conditional or unconditional synthetic series (green boxes).This synthetic data can be used for make probabilistic prediction or estimating tails of distribution, such as return time plots .
the heatwaves occurring in Scandinavia, and for consistency perform masking over the corresponding area in that case.The inputs are then normalized so that each grid-cell of each field has mean 0 and standard deviation 1.
The input is fed through three convolutional and two max-pooling layers; the intermediate output is flattened and passed through a dense layer, which is connected to the final output.The final output consists of two neurons and a soft-max activation, which enforces the outputs within (0, 1) interval consistent with the domain of probabilities.The architecture is trained with Adam optimizer to minimize cross-entropy as is appropriate for classification task and early stopping is applied which stops the training when the average foldwise NLS reaches the maximum.
Training Set

Stochastic Weather Generator algorithm
The principle of the Stochastic Weather Generator (SWG) as described in (Yiou, 2014) is based on the analog Markov chain method.The idea is to use the existing observations or model output that catalog the dynamical evolution of the system, the dynamical history, and generate synthetic series.
Algorithm 1: The algorithm generates a single synthetic trajectory by using the analogs of training set (see Table 1).Sets of  nearest analogues of each sample   during dynamical evolution are computed based on the metric given in the equation ( 13).The states whose analogs we compute are constrained by the condition that their calendar days must start on May 15 and end on August 30 (there are 30 days per month in PlaSim).This results in   = 105 •  years .There is additional condition on the upper bound of the calendar day of the analogs of these days which is August 27 preventing SWG from accessing days outside of the summer season, since the time step is   = 3 days.The analogs are computed using parallelized kDTree search implemented in scipy packages of python once per parameter  0 (see equation ( 13) and discussion in Section 2.6.3) and each fold.The method consists of constructing a catalog of analogs, the states which have similar circulation patterns and other thermodynamic characteristics (in our case temperature and soil moisture).Each state in the catalog comes from the realization of the dynamics (in priciple it could come not only from simulations but also reanalysis).To construct synthetic time series one starts from any of these states and draws randomly (with uniform probability) one analog from a pre-defined set of nearest neighbors .The similarity between the states is assessed via Euclidean metric in the feature space (Karlsson and Yakowitz, 1987;Davis, 2002;Yiou, 2014), as defined below (Section 2.6.3).The following step is to look-up in the dynamical history, i.e. how that analog evolved dynamically after time   and use the corresponding state as the next member of the synthetic time series (see Figure 3).Next, the process is repeated.The details of how the algorithm is implemented will become clear in the following sections but they are schematically represented on Figure 3 and encapsulated procedurally in the Algorithm 1.The procedure is quite general although it is still possible to modify it in many aspects.For instance, the analogs are not always drawn from the uniform probability (Lall and Sharma, 1996;Buishand and Brandsma, 2001;Rajagopalan and Lall, 1999;Yiou, 2014), the catalog of analogs is often constrained by a moving window or the analogs from the same year may be excluded from possible candidates (Yiou, 2014).The common rationale behind our choices is to prioritize methods that demand less resources and use as much information as possible.

Time step and coarse-graining
The first question that arises is what is the appropriate choice for the time-step   of SWG (See schematics on Figure 3).It is natural to use synoptic time scale of cyclones on which the dynamics partially decorrelates, which corresponds to   = 3 days as demonstrated by inspecting autocovarience Miloshevich et al. (2023b).This means that we are constructing a Markov chain where we are selecting a time step consistent with ignoring synoptic time-scale correlations.One may expect that the results do not depend too much on   as long as it does not exceed the total correlation time (see Appendix B for further details).The coarse graining will be applied as follows, F (r,  0 ) will be The tilde will be suppressed without the loss of generality since the coarse-graining will be applied throughout, except in the input to the CNN.For simplicity, we perform coarse-graining in accordance with this time step   =   .
Note that we will also work with scalars obtained as area integrals F D over the area of France D =  and Scandinavia D =  defined as This allows us to re-write the definition of heatwaves (equation ( 2)) in a new, more concise notation If we take coarse-graining into account and set  =   = 15 days we can express equation (2) as discretization (10)

State space and metric
Next, we address how to combine fields of different nature (unit) in SWG.The fields can be stacked in various ways, for instance tuples can be constructed: This approach adapted for SWG consisting of integrals over D is to be contrasted with the input received by CNN, whereby S and T are masked as described in Section 2.5 outside areas corresponding to the heatwaves, while SWG receives only integrated temperature and soil moisture (over the same heatwave areas).
In the case of SWG each field will be normalized to global field-wise standard deviation5 (with the global field-wise mean subtracted) as opposed to cell-wise like in the case of CNN The overbar will be omitted in what follows and it will be implied instead.
The Euclidean distance between two points X 1 and X 2 is defined as where (X  ) counts the number of grid points concerned (1 for scalars such as S and T and 1152 for Z).When  1 =  2 = 0 we get definition consistent with (Yiou, 2014), except that we assign uniform probabilities for a set of nearest analogs.
Of a particular interest is the fact that geopotential contains global information on a shorter timescale, while soil moisture contains local information on a longer time-scale (Miloshevich et al., 2023a).In the zeroth order approximation one expects that for the efficient algorithm the fields would need to be further rescaled by their dimensionality as is done in equation ( 13).Thus, we choose parameters   = 1 as a first guess.To optimise the performance of SWG for the conditional prediction problem (equation ( 3)) we perform grid search: for each  we compute committor  and measure its skill (equation ( 6)) as a function of number of nearest neighbors  retained and the coupling coefficient for geopotential  0 , while the other two coefficients are set to one:  1,2 = 1.
Other choices for distance function have been explored in the literature, such as Mahalanobis metric (Yates et al., 2003;Stephenson, 1997)), which is equivalent to performing ZCA "whitening", i.e. a procedure that is a rotation of a PCA components plus normalizing the covariance matrix.We do not explicitly compute Mahalanobis distance in this work, although we do actually compute the Euclidean distance after performing the dimensionality reduction via PCA as will be described below.

Dimensionality reduction
When dealing with high dimensional fields, one often encounters the problem of "curse of dimensionality".In these cases it is known that Euclidean distance is poorly indicative regarding the similarity of two data points.It could therefore be useful to project the dynamics onto a reduced space.Furthermore, in this way the construction of a catalog is much more efficient from a computational point of view.Here we decided to adopt two different dimensionality reduction techniques, namely a Variational Auto-Encoder (VAE) and Principal Component Analysis (PCA), also known as Empirical Orthogonal Function (EOF) analysis.In a nutshell, PCA is a linear transformation of the original data, where the samples are projected onto the eigenvectors of the empirical correlation matrix (see Appendix A).Usually the projections are made only on the eigenvectors corresponding to the largest eigenvalues, as they are the ones that contain most of the variability of the data.However, this variability may not be entirely relevant to the prediction of heat waves, so this criterion is not necessarily useful.The VAE instead is a non-linear probabilistic projections onto a latent space (usually with Gaussian measure).It consists of probabilistic decoder   (|) and probabilistic (stochastic) encoder   (|) and the training is performed with the aim of maximize the probability of the data () (Doersch, 2016) (more details can be found in Appendix A).Here, we adopt a deep Fully Convolutional Neural Network (FCNN) as an encoder followed by a up-sampling FCNN decoder.Once the data have been projected into the latent space through one of the two methods, the SWG can be built exactly as explained above, simply replacing the fields in high-dimensional space with low-dimensional ones.

SWG committor
Due to the necessity of transitioning between Training (TS) and Validation Sets (VS) (see Figure 3), we require two transition matrices M tr and M va .The former is a straightforward application of what was discussed above, while the latter allows the trajectory starting in the VS to find the appropriate analog in TS.During this procedure (when  = 0) we retain the mean value of temperature over   days that was already known in the VS (see Section 2.6.2), and re-use it when computing the synthetic label (), equation ( 10).All the remaining entries in () are based on the temperatures corresponding to the subsequent states visited in the training set.
When   = 3 and  = 15 and  = 0 days, we need to evolve the trajectory only four times, thus each state  can have at most  4 different values of .Therefore, if we denote by   () the number of trajectories such that  is greater than , the committor function of the state  will be () =   ()  4 .Because we are also interested in larger lead times such as  = 15 days we actually need to start the trajectories earlier, which leads to exponentially growing computational trees: with total 9 steps of Markov chain necessary to be applied starting from each day in VS.To reduce the computational burden we use Monte Carlo sampling with 10, 000 trajectories at each day from VS rather than exploring the full tree.We do not observe improvements in the skill after refining with more trajectories.
Due to large number of trajectories we had to parallelise our python code with a numba package.SWG requires computation of a large matrix of nearest neighbors, for which we use kDTree from scikitlearn package and we run it in parallel on 16 CPU cores of Intel Xeon Gold 6142.Assuming there was no dimensionality reduction performed prior to this operation, the feature space resulting from Z field is 1152 dimensional (number of cells) and we run a grid search to find optimal  0 coefficient in Eq.
(2) for each of the 5 folds.The procedure takes approximately 18 hours when dimensionality reduction (See Appendix A) is not applied.
Finally, because of coarse-graining (see Section 2.6.2) and the definition of labels (equation ( 8)) special care has to be made when comparing quality of the prediction of (3) with respect to (Miloshevich et al., 2023a), where X was daily.Indeed, the coarse-grained field X retains information until   days later due to forward coarse-graining (equation ( 7)).Thus for a fair comparison between different coarsegrained committors one must consistently shift the lead time  of   1 −   2 to ensure that fields X 1 and X 2 have no information about the future state of the system.In this paper we consider coarse graining time of 3 days.

Return Time Plots
The return time plots are produced using the method described in details in (Lestang et al., 2018).Here we will mostly summarize the algorithm.The idea is to estimate the return time of block-maximum (summer) () surpassing a particularly high threshold value .Thus the heatwave definition differs slightly with respect to what is used in the committor definition (Section 2.4.1),i.e. we do not restrict  to the 95-th percentile, but instead identify the largest value of () each summer which will correspond to  of that year.The interval (duration of the block) is defined so that heatwave may start in June 1 and end in August 30, so the interval depends on the duration of the event , which for this application may take values other than just 15 days.If  = 30 days, for instance, the last heatwave may start no later than August 1.Next, the yearly thresholds 's are sorted in the descending order from largest to smallest: {  } 1⩽⩽ , where  is the total number of years.With this definition the most extreme return time, is identical to the length of the dataset (in years) under consideration (Table 1) and the thresholds are ordered  1 ≥  2 ≥ • • • ≥   .The simplest approach is to just to associate the rank of the thresholds  to the inverse return time.
This definition is intuitively reasonable for large  but runs into problems at the other end of the  axis.
A more precise estimator uses the assumption of Poisson process () =  exp (−) approximation (which largely affects the small return times), modified block maximum estimator in (Lestang et al., 2018) which reads in our notation as where  is the length of the dataset in years and rank() is the order in which the particular year appears in the ordered list of years (by threshold).

Generalized Extreme Value (GEV) fit
We chose the scikit-extremes package of Correoso (2019) to perform the Generalized Extreme Value (GEV) fits.The maximal yearly summer () is calculated and then fit using Maximum Likelihood Method estimator (MLE).Because the initial guess for the parameters is important and can have an impact on the shape parameter of the GEV distribution, linear moments of the distribution are used to obtain those start values.The GEV function is as follows where  corresponds to the shape parameter.To estimate the confidence intervals the delta method is employed.Some years in the simulation tend to be cooler and even with negative temperature anomaly  max () < 0. In fact there is a handful of such  max () < 0 years in the TS of D100.To better estimate the GEV parameters, we remove such years.

Probabilistic forecasting
The first goal is to compare Stochastic Weather Generator (SWG) and Convolutional Neural Network (CNN) as tools for probabilistic heatwave forcasting.This is done in a framework depicted schematically on Figure 2.

Comparisons with CNN
Here we plot the NLS (equation ( 6)) comparing the predictions of CNN (orange curve) vs SWG (blue curve) described in Sections 2.5 and 2.6 for France (top panels) and Scandinavia (bottom panels) respectively on Figure 4.The comparison is made on D500 (see Table 1) but similar results are obtained considering shorter dataset (see Appendix C).The left panels are plotted as a function of lead time  and just like in (Miloshevich et al., 2023a) we see a downward trend in NLS due to chaotic nature of the atmosphere (the predictability diminishes with time).As was described in Section 2.6.3 parameters  0 and the number of nearest neighbors of SWG were tuned based on cross-validation which allows us to provide error bars (shading around the curves).We find that using 10 nearest neighbors is almost always optimal (right panels), whereas the optimal values of  0 (middle panels) tend to be of the order 50 − 100.Later we shall see that  0 also may depend on the preprocessing the data (such as dimensionality reduction).We are following identical procedures of data processing for both SWG and CNN (Section 2.4.3).However, because of the coarse-graining in SWG the comparisons are only appropriate, if we shift orange curve by   −1 = 2 days ahead (increasing ).The latter adjustment is explained in Section 2.6.5.The Figure 4 shows that in France at  = 2 days NLS reaches approximately 0.40 ± 0.04 for CNN and an estimation of 0.30 ± 0.02 for SWG.In Scandinavia (bottom panel) we have similar results: approximately 0.37 ± 0.03 for CNN vs only 0.25 ± 0.02 for SWG.Large lead times  result in lower predictive skill due to chaotic nature of the atmosphere.The behavior of the curve leads to the conclusion that CNN predicts heatwaves better up to 10 days before the event at which point the CNN and SWG skills are not statistically distinguishable.In case of France the regime of large  is mostly dominated by predictability due to soil moisture (Miloshevich et al., 2023a), whereas in Scandinavia this long-term information is absent thus NLS converges to near-zero values at large .It is generally known (Felsche et al., 2023), that it is heatwaves in Southern Europe that are preceded by the negative anomaly of soil moisture, rather than Northern Europe, which is consistent with what we see here.The reasoning we provide is that in France the soil may actually be in the regime of strong lack of humidity in summer which significantly amplifies the soil-atmosphere feedbacks.
Looking at the optimality of the parameter  0 , which controls the importance of geopotential in the metric (equation ( 13)), we see that for France it is of order  0 ∼ 50, an order of magnitude higher than what we expected  0 = 1 based on weighting by the number of relevant grid points.In Scandinavia the optimal value is not too different, of order  0 ∼ 100, but the dependence is weaker.The reason for  6)) as a function of lead time  and hyper-parameters of SWG (see caption of Figure 4).SWG is indicated by the same (identical) blue curve as in Figure 4 while orange and green curves correspond to VAESWG where geopotential was passed through two different autoencoders (equation ( 18)) (orange and green curves). .4).SWG is indicated by the same (identical) blue curve as in Figure 4 while orange and green curves correspond to VAESWG where geopotential was passed through two different autoencoders (equation ( 18)) (orange and green curves). .this could also be the fact that soil moisture does not contribute much to heatwave predictability in that region, thus limiting its usefulness in the metric (equation ( 13)).
The benchmarks indicate that CNN outperforms SWG independent of the area of interest (France or Scandinavia) or even dataset length (Figure ( 12)).

Dimensionality reduction
We now move on to the question of projecting the underlying space to latent dimension as described in Sec.2.6.4.Three approaches are considered, one which does not involve any dimensionality reduction, one where a deep Fully Convolutional Neural Network (FCNN) is used as an encoder followed by a up-sampling FCNN decoder and one, where instead of autoencoder we use scipy package PCA projections6 .We plot the results on Figures (5,6) for France and Scandinavia, respectively.There, simple SWG trained without dimensionality reduction is represented via a blue curve, autoencoder (latent dimension  = 16) via orange and the other curves represent PCA for different latent dimensions .For France we see no statistically significant differences even in the drastic case where the latent dimension consists of the two largest principal components only7 .This is probably due to the strong predictability power of soil moisture field (as clearly shown in Miloshevich et al. (2023a)).Indeed, the results for Scandinavia (which is less affected by soil moisture) shown in Fig ( 6) display a degradation when  < 4 while for  ≥ 4 all methods provide essentially the same score.Interestingly, the highest score (although within the error bars) is obtained for  = 8, suggesting that an optimal dimension may exist.
The computation of analogs (equation ( 13)) is rather difficult in high dimensions and with datasets larger than 1000 years becomes unfeasible, because we also have to perform tuning of  0 and the number of nearest neighbors.This is much more efficient if the dimension of the space has been successfully reduced, with a method such as autoencoder.However as can be seen from Figures (5,6) the benefits from using complicated multi-layer architecture are limited in this case, given the fact that the green curve, corresponding to PCA is almost the same (within the error bars) as the one obtained with the autoencoder.One notable difference between the cases with dimensional reduction and without is that the value for  0 that is optimal  0 = 5 is an order of magnitude smaller than the one we obtain on Figure 4, corresponding to SWG learned in real space.
These conclusions should be taken in the context of the predictability of temperature using geopotential, which have rather linear Gaussian statistics.It could be that autoencoders are more useful for more complex fields such as precipitation.

Sampling extremes
From now on our goal will be to use Stochastic Weather Generator (SWG) and evaluate the quality of the statistics for the extremes it can produce.In contrast to the Section 3.1 we will consider heatwave extremes as large as one in 7000 years events.

Computing return times
We aim to estimate extreme return times from shorter sequences and compare them to the control run D8000 (Table 1).To this end we use SWG trained on D100 (Table 1) and initialized with  0 = 1 and 10 nearest neighbors, parameters that we do not optimize in contrast to Section 3.1.The generated synthetic sequences are plotted for France on Figure 7a and for Scandinavia on Figure 7b.The curves were obtained using the method described in Section 2.6.6,so that the analogs are initialized in June 1 of each year (using the simulation data) and then advanced according to the Algorithm 1.The trajectory lasts up to synthetic August 30.
The main conclusion from Figure 7a is that the analog method is rather well-suited for this task: not only do synthetic trajectories match return times of the reference real trajectory (80 years of analogs) but they also provide correct estimates of the returns of a much longer trajectory (8000 years) at a fairly low computational footprint, compared to running such a trajectory.For instance, for  = 15 day heatwaves the most extreme event with return time of 7200 years has anomaly 7.07 which is well within the error bars of SWG estimate 7.02 ± 0.46.We can see that the generalization happens consistently, except for the extremes of 6 day heatwaves, where we have sampling issue since   = 3 days.The procedure is repeated for Scandinavia on Figure 7b.The results are generally consistent, although we observe more deviations from the prediction of SWG.For example, heatwaves of length  = 15 days and return times of order 2-3 thousand years as well as the longest return times for  = 30 days tend to be systematically underestimated.Note, however, that these events are rare even in the control run and therefore their thresholds have large uncertainties.Excluding these extreme events, the predictions are almost always in agreement with the return times calculated on the long run and when they are not, the relative error is smaller than 15%.
Since for this type of risk assessments Extreme Value Theory (EVT) is often employed, we compare our approach to the Generalized Extreme Value (GEV) function fits obtained using package developed Here we use parameters  = 1 (default), Number of nearest neighbors  = 10, the analogs are initialized on June 1 of each year (using the simulation data) and then advanced according to the Algorithm 1.The trajectory ends at the last day of summer.Each trajectory is sampled 800 times providing much longer synthetic series and thus estimating longer return times.Return times are computed for  = {6, 15, 30, 60} day heatwaves (indicated on the inset legend), with dots corresponding to the statistics from the control run (D8000, see Table 1), while shaded areas correspond to the bootstrapped synthetic trajectory: the whole sequence is split into 10 portions which allows estimating the mean and variance.The shading corresponds to mean plus or minus one standard deviation. .by (Correoso, 2019) from the 79 extreme  = 14 day heatwaves in the 80 years of D100 (our training data) on Figure 8.For details on why only 79 heatwaves are chosen, see Section 2.7.Generally speaking, GEV fit performs worse than SWG; in France it underestimates the severity of extremes, while in Scandinavia it produces confidence intervals that are very large.While the GEV fit is also consistent with the extremes we observe in the control run it provides much looser confidence intervals.On the other hand, SWG tends to shadow more closely the extremes of the long control simulation.This could be ascribed to the hypothesis that there is an upper bound for temperature extremes (e.g.Zhang and Boos (2023)).Parameters of the GEV fit yield uncertainties (especially the shape parameter) which can be very large when the available data is limited (here: 80 years).This leads to wide uncertainties for return periods that are larger than the training length, although in case of Scandinavia the return levels of the long control simulations do fall within the confidence intervals of the GEV fits, albeit not on the GEV fit.On the other hand, the SWG simulations are close to intrinsic properties of temperature, driven by the predictors we use.This explains why the SWG simulations follow the long control run, with a relatively narrow confidence interval, which does not increase with return periods.
We do not control for  1 and  2 to give us insight into which variable matters more in this aspect (i.e.temperature or soil moisture), however since soil moisture does not have impact on heatwaves on Scandinavia, one could reason that it is temperature in that case.In other words, to produce reliable return time plots the Euclidean metric should not only take circulation patterns into account but also the temperature and soil moisture in some cases.A side criticism of our approach is that we have chosen value  0 = 1, which seems arbitrary in view of the discussion in Section 3.1.1on the optimality of  0 ∼ 50 for performing the probabilistic forecast.However, it should be noted that having changed the task to be performed, there is no longer a guarantee that  0 ∼ 50 is still an optimal value (and it is not, as shown in Appendix D).The synthetic time series generated via SWG and identical to the green shade on Figure 7 are plotted using orange shade, except that we chose to shade two standard deviations, rather then one for consistency.The 7200 years control run (identical to green dots on Figure 7) are plotted using green dots. .

Teleconnection patterns for extremes
We use synthetic trajectories generated by SWG trained on only 80 years (from D100, see Table 1) to estimate extreme teleconnection patterns that were never observed in that run.Note that we do not use autoencoder to generate new circulation patterns.Instead SWG generates new sequences from the already existing ones which results in new values of () (equation ( 2)) since it is a running mean over temperature series.This is precisely the recipe we have followed in Section 3.2.1.To validate the teleconnection patterns thus obtained, we compare them to the control run (D8000).The results are plotted in Figure 9. First, we identify the 10 most extreme heatwaves in D100 and plot the composite on their first three days ( = {0, 1, 2} days) since the onset on Figure 9a 8 .The threshold for this event happens to be approximately 4.The pattern consists of anticyclonic (in red) and cyclonic (blue) anomalies (the word "anomalies" will be suppressed in what follows for brevity).
Next, we compare this lack of data regime (essentially 10 events) to the composite map that can be computed using a 7200 years long control run on Figure 9c using the same 4 threshold and plotting the first three days ( = {0, 1, 2} days) of the composites.In contrast to the previous situation, this leaves us with many more events satisfying the constraint.The picture changes somewhat, most notably, the cyclone and anticyclone become more pronounced, while the other patterns maintain relationship consistent with a Rossby wave pattern.This is in line with the understanding of teleconnection patterns expected in European heatwaves.This pattern is also consistent with wave-number 3 teleconnection for heatwaves in France obtained in Ragone and Bouchet (2021) and Miloshevich et al. (2023b) computed for 1000 year long sequence.
Finally, we compare both figures with a very long synthetic SWG run on Figure 9e.The resulting teleconnection pattern is again similar to the one on 9c.Overall, the advantage offered by SWG in this case is rather small and mostly has to do with the shape of the anticyclone over Northern Europe which in the short D100 composites appears more narrow.
We repeat the exact same steps but for the heatwaves in Scandinavia, starting from the 10 most extreme events in D100 on Figure 9b to the composites resulting from synthetic SWG run on Figure 9f.In this case SWG performs better in that it is able to capture the magnitude of the anticyclone over Scandinavia more accurately (compared to Figure 9d) than D100 composite, although the other features are hard to distinguish.

Discussion
We have systematically compared performance of Stochastic Weather Generator (SWG) and Convolutional Neural Network (CNN) (Miloshevich et al., 2023a) tasked to predict the occurrence of prolonged heatwaves over two European areas using simulation data from intermediate complexity climate model Plasim.We have also studied the ability to sample extreme return times and composite maps via the method of SWG.
In addition to the study of heatwaves in France considered in (Miloshevich et al., 2023a) we also apply our methodology to heatwaves in Scandinavia.This area has different hydrology which impacts long-term predictability.In order to achieve better predictive ability with SWG we took the version developed earlier (Yiou, 2014), which previously used only analogs of circulation, predictors we refer to as global, and upgraded it with additional predictors such as temperature and soil moisture integrated over the area of interest (where heatwaves occur), which we refer to as local.This was done because of the ample evidence of importance of slow drivers such as soil moisture (Zeppetello et al., 2022).Consequently, we had to use appropriate weights in the definition of Euclidean distance to control the importance of global vs local predictors.The benchmarks of CNN vs SWG were obtained based on Normalized Logarithmic Score (NLS) which is particularly well suited for studying rare events (Benedetti, 2010).The conclusion is that CNN outperforms SWG in probabilistic forecasting, although this is more evident for particularly large (hundreds of years long) training sets, which would not be available in the observational record.This is in line with the statement of (Miloshevich et al., 2023a) on the convergence of such methods in deep learning that requires massive datasets.These conclusions could be interesting for extreme event prediction but are also relevant for future developments of geneological rare event algorithms that aim to resample heatwaves based on the forecasts provided by either CNN or SWG (in the approach similar to (Lucente et al., 2022b)).
We have studied the performance of SWG in relation to estimation of large return times and extreme teleconnections.This is achieved by generating synthetic trajectories of extensive length based on the initial training data of a short 80 year Plasim run.We have shown that out-of-the-box (without hyperparameter optimization) implementation of SWG, which takes the three relevant fields as input for the Euclidean metric, was capable of reproducing the return times of the much longer control run (7200 years) for a range of heatwave durations: 15 day, 30 day and 90 day ones.These SWG calculations shadow more closely the control run (with smaller variance) than the Extreme Value Theory (EVT) estimates which are typically used to address similar questions.However, the version of SWG whose weights have been optimized for a different task of conditional (intermediate range to subseasonal) probabilistic prediction tends to underestimate the return times slightly.Nevertheless, its synthetic teleconnection patterns have qualitatively accurate features.This is an important result in view of works such as (Yiou et al., 2023) that rely on SWG to estimate the risks of the extremes.
In the hopes of improving the performance of SWG we have considered two dimensionality reduction techniques to project the geopotential to the smaller latent space.These techniques involve traditional (Empirical Orthogonal Functions) EOFs and Variational Autoencoders (VAEs).We found that while improving the efficiency at which the analogs are computed, the dimensionality reductions did not modify the predictive skill, which leads us to suggest EOF as a superior dimensionality technique for this task.However, it is possible that VAE could still be useful for other types of extremes, such as precipitation, where applications of traditional analog method is more controversial.We leave such studies for the future.When looking for optimal dimensionality of the latent space we found that few 500 hPa geopotential EOF components coupled with local temperature and soil moisture are sufficient to obtain optimal forecasting skill for SWG, which, however, as stated earlier, falls short of CNN prediction which takes the full fields as in input.
Another possibility of extending this work would be comparing SWG to other modern tools for learning propagators, for instance, simple U-Net architectures or a corresponding Bayesian Neural Network (BNN) (Thuerey et al., 2020(Thuerey et al., , 2021)), which provides uncertainty of the prediction thus a kind of propagator.Many other types of architectures are possible, however, given overall simplicity of SWG when training climate model emulators with small datasets SWG should be treated as an indispensable baseline method to justify the use of complex architectures.Fortunately, we have supported this paper with all the necessary code and documentation for straightforward implementation of SWG that can be applied to other projects.
As stated above, another interesting potential application for SWGs and other statistical methods such as deep learning is rare event algorithms (Lucente et al., 2022a).Rare event algorithms have been used in past to sample heatwaves (Ragone et al., 2018;Ragone and Bouchet, 2021) and could potentially allow to sample other extreme events (Webber et al., 2019) with expensive high fidelity models at a cheaper cost.Data driven approaches could be used to improve such rare event simulations.This is because knowing the probability of an event gives a prior information to the rare event algorithm (Chraibi et al., 2021).A recent work (Jacques-Dumas et al., 2023) compares SWG and CNN in computing importance functions for two simple Atlantic meridional overturning circulation models.We believe that providing similar benchmarks is important, thus in this manuscript we concentrate on the comparisons between deep learning and analog forecasting.
Finally, the results here were obtained based on dataset generated by intermediate complexity climate model, PlaSim with the ocean that was driven using stationary climatology and with relatively coarse resolution.As other modes of long-term predictability come from the ocean the next study should investigate the same questions based on higher fidelity model with better parametrisations.In particular, the question of transfer learning should be addressed, i.e. how to properly generalize out-ofsample by pretraing on a simpler models and fine tuning on more complex datasets, such as CESM or reanalysis.

B. Daily steps
Throughout this paper we have used   =   = 3 day step for our Markov chain and coarse graining.While the motivation for this was synoptic decorrelation time one naturally wonders how optimal is that choice for predicting extremes.Moreover, the CNN was designed to learn from daily fields, which meant we had to shift the NLS curves by 2 days for proper comparisons.Here we will show what happens when   =   = 1 day.Given that temperature autocorrelation time one would expect subsequent days to be strongly correlated, which would violate Markov's condition.Nevertheless, comparing blue and orange curves on Figure 11 we see hardly any difference.

C. Committors for 100 years
Results of the Figure 4 show that CNN outperforms SWG when using D500 (see Table 1), in other words 400 years of training data.Climate models allow working with large datasets, however in practice we are often limited by the paucity of the observational record.Thus we provide benchmarks for shorter dataset D100, thus 80 years of training, the same that is used in Section 3.2.The results are plotted on Figure 12 which shows that, while CNN still gives better results on average, the spread of the NLS tends to be quite large.The trends for  0 and number of neighbors remain the same.

D. Return times for different values of the hyperparameter
Here we attach Figure 13 which is equivalent to Figure 7a in all but the value of  0 = 50.This parameter was chosen optimal for this area in Section 3.1.As we can see it is not optimal for generating synthetic return times (under-estimation).This under-estimation is related to the reduction of variance problem.We do not include the case of Scandinavia, but the results look very similar.4 except for the number of years used to train/validate the algorithm.We have relied on D100 here (Table 1)..The trajectory ends at the last day of summer.Each trajectory is sampled 800 times providing much longer synthetic series and thus estimating longer return times.Return times are computed for  = {6, 15, 30, 60} day heatwaves (indicated on the inset legend), with dots corresponding to the statistics from the control run (D8000, see Table 1), while shaded areas correspond to the bootstrapped synthetic trajectory: the whole sequence is split into 10 portions which allows estimating the mean and variance.The shading corresponds to mean plus or minus one standard deviation. .

Figure 2 .
Figure2.Schematics of different methodologies used for probabilistic forecasting and estimates of extreme statistics.On the left we show the three input fields which are labeled directly on the plot.On the right the target of the inference is displayed (for instance probabilistic prediction as described in 2.6.5).On top we have the direct CNN approach.In the middle we show the analog method (SWG), which is the main topic of this work and which is compared to CNN for this task.At at the bottom the option is presented to perform dimensionality reduction of the input fields and pass them as the input on which SWG is "trained".Subsequently, SWG can be used to generate conditional or unconditional synthetic series (green boxes).This synthetic data can be used for make probabilistic prediction or estimating tails of distribution, such as return time plots .

Figure 3 .
Figure 3.The schematics of the flow of analogs.When applying the algorithm for estimation of committor (equation (3)) the first step consists of starting from the state in the Validation Set (VS), finding the analog in the Training Set (TS) and applying the time evolution operator.All subsequent transition occur within the TS..

1
Define variables (,   , M   × ); Input :  ← the number of nearest neighbors retained Input :   ← the number of states in the training set Input : Compute the matrix of nearest neighbors M   × Input : T = T () ← temperature sequence corresponding to historical states  ∈ 1 . . .  Input :  ← The initial state of the synthetic trajectory Output: synthetic time series 2 append T () to the synthetic temperature series 3 while synthetic trajectory continues do 4  ← U() a uniform random number draw, i.e. 0 <  <  5  ← M   × , a random analog of the current state 6  ←  +   a state is advanced according to its historical dynamical evolution.7 append  to the synthetic trajectory sequence; 8 append T () to the synthetic temperature series 9 end 10 return synthetic temperature series

Figure 4 .
Figure 4. Basic SWG (blue curve) vs CNN.(orange curve) All three panels display NLS (equation (6)) on the  axis.Left panel has  on the  axis, central panel has  0 (a hyperparameter of SWG, see equation (13)) on the  axis and right panel has -nearest neighbors (also hyperparameter of SWG) on the  axis.On the central and the right panels the choice for  = 0 was made.The dots show data points corresponding to the mean of the cross-validation, whereas the thickness of the shaded area represents two standard deviations.These conventions will be re-used in the subsequent figures. .

Figure 5 .
Figure 5. Basic SWG vs VAESWG: On the  axis we have NLS (equation (6)) as a function of lead time  and hyper-parameters of SWG (see caption of Figure4).SWG is indicated by the same (identical) blue curve as in Figure4while orange and green curves correspond to VAESWG where geopotential was passed through two different autoencoders (equation (18)) (orange and green curves). .

Figure 6 .
Figure 6.Basic SWG vs VAESWG: On the  axis we have NLS (equation (6)) as a function of lead time  and hyper-parameters of SWG (see caption of Figure4).SWG is indicated by the same (identical) blue curve as in Figure4while orange and green curves correspond to VAESWG where geopotential was passed through two different autoencoders (equation (18)) (orange and green curves). .

Figure 7 .
Figure 7.Return time plot for (a) France and (b) Scandinavia heatwaves using analogues of North Atlantic and Europe and the method based on equation (15).Here we use parameters  = 1 (default), Number of nearest neighbors  = 10, the analogs are initialized on June 1 of each year (using the simulation data) and then advanced according to the Algorithm 1.The trajectory ends at the last day of summer.Each trajectory is sampled 800 times providing much longer synthetic series and thus estimating longer return times.Return times are computed for  = {6, 15, 30, 60} day heatwaves (indicated on the inset legend), with dots corresponding to the statistics from the control run (D8000, see Table1), while shaded areas correspond to the bootstrapped synthetic trajectory: the whole sequence is split into 10 portions which allows estimating the mean and variance.The shading corresponds to mean plus or minus one standard deviation. .

Figure 8 .
Figure 8.Return time plot for (a) France and (b) Scandinavia  = 15 day heatwaves using the method based on equation (14).On the  axis  anomalies are plotted in  and on the  axis years.The black dots correspond to the most extreme heatwaves in 80 years of D100.The blue dashed line corresponds to the GEV fit performed on 80 years of D100 (minus the ones which have negative () maxima).The 95 percent confidence intervals are indicated via blue shade.The synthetic time series generated via SWG and identical to the green shade on Figure7are plotted using orange shade, except that we chose to shade two standard deviations, rather then one for consistency.The 7200 years control run (identical to green dots on Figure7) are plotted using green dots. .

Figure 9 .
Figure 9. Composite maps of geopotential height (meters) anomalies at 500 hPa for heatwave in France and Scandinavia ( = 15 days).(a) forward 3 day running mean at  = 0, i.e. the heatwave onset, a composite of the 10 most extreme France heatwaves in 80 year long dataset with the threshold ≈ 4 K.(b) Same as (a) but for Scandinavia heatwaves.(c) Composite for the control run with a collection of France heatwaves above the threshold 4 K.(d) Same as (c) but for Scandianvia heatwave (e) Composite for the synthetic run performed by SWG above the same threshold for France heatwaves.(f) same as (d) but for Scandinavia heatwave. .

Figure 10 .
Figure 10.Top panels: original geopotential anomalies, Bottom panels: a realisation of reconstructed geopotential anomalies after passing them through VAE. .

Figure 11 .
Figure 11.Basic SWG daily increments vs 3 day Here we show NLS (equation (6)) as a function of lead time  for optimal hyper-parameters obtained for usual analog Markov chain based on representation (equation (11)) and the procedure described in 2.6.1 like in Figure (6).Both Figures share identical blue curves (corresponding to the same setup.Orange curve corresponds to a Markov chain whose increment   and coarse-graining time   are set to 1, i.e. no coarse-graining.Note that for fairness of comparison we have shifted the orange and green curves by 2 days (see discussion in the Section 2.4.2).For details on the interpretation of different panels see caption of Figure 4. .

Figure 12 .
Figure 12.Basic SWG (blue curve) vs CNN.(orange curve) This figure is identical to the Figure4except for the number of years used to train/validate the algorithm.We have relied on D100 here (Table1)..

Figure 13 .
Figure 13.Return time plot for France heatwaves using analogues of North Atlantic and Europe.Here we use parameters  = 50 (default), Number of nearest neighbors  = 10, the analogs are initialized on June 1 of each year (using the simulation data) and then advanced according to the Algorithm 1.The trajectory ends at the last day of summer.Each trajectory is sampled 800 times providing much longer synthetic series and thus estimating longer return times.Return times are computed for  = {6, 15, 30, 60} day heatwaves (indicated on the inset legend), with dots corresponding to the statistics from the control run (D8000, see Table1), while shaded areas correspond to the bootstrapped synthetic trajectory: the whole sequence is split into 10 portions which allows estimating the mean and variance.The shading corresponds to mean plus or minus one standard deviation. .