## Impact Statement

Spatially co-occurring climate extremes such as heavy precipitation events or temperature extremes can have devastating impacts on human and natural systems. Modeling complex spatial dependencies between climate extremes in different locations are notoriously difficult and traditional approaches from the field of extreme value theory are relatively inflexible. We show that combining extreme value theory with a deep learning model (generative adversarial networks) can well represent complex spatial dependencies between extremes. Hence, instead of running expensive climate models, the approach can be used to sample many instances of spatially cooccurring extremes with realistic dependence structure, which may be used for climate risk modeling and stress testing of climate-sensitive systems.

## 1. Introduction

Understanding and modeling climate extremes such as floods, heatwaves, and heavy precipitation, are of paramount importance because they often lead to severe impacts on socioeconomic systems (Field et al., Reference Field, Barros and Stocker2012). Many impacts are associated with compounding drivers, for instance, multiple hazards occurring at the same time at the same or different locations affecting the same system (Leonard et al., Reference Leonard, Westra, Phatak, Lambert, van den Hurk, McInnes, Risbey, Schuster, Jakob and Stafford-Smith2014; Zscheischler et al., Reference Zscheischler, Westra, van den Hurk, Seneviratne, Ward, Pitman, AghaKouchak, Bresch, Leonard, Wahl and Zhang2018). Ignoring potential dependencies between multiple hazards can lead to severe misspecification of the associated risk (Zscheischler and Seneviratne, Reference Zscheischler and Seneviratne2017; Hillier and Dixon, Reference Hillier and Dixon2020). However, estimating dependencies between extremes is a challenging task, requiring large amounts of data and/or suitable approaches to model the phenomena of interest in a computationally feasible time.

A particularly challenging class of events is spatially compounding events (Zscheischler et al., Reference Zscheischler, Martius, Westra, Bevacqua, Raymond, Horton, van den Hurk, AghaKouchak, Jézéquel, Mahecha, Maraun, Ramos, Ridder, Thiery and Vignotto2020). Spatially compounding events occur when multiple connected locations are affected by the same or different hazards within a limited time window, thereby causing an impact. The compounding is established via a system capable of spatial integration, which accumulates hazard impacts in spatially distant locations. For instance, the global food systems are vulnerable to multiple co-occurring droughts and heatwaves in key crop-producing regions of the world (Anderson et al., Reference Anderson, Seager, Baethgen, Cane and You2019; Mehrabi and Ramankutty, Reference Mehrabi and Ramankutty2019). Similarly, spatially extensive floods (Jongman et al., Reference Jongman, Hochrainer-Stigler, Feyen, Aerts, Mechler, Botzen, Bouwer, Pflug, Rojas and Ward2014) or sequences of cyclones (Raymond et al., Reference Raymond, Horton, Zscheischler, Martius, AghaKouchak, Balch, Bowen, Camargo, Hess, Kornhuber, Oppenheimer, Ruane, Wahl and White2020) can deplete emergency response funds. Climate extremes can be correlated over very large distances due to teleconnections (Boers et al., Reference Boers, Goswami, Rheinwalt, Bookhagen, Hoskins and Kurths2019) but modeling such dependencies is challenging.

One approach to tackle the challenge of spatially correlated extremes is to create a large amount of data by running very long simulations with state-of-the-art climate models, which have the physical spatial dependencies built into them. For instance, over the recent years for a number of climate models, large ensembles have been created (Deser et al., Reference Deser, Lehner, Rodgers, Ault, Delworth, DiNezio, Fiore, Frankignoul, Fyfe, Horton, Kay, Knutti, Lovenduski, Marotzke, McKinnon, Minobe, Randerson, Screen, Simpson and Ting2020). However, these simulations typically have rather coarse resolution, are usually not stationary in time and are very expensive to run.

Extreme value theory provides mathematically justified models for the tail region of a multivariate distribution
$ \left({X}_1,\dots, {X}_d\right) $
,
$ d\ge 2 $
. This enables the extrapolation beyond the range of the data and the accurate estimation of the small probabilities of rare (multivariate) events. Statistical methods building on this theory are popular in a broad range of domains such as meteorology (Le et al., Reference Le, Davison, Engelke, Leonard and Westra2018), climate science (Naveau et al., Reference Naveau, Hannart and Ribes2020), and finance (Poon et al., Reference Poon, Rockinger and Tawn2004). Applications have so far been limited to low-dimensional settings for several reasons. On the one hand, even for moderately large dimensions *d*, the fitting and simulation of parametric models are computationally intensive because it requires computing complex likelihoods (Dombry et al., Reference Dombry, Engelke and Oesting2017). On the other hand, the extremal dependence structures in applications are difficult to model and the existing approaches are either simplistic or over-parameterized. In spatial applications, for instance, Euclidean distance is used to parametrize the isotropic dependence of stationary max-stable random fields (Blanchet and Davison, Reference Blanchet and Davison2011). Most real-world datasets that cover larger spatial domains do, however, feature nonstationarities in space that cannot be captured by the stationarity assumed in current geostatistical models; see Engelke and Ivanovs (Reference Engelke and Ivanovs2021) for a review on recent extreme value methods in higher dimensions.

Methods from machine learning are well-suited for complex and nonstationary data sets. Their loss functions are, however, typically designed with the purpose to predict well in the bulk of the distribution. It is therefore a difficult problem to construct approaches with an accurate performance outside the range of the training data. In prediction problems, one possibility is to adapt the loss function to make the algorithm more sensitive to extreme values, as for instance done in quantile regression forests (Meinshausen, Reference Meinshausen2006) or extreme gradient boosting (Velthoen et al., Reference Velthoen, Dombry, Cai and Engelke2021) for the prediction of high conditional quantiles. Jalalzai et al. (Reference Jalalzai, Clémençon and Sabourin2018) discuss this problem for classification in extreme regions and propose a new notion of asymptotic risk that helps to define classifiers with good generalization capacity beyond the observations’ range in the predictor space.

Rather than predicting an unknown response, in this work, we are interested in a generative model that is able to learn a high-dimensional distribution $ \left({X}_1,\dots, {X}_d\right) $ both in the bulk and in the extremes. We concentrate on generative adversarial networks (GANs), which are known to be competitive models for multivariate density estimation. While classical applications of GANs are often in the field of image analysis (Zhang et al., Reference Zhang, Xu, Li, Zhang, Wang, Huang and Metaxas2017; Choi et al., Reference Choi, Choi, Kim, Ha, Kim and Choo2018; Karras et al., Reference Karras, Aila, Laine and Lehtinen2018), they have been used much more broadly in domains such as finance (Efimov et al., Reference Efimov, Xu, Kong, Nefedov and Anandakrishnan2020), fraud detection (Zheng et al., Reference Zheng, Yuan, Wu, Li and Lu2019), speech recognition (Sahu et al., Reference Sahu, Gupta and Espy-Wilson2020), and medicine (Schlegl et al., Reference Schlegl, Seeböck, Waldstein, Schmidt-Erfurth and Langs2017).

There are two different aspects that characterize the extremal properties of multivariate data: the univariate tail behavior of each margin and the extremal dependence between the largest observations. For GANs, it has recently been shown (Wiese et al., Reference Wiese, Knobloch and Korn2019; Huster et al., Reference Huster, Cohen, Lin, Chan, Kamhoua, Leslie, Chiang, Sekar, Meila and Zhang2021) that the marginal distributions of the generated samples are either bounded or light-tailed if the input noise is uniform or Gaussian, respectively. Huster et al. (Reference Huster, Cohen, Lin, Chan, Kamhoua, Leslie, Chiang, Sekar, Meila and Zhang2021) propose to use heavy-tailed input to overcome this issue. Bhatia et al. (Reference Bhatia, Jain and Hooi2021) propose the ExGAN algorithm that uses conditional GANs to perform importance sampling of extreme scenarios. The main difference to our approach is that ExGAN simulates single extreme events, while our model will rely on block maxima and therefore does not suffer from serial correlation and seasonality issues. Moreover, ExGAN does not model the marginal distributions with extreme value theory and, since the input noise of ExGAN is Gaussian, this may be problematic especially for heavy-tailed data sets in view of the results of Wiese et al. (Reference Wiese, Knobloch and Korn2019) and Huster et al. (Reference Huster, Cohen, Lin, Chan, Kamhoua, Leslie, Chiang, Sekar, Meila and Zhang2021) mentioned above.

In this article, we concentrate on modeling the extremal properties of both the marginal distributions and the dependence structure in a realistic way even for high-dimensional and spatially nonstationary data. Our model, called evtGAN, combines the asymptotic theory of extremes with the flexibility of GANs to overcome the limitations of classical statistical approaches to low dimensional and stationary data. By using extrapolation for the marginals from extreme value theory, our method can be applied to datasets with arbitrary combinations of light-tailed and heavy-tailed distributions. We use a stationary 2000-year climate simulation to test and validate our approach.

In sum, in this article, we demonstrate how our method combines the best of extreme value theory and GANs in order to efficiently and accurately learn complex multidimensional distributions with nontrivial extremal dependence structures. Furthermore, one can easily sample from these learned distributions, which facilitates the study of complex data such as spatially distributed climate extremes with spatially highly heterogeneous dependence structures. The remainder of the article is structured as follows: in Section 2, we briefly lay out the theoretical background of extreme value theory and GANs, and present the main algorithm of our methodology. Section 3 contains a description of the data and the model’s architecture and hyperparameters used for its training, and a presentation of the obtained results, which we then discuss in Section 4. Finally, concluding remarks are laid out in Section 5.

## 2. Methodology

In the climate community, there is a great need for methods able to efficiently provide an empirical description of the climate system, including extreme events, starting from as few ensembles runs as possible (Castruccio et al., Reference Castruccio, Hu, Sanderson, Karspeck and Hammerling2019; Deser et al., Reference Deser, Lehner, Rodgers, Ault, Delworth, DiNezio, Fiore, Frankignoul, Fyfe, Horton, Kay, Knutti, Lovenduski, Marotzke, McKinnon, Minobe, Randerson, Screen, Simpson and Ting2020). Emulation techniques that tackle this challenge have been proposed in the last few years, usually focusing on the spatial correlation properties of the phenomena of interest (McKinnon et al., Reference McKinnon, Poppick, Dunn-Sigouin and Deser2017; McKinnon and Deser, Reference McKinnon and Deser2018; Link et al., Reference Link, Snyder, Lynch, Hartin, Kravitz and Bond-Lamberty2019; Brunner et al., Reference Brunner, Gilleland and Wood2021). In this work, our aim is to build an emulator specifically designed for extreme events that is able to reproduce the spatial tail dependencies of the data and extrapolate outside the range of the training samples. Our method does not require to discard any simulations and is therefore highly efficient. In this section, we first recall some background on extreme value theory and GANs, and then describe our algorithm called evtGAN. Similar to a copula approach (e.g., Brunner et al., Reference Brunner, Gilleland and Wood2021), it relies on the idea to disentangle the modeling of the marginal distributions and the extremal dependence structure in order to use extreme value theory for the former and GANs for the latter.

### 2.1. Extreme value theory

Extreme value theory provides mathematically justified methods to analyze the tail of a *d*-dimensional random vector
$ \mathbf{X}=\left({X}_1,\dots, {X}_d\right) $
. This branch of statistical theory has been widely applied in climate science, for example, to study droughts (Burke et al., Reference Burke, Perry and Brown2010), floods (Asadi et al., Reference Asadi, Engelke and Davison2018), heatwaves (Tanarhte et al., Reference Tanarhte, Hadjinicolaou and Lelieveld2015), and extreme rainfalls (Le et al., Reference Le, Davison, Engelke, Leonard and Westra2018).

For multivariate data, there are two different aspects that determine the quality of an extrapolation method: the univariate tail behavior of each margin and the extremal dependence between the largest observations. We explain how to measure and model those based on componentwise maxima. Let
$ {\mathbf{X}}_i=\left({X}_{i1},\dots, {X}_{id}\right) $
,
$ i=1,\dots, k $
, be a sample of
$ \mathbf{X} $
and let
$ {\mathbf{M}}_k=\left({M}_{k1},\dots, {M}_{kd}\right) $
, where
$ {M}_{kj}=\max \left({X}_{1j},\dots, {X}_{kj}\right) $
is the maximum over the *j*th margin.

#### 2.1.1. Univariate theory

The Fisher–Tippett–Gnedenko theorem (e.g., Coles et al., Reference Coles, Bawa, Trenner and Dorazio2001) describes the limit behavior of the univariate maximum $ {M}_{kj} $ . It states that if there exist sequences $ {a}_{kj}\in \unicode{x211D} $ , $ {b}_{kj}>0 $ such that

then if $ {G}_j $ is the distribution function of a nondegenerate random variable $ {Z}_j $ , it is in the class of generalized extreme value (GEV) distributions with

where
$ {x}_{+} $
denotes the positive part of a real number *x*, and
$ {\xi}_j\in \unicode{x211D} $
,
$ {\mu}_j\in \unicode{x211D} $
, and
$ {\sigma}_j>0 $
are the shape, location, and scale parameters, respectively. The shape parameter is the most important parameter since it indicates whether the *j*th margin is heavy-tailed (
$ {\xi}_j>0 $
), light-tailed (
$ \xi =0 $
) or whether it has a finite upper end-point (
$ \xi <0 $
).

It can be shown that under mild conditions on the distribution of margin
$ {X}_j $
, appropriate sequences exist for (1) to hold. The above theorem, therefore, suggests to fit a generalized extreme value distribution
$ {\hat{G}}_j $
to maxima taken over blocks of the same length, which is common practice when modeling yearly maxima. The shape, location, and scale parameters can then be estimated in different ways, including moment-based estimators Hosking (Reference Hosking1985), Bayesian methods (Yoon et al., Reference Yoon, Cho, Heo and Kim2010), and maximum likelihood estimation (Hosking et al., Reference Hosking, Wallis and Wood1985). We use the latter in this work. This allows extrapolation in the direction of the *j*th marginal since the size of a *T*-year event can be approximated by the
$ \left(1-1/T\right) $
-quantile of the distribution
$ {\hat{G}}_j $
, even if *T* is larger than the length of the data record.

#### 2.1.2. Multivariate theory

For multivariate data, the correct extrapolation of the tail not only depends on correct models for the marginal extremes, but also on whether the dependence between marginally large values is well captured. For two components $ {X}_i $ and $ {X}_j $ with limits $ {Z}_i $ and $ {Z}_j $ of the corresponding maxima in (1), this dependence is summarized by the extremal correlation $ {\chi}_{\mathrm{ij}}\in \left[0,1\right] $ (Schlather and Tawn, Reference Schlather and Tawn2003)

The larger $ {\chi}_{\mathrm{ij}} $ , the stronger the dependence in the extremes between the two components, that is, between $ {Z}_i $ and $ {Z}_j $ . If $ {\chi}_{\mathrm{ij}}>0 $ we speak of asymptotic dependence, and otherwise of asymptotic independence. While the extremal correlation is a useful summary statistic, it does not reflect the complete extremal dependence structure between $ {X}_i $ and $ {X}_j $ . Under asymptotic dependence, the latter can be characterized by the so-called spectral distribution (de Haan and Resnick, Reference de Haan and Resnick1977). Let $ {\tilde{X}}_j=-1/\log {F}_j\left({X}_j\right) $ , $ j=1,\dots, d $ , be the data normalized to standard Fréchet margins. The spectral distribution describes the extremal angle of $ \left({X}_i,{X}_j\right) $ , given that the radius exceeds a high threshold, that is,

Under strong extremal dependence, the spectral distribution centers around 1/2; under weak extremal independence, it has mass close to the boundary points 0 and 1.

In multivariate extreme value theory, a popular way to ensure a correct extrapolation of tail dependence under asymptotic dependence is by modeling the joint distribution $ \mathbf{Z}=\left({Z}_1,\dots, {Z}_d\right) $ of the limits in (1) as a max-stable distribution with multivariate distribution function:

where $ \mathrm{\ell} $ is the so-called stable tail dependence function (e.g., de Haan and Ferreira, Reference de Haan and Ferreira2007). This is a copula approach in the sense that the right-hand side is independent of the original marginal distributions $ {G}_j $ and only describes the dependence structure. In practice, the $ {G}_j $ are replaced by estimates $ {\hat{G}}_j $ that are fitted to the data first. The right-hand side of (2) is also called an extreme value copula (e.g., Gudendorf and Segers, Reference Gudendorf, Segers, Jaworski, Durante, Härdle and Rychlik2010).

A natural extension of max-stable distributions to the spatial setting is given by max-stable processes $ \left\{Z(t):t\in {\unicode{x211D}}^d\right\} $ (e.g., Davison et al., Reference Davison, Padoan and Ribatet2012), where the maxima $ {Z}_i=Z\left({t}_i\right) $ are observed at spatial locations $ {t}_i $ , $ i=1,\dots, d $ . These types of models have been widely applied in many different areas such as flood risk (Asadi et al., Reference Asadi, Davison and Engelke2015), heat waves (Engelke et al., Reference Engelke, de Fondeville and Oesting2019), and extreme precipitation events (Buishand et al., Reference Buishand, de Haan and Zhou2008). A popular parametric model for a max-stable process $ {Z} $ is the Brown–Resnick process (Kabluchko et al., Reference Kabluchko, Schlather and de Haan2009; Engelke et al., Reference Engelke, Malinowski, Kabluchko and Schlather2015). This model class is parameterized by a variogram function $ \gamma $ on $ {\unicode{x211D}}^d $ . In this case, each bivariate distribution is in the so-called Hüsler–Reiss family (Hüsler and Reiss, Reference Hüsler and Reiss1989), and the corresponding dependence parameter is the variogram function evaluated at the distance between the two locations,

One can show that the extremal correlation coefficient for this model is given by $ {\chi}_{\mathrm{ij}}=2-2\Phi \left(\sqrt{\lambda_{\mathrm{ij}}}/2\right) $ , where $ \Phi $ is the standard normal distribution function. A popular parametric family is the class of fractal variograms that can be written as $ {\gamma}_{\alpha, s}(h)={h}^{\alpha }/s $ , $ \alpha \in \left(0,2\right] $ , $ s>0 $ . While these methods enjoy many interesting properties, they are often quite complex to fit (Dombry et al., Reference Dombry, Engelke and Oesting2017) and impose strong assumptions on the spatial phenomena of interest, such as spatial stationarity, isotropic behavior, and asymptotic dependence between all pairs of locations.

A simple way of fitting a Brown–Resnick process to data is to compute the empirical versions
$ {\hat{\chi}}_{\mathrm{ij}} $
of the extremal correlation coefficients and then numerically find the parameter values (e.g., α and *s* for the fractal variogram family) whose implied model extremal coefficients minimize the squared error compared to the empirical ones.

### 2.2. Generative adversarial network

GANs (Goodfellow et al., Reference Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio, Ghahramani, Welling, Cortes, Lawrence and Weinberger2014) are attracting great interest thanks to their ability of learning complex multivariate distributions that are possibly supported on lower-dimensional manifolds. These methods are generative since they allow the generation of new realistic samples from the learned distribution that can be very different from the training samples.

Given observations
$ {\mathbf{U}}_1,\dots, {\mathbf{U}}_n $
from some *d*-dimensional unknown target distribution
$ {p}_{\mathrm{data}} $
, GANs can be seen as a game opposing two agents, a discriminator *D* and a generator *G.* The generator *G* takes as input a random vector
$ \mathbf{Y}=\left({Y}_1,\dots, {Y}_p\right) $
from a
$ p $
-dimensional latent space and transforms it to a new sample
$ {\mathbf{U}}^{\ast }=G\left(\mathbf{Y}\right) $
. The components
$ {Y}_j $
,
$ j=1,\dots, p $
, are independent and typically have a standard uniform or Gaussian distribution.

The discriminator *D* has to decide whether a new sample
$ {\mathbf{U}}^{\ast } $
generated by *G* is fake and following the distribution
$ {p}_G $
, or coming from the real observations with distribution
$ {p}_{\mathrm{data}} $
. The discriminator expresses its guesses with a value between 0 and 1 corresponding to its predicted probability of the sample coming from the real data distribution
$ {p}_{\mathrm{data}} $
. Both the generator and the discriminator become better during the game, in which *D* is trained to maximize the probability of correctly labeling samples from both sources, and *G* is trained to minimize *D*’s performance and thus learns to generate more realistic samples.

Mathematically, the optimization problem is a two-player minimax game with cross-entropy objective function:

In equilibrium, the optimal generator satisfies $ {p}_G={p}_{\mathrm{data}} $ (Goodfellow et al., Reference Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio, Ghahramani, Welling, Cortes, Lawrence and Weinberger2014).

In practice, the discriminator and the generator are modeled through feed-forward neural networks. While the equilibrium with $ {p}_G={p}_{\mathrm{data}} $ is only guaranteed thanks to the convex–concave property in the nonparametric formulation (3), the results remain very good as long as suitable adjustments to the losses, training algorithm, and overall architecture are made to improve on the stability and convergence of training. For instance, a standard adjustment to the generator’s loss function was already proposed in the GANs original paper (Goodfellow et al., Reference Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio, Ghahramani, Welling, Cortes, Lawrence and Weinberger2014), and consists of training the generator to minimize $ -\log \left(D\left(G\left(\mathbf{Y}\right)\right)\right) $ rather than $ \log \left(1-D\left(G\left(\mathbf{Y}\right)\right)\right) $ . This new loss function is called the nonsaturating heuristic loss, and was suggested to mitigate close to null gradients that occur in early training. Indeed, as the generator is still poorly trained, the discriminator can easily detect its samples and therefore associates values close to zero to them. This results in a very slow improvement of the generator during backpropagation as the gradients associated with values around $ D\left(G\left(\mathbf{Y}\right)\right)=0 $ are too small, and thus its loss $ L(G)=\log \left(1-D\left(G\left(\mathbf{Y}\right)\right)\right) $ saturates. Training the generator to minimize $ -\log \left(D\left(G\left(\mathbf{Y}\right)\right)\right) $ rather than $ \log \left(1-D\left(G\left(\mathbf{Y}\right)\right)\right) $ provides stronger gradients in early training without compromising on the dynamics of the two agents.

The architecture of the generator *G* and the discriminator *D* can be specifically designed for the structure of the data under consideration. In this direction, Radford et al. (Reference Radford, Metz and Chintala2016) introduce the use of the convolutional neural network as an add-on feature to the GAN yielding a deep convolutional generative adversarial network model (DCGAN). It is considered the preferred standard for dealing with image data and more generally any high dimensional vector of observations describing a complex underlying dependence structure, showing superior performance in the representation and recognition fields.

### 2.3. evtGAN

Here we give an overview of evtGAN. Let
$ {\mathbf{X}}_1,\dots, {\mathbf{X}}_m $
be independent observations of dimensionality *d*, where
$ {\mathbf{X}}_i=\left({X}_{i1},\dots, {X}_{id}\right) $
and
$ m=k\cdot n $
. Now let
$ {\mathbf{Z}}_1,\dots, {\mathbf{Z}}_n $
be independent observations obtained as block maxima taken sequentially over the span of *k* observations, that is,
$ {\mathbf{Z}}_1=\max \left({\mathbf{X}}_1,\dots, {\mathbf{X}}_k\right) $
,
$ {\mathbf{Z}}_2=\max \left({\mathbf{X}}_{k+1},\dots, {\mathbf{X}}_{2k}\right) $
and so on (see also Section 2.1), where the maxima are taken in each component. The evtGAN algorithm (Algorithm 1) below takes these empirical block maxima as input and follows a copula approach where marginal distributions and the dependence structure are treated separately. In particular, this allows us to impose different amounts of parametric assumptions on the margins and the dependence.

It is known that classical GANs that are trained with bounded or light-tailed noise input distributions in the latent space will also generate bounded or light-tailed samples, respectively (see Wiese et al., Reference Wiese, Knobloch and Korn2019; Huster et al., Reference Huster, Cohen, Lin, Chan, Kamhoua, Leslie, Chiang, Sekar, Meila and Zhang2021). That means that they are not well-suited for extrapolating in a realistic way outside the range of the data. For this reason, we suppose that the approximation of the margins by generalized extreme value distributions $ {G}_j $ as in (1) holds, which allows for extrapolation beyond the data range. On the other hand, we do not make explicit use of the assumption of a multivariate max-stable distribution as in (2). This has two reasons. First, while the univariate limit (1) holds under very weak assumptions, the multivariate max-stability as in (2) requires much stronger assumptions (see Resnick, Reference Resnick2008) and may be too restrictive for cases with asymptotic independence (e.g., Wadsworth and Tawn, Reference Wadsworth and Tawn2012). Second, even if the data follow a multivariate max-stable distribution (2), the probabilistic structure is difficult to impose on a flexible machine learning model such as a GAN.

The margins are first fitted by a parametric GEV distribution (line 1 in Algorithm 1). They are then normalized to (approximate) pseudo-observations by applying the empirical distribution functions to each margin (line 2). This standardization to uniform margins stabilizes the fitting of the GAN. Alternatively, we could use the fitted GEV distributions for normalization but this seems to give slightly worse results in practice. The pseudo-observations contain the extremal dependence structure of the original observations. Since this dependence structure is very complex in general and our focus is to reproduce gridded spatial fields, we do not rely on restrictive parametric assumptions but rather learn it by a highly flexible DCGAN (line 3). From the fitted model we can efficiently simulate any number of new pseudo-observations that have the correct extremal dependence (line 4). Finally, we transform the generated pseudo-observations back to the original scale to have realistic properties of the new samples also in terms of the margins (line 5).

## 3. Application

We assess the ability of the proposed method to correctly model the spatial tail dependence structure between climate extremes. Because observational records are relatively short and contain temporal trends, here we rely on 2000 years of climate model output that is representative of present-day weather over western Europe. In particular, we apply our approach to summer temperature and winter precipitation maxima, yielding 2000 and 1999 maxima for temperature and precipitations, respectively. We then compare the performance of our algorithm to the Brown–Resnick model, a state-of-the-art statistical model for spatial extremes. We use the fractal variogram family and the fitting described in Section 2.1.2. To retrieve an unbiased estimate of the error that each of the methods makes, we divide the data into a training set of 50 observations, and the rest is taken as a test set. A sensitivity analysis with other sample sizes is also carried out. Both methods are fitted using only the training data set and the evaluation is made considering the information contained in the test set as ground truth. When not stated otherwise, the results from the Brown–Resnick model are analytical, while the results from evtGAN are obtained simulating $ 1{0}^{\prime }000 $ data points.

### 3.1. Data

The model experiment, which uses large ensemble simulations with the EC-Earth global climate model (v2.3, Hazeleger et al., Reference Hazeleger, Wang, Severijns, Stefnescuă, Bintanja, Sterl, Wyser, Semmler, Yang, van den Hurk, van Noije, van der Linden and van der Wiel2012) has been widely used in climate impact studies (Van der Wiel et al., Reference van der Wiel, Stoop, Van Zuijlen, Blackport, van den Broek and Selten2019a, Reference van der Wiel, Lenderink and De Vries2021; Tschumi et al., Reference Tschumi, Lienert, van der Wiel, Joos and Zscheischler2021; Van Kempen et al., Reference van Kempen, van der Wiel and Melsen2021; Vogel et al., Reference Vogel, Rivoire, Deidda, Rahimi, Sauter, Tschumi, van der Wiel, Zhang and Zscheischler2021) and was originally designed to investigate the influence of natural variability on climate extremes. A detailed description of these climate simulations is provided in Van der Wiel et al. (Reference van der Wiel, Wanders, Selten and Bierkens2019b), here we only provide a short overview. The large ensemble contains 2000 years of daily weather data, representative of the present-day climate. Present-day was defined by the observed value of global mean surface temperature (GMST) over the period 2011–2015 (HadCRUT4 data, Morice et al., Reference Morice, Kennedy, Rayner and Jones2012); the 5-year time slice in long transient model simulations (forced by historical and Representative Concentration Pathway (RCP) 8.5 scenarios) with the same GMST was simulated repeatedly. To create the large ensemble, at the start of the time slice, 25 ensemble members were branched off from the 16 transient runs. Each ensemble member was integrated for the 5-year time slice period. Differences between ensemble members were forced by choosing different seeds in the atmospheric stochastic perturbations (Buizza et al., Reference Buizza, Milleer and Palmer1999). This resulted in a total of $ 16\times 25\times 5=2000 $ years of meteorological data, at T159 horizontal resolution (approximately 1°), among which we selected temperature (Kelvin) and precipitations (meter/day) for this article.

We choose an area such that it is big enough to be relevant for climate application while being small enough to ensure a fair comparison of our approach and the classical approach in statistics to model spatial tail dependence, the Brown–Resnick process. Our analysis thus focuses on a total of $ 18\times 22 $ grid points covering most of western Europe. For that area, we compute for each grid point the summer temperature maxima and winter precipitation maxima.

### 3.2. Architecture and hyper-parameters

In our application, we consider the nonsaturating heuristic loss (Goodfellow et al., Reference Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio, Ghahramani, Welling, Cortes, Lawrence and Weinberger2014) for the generator, and the standard empirical cross-entropy loss for the discriminator. Training is done iteratively, where the discriminator is allowed to train two times longer than the generator. A bigger number of discriminator training iterations per generator training iteration was initially considered but it did not achieve better results for the considered sample sizes and input dimensions but only resulted in longer training. We incorporate an exponential moving average scheme (Gidel et al., Reference Gidel, Berard, Vignoud, Vincent and Lacoste-Julien2019) to the training algorithm as it has demonstrated more stability, a far better convergence, and improved results.

Since we focus on reproducing gridded spatial fields, we make use of the DCGAN model (Radford et al., Reference Radford, Metz and Chintala2016) as an architectural design to take advantage of convolution (Fukushima, Reference Fukushima1980), and for a more accurate analysis of the input data, we apply zero-padding of a single layer to the train set observations before they are fed to the discriminator (Albawi et al., Reference Albawi, Mohammed and Al-Zawi2017), increasing the dimension of its input to $ 20\times 24 $ . We use classical regularization tools such as drop-out (Srivastava et al., Reference Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov2014), batch-normalization (Ioffe and Szegedy, Reference Ioffe, Szegedy, Bach and Blei2015), and the Adam optimizer (Kingma and Ba, Reference Kingma and Ba2015) with a learning rate of $ 2\times {10}^{-4} $ and batch size of 50 to train the neural network, and a total of $ \mathrm{30,000} $ training epochs. The value of the exponential moving average parameter in the training algorithm is set to $ \alpha =0.9 $ . The values of the hyperparameters are the result of an extensive heuristic tuning, and are only valid for the dimension of the input and the considered sample sizes involved in training the network. That being said, we are fairly confident that a minimal intervention in tuning the hyperparameters/architecture would be required if for instance higher resolution inputs were to be used, as these models are fairly robust to such changes. Indeed, models based on the general GANs paradigm are traditionally trained on much higher dimensional objects such as images, and they learn well using an architecture that is very close to the one we have adopted. One may argue that the underlying distributions defining real-world images are “simpler,” and therefore easier to learn, which is why we propose the evtGAN model to address the increased complexity of the underlying distribution of the data at hand: initially, we started with a coarser grid (lower resolution input), as we considered France only, then decided to cover a larger part of Europe as the results were very good. In this change, the model’s architecture needed to be adjusted only by adding a few layers to the generator. For a given input resolution, increasing the complexity of the model (using a deeper one) just increased training time and pushed convergence further down the line. Figures 1 and 2 illustrate the final architectures of the discriminator and the generator, as well as the values of some of the hyperparameters used for training.

### 3.3. Results

A particularity of evtGAN is that it decomposes the full high-dimensional distribution of the data into its marginal distributions and its dependence structure and processes them separately. We thus first report estimated key parameters of the marginal distributions. For temperature, the location parameter $ \mu $ is typically higher over land than over oceans (Figure 3a). Furthermore, there is a trend toward lower values of $ \mu $ for more northern regions, illustrating the well-known latitudinal temperature gradient from south to north. The scale parameter $ \sigma $ , representing the width of the distribution, is also higher over land than over the ocean, with a fairly homogeneous distribution over land (Figure 3b). The shape parameter $ \xi $ is well below 0 for most regions except some areas in the Atlantic north of Spain and in the Mediterranean at the North African coast, indicating a bounded tail (Figure 3c).

For precipitation, the location parameter $ \mu $ is more similar between land and ocean compared to temperature but spatially more heterogeneous (Figure 3d), illustrating orographic effects on precipitation extremes. For the scale parameter $ \sigma $ , there is also no clear land-sea contrast but areas with relatively much higher values (corresponding to higher interannual variability in extremes) occur over the Western part of the Iberian peninsula and along a coastal band at the northern Mediterranean coast (Figure 3e). The shape parameter $ \xi $ is spatially quite heterogeneous, with values ranging from −0.15 to 0.25 (Figure 3f). Overall, the parameters of the extreme value distributions of precipitation extremes shows much higher spatial heterogeneity than the ones for temperature extremes, suggesting that it might be more difficult to learn a model that represents well all tail dependencies for the entire region.

Next, we look at examples of bivariate scatterplots of simulated temperature extremes from the different approaches for three cases with varying tail dependence (Figure 4). The three rows correspond to three pairs of grid points with weak, mild, and strong tail dependence, respectively. The train sets (Figure 4a,f,k) illustrate the distributions from which the models are tasked to learn while the test sets (Figure 4b,g,l) are used as ground truth. As can be clearly seen, extremes are much less likely to co-occur between locations that are characterized by weak tail dependence (Figure 4b) compared to locations that are characterized by strong tail dependence (Figure 4l). Purely based on visual judgment, it seems that evtGAN is able to characterize the different tail dependencies relatively well and can simulate samples outside of the range of the train set (Figure 4c,h,m) whereas DCGAN only simulates distributions bounded to the range of the train set (Figure 4d,i,n) and Brown–Resnick tends to overestimate tail dependence in cases where tail dependence is weak (Figure 4e,j).

The corresponding figure for precipitation extremes is shown in Figure 5. Conclusions are similar to the ones derived based on temperature extremes. The evtGAN simulates distributions with different tail dependencies and samples that are far outside the range of the train set (Figure 5c,h,m). DCGAN simulates distributions that are bounded to the train set range (Figure 5d,i,n). On the other hand, Brown–Resnick overestimates tail dependence in the case of mild tail dependence (Figure 5j), and underestimates tail dependence when tail dependence is strong (Figure 5o).

A scatterplot of bivariate extremal correlations between 100 randomly selected locations estimated from the train set, evtGAN, and Brown–Resnick, respectively, against estimates based on the test set (1950 samples) is shown in Figure 6. The estimates derived directly from the train sets (Figure 6a,d) are the benchmark, and by design better performance is not possible. Clearly, pairs of locations with stronger tail dependence are much more likely for temperature (Figure 6a) than for precipitation (Figure 6d), confirming the impression obtained from the spatial homogeneity of the parameters in the extreme value distributions (Figure 3). Furthermore, evtGAN seems to better capture the observed relationship. Brown–Resnick has difficulties in particular with pairs that have weak or no tail dependence (extremal correlation equals 0, lower left in the figures), which is consistent with Figures 4 and 5.

To further explore the behavior of evtGAN and Brown–Resnick in dealing with different levels of dependence, Figure 7 shows a comparison of the estimated spectral distribution for the same pairs of chosen grid points shown in Figures 4 and 5, characterized by weak (a,d), mild (b,e) and strong tail dependence (c,f), respectively. The magenta bars show the estimates based on the 1950 samples in the test set. The estimates of evtGAN are very close to the ground truth for all cases, that is, weak, mild, and strong dependence (red lines in Figure 7), except for some bias in the case of mild dependence in precipitation extremes (Figure 7e). In contrast, the performance of the Brown–Resnick model (black lines in Figure 7) is much more variable. It captures relatively well the two pairs with weak tail dependence (Figure 7d) and does a decent job for strong dependence in temperature extremes (Figure 7c) but fails completely for the remaining three cases (Figure 7b,e,f). Furthermore, in contrast to Brown–Resnick, evtGAN is able to represent asymmetric dependence structures, as is evident from Figure 7.

We finally present a sensitivity analysis of the performance of evtGAN for different realistic sample sizes, namely 30, 50, and 100, while learning for 30,000 epochs (Figure 8). Overall the error tends to decrease the longer the models learn and it does not seem to matter whether we evaluate the performance on the train set or the test set (difference between black and red lines in Figure 8 is small). For small sample sizes ( $ n=30 $ ) the best performance might be reached for epochs smaller than 30,000 (Figure 8a,b) and the $ {l}^2 $ norm between the extremal coefficients of evtGAN and the train set could be used as a stopping criterion. An improvement in performance with increasing sample size is clearly visible (compare Figure 8a,b with Figure 8e,f). Furthermore, the model errors for temperature extremes are smaller than the ones for precipitation extremes (left vs. right column in Figure 8).

## 4. Discussion

The use of evtGAN combines the best of two worlds: correct extrapolation based on extreme value theory on the one hand, and flexible dependence modeling through GANs on the other hand. GANs are an excellent tool to model complex dependencies in high-dimensional spaces. However, they are typically not tailored to model extremes in the marginals well. Indeed, for a standard DCGAN implementation where the marginals are not estimated by GEV distributions but empirically transformed, Figures 4d,i,n and 5d,i,n show that the generated samples are bounded by the range of the training sample. For an accurate extrapolation that resembles the marginal distributions of the test set, extreme value theory is required (panels c,e,h,j,m, and o of Figures 4 and 5).

On the other hand, classical methods of spatial extreme value theory such as the Brown–Resnick process have accurate extrapolation properties for the marginal distributions. However, for an application to a spatially heterogeneous data set on a large domain (Figure 3), their parametric assumption may be too restrictive. Indeed, Figure 6c,f shows a clear bias of the Brown–Resnick model in terms of bivariate extremal correlations, which is particularly visible for pairs with weak extremal dependence. Another indication for this bias can be seen in Figure 5j,o where the distributions of the Brown–Resnick samples differ strongly from the test set distributions. The evtGAN does not make prior assumptions on spatial stationarity or isotropy and therefore it does not exhibit a bias (points in Figure 6b,e are centered around the diagonal). This is particularly noteworthy since modeling complex nonstationarities for extremes is very difficult with parametric models (Huser and Genton, Reference Huser and Genton2016; Engelke and Hitz, Reference Engelke and Hitz2020). Considering the fitted bivariate distributions of evtGAN and Brown–Resnick underlines this point.

The spectral distributions of the Brown–Resnick model are restricted to a parametric class of functions, which, for instance, are symmetric around $ 1/2 $ . The blue lines in Figure 7 show that this is too restrictive for our data since the strength of dependence is not correctly modeled (Figure 7b) or the asymmetry is not captured (Figure 7c,f). The evtGAN (red lines) on the other hand can model weak and strong dependences, and it even adapts to possible asymmetries. This is also evident from the scatterplots in Figures 4 and 5, where the shape of the Brown–Resnick samples is restricted to a parametric subclass of distributions, the so-called Hüsler–Reiss family (Hüsler and Reiss, Reference Hüsler and Reiss1989).

A further restriction of classical spatial max-stable processes is the fact that they are limited to modeling asymptotic dependence. For precipitation, it can be seen in Figure 6d–f that most of the test extremal correlations are close to zero, indicating asymptotic independence. While the evtGAN is able to capture this fairly well (Figure 6e), the Brown–Ressnick model always has positive extremal correlations, explaining the bias in the bottom left corner of Figure 6f. A spatial asymptotically independent model (e.g., Wadsworth and Tawn, Reference Wadsworth and Tawn2012) would be a possible remedy for this, but such processes would still suffer from the limitation induced by nonstationarity and asymmetry described above.

Overall, evtGAN tends to perform better in capturing dependencies between temperature extremes than precipitation extremes (Figure 8). This is likely related to the fact that extremes in temperature are more spatially coherent (Keggenhoff et al., Reference Keggenhoff, Elizbarashvili, Amiri-Farahani and King2014; Figure 3).

## 5. Conclusions

Understanding and evaluating the risk associated with extreme events is of primal importance for society, as recently emphasized in the 6th Assessment Report of the Intergovernmental Panel on Climate Change (Seneviratne et al., Reference Seneviratne, Zhang, Adnan, Badi, Dereczynski, Di Luca, Ghosh, Iskandar, Kossin, Lewis, Otto, Pinto, Satoh, Vicente-Serrano, Wehner, Zhou, Masson-Delmotte, Zhai, Pirani, Connors, Péan, Berger, Caud, Chen, Goldfarb, Gomis, Huang, Leitzell, Lonnoy, Matthews, Maycock, Waterfield, Yelekçi, Yu and Zhou2021). Extreme event analysis and impact assessments are often limited by the available sample sizes. Furthermore, simulations with complex climate models are very expensive. Here we combine a machine learning approach with extreme value theory to model complex spatial dependencies between extreme events in temperature and precipitation across Europe based on a limited sample size. We demonstrate that this hybrid approach outperforms the typically used approach in multivariate extreme value theory and can well represent the marginal distributions and extremal dependencies across spatially distributed climate extremes. The approach can be easily adapted to other types of extremes and used to create large sample sizes that are often required for climate risk assessments.

## Data Availability Statement

The temperature and precipitation maxima alongside the scripts to reproduce the results in this study are available at https://doi.org/10.5281/zenodo.5821485.

## Author Contributions

Conceptualization: Y.B., E.V., S.E.; Data curation: K.W.; Data visualization: Y.B.; Formal analysis: Y.B.; Methodology: Y.B., E.V., S.E.; Software: Y.B., E.V.; Supervision: J.Z., S.E.; Writing—original draft: J.Z., E.V., S.E. All authors approved the final submitted draft.

## Funding Statement

This work was funded by the Swiss National Science Foundation (grant nos. 179876, 189908, and 186858) and the Helmholtz Initiative and Networking Fund (Young Investigator Group COMPOUNDX; grant agreement no. VH-NG-1537).

## Competing Interests

The authors declare no competing interests exist.