Modeling and simulating spatial extremes by combining extreme value theory with generative adversarial networks

Abstract Modeling dependencies between climate extremes is important for climate risk assessment, for instance when allocating emergency management funds. In statistics, multivariate extreme value theory is often used to model spatial extremes. However, most commonly used approaches require strong assumptions and are either too simplistic or over-parameterized. From a machine learning perspective, generative adversarial networks (GANs) are a powerful tool to model dependencies in high-dimensional spaces. Yet in the standard setting, GANs do not well represent dependencies in the extremes. Here we combine GANs with extreme value theory (evtGAN) to model spatial dependencies in summer maxima of temperature and winter maxima in precipitation over a large part of western Europe. We use data from a stationary 2000-year climate model simulation to validate the approach and explore its sensitivity to small sample sizes. Our results show that evtGAN outperforms classical GANs and standard statistical approaches to model spatial extremes. Already with about 50 years of data, which corresponds to commonly available climate records, we obtain reasonably good performance. In general, dependencies between temperature extremes are better captured than dependencies between precipitation extremes due to the high spatial coherence in temperature fields. Our approach can be applied to other climate variables and can be used to emulate climate models when running very long simulations to determine dependencies in the extremes is deemed infeasible.


Introduction
Understanding and modeling climate extremes such as floods, heatwaves, and heavy precipitation, are of paramount importance because they often lead to severe impacts on socioeconomic systems (Field et al., 2012). Many impacts are associated with compounding drivers, for instance, multiple hazards occurring at the same time at the same or different locations affecting the same system (Leonard et al., 2014;Zscheischler et al., 2018). Ignoring potential dependencies between multiple hazards can lead to severe misspecification of the associated risk (Zscheischler and Seneviratne, 2017;Hillier and Dixon, 2020). However, estimating dependencies between extremes is a challenging task, requiring large amounts of data and/or suitable approaches to model the phenomena of interest in a computationally feasible time.
A particularly challenging class of events is spatially compounding events . Spatially compounding events occur when multiple connected locations are affected by the same or different hazards within a limited time window, thereby causing an impact. The compounding is established via a system capable of spatial integration, which accumulates hazard impacts in spatially distant locations. For instance, the global food systems are vulnerable to multiple co-occurring droughts and heatwaves in key crop-producing regions of the world (Anderson et al., 2019;Mehrabi and Ramankutty, 2019). Similarly, spatially extensive floods (Jongman et al., 2014) or sequences of cyclones  can deplete emergency response funds. Climate extremes can be correlated over very large distances due to teleconnections (Boers et al., 2019) but modeling such dependencies is challenging.
One approach to tackle the challenge of spatially correlated extremes is to create a large amount of data by running very long simulations with state-of-the-art climate models, which have the physical spatial dependencies built into them. For instance, over the recent years for a number of climate models, large ensembles have been created (Deser et al., 2020). However, these simulations typically have rather coarse resolution, are usually not stationary in time and are very expensive to run.
Extreme value theory provides mathematically justified models for the tail region of a multivariate distribution X 1 , …,X d ð Þ , d≥2. This enables the extrapolation beyond the range of the data and the accurate estimation of the small probabilities of rare (multivariate) events. Statistical methods building on this theory are popular in a broad range of domains such as meteorology (Le et al., 2018), climate science (Naveau et al., 2020), and finance (Poon et al., 2004). Applications have so far been limited to lowdimensional settings for several reasons. On the one hand, even for moderately large dimensions d, the fitting and simulation of parametric models are computationally intensive because it requires computing complex likelihoods (Dombry et al., 2017). On the other hand, the extremal dependence structures in applications are difficult to model and the existing approaches are either simplistic or over-parameterized. In spatial applications, for instance, Euclidean distance is used to parametrize the isotropic dependence of stationary max-stable random fields (Blanchet and Davison, 2011). Most real-world datasets that cover larger spatial domains do, however, feature nonstationarities in space that cannot be captured by the stationarity assumed in current geostatistical models; see Engelke and Ivanovs (2021) for a review on recent extreme value methods in higher dimensions.
Methods from machine learning are well-suited for complex and nonstationary data sets. Their loss functions are, however, typically designed with the purpose to predict well in the bulk of the distribution. It is therefore a difficult problem to construct approaches with an accurate performance outside the range of the training data. In prediction problems, one possibility is to adapt the loss function to make the algorithm more sensitive to extreme values, as for instance done in quantile regression forests (Meinshausen, 2006) or extreme gradient boosting (Velthoen et al., 2021) for the prediction of high conditional quantiles. Jalalzai et al. (2018) discuss this problem for classification in extreme regions and propose a new notion of asymptotic risk that helps to define classifiers with good generalization capacity beyond the observations' range in the predictor space.
Rather than predicting an unknown response, in this work, we are interested in a generative model that is able to learn a high-dimensional distribution X 1 , …,X d ð Þboth in the bulk and in the extremes. We concentrate on generative adversarial networks (GANs), which are known to be competitive models for multivariate density estimation. While classical applications of GANs are often in the field of image analysis (Zhang et al., 2017;Choi et al., 2018;Karras et al., 2018), they have been used much more broadly in domains such as finance (Efimov et al., 2020), fraud detection (Zheng et al., 2019), speech recognition (Sahu et al., 2020), and medicine (Schlegl et al., 2017).
There are two different aspects that characterize the extremal properties of multivariate data: the univariate tail behavior of each margin and the extremal dependence between the largest observations. For GANs, it has recently been shown (Wiese et al., 2019;Huster et al., 2021) that the marginal distributions of the generated samples are either bounded or light-tailed if the input noise is uniform or Gaussian, respectively. Huster et al. (2021) propose to use heavy-tailed input to overcome this issue. Bhatia et al. (2021) propose the ExGAN algorithm that uses conditional GANs to perform importance sampling of extreme scenarios. The main difference to our approach is that ExGAN simulates single extreme events, while our model will rely on block maxima and therefore does not suffer from serial correlation and seasonality issues. Moreover, ExGAN does not model the marginal distributions with extreme value theory and, since the input noise of ExGAN is Gaussian, this may be problematic especially for heavytailed data sets in view of the results of Wiese et al. (2019) and Huster et al. (2021) mentioned above.
In this article, we concentrate on modeling the extremal properties of both the marginal distributions and the dependence structure in a realistic way even for high-dimensional and spatially nonstationary data. Our model, called evtGAN, combines the asymptotic theory of extremes with the flexibility of GANs to overcome the limitations of classical statistical approaches to low dimensional and stationary data. By using extrapolation for the marginals from extreme value theory, our method can be applied to datasets with arbitrary combinations of light-tailed and heavy-tailed distributions. We use a stationary 2000-year climate simulation to test and validate our approach.
In sum, in this article, we demonstrate how our method combines the best of extreme value theory and GANs in order to efficiently and accurately learn complex multidimensional distributions with nontrivial extremal dependence structures. Furthermore, one can easily sample from these learned distributions, which facilitates the study of complex data such as spatially distributed climate extremes with spatially highly heterogeneous dependence structures. The remainder of the article is structured as follows: in Section 2, we briefly lay out the theoretical background of extreme value theory and GANs, and present the main algorithm of our methodology. Section 3 contains a description of the data and the model's architecture and hyperparameters used for its training, and a presentation of the obtained results, which we then discuss in Section 4. Finally, concluding remarks are laid out in Section 5.

Methodology
In the climate community, there is a great need for methods able to efficiently provide an empirical description of the climate system, including extreme events, starting from as few ensembles runs as possible (Castruccio et al., 2019;Deser et al., 2020). Emulation techniques that tackle this challenge have been proposed in the last few years, usually focusing on the spatial correlation properties of the phenomena of interest (McKinnon et al., 2017;McKinnon and Deser, 2018;Link et al., 2019;Brunner et al., 2021). In this work, our aim is to build an emulator specifically designed for extreme events that is able to reproduce the spatial tail dependencies of the data and extrapolate outside the range of the training samples. Our method does not require to discard any simulations and is therefore highly efficient. In this section, we first recall some background on extreme value theory and GANs, and then describe our algorithm called evtGAN. Similar to a copula approach (e.g., Brunner et al., 2021), it relies on the idea to disentangle the modeling of the marginal distributions and the extremal dependence structure in order to use extreme value theory for the former and GANs for the latter.

Extreme value theory
Extreme value theory provides mathematically justified methods to analyze the tail of a d-dimensional random vector X = X 1 , …, X d ð Þ . This branch of statistical theory has been widely applied in climate science, for example, to study droughts (Burke et al., 2010), floods (Asadi et al., 2018), heatwaves (Tanarhte et al., 2015), and extreme rainfalls (Le et al., 2018). For multivariate data, there are two different aspects that determine the quality of an extrapolation method: the univariate tail behavior of each margin and the extremal dependence between the largest observations. We explain how to measure and model those based on componentwise maxima. Let is the maximum over the jth margin.

Univariate theory
The Fisher-Tippett-Gnedenko theorem (e.g., Coles et al., 2001) describes the limit behavior of the univariate maximum M kj . It states that if there exist sequences a kj ∈ R, b kj > 0 such that then if G j is the distribution function of a nondegenerate random variable Z j , it is in the class of generalized extreme value (GEV) distributions with where x þ denotes the positive part of a real number x, and ξ j ∈ R, μ j ∈ R, and σ j > 0 are the shape, location, and scale parameters, respectively. The shape parameter is the most important parameter since it indicates whether the jth margin is heavy-tailed (ξ j > 0), light-tailed (ξ = 0) or whether it has a finite upper end-point (ξ < 0). It can be shown that under mild conditions on the distribution of margin X j , appropriate sequences exist for (1) to hold. The above theorem, therefore, suggests to fit a generalized extreme value distribution b G j to maxima taken over blocks of the same length, which is common practice when modeling yearly maxima. The shape, location, and scale parameters can then be estimated in different ways, including moment-based estimators Hosking (1985), Bayesian methods (Yoon et al., 2010), and maximum likelihood estimation . We use the latter in this work. This allows extrapolation in the direction of the jth marginal since the size of a T-year event can be approximated by the 1 À 1=T ð Þquantile of the distribution b G j , even if T is larger than the length of the data record.

Multivariate theory
For multivariate data, the correct extrapolation of the tail not only depends on correct models for the marginal extremes, but also on whether the dependence between marginally large values is well captured. For two components X i and X j with limits Z i and Z j of the corresponding maxima in (1), this dependence is summarized by the extremal correlation χ ij ∈ 0,1 ½ (Schlather and Tawn, 2003) The larger χ ij , the stronger the dependence in the extremes between the two components, that is, between Z i and Z j . If χ ij > 0 we speak of asymptotic dependence, and otherwise of asymptotic independence. While the extremal correlation is a useful summary statistic, it does not reflect the complete extremal dependence structure between X i and X j . Under asymptotic dependence, the latter can be characterized by the so-called spectral distribution (de Haan and Resnick, 1977). Let e X j = À 1= log F j X j À Á , j = 1,…,d, be the data normalized to standard Fréchet margins. The spectral distribution describes the extremal angle of X i , X j À Á , given that the radius exceeds a high threshold, that is, Under strong extremal dependence, the spectral distribution centers around 1/2; under weak extremal independence, it has mass close to the boundary points 0 and 1.
In multivariate extreme value theory, a popular way to ensure a correct extrapolation of tail dependence under asymptotic dependence is by modeling the joint distribution Z = Z 1 ,…,Z d ð Þof the limits in (1) as a max-stable distribution with multivariate distribution function: where ℓ is the so-called stable tail dependence function (e.g., de Haan and Ferreira, 2007). This is a copula approach in the sense that the right-hand side is independent of the original marginal distributions G j and only describes the dependence structure. In practice, the G j are replaced by estimates b G j that are fitted to the data first. The right-hand side of (2) is also called an extreme value copula (e.g., Gudendorf and Segers, 2010).
A natural extension of max-stable distributions to the spatial setting is given by max-stable processes Z t ð Þ : t ∈ R d È É (e.g., Davison et al., 2012), where the maxima Z i = Z t i ð Þ are observed at spatial locations t i , i = 1,…,d. These types of models have been widely applied in many different areas such as flood risk (Asadi et al., 2015), heat waves (Engelke et al., 2019), and extreme precipitation events (Buishand et al., 2008). A popular parametric model for a max-stable process Z is the Brown-Resnick process (Kabluchko et al., 2009;Engelke et al., 2015). This model class is parameterized by a variogram function γ on R d . In this case, each bivariate distribution is in the so-called Hüsler-Reiss family (Hüsler and Reiss, 1989), and the corresponding dependence parameter is the variogram function evaluated at the distance between the two locations, One can show that the extremal correlation coefficient for this model is given by where Φ is the standard normal distribution function. A popular parametric family is the class of fractal variograms that can be written as γ α,s h ð Þ = h α =s, α ∈ 0,2 ð , s > 0. While these methods enjoy many interesting properties, they are often quite complex to fit (Dombry et al., 2017) and impose strong assumptions on the spatial phenomena of interest, such as spatial stationarity, isotropic behavior, and asymptotic dependence between all pairs of locations.
A simple way of fitting a Brown-Resnick process to data is to compute the empirical versions b χ ij of the extremal correlation coefficients and then numerically find the parameter values (e.g., α and s for the fractal variogram family) whose implied model extremal coefficients minimize the squared error compared to the empirical ones.

Generative adversarial network
GANs (Goodfellow et al., 2014) are attracting great interest thanks to their ability of learning complex multivariate distributions that are possibly supported on lower-dimensional manifolds. These methods are generative since they allow the generation of new realistic samples from the learned distribution that can be very different from the training samples.
Given observations U 1 , …, U n from some d-dimensional unknown target distribution p data , GANs can be seen as a game opposing two agents, a discriminator D and a generator G. The generator G takes as input a random vector Y = Y 1 ,…, Y p À Á from a p-dimensional latent space and transforms it to a new sample U * = G Y ð Þ. The components Y j , j = 1, …, p, are independent and typically have a standard uniform or Gaussian distribution.
The discriminator D has to decide whether a new sample U * generated by G is fake and following the distribution p G , or coming from the real observations with distribution p data . The discriminator expresses its guesses with a value between 0 and 1 corresponding to its predicted probability of the sample coming from the real data distribution p data . Both the generator and the discriminator become better during the game, in which D is trained to maximize the probability of correctly labeling samples from both sources, and G is trained to minimize D's performance and thus learns to generate more realistic samples.
Mathematically, the optimization problem is a two-player minimax game with cross-entropy objective function: In equilibrium, the optimal generator satisfies p G = p data (Goodfellow et al., 2014).
In practice, the discriminator and the generator are modeled through feed-forward neural networks. While the equilibrium with p G = p data is only guaranteed thanks to the convex-concave property in the nonparametric formulation (3), the results remain very good as long as suitable adjustments to the losses, training algorithm, and overall architecture are made to improve on the stability and convergence of training. For instance, a standard adjustment to the generator's loss function was already proposed in the GANs original paper (Goodfellow et al., 2014), and consists of training the generator to minimize Þ . This new loss function is called the nonsaturating heuristic loss, and was suggested to mitigate close to null gradients that occur in early training. Indeed, as the generator is still poorly trained, the discriminator can easily detect its samples and therefore associates values close to zero to them. This results in a very slow improvement of the generator during backpropagation as the gradients associated with values around D G Y ð Þ ð Þ= 0 are too small, and thus its Þprovides stronger gradients in early training without compromising on the dynamics of the two agents.
The architecture of the generator G and the discriminator D can be specifically designed for the structure of the data under consideration. In this direction, Radford et al. (2016) introduce the use of the convolutional neural network as an add-on feature to the GAN yielding a deep convolutional generative adversarial network model (DCGAN). It is considered the preferred standard for dealing with image data and more generally any high dimensional vector of observations describing a complex underlying dependence structure, showing superior performance in the representation and recognition fields.

evtGAN
Here we give an overview of evtGAN. Let X 1 ,…, X m be independent observations of dimensionality d, where X i = X i1 , …,X id ð Þand m = k Á n. Now let Z 1 ,…,Z n be independent observations obtained as block maxima taken sequentially over the span of k observations, that is, ð Þand so on (see also Section 2.1), where the maxima are taken in each component. The evtGAN algorithm (Algorithm 1) below takes these empirical block maxima as input and follows a copula approach where marginal distributions and the dependence structure are treated separately. In particular, this allows us to impose different amounts of parametric assumptions on the margins and the dependence.
It is known that classical GANs that are trained with bounded or light-tailed noise input distributions in the latent space will also generate bounded or light-tailed samples, respectively (see Wiese et al., 2019;Huster et al., 2021). That means that they are not well-suited for extrapolating in a realistic way outside the range of the data. For this reason, we suppose that the approximation of the margins by generalized extreme value distributions G j as in (1) holds, which allows for extrapolation beyond the data range. On the other hand, we do not make explicit use of the assumption of a multivariate max-stable distribution as in (2). This has two reasons. First, while the univariate limit (1) holds under very weak assumptions, the multivariate max-stability as in (2) requires much stronger assumptions (see Resnick, 2008) and may be too restrictive for cases with asymptotic independence (e.g., Wadsworth and Tawn, 2012). Second, even if the data follow a multivariate max-stable distribution (2), the probabilistic structure is difficult to impose on a flexible machine learning model such as a GAN.

e5-6
Environmental Data Science Algorithm 1: evtGAN 1. For j = 1,…,d, fit a GEV distribution b G j to the data Z 1j ,…,Z nj with estimated parameters b μ j ,b σ j , b ξ j .
2. Normalize all margins empirically to a standard uniform distribution to obtain pseudo-observations: where b F j is the empirical distribution function of the Z 1j ,…, Z nj .
3. Train a DCGAN G on the normalized data U 1 , …, U n based on the loss in Equation (3). 4. Generate n * new data points U * 1 ,…, U * n from G with uniform margins, i = 1,…,n * . 5. Normalize back to the scale of the original observations: The margins are first fitted by a parametric GEV distribution (line 1 in Algorithm 1). They are then normalized to (approximate) pseudo-observations by applying the empirical distribution functions to each margin (line 2). This standardization to uniform margins stabilizes the fitting of the GAN. Alternatively, we could use the fitted GEV distributions for normalization but this seems to give slightly worse results in practice. The pseudo-observations contain the extremal dependence structure of the original observations. Since this dependence structure is very complex in general and our focus is to reproduce gridded spatial fields, we do not rely on restrictive parametric assumptions but rather learn it by a highly flexible DCGAN (line 3). From the fitted model we can efficiently simulate any number of new pseudo-observations that have the correct extremal dependence (line 4). Finally, we transform the generated pseudo-observations back to the original scale to have realistic properties of the new samples also in terms of the margins (line 5).

Application
We assess the ability of the proposed method to correctly model the spatial tail dependence structure between climate extremes. Because observational records are relatively short and contain temporal trends, here we rely on 2000 years of climate model output that is representative of present-day weather over western Europe. In particular, we apply our approach to summer temperature and winter precipitation maxima, yielding 2000 and 1999 maxima for temperature and precipitations, respectively. We then compare the performance of our algorithm to the Brown-Resnick model, a state-of-the-art statistical model for spatial extremes. We use the fractal variogram family and the fitting described in Section 2.1.2. To retrieve an unbiased estimate of the error that each of the methods makes, we divide the data into a training set of 50 observations, and the rest is taken as a test set. A sensitivity analysis with other sample sizes is also carried out. Both methods are fitted using only the training data set and the evaluation is made considering the information contained in the test set as ground truth. When not stated otherwise, the results from the Brown-Resnick model are analytical, while the results from evtGAN are obtained simulating 10 0 000 data points.  (Buizza et al., 1999). This resulted in a total of 16 Â 25 Â 5 = 2000 years of meteorological data, at T159 horizontal resolution (approximately 1°), among which we selected temperature (Kelvin) and precipitations (meter/day) for this article.

Data
We choose an area such that it is big enough to be relevant for climate application while being small enough to ensure a fair comparison of our approach and the classical approach in statistics to model spatial tail dependence, the Brown-Resnick process. Our analysis thus focuses on a total of 18 Â 22 grid points covering most of western Europe. For that area, we compute for each grid point the summer temperature maxima and winter precipitation maxima.

Architecture and hyper-parameters
In our application, we consider the nonsaturating heuristic loss (Goodfellow et al., 2014) for the generator, and the standard empirical cross-entropy loss for the discriminator. Training is done iteratively, where the discriminator is allowed to train two times longer than the generator. A bigger number of discriminator training iterations per generator training iteration was initially considered but it did not achieve better results for the considered sample sizes and input dimensions but only resulted in longer training. We incorporate an exponential moving average scheme (Gidel et al., 2019) to the training algorithm as it has demonstrated more stability, a far better convergence, and improved results.
Since we focus on reproducing gridded spatial fields, we make use of the DCGAN model (Radford et al., 2016) as an architectural design to take advantage of convolution (Fukushima, 1980), and for a more accurate analysis of the input data, we apply zero-padding of a single layer to the train set observations before they are fed to the discriminator (Albawi et al., 2017), increasing the dimension of its input to 20 Â 24. We use classical regularization tools such as drop-out (Srivastava et al., 2014), batch-normalization (Ioffe and Szegedy, 2015), and the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 2 Â 10 À4 and batch size of 50 to train the neural network, and a total of 30,000 training epochs. The value of the exponential moving average parameter in the training algorithm is set to α = 0:9. The values of the hyperparameters are the result of an extensive heuristic tuning, and are only valid for the dimension of the input and the considered sample sizes involved in training the network. That being said, we are fairly confident that a minimal intervention in tuning the hyperparameters/architecture would be required if for instance higher resolution inputs were to be used, as these models are fairly robust to such changes. Indeed, models based on the general GANs paradigm are traditionally trained on much higher dimensional objects such as images, and they learn well using an architecture that is very close to the one we have adopted. One may argue that the underlying distributions defining real-world images are "simpler," and therefore easier to learn, which is why we propose the evtGAN model to address the increased complexity of the underlying distribution of the data at hand: initially, we started with a coarser grid (lower resolution input), as we considered France only, then decided to cover a larger part of Europe as the results were very good. In this change, the model's architecture needed to be adjusted only by adding a few layers to the generator. For a given input resolution, increasing the complexity of the model (using a deeper one) just increased training time and pushed convergence further down the line. Figures 1 and 2 illustrate the final architectures of the discriminator and the generator, as well as the values of some of the hyperparameters used for training.

Results
A particularity of evtGAN is that it decomposes the full high-dimensional distribution of the data into its marginal distributions and its dependence structure and processes them separately. We thus first report estimated key parameters of the marginal distributions. For temperature, the location parameter μ is typically higher over land than over oceans (Figure 3a). Furthermore, there is a trend toward lower values of μ for more northern regions, illustrating the well-known latitudinal temperature gradient from south to north. The scale parameter σ, representing the width of the distribution, is also higher over land than over the ocean, with a fairly homogeneous distribution over land (Figure 3b). The shape parameter ξ is well below 0 for most regions except some areas in the Atlantic north of Spain and in the Mediterranean at the North African coast, indicating a bounded tail (Figure 3c). For precipitation, the location parameter μ is more similar between land and ocean compared to temperature but spatially more heterogeneous (Figure 3d), illustrating orographic effects on precipitation extremes. For the scale parameter σ, there is also no clear land-sea contrast but areas with relatively much higher values (corresponding to higher interannual variability in extremes) occur over the Western part of the Iberian peninsula and along a coastal band at the northern Mediterranean coast (Figure 3e). The shape parameter ξ is spatially quite heterogeneous, with values ranging from À0.15 to 0.25 (Figure 3f). Overall, the parameters of the extreme value distributions of precipitation extremes shows much higher spatial Figure 1. Discriminator's architecture and values of hyperparameters used for training. lrelu stands for the leaky relu function and its argument corresponds to the value of the slope when x < 0; drop stands for drop-out (see Section 3.2) and its argument corresponds to the drop-out probability; F stands for the kernel's dimension, and S for the value of the stride; FC stands for fully connected, and sigmoid for the sigmoid activation function. Figure 2. Generator's architecture and values of hyper-parameters used for training. lrelu stands for the leaky relu function and its argument corresponds to the value of the slope when x < 0; drop stands for drop-out (see Section 3.2) and its argument corresponds to the drop-out probability; F stands for the kernel's dimension, and S for the value of the stride; BN stands for batch-normalization (see Section 3.2); FC stands for fully connected.
heterogeneity than the ones for temperature extremes, suggesting that it might be more difficult to learn a model that represents well all tail dependencies for the entire region.
Next, we look at examples of bivariate scatterplots of simulated temperature extremes from the different approaches for three cases with varying tail dependence (Figure 4). The three rows correspond to three pairs of grid points with weak, mild, and strong tail dependence, respectively. The train sets (Figure 4a,f,k) illustrate the distributions from which the models are tasked to learn while the test sets (Figure 4b,g,l) are used as ground truth. As can be clearly seen, extremes are much less likely to co-occur between locations that are characterized by weak tail dependence (Figure 4b) compared to locations that are characterized by strong tail dependence (Figure 4l). Purely based on visual judgment, it seems that evtGAN is able to characterize the different tail dependencies relatively well and can simulate samples outside of the range of the train set (Figure 4c,h,m) whereas DCGAN only simulates distributions bounded to the range of the train set (Figure 4d,i,n) and Brown-Resnick tends to overestimate tail dependence in cases where tail dependence is weak (Figure 4e,j).
The corresponding figure for precipitation extremes is shown in Figure 5. Conclusions are similar to the ones derived based on temperature extremes. The evtGAN simulates distributions with different tail dependencies and samples that are far outside the range of the train set (Figure 5c,h,m). DCGAN simulates distributions that are bounded to the train set range (Figure 5d,i,n). On the other hand, Brown-Resnick overestimates tail dependence in the case of mild tail dependence (Figure 5j), and underestimates tail dependence when tail dependence is strong (Figure 5o).
A scatterplot of bivariate extremal correlations between 100 randomly selected locations estimated from the train set, evtGAN, and Brown-Resnick, respectively, against estimates based on the test set (1950 samples) is shown in Figure 6. The estimates derived directly from the train sets (Figure 6a,d) are the benchmark, and by design better performance is not possible. Clearly, pairs of locations with stronger tail dependence are much more likely for temperature (Figure 6a) than for precipitation (Figure 6d),   confirming the impression obtained from the spatial homogeneity of the parameters in the extreme value distributions ( Figure 3). Furthermore, evtGAN seems to better capture the observed relationship. Brown-Resnick has difficulties in particular with pairs that have weak or no tail dependence (extremal correlation equals 0, lower left in the figures), which is consistent with Figures 4 and 5.
To further explore the behavior of evtGAN and Brown-Resnick in dealing with different levels of dependence, Figure 7 shows a comparison of the estimated spectral distribution for the same pairs of chosen grid points shown in Figures 4 and 5, characterized by weak (a,d), mild (b,e) and strong tail dependence (c,f), respectively. The magenta bars show the estimates based on the 1950 samples in the test set. The estimates of evtGAN are very close to the ground truth for all cases, that is, weak, mild, and strong dependence (red lines in Figure 7), except for some bias in the case of mild dependence in precipitation extremes (Figure 7e). In contrast, the performance of the Brown-Resnick model (black lines in Figure 7) is much more variable. It captures relatively well the two pairs with weak tail dependence (Figure 7d) and does a decent job for strong dependence in temperature extremes (Figure 7c) but fails completely for the remaining three cases (Figure 7b,e,f). Furthermore, in contrast to Brown-Resnick, evtGAN is able to represent asymmetric dependence structures, as is evident from Figure 7.
We finally present a sensitivity analysis of the performance of evtGAN for different realistic sample sizes, namely 30, 50, and 100, while learning for 30,000 epochs (Figure 8). Overall the error tends to decrease the longer the models learn and it does not seem to matter whether we evaluate the performance on the train set or the test set (difference between black and red lines in Figure 8 is small). For small sample sizes (n = 30) the best performance might be reached for epochs smaller than 30,000 (Figure 8a,b) and the l 2 norm between the extremal coefficients of evtGAN and the train set could be used as a stopping criterion. An improvement in performance with increasing sample size is clearly visible (compare Figure 8a,b with Figure 8e,f). Furthermore, the model errors for temperature extremes are smaller than the ones for precipitation extremes (left vs. right column in Figure 8).

Discussion
The use of evtGAN combines the best of two worlds: correct extrapolation based on extreme value theory on the one hand, and flexible dependence modeling through GANs on the other hand. GANs are an excellent tool to model complex dependencies in high-dimensional spaces. However, they are typically not tailored to model extremes in the marginals well. Indeed, for a standard DCGAN implementation where the marginals are not estimated by GEV distributions but empirically transformed, Figures 4d,i,n and 5d,i,n show that the generated samples are bounded by the range of the training sample. For an accurate extrapolation that resembles the marginal distributions of the test set, extreme value theory is required (panels c,e,h,j,m, and o of Figures 4 and 5).
On the other hand, classical methods of spatial extreme value theory such as the Brown-Resnick process have accurate extrapolation properties for the marginal distributions. However, for an application to a spatially heterogeneous data set on a large domain (Figure 3), their parametric assumption may be too restrictive. Indeed, Figure 6c,f shows a clear bias of the Brown-Resnick model in terms of bivariate extremal correlations, which is particularly visible for pairs with weak extremal dependence. Another indication for this bias can be seen in Figure 5j,o where the distributions of the Brown-Resnick samples differ strongly from the test set distributions. The evtGAN does not make prior assumptions on spatial stationarity or isotropy and therefore it does not exhibit a bias (points in Figure 6b,e are centered around the diagonal). This is particularly noteworthy since modeling complex nonstationarities for extremes is very difficult with parametric models (Huser and Genton, 2016;Engelke and Hitz, 2020). Considering the fitted bivariate distributions of evtGAN and Brown-Resnick underlines this point.
The spectral distributions of the Brown-Resnick model are restricted to a parametric class of functions, which, for instance, are symmetric around 1=2. The blue lines in Figure 7 show that this is too restrictive for our data since the strength of dependence is not correctly modeled (Figure 7b) or the asymmetry is not captured (Figure 7c dependences, and it even adapts to possible asymmetries. This is also evident from the scatterplots in Figures 4 and 5, where the shape of the Brown-Resnick samples is restricted to a parametric subclass of distributions, the so-called Hüsler-Reiss family (Hüsler and Reiss, 1989). A further restriction of classical spatial max-stable processes is the fact that they are limited to modeling asymptotic dependence. For precipitation, it can be seen in Figure 6d-f that most of the test extremal correlations are close to zero, indicating asymptotic independence. While the evtGAN is able to capture this fairly well (Figure 6e), the Brown-Ressnick model always has positive extremal correlations, explaining the bias in the bottom left corner of Figure 6f. A spatial asymptotically independent model (e.g., Wadsworth and Tawn, 2012) would be a possible remedy for this, but such processes would still suffer from the limitation induced by nonstationarity and asymmetry described above.
Overall, evtGAN tends to perform better in capturing dependencies between temperature extremes than precipitation extremes (Figure 8). This is likely related to the fact that extremes in temperature are more spatially coherent (Keggenhoff et al., 2014; Figure 3).

Conclusions
Understanding and evaluating the risk associated with extreme events is of primal importance for society, as recently emphasized in the 6th Assessment Report of the Intergovernmental Panel on Climate Change (Seneviratne et al., 2021). Extreme event analysis and impact assessments are often limited by the available sample sizes. Furthermore, simulations with complex climate models are very expensive. Here we combine a machine learning approach with extreme value theory to model complex spatial . Model performance versus training epochs for different sample sizes in evtGAN for temperature (a,c,e) and precipitation (b,d,f) extremes. A number of observations equal to n test ¼ N À n train were sampled from evtGAN, where N ¼ 2000 for temperature, and N ¼ 1999 for precipitation. The mean l 2 norms for train (black) and test set (red) are defined as C te ¼ χ evtG À χ te k k 2 , C tr ¼ χ evtG À χ tr k k 2 , where χ evtG , χ te , and χ tr denote the vectors of extremal correlations calculated on the samples from evtGAN, the test set and the train set, respectively.

e5-14
Environmental Data Science dependencies between extreme events in temperature and precipitation across Europe based on a limited sample size. We demonstrate that this hybrid approach outperforms the typically used approach in multivariate extreme value theory and can well represent the marginal distributions and extremal dependencies across spatially distributed climate extremes. The approach can be easily adapted to other types of extremes and used to create large sample sizes that are often required for climate risk assessments.
Data Availability Statement. The temperature and precipitation maxima alongside the scripts to reproduce the results in this study are available at https://doi.org/10.5281/zenodo.5821485. Competing Interests. The authors declare no competing interests exist.