Understanding precipitation changes through unsupervised machine learning

Abstract Despite the importance of quantifying how the spatial patterns of heavy precipitation will change with warming, we lack tools to objectively analyze the storm-scale outputs of modern climate models. To address this gap, we develop an unsupervised, spatial machine-learning framework to quantify how storm dynamics affect changes in heavy precipitation. We find that changes in heavy precipitation (above the 80th percentile) are predominantly explained by changes in the frequency of these events, rather than by changes in how these storm regimes produce precipitation. Our study shows how unsupervised machine learning, paired with domain knowledge, may allow us to better understand the physics of the atmosphere and anticipate the changes associated with a warming world.


Introduction: Understanding the Changing Spatial Patterns of Heavy Precipitation
According to the latest Intergovernmental Panel on Climate Change report [11], "there is high confidence that heavier precipitation events across the globe will increase in both intensity and frequency with global warming".As the severity of storms and tropical cyclones magnifies, there will be associated increases in flood-related risk [16] and challenges in water management [2,3].To first order, bounds of heavy precipitation are limited by the water vapor holding capacity of the atmosphere, which increases by about 7% per 1K (Kelvin) of warming following an approximate Clausius-Clapeyron scaling [30].This is referred to as the "thermodynamic contribution" to heavy precipitation changes [12] and gives a solid theoretical foundation for spatially-averaged changes in heavy precipitation.
Yet climate change adaptation requires knowledge of how heavy precipitation will change at the local scale, i.e., understanding the changing spatial patterns of heavy precipitation under warming.Focusing on the tropics, where most of the vulnerable world population lives [10], these changing spatial patterns are primarily dictated by atmospheric vertical velocity ("dynamical") changes because horizontal spatial gradients in temperatures are weak.This is referred to as the "dynamic contribution" to heavy precipitation changes [12].
A comprehensive understanding of this "dynamic contribution" remains elusive.Approximate scalings can be derived based on quasi-geostrophic dynamics [21,29] and convective storm dynamics [27,1].As well, some of the effects from dynamics can be approximated using observational and reanalysis data [35].But actionable findings require Earth-like simulations of the present and future climates (e.g., [32]), which can resolve regional circulation changes and their effects on storms in their full complexity.These simulations are computationally demanding and output large amounts of multi-scale, three-dimensional data that challenge traditional data analysis tools.For example, the state-of-the-art storm-resolving1 SPCAM (Super Parameterized Community Atmospheric Model [18,19]) simulations we use in this study (Section 3.

Theory: Heavy Precipitation Decomposition
In this section, we outline our strategy to facilitate the analysis of heavy precipitation changes including spatial details elucidating storm formation.
The crux of this analysis rests on the assumption that we can meaningfully cluster different vertical velocity fields into  different convection types.Our analysis transforms these convection types into "cluster probabilities".For the total  data points being assigned to a set of K clusters, we first compute a vector of counts   , indicating the number of data points assigned to each cluster index by :  =1   ≡ .We now convert these numbers to a normalized probability vector as  ≡ ( 1 , . . .,   ) ≡ ( 1 /, . . .,   /).
To calculate these   's, we use variational autoencoders in conjunction with k-means clustering, also called vector quantization [14,22].Details on the exact coarse graining procedure will be presented in Section 3.2; for now, we take the  different convection types and their probabilities as given.
This unsupervised quantization of regimes of convection through ML allows us to quantitatively understand changes in heavy precipitation ( heavy ) from both changes in convection regime characteristics and probability.Here we define  heavy as a fixed high quantile of precipitation (e.g., 80th-99.99thpercentiles) at a given spatial location.To model the effects of climate change, we define Δ heavy as its absolute change from the "Control" to the "Warmed" climate, and show below that relative changes in heavy precipitation can be decomposed using changes in  (Eq.2.7).
We derive a decomposition of heavy precipitation changes for global warming by making a series of simple physical assumptions about precipitation.While these approximations facilitate our results' interpretability and lend physical meaning to each term, they need not hold to derive Eq. 2.5, the final form of the decomposition as it could also be derived by decomposing the heavy precipitation field into its spatial-average and an anomaly, before further decomposing the anomaly using the objectivelyidentified dynamical regimes.To first order, precipitation () scales like condensation rate (), which depends on the full vertical velocity () and atmospheric water vapor (here quantified using specific humidity ) fields: Note that Eq. 2.1 neglects the dependence on microphysical processes (see e.g., [26]) to focus on the thermodynamical and dynamical components of precipitation.When focusing on heavier precipitation, we de facto sample atmospheric columns that are so humid that the specific humidity  equals its saturation value  sat .This allows us to further simplify Eq. 2.1 in the case of high quantiles of :  heavy ≈  heavy (,  sat ) . (2.2) We now make the assumption that the thermodynamic dependence on  sat can be factored out of the right-hand side of Eq. 2.2 and denote the dynamical pre-factor as D (): (2. 3) The previous assumption has been historically justified by assuming a moist adiabatic temperature profile and scaling the vertical velocity by using a single value for the entire vertical profile for intense events [30,27].However, it is more accurate to assume vertical velocity profiles collapse when changing the vertical coordinate from pressure to the normalized integral of the moisture lapse rate [1].
We can now linearly decompose the dynamical pre-factor D () into the  regimes identified by our unsupervised learning framework: where   is the normalized probability of each dynamical regime determined by the vertical velocity field information.Combining Eq.'s 2.3, 2.4, and taking a logarithmic derivative with respect to climate change allows us to decompose relative changes in upper precipitation quantiles as follows: where Δ denotes absolute changes from the reference to the warm climate.Lastly, we approximate the thermodynamic contribution to upper quantile precipitation as the relative changes in near-surface saturation specific humidity, which can be further approximated as spatially uniform: where   is near-surface temperature and   near-surface pressure.Expanding Eq. 2.5 and substituting D () using Eq. 2. Eq. 2.7 shows that relative changes in  heavy are the sum of a well-understood, spatially-uniform "thermodynamic" increase in saturation specific humidity ( sat -see Eq. 2.6), and a spatially-varying term.This spatially-varying term is the sum of  regime probability shifts Δ  (changes in our unsupervised ML-derived convection cluster assignment probability or "cluster sizes" -covered in more detail in Section 3), and  changes in regime characteristics ΔD  (changes in the "dynamic contribution" pre-factors, in precipitation units).
Our simulation data already contain  heavy and  sat , and we can derive   from our unsupervised learning framework, giving us all the information we need to calculate the elusive pre-factors D  and their changes with warming.Using equation 2.4, we linearly regress  heavy  sat on the regime probabilities   in both the reference and warm climates to derive the pre-factors D  , which are the weights of the multiple linear regression.This is a step toward understanding how the spatial patterns of stormscale dynamical changes, which are notably hard to analyze, can affect the spatial patterns of heavy precipitation.Understanding these changes is critical to trust local climate change predictions.

Machine Learning Approach
We will now discuss the data, models, and statistical techniques used in this paper.Additional details can be found in the Supplemental Information.

Data: High-resolution, Earth-like Simulations of Global Surface Warming
The multi-scale modeling framework [33] used to generate our training and test data is composed of small, locally periodic 2D subdomains of explicit high-resolution physics that are embedded within each grid column of a coarser resolution (1.9 • × 2.5 • degree) host planetary circulation model [19].In total, we performed six simulations of present-day climate launched from different initial conditions (but consistent resolution), configured with storm resolving models that are 512 km in physical extent, each with 128 grid columns with a horizontal resolution of 4 km.We approximate the atmosphere with a simple bulk one-moment microphysical scheme and thirty vertical levels.We extract snapshots every four hours.We then perform six additional simulations but increase the sea surface temperatures by 4K.We compare the "Control" simulations against those with uniform increases in sea surface temperatures ("Warmed").For our purposes, this creates a testbed for climate change, but we acknowledge that surface warming is only an approximation for the thermodynamic consequences of CO 2 concentration increase.
The VAE based approach to understand changes in heavy precipitation,  ℎ .Given a vertical velocity field w, the VAE non-linearly reduces the input dimension to yield a latent representation z.We can cluster the simulated data sets (control and warmed) through their latent representations into three recognizable regimes of convection (continental shallow, deep, and marine shallow).The corresponding cluster assignment probabilities, 's allow us to also quantify the dynamical contribution, D to heavy precipitation.Photos taken by Griffin Mooers.
To investigate the "dynamic mode" of precipitation, we choose vertical velocity to represent the state of the atmosphere.These vertical velocity fields contain information about complex updraft and gravity wave dynamics across multiple scales.We considered the entire 15S-15N latitude band containing diverse tropical convective regimes.Examples of these vertical velocity snapshots, selected by precipitation percentile, can be seen in Fig 1.

VAE Training
Our ML methodology identifies dynamical regimes on the basis of two million two-dimensional vertical velocity fields.We first nonlinearly reduce the dimensionality of the data using a fully-convolutional VAE design, whose architecture is depicted in Fig 2 .To train the VAE, we perform beta-annealing [17,6], expanding the Evidence Lower Bound (ELBO) traditionally used to train the VAE by including a  parameter and linearly anneal  from 0 to one over 1600 training epochs.The number of layers and channels in the encoder and decoder are depicted in Fig 2 (4 layers in each, stride of two).After manual hyperparameter tuning, we choose ReLUs as activation functions in both the encoder and the decoder.We pick a relatively small kernel size of 3 to preserve the small-scale updrafts and downdrafts of our vertical velocity fields.The dimension of our latent space is 1000.
We refer the reader to SI B for the details of the VAE's benchmarking and performance evaluation and proceed to use that VAE to investigate how convective regimes respond to climate change.

Quantization Procedure
The goal of the following analysis is to analyze upper quantile precipitation patterns arising from climate shifts, leveraging vertical velocity fields.The sheer amount of data requires additional statistical analysis.
Although the use of a VAE encoder makes our high-resolution simulation data more manageable, we require additional work to derive the formal convective probability information, .The main idea is to convert a high-dimensional, continuous probability distribution over velocity fields into a fixed-size,  discrete probability distribution over quantization points [28].Then, we use the coarse-grained, discrete distribution to compute various quantities of interest.
We use a convolutional VAE to nonlinearly embed our 2D input data (vertical velocity fields) into a lower-dimensional latent space.To quantize the emergent latent space, we employ k-means clustering (More details in SI-A): we encode our training data into the latent space and cluster them into  clusters.We also define a vector of cluster assignment probabilities   for  = 1, • • • ,  as the percentage of training data assigned to each cluster .This dimensionality reduction and clustering can be thought of as a lossy compression of the data [37].As we will see, the discrete structure helps us compute various quantities of interest.While the quantization approximation can, in principle, be made arbitrarily precise using a large , we use  = 3 in practice for interpretability based on the findings of [24].We cluster convection into three distinct regimes: (1) Marine Shallow Convection, (2) Continental Shallow Cumulus Convection, and (3) Deep Convection.We have found that higher counts of  created repetitive regimes of Deep Convection while lower counts do not properly separate out these three distinct species of convection [24].
We are especially interested in changes of the cluster assignment probabilities under global warming.In order to compare the present and future data distributions, we train the VAE and learn the cluster centers based on present climate data.This yields the present cluster assignment probabilities  0 .In a second step, we encode all future climate data into the latent space and assign each datum to the nearest (control) cluster center, yielding the future cluster assignment probabilities  4 .The difference vector of assignment probabilities, before and after global warming, is given by Δ =  4 −  0 .This information can then be used as a proxy for dynamical regime shifts with warming.We visualize these shifts and interpret their implications for heavy precipitation below.With just the information from the regime probabilities, we can model the spatial patterns of precipitation changes at upper percentiles (Fig. S4).Our model becomes less accurate at lower precipitation quantiles, partly because we are not using specific humidity information (the approximation of Eq. 2.2 is only valid for high precipitation quantiles).This degree of accuracy at the upper percentiles suggests that changes in the location of convective dynamical regimes can explain a large fraction of changes in heavy precipitation, which should be further tested in diverse climate change modeling frameworks.

Decomposing the Dynamic Contribution to Heavy Precipitation Changes
We now isolate the dynamical contributions to dynamical changes in heavy precipitation by decomposing the spatial patterns (3d) into changes in regime probability   and changes in regime characteristics D  .Unlike traditional approaches that spatially average information, we use our fully-convolutional encoder and latent space clustering to leverage storm-scale variability.
We calculated changes in regime probability (how regimes move in space) in section 3, so we must now calculate changes in how each regime produces precipitation, which involves the following two steps.First, we empirically estimate D  by using the probabilities of deep and shallow convection2.Second, we estimate changes in "deep" and "shallow" convection dynamical pre-factors as ΔD  = D 4  − D 0  .We now have the requisite information to understand the drivers of heavy precipitation changes themselves.We ask: Did the patterns of heavy precipitation simply follow the changing patterns of the convective regime, or are there more complex changes in how deep convection produces rain?We address this question by comparing how much of the spatial variance in heavy precipitation Δ ℎ can be explained by changes in convection probability (Δ), and how much of it can be explained by changes in the dynamical prefactors (ΔD).This comparison relies on the following decomposition of heavy precipitation variance var Δ heavy , derived in Sec D of the SI: where CT are cross-terms and R groups all terms of the decomposition that are not needed to compare differences in precipitation from regime shifts vs. intra-regime changes.To understand what is most crucial to precipitation changes, we depict the spatial mean of Eq. 4.1's terms in Fig 4.
For high precipitation quantiles where our model works best (especially above the 80th percentile), changes in the upper quantiles are dominated by regime probability shifts rather than by intra-regime 2More specifically, we estimate the dynamical pre-factors (D  ) by regressing  heavy / sat on  1 and  2 , neglecting the "Continental Shallow Cumulus" regime as it concentrates over arid continental zones with high lower tropospheric stability and low latent heat fluxes, making conditions unfavorable for precipitation [9].We are also confident in the robustness of these results based on the degree of similarity in the findings of the importance of deep convection shifts in both our work and [35].The fact that in satellite and reanalysis data probability shifts in convection type are also the drivers of changes in extreme precipitation is a reassuring measure of support for our findings as well as for the performance of the GSRM representation of the global high-resolution vertical velocity structure more broadly.At the same time, it suggests that machine learning approaches like the one we deploy with minimal domain knowledge are just as faithful to the physics of the atmosphere as more traditional convection analysis.

Conclusion
Based on our findings, proper prediction of spatial shifts in deep convection with global warming should allow us to anticipate regional and local changes in heavy precipitation.This highlights the importance of leveraging the full spatial extent of this information (traditionally averaged out) to derive accurate regional and local changes.Although our findings were achieved from unsupervised learning, we are reassured that similar findings can be derived in observational satellite data from regime-based analysis of deep convection as well [35].The necessity of understanding this rich spatial information indicates a role for ML methods like variational encoders and clustering routines for the analysis of storm-scale climate information to deepen our understanding of intense events.Our next step to build credibility in this unsupervised model is to deploy the workflow on more diverse climate change data like the High-Resolution Model Inter-comparison Project (HighResMIP) [13,15] 1: The MSE of both of our models ("linear baseline" and VAE) calculated across training/validation/test data.For both training and test data, we see low reconstruction errors, suggesting satisfactory skill and generalization ability.Overall, the VAE outperforms the "linear" baseline for the clustering that leads to our dynamical regime shifts, 's.As a justification for our use of a Variational Autoencoder, we provide a comparison of the performance against a simpler linear baseline.Finally, we also include additional steps showing how we decompose the full spatial variance of heavy precipitation for analysis as well as plotting all the terms of the full decomposition.

A. K-Means Clustering Approach
We apply the K-Means Clustering algorithm to partition the latent space of our VAE and analyze which physical properties can be clustered in this reduced order z space.This approach first randomly assigns centroids, C, to locations in the z space (note we actually use the more modern k++ algorithm [5] to maximize the initial distances between the centroids).Latent representations of each sample z  , in the test dataset of size  , are assigned to their nearest centroid.The second stage of the algorithm moves the centroid to the mean of the assigned cluster.The process repeats until the sum of the square distances (or the Inertia, ) between the latent space data points and the centroids are minimized [22,23] such that: in which z is the mean of the given samples belonging to a cluster l for the total number of cluster centers C. We always calculate ten different K-means initializations and then select the initialization with the lowest inertia.This process allows us to derive the three data-driven convection regimes within SPCAM highlighted in Fig 3.
We qualitatively choose an optimal number of cluster centroids (centers),  by incorporating domain knowledge rather than a traditional approach relying on the rate of decrease in  as  increases or a single quantitative value such as a Silhouette Coefficient [34] or Davies-Bouldin Index [7].More specifically, we identify the maximum number of "unique clusters".We define a "unique cluster" of convection as a group in the latent space where the typical physical properties (vertical structure, intensity, and geographic domain) of the vertical velocity fields are not similar to the physical properties of another group elsewhere in the latent space.Empirically this exercise enables us to create three unique regimes of convection (Fig 3).When we increase  above three, we get sub-groups of "Deep Convection" without differences in either vertical mode, intensity, or geography.Thus we don't consider  > 3 to be physically meaningful for our purposes.
Because we seek to contrast common clusters between different climates, we do not use Agglomerative (hierarchical) Clustering unlike other recent works that cluster compressed representations of clouds from ML models [8,20].Using the Kmeans approach, we can save the cluster centroids at the end of the algorithm.This provides a basis for cluster assignments for latent representations of out-of-sample test datasets when we use a common encoder as in Section B. More specifically, we only use the cluster centroids to get label assignments in other latent representations.We don't move the cluster centroids themselves once they have been optimized on the original test dataset (the second part of the K-means algorithm).Keeping the center of the clusters the same between different types of test data ensures we can objectively contrast cluster differences through the lens of the common latent space.

B. VAE Benchmarking and Performance Evaluation
We train our VAE on 160,000 unique vertical velocity fields and use an additional 125,000 samples to validate and optimize the model hyperparameters.Finally, we leverage 1,000,000 vertical velocity fields in the test dataset for robust analysis.The high count in the test dataset is necessary both due to the high spatio-temporal correlations common in meteorological data but also because of the geographic conditioning in our analysis -we need enough samples at each lat/lon grid cell, not just globally.To determine whether our data are nonlinear enough to warrant the use of a VAE we also train a baseline model of the same architecture but with all activation functions replaced by "linear".The fact that the VAE reconstructs the vertical velocity snapshots with both lower error and a higher degree of structural similarity suggests significant non-linearity is involved in compressing and rebuilding the 2D fields (Tab 1 and Tab 2).This problem is therefore well suited for the non-linear dimensionality reduction of the VAE encoder and less so for linear models.

C. Full Decomposition of the Spatial Variance of Heavy Precipitation
We derive the decomposition of the spatial variance of heavy precipitation in four steps.First, we multiply Eq. (2.7) by  heavy : For convenience, we use Q to denote the first term of Eq. 5.2.We then take its spatial anomaly by applying the spatial anomaly operator ( ′ ): where we have used the fact that ΔD 0 is uniform in space (ΔD 0 = ΔD 0 and ΔD ′ 0 = 0).Squaring Eq 5.3 yields: where the cross-terms CT can be decomposed into cross-terms involving spatial shifts in regime probability: cross-terms involving changes in how each regime produces precipitation: and additional cross-terms: Taking the spatial mean  of Eq 5.4 and noting that the spatial variance is defined as the spatial mean of the squared spatial anomaly, we derive the following decomposition: where we have introduced the decomposition's numerical residual, which helps us assess which terms are significant.Grouping the terms irrelevant to the comparison between regime spatial shifts and intra-regime changes into a single term, R, mathematically defined as: we recover Eq. (4.1) from the manuscript's main text.For additional context on the significance of our decomposition, we plot all the terms in Fig. S2.We see that as in Fig. S2, our decomposition is most valid for high precipitation percentiles (percentiles where the residual (blue line) is of lesser magnitude than other quantities).).We focus primarily on precipitation percentiles 80-99, where our model is valid (the numerical residual, grey, is smaller than the key terms) and we have sufficient data (Fig. 3).Across these upper quantiles of precipitation, we find that the change in probability of convection type (Δ -red) is of greater importance than changes in the Dynamical Prefactors (Δblue).For additional context compared to Figure 4, we include all terms from Eq. 5.8

𝑞𝑠𝑎𝑡
) using just the dynamic contributions,     and  ℎ  identified by our unsupervised ML framework.We see our model works very well for high precipitation percentiles where the dynamic contributions are greatest and less well for lower percentiles where thermodynamics are also important.
1) output 3.4 Terabytes of data over 90 days, with 76,944,384 samples of precipitation and the corresponding storm-scale vertical velocity fields (see Fig 1 for examples).
of Selected Tropical Vertical Velocity Fields Selected SPCAM Vertical Velocity Fields
(a) Marine Shallow Convection (e) Spatial Mean changes in 95th % Precip.(b) Continental Shallow Convection (f) Spatial Pattern of change in 95th % Precip.(c) Deep Convection (g) Deep Convection Regime Shifts (d) Total change in 95th % Precip.(h) Changes in Deep Convection Characteristics 0Shifts from 0K Control Climate to +4K Warmed Climate

Figure 3 :
Figure 3: Changes induced by +4 • C of simulated global warming.Panels (a-c) display normalized probability shifts (Δ's) in the three dynamical regimes found through clustering with  = 3, corresponding to (a) "marine shallow", (b) "continental shallow cumulus", and (c) "deep" convection.The difference in the 95th percentile in precipitation between the control and warmed simulations are shown in (d).We subtract the spatial-mean change (e, the "thermodynamics" from Eq. 2.6) from the total change (d) to yield the "dynamic" contribution (f).Using Eq. 2.7, we decompose the changing spatial patterns (f) into five terms, including (g) probability changes in deep convection, (h) changes in deep convective precipitation, and three additional terms depicted in Fig.S5.The patterns of storms change (a-c), which changes the patterns of heavy precipitation (f), mostly because deep convective storms shift location (g).

Fig 3
Fig 3 shows the probability shifts in convection type (Δ  ) from the "Control" to the "Warmed" climate.The dominant climate change signal captured by our unsupervised framework is the increased geographic concentration of deep convection (Fig 3c).More specifically, deep convection becomes more probable over warm ocean waters and especially the Pacific Warm Pool [4] while shallow convection becomes less common in these unstable regions (Fig 3a).This result is consistent with observational trends showing an intensification of already powerful storms over the warm tropical waters [4].At first glance, the pattern of this unsupervised deep convection shift with warming (Δ 1 ) looks quite similar to shifts in upper precipitation quantiles (Fig 3c vs. f).With just the information from the regime probabilities, we can model the spatial patterns of precipitation changes at upper percentiles (Fig.S4).Our model becomes less accurate at lower precipitation quantiles, partly because we are not using specific humidity information (the approximation of Eq. 2.2 is only valid for high precipitation quantiles).This degree of accuracy at the upper percentiles suggests that changes in the location of convective dynamical regimes can explain a large fraction of changes in heavy precipitation, which should be further tested in diverse climate change modeling frameworks.

Figure 4 :
Figure4: Based on Eq. 4.1 we decompose the total magnitude of the change in heavy precipitation.In our decomposition, we compare the mean of the spatial anomaly of convective probability shifts (Δ) to the changes in the dynamical prefactors (Δ).We find that the convective regime shifts are of greater importance to explain the changes in heavy precipitation (80th-99.99thpercentiles)

Figure S2 :
Figure S2 : Derived from Eq. 5.8, we plot each term from the full decomposition for the variance in the change in heavy precipitation,   (Δ 2 ).We focus primarily on precipitation percentiles 80-99, where our model is valid (the numerical residual, grey, is smaller than the key terms) and we have sufficient data (Fig.3).Across these upper quantiles of precipitation, we find that the change in probability of convection type (Δ -red) is of greater importance than changes in the Dynamical Prefactors (Δblue).For additional context compared to Figure4, we include all terms from Eq. 5.8

Figure S3 :
Figure S3 : The shifts in different percentiles of precipitation with global warming, where we again stratified and plotted the data by latitude/longitude grid cell.As in Fig 3d we again remove the mean to highlight the dynamical pattern and see at what threshold the alignment with the VAE identified Deep Convection shifts (Fig 3c) is greatest.The top percentiles including (f-h) are pixelated because of a lack of samples that are out on the tail of the PDF.

Figure S4 :
Figure S4 : The simple results of the simple regression model we use to predict patterns in heavy precipitation (

Figure S5 :
Figure S5 : From Eq. 2.7, we can decompose the changing spatial patterns (Fig 3f) into five terms, including probability changes in shallow convection (a), changes in deep convective precipitation (b), and the intercept of Dynamical Prefactor (c).
3 yields the desired decomposition of heavy precipitation changes in climate: and determine its ability to explain spatial variations in heavy precipitation with climate change.Linear Baseline 3.10 * 10 −3 4.70 * 10 −3 5.10 * 10 −2 Table