Towards learned emulation of interannual water isotopologue variations in General Circulation Models

Abstract Simulating abundances of stable water isotopologues, that is, molecules differing in their isotopic composition, within climate models allows for comparisons with proxy data and, thus, for testing hypotheses about past climate and validating climate models under varying climatic conditions. However, many models are run without explicitly simulating water isotopologues. We investigate the possibility of replacing the explicit physics-based simulation of oxygen isotopic composition in precipitation using machine learning methods. These methods estimate isotopic composition at each time step for given fields of surface temperature and precipitation amount. We implement convolutional neural networks (CNNs) based on the successful UNet architecture and test whether a spherical network architecture outperforms the naive approach of treating Earth’s latitude-longitude grid as a flat image. Conducting a case study on a last millennium run with the iHadCM3 climate model, we find that roughly 40% of the temporal variance in the isotopic composition is explained by the emulations on interannual and monthly timescale, with spatially varying emulation quality. The tested CNNs outperform simple baseline models such as random forest and pixel-wise linear regression substantially. A modified version of the standard UNet architecture for flat images yields results that are as good as the predictions by the spherical CNN. Variations in the implementation of isotopes between climate models likely contribute to an observed deterioration of emulation results when testing on data obtained from different climate models than the one used for training. Future work toward stable water-isotope emulation might focus on achieving robust climate–oxygen isotope relationships or exploring the set of possible predictor variables.


Impact Statement
Information on the hydrological cycle is imprinted onto the isotopic composition of precipitation, which subsequently is oftentimes preserved in natural climate archives like speleothems or ice deposits.Some climate models, so-called isotope-enabled general circulation models (iGCMs), simulate isotopes explicitly and, thus, allow comparing climate model output under paleoclimate scenarios to samples taken from natural climate archives.However, isotopes are not included in most climate simulations due to computational constraints or the complexity of their implementation.We test the possibility of using machine learning methods to infer the isotopic composition from surface temperature and precipitation amounts, which are standard outputs for a wide range of climate models.

Introduction
Reliable analysis of current climate change, as well as robust prediction of future Earth system behavior, have become a crucial foundation for all endeavors to protect humanity's prosperity, mitigate ecological disasters, or formulate plans for adaptation (Langsdorf et al., 2022).This analysis hinges on an accurate understanding and modeling of complex mechanisms in the climate system, which in turn relies on knowledge of the system's past behavior.To analyze past climatic conditions outside the comparatively short period of instrumental measurements, we depend on environmental processes recording and preserving information on the climate system in natural "climate archives".One way to recover past climate information from such archives is to measure the relative abundance of isotopes, particularly of the isotopes of the constituents of water molecules (Mook, 2000).Due to differences in mass, molecules with varying isotopic compositions, so-called isotopologues, differ in their behavior in chemical reactions and phase transitions.For the special case of water, molecules containing heavy 18 O atoms, further denoted heavy isotopes, evaporate slower but condensate faster than ones containing the lighter 16 O.These effects are imprinted on the global hydrological cycle.The resulting patterns of the isotopic composition of precipitation depend on many variables such as precipitation amount, temperature, relative humidity, and the circulation of the atmosphere (Dansgaard, 1964).This makes heavy isotopes in water an important tracer of the hydrological cycle and consequently a valuable proxy for past climatic changes.
Isotopic abundances are canonically expressed in the delta notation.For stable oxygen isotopes 18 O and 16 O, this is given by (1) Here the ratio of concentrations of the isotopic species in a given sample is compared to a defined reference standard.For  18 O of precipitation, this standard is an artificially created sample with an isotopic composition that is typical for ocean surface water (Baertschi, 1976).One important task in paleoclimatology is to test whether hypotheses about the past climate are compatible with proxy data like  18 O measured in natural climate archives (e.g.Bühler et al., 2022).To compare simulations of hypothetical climate states to those measurements, a special sub-type of climate models, so-called isotope-enabled General Circulation Models (iGCMs), was developed.They explicitly simulate isotopic compositions by following the isotopic water species through the hydrological cycle (Brady et al., 2019;Tindall et al., 2009;Yoshimura et al., 2008;Colose et al., 2016;Werner et al., 2016).However, many climate models and climate model simulations exist that do not include information on water isotopologues.Simulating  18 O is costly because it typically requires duplicating large parts of the water cycle for each simulated water species (Tindall et al., 2009).In light of recent advances in data science, the question arises whether this isotopic output can instead be emulated using machine learning (ML) models that infer the  18 O at each location from other climate variables after a model run is finished.We thus call this approach "offline-emulation".Conducting the emulation "offline", i.e. not coupled to the climate simulation, is possible because isotopes are passive tracers of the hydrological cycle that have no feedback onto the climate system.
Within this study, we narrow the broad task of "offline-emulation" by making a number of choices for the learned isotope emulation.The first choice is to only emulate the isotopic composition of precipitation.This is the quantity that is also output by iGCMs; it neglects all processes that might disturb the signal until it is stored in a climate archive (see e.g.Casado et al., 2018).Because observations of isotopes in precipitation are sparse and only exist starting from the 1960s (IAEA/WMO, 2020), we conduct our experiments entirely on simulated data.We limit ourselves to using surface temperature and precipitation amount as the two fundamental predictor variables since these variables possess strong correlations to  18 O that are well known experimentally (Dansgaard, 1964) and from simulations (see Figure 2, C) and are frequently simulated in climate models.
We decide to emulate yearly  18 O data from last millennium (850 CE to 1849 CE) climate simulations.This is motivated by the combination of the high data availability of simulation runs of sufficient length, and the archiving resolution of paleoclimate records during this time period which is typically between monthly and sub-decadal.We also contrast the yearly emulation results with experiments using monthly resolution.
As a measure of emulator performance, we will use the  2 score, which measures the fraction of explained temporal variance, as detailed in Section 2.2.5.While we use ML methods that exploit spatial correlations in the data by design, we leave explicit incorporation of temporal correlations largely to future investigation.
Working within these constraints, our paper presents the following contributions: • We train a deep neural network to estimate stable oxygen isotopes in precipitation ( 18 O) given surface temperature and precipitation and compare to common regression baselines.

Data and Methodology
Our approach to emulating  18 O is sketched in Figure 1.For each time step, we start with variables that we know are statistically related to  18 O, namely surface temperature and precipitation amount.
All variables are standardized pixel-wise, i.e. we subtract the mean and divide by the standard deviation, both computed from the training set.We then estimate the standardized spatial field of  18 O from the predictor variables by training a machine learning (ML) regression model.Subsequently, the standardization for the inferred  18 O is reversed, resulting in our estimate for the isotopic composition.

Data
We use data from the isotope-enabled version of the Hadley Center Climate Model version 3 climate model (hereafter iHadCM3, Tindall et al., 2009).iHadCM3 is a fully coupled atmosphere-ocean general circulation model (AOGCM).The horizontal resolution of iHadCM3 is 3.75°in the longitudinal direction, and 2.5°in the latitudinal direction.We exclude -90°and 90°from the latitudinal values because  18 O is not simulated at these latitudes.We focus on the last millennium (850 CE to 1849 CE), which is characterized by a stable climate with variability on interannual-to-centennial timescales, but no major trends (Jungclaus et al., 2017).Additionally, the last millennium is well documented in climate archives and observations (PAGES2k-Consortium, 2019; Konecky et al., 2020;Comas-Bru et al., 2020;Morice et al., 2012).Diagnostics of the iHadCM3 data set are visualized in Figure 2. As can be seen from Figure 2 B, the standard deviation of the simulated  18 O is large over dry regions like the Sahara desert or the Arabian peninsula.This is partly related to the way  18 O is computed in the climate models: in these regions the abundances of 18 O and 16 O are both small because of generally low precipitation amounts, leading to numerically unstable ratios and missing values on the monthly time scale.Overall, 0.3% of the  18 O values are missing on the monthly timescale, with a strong clustering in the regions with numerical instabilities described above (compare Figure A.10).We take this into account by adapting the loss we use to train our ML methods to deal with missing values, as described in Section 2.2.3.
To test the extrapolation and robustness of our emulator, we use last millennium simulations of three other climate models: AGCM Scripps Experimental Climate Prediction Center's Global Spectral Model (hereafter isoGSM, Yoshimura et al., 2008), iCESM1 version 1.2 (hereafter iCESM, Brady et al., 2019), and ECHAM5/MPI-OM (hereafter ECHAM5-wiso, Werner et al., 2016).While iCESM and ECHAM5wiso are fully coupled AOGCMs, isoGSM is an atmospheric GCM forced by sea surface temperatures and sea ice distributions of a last millennium run with the CCSM4 climate model (Landrum et al., 2013).We re-grid the other climate model simulations to the iHadCM3 grid using bilinear interpolation from the CDO tool set (Schulzweida, 2020).

Pre-processing
We apply the following pre-processing steps to the climate simulation data: • We set valid ranges for all variables, thereby excluding implausibly large or small values, using the following choices: surface temperature range: [173, 373] K,  18 O range: [−100, 100] , precipitation amount: [−1, 10000] mm month .Wide ranges are chosen because we aim to exclude only implausible values that might deteriorate emulator performance without artificially removing model deficiencies.Thus, we also keep small negative precipitation values that climate models might produce due to numerical inaccuracies in rare occasions.
• Time steps with missing values in the predictor variables are excluded from the data set.This leads to the exclusion of 31 of the 12000 monthly time steps of iHadCM3.• We form yearly averages from monthly data.Missing  18 O data points are omitted in the yearly averaging.We argue that this does not impact our results negatively, because the invalid 0.3% of  18 O values cluster in regions, where due to numerical instabilities in the "ground truth" iHadCM3 simulation, learning a physically consistent emulation would not have been possible anyway (compare Figure A.10). • We re-grid the yearly data sets to the irregular grid on which the investigated spherical network operates (see Section 2.2) using a first-order conservative remapping scheme (Schulzweida, 2020).

Methodology
To obtain a spatially consistent emulation, and to utilize the fact that the local statistical relations between  18 O and the predictor variables are similar in many grid boxes on Earth's surface, we choose two approaches based on Convolutional Neural Networks (CNNs).Both utilize the successful UNet architecture (Ronneberger et al., 2015), whose multi-scale analysis can simultaneously capture fine structure variations and utilize large-scale contextual information.UNet architectures have been successfully applied in a climate science context before (e.g.Kadow et al., 2020).The first of our two approaches treats data on the latitude-longitude grid as a flat image.The second explicitly incorporates the spherical geometry of the data.

Flat network
Because our data naturally lie on the surface of a sphere, distortions arise when treating the equally spaced longitude-latitude grid as a flat image using e.g. a plate carrée projection (lat/lon projection).
We test if we can still obtain reasonable results with this naive setup.Furthermore, we try to partially remedy the effects of the distortions within the "flat" approach, by modifying the standard UNet architecture in three ways : • We use area-weighted loss functions.
• We use periodic padding in the longitudinal direction, i.e. we append the rightmost column to the very left of the plate carrée map (and vice versa) before computing convolutions.Thereby we assure continuity along the 0°-360°coordinate discontinuity.• We incorporate CoordConv (Liu et al., 2018), a tweak to convolutional layers that appends the coordinates to the features input into each convolution, thus, allowing networks to learn to break translational symmetry if necessary.

Spherical network
As a more sophisticated technique, a multitude of approaches to directly incorporate the spherical nature of data into a neural network architecture has been proposed (Cohen et al., 2019(Cohen et al., , 2018;;Coors et al., 2018;Defferrard et al., 2020;Esteves et al., 2018;Lam et al., 2022).We reproduce the approach of Cohen et al. (2019), where the network operates on an icosahedral grid, with grid boxes centered on the vertices.Using the icosahedron offers a straightforward way to increase or decrease resolution for a UNet-like design, as we can recursively subdivide each of its triangles into four smaller triangles, projecting all newly created vertices onto the sphere again.We denote the number of recursive refinements of the grid as , with  = 0 identifying the grid containing only the twelve vertices of the regular icosahedron.As the refined icosahedral grid is locally very similar to a flat hexagonal grid, we can use an appropriately adapted implementation of the usual efficient way to compute convolutions.Additionally, the architecture of Cohen et al. (2019) is equivariant to a group of symmetry transformations, meaning that if the input to the CNN is transformed by an element of the symmetry group, the output transforms accordingly.This fits well with the approximate symmetries present in the Earth system, like symmetry to reflections on the equatorial plane or rotations around the polar axis.We validate our implementation of the method on a toy problem described by Cohen et al. (2019): the classification of handwritten digits projected onto a spherical surface.We obtain results that are comparable to those reported by Cohen et al. (2019), see Appendix A.1.1 for more details.

Loss function
To train our UNet architectures for isotope emulation, we use a weighted mean squared error loss between the standardized  18 O ground truth  and the predicted values Ŷ : where the loss is averaged over a batch of size  and the set of valid grid boxes G  at time step .A grid box is valid if the simulated ground truth data has no missing value at this time step in this grid box.|G  | denotes the cardinality of G  and   are weighting coefficients.For the convolutional UNet working on the plate carrée projection, we choose   to be proportional to the cosine of the latitude of the center of grid cell , which is an approximation of the physical size of the grid cell.We rescale the weights, such that they sum to the total number of grid boxes.For the icosahedral UNet, all grid boxes are of approximately equal size.Therefore, no weighting is applied and   is a constant independent of .

Baselines
In addition to the UNet models, we implement three simple baseline models to assess the relative benefit of complex and deep models in our emulation problem.These baselines are: • Grid-box-wise linear regression, the simplest conceivable model: regress  18 O on temperature and precipitation amount in a separate model for each grid box.• Grid-box-wise random forest regression model: in contrast to the linear regression baseline, we train a single random forest (Breiman, 2001) to make predictions on all grid boxes.To allow the model to learn spatially varying relationships, we include the coordinates as predictor variables.2• Grid-to-grid approach (PCA regression): relations between  18 O and other climatic variables tend to behave similarly over large areas (see Figure 2 C), justifying a dimension reduction of the input and output spaces before applying a multivariate linear regression.This is implemented by computing the principal components of the input and output spaces.Schematically, the computation goes as follows: Approximately optimal numbers of principal components are obtained as follows: we iterate over a 50 × 50 logarithmically spaced grid of candidate values for the number of input and output principal components.For each configuration, the emulation model is trained and its performance is measured on a held-out validation set.We then select the combination of numbers of input and output principal components which yields the best results on the validation set.As a last step, the selected model is retrained, now including the validation set data.Principal component analysis can be performed on arbitrary grids, which makes it equally applicable to the projected 2D data and the icosahedral representation.

Metrics
The metric we use for evaluating emulation approaches is the  2 score, also called the "coefficient of determination", which quantifies what fraction of the temporal variance in the test set is explained by the ML estimate in each grid box.The  2 score compares the  18 O ground truth   and an estimate Ŷ in a given grid box  as where MSE(  , Ŷ ) is the mean squared error and  2  the variance of the test set ground truth, both taken over the time axis at grid box .A value of  2 = 1 indicates perfect emulation, while a model that simply outputs the temporal mean at every time step has  2 = 0.The score can become arbitrarily negative.
Additionally, we compute the Pearson correlation coefficient between the true and emulated time series at some grid boxes.To select time steps in which a method's performance is particularly strong or weak, we calculate the Anomaly Correlation Coefficient (ACC) between emulation and ground truth.ACC is defined as the Pearson correlation coefficient between the true and emulated anomaly patterns for a given time step.Anomalies are computed with respect to the training set mean.
If error intervals on performance metrics are given, unless stated otherwise, they are 1 intervals computed over a set of ten runs.Thus, the uncertainties only account for the uncertainty of the stochastic aspects of the ML model parameter optimization, discarding any uncertainty that is related to the data.
Implementation details for training and configuration of the ML methods are provided in Appendix A.1 and code to reproduce our experiment is freely available at https://github.com/jonathanwider/isoEm

Results
We structure the Results section as follows.First, we give a detailed spatiotemporal overview to illustrate the characteristics of the ML-based emulation results.To this purpose, we use the best performing emulation method as an example.Subsequently, we compare emulation methods amongst each other, contrasting deep architectures and baselines as well as "flat" and "spherical" approaches.We follow up with a range of sensitivity experiments, and conclude by conducting a cross-model experiment, i.e. we train a ML model on data from one climate model and then use the trained model to emulate  18 O in other climate model simulations.

Spatiotemporal Overview of Emulation Results
In Section 3.3, we will discover that the best performing ML emulation method, a deeper version of the flat UNet architecture, reaches an average  2 score of 0.389 ± 0.006 on the plate carrée grid.This means that in the global average almost 40% of the temporal variance in the test set is explained by our emulation on the interannual timescale.We use this best ML method to introduce spatial and temporal characteristics of the emulation.
The prediction quality varies spatially, as shown in Figure 3A. 2 scores of 0.6 or larger are reached in 18.5% of the grid cells, and  2 ≤ 0 for only 5.4% of grid cells.The best results are achieved over tropical oceans, which are regions with strong correlations of  18 O and precipitation amounts.Performance is good over large parts of the Arctic and over western Antarctica as well, which is important because these regions are especially relevant for the comparison with  18 O measurements from ice cores.We illustrate the performance in these regions by comparing emulated and ground truth time series in the grid boxes closest to two ice core drilling sites in panels B and C of Figure 3: the North Greenland Ice Core Project ("NGRIP", 75.1°N, 42.3°W, Berggren et al., 2009) and the West Antarctic Ice Sheet Divide ice core project ("WAIS Divide", 79.5°S, 112.1°W,Buizert et al., 2015).For these drilling sites, the correlation between our emulation and the exact output time series of an isotope-enabled climate model exceeds 70%.
In general, spatial variations in performance follow the correlation structure between  18 O and the predictor variables (Figure 2C): in regions with strong absolute correlations between  18 O and surface temperature or precipitation amount, the  2 scores are higher than in regions where none of the predictor variables is strongly correlated with  18 O.Thus, performance is worse over landmasses, especially in the low and mid-latitudes.
Next, we visualize emulation and climate model output for individual time steps.For a year with typical emulator performance3 , we plot emulated (panel A) and simulated (panel B) anomalies in Figure 4. We can see that the large-scale patterns match well between emulation and simulation: there are strong positive anomalies over the Arctic, related to positive temperature anomalies in this time step, and the large-scale structure over the Pacific is captured as well.Strong negative anomalies over parts of South America and northern India and Pakistan are reproduced.Emulation and ground truth simulation differ in their fine-scale structure: the ground truth is generally less smooth than the emulation and seems particularly noisy over some dry regions like the Sahara and the Arabic desert.In these regions, there is a potential for numerical inaccuracies in the isotopic component of climate models, due to small abundances of each isotopic species, and it is hard to untangle which parts of the "noisy" signal have a climatic origin and which parts are simulation artifacts.A part of the overall smoother nature of the UNet regression results can be attributed to the MSE Loss giving a large (quadratic) penalty for strong Figure 3. Test set emulation performance of the best ML emulation method.The bluer the colors, the better the emulation.Blue colors indicate regions in which the performance is better than a trivial baseline model that returns the correct test set mean at every time step.This plot displays the average of the  2 scores over ten runs.Additionally, we show the time series of the ML emulation (green, mean over ten runs) and the true simulation data (black) for grid boxes next to two ice core drilling sites.Panel (B) "NGRIP" (Greenland).Panel (C) "WAIS Divide" (West Antarctica).
deviations from the true values, thus, priming the network against predicting values in the tails of the distribution.Figure A.4 compares emulation and simulation for three additional time steps: time steps in which the emulation works particularly well or poorly, and a climatically interesting year: 1816 CE, the "year without a summer" (Luterbacher and Pfister, 2015), which is caused by a volcanic eruption included in the volcanic forcing of the iHadCM3 simulation.For 1816 CE, we observe that the emulator reproduces a strong negative  18 O anomaly in regions where  18 O is primarily influenced by temperature, namely in the Arctic, Northern North America and Siberia.

Comparing Machine Learning Methods
The ML emulation models (UNet architectures and simpler baselines) differ in the quality of their emulation.In the following, we compare the methods amongst each other.For details on the training procedures, network architectures, and method implementations, see Appendix A.1.We also address the question of whether using an inherently spherical approach is beneficial over treating the latitudelongitude grid as "flat".However, the comparison is not trivial: the approaches are developed for data Table 1.Globally averaged  2 scores for the different ML-emulation methods.Results are calculated for the icosahedral grid that the method of Cohen et al. (2019) operates on and the plate carrée grid.When a method works with data on the other grid, the emulated data is interpolated."Flat UNet, unmodified" and "Flat UNet, modified" refer to the flat network architecture described in Section 2.2.1, either not applying or applying the modifications to remedy projection artifacts described in that chapter.The other baseline models perform worse.In particular, it seems that the random forest baseline, which regresses on a pixel-to-pixel level is not able to capture the spatially varying relationships between  18 O and the predictor variables surface temperature and precipitation amount well enough, even when including coordinates as additional inputs.The spatial performance differences between the UNet methods and the best baselines are visualized in Figure A.5.The improvements by the UNets are largest over oceans.
On the icosahedral grid, the icosahedral UNet and the modified flat UNet achieve  2 scores that are not significantly different.On the plate carrée grid, however, the results of the icosahedral UNet are much worse.This drop can largely be attributed to the interpolation method (see Figure A.7): on the plate carrée grid, neither training data nor results of the flat UNet are interpolated, while for the icosahedral UNet interpolations are necessary in both cases.

Sensitivity experiments
We conduct a range of sensitivity experiments, to test a) the influence of each predictor variable on the results, b) whether we can further improve the performance of our ML method and c) whether emulation quality varies with timescale.
First, we use the modified flat UNet architecture as employed in Section 3.2 and test, how the results differ if we exclude one of the predictor variables.The globally averaged  2 score on the plate carrée grid drops from 0.377 ± 0.005 if both precipitation and temperature are used to 0.327 ± 0.006 when using only precipitation and to 0.251 ± 0.004 when only using temperature.The spatial differences in emulation quality follow the large-scale behavior of the correlation structure in panel C of Figure 2. When precipitation is excluded, the performance decreases most over low latitudes, while the  2 score drops over polar regions without temperature.This is visualized in Figure A.8.
To potentially improve the emulation results even further, we create variations of the modified flat UNet architecture: a "wider" version in which the number of computed features per network layer is doubled ( 2 = 0.386 ± 0.008, plate carrée grid), and a "deeper" version with six additional network layers4 , which obtains  2 = 0.389±0.006(plate carrée grid), both improving over the default choice by roughly 0.01 .Additionally, we test, whether results could be improved by tuning the learning rate of the employed optimizer by testing a grid of 20 logarithmically spaced values between 10 −4 and 10 −1 .The performance is best for learning rates between 10 −3 and 10 −2 .However, no substantial improvements over the default parameter choice were reached in the limited range of tested values.
The monthly timescale differs from the interannual scale by a pronounced seasonal cycle of  18 O in many regions.Thus, even a simple climatology can explain a part of the variability in  18 O.To exclude this trivially explainable part from the computation of the  2 score, we compute the score separately for each month.Results are similar to the results on the interannual scale with roughly 40% of variance explained.The higher time resolution suggests exploring whether the emulation can profit from taking the temporal context into account.We test this by including not only the temperature and precipitation of the current time step but also of the previous month as inputs to the emulation of  18 O.Results do not improve strongly, however, possibly because the investigated timescale is still larger than the average atmospheric moisture residence time (Trenberth, 1998).

Cross-Model Comparison
For practical applicability, it is essential that an emulator's performance is robust under varying climatic conditions and under potential biases of the climate model that produces the training data for the emulator.We address these questions by testing how well our emulation generalizes to data generated with different climate models (iCESM, ECHAM5-wiso, isoGSM).To do so, we train the best model architecture so far, the deeper modified flat UNet, on data from iHadCM3.Subsequently, the trained network is used to emulate  18 O for the test sets of the other climate model data sets.Results of the emulation are visualized in Figure 5.For all data sets the mean  2 score is positive, meaning that in the global average, the emulation is preferable to predicting the mean state of the corresponding training set.The  2 score is highest for the ECHAM5-wiso simulation and lowest for isoGSM, where 80% less variance is explained than on iHadCM3.
In all three cross-prediction cases, the performance drops strongly in the Pacific Ocean west of South America, a region that is important for the El Niño-Southern Oscillation (ENSO).This might hint at inter-model differences in the spatial pattern of ENSO variability.For isoGSM, the emulation quality over Antarctica is considerably worse than for all other models.The Antarctic in isoGSM is much less depleted in  18 O (less negative  18 O) than in the other models while showing similar equator to pole temperature gradients (Bühler et al., 2022).This can potentially impact the relationship between the temporal variations of temperature and  18 O.For isoGSM and iCESM,  2 is negative over large areas of the mid-latitude oceans.As synopticscale variability of moisture transport pathways might be an important factor for  18 O in the mid-latitudes, adding predictor variables that encode information on the atmospheric circulation in the respective models could improve the results.The independence of the isoGSM and iCESM runs in these regions must be assessed carefully: isoGSM is forced by sea-surface temperatures and sea-ice distributions of a last millennium run with CCSM4, which is a predecessor model of iCESM.Therefore, characteristics of iCESM might also be present in the isoGSM results.
We also test, how well the baseline ML models generalize when employed to estimate  18 O for other climate models.The very simplistic pixel-wise linear regression yields better results than the PCAregression baseline.Figure A.9 shows the cross-model performance of the linear regression baseline.While the  2 score for iHadCM3 itself is significantly smaller than for the UNet model, the loss of performance when doing cross-prediction is much smaller, for iCESM the results are even better than those obtained with the UNet model.Especially over mid-latitude oceans, the  2 scores of the linear regression are better the ones obtained with the UNet.

Discussion
In a first step towards data-driven emulation of water isotopes in precipitation, we show that in a simulated data set 40% of the interannual  18 O variance can be explained by ML models.The emulation quality follows patterns of the correlation between  18 O and the predictor variables precipitation amount and surface temperature.This hints at the possibility of further improving the emulation by including other variables that are statistically connected to  18 O as predictors. 18 O composition depends on atmospheric moisture transport, which in turn, depends on atmospheric circulation.Thus, variables encoding information on atmospheric circulation such as sea-level pressure are promising candidates which should be explored in future research.This could be particularly relevant in the midlatitudes, where the comparably poor performance of the emulators might be due to synoptic-scale moisture transport variability which is not well captured by annual or monthly means of precipitation and temperature.In addition, relative humidity seems a promising candidate as it is important for the evolution of  18 O during the evaporation process.
It should be noted that correlation structures between predictor variables and  18 O are likely timescale dependent.Our results suggest that temperature, precipitation and atmospheric circulation variations due to internal variability in the climate system and short-scale external forcing such as volcanic eruptions and solar variability are the most important factors controlling interannual  18 O variability.On the other hand, changes in long-term external forcings such as greenhouse gas concentrations and Earth's orbital configuration, and variations in oceanic circulation have been found to explain  18 O changes on millennial and orbital (10,000 years and longer) timescales (He et al., 2021).This varying importance of factors controlling climate variations can also result in timescale-dependent relationships between the predictor variables surface temperature and precipitation amount (Rehfeld and Laepple, 2016), which limits the generalization of emulators between timescales.Meanwhile, on timescales from hours to weeks, the memory in the atmosphere is higher.Thus, taking into account previous time steps and explicitly tracking moisture pathways for example in tropical or extratropical cyclones could improve the emulation performance.On these timescales, ML methods to model sequences of data, like long short-term memory (LSTM), recurrent neural networks (RNNs), or transformer models could be good alternatives.
A tested spherical CNN architecture shows no clear benefit over a modified version of the standard flat UNet for our task of emulating  18 O in precipitation globally.We suppose that this is partly due to the strong latitudinal dependence of the statistical relationships between  18 O and the predictor variables (as indicated by correlations in Figure 2C).Thus, the strength of the spherical network architecture, namely its equivariance to rotations, possibly does not offer a strong benefit.Additionally, the interpolation between the plate carrée grid and the icosahedral grid deteriorates the results.This might be remedied by "differentiating through" the interpolation or directly learning the interpolation, as is done in Lam et al. (2022).Using ML architectures that are equivariant to approximate symmetries in the Earth system might still be beneficial in many applications, since adapting the ML approach to symmetries of the problem reduces overfitting and the demands for training data.One might for example use Cohen and Welling (2016) as a starting point and test a network that is equivariant under rotations around the polar axis and reflections on the equatorial plane.
The cross-model emulations can be seen as a supplement to test for the generalization to the (unavailable) real-world  18 O data.Assuming that each model possesses deficiencies in its  18 O simulation, robustness under varying models would hint at robustness in the generalization to real-world data.Additionally, reliable  18 O emulations for climate models that do not possess an implementation of water isotopologues would ideally be done with an emulator that does not overfit to a certain climate model it was trained on.Two reasons that might make an ML emulator perform poorly under cross-emulations are a) weak statistical connections between  18 O and the predictor variables in the training set and b) differences in the statistical connections of  18 O and the predictor variables between climate models.We investigate whether drops in cross-prediction performance can be attributed to these causes in Appendix A.2 and Figure A.3.Indeed, most regions, in which there is a drop in emulation performance, coincide with regions of differing correlation structures or weak correlations between  18 O and the predictor variables in the iHadCM3 data set (Figure A.3,B4 to D4 and B2 to D2).In regions with weak correlations between  18 O and the predictor variables in the iHadCM3 data set such as the Southern Hemisphere mid-latitudes, the UNet has to predict  18 O based on spatial similarity structures (teleconnections).The poor performance in the Southern Hemisphere mid-latitudes in IsoGSM and iCESM suggest that the spatial similarity structures differ between those two GCMs and iHadCM3.Here, predictors that encode atmospheric circulation more directly such as sea-level pressure could be beneficial in future studies.This interpretation is supported by a much sharper drop in performance of the UNet architectures than simple linear regression when methods were trained on the iHadCM3 climate model and then used to emulate other climate model data.As a result, the  2 scores on the other climate models were comparable between UNet and linear regression.This suggests that the UNet might overfit to the spatial anomaly patterns in iHadCM3 given the limited information provided by the predictor variables.This overfitting will partly reproduce deficiencies of the respective data set used for the training of the emulator.It was shown previously that the models used in our study differ in their mean climate state.For example, iHadCM3 and ECHAM5-wiso show a similar global temperature state but iHadCM3  18 O is much more negative in the global mean (Bühler et al., 2022).Similar differences in the spatial anomaly patterns between models need to be explored further to understand their contribution to poor cross-model emulation performance.To obtain a more robust emulator that is applicable across models, one might utilize data from multiple climate models and climate states (e.g.Last Glacial Maximum, mid-Holocene, Pliocene) in the training set.
Spatially, ML estimates are smoother than the true simulated data.The ground truth data show very noisy behavior over dry regions, part of which is likely due to numerical instabilities in the computation of  18 O for very low precipitation amounts.Missing data points also occur more frequently in these regions, thus, potentially biasing the emulator and its measured performance.Because of these inconsistencies in the input data, it might be beneficial to focus on particular regions when developing an emulator with the aim of comparing to a certain natural climate archive.Examples are the polar regions for comparisons to ice core data or the mid-latitudes for speleothem records.Restricting the spatial extent would also alleviate artifacts of the map projection and render spherical approaches unnecessary.Alternatively, one might think about the application of ML to do in-painting of missing values of  18 O for the training of emulators, similar to Kadow et al. (2020).In this case, the incomplete  18 O would serve as an input to the ML method in addition to precipitation and temperature.
Training an isotope emulator on real-world data would avoid uncertainties originating from climate models and the implementation of isotopes within them.Databases of observed  18 O in precipitation (IAEA/WMO, 2020) or  18 O from natural climate archives (Konecky et al., 2020) are publicly available.However, challenges arise from the spatial scarcity and unequal distribution of data, and the short temporal coverage of observations.Here, using graph networks like the one developed by Defferrard et al. (2020) might be an option, and likely strong prior constraints would need to be used to compensate for small data set sizes.For the future goal of comparing emulations to  18 O measured in natural climate archives, archive-specific processes need to be taken into account.This is because  18 O in precipitation is not archived directly, but always as the response of a sensor of the archiving medium.For example, precipitation  18 O is archived in speleothem records as calcite carbonate in accumulating layers that form from cave drip water (Fairchild and Baker, 2012).
We calculate yearly  18 O as the unweighted average of monthly  18 O.In most natural climate archives, yearly  18 O is weighted by precipitation amount.We tested the influence of such a weighting and find that it does not impact the emulator performance negatively (not shown).However, climate archives can also show seasonal preference in their sensitivity to  18 O (Wackerbarth et al., 2010;Fohlmeister et al., 2017;Baker et al., 2019) such that there is likely no optimal way for computing yearly values.Including archive-specific processes could either be a second step in a two-step approach, where an ML emulator is trained to predict  18 O in precipitation and then a proxy system model (Evans et al., 2013) is used to forward-model archive-specific processes.Alternatively, one might include a differentiable proxy system model in the ML pipeline.This would make it possible to train the ML architecture directly with proxy data instead of  18 O measured in precipitation.

Conclusion
In this study, we explored the ability of machine learning methods to emulate oxygen isotopes as simulated by isotope-enabled general circulation models (GCMs).Focussing on interannual variability in a last millennium simulation, we show that UNet neural networks improve the emulation performance compared to baseline methods such as pixel-wise linear regression and PCA-regression.Averaged over all grid-cells, our best-performing UNet architecture explains ∼40% of the temporal  18 O variance.The emulation performs best in polar regions, where  18 O is strongly controlled by surface temperature variations, and in low latitude ocean areas, where  18 O is highly correlated with precipitation amounts.Lowest performances occur in arid regions, partly because of numerical instabilities in the simulation of  18 O for very low precipitation amounts.Using a spherical network architecture does not improve the results compared to a modified flat architecture which accounts better for Earth's spherical geometry than a default UNet architecture.This might be because our spherical UNet architecture is not optimized to capture latitudinal dependences in the relationships between  18 O and the predictor variables.
We tested the generalization of the emulator trained on output from the iHadCM3 GCM to last millennium simulations with other GCMs.While the performance is better than predicting the model's climatology for all GCMs, the explained variance is substantially lower than for iHadCM3.Performances are especially poor in regions where the correlation structure between  18 O and the predictor variables differs from the correlation structure in iHadCM3 and in regions with low correlations between  18 O and the predictor variables in iHadCM3.In the latter case, the UNet architecture learns spatial dependence structures to improve the emulation of  18 O.This improves the performance within iHadCM3 compared to pixel-wise regression.However, these spatial structures seem to differ too much between GCMs to facilitate skillful cross-model emulations, especially in the mid-latitudes where encoding synoptic-scale circulation variations could be important to capture  18 O variations.
To further improve emulation performance, adding more predictor variables could be a promising next step.In particular, variables such as sea level pressure, which capture characteristics of the atmospheric circulation more directly than surface temperature and precipitation amount, could help in regions with currently poor performance.To compare emulated isotopes to  18 O measured in natural climate archives such as ice cores and speleothems, a way of incorporating archive-specific processes needs be investigated.This could be done by incorporating differentiable proxy system models into UNet architectures or by applying proxy system models to the emulator output in a two-step approach.For comparison with  18 O measurements in natural climate archives, the timescales of variations recorded by the archives are important.While we focused on interannual timescales in this study, shorter as well as longer timescales could be explored in future research to understand the importance of synoptic-scale processes, local predictor variables and external forcings for  18 O emulation across timescales. of handwritten digits from the MNIST data set, projected onto the icosahedral grid, in this case the refinement level of the icosahedral grid  equals 4. For the validation tasks, three versions of this data set are generated: One where the digits are not rotated ("N"), one in which rotations of the symmetry group of the icosahedron are applied ("I") and one in which we apply rotations of the group of rotations of the sphere("R").

A.1. Implementation Details
We use pytorch (Paszke et al., 2019) to implement the CNNs and scikit-learn (Pedregosa et al., 2011) for the baseline models.Our code is publically available on GitHub.5

A.1.1. Validation Experiment
The validation experiment is a variation of one of the seminal tasks in ML: the classification of the MNIST data set of handwritten digits (LeCun and Cortes, 2005).As a toy example, Cohen et al. (2019) project the digits onto the southern hemisphere of the icosahedral grid, some examples are visualized in Figure A.1.Then, three types of data sets are produced: the digits can either be left as they are ("non-rotated") -or the icosahedral grid can be rotated before the digits are projected onto it, either by a symmetry transformation of the icosahedron or by an unconstrained rotation ("fully-rotated").Because the network architecture of Cohen et al. (2019) is equivariant to rotations of the icosahedron, the network should be able to classify digits that have been rotated by elements of this symmetry group with the same accuracy as non-rotated digits, even if during training it was never shown any rotated digits.
The data set creation process, our network architecture (Figure A.2), and the training procedure for this task follow Cohen et al. (2019) as closely as possible.We train for 60 epochs on the nonrotated and fully-rotated data sets and for one epoch on the icosahedrally rotated data set, which is 60 times larger than the other sets.We use a cross-entropy loss function and the Adam optimizer (Kingma and Ba, 2014) with a learning rate of 0.001 and the other parameters at their default values  1,2 = (0.9, 0.999),  = 10 −8 .
Our results in Table A.1 are comparable or better than those reported in Cohen et al. (2019).Notably, they demonstrate the equivariance of our implementation -the classification results for digits rotated by an icosahedral symmetry are almost identical to those of non-rotated digits.6Small differences between the results of our implementation and the implementation of Cohen et al. (2019) can partly be attributed to the fact that none of our hyperparameters were tuned -and to implementation details not described in Cohen et al. (2019), like the batch size or the used optimizer.

A.1.2. Isotope Emulation Experiment
UNet Training.To optimize the UNet models, we use the Adam optimizer (Kingma and Ba, 2014), with lr = 0.001 and the other parameters at their default values:  1,2 = (0.9, 0.999),  = 10 −8 .We use early stopping with a patience of 5, i.e. we abort training if no global minimum of the validation set loss Table A.1.Classification accuracies (fraction of correctly classified to the total number of tested digits) for the icosahedral MNIST validation experiment (Cohen et al., 2019)."N", "I" and "R" indicate data sets, where either no rotations, rotations of the symmetry group of the icosahedron, or rotations of the symmetry group of the sphere were applied.Shown results are averages over three runs.is reached in 5 consecutive epochs.As a result, the training runs are stopped after roughly 20 epochs.We use a batch size of 8. Training a model takes on the order of minutes to tens of minutes for a single run on a basic graphics card (NVIDIA GeForce GTX1650).
For both icosahedral and flat UNets, we use batch normalization (BN, Ioffe and Szegedy, 2015) and ReLU-activations after all but the last convolutional layer.We use 2x2 max-pooling to go to coarser spatial resolution and nearest neighbor interpolation (plate carrée projection) and a modified form of bilinear interpolation (icosahedral grid) to increase resolution.Data from skip connections and upsampling are simply concatenated.
Icosahedral UNet Approach.We create the icosahedral data set using an icosahedral grid at refinement level  = 5, meaning we apply 5 recursive steps in which each of the triangular faces of the icosahedron is divided into four smaller triangles, with the grid points subsequently projected back to the spherical surface.This results in a grid with 5 × 2 2 +1 + 2 = 10242 vertices, while the iHadCM3 latitude-longitude grid encompasses 6816 grid points.Because the areas covered by the iHadCM3 grid cells vary with latitude, the resolution of the icosahedral grid is higher than the resolution of the iHadCM3 grid close to the equators, while close to the poles iHadCM3 data is better resolved than the icosahedral grid.
The architecture used for the icosahedral grid is visualized in Figure A.12.It is inspired by an architecture in Cohen et al. (2019).There is an implementation uncertainty remaining that relates to the treatment of the vertices of the icosahedron at  = 0, i.e. the 12 corners of the unrefined grid.In opposition to all the other pixels in the icosahedral grid, these possess only five neighboring pixels (instead of six), thus introducing irregularities in the convolution patterns.Unable to extract the exact treatment of these corner pixels in Cohen et al. (2019), we made the choice to set these corner pixel values to zero in every layer.This will likely introduce biases at very coarse resolutions (i.e. small).
We use the MSE-Loss from Equation (2) without weighting, because all pixels in the icosahedral grid cover an approximately equal area.Padding is done similarly to Cohen et al. (2019) in a way that asserts continuity between the faces of the icosahedral grid.
Flat UNet.The default architecture for the flat UNets is visualized in Figure A.11.By default, we use zero padding to keep the resolution before and after convolutions identical, in most experiments, however, only the latitudes get zero-padded, while we pad the longitudes cyclically to avoid discontinuities.

A.2. Investigating reasons for cross-prediction performance drops
We investigate the drops in  2 score that occur when predicting with a model trained on iHadCM3 data on simulation data from other climate models.This experiment and its results were described in Section 3.4.As described in the main text, a possible explanation might be differences in the statistical connections between  18 O and the predictor variables amongst the different climate models.To estimate these differences, we calculate the root of the squared differences in correlation coefficients between each model and iHadCM3 grid-box wise: Additionally, the generalization to other models might be impaired in regions where there are weak or no statistical relationships between  18 O and the predictor variables in the iHadCM3 data.This may occur due to iHadCM3 misrepresenting the  18 O relationships in the Earth system -or there might just not be strong connections between  18 O and the predictor variables in these regions.In a simplistic approach, we assess this by plotting regions in which neither the correlation of temperature to  18 O nor the correlation of precipitation amount and  18 O exceeds an absolute value of 0.25 as hatches in In panels (B4) to (D4), we plot the difference in cross-prediction performance between each model and iHadCM3, whose training set was used to train the emulator.We observe that especially for the ENSO region west of South America and over the North Atlantic Ocean (a region relevant for the North Atlantic Oscillation, see Wanner et al., 2001) a decline in performance coincides with changes in the correlation structure between the models.Over the southern oceans, there are both changes in correlation structure between the models and weak correlations for iHadCM3.We show anomalies produced by the ML-emulator ("emulation") and the "true" result in the iHadCM3 data set ("ground truth").The anomalies are computed as the difference to the training set mean.We select time steps in which the emulator performs especially strong ("90th percentile") and weak ("10th percentile"), as measured by the anomaly correlation coefficient (ACC).Additionally, we show the time step, for which the ACC reaches its median value, and a year with a pronounced climatic anomaly: 1816 CE, the "year without a summer".Additionally, the results are bad in regions, where we expect interpolations to have a negative effect: Over coastal and in the polar regions, where the icosahedral grid cells are larger then the iHadCM3 grid cells and therefore data partly gets "averaged" when interpolating from the finer iHadCM3 grid and to the coarser icosahedral grid and back.Thus, we can attribute a large part of the  2 score differences to the interpolations between the grids.The parameters were chosen to roughly match in size with the icosahedral UNet architecture.Before the data is input into the network, it is resized in order to assure divisibility during down-and up-scaling, thus the input resolution (green) to the first layer is 72 × 96 instead of 71 × 96 as in the iHadCM3 grid.In the end, the output of the architecture is scaled back to the original grid.The number of computed features per layer is written below the corresponding data blocks (black).The number of input features (  ) depends on the number of chosen variables (it equals 2 if using both temperature and precipitation amount), and the number of output features   equals 1 for all our applications.This graphic doesn't include the coordinate features that get appended to the data-tensors before each convolution, if CoordConv (Liu et al., 2018) is used.BN is short for batch normalization, and 3x3 conv indicates a convolutional layer with a 3 × 3 convolutional kernel.
Figure A.12. Sketch of the default icosahedral UNet architecture that takes into account the spherical nature of our data.The parameters are chosen similar to details given in the paper of Cohen et al. (2019).The icosahedral refinement level  is given in green, the number of features per layer in black below the corresponding blocks."S2R" (scalar-to-regular) and "R2R" (regular-to-regular) are convolutional layers defined by Cohen et al. (2019) to achieve equivariance to a group of symmetry transformations.

Figure 1 .
Figure 1.Our approach to the emulation of  18 O in precipitation: for each time step, we use surface temperature and precipitation amount as predictor variables.Subsequently, the data is standardized pixel-wise by subtracting the mean and dividing it by the standard deviation at each pixel (top right).Means and standard deviations are based on the training set of the investigated climate model simulation.We use a machine learning emulation model (ML Regressor) to obtain a standardized estimate for  18 O.The emulator output (bottom right) is then de-standardized using the training set mean and standard deviation of  18 O at every pixel, to arrive at the final emulation result (bottom left).When applying the ML model to data from climate models other than the one that was used for training (e.g. in the cross-comparison experiment in Section 3.4) we use the mean and standard deviation from the training set of the new model.

Figure 2 .
Figure 2. Statistical properties of the iHadCM3  18 O data: (A) mean state of isotopic composition ( 18 O) in the precipitation in precipitation and (B) standard deviation of  18 O on an annual timescale.(C) absolute correlations of  18 O with temperature (green) and precipitation amount (brown) on interannual timescale; for each grid cell only the stronger of the two is shown.
• We split the data into test and training sets.We use 850-1750 CE for training and 1751-1849 CE for testing.The data are split chronologically instead of randomly to make the test and training set as independent as possible, and prevent the network from exploiting auto-correlations from previous or subsequent time steps.If a validation set (used for making choices of ML hyperparameters) is needed, we split off 10% of the training set randomly unless specified otherwise.• Before the ML methods are applied, the data are standardized pixel-wise by subtracting the training set mean and dividing by the standard deviation of the corresponding climate model, as visualized in Figure 1.

Figure 4 .
Figure 4. Typical emulation results on iHadCM3 data set: We show anomalies as they are output by the ML-emulator ("Emulation") and the "true" result in the simulation data set ("Ground truth").The anomalies are computed with respect to the training set mean.For the selected time step, the anomaly correlation coefficient (ACC) reaches its median value.

Figure 5 .
Figure 5. Results for the cross-prediction task: A UNet is trained on the iHadCM3 training data set.The performance is then evaluated on the test set of various climate models, shown  2 scores are averages over ten runs.

Figure A. 1 .
Figure A.1.Validation task for the icosahedral neural network: shown are randomly selected examplesof handwritten digits from the MNIST data set, projected onto the icosahedral grid, in this case the refinement level of the icosahedral grid  equals 4. For the validation tasks, three versions of this data set are generated: One where the digits are not rotated ("N"), one in which rotations of the symmetry group of the icosahedron are applied ("I") and one in which we apply rotations of the group of rotations of the sphere("R").

Figure A. 2 .
Figure A.2. Sketch of the architecture used to validate our implementation of the icosahedral network of Cohen et al. (2019).Spatial resolutions (green text) relate to the refinement level of the icosahedral grid.The architecture was chosen according to the details given in the supplementary material of the original paper."S2R" and "R2R" are specific types of convolutional layers developed to maintain equivariance, and IcoBN is an adapted version of batch normalization.For details, see Cohen et al. (2019).

Here
indicates the (temporal) Pearson correlation coefficients computed for each model and grid box.The results are visualized in panels (B2) to (D2) of Figure A.3.

Figure A. 3 .
Figure A.3.Investigating reasons for emulation quality decrease in cross-prediction experiments: (A1) to (D1) show the absolute correlations between  18 O and the predictor variables.For each grid cell, only the larger of the two absolute correlation coefficients is shown.(B2) to (D2) show the differences in the correlation structure between the models as computed in Equation (A.1).(A3) to (D3) show the cross-prediction  2 scores of the best deeper flat UNet model and are identical to the content of Figure 5. (B4) to (D4) show  2 scores differences between each model and iHadCM3.In hatched regions no correlation coefficient has an absolute value bigger than 0.25 in the iHadCM3 data set.

Figure A. 4 .
Figure A.4. Emulation results on iHadCM3 data set:We show anomalies produced by the ML-emulator ("emulation") and the "true" result in the iHadCM3 data set ("ground truth").The anomalies are computed as the difference to the training set mean.We select time steps in which the emulator performs especially strong ("90th percentile") and weak ("10th percentile"), as measured by the anomaly correlation coefficient (ACC).Additionally, we show the time step, for which the ACC reaches its median value, and a year with a pronounced climatic anomaly: 1816 CE, the "year without a summer".

Figure A. 5 .
Figure A.5.The difference in  2 score between the UNet model and the PCA-based regression baseline for each grid type,(A): plate carrée grid, (B): icosahedral grid.On the plate carrée grid, we use the "modified" version of the flat UNet, including the modifications remedy distortions due to the spherical nature of the data.Green colors indicate better performance of the UNet compared to the baseline model.Before computing the differences,  2 scores of ten runs are averaged for each configuration.

Figure A. 6 .
Figure A.6.Effects of modifications to the "standard" UNet, which treats the longitude-latitude grid as a flat image: In each panel, we show the difference in  2 score with respect to the unmodified version.We investigate the following modifications: (A) A loss function, in which the contribution of each grid box is weighted by the area it covers on Earth's surface.(B) CoordConv (Liu et al., 2018), a modification to CNNs, that allows the network to access the coordinates of locations in the image and break translational equivariance.(C) padding the longitudes cyclically instead of using zero padding and (D) the joint effect of the modifications.Shown in each panel are differences between averages over the  2 scores of ten runs.

Figure A. 7 .
FigureA.7.Interpolation between grids degrades results.(A) Performance of the icosahedral UNet architecture, here the results were interpolated to the flat grid.(B) results of the flat UNet architecture, after interpolating the predictions to the icosahedral grid and back to the plate carrée grid.The prediction quality is almost indistinguishable from the results of the icosahedral UNet.Additionally, the results are bad in regions, where we expect interpolations to have a negative effect: Over coastal and in the polar regions, where the icosahedral grid cells are larger then the iHadCM3 grid cells and therefore data partly gets "averaged" when interpolating from the finer iHadCM3 grid and to the coarser icosahedral grid and back.Thus, we can attribute a large part of the  2 score differences to the interpolations between the grids.

Figure A. 8 .
Figure A.8. Difference in emulation quality (2 score) between using only one of the predictor variables (surface temperature and precipitation amount) and using both simultaneously.Regions shaded in purple indicate a drop in performance when leaving out the corresponding predictor variable.Before computing the differences, averages over the  2 scores of ten runs were formed for each configuration..

Figure A. 9 .
Figure A.9. Results for the cross-prediction task with a pixel-wise linear regression model.The model was trained on the iHadCM3 data set, where the performance is worse than with the UNet model.For cross-predictions on other data sets, however, the emulation quality of the linear model is comparable to or even better than the performance of the UNet, as can be seen by comparing the globally averaged  2 scores in panels (B) to (D) to the corresponding panels of Figure5.

Figure
Figure A.10. Investigating missing  18 O values: Spatial distribution of missing  18 O data for iHadCM3 on monthly timescale.While for over 80% of the gridboxes, not a single value is missing, missing values are clustering mainly in hot and dry regions.For the worst grid box,  18 O is missing in 25% of the time steps.

Figure A. 11 .
FigureA.11.Sketch of our default flat UNet architecture.The parameters were chosen to roughly match in size with the icosahedral UNet architecture.Before the data is input into the network, it is resized in order to assure divisibility during down-and up-scaling, thus the input resolution (green) to the first layer is 72 × 96 instead of 71 × 96 as in the iHadCM3 grid.In the end, the output of the architecture is scaled back to the original grid.The number of computed features per layer is written below the corresponding data blocks (black).The number of input features (  ) depends on the number of chosen variables (it equals 2 if using both temperature and precipitation amount), and the number of output features   equals 1 for all our applications.This graphic doesn't include the coordinate features that get appended to the data-tensors before each convolution, if CoordConv(Liu et al., 2018) is used.BN is short for batch normalization, and 3x3 conv indicates a convolutional layer with a 3 × 3 convolutional kernel.
• To respect the underlying geometry of the climate model data, we investigate the performance of a spherical network architecture.• We present cross-model results, where a regressor trained on simulated data from one climate model is used to emulate  18 O in a run from a different model.
(plate carrée and icosahedral) and the necessary interpolations may deteriorate performance.Thus, we compute performances on both grids, interpolating the predictions from one grid to another.Results for the globally averaged  2 scores are shown in Table1.The best model in the comparison is the "modified" version of the flat UNet that includes the three modifications (area-weighted loss, adapted padding, CoordConv) described in Section 2.2.The effect of the individual modifications is detailed in TableA.2 and Figure A.6.All UNet architectures outperform all baseline architectures that operate on the same grid.The best UNet method explains 7% more of the test set variance than the best baseline model, PCA regression.

Table A .
2. Effect of modifications to flat UNet architecure, globally averaged  2 scores.Shown are standard deviations and mean over ten runs.