Hostname: page-component-89b8bd64d-z2ts4 Total loading time: 0 Render date: 2026-05-08T03:46:16.784Z Has data issue: false hasContentIssue false

Identifying climate models based on their daily output using machine learning

Published online by Cambridge University Press:  03 July 2023

Lukas Brunner*
Affiliation:
Department of Meteorology and Geophysics, University of Vienna, Vienna, Austria
Sebastian Sippel
Affiliation:
Institute for Meteorology, Leipzig University, Leipzig, Germany Institute for Atmospheric and Climate Science, ETH Zurich, Zurich, Switzerland
*
Corresponding author: Lukas Brunner; Email: l.brunner@univie.ac.at

Abstract

Climate models are primary tools for investigating processes in the climate system, projecting future changes, and informing decision makers. The latest generation of models provides increasingly complex and realistic representations of the real climate system, while there is also growing awareness that not all models produce equally plausible or independent simulations. Therefore, many recent studies have investigated how models differ from observed climate and how model dependence affects model output similarity, typically drawing on climatological averages over several decades. Here, we show that temperature maps of individual days drawn from datasets never used in training can be robustly identified as “model” or “observation” using the CMIP6 model archive and four observational products. An important exception is a prototype storm-resolving simulation from ICON-Sapphire which cannot be unambiguously assigned to either category. These results highlight that persistent differences between simulated and observed climate emerge at short timescales already, but very high-resolution modeling efforts may be able to overcome some of these shortcomings. Moreover, temporally out-of-sample test days can be assigned their dataset name with up to 83% accuracy. Misclassifications occur mostly between models developed at the same institution, suggesting that effects of shared code, previously documented only for climatological timescales, already emerge at the level of individual days. Our results thus demonstrate that the use of machine learning classifiers, once trained, can overcome the need for several decades of data to evaluate a given model. This opens up new avenues to test model performance and independence on much shorter timescales.

Information

Type
Application Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Figure 1. (a) Logistic regression coefficients learned from 17,200 randomly drawn daily samples in the period 1982–2001 to separate models and observations. (b) Climatological mean, multi-model mean temperature difference to the mean over the four observational datasets in the period 2005–2014. See Supplementary Figure S7 for corresponding maps of the individual models. Coefficients and climatologies are calculated from daily data with the global mean removed.

Figure 1

Figure 2. Distribution of predicted probabilities for the dataset out-of-sample test days: for each dataset, the probabilities are estimated by a classifier which has not been trained on this dataset. The vertical dotted line at 0.5 marks the decision threshold between the two categories. ICON-Sapphire is never used in training and has only 1 year of data available. (a) Results for logistic regression classifiers using data with the daily global mean removed. (b) Same as (a) but for the convolutional neural network. (c) Same as (b) but using data with the seasonal cycle removed in addition.

Figure 2

Figure 3. Confusion matrix showing the frequency of predicted versus true labels. The main diagonal shows correct predictions using green shading, purple shading indicates misclassifications within a model family (see Supplementary Table S3), and red shading indicates other misclassifications. Values are in % relative to the total number of samples in each category. The number in each box gives the value rounded to the last shown digit with rows not adding up to 100% only due to rounding.

Figure 3

Figure 4. Same as Figure 3 but with test data from the end of the century (2091–2100). Labels from datasets which do not cover this period are omitted in the true category.

Supplementary material: PDF

Brunner and Sippel supplementary material

Brunner and Sippel supplementary material

Download Brunner and Sippel supplementary material(PDF)
PDF 22 MB