Hostname: page-component-5db58dd55d-jhf8m Total loading time: 0 Render date: 2026-05-31T17:01:48.378Z Has data issue: false hasContentIssue false

A novel heuristic method for detecting overfit in unsupervised classification of climate model data

Published online by Cambridge University Press:  13 December 2023

Emma J. D. Boland*
Affiliation:
Polar Oceans, British Antarctic Survey, Cambridge, United Kingdom
Erin Atkinson
Affiliation:
Department of Physics, University of Toronto, Toronto, Canada
Dani C. Jones
Affiliation:
Polar Oceans, British Antarctic Survey, Cambridge, United Kingdom
*
Corresponding author: Emma J. D. Boland; Email: emmomp@bas.ac.uk

Abstract

Unsupervised classification is becoming an increasingly common method to objectively identify coherent structures within both observed and modelled climate data. However, in most applications using this method, the user must choose the number of classes into which the data are to be sorted in advance. Typically, a combination of statistical methods and expertise is used to choose the appropriate number of classes for a given study; however, it may not be possible to identify a single “optimal” number of classes. In this work, we present a heuristic method, the ensemble difference criterion, for unambiguously determining the maximum number of classes supported by model data ensembles. This method requires robustness in the class definition between simulated ensembles of the system of interest. For demonstration, we apply this to the clustering of Southern Ocean potential temperatures in a CMIP6 climate model, and show that the data supports between four and seven classes of a Gaussian mixture model.

Information

Type
Application Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Figure 1. An overview of the classification process using 1200 random Southern Ocean profiles from the year 2000. Raw profiles (a), are normalized (b) so each depth level has a mean of zero and a standard deviation of one. (c) The profiles are transformed into PCA space. Two PCs are shown here; each point corresponds to one temperature profile. (d) The data are classified according to a GMM. Three classes are chosen to generate this example, with an equi-probability surface shown for each, separately.

Figure 1

Figure 2. Comparison of the class assignments from Argo (a, Jones et al., 2019) and the ensemble mode class from UK-ESM (b), for 2001–2017. See text for further details. The ocean fronts from Kim and Orsi (2014) are marked by the colored lines as labeled.

Figure 2

Figure 3. The mean temperature profiles for an eight-class GMM from Jones et al. (2019) (a) compared with the ensemble mean profiles from this study (b). Dashed lines indicate the standard deviation across samples for the Argo data, and the ensemble average of the temporal standard deviations for the UK-ESM data. The dotted line indicates the ensemble standard deviation in the temporal mean profiles for the UK-ESM data. Red arrows indicate the mapping from a UK-ESM class to the closest class in the Argo data, and the blue arrows indicate the same for the Argo classes and the closest classes in the UK-ESM data.

Figure 3

Figure 4. (a) BIC curves and (b) $ \Delta $BIC curves for three of the ensembles.

Figure 4

Figure 5. (a) SIL curves and (b) $ \Delta $SIL curves for three of the ensembles.

Figure 5

Figure 6. An example of using the ensemble difference criterion (EDC) to match classes resulting from a 7-class GMM of two different ensemble members (r1i1p1f2 and r2i1p1f2).

Figure 6

Figure 7. An example of using the ensemble difference criterion (EDC) to match classes resulting from a 8-class GMM of two different ensemble members (r1i1p1f2 and r2i1p1f2). Note that the map between classes of each ensemble is not bijective––3 and 4 in the first ensemble are both matched to class 6 in the second (highlighted in red).

Supplementary material: File

Boland et al. supplementary material
Download undefined(File)
File 4.9 MB