Hostname: page-component-5db58dd55d-smskv Total loading time: 0 Render date: 2026-06-17T08:14:04.396Z Has data issue: false hasContentIssue false

Data augmentation for disruption prediction via robust surrogate models

Published online by Cambridge University Press:  04 October 2022

Katharina Rath*
Affiliation:
Department of Statistics, Ludwig-Maximilians-Universität München, Germany Max-Planck-Institut für Plasmaphysik, Garching, Germany
David Rügamer
Affiliation:
Department of Statistics, Ludwig-Maximilians-Universität München, Germany Institute of Statistics, RWTH Aachen University, Germany
Bernd Bischl
Affiliation:
Department of Statistics, Ludwig-Maximilians-Universität München, Germany
Udo von Toussaint
Affiliation:
Max-Planck-Institut für Plasmaphysik, Garching, Germany
Cristina Rea
Affiliation:
Plasma Science and Fusion Center, Massachusetts Institute of Technology, Cambridge, MA, USA
Andrew Maris
Affiliation:
Plasma Science and Fusion Center, Massachusetts Institute of Technology, Cambridge, MA, USA
Robert Granetz
Affiliation:
Plasma Science and Fusion Center, Massachusetts Institute of Technology, Cambridge, MA, USA
Christopher G. Albert
Affiliation:
Fusion@OEAW, Institute of Theoretical and Computational Physics, Graz University of Technology, Austria
*
Email address for correspondence: katharina.rath@ipp.mpg.de
Rights & Permissions [Opens in a new window]

Abstract

The goal of this work is to generate large statistically representative data sets to train machine learning models for disruption prediction provided by data from few existing discharges. Such a comprehensive training database is important to achieve satisfying and reliable prediction results in artificial neural network classifiers. Here, we aim for a robust augmentation of the training database for multivariate time series data using Student $t$ process regression. We apply Student $t$ process regression in a state space formulation via Bayesian filtering to tackle challenges imposed by outliers and noise in the training data set and to reduce the computational complexity. Thus, the method can also be used if the time resolution is high. We use an uncorrelated model for each dimension and impose correlations afterwards via colouring transformations. We demonstrate the efficacy of our approach on plasma diagnostics data of three different disruption classes from the DIII-D tokamak. To evaluate if the distribution of the generated data is similar to the training data, we additionally perform statistical analyses using methods from time series analysis, descriptive statistics and classic machine learning clustering algorithms.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
Copyright © The Author(s), 2022. Published by Cambridge University Press
Figure 0

Figure 1. Predicted mean and $95\,\%$ confidence band with (a) GP and (b) TP trained on $N = 100$ training data points following $f(t) = \textrm {sin}(2t)\cos (0.4t)$ corrupted by Gaussian noise $0.1 \mathcal {N}(0, 1)$, with several outliers.

Figure 1

Figure 2. Data processing flow.

Figure 2

Figure 3. (a) Training data and (b) 10 generated data sets from the state space Student $t$ surrogate model together with the estimated mean (black solid line) and $95\,\%$ confidence (grey shaded region) for test case (i). Different colours correspond to different shots of training data and different samples of the generated data, respectively.

Figure 3

Table 1. Post-processing method comparison for test case (i). Mean and standard deviation over five dimensions and $N = 1000$ samples generated from the trained model for statistical metrics described in § 3. Best values are highlighted in bold.

Figure 4

Table 2. Post-processing method comparison for disruption data from DIII-D. Mean and standard deviation of the Wasserstein metric between training and generated data for stable and unstable phases of the disruptive discharges. The Wasserstein metric is averaged over five dimensions and $N = 1000$ samples generated from the trained model.

Figure 5

Figure 4. Comparison of the cross-covariance in the training and generated data with cross-covariance (solid lines on top of each other, numerical error of order $10^{-16}$), covariance or sampled covariance post-processing (dashed lines) and uncorrelated model (dotted line) for test case (i).

Figure 6

Figure 5. Comparison of the covariance of training data (a) and the difference from the generated data (b) with uncorrelated model, (c) empirical covariance, (d) cross-covariance and (e) sampled covariance post-processing for test case (i). Note the different scaling in the colour scale.

Figure 7

Figure 6. Kernel density estimation of the 2-D kernel PCA embedding of the (a) training data and generated data via (b) uncorrelated model, (c) empirical covariance, (d) cross-covariance and (e) sampled covariance post-processing for test case (i). The embedded training data are shown in grey in all plots. The colour scale representing the density is the same in all plots.

Figure 8

Table 3. The $F1$ score for DTW SOM clustering of different post-processing methods for test case (i).

Figure 9

Algorithm 1 Multivariate Student-t filter (Solin & Särkkä 2015)

Figure 10

Algorithm 2 Multivariate Student-t smoother (Solin & Särkkä 2015)

Figure 11

Table 4. Optimized hyperparameters for the state space Student $t$ surrogate model for all test cases.

Figure 12

Figure 7. (a) Training data and (b) 10 generated data sets from the state space Student $t$ surrogate model together with the estimated mean (black solid line) and $95\,\%$ confidence (grey shaded region) for test case (ii). Different colours correspond to different shots of training data and different samples of the generated data, respectively.

Figure 13

Table 5. Post-processing method comparison for test case (ii). Mean and standard deviation over five dimensions and $N = 1000$ samples generated from the trained model for statistical metrics described in § 3. Best values are highlighted in bold.

Figure 14

Figure 8. Comparison of cross-covariance of training data and generated data with cross-covariance (solid lines on top of each other, numerical error of order $10^{-16}$), covariance or sampled covariance (dashed lines) post-processing and uncorrelated model (dotted line) for test case (ii).

Figure 15

Figure 9. Comparison of covariance of training data (a) and difference of generated data (b) with uncorrelated model, (c) empirical covariance, (d) cross-covariance and (e) sampled covariance post-processing for test case (ii). Note the different scaling in the colour scale.

Figure 16

Figure 10. Kernel density estimation of the 2-D kernel PCA embedding of the (a) training data and generated data via (b) uncorrelated model, (c) empirical covariance, (d) cross-covariance and (e) sampled covariance post-processing for test case (ii). The embedded training data are shown in grey in all plots. The colour scale representing the density is the same in all plots.

Figure 17

Table 6. The $F1$ score for DTW SOM clustering of different post-processing methods for test case (ii).

Figure 18

Figure 11. (a) Training data and (b) 10 generated data sets from the state space Student $t$ surrogate model together with the estimated mean (black solid line) and $95\,\%$ confidence (grey shaded region) for test case (iii). Different colours correspond to different shots of training data and different samples of the generated data, respectively.

Figure 19

Figure 12. Comparison of cross-covariance of training data and generated data with cross-covariance (solid lines on top of each other, numerical error of order $10^{-16}$), covariance or sampled covariance (dashed lines) post-processing and uncorrelated model (dotted line) for test case (iii).

Figure 20

Figure 13. Comparison of covariance of training data (a) and difference of generated data (b) with uncorrelated model, (c) empirical covariance, (d) cross-covariance and (e) sampled covariance post-processing for test case (iii). Note the different scaling in the colour scale.

Figure 21

Figure 14. Kernel density estimation of the 2-D kernel PCA embedding of the (a) training data and generated data via (b) uncorrelated model, (c) empirical covariance, (d) cross-covariance and (e) sampled covariance post-processing for test case (iii). The embedded training data are shown in grey in all plots. The colour scale representing the density is the same in all plots.

Figure 22

Table 7. Post-processing method comparison for test case (iii). Mean and standard deviation over five dimensions and $N = 1000$ samples generated from the trained model for statistical metrics described in § 3. Best values are highlighted in bold.

Figure 23

Table 8. The $F1$ score for DTW SOM clustering of different post-processing methods for test case (iii).