Hostname: page-component-89b8bd64d-n8gtw Total loading time: 0 Render date: 2026-05-07T12:09:00.948Z Has data issue: false hasContentIssue false

Probabilistic Prediction of Oceanographic Velocities with Multivariate Gaussian Natural Gradient Boosting

Published online by Cambridge University Press:  02 May 2023

Michael O’Malley*
Affiliation:
STOR-i Centre for Doctoral Training, Department of Mathematics and Statistics, Lancaster University, Lancaster, United Kingdom
Adam M. Sykulski
Affiliation:
Department of Mathematics, Imperial College London, London, United Kingdom
Rick Lumpkin
Affiliation:
Physical Oceanography Division, NOAA/Atlantic Oceanographic and Meteorological Laboratory, Miami, FL, USA
Alejandro Schuler
Affiliation:
Center for Targeted Machine Learning, University of California, Berkeley, Berkeley, CA, USA
*
Corresponding author: Michael O’Malley; Email: michael.omalley1011@gmail.com

Abstract

Many single-output regression problems require estimates of uncertainty along with the point predictions. For this purpose, there exists a class of regression algorithms that predict a conditional distribution rather than a point estimate. The off-the-shelf options are much more limited, however, when the prediction output is multivariate and a joint measure of uncertainty is required. In this paper, we predict a distribution around a multivariate random vector of dimension P, such that the joint uncertainty would quantify the probability of any vector in P-dimensional space. This is more expressive than providing separate uncertainties in each dimension. To enable joint probabilistic regression, we propose a natural gradient boosting approach based on nonparametrically modeling the conditional parameters of the multivariate predictive distribution, where we focus on the multivariate Gaussian distribution. Our method is robust, can be easily trained without extensive tuning, and performs competitively in comparison to existing approaches. The motivating application of our methodology is to predict two-dimensional oceanographic currents measured by freely floating Global Drifter Program drifters using remotely sensed data. We also demonstrate the method’s performance on simulated data. We find this method excels when strong correlation between output dimensions is present. As part of this work, we have added the model to the open source package at github.com/stanfordmlgroup/ngboost.

Information

Type
Methods Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Figure 1. Simulated data from equation (5). A sample of 100 points is shown in panel (a). We plot each parameter in plots (b) to (f). The true parameters of the distribution are shown in blue, the NGBoost fit is shown in orange, and the neural network fit with one hidden layer (100 neurons in the hidden layer) is shown in red. Both model fits are trained on training points.

Figure 1

Table 1. Average KL divergence in the test set (to three decimal places) from the predicted distribution to the true distribution as the number of training data points N varies.

Figure 2

Figure 2. A plot showing the five longest trajectories in the dataset described in detail in Appendix A, where each drifter has a lifetime of between 1100 and 1378 days. A point is plotted every 10 days.

Figure 3

Figure 3. The values of the prediction feature for the sample of drifters shown in Figure 2. We only show the first 50 days to allow the reader to see the more granular patterns.

Figure 4

Table 2. Average test set metrics defined in Section 5.2.

Figure 5

Figure 4. Summaries of test set results within latitude–longitude bins for the North Atlantic Ocean application. Panel (a) shows the spatial distribution of where the data are sampled. Panel (b) shows the difference between the negative log-likelihood spatially between NGB and Indep NGB; negative values (blue) implying NGB is better than Indep NGB (with vice versa in red). Panel (c) shows the average prediction of , where is extracted from the predicted covariance matrix in the held out set from NGB. Panel (d) shows the mean currents estimated by NGB. All major ocean features are captured by the model (Lumpkin and Johnson, 2013).

Figure 6

Figure 5. Plots showing the predictions from both NGB (a) and Indep NGB (b) for the first 12 days of drifter ID 54386 starting from September 23, 2005. The velocity predictions are plotted every 2 days for visualization purposes. The velocity measurements are translated to where the drifter would end up after 1 day if it continued at the constant predicted velocity for visualization purposes. The model used for the prediction did not see this trajectory when trained. The faded blue line shows the trajectory of the drifter with a plotted point every day. The 70% prediction region is the boundary from equation (6). The 70% level is just chosen for visualization purposed to prevent the ellipses from overlapping in the plot. Conversion to easting-northing computed using a transverse Mercator projection centered at latitude and longitude.

Figure 7

Figure 6. The same metrics as shown in Figure 4, just zoomed in on a smaller section which is analyzed in Section 5.4. The plot has a finer granularity than Figure 4, the resolution here is a bin granularity. The definition of the plotted metrics in panels A-D) can be found in the caption of Figure 4.

Figure 8

Table 3. Summary of data used in the application.

Supplementary material: PDF

O’Malley et al. supplementary material

O’Malley et al. supplementary material

Download O’Malley et al. supplementary material(PDF)
PDF 164.2 KB