Hostname: page-component-77f85d65b8-zzw9c Total loading time: 0 Render date: 2026-03-29T07:59:28.356Z Has data issue: false hasContentIssue false

Uniform-in-phase-space data selection with iterative normalizing flows

Published online by Cambridge University Press:  25 April 2023

Malik Hassanaly*
Affiliation:
Computational Science Center, National Renewable Energy Laboratory, Golden, CO, USA
Bruce A. Perry
Affiliation:
Computational Science Center, National Renewable Energy Laboratory, Golden, CO, USA
Michael E. Mueller
Affiliation:
Computational Science Center, National Renewable Energy Laboratory, Golden, CO, USA Mechanical and Aerospace Engineering, Princeton University, Princeton, NJ, USA
Shashank Yellapantula
Affiliation:
Computational Science Center, National Renewable Energy Laboratory, Golden, CO, USA
*
Corresponding author: Malik Hassanaly; E-mail: malik.hassanaly@gmail.com

Abstract

Improvements in computational and experimental capabilities are rapidly increasing the amount of scientific data that are routinely generated. In applications that are constrained by memory and computational intensity, excessively large datasets may hinder scientific discovery, making data reduction a critical component of data-driven methods. Datasets are growing in two directions: the number of data points and their dimensionality. Whereas dimension reduction typically aims at describing each data sample on lower-dimensional space, the focus here is on reducing the number of data points. A strategy is proposed to select data points such that they uniformly span the phase-space of the data. The algorithm proposed relies on estimating the probability map of the data and using it to construct an acceptance probability. An iterative method is used to accurately estimate the probability of the rare data points when only a small subset of the dataset is used to construct the probability map. Instead of binning the phase-space to estimate the probability map, its functional form is approximated with a normalizing flow. Therefore, the method naturally extends to high-dimensional datasets. The proposed framework is demonstrated as a viable pathway to enable data-efficient machine learning when abundant data are available.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© Alliance for Sustainable Energy, LLC, 2023. Published by Cambridge University Press
Figure 0

Figure 1. Illustration of the proposed method for a canonical example. (Left) Histogram of the original distribution. (Middle) Acceptance probability plotted against the random variable value when downsampling to $ n=100 $ data points () and $ n=\mathrm{1,000} $ data points (). (Right) Distribution of the downsampled dataset when downsampling to $ n=100 $ data points () and $ n=\mathrm{1,000} $ data points ().

Figure 1

Figure 2. Illustration of the results obtained with Algorithm 1. The full dataset () is overlaid with the reduced dataset (). (Top left) Random selection of $ n={10}^3 $ data points. (Bottom left) Result of Algorithm 1 with $ n={10}^3 $. (Top right) Random selection of $ n={10}^4 $ data points. (Bottom right) Result of Algorithm 1 with $ n={10}^4 $.

Figure 2

Figure 3. (Left) Result of Algorithm 2 with $ n={10}^3 $. (Right) Result of Algorithm 2 with $ n={10}^4 $.

Figure 3

Figure 4. Effect of iterations on accuracy of acceptance probability. (Left) Conditional mean of relative error in the acceptance probability plotted against the exact acceptance probability. (Right) Conditional mean of absolute error in the acceptance probability plotted against the exact acceptance probability.

Figure 4

Figure 5. (Left) Negative log-likelihood against standard deviation for a standard normal distribution. (Right) History of the negative log-likelihood loss of the normalizing flow trained at each flow iteration.

Figure 5

Figure 6. Scatter plot of the full dataset with $ D=2 $.

Figure 6

Figure 7. (Top) Comparison of selection schemes for 100 samples () out of $ {10}^5 $ data points distributed according to $ \mathcal{N}\left(\left[0,0\right],\mathit{\operatorname{diag}}\left(\left[0.1,0.1\right]\right)\right) $ (). (Bottom) PDF of the distance to nearest neighbor in the downselected dataset. (Left) Random selection scheme. (Middle) Brute force optimization of distance criterion. (Right) Algorithm 2.

Figure 7

Table 1. Dependence of the distance criterion value on the number of iterations and instances selected ($ n $).

Figure 8

Table 2. Dependence of the distance criterion value on the number of iterations and instances selected ($ n $).

Figure 9

Figure 8. Sensitivity of distance criterion to the number of flow iterations, normalized by the distance criterion after the first flow iteration. Symbols represent the mean distance criterion. Results are shown for $ D=2 $ () and $ D=4 $ (), for $ n={10}^3 $ (), $ n={10}^4 $ (), and $ n={10}^5 $ ().

Figure 10

Table 3. Distance criterion values for different numbers of instances selected ($ n $) and data dimensions ($ D $).

Figure 11

Figure 9. Two-dimensional projection of $ n=\mathrm{10,000} $ selected data points () out of an original four-dimensional dataset (). (Left) Random sampling. (Middle) Stratified sampling. (Right) Algorithm 2.

Figure 12

Table 4. Distance criterion values for different numbers of instances selected ($ n $) and data dimensions ($ D $).

Figure 13

Figure 10. Sensitivity of distance criterion to $ M $ normalized by the distance criterion at $ M={10}^6 $. Symbols represent the mean distance criterion. Results are shown for $ D=2 $ (), $ D=3 $ (), and $ D=4 $ (), for $ n={10}^3 $ (), $ n={10}^4 $ (), and $ n={10}^5 $ ().

Figure 14

Figure 11. Scaling of computational cost against the number of data points $ N $ in the full dataset (left) and the dimension $ D $ (right). The computational cost is divided into the probability map estimation (), called Step 1; the construction of acceptance probability (), called Step 2a; and the data selection (), called Step 2b. The total computational cost is also indicated for both scaling plots ().

Figure 15

Figure 12. (Left) Scaling of Step 2a () overlaid with linear scaling (). (Right) Weak scaling of Step 2b () overlaid with linear scaling ().

Figure 16

Figure 13. Illustration of the combustion dataset. The dots are colored by progress variable source term $ \dot{\omega} $.

Figure 17

Figure 14. Statistics of the mean (left) and maximal error (right) obtained over five repetitions of downselection and training of a neural network and Gaussian process on the combustion dataset for random sampling (), stratified sampling (), and Algorithm 2 (). Bar height shows ensemble mean and error bar shows standard deviation. Bar and error bar size are rescaled by the ensemble mean obtained with Algorithm 2.

Figure 18

Figure 15. Prediction errors over the full dataset conditioned on sampling probability, calculated with $ n={10}^4 $ for five independent datasets constructed with random sampling (), stratified sampling (), and Algorithm 2 (). Inset zooms in on low sampling probability. (Left) Gaussian process with $ n={10}^3 $. (Middle) Feed-forward neural network with $ n={10}^3 $. (Right) Feed-forward neural network with $ n={10}^4 $.

Figure 19

Figure 16. Illustration of the synthetic datasets. Noise level increases from left to right.

Figure 20

Figure 17. Statistics of the mean (left) and maximal error (right) obtained over five repetitions of downselection and training of a Gaussian process on the synthetic dataset for random sampling (), stratified sampling (), and Algorithm 2 (). Bar height shows ensemble mean and error bar shows standard deviation. Bar and error bar size are rescaled by the ensemble mean obtained with Algorithm 2.

Figure 21

Figure 18. Statistics of the mean (left) and maximal error (right) obtained over five repetitions of downselection and training of a neural network on the synthetic dataset for random sampling (), stratified sampling (), and Algorithm 2 (). Bar height shows ensemble mean and error bar shows standard deviation. Bar and error bar size are rescaled by the ensemble mean obtained with Algorithm 2.

Figure 22

Figure A1. Illustration of the effect of using more than two iterations to select $ n=\mathrm{1,000} $ (top) and $ n=\mathrm{10,000} $ (bottom) data points with one iteration (left), two iterations (middle), and three iterations (right) of Algorithm 2.

Figure 23

Figure A2. Illustration of random (left) and stratified (right) selection schemes of $ n=\mathrm{1,000} $ (top) and $ n=\mathrm{10,000} $ (bottom) data points.

Figure 24

Figure A3. Illustration of the effect of using more than two iterations to select $ n=\mathrm{1,000} $ (top) and $ n=\mathrm{10,000} $ (bottom) data points with one iteration (left), two iterations (middle), and three iterations (right) of Algorithm 2, using a binning method for the density estimation.

Figure 25

Figure B1. Two-dimensional projection of $ n=\mathrm{10,000} $ selected data points () out of an original four-dimensional building power consumption dataset (). (Left) Random sampling. (Middle) Algorithm 2 (two iterations). (Right) Algorithm 2 (three iterations).

Figure 26

Table B1. Dependence of the distance criterion value on the number of iterations and instances selected ($ n $) for the building power consumption dataset.

Submit a response

Comments

No Comments have been published for this article.