Predicting turbulent dynamics with the convolutional autoencoder echo state network

Abstract The dynamics of turbulent flows is chaotic and difficult to predict. This makes the design of accurate reduced-order models challenging. The overarching objective of this paper is to propose a nonlinear decomposition of the turbulent state to predict the flow based on a reduced-order representation of the dynamics. We divide the turbulent flow into a spatial problem and a temporal problem. First, we compute the latent space, which is the manifold onto which the turbulent dynamics live. The latent space is found by a series of nonlinear filtering operations, which are performed by a convolutional autoencoder (CAE). The CAE provides the decomposition in space. Second, we predict the time evolution of the turbulent state in the latent space, which is performed by an echo state network (ESN). The ESN provides the evolution in time. Third, by combining the CAE and the ESN, we obtain an autonomous dynamical system: the CAE-ESN. This is the reduced-order model of the turbulent flow. We test the CAE-ESN on the two-dimensional Kolmogorov flow and the three-dimensional minimal flow unit. We show that the CAE-ESN: (i) finds a latent-space representation of the turbulent flow that has ${\lesssim }1\,\%$ of the degrees of freedom than the physical space; (ii) time-accurately and statistically predicts the flow at different Reynolds numbers; and (iii) takes ${\lesssim }1\,\%$ computational time to predict the flow with respect to solving the governing equations. This work opens possibilities for nonlinear decomposition and reduced-order modelling of turbulent flows from data.


Introduction
In the past few decades, large amounts of data have been generated from experiments and numerical simulations of turbulent flows (e.g.Duraisamy, Iaccarino & Xiao 2019).
To analyse high-dimensional data, low-order representations are typically sought (e.g.Taira et al. 2017).Reduced-order modelling consists of predicting the time evolution of high-dimensional systems with a low-order representation of the system.By predicting the system based on a (relatively) small number of degrees of freedom, reduced-order modelling significantly reduces the computational cost and provides insights into the physics of the system.In this paper, we construct a reduced-order model by: (i) inferring a lower-dimensional manifold, the latent space, where the turbulent dynamics live; and (ii) predicting the turbulent dynamics in the latent space.To generate the latent space, techniques such as proper orthogonal decomposition (POD) (Lumley 1970) and dynamic mode decomposition (DMD) (Schmid 2010) are commonly used.They have been applied successfully in multiple settings, such as extracting spatiotemporal features and controlling flowfields (e.g.Rowley, Colonius & Murray 2004;Brunton & Noack 2015;Rowley & Dawson 2017).The downside of these methodologies is that they are linear approximators, which require a large number of modes to describe turbulent flowfields (e.g.Alfonsi & Primavera 2007;Muralidhar et al. 2019).To reduce the number of modes to accurately describe flowfields, nonlinear mappings have shown promising results in recent years (e.g.Brunton, Noack & Koumoutsakos 2020;Fernex, Noack & Semaan 2021).
A robust data-driven method that computes a low-dimensional representation of the data is the autoencoder (Kramer 1991;Goodfellow, Bengio & Courville 2016).Autoencoders typically consist of a series of neural networks that map the original field to (and back from) the latent space.In fluids, Milano & Koumoutsakos (2002) developed a feed-forward neural network autoencoder to investigate a turbulent channel flow.Since then, the advent of data-driven techniques tailored for the analysis of spatially varying data, such as convolutional neural networks (CNNs) (Lecun et al. 1998), has greatly extended the applicability of autoencoders (Hinton & Salakhutdinov 2006;Agostini 2020).For example, Fukami, Fukagata & Taira (2019) employed CNNs to improve the resolution of sparse measurements of a turbulent flowfield, Murata, Fukami & Fukagata (2020) analysed the autoencoder modes in the laminar wake past a cylinder, and Kelshaw, Rigas & Magri (2022) proposed a physics-informed autoencoder for super-resolution of turbulence, to name only a few.
Once the latent space is generated, we wish to predict the temporal dynamics within the latent space.To do so, one option is to project the governing equations onto the low-order space (Antoulas 2005).This is a common method when the governing equations are known, but it becomes difficult to implement when the equations are not exactly known (Yu, Yan & Guo 2019).An alternative option is the inference of differential equations within the latent space, with genetic programming (e.g.Schmidt & Lipson 2009) or symbolic regression (e.g.Loiseau, Noack & Brunton 2018).In this paper, we focus on developing a reduced-order modelling approach that does not require differential equations.
To forecast temporal dynamics based on a sequence of inputs, recurrent neural networks (RNNs) (Rumelhart, Hinton & Williams 1986) are the state-of-the-art data-driven architectures (e.g.Goodfellow et al. 2016;Chattopadhyay, Hassanzadeh & Subramanian 2020).RNNs are designed to infer the correlation within data sequentially ordered in time via an internal state, which is updated at each time step.Through this mechanism, RNNs retain the information of multiple previous time steps when predicting the future evolution of the system.This is useful in reduced-order modelling, in which the access to a limited number of variables (a subset of the full state of the system) causes the evolution of the latent space dynamics (i.e.partially observed dynamics) to be non-Markovian, as explained in Vlachas et al. (2020).In fluids, RNNs have been deployed for predicting flows past bluff bodies of different shapes (Hasegawa et al. 2020), replicating the statistics of turbulence (Srinivasan et al. 2019;Nakamura et al. 2021) and controlling gliding (Novati, Mahadevan & Koumoutsakos 2019), to name a few.Among RNNs, reservoir computers in the form of echo state networks (ESNs) (Jaeger & Haas 2004;Maass, Natschläger & Markram 2002) are versatile architectures for the prediction of chaotic dynamics.ESNs are universal approximators under non-stringent assumptions of ergodicity (Grigoryeva & Ortega 2018;Hart, Hook & Dawes 2021), which perform (at least) as well as other architectures such as long short-term memory (LSTM) networks (Vlachas et al. 2020), but they are simpler to train.This is because training the ESNs is a quadratic optimisation problem, whose global minimum is a solution of a linear system (Lukoševičius 2012).In contrast, the training of LSTMs require an iterative gradient descent, which can converge to a local minimum of a multimodal loss function (Goodfellow et al. 2016).ESNs are designed to be straightforward and computationally cheaper to train than other networks, but their performance is sensitive to the selection of hyperparameters, which need to be computed through ad hoc algorithms (Racca & Magri 2021).In fluid dynamics, ESNs have been employed to (i) optimise ergodic averages in thermoacoustic oscillations (Huhn & Magri 2022), (ii) predict extreme events in chaotic flows (Doan, Polifke & Magri 2021;Racca & Magri 2022a,b) and control their occurrence (Racca & Magri 2022a, 2023), (iii) infer the model error (i.e.bias) in data assimilation of thermoacoustics systems (Nóvoa, Racca & Magri 2023) and (iv) investigate the stability and covariant Lyapunov vectors of chaotic attractors (Margazoglou & Magri 2023).Racca & Magri (2022a) showed that ESNs perform similarly to the LSTMs of Srinivasan et al. (2019) in the forecasting of chaotic flows, whilst requiring ≈0.1 % data.
The objective of this work is threefold.First, we develop the CAE-ESN by combining convolutional autoencoders (CAEs) with ESNs to predict turbulent flowfields.Second, we time-accurately and statistically predict a turbulent flow at different Reynolds numbers.Third, we carry out a correlation analysis between the reconstruction error and the temporal prediction of the CAE-ESN.
This paper is organised as follows.Section 2 presents the two-dimensional turbulent flow and introduces the tools for nonlinear analysis.Section 3 describes the CAE-ESN.Section 4 analyses the reconstruction of the flowfield.Section 5 analyses the time-accurate prediction of the flow and discusses the correlation between reconstruction error and temporal prediction of the system.Section 6 analyses the prediction of the statistics of the flow and the correlation between time-accurate and statistical performance.Section 7 extends the CAE-ESN to the prediction of three-dimensional turbulence.Finally, § 8 summarises the results.

Kolmogorov flow
We consider the two-dimensional non-dimensionalised incompressible Navier-Stokes equations where p is the pressure, u = (u 1 , u 2 ) is the velocity and x = (x 1 , x 2 ) are the spatial coordinates.The time-independent diverge-free forcing is f = sin(n f x 2 )e 1 , where e 1 = (1, 0) T and n f is the wavenumber of the forcing.The Reynolds number is Re = √ χ/ν, where χ is the amplitude of the forcing and ν is the kinematic viscosity (Chandler & Kerswell 2013).We solve the flow in the doubly periodical domain x ∈ T 2 = [0, 2π] × [0, 2π].For this choice of forcing and boundary conditions, the flow is typically referred to as the Kolmogorov flow (e.g.Platt, Sirovich & Fitzmaurice 1991).We integrate the equations using KolSol (https://github.com/MagriLab/KolSol),which is a publicly available pseudospectral solver based on the Fourier-Galerkin approach described by Canuto et al. (1988).The equations are solved in the Fourier domain with a fourth-order explicit Runge-Kutta integration scheme with a timestep dt = 0.01.The results are stored every δt = 0.1.As suggested in Farazmand (2016), we select the number of Fourier modes from convergence tests on the kinetic energy spectra (see supplementary material available at https://doi.org/10.1017/jfm.2023.716).The solution in the Fourier domain is projected onto a 48 × 48 grid in the spatial domain with the inverse Fourier transform.The resulting 4608-dimensional velocity flowfield, q(t) ∈ R 48×48×2 , is the flow state vector.
The Kolmogorov flow shows a variety of regimes that depend on the forcing wavenumber, n f , and Reynolds number, Re (Platt et al. 1991).In this work, we analyse n f = 4 and Re = {30, 34}, for which we observe quasiperiodic and turbulent solutions, respectively.To globally characterise the flow, we compute the average dissipation rate, D, per unit volume (2.3a,b) where d(x 1 , x 2 , t) is the local dissipation rate and ||•|| is the L 2 norm.The dissipation rate has been employed in the literature to analyse the Kolmogorov flow (Chandler & Kerswell 2013;Farazmand 2016).We characterise the solutions in figure 1 and Appendix B through the average dissipation rate, D, and the local dissipation rate at the centre of the domain, d π,π .The phase plots show the trajectories obtained with the optimal time delay given by the first minimum of the average mutual information, τ (Kantz & Schreiber 2004).In the first regime (Re = 30), the average dissipation rate is a limit cycle (figure 1a), whilst the local dissipation rate has a toroidal structure (figure 1b), which indicates quasiperiodic variations in the flow state.The average dissipation rate is periodic, despite the flow state being quasiperiodic, because some temporal frequencies are filtered out when averaging in space (more details are given in Appendix B).In the second regime (Re = 34), the solution is turbulent for both global and local quantities (figure 1c,d).

Lyapunov exponents and attractor dimension
As explained in § 4, in order to create a reduced-order model, we need to select the number of degrees of freedom of the latent space.We observe that at a statistically stationary regime, the latent space should be at least as large as the turbulent attractor.Therefore, we propose using the turbulent attractor's dimension as a lower bound for the latent space dimension.In chaotic (turbulent) systems, the dynamics are predictable only for finite times because infinitesimal errors increase in time with an average exponential rate given by the (positive) largest Lyapunov exponent, Λ 1 (e.g.Boffetta et al. 2002).The inverse of the largest Lyapunov exponent provides a timescale for assessing the predictability of chaotic systems, which is referred to as the Lyapunov time (LT), 1LT = Λ −1 1 .975 A2-4  To quantitatively assess time accuracy ( § 5), we normalise the time by the LT.In addition to being unpredictable, chaotic systems are dissipative, which means that the solution converges to a limited region of the phase space, i.e. the attractor.The attractor typically has a significantly smaller number of degrees of freedom than the original system (Eckmann & Ruelle 1985).An upper bound on the number of degrees of freedom of the attractor, i.e. its dimension, can be estimated via the Kaplan-Yorke dimension (Kaplan & Yorke 1979) where Λ i are the j largest Lyapunov exponents for which j i=1 Λ i ≥ 0. Physically, the m largest Lyapunov exponents are the average exponential expansion/contraction rates of an m-dimensional infinitesimal volume of the phase space moving along the attractor (e.g.Boffetta et al. 2002).To obtain the m largest Lyapunov exponents, we compute the evolution of m random perturbations around a trajectory that spans the attractor.The space spanned by the m perturbations approximates an m-dimensional subspace of the tangent space.Because errors grow exponentially in time, the evolution of the perturbations is exponentially unstable and the direct computation of the Lyapunov exponents numerically overflows.To overcome this issue, we periodically orthonormalise the perturbations, following the algorithm of Benettin et al. (1980) (see the supplementary material).In so doing, we find the quasiperiodic attractor to be 3-dimensional and the chaotic attractor to be 9.5-dimensional.Thus, both attractors have approximately 3 order of magnitude fewer degrees of freedom than the flow state (which has 4608 degrees of freedom, see § 2).We take advantage of these estimates in § § 4-5, in which we show that we need more than 100 POD modes to accurately describe the attractor that has less than 10 degrees of freedom.The leading Lyapunov exponent in the chaotic case is Λ 1 = 0.065, therefore, the LT is 1LT = 0.065 −1 ≈ 15.4.

Convolutional autoencoder echo state network (CAE-ESN)
In order to decompose high-dimensional turbulent flows into a lower-order representation, a latent space of the physical dynamics is computed.For time prediction, the dynamics are mapped onto the latent space on which their evolution can be predicted at a lower computational cost than the original problem.In this work, we generate the low-dimensional space using a CAE (Hinton & Salakhutdinov 2006), which offers a nonlinear reduced-order representation of the flow.By projecting the physical dynamics onto the latent space, we obtain a low-dimensional time series, whose dynamics are predicted by an ESN (Jaeger & Haas 2004).

Convolutional autoencoder
The autoencoder consists of an encoder and a decoder (figure 2a).The encoder, g(•), maps the high-dimensional physical state, q(t) ∈ R N phys , into the low-dimensional latent state, z ∈ R N lat with N lat N phys ; whereas the decoder, f (•), maps the latent state back into the physical space with the following goal q(t) q(t), where q(t) = f (z(t)), z(t) = g(q(t)), ( where q(t) ∈ R N phys is the reconstructed state.We use CNNs (Lecun et al. 1998) as the building blocks of the autoencoder.In CNNs, a filter of size k f × k f × d f , slides through the input, y 1 ∈ R N 1x ×N 1y ×N 1z , with stride, s, so that the output, y 2 ∈ R N 2x ×N 2y ×N 2z , of the CNN layer is where func is the nonlinear activation function, and W l 1 l 2 km and b m are the weights and bias of the filter, which are computed by training the network.We choose func = tanh following Murata et al. (2020).A visual representation of the convolution operation is shown in figure 2(b).The size of the output is a function of the input size, the kernel size, the stride and the artificially added entries around the input (i.e. the padding).By selecting periodic padding, we enforce the boundary conditions of the flow (figure 2c).The convolution operation filters the input by analyzing patches of size k f × k f .In doing so, CNNs take into account the spatial structure of the input and learn localised structures in the flow.Moreover, the filter has the same weights as it slides through the input (parameter sharing), which allows us to create models with significantly less weights than fully connected layers.Because the filter provides the mathematical relationship between nearby points in the flowfield, parameter sharing is physically consistent with the governing equations of the flow, which are invariant to translation.
The encoder consists of a series of padding and convolutional layers.At each stage, we (i) apply periodic padding, (ii) perform a convolution with stride equal to two to half the spatial dimensions (Springenberg et al. 2014) and increase the depth of the output and (iii) perform a convolution with stride equal to one to keep the same dimensions and increase the representation capability of the network.The final convolutional layer has a varying depth, which depends on the size of the latent space.The decoder consists of a series of padding, transpose convolutional and centre crop layers.An initial periodic padding enlarges the latent space input through the periodic boundary conditions.The size of the padding is chosen so that the spatial size of the output after the last transpose convolutional layer is larger than the physical space.In the next layers, we (i) increase the spatial dimensions of the input through the transpose convolution with stride equal to two and (ii) perform a convolution with stride equal to one to keep the same dimensions and increase the representation capability of the network.The transpose convolution is the inverse operation of the convolution shown in figure 2(b), in which the roles of the input and the output are inverted (Zeiler et al. 2010).The centre crop layer eliminates the outer borders of the picture after the transpose convolution, in a process opposite to padding, which is needed to match the spatial size of the output of the decoder with the spatial size of the input of the encoder.A final convolutional layer with a linear activation function sets the depth of the output of the decoder to be equal to the input of the encoder.
We use the linear activation to match the amplitude of the inputs to the encoder.The encoder and decoder have similar number of trainable parameters (more details are given in Appendix C).
In this study, we use a multiscale autoencoder (Du et al. 2018), which has been successfully employed to generate latent spaces of turbulent flowfields (Nakamura et al. 2021).The multiscale autoencoder employs three parallel encoders and decoders, which have different spatial sizes of the filter (figure 2a).The size of the three filters are 3 × 3, 5 × 5 and 7 × 7, respectively.By employing different filters, the multiscale architecture learns spatial structures of different sizes, which are characteristic features of turbulent flows.We train the autoencoder by minimising the mean squared error (MSE) between the outputs and the inputs where N t is the number of training snapshots.To train the autoencoder, we use a dataset of 30 000 time units (generated by integrating (2.1)-(2.2)),which we sample with timestep δt CNN = 1.Specifically, we use 25 000 time units for training and 5000 for validation.
We divide the training data in minibatches of 50 snapshots each, where every snapshot is 500δt CNN from the previous input of the minibatch.The weights are initialised following Glorot & Bengio (2010), and the minimisation is performed by stochastic gradient descent with the AMSgrad variant of the Adam algorithm (Kingma & Ba 2017;Reddi, Kale & Kumar 2018) with adaptive learning rate.The autoencoder is implemented in Tensorflow (Abadi et al. 2015).

Echo state networks
Once the mapping from the flowfield to the latent space has been found by the encoder at the current time step, z(t i ) = g(q(t i )), we wish to compute the latent state at the next time step, z(t i+1 ).Because the latent space does not contain full information about the system, the next latent state, z(t i+1 ) cannot be computed straightforwardly as a function of the current latent state only, z(t i ) (for more details refer to Kantz & Schreiber 2004;Vlachas et al. 2020).To compute the latent vector at the next time step, we incorporate information from multiple previous time steps to retain a recursive memory of the past.(This can be alternatively motivated by dynamical systems' reconstruction via delayed embedding Takens (1981), i.e. z(t i+1 ) = F (z(t i ), z(t i−1 ), . . ., z(t i−d emb )), where d emb is the embedding dimension.)This can be achieved with RNNs, which compute the next time step as a function of previous time steps by updating an internal state that keeps memory of the system.Because of the long-lasting time dependencies of the internal state, however, training RNNs with backpropagation through time is notoriously difficult (Werbos 1990).ESNs overcome this issue by nonlinearly expanding the inputs into a higher-dimensional system, the reservoir, which acts as the memory of the system (Lukoševičius 2012).The output of the network is a linear combination of the reservoir's dynamics, whose weights are the only trainable parameters of the system.Thanks to this architecture, training ESNs consists of a straightforward linear regression problem, which avoids backpropagation through time.
As shown in figure 3, in an ESN, at any time t i : (i) the latent state input, z(t i ), is mapped into the reservoir state, by the input matrix, W in ∈ R N r ×(N lat +1) , where N r > N lat ; (ii) the reservoir state, r ∈ R N r , is updated at each time iteration as a function of the current input and its previous value; and (iii) the updated reservoir is used to compute the output, which is the predicted latent state at the next timestep, ẑ(t i+1 ).This process yields the discrete dynamical equations that govern the ESN's evolution (Lukoševičius 2012) Echo state network Decoder The purpose of washout is for the reservoir state to become (i) up-to-date with respect to the current state of the system and (ii) independent of the arbitrarily chosen initial condition, r(t 0 ) = 0 (echo state property).After washout, we train the output matrix, W out , by minimising the MSE between the outputs and the data over the training set.
Training the network on N tr + 1 snapshots consists of solving the linear system (ridge regression) where R ∈ R (N r +1)×N tr and Z d ∈ R N lat ×N tr are the horizontal concatenations of the reservoir states with bias, [r; 1], and of the output data, respectively; I is the identity matrix and β is the Tikhonov regularisation parameter (Tikhonov et al. 2013).In the closed-loop configuration (figure 3), starting from an initial data point as an input and an initial reservoir state obtained through washout, the output, ẑ, is fed back to the network as an input for the next time step prediction.In doing so, the network is able to autonomously evolve in the future.After training, the closed-loop configuration is deployed for validation and test on unseen dynamics.

Validation
During validation, we use part of the data to select the hyperparameters of the network by minimising the error of the prediction with respect to the data.In this work, we optimise the input scaling, σ in , spectral radius, ρ, and Tikhonov parameter, β, which are the key hyperparameters for the performance of the network (Lukoševičius 2012).
We use a Bayesian optimisation to select σ in and ρ, and perform a grid search within [σ in , ρ] to select β (Racca & Magri 2021).The range of the hyperparameters vary as a function of the testcase (see the supplementary material).In addition, we add to the training inputs, z tr , Gaussian noise, N , with a zero mean and standard deviation, such that To select the hyperparameters, we employ the recycle validation (RV) (Racca & Magri 2021).The RV is a tailored validation strategy for the prediction of dynamical systems with RNNs, which has been shown to outperform other validation strategies, such as the single shot validation (SSV), in the prediction of chaotic flows (more details are given in Appendix D).In the RV, the network is trained only once on the entire dataset, and validation is performed on multiple intervals already used for training.This is possible because RNNs operate in two configurations (open-loop and closed-loop), which means that the networks can be validated in closed-loop on data used for training in open-loop.

Spatial reconstruction
We analyse the ability of the autoencoder to create a reduced-order representation of the flowfield, i.e. the latent space.The focus is on the spatial reconstruction of the flow.Prediction in time is discussed in § § 5-6.

Reconstruction error
The autoencoder maps the flowfield onto the latent space and then reconstructs the flowfield based on the information contained in the latent space variables.The reconstructed flowfield (the output) is compared with the original flowfield (the input).The difference between the two flowfields is the reconstruction error, which we quantify with the normalised root-mean-squared error (NRMSE) where i indicates the ith component of the physical flowfield, q, and the reconstructed flowfield, q, and σ (•) is the standard deviation.We compare the results for different sizes of the latent space with the reconstruction obtained by POD (Lumley 1970), also known as principal component analysis (Pearson 1901).In POD, the N lat -dimensional orthogonal basis that spans the reduced-order space is given by the eigenvectors Φ (phys) i associated with the largest N lat eigenvalues of the covariance matrix , where Q d is the vertical concatenation of the N t flow snapshots available during training, from which the mean has been subtracted.Both POD and the autoencoder minimise the same loss function (3.3).However, POD provides the optimal subspace onto which linear projections of the data preserve its energy.On the other hand, the autoencoder provides a nonlinear mapping of the data, which is optimised to preserve its energy.(In the limit of linear activation functions, autoencoders perform similarly to POD (Baldi & Hornik 1989;Milano & Koumoutsakos 2002;Murata et al. 2020).)The size of the latent space of the autoencoder is selected to be larger than the Kaplan-Yorke dimension, which is N KY = {3, 9.5} for the quasiperiodic and turbulent testcases, respectively ( § 2.1).We do so to account for the (nonlinear) approximation introduced by the CAE-ESN, and numerical errors in the optimisation of the architecture.
Figure 4 shows the reconstruction error over 600 000 snapshots (for a total of 2 × 10 6 snapshots between the two regimes) in the test set.We plot the results for POD and the autoencoder through the energy (variance) captured by the modes where NRMSE is the time-averaged NRMSE.The rich spatial complexity of the turbulent flow (figure 4b), with respect to the quasiperiodic solution (figure 4a), is apparent from the magnitude and slope of the reconstruction error as a function of the latent space dimension.

A2-11
In the quasiperiodic case, the error is at least one order of magnitude smaller than the turbulent case for the same number of modes, showing that fewer modes are needed to characterise the flowfield.In both cases, the autoencoder is able to accurately reconstruct the flowfield.For example, in the quasiperiodic and turbulent cases the nine-dimensional autoencoder latent space captures 99.997 % and 99.78 % of the energy, respectively.The nine-dimensional autoencoder latent space provides a better reconstruction than 100 POD modes for the same cases.Overall, the autoencoder provides an error at least two orders of magnitude smaller than POD for the same size of the latent space.These results show that the nonlinear compression of the autoencoder outperforms the optimal linear compression of POD.

Autoencoder principal directions and POD modes
We compare the POD modes provided by the autoencoder reconstruction, Φ (phys) Dec , with the POD modes of the data, Φ (phys) True .We do so to interpret the reconstruction of the autoencoder as compared with POD.For brevity, we limit our analysis to the latent space of 18 variables in the turbulent case.We decompose the autoencoder latent dynamics into proper orthogonal modes, Φ (lat) i , which we name 'autoencoder principal directions' to distinguish them from the POD modes of the data.Figure 5(a) shows that four principal directions are dominant, as indicated by the change in slope of the energy, and that they contain roughly 97 % of the energy of the latent space signal.We therefore focus our analysis on the four principal directions, from which we obtain the decoded field where a i (t) = Φ (lat)T i z(t) and f (•) is the decoder ( § 3.1).The energy content in the reconstructed flowfield is shown in figure 5(b).On the one hand, the full latent space, which consists of 18 modes, reconstructs accurately the energy content of the first 50 POD modes.On the other hand, qDec4 closely matches the energy content of the first few POD modes of the true flow field, but the error increases for larger numbers of POD modes.This happens because qDec4 contains less information than the true flow field, so that fewer POD modes are necessary to describe it.The results are further corroborated by the scalar product with the physical POD modes of the data (figure 6).The decoded field obtained from all the principal components accurately describes the majority of the first 50 physical POD modes, and qDec4 accurately describes the first POD modes.These results indicate that only few nonlinear modes contain information about several POD modes.
A visual comparison of the most energetic POD modes of the true flow field and of qDec4 is shown in figure 7. The first eight POD modes are recovered accurately.The decoded latent space principal directions, f (Φ (lat) i ), are plotted in figure 8.We observe that the decoded principal directions are in pairs as the linear POD modes, but they differ significantly from any of the POD modes of figure 7.This is because each mode contains information about multiple POD modes, since the decoder infers nonlinear interactions among the POD modes.Further analysis of the modes and the latent space requires tools from Riemann geometry (Magri & Doan 2022).This is beyond the scope of this study.

Time-accurate prediction
Once the autoencoder is trained to provide the latent space, we train an ensemble of 10 ESNs to predict the latent space dynamics.We use an ensemble of networks to take into account the random initialisation of the input and state matrices (Racca & Magri 2022a).We predict the low-dimensional dynamics to reduce the computational cost, which becomes prohibitive when time-accurately predicting high-dimensional systems with RNNs (Pathak et al. 2018a;Vlachas et al. 2020).Figure 9 shows the computational time required to forecast the evolution of the system by solving the governing equations and using the CAE-ESN.The governing equations are solved using a single GPU Nvidia Quadro RTX 4000, whereas the CAE-ESN uses a single CPU Intel i7-10700K (ESN) and single GPU Nvidia Quadro RTX 4000 (decoder) in sequence.The CAE-ESN is two orders of magnitude faster than solving the governing equations because the ESNs use sparse matrix multiplications and can advance in time with a δt larger than the numerical solver (see chapter 3.4 of Racca 2023).

Quasiperiodic case
We train the ensemble of ESNs using 15 000 snapshots equispaced by δt = 1.The networks are validated and tested in closed-loop on intervals lasting 500 time units.One closed-loop prediction of the average and local dissipation rates (2.3a,b) in the test set are plotted in figure 10.The network accurately predicts the two quantities for several oscillations (figure 10a,b).The CAE-ESN accurately predicts the entire 4608-dimensional state (see figure 3) for the entire interval, as shown by the NRMSE for the state in figure 10(c).
Figure 11 shows the quantitative results for different latent spaces and reservoir sizes.We plot the percentiles of the network ensemble for the mean over 50 intervals in the test set, • , of the time-averaged NRMSE.The CAE-ESN time-accurately predicts the system in all cases analysed, except for small reservoirs in large latent spaces (figure 11c,d).This is because larger reservoirs are needed to accurately learn the dynamics of larger latent spaces.For all the different latent spaces, the accuracy of the prediction increases with the size of the reservoir.For 5000 neurons reservoirs, the NRMSE for the prediction in time is the same order of magnitude as the NRMSE of the reconstruction (figure 4).This means that the CAE-ESN learns the quasiperiodic dynamics and accurately predicts the future evolution of the system for several characteristic timescales of the system.

Turbulent case
We validate and test the networks for the time-accurate prediction of the turbulent dynamics against the prediction horizon (PH), which, in this paper, is defined as the time interval during which the NRMSE (4.1) of the prediction of the network with respect to the true data is smaller than a user-defined threshold, k where t p is the time from the start of the closed-loop prediction and k = 0.3.The PH is a commonly used metric, which is tailored for the data-driven prediction of diverging trajectories in chaotic (turbulent) dynamics (Pathak et al. 2018b;Vlachas et al. 2020).The PH is evaluated in LTs ( § 2).(The PH defined here is specific for data-driven prediction of chaotic dynamics, and differs from the definition of Kaiser et al. (2014).) Figure 12(a) shows the prediction of the dissipation rate in an interval of the test set, i.e. on data not seen by the network during training.The predicted trajectory closely matches the true data for about 3 LTs.Because the tangent space of chaotic attractors is exponentially unstable, the prediction error ultimately increases with time (figure 12b).To quantitatively assess the performance of the CAE-ESN, we create 20 latent spaces for each of the latent space sizes [18,36,54].Because we train 10 ESNs per each latent space, we analyse results from 600 different CAE-ESNs.Each latent space is obtained by training autoencoders with different random initialisations and optimisations of the weights of the convolutional layers ( § 3.1).The size of the latent space is selected to be larger than the Kaplan-Yorke dimension of the system, N KY = 9.5 ( § 2.1), in order to contain the attractor's dynamics.In each latent space, we train the ensemble of 10 ESNs using 60 000 snapshots equispaced by δt = 0.5.Figure 13 shows the average PH over 100 intervals in the test set for the best performing latent space of each size.The CAE-ESN successfully predicts the system for up to more 1.5 LTs for all sizes of the latent space.Increasing the reservoir size slightly improves the performance with a fixed latent space size, whilst latent spaces of different sizes perform similarly.This means that small reservoirs in small latent spaces are sufficient to time-accurately predict the system.
A detailed analysis of the performance of the ensemble of the latent spaces of the same size is shown in figure 14. Figure 14(a) shows the reconstruction error.The small amplitude of the errorbars indicates that different autoencoders with same latent space size reconstruct the flowfield with similar energy.Figure 14(b) shows the performance of the CAE-ESN with 10 000 neurons reservoirs for the time-accurate prediction as a function of the reconstruction error.In contrast to the error in figure 14(a), the PH varies significantly, ranging from 0.75 to more than 1.5 LTs in all latent space sizes.There is no discernible correlation between the PH and reconstruction error.This means that the time-accurate performance depends more on the training of the autoencoder than the latent space size.To quantitatively assess the correlation between two quantities, [a 1 , a 2 ], we use the Pearson correlation coefficient (Galton 1886;Pearson 1895) (5. 2) The values r = {−1, 0, 1} indicate anticorrelation, no correlation and correlation, respectively.The Pearson coefficient between the reconstruction error and the PH is r = −0.10 for the medians of the 60 cases analysed in figure 14(b), which means that the PH is weakly correlated with the reconstruction error.This indicates that latent spaces that similarly capture the energy of the system, may differ in capturing the dynamical content of the system; i.e. some modes that are critical for the dynamical evolution of the system do not necessarily affect the reconstruction error, and therefore may not necessarily be captured by the autoencoder.This is a phenomenon that also affects different reduced-order modelling methods, specifically POD (Rowley & Dawson 2017;Agostini 2020).In figure 14(c), we plot the PH of the medians of the errorbars of figure 14(b) for different latent spaces.Although the performance varies within the same size of the latent space, we observe that the PH improves and the range of the errorbars decreases with increasing size of the latent space.This is because autoencoders that better approximate the system in a L 2 sense are also more likely to include relevant dynamical information for the time evolution of the system.From a design perspective, this means that latent spaces with smaller reconstruction error are needed to improve and reduce the variability of the prediction of the system.A different approach, which is beyond the scope of this paper, is to generate latent spaces tailored for time-accurate prediction (a more detailed discussion can be found in the supplementary material).

Statistical prediction
We analyse the statistics predicted by the CAE-ESN with long-term predictions.
Long-term predictions are closed-loop predictions that last tens of LTs, whose instantaneous state differs from the true data because of chaos (infinitesimal errors exponentially grow in chaotic systems).However, in ergodic systems, the trajectories evolve within a bounded region of the phase space, the attractor, which means that they share the same long-term statistics.The long-term predictions are generated from 50 different starting snapshots in the training set.In the quasiperiodic case, each time series is obtained by letting the CAE-ESN evolve in closed-loop for 2500 time units, whereas in the chaotic case the CAE-ESN evolves for 30 LTs. (If the network predicts values outside the range of the training set, we discard the remaining part of the closed-loop prediction, similarly to Racca & Magri (2022a).)To quantify the error of the predicted probability density function (p.d.f.) with respect to the true p.d.f., we use the first-order Kantorovich metric (Kantorovich 1960), also known as the Wasserstein distance or Earth mover's distance.The Kantorovich metric evaluates the work to transform one p.d.f. to another, which is computed as where CDF is the cumulative distribution function and k is the random variable of interest.Small values of K indicate that the two p.d.f.s are similar to each other, whereas large values of K indicate that the two p.d.f.s significantly differ from each other.
6.1.Quasiperiodic case To predict the statistics of the quasiperiodic case, we train the network on small datasets.We do so because the statistics are converged for the large dataset (15 000 time units) used for training in § 5 (figure 15a).We use datasets with non-converged statistics to show that we can infer the converged statistics of the system from the imperfect data available during training.We analyse the local dissipation rate at the central point of the domain, d π,π , which shows quasiperiodic oscillations ( § 2). Figure 15(b) shows the Kantorovich metric for the CAE-ESN as a function of the number of training snapshots in the dataset for fixed latent space and reservoir size.The networks improve the statistical knowledge of the system with respect to the training set, i.e. the original data, in all cases analysed, which generalises the findings of Racca & Magri (2022a) to spatiotemporal systems.The improvement is larger in smaller datasets, and it saturates for larger datasets.This means that (i) the networks accurately learn the statistics from small training sets and (ii) the networks are able to extrapolate from imperfect statistical knowledge and provide an accurate estimation of the true statistics.Figure 16 shows the performance of the CAE-ESN as a function of the reservoir size for the small dataset of 1500 training snapshots.In all cases analysed, the CAE-ESN outperforms the training set, with values of the Kantorovich metric up to three times smaller than those of the training set.Consistently with previous studies on ESNs (Huhn & Magri 2022;Racca & Magri 2022a), the performance is approximately constant for different reservoir sizes.The results indicate that small reservoirs and small latent spaces are sufficient to accurately extrapolate the statistics of the flow from imperfect datasets.

Turbulent case
In the turbulent case, we analyse the ability of the architecture to replicate the p.d.f. of the average dissipation rate, D, to assess whether the networks have learned the dynamics in a statistical sense.Figure 17 shows the results for the best performing CAE-ESN with latent space of size 18, 36 and 54 discussed in § 5. Figure 17(a) shows the Kantorovich metric as a function of the size of the reservoir.The networks predict the statistics of the dissipation rate with Kantorovich metrics of the same order of magnitude of the training set, which has nearly converged statistics.This means that the network accurately predict the p.d.f. of the dissipation rate (figure 17b).In agreement with the quasiperiodic case, the performance of the networks is approximately constant as a function of the reservoir size.These results indicate that small latent spaces and small networks can learn the statistics of the system.
A detailed analysis of the statistical performance of the 20 latent spaces of size [18,36,54] is shown in figure 18.On the one hand, figure 18(a) shows scatter plot between the Kantorovich metric and the reconstruction error.In agreement with the results for the PH (see figure 14), we observe that (i) the Kantorovich metric varies significantly within latent spaces of the same size, (ii) the two quantities are weakly correlated (r = −0.16)and (iii) the variability of the performance decreases with increasing latent space size (figure 18c).On the other hand, figure 18(b) shows the scatter plot for the Kantorovich metric as a function of the PH, which represents the correlation between the time-accurate and statistical performances of the networks.The two quantities are (anti)correlated, with a Pearson coefficient r = −0.76.Physically, the results indicate that (i) increasing the size of the latent space is beneficial for predicting the statistics of the system and (ii) the time-accurate and statistical performances of latent spaces are correlated.This means that the design guidelines for the time-accurate prediction discussed in § 5 extend to the design of latent spaces for the statistical prediction of turbulent flowfields.

Predicting higher-dimensional turbulence: the minimal flow unit
Because of its relatively small size, the two-dimensional Kolmogorov flow ( § 2) enables us to thoroughly assess the accuracy and robustness of the proposed method with a variety of parametric and ensemble computations.In this section, we explain how to modify the CAE-ESN to tackle three-dimensional fields.We deploy the CAE-ESN for the statistical prediction of the minimal flow unit (MFU) (Jimenez & Moin 1991).This flow is governed by the incompressible Navier-Stokes equations, presented in § 2, with velocity u = (u, v, w).As shown in figure 19, the MFU consists of a channel flow-like configuration with no-slip boundary conditions in the wall-normal direction, y, at u(x, ±δ, z, t) = 0, where δ is the half-width of the channel, and periodic boundary conditions in the streamwise, x, and transverse, z, directions.The flow is forced with a body forcing f = ( f 0 , 0, 0) in the streamwise direction.We consider a channel of dimension πδ × 2δ × 0.34πδ with δ = 1.0 and a forcing term that provides a friction Reynolds number, Re τ ≈ 140 as in Blonigan, Farazmand & Sapsis (2019).These conditions result in a turbulent flow, as shown in figure 19.We use an in-house direct numerical simulation (DNS) code based on work by Bernardini, Pirozzoli & Orlandi (2014) to simulate the MFU on a grid 32 × 256 × 16, yielding a 32 × 256 × 16 × 3 ≈ 4 × 10 5 -dimensional physical state vector.We solve the equations using a hybrid third-order low-storage Runge-Kutta algorithm coupled with a second-order Crank-Nicolson scheme (Bernardini et al. 2014).
The timestep is computed at the initial state to ensure the stability of the numerical scheme (initial Courant-Friedrichs-Lewy (CFL) number of 2/3), and is subsequently kept constant to a value approximately equal to 1/100 of the eddy turnover time.y-direction.This padding is physically consistent with the boundary conditions.Similarly, in the decoder part of the CAE, three-dimensional transpose convolutional layers are employed.To balance the size of the CAE and reconstruction accuracy, only two parallel encoder/decoder streams are used in the multiscale autoencoder architecture, which have spatial filter sizes 3 × 5 × 3 and 5 × 7 × 5, respectively.For the reducing/increasing size layers, a striding of the form (2, 4, 2) is used.Details on the CAE architecture are provided in Appendix C. The CAE is trained using the same method as described in § 3.1.The architecture the ESN is identical to the architecture presented in § 3.2, but with a reservoir containing 150 000 neurons.The hyperparameters of the ESN are obtained using the RV (Racca & Magri 2021), in which the training dataset is the encoded MFU evolution.
Figure 19(b) shows the performance of the CAE through the reconstruction error discussed in § 4. The energy for POD and three different CAEs, with latent space dimensions of 384, 768, 1536 and 3072, respectively, is computed over the 400 snapshots in the test set, similarly to figure 4 in the Kolmogorov flow.The CAE outperforms POD, with POD requiring, for example, approximately 10 times larger latent spaces to obtain the same reconstruction error as the CAE with 1536-dimensional latent space.The performance of the combined CAE-ESN, with a CAE of latent dimension 1536, is finally assessed statistically in figure 20.This latent dimension was chosen as it provided sufficient reconstruction accuracy in the CAE.We perform long-term predictions of the CAE-ESN for a duration of 100 eddy turnover times, and compare the resulting mean and fluctuating velocity profiles with the true DNS data.The CAE-ESN correctly predicts the true mean (figure 20a) and fluctuating velocity profiles (figure 20b) of the MFU.In conclusion, the CAE-ESN can statistically learn and predict the dynamics of three-dimensional turbulent flows through relatively small latent spaces.

Conclusions
In this paper, we propose the CAE-ESN for the prediction of two-and three-dimensional turbulent flowfields from data.The CAE nonlinearly maps the flowfields onto (and back from) the latent space, whose dynamics are predicted by an ESN.We assess the capabilities of the CAE-ESN methodology on a variety of studies on a two-dimensional flowfield, in both quasiperiodic and turbulent regimes.The CAE-ESN is finally deployed to predict the statistics of a three-dimensional turbulent flow.First, we show that the CAE-ESN time-accurately and statistically predicts the flow.The latent space requires 1 % of the degrees of freedom of the original flowfield.In the prediction of the spatiotemporal dynamics, the CAE-ESN is at least 100 times faster than solving the governing equations.This is possible thanks to the nonlinear compression provided by the autoencoder, which has a reconstruction error that is 1 % of that from POD.Second, we analyse the performance of the architecture at different Reynolds numbers.The architecture correctly learns the dynamics of the system in both quasiperiodic and turbulent testcases.This means that the architecture is robust with respect to changes in the physical parameters of the system.Third, we analyse the performance of the CAE-ESN as a function of the reconstruction error.We show that (i) the performance varies between latent spaces with similar reconstruction error, (ii) larger latent spaces with smaller reconstruction errors provide a more consistent and more accurate prediction on average and (iii) the time-accurate and statistical performance of the CAE-ESN are correlated.This means that relatively small reservoirs in relatively small latent spaces are sufficient to time-accurately and statistically predict the turbulent dynamics, and increasing the latent space size is beneficial for the performance of the CAE-ESN.Finally, we extend the CAE-ESN to three-dimensional turbulent flows.By analysing the MFU as a testcase, we show that POD requires a latent space that is approximately 10 times larger than the latent space inferred by the CAE-ESN for the same reconstruction error.By letting the CAE-ESN evolve as an autonomous dynamical system, the turbulent velocity statistics of the MFU are accurately inferred.By proposing a data-driven framework for the nonlinear decomposition and prediction of turbulent dynamics, this paper opens up possibilities for nonlinear reduced-order modelling of turbulent flows.These results indicate that the RV outperforms and is more reliable than the SSV in the prediction of the turbulent flow.

Figure 2 .
Figure 2. (a) Schematic representation of the multiscale autoencoder, (b) convolution operation for filter size (2 × 2 × 1), stride = 1 and padding = 1 and (c) periodic padding.In (b,c), blue and white squares indicate the input and the padding, respectively.The numbers in (c) indicate pictorially the values of the flowfield, which can be interpreted as pixel values.

Figure 3 .
Figure 3. Schematic representation of the closed-loop evolution of the ESN in the latent space, which is decompressed by the decoder.
where σ (•) is the standard deviation and k z is a tunable parameter (Appendix D).Adding noise to the data improves the ESN forecasting of chaotic dynamics because the network explores more regions around the attractor, thereby becoming more robust to errors (Lukoševičius 2012;Vlachas et al. 2020;Racca & Magri 2022a).

Figure 4 .
Figure 4. Reconstruction error in the test set as a function of the latent space size in (a) the quasiperiodic case and (b) the chaotic case for POD and the CAE.

Figure 5 .Figure 6 .
Figure 5. (a) Energy as a function of the latent space principal directions and (b) energy as a function of the POD modes in the physical space for the field reconstructed using 4 and 18 (all) latent space principal direction and data (True).

Figure 7 .
Figure 7. POD modes, one per row, for (a) the reconstructed flow field based on four principal components in latent space and (b) the data.Even modes are shown in the supplementary material.The reconstructed flow contains the first eight modes.

Figure 9 .Figure 10 .
Figure 9. Computational time required to forecast the evolution of the system for one time unit.Times for the CAE-ESN are for the largest size of the reservoir, which takes the longest time.

Figure 11 .
Figure 11.The 25th, 50th and 75th percentiles of the average NRMSE in the test set as a function of the latent space size and reservoir size in the quasiperiodic case.

Figure 12 .Figure 13 .Figure 14 .
Figure 12.Prediction of the dissipation rate in the turbulent case in the test set, i.e. unseen dynamics, for a CAE-ESN with a 10 000 neurons reservoir and 36-dimensional latent space.The vertical line in panel (a) indicates the PH.

Figure 15 Figure 16 .
Figure 15.(a) P.d.f. of the local dissipation rate, d π,π , in the training set for different numbers of training snapshots.(b) The 25th, 50th and 75th percentiles of the Kantorovich metric for the training set (Train) and for the CAE-ESN (Predicted) as a function of the training set size.Quasiperiodic regime.

Figure 17
Figure 17.(a) The 25th, 50th and 75th percentiles of the Kantorovich metric of the dissipation rate, for the training set (Train) and for different reservoir and latent space sizes.(b) P.d.f. for the true and training data, and predicted by 10 000 neurons CAE-ESNs.

Figure 18 Figure 19
Figure 18.(a) The 25th, 50th and 75th percentiles for the Kantorovich metric as a function of the reconstruction error, (b) medians of the Kantorovich metric as a function of the median of the PH and (c) the 25th, 50th and 75th percentiles of the medians of the Kantorovich metric of (b) as a function of the latent space.The lines in (a,b) are obtained by linear regression.

Figure 20 .
Figure 20.Profile of (a) mean streamwise and (b) fluctuating streamwise and wall-normal velocity profiles.Grey line: true statistics.Red/blue dashed lines: statistics from the CAE-ESN.