From snapshots to manifolds - A tale of shear flows

We propose a novel non-linear manifold learning from snapshot data and demonstrate its superiority over Proper Orthogonal Decomposition (POD) for shedding-dominated shear flows. Key enablers are isometric feature mapping, Isomap (Tenenbaum et al., 2000), as encoder and K-nearest neighbours (KNN) algorithm as decoder. The proposed technique is applied to numerical and experimental datasets including the fluidic pinball, a swirling jet, and the wake behind a couple of tandem cylinders. Analyzing the fluidic pinball, the manifold is able to describe the pitchfork bifurcation and the chaotic regime with only three feature coordinates. These coordinates are linked to vortex-shedding phases and the force coefficients. The manifold coordinates of the swirling jet are comparable to the POD mode amplitudes, yet allow for a more distinct manifold identification which is less sensitive to measurement noise. As similar observation is made for the wake of two tandem cylinders (Raiola et al., 2016). The tandem cylinders are aligned in streamwise distance which corresponds to the transition between the single bluff body and the reattachment regimes of vortex shedding. Isomap unveils these two shedding regimes while the Lissajous plots of first two POD mode amplitudes feature a single circle. The reconstruction error of the manifold model is small compared to the fluctuation level, indicating that the low embedding dimensions contains the coherent structure dynamics. The proposed Isomap-KNN manifold learner is expected to be of large importance in estimation, dynamic modeling and control for large range of configurations with dominant coherent structures.


Introduction
The chaotic nature of turbulent flows and their importance in physical and engineering systems has motivated countless studies attempting to obtain simplified models for control purposes in engineering applications. In particular, unbounded shear flows such as jets and wakes have received unabated interest due to their importance for drag reduction, control of unsteady loads, mixing enhancement etc.
Despite their chaotic nature, such flows are characterized by recurrent flow patterns that are typically referred to as coherent structures. The beauty of the coherent structures has fascinated scientists since Leonardo Da Vinci (Marusic and Broomhall, 2021) and has hinted at the possibility that the flow dynamics can be represented as a system evolving on a low-dimensional attractor. Proper Orthogonal Decomposition (POD, Berkooz et al., 1993), also called Principal Component Analysis in statistics, has received significant attention since it allows to decompose a flow field into orthogonal modes sorted according to their contribution to the variance of the quantity to be analyzed. One of the most exploited advantages of POD in fluid mechanics is their capability to simplify the Navier-Stokes equations into a system of linear differential equations employing Galerkin projections (Noack et al., 2003). Low-order models obtained from POD open a door to a vast space of applications such as flow control (Brunton and Noack, 2015) and also crisp least-order models for bifurcations and interactions of coherent structures (Deng et al., 2019).
A myriad of POD studies hint at low-dimensional manifolds describing turbulent shear flows. In the case of oscillatory flows, two-dimensional manifolds have been identified from laminar two-dimensional cylinder flows (Noack et al., 2003) to experimental turbulent wakes behind finite cylinders at high Reynolds numbers (Bourgeois et al., 2013). These manifolds are the cornerstone of mean-field Galerkin models. Even flows with several frequencies may live on a mean-field manifold (Luchtenburg et al., 2009). The pioneering POD model of Aubry et al. (1988) derives such a manifold for the turbulent boundary layer from the Reynolds equations. For more complex flows, the energy spectrum of POD typically reveals O(10) distinct most energetic eigenvalues associated with physically interpretable modes. This distinct spectrum is followed by a continuous eigenvalue distribution with less interpretable and increasingly fine-scaled modes. The distinct POD modes can be hypothesized as the conductors of a large turbulence orchestra 'slaved' their conductor POD mode amplitudes (Callaham et al., 2022).
While the focus of POD is on obtaining reduced-order models optimal in terms of energy, the field of statistical learning provides a vast amount of tools for dimensionality reduction (Hastie et al., 2009). Multi-dimensional scaling (MDS), for instance, is based on the singular value decomposition of the data distance matrix and allows representing a dataset in a low-dimensional space preserving the distance between the snapshots in the high-dimensional space (Torgerson, 1952;Kruskal, 1964). MDS has been used in fluid mechanics mainly for visualization purposes and effectively captures some hidden features of the flows (Kaiser et al., 2014;Foroozan et al., 2021).
The capability of obtaining low-dimensional representations is tempting to identify embedded manifolds of the flows under study. Manifold learning attempts to identify a topologically closed surface, the manifold, over which the dataset actually resides. In statistical words, the dataset can be described to lie on or near a manifold in a low-dimensional space in which the manifold expresses some basic features of it. In this sense the manifold is in fact the set of relations that connect snapshots to each other. Interestingly, often high-dimensional systems appear to evolve on low-dimensional manifolds, thus simplifying their modelling if the manifold can be identified. These aspects pushed the development of the manifold learning techniques in the last decades. Remarkable examples are locally linear embedding (Roweis and Saul, 2000), isometric mapping (Tenenbaum et al., 2000), and diffusion map (Coifman and Lafon, 2006).
The traditional dimensionality-reduction methods are structured on linear models and thus fail in capturing the manifolds when a non-linear structure is present in the data. Turbulent flows exhibit a non-linear behaviour which motivates the investigation of non-linear models for manifold learning. Tenenbaum et al. (2000) have shown that the dimensionality reduction based on geodetic distances can be a powerful tool in preserving the actual behaviour of non-linear datasets. This technique is referred to as isometric feature mapping, or Isomap. Surprisingly, the application of Isomap in fluid mechanics is minimal. Tauro et al. (2014) successfully employed Isomap to identify manifolds from flow-visualization data while others used it for combustion (Bansal et al., 2011), and design optimization (Franz et al., 2014). Recently Otto and Rowley (2021) discussed the limitation of the linear methods in the case of selection and placement of sensors in a flow field. This manuscript introduces a framework of manifold learning as an encoder for unbounded shear flows with a K−Nearest Neighbors (KNN) decoder. The input snapshots, which can be obtained either from a simulation or an experiment, are encoded using Isomap as the primary tool. The high-dimensional space transforms to a low-dimensional space to identify the hidden embedding manifold of the dataset. In this new space, the manifold is interpreted to unravel the relationship between the manifold low-dimensional characteristics and the main features of the flow dynamics. We can reproduce the snapshots in the high-dimensional space with a KNN decoder using this new, easy-to-understand space and fast computing. This whole encoder-decoder model provides a robust framework to analyze shear flows and then implement applications (such as designing flow control systems) based on it.
Four datasets with different features have been used to investigate the framework's performance. The selected datasets vary from DNS simulations of wake flows to fully-turbulent experimental datasets with measurement noise. The simulation datasets are based on the wake of the fluidic pinball which in recent years has been shown to be a suitable test-bed configuration to study general flow phenomena like bifurcations and flow control (Deng et al., 2019). To study different flow regimes, the results from the simulations at Re = 80 and Re = 130 are reported and discussed, allowing to identify the manifold learning capabilities both in a simpler bifurcation and in a more complex chaotic environment. The first experimental dataset consists of PIV measurements in a highly functional swirling jet configuration. This configuration has a wide application in modern gas turbine combustors and aerodynamically stabilizes lean premixed flames (Lückoff et al., 2017;Lückoff et al., 2021). Both the turbulent regime and the measurement noise challenge the encoderdecoder framework. The last tested dataset relates to the flow in the wake of two tandem cylinders. Tandem cylinders are characterized by several working regimes depending on the streamwise cylinder distance. The proposed dataset is at the intersection of two regimes, however by using POD in a previous work, Raiola et al. (2016) could not unveil a regime switch.
The paper is organized as follows: in §2 after the introduction, a detailed description of the developed framework is provided; the datasets and flow configurations employed are described in §3; the most important outcomes from the analysis using the encoder-decoder framework are presented in §4, and finally, the conclusion and the possible future steps have been put in §5. Two appendices describe a criterion for the choice of Isomap parameters and discuss possible criteria for the definition of the manifold residual variance.

Isomap -KNN manifold learner
In this work, a manifold learner methodology for fluid data is developed. The proposed approach consists of three steps. First, data is gathered either from simulations or experiments. Second, the so-obtained data is embedded into a low-dimensional space using Isometric feature Mapping (Isomap, Tenenbaum et al., 2000). This encoding part, which is fully data-driven, is carried out with the aim of revealing a hidden manifold that allows us to relate the new coordinates to physical features of the flow such as, for instance, force coefficients. Finally, a decoding part that enables return to the high-dimensional space and reconstructs the original flow field is developed. The proposed decoder is based on K−nearest neighbours (KNN) and linear interpolation. Figure 1 shows the three stages of our procedure, which are described in detail in what follows.
Let us consider that N flow field snapshots have been observed, either from an experimental setting or a simulation. Then, each snapshot is an observation (point) in the high-dimensional space R P , where each dimension (feature) contains information about a point of the field. Let X ∈ R N ×P be the data matrix containing the stated information and x i ∈ R P be each of its rows, i.e. the flow fields for i = 1, . . . , N. The dataset in X is complex by nature and being able to extract a meaningful small number of coordinates that capture the main characteristics of the flow is challenging. Isomap is a nonlinear dimensionality reduction technique that finds a low-dimensional embedding of the data points that best preserve the geodesic distances measured in the high-dimensional input space. In order to estimate these geodesic distances, the shortest paths in a graph connecting neighbouring points are employed. These distances are then used as an input in classical MDS (Torgerson, 1952) to construct the low-dimensional embedding so that the Euclidean pairwise distances resemble those in the neighbouring graph. Therefore, the Isomap algorithm runs as follows. First, the Euclidean distances d X (i, j) between flow fields x i and x j are computed for all i, j = 1, . . . , N . Second, for i = 1, . . . , N , N k X (i), is defined as the set of the k closest observations to x i . Based on these neighborhoods, the neighboring graph G is defined over these data points such that two nodes (flow fields) i and j are connected by an edge of weight d X (i, j) if they are neighbors, i.e. there is an edge between i and j if x j ∈ N X (i). Observe that G approximates the highdimensional manifold containing the observed data. Third, the shortest paths between all pair of vertices in G are computed, yielding d G (i, j) for all i, j = 1, . . . , N , using Floyd's algorithm (Floyd, 1962). Let D G be the matrix containing these shortest path distances. Finally, obtain the low-dimensional embedding Γ ∈ R N ×p , p << P , using MDS. The new coordinates for the N samples are then found so that their pairwise Euclidean distance resembles d G (i, j). This is equivalent to finding Γ which minimizes the cost function H is the Gram matrix in the input space, being H = I N − 1 N I N the centering matrix, I N the identity matrix of dimension N , the Hadamard (element-wise) product and · F the Frobenius norm. Fixing a dimension p for the low-dimensional embedding, the value of Γ minimizing the quantity in (1) is the matrix made up of the p eigenvectors γ 1 , . . . , γ p corresponding to the p largest (positive) eigenvalues of the matrix Λ arising from the eigendecomposition of B, namely B = VΛV and Γ = V p .
The aforementioned Isomap algorithm admits the choice of other norms different from the Euclidean, different ways of identifying the neighbors to construct G, other shortest path algorithms or a non-classical approach to MDS. However, the choices made in our methodology are motivated by the implemented version of Isomap in the RDRToolbox in the R software (R Core Team, 2020; Bartenhagen, 2021), which has been used to carry out our analyses.
In order to assess the performance of Isomap, Tenenbaum et al. (2000) propose using the definition of residual variance as in (2). Let D Γ be the matrix of Euclidean distances between each pair of points in the low-dimensional embedding. Then, the residual variance is defined as one minus the squared correlation coefficient between the vectorization of the distance matrices D G and D Γ , yielding where R 2 refers to the squared correlation coefficient and vec is the vectorization operator.
Observe that the result in (2) is a number between 0 and 1 which accounts for the amount of information that remains unexplained by the low-dimensional embedding of the original data. Therefore, the lower the value in (2) the better. For a discussion about the definition of residual variance to assess the performance of Isomap against other dimensionality reduction methods such as POD we refer the reader to Appendix B.
In order to provide a decoder to create a correspondence between Isomap coordinates γ 1 , . . . , γ p and the ones in the high-dimensional space, namely R P , we employ a purely datadriven approach. Any flow field x i ∈ R P has its low-dimensional counterpart y i ∈ R p , i = 1, . . . , N. Then, let f : R p −→ R P be the unknown mapping which transforms the flow fields in the low-dimensional space onto the high-dimensional ones. To reconstruct the flow field for any y ∈ R p we assume that its K−Nearest Neighbors y (1) , . . . , y (K) and their high-dimensional counterparts, namely x (1) , . . . , x (K) , are identified. Therefore, the reconstruction (or decoding) of y, denoted as x, can be obtained as a first order Taylor expansion starting from the nearest neighbor to be mapped back to the original space, i.e.
x (1) , as where the gradient tensor in y (1) , namely ∇f (y (1) ) = ∂f ∂γ 1 (y (1) ), . . . , ∂f ∂γp (y (1) ) is estimated assuming an orthogonal projection of the K −1 directions provided by the K−Nearest Neighbors in R P to those in R p . This is: which yields ∇f (y (1) ) = (∆Y ∆Y) −1 ∆Y ∆X if least squares minimization is used to approximate it, and ∆X and ∆Y are the left-hand side and the second term in the righthand side of (4), respectively.

Datasets
In this section we describe the datasets that have been used to test our methodology. Three configurations are considered, which yield different flow fields and regimes under both experimental and simulation setups.

Fluidic pinball dataset
The fluidic pinball is a flow configuration consisting of three rotatable cylinders of equal diameter D whose axes are located in the vertices of an equilateral triangle, as sketched in figure 2a. The triangle set of three cylinders has a center-to-center side length 3D/2 and is immersed in a viscous incompressible flow with a uniform upstream velocity U ∞ . The Reynolds number for this setup is defined as Re = U ∞ D/ν, where ν is the kinematic viscosity of the fluid. The wake flow undergoes a set of interesting transitions at different values of the Reynolds number, thus allowing to explore reduced-order modeling and flow control strategies in a wide range of scenarios. In the recently published literature, the fluidic pinball has been used as a benchmark configuration for testing the mean-field modeling (Deng et al., 2019), cluster-based network modeling (Deng et al., 2022), and machine learning control (Cornejo Maceda et al., 2021;Raibaudo et al., 2020). The numerical results of the fluidic pinball are investigated by employing a software developed to study multiple-input multiple-output flow control by Noack and Morzyński (2017). Direct numerical simulation of the incompressible Navier-Stokes equations is used to compute the two-dimensional viscous wake behind the pinball configuration. As shown in figure 3(a), the computational domain [−6D, 20D] × [−6D, 6D], excluding the interior of the cylinders, is described by a Cartesian coordinate system whose origin is located in the middle of the top and bottom cylinders. As the rotation of the cylinders is not considered in this study, a no-slip condition on the cylinders and the far-field velocity U ∞ are considered as the boundary conditions. The unsteady Navier-Stokes solver is based on fully implicit time integration using an iterative Newton-Raphson approach and finite-element method (FEM) discretization on an irregular grid structure with 4225 triangles and 8633 vertices (Deng et al., 2019). The steady solution, used as the initial condition at each corresponding Reynolds number, is calculated by the solver for the steady Navier-Stokes equations in the same way.
With increasing Reynolds number, the flow experiences a transition from a laminar flow to periodic vortex shedding and finally to chaos (Deng et al., 2019). Five different regimes have been identified, as summarized in figure 3c. The transition from a steady symmetric flow to a periodic symmetric vortex shedding occurs at Re 1 ≈ 18 (following a Hopf bifurcation, Andronov et al., 1971;Strogatz et al., 1994). The symmetry of the vortex shedding vanishes at Re 2 ≈ 68 (pitchfork bifurcation, Strogatz et al., 1994) thus entering a periodic asymmetric regime. A secondary frequency appears with a higher Reynolds number, and the flow experiences another transition to a quasi-periodic asymmetric regime at Re 3 ≈ 104 (Neimark-Säcker bifurcation, Kuznetsov and Sacker, 2008). Finally, at Re ≈ 115, the flow enters into a chaotic symmetric regime.
In this study, we focused on two different flow states at the selected Re, representative of the two most complex flow states identified by Deng et al. (2022). At Re = 80 for the periodic asymmetric regime, there exist a total of six invariant sets in the system state space: three unstable stable solutions, one unstable limit cycle, and two stable limit cycles, resulting from the primary Hopf bifurcation and the secondary pitchfork bifurcation. The dataset at Re = 80 includes the snapshots from the simulations starting from the symmetric steady solution and its mirror-conjugated snapshots, as well as the snapshots from the simulations starting from the two mirror-conjugated asymmetric steady solutions. In this case, it is able to ensure that the dataset contains the complete manifold with six invariant sets. At Re = 130 for the chaotic symmetric regime, three unstable stable solutions and one chaotic attracting set can be found. The dataset considered at Re = 130 only contains the snapshots from the simulation starting from the symmetric steady solution, since we are interested in the transition from the unstable invariant set to the chaotic attracting set in this case. Therefore, simulations are performed at Re equal to 80 and 130, thus covering the typical complex wake dynamics regime with multiple invariant sets and chaotic attracting set.

Swirling jet dataset
Swirling jets have a wide variety of applications in modern gas turbine combustors and aerodynamically stabilize lean premixed flames. In present work we analyze an experimental dataset obtained with stereoscopic particle image velocimetry by Lückoff et al. (2021). Figure 2c reports a schematic of the swirling nozzle and of the measurement domain employed in the work by Lückoff et al. (2021). This configuration consists of a feeding line which provides the mass flow rate to a jet issued from a swirling nozzle. The swirling flow is produced using a radial swirl generator and the swirl number Sw, defined as the ratio of the axial flux of tangential momentum to the axial flux of axial momentum, can range between 0 and 1.5. The nozzle exit has a diameter of D = 55mm and has a centerbody at the center with diameter D CB = 35mm thus the hydraulic diameter of the mixing tube has a diameter of D h = 20mm. The Reynolds number is defined using the hydraulic diameter and the experimental facility can provide jets with a Reynolds number in the range [13000,32000]. Present dataset has been generated for Re = 20000 and Sw = 0.7. The measurement domain is located at the nozzle exit to analyze the flow in the combustion chamber. The combustion chamber is a cylinder with a inner diameter of D CCh = 200mm and length of L CCh = 300mm and is made of quartz glass to enable flow measurement using time-resolved particle image velocimetry (PIV). For the present dataset, 2183 snapshots were captured and evaluated using a commercial PIV software. The raw data were filtered for removing the outliers. The velocity at the actuators outlet has been measured using hot-wire sensors. The reader can refer to the work by Lückoff et al. (2021) for further details on the measurement setup and PIV image processing.
The main coherent structure in this kind of flow is a helical structure known as the precessing vortex core (PVC, Syred, 2006), which is generated due to a global self-excited instability Müller et al. (2020). Although the origin of the PVC is well understood, its impact on combustion performance, especially flame stability, is still a matter of study. The existence of a dominant coherent structure in this flow encourages the idea that manifold learning can be successfully implemented to study the behavior of the PVC under different conditions.

Tandem cylinders dataset
The third dataset employed in this work consists of flow field measurements in the wake of tandem cylinders near a wall. The data refer to the work by Raiola et al. (2016); the experimental configuration is summarized here for completeness. As sketched in figure 2b, this configuration consists of two equal cylinders located in a cross-flow which are separated by a ratio of L/D denoted by longitudinal pitch ratio (with L being the longitudinal distance between two cylinders centres and D the diameter of the cylinders equal to 32mm). Both cylinders are placed at a similar distance to a wall with a ratio of W/D denoted as wall gap ratio (with W being the distance of the cylinders from the wall). The wind tunnel velocity U ∞ is set constant and equal to 2.3m/s in order to achieve a Reynolds (based on cylinder diameter) number of 4900 (Raiola et al., 2016). In this study, the longitudinal pitch ratio (L/D) is set to 1.5 and the wall gap ratio (W/D) is set to 3.
While Raiola et al. (2016) reported that a gap ratio W/D = 3 is sufficient to have negligible wall-interaction effects, three main flow behaviours in the wake of tandem cylinders can be identified based on the Reynolds number and the distances between the two cylinders. Zdravkovich (1997) classified the flow around tandem cylinders with identical diameters into three major regimes (extended body, reattachment, and co-shedding), depending on the longitudinal pitch ratio. At low longitudinal distances (L/D < 1.5), the vortex shedding for the upstream cylinder is suppressed, and the system act as a unified bluff-body, which is categorized as the extended body regime or single bluff-body regime. By increasing the longitudinal distance(1.5 < L/D < 4), the flow starts to show some more complex behaviour which mainly can be characterized by the reattachment of the separated free shear layers from the upstream cylinder on the surface of the downstream cylinder. This regime is referred to as 'reattachment' regime. Furthermore, by increasing the longitudinal distances, both cylinders start to have their wake with typical characteristics of a Kármán street. This regime is often defined 'co-shedding' regime. Alam et al. (2018) report the existence of a transitional L/D range between the reattachment and co-shedding regimes also referred to as 'critical' or 'bistable' flow spacing. While it can be argued that a similar bistable regime should occur also between the extended body and the reattachment regimes, Raiola et al. (2016) did not identify such feature for L/D = 1.5. This dataset appears to be especially suited to discover whether non-linear manifold learning could unveil such a kind of bistable regime.

Results
This section presents and discusses the performance of the proposed encoder-decoder algorithm. The first subsection is dedicated to the performance of the encoder part. We discuss its strengths in unravelling the physical characteristics of the flows distilling the manifolds. In the second subsection, the decoder's ability to reconstruct the original flow fields from the obtained low-dimensional coordinates is analyzed.

Encoder's capabilities
In this section, the embedding manifolds obtained from Isomap are presented for the datasets described in section 3. A discussion about the choice of the number of neighbours k to build the neighbouring graph G is included in Appendix A.
In order to compute the dimensionality of the datasets at hand, the residual variance as defined in (2) is obtained for each number of dimensions. The choice of p is then made based on the elbow method following Tenenbaum et al. (2000). The dimension beyond which the residual variance experiences negligible variation can be identified as the true dimensionality of the dataset. This method is also widely used the compute the proper number of the clusters in the clustering techniques (Kaufman and Rousseeuw, 2009). The 'true' dimensionality is the proper place in a trade-off between the simplicity of the embedded manifold and the loss of information due to truncation.
An example of residual variance as a function of the number of selected dimensions is reported in the figure 4a for the case of the wake of the pinball at Re = 80 and Re = 130. For the lowest Reynolds number, it can be seen that 3 dimensions are sufficient to describe the bulk of the variance since the resulting residual variance for dimensions higher than 3 remains approximately the same. Note that due to the definition of the residual variance, residual variance might be not monotonically decreasing with increasing number of coordinates; for more details the reader is referred to appendix B. On the other hand, truncating at three dimensions appears to still be acceptable although induces a larger error in terms of explained variance.
It must be remarked that, from now on, we are limited in plotting the projection of the manifold on the first 3 dimensions; nonetheless, the number of dimensions needed to represent accurately the manifold might be larger, depending on the complexity of the dynamics. Figure 4a shows the residual variances of the pinball configuration dataset for Re = 80 and p = 1, . . . , 10. For this configuration, the chosen value of k is equal to 8; for more details about the choice of k the reader is referred to appendix A The elbow is attained for p = 3 which is considered to be the true dimensionality of this problem. The residual variance is approximately the same for p > 3, thus it can be argued that the manifold of input data, embedded in a higher dimensional state space, has three key dimensions.

Fluidic pinball
The same procedure has been done for Re = 130 case, employing k = 12. The residual variance decreases monotonically for increasing number of dimensions and is already below 20% after three-dimensions (p = 3) (black curve in figure 4a). However, the residual variance value at p = 3 increases by increasing the Reynolds number to higher values and entering the more chaotic regimes. This indicates that in the chaotic regime the 'true' dimensionality is higher due to the arising of a more complex dynamics. Therefore, it is reasonable to expect that, for increasing Reynolds number, the number of dimensions needed to explain the bulk of the variance should increase. This behaviour is not surprising, and it is observed in virtually all dimensionality reduction techniques.
In terms of manifold shape, both for Re = 80 and 130 data lies on a paraboloid with the first two coordinates (γ 1 and γ 2 ) being representative of the periodic vortex shedding and the third coordinate being representative of a shift-mode characteristic of the transient dynamics from the onset of vortex shedding to the periodic von Kármán wake, analogously to what found for the cylinder wake by Noack et al. (2003). A similar paraboloid shape is reported for the fluidic pinball at Re = 30 by Deng et al. (2019). It is remarkable that, when analyzing all the solutions at Re = 80, the manifold correctly identifies the first unstable limit cycle for the symmetric unstable solution and is able to identify the differences between the asymmetric upward and downward limit cycle. As well for Re = 130, the chaotic nature of the data shows a less smooth manifold, which is still possible to visualize the characteristic paraboloid shape. When employing POD and plotting the first three temporal modes, similar shapes could be obtained although less clear, expecially at Re = 130 as evident from the comparison of figures 4 (c) and (d).
Although the dimensions identified by Isomap do not necessarily have a physical meaning, it is a useful exercise to establish whether there exists some correlation between such coordinates and relevant flow quantities. Figure 5(a) shows a clear correlation between the drag coefficient and the coordinate γ 3 . This correlation is reasonably expected, since we observed that the third coordinate is representative of the shift mode. Some pairs of coordinates are related to higher-order harmonics of the flow, as is the case for γ 1 −γ 2 , γ 4 −γ 5 , and γ 6 − γ 7 . Interestingly enough, if we extend the analysis to higher-order coordinates, it is possible to identify a high degree of correlation between the γ 8 and the lift coefficient C L , as shown in figure 5(b). Although these interpretations are case-sensitive and can be affected by changes in the flow configuration and starting conditions in the Isomap algorithm, we have identified situations in which some coordinates in the low-dimensional space could be related to the main flow features. While it is outside of the scope to assess the interpretability of the coordinates identified by Isomap, we spotlight the possibility of existence of such kind of correlations. This could be a powerful catalyst for the extension of the encoder-decoder framework presented here, and it will be object of future study.

Swirling jet
In the more challenging experimental case of the swirling jet, the same encoder procedure based on Isomap using k = 8 has been carried out. POD and Isomap performance as encoders is compared. Figure 6(a) shows the residual variances of Isomap (black dots) and POD (green triangles), which is measured as stated in Appendix B, for different numbers of dimensions. In this case, the values of residual variance are significantly larger compared to those in the simulation cases studied before most likely due to the turbulent nature of the flow and possibly due to measurement noise in the experimental data. It is worth noting that the residual variances of Isomap are lower than those of POD for all the dimensions depicted, thus indicating that Isomap is a better manifold learner in this case to preserve the geometry of high-dimensional dataset. To investigate this advantage, we can compare the resulting embedding in both cases for three dimensions, which accounts for a reasonable amount of residual variance. Figures 6(b) and 6(d) show the encoded data by Isomap, whereas figures 6(c) and 6(e) illustrate the POD results, respectively. Although for both methods the general shape of the embedded manifold is similar to a hollow cylinder, the one obtained by Isomap shows a more clearly defined shape. Furthermore, the diameter of the hollow cylinder in POD is smaller and less circular with more spread points. Thus, the Isomap encoding provides a more helpful base to interpret the manifold and eventually relate the low dimensional coordinates to the physical features of the flow.

Tandem cylinders
Regarding the tandem cylinder dataset, Figure 7(a) shows that an appropriate number of dimensions in terms of the residual variance for both Isomap (black dots) and POD (green triangles) is two. Furthermore, as in the previous case, Isomap residual variances outperforms those of POD. As in the swirling-jet case discussed above, the Isomap results in figure 7(b) show a clearer manifold compared to the one obtained by POD (figure 7(c)). Furthermore, it also shows some separate groups of snapshots related to some physical features of the flow while the manifold resulted from POD ultimately fails to capture this behavior of the system. Setting the number of groups to three, the results of the classification in the polar coordinates are shown by means of different colors in figure 7(b) The flow fields of each group close to γ 1 = 0 line are plotted in figures 7(d)-(f) from the outer group to the inner one, as the prototype of each group. Investigating each group by looking at these prototypes, which are highlighted as red dots in figure 7(b), reveals that as we move outward from the center (γ 1 = γ 2 = 0), the distance between two consecutive vortices in the wake decreases, and the vortices appear less intense. In other words, moving from the groups from the outer part to the inner one it is found a transition between two different vortex shedding regimes which could correspond to the bluff-body regime and the reattachment regime of tandem cylinders. As described in §3.3, previous studies suggest that this configuration of the flow lies on the bluff-body regime, as most of the snapshots in low-dimensional space classify in the outer and middle groups, and the results of our encoding procedure are consistent with previous studies done by using POD (Raiola et al., 2016). However, Isomap can capture that even in this configuration, some behavior of the next regime can coexist with the dominant one, suggesting that we cannot define a specific number for L/D as the classifier of the flow regimes, and by increasing L/D the flow smoothly changes its behaviors.

Decoder's performance
This section assesses the quality of the decoder approach described in §2. The normalized mean squared error (NMSE) in a test set of observations T is computed as wherex is the decoder's reconstruction of the flow field x ∈ R P . Table 1 shows the average NMSE obtained (last column) for the different cases studied, namely Pinball for Re = 80 and Re = 130, the swirling jet flow and the wake of tandem cylinders. The sample data has been split into 70% training and 30% testing. The number of neighbors considered in Isomap (k) and the number of neighbors considered in the decoding stage (K) are reported in the sixth and seventh columns, respectively. The first column in Table 1 depicts the case study, the second the value of Re, the third the number of samples, fourth the number of features in the high-dimensional space and fifth the the number of dimensions considered for Isomap. Observe that the NMSEs reported in Table 1 are actually representation errors, i.e. the error incurred when reconstructing the original datasets by means of our encoder-decoder methodology. These representation errors show that the decoder has an excellent performance on the simulation datasets and has a reasonably good performance on both noisy experimental datasets. Figure 8 compares reconstructed and actual snapshots for a random point in the test subset. As expected from the reported errors, the differences between the two snapshots in the pinball case are hardly noticeable. In the case of the tandem cylinders, although there are some noticeable differences between reconstructed and actual snapshots, the general behavior of the flow is well preserved in the reconstructed snapshot.
A comparison using as input low-dimensional coordinate resulting from Isomap and POD has been carried out to investigate further the decoder performance. The NMSE of the Isomap decoder was calculated using the same 70% training dataset. The reconstructon error with POD modes was calculated as the reconstruction error employing the first two or three modes, depending on the true dimensionality p of the dataset. Results show a clear superiority of the KNN decoder in all the cases except for the tandem-cylinders one. In this case, the performance of the two methods is similar, probably due to the rather small size of the input experimental dataset.

Conclusions
In this paper we have developed an encoder-decoder framework based on manifold learning techniques to tackle the problem of understanding shear flows. The proposed manifold learner is Isomap and it is coupled with a KNN decoder. We show that flows which can be described with a limited set of coherent structures are suitable candidates for manifold learning. In the applications proposed in this manuscript we have chosen phenomena whose snapshots correctly sampled all the events defining the system "clock", i.e. jet and wake flows in which the measurement domain samples the evolution of the main vortical features. We have shown that the manifolds unraveled using the Isomap encoder are representative of meaningful physical quantities and are suitable for a reduced-order modelling of shear flows. The pure physics-uninformed results in the fluidic pinball case also have a correlation with physical properties like vortex-shedding phases or the force coefficients and open ground to use this technique to model wake flows and design flow-control systems. We have shown that when handling experimental cases with complex behavior, despite the presence of acceptable measurement noise, the new encoder-decoder tool outperforms the classical dimensionality-reduction techniques like POD in terms of clarity and interpretability of the identified manifolds, and it is less sensitive to experimental noise. In such cases, not only the identified manifolds are more reliable, but they also distill some physical information that POD is not able to catch, including the coexistence of the two shedding regimes and transition between them in the case of the wake of two tandem cylinders.
Finally, the developed decoder proved to have outstanding capabilities in reconstructing the original flow fields from the identified manifold, allowing for the instantaneous identification of the flow state with applications to closed-loop flow control.
The proposed manifold learner may significantly reduce the dimension of the state space as compared to POD expansions with similar representation error, particularly for transient flows. Hence, the autoencoder methodology may be used for full-state estimation and control design. For both tasks, every unnecessary coordinate acts as a noise amplifier and constitutes a danger for the system to get 'off track'. The authors actively pursue this direction.  Samko et al. (2006) Regarding the upper bound, k max , pick the largest value of k so that the following equation holds: where E is the number of edges and N is the number of nodes in G. Once the valid range of k is found, Samko et al. (2006) propose to pick k ∈ [k min , k max ] so that the residual variance is minimum. In all our studies in section 4 using any value in the valid range selected this way results in a low residual variance. We refer the reader to Figure 10 for an illustration of the impact of the choice of k on the residual variance for the pinball dataset for Re = 130.
B Different approaches to define the residual variance.
There exist numerous approaches to assess the fits provided by different dimensionality reduction techniques. Regarding Isomap, the so-called residual variance as stated in (2) is the common choice.
Regarding POD, the classical definition of residual to assess its performance is given where var states for the variance and z j are the Principal Components, j = 1, . . . , P.
The ways of measuring the fits in Isomap and POD given by (2) and (7) are not comparable. In order to evaluate Isomap and POD up to the same standard, Tenenbaum et al. (2000) proposed to replace the geodesic distances approximated by D G in (2) by the pairwise Euclidean distances in the input high-dimensional space, D X , where the element in row i and column j in D X is x i − x j 2 , i, j = 1, . . . , n. Then a related definition of residual variance for POD is, 1 − R 2 (vec(D X ), vec(D Z )), where D Z is the matrix of Euclidean distances between the retained Principal Components z 1 , . . . , z d . Nevertheless, the definition of residual variance in (8) does not capture the ability of POD of reproducing the geodesic distances in the high-dimensional space. To do so, a different but related proposal to measure the residual variance for POD combines (2) and (8) as 1 − R 2 (vec(D G ), vec(D Z )).
Finally, we point out that the residual variance as defined in (2), (8) and (9) may not decrease strictly when the number of dimensions increases. In other words, the correlation between a distance matrix obtained from a set of points in R d , D d , and another distance matrix, D, does not necessarily decrease if the distance between the embedded points in R d−1 by dropping one of the dimensions, D d−1 , is considered instead. Figure 11 shows the values for the different definitions of residual variances presented in this work in each of the case studies in section 4 for different numbers of dimensions. The black line corresponds to (2), the red one is (7), blue (8) and green (9). For all the dimensions considered, the residual variance of Isomap surpasses POD using definitions (2) and (8).