Sparsity-promoting algorithms for the discovery of informative Koopman-invariant subspaces

Abstract Koopman decomposition is a nonlinear generalization of eigen-decomposition, and is being increasingly utilized in the analysis of spatio-temporal dynamics. Well-known techniques such as the dynamic mode decomposition (DMD) and its linear variants provide approximations to the Koopman operator, and have been applied extensively in many fluid dynamic problems. Despite being endowed with a richer dictionary of nonlinear observables, nonlinear variants of the DMD, such as extended/kernel dynamic mode decomposition (EDMD/KDMD) are seldom applied to large-scale problems primarily due to the difficulty of discerning the Koopman-invariant subspace from thousands of resulting Koopman eigenmodes. To address this issue, we propose a framework based on a multi-task feature learning to extract the most informative Koopman-invariant subspace by removing redundant and spurious Koopman triplets. In particular, we develop a pruning procedure that penalizes departure from linear evolution. These algorithms can be viewed as sparsity-promoting extensions of EDMD/KDMD. Furthermore, we extend KDMD to a continuous-time setting and show a relationship between the present algorithm, sparsity-promoting DMD and an empirical criterion from the viewpoint of non-convex optimization. The effectiveness of our algorithm is demonstrated on examples ranging from simple dynamical systems to two-dimensional cylinder wake flows at different Reynolds numbers and a three-dimensional turbulent ship-airwake flow. The latter two problems are designed such that very strong nonlinear transients are present, thus requiring an accurate approximation of the Koopman operator. Underlying physical mechanisms are analysed, with an emphasis on characterizing transient dynamics. The results are compared with existing theoretical expositions and numerical approximations.


Introduction
Complex unsteady flow phenomena such as turbulence (Pope 2001), flow instability (Drazin & Reid 2004;Lietz, Johnsen & Kushner 2017) and fluid-structure interactions (Dowell & Hall 2001) are prevalent in many physical systems. To analyse and understand such phenomena, it is useful to extract coherent modes associated with important dynamical mechanisms and track their evolution. Koopman operator theory (Budišić, Mohr & Mezić 2012) offers an elegant framework to reduce spatio-temporal fields associated with nonlinear dynamics as a linear combination of time evolving modes ordered by frequency and growth rates (Rowley et al. 2009).
Consider a general continuous nonlinear dynamical systemẋ = F (x), where the system state x evolves on a manifold M ⊂ R N . Here F : M → TM is a vector-valued smooth function and TM is the tangent bundle, i.e. ∀ p ∈ M, F ( p) ∈ T p M. The aforementioned task of decomposition is equivalent to finding a set of L observables associated with the system, {g i } L i=1 (g i : M → C), that evolve linearly in time while spanning the system state x. The significance of the above statement is that it represents a global linearization of the nonlinear dynamical system (Budišić et al. 2012;Brunton et al. 2016a).
Formally, this idea can be traced back to Koopman operator theory for Hamiltonian dynamical systems introduced by Koopman (1931) to study the evolution of observables g on L 2 (M), which is a vector space of square integrable functions defined on the manifold M. More generally, for a certain vector space of observable functions F defined on the manifold M and t > 0, the Koopman semigroup (parameterized by t) is the set {K t } t∈R + : K t : F → F that governs the dynamics of observables in the form K t g(x(0)) g(S(t, x 0 )) = g(x(t)), where S(t, ·) is the flow map of the dynamical system (Mezić 2013). (Note that for a discrete dynamical system resulting from the discretization (with time step Δt) of a continuous dynamical system, the corresponding discrete-time Koopman operator is an element of this semigroup: K Δt .) Here K = lim t→0 (K t f − f )/t is referred to as the continuous-time Koopman operator, i.e. Koopman generator. While the Koopman operator is linear over the space of observables, F is most often infinite dimensional, e.g. L 2 (M), which makes the approximation of the Koopman operator a difficult problem. Throughout this work, we assume that F = L 2 (M). Readers interested in a detailed discussion of the choice of F are referred to Bruce, Zeidan & Bernstein (2019).
A special subspace of F, referred to as a minimal Koopman-invariant subspace G, has the following property: for any φ ∈ G, for any t ∈ R + , K t φ ∈ G and x ∈ G. Existing techniques such as the extended dynamic mode decomposition (Williams, Kevrekidis & Rowley 2015) are capable of approximating Koopman operators, but typically yield high-dimensional spaces (L-dimensional space in figure 1). In this work we are interested in accurately approximating such a subspace -but with the minimal possible dimension and Koopman invariance -using learning techniques, as illustrated in figure 1. This can yield useful coordinates for a multitude of applications including modal analysis and control (Arbabi, Korda & Mezic 2018).
In general, physical systems governed by partial differential equations, e.g. fluid dynamics, are infinite dimensional. From a numerical viewpoint, the number of degrees of freedom can be related to the spatial discretization (for example, the number of grid points). Although a finite-dimensional manifold can be extracted (Holmes et al. 2012), e.g. O(10)-O(10 3 ) dimensions via proper orthogonal decomposition (POD), finding a Koopman-invariant subspace on such manifolds is still challenging. Currently, the most popular method to approximate the Koopman operator is the dynamic M x(t 2 ) x(t 1 ) Ψ (x(t 1 )) A set of nonlinear mappings Ψ Linear dynamics on a minimal dimension hyperplane Nonlinear dynamics mode decomposition (DMD) (Schmid 2010;) mainly for two reasons. First, it is straightforward and computationally efficient compared with nonlinear counterparts such as extended DMD (EDMD) (Williams et al. 2015) and kernel DMD (KDMD) (Williams, Rowley & Kevrekidis 2014). Second, the essence of DMD is to decompose a spatio-temporal field into several temporally growing/decaying travelling/stationary harmonic waves, which are prevalent in fluid mechanics. However, the accuracy of DMD is limited by the assumption that the Koopman-invariant subspace lies in the space spanned by snapshots of the state x. Thus, DMD is used to mainly identify and visualize coherent structures. Indeed, DMD can be interpreted as an L 2 projection of the action of the Koopman operator on the linear space spanned by snapshots of the system state (Korda & Mezić 2018b). To overcome the above limitations, it is natural to augment the observable space with either the history of the state (Arbabi & Mezić 2017a;Brunton et al. 2017;Le Clainche & Vega 2017;Kamb et al. 2018) or nonlinear observables of the state (Williams et al. 2014(Williams et al. , 2015. Time-delay embedding can be very useful in reduced-order modelling of systems for which sparse measurements can be easily obtained, assuming the inputs and outputs are not high dimensional (Korda & Mezić 2018a). Although time-delay embedding is simple to implement and has strong connections to Takens' embedding (Kamb et al. 2018;Pan & Duraisamy 2019), the main practical issue arises in reduced-order modelling of high-fidelity simulations in a predictive setting due to the requirement of a large number of snapshots of the full-order model. Furthermore, if one is only interested in the post-transient dynamics of the system state on an attractor, linear observables with time delays are sufficient to extract an informative Koopman-invariant subspace (Mezić 2005;Arbabi & Mezić 2017a,b;Brunton et al. 2017;Röjsel 2017;Pan & Duraisamy 2019). However, if one is interested in the strongly nonlinear transient dynamics leading to an attractor or reduced-order modelling for high-fidelity numerical simulations (Carlberg et al. 2013;Huang, Duraisamy & Merkle 2018;Parish, Wentland & Duraisamy 2020;Xu, Huang & Duraisamy 2020), time-delay embedding may become less appropriate as several delay snapshots of the full-order model are required to initialize the model. In that case, nonlinear observables may be more appropriate.
Driven by the interest in modal analysis and control of transient flow phenomena, we consider augmentation of the observable space with nonlinear functions of the state, e.g. EDMD (Williams et al. 2015)/KDMD (Williams et al. 2014). Although it has been reported that KDMD allows for a set of more interpretable Koopman eigenvalues (Williams et al. 2014) and better accuracy (Röjsel 2017), issues such as mode selection, spurious modes Zhang et al. 2017) and choice of dictionary/kernel in EDMD/KDMD remain open. In fact, the choice of kernel type and hyperparameter can significantly affect the resulting eigenmodes, distribution of eigenvalues (Kutz et al. 2016) and the accuracy of predictions .
Searching for an accurate and informative Koopman-invariant subspace has long been a pursuit in the DMD community. Rowley et al. (2009) and Schmid, Violato & Scarano (2012) considered selecting dominant DMD modes in order of their amplitudes. However, following such a criterion (Tu et al. 2013;Kou & Zhang 2017) may result in the selection of irrelevant modes that may have large amplitudes, but decay rapidly. As a result, Tu et al. (2013) considered weighting the loss term by the magnitude of eigenvalues to penalize the retention of fast decaying modes. Sparsity-promoting DMD (referred to as 'spDMD' throughout the paper) developed by Jovanović, Schmid & Nichols (2014) recasts mode selection in DMD as an optimization problem with a 1 penalty. With a preference of stable modes over fast decaying ones, Tissot et al. (2014) proposed a simpler criterion based on time-averaged-eigenvalue-weighted amplitude. This was followed by Kou & Zhang (2017) who used a similar criterion but computed the 'energy' of each mode, yielding similar performance to spDMD at a lower computational cost. Based on the orthonormal property of pseudo-inverse, Hua et al. (2017) proposed an ordering of Koopman modes by defining a new 'energy'. Compared with previous empirical criteria, the 'energy' for each mode involves a pseudo-inverse which combines the influence from all eigenmodes, and, therefore, the contribution from each mode cannot be isolated. Instead of selecting modes from a 'reconstruction' perspective, Zhang et al. (2017) studied the issue of spurious modes by evaluating the deviation of the identified eigenfunctions from linear evolution in an a priori sense. Furthermore, optimized DMD (Chen, Tu & Rowley 2012;Askham & Kutz 2018) combines DMD with mode selection simultaneously, which is the forerunner of recently proposed neural network-based models for Koopman eigenfunctions in spirit (Li et al. 2017;Takeishi, Kawahara & Yairi 2017;Lusch, Kutz & Brunton 2018;Otto & Rowley 2019;Yeung, Kundu & Hodas 2019;Pan & Duraisamy 2020). Regardless of the above issues related to non-convex optimization (Dawson et al. 2016), extension of optimized DMD to EDMD/KDMD is not straightforward. Further, neural network-based models require large amounts of data, are prone to local minima and lack interpretability. There have been a few attempts towards mode selection in EDMD/KDMD. Brunton et al. (2016a) present an iterative method that augments the dictionary of EDMD until a convergence criterion is reached for the subspace. This is effectively a recursive implementation of EDMD. Recently, Haseli & Cortés (2019) showed that given a sufficient amount of data, if there is any accurate Koopman eigenfunction spanned by the dictionary, it must correspond to one of the obtained eigenvectors. Moreover, they proposed the idea of mode selection by checking if the reciprocal of identified eigenvalue also appears when the temporal sequence of data is reversed, which is similar to the idea of comparing eigenvalues on the complex plane from different trajectories, as proposed by Hua et al. (2017). In contrast to the 'bottom-up' method of Brunton et al. (2016a) with subspace augmentation, Hua et al. (2017) proposed a 'top-down' subspace subtraction method relying on iteratively projecting the features onto the null space. A similar idea can be traced back to Kaiser et al. (2017) who propose a search for the sparsest vector in the null space.
As the main contribution of this work, we propose a novel EDMD/KDMD framework equipped with the following strategy to extract an accurate yet minimal Koopman-invariant subspace.
(i) We first evaluate the normalized maximal deviation of the evolution of each eigenfunction from linear evolution in a posteriori fashion.
(ii) Using the above criteria, we select a user-defined number of accurate EDMD/KDMD modes. (iii) Among the accurate EDMD/KDMD modes obtained above, informative modes are selected using multi-task feature learning (Argyriou, Evgeniou & Pontil 2008a;Bach et al. 2012).
To the best of our knowledge, this is the first model agnostic attempt to address sparse identification of an accurate minimal Koopman-invariant subspace. Our contribution also extends the classical spDMD (Jovanović et al. 2014) in a more general setting that includes EDMD/KDMD. The applications are focused on strongly transient flows, and new classes of stable Koopman modes are identified.
The organization of the paper is as follows: In § 2 we provide a review of EDMD/KDMD in discrete time and present corresponding extensions to continuous time. Following this, we discuss current challenges in standard EDMD/KDMD. In § 3 we propose the framework of sparse identification of informative Koopman-invariant subspaces for EDMD/KDMD with hyperparameter selection and provide implementation details. A novel optimization procedure that penalizes departure from linear evolution is detailed in § § 3.1 and 3.2. Differences and connections between spDMD, Kou's empirical criterion and our proposed framework is shown from the viewpoint of optimization. In § 4 numerical verification and modal analysis on examples from a fixed point attractor to a transient cylinder wake flow, and a transient ship airwake are provided. In § 5 conclusions are drawn.

Review of EDMD/KDMD
In this section we provide a summary of our framework to extract the Koopman operator using EDMD/KDMD in both discrete time and continuous time. Although the original EDMD is seldom used for data-driven Koopman analysis in fluid flows due to its poor scaling, it has recently been reported that a scalable version of EDMD can be applied to high-dimensional systems (DeGennaro & Urban 2019). Since our algorithm to extract the Koopman-invariant subspace is agnostic to the type of technique to compute the original set of Koopman modes, we include both EDMD and KDMD for completeness.

Discrete-time formulation
For simplicity, consider M sequential snapshots sampled uniformly in time with Δt on a trajectory, The EDMD algorithm (Williams et al. 2015) assumes a dictionary of L linearly independent functions i = 1, . . . , L, ψ i (x) : M → C that approximately spans a Koopman-invariant subspace F L , (2.1) Thus, we can write for any g ∈ F L , as g(x) = Ψ (x)a with a ∈ C L , where the feature vector Ψ (x) is (2.2) Consider a set of L observables as {g l } L l=1 = {Ψ a l } L l=1 , where a l ∈ C L is arbitrary. After the discrete-time Koopman operator K Δt is applied on each g l , given data {x i } M i=1 , we have 917 A18-5 the following for l = 1, . . . , L , and i = 1, . . . , M − 1, where r i,l is simply the residual for the lth function on the ith data point. The standard EDMD algorithm seeks a constant matrix K ∈ C L×L that governs the linear dynamics in the observable space such that the sum of the square of the residuals r i,l from (2.3) over all samples and functions, In the above equation, where + denotes the pseudo-inverse and where H denotes conjugate transpose. Note that when the set of observables fully span F L , i.e. A is full rank, (A A H )(A A H ) + reduces to identity. Then we have the more familiar K EDMD as which is independent of the choice of {a l } L l=1 . Assuming that all eigenvalues of K EDMD are simple (this is an immediate consequence for ergodic system Parry 2004), for i = 1, . . . , L, the corresponding Koopman eigenfunctions ϕ i associated with Koopman eigenvalues λ i are where b i is often numerically determined in practice by ordinary least squares,

Continuous-time formulation
Consider M data snapshots of the dynamical system with state x sampled over M Recall the generator of the semigroup of Koopman is the domain in which the aforementioned limit is well defined and the closure of D(K) is F. One can have the evolution of any observable g = Ψ a ∈ F L as (2.14) where r is the residual. Similarly, one can find a K that minimizes the sum of the square of residual r minimized solution, Koopman eigenfunctions are in the same form as that in discrete-time formulations while continuous-time Koopman eigenvalues μ i can be converted to the aforementioned discrete-time sense as λ i = e μ i Δt .

Discrete-time formulation
Instead of explicitly expressing a dictionary of candidate functions, one can instead implicitly define a dictionary of candidate functions via the kernel trick, which is essentially the dual form of EDMD (Williams et al. 2014). Note that, from the EDMD formulation in (2.4), any vector in the range of K orthogonal to the range of Ψ H x is annihilated, and, therefore, cannot be inferred (Williams et al. 2014). Assuming Ψ x is of rank r, we can obtain a full singular value decomposition (SVD) Ψ x = QΣZ H and the corresponding economical-SVD as Q r Σ r Z H r . With the aforementioned definitions, we have K = Z rK Z H r without loss (Otto & Rowley 2019). Since the multiplication by a unitary matrix preserves the Frobenius norm, we have where Q H r,⊥ are the last m − r rows of Q H . By taking derivatives with respect toK , one can obtain the minimal-norm minimizer aŝ Note that, since any column in the span of A that is orthogonal to the span of Z r will be annihilated by Z H r and, thus, cannot be utilized to determineK , it is reasonable to restrict a l within the column space of Z r . Assuming L is sufficiently large such that the column space of A fully spans Z r , (2.22) can be proved (details in Appendix A), With (2.22), we can rewrite (2.21) as the familiar KDMD formulation, Again, such a minimizer is independent of the choice of A .
Note that, to compute Koopman eigenvalues and eigenfunctions, one would only need access to Ψ x Ψ H x and Ψ x Ψ H x , i.e. the inner product among features on all data points. Fortunately, on a compact domain M, it is well known from Mercer's theorem (Mercer 1909) that once a suitable non-negative kernel function k(·, ·) : M × M → R is defined, k(x, y) is the inner product among a potentially infinite-dimensional feature vector Ψ evaluated at x, y ∈ M. Note that the choice of such a feature vector might not be unique but the corresponding reproducing kernel Hilbert space (RKHS) is (Aronszajn 1950). In the case of a Gaussian kernel, one can choose the canonical feature vector k(·, x) which are 'bumps' of a certain bandwidth distributed on M. From the view point of kernel PCA (Williams et al. 2014), Q r resulting from the finite-dimensional rank truncation on the Gram matrix Ψ x Ψ H x is a numerical approximation to the r most dominant variance-explained mode shapes in the RKHS evaluated on the data points (Rasmussen 2003), and Z r represents the r dominant variance-explaining directions in terms of the feature vector in the RKHS.
Similar to EDMD, givenK KDMD =VΛV −1 ,V = [v 1 . . .v r ], for i = 1, . . . , r, the corresponding ith Koopman eigenfunctions ϕ i and Koopman modes for a vector observable g are  (2.27) where the overline symbol is the complex-conjugate operator, and the appearance of Jacobian indicates that a differentiable kernel function is required for the extension to continuous time. For common kernels used in Koopman analysis, the kernel function, Jacobian and hyperparameters are listed in table 1.

Challenges in EDMD/KDMD
In this section we briefly discuss two broad challenges in the use of EDMD and KDMD for Koopman analysis.

Mode selection
The number of approximated Koopman tuples (eigenfunction, eigenvalue, modes) from EDMD grows with the dictionary size, whereas the KDMD grows with the number of snapshots. However, in most cases, a significant number of the eigenfunctions fail to evolve linearly, or are redundant in contribution to the reconstruction of the state x. For example, as shown by Budišić et al. (2012), the Koopman eigenfunctions that vanish nowhere form an Abelian group under pointwise products of functions, while polynomial observables evolve linearly for a general linear system. These eigenfunctions, associated with the polynomial observables, are redundant in terms of providing an intrinsic coordinate for the linear dynamics.
When the number of features is larger than the number of data snapshots, EDMD eigenvalues can be misleading (Otto & Rowley 2019) and often plagued with spurious eigenfunctions that do not evolve linearly even when the number of data snapshots is sufficient. Analytically, it is clear that a Koopman eigenfunction in the span of the dictionary will be associated with one of the eigenvectors obtained from EDMD, given Ψ x is full rank, and contains sufficient snapshots M (Haseli & Cortés 2019). Indeed, the EDMD is an L 2 projection of the Koopman operator under the empirical measure (Korda & Mezić 2018b). As a result, we seek a Koopman-invariant subspace following the standard EDMD/KDMD. Since KDMD can be viewed as an efficient way of populating a dictionary of nonlinear features in high-dimensional spaces, the above arguments apply to KDMD as well. It should be noted that numerical conditioning can play a critical role since full rank matrices can be ill-conditioned.

Choice of dictionary (for EDMD) or kernel (for KDMD)
Although the use of a kernel defines an infinite-dimensional feature space, the resulting finite number of effective features can still be affected by both the type of kernel and the hyperparameters in the kernel, as clearly shown by Kutz et al. (2016). Compared with EDMD/KDMD, which are based on a fixed dictionary of features, neural network approaches (Lusch et al. 2018;Otto & Rowley 2019;Pan & Duraisamy 2020) have the potential to be more expressive in searching for a larger Koopman-invariant subspace. From a kernel viewpoint (Cho & Saul 2009), feedforward neural networks enable adaptation of the kernel function to the data. Such a characteristic could become significant when the underlying Koopman eigenfunction is discontinuous. From an efficiency standpoint, a kernel-guided scalable EDMD (DeGennaro & Urban 2019) may be pursued. This can be achieved by generating kernel-consistent random Fourier features or approximating a few components of the feature vector constructed from Mercer's theorem, i.e. the eigenfunctions of the Hilbert-Schmidt integral operator on the RKHS.

Sparse identification of informative Koopman-invariant subspace
To address the challenges described in § 2.3, we develop a novel framework that uses EDMD/KDMD modes to identify a sparse, accurate and informative Koopman-invariant subspace. Our framework first prunes spurious, inaccurate eigenmodes and second determines a sparse representation of the system state x from the accurate eigenmodes. In addition to the training data, as required in standard EDMD/KDMD, a validation trajectory data set is required to avoid overfitting on training data. The terms spEDMD/spKDMD will refer to filtered mode selections of EDMD and KDMD, respectively.

Pruning spurious modes by a posteriori error analysis
Given a validation trajectory x(t) where t ∈ [0, T] associated with the nonlinear dynamical system, for i = 1, . . . , L, we define the goodness of ith eigenfunctions in a posteriori way as the maximal normalized deviation from linear evolution conditioned on trajectory x(t) as Q i in the form In practice, we evaluate the above integral terms discretely in time. A similar a priori and less restrictive method has been previously proposed . In contrast, in the proposed method, the maximal error is evaluated in an a posteriori way to better differentiate spurious modes from accurate ones. For any 1 ≤L ≤ L, we can always select topL accurate eigenmodes out of L eigenmodes denoting their index in eigen-decomposition as Then, for the next sparse reconstruction step, we simply use ϕ defined as follows to reconstruct the state x, To choose an appropriateL to linearly reconstruct the system state x, we monitor the normalized reconstruction error for the aforementioned set of topL accurate eigenmodes in the form

RL
where I is the identity matrix, and As a result, the evaluation of (3.4) for eachL is of similar expense to least-square regression. For an increasing number of selected eigenfunctionsL, the reconstruction error RL decreases, while the largest linear evolution error Q iL increases. Then, a truncation L can be defined by the user to strike a balance between linear evolution error Q iL and reconstruction error RL. In the next subsection we will further select a subset of eigenmodes for spanning the minimal Koopman-invariant subspace.

Sparse reconstruction via multi-task feature learning
Numerical experiments revealed that, in the selected set ofL most accurate eigenfunctions, two types of redundant eigenfunctions were found: (i) nearly constant eigenfunctions with eigenvalues close to zero; (ii) pointwise products of Koopman eigenfunctions introduced by nonlinear observables, not useful in linear reconstruction.
To filter the above modes, we consider sparse regression withL most accurate eigenfunctions as features and the system state x as target. Note that, since we have guaranteed the accuracy of selected eigenmodes, one can either choose features a priori ϕ i (x(t)) or a posteriori (multi-step prediction) e λ i t ϕ i (x(0)). Here we choose the latter since it is directly related to prediction, and can actually be reused from the previous step without additional computational cost. We denote the corresponding multi-step prediction feature matrix asΨL ,ΨL Note that similar featuresΨL were also considered in spDMD (Jovanović et al. 2014) and optimized DMD (Chen et al. 2012). Finally, the fact that there is no control over the magnitudes of the implicitly defined features in the standard KDMD may cause unequal weighting between different features. Thus, we consider scaling the initial value of all eigenfunctions to be unity in (3.7), Since x is finite dimensional, searching for a sparse combination ofΨL to reconstruct x is equivalent to the solution of a multi-task feature learning problem with preference over a relatively small size of features. Note that this type of problem has been studied extensively in the machine learning community (Argyriou et al. 2008a, ;Zhao et al. 2015). In this work, given X andΨL ,scaled , we leverage the multi-task ElasticNet (Pedregosa et al. 2011) to search for a row-wise sparse B , which solves the following convex optimization problem: (3.10) Here · 2,1 defined in (3.11) is the so-called 1 / 2 norm for a matrix W , and W i is ith row of W . This norm is special in that it controls the number of shared features learned across all tasks, i.e. ith Koopman mode b i is either driven to a zero vector or not while the standard 1 only controls the number of features for each task independently. As a simple illustration, the 1 / 2 norm for three different N × N square matrices (here N = 5) with 0-1 binary entries is displayed in figure 2. Since √ N ≤ 1 + √ N − 1 ≤ N, minimizing the 1 / 2 norm leads to a penalty on the number of rows. As shown in the second term on the right-hand side of (3.9), minimizing the 1 / 2 norm penalizes the number of Koopman eigenmodes. Figure 2. Illustration of 1 / 2 norm (defined in (3.11)) for different N × N 0-1 binary matrices.
The above procedure not only serves the purpose of selecting modes that explain the behaviour of all components in the state, but is also particularly natural for EDMD/KDMD since Koopman modes are obtained via regression. Here α is a penalty coefficient that controls the amount of total regularization in the 1 / 2 and 2 norms, while ρ is the ElasticNet mixing parameter (Zou & Hastie 2005) that ensures uniqueness of the solution when highly correlated features exist. In our case, we choose ρ = 0.99 and sweep α over a certain range with L r non-zero features denoted asΨ L r for each α, while monitoring the normalized residual min B X −Ψ L r B F / X F to choose an appropriate α. It has to be mentioned that sometimes the sparsest solution from a multi-task ElasticNet was found to shrink to a small number instead of zero. This is a consequence of the insufficiency of the current optimization algorithm which employs coordinate descent (Pedregosa et al. 2011). Hence, for each target component, we consider an additional hard-thresholding step by setting the corresponding magnitude of the coefficient, i.e. contribution of any mode, to zero if it is smaller than a certain threshold ∈ [10 −2 , 10 −3 ].
Finally, we refit the Koopman modes as B L r =Ψ + L r X which avoids the bias introduced by the penalty term (spDMD does not refit B) in (3.9). To summarize, the general idea of the framework is illustrated in figure 3. As a side note for interested readers, if one only performs multi-task feature learning without hard thresholding and refitting, one would obtain a smooth ElasticNet path instead of a discontinuous one with hard thresholding and refitting. However, the smooth ElasticNet can lead to difficulties in choosing the proper α visually, especially when the given dictionary of EDMD/KDMD is not rich enough to cover an informative Koopman-invariant subspace. Further discussion on the computational complexity of our framework is presented in appendix B.
Thus far, we have presented our main contribution: a novel optimization-based framework to search for an accurate and minimal Koopman-invariant subspace from data. An appealing aspect of our framework is the model agnostic property, which makes the extension easy from the standard EDMD/KDMD to more advanced approximation methods (Azencot, Yin & Bertozzi 2019;Jungers & Tabuada 2019;Mamakoukas et al. 2019). In the following subsection we present two mathematical insights: 1) multi-task feature learning generalizes spDMD under a specific constraint; 2) a popular empirical criterion can be viewed as a single step of proximal gradient descent.

Relationship between spDMD, Kou's criterion and multi-task feature learning
For simplicity, neglecting the ElasticNet part (i.e. using ρ = 1), (3.9) with L modes leads to a multi-task Lasso problem,  Recall that in spDMD (Jovanović et al. 2014), DMD modes φ 1 , . . . , φ L with φ i 2 = 1 remain the same as standard DMD. Similarly, if we posit a structural constraint on B in (3.12) by enforcing the modes as those from DMD, then there exist α 1 , . . . , α L such that Note the fact that B 2,1 = L i |α i |. Hence, we recover the 1 optimization step in the spDMD (Jovanović et al. 2014) from (3.12) as (3.14) where φ 1 , . . . , φ L are the DMD modes corresponding to eigenvalues as λ 1 , . . . , λ L . Hence, multi-task feature learning solves a less constrained optimization than spDMD in the context of DMD. Kou & Zhang (2017) proposed an empirical criterion for mode selection by ordering modes with 'energy' I i defined as (3.15) From an optimization viewpoint, consider a posteriori prediction matrix X DMD from DMD, where X DMD is determined from DMD using the snapshot pair (X , X ). Here X DMD is a rank-1 summation of contributions from different modes (Schmid 2010). Hence, a general mode selection technique with a user-defined preference weighting w is the following weighted 0 non-convex optimization problem: Here a w,0 i w i |a i | 0 , |a i | 0 is one if a i / = 0 and zero otherwise. Note that this pseudo-norm can be viewed as a limiting case of a weighted composite sine function smoothed 0 regularization (Wang et al. 2019).
To solve this non-convex optimization problem, compared with the popular 1 relaxation method such as the one in spDMD, a less known but rather efficient way is iterative least-squares hard thresholding. This has been used in sparse identification of dynamical systems (SINDy) (Brunton, Proctor & Kutz 2016b), and convergence to a local minimum has been proved (Zhang & Schaeffer 2019). Indeed, a more rigorous framework that generalizes such an algorithm is the proximal gradient method (Parikh & Boyd 2014). Much like Newton's method is a standard tool for unconstrained smooth optimization, the proximal gradient method is the standard tool for constrained non-smooth optimization. Here, it is straightforward to derive the iterative algorithm that extends to the weighted 0 norm from step k to step k + 1 as and η k is the step size at step k. Note that the weighted 0 norm is a separable sum of a i . After some algebra, we have the proximal operator as where H √ λη k (a) is an element-wise hard-thresholding operator defined as a if |a| < √ λη k and zero otherwise. Particularly if one considers the initial step size to be extremely small η 1 1 then the second term in (3.18) can be neglected. Thus, for i = 1, . . . , L, with the following weighting scheme that penalizes fast decaying modes, one immediately realizes the thresholding criterion for ith entry of a becomes Then plugging (3.21a,b) in (3.18), the first iteration in (3.18) reduces to mode selection with Kou's criterion in (3.15). Normally, β i is very large for unstable modes and small for decaying modes. It is important to note that (a) such a choice of w preferring unstable/long-lasting modes over decaying modes is still user defined; (b) optimization is in an a priori sense to obtain DMD. Thus, the insufficiency of the a priori formulation to account for temporal evolution is indeed compensated by this criterion, while DMD in an a posteriori formulation (e.g. spDMD) includes such an effect implicitly in the optimization. Hence, it is possible that in some circumstances spDMD and Kou's criterion could achieve similar performance (Kou & Zhang 2017). Lastly, as summarized in figure 4, it is important to mention the similarities and differences between spKDMD and spDMD: (1) spKDMD will refit Koopman modes while spDMD does not; and (2) the amplitude for all the modes in spKDMD is fixed as unity while it has to be determined in spDMD. 3.3. Hyperparameter selection For simplicity, hyperparameter selection for KDMD is only discussed in this section. To fully determine a kernel in KDMD, one would have to choose the following: (i) kernel type; (ii) kernel parameters, e.g. scale parameters σ ; (iii) rank truncation r.
In this work, for simplicity, we fix the kernel type to be an isotropic Gaussian. Motivated by previous work on error evaluation in Koopman modes by Zhang et al. (2017), we consider evaluation with cross-validation on a priori mean normalized accuracy defined in (3.23) and (3.24) for the ith eigenfunction, discrete form: (3.24) on validation data for a different number of rank truncation and kernel parameters. Note that evaluation on maximal instead of mean normalized accuracy would lead to the error metric to be strongly dependent on the local sparsity of training data in the feature space. This is particularly true for when a single trajectory for a high-dimensional dynamical system is used, and random shuffled cross-validation is performed (Pan & Duraisamy 2018).
For each set of hyperparameters, we first compute the number of eigenfunctions of which the error defined in (3.23) and (3.24) is below a certain threshold on both training and validation data for each fold of cross-validation. Then we compute the average number of such eigenfunctions over all folds which indicates the quality of the corresponding subspace. Finally, we plot the average number vs rank truncation r and kernel scale parameters σ to select hyperparameters.

Implementation
We implement the described framework in Python with moderate parallelism in each module. We use scipy.special.hermitenorm (Jones, Oliphant & Peterson 2001) to generate normalized Hermite polynomials and MultiTaskElasticNet in the scikit-learn (Pedregosa et al. 2011) for multi-task feature learning where we set the maximal iteration as 10 5 and tolerance as 10 −12 . Message passing interface parallelism using mpi4py (Dalcin et al. 2011) is used for the grid search in hyperparameter selection. To prepare data with hundreds of gigabytes collected from high-fidelity simulations, a distributed SVD written in C++ named the parallel data processing (PDP) tool is developed for dimension reduction. A brief description of this tool is given in appendix D.

Two-dimensional fixed point attractor with the known finite Koopman-invariant
subspace We first consider a simple fixed point nonlinear dynamical system which has an analytically determined, finite-dimensional non-trivial Koopman-invariant subspace (Brunton et al. 2016a;Kaiser et al. 2017) to show the effectiveness of the proposed method. We consider a continuous-time formulation. The governing equation for the dynamical system is given asẋ where μ = −0.05, λ = −1. One natural choice of the minimal Koopman eigenfunctions that linearly reconstructs the state is (Brunton et al. 2016a) The way we generate training, validation and testing data is described below with distribution of the data shown in figure 5. As an illustration, we consider two models to approximate the Koopman operator from training data: and R defined in (3.4). From figure 6(a), considering both the linearly evolving error and the quality of the reconstruction, we choose the cut-off threshold atL = 10. We observe a sharp cut-off in figure 6(a) around the number of selected eigenmodesL = 8. This is a reasonable choice, since from the eigenvalues in figure 6(b), we notice the analytic Koopman eigenmodes are not covered until first eight accurate eigenmodes are selected. Note that the legend in figure 6(a) is ordered by the maximal deviation from linear evolution, e.g. the second most accurate mode is the 34th mode with zero eigenvalue. Indeed, the first four eigenfunctions (index = 1, 34, 35, 36) are redundant in terms of reconstruction in this problem (this could be interesting if the system is instead Hamiltonian). The fifth (index = 29) and sixth (index = 33) eigenmodes correspond to two of the analytic eigenfunctions that span the system, and the seventh (index = 32) eigenmode is indeed the product of the fifth and sixth eigenfunctions. Similarly, the ninth and tenth eigenfunctions (index = 31, 28) also appear to be the polynomial combination of the true eigenfunctions.
According to (3.9), to further remove redundant modes, we perform multi-task feature learning on theL = 10 eigenmodes. The corresponding ElasticNet path is shown in figure 7. Note that each α corresponds to a minimizer of (3.9). To choose a proper α so as to find a proper Koopman-invariant subspace, it is advisable to check the trend of the normalized reconstruction error and number of non-zero features. Given the dictionary, for simple problems for which there exists an exact Koopman-invariant subspace that also spans system state, a proper model can be obtained by selecting α ≈ 10 −6 which ends up with only three eigenfunctions, as shown in figure 7. Moreover, as is common for EDMD with polynomial basis (Williams et al. 2014(Williams et al. , 2015, a pyramid of eigenvalues appears in figure 7. As shown in figure 8, both the identified eigenvalues, and contour of the phase angle and magnitude of selected eigenfunctions from spEDMD match the analytic eigenfunctions given in (4.3a-c) very well. As expected, the prediction on unseen testing data is also excellent. Note that the indices of true eigenfunctions ϕ 1 , ϕ 2 and ϕ 3 ordered by Kou's criterion in (3.15) are 8, 5 and 6. In this case, all of the true eigenfunctions are missing in the top three modes chosen by Kou's criterion. Indeed, the top three modes chosen by Kou's criterion have nearly zero eigenvalues.

Results of continuous-time KDMD with mode selection
The mode selection algorithm presented above can be applied in precisely the same form to KDMD, given a set of eigenfunctions and eigenvalues. Error analysis of eigenfunctions is shown in figure 9, from which we chooseL = 10 as well. As before, eigenvalues ordered by maximal deviation from linear evolution are shown in the legend in figure 9(a). Again, in figure 9(a), we observe a sharp decrease in the reconstruction error after the four most accurate modes are included. This is expected, as the second to fourth most accurate modes are analytically exact from the figure 9(b). As shown in figures 10 and 11, it is confirmed that both spEDMD and spKDMD arrive at the same analytic eigenfunctions with difference up to a constant factor. It should be noted that, although polynomials are not analytically in the RKHS (Minh 2010), good approximations can still be achieved conditioned on the data we have, i.e. x 1 , x 2 ∈ [−0.5, 0.5]. Again, the indices of true eigenfunctions ϕ 1 to ϕ 3 ordered by Kou's criterion are 8, 2 and 3. Hence, ϕ 1 is missing in the top three modes chosen by Kou's criterion. Similarly, the first mode chosen by Kou's criterion has a zero eigenvalue.

Effect of SVD regularization
Singular value decomposition truncation is a standard regularization technique in the solution of a potentially ill-conditioned linear system. In the standard EDMD in (2.7), for example, G could be potentially ill-conditioned, leading to spurious eigenvalues in K . Hence, Williams et al. (2014) recommend SVD truncation in (2.8) to obtain a robust solution of K . Effectively, it shrinks the number of EDMD/KDMD modes. It has to be recognized, however, that the mode reduction from SVD truncation is not the same as mode selection. Most importantly, one should not confuse numerical spuriousness from poor numerical conditioning with functional spuriousness from the orthogonal projection error of the Koopman operator (Korda & Mezić 2018b). Indeed, SVD truncation does not always lead to better approximation of a Koopman-invariant subspace. It is rather a linear dimension reduction that optimally preserves the variance in the feature space conditioned on the training data without knowing the linear evolution property of each feature. For demonstration, we take the above fixed point attractor system where we use the same data and standard EDMD algorithm with the same order of Hermite polynomials. The results of prediction on the unseen testing data shown in figure 12 indicate that even  though only three eigenfunctions (indeed three features in the Hermite polynomial) are required, standard EDMD fails to identify the correct eigenfunctions with 26 SVD modes while the results improve with 31 modes retained. The sensitivity of standard EDMD with respect to SVD truncation is likely a result of the use of normalized Hermite polynomials where SVD truncation would lead to a strong preference over the subspace spanned by the higher-order polynomials. We did not observe such a sensitivity for KDMD, unless the subspace is truncated below 10 modes.  N (0, 0.3 2 ). It should be noted that the noise is generated on the coarsest mesh shown in figure 13, and interpolated to the finer meshes. Grid convergence with increasing mesh resolution is assessed by comparing the temporal evolution of the drag coefficient C D and lift coefficient C L . Note that the dynamics of a cylinder wake involves four regimes: near-equilibrium linear dynamics, nonlinear algebraic interaction between equilibrium and the limit cycle, exponential relaxation rate to the limit cycle and periodic limit cycle dynamics (Chen et al. 2012;Bagheri 2013). Instead of considering data only from each of these regimes separately (Chen et al. 2012;Taira et al. 2019) or with only the last two regimes where exponential linear dynamics is expected (Bagheri 2013), we start collecting data immediately after the flow field becomes unstable, and stop after the flow field experiences several limit cycles. Note that the regime with algebraic interaction is non-modal (Schmid 2007) and, therefore, cannot be expressed as individual exponential terms (Bagheri 2013). This becomes a challenging problem for DMD (Chen et al. 2012). The sampling time interval is For each Re, 891 snapshots of full velocity field U and V with sampling time interval Δt are collected as two matrices of size N grid × N snapshots . Following this, each velocity component is shifted and scaled (normalized) between [−1, 1]. Since space-filling sampling in any high-dimensional space would be extremely difficult, we split the trajectory into training, validation and testing data by sampling with strides similar to the 'even-odd' sampling scheme previously proposed by Otto & Rowley (2019). As illustrated in figure 14, given index i, if i mod 3 = 0, the ith point belongs to training set while i mod 3 = 1 corresponds to validation, and i mod 3 = 2 for testing data. Consequently, the time interval in the training, testing, validation trajectory is tripled as 3Δt = 0.3t ref . Thus, training, validation and testing data are split into 297 snapshots each. Finally, we stack data matrices along the first axis corresponding to the number of grid points, and perform a distributed SVD described in § 3.4. For all three cases, the top 20 POD modes are retained, corresponding to 99 % of kinetic energy. Next, we apply our algorithm to discrete-time KDMD with isotropic Gaussian kernel on this reduced-order nonlinear system. We choose the hyperparameters σ = 3 and r = 180. Further details are given in § C.2.

Results of discrete-time KDMD with mode selection
For all three Reynolds numbers, a posteriori error analysis is shown in figure 15. A good choice of the number of accurate modesL retained for reconstruction is around 60 since the corresponding maximal deviation from linear evolution is still around 5 % while the reconstruction error reaches a plateau afterL > 60. After the mode selection on validation data, a α-family of solutions is obtained with corresponding reconstruction error and the number of non-zero terms as shown  in figure 16. Note that the chosen solution is highlighted as blue circles. As shown in table 2, nearly half of the accurate KDMD eigenmodes identified are removed with the proposed sparse feature selection. Note that for all three cases, the number of selected modes (around 32 to 34) is still larger than that required in neural network models (around 10) (Otto & Rowley 2019;Pan & Duraisamy 2020). This is because the subspace spanned by KDMD/EDMD relies on a predetermined dictionary rather than being data-adaptive  like neural network models. Nevertheless, due to the additional expressiveness from nonlinearity, we will see in § 4.2.3 that spKDMD performs significantly better than DMD (Schmid 2010) and spDMD (Jovanović et al. 2014), while enjoying the property of convex optimization at a much lower computational cost than the inherently non-convex and computationally intensive neural network counterparts. The predictions of the top eight POD coefficients (denoted as x 1 to x 8 ) on testing data are displayed in figures 17-19. The results match very well with ground truth for all three cases. Figure 21 shows that there appear to be five clusters of selected eigenvalues while most of the modes are removed by the proposed algorithm. Similar observations were also made in Bagheri (2013) when DMD is applied to the full transient dynamics. This pattern consisting of a stable eigenvalue on the unit circle surrounded by several decaying eigenvalues is observed for all clusters. The stable eigenvalue contributes to limit cycle behaviour, while the decaying eigenvalues account for the transient phenomenon. Due to symmetry, only eigenvalues in the first quadrant are shown in the bottom row of figure 21. It is observed that the frequency associated with the type-II cluster is approximately twice that of type-I. This is in good agreement with previous analytical results from the weakly nonlinear theory (Bagheri 2013). The frequency f is normalized as St = fD/U ∞ , where St is the Strouhal number.
Recall that in the laminar parallel shedding region (47 < Re < 180), the characteristic Strouhal number St scales with −1/ √ Re (Fey, König & Eckelmann 1998). Therefore, it is expected that St of both types tend toward higher frequency as Re increases from 70 to 130. Furthermore, it is interesting to note that the corresponding Strouhal numbers for lift and drag when the system is on the limit cycle, St L and St D (we observe that each lift and drag coefficient exhibits only one frequency at the limit cycle regime for the range of Re studied in this work.), coincide with the stable frequency of type-I and II, respectively, as indicated in figure 21. This is due to the anti-symmetrical/symmetrical structure of the velocity field of type-I/II Koopman mode, respectively, as can be inferred from figure 22. A schematic is also shown in figure 20. The higher frequency mode is symmetrical (along the free stream direction) in U and anti-symmetrical in V. As a consequence, this only contributes to the oscillation of drag. The lower frequency mode is anti-symmetrical in U and symmetrical in V, and only contributes to the oscillation of lift. Thus, the fluctuation in the lift mostly results from the stable mode in type-I, while that for drag results from the stable mode in type-II with twice the frequency. • The minimal dimension of the Koopman-invariant subspace that approximately captures the limit cycle attractor for all three Re that fall into laminar vortex shedding regime (White & Corfield 2006) is five, which is consistent with previous multi-scale expansion analysis near the critical Re (Bagheri 2013). • The lobes of stable Koopman modes in the type-I cluster show an approximately 50 % larger width than those in a type-II cluster. • Similarity/colinearity among Koopman modes within each cluster is observed.
Such a finding is previously reported in the theoretical analysis by Bagheri (2013).  A similarity in spatial structure exists among the Koopman modes belonging to the same family, even though the structures are clearly phase lagged. • As Re increases from 70 to 130, mode shapes flatten downstream while expanding upstream. • At Re = 70, the shear layer in the stable Koopman modes continues to grow within the computational domain. However, at Re = 100, 130, the shear layer stops growing after a certain distance that is negatively correlated with Re.

Net contribution of clustered Koopman modes
By examining figures 22-24, we observe that the colinearity of the spatial structures among each cluster can cause cancellations. A famous example of such non-oscillatory cancellation is the 'shift mode' defined by Noack et al. (2003), in conjunction with two oscillating modes. As the 'shift mode' decays, the stationary component of the flow transitions from the unstable equilibrium to the time-averaged mean. The existence of such non-oscillatory decaying Koopman modes is also confirmed by weakly nonlinear analysis (Bagheri 2013). Interestingly, our algorithm is able to identify not only the non-oscillatory cancellation (from the 'shift mode') but also oscillatory cancellations from two clusters with distinct frequencies. Such cancellations elegantly explain why no unstable Koopman eigenvalues appear in this flow given the coexistence of an attracting limit cycle and an unstable equilibrium. These modes could be understood as 'oscillatory shift modes', as a generalization of the model proposed by Noack et al. (2003). Since modes within each cluster appear to be colinear to each other, it is intriguing to investigate the net contribution from each cluster. For the above Re = 70 case, effects from different clusters at different times in the transient regime are shown in figure 26. We note the following interesting observations throughout the transient period from t = 80 to t = 200. • The net contribution from 'cluster 0' does not exhibit strong oscillations. For the contribution from 'cluster 0', the U component shows a decrease in the length of the reverse flow region behind the cylinder with an increase in the low speed wake layer thickness while the V component remains unchanged. This is similar to the effect of 'shift mode' which also characterizes the decay of recirculation behind the cylinder. • Initially at t = 80, the net contribution from 'cluster I' is rather weak primarily due to the 'cancellation' from the lagged phase. Furthermore, the development of vortex shedding downstream from the top and bottom surfaces of the cylinder is nearly parallel. This corresponds to the initial wiggling of the low speed wake flow downstream. As time increases, the pattern of corresponding vortices develops away from the centreline. growth in x 1 and x 2 while ignoring a turnaround near the onset of the transient regime in x 5 and x 8 . As expected, DMD with 20 POD coefficients performs the worst especially for the modes where transient effects are dominant. Given the results in figure 27, among all of the top eight POD coefficients, x 6 and x 7 appear to be most challenging to model: DMD and spDMD cannot match the limit cycle while spKDMD performs very well. Notice that the frequency in x 6 and x 7 corresponds to St D . Hence, there will be a difference in predicting the fluctuation of C D between spDMD and spKDMD.
Finally, comparison of the identified Koopman eigenvalues between DMD, spDMD and spKDMD is shown in figure 28. On one hand, both spDMD and spKDMD exactly capture the stable eigenmodes that correspond to St D and St L . This is expected since DMD with 200 POD coefficients represents the dynamics very well, and deviation from limit cycle behaviour would be penalized in spDMD. On the other hand, several erroneous stable DMD modes are obtained by DMD. This explains the deviation of a posteriori prediction from the ground truth limit cycle in figure 27. For those decaying modes, similarity/colinearity is observed between two clusters of eigenvalue from spDMD and spKDMD. However, spKDMD contains more high-frequency modes than spDMD. Finally, it is interesting to note that, although the correct stable eigenvalues are captured accurately by both spDMD and spKDMD, the former does not capture accurate amplitudes for stable eigenvalues of type-II as seen in figure 27.
As a side note, when the temporal mean was used instead of maximal error in the definition of Q in (3.2), spKDMD with the above setting was found to not find a stable eigenvalue corresponding to ST D .

Three-dimensional transient turbulent ship airwake
Understanding unsteady ship-airwake flows is critical to design safe shipboard operations, such as takeoff and landing of fixed or rotary wing aircraft (Forrest & Owen 2010), especially when the wind direction becomes stochastic. Here we obtain snapshots from an unsteady Reynolds averaged Navier-Stokes (URANS) simulation of a ship airwake using FLUENT (Ansys 2016) with the shear layer corrected k-ω two-equation turbulence model. Unsteadiness arises from both bluff-body separation, and an abrupt change in wind direction. We consider a conceptual ship geometry called simple frigate shape 2 (SFS2). For the details of the geometry, readers are referred to Yuan, Wall & Lee (2018). Configuration of the simulation set-up is shown in figure 29, where α ∞ denotes the angle of side wind.
To prepare a proper initial condition, a URANS simulation for U ∞ = 15 m s −1 with α ∞ = 0 • , i.e. no side wind, is conducted to reach a physical initial condition. Following this, the last snapshot is used as the initial condition for a new run with an impulsive change in the wind direction from α ∞ = 0 • to α ∞ = α 0 = 5 • . The boundary conditions for outlet/input is pressure outlet/velocity inlet while the top and bottom are set as symmetry for simplicity. No-slip conditions are used at the surface of the ship. Further details on the simulation set-up are provided in Sharma et al. (2019).
The sampling time interval is Δt = 0.1 s with 500 consecutive samples of the three velocity components. This corresponds to several flow through times over the ship length. The domain of interest is a Cartesian region of mesh size 24 × 40 × 176 starting on the rear landing deck. For dimension reduction, the trajectory of the top 40 POD coefficients (temporal mean subtracted) are collected, yielding >99 % kinetic energy preservation. Discrete-time KDMD with an isotropic Gaussian kernel is employed to perform nonlinear Koopman analysis where σ = 200, r = 135 is chosen. Details of hyperparameter selection are provided in § C.3.

Results of discrete-time KDMD with mode selection
First, the error analysis of eigenfunctions is shown in figure 30, where we chooseL ≈ 60 for good reconstruction. However, the level of deviation from linear evolution is around 10 %. This error will be reflected later as deviation in a posteriori prediction on the testing trajectory. (This implies difficulties in finding an accurate yet informative Koopman operator with isotropic Gaussian kernels. However, choosing an optimal kernel type is beyond the scope of this work.) Second, the result of mode selection is summarized in table 3. Note that nearly 2/3 of the modes are removed. Furthermore, model performance in terms of a posteriori prediction on the testing trajectory is evaluated. Comparison between KDMD and the ground truth on contours of velocity components on a special z-plane (1.2 m above the landing deck) is shown in figure 31. Effects of an impulse change in wind direction in the following are observed from t = 1.5s to t = 30s and well captured by spKDMD: • growth of a large shear layer in U on the rear (left) side of the superstructure on the ship;  Table 3. Summary of mode selection for discrete-time KDMD on ship airwake.
• a strong side wind sweep in V above the landing deck propagating downstream; • development of a vortex on the upwind (right) downstream side of the ship.
Furthermore, three velocity components of the Koopman mode decomposition on the previously mentioned z-plane is shown in figure 32 together with the isocontour of vorticity coloured by the velocity magnitude for the two stable harmonics. Note that the frequency is normalized using U ∞ as the reference velocity and funnel width of the ship L = 3.048 m as the characteristic length scale (Forrest & Owen 2010). As expected, the spKDMD yields a large number of decaying modes with only three non-trivial stable harmonics, since significant transient effects are present in the data. From the Koopman mode decomposition in figure 32, we observe the following: • modes with eigenvalues close to each other exhibit a similar spatial structure; • modes associated with higher frequency are dominated by smaller scales; • the stable harmonic mode near St = 0.09 associated with a strong cone-shape vortex originating from the upwind (right) rear edge of the superstructure on the ship; • the stable harmonic mode near St = 0.068 corresponds to vortex shedding induced by the funnel; • the slowly decaying mode near St = 0.022 shows unsteady circulation directly behind the superstructure; • the steady mode (St = 0) is consistent with the large circulation behind the superstructure, the roll-up wind on the side of the landing deck, and vertical suction towards the floor on the landing deck.

Comparison with spDMD
We again repeat the comparison between our spKDMD and spDMD (Jovanović et al. 2014). Note that DMD on the first 40 POD modes performs poorly similar to § 4.2.3 and, therefore, is not shown here. To make a fair comparison against the spKDMD from the previous subsection, however, we collect the first 200 POD coefficients for spDMD to ensure that a sufficient number of modes are used to fit the trajectory well. We then carefully choose the penalty coefficient in spDMD to ensure that the same number of modes are retained as in spKDMD. As shown in figure 33, within the time horizon t < 50, a posteriori evaluation shows that spKDMD offers much improved predictions compared with spDMD (Jovanović et al. 2014) on the testing trajectory. Moreover, as further illustrated in figure 34(a), eigenvalues identified from spKDMD only contain two stable modes while nearly all eigenvalues from spDMD are located near the unit circle, among which there are 30 out of 56 slightly unstable modes. These unstable modes inevitably lead to the identified system being numerically unstable when predicting beyond the current training time horizon, whereas spKDMD predicts a 'physically consistent' limit cycle behaviour. As indicated in figure 34(b), such instability is related to the inability of the (linear) features to approximate the Koopman-invariant subspace, where only eight modes are within 10 % of maximal deviation from linear evolution, compared with 60 modes in KDMD. We note that similar observations of the drastically different eigenvalue distribution were reported in the original KDMD paper (Williams et al. 2014).

Conclusions
Two classical nonlinear approaches for the approximation of Koopman operator: EDMD and KDMD are revisited. From an algorithmic perspective, the main contributions of this work are (a) sparsity-promoting techniques based on a posteriori error analysis, and (b) multi-task learning techniques for mode selection as an extension of spDMD into the nonlinear variants. Furthermore, analytical relationships between spDMD, Kou's criterion and the proposed method are derived from the viewpoint of optimization. The algorithm (code available at https://github.com/pswpswpsw/SKDMD) is first evaluated in detail on a simple two-state dynamical system, for which the Koopman decomposition is known analytically.
If one is only interested in the post-transient dynamics of the system on an attractor, linear observables with time delays are sufficient to extract an informative Koopman-invariant subspace. Thus, the present techniques are evaluated on two fluid flows which involve strong transients: two-dimensional flow over a cylinder at different Reynolds numbers, and a three-dimensional (3-D) ship airwake. We demonstrate the effectiveness of discovering accurate and informative Koopman-invariant subspaces from data and constructing accurate reduced-order models from the viewpoint of Koopman theory. Furthermore, with the proposed algorithm, the parametric dependency of Koopman mode shapes on the Reynolds number is investigated for the cylinder flows. In this case, as Re increases from 70 to 130, the shape of stable modes becomes flatter downstream and larger upstream. Moreover, the similarity of mode shapes between Koopman modes with similar eigenvalues is observed in both fluid flows. Specifically, five clusters of eigenvalues are observed in the case of a two-dimensional cylinder wake flow which is confirmed with weakly nonlinear theoretical analysis from Bagheri (2013). Type-I, II clusters are found to correspond to fluctuations in lift and drag, respectively. We identify non-oscillatory as well as oscillatory cancellations from the above two clusters with distinct frequencies. These modes could be understood as 'oscillatory shift modes', as a generalization of the model proposed by Noack et al. (2003). For the 3-D ship-airwake case, two stable modes, and one slowly decaying mode with distinct frequencies and mode shapes resulting from vortex shedding are extracted, and accurate predictive performance is observed in the transient regime. Step Computational complexity  Table 4. Computational complexity of each step in the proposed sparsity-promoting framework.
The computational complexity for each step in the algorithm is summarized in table 4, where n is the dimension of the system state, M is the number of snapshots in the training (for conciseness, we assume that the number of training snapshots equals the number of validation snapshots.), r is the rank of the reduced system after SVD,L is the user-defined cut-off for 'accurate' features and N iter is the maximal number of iterations user defined to achieve a residual threshold, e.g. 10 −12 .
As shown in table 4, the theoretical computational complexity for multi-task ElasticNet with an α sweep is O(N α N iterL 2 r). Note that this is a worst case simply because -except for the first α -we reuse the previous optimal solution as the initial condition for the new objective function to start the iterative optimization process. Also, thanks to SVD-based dimension reduction, the cost scales linearly with the reduced system rank r. Moreover, the user-defined linearly evolving error truncationL helps reduce that complexity as well instead of scaling with the number of snapshots M. Lastly, there is a cubic theoretical complexity associated with the number of snapshots when applying KDMD. The number of snapshots in a typical high-fidelity simulation is O(10 3 ). That is to say, r < 10 3 and L < 10 3 . We note that computational efficiency can be further improved, but this will be left for future work.
Appendix C. Hyperparameter selection C.1. Two-dimensional fixed point attractor We perform grid search in parallel for the selection of σ and the truncated rank r over the range σ ∈ [10 −1 , 10] with 80 points uniformly distributed in the log sense and r = 36, 50, 70 to find a proper combination of r and σ . As shown in figure 35, the higher rank r leads to a larger number of linearly evolving eigenfunctions. Thus, it is more crucial to choose a proper scale σ than r from figure 35. However, considering the simplicity of this problem, σ = 2 and r = 36 would suffice.

C.2. Cylinder flow case
As the flow is not turbulent, we choose hyperparameters for the Re = 70 case and fix them for all the other cases. We sweep σ from [1, 10 5 ] with 30 points uniformly distributed in the log sense and r = 120, 140, 160, 180, 200, as shown in figure 36. From the plot we choose r = 180 and σ = 3 for an approximate choice of the hyperparameter. Again, we observe that the number of accurate eigenfunctions first increases then decreases with σ increasing and the saturation of rank truncation at higher σ which is related to the variation in the characteristic scale of the features with respect to σ .  C.3. Turbulent ship-airwake case Grid search in parallel for the selection of σ and r is performed over the range σ ∈ [1, 10 5 ] with 50 points uniformly distributed in the log sense, r = 40, 80, 120, 130. As shown in figure 37, a good choice of σ can be 200 for the case of α ∞ = 5 • . Note that since the hyperparameter selection is performed with a five-fold cross-validation on the training data, we only have up to 166 × 0.8 ≈ 132 data points, i.e. maximal possible rank is 132. While in the actual training, we have maximal rank up to 166. Note that as long as the system is well conditioned, the higher the rank, the richer the subspace. Here we take σ = 200 and r = 135.

Appendix D. Parallel data processing tool
The application of data-driven methodologies to increasingly large data sets has exposed the need for scalable tools. Method development is easily achieved in scripting environments, but for application to large multidimensional problems, memory and data I/O becomes a limiting factor. To illustrate this, consider a state-of-the art 3-D computation. Given the fact that there are multiple variables associated with each grid cell, and that reasonable meshes can be expected to contain millions to hundreds of millions of cells, work-space memory requirements of terabytes may be required. Utilizing available distributed tools and I/O strategies a generalized toolset, PDP, has been developed for application of complex methods to arbitrarily large data sets. Originally developed for use in reduced-order modelling, the generality of the toolset allows for a range of complex methodologies to be applied. This was leveraged for use in this work for the application of methods, as the problem size is intractable on desktop machines. The toolset leverages commonly available packages in the form of ScaLAPACK, PBLAS and PBLACS (Blackford et al. 1997) routines. ScaLAPACK is widely available on most computing resources where the original simulations are run.