## 1. Introduction

Nearly all surfaces in fluid-related industries are rough at their operating Reynolds numbers. These rough surfaces easily alter the aero- or hydrodynamic properties of a fluid system and induce unfavourable consequences; for instance, reduced stability of aircraft, increased fuel consumption of cargo ships and reduced energy harvesting capacity of wind turbines (Gent, Dart & Cansdale Reference Gent, Dart and Cansdale2000; Dalili, Edrisy & Carriveau Reference Dalili, Edrisy and Carriveau2009; Schultz *et al.* Reference Schultz, Bendick, Holm and Hertel2011). Therefore, it is crucial to restore or replace surfaces that have degraded past a critical level. This requires monitoring roughness on industrial surfaces and efficiently predicting their effect on the flow.

One of the most important effects of surface roughness on flow is the increased drag. This additional drag is generally expressed by the Hama roughness function (Hama Reference Hama1954) or the equivalent sand-grain height. The Hama roughness function is the downward shift of the mean velocity profile induced by roughness. The equivalent sand-grain height is the roughness height from the Moody diagram (Moody Reference Moody1944) that would cause the same drag as the rough surface of interest. It is worth noting that to calculate the equivalent sand-grain height, the Hama roughness function in the fully-rough regime – where the skin-friction coefficient is independent of Reynolds number – needs to be determined. To accurately determine the Hama roughness function of irregular surfaces, numerical simulations with fully resolved rough structures or experiments are needed. However, neither simulations nor experiments are cost-efficient and hence not feasible for practical purposes. To alleviate the high costs, researchers have found empirical correlations between rough surface statistics and their contribution to drag (Flack & Schultz Reference Flack and Schultz2010; Chan *et al.* Reference Chan, MacDonald, Chung, Hutchins and Ooi2015; Forooghi *et al.* Reference Forooghi, Stroh, Magagnato, Jakirlić and Frohnapfel2017; Flack, Schultz & Barros Reference Flack, Schultz and Barros2020). However, no universal model has emerged that is able to accurately predict the drag from a sufficiently wide range of roughness. This is mainly attributed to the high-dimensional feature space of irregular rough structures (Chung *et al.* Reference Chung, Hutchins, Schultz and Flack2021).

Neural networks are known to be capable of extracting patterns in high-dimensional spaces and therefore have been successfully used in various studies using fluid flow data (Lee & You Reference Lee and You2019; Kim & Lee Reference Kim and Lee2020; Fukami, Fukagata & Taira Reference Fukami, Fukagata and Taira2021). Jouybari *et al.* (Reference Jouybari, Yuan, Brereton and Murillo2021) developed a multilayer perceptron (MLP) type neural network to find a mapping of 17 different rough surface statistics to equivalent sand-grain height. They reported a state-of-the-art performance in predicting equivalent sand-grain heights from rough surface statistics. A total of 45 data samples, obtained from direct numerical simulations (DNSs) and experiments, were used for training and validating their neural network. Having a large number of data samples to train a neural network is invariably advantageous; nonetheless, constructing a fluid database that is sufficiently large to train a network from scratch imposes significant computational or experimental cost and is often considered impractical. Therefore, for the practical usage of neural networks in fluid-related applications, a framework that allows neural networks to learn a generalized mapping from a limited number of data samples is needed.

In this study, we propose a transfer learning framework to improve the generalization ability of neural networks for predicting the drag on rough surfaces when only a limited number of high-fidelity data samples are available. Transfer learning is a method that improves learning performance by adapting knowledge learned from one domain to another (Pan & Yang Reference Pan and Yang2009; Chakraborty Reference Chakraborty2021). The method has been used, for example, for self-driving cars, which can be pre-trained with data from virtual car simulators (Martinez *et al.* Reference Martinez, Sitawarin, Finch, Meincke, Yablonski and Kornhauser2017; Pan *et al.* Reference Pan, You, Wang and Lu2017; Akhauri, Zheng & Lin Reference Akhauri, Zheng and Lin2020). Obviously, the driving environment in virtual car simulators is inherently different from the real world, however, pre-training a neural network with ‘approximate knowledge’ can provide better initial weights for training real self-driving cars.

Similarly, for flow over rough surfaces, empirical correlations between drag and surface roughness provide an approximate knowledge of real flow physics (Colebrook & White Reference Colebrook and White1937; Moody Reference Moody1944; Flack & Schultz Reference Flack and Schultz2010; Chan *et al.* Reference Chan, MacDonald, Chung, Hutchins and Ooi2015; Forooghi *et al.* Reference Forooghi, Stroh, Magagnato, Jakirlić and Frohnapfel2017; Flack *et al.* Reference Flack, Schultz and Barros2020). These correlations were developed by a fitting procedure of large experimental and numerical datasets. However, it is not straightforward to make direct use of these datasets, in particular because the information of surface topographies and flow statistics is not always accessible. Indeed, in many cases we have advanced our understanding of physics from the valuable information embedded in empirical correlations, such as those developed by Colebrook & White (Reference Colebrook and White1937) and Moody (Reference Moody1944).

The aim of this study is to show that transfer learning of empirical correlations can significantly improve the performance of neural networks for modelling drag on rough surfaces. Our aim is also to analyse a simple neural network with and without transfer learning to gain insight of the learning behaviour, which is strongly connected to the physics of the problem. The objective of the developed neural network model is to predict drag of a particular class of roughness contained in one high-fidelity dataset, which is composed of irregular homogeneous rough surfaces. However, we foresee that transfer learning will play a central role in combining empirical relations with larger datasets from a range of sources to develop predictive models of a significantly larger class of rough walls than those considered here. The paper is organized as follows: the details of the developed transfer learning framework including the neural networks and datasets are explained in § 2. The results of learning a mapping of surface statistics to the Hama roughness function using the transfer learning framework are shown in § 3. The effects of transfer learning are analysed and discussed in § 4, followed by the concluding remarks in § 5.

## 2. Transfer learning framework

The developed transfer learning framework is composed of two parts: (1) pre-training step and (2) fine-tuning step (figure 1). In the pre-training step, neural networks were trained to learn an ‘approximate’ knowledge of the mapping of surface statistics to the Hama roughness function. The training data are created by evaluating empirical correlation functions given the surface statistics of different surface topographies. In the fine-tuning step, neural networks were tuned to learn high-fidelity physics from a small DNS dataset by adapting their pre-trained ‘approximate’ knowledge to the domain of real physics. In § 3, we show that transfer learning of empirical correlations improves the generalization ability of neural networks when learning the drag of rough surfaces from a small amount of data. First, however, we explain the details of the neural networks and pre-training and fine-tuning steps.

### 2.1. Neural network

We used a variation of the MLP type neural network architecture developed by Jouybari *et al.* (Reference Jouybari, Yuan, Brereton and Murillo2021). The input of the employed network is a set of 17 different rough surface statistics calculated from the surface topography of interest. The 17 input parameters are composed of eight primary surface statistics and nine products of the eight primary statistics. These parameters are selected based on their perceived importance on affecting drag on rough surfaces (Jouybari *et al.* Reference Jouybari, Yuan, Brereton and Murillo2021).

Let $x$, $y$ and $z$ be the streamwise, wall-normal and spanwise directions, $k(x,z)$ the roughness height distribution and $A_t$ the total roughness plan area. Then, one may define the following measures of an irregular rough surface:

In addition, we define the crest height $k_{c}$ and finally the fluid area at each wall-normal plane $A_{f}(y)$. From these quantities, the 17 statistical input parameters ($\{I_{1}, I_{2},\ldots, I_{17}\}$) can be calculated as listed in table 1. Here, $Sk$, $kur$, $por$, $ES_{x}$, $ES_{z}$, $inc_{x}$ and $inc_{z}$ indicate skewness, kurtosis, porosity, and effective slopes and average inclinations in $x$ and $z$ directions, respectively. Before training, the parameters $I_{1}$ to $I_{17}$ were normalized with the mean and standard deviation values calculated from the dataset of empirical correlations (see § 2.2 for the details of the empirical dataset).

Information of input parameters travels through three hidden layers with 18, 7 and 7 neurons, respectively (see figure 2). Leaky rectified linear unit (ReLU) activations ($\max (0.01x,x)$) were applied after the hidden layers. The output of the employed neural network was a scalar value of the Hama roughness function $\Delta U^{+}$. The superscript $+$ indicates the normalization by the viscous scale $\delta _{\nu }=\nu /u_{\tau }$, where $\nu$ and $u_{\tau }$ are the kinematic viscosity and the friction velocity, respectively. In this study, the friction Reynolds number $Re_{\tau }=u_{\tau } H/ \nu =500$ was used, where $H$ is the channel half-height. At the output layer, a Sigmoid activation function, $a/(1+\textrm {e}^{x})$, was applied to bound the prediction value ($a=20$ in this study).

### 2.2. Pre-training step

In the pre-training step, neural networks were trained with a large dataset obtained from empirical correlation functions between rough surface statistics and the Hama roughness function $\Delta U^{+}$. We used three correlations:

and

Here, (EMP1) was proposed by Chan *et al.* (Reference Chan, MacDonald, Chung, Hutchins and Ooi2015), (EMP2) by Forooghi *et al.* (Reference Forooghi, Stroh, Magagnato, Jakirlić and Frohnapfel2017) and (EMP3) by Flack & Schultz (Reference Flack and Schultz2010); Flack *et al.* (Reference Flack, Schultz and Barros2020).

As mentioned earlier, we will develop neural networks to predict the drag of rough surfaces contained in a small high-fidelity dataset (see § 2.3 for the details of the DNS dataset). The range of the roughness function in this dataset was $5.3<\Delta U^{+}<8.3$. The lower bound lay at the boundary between transitionally and fully rough regimes. Equation (EMP1) was derived using surfaces in the transitionally and fully rough regimes with $0.4<\Delta U^{+}<11.4$. Equation (EMP2) was modelled from surfaces with roughness function in the range of $5.4<\Delta U^{+}<10.1$. Finally, (EMP3) was developed with rough surfaces in the fully rough regime, where most of the surfaces became fully rough when $\Delta U^{+}$ was over $5.5 - 6.0$. Therefore, all three empirical models contained ‘approximate’ knowledge of the drag of rough surfaces that could help the neural networks to learn the DNS dataset in the fine-tuning step.

A large dataset was constructed by randomly generating 10 000 irregular rough surfaces and calculating the associated Hama roughness function using (EMP1), (EMP2) and (EMP3). The surfaces were constructed using the Fourier-filtering algorithm suggested by Jacobs, Junge & Pastewka (Reference Jacobs, Junge and Pastewka2017). The algorithm generated self-affine surfaces with a pre-defined power spectrum $C$,

where $q$ is the wavenumber and $h$ the Hurst exponent. The Hurst exponent $h=0.8$ was used as in Jacobs *et al.* (Reference Jacobs, Junge and Pastewka2017). This power-law dependence is a well-known attribute of self-affine realistic surfaces (Mandelbrot Reference Mandelbrot1982; Persson *et al.* Reference Persson, Albohr, Tartaglino, Volokitin and Tosatti2004). Figure 3(*a*) shows a few examples of the generated random surfaces. The randomness was imposed by choosing random amplitudes of the power spectrum and random phase shifts of Fourier modes. Accordingly, the surface statistics, such as the skewness and the effective slope, depended on the particular combination of the random amplitudes and phases of Fourier modes.

For training and validating the neural network, 7000 and 3000 data samples were used, respectively. Figure 3(*b*) shows the scatter plots of the Hama roughness functions, calculated by (EMP1), (EMP2) and (EMP3), against $k_{rms}^{+}$, $Sk$ and $ES_{x}$ from the 3000 validation data samples. The surface generation method based on the spectral density (EMP1), generated surfaces with skewness bounded between $-1$ to $+1$ and effective slopes limited to ${\lesssim }0.35$. The bounded skewness and limited effective slopes arose from the wavy surfaces generated by Fourier modes and the self-affine power spectrum defined for realistic surfaces, as in Jacobs *et al.* (Reference Jacobs, Junge and Pastewka2017). The random surfaces were thus considered to be in the low-slope regime, where the drag tends to increase with increasing effective slope (Flack & Schultz Reference Flack and Schultz2014). From figure 3(*b*), we observed different distributions, or knowledge, of the Hama roughness function from the different empirical models. The aim is to seed this diverse knowledge of empirical correlations in neural networks during the pre-training step.

Three neural networks with the same architecture and the same initial weights and biases were trained simultaneously to learn the different empirical models, (EMP1) to (EMP3). Let $NN^{i}$ be the neural network trained with the $i$th empirical model, $W^{i}_{j,k}$ the weight matrix connecting the $j$th and $k$th layers of $NN^{i}$, $N_{w}$ the total number of weights in $NN^{i}$, $b^{i}_{j}$ the bias vector added in the $j$th layer, $\Delta U^{+}_{i}$ the Hama roughness function predicted by the $i$th empirical model and $\Delta \tilde {U}^{+}_{i}$ the Hama roughness function predicted by $NN^{i}$. The neural networks were trained to minimize a loss for pre-training:

The first term in the right-hand side is the mean-squared-error loss and the second is the weight regularization loss, also used in Jouybari *et al.* (Reference Jouybari, Yuan, Brereton and Murillo2021). The sizes of the weight matrix $W^{i}_{j,k}$ and bias vector $b^{i}_{k}$ are $n_{j} \times n_{k}$ and $n_{k}$, respectively, where $n_{j}$ and $n_{k}$ are the number of neurons on the $j$th and $k$th layers.

The weights and biases of the neural networks were updated in the direction of minimizing the pre-training loss $L_{p}$ on the training data samples through the Adam optimizer (Kingma & Ba Reference Kingma and Ba2014). The pre-training loss $L_{p}$ on the validation data samples was also evaluated after each iteration of updates. Batch sizes of 16 and 32 were used for training and validating, respectively. To provide a wide distribution of data samples in the batches, i.e. preventing neural networks from learning biased data focused on average characteristics, we imposed batches to contain 50 % of the data samples with $\Delta U^{+}$ in the range of 6 to 8, 25 % of the data samples with $\Delta U^{+}$ in the range of 4 to 6 and the remaining 25 % of the data samples with $\Delta U^{+}$ in the range of 8 to 10. After a sufficient number of iterations for training and validating, the set of networks ($\{NN^{1},NN^{2},NN^{3}\}$) that produced the lowest pre-training loss on the validation data samples was chosen to be merged into a single pre-trained network $NN^{pre}$. The learned sets of weights and biases in the three neural networks, $\{W^{1}_{j,k},W^{2}_{j,k},W^{3}_{j,k}\}$ and $\{b^{1}_{k},b^{2}_{k},b^{3}_{k}\}$, were averaged into the weights and biases of $W^{pre}_{j,k}$ and $b^{pre}_{k}$ as $W^{pre}_{j,k} = \sum _{i=1}^{3} W^{i}_{j,k}/3$ and $b^{pre}_{k} = \sum _{i=1}^{3} b^{i}_{k}/3$. These pre-trained weights and biases were employed as the initial weights for the fine-tuning step (§ 2.3) to transfer the knowledge of empirical correlations to the domain of high-fidelity physics.

A key factor for the performance of the fine-tuned neural network is the contents of the database used in the pre-training step. If the bound of the $\Delta U^{+}$ in one's high fidelity dataset is known *a priori*, then using a narrow range of $\Delta U^{+}$ for pre-training increases the transfer learning performance significantly. We provide an example of this by training the neural networks with a bound of $\Delta U^{+}$ in Appendix A. In the main text, we assume that the range of $\Delta U^{+}$ of the high-fidelity dataset is *a priori* unknown.

### 2.3. Fine-tuning step

A total of 35 DNSs of channel flow over rough surfaces at $Re_{\tau }\approx 500$ were conducted to construct a high-fidelity dataset. The 35 rough surfaces were generated following the method proposed by Pérez-Ràfols & Almqvist (Reference Pérez-Ràfols and Almqvist2019). Figure 4(*a*) shows a few examples of the generated surfaces. Note that different surface generation methods were used to construct the DNS and empirical datasets (§ 2.2). The method used for DNS imposed an additional constraint on the roughness height probability distribution. This constraint enabled the generation of rough surfaces with distinctive characteristics. For the probability distribution, we used Weibull distributions of the form:

By using different shape ($s$) and scale ($\lambda$) parameters, rough surfaces with non-zero skewness were generated. Gaussian distributions were used to generate rough surfaces with zero skewness. Rough surfaces generated with (EMP3) imitate the roughness caused by wear (Pérez-Ràfols & Almqvist Reference Pérez-Ràfols and Almqvist2019). By using different surface generation methods in the pre-training and fine-tuning steps, we can verify that pre-training on one class of rough surfaces can help learn the drag on a different class of rough surfaces. Such verification is important, as in practice, artificially generated rough surfaces for pre-training would not perfectly imitate real-world rough surfaces.

Details of the methodology and validation of the conducted DNS are reported in the study of Yang *et al.* (Reference Yang, Stroh, Chung and Forooghi2021). Here, we provide a brief overview. A pseudo-spectral incompressible Navier–Stokes solver, SIMSON (Chevalier, Lundbladh & Henningson Reference Chevalier, Lundbladh and Henningson2007) with an immersed boundary method (Goldstein, Handler & Sirovich Reference Goldstein, Handler and Sirovich1993), was employed for the channel simulations. Periodic boundary conditions were used in the streamwise and spanwise directions. No-slip conditions were applied at the top and bottom planes of the channel. Roughness effects were added to both the top and bottom planes. A minimal channel approach was adopted to find a small computational domain size that could accurately reproduce the Hama roughness function calculated from a full channel simulation (Chung *et al.* Reference Chung, Chan, MacDonald, Hutchins and Ooi2015). Further details on the domain size and grid resolution are provided in Appendix B.

The scatter plots of the calculated Hama roughness function on the 35 data samples against the surface statistics involved in calculating the input of neural networks are shown in figure 4(*b*). The generated surfaces in the fine-tuning and pre-training steps showed moderately different characteristics. For instance, the surfaces generated in the fine-tuning step showed higher limits of the effective slopes (up to ${\sim }0.8$) compared with those in the pre-training step (up to ${\sim }0.35$). However, note that despite the larger effective slope, the surfaces were not in the regime where drag tends to decrease with increasing effective slope (Flack & Schultz Reference Flack and Schultz2014). The 35 data samples were split into 6, 4 and 25 data samples for training, validation and test, respectively. More than 70 % of the data samples were used for testing, i.e., not used during the fine-tuning step, to fairly evaluate the generalization ability of the developed neural networks. Note that a total of 10 data samples were used in the fine-tuning step and these fine-tuning data samples were completely separated from the test data samples. The data used for fine-tuning and testing the neural networks are provided in table 2. The surface statistics and the Hama roughness function in the test dataset are distributed widely in the total dataset as shown by the $\times$ markers in figure 4(*b*).

To reduce the uncertainty arising from splitting a very small dataset into training and validation sets, we adopted an approach based on averaging many neural networks. More specifically, we employed 210 neural networks, $\{NN^{1}, NN^{2},\ldots, NN^{210}\}$, to learn from the 210 different data combinations derived by selecting six training and four validation data samples from the ten fine-tuning data samples $((\!\begin {smallmatrix}{10}\\ {6}\end {smallmatrix}\!) = (\!\begin {smallmatrix}10\\ 4\end {smallmatrix}\!) =10!/(6!\,4!)=210$). The weights and biases of these 210 neural networks ($\{W^{1}_{j,k},W^{2}_{j,k},\ldots,W^{210}_{j,k}\}$ and $\{b^{1}_{k},b^{2}_{k},\ldots, b^{210}_{k}\}$) were initialized by the pre-trained weights and biases, $W^{pre}_{j,k}$ and $b^{pre}_{k}$. Similar to the pre-training step, the weights and biases of each neural network $NN^{i}$ were updated by the Adam optimizer that minimizes a loss for fine tuning $L^i_{f}$,

Here, $\Delta U^{+}$ is the ground truth and $\Delta \tilde {U}_{i}^{+}$ is the roughness function predicted by $NN^{i}$. After a sufficient amount of updates, the networks with the smallest fine-tuning loss on the validation data samples were chosen to predict the roughness function $\Delta U^{+}$ on test data (§ 3).

## 3. Prediction of roughness functions

We compare the predicted Hama roughness functions on the 25 test data samples that were not included in the fine-tuning step. The roughness functions were calculated by the following prediction methods: (1) neural networks trained with transfer learning (NNTF); (2) neural networks trained without transfer learning (NN) and (3) empirical correlations (EMP). Neural networks in NN were trained without the pre-training step, i.e. the weights and biases of the networks were randomly initialized instead of being initialized by $W^{pre}_{j,k}$ and $b^{pre}_{k}$. Our focus here is on the comparison between NNTF and NN to demonstrate the effects of transfer learning. However, we also include EMP predictions in the comparison because these methods are widespread in the literature.

The performance of neural networks can vary greatly depending on the selection of data for training and validation. This is because it is generally hard to form a distribution covering the entire feature space from a limited number of data samples. By ensemble averaging the predictions from neural networks learned from different data combinations, we can partly alleviate this problem. The ensemble predictions from NNTF and NN are thus obtained by averaging the prediction values of the 210 neural networks as

Similarly, we use ensemble predictions of the empirical models $\{$(EMP1), (EMP2), (EMP3)$\}$ in EMP as

where $\Delta \tilde {U}_{i, emp}^{+}$ is the Hama roughness function predicted by the $i$th empirical model.

Figure 5 shows the maximum and average errors calculated from the ensemble predictions by NNTF, NN, and EMP. The predictions from NN exhibited 7.62 % and 23.05 % of the average and maximum errors, respectively, while the corresponding error associated with the predictions from NNTF were 6.25 % and 14.15 %. Note that both NN and NNTF are performing better than EMP here. The most significant advantage of using transfer learning is in the reduction of the maximum error; a nearly ten percent decrease is achieved by using NNTF instead of NN. As we will discuss more quantitatively in the next section, this error reduction demonstrates the capability of NNTF in learning a better generalized mapping from a small amount of DNS data.

To further investigate transfer learning, the best and worst performing neural networks out of the 210 networks in NNTF and NN were extracted and compared in figure 6. The errors from the best performing networks in NNTF and NN (figure 6*a*) showed a similar trend as the ensemble predictions (figure 5); both average and maximum errors were clearly reduced by using transfer learning. We also note that the ensemble predictions from NNTF (figure 5) performed marginally better compared with both the best performing neural network in NN and empirical correlation in EMP (figure 6*a*).

The advantage of using transfer learning was more clearly demonstrated in the worst performing case (figure 6*b*). The maximum error of the worst performing neural network in NN was nearly 60 %, which showed that this network failed to provide a reliable prediction. However, the network in NNTF exhibited significantly smaller errors, which indicated that transfer learning was helping the network to provide a safer prediction even in the worst scenario. Therefore, these results showed that transfer learning enables the networks to learn a more generalized mapping from limited DNS data by adapting an ‘approximate’ knowledge from empirical correlations. It is also important to emphasize that ensemble predictions should be used in practice to reduce the uncertainty arising from a limited pool of data as can be seen from the wide range of errors that occurred with the best and worst performing neural networks in figure 6.

## 4. Analysis of transfer learning and approximate knowledge

In the previous section, we demonstrated that transfer learning of empirical correlations improves the learning performances of neural networks. In this section, we analyse the effects of transfer learning and why ‘approximate knowledge’ in the empirical correlations can help neural networks.

### 4.1. Effects of transfer learning on weight generalization

In this section, we investigate how transfer learning improves the generalization ability by characterizing the learning weights of the networks. Let $w^{i}_{l}$, where $l=1,2,\ldots,N_{w}$ is the vectorized elements of the weight matrices $W^{i}_{j,k}$ in $NN^{i}$ ($N_{w}=488$ in this study). Then, we define the deviation of weights $w^{i}_{l}$ from different $NN^{i}$ as

where $\mu _{l}= \sum _{i=1}^{210}w^{i}_{l}/210$. Note that the networks in NNTF were initialized with the pre-trained weights, and the networks in NN were initialized with the same random weights. Therefore, the deviation $\sigma _{l}$ indicates how a weight in a neural network is updated differently for different combinations of training and validation data samples (i.e. different $NN^{i}$). In the ideal case with an infinite amount of data samples, the deviation will approach zero as the network converges to specific weights representing a unique generalized mapping of the real physics.

The distributions of the deviation $\sigma _{l}$ calculated from NNTF and NN are compared in figure 7. A number of weights with large $\sigma _{l}$ were observed in NN, which indicated that the updates of weights largely deviated depending on the data. This implied that the current data were insufficient for NN to learn a generalized mapping. However, the deviations were significantly reduced in NNTF, which implied that the networks in NNTF were learning a better-generalized mapping compared with NN. This was because the networks in NNTF were initialized with the pre-trained weights containing the ‘approximate’ knowledge. These weights provided a better initial state for the weight optimization problem in the fine-tuning step.

### 4.2. Effects of transfer learning on sensitivities of input surface statistics

This section investigates how transfer learning affects the neural networks in learning the sensitivities of the input surface statistics. The sensitivities of input surface statistics in predicting drag were obtained by calculating the derivatives of drag with respect to each input (surface statistical measure). A similar analysis was employed by Kim & Lee (Reference Kim and Lee2020) for evaluating the influences of flow variables in predicting heat transfer characteristics. We define the sensitivity of the input $I_{i}$ in predicting drag $\Delta \tilde {U}^{+}$ (defined in (3.1)) as

Here, $\langle \cdot \rangle$ is the averaging operation along with the calculated derivatives from the 25 test DNS dataset. The derivative ${\partial \Delta \tilde {U}^{+}}/{\partial I_{i}}$ can be analytically calculated as neural networks are composed of differentiable function compositions and matrix multiplications. However, more efficiently, it can be done by using an automatic differentiation algorithm. In this study, we used the automatic differentiation algorithm implemented in PyTorch (Paszke *et al.* Reference Paszke, Gross, Chintala, Chanan, Yang, DeVito, Lin, Desmaison, Antiga and Lerer2017).

Figure 8 shows the sensitivities of the input surface statistics when trained with NNTF and NN. It was observed that for both neural networks, $I_{12}=ES_{z}\times Sk$, $I_{10}=ES_{x}\times Sk$ and $I_{6}=por$ had the highest sensitivities. In other words, these were the most important surface measures when it came to predicting the friction drag for the particular class of rough surfaces contained in the high-fidelity dataset. These statistics are not explicitly included in the empirical models (EMP1), (EMP2) and (EMP3). Therefore, this indicates that the neural networks were learning the mapping of surface statistics to drag beyond the empirical models. However, the importance of $I_{12}, I_{10}$ and $I_{16}$ was partially implied by previous studies. The importance of the product of effective slope and skewness was partly implied in the study by Forooghi *et al.* (Reference Forooghi, Stroh, Magagnato, Jakirlić and Frohnapfel2017), as their empirical model (EMP2) was nonlinear with respect to skewness and effective slope. The importance of porosity in predicting drag was recently emphasized by Jouybari *et al.* (Reference Jouybari, Yuan, Brereton and Murillo2021). Although the result shown here was for a particular class of rough surfaces, it demonstrated the necessity of expanding the feature space of surface statistics for drag prediction to more complex nonlinear combinations of the most basic statistical moments.

To understand what influence transfer learning has on sensitivities, we can identify from figure 8 the input statistics that are ranked differently by NN and NNTF in terms of importance. The surface statistics related to effective slopes and skewness, e.g. $I_{12}=ES_{z}\times Sk$, $I_{10}=ES_{x}\times Sk$, $I_{2}=Sk$, $I_{4}=ES_{x}$ and $I_{5}=ES_{z}$, were found to be more sensitive in NNTF. These higher sensitivities in NNTF mainly arose from the inclusion of the approximate knowledge about the drag dependencies on effective slope and skewness in the empirical correlations. Conversely, the sensitivity of the input defined as the square of skewness ($I_{17}=Sk \times Sk$) was smaller in NNTF compared with NN. This indicated that NNTF learns that the square of skewness compromises the information about the sign of skewness, which is known to be important for drag prediction (Jelly & Busse Reference Jelly and Busse2018). In addition, NNTF reduces the sensitivities of statistics related to kurtosis ($I_{3}=kur$, $I_{13}=ES_{x}\times kur$ and $I_{11}=ES_{z}\times kur$). As kurtosis measures the outliers (DeCarlo Reference DeCarlo1997), the reduction indicates that the effects of the outlier roughness heights are relaxed in NNTF.

### 4.3. Approximate knowledge in the empirical correlations

As briefly introduced in § 2.2, the empirical correlations contain different types of approximate knowledge. This is because the empirical correlations were fitted from different types of surfaces (Chung *et al.* Reference Chung, Hutchins, Schultz and Flack2021); (EMP1) was derived from data of a pipe with sinusoidal irregular roughness; (EMP2) was developed from data of rough surfaces with different arrangements and size distributions of roughness elements; and (EMP3) was constructed from a wide range of realistic rough surfaces, such as gravel and commercial steel pipes. As a consequence, each of the relations (EMP1), (EMP2) and (EMP3) depends on a different measure of roughness height ($Ra^{+}$, $k_{c}^{+}$ or $k_{rms}^{+}$) and different combinations of surface statistics ($Sk$ and/or $ES_{x}$). Figure 9 shows contour levels of the drag predicted by (EMP1), (EMP2) and (EMP3) in a map spanned by skewness $Sk$ and effective slope $ES_x$. Note that the roughness heights for (EMP1)– (EMP3) ($Ra^{+}$, $k_{c}^{+}$ and $k_{rms}^{+}$) were chosen by their respective values that produced $\Delta U^{+}=-3.5$ at a common location ($ES_{x} = 0.3$ and $Sk = 0$) as in Chung *et al.* (Reference Chung, Hutchins, Schultz and Flack2021).

The correlation (EMP1) contains the approximate knowledge that drag tends to increase with increasing effective slope, while it does not contain any dependency of drag with respect to skewness. This is because (EMP1) was derived from fitting data with a narrow regime of skewness, as shown by the black $\circ$ markers in EMP1 of figure 9. Oppositely, the correlation (EMP3) contains the knowledge that drag tends to increase with increasing skewness, without a dependency on the effective slope. This is because the majority of the fitting data for (EMP3) (black $\circ$ markers in EMP3 of figure 9) lies in an insensitive regime of effective slope – between the sparse regime ($ES<0.3 - 0.6$) and dense regime ($ES>0.4 - 3.0$) – with relatively small effects on drag (Jiménez Reference Jiménez2004). Equation (EMP2) shows a nonlinear dependence of drag on the skewness and the effective slope. As a result, this model includes approximate knowledge about the drag dependency on the product of skewness and effective slope as also discussed in § 4.2.

As the empirical correlations are derived by fitting data from a particular set of surfaces, a naive extrapolation of them for predicting drag of other sets of rough surfaces may lead to large errors. For example, if the correlations are directly applied to predict drag on the current test surfaces of the DNS dataset (red $\times$ markers in figure 9), the average and maximum errors of (EMP1), (EMP2) and (EMP3) are ($17\,\%$, $29\,\%$), ($8\,\%$, $17\,\%$) and ($25\,\%$, $44\,\%$), respectively. Note that the errors from (EMP2) are the smallest owing to the similar surface spaces between the current DNS data and the fitting data. Interestingly, large errors of the empirical models do not necessarily lead to inaccurate fine-tuned neural networks. To demonstrate this, we fine-tuned neural networks pre-trained only with one empirical correlation. We found that the average and maximum errors of the ensemble networks fine-tuned from (EMP1), (EMP2) and (EMP3) were ($6\,\%$, $18\,\%$), ($6\,\%$, $17\,\%$) and ($7\,\%$, $14\,\%$), respectively. Despite the high errors of the empirical correlations of (EMP1) and (EMP3), their resulting fine-tuned networks were as effective as the fine-tuned network from (EMP2). This indicated that an empirical model for pre-training does not necessarily need to be accurate, it merely needs to provide an approximate knowledge of the dependencies on surface statistics. Moreover, NNTF, which is pre-trained with all of the approximate knowledge provided by the three empirical models, achieves a more favourable overall performance (figure 5), compared with the fine-tuned networks pre-trained with any single EMP. This is because the knowledge contained in each empirical model contributes with information of drag dependencies. This can be qualitatively shown by visualizing knowledge domains.

Figure 10 compares the knowledge domains of high-fidelity physics (DNS domain), approximate knowledge in the empirical models (EMP domain) and non-physical knowledge from the randomly initialized neural networks (non-physics domain). To visualize the knowledge domains, we extracted principal axes of the data samples composed of $(ES_{x}, Sk, \Delta U^{+})$ in each domain. Note that $\Delta U^{+}$ in the DNS, EMP and non-physics domains are obtained from simulations, empirical models and randomly initialized neural networks, respectively. The DNS domain is composed of rough surfaces in the DNS dataset, while the EMP and non-physics domains are composed of rough surfaces in the empirical dataset. After computing the principal axes in each domain, ellipsoids that approximately bound the DNS, EMP and non-physics domains along their principal axes are visualized. As shown in figure 10(*a*), no particular correlation (or rather alignment) can be found between the non-physics domain and the DNS domain. However, an alignment between the EMP domain and the DNS domain is observed in figure 10(*b*). Therefore, it is expected that neural networks with approximate knowledge from the EMP domain can more easily adapt to the DNS domain compared with those without any physical knowledge.

We also quantified the alignment between the knowledge domains by calculating the angles between the principal axes. The angles between the principal axes in the EMP and DNS domains were computed as ($16.5^{\circ }$, $18.1^{\circ }$, $8.9^{\circ }$), while the angles between the principal axes in the non-physics and DNS domains were computed as ($120.9^{\circ }$, $119.9^{\circ }$, $12.3^{\circ }$). Accordingly, the aligned ‘approximate knowledge’ from EMPs assisted the networks to adapt to the DNS domain; thus, a better performance of neural networks was achieved by transfer learning of empirical correlations.

## 5. Conclusions

We have developed a transfer learning framework to learn the Hama roughness function from a limited number of DNS data samples. The framework is composed of two steps: (1) pre-training step and (2) fine-tuning step. In the pre-training step, neural networks learn an ‘approximate’ knowledge from empirical correlations. In the fine-tuning step, the networks are fine-tuned using a small DNS dataset. Neural networks trained with transfer learning show a significant improvement in predicting the roughness function on test data samples not included in the fine-tuning step. This is because the ‘approximate’ knowledge in empirical correlations is aligned to high-fidelity physics, which assists neural networks to learn better-generalized weights. In addition, a sensitivity analysis shows that the neural networks with transfer-learning were clearly emphasizing the importance of certain input surface statistics (effective slopes, skewness and porosity) in predicting drag. These extracted statistics are in general good agreement with what has been reported by other investigations, but also highlight that nonlinear functions of several surfaces statistics could provide high correlation with drag. We have shown that the prediction performance is enhanced when the ‘approximate’ knowledge of the empirical dataset is well aligned to the high-fidelity dataset. Therefore, it is advantageous to employ empirical correlations developed for classes of rough surfaces that are similar to one's high-fidelity dataset. The current NNTF, trained with irregular homogeneous rough surfaces, would not be effective when predicting drag on regular rough surfaces (e.g., cuboids, bars, etc.) or inhomogeneous rough surfaces. Similarly, predicting values of roughness function beyond training is also expected to not be effective owing to the general limitations of data-driven methods in extrapolation.

To further increase the generalization ability of neural networks, ultimately, the dataset has to be expanded. A large database would also help to study classes of rough surfaces with undiscovered empirical correlations, where pre-training is currently not possible. Accordingly, a collective community effort is needed to construct and train Big Data of flows over many classes of rough surfaces in different flow conditions. One step in this direction is the online database at http://roughnessdatabase.org (Engineering and Physical Sciences Research Council & University of Southampton 2020), currently under construction. Moreover, new primary surface statistics that strongly affect drag may be discovered in the future by extending the sensitivity analysis in § 4.2 with a larger surface parameter space and Big Data. Also, expansion of the network structure would enable one to fully leverage information inside large databases. The structure of the neural network in this study is very simple, as our focus has been to show that transfer learning can improve the performance of a given network. Thus, this network structure should not be considered as the optimal model for predicting drag. In addition to expanding the database, further optimization of the neural networks will also improve the performance for drag prediction. The current transfer learning code will be made available online (e.g. see https://www.bagherigroup.com/research/open-source-codes/).

Finally, while we have considered transport of momentum (i.e. drag coefficient), empirical relations are widespread also for characterizing transport of energy (Nusselt number) and mass (Sherwood number). Therefore, the proposed framework can also be applied to other engineering applications where only a limited amount of high-fidelity data is available, but a significant amount of knowledge has been accumulated.

## Funding

This work was supported by the Swedish Energy Agency under grant number 51554-1, Swedish Foundation for Strategic Research (FFL15-001), and Friedrich and Elisabeth Boysen Foundation (BOY-151). S.L. and S.B. would like to thank Dr O. Beijbom for the helpful discussions regarding the neural networks.

## Declaration of interests

The authors report no conflict of interest.

## Appendix A. Effects of a bounded roughness function in the datasets

Here, we provide an example of training the neural networks with bounded $\Delta U^{+}$. If the bound of $\Delta U^{+}$ in one's high-fidelity dataset is known, the transfer learning performance can be improved by accordingly bounding the $\Delta U^{+}$ for pre-training. In this example, we used the bound of $\Delta U^{+}>6$, as the surfaces in the current DNS dataset asymptotically reach the fully rough regime when $\Delta U^{+} \approx 6$ (Yang *et al.* Reference Yang, Stroh, Chung and Forooghi2021). We excluded the data samples with $\Delta U^{+} < 6$ from the 35 DNS data samples (figure 4). As a result, a total of 29 DNS data samples were used in this section. The 29 data samples were split into 5, 3 and 21 data samples for training, validation and testing. Note that more than 70 % of the data samples were used for testing, as in § 3. The $\Delta U^{+}$ of the empirical dataset was bounded to $\Delta U^{+}>6$ as the DNS dataset. After pre-training neural networks with the bounded $\Delta U^{+}$, we fine-tuned a total of 56 neural networks, which learnt from the 56 different combinations of 5 training and 3 validation data samples out of the total 8 fine-tuning data samples $((\!\begin {smallmatrix}{8}\\ {5}\end {smallmatrix}\!) =(\!\begin {smallmatrix}{8}\\ {3}\end {smallmatrix}\!) =8!/(5!\,3!)=56$).

The average and maximum errors on the 21 test data samples predicted by the ensembles of (1) NNTF with bounded $\Delta U^{+}$ (NNTF-BU), (2) NN with bounded $\Delta U^{+}$ (NN-BU) and (3) EMP with bounded $\Delta U^{+}$ (EMP-BU) were ($5.49\,\%$ and $14.60\,\%$), ($19.61\,\%$ and $45.61\,\%$) and ($13.22\,\%$ and $27.07\,\%$), respectively (figure 11). The errors from NN-BU were notably increased compared with the errors from NN ($7.62\,\%$ and $23.05\,\%$, figure 5). This was because the number of data samples used for training NN-BU had decreased by ${\sim }20\,\%$ compared with the number used for training NN. However, despite the decrease of the number of fine-tuning data samples, NNTF-BU showed similar errors compared with NNTF ($6.25\,\%$ and $14.15\,\%$, figure 5). Accordingly, NNTF-BU achieved nearly 15 % and 30 % decrease in error percentages compared with NN-BU, which was more significant than those achieved from NNTF (figure 5). Therefore, the transfer learning performance can be improved when the range of $\Delta U^{+}$ of surfaces in one's high-fidelity dataset is *a priori* known. It is also worth mentioning that the errors from EMP-BU are similar to those from EMP ($12.30\,\%$ and $27.07\,\%$, figure 5), which indicates that the employed empirical models are reasonably valid even for $\Delta U^{+}$ slightly under 6.

## Appendix B. Grid resolution and domain size

To determine the appropriate mesh resolution in our simulations, a grid independence test was conducted. We conducted the test for the rough surface with the smallest Taylor microscale ($\lambda _{T}$) along with the 35 DNS data samples. The definition of the Taylor microscale proposed by Yuan & Piomelli (Reference Yuan and Piomelli2014) was adopted in the current study. We studied the grid independence using five different resolutions in a minimal channel domain $(2.0H \times 2.0H \times 0.4H)$: case 1, $(192 \times 401 \times 36)$; case 2, $(256 \times 361 \times 48)$; case 3, $(256 \times 401 \times 48)$; case 4, $(256 \times 451 \times 48)$; case 5, $(300 \times 401 \times 60)$. The calculated mean velocity ($U^{+}$) profiles are shown in figure 12. The profiles from the five different cases were found to be nearly identical in both the inner and outer layers. Thus, in the current study, we used the grid resolution of case 3. This resolution corresponded to $\varDelta ^{+}<5$ in both streamwise and spanwise directions. In addition, with this resolution, the grid sizes were at least four times smaller than the Taylor microscales of all surfaces in both streamwise and spanwise directions $(\lambda _{T}/\varDelta >4.5)$, which satisfied the grid resolution constraint introduced in Jouybari *et al.* (Reference Jouybari, Yuan, Brereton and Murillo2021).

The sensitivity of the domain size was also studied. We performed additional 12 full channel DNS to inspect the feasibility of using minimal channel DNS in the current study. For a given channel half-height $H$, the domain and grid sizes for the full channel simulations were respectively $(8H \times 2H \times 4H)$ and $(900 \times 401 \times 480)$ in the streamwise, wall-normal and spanwise directions. The smallest domain and grid sizes for the minimal channel simulations were $(2.0H \times 2.0H \times 0.4H)$ and $(256 \times 401 \times 48)$, respectively. The surface statistics of the rough surfaces were reasonably well converged in both domains. For instance, the maximum differences of $k_{rms}^{+}$, $ES$ and $Sk$ between the full and minimal channel domains for the rough surfaces were $0.038$, $0.025$ and $0.006$, respectively. The resulting average and maximum errors between $\Delta U^{+}$ calculated from the full and minimal channel domains were $1.5\,\%$ and $4.6\,\%$, respectively. As these errors were notably smaller than those from the predictions of neural networks (§ 3 and Appendix A), we found the current minimal channel approach to be sufficient for our purposes. The full details of the surfaces and simulations are available in Yang *et al.* (Reference Yang, Stroh, Chung and Forooghi2021).