Data-driven prediction of the equivalent sand-grain height in rough-wall turbulent flows

This paper investigates a long-standing question about the effect of surface roughness on turbulent flow: what is the equivalent roughness sand-grain height for a given roughness topography? Deep Neural Network (DNN) and Gaussian Process Regression (GPR) machine learning approaches are used to develop a high-fidelity prediction approach of the Nikuradse equivalent sand-grain height $k_s$ for turbulent flows over a wide variety of different rough surfaces. To this end, 45 surface geometries were generated and the flow over them simulated at $\hbox{Re}_\tau=1000$ using direct numerical simulations. These surface geometries differed significantly in moments of surface height fluctuations, effective slope, average inclination, porosity and degree of randomness. Thirty of these surfaces were considered fully-rough and they were supplemented with experimental data for fully-rough flows over 15 more surfaces available from previous studies. The DNN and GPR methods predicted $k_s$ with an average error of less than 10% and a maximum error of less than 30%, which appears to be significantly more accurate than existing prediction formulas. They also identified the surface porosity and the effective slope of roughness in the spanwise direction as important factors in drag prediction.


Introduction
At sufficiently high Reynolds numbers all surfaces are hydrodynamically rough, as is almost always the case for flows past the surfaces of naval vehicles. Reviews of roughness effects on wall-bounded turbulent flows are provided by Raupach et al. (1991) and Jiménez (2004). The most important effect of surface roughness in engineering applications is an increase in the hydrodynamic drag (Flack 2018), which is due predominantly to the pressure drag generated by the small-scale recirculation regions associated with individual roughness protuberances.
For the foreseeable future, the most practical approach to making predictive flow calculations for many realistic applications is to use engineering one-point closures of turbulence, such as two-equation turbulent eddy-viscosity models to the Reynoldsaveraged Navier-Stokes (RANS) equations. Existing rough-wall corrections to this type of closure typically model the increase in hydrodynamic drag on a single length scalethe equivalent sand-grain height (Nikuradse 1933) k s -without physically resolving the surface or changing the governing equations. In the fully rough flow regime, where the wall friction depends on the roughness alone and is independent of the Reynolds number, k s was observed to quantify the increase in hydrodynamic drag through the empirical relation with the roughness function, ∆U + (defined as the offset of the log-linear velocity profile of a rough-wall flow relative to that of a smooth-wall one): where κ = 0.41 is von Kármán's constant and + represents normalization in wall units. A universal length scale (e.g. k s in Nikuradse's relation, or in the Moody diagram (Moody 1944)) that can predict accurately the surface drag coefficient is not known a priori and does not appear to be equivalent to any single geometrical length scale, such as an average or a root-mean-square (rms) of roughness height (Flack 2018). It is also well-established that k s can depend on many geometrical parameters such as the effective slope (Napoli et al. 2008;Yuan & Piomelli 2014a) and the skewness of the roughness height distribution (Flack & Schultz 2010). Readers are referred to Flack & Schultz (2010) and Bons (2002) for extensive reviews on this topic. Empirical expressions for k s based on a small number of geometrical roughness parameters include, among others: k s = c 1 k avg α 2 rms + c 2 α rms , k s = c 1 k avg Λ c2 s , and k s = c 1 k rms 1 + S k c2 , (1.2) proposed by Bons et al. (2001), van Rij et al. (2002) and Flack & Schultz (2010) respectively. Here k avg is the average height, α is the local streamwise slope angle and Λ s = S/S f S f /S s −1.6 (where S, S f , S s are, respectively, the platform area, the total frontal area, and the total windward wetted area of the roughness) while k rms and S k are the rms and skewness of the roughness height fluctuations, and c 1 and c 2 are constants. The hydrodynamic lengthscale k s appears to be correlated with different sets of geometrical parameters for each type of rough surface and no universal correlation currently exists for flow over surfaces of arbitrary roughness. For example, for synthetic roughness comprising closely packed pyramids ) and random sinusoidal waves (Napoli et al. 2008), it has been shown that k s scales on the effective slope when the surface slope is gentle (i.e. within the 'waviness' regime), whereas the skewness and rms height, but not slope magnitude, become important when the slope is steeper (i.e. within the 'roughness' regime). The boundary between these two regimes has been shown to be surface dependent (Yuan & Piomelli 2014a).
Some more recent studies of k s correlations are summarized below. Thakkar et al. (2017) carried out DNS of transitionally-rough turbulent flows for different irregular roughness topographies. They found that the roughness function is influenced by solidity, skewness, the streamwise correlation length scale and the rms of roughness height. Flack et al. (2019) performed several experiments to systematically investigate the effects of the skewness and amplitude of roughness height on the skin friction. They found that the rms and skewness of roughness height fluctuations are important scaling parameters for prediction of roughness function; however, the surfaces with positive, negative and zero skewness values needed different correlations. Also, Chan et al. (2015) simulated turbulent pipe flows over sinusoidal roughness geometries and confirmed strong dependence of roughness function on the average height and streamwise effective slope.
In previous studies, the small number of roughness parameters used to devise k s correlations tended to limit their application to a narrow range of surface roughnesses. Since it appears that many geometrical parameters, such as porosity, moments of roughness height (e.g. rms, skewness and kurtusis), effective slope, and surface inclination angle might affect k s , it is useful to employ a data science approach suited to modeling large multi-variate/multi-output systems.
Specifically, we use Machine Learning (ML) to explore k s -prediction approaches that depend on a large set of surface-topographical parameters, with the expectation that the resulting models may be applied accurately to a wider range of surfaces. Since the prediction of k s from surface topography is essentially a labeled regression problem, supervised ML operations were performed using Deep Neural Networks (DNN) and Gaussian Process Regressions (GPR). Both methods are explained thoroughly in section 3. Readers are referred to the monograph by Rasmussen & Williams (2006) and the review provided by LeCun et al. (2015) for detailed descriptions of these methods. An initial ensemble of 60 sets of data on k s as a function of topographical parameters-45 direct numerical simulation (DNS) results and 15 experimental results-was considered. All experimental data sets are fully rough, and of the DNS data, 30 are considered fully-rough flows; all fully-rough cases were used for ML training and testing. To the best of our knowledge, this ensemble of roughness geometries is the most extensive used for developing a k s -prediction method.
In this paper, we first present the governing equations, solution methodologies, simulation parameters and different roughness topographies and then discuss the post-processed DNS results used to calculate k s for each surface. Finally, we describe the ML models, their predictions of k s and their uncertainty.

Governing equations
The governing equations of incompressible continuity and linear momentum-the Navier-Stokes (NS) equations-for a constant-property Newtonian fluid, were solved by DNS. These equations are written in indicial notation as where i, j = 1, 2, 3, x 1 , x 2 and x 3 (or x, y, z) are the streamwise, wall-normal and spanwise coordinates, with corresponding velocity components of u 1 , u 2 and u 3 (or u, v, w) and P is defined as p/ρ, where p is the pressure and ρ is the fluid density; ν is the kinematic viscosity. An immersed boundary method (Yuan & Piomelli 2014b) was used to enforce the fine-grained roughness boundary conditions on a non-conformal Cartesian grid. The corresponding body force F i is added to the the right hand side of the momentum equations to impose a no-slip boundary condition at the fluid-roughness interface. To solve the equations, second-order central differencing was used for spatial discretizations and second-order Adams-Bashforth semi-implicit time advancement was employed. The numerical solver was parallelized using a message passing interface (MPI) method (Keating 2004). A double-averaging decomposition (Raupach & Shaw 1982) was used to resolve turbulent and dispersive components of flow variables in the presence of roughness. In this decomposition, any instantaneous flow variable θ may be decomposed into three components, as where the time-averaging operator is θ and the intrinsic spatial-averaging operator is θ = 1/A f x,z θdA (and A f (y) is the area occupied by fluid at an elevation y). The Reynolds and dispersive fluctuating components are then θ = θ − θ and θ = θ − θ respectively. θ is called the double-averaged component. The wall shear stress (including both viscous and pressure drag contributions on a rough wall) was determined by integrating the time-averaged immersed boundary method body force in the x-direction F 1 as where V represents the simulation domain volume below the roughness crest and L xi is the domain length in the x i direction. Readers are referred to Yuan & Piomelli (2014b,c) for details of the implementation and validation of the immersed boundary method and the τ w calculation.

Surface roughness
In figure 1, surface plots of the 45 roughness geometries used in these simulations are displayed; their statistical properties are given in table 1. Each case name in figure 1 and table 1 begins with the letter C or E, which denotes whether the data is computational or experimental, followed by an identifying index for that particular surface. For computational cases, this index is followed by: a characteristic length scale (as a percentage of δ) used for roughness synthesis; an identifier of whether the surface roughness is regular (reg) or random (rnd); and finally an identifier for one additional surface feature and its position in a series of surfaces with different sizes of that feature. These features were: the streamwise inclination angle I x in surfaces C01 to C12; the porosity P o in surfaces C13 to C24; and the streamwise effective slope E x in surfaces C25 to C30. For the experimental data two indices were assigned to each surface. The first denotes the year in which the data were published and the second is the surface designation in that publication. Thus surfaces with index 16 are from Flack et al. (2016), those with index 18 are from Barros et al. (2018), and those with index 19 are from Flack et al. (2019). Note that these experimental data were obtained from fully-developed channel flows, where the drag was measured through the pressure drop along the channel. Thus their results are expected to be more accurate than those of boundary layer studies where the drag is usually inferred.
Surfaces C01 through C24 were created using ellipsoidal elements (Scotti 2006) of different size, aspect ratio and inclination. For regular roughness, each element had the same orientation and semi-axis lengths, (λ 1 , λ 2 , λ 3 ) = (1.0, 0.7, 0.5)k c , where k c is the peak-to-trough height (also called the crest height). For random roughness, the elements had random orientations and semi-axis lengths (with uniform distributions of the random variables). The average orientation and semi-axis lengths for random roughness were the same as the corresponding regular surface. Surfaces C25 through C30 comprised sinusoidal waves in the x direction, of the same magnitude but different wavelengths, to generate different values of effective slope E x . The wavelengths were 3δ/4, 3δ/8 and δ/6. Surfaces C31 and C37 comprised the random sand-grain roughness of Scotti, which were produced by randomly oriented ellipsoidal elements with fixed semi-axes of (1.0, 0.7, 0.5)k c . Surfaces C32 through C36 and C38 through C42 were generated as the low-order (the first 5, 10, 20, 30 and 50) modes of Fourier transforms of white noise in the streamwise and spanwise directions; they therefore describe random surfaces with large-to small-wavelength roughness. Cases C43, C44 and C45 are DNS results from full-span channel computations of flow over surfaces of: random sand-grain roughness; the roughness found on a turbine blade (Yuan & Aghaei Jouybari 2018); and arrays of cubes (from the study of Aghaei Jouybari et al. 2019) respectively. Case C46 is a full-span DNS of case C21, generated to validate the minimal-channel approach of the preceding  Table 1. Statistical parameters of roughness topography and the equivalent sand-grain height ks for each roughness geometry. Ra, kavg, kc, kt, krms and ks values from DNS are normalized by the channel half height δ, while corresponding experimental values are given in mm. ks is not listed for cases thought to be transitionally rough. cases. A baseline smooth-wall flow was also simulated using a full-span channel (Yuan & Aghaei Jouybari 2018). The geometric parameters reported for each surface in table 1 are: roughness peak-totrough height (also termed crest height) k c (i.e. distance between the highest and the lowest surface points); mean peak-to-trough height k t (i.e. the average of peak-to-trough heights obtained from surface tiles of size δ × δ, similar to Forooghi et al. (2017)); mean roughness height k avg ; first-order moment of height fluctuations R a ; root-mean-square k rms , skewness S k and kurtosis K u of the roughness height fluctuations; surface porosity P o ; effective slope in the x i direction E xi ; and inclination angle (in radians) in the x i direction I xi , together with the hydrodynamic lengthscale k s deduced from the mean velocity field using equation (1.1).
These geometrical parameters are defined as: (2.11) where k(x, z) is the roughness height distribution and A f (y) and A t (y) are the fluid and total planar areas at each y location. S k (∂k/∂x i ) is the skewness of ∂k/∂x i distribution.
In table 1, k avg , k c , k rms and k s are then normalized by the first-order moment of height fluctuations R a and were incorporated in the ML algorithms in this form. All surfaces considered were in the ranges k c /δ 0.17 and R a /δ 0.04.

Simulation parameters
Direct numerical simulation was used to calculate the velocity and pressure fields in turbulent open-channel flows over 45 different rough surfaces and one smooth one, at a constant frictional Reynolds number Re τ = u τ δ/ν = 1000, where u τ is the friction velocity and δ is the channel half-height. In these simulations, the domain sizes were (L x , L y , L z ) = (3, 1, 1)δ. The origin of the y axis was the elevation of the lowest trough for each rough surface. The number of grid points was (n x , n y , n z ) = (400, 300, 160). A uniform mesh was used in the x and z directions, yielding grid sizes of ∆x + = 7.5 and ∆z + = 6.3, where + denotes normalization in wall units. For all cases, the mesh was stretched in the y direction with a hyperbolic tangent function, with the third grid point from the origin at y + < 1. For the rough-wall cases, at the roughness crest, ∆y/k c 0.017, with this ratio taking its highest value for Case C11. The maximum grid size was ∆y + max = 9.5 at the channel center line, where the Kolmogorov length scale η + ≈ 6. Moin & Mahesh (1998) have proposed that one requirement for obtaining reliable first-and second-order flow statistics is that the grid resolution be fine enough to capture accurately most of the dissipation, while  noted that most of the dissipation in curved channel flow occurs at scales greater than 15η (based on average dissipation). It follows that for DNS computations of these kinds of flow statistics in channel and boundary-layer flows, ∆x/η and ∆z/η are typically chosen between 7 to 15, and 4 to 8 respectively (see, for example, Kim et al. (1987), Spalart (1988) and Yuan & Piomelli (2014c)). The grid sizes in this study were chosen accordingly and were: ∆x/η < 7.5, ∆y/η < 4.0, and ∆z/η < 6.5.
Periodic boundary conditions were imposed in the streamwise and spanwise directions, with no-slip and symmetry boundary conditions at the bottom and top boundaries respectively. After each simulation had reached statistical stationarity, data were collected for ensemble averaging over 10 large-eddy turn-over times (δ/u τ ). In these simulations, the time step τ + 0.04 and so was significantly smaller than the largest acceptable one of τ + ≈ 0.2 recommended by Choi & Moin (1994) for DNS.
The surface Taylor micro-scales λ T,x and λ T,z , in the x and z directions, were used to evaluate the adequacy of the grid resolution for resolving details of flow in the roughness sublayer, following Yuan & Piomelli (2014b). These geometric micro-scales were obtained by fitting a parabola to the two-point autocorrelation of the surface height fluctuation in the respective direction. They represent the size of an equivalent 'roughness element' in the context of random multiscale roughness. The streamwise and spanwise values of λ T , rescaled by u τ /ν as λ + T , and the respective grid sizes are given in table 2 (part I). For each case, λ + T,xi is of order 10 to 10 2 , indicating that the average size of the roughness element is large in viscous units. On average, roughness elements were well resolved by the grid, with typically 4 to 12 grid points per λ T,xi microscale in each direction. For reference purposes, Yuan & Piomelli (2014a) reported a resolution of λ T,x /∆x ≈ 4 in their large-eddy simulations of channel flow over surfaces with sand-grain roughness. The cases in table 2 for which λ T was not well resolved in at least one direction (λ T,x /∆x < 3 or λ T,z /∆z < 3) may also not have been fully-rough flows (as discussed in the following section), and so were not included in the ensemble of flows for ML training and testing.
In rough-wall flows, the pressure drag is caused primarily by the local flow structures and separation in the vicinity of individual roughness protuberances, which are predominately near-wall phenomena. To carry out the 46 separate DNS simulations for determining k s efficiently, with sufficient near-wall resolution, a small-span channel simulation approach was employed. The concept of minimal-span simulation was introduced by Jimenez & Moin (1991). Chung et al. (2015) and MacDonald et al. (2017) carried out analyses of the performance of DNS over small spanwise domains for full and open channel flows on rough and smooth walls and showed that minimal-span simulations captured the essential near-wall dynamics and yielded accurate computations of wall friction, and of mean velocities and Reynolds stresses as far from the wall as y ≈ 0.3δ, when the following constraints were met: where δ ν = ν/u τ and λ r,xi is the characteristic roughness wavelength in the x i direction. Alternatively, the surface Taylor microscale may be used as the lengthscale in this constraint. Conditions (2.14a,c) were satisfied by choosing domain sizes L + x and L + z of 3000 and 1000 respectively, while condition (2.14b) was met for all cases except C11, which fell below the L y k c /0.15 constraint by about 10%. C11 is a case with random geometry; protuberances beyond 0.15δ exist but are rare. The criteria of (2.14) were developed originally for simulations of flow over surfaces with uniformly distributed roughness elements. In this study, the random roughness geometries used require an additional criterion on the sufficiency of the domain size: the area L x L z should be large enough to achieve statistical convergence of surface parameters, such as k rms and E xi , and of the flow parameter k s . To check the adequacy of the chosen domain size, an additional simulation was carried out of Case C21, the surface comprising the largest dominant spatial wavelength (and consequently the most limited sampling of random geometrical components with this wavelength) and a longtailed height-fluctuation pdf with a kurtosis of around 8. In this validation simulation, denoted Case C46, the domain sizes were doubled in x and z, by duplicating C21 in these directions. The double-averaged velocity profiles U + = u + (y + ) for Cases C21 and C46 are in a very good agreement over the log-linear region, as shown in figure 2. Each surface statistic differs by no more than 3%, with the greatest discrepancy found in I z , while the equivalent sandgrain roughness height k s is almost equal in the two cases. The chosen domain size was therefore considered sufficient for accuracy and convergence of statistics describing flow over the random roughness geometries of this study.

Post-processed results
In figure 2, the streamwise double-averaged velocity profiles computed in these simulations are shown. The profiles in the logarithmic region are described for the smooth-wall case and the fully-rough rough-wall cases as u + = 1 κ ln(y + ) + 5.0, and (3.1a) respectively, where d is the zero-plane displacement, obtained as the location of the centroid of the wall-normal profile of the averaged drag force (Jackson 1981). The shift in the y coordinate by d accounts for the flow blockage by surface roughness elements, and the values of d are given in table 2 (part II).
To determine whether a particular flow was within the fully rough regime, equation (3.1b) was applied to the computed logarithmic velocity profile to yield a test value of k s , denoted as k s in table 2 (part II). With k s determined for all cases, those with k + s greater than a threshold value of 50 were deemed to be in the fully rough regime (30 surfaces), in which case k s was set to equal k s . Those below the threshold were possibly transitionally rough (15 surfaces) and so were not included in ML predictions in this study. The threshold value of k + s -the lower end of the fully rough regime-has been observed to vary significantly for different types of roughness and is typically between 20 and 80. For example, the threshold values for surfaces C43 and C44 are roughly 80 and 20 (Yuan & Piomelli 2014a), and 50 for surface C45 (Bandyopadhyay 1987).
The threshold value of k + s which signifies the beginning of the fully rough regime was not determined more precisely because of the cost of carrying out, for each surface, simulations at successively higher values of k + s until k s /R a became invariant with the Reynolds number. In the GPR prediction, potential uncertainties in k s which might arise through treating all flows with k + s > 50 as fully rough, and other sources of possible error, were compensated for by incorporating an assumed 10 % noise level in the learning stage of the prediction of k s , as discussed in section 3.2. The values of k + s = 50 as the threshold for fully rough flows and the assumed noise level were chosen as part of a trade-off to maximize the number of usable data, to avoid overfitting, while acknowledging possible uncertainties in the modeling data.
In figure 3, pair plots of the different topographic roughness parameters are shown as scatter plots (lower left), joint pdfs (upper right), and distribution pdfs (diagonal). Pair scatter plots for the true (DNS and experimental) value of k s and other roughness parameters are along the bottom row of this figure. It can be seen that, for the roughness cases chosen, there is some correlation between kurtosis and rms roughness (column 1, row 6), kurtosis and skewness (column 5, row 6), and skewness and porosity (column 2, row 5). The relationship between others appears to be more random. From the graphs in the bottom row, it can be seen that k s /R a scales on porosity to some power, albeit with some scatter (column 2, row 7). It also appears that k s /R a might decrease with skewness for surfaces with S k < 0 and increase with skewness in cases with S k > 0 (column 5, row 7). Surfaces with positive skewness yielded higher values of k s compared to those with negative skewness, consistent with the observation of Flack et al. (2019). Beyond these observations, there does not appear to be a clear linear correlation between k s and any individual roughness parameter, which makes the search for a functional dependence of k s on these parameters a problem well suited to ML. The measures of inclination, I x and I z , showed no clear correlation with other variables or with k s /Ra.

ML predictions of the equivalent sand-grain height
The ML techniques of DNN and GPR were employed to predict k s from the data sets described in the previous section. The objectives of this exercise were to generate and collect data, and make qualitative comparisons between ML predictions and those from conventional correlations, rather than evaluating and comparing the performance of various ML procedures per se. DNN and GPR approaches were used because our experience was that they predicted k s with high accuracy, notwithstanding their simplicity. Other approaches such as the Support Vector Machine (SVM) technique were considered initially, but their preliminary predictions were not as accurate as those found using DNN and GPR approaches.
The main characteristics of DNN and GPR methods are described below: •The inputs for both techniques were 17 roughness geometrical parameters, 8 of which were the primary variables k rms /R a , I x , |I z |, P o , E x , E z , S k and K u (defined in equations 2.4 to 2.13). The other 9 were products of the primary variables, which were added to improve the efficiency of each learning stage. They were x , p 8 = E 2 z and p 9 = S 2 k . These particular products were chosen because of their perceived importance for certain types of roughness.
•The database consisted of 45 different sets: 30 DNS of turbulent channel flows over different surfaces at Re τ = 1000, and 15 experimental data sets at higher Reynolds numbers, with all data sets in the fully-rough turbulent-flow regime.
•The DNN architecture was a Multi Layer Perceptron, with three hidden layers (with 18, 7 and 7 neurons respectively). The activation functions at all nodes were of the Rectified Linear Unit kind, and kernel regularization was used to avoid overfitting. The network had 521 trainable weights in total. The preset parameters to the algorithm were optimized based on available data, through a hyper-parameter tuning process. Specifically, 270 configurations were first generated with different lengths (representing the number of layers) and widths (representing the number of neurons). For each configuration, the DNN compiler was performed 1000 times with random selections of training (70% of total) and testing (30% of total) datasets to identify the best performance of the configuration. The configuration that yielded the best results was considered as the optimal one, the results of which are presented here. The cost of data fitting for one iteration (out of 1000) for each DNN configuration was about one second. In total, it took about 75 hours to obtain the optimal DNN network. This architecture was found to provide suitable accuracy in modeling without overfitting, for this particular multivariate labeled regression problem.
•The GPR procedure used Rational Quadratic kernels to represent k s as a superposition of scaled Gaussian functions of the independent variables of the modeling problem. Similar to the DNN method, the training and testing data were chosen randomly, with respective ratios of 70% and 30% of the total data points. The preset parameters (e.g. kernel type, number of iterations, etc.) were also tuned with the available data by running the GPR compiler for about 8000 times. It took about 35 hours to obtain the optimal fit. The GPR method has the capability of incorporating uncertainty or noise in the determination of model parameters in the learning stages. Such noise might arise through: numerical and discretization errors; uncertainty in the form and model coefficients of equation (1.1); the applicability and fitting range of equation (1.1) (which was deduced from high Reynolds number experiments) to simulations at much lower Reynolds numbers; and the possibility that some of the training data may have been from simulations in which the flow was not quite fully rough. A noise level of 10% in k s /R a values was chosen as an upper estimate of the likely uncertainty from these sources. Noise levels of 5% and 15% were also tested, but little sensitivity of the k s prediction was found to the assumed noise level within the tested range.
The values of k s predicted from the surface topography parameters, henceforth called k sp , are compared to the actual k s values in figure 4, for the DNN and GPR methods respectively. Scatter plots of k sp and the true value of k s in figures 4(a) and (d) reveal a tight clustering of data along the y = x diagonal, with only a few outlying points. This very high degree of correlation between k sp and k s implies that both techniques have been applied with equal success to this prediction problem. The error range, figures 4(b) and (e), is less than ±30% (L ∞ norm) and the average error (L 1 norm) is less than 8%, for both techniques.
The consistency between both the k s predictions and error bands for two quite different ML techniques suggests that they are both well-suited to this kind of problem, and possibly close to an optimum for this class of ML approach.
The error values as percentages, for the DNN and GPR methods, are given in table 3, together with the error in the empirical relation proposed by Flack et al. (2016), and given by Forooghi et al. (2017), as well as their respective recalibrated correlations: when extended to all cases in the current database. It is interesting to note that the form of equation (3.2) was chosen for surfaces generated by grit blasting-closely-packed, random, three-dimensional roughnesses with a wide range of scales (E01-E06), while many of the simulated surfaces are two-dimensional, some are characterized by discrete elements of similar sizes, while others are sparse or wavy (characterized by low slopes). Equation (3.3), on the other hand, includes a slope parameter and was calibrated for numerically generated surfaces consisting elements of random sizes and a prescribed shape. For most cases, the errors from the DNN and GPR methods were of the same order of magnitude and much smaller than the error in using equation (3.2) or (3.3). In the DNN and GPR predictions of simulation cases, the greatest errors (about 25%-30%) arose in cases E07 and E08. The surfaces associated with these cases are characterized by fractal features (with spectral slopes of -0.5 and -1.0, respectively (Barros et al. 2018)). The size of the errors for these cases might be attributed to the small number of surfaces with this feature used in the training set (as opposed to the many surfaces that are mostly characterized by single-scale elements). A close examination of the prediction errors for the DNS cases showed a subtle trend between relatively high errors and low roughness solidity (or low E s and insignificant wake sheltering), in, for example, cases C28 and C44. Both these cases are characterized by large-wavelength, wavy features, suggesting an under-representation of sparse roughness in the dataset. Beyond this observation, no clear correlation was found between the error and other primary roughness parameters included herein or surface categorizations (2D/3D, random/regular). The errors associated with using equation (3.2) are small for surfaces E01 through E06, which were used to calibrate this relation. The errors in using equations (3.2) and (3.3) over all surfaces in the database are 120% and 430% respectively. However, when recalibrated against the full database, equations (3.4) and (3.5) have a significantly smaller error band with maximum values of 79% and 84%. The high error values of the empirical correlations, compared to DNN or GPR prediction, are attributed to the small number of geometrical variables used in their calibrations and the restricted range of the models' parameters.

Uncertainty estimation
In addition to predictions of equivalent sand-grain height, the GPR method provides confidence margins as functions of each input parameter. These margins can be useful for indicating the kinds of surfaces for which additional training data could improve confidence in predictions. This feature of the GPR approach makes it very attractive for studies of this kind, since DNS and experimental generation of data can be expensive.
The confidence intervals determined by the GPR technique are shown as functions of the normalized surface rms roughness height, effective slope, porosity and skewness in figure 5. Wider intervals indicate higher estimated values of predictive error, such as at roughness porosity of 0.68, and skewnesses of -1.5 and 2.0. Surfaces of roughness with similar values of porosity and skewness would then be priorities for additional simulations or experiments.

Sensitivity analysis
The dependence of DNN predictions of k s on individual roughness parameters is explored by determining the change in the error norms when each of the primary surface parameters is removed from the data from which the DNN prediction was made. In table 4, the actual error for each surface, and the values of the L 1 and L ∞ norms of errors in the prediction of k s over the 45 surfaces, are reported when the parameter(s) in the first row is (are) the excluded one(s). The errors of the base prediction (which includes all 8 primary parameters) are listed in the second column. In the following discussion, we focus on the L 1 norm for ease of comparison over all 45 cases.
When the values of L 1 are considered, the relative importance of these surface parameters for predicting k s is: E x , I x , |I z |, E z , P o , k rms /R a , S k , and of least importance, K u . The L 1 -norm error is small when all parameters are included (7.4%). Excluding any single one of these parameters increases the L 1 -norm error up to around 9%. On the other hand, the exclusion of K u from the input parameters does not worsen predictions of k s significantly. Instead, this observation appears to be a consequence of correlation between K u and other surface parameters like k rms /R a (see figure 3). When such correlations exist Figure 5. Confidence interval (CI) of predictions with the GPR method, with predicted values of ks/Ra in blue lines (called ksp) and true values of ks/Ra in red dots. GPR predictions for both training and testing data sets are shown -ks and ksp are very close to each other for the training data points, while they deviate (less than 30% of error) for some test data points. Line jaggedness is associated with projection of a high-dimensional space to one-dimensional ones. and one correlating parameter is excluded, the DNN process redistributes the weightings given to other correlated parameters, with little loss in predictive accuracy.
To reduce the correlation between the excluded parameters and the remaining ones, one may exclude groups of parameters that are thought to characterize the same type of surface feature. For this reason, a sensitivity analysis was carried out on the effect of groups of variables on prediction of k s . The characteristics of surface slope, element inclination angle, porosity, and intensity of height fluctuations, are contained in pairs of (E x , E z ), (I x , I z ), (P o , S k ) and (k rms , K u ), respectively. Parameters within each pair have been shown to be correlated to some degree in figure 3. accuracy of k s prediction is affected, if any one of these pairs is excluded. According to the table, the prediction of k s is sensitive to all four pairs, but with greater sensitivities to the surface porosity (described by P o , S k ) and the surface slope (described by E x and E z ). As expected, the elimination of both parameters of a pair worsens the prediction more than removing either single parameter (from around 7-9% errors to up to 14%). According to the sensitivity analysis, all parameters considered are of some importance in the prediction of k s . The effective x-slope E x and roughness height skewness S k have been suggested as especially significant in earlier studies (Napoli et al. 2008;Flack & Schultz 2010;Yuan & Piomelli 2014a). The inclination angle in the streamwise direction I x makes a significant contribution to the k s prediction because, physically, I x characterizes the average aerodynamic shape of the roughness elements. Surfaces with I x > 0 are aerodynamically bluff bodies when compared with surfaces of the same size but with I x = 0, and surfaces with I x < 0 tend to be more streamlined and hence produce less drag.
An important finding from this study is that the effective z-slope E z is of similar importance to accurate k s prediction as S k or E x . The exclusion of E z adversely affects the prediction for a large number of rough surfaces. Physically, E z describes whether the surface is close to a two-dimensional (2D) roughness with E z = 0 (such as a transverse bar roughness) or a three-dimensional (3D) roughness with finite E z . It is known that a k-type 2D roughness produces a higher drag than a 3D roughness with the same height due to the larger spanwise lengthscale that the 2D roughness imparts to the flow (Volino et al. 2009).

Comparison between ML algorithms and polynomial models
Explicit algebraic data representations, such as polynomial functions, can also be determined for the data sets of this study, using fitting or minimization procedures. In such methods, a set of basis functions is proposed for a model, the unknown coefficients of which are then optimized according to specified constraints. They are a generalization of the models of equation (1.2), which were based on experimental observations of the dependence of k s on a small number of surface parameters. A 30-degree-freedom polynomial basis was proposed as a 'white-box' model for k s , analogous to a low-order Taylor series expansion for k s : where a i (i = 0, 1, · · · , 29) are the model coefficients. To keep this model as simple as possible and to bring the effects of all contributing factors into account, we used terms as α i θ αj for a test variable θ that take only positive values (e.g. k rms ), and terms as α i θ +α j |θ| α k for those variables that take both positive and negative values (e.g. S k ). For the latter, the power of θ in the first term is fixed (at one) instead of fitted, to eliminate the possibility of an imaginary number. Combinations of six parameters (E x , E z , P o , S k , k rms /Ra and K u ), taken in pairs, were also included. Since, for the present collection of surfaces, strong correlations were observed between individual variables within the three pairs of (E x , E z ), (P o , S k ) and (k rms /Ra, K u ), shown in figure 3, only one variable from each pair was used for the combination terms in equation (3.6). Using the other variable from any of these pairs instead would not lead to a significant change in the prediction using equation (3.6). The high-dimensional space of a i is poorly suited to curve-fitting and minimization procedures which use stochastic gradient descent algorithms. However, it is well suited to robust minimization methods like the differential evolution algorithm (Storn & Price 1997), with which global minima can often be found efficiently in spaces of high dimension. In this case, it is used to determine the values of the coefficients a i which minimize the L 1 norm.
In figure 6, the prediction quality of this white-box model with optimized coefficient values is shown. This method yields an average prediction error of 12% and a maximum one of 51% when using all 45 fully-rough data sets (to give the best possible prediction accuracy) for the model training.
The optimized values of a i 's are The predictive accuracy of this optimized explicit model equation is considerably lower than that of the DNN and GPR methods. One reason for this reduced accuracy is that low-order functions of geometrical parameters do not faithfully represent the dependence of k s on surface parameters because each coefficient in the model is required to take the same value over the entire surface-parameter space. In ML approaches, such restrictions need not apply as they are not constrained to low-order polynomial functions but instead adopt a methodical search for the best representation of k s as a function of the surface parameters. This search is carried out through 'feature selection' in the first layers of DNN and the properties of the basis functions adopted in GPR, each of which are designed to yield the same mean and standard deviation of k s /R a as in the original dataset (Rasmussen & Williams 2006).

Concluding remarks
The construction of a predictive model from a large ensemble of dataset for the equivalent sandgrain height k s of a surface of arbitrary roughness, as a function of many different measures of surface topography, is a labeled regression problem that is wellsuited to machine learning techniques. In this paper, data from 45 different rough surfaces (in fully rough flows) were used to devise DNN and GPR predictions for k s as functions of 8 different surface-roughness parameters.
Both models were able to predict k s for the 45 surfaces with an average error below 10%, with the largest error for any one surface less than 30%. These predictions were significantly better than those of existing formulas, and of a 30 degree-of-freedom polynomial model fitted to the same data, where the greatest error for any surface was about 50%.
Sensitivity analyses revealed that inclusion of nearly all the surface roughness descriptive parameters was necessary to minimize the average prediction error, but that exclusion of either measures of porosity or measures of the surface slope increased the maximum prediction error more significantly than omitting other parameters.
Machine learning techniques are well suited to this modeling problem because: i ) it is complex insofar as different kinds of surface roughness yield different flow phenomena which are modeled most accurately in different ways, making the prospect of a general physical model very remote; and ii ) the dependent surface-roughness variables upon which k s is modeled are a large non-orthogonal set for which robust multivariable regression techniques are required. As machine learning methods, they take no account of physical modeling concepts or observed phenomena within roughness sublayers, such as recirculation regions, enhanced turbulence production in the wake of roughness elements, assumed scalings for drag etc., each of which is applicable to flows over some rough surfaces but not others. Nor are they hindered by the lack of orthogonality of the surface roughness parameters as the dependent variables of k s . The techniques used can be configured readily to mimic models with very many degrees of freedom and, when compared to polynomial models, their feature selection properties provide the equivalent of different values for polynomial coefficients in different regions of the surface-parameter space. In this application, both approaches of DNN and GPR yielded models with very similar predictive accuracy, even though the techniques themselves were very different. We therefore conclude that they yield high-fidelity predictions of the equivalent sandgrain roughness height for turbulent flows over a wide range of rough surfaces, as a significant improvement over other methods. Improved prediction might be achieved by enlarging the database to include rough-wall flows with surface parameters which correspond to the relatively low prediction confidence in the GPR method, and by including additional roughness parameters as inputs which might describe sparseness and two-dimensionality, such as the solidity, correlation lengthscales and other two-point surface statistics.
In addition to the k s prediction described here, the DNS database and the ML techniques in general can also be used to uncover relations between roughness geometry and physics-related quantities, such as the flow pattern around roughness protuberances, flow separation locations, characteristics of the shear layers associated with the separation bubbles, the wake sheltering volume, etc. Specifically, a ML network trained to correlate these flow characteristics (as outputs) to the roughness geometry (as inputs) may be an efficient tool for determining the sets of roughness geometrical features which are important for characterizing these effects. Knowledge of such a set of significant roughness parameters may also guide the construction of rough-surface databases that yield more efficient and more widely applicable predictions of k s or other quantities.

Declaration of interests
The authors report no conflict of interest.

Acknowledgements
The authors gratefully acknowledge the financial support of the Office of Naval Research (Award No. N00014-17-1-2102). Computational support was provided by Michigan State University's Institute for Cyber-Enabled Research. The authors also gratefully thank Professor Karen A. Flack of the US Naval Academy for providing the experimental data sets.

Supplementary materials
The rough-wall flow database (including k s , surface height map and surface parameters) and the trained DNN and GPR networks, called Prediction of the Roughness Equivalent Sandgrain Height (PRESH), can be accessed online in the first author's GitHub repository at https://github.com/MostafaAghaei/Prediction-of-the-roughnessequivalent-sandgrain-height. With this package of data and programs, interested researchers can: i) use the ML networks described in this paper to make predictions of k s for surfaces of their own roughness topography; ii) download the code and train new DNN and GPR networks to predict k s for a different set of surfaces of arbitrary topography; and iii) use the database of 45 rough-wall flows for other applications. It is recommended to use the ML configurations described in this paper for surfaces with parameters inside the ranges specified in figure 3. Extrapolations (using inputs which are beyond the specified range) will lead to additional uncertainty.
The PRESH and the database will be actively updated by the authors to improve the prediction accuracy and universality. We welcome interested researchers to share their datasets with us.