## 1. Introduction

Stellarators are one of the most promising designs for magnetic confinement fusion as they are inherently steady state and are less susceptible to current-driven instabilities (Helander Reference Helander2014). However, stellarators have historically been plagued by large collisionally induced losses of particles and energy associated with cross-field drifts of particles caused by the inhomogeneity and curvature of the magnetic field. In principle, stellarators can be optimized to reduce these collisional losses. In practice, impressive improvements in plasma confinement have been obtained. For example, experiments at the W7-X stellarator have shown greatly reduced neoclassical transport (Beidler *et al.* Reference Beidler, Smith, Alonso, Andreeva, Baldzuhn, Beurskens, Borchardt, Bozhenkov, Brunner and Damm2021). Furthermore, advances in stellarator optimization techniques have led to designs with precise quasi-symmetry (QS) (Landreman & Paul Reference Landreman and Paul2022; Landreman, Buller & Drevlak Reference Landreman, Buller and Drevlak2022) and quasi-isodynamicity (QI) (Goodman *et al.* Reference Goodman, Mata, Henneberg, Jorge, Landreman, Plunk, Smith, Mackenbach and Helander2022).

After neoclassical losses are minimized, confinement is limited by the anomalous transport of heat and particles by turbulence. This turbulence is driven by plasma microinstabilities on length scales of the gyroradius. For example, the ion-temperature gradient (ITG) instability is believed to be a major cause for ion-temperature clamping in W7-X, preventing heating of ions above 2 keV (Beurskens *et al.* Reference Beurskens, Bozhenkov, Ford, Xanthopoulos, Zocco, Turkin, Alonso, Beidler, Calvo and Carralero2021) (along with weak energy exchange between electrons and ions). While there have been recent attempts to optimize stellarators to reduce turbulence-induced transport, they have mainly relied on proxies based solely on the magnetic geometry (Mynick, Pomphrey & Xanthopoulos Reference Mynick, Pomphrey and Xanthopoulos2010; Proll *et al.* Reference Proll, Mynick, Xanthopoulos, Lazerson and Faber2016; Roberg-Clark, Xanthopoulos & Plunk Reference Roberg-Clark, Plunk, Xanthopoulos, Nührenberg, Henneberg and Smith2023) or linear simulations (Jorge *et al.* Reference Jorge, Dorland, Kim, Landreman, Mandell, Merlo and Qian2023). However, linear physics may not accurately predict nonlinear saturation mechanisms that ultimately determine the rate of heat and particle loss (McKinney *et al.* Reference McKinney, Pueschel, Faber, Hegna, Talmadge, Anderson, Mynick and Xanthopoulos2019). Unfortunately, using nonlinear analysis for optimization is usually very challenging. Gyrokinetics (Catto Reference Catto1978; Antonsen & Lane Reference Antonsen and Lane1980; Frieman & Chen Reference Frieman and Chen1982) is one of the most commonly used models to study turbulence in magnetic confinement fusion devices, and is also the model used for this paper. Typical nonlinear gyrokinetics simulations usually require hundreds to thousands of CPU hours, making them infeasible to use within an optimization loop.

In this work, we demonstrate the ability to reduce turbulent losses by optimizing stellarator configurations using nonlinear turbulence simulations directly rather than relying on proxies. To run nonlinear simulations inside the optimization loop, we use the new GPU-native gyrokinetic code GX (Mandell, Dorland & Landreman Reference Mandell, Dorland and Landreman2018; Mandell *et al.* Reference Mandell, Dorland, Abel, Gaur, Kim, Martin and Qian2023). GX uses pseudo-spectral methods in velocity space. GPU acceleration combined with flexible velocity resolution allows for nonlinear GX simulations that only take minutes to run. For this work, we focus on ITG turbulence, which contributes to major energy losses that limit plasma confinement and so is one of the most important microinstabilities to consider for reactor design (Kotschenreuther, Rewoldt & Tang Reference Kotschenreuther, Rewoldt and Tang1995*b*; Horton Reference Horton1999; Helander, Proll & Plunk Reference Helander, Proll and Plunk2013). To that end, we will be running electrostatic simulations with a Boltzmann (adiabatic) response assumed for the electrons.

The optimization is performed using the stellarator equilibrium and optimization code DESC (Dudt & Kolemen Reference Dudt and Kolemen2020; Conlin *et al.* Reference Conlin, Dudt, Panici and Kolemen2023; Dudt *et al.* Reference Dudt, Conlin, Panici and Kolemen2023; Panici *et al.* Reference Panici, Conlin, Dudt, Unalmis and Kolemen2023). DESC also uses pseudo-spectral methods and directly solves the ideal MHD force-balance equation to compute the magnetic equilibrium. The quantities computed from the resulting magnetic fields are used as input for GX. Stochastic optimization methods are used to robustly handle the noisy landscape of turbulent heat fluxes.

The paper is organized as follows. The stochastic optimization method used in this work is described in § 2. Results are shown in § 3, with analysis on the effects of global magnetic shear for reduced turbulence explored in § 4. Transport simulations using the T3D stellarator transport code (Qian *et al.* Reference Qian, Buck, Gaur, Mandell, Kim and Dorland2022) are shown in § 5. Finally, the conclusions follow in § 6.

## 2. Optimization methods

In this optimization routine, we seek to minimize the nonlinear heat flux returned by GX. Specifically, GX returns as output the time trace of the heat flux (normalized to gyro-Bohm units). An example of some heat traces is shown in figure 5(*a*). At the beginning of the simulation, linear growth of the fastest-growing instability dominates. However, eventually nonlinear effects cause the heat flux to decrease and saturate to a statistical steady state. We use the time average of this steady-state flux as our heat flux objective $f_Q$. To take the time average, we take the second half of the time trace and compute the weighted Birkoff average

where the sum is over each point in the trace (with $N$ total points) and $I = \sum _i^N \exp ({-{\rm i}/(N(1-i/N))})$ is a normalization factor. This gives greater weight to values in the middle of the trace.

Since GX is a local flux-tube gyrokinetic code, each simulation is on a single field line on a single surface specified by the field line label $\alpha$ and the radial coordinate $\rho = \sqrt {\psi /\psi _b}$ (where $\boldsymbol{B} = \boldsymbol {\nabla } \psi \times \boldsymbol {\nabla } \alpha$ and $\psi$ is the toroidal magnetic flux). Therefore, $f_Q$ is also computed for only a single field line and surface. For all of the optimization examples, we only simulate on the $\rho = \sqrt {\psi /\psi _b} = \sqrt {0.5}$ surface and the $\alpha = 0$ (except for the multiple field line examples in § 3.3) field line. However, we will run post-processing simulations across different surfaces and field lines.

To run these simulations, GX requires a set of geometric quantities that can be computed from numerical equilibria. Given that the magnetic field is written in Clebsch form $\boldsymbol {B} = \boldsymbol {\nabla } \psi \times \boldsymbol {\nabla } \alpha$, the set of quantities needed are

where $\boldsymbol {b} = \boldsymbol {B}/B$, $\alpha$ is the straight field line label, $\boldsymbol{\kappa} = \boldsymbol {b} \boldsymbol {\cdot }\boldsymbol {\nabla } \boldsymbol {b}$ is the curvature and $z$ is some coordinate representing the distance along the field line. For these simulations, the geometric toroidal angle $\phi$ is used for $z$. All of these quantities are easily computed using utility functions in DESC.

One issue with directly minimizing nonlinear heat fluxes is that their time traces are often very noisy. Even the resulting time averages are usually very noisy in parameter space. This can be seen in figure 1 showing the time-averaged nonlinear heat flux when scanning over the $Z_{m,n} = Z_{0,-1}$ boundary mode of the initial equilibrium used in this study. Optimizers designed for smooth objectives may easily get stuck in local minima and make very little progress. To help with this issue, we use the simultaneous perturbation stochastic approximation (SPSA) method (Spall Reference Spall1987) to minimize the heat fluxes. In this algorithm, the $i$th component of the gradient at point $\boldsymbol {x_n}$ is approximated as

where $\boldsymbol {c_n}$ is a random perturbation vector whose components are sampled from a Rademacher distribution, and $k$ is a finite-difference step-size. Therefore, we are effectively using the finite-difference method in all directions at once. Finally, for this method, rather than specifying stopping tolerances, we instead specify a maximum number of iterations.

The main feature of the SPSA method is that it only requires two objective function measurements per iteration. This makes it suitable for high-dimensional optimization problems whose objective functions are expensive to compute. SPSA can also robustly handle noisy objective functions, and so has been used for simulation optimization, including Monte Carlo simulations (Chan, Doucet & Tadic Reference Chan, Doucet and Tadic2003).

To still use automatic differentiation and smooth optimization methods for other objectives like quasi-symmetry residuals, we split the optimization routine into two parts. We first minimize the turbulent heat flux using stochastic gradient descent. Next, we minimize the quasi-symmetry residuals using a least-squares method that uses automatic differentiation. This order is chosen arbitrarily, but the reverse ordering can also work. The two objectives are thus

where $f_Q$ is the heat flux from GX and $A$ is the aspect ratio. For these optimizations, we target an aspect ratio of $A_{{\rm target}} = 8$. We use the two-term quasi-symmetry objective, where a magnetic field is quasi-symmetric if the quantity

is a flux function. Computationally, we evaluate the equivalent form

at several different flux surfaces and try to minimize the resulting values. For this study, we always choose flux surfaces at $\rho = 0.6,\ 0.8\ \text {and}\ 1$. We target quasi-helical (QH) symmetry, so that the helicity is $(M, N) = (1, 4)$.

The GX simulation parameters used in the optimization loop and for post-processing, along with their justification, are in Appendix B.

Finally, the optimization routine is performed in stages. Each stage increments the maximum boundary Fourier mode being optimized over. For example, in the first stage, only boundary modes with modes $m,n$ satisfying $|m| \leq 1$ and $|n| \leq 1$ are used as optimization variables. In the next stage, boundary modes with $|m| \leq 2$ and ${|n| \leq 2}$ are used. The $m=0,n=0$ mode is excluded to prevent the major radius from changing. This reduces the number of optimization variables at the beginning of the optimization and ‘warm-starts’ each successive stage. This type of method has been used with great success for previous stellarator optimization results, including for optimizing quasi-symmetry (Landreman & Paul Reference Landreman and Paul2022). After each stage, we start the next stage with an equilibrium chosen from the previous stage that had achieved both low heat flux and low quasi-symmetry error. No strict rule is used, and the choice depended on several factors like the value of the gradient estimate and the $\iota$ profile.

## 3. Results

### 3.1. Turbulence optimization

As an initial test of our stochastic optimization method, we begin by solely optimizing for turbulence. The optimization routine is the same as described in the previous section, except instead of optimizing for both quasi-symmetry and aspect ratio in the second half of each iteration, we only optimize for the target aspect ratio. Usually, combining the turbulence and aspect ratio objectives, as is done in the first part, is insufficient to reach our target aspect ratio. We have to force the optimizer to take more aggressive steps due to the simulation noise, which makes it more difficult to maintain the desired aspect ratio. For this optimization (and all of the examples in this work), the initial equilibrium is an approximately quasi-helically symmetric equilibrium with an aspect ratio of eight and four field-periods.

Figure 2 shows the normalized nonlinear heat flux across each iteration of the optimization process. Due to the stochastic nature of the optimization routine, the initial gradient estimates are very poor, leading to an increase in the heat flux in the first several iterations. However, as the optimization continued, there is eventually a rapid decrease in the heat flux as the gradient approximation begins to track the true gradient. The optimizer continues to steadily decrease the heat flux for most of the remaining iterations. Note that since we specify a maximum number of iterations rather than stopping tolerances, the optimization ends even when the heat flux had increased slightly at the end.

A scan of the nonlinear heat fluxes across $\rho$ and optimized cross-sections of the flux surfaces are shown in figure 3. For this simulation (and all of the simulations in this section), the resolution is increased to the values in table 3. As seen in those plots, some surfaces see moderate to drastic improvements to the nonlinear heat flux. In particular, the $\rho = 0.4$ surfaces has a reduction of approximately an order of magnitude. However, other surfaces see much less improvement, such as at $\rho = 0.2$ and $\rho = 0.8$. It is not completely well understood why that is, but it could be related to the fact that only the $\rho = \sqrt {0.5}$ surface is being optimized. This does reveal a potential weakness in the optimization strategy that will be addressed in future work. Nevertheless, the following sections show much more uniform improvement. Interestingly, based off of the surface boundary plots, the cross-sections of the optimized equilibria seem much less strongly shaped than in the initial equilibrium. Instead, the magnetic axis has significantly more torsion.

The contours of $|\boldsymbol {B}|$ in Boozer coordinates are plotted in figure 4. Surprisingly, the contours seem to resemble those from quasi-isodynamic equilibria despite not including a QI term in the objective functions. This may be related to the fact that it seems like the optimizer favoured a high mirror ratio. Additionally, it is well known that QI stellarators can be optimized to have the maximum-J property, which has numerous benefits including improved confinement of fast ions, neoclassical confinement of thermal ions (Sánchez *et al.* Reference Sánchez, Velasco, Calvo and Mulas2023; Velasco *et al.* Reference Velasco, Calvo, Sánchez and Parra2023) and enhanced stabilization against trapped-electron modes (TEMs) (Proll *et al.* Reference Proll, Helander, Connor and Plunk2012; Helander *et al.* Reference Helander, Proll and Plunk2013; Helander Reference Helander2014). It has been further theorized and shown in gyrokinetic simulations of W7-X that possessing the maximum-J property can also be beneficial for ITG turbulence (Proll *et al.* Reference Proll, Plunk, Faber, Görler, Helander, McKinney, Pueschel, Smith and Xanthopoulos2022). However, the TEM studies required modelling the full kinetic electron dynamics. In this study, we only use a Boltzmann response for the electrons.

### 3.2. Combined turbulence–quasi-symmetry optimization

Next, we include the two-term quasi-symmetry objective in the second part of the optimization loop. The final heat flux traces and optimized cross-sections of the flux surfaces are shown in figure 5. The cross-sections in figure 5(*b*) show relatively modest changes in the shape of the optimized stellarator. However, the time trace (simulated at the $\psi /\psi _b = 0.5$ surface) in figure 5(*a*) shows approximately a factor of three decrease in the nonlinear heat flux.

To ensure that the nonlinear heat flux was reduced throughout the plasma volume, we again ran simulations at radial locations of $\rho = 0.2$, $0.4$, $0.6$ and $0.8$. The resulting heat fluxes as well as the maximum symmetry breaking residuals across $\rho$ are plotted in figure 6. As seen in the plots, despite optimizing only for the $\rho = \sqrt {0.5}$ surface, the heat flux is reduced throughout the plasma volume. With regards to quasi-symmetry, for $\rho < 0.5$, the optimized equilibrium has smaller maximum symmetry-breaking modes than the initial equilibrium. However, this reverses in the outer surfaces. This is unexpected considering that the quasi-symmetry residuals were computed at $\rho = 0.6, 0.8$ and $1$. Nevertheless, the degree of symmetry breaking is still comparable to WISTELL-A (Bader *et al.* Reference Bader, Faber, Schmitt, Anderson, Drevlak, Duff, Frerichs, Hegna, Kruger and Landreman2020), another optimized QH stellarator. Indeed, the $|\boldsymbol {B}|$ plots in figure 7 show contours characteristic of a quasi-helically symmetric stellarator (plotted at the $\psi = 0.5$ surface).

Unlike in tokamaks, different field lines in stellarators experience different curvatures, magnetic fields, etc. and so may have very different fluxes (Dewar & Glasser Reference Dewar and Glasser1983; Faber *et al.* Reference Faber, Pueschel, Proll, Xanthopoulos, Terry, Hegna, Weir, Likin and Talmadge2015). Therefore, we also run simulations across $\alpha$ at the radial location $\rho = \sqrt {0.5}$. The resulting scan is shown in figure 8(*a*). At each $\alpha$ simulated, the optimized stellarator has a lower heat flux, indicating that the heat flux has been reduced across the entire flux surface.

It is interesting to see that for the initial geometry, there are large variations in the heat flux. This might indicate that some flux tubes are too short to sample enough area on the flux surface or that there is some coupling between the flux tubes. However, it should be noted that the $\iota$ of this equilibrium (plotted in figure 9) is close to $5/4$. Recent work has shown a very large variation across $\alpha$ on low-order rational surfaces (Buller *et al.* Reference Buller, Mandell, Parisi, Kim, Adkins, Gaur, Dorland and Landreman2023). In comparison, the optimized equilibrium has an $\iota$ that is slightly above 0.9. This could have negative impacts on the optimization routine as the optimizer may choose to approach equilibria with $\iota$ near low-order rationals. This could then lead to not only misleading heat fluxes, but also poor MHD stability. More work is needed to investigate the limitations of flux-tube codes, the effects of rational surfaces and the resulting consequences for optimization.

In a fusion reactor, since the transport is very stiff, the temperature profiles may evolve to approach the critical temperature gradient (the temperature gradient at which the heat flux is zero). Therefore, a decrease in the heat flux is less useful if the critical temperature gradient also decreases significantly. To verify that the critical temperature gradient did not decrease, we also ran several nonlinear simulations with different temperature gradients at $\rho = \sqrt {0.5}$. The results are shown in figure 8(*b*) which shows that the critical temperature gradient does not seem to change significantly. It would be more beneficial if the critical temperature gradient had also increased. Specifically targeting the critical temperature gradient will be the focus of future work.

We also run linear simulations for the initial and optimized equilibria, with the growth rates shown in figure 10(*a*) at $k_x = 0$ and figure 10(*b*) at $k_x = 0.4$. Although the optimized equilibrium has lower heat fluxes, it has a significantly higher peak growth rate at ${k_x = 0}$. However, it has lower growth rates at lower $k_y$, and one would expect that the nonlinear heat flux scales like $\gamma /\langle k_{\perp }^2\rangle$ (where the angle brackets indicate a flux-surface average) (Mariani *et al.* Reference Mariani, Brunner, Dominski, Merle, Merlo, Sauter, Görler, Jenko and Told2018). Therefore, this might result in the observed lower nonlinear heat fluxes. However, this trend reverses for the simulations at $k_x = 0$.4. Furthermore, recent work has shown that both peak growth rates and the quasilinear $\gamma /\langle k_{\perp }^2\rangle$ estimate are both poor proxies for the nonlinear heat flux (Buller *et al.* Reference Buller, Mandell, Parisi, Kim, Adkins, Gaur, Dorland and Landreman2023).

As a preliminary example, we replicate the plots from figure 10 but with a quasi-linear estimate of the form

where $\langle k_{\perp }^2 \rangle$ is

Here, $\phi$ is the electrostatic potential, $\sqrt {g}$ is the Jacobian and the integration is along the simulated field line. This expression is similar to that used by Mariani *et al.* (Reference Mariani, Brunner, Dominski, Merle, Merlo, Sauter, Görler, Jenko and Told2018). The resulting plots are shown in figure 11. We have enforced that $\min (f_{QL}) = 0$, which assumes that only growing modes contribute to the flux. As seen in the plots, while the optimized configuration has a smaller $f_{QL}$ at smaller $k_y$ at $k_x = 0$, it has a much larger $f_{QL}$ at $k_x = 0.4$. This makes it challenging to use such a quasilinear estimate of the heat flux as a proxy, necessitating our approach of directly computing the nonlinear heat flux.

Finally, we compare the optimized stellarator to another configuration solely optimized for quasi-symmetry. The plot of the heat fluxes across different surfaces is shown in figure 12. The heat fluxes for the precise QH equilibrium are higher than those from the approximately QH equilibrium used as the initial point. This shows that just optimizing for quasi-symmetry can be detrimental for turbulence and stresses the need to optimize for both.

### 3.3. Optimization with multiple field lines

As described in the previous section, several physical and computational factors can lead to large variations in the heat fluxes at different $\alpha$. To avoid this issue, we rerun the optimization loop but simulate on field lines of both $\alpha = 0$ and $\alpha = {\rm \pi}\iota /4$. This makes the new objective function

This new objective function serves as a way of reducing the heat flux across multiple field lines while also testing our stochastic optimizer against additional objectives. Similar situations include trying to optimize across different flux surfaces, different temperature/density gradients, etc.

The plots in figure 13 show the cross-sections, maximum symmetry breaking modes, and heat fluxes across $\rho$ and $\alpha$ for the initial equilibrium and the final optimized equilibria after optimizing at one and two field lines. While this new equilibrium has higher heat fluxes than when optimizing for just a single field line, it instead has lower maximum symmetry breaking modes and better quasisymmetry.

The scans across multiple $\alpha$ in figure 13(*d*) show approximately a 50 % variation in the heat flux compared with the $\alpha = 0$ point. Unfortunately, this is larger than the approximately 30 % variation in the single-field line case and comparable to the relative variation in the initial equilibrium. Nevertheless, the heat fluxes across each $\alpha$ are still 2–3 times smaller than in the initial equilibrium and this shows that the stochastic optimizer is still effective with additional terms in the objective function. More investigation is needed to more effectively and precisely optimize with multiple field lines.

### 3.4. Fluid approximation

In this final case, we change the simulation parameters within the optimization loop to approach the fluid limit in GX. This is based off of previous work that showed a fluid approximation with only a few velocity moments can accurately match the true heat fluxes at higher velocity resolution at moderate collision frequencies (Buck *et al.* Reference Buck, Dorland, Mandell, Kim, Fischer and Qian2022). That study was motivated by a recently developed three-field model for the density, temperature and momentum to approximate growth rates and nonlinear heat fluxes (Hegna, Terry & Faber Reference Hegna, Terry and Faber2018).

To approximate this model, we reduce the velocity resolution to four Hermite moments and two Laguerre moments, as in the gyrofluid model. While normally this requires closure relations like those of Beer (Reference Beer1995) and Mandell *et al.* (Reference Mandell, Dorland and Landreman2018), we instead increase the collisionality to damp the high wavenumber modes and increase the temperature gradient to $a/L_T = 5$. We employ a Dougherty collision operator (Dougherty Reference Dougherty1964).

The plots of the maximum-symmetry breaking modes and the heat flux across $\rho$ for the kinetic and fluid cases are shown in figure 14. While the new equilibrium does not achieve lower heat fluxes than in the kinetic case, it still achieves approximately a factor of two reduction compared to the initial equilibrium except at $\rho = 0.8$ despite using very different physical parameters. The fluid case retains comparable levels of quasi-symmetry as well. These results indicate that the optimization routine is robust against varying levels of fidelity. This opens new possibilities of potentially varying the fidelity across iterations. This can either decrease the computational cost further or allow us to increase other resolution parameters.

## 4. Effects of magnetic shear on reduced turbulence

It is well known since the findings of Greene & Chance (Reference Greene and Chance1981) that sufficient positive global magnetic shear can be destabilizing for intermediate to high wavenumber instabilities like the ballooning mode. An analogous relationship for transport was investigated by Kotschenreuther *et al.* (Reference Kotschenreuther, Dorland, Beer and Hammett1995*a*), where it was found that while increased shear can reduce thermal transport, it can also negatively impact the critical gradient.

To further investigate these effects, we optimized for several more precisely quasi-symmetric equilibria, but with different target shears. The nonlinear heat flux traces are shown in figure 15. The ${\rm shear} = 0.01$ case is when there is no shear target. While increasing the shear slightly does reduce the heat flux, continuing to increase it eventually increases the heat flux again, consistent with the findings of Kotschenreuther *et al.* (Reference Kotschenreuther, Dorland, Beer and Hammett1995*a*). It is somewhat surprising that the heat flux is this sensitive to the shear since the shear is very small. For comparison, the global magnetic shear for W7-X in these units is approximately 0.16. Future work will better address the relationship between shear and turbulence in stellarators. This relationship can be very important as very low global magnetic shear is characteristic of vacuum quasi-symmetric stellarators (Landreman & Paul Reference Landreman and Paul2022). While the shear increases for finite $\beta$ QS stellarators (Landreman *et al.* Reference Landreman, Buller and Drevlak2022), it is still small compared with shear in tokamaks.

## 5. Transport simulations

To study how the changes in nonlinear heat fluxes affected the macroscopic profiles, we used the T3D transport code (Qian *et al.* Reference Qian, Buck, Gaur, Mandell, Kim and Dorland2022). T3D uses the same algorithm as Trinity (Barnes *et al.* Reference Barnes, Abel, Dorland, Görler, Hammett and Jenko2010), but is written in Python and adapted to work with stellarator geometries. T3D takes advantage of the separation of scales in the local $\delta f$ gyrokinetic model where the distribution function is written as $F = F_{0s} + \delta f_s$. It can be shown that $F_{0s}$ is a Maxwellian and it is assumed that the perturbation $\delta f$ scales like $\delta f \sim \epsilon F_{0s}$, where $\epsilon = \rho /a$. The Maxwellian $F_{0s}$ evolves slower than the perturbation $\partial F_{0s}/\partial t \sim \epsilon ^2 \partial \delta f_s/\partial t$. This allows for T3D to evolve the macroscopic density, pressure and temperature profiles using heat and particle fluxes computed by gyrokinetic codes like GX.

We run T3D simulations for the initial approximately QH equilibrium and optimized equilibrium from § 3.2. Both equilibria were scaled to the same major radius and on-axis magnetic field as W7-X. An adiabatic response is still assumed for the electrons, and neoclassical contributions are ignored to isolate the effects from turbulence. The plots of the final steady-state temperature profiles for both equilibria are shown in figure 16.

From the plot of the temperature profiles, there is a clear increase in the temperature for the optimized equilibrium. The temperature at the innermost simulated point increased by approximately 10 $\%$. This seems lower than expected considering that the optimized equilibrium had lower heat fluxes by approximately a factor of three. The true heat flux scales like $Q \sim Q_{{\rm sim}} n T v_{{\rm th}} (\rho /a)^2$, where $Q_{{\rm sim}}$ is the heat flux from the simulation. Since $v_{{\rm th}} \sim T^{1/2}$ and $\rho \sim v_{{\rm th}}$, the heat flux scales like $Q \sim Q_{{\rm sim}} T^{5/2}$. For equivalent sources, $T \sim Q_{{\rm sim}}^{2/5}$. Therefore, we would expect the temperature to increase by a factor of $Q^{0.4} \sim 1.55$. However, the simulations from § 3.2 assumed a temperature gradient of $a/L_{Ti} = 3$. From the plot of the temperature gradients, most points are simulated with lower gradients, and the equilibria have the same critical temperature gradient (as seen in figure 8*b*). Nevertheless, the improvement is still very encouraging. Future work will investigate using simulations to optimize for a higher critical gradient or for specific target profiles.

## 6. Conclusions

In this work, we directly optimized stellarators for reduced nonlinear heat fluxes and good quasi-symmetry by coupling GX and DESC. By directly running nonlinear simulations, we include the nonlinear saturation mechanisms that determine the steady-state heat flux, and so avoid potential limitations of linear or quasilinear models. The SPSA method is used to handle the noisy heat flux objective and to cheaply estimate the gradient. The newly optimized equilibria show factors of 2–4 improvement in the nonlinear heat flux across several flux surfaces and multiple field lines while also having comparable or improved quasi-symmetry. We also investigated how global magnetic shear affects the nonlinear heat flux and found that slightly increasing it can significantly reduce the flux. However, increasing the shear too much led to increases in the heat flux. Our T3D simulations showed that by reducing the heat fluxes, the steady-state temperature profiles did increase slightly. Unfortunately, the transport is sufficiently stiff so that macroscopic profiles will approach the critical gradient. Since the initial and optimized equilibria have similar critical gradients, this limits the improvements in the temperature profile. Nevertheless, these results demonstrate that we can efficiently include nonlinear gyrokinetic simulations within the optimization loop. Future work will include adding an objective for MHD stability, optimizing for the nonlinear critical gradient and also directly optimize for desired macroscopic profiles. More improvements will also be made to the SPSA optimizer to make it more effective near minima. Currently, near the minima, the true gradient is small enough to be comparable to the simulation noise. This leads to poorer convergence near the minima and so requires more sophisticated optimization methods. Consequently, we will implement other optimization methods that have been effective for stochastic optimization, such as Bayesian optimization.

## Acknowledgements

The authors thank M. Zarnstoff and B. Buck for insightful and fruitful conversations. Research support came from the U.S Department of Energy (DOE). P.K and colleagues at UMD were supported by the DOE via the Scientific Discovery Through Advanced Computing Program under award number DE-SC0018429 as well as under award number DE-FG0293ER54197. P.K has also been supported through the DOE CSGF Program under award number DE-SC0024386 (while at Princeton University). R.C., D.W.D., E.K., and D.P. were supported by the U.S. Department of Energy under contract numbers DE-AC02-09CH11466, DE-SC0022005 and Field Work Proposal No. 1019. This work started as part of P.K.'s Summer 2022 DOE SULI internship under award number DE-AC02-09CH11466. N.R.M was supported by the DOE Fusion Energy Sciences Postdoctoral Research Program administered by the Oak Ridge Institute for Science and Education (ORISE) for the DOE via Oak Ridge Associated Universities (ORAU) under DOE contract number DE-SC0014664, and by the Laboratory Directed Research and Development Program of the Princeton Plasma Physics Laboratory under U.S. Department of Energy contract number DE-AC02-09CH11466. R.J. is supported by the Portuguese FCT–Fundação para a Ciência e Tecnologia, under Grant 2021.02213.CEECIND. This work has been carried out within the framework of the EUROfusion Consortium, funded by the European Union via the Euratom Research and Training Programme (Grant Agreement No 101052200 – EUROfusion). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the European Commission can be held responsible for them. Instituto Superior Técnico activities also received financial support from FCT through Projects UIDB/50010/2020 and UIDP/50010/2020. Computations were performed on the Traverse and Stellar clusters at Princeton/PPPL as well as the Perlmutter cluster at NERSC.

*Editor Per Helander thanks the referees for their advice in evaluating this article.*

## Declaration of interests

The authors report no conflict of interest.

## Data availability statement

The data that support the findings of this study are openly available in Zenodo at http://doi.org/10.5281/zenodo.10277921 (Kim Reference Kim2024).

## Appendix A. Codes

#### A.1. GX

GX employs the radially local $\delta \!f$ approach to solve the gyrokinetic equation. In this approximation, the distribution function $F_s$ is represented as $F_s = F_{0s} + \delta \! f_s = F_{Ms}(1 - Z_s \varPhi /\tau _s) + h_s$, where $F_{0s}$ is Maxwellian and $\delta \! f_s$ is a perturbation consisting of a Boltzmann part proportional to the electrostatic potential $\varPhi$ and a general perturbation $h_s$. The subscript $s$ labels species. Assuming electrostatic fluctuations, the gyroaveraged perturbation $g_s = \langle f_s \rangle$ then obeys the electrostatic gyrokinetic equation

where $v_{\parallel }$ is the velocity parallel to the magnetic field, $\boldsymbol {b} = \boldsymbol {B}/B$, $\mu$ is the magnetic moment, $z$ is some coordinate along the field line, $\kappa$ is the field line curvature, $\boldsymbol {v}_{Ms}$ is the sum of the magnetic and curvature drift velocities, and $\boldsymbol {v_E}$ is the $\boldsymbol {E} \times \boldsymbol {B}$ velocity. For this work, electromagnetic effects are ignored and a Boltzmann response is used to model the perturbed electron distribution.

In GX, (A1) is projected onto the Hermite and Laguerre basis functions

where $\mathrm {He}_m(x)$ and $\mathrm {L}_l(x)$ are the (probabilist's) Hermite and Laguerre polynomials, respectively. More details on the expansion and numerical algorithm can be found from Mandell *et al.* (Reference Mandell, Dorland and Landreman2018) and Mandell *et al.* (Reference Mandell, Dorland, Abel, Gaur, Kim, Martin and Qian2023).

By choosing a Hermite–Laguerre basis, the resulting equations for the spectral coefficients reduce to the gyrofluid equations at low resolution (Dorland & Hammett Reference Dorland and Hammett1993; Beer Reference Beer1995). Particle number, momentum and energy are also conserved at low resolution. Overall, any inaccuracies are due to the closure or dissipation model used. Therefore, GX allows for lower velocity resolution than other similar codes like GS2 (Kotschenreuther *et al.* Reference Kotschenreuther, Rewoldt and Tang1995*b*; Dorland *et al.* Reference Dorland, Jenko, Kotschenreuther and Rogers2000) and stella (Barnes, Parra & Landreman Reference Barnes, Parra and Landreman2019) that use finite-difference methods in velocity space. Flexible velocity resolution combined with a GPU implementation allows GX to run nonlinear (electrostatic with adiabatic electrons) gyrokinetic simulations in minutes rather than hours or days.

#### A.2. DESC

DESC (Dudt & Kolemen Reference Dudt and Kolemen2020; Conlin *et al.* Reference Conlin, Dudt, Panici and Kolemen2023; Dudt *et al.* Reference Dudt, Conlin, Panici and Kolemen2023; Panici *et al.* Reference Panici, Conlin, Dudt, Unalmis and Kolemen2023) is a new stellarator equilibrium and optimization code. Two of the main features of DESC are its pseudo-spectral representation of the magnetic geometry and the use of automatic differentiation to compute exact derivatives.

DESC directly solves the ideal MHD force balance equation

DESC uses as its computational domain the coordinates $(\rho,\theta,\zeta )$, defined as

where $\psi$ is the enclosed toroidal flux, $\psi _a$ is the total enclosed toroidal flux in the plasma, $\theta ^*$ is a straight field line poloidal angle, $\lambda$ is a stream function and $\phi$ is the geometric toroidal angle. It then expands $\lambda$, and the cylindrical coordinates $R$ and $Z$ in terms of a Fourier–Zernike basis,

where $\mathcal {R}_l^{|m|}$, $\mathcal {Z}_l^m$ and $\mathcal {F}$ are the shifted Jacobi polynomials, Zernike polynomials and Fourier series, respectively.

It can be shown that the force error $\boldsymbol {J} \times \boldsymbol {B} - \boldsymbol {\nabla } p$ has only two independent components

where $\sqrt {g}$ is the Jacobian and the superscripts indicate the contravariant components. DESC then solves for the spectral coefficients $R_{lmn}$, $Z_{lmn}$ and $\lambda _{lmn}$ that minimize this force error.

DESC also uses automatic differentiation through JAX (Bradbury *et al.* Reference Bradbury, Frostig, Hawkins, Johnson, Leary, Maclaurin, Necula, Paszke, VanderPlas and Wanderman-Milne2018) in its optimization routines. Briefly, in automatic differentiation, the chain rule is applied through the code, allowing DESC to compute exact derivatives faster and more accurately than derivatives from finite differencing routines. However, at the time of this writing, this requires writing the objective function in native Python and GX is written in C++/CUDA. Furthermore, it is unclear how effective it would be to differentiate through the time stepping of such a noisy system. Therefore, this would require extensive code developments for both DESC and GX, and so AD will not be used when optimizing for reduced turbulence.

## Appendix B. Simulation parameters used in optimization

#### B.1. Turbulence, turbulence-QS and multiple $\alpha$ optimization

For the GX simulations within the optimization loop, we use the simulation parameters listed in table 1.

The number of poloidal turns, parallel resolution and number of simulated modes (proportional to $n_x$ and $n_y$) are chosen to enable cheaper simulations. However, one poloidal turn has been used in previous nonlinear W7-X gyrokinetic benchmark studies (González-Jerez *et al.* Reference González-Jerez, Xanthopoulos, García-Regaña, Calvo, Alcusón, Bañón Navarro, Barnes, Parra and Geiger2022*a*). Furthermore, from a quasilinear estimate, the heat flux scales like $1/\langle k_{\perp }^2\rangle$ (where the angle brackets indicate a flux-surface average) (Mariani *et al.* Reference Mariani, Brunner, Dominski, Merle, Merlo, Sauter, Görler, Jenko and Told2018). Consequently, higher wavenumber modes contribute less to the heat flux. The gradients are typical for experimental profiles for W7-X (Beurskens *et al.* Reference Beurskens, Bozhenkov, Ford, Xanthopoulos, Zocco, Turkin, Alonso, Beidler, Calvo and Carralero2021) and have been previously used for benchmark, transport and optimization studies (González-Jerez *et al.* Reference González-Jerez, Xanthopoulos, García-Regaña, Calvo, Alcusón, Bañón Navarro, Barnes, Parra and Geiger2022*b*; Roberg-Clark *et al.* Reference Roberg-Clark, Plunk, Xanthopoulos, Nührenberg, Henneberg and Smith2023; Bañón Navarro *et al.* Reference Bañón Navarro, Di Siena, Velasco, Wilms, Merlo, Windisch, LoDestro, Parker and Jenko2023). Finally, it has been demonstrated that GX simulations can yield accurate heat flux traces using the specified number of Hermite and Laguerre polynomials (Mandell *et al.* Reference Mandell, Dorland, Abel, Gaur, Kim, Martin and Qian2023). To check our results, the post-processing simulations are run at higher resolution.

#### B.2. Fluid approximation optimization

When running the optimization at fluid resolution, the simulation parameters are identical except those listed in table 2.

The velocity resolution is decreased to that of the gyrofluid model and the collision frequency is increased. Since there is far greater dissipation in the model, we also increase the temperature gradient.

#### B.3. Post-processing simulations

To ensure our final results are well converged, we increase the resolution of the post-processing simulations to those listed in table 3.

The scans over $\rho$, $\alpha$ and $a/L_T$ of course include additional values.

#### B.4. DESC equilibria parameters

The parameters used for the equilibria are shown in table 4. These are considered ‘moderate’ resolution and are typical values used for optimization. This is a vacuum case, so both the pressure and current are zero. When optimized for precise QH symmetry, the resulting equilibrium is similar to the QH equilibirum of Landreman & Paul (Reference Landreman and Paul2022).

#### B.5. Optimization timings

Table 5 shows the approximate timings of different parts of the optimization loop using a single Tesla V100 GPU on the Princeton Traverse cluster. All values are in minutes. In total, each optimization completed in approximately 20 hours. It takes slightly more than 12 hours on a single NVIDIA A100 GPU on the NERSC Perlmutter cluster. The ‘optimization’ timings are for a single iteration.