Unsupervised Discovery of Inertial-Fusion Plasma Physics using Differentiable Kinetic Simulations and a Maximum Entropy Loss Function

Plasma supports collective modes and particle-wave interactions that leads to complex behavior in inertial fusion energy applications. While plasma can sometimes be modeled as a charged fluid, a kinetic description is useful towards the study of nonlinear effects in the higher dimensional momentum-position phase-space that describes the full complexity of plasma dynamics. We create a differentiable solver for the plasma kinetics 3D partial-differential-equation and introduce a domain-specific objective function. Using this framework, we perform gradient-based optimization of neural networks that provide forcing function parameters to the differentiable solver given a set of initial conditions. We apply this to an inertial-fusion relevant configuration and find that the optimization process exploits a novel physical effect that has previously remained undiscovered.


Fusion, Charged Fluid Dynamics, and Kinetics
Thermonuclear fusion involves conditions where matter is best represented as a warm plasma where the ions and electrons form a charged particle soup. Such plasmas can often be described using a combination of Maxwell's equations and the Navier-Stokes equations. However, many non-linear effects in plasmas cannot be modeled using the fluid description without further modifications.
These effects are termed kinetic effects. To model kinetic effects, a more complete description is required. This can be accomplished by either retaining the velocities of individual particles of the fluid or retaining a statistical representation of the particles. In either case, the objective becomes 1 Ergodic LLC, San Francisco, CA, USA 2 University of Michigan, Ann Arbor, MI, USA. Correspondence to: Archis Joglekar <archisj@umich.edu>. to model the phase space dynamics rather than a version integrated in velocity.
Phase space dynamics for plasmas are given by formulations or realizations of the non-linear transport equation. A common simplification that is valid for fast time-scales, is to assume the ions, and their phase space, remains stationary. Then, the transport equation for electrons is given by where E = f dv − 1, and f = f (t, x, v). Equation 1, along with Gauss's Law, is often termed the Vlasov-Poisson-Fokker-Planck (VPFP) equation set. Solving the VPFP set is often analytically intractable, even in one spatial and one velocity dimension (1D-1V). This is because the left-handside has a stiff linear transport term, has a non-linear term in E∂f /∂v, and can sustain wave propagation and other hyperbolic partial-differential-equation (PDE) behavior. Additionally, the right hand side is typically represented by a hyperbolic, advection-diffusion, partial-differential-equation. Making progress on kinetic plasma physics requires computational simulation tools.
Numerical solutions to the 1D-1V VPFP equation set have been applied in research on laser-plasma interactions in the context of inertial fusion plasma-based accelerators (Thomas, 2016), space physics (Chen et al., 2019), fundamental plasma physics (Pezzi et al., 2019), and inertial fusion (Strozzi et al., 2007;Fahlen et al., 2009;Banks et al., 2011) In this work, we extend previous findings (Fahlen et al., 2009) in the context of inertial fusion using a gradient-based approach towards discovery of new physics.

Gradient-Based Optimization in Plasma Physics
There have been several recent applications of gradientdescent towards fusion plasma physics. Analytic approaches have resulted in the development of adjoint methods for shape derivatives of functions that depend on magnetohydrodynamics (MHD) equilibria (Antonsen et al., 2019;Paul et al., 2020). These methods have been used to perform optimization of stellarator design (Paul et al., 2021). In Figure 1. Left -We create a differentiable loop to learn forcing function parameters that optimize for a free energy and entropy metric.
Middle -The optimization uncovers a superadditive process that results in a long-lived second wavepacket. Right -Deeper inspection shows that the learning process exploits novel kinetic plasma physics in order to minimize damping, where the second wavepacket interacts with the electrons from the first wavepacket.
other work, by using gradients obtained from analytic (Zhu et al., 2017) and automatic differentiation (AD) (McGreivy et al., 2021), the FOCUS and FOCUSADD codes optimize coil shape. These advances are founded on the concept of performing a sensitivity analysis towards device design.
This work applies AD towards learning physical relationships and discovering novel phenomena in the VPFP dynamical system. We do this by training neural networks through differentiable simulations that solve the 1D-1V version of eq. 1 and optimize for a custom objective function.

Applications of Differentiable Physical Simulations
Differentiable physics simulations have been used in a variety of contexts. For example, for learning parameters for molecular dynamics (Schoenholz & Cubuk, 2019), for learning differencing stencils in PDEs ( Bar-Sinai et al., 2019;Zhuang et al., 2020;Kochkov et al., 2021), and for controlling PDEs (Holl et al., 2020).
In the work here, we train a neural network that provides control parameters to the PDE solver. We choose physical parameters as inputs and control parameters as outputs of the neural network. This enables the neural network to learn a function that describes the physical relationship between the plasma parameters and the forcing function parameters e.g. the resonance frequency.
We train the neural network in an unsupervised fashion using a cost function based on the maximum entropy principle. This enables us to create self-learning plasma physics simulations, where the optimization process provides a physically interpretable function that can enable physics discovery.

Parameter and Function Learning using Differentiable Simulations
In this section, we provide a step-by-step description of how a traditional simulation-based computational physics workflow may be modified to perform closed-loop optimization as applied to parameter and function learning.

Open Loop -Manual Workflow
Figure 2a depicts a typical workflow of a computational scientist represented as a cyclic graph. The scientist defines the parametric inputs which create the state vector x. This can contain any parameters that are used to define the simulation e.g. the grid size, the number of solver steps, etc. For didactic purposes, the physical parameters to the simulation may be separated into a different vector of inputs p d e.g. the forcing function parameters, the viscosity coefficient etc.
Each of x and p d is passed to the algorithm that solves the PDE which is represented by the function, V. The output of these simulations is stored in the final state vector x f . The final state is postprocessed using a domain-specific set of algorithms devised by the scientist or otherwise. The results of the postprocessing are interpreted by the scientist who then determines the next set of inputs and parameters. Figure 2b shows a more automated workflow. We replace the gray-box postprocessing step with the calculation of a scalar quantity S using a Cost Function C on the final state x f . This reduces the complexity of the interpretation of the postprocessing and enables a more rapid search in parameter space. The decrease in required human effort for completing  (c) A gradient-descent based optimization algorithm replaces the parameter-scan to provide a more efficient search mechanism. This requires the components in the purple background to be written in an auto-differentiable framework (d) We add a neural network that generates the forcing function parameters as a function of other parameters. This generalizes the learnings from (c) and enables interpolation and extrapolation within the learned parameters.

Closed Loop -Brute Force Parameter Scan
one cycle enables the scientist to execute this loop as a brute force parameter scan over a pre-defined parameter space. At the end, the scientist can look up the minimum/maximum of the scalar cost function, and find the parameters which provide that minimum.
The parameter scan approach scales with the number of different unique parameters and the number of values of each parameter. e.g. a 2-D search in x and y requires N x × N y calculations. Therefore, the parameter scan approach quickly becomes inefficient when there are many parameters to scan, or when the required resolution in parameter space is very high. To search this parameter space efficiently, and to escape the linear scaling with each parameter, we can use gradient descent.

Gradient-Descent-Based Parameter Learning
Figure 2c includes two modifications. The scientist/parameter search graybox has been replaced with a gradient-descent-based optimization algorithm. This algorithm provides the updated parameters, e.g. ω G , a guess for the resonant frequency of the system, for the next iteration of the loop. The gradient-descent algorithm requires the calculation of an accurate gradient. By writing our PDE solver V and the cost function C using a numerical framework that supports automatic differentiation, we are able to calculate ∂S/∂ω G . Since the gradient for the update-step is given by For example, if we wish to learn the resonant frequency, ω, that optimizes for the scalar, S, we compute Assuming a well-behaved solution manifold, performing gradient-descent tends to reduce the number of iterations required to find the minimum in comparison to a evenlyspaced parameter scan, especially when the required resolution is unknown (Nocedal & Wright, 2006). Put another way, gradient-descent effectively provides an adaptive stepping mechanism in-lieu of the pre-defined values that represent a parameter scan.

Gradient-Descent-Based Function Learning
In the final step, shown in fig. 2d, we can replace the lookuplike capability of the parameter optimization and choose to learn a function that can provide the learned parameters. The advantage to learning a function, however, is the ability to interpolate and extrapolate with respect in the input space.
Here, we choose to use neural networks, with a parameter vector θ, to parameterize the desired function. This allows us to extend the gradient-descent based methodology and leverage existing numerical software to implement this differentiable programming loop. Now, where G is a function that generates the desired forcing function parameter given a parameter vector θ. To extend the example from sec. 2.3, ω is now a function given by ω = G(x, p d ; θ).
We compute the same gradient as in eq. 2 and add a correction factor that arises because the parameter (vector) is now θ, rather than ω. The necessary gradient for the gradient update is now given by

Physics Discovery as an Optimization Problem
We will start with a basic physical effect that is well understood, Electrostatic Wavepackets in Inertial Fusion Plasmas. Then we will ask the question of "what happens when you excite this system?"

Electrostatic Wavepackets in Inertial Fusion
When electrostatic waves are driven to large amplitude, electrons can become trapped in the large potential (O'Neil, 1965). Simulations of Stimulated Raman Scattering (SRS) in inertial confinement fusion (ICF) scenarios show that large-amplitude waves of finite extent are generated in the laser-plasma interaction, and that particle trapping is correlated with the transition to the high-reflectivity burst regime of SRS (Strozzi et al., 2007;Ellis et al., 2012).
Simulating wavepackets, similar to those generated in SRS, but in isolation, has illuminated kinetic dynamics where trapped electrons, with velocity ≈ v ph , transit the wavepacket which is moving at the slower group velocity v g . The transit of the trapped electrons from the back of the wavepacket to the front results in the resumption of Landau damping at the back and the wavepacket is then damped away (Fahlen et al., 2009).
Recent work modeled the interaction of multiple speckles with a magnetic field acting as a control parameter. Since the effect of the magnetic field is to rotate the distribution in velocity space, the field strength serves as a parameter by which the authors control scattered particle propagation. Using this, along with carefully placed laser speckles, they show that scattered light and particles can serve as the trigger for SRS (Winjum et al., 2019).
Here, we ask What happens when a non-linear electron plasma wavepacket is driven on top of another? Figure 3. Given a first wavepacket with wavenumber k0 and a desired time of second wavepacket excitation t1, the task is to learn functions that give the optimal frequency ω1 and spatial location x1 of the second wavepacket.
To answer this question, we reframe it as an optimization problem and ask, What is the best way to excite a wavepacket that interacts with a pre-existing wavepacket?
To solve this optimization problem, we turn to the framework established in the previous section and in fig. 2. To automate this process, we can use the workflow described in fig. 2c or fig. 2d. In both, the physicist is still required to provide their expertise in the form of 1/ the parameterization, and 2/ the objective function.

Physicist Input 1: Parameterizing the Forcing Function with a Neural Network
We start with a large-amplitude, finite-length electrostatic wavepacket driven by a forcing function with parameters given by where x i is the location of excitation, ω i is the frequency, t i is the time of excitation, and k i is the wavenumber, of the i th wavepacket.
Since we seek to excite a second wavepacket that can interact with the detrapped electrons, we set the wavenumber k 1 = k 0 . We reparameterize the resonant frequency, ω 1 , with a frequency shift, ∆ω 1 and the linear resonant frequency ω 0 such that ω 1 = ω 0 + ∆ω 1 .
We use the time of excitation of the second wavepacket, t 1 , as an independent variable along with k 0 . For each t 1 and k 0 , we seek to learn functions that produce x 1 and ∆ω 1 i.e. we seek to learn x 1 (t 1 , k 0 ) and ∆ω 1 (t 1 , k 0 ). The entire parameter vector for the second wavepacket is given by This framing is also illustrated in fig. 3 where given k 0 and t 1 , we seek functions for ω 1 and x 1 .

Physicist Input 2: A Minimum Energy and Maximum Entropy Loss Function
We reparameterize ∆ω 1 and x 1 with a neural network with a parameter vector, θ * , that maximizes the electrostatic energy (minimizes the free energy) and maximizes the kinetic entropy i.e.
( 7) where, and are the electrostatic energy, and entropy, terms in the loss function, respectively. f MX = f MX (n, T ), n = f dv, T = f v 2 dv where f MX is the Maxwell-Boltzmann distribution.

Function Learning Results in the Exploitation of Novel Physics
In this section, we describe the outcome of posing "Physics Discovery as an Optimization Problem" using our differentiable simulations framework. We start at the highest level, describing the training dynamics, and work our way deeper into the results by attempting to interpret what the training process has learned. Figure 4 shows that the loss value is reduced over the duration of the training process, roughly 30 hours, which runs for 60 epochs. The convergence in the loss metric suggests that we were able to train a overparameterized neural network with 35 samples of data in 60 epochs. We attribute this to the effect of having so-called 'physical' gradients from training through a PDE solver (Holl et al., 2021).

Training Dynamics suggest Sample Efficiency
Epoch Loss Batch Loss Figure 4. The loss given by eq. 14 is plotted as a function of time.
Each cross represents an epoch, and batch-wise fluctuations are also displayed. The training converges after roughly 30 hours Figure 4 implies that the training process did indeed minimize eq. 14. To understand how this quantity was minimized, it is important to diagnose the simulations that minimize this quantity. To better understand the simulation results, we look at the electric field with respect to time, and find that the wavepacket survives for a much longer duration without damping.

ELECTRIC FIELD
In this section, we analyze the observed behavior of the electric field in the simulations that minimize the metric function. Figure 5 shows the electric field profile for three different simulations. In fig. 5a, only the first wavepacket is excited, and in fig. 5b, only the second wavepacket is excited. In fig.  5c, both wavepackets are excited.
Early in time, t = 400ω −1 p (green), when only the first wavepacket has been excited figs. 5a and 5c agree perfectly. The second wavepacket is excited at t = 500ω −1 p . At t = 900ω −1 p , once some time has passed after the excitation of the second wavepacket, the first wavepacket has not fully damped away. It is visible as small bumps in figs. 5a and 5c. The second wavepacket is also present at this time and easily seen in fig. 5b. A larger amplitude wavepacket is seen in fig. 5c. Late in time, the difference in amplitude between the second wavepacket in figs. 5b and 5c is obvious. The second wavepacket has nearly damped away in fig. 5b. In fig. 5c, the second wavepacket continues to persist, at nearly the same energy as it was at t = 900ω −1 p . This superadditive behavior, where f (x)+f (y) ≤ f (x+y), suggests that there is a constructive interference. We observe this effect for all wavenumbers we model.

PHASE SPACE
To determine the mechanism behind this phenomenon, we turn to the phase-space dynamics. In fig. 6, (i) is a spacetime plot of the electric field. The two dashed-dot red lines at the front and back of the wavepacket are parallel and indicate the velocity of the wavefront. In fig. 6a(i), the front of the wavepacket propagates at a seemingly faster rate than the rear. This is due to the etching effect (Fahlen et al., 2009). In fig. 6b, the wave survives for a much longer time, as was also illustrated in fig. 5. In fig. 6a(ii) and (iii), we see that the rear of the wavepacket is Maxwellian. As previously shown, this is why the rear of the wavepacket damps faster than the front as in fig. 6a(i) (Fahlen et al., 2009).
In the simulations described here, fig. 6b show that the distribution function at the back of the wavepacket has trapped particle activity ( fig. 6b(ii)) and near zero slope at the phase velocity of the wave ( fig. 6b(iii)). Both plots show that the slope is negligible because of the arrival of streaming detrapped particles from the first wavepacket. Due to this effect, the remergence of Landau damping that occurs due to the loss of trapped particles in isolated wavepackets no longer occurs here. This results in a reduction of the etching and the wavepacket propagates freely for some time while the particles from the first wavepacket propagate and arrive at the rear of the second wavepacket.

Conclusion
In this preprint, we show how one may be able to discover novel physics using differentiable simulations by posing a physical question as an optimization problem. This removes the physicist from the loop and allows the simulation process to use physical gradients in order to arrive to the solution. The job of the physicist is abstracted one level up and now consists of providing their domain expertise in 1/ determining which parametric or functional dependencies to learn (and whether they should be learned using neural networks or other parameterized functions) and 2/ determining the nature of the objective function that best fulfills the desired criteria.
In fig. 2, we show how one may adapt an existing computational science workflow to the autodidactic process described here. In the work performed here, this process enabled the discovery of parameters in a 4D search space with known bounds but an unknown resolution. We trained the model over a coarse grid in k 0 and t 1 , and learned functions for x 1 and ω 1 . Using gradient descent here allows an escape from the curse of dimensionality and reduces the problem from a 4D search to a 2D search + 2D gradient descent.
This discovery process is not limited to differentiable simulations. While in fig. 2, V represents a PDE solve, it only needs to be a AD-enabled function that is a model for a physical system. For example, rather than a PDE solve, V could represent a pre-trained neural-network-based emula-  Middle Phasespace plots at the back (top) and front (bottom) of the wavepacket. In (b), the phase space shows significant activity at the back of the wavepacket while in (a), the distribution function is nearly undisturbed. Right The spatially averaged distribution function. This confirms the fact that the distribution function has returned to a Maxwell-Boltzmann at the back of the wavepacket in (a), while in (b), the distribution function remains flat at the phase velocity of the wave. This is the reason behind the loss of damping. tor for experimental data. In such a scenario, one may be able to learn forcing function parameters for an experiment using the proposed workflow.
Finally, in neural network literature, the gradient required for the update is ∂S/∂θ = ∂S/∂G × ∂G/∂θ. We see that this is the same as eq. 3 after the addition of one more node in the computational graph for V, the function that models the physical system. This node, here the PDE calculation, allows the neural network training process to become unsupervised and data-efficient.