A Fast Algorithm for Scanning Transmission Electron Microscopy (STEM) Imaging and 4D-STEM Diffraction Simulations

Scanning transmission electron microscopy (STEM) is an extremely versatile method for studying materials on the atomic scale. Many STEM experiments are supported or validated with electron scattering simulations. However, using the conventional multislice algorithm to perform these simulations can require extremely large calculation times, particularly for experiments with millions of probe positions as each probe position must be simulated independently. Recently, the PRISM algorithm was developed to reduce calculation times for large STEM simulations. Here, we introduce a new method for STEM simulation: partitioning of the STEM probe into"beamlets,"given by a natural neighbor interpolation of the parent beams. This idea is compatible with PRISM simulations and can lead to even larger improvements in simulation time, as well requiring significantly less computer RAM. We have performed various simulations to demonstrate the advantages and disadvantages of partitioned PRISM STEM simulations. We find that this new algorithm is particularly useful for 4D-STEM simulations of large fields of view. We also provide a reference implementation of the multislice, PRISM and partitioned PRISM algorithms.


Introduction
Transmission electron microscopy is a powerful tool for studying atomic-scale phenomena due to its unmatched spatial resolution, and ability to perform imaging, diffraction, and multiple types of spectroscopic measurements [1][2][3].
Scanning TEM (STEM) is a particularly versatile TEM technique, as the STEM probe size can be tuned to any desired experimental length scale, from sub-Ångstrom to tens of nanometers, to best match the length scale of the structures being probed [4]. The size of the probe is also completely decoupled from the step size between adjacent probe positions, allowing experimental fields-of-view up to almost one square millimeter [5]. Advances in detector technology have lead to high speed electron cameras capable of recording full 2D images of the diffracted STEM probe with microsecond-scale dwell times, which has lead to many experiments which record the full fourdimensional dataset, in a family of methods called 4D-STEM [6]. In parallel, the rise of powerful computational methods have enabled measurements of many different material properties with high statistical significance [7].
The combination of computational methods and advanced STEM experimentation has lead to atomic-resolution 3D tomographic reconstructions [8], measurements of highly beam sensitive samples over functional length scales [9], images of samples with resolution better than the diffraction limit [10], and many other advances in STEM imaging techniques.
Many of the technique developments and validation of these experiments make heavy use of electron scattering simulations. The application of extremely data-intensive machine learning methods which to STEM experiments can also be aided by simulations [11].
It is possible to simulate the propagation and scattering of STEM probes through a material by directly computing the Bloch wave eigenstates of the electron scattering matrix (S-matrix) [12]. The Bloch Wave method can be employed in diffraction simulations [13], but it is only practical to use for small, periodic unit cells. The majority of the STEM simulations performed currently implement the multislice method [14]. The multislice method is typically applied in STEM simulations by performing a separate quantummechanical electron scattering simulation for each probe position [15][16][17]. The multislice algorithm can therefore require extremely large computation times when simulating STEM experiments which can contain 1000 2 probe positions or even higher. To alleviate this issue, various authors have implemented parallelized simulation codes that make use of multiple central processing unit (CPU) or graphics processing unit (GPU) resources [18][19][20][21][22][23][24][25][26].
It is possible to perform large STEM simulations more efficiently by computing them as a superposition of plane waves [27]. This idea was developed into an efficient simulation algorithm by [28], who named it the planewave reciprocal-space interpolated scattering matrix (PRISM) algorithm. In the PRISM algorithm, the S-matrix elements are directly computed by multislice simulations. The equivalence of the Bloch wave Smatrix and multislice simulation outputs have been investigated in detail by [29,30]. The PRISM algorithm has been implemented into multiple simulation codes [31][32][33]. It has also been extended to a double-S-matrix formalism which can provide an even higher speed boost relative to multislice for inelastic scattering such as in STEM electron energy loss spectroscopy (STEM-EELS) simulations [34].
The PRISM algorithm achieves large decreases in calculation times by reducing the sampling of the probe wavefunction in reciprocal space, and is highly accurate when the detector configuration is given by large monolithic regions. However, PRISM simulations are less accurate where fine details in the STEM probe and diffracted Bragg disks are necessary, for example in [35][36][37]. A different form of interpolation has been proposed by [38], where the STEM probe is partitioned into different beams by interpolation of basis functions constructed from the initial STEM probe. This partitioning of the probe has been shown to be a highly efficient and accurate representation of dynamical scattering of the STEM probe in experimental data, and is fully compatible with the PRISM algorithm [39].
In this manuscript, we introduce the partitioned PRISM algorithm for use in STEM simulations. We describe the theory of multislice, PRISM, and partitioned PRISM simulations, and provide a reference implementation of these algorithms. We show that beam partitioning simulations provide an excellent trade off between calculation times and accuracy, by measuring the error of diffracted STEM probes with respect to multislice simulations as a function of the number of included beams. We also use this method to simulate the full field of view for a common experimental geometry, a metal nanoparticle resting on an amorphous substrate. These simulations demonstrate that the partitioned PRISM method can produce comparable accuracy for coherent diffraction to PRISM simulations, but for much lower calculation times and lower random access memory (RAM) usage. This is important since many PRISM simulations are constrained by the available RAM of a GPU to hold the S-matrix. Finally, we demonstrate the utility of this method in 4D-STEM simulations by simulating the full 4D dataset of an extremely large (512 2 probes, 4.6 million atoms) sample cell and measuring the sample strain, where the partitioned PRISM algorithm provides superior performance to a PRISM simulation using roughly the same total calculation time.

Theory
For previously published TEM simulation methods, we will briefly outline the required steps here. We refer readers to Kirkland for more information on these methods [40]. We will also only describe the scattering of the electron beam while passing through a sample; probeforming optics and the microscope transfer function mathematics are described in many other works [40][41][42].

Elastic Scattering of Fast Electrons
Transmission electron microscopy simulations aim to describe how an electron wavefunction ψ(r) evolves over the 3D coordinates r = (x, y, z). The evolution of the slow-moving portion of the wavefunction along the optical axis z can be described by the Schrödinger equation for fast electrons [40] where λ is the relativistic electron wavelength, ∇ xy 2 is the 2D Laplacian operator, σ is the relativistic beamsample interaction constant and V (r) is the electrostatic potential of the sample.

The Bloch Wave Algorithm
The Bloch wave method uses a basis set that satisfies Eq. 1 everywhere inside the sample boundary, which is assumed to be periodic in all directions. This basis set is calculated by calculating the eigendecomposition of a set of linear equations that approximate Eq. 1 up to some maximum scattering vector |q max |. Then, for each required initial condition (such as different STEM probe positions on the sample surface), we compute the weighting coefficients for each element of the Bloch wave basis. Finally, the exit wave after interaction of the sample is calculated by multiplying these coefficients by the basis set. This procedure can be written in terms of a scattering matrix (S-matrix) as [40] ψ f (r) = S ψ 0 (r), where ψ 0 (r) and ψ f (r) are the incident and exit wavefunctions respectively. The Bloch wave method can be extremely efficient for small simulation cells, such as periodically tiled crystalline materials. High symmetry is also an asset for Bloch wave simulations, as it allows the number of beam plane waves (beams) included in the basis set to be limited to a small number. However, for a large STEM simulation consisting of thousands or even millions of atoms in the simulation, the S-matrix may contain billions or more entries, which requires an impractical amount of time to calculate the eigendecomposition (roughly Θ(B 3 ) for B beams). Using Eq. 2 many times for multiple electron probes could require extremely large computational times. Thus Bloch wave methods are only used for plane wave, diffraction, or very small size STEM simulations.

The Multislice Algorithm
The formal operator solution to Eq. 1 is given by [40], where ψ f (r) is the exit wavefunction after traveling a distance z from the initial wave ψ 0 (r). This expression is commonly approximately solved with the multislice algorithm first given by Cowley and Moodie [14], which alternates solving the two operators using only the linear term in the series expansion of the exponential operator.
In the multislice algorithm, we first divide up the sample into a series of thin slices with thickness t. Solving for the first operator on Eq. 3 yields an expression for free space propagation between slices separated by t, with the solution given by where P t is the Fresnel propagator defined by where q = (q x , q y ) are the 2D Fourier coordinates and r = (x, y) are the 2D real space coordinates. F x [ · ] denotes the two-dimensional Fourier transform with respect to x and F † x [ · ] the 2D inverse Fourier transform with respect to x.
To solve for the second operator in Eq. 3, integrate the electrostatic potential of the sample over the slice of thickness t Fig. 1a shows an example of this slicing procedure. If we assume that the electron scattering inside this slice occurs over infinitesimal thickness, the resulting solution to this operator is We can then write one iteration of the multislice algorithm as where V k is the projected potential at slice k. This algorithm is shown schematically in Fig. 1b. The multislice solution of a wavefunction ψ after k potential slices is then for which we introduce the short notation M k if the potential is assumed to be fixed. It is important to note that T (ψ, V k ) is linear in ψ and nonlinear in V k and thus M V k is linear in ψ and nonlinear in V . Traditionally, a STEM or 4D-STEM simulation was computed by shifting the incoming wave function to the scan positions ρ and computing the resulting far-field intensity using the multislice algorithm at each position: An example of this output is shown in Fig. 1c. This requires us to perform a full multislice calculation at each scan position, which makes large fields of view that could contain millions of probe positions computationally expensive.

The PRISM Algorithm for STEM Simulations
Recently, [28] proposed an elegant solution to this problem. The incident wave-function of a microscope in a scanning geometry usually passes through a beamforming aperture with maximum allowed wave vector h max and is then focused onto the sample. It can therefore be described in Fourier space as with Ψ(h) the Fourier transform of ψ(r) and ρ the twodimensional scan coordinate. Using the linearity of the multislice algorithm with respect to ψ and Eq. 10, we can then rewrite Eq. 9 as We take Eq. 11 to link our algorithm to the existing Bloch wave literature in electron microscopy. Traditionally, the set h of incoming plane waves is referred to as "beams," and the linear operator that maps from plane waves entering he sample to plane waves exiting the sample is referred to as the S-matrix. Using Eq. 11, we can define the real-space scattering matrix S r,h := M k e 2πih·r , which is the set of exit waves produced by running the multislice algorithm on the set of plane waves present in the probe-forming aperture of the microscope. The scattering matrix encapsulates all amplitude and phase information that is required to describe a scattering experiment with variable illumination, given a fixed sample potential V .
Given the S-matrix and a maximum scattering angle h max in the condenser aperture, we can rewrite Eq. 11 with the real-space scattering matrix as To introduce the concepts used in the PRISM algorithm we now need to consider the variables r and h on a discretely sampled grid. The bandwidth-limitation |h| < h max means that the incoming probe is represented by a finite number of Fourier coefficients To compute the S-matrix, we run the multislice algorithm for each wavevector h b that is sampled in the detector plane. These steps are shown schematically in Figs. 2a and b. This strategy yields favorable computational complexity when a large number of probe positions needs to be calculated, which is necessary for large field-of-view STEM simulations. It has the additional advantage that a series of scanning diffraction experiments with different illumination conditions can be simulated without recomputing the S-matrix. This method was named the plane-wave reciprocal-space interpolated scattering matrix (PRISM) algorithm in [28]. It was first implemented into a simulation code parallelized for both CPUs and GPUs in the Prismatic implementation [31]. Since then, the PRISM algorithm has also been implemented in the GPU simulations codes py multislice [33] and abTEM [32].
The PRISM algorithm introduced an additional concept to improve the scaling behaviour of STEM simulations via the scattering matrix. If only each f -th beam in the condenser aperture is sampled, the field of view in real-space contains f 2 copies of the probe with size N 1 /f 2 × N 2 /f 2 . If one of these probe copies is cropped out, and then the far-field intensities computed via Eq. 12, we can perform simulations that trade a small amount of accuracy for a significant speed-up in computation times [28]. The new model then reads where we have introduced a cropping operator, This cropping procedure to compute STEM probes using PRISM is shown in Figs. 2c and d.

Natural Neighbor Interpolation of the Scattering Matrix
Theoretical and experimental investigations of S-matrix reconstructions have shown that once the plane wave tilts have been removed from all beams, the resulting matrix elements are remarkably smooth [39,43,44]. We have also observed that in many PRISM simulations, the information contained in neighboring beams is very similar. These observations have inspired us to propose a new method for simulating STEM experiments. Rather than computing all beams of the S-matrix with the multislice algorithm, we could instead interpolate them from a reduced set P of parent beams, which are computed with the multislice algorithm in the manner described above.
Defining the interpolation weights as a matrix w ∈ R |P|xB that stores the interpolation weights for each beam, we can then compute the 4D-STEM intensities as The remaining tasks are then to choose an interpolation strategy to determine the weights w, and to choose the set of parent beams P.
To maximise the flexibility in choosing the parent beams, which form the the interpolation bases of the S-matrix, the chosen interpolation scheme must be able to interpolate an unstructured grid of parent beams. Here we have chosen to employ the natural neighbour interpolation [45,46].
We note two additional methods which can save further computational time. First, part of the computational overhead when performing matrix multiplication of the S-matrix is the cropping operator. When using interpolation factors of f > 1 for either traditional or partitioned PRISM, this overhead can represent a significant amount of computation time due to the need for a complex indexing system to reshape a subset of the S-matrix. Thus in many cases, simulations with f = 2 may require longer computational times than f = 1. We therefore recommend that the scaling behaviour be tested in each case.
A second and more universal speed-up can be achieved by modifying the beamlet basis functions. To position a STEM probe at any position that is not exactly centered on a pixel with respect to the plane wave basis functions, we use the Fourier shift theorem to apply the sub-pixel shifts of the initial probe, represented by the phase factors e −2πih b ·ρ in Equ. 13. This requires that the beamlet basis functions be stored in Fourier space coordinates, multiplied by a plane wave to perform the sub-pixel shift, and then an inverse Fourier transform be performed before multiplication by the S-matrix. To avoid this potentially computationally-costly step, we can set the simulation parameters such that all STEM probe positions fall exactly on the potential array pixels (for example, calculating 512 x 512 probe positions from a 1024 x 1024 pixel size potential array). This eliminates the summation over the complete set of basis beams b and the shift of the STEM probe can be achieved by indexing operations alone, allowing the probe basis functions to be stored in real space. After factoring out the summation over the Fourier basis, the 4D-STEM intensities can then be calculated as which defines our new "beamlet" basis, depicted in Fig. 3 a). These new probe basis functions can be pre-computed and stored in memory, such that only summation over the the parent beams is necessary to calculate a diffraction pattern.

Algorithmic Steps of Partitioned PRISM Simulations
Fig . 3 shows the steps of our new simulation algorithm as a flow chart. As in all of the above electron scattering simulation methods, the first step is to compute the projected potentials from the atomic coordinates. Fig. 3a shows the sum of all 40 projected potential slices, each having a thickness of 2Å.
The second step, shown in Fig. 3b is to choose a set of parent beams, and then calculate the weight function of all beamlets using the desired partitioning scheme.
Here we have used two rings of triangularly tiled beams, where each ring has a constant radius and the beams are separated by 10 mrads across the 20 mrad STEM probe. The beamlet weights w are calculated using natural neighbor interpolation, and are shown in Fig. 3 a) in the top right panel. The parent beams are indicated by small red circles in the condenser aperture, and the beamlet weight distributions for each parent beam are show in gray scale. By taking the inverse Fourier transform of each weight function, we can generate the real space beamlet basis functionsψ p . Fig. 3 c) shows the third step of the partitioning simulation algorithm, where we perform a plane wave multislice simulation for each of the parent beams defined above. After the plane waves have been propagated and transmitted through all 40 slices, the tilt of each beam is removed. These outputs are then stored in the compact S-matrix S r,p .
Finally, we compute the intensity of each desired STEM probe position as shown in Fig. 3 d). First, if we are using a PRISM interpolation factor other than f = 1, we crop out a subset of the S-matrix. Next, each beam of the S-matrix is multiplied by the beamlet basis functions, and all beamlet exit waves summed to form the complex STEM probe in real space. Finally, we take the Fourier transform of these probes and compute the intensity from the magnitude squared of the wavefunction. If we require sub-pixel shifts of the STEM probes with the cropped region of the S-matrix, we must first multiply the probe basis functions by the appropriate complex plane wave in Fourier space to achieve the desired shift. This adds some computational overhead to each probe, and so if possible we suggest using a potential sampling pixel size that produces a simulation image size which is an integer multiple of the spacing between adjacent STEM probes.

Computational and memory complexity
We now approximate computational complexity and memory complexity for the multislice, PRISM, and partitioned PRISM algorithms. We neglect calculation time for the sample projected potential slice and thermal diffuse scattering, as the added computational and memory complexity is equal for all methods. For simplicity, we assume a quadratic simulation cell with N = N 1 = N 2 . Each slice of the multislice algorithm requires transmission and propagation operations in Equ. 7, which is 6N 2 log 2 (N), and 2N 2 operations to multiply the potential and the Fresnel propagator. For a STEM simulation with P STEM probe positions and H slices for the sample, the total multi-slice complexity is then Θ(HP(6N 2 log 2 (N) + 2N 2 ) [28].
The complexity of the PRISM algorithm is given by Θ( HB f 2 6N 2 log 2 (N) + 2N 2 + PBN 2 4f 4 ) [28], which consists of HB multi-slice simulations for each of the sampled beams, and PBN 2 4f 4 operations for the summation of the beams. For the partitioned PRISM algorithm with B p partitions, the complexity for the multi-slice calculations is Θ( . For the realspace summation with subpixel precision, a maximum of The memory complexity of the multi-slice algorithm is lowest, since only the current wave of size Θ(N 2 ) needs to be held in memory for an unparallelized implementation. All algorithms need Θ( PN 2 f 2 ) memory to store the results of the calculation, if 4D datasets are computed. The PRISM algorithm requires Θ( BN 2 4 ) memory to store the compact S-matrix. For simulations which require a finely-sampled diffraction disk, B can quickly grow to 10 4 or larger, since the number of beams scales with the square of the bright-field disk radius, such that large-scale simulations with fine diffraction disks can outgrow the available memory on many devices. The memory requirements of the partitioned PRISM algorithm scale with Θ( BpN 2 4 ). Since the number of parent beams B p can be chosen freely, the memory requirements of the partitioned PRISM algorithm can be freely adjusted to the available hardware.

Methods
All the simulations shown in this paper were performed using a set of custom Matlab codes. In addition to implementing the partitioned PRISM algorithm, we have also implemented both the conventional multislice and PRISM algorithms for STEM simulation, in order to make a fair comparison between the different methods. We have used a single frozen phonon configuration in all cases, in order to increase the number of features visible in diffraction space. No effort was made for performance optimization or parallelization beyond MATLAB's inline compiler optimizations.
The microscope parameters used in Figs 4,5 and 6 were an accelerating voltage of 80 kV, a probe convergence semiangle of 20 mrad, and a pixel size of 0.1Å. The probe was set to zero defocus at the entrance surface of the simulation cell. The projected potentials were calculated using a 3D lookup table method [47], using the parameterized atomic potentials given in [40]. Slice thicknesses of 2Å were used for all simulations, and an anti-aliasing aperture was used to zero the pixel intensities at spatial frequencies above 0.5 · q max during the propagation step.
The atomic coordinates utilized for our single probe position and imaging simulations is identical to that used previously [28].
The structure consists of a Pt nanoparticle with a multiply twinned decahedral structure, with screw and edge dislocations present in two of the grains. The nanoparticle measures approximately 7 nm in diameter, and was tilted such that two of the platinum grains are aligned to a low index zone axis. It was embedded into an amorphous carbon support to a depth of approximately 1 nm, with all overlapping carbon atoms removed. The cell size is 10 nm × 10 nm × 8 nm, and contains 57 443 total atoms. The nanoparticle coordinates were taken from [48], and the amorphous carbon structure was adapted from [49].
The atomic coordinates of our 4D-STEM simulations were a multilayer stack of semiconductor materials inspired by the experiments in [50]. The simulation cell consists of a GaAs substrate where the Ga and As sites are randomly replaced with 10% Al and P respectively. The multilayers are an alternating stack of GaAs doped  with 10% P and pure GaAs respectively, each 9 unit cells thick along a [001] direction. The lattice parameters of the GaAs and GaAsP were fixed to be +1.5% and -1.5% of the substrate lattice parameter, which was set to 5.569Å. The field of view was approximately 500 x 500Å, and the potential pixel size and slice thicknesses were set to 0.1Å and 2Å respectively. The cell thickness was approximately 40Å, giving 4.6 million atoms inside the simulated volume. The STEM probe convergence semiangle was set to 2.2 mrads, the accelerating voltage was set to 300 kV, and the probe was scanned over the field of view with 2Å step sizes, giving an output of 250 × 250 probes.
The simulations shown in Figs. 4, 5, and Fig. 7 were computed on a laptop with an Intel Core i7-10875H GPU, operating at 2.30 GHz with 8 cores, and 64 GB of DDR4 RAM operating at 2933MHz. The simulations shown in Fig. 6 were performed on Intel Xeon Processors E5-2698v3 with 8 Physical cores (16 threads) and 25GB RAM per simulation. The multislice 512 × 512 results were obtained by splitting the 512 × 512 array into 32 jobs with 16 × 512 positions. Prism f = 2 results with 512 × 512 probe positions were obtained by splitting the array in to 8 jobs with 64 × 512 each, all using an identical calculated S-matrix. All calculations were performed using Matlab's single floating point complex numbers, and simulation run times were estimated using built-in MATLAB functions, and memory usages were based on theoretical calculations.

Calculation of Individual STEM Probes
To demonstrate the accuracy of our proposed algorithm, we have performed STEM simulations of a common sample geometry: a multiply-twinned Pt nanoparticle resting on an amorphous surface. The total projected potential of this sample is plotted in Fig. 4a, as well as the location of a STEM probe positioned just off-center. We have tested a series of beam partitioning schemes, shown graphically in Fig. 4b. The first case tested was a single beam, which is equivalent to convolving a plane wave HRTEM simulation with the STEM probe.
We have also used natural neighbor partitioning to calculate the beamlet weights when using a series of concentric hexagonal rings of beams, distorted slightly to the circular probe geometry. These simulations include partitioning the 20 mrad probe by 20, 10, 5, 2.5, and 1.25 mrads, resulting in a total of 7, 19, 61, 217 and 817 parent beams respectively.
The calculated diffraction space intensities of the probes corresponding to the above cases are shown in Fig. 4c, along with the corresponding conventional multislice simulation. We see that using a single parent beam is extremely inaccurate, reproducing only the coarsest features of the multislice simulation. However, the partitioning scheme rapidly converges to the multislice result, shown by the error images plotted in Fig. 4d. The 19 beam case has errors falling roughly within 5%, while the 61 beam case drops to <2%. The calculated probe for the 217 beam case has errors on the order of <0.5%, which would likely be indistinguishable from an identical experiment due to measurement noise. Finally, the 817 beam case is essentially error-free.
We can make additional observations about the character of the errors present in the partitioning algorithm. Inside the initial probe disk and in directly adjacent regions, the errors are roughly equally distributed in the positive and negative directions. However, at higher angles the errors are biased in the negative direction. This indicates that the partitioning approximation is highly accurate at low scattering angles where coherent diffraction dominates the signal [51], and is less accurate at high scattering angles where thermal diffuse scattering dominates [52]. We attribute this effect to the complex phase distribution of the pixels; at low scattering angles, adjacent beams have very similar phase distributions, which in turn makes the interpolation a good approximation. However, at high scattering angles the phases of each pixel are substantially more random, due to thermal motion of the atoms. This means that if too few beams are used to approximate the signal, the coherent summations will tend towards zero due to the random phase factors. Thus when using a small number of beams in partitioned STEM simulations, high angle scattering intensities can be underestimated.
The estimated calculation times for these simulations are shown in Fig. 4e. When calculating a single STEM probe, multislice is always fastest because the only overhead to the calculation is computation of the projected potentials. The partitioned simulations by contrast require evaluation of the S-matrix, which requires the same calculation time as each STEM probe multislice propagation for each beam. However, once the S-matrix has been computed, calculation of STEM probes via matrix multiplication becomes substantially faster than multislice. The overall simulation time becomes lower than multislice if many STEM probe positions must be calculated. For the 61, 217 and 817 beam cases, these crossovers in calculation time occur for 32, 135, and 1000 probe positions respectively. Therefore even for 1D simulations of STEM probe positions, the partitioning scheme is faster, and for simulations with a grid of 2D probe position this scheme is significantly faster than multislice.
However, using the beam partitioning algorithm on the entire field of view does not utilize the algorithm to its full speed-up potential. The beam partitioning approximation is also compatible with the PRISM approximation. Partitioning reduces the number of entries of the S-matrix in diffraction space, whereas PRISM reduces the number of entries using cropping in real space. Fig. 5a shows STEM simulations that combine partitioning with a PRISM interpolation factor of f = 5. The 25-fold reduction in sampling of the STEM probes is evident in Fig. 5b, where the underlying beam pixels are clearly visible in the STEM probe. The partitioning scheme used is identical to that of Fig. 5b, except for the 1.25 and 2.5 mrad partitioning cases. For the 1.25 mrad partitioning, the number of parent beams outnumbers the number of available beams; after removing duplicate beams, this simulation becomes equivalent to a PRISM f = 5 simulation. The 2.5 mrad partitions were changed to a diagonal grid, where every other beam is included in order to produce a more uniform sampling of the S-matrix.
The calculated probe intensities are shown in Fig. 5c, along with the corresponding multislice simulation (which was sampled on the same 25-fold reduced grid). The errors of the partitioned PRISM simulations have been compared to the multislice simulation in Fig. 5d. The resulting convergence towards zero error is essentially identical to the non-PRISM case (where f = 1). These simulations are also slightly biased towards negative errors at high scattering angles.
The estimated calculation times are plotted in Fig. 5e, as a function of the number of probe positions. These simulations are substantially faster than multislice. The 61, 161 and 325 beam cases have a crossover in the calculation with multislice for 32, 83 and 155 probe positions respectively. If the error for the 161 beam case is within an acceptable tolerance, a 1000 x 1000 probe position simulation of this sample can be performed in roughly 50 minutes, without additional parallelization or utilization of GPU or HPC resources.

Calculation of Full STEM Images
We have also simulated full STEM images with a variety of standard detector configurations, in order to demonstrate the potential of the partitioned PRISM algorithm. These simulations are shown in Fig. 6, and include four radially-symmetric detector configurations. These are a bright field (BF) image from 0-8 mrads, an annular bright field (ABF) image from 9-20 mrads, a low angle annular dark field (LAADF) image from 25-60 mrads, and a high angle annular dark field (HAADF) image from 60-100 mrads. We have performed these simulations with 512 x 512 probe positions using multislice, PRISM and partitioned PRISM algorithms. The PRISM simulations used interpolation factors of f = 1, 2, 4, and 8, giving a total number of beams equal to 7377, 1885, 489, and 137 beams respectively. The partitioning included was the scheme described above, where the 20 mrad STEM probe was subdivided by 10, 5, and 2.5 mrads into the parent beams, and where no partitioning was performed (i.e. the original PRISM algorithm). The number of beams for the 10, 5, and 2.5 mrad partitioning were equal to 19, 61, and 217 respectively, except for the 2.5 mrad partitioning for f = 8 interpolation, where the simulation is equivalent to PRISM (137 beams). Fig. 6a shows a summary of the results, where the rootmean-square (RMS) errors in units of probe intensity and calculation times relative to multislice simulations are plotted. Additionally, the RAM requirements for storing the S-matrix are shown by the marker sizes. Overall, the results follow the same trend as in the previous section. Using less beams either in the partitioning or higher PRISM interpolation results in a less accurate simulation for all cases. The only exception to this is PRISM f = 1 simulations, which are mathematically identical to multislice simulations [28]. Interestingly, the PRISM f = 1 simulations are faster than f = 2, due to not needing any matrix indexing operations to crop out a portion of the S-matrix. However f = 1 PRISM simulations also have the largest RAM requirements by a large margin, requiring 15.5 GB. This is potentially an issue for large simulations if we wish to utilize GPU resources, since RAM capacities of current GPUs are often in the range of 4-16 GB, and there is additional overhead for other arrays that must be calculated. This problem can be alleviated by streaming only part of the S-matrix into the GPU RAM [31], but then the large speed-up afforded by performing only a single matrix multiplication per STEM probe is lost.
The BF and ABF simulation errors shown in Fig. 6a, the partitioned simulations have a very favourable balance between calculation time and accuracy. For PRISM interpolation factors of f = 2 and f = 4, the partitioned simulations have essentially identical accuracy to the PRISM simulations, while requiring far lower calculation times and less RAM to store the S-matrix. The 5 mrad partitioning case (61 beams) for example is 46 (f = 2) and 171 (f = 4) times faster than an equivalent multislice simulation, while having RMS errors on the order of 0.2% and 0.1% respectively for the 0-8 mrads BF image and RMS errors on order of 0.5% and 0.2% respectively for the 9-20 mrads ABF image.
For the LAADF and HAADF images shown in Fig. 6a, the partitioned simulations show somewhat less favourable error scaling than the PRISM algorithm. While the calculation times are reduced by partitioning for a given PRISM interpolation factor, the errors increase roughly inversely proportional to the number of include beams. These errors are still relatively low however, staying roughly constant with the interpolation factor f . Fig. 6b shows the STEM images for the f = 4 cases including conventional PRISM and the 3 partitioning schemes. It is immediately evident that all images contain the same qualitative information, for example showing that the ABF image is far more interpretable than the BF image. Visually, the BF and ABF images appear indistinguishable from each other, with all atomic-scale features preserved across the different partitioning schemes. The LAADF and HAADF images similarly all contain the same qualitative information, and all highlight the differences between these two dark field imaging conditions. Here however we can see an overall reduction of image intensity inside the nanoparticle for the partitioned simulations with less beams. In the LAADF case, the 19 beam image is noticeably dimmer than the other cases, and for the HAADF case both the 19 and 61 beam partitioning show reduced intensities.
To show the errors more quantitatively as a function of the probe position, we have plotted the difference images with respect to a multislice simulation in Fig. 6c. For the BF images, a slight offset in the overall intensities is visible, likely due to the slightly different probe and detector sampling when using f = 4 interpolation. The spatially resolved differences are very low however, for both PRISM and the 2.5 and 5 mrad partitioning simulations. In the regions of highest scattering in the nanoparticle, some errors along the atomic planes are visible in the 10 mrad partitioned simulation. The 2.5 and 5 mrads partioned PRISM simulations are an excellent replacement for the PRISM simulations, as they offer large calculation time speed-ups for a negligible change in the error.
In the LAADF and HAADF error images plotted in Fig. 6c, the errors are increasing proportionally to the inverse of the number of beams included, as we observed in Fig. 6a. The HAADF images show higher overall errors than the LAADF images, due to the increasing randomness of the pixel phases at high scattering angles where thermal diffuse scattering dominates the signal. The relative error is also higher at these high scattering angles, as the number of electrons that scatter to these high angles are significantly lower than those which reach the other detector configurations. Both the LAADF and HAADF errors scale nearly linearly with the nanoparticle projected potential, which indicate that they may be tolerable for relative measurements such as comparing different thicknesses or the signal measured for different atomic species. For quantitative intensity simulations at high angles, we recommend using as many beams as possible for the partitioned PRISM algorithm.

Calculation of 4D-STEM Datasets
Many 4D-STEM experimental methods require fine enough sampling of reciprocal space to resolve the edges of scattered Bragg disks, or fine details inside the unscattered and scattered Bragg disks [6]. In particular, for machine learning methods which are trained on simulated data, we want the sampling and image sizes to be as close to the experimental parameters as possible [53,54]. Here, compare the PRISM and partitioned PRISM algorithms for 4D-STEM simulations, and assess their accuracy by performing a common 4D-STEM workflow of strain mapping by measuring the Bragg disk spacing [55].
We have simulated a 4D-STEM experiment for a multilayer stack of semiconductor materials similar to the experiments shown in [50], shown in Fig. 7   factors of f = 25, giving 21 total beams, shown in Fig. 7a. The second combined a PRISM interpolation of f = 5 with partitioning into 19 beams, shown in Fig. 7b. These parameters were chosen to require approximately the same total calculation time (157 and 186 minutes for pure PRISM and partitioned PRISM respectively). Both simulations used the same atomic potentials which required 113 minutes to compute. The S-matrix calculation steps required 42 and 30 minutes for the pure PRISM and partitioned PRISM simulations respectively. Finally the 62 500 probe positions required 2 and 43 minutes for the pure PRISM and partitioned PRISM simulations respectively.
We have used the py4DSTEM package [56] to measure strain in both of the simulations shown in Fig. 7, by fitting the positions of the Bragg disks. These strains are compared to the ideal strain, estimated by convolving the underlying lattice spacing with a Gaussian kernel with a standard deviation given by the 5Å estimated size of the STEM probe. In the pure PRISM simulation shown in Fig. 7a, there are artifacts visible in both strain maps. The strain perpendicular to the layer direction shows rapid oscillations of ±0.1%, while the strain parallel to the layer direction shows discrete steps. Both of these are due to the very small cropping box used when f = 25, which cuts off the tails of the STEM probe in this simulation. Additionally, the limited sampling of the diffraction disk edges strongly limits the achievable precision in the disk position measurements.
By contrast, the partitioned simulation shown in Fig. 7b samples diffraction space 5 times more finely in both the x and y directions. The resulting strain maps are much flatter, and the measured strain positions agree better with the ideal measurements. This simulation demonstrates that beam partitioning combined with PRISM interpolation can provide a much more efficient use of the calculation time required to generate the Smatrix beams than a pure PRISM simulation. This partitioning case uses approximately the same number of beams and requires roughly the same calculation time, but is substantially more accurate at the low scattering angles used in a coherent diffraction 4D-STEM simulation. We also estimate that a multislice simulation of this same experiment would require approximately 60 days using the same simulation parameters. Even if we were to increase the beam sampling by a factor of 8, the partitioned PRISM simulation would still complete in less than a day.

Conclusion
We have introduced the beam partitioning algorithm for STEM simulation. This algorithm splits the STEM probe into a series of basis functions generated by natural neighbor interpolation between a set of parent beams. We construct the diffracted STEM probe by matrix multiplication of these basis functions with plane wave multislice simulations of each parent beam which are stored in a S-matrix that can be re-used for each new STEM probe position. We have demonstrated that the resulting algorithm converges rapidly to low error with respect to the conventional multislice algorithm, and that it is fully compatible with the PRISM algorithm for STEM simulation.
We have compared our new algorithm to multislice and PRISM simulations of a nanoparticle on an amorphous substrate. With these simulations, we have shown that in general partitioned beam simulations can provide the same accuracy as PRISM at low to intermediate scattering angles (where coherent diffraction dominates the signal), but with much lower calculation times and lower RAM usage. We have also shown that at high scattering angles, beam partitioning simulation accuracy is somewhat worse than the PRISM algorithm, though still with lower calculation times. These low calculation times may allow the partitioned PRISM algorithm to be used "in the loop" with 3D tomographic reconstruction algorithms, in order to properly model the nonlinear dependence of STEM image contrast on the underlying atomic potentials.
Finally, we have also demonstrated the utility of partitioned PRISM for simulations of large 4D-STEM datasets.
We used a common sample geometry composed of a multilayer stack of semiconductor materials with varying compositions on a substrate, and performed strain mapping from the diffracted probe signals by measuring the position of the Bragg disks and fitting a lattice. These simulations show that the partitioned PRISM algorithm is particularly well suited for performing fast simulations of large fields of view where high sampling of diffraction space is needed.
We believe that our algorithm will find widespread application in simulations of very large simulated cells, such as those calculated with molecular dynamics. Our simulations also show that the beam partitioning S-matrix can efficiently represent complex three-dimensional scattering, which may make it useful for inverting experimental data efficiently.

Data and Source Code Availability
Reference implementations of the algorithms presented in this paper (multislice, PRISM and partitioned beam STEM simulations) are available at github.com/cophus/superPRISM.