Hostname: page-component-6766d58669-zlvph Total loading time: 0 Render date: 2026-05-15T11:27:45.691Z Has data issue: false hasContentIssue false

Acceleration of the particle-in-cell code Osiris with graphics processing units

Published online by Cambridge University Press:  07 January 2025

Roman P. Lee
Affiliation:
Department of Physics and Astronomy, University of California, Los Angeles, CA 90095, USA
Jacob R. Pierce*
Affiliation:
Department of Physics and Astronomy, University of California, Los Angeles, CA 90095, USA
Kyle G. Miller
Affiliation:
Laboratory for Laser Energetics, University of Rochester, Rochester, NY 14623-1299, USA
Maria Almanza
Affiliation:
Department of Physics and Astronomy, University of California, Los Angeles, CA 90095, USA
Adam Tableman
Affiliation:
Department of Physics and Astronomy, University of California, Los Angeles, CA 90095, USA
Viktor K. Decyk
Affiliation:
Department of Physics and Astronomy, University of California, Los Angeles, CA 90095, USA
Ricardo A. Fonseca
Affiliation:
GoLP/Instituto de Plasmas e Fusão Nuclear, Instituto Superior Técnico, 1049-001 Lisboa, Portugal ISCTE – Instituto Universitário de Lisboa, Av. Forças Armadas, 1649-026 Lisboa, Portugal
E. Paulo Alves
Affiliation:
Department of Physics and Astronomy, University of California, Los Angeles, CA 90095, USA
Warren B. Mori
Affiliation:
Department of Physics and Astronomy, University of California, Los Angeles, CA 90095, USA Department of Electrical and Computer Engineering, University of California, Los Angeles, CA 90095, USA
*
Email address for correspondence: jacobpierce@physics.ucla.edu

Abstract

Fully relativistic particle-in-cell (PIC) simulations are crucial for advancing our knowledge of plasma physics. Modern supercomputers based on graphics processing units (GPUs) offer the potential to perform PIC simulations of unprecedented scale, but require robust and feature-rich codes that can fully leverage their computational resources. In this work, this demand is addressed by adding GPU acceleration to the PIC code Osiris. An overview of the algorithm, which features a CUDA extension to the underlying Fortran architecture, is given. Detailed performance benchmarks for thermal plasmas are presented, which demonstrate excellent weak scaling on NERSC's Perlmutter supercomputer and high levels of absolute performance. The robustness of the code to model a variety of physical systems is demonstrated via simulations of Weibel filamentation and laser-wakefield acceleration run with dynamic load balancing. Finally, measurements and analysis of energy consumption are provided that indicate that the GPU algorithm is up to ${\sim }$14 times faster and $\sim$7 times more energy efficient than the optimized CPU algorithm on a node-to-node basis. The described development addresses the PIC simulation community's computational demands both by contributing a robust and performant GPU-accelerated PIC code and by providing insight into efficient use of GPU hardware.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
Copyright © The Author(s), 2025. Published by Cambridge University Press
Figure 0

Figure 1. Schematic of the CPU memory layout, chunk pool data structure and species chunk manager. The CPU species object stores contiguous arrays of particle position relative to nearest cell, momentum, charge and cell index (x, p, q and i) for each tile. Note that, generally, particle data are maintained only on the GPU. The figure depicts the state immediately after particles have been copied from device to host for particle diagnostics. The species chunk manager, whose state is maintained on both CPU and GPU, stores pointers to chunks of device memory for each tile. Chunks contiguously store all four data fields for a fixed number of particles. Each species chunk manager may transfer pointers to chunks to and from the chunk pool with nearly zero overhead, enabling arbitrary load balancing between tiles with nearly zero overhead. Particle transfer between the CPU species object and species chunk manager is achieved through an object for buffered data movements, discussed in § 2.2.

Figure 1

Figure 2. Comprehensive thermal plasma benchmarks run on Perlmutter. (a) Weak scaling tests for the cases described in the text, showing near-perfect scaling up to 4096 GPUs on Perlmutter. (b) Parameter scan showing dependence of throughput on particles per cell and grid sizes. (c) Comparison of absolute performance on one GPU and one core of one CPU for different interpolation orders. All simulations are run in three dimensions with quadratic interpolation and 512 particles per cell unless otherwise specified.

Figure 2

Table 1. Absolute throughput in gigaparticles per second and speedup for the GPU and CPU algorithms on one GPU and one core of one CPU on Perlmutter with different interpolation orders. The benchmarks are measured on a thermal plasma. The 3-D CPU and GPU performances are plotted in figure 2(c).

Figure 3

Figure 3. Examples of spatially inhomogeneous plasma simulations run on Perlmutter. (a) Three-dimensional simulation of the Weibel instability on 4096 GPUs, demonstrating the tangling of magnetic field lines. (b) Charge density from a 2-D simulation of a laser-wakefield accelerator (LWFA) in the nonlinear regime with dynamic load balancing. The black lines indicate the computational boundaries of different MPI ranks.

Figure 4

Figure 4. Energy consumption per particle push for the GPU algorithm and for the CPU algorithm. The comparison is between a simulation for 4 h on a Perlmutter GPU node with 4 GPUs and a Perlmutter CPU node with 128 CPU cores. The GPU node is ${\sim }14$ times faster than the CPU node and uses ${\sim }2$ times more energy, leading to an overall energy efficiency ${\sim }7$ times higher. Further detail of what is being measured is in the text.