The 2020 OpenACC Summit was held August 31 to September 1, 2020 as a completely digital, remote event which included a keynote from Martijn Marsman from the University of Vienna, a Birds of a Feather (BoF) interactive discussion, two "Ask the Experts" sessions, several OpenACC talks, and two GPU Bootcamps.
Welcome and Keynote
August 31, 2020 | 08:00 AM - 12:30 PM PDT | Digital Event
Welcome and Opening Remarks: Jack Wells, Vice President, OpenACC
OpenACC Technical Committee Updates: Jeff Larkin, OpenACC Technical Committee Chair
Future Specification Feedback: Jeff Larkin, OpenACC Technical Committee Chair
New Members Welcome: Sunita Chandrasekaran, OpenACC User Adoption Chair
NVIDIA HPC SDK Updates: Michael Wolfe, NVIDIA
GPU Hackathon Reports: Julia Levites, OpenACC Marketing Committee Chair
TalksAdaptation of Stochastic GW to GPUs: Lessons and Implications
Monday, August 31, 2020 | Daniel Neuhauser, University of California, Los Angeles (UCLA)
Recently, our Stochastic Donkeys team participated in a GPU Hackathon where the in-home Stochastic GW code was adapted to GPUs. The code calculates efficiently charging energies that are important for chemistry/physics/material science. It does it for very large "dirty" real-life systems with thousands of electrons or more, using only a set of trivially parallelizable 50-1000 calculations, each involving propagation of only 10 quantum wavefunctions. The GPU adaptation was successful, reaching a factor of 10-50 improvement (compared with a single CPU code) in different parts. This factor is far from the roughly three-order-of magnitude improvement possible for dense-matrix-algebra, but is still quite important since it indicates that with a GPU farm a large improvement in wall time would be achievable. I will also review the GPU adaptation that was quite straightforward with the help of OpenACC.Experiences with Porting a Nuclear Physics CI Code to GPUs Using OpenACC
Monday, August 31, 2020 | Dossay Oryspayev, NERSC
We present our experiences with porting and optimizing the MFDn software, which is a FORTRAN-90 code for performing ab initio nuclear No-Core Shell Configuration Interaction calculations, on GPUs using OpenACC. MFDn has been in production since early 2000. The main computational tasks performed in this code are construction of a nuclear many-body Hamiltonian in a truncated no-core shell configuration interaction space, and the partial diagonalization of the Hamiltonian using an iterative eigensolver such as the local optimal preconditioned conjugate gradient (LOBPCG) method or the Lancos method. The largest matrices we can currently handle have a dimension of up to about 30 billion, with more than 10^14 nonzero matrix elements. The current implementation uses a hybrid MPI and OpenMP programming model, and performs well on CPU-based HPC platforms, including Cori at NERSC and Theta at ALCF, both with Intel Xeon Phi 'Knights Landing' processors. Under the NERSC Exascale Science Application Program (NESAP), we are currently porting and optimizing this code to run efficiently on Perlmutter, a system with more than 6,000 NVIDIA GPUs which will be installed at NERSC in late 2020. We will discuss the advantage of using OpenACC for porting MFDn on GPUs and point out some of the challenges as well.Porting Nek5000 on GPUs Using OpenACC and CUDA
Monday, August 31, 2020 | Niclas Jansson, KTH Royal Institute of Technology
Nek5000 is a spectral element code for fluid dynamics applications. In this talk, we present our work on porting and optimising Nek5000 for GPUs. Based on the existing OpenACC port, our main focus has been on optimising the small, dense matrix-matrix multiplications arising from the spectral element formulation. Starting with the proxy application Nekbone at EuroHack’19, a significant speedup could be obtained with optimised OpenACC loops, where further performance gains were achieved from rewriting time critical routines in CUDA. With these tuned kernels, we report on the current progress of our GPU port by demonstrating the performance of Nek5000 when solving real cases on both Nvidia P100 and V100 GPUs.Acceleration Without Breaking: The Search for Sustainable Portable Performance in CASTEP
Monday, August 31, 2020 | Phil Hasnip, University of York
CASTEP is a leading “first principles” materials simulation program, capable of predicting the electronic, chemical and vibrational properties of materials. It was designed “from the ground up” to exploit large-scale conventional HPC, using modern Fortran and MPI coupled with good software engineering principles. The advent and rapid development of accelerators such as GPGPUs presents both a great opportunity and a great challenge: how can a single codebase support efficient serial, parallel and accelerator execution models? In this talk I will discuss how OpenACC is being used to address these questions, and the successes, failures and current challenges in our quest to exploit fully all of the compute resources available on modern HPC machines.Accelerating Microbiome Research with OpenACC
Tuesday, September 1, 2020 | Igor Sfiligoi, San Diego Supercomputer Center
UniFrac is a commonly used metric in microbiome research for comparing microbiome profiles to one another. Computing UniFrac on modest sample sizes used to take a workday on a server class CPU-only node, while modern datasets would require a large compute cluster to be feasible. After porting to GPUs using OpenACC, the compute of the same modest sample size now takes only a few minutes on a single NVIDIA V100 GPU, while modern datasets can be processed on a single GPU in hours. The OpenACC programming model made the porting of the code to GPUs extremely simple; the first prototype was completed in just over a day. Getting full performance did however take much longer, since proper memory access is fundamental for this application.Accelerating Gyrokinetic Tokamak Simulation (GTS) Code Using OpenACC
Tuesday, September 1, 2020 | Min-Gu Yoo, Princeton Plasma Physics Laboratory
The GTS (Gyrokinetic Tokamak Simulation) code has been developed to study turbulent plasma physics in a tokamak which is one of the most promising concepts for nuclear fusion reactor. The GTS code is a global gyrokinetic particle simulation based on the Particle-In-Cell (PIC) method that requires a large number of particles to reduce statistical noises. The GTS has used MPI and OpenMP to parallelize the calculations regarding particles, but it is still expensive as several thousand CPU cores are required to deal with a billion particles. For further speed-up and cheaper computational cost, we recently ported the code to the GPU system by using the OpenACC library. We could easily parallelize the most of calculations regarding particles, such as particle advancing and charge deposition, by adding a few lines of OpenACC directives in the code. As a result, we achieved over 10x speed-up for the particle calculations and over 4x speed- up for a whole code performance including non-parallelized routines.Porting Legacy Monte Carlo Ray-Tracing to GPU Using ISO-C Code-Generation and OpenACC #pragmas
Tuesday, September 1, 2020 | Peter Willendrup, European Spallation Source & Technical University of Denmark
McStas and McXtrace are Monte Carlo ray-tracing codes for neutron- and X-ray scattering, aged 22 and 9 years respectively. Both generate ISO-C code using a classical LeX+Yacc grammar and include lots of user-contributed physics component models. The presentation will introduce the codes and their main applications and give details on the process of how we are porting the codes to GPU using OpenACC #pragmas, keeping much of the code base identical to its CPU counterpart, relying on a wide set of competences in our development team. We will further report on how our team participated in several GPU Hackathons and showcase our current status, demonstrating speedups in the range of 10-600 over the performance on a single CPU core. Finally we will outline our development roadmap toward full GPU support.Accelerating Kinetic Low-Temperature Plasma Simulations via OpenACC
Tuesday, September 1, 2020 | Andrew Powis, Princeton University
Low-temperature plasmas (electron temperatures from a few eV to 10s of eVs, and low ionization fraction) encapsulate an enormous range of physical behaviour, including complex boundary phenomena, generation of excited states, and plasma chemistry. Many of these phenomena are kinetic in nature, and therefore require resolution of the full six-dimensional phase space. Although perhaps not as glamorous as fusion reactors, low-temperature plasmas also encompass the majority of industrial plasma applications, as well as the bulk of laboratory experiments. Perhaps the most ubiquitous application of this plasma regime is in materials processing, particularly the etching of silicon wafers in microchip manufacturing.
Our code, Low-Temperature-Plasma Particle-in-Cell (LTP-PIC) relies on the mixed Langrangian/Eulerian PIC method to model these phenomena. A majority of this algorithm is highly amenable to GPU architecture. Recently our development team took part in the Princeton GPU Hackathon (in collaboration with the Oak Ridge National Laboratory). Throughout the Hackathon we were able to accelerate a majority of our code via OpenACC, with some routines seeing a 100-300 times speedup on a Tesla V100 when compared to a single CPU thread. Our motivation has always been to maintain a single code base and portability over systems ranging from desktops to homogeneous and heterogeneous supercomputers. Here we report on our progress, and with some caveats, mostly success with OpenACC
Collisions also play an important role in low-temperature plasmas, and in PIC they are modeled via Monte-Carlo processes, with a typical simulation requiring upwards of trillions of high quality random numbers. We are working to port a suitable pseudo-random-number-generator for use on GPUs with OpenACC, but see this as a potentially important feature to incorporate into the standard moving forward.
Tuesday, September 1, 2020 | Antonio Ragagnin, Ludwig-Maximilians-Universität München
The parallel N-body code Gadget3 is one of the most used codes for large cosmological hydrodynamic simulations. In this talk we present our OpenACC solution to bring its most expensive algorithm to the GPU: namely its neighbour searching algorithm that is based on tree walks and particle exchanges between MPI ranks. Our approach overlaps computations on the CPUs and GPUs: while GPUs asynchronously compute interactions between particles within their MPI ranks, CPUs perform tree-walks and MPI communications of neighbouring particles, where we obtain a speedup of 2 at CSCS (NVIDIA P100 + PCI express) and CINECA Marconi100 (P9 + V100 connected with NVLink). Next generation of zoom-in cosmological simualtions may be challenging for accelerators due to their in-omogeneous initial conditions that imply in-omogeneous tree structures. We will discuss our preliminary results and approaches in the context of zoom-in simulations.
Ask the Experts
September 1, 2020 | 10:20-12:20 PM PDT | Digital Session
September 1, 2020 | 07:00 - 08:00 PM PDT | Digital Session
Come join OpenACC experts to ask about how to start accelerating or continuing to optimize your code on GPUs with OpenACC, start teaching OpenACC, host or participate in a GPU Hackathon, and more!