OpenACC and Hackathons Summit 2022

Keynote Tutorials Talks

Started from humble beginnings as a small OpenACC user meeting, today our annual OpenACC and Hackathons Summit has grown to showcase leading research accelerated by OpenACC directives-based programming model and optimized through the Open Hackathons that we manage across national laboratories, research institutions and supercomputing centers worldwide.

The 2022 Summit is scheduled from August 2nd to the 4th and will include two keynote speakers, invited talks, tutorials covering HPC and AI topics, an interactive panel discussion and networking opportunities.

Welcome and Keynote

Day 1 Keynote

Portability Within the Upcoming European Exascale Infrastructure
Dirk Pleiter, KTH Royal Institute of Technology/ PDC Center for High Performance Computing

As in other regions, high performance computing (HPC) in Europe is on a path towards increased diversity of compute hardware solutions. This concerns both processors, such as Arm-based CPUs resulting from the European Processor Initiative (EPI), as well as compute accelerator solutions ranging from different types of GPUs to accelerators developed within the EPI. While acknowledging the benefits of increased competition between different technologies and vendors, the challenges for application developers needs to be mitigated through portable application implementation strategies as well as suitable development environments and support models. In this talk, we review the emerging European HPC infrastructure as well as the application ecosystem with a particular emphasis on the role of directive-based programming models like OpenACC.

Day 2 Keynote

Having it all. Can software be productive, performant and portable?
Chris Maynard, University of Reading

Weather and climate models simulate complex, multi-scale physics. They can deliver substantial value for society utilizing their predictive power. However, It can take a decade or more to develop a new model. Moreover, computer architectures and the programming models used to develop the software have evolved on a much shorter timescale and become more complex. In this presentation the Domain Specific Language (DSL) approach being adopted by the Met Office in developing its next generation model, LFRic, will be discussed and how OpenACC is being used to exploit the computational power of GPUs described. To conclude, some of the implications of the use of DSLs for parallelism, for general purpose languages and programming model specifications such as OpenACC will be considered.

Tutorials

HPC Tutorial: Introduction to Accelerated Computing with OpenACC and ISO Standard Languages

Jeff Larkin, NVIDIA

Porting and optimizing legacy applications for GPUs doesn't have to be difficult when you use the right tools. OpenACC is a directive-based parallel programming model that enables C, C++, and Fortran applications to be ported to GPUs quickly while maintaining a single code base. Learn the basics for parallelizing an application using OpenACC directives and the NVIDIA HPC Compiler. Also learn to identify important parts of your application, parallelize those parts for the GPU, optimize data movement, and improve GPU performance. Become the a GPU Hero by joining this session.

AI Tutorial: Exploring Neural Network Surrogate Models with miniWeatherML

Muralikrishnan (Murali) Gopalakrishnan Meena, Oak Ridge National Laboratory and Matthew Norman, Oak Ridge National Laboratory

There is growing interest in using machine learning (ML) as a part of numerical scientific experiment workflows. While ML architectures and applications span a very broad range, this tutorial narrows the scope to creating and deploying Neural Network surrogate models using simulation data.

miniWeatherML is a C++ mini-app geared toward training, benchmarking, and rapid prototyping end-to-end workflows using surrogate models. The code aims to be complex enough to be scientifically challenging and interesting, yet small and self-contained enough to be tractable and quickly understood. Users can create simplified numerical weather experiments to generate training data, train the surrogate model, and then deploy the model online in a real simulation to observe its behavior in phenomena such as “supercell” convection.

miniWeatherML uses a portable C++ library called Yet Another Kernel Launcher (YAKL), which is designed to create readable user-level code that will be as familiar as possible to a Fortran programmer. It also uses a custom Neural Network deployment library called the Portable Online Neural Network Inferencng (PONNI) library, which aims to help users easily deploy trained models in a portable manner on CPUs and GPUs. miniWeatherML runs out of the box on CPUs and Nvidia, AMD, and Intel GPUs.

Talks

Design of OpenACC compiler for Fujitsu A64FX SVE

Mitsuhisa Sato, RIKEN R-CCS

The Fujitsu A64FX is a manycore processor designed for the Supercomputer Fugaku. This processor supports Arm SVE (Scalable Vector Extension) SIMD instruction sets. We are working on the design of a prototype OpenACC compiler to make use of the SVE more efficiently than existing solutions such as OpenMP do.

While OpenACC programs can be compiled and executed on A64FX without any modifications, it may be able to use SIMD operations of each core like GPUs for efficient execution. In this talk, the design choices and the preliminary results will be presented.

The Prime Directive: A "minimal interference" approach to GPU acceleration in the CASTEP code

Phil Hasnip, University of York

CASTEP is a leading "first principles" materials simulation program, using quantum mechanics to predict materials' electronic, chemical and mechanical properties. It is written in modern Fortran according to good software engineering principles, and uses MPI + OpenMP for efficient CPU parallelism.

In this talk, I will discuss how OpenACC is being used to enable GPU offloading in CASTEP, highlighting both the successes and the current challenges. Performance results from the latest developments will be presented, including on the UK's Tier-2 Bede IBM Power9 facility, where GPU-enabled simulation speeds routinely exceed x4 those of the CPU-only version.

thornado: a GPU-optimized application for spectral neutrino transport using OpenACC and OpenMP

J. Austin Harris, Oak Ridge National Laboratory

The toolkit for high-order neutrino radiation-hydrodynamics (thornado) is being developed to simulate spectral neutrino transport in nuclear astrophysics applications; for example, core-collapse supernovae and binary neutron star mergers. The computational cost associated with simulations of these events is largely dominated by modeling the neutrino-matter coupling, making it the obvious focus of algorithm development and performance optimizations that are necessary for deploying high-fidelity simulations.

Thornado is developed as a supplemental module for neutrino transport in large-scale astrophysical simulations frameworks such as Flash-X. As such, the focus is on node-level parallelism and performance. We discuss the challenges and strategies encountered in porting thornado to efficiently use GPUs with the OpenACC and OpenMP programming models and present resent results from a detailed performance analysis.

On adding GPU support to the Particle-In-Cell code ECsim

Elisabetta Boella, Lancaster University

The Particle-In-Cell (PIC) method is a computational technique widely used in plasma physics to model plasmas at a kinetic level. In the PIC algorithm, macroparticles representative of several real plasma particles are moved under the influence of the self-consistent electromagnetic fields. The latter are calculated on a grid via Maxwell's equations, where the source terms are obtained from the particles.

In this work, we describe the effort to prepare our home-grown PIC code ECsim for future exascale architectures. In particular, we report on the implementation of OpenACC directives and the memory management technique that we adopted to port the particle kernels on GPUs. We show results from code profiling on CPUs and GPUs obtained with the NVIDIA® Nsight™ Systems profiler. Off-loading these particles kernels to GPUs leads to an overall 5x speed-up with respect to the CPU implementation of the code. Finally, we present results from weak and strong scaling tests obtained on the supercomputers Marconi100 (CINECA, Italy), MeluXina (LXP, Luxembourg) and Juwels Booster (JSC, Germany).

In weak scaling tests, ECsim demonstrates a scalability of 80% up to 1024 GPUs. This work was performed as part of the CSCS OpenACC hackathon. Access to Marconi100, MeluXina and Juwels Booster was obtained via HPCEuropa3, EuroHPC JU and PRACE, respectively.

Techniques, Tricks and Algorithms for Efficient GPU-Based Processing of Higher Order Hyperbolic PDEs

Dinshaw Balsara, University of Notre Dame

GPU computing is expected to play an integral part in all modern ExaScale supercomputers. It is also expected that higher-order Godunov schemes will make up about 30% of the application mix on such supercomputers. It is, therefore, very important to prepare the community of users of higher-order schemes for hyperbolic partial differential equations (PDEs) for this emerging opportunity. Not every algorithm that is used in the space-time update of the solution of hyperbolic PDEs will take well to GPUs. However, we identify a small core of algorithms that take exceptionally well to GPU computing.

Based on an analysis of available options, we have been able to identify Weighted Essentially Non-Oscillatory (WENO) algorithms for spatial reconstruction along with Arbitrary DERivative (ADER) algorithms for time-extension followed by a corrector step as the winning three-part algorithmic combination. Even when a winning subset of algorithms has been identified, it is not clear that they will port seamlessly to GPUs. The low data throughput between CPU and GPU, as well as the very small cache sizes on modern GPUs, implies that we must think through all aspects of the task of porting an application to GPUs.

For that reason, this paper identifies the techniques and tricks needed for making a successful port of this very useful class of higher-order algorithms to GPUs. Application codes face a further challenge: The GPU results need to be practically indistinguishable from the CPU results in order for the legacy knowledge bases embedded in these application codes to be preserved during the port of GPUs. This requirement often makes a complete code rewrite impossible. For that reason, it is safest to use an approach based on OpenACC directives, so that most of the code remains intact (as long as it was originally well-written).

This talk is intended to be a one-stop shop for anyone seeking to make an OpenACC-based port of a higher-order Godunov scheme to GPUs. We focus on three broad and high-impact areas where higher-order Godunov schemes are used. The first area is computational fluid dynamics (CFD). The second is computational magnetohydrodynamics (MHD) which has an involution constraint that must be mimetically preserved. The third is computational electrodynamics (CED) which has involution constraints and also extremely stiff source terms.

Together, these three diverse uses of higher-order Godunov methodology, cover many of the most important application areas. In all three cases, we show that the optimal use of algorithms, techniques and tricks, along with the use of OpenACC, yields superlative speedups on GPUs! As a bonus, we find a most remarkable and desirable result: some higher-order schemes, with their larger operations count per zone, show better speedup than lower-order schemes on GPUs. In other words, the GPU is an optimal stratagem for overcoming the higher computational complexities of higher-order schemes! Several avenues for future improvement have also been identified.

Developing an Ultra-high-resolution E3SM Land Model on Summit using OpenACC

Dali Wang, Oak Ridge National Laboratory

Earth system models are essential tools to advance our Earth system prediction capability. Exascale computers provide new opportunities to simulate Earth systems at unprecedented scales.

We are developing an ultra-high-resolution E3SM land model (uELM) to enable high-fidelity land simulations targeting the coming exascale computers. We report on an early uELM model development and a pilot simulation over North America on Summit using OpenACC and PGI FORTRAN compiler (20.4). This comprehensive effort 1) provided large-scale scientific software analysis; 2) developed a hybrid computational model to ensure efficient uELM model execution on hybrid architectures; 3) generated a function unit testing (FUT) platform for accelerated code development; 4) effectively implemented the uELM code on GPU using compiler directives; and 5) demonstrated that compiler directive is a practical and efficient approach for porting large-scale scientific code on GPUs.

For the broad interests of the audience, we dive into a specific ELM module (EcosystmDyn) implementation on a single Summit node within the FUT framework. Advanced OpenACC features, such as data offload (region, copyin/out), nested parallel loops, asynchronous kernel launch, and deepcopy, have been deployed for the EcosystemDyn code optimization. As a result, the optimized parallel implementation of EcosystemDyn achieved more than a 140-time speedup (50 ms vs. 7600 ms), compared to a naive OpenACC implementation on a single NVIDIA V100. On a fully loaded computing node with 44 CPU cores (2 reserved) and 6 GPUs, the code achieved over 3.0-time speedup, compared to the original code on the CPU. Furthermore, the memory footprint of the optimized parallel implementation is 300 MB, which is around 15% of the 2.15 GB of memory consumed by a naive implementation. We also briefly describe the pilot uELM simulation over North America (24 million gridcell at a 1km resolution) that will utilize all the GPUs on around 700 Summit nodes.

Finally, we summarize several findings for porting large scientific code using compiler directives. More detailed background information can be found in the two attached papers. We hope our effort can inspire further discussions on the compiler directive support on the coming exascale computers, including "Frontier" and its development platform "Spock" at the Oak Ridge National Laboratory.

OpenACC Acceleration of an Agent-Based Biological Simulation Framework

Sunita Chandrasekaran, University of Delaware

Computational biology has increasingly turned to agent-based modeling to explore complex biological systems. Biological diffusion (diffusion, decay, secretion, and uptake) is a key driver of biological tissues. GPU computing can vastly accelerate the diffusion and decay operators in the partial differential equations used to represent biological transport in an agent-based biological modeling system.

In this paper, we utilize OpenACC to accelerate the diffusion portion of PhysiCell, a cross-platform agent-based biosimulation framework. We demonstrate an almost 40x speedup on the state-of-the-art NVIDIA A100 GPU compared to a serial run on AMD's EPYC 7742. We also demonstrate 9x speedup on the 64 core AMD EPYC 7742 multicore platform.

By using OpenACC for both the CPUs and the GPUs, we maintain a single source code base, thus creating a portable yet performant solution. With the simulator's most significant computational bottleneck significantly reduced, we can continue cancer simulations over much longer times.

Exploring Performance and Readability with Fortran + Directives and Portable C++ with miniWeather

Matthew Norman, Oak Ridge National Laboratory

It can be hard to determine the most effective accelerator porting strategy for a Fortran application. While the majority of the work for legacy applications typically lies in refactoring the code to expose threading in a manner most amenable to an accelerator device, the choice of language and strategy is still very important. Two common approaches among lower level languages include Fortran + directives and portable C++ libraries. The goal of this talk is not so much to explore the reasons why a project may choose one approach or the other. The goal is more to explore what CPU and GPU performance and code readability might look like with either choice. The miniWeather mini-app is an HPC parallel programming training and benchmarking application that is complex enough to be scientifically interesting yet small enough to be tractable, and it contains a wide variety of approaches in Fortran, C, and portable C++, including MPI, OpenMP CPU threads, OpenMP target offload, OpenACC, and standards-based parallelism in Fortran ("do concurrent"). There are ongoing efforts to bring in standards-based C++ parallelism as well. While these results are specific to the miniWeather application, they can provide an important data point to help developers make decisions for their application.

Restructuring a finite-element magnetohydrodynamics MPI code for GPU using OpenACC

Chang Liu, Princeton Plasma Physics Laboratory

The model of magnetohydrodynamics (MHD) has been used to calculate the macroscopic instabilities happening in thermonuclear fusion devices like National Spherical Torus Experiment (NSTX) and International Thermonuclear Experimental Reactor (ITER), and plays an important role in developing fusion power plants. M3D-C1, one of the flagship codes at Princeton Plasma Physics Laboratory (PPPL), is a code using finite element method to solve MHD equations and has been widely used in the community.

We have participated in two GPU Hackathons at Princeton University in 2019 and 2021. In this talk, we present what we learned when porting this code to GPUs using OpenACC. Currently two parts of the code have been successfully ported, one is the calculation of finite element matrix, and the other is the particle-pushing in kinetic effect calculation. For the matrix calculation we did a restructuring of the code to separate the physics part and the numerical integration part. The numerical part, which contains multi-layer nested loops, can run on GPUs efficiently thanks to OpenACC. We also utilized the Multi-Process Service (MPS) developed by NVIDIA to make the MPI code work with multiple GPUs. For the second part, we utilized NVIDIA Nsight Systems to find the best strategy for parallelization for particle pushing. The speedup brought by the work for both parts are impressive.

We will also talk about the remaining work to be done for porting the other parts of the code, including the iterative matrix solver and the reduction calculator of particle moments, and discuss the challenges associated with the work.

[1] C. Liu, S.C. Jardin, H. Qin, J. Xiao, N.M. Ferraro, and J. Breslau, Comput. Phys. Commun. 275, 108313 (2022)
[2] C. Liu, C. Zhao, S.C. Jardin, N.M. Ferraro, C. Paz-Soldan, Y. Liu, and B.C. Lyons, Plasma Phys. Control. Fusion 63, 125031 (2021)

Porting a Python Markov Chain Monte Carlo method to RAPIDS

Jason Yalim, Arizona State University

During the San Diego Supercomputing Center (SDSC) GPU Hackathon, a Python application involving a Markov Chain Monte Carlo (MCMC) estimator for a six-dimensional integral was successfully ported to NVIDIA GPUs using RAPIDS suite of software libraries and APIs. The Python code's was initially written with NumPy and an involved a fourth-order tensor product. Utilizing CuPy, the code was accelerated ~540 times on the iteration level and ~2150 times on the process level when comparing to the original software. This performance boost—mostly accomplished by leveraging tensor algebra, basic variable caching, an NVIDIA A100 Tensor Core GPU, and the RMM library's memory pool functionality—enables state-of-the-art convergence for MCMC and provides us with a valuable toolset for accelerating other research pipelines utilizing MCMC.

GPU Acceleration of Greenland Ice Sheet Simulations

Emma "Mickey" MacKie, University of Florida

Geostatistical simulation is a powerful means for modeling the topography beneath ice sheets, a critically needed parameter for ice sheet models and sea level rise projections. However, computational limitations preclude the application of this method at ice sheet scales.

The most computationally demanding component of this algorithm is the nearest neighbor octant search, which scales exponentially with dataset size on a CPU. Over the course of the 2022 University of Florida Hackathon, we modified this algorithm to run on a GPU.

The GPU implementation of the octant search does not show a slow-down for large datasets, resulting in a projected 30x improvement in simulation speed for a full-Greenland implementation. We discuss the challenges, insights, and ongoing efforts to optimize the geostatistical simulation algorithm for GPU performance for Greenland applications.

Porting OVERFLOW CFD code to GPUs: To Hackathons and Beyond!

Charles Jackson, NASA Langley Research Center

OVERFLOW is an overset, structured computational fluid dynamics (CFD) code written in Fortran which is widely used in the government, industry, and academia.

Over the last several years the OVERFLOW developers have been working to port miniapps based on computationally expensive parts of OVERFLOW to run on GPUs, primarily using OpenACC. This effort started at our first hackathon in 2019 and since then the OVERFLOW team has attended two additional hackathons (virtually). These hackathon environments have provided a great place to collaborate with others and learn from experts. These learning experiences enabled porting two miniapps to run effectively on NVIDIA GPUs using OpenACC.

The first miniapp focused on motifs found in the solver itself and the final ported version runs three times fast on a single V100 compared to a 40 core, dual-socket Intel Skylake node. The speed up in this solver miniapp required multiple design changes including increasing the amount of parallelism available and the amount of work performed in each kernel. The second miniapp focused on overset MPI communication, also saw significant speedups over the CPU implementation using a CUDA-aware MPI implementation through OpenACC.

This presentation will discuss our experience at the hackathons, our process of porting the miniapps to run on the GPUs, and several lessons learned throughout.