The 2021 OpenACC Summit was held September 14th and 15th as a completely digital, remote event which included daily keynotes, a Birds of a Feather (BoF) interactive discussion, several invited talks, and an Ask the Experts" session.
Welcome and Keynote
September 14, 2021 | 07:00 AM - 7:10 AM PDT | Digital Event
Welcome and Opening Remarks: Jack Wells, President, OpenACC
Day 1 Keynote: September 14, 2021 | 7:10 - 7:40 AM PDT
The Good, the Ugly and the Bad: What We Learned from Porting ICON to GPUs
William Sawyer, Swiss National Supercomputing Centre (CSCS)
The Icosahedral Non-hydrostatic (ICON) model employs a finite-volume solver of the equations of atmospheric motion and the physical parameterizations from the well-known ECHAM model. It was ported to accelerators utilizing OpenACC accelerator directives as well as a source-to-source translator, which generates OpenACC or OpenMP directives from meta-directives. This approach was intended as an intermediate solution to allow single-source compatibility with the existing Fortran code base. In the longer term ICON developers will slowly embrace emerging paradigms such as domain specific languages (DSLs) and/or accelerator-aware extensions of existing programming languages. This presentation gives an historical perspective on what went well, and what less well, during the long process to make ICON GPU-capable.
Day 2 Keynote: September 15, 2021 | 7:05 - 7:30 AM PDT
Barbara Chapman, HPE
In the HPC community’s efforts to deliver programming models that provide high performance along with productivity benefits, directive-based approaches have played an important role. As large-scale computer systems continue to grow in their size and complexity, the need for productive, directive-based programming models such as OpenACC has never been greater.
To facilitate application development using OpenACC, robust implementations on a broad variety of HPC platforms must be complemented by performance analysis and optimization tools. We describe HPE’s offerings for OpenACC that will help grow the OpenACC ecosystem.
Birds of a Feather
September 14, 2021 | 07:40 AM - 9:00 AM PDT | Digital Event
OpenACC Technical Committee Updates: Jeff Larkin, OpenACC Technical Committee Chair PDF
OpenACC Unified Programming Environment for GPU and FPGA Multi-Hybrid Acceleration: Taisuke Boku, University of Tsukuba Video
TalksPorting Non-equilibrium Green’s Functions Routines to GPUs
Alessandro Pecchia, ISMN-CNR
The libNEGF library is a general purpose library for quantum transport calculations based on Non-equilibrium Green’s functions. The library is written in modern Fortran and has been interfaced to different Hamiltonian representations like finite-elements (tiberCAD), tight-binding (DFTB) and even ab-initio within the Material Studio suite, commercialized by Biovia. The library is an open-source project hosted on GitHub. The implemented algorithms are based on a recursive block scheme, essentially based on the Schur complement. The time consuming numerical steps are dense matrix inversions and MxM multiplications. Recently we have worked at porting the code to run on GPUs with the aim of improving the performances of such linear algebra routines very suitable to vector architectures.
Mini-apps running selected routines in single precision show speedups exceeding x100 compared to state of art multi-core Math Kernel Library (MKL) library on Intel CPUs. Tensor core in reduced precision can reach impressive speedups of x400. Complete porting has been accomplished for double precision routines, showing speedups exceeding x10. In this talk some detail of our OpenACC and CUDA implementations will be discussed.Refactoring the MPS/University of Chicago Radiative MHD (MURaM) Model for GPU/CPU Performance Portability Using OpenACC Directives
Eric Wright, University of Delaware
The MURaM (Max Planck University of Chicago Radiative MHD) code is a solar atmosphere radiative MHD model that has been broadly applied to solar phenomena ranging from quiet to active sun, including eruptive events such as flares and coronal mass ejections. The treatment of physics is sufficiently realistic to allow for the synthesis of emission from visible light to extreme UV and X-rays, which is critical for a detailed comparison with available and future multi-wavelength observations. This component relies critically on the radiation transport solver (RTS) of MURaM; the most computationally intensive component of the code. The benefits of accelerating RTS are multiple fold: A faster RTS allows for the regular use of the more expensive multi-band radiation transport needed for comparison with observations, and this will pave the way for the acceleration of ongoing improvements in RTS that are critical for simulations of the solar chromosphere. We present challenges and strategies to accelerate a multi-physics, multi-band MURaM using a directive-based programming model, OpenACC in order to maintain a single source code across CPUs and GPUs.
Results for a $288^3$ test problem show that MURaM with the optimized RTS routine achieves 1.73x speedup using a single NVIDIA V100 GPU over a fully subscribed 40-core Intel Skylake CPU node and with respect to the number of simulation points (in millions) per second, a single NVIDIA V100 GPU is equivalent to 69 Skylake cores. We also measure parallel performance on up to 96 GPUs and present weak and strong scaling results.On the Road to Code Portability
Stéphane Ethier, Princeton Plasma Physics Laboratory (PPPL)
PPPL scientists have successfully ported several codes to NVIDIA GPUs using the OpenACC programming model. It has been our preferred approach due to its ease of implementation and non-interference with the CPU code. While the hope was that OpenACC would become the de facto directive-based programming model for accelerators of all types, it appears that OpenMP is now being promoted more forcefully. Unfortunately, this transition period can be painful for developers who have to work with immature implementations.GPU-Accelerated Multi-Phase Flow Simulation
Spencer H. Bryngelson, Georgia Institute of Technology
The computational cost of multi-phase flow simulations is dominated by with the numerical methods used to represent the phase interfaces. In the cases presented, shock- and interface-capturing take derivatives across such discontinuities. These methods have high operation counts, making them good candidates for off-loading, but are vector operations on the state variables, potentially resulting in low arithmetic intensity. These competing heuristics are tested via implementations in the open-source multi-component flow code (MFC, mfc-caltech.github.io). Specifically, we implement a weighted essentially non-oscillatory (WENO) state variable reconstruction scheme and an HLLC approximate Riemann solver. OpenACC handles the GPU offloading and host—device memory transfers are removed by computing the entire flow simulation on the GPUs. ORNL Summit is used to test the approach. Results show that 100-times speed-ups are achieved over the CPU-only implementation on the Summit nodes (6x Nvidia V100 and 2x Power9 CPUs with 22 compute cores per CPU). Profiling reveals that the most expensive kernels, WENO and HLLC, both perform near the GPU roofline and achieve above 80% of peak compute intensity. We conclude that GPU-offloading and its implementation via OpenACC is appropriate for interface-capturing compressible flow solvers, with significant speed-ups expected.ACCelerating the FALL3d Flagship Code. Insights from Porting a Mini-App
Eduardo Cabrera Flores, Barcelona Supercomputer Center
Porting an app to accelerators could be challenging. In this talk we describe our roadmap ACCelerating FALL3D, a ChEESE flagship code -Eulerian model for the atmospheric transport and deposition of particles, aerosols and radionuclides, using OpenACC through both the full-app and a mini-app. Results indicate the benefits of our approach, delivering as much as over 29x improvement in runtime for certain benchmark scenario when executed on multiple GPUs.New Developments of the Quantum ESPRESSO Code: A Combined CUF-OpenACC Approach
Ivan Carnimeo, Scuola Internazionale Superiore di Studi Avanzati (SISSA)
In recent years there has been a considerable effort to port the Quantum ESPRESSO code to accelerated hardware architectures. A CUDA Fortran-based approach was initially chosen, due to its ease of learning and the similarity to the standard fortran. Recently, to overcome some drawbacks of the CUDA Fortran accelerated code, some portions of Quantum ESPRESSO have been accelerated by integrating OpenACC directives into the pre-existing CUF code. In the talk, Ivan Canemio will briefly illustrate the current situation of the GPU-porting of the Quantum ESPRESSO code.Porting Large-Scale Materials Science Codes to GPUs for Next Generation Exascale Architectures
Mauro Del Ben, Lawrence Berkeley National Laboratory
Due to their intensive computational workload, materials science codes have been and still are among those which mostly benefit from leadership class HPC facilities. At the state of the art, graphics processing units (GPUs) dominate the HPC paradigm and force developers to actively maintain and optimize core compute kernels going forward. In this talk, we will focus on our experiences navigating this portability effort for the BerkeleyGW software package. BerkeleyGW is a massively parallel software package employed to study the excited state properties of electrons in materials by using GW, Bethe-Salpeter Equation (BSE) methods and beyond. The code is capable of scaling out to tens of thousands of nodes and effectively utilizing strong-scaling CPU architectures. We will discuss our experiences porting BerkeleyGW to three different GPU programming models (CUDA, OpenACC, OpenMP Target), as well as the challenges we encountered along the way to achieve true performance portability.Leveraging Profiling Tools for OpenACC Refactoring Projects: Case Study on PRIMo, a Flood Inundation Model
Daniel Howard, National Center for Atmospheric Research (NCAR), Computational & Information Systems Lab (CISL)
PRIMo (Parallel Raster Inundation Model), developed by Brett Sanders and Jochen Schubert at Univ. of California Irvine, simulates metric scale flood inundation by solving 2D shallow water equations, providing useful hazard information to flood risk managers and planners. Earlier versions of the Fortran based model utilizes MPI alongside Single Process Multiple Data (SPMD) design for running on classical CPU based distributed parallel computing clusters . Given the original design approach, porting PRIMo to GPUs using OpenACC was a relatively straightforward task. However, performance bottlenecks were encountered during this development, motivating the use of profilers like NSight Systems and NSight Compute. In this talk, we will discuss our approach for using profiling tools in analyzing code performance, allowing us to quickly pinpoint needed areas of improvement in the code and reducing overall development time. Specific problems encountered and fixed will also be detailed. Once these issues were resolved, a medium size Hurricane Harvey test case was able to achieve a 56x speedup on 1 NVIDIA V100 GPU vs the CPU only version.
This work was made possible as a collaboration between Brett Sanders at UC Irvine and Adam Luke from Zeppelin Floods with Daniel Howard and Davide Del Vento at NCAR. Additional thanks go to the GPU Hackathons community and support provided from Dave Norton, Matt Stack, and Mahidar Tatineni.
 Sanders, B., Schubert J., 2019. PRIMo: Parallel raster inundation model. Adv. Water Resour. 86, 79-95. https://doi.org/10.1016/j.advwatres.2019.02.007A Transformation--Based Approach for the Design of Parallel/Distributed Scientific Software: the FFT
Lenore M. Mullin, MoA: Provably Optimal Tensors and State University of New York, Albany and Wileam Y. Phan, Lawrence Berkeley National Laboratory
We extend a methodology for designing efficient parallel and distributed scientific software for GPUs introduced in Rosenkrantz et al.  This methodology utilizes sequences of mechanizable algebra-based optimizing transformations. Starting from a high-level algebraic algorithm description in “A Mathematics of Arrays” (MoA) , abstract multiprocessor plans are developed and refined to specify which computations are to be done by each processor. Starting with the OpenMP program in Fortran 90 produced in Rosenkrantz et al. , we extend it to include OpenACC for GPU support. Our studies show what is needed in OpenACC, support in Fortran compilers for GPUs, and what issues we encountered and resolved.
 Harry B. Hunt, Lenore R. Mullin, Daniel J. Rosenkrantz, and James E. Raynolds (2008). “A Transformation-Based Approach for the Design of Parallel and Distributed Scientific Software: The FFT”. arXiv:0811.2535
 Lenore R. Mullin (1988), “A Mathematics of Arrays”, PhD dissertation. Syracuse University. doi:10.5555/915213.