Experiences Porting ITASSER Using OpenACC

February 27, 2024 | 10:00 AM PST
Elijah MacCarthy, Oak Ridge National Laboratory

GPU-I-TASSER is a GPU-capable protein structure prediction and structure-based function annotation method. It is developed from the Iterative Threading ASSEmbly Refinement (ITASSER) method, a top protein structure prediction tool according to the Critical Assessment of Structure Prediction (CASP) experiments.

In this webinar, we discuss GPU-I-TASSER which ports bottleneck Replica Exchange Monte Carlo (REMC) regions of the ITASSER pipeline to the device using OpenACC. We share our experiences in using extensive data management (on the data bound replica exchanges) on these ported kernels to achieve an efficient and optimized ITASSER.  We record an average 10X performance on a single Nvidia P100 GPU against an Intel Xeon E5-2680v3 processor on Pittsburgh Supercomputing Center’s Bridges Supercomputer.

Register Now


MFC: Performant Multiphase Flow Simulation at Leadership-class Scale via OpenACC

Spencer Bryngelson, Georgia Institute of Technology

Multiphase compressible flow simulations are often characterized by large grids and small time steps, thus conducting meaningful simulations on CPU-based clusters can take several wall-clock days. Accelerating the corresponding kernels via GPUs appears attractive but is memory-bound for standard finite-volume and -difference methods, damping speed-ups. Even if realized, faster GPU-based kernels can make communication and I/O times prohibitive.

This webinar focuses on a portable strategy for GPU acceleration of multiphase and compressible flow solvers that addresses these challenges and obtains large speedups at scale. Employing a trio of approaches—OpenACC for offloading, Fypp to reveal hidden compile-time optimizations, and NVIDIA CUDA-aware MPI for remote direct memory access—enables the efficient use of the latest leadership-class systems.

Spencer Bryngelson, assistant professor from Georgia Institute of Technology, discusses how his team implemented this approach in the open-source solver MFC (https://mflowcode.github.io) to achieve 46% of peak FLOPs and high arithmetic intensity for the most expensive simulation kernels. In representative simulations, a single NVIDIA A100 GPU is 300 times faster than an Intel Xeon Cascade Lake CPU core. At the same time, near-ideal (within 3%) weak scaling is observed for at least 13824 V100 GPUs on OLCF Summit. 84% strong scaling efficiency is retained for an 8-times increase in GPU count. Large multi-GPU simulations demonstrate the practical utility of this strategy.

Watch Recording          View Slides


Accelerating a Production Solar MHD Code with Fortran Standard Parallelism

Ronald M. Caplan, Predictive Sciences, Inc.

There is growing interest in using standard language constructs for accelerated computing, thereby avoiding the need for external APIs. These constructs hold the potential to be more portable and `future-proof.'  For Fortran codes, the current focus is on the `do concurrent` (DC) loop.  While there have been some successful examples of GPU acceleration using DC for benchmark and/or small codes, its widespread adoption requires demonstrations of its use in full-size applications.

In this webinar, Ronald M. Caplan from Predictive Science Inc., looks at the current capabilities and performance of using DC in a production application called Magnetohydrodynamic Algorithm outside a Sphere (MAS). MAS, a state-of-the-art model for studying the Sun’s corona and heliosphere, is over 70,000 lines long and has previously been ported to GPUs using MPI+OpenACC.  He attempts to eliminate as many of its OpenACC directives as possible in favor of DC.

Additionally, he shows that using the NVIDIA NVFORTRAN compiler's Fortran 2023 preview implementation, unified managed memory, and modified MPI launch methods, we can achieve GPU acceleration across multiple GPUs with zero OpenACC directives.  However, doing so currently results in a non-trivial performance drop.  Finally, Ronald Caplan discusses improvements needed to avoid this loss and demonstrates how to use DC and still retain the original code's performance while reducing the OpenACC directives by a factor of five.

Watch Recording          View Slides          Read Paper


KokkACC: Enhancing Kokkos with OpenACC

Pedro Valero Lara, Oak Ridge National Laboratory

Template metaprogramming is gaining popularity as a high-level solution for achieving performance portability on heterogeneous computing resources. Kokkos is a representative approach that offers programmers high-level abstractions for generic programming while most of the device-specific code generation and optimizations are delegated to the compiler through template specializations. For this, Kokkos provides a set of device-specific code specializations in multiple back ends, such as CUDA and HIP. However, maintaining and optimizing multiple device-specific back ends for each new device type can be complex and error-prone.

In this webinar, Dr. Pedro Valero Lara from Oak Ridge National Laboratory presents an alternative OpenACC backend for Kokkos: KokkACC. KokkACC provides a high-productivity programming environment and --potentially-- a multidevice backend. Competitive performance has been observed; in some cases, KokkACC is faster than NVIDIA's CUDA backend and much faster than OpenMP's GPU offloading backend. This work also includes implementation details and a detailed performance study conducted with a set of mini-benchmarks (AXPY and DOT product) and three mini-apps (LULESH, LAMMPS, and miniFE).

Recently, the team implemented the support for transparent device selection as an extension of KokkACC. OpenACC can target different types of devices, so a single backend can be compiled to target different hardware architectures. KokkACC equips Kokkos with automatic and transparent device selection depending on the computational cost of the applications and the characteristics of the hardware, eliminating the burden of deciding which backend and device to use at compilation time. Two  heterogeneous systems with different hardware capabilities were used for performance analysis. KokkACC provides high accelerations of up to 28x thanks to automatic and transparent device selection.

Watch Recording          View Slides          Read Paper


Analysis of OpenACC Validation and Verification Testsuite

Sunita Chandrasekaran, University of Delaware, Aaron Jarmusch, University of Delaware, and Christian Munley, University of Delaware

OpenACC is a high-level directive-based parallel programming model that can manage the sophistication of heterogeneity in architectures and abstract it from the users. The portability of the model across CPUs and accelerators has gained the model a wide variety of users. This means it is also crucial to analyze the reliability of the compilers' implementations.

To address this challenge, the OpenACC Validation and Verification team introduces a validation testsuite, supported by a streamlined infrastructure, to verify the OpenACC implementations across various compilers and system architectures.

This webinar will cover recent advancements in the testsuite, demonstrate the infrastructure’s workflow, present representative tests, analyze results from various systems, and outline future developments.

Watch Recording          View Slides          Read Paper


Quantum ESPRESSO on GPUs: Porting Strategy and Results

Fabrizio Ferrari Ruffino, Italian National Council of Research (CNR-IOM)

Quantum ESPRESSO (QE) is an open-source distribution of software code packages for materials simulation and modeling at the nanoscale based on density functional theory, pseudopotentials, and plane waves.

A flagship code of the European Union MAX (Materials design at the eXascale) Centre of Excellence for high performance computing (HPC) applications, QE's development entails refactoring codes into software stacks of conceptually distinct components, and adapting the porting strategies to each one of them, ranging from low-level libraries to high-level simulation drivers. An important part of the ongoing development relies on performance portability and the shift from the original CUDA Fortran-based accelerated version to a more directive-based one utilizing OpenACC and, more recently, OpenMP 5.

In this webinar, Fabrizio Ferrari Ruffino from the 'Istituto Officina dei Materiali' at the Italian National Council of Research (CNR-IOM) will present the current status of QE porting, including the strategies to allow the coexistence of different offloading models, and share some benchmark results from DFT self-consistent calculations and linear response. 

Watch Recording         View Slides          Read Paper