The Open Accelerated Computing Summit reflects our Organization's evolution and commitment to the helping the research and developer community advance science by expanding their accelerated and parallel computing skills.

The 2023 Summit brought together preeminent researchers across national laboratories, research institutions and supercomputing centers worldwide to discuss work that aligns with our Organization's focus areas of developing and utilizing the OpenACC directives-based programming model to port, accelerate, or optimize scientific applications; sharing experiences and lessons learned from participating in a hackathon or bootcamp training event; and participating in ecosystem development through work that enables parallel programming models in compilers or tools and advances performance interoperability.

Keynotes

The Long but “Straight” Road towards Integrations of Simulations, Data, and Learning on Oakforest - PACS II

Kengo Nakajima, University of Tokyo

Recently, supercomputing is changing dramatically. Integration and convergence of simulation, data, and learning (S+D+L) is important as we move towards Society 5.0, a concept proposed by the Japanese Government that enables the integration of cyberspace and physical space. In 2015, we started the Big Data & Extreme Computing (BDEC) project to develop supercomputers and software for the integration of (S+D+L). In May 2021, we began operation of Wisteria/BDEC-01, the first BDEC system to consist of computing nodes for computational science and engineering with A64FX (Odyssey) and those for Data Analytics/AI with NVIDIA A100 GPUs (Aquarius).
Additionally, we developed a software platform “h3-Open-BDEC” for integration of (S+D+L) on Wisteria/BDEC-01, designed to extract the maximum performance from the supercomputers with minimum energy consumption by focusing on (1) Innovative methods for numerical analysis by adaptive precision, accuracy verification, and automatic tuning; (2) Hierarchical Data-Driven Approach based on machine learning, and (3) Software for heterogeneous systems. Integration of (S+D+L) by h3-Open-BDEC enables significant reduction of computations and power consumption, compared to those by conventional simulations. In January 2025, we will start to operate the Oakforest-PACS II system (OFP-II) together with the University of Tsukuba. OFP-II will consist of NVIDIA H100 nodes with a total peak performance of 100-150 PFLOPS. This is our next platform for integration of (S+D+L). Since October 2022, we started support for our users to migrate their applications to the OFP-II with GPUs in collaboration with NVIDIA. This talk will describe and discuss our activities in the integration of (S+D+L) and efforts towards OFP-II.

The Role of Compiler Directives in Scientific Computing: Now and into the Future

Thomas Schulthess, ETH Zurich / The Swiss National Supercomputing Center (CSCS)

Compiler directives, like OpenACC and OpenMP, have played an important role in accelerated scientific computing, especially for imperative languages like Fortran. Directives typically provide a parallel programming model for legacy software, making them useful to explore new architectural features in a wide range of scientific applications. They are also thought to enhance portability of software. However, to achieve optimal performance on different architectures, algorithms and thus imperative implementations need to be changed. Sustainable, performance portable software development is thus better served with parallel programing constructs in standard languages (e.g. C++ or Fortran) or with descriptive programming models that can be implemented in Python with domain specific libraries. As these new technologies continue to mature, it will be natural and advantageous for the role played by directives to also evolve in managing the expression of parallelism in scientific software. This talk presents my views on the role of compiler directives in scientific computing moving forward into the future.

Tutorials

GPU Programming with Python Tutorial

Aswin Kumar, NVIDIA and Mozhgan Kabiri Chimeh, NVIDIA

This two-hour tutorial covers the basics of GPU programming and provides an overview of how to port Python-based scientific applications to GPUs using CuPy and Numba. Throughout the session, attendees will learn how to analyze and accelerate Python codes on GPUs and apply these concepts  to real-world problems. Participants will be given access to hands-on exercises and expert help during and after the tutorial.  This tutorial will be held two times, once in an Asia-Pacific-friendly time zone and again in a North America and Europe-friendly time zone.

Panels

The Accelerated Evolution of Programming

Jeff Larkin, Moderator,OpenACC Chair of the OpenACC Technical Committee, NVIDIA

The way we program computers is evolving at an accelerated pace. Classical programming languages are expanding, new programming languages emerging, and fundamentally different paradigms are on the horizon. This panel will discuss the historical context, rapid development, and future direction of how computers are programmed. Experts will share their insights on how classical programming languages and models have progressed to lay the groundwork for new approaches, examine newly emerged models and developer enablement, and lastly, look towards emergent areas such as Quantum Computing and Large Language Models and how they could completely change the relationship between the programmer and computer.

Towards Sustainable Computing Competence through Mentorship

Moderator, TBA

Open Hackathons help researchers and computational scientists advance science by pairing them with expert mentors to work collaboratively on AI and HPC applications. Our mentors are at the core of the hackathons’ and teams’ success.

This community-driven mentor panel will discuss the evolution of mentorship given the rapid advancement of high-performance systems used to accelerate traditional HPC as well as AI. We'll discuss different ways of examining and addressing diverse and newly emerged workloads that are participating in Hackathons. Lastly, we'll discuss the benefits of being part of a mentor program and what additional steps will help existing mentors enhance their knowledge, increase networking opportunities, and inspire new mentors.

Talks

SOD2D: An OpenACC-Accelerated Code for Scale-Resolving Smiluations of Turbulent Compressible Flows

Lucas Gasparino Ferreira da Silva, Barcelona Supercomputing Center

SOD2D is a code developed at BSC-CNS for performing large eddy simulation (LES) and direct numerical simulation (DNS) of compressible turbulent flows in realistic complex cases, with a particular focus on the aviation industry. It employs a high-order Spectral continuous Galerkin, SEM for short, discretization in conjunction with the entropy viscosity method adapted to SEM to achieve high accuracy and robustness at a reasonable computational cost. The aim of this project is to try and bridge the gap between industry needs and highly accurate flow simulations, which requires the use of high-fidelity models and large meshes. This in turn requires the use of HPC systems, which are increasingly relying on GPUs to achieve the required performance. 

In this talk, we will present how we used OpenACC to allow our scale-resolving CFD code to run efficiently on multiple GPUs. Using OpenACC to port most of the code to GPUs (in particular, all of the compute algorithm), we were able to achieve excellent performance on NVIDIA GPUs, including the latest A100 and H100 architectures. As a preview of some of the most striking results, we were able to assess that a single H100 GPU can run a heavily refined mesh (~ 48M nodes) composed of 3rd order hexahedra at a cost of about 370ms per time-step, a feat that would require several nodes of the MareNostrum4 supercomputer at BSC-CNS. We will also present scalability data demonstrating how well the code performs when running on multiple GPUs, both when communicating data using MPI and NVIDIA Collective Communications Library (NCCL) in tandem with OpenACC directives.

Porting an OpenACC Fortran HPC Code to the AMD GPU-Based Frontier System 

Igor Sfiligoi, University of California, San Diego

NVIDIA has been the main provider of GPU hardware in HPC systems for over a decade. Many applications that benefit from GPUs have thus been developed and optimized for the NVIDIA software stack. Recent exascale HPC systems are, however, introducing GPUs from other vendors, such as the AMD GPU-based Oak Ridge Leadership Computing Facility (OLCF) Frontier system recently becoming available. AMD GPUs cannot be directly accessed using the NVIDIA software stack and require a porting effort by the application developers. 

This talk provides an overview of our experience porting and optimizing the CGYRO code, a widely-used fusion simulation tool based on Fortran with OpenACC-based GPU acceleration. While the porting from the NVIDIA compilers was relatively straightforward using the CRAY compilers on the AMD systems, the performance optimization required more fine-tuning. In the optimization effort, we uncovered code sections that had performed well on NVIDIA GPUs but were unexpectedly slow on AMD GPUs. After AMD-targeted code optimizations, performance on AMD GPUs has increased to meet our expectations. Modest speed improvements were also seen on NVIDIA GPUs, which was an unexpected benefit of this exercise.

GPU-Acceleration of the WEST Code for Large-Scale Many-Body Perturbation Theory

Victor Yu, Argonne National Laboratory

Many-body perturbation theory (MBPT) is a powerful method for simulating electronic excitations in molecules and materials. In this talk, we present a massively parallel, GPU-accelerated implementation of MBPT in the WEST code (www.west-code.org). Outstanding performance and scalability are achieved by employing a hierarchical parallelization strategy, nonblocking MPI communications, and mixed precision in selected portions of the code. The capability of the GPU version of WEST is demonstrated by large-scale MBPT calculations using up to 25,920 GPUs. Finally, we delve into our experience of switching our GPU programming model from CUDA to OpenACC, which is enabling us to attain enhanced performance portability. 

Clacc: OpenACC, Clang/LLVM, and Kokkos

Joel Denny, OpenACC Co-Chair of the Technical Committee, Oak Ridge National Laboratory

Clacc has developed OpenACC compiler, runtime, and profiling support for C and C++ by extending Clang and LLVM under the Exascale Computing Project (ECP).  OpenACC support in Clang and LLVM can facilitate the programming of GPUs and other accelerators in HPC applications and provide a popular compiler platform on which to perform research and development for related optimizations and tools for heterogeneous computing architectures.  A key Clacc design decision is to translate OpenACC to OpenMP to leverage the OpenMP offloading support that is actively being developed for Clang and LLVM.  A benefit of this design is support for two compilation modes: a traditional compilation mode that translates OpenACC source to an executable, and a source-to-source mode that translates OpenACC source to OpenMP source.  Clacc is hosted publicly on GitHub as part of the LLVM Department of Energy (DOE) Fork maintained at Oak Ridge National Laboratory (ORNL) (https://github.com/llvm-doe-org/llvm-project/wiki).

This talk presents the latest developments in the Clacc project as well as future plans in light of the end of ECP later this year.  We will cover topics including recent Clacc support for OpenACC in C++, support for KokkACC (a new OpenACC backend for Kokkos which won the best paper award at WACCPD 2022), and a general summary of Clacc's current OpenACC feature support.  We will also invite the community to give feedback on their interest in seeing OpenACC support in the LLVM ecosystem going forward.

Porting CaNS Using OpenACC for Fast Fluid Dynamics Simulations at Scale

Pedro Costa, Delft University of Technology

Direct numerical simulations of the Navier-Stokes equations have greatly enhanced our understanding of turbulent fluid flows, impacting numerous environmental and industrial applications. Still, important unresolved issues remain, requiring massive computing power that just became within reach with GPU computing at scale. 

This talk focuses on the GPU porting effort of the numerical solver CaNS, a code for fast massively parallel simulations of canonical fluid flows that has gained popularity throughout the years. CaNS is written in modern Fortran and recently featured a fresh GPU porting using OpenACC and a hardware-adaptive pencil domain decomposition library. We exploited OpenACC directives for host/device data movement, loop offloading, asynchronous kernel launching, and interoperability with CUDA and external GPU libraries. More importantly, in the porting process, several interesting practices have been found where standard Fortran combined with OpenACC secures a sustainable and flexible implementation while retaining the efficiency of the numerical tool. The exchange between domain-specific and GPU computing experts was key to the success of this effort. 

We will cover the high-level implementation and performance of CaNS, including how OpenACC enabled swift interoperability with the external libraries that enabled a hardware-adaptive implementation. Additionally, we will discuss fine implementation details, highlighting simple yet impactful approaches applicable to other applications which have not been widely documented. Finally, we will demonstrate performance and show how CaNS is being used on the supercomputer Leonardo to simulate wall turbulence at unprecedented Reynolds numbers (i.e., high flow speeds).

GPU Implementation and Optimization of Numerical Ocean Model with OpenACC

Takateru Yamagishi, Research Organization for Information Science and Technology (RIST)

We have developed numerical ocean models to study and replicate the ocean's state and understand its impact on climate, weather, and the entire Earth system. Recognizing the importance of GPUs for high-performance numerical simulations, we have focused on implementing and optimizing our numerical ocean models for GPU usage. Given that these models are shared among researchers with diverse backgrounds, not all of whom are proficient in high-performance computing, we adopted an OpenACC directive-based approach to ensure accessibility and minimize specific challenges.

In this talk, we present our implementation where we employed several optimization methods characteristic of GPU implementation with OpenACC. These methods included eliminating redundant device data transfers, reorganizing loops to optimize GPU transactions, and utilizing scalar values to maximize GPU register utilization. Furthermore, we made algorithmic changes to the tracer advection equations to increase data parallelism, and we incorporated a hybrid use of CPU and GPU to implement multigrid preconditioners for the Poisson solver. Throughout this process, we maintained code readability.

The results of our model implemented with OpenACC showed a significant performance improvement compared to the CPU implementation. Specifically, the GPU implementation was approximately five times faster, comparable to our previous implementation with CUDA. Moreover, our model demonstrated good weak scalability across multiple GPUs, ranging from 4 to 256 GPUs.

CUDA Acceleration of the Simulation Code for Tokamak Fast Ions

Tongnyeol Rhee, Korea Institute of Fusion Energy

The fusion of hot hydrogen ions within a tokamak is one promising and sustainable future clean energy source.  To utilize this energy potential, comprehensive studies on tokamak plasmas are crucial and the development of simulation codes suitable for various purposes is required. This presentation introduces the acceleration of a code simulating hydrogen ions heating and wall heat load by fast ions in a tokamak. Specifically, we focus on the translation of the original CPU-based code to the CUDA language, allowing efficient execution on a GPU.  Finally, we compare optimized CUDA-based GPU code with original CPU code and the non-optimized version.

Acceleration of an Incompressible Immersed Boundary-based CFD Solver over Multi-GPU Platforms Using OpenACC

Somnath Roy, Indian Institute of Technology, Kharagpur

An immersed boundary based incompressible CFD solver has been accelerated over multi-GPU platforms using OpenACC. First, the compute heavy components of the solver are identified and optimized over a single GPU using OpenACC constructs. A banded matrix storage algorithm is used, which facilitated solving of large mesh over a single GPU itself. Further, a domain decomposition algorithm with cell overlaps is used to hybridize the parallelization strategy over multi-GPU platforms. Asycnhronous data transfer calls are used to leverage the benefits of computing-communication overlap. Krylov subspace based solvers from AMGX and in-house RBSOR matrix solvers are both tested over different applications. It is observed that in some cases involving internal flows, RBSOR gives comparable performance with preconditioned BICGSTAB, whereas the former allows larger grid size due its smaller memory utilization. Near linear speed-up is obtained for large problems. The solver is demonstrated for applications involving aerodynamic and biological flows.

GPU Porting of Plasma Physics Codes

Emily Bourne, École Polytechnique Fédérale de Lausanne (EPFL)

In the context of the EUROfusion consortium, Ecole Polytechnique Fédérale de Lausanne (EPFL) Advanced Computing Hub participates in the improvement of existing European fusion simulation codes to enable researchers to take full advantage of the new capabilities offered by new generations of supercomputers. These codes are intended to simulate the plasmas of tokamaks and stellarators, experimental magnetic confinement devices for the production of energy by nuclear fusion.

In this talk, we will present the global strategy used for the GPU porting of multiple fusion research codes. The chosen strategy aims to keep a single version of each code using OpenACC/OpenMP directives. Performance portability on different architectures (NVIDIA and AMD GPUs) and with various compilers will be investigated. We will focus on three representative codes to illustrate this global strategy: (1) CAS3D, a magnetohydrodynamic code used to study the properties of fusion plasmas in non-axisymmetric configurations such as stellarators. We will show how the introduction of generic pragmas allows the use of either OpenACC or OpenMP to exploit multi-level parallelism to get the best performance according to the available compilers. (2) Soledge3X aims to simulate the complex physics of tokamak edge plasma. We will present how this code is coupled to tools such as PETSC on GPUs using OpenACC and CUDA. (3) ASCOT5, an orbit-following Monte Carlo code. We will present how we leverage the Thrust library and efficient memory access to ensure efficient execution. We will also explore how independent time evolution of particles is used to improve load balancing.

SORDI.ai Hackathon: Advancing Industrial Object Detection with Synthetic Data

Chafic Abou Akar, BMW Group

The SORDI.ai Hackathon, organized by BMW Group in partnership with NVIDIA, OpenACC, Microsoft, idealworks, and many other worldwide companies and universities, focused on advancing industrial object detection through the groundbreaking Synthetic Object Recognition Dataset for Industries: SORDI.ai, the largest and most comprehensive collection of photo-realistic images for industrial research covering 80+ object classes. Over 700 teams participated in the first stage, with the top 10 teams progressing to the second stage, where they leveraged Intelligent Video Analytics to develop real-time inventory systems for industrial assets.

In this presentation, we will delve into the technologies behind SORDI.ai and its capacity as a valuable resource for training computer vision models in various factory environments. Additionally, we will share how this hackathon showcased the immense innovative potential of using industrial synthetic data and cutting-edge technologies within the industry.

Experiences from Hackathons: Accelerating Quantum Enhanced Support Vector Machine Algorithm with cuQuantum

Tai-Yueh Li, National Synchrotron Radiation Research Center

Quantum-enhanced Support Vector Machine (QSVM) is a promising quantum machine learning algorithm with potential quantum advantages. QSVM can nonlinearly map classical data into high-dimensional quantum state spaces for classification, outperforming classical algorithms in certain datasets. However, its application to large-scale datasets is challenging due to the limited number of qubits in current quantum computers and susceptibility to noise. One solution is to use GPU acceleration to simulate the QSVM algorithm.

At last year's NCHC Open Hackathons in Taiwan, we used NVIDIA cuQuantum SDK and IBM Qiskit package to implement GPU simulation of QSVM for tumor metastasis classification. The dataset consisted of 334 tumor metastasis samples, each with 48 gene sequence features, and we aimed to investigate the impact of feature quantity on QSVM classification accuracy. In this implementation, we analyzed two factors affecting QSVM computation time. First, the influence of data size, where the time for mapping classical data to quantum states in QSVM is proportional to 0.5N^2, N is the data size. Secondly, we considered the effect of varying feature quantities on the time needed for quantum circuit computation, where the time for M features in the quantum circuit is proportional to 2^M.

During the Hackathon, we tested and validated the computations of QSVM on two A100-GPUs. Due to time limitations we were unable to complete the full benchmark test, therefore, we performed more comprehensive testing using a single NVIDIA RTX A6000-GPU in our lab after the hackathons. The results showed that with a single dataset and 31 features, the A6000-GPU was approximately 19 times faster than the CPU, and the A100-GPU was about 33 times faster. In the case of 334 data points and 27 features, the A6000-GPU was approximately 1.6 times faster than CPU, and the A100-GPU was 11 times faster. Furthermore, we observed that QSVM demonstrated a more stable and faster convergence in classifying training datasets compared to classical SVM algorithms. In summary, our work highlights the potential of using GPUs and the cuQuantum SDK to accelerate quantum computing research.

The  Gaia AVU-GSR Parallel Solver: Preliminary OpenACC Porting of a LSQR-based Application Towards (Pre-)Exascale Systems

Valentina Cesare, Osservatorio Astrofisico di Catania

The Gaia AVU-GSR code finds the astrometric parameters of ~10^8 stars in the Milky Way by solving an overdetermined system of linear equations with the iterative LSQR algorithm. In this talk, we present the GPU porting of Gaia AVU-GSR using OpenACC. Previously parallelized on CPUs with a MPI+OpenMP approach, this work is a preliminary study to test the feasibility of porting this code to a GPU environment by properly replacing OpenMP directives with OpenACC . This resulted in a moderate speedup of 1.5x over the OpenMP version as tested on Marconi100 (Cesare, et al., ADASS XXXI, in press; 2022a; 2022c). This work had a threefold goal: (1) it was propaedeutic to a final CUDA port where the algorithm was deeply optimized and the speedup increased to 14x (Cesare, et al., 2022b; accepted for PASP); (2) it was essential to investigate the potential performance and scaling properties of the code towards (pre-)Exascale systems; and (3) it could be exploited to extend these results to other HPC-LSQR-based applications. During the talk, an analysis on the different performance of the OpenACC and CUDA codes will be presented. This achievement with OpenACC is the result of the CINECA GPU Hackathon 2021, where the interaction with expert mentors was crucial in this sense. For a proper code parallelization, a systematic approach to parallelize scientific applications was also an important guideline (Cesare, et al., 2020; Aldinucci, et al., 2021). To understand this talk, attendees should be familiar with parallelization languages (MPI, OpenMP, OpenACC, and CUDA) and code profilers (NVIDIA Nsight Systems).

Performance Portability of Ensemble Kalman Filter Using C++ Senders/Receivers

Yuuichi Asahi, Japan Atomic Energy Agency (JAEA)

Generally, production-ready scientific simulations consist of many different tasks including computations, communications and file I/O. Compared to the accelerated computations with GPUs, communications and file I/O would be slower which can be major bottlenecks. It is thus quite important to manage these tasks concurrently to suppress these costs.

In this talk, we employ the proposed language standard C++ senders/receivers to mask the costs of communications and file I/O. As a case study, we implement a 2D turbulence simulation code with the local ensemble transform Kalman filter (LETKF) using C++ senders/receivers. In LETKF, the mock observation data are read from files followed by MPI communications and dense matrix operations on GPUs. We demonstrate the performance portable implementation with this framework while exploiting the performance gain with the introduced concurrency.

From HPC to AI: NCI Open Hackathon Application Case Study

Frederick Fung, National Computational Infrastructure (NCI)

Concluded in June 2023 in Canberra, Australia, the NCI Open Hackathon marks the second successful year of collaboration between National Computational Infrastructure (NCI), OpenACC Organization, and Nvidia, hosting this in-person event. The Hackathon has attracted a diverse range of applications from both Australian and international research groups, featuring innovative projects that spanned across fields such as disaster management models, mathematics, CFD, economics, physics simulations, large language models, and more. Throughout the Hackathon, NCI and NVIDIA  hosted and mentored scientific software written in Python, C/C++, Julia, and Fortran. In this presentation, we will highlight the results achieved by the participating researchers and share our learning experiences with our HPC and AI community.

Sailfish: A GPU-based Ocean Numerical Model

José María González Ondina, University of Florida

GPUs are, by orders of magnitude, the most computationally powerful part of modern computers. Modern units contain thousand to tens of thousands of cores that can perform tens of teraflops. On the memory side, modern GPUs like NVIDIA’s A100 and H100 can have up to 80 Gb of fast VRAM, enough to contain even big models without the need of partitioning and message passing.

Unfortunately, the scientific community has been slow to adapt their numerical models to use GPUs. When it does, it is too common to use a hybrid approach, combining GPU code with MPI or heavy use of non-VRAM memory. These approaches incur on the penalty of memory transfer, either through MPI memory passing or between CPU and GPU memory, negating some of the speedups obtained by using GPUs.
We are developing a new numerical code written in Python that aims to be a drop-in replacement of ROMS. This code, called Sailfish, solves the primitive equations using algorithms very similar to ROMS’, but running almost entirely in the GPU. Sailfish will be able to read ROMS inputs and write the same type of outputs, but it is an entirely new code written in Python and using NVIDIA’s CUPY library plus some small kernels written in C++. The popularity of Python in academia combined with it being a modern language, allows for a simple and clear code that will be easy to understand and modify.

Preliminary tests show that Sailfish can be hundreds of times faster than sequential ROMS. In a cluster like Unviersity of Florida’s Hipergator, with nodes of four NVIDIA A100 GPU’s, it would be possible to run ~100 cases at the same time that one can run a ROMS case using multiple cores. This speedup will allow us to run ensembles, allowing for better uncertainty analysis and informing Data Assimilation algorithms, as well as enormously increasing the amount of different scenarios one can analyze.

Accelerating Quantum Chemistry Calculations in GAMESS via Fortran Standard Parallelism

Melisa Alkan, Stanford University

Directive-based approaches such as OpenACC and OpenMP target offloading (OTO) models to accelerate computational codes have become increasingly popular with the adoption of Graphics Processing Units (GPUs) in scientific communities. With the advent of heterogeneous computer architectures, it is important to take into consideration portability of computational codes across various compilers and hardware systems. In this talk, development and implementation of Fortran 2008 DO CONCURRENT (DC) code in GAMESS quantum chemistry software package to accelerate the main computational bottleneck in most quantum chemistry codes – evaluation of molecular integrals and their digestion into the Fock build, will be described. The performance of DC relative to OpenACC and OTO models with different compilers on NVIDIA A100 and V100 GPUs is assessed. The results show that DC can speed up the Fock build by 3.0× compared with that of the OTO model.

Real-time Urban Microclimate Simulation Using Multi-GPU Computing

Mingyu Yang, Yonsei University

Numerical weather prediction primarily operates at the kilometer scale due to resolution constraints. Nevertheless, in urban scenarios, factors like building structures influence local weather phenomena such as wind, thereby necessitating meter-scale predictions. Addressing these demands has been challenging, given the limitations in CPU integration density, computational resources, and the paucity of innovative numerical methods.

To tackle these challenges, this study introduced a multi-GPU-based parallelization approach, leveraging OpenACC and CUDA Fortran and integrating cutting-edge, highly scalable algorithms such as PaScaL_TDMA and CuFFT. This strategy ensures scalability in a multi-GPU environment, a process detailed in this presentation.

Additionally, the research underwent robust validation across diverse urban contexts, with quantitative analysis considering variables such as the number of GPUs, the simulation area, and the maximum achievable computational speed. In conclusion, our real-time analysis of a 10.49 km^2 urban area in Seoul at a 4m resolution (approx. 40 million grids) using eight NVIDIA A100 GPUs demonstrates the practicality and potential of this approach. These results underscore the possibility for substantial advancements in GPU-based urban environment simulators.

We are developing a new numerical code written in Python that aims to be a drop-in replacement of ROMS. This code, called Sailfish, solves the primitive equations using algorithms very similar to ROMS’, but running almost entirely in the GPU. Sailfish will be able to read ROMS inputs and write the same type of outputs, but it is an entirely new code written in Python and using NVIDIA’s CUPY library plus some small kernels written in C++. The popularity of Python in academia combined with it being a modern language, allows for a simple and clear code that will be easy to understand and modify.

Porting an Atmospheric Model to GPUs: Low-cost to High-performance Results

Kazuya Yamazaki, University of Tokyo

In this talk, we present the GPU porting of a numerical model of the atmosphere in Fortran 90/95, a subset of the SCALE Regional Model (SCALE-RM) developed at RIKEN, Japan, using OpenACC. Atmospheric models in general contain numerous small kernels without any hotspots, resulting in kernel launch latencies taking up a significant portion of the total model runtime. To hide this latency, the "async" clause was heavily utilized throughout the code. OpenACC supports explicit queue handling for asyncronous execution, which allowed us to easily configure kernels to be asyncronous with respect to host CPUs, but in specific orders on GPUs to satisfy inter-kernel dependencies.

Performance measurement revealed that the subset SCALE-RM on the NVIDIA A100 GPU achieved five times as large FLOPS/W as the Intel Xeon 8360Y, and twice the performance-per-watt compared to the power-efficient Fujitsu A64FX. Performances were also tested on lower-cost GPUs including GTX 1060. Although the theoretical FP64 performance on such consumer GPUs are severely limited, the subset SCALE-RM in FP64 performed almost proportionally to the memory bandwidth with only a minor slowdown on low-cost GPUs because SCALE-RM is an extremely memory-bound application.

These results indicate that the OpenACC-ported SCALE-RM performs better on both data-center and consumer GPUs than CPUs, in terms of power and cost-efficiency, respectively.

GPU-acceleration of the Massively-parallel Flow Solver RHEA Pairing MPI+OpenACC

Ahmed Abdellatif, Universitat Politècnica de Catalunya · BarcelonaTech

This work presents a computational approach for large-scale simulations of compressible turbulence on heterogeneous compute nodes with accelerators. It combines MPI parallelization for distributed computation with OpenACC for GPU acceleration. This work discusses GPU porting, data management, solver performance, and scalability on different multi-node/GPU systems.

Two OpenACC approaches, automated (managed) and manual (non-managed) data management, are compared. The non-managed GPU-accelerated approach achieves an overall speedup of 6x compared to the CPU version, with specific kernels like inviscid and viscous flux calculations showing speedups of approximately 26x and 16x, respectively.

Accelerating the Cloud Microphysics Parameterization on GPU Using the Directive-based Methods

Jian Sun, National Center for Atmospheric Research (NCAR)

Cloud microphysics are computationally costly components in a climate model. To improve its performance, we port the cloud microphysics parameterization in the Community Atmosphere Model (CAM), known as Parameterization of Unified Microphysics Across Scales (PUMAS), from CPU to GPU by using the directive-based methods (OpenACC and OpenMP target offload). Their performance is first examined in a PUMAS stand-alone kernel and the directive-based methods can outperform a CPU node as long as there is enough computational burden on the GPU. A consistent behavior is observed when we run PUMAS on the GPU in a practical CAM simulation. A 3.6x speedup of the PUMAS execution time, including data movement between CPU and GPU, is achieved at a coarse horizontal resolution. This speedup further increases up to 5.4× at a high resolution, which highlights the fact that GPU favors larger problem size. This study demonstrates that using the GPU in a CAM simulation can save noticeable computational costs even with a small portion of code being GPU-enabled.