OpenACC and Hackathons Asia-Pacific Summit 2022

Welcome Tutorials Talks

Started from humble beginnings as a small OpenACC user meeting, today our annual OpenACC and Hackathons Summit has grown to showcase leading research from our diverse, worldwide community. This year, we are holding our first Asia-Pacific Summit featuring leading research from preeminent organizations across Australia, India, Japan, Korea and Taiwan. This digital event will focus on a variety of critical topics ranging from AI-enabled physics and natural language processing to scientific libraries and frameworks for HPC to porting scientific applications to modern archtectures to fully delve into the global research projects that are impacting the scientific and developer communities of the region.

The 2022 Summit is scheduled from August 23rd to 25th and will include opening remarks from Jack Wells, OpenACC president, invited talks, tutorials covering HPC and AI topics, and the opportunity to interact with our regional sponsors during designated expo hours.

Register now!

Welcome

Opening Remarks

OpenACC Mission and Organization Update - Simulive

Jack Wells, OpenACC Organization

As the OpenACC organization grows beyond developing and promoting the OpenACC specification to include a broader set of modeling, simulation, and AI initiatives led by the Open Hackathons program, we would like to share with you our organization's updated mission and vision, report on what has been accomplished, and discuss activities the organization is focusing on going forward. We will also highlight opportunities for institutions and individuals to participate in outreach and service to the accelerated computing community.

Tutorials

HPC Tutorial: N-Ways to GPU Computing with Performance Portability (ISO Standard Language and Directives)

Aswin Kumar, NVIDIA

Porting and optimizing legacy applications for GPUs doesn't have to be difficult when you use the right tools. From directive-based parallel programming to using a standard language model that enables C, C++, and Fortran applications to be ported to GPUs quickly while maintaining a single code base. you will learn the basics for parallelizing an application to get more performance while keeping the code portable. Become a GPU Hero by joining this session.

AI Tutorial: AI for Science

Aswin Kumar, NVIDIA

This tutorial will teach researchers fundamental concepts and practices in deep learning and how to effectively apply them to scientific problems. Participants will learn how to implement the primary components of deep learning workflows, including data driven and physics-informed training and evaluating models using accelerated visualization on scientific datasets.

Talks

GPU Acceleration of the Convergent Close-Coupling Code for Electron-Atom Scattering

Igor Bray, Curtin University

The Convergent Close-Coupling (CCC) computer code, which solves problems in electron-atom scattering, began its life in the late 1980s for a single CPU. Written in Fortran, it has been progressively extended to utilize all of the available cores on each node with OpenMP, across available nodes with MPI, and most recently takes advantage of GPU acceleration.

This talk will discuss this journey, with a particular emphasis on the importance of partnerships with supercomputer staff whose expertise is essential in ensuring that the strengths of the modern-day computational infrastructure are made available to solve problems of current interest to science.

Evaluation of GPU Offloading Method Using Fortran Standard Parallelization

Tetsuya Hoshino, University of Tokyo

The DO CONCURRENT statement, introduced as standard in Fortran 2008, means that applied loops can be executed in parallel. The NVIDIA ® nvfortran compiler supports DO CONCURRENT, allowing GPU execution of standard Fortran programs without CUDA® or directives.

In this presentation, we present the results of a DO CONCURRENT implementation of the ICCG solver, H-matrix vector multiplication, and 3D diffusion kernel on the NVIDIA A100 GPU compared to OpenACC and OpenMP 5.x.

The results show that the performance is comparable to that of OpenACC and OpenMP 5.x when unaffected by the DO CONCURRENT limitation. However, when affected by the limitations of DO CONCURRENT (i.e., no support for reduction operations and no explicit CPU-to-GPU data transfer), we found a large performance difference and a large implementation cost to avoid the limitations.

Accelerating Machine Learning for Quantum Mechanical Data

Cheol Ho Choi, Kyungpook National University

Knowledge of the detailed mechanism behind the atomic layer deposition (ALD) can greatly facilitate the optimization of the manufacturing process. Computational modeling can potentially foster the understanding; however, the presently available capabilities of the accurate ab initio computational techniques preclude their application to modeling surface processes occurring on a long time scale, such as ALD.

In this talk, we propose an iterative protocol for optimizing machine learning (ML) training data sets and apply ML-assisted ab initio calculations to model surface reactions occurring during the ALD process on the semiconductor surfaces. The protocol uses a recently developed low-dimensional projection technique, greatly reducing the amount of information required to achieve high accuracy (1 kcal/mol or less) of the developed ML models. Hence, the proposed protocol furnishes a very effective tool to study complex chemical reaction dynamics at a much reduced computational cost.

GPU Acceleration on Geodynamic Simulation via OpenACC

Eh Tan, Institute of Earth Sciences, Academia Sinica

Numerical simulation of geodynamic processes is computationally expensive. The pursuit of higher resolution and more accurate physical simulation requires more and more computation power. The speed improvement of CPUs has stalled in recent years. Moreover, the speed of memory access has only slowly improved in decades which further reduces the performance of the simulation. The advance of GPGPU (General Purpose computing on GPU) can help to solve the performance problem. GPUs provide quick memory access and fast context switch to hide memory access latency while keeping the computation units busy. Traditionally, the GPU and CPU have separated memory space. Programmers have to transfer data between GPU and CPU manually before and after the computation. The newer generation of NVDIA GPU provides unified memory space to avoid manual data transfers. Additionally, we can port the CPU codes to GPUs using a few lines of OpenACC directives. In the end, we completely ported our explicit geodynamic simulation code to GPU and achieved a 40x speed-up, compared to a single CPU performance. In this talk, we will detail the porting strategy and compare the similarity of OpenACC to OpenMP.

GPU Parallelization of Two-way Coupled Particle-Fluid Solver

Prasad Perlekar, Tata Institute of Fundamental Research, Hyderabad

We present a study on the porting and parallelization of a Navier-Stokes solver for flow that couples with inertial or active particles. The challenge is performing fast and frequent interpolation and extrapolating the data to the neighboring grid points.

Making the Work-Efficient Parallel Prefix Sum Do Less Work

Stephen Sanderson, The University of Queensland

The prefix sum (cumulative sum) algorithm can be accelerated for parallel processing through various algorithms, including the work-efficient algorithm in which the calculation is performed in "up sweep" and "down sweep" stages.

A common use of the prefix sum is to aid in selecting a random element from an array of probabilities. The cumulative sum of the array is generated, and then an element is chosen such that it is the first element with a value greater than a random fraction of the total.

In this talk, modifications to the work-efficient parallel prefix sum algorithm and the subsequent search algorithm are developed which avoid unnecessary work in this specific "sum and search" use case, thereby minimizing expensive memory transactions. Optimization steps for this algorithm are discussed in terms of CUDA best practices, covering topics such as shared memory caching, bank conflict avoidance, and efficient memory access patterns.

5-Dimentional Plasma Turbulence Simulations with OpenACC Directives

Kenji Imadera, Kyoto University

In magnetically confined fusion plasmas, the strong profile inhomogeneity can trigger micro-scale turbulence, driving particle and heat transport. 5-dimensional (5D) plasma turbulence simulation, which treats not only 3D real space but also 2D phase space, is considered to be an essential tool to explore how to control such a turbulence-driven transport.

We have developed a 5D turbulence code named GKNET (GyroKinetic Numerical Experiment of Tokamak) with 3D MPI decomposition; however, it is necessary to accelerate the calculation for simulating future larger fusion reactor.

In this study, we have implemented GPU parallelization to GKNET by OpenACC directives. We combined 5D loops into one and then distributed heavy calculations to each GPU. For using multiple GPUs on one node, we used “acc_set_device_num” subroutine to explicitly link the GPU to each CPU, which technique is available in the MPI-OpenACC hybrid parallelization.

Then we checked how GKNET simulations can be accelerated by using OpenACC directives on MARCONI 100 in CINECA. We verified that GPUs can reduce the total cost by 1/15 in the case of 256[CPUs]+16[GPUs]. In particular, the cost of the 5D loop is reduced by 1/25 by utilizing the asynchronous execution to hide the calculation and communication so that the GPU acceleration is considered to efficiently work for the speed-up of the simulation.

Single Large Job Acceleration with GPU and Processing-in-memory (PIM) Support from OpenACC

Seungwon Lee, Samsung Advanced Institute of Technology (SAIT)

In this talk, we explore current trends affecting supercomputers (TOP500) and highlight case studies of large jobs at Samsung Advanced Institute of Technology (SAIT). For Molecular Dynamics (MD) and Density Functional Theory (DFT) simulations, we use GPUs instead of CPUs for improved simulation speed up to 300 times. In addition, we optimized a large language model (BERT) and we can overlap between computation and communications. Using a hyper-parameter auto-tuning method, we can enlarge the batch size and achieve the best time-to-result of BERT training on MLPerf Traning v1.1 and v2.0 when we use 1024 NVIDIA® A100 GPUs. (https://github.com/SAITPublic/). Lastly, we also propose an OpenACC extension for Processing-in-memory (PIM) and some results of HPC benchmarks.

Cryo-RALib - A Modular Library for Accelerating Alignment in Cryo-EM

Szu-Chi Chung, Department of Applied Mathematics, National Sun Yat-sen University

Thanks to GPU-accelerated processing, cryo-EM has become a rapid structure determination method that permits the capture of dynamical structures of molecules in solution, which has been recently demonstrated by the determination of COVID-19 spike protein in March 2020, shortly after its breakout in late January 2020. This rapidity is critical for vaccine development in response to the emerging pandemic. Compared to the Bayesian-based 2D classiﬁcation widely used in the workﬂow, the multi-reference alignment (MRA) is less popular. It is time-consuming despite its superiority in differentiating structural variations. Interestingly, the Bayesian approach has higher complexity than MRA. We thereby reason that the popularity of Bayesian is gained through GPU acceleration, where a modular acceleration library for MRA is lacking.

In this talk, we introduce a library called Cryo-RALib that expands the functionality of the NVIDIA® CUDA® library used by GPU ISAC. It contains a GPU-accelerated MRA routine for accelerating MRA-based classiﬁcation algorithms. In addition, we connect the cryo-EM image analysis with the python data science stack to make it easier for users to perform data analysis and visualization. Benchmarking on the TaiWan Computing Cloud (TWCC) shows that our implementation can accelerate the computation by one order of magnitude. The library is available at https://github.com/phonchi/Cryo-RAlib.

OpenACC-based Acceleration of an Immersed Boundary Method CFD Solver

Somnath Roy, IIT Kharagpur

Immersed boundary methods are designed for solving flow over complex and moving geometries in the framework of a fixed structured mesh. This method uses a separate description for the boundary as a surface mesh file immersed inside the three-dimensional fixed mesh. The first step is to identify the near surface volume mesh points followed by interpolation of field variables at those points to satisfy the boundary conditions.

This talk discusses parallelizing the search and interpolation steps on GPUs using the OpenACC programming model and optimizing for better scalability. Further, we will demonstrate the optimization of the pressure correction Poisson's equation that avoids the branch divergences using a red-black successive over-relaxation method. Our implementation considers the advantages of using a diagonal storage system that results in a massively parallel GPU architecture due to structured mesh. We have reported more than 100x speed-up for several problems involving aerodynamic and cardiovascular applications. The present developments have helped us perform direct numerical simulations of moderate Reynolds number turbulent flows and resolve the small-scale fluctuations.

Overall, this talk will present the applicability of OpenACC-based optimizations for obtaining high fidelity solutions with moderately low computing time.

Towards Exascale Quantum Chemistry

Giuseppe Barca, Australian National University

Quantum chemistry calculations have become strategic for the development and screening of novel materials, drugs, catalysts, and other chemicals. Central to this success has been a synergistic advancement of quantum chemistry methods and the underlying computer systems that they use.

At the dawn of the exascale computational era, as high-performance computer hardware morphs from CPU- to accelerator-based massively parallel architectures, quantum chemical methods and their underpinning algorithms and implementations must evolve accordingly to enable the modelling of larger and more diverse molecular systems.

This talk will discuss challenges and world-record achievements along the path of developing algorithms and software designed for running on the world’s most powerful supercomputers to tackle one of the grand challenges of our time: To predict the chemistry of large molecular systems.

Design of OpenACC Compiler for Fujitsu A64FX SVE

Mitsuhisa Sato, RIKEN R-CCS

The Fujitsu A64FX is a manycore processor designed for the Supercomputer Fugaku. This processor supports Arm SVE (Scalable Vector Extension) SIMD instruction sets. We are working on the design of a prototype OpenACC compiler to make use of the SVE more efficiently than existing solutions such as OpenMP do. While OpenACC programs can be compiled and executed on A64FX without any modifications, it may be able to use SIMD operations of each core like GPUs for efficient execution. In this talk, the design choices and the preliminary results will be presented.

Machine Learning Potentials and Paradigm Shift in Material Modeling

Seungwu Han, Seoul National University

Recently, machine-learning (ML) approaches to developing interatomic potentials are attracting considerable attention because it is poised to overcome the major shortcoming inherent to the classical potential and density functional theory (DFT), i.e., difficulty in potential development and huge computational cost, respectively. In this presentation, based on SIMPLE-NN [1] for training and using neural network potentials (NNPs), we present our recent results on various material simulations: highly-efficient crystal structure prediction [2], accelerated computation of thermal conductivities [3], emission spectrum of quantum dots [4], electrocatalysts in fuel-cells, and simulation of semiconductor processing such as etching and atomic-layer deposition.

References
1. K. Lee et al, Comp. Phys. Comm. 242, 95 (2019).
2. S. Kang, W. Jeong et al, npj Computational Materials 8, 108 (2022).
3. J. M. Choi, K. Lee et al, Computational Materials Science 211, 111472 (2022).
4. S. Kang et al, ACS Materials Au 2, 103 (2022).

Simulation of Collective Fast Neutrino Flavor Conversions

Chun-Yu Lin, National Center for High-performance Computing

Detailed dynamics of the collective neutrino flavor conversion play a pivotal role in extreme astrophysical scenarios such as the supernova explosions and neutron star mergers. It is described by coupled non-linear, complex partial differential equations (PDEs) and posts numerical and computing challenges. In this talk, we present the use the standard finite difference method with Kreiss-Oliger dissipation to develop the stable simulation and confirm that the result is consistent with other approaches, such as the finite volume method with seventh order weighted essentially non-oscillatory scheme. We also explore the GPU acceleration with OpenACC derivatives.

Empowerment of Cutting-Edge Speech Technology by Natural Language processing

Mahesh Bhargava, Centre for Development of Advanced Computing (C-DAC)

Speech technology has come a long way since 1950. The initial speech recognition was developed for digit and word recognition. Subsequently, speech recognition vocabulary goes from a few hundred words to several thousand words. Speech technology has seen many paradigms shift and is now moving from dictation to conversational with the pragmatic use of natural language processing (NLP).

With recent advancements in technology, Artificial Intelligence (AI) has touched almost all aspects of life and surroundings. AI and deep learning disrupt the entire building approach of speech-based applications and make it more convenient for users. A Hybrid (HMM-DNN) and pure neural network (end-to-end) based system uses deep learning and has achieved performance beyond traditional approaches. Hence, this AI-powered speech recognition technology gained enormous importance across the globe.

Speech technology plays a vital role in many applications for fast and efficient access and control. Almost all sectors and domains such as agriculture, healthcare, travel and tourism, banking, law enforcement, defense, education, government, and the like are ready to adopt this technology in various application forms. Defense or intelligence agencies can take advantage of speech technology in many ways to intercept or identify any suspect activities that happen over the voice conversions, the identity of the speaker, language, and gender over speech samples.

Accelerating Simulations of the Early Universe

Simon Mutch, University of Melbourne

Understanding the formation and evolution of the first galaxies in the Universe is a vital piece of the puzzle in understanding how all galaxies, including our own Milky Way, came to be. It's also a key aim of major international facilities such as the James Webb Space Telescope and the forthcoming Square Kilometre Array. To maximize what we can learn from observations made by these facilities, we require the ability to accurately simulate how early galaxies affected and interacted with their environments.

This talk presents work being carried out by the Genesis simulations team, part of the Australian Research Council Centre of Excellence for All-sky Astrophysics in 3D (ASTRO 3D), to tackle this complex problem using a combination of HPC simulations and modeling techniques, with a particular focus on the associated computational challenges and some ways to address them.

OpenACC-Enabled GPU-FPGA Accelerated Computing for Astrophysics Simulation

Ryohei Kobayashi, University of Tsukuba

There are a variety of accelerators available to the high performance computing (HPC) community. The use of graphic processing units (GPUs) has become very popular owing to their good peak performance and high memory bandwidth; however, they do not work well for applications that employ partially poor parallelism or frequent inter-node communication. Field-programmable gate arrays (FPGAs) have garnered significant interest in high-performance computing research as their computational and communication capabilities have drastically improved in recent years. GPU-FPGA coupling could be ideal for Multiphysics problems where various computations are included within a simulation and difficult to accelerate by GPU alone.

Currently, researchers at the University of Tsukuba are working on a programming environment that enables the use of both accelerators in a GPU-FPGA-equipped HPC cluster system with OpenACC. This talk will present the outline of the programming environment, implementation, and performance evaluation of a GPU-FPGA-accelerated application for astrophysics simulations.

Data-Driven Design for High-Performance SnSe-Based Thermoelectric Material

Hyunju Chang, Korea Research Institute of Chemical Technology

There has been growing activity for data-driven research using machine learning (ML) in materials science recently. However, the accuracy of ML prediction seems limited to the data quality that the ML model is built from.

This talk discusses how we developed a web-based materials research data platform to collect the data from the entire research process of synthesis, characterization, and calculations of doped SnSe for especially high-ZT thermoelectric materials in order to obtain a high-quality scientific dataset. From the systematically collected data, we have built the ML model to predict the thermoelectric properties of doped SnSe materials. Using this predictive model, we were able to rapidly screen the doping elements and proposed several dopant candidates of SnSe that provided high ZT values. Combined with various first-principles calculations, we can further identify the best doping element for new thermoelectric material, which was confirmed with experimental synthesis and characterization. This data-driven research using the experiments and calculations data together provides cutting-edge research strategies for designing the new materials.

GPU-Facilitated Deep Learning Approach in Medical Images Analysis

Yi-Ju Lee, Institute of Statistical Science, Academia Sinica

Statistical methodology plays a critical core both in the theoretical and practical understanding of Artificial Intelligence (AI). The intelligence-based models of natural and artificial computation are reshaping our society and promising a new era for human wellness. Implementing AI into healthcare ecosystems has created a significant impact in medical fields. However, the progress in such a demand-driven trend requires a regulatory framework and access to data. The shifting research paradigms allow us to create value in the context of high-tech applications and high-contact networks. In this talk, I will share our experiences using deep learning approaches in medical studies, especially using OpenACC on our NVIDIA GPUs.

Data-Driven and Physics-Informed Machine Learning for Design and Mechanics of Materials

Alankar Alankar, Indian Institute of Technology, Bomba

This presentation will focus on the application of data-driven machine learning (ML) methods for the design of materials and physics-informed ML for the mechanics of materials. The first part of the presentation focuses on the application of ML for feature extraction from available data and the discovery of multicomponent metals called alloys. In the second part, a model based on the Physics-Informed Neural Network (PINN) for solving elastic deformation of heterogeneous solids and associated Uncertainty Quantification (UQ) is presented.

This work is implemented in a framework called Modulus developed by NVIDIA Inc. PINN is used to approximate momentum balance by assuming isotropic linear elastic constitutive behavior against a loss function. Along with governing equations, the associated initial/boundary conditions also somewhat participate in the loss function. Additionally, the advantages associated with the surrogate features of PINNs are explored via the variation in geometry and material properties.