GTC 2020 conference offers several opportunities to learn more about OpenACC and collaborate with your peers through a variety of talks, tutorials, posters, and meet-the-experts hangouts. This year's GTC conference will be a digital conference with on-demand recordings and materials posted as soon as available. Registration is FREE!
Connect with the Experts
OpenACC is a programming model designed to help scientists and developers to start with GPUs faster and be more efficient by maintaining a single code source for multiple platforms. OpenACC experts discuss how to start accelerating your code on GPUs, continue optimizing your GPU code, start teaching OpenACC, host or participate in a hackathon, and more.
Richard Loft, Director of Technology Development, Computational and Information Systems Laboratory, National Center for Atmospheric Research
Many have speculated that combining exascale GPU computational power with machine-learning algorithms could radically improve weather and climate modeling. We'll discuss the status of an ambitious project at the U.S. National Center for Atmospheric Research that's moving in that direction. Having achieved performance portability for a standalone version of the Model for Prediction Across Scales-Atmosphere (MPAS-A) on heterogeneous CPU/GPU architectures across thousands of GPUs using OpenACC, our project has begun looking at two new directions. First, we've launched an effort to port the MOM-6 Ocean Model. Second, machine-learning scientists at NCAR and elsewhere have begun evaluating replacing atmospheric parameterizations with machine-learned emulators, including the atmospheric surface layer, cloud microphysics, and aerosol parameterizations. We'll also discuss related efforts to apply machine-learning emulation to model physics.Fast Evaluation of Eigenvalues of the Overlap-Dirac Operator in Lattice Gauge Theory [S21219]
Naga Vijayalakshmi Vydyanathan, Architect, NVIDIA | Nilmani Mathur, Professor, Tata Institute of Fundamental Research
We'll describe how to efficiently perform numerical calculations of lattice gauge theory with the Overlap-Dirac action on a 4-dimensional Euclidean space-time lattice. Lattice gauge theory computations on large lattices are crucial for the quantitative study of the physics of strong interactions. This computationally expensive problem requires large supercomputing facilities. We'll show how to accelerate the eigenvalue computations of the Overlap-Dirac operator for large lattices on multiple GPUs using MPI, OpenACC and CUDA, and discuss performance results on large lattices using hundreds of GPUs. This application is a property of the Indian Lattice Gauge Theory Initiative and its computations will be used in predicting results of experiments in high-energy laboratories such as at CERN (Switzerland), KEK (Japan), BES (China), Fair (Germany), and BNL (U.S.).Fully Exploiting a GPU Supercomputer for Seismic Imaging [S21451]
Lionel Boillot, HPC Software Development Engineer, Total SA | Long Qu, HPC Software Development Engineer, Total SA
We'll show you how we ported modern seismic applications like Reverse Time Migration, Full Wave Inversion, and One-Way Migration to the GPU-accelerated Pangea III supercomputer. We'll explain decisive code transformations to take full advantage of the computing power brought by NVIDIA V100, as well as Power9-enhanced multi-GPU support (NVLink/GPUDirect). We'll describe different CUDA optimization techniques to achieve an asynchronous implementation entirely overlapping communications with the propagation kernels. We'll also compare OpenACC and CUDA programming models, and outline a new hybrid GPU-CPU data compression algorithm, developed with the support of NVIDIA, that vastly outperforms the CPU version.Porting the NAS SP-MZ Benchmark to GPUs with CUDA Fortran and OpenACC [S21377]
Josh Romero, Developer Technology Engineer, NVIDIA | Massimiliano Fatica, Director, NVIDIA
Learn strategies to effectively port existing Fortran programs to GPUs using a combination of CUDA Fortran, OpenACC programming, and managed memory. We'll describe how we ported the SP-MZ solver from the NAS Parallel Benchmark suite to NVIDIA GPUs, from initial profiling to assess time-consuming routines to performance results and comparisons. We'll detail the challenges arising from the existing structure of the benchmark code and compare several approaches we tested to improve performance under these constraints. We'll also highlight features available in CUDA Fortran and OpenACC and discuss where each programming approach is best utilized.GPU Acceleration with C++: From C++17 Parallel Algorithms to CUDA [S21777]
David Olsen, PGI C++ Compiler Engineer, NVIDIA
NVIDIA opened the GPU to general-purpose programming with the release of CUDA in 2007. The CUDA programming model has steadily evolved to expand the base of applications that can be accelerated. CUDA has also served as a platform for building higher-level interfaces such as math libraries, class library-based models like Thrust, and directive-based models like OpenACC, bringing GPU programming to an even wider audience of developers. Meanwhile, the C++ language standard has steadily evolved as well, and now includes parallel algorithms purposely designed to enable high-level and portable GPU programming in standard C++ with no extensions or directives. Learn how you can use standard C++17 to program NVIDIA GPUs, and how it builds on and interoperates with class libraries, directives, and CUDA to offer a seamless environment for incremental parallelization and optimization to reap all of the benefits of GPU computing while maximizing productivity and portability.VASP 6 with OpenACC: Performance Tuning and New Functionality on GPUs [S21737]
Stefan Maintz, Senior Development Technology Engineer, NVIDIA
Wednesday, March 25 | 11:00 AM - 11:50 AM | Hilton Hotel San Carlos Room (Level 2/Concourse)
VASP is a software package for atomic-scale materials modeling. It's one of the most widely used codes for electronic-structure calculations and first-principles molecular dynamics. The latest, recently released version of VASP 6 introduced a new port to GPUs with OpenACC. Combining this directive-based approach with interfacing NVIDIA libraries, the developers can now handle a unified code base. This increased productivity has led to a much more complete set of features accelerated with GPUs in contrast to the CUDA-C based port available in the previous version. VASP 6 even featured first-day support for methods newly implemented in this release. We'll discuss performance relative to CPUs, showcase optimization strategies we applied, and demonstrate how libraries have been integrated.Improving Developer Productivity with Linux Heterogeneous Memory Management (HMM) [S21571]
John Hubbard, Principal SW Engineer, NVIDIA
The Heterogeneous Memory Manager (HMM) integrates support for non-conventional memories, including high-bandwidth GPU memory, into the Linux kernel for direct access from a CPU. It also enables non-CPU processors, including GPUs, to access main program memory coherently with the CPU. While the CPU and GPU are non-symmetric with respect to compute architecture, HMM presents them as fully symmetric with respect to memory architecture. For anyone who has developed applications for GPUs, the potential impact of HMM on programmability of heterogeneous systems is obvious and huge for nearly every programming model — CUDA, Thrust, OpenACC, OpenMP, SHMEM, and even standard C++ and Fortran. Realizing this potential requires mainstreaming of HMM in the Linux kernel, enhancements to CUDA to fully enable HMM from the GPU side, and enhancements to compilers and runtime systems. We'll update the status of each of these, and provide some concrete examples of the transformative effect HMM will have on software development for accelerated computing.OpenACC-Based Acceleration of Convex Optimizations in Financial Engineering [S22152]
Yash Ukidave, Senior R&D Engineer, Millennium Management LLC | Alireza Yazdani, Senior R&D Engineer, Millennium Management
We'll walk through OpenACC-based GPU acceleration for some of the problems in financial engineering. We'll address the challenges of using openACC for some of the convex optimization problems and linear solvers, and we'll discuss the algorithmic and code-centric approach taken for providing best performance without losing accuracy of the solution.Toward Industrial LES/DNS in Aeronautics: Leveraging OpenACC for Massively Parallel CPU+GPU Simulations [S21958]
David Gutzwiller, Software Engineer, head of HPC, Numeca
Wednesday, March 25 | 03:30 PM - 03:55 PM | Hilton Hotel Santa Clara Room (Level 2/Concourse)
We'll describe recent advances toward industrial LES/DNS computational fluid dynamics within the scope of the EU TILDA (Towards Industrial LES/DNS in Aeronautics) project. The TILDA project aims to complete high-fidelity industrial LES/DNS simulations with upwards of 1 billion degrees of freedom, with a turnaround time on the order of one day. Achieving this requires near-linear efficiency on massively parallel, heterogeneous CPU+GPU compute resources. We'll describe the development of FineFR, a high-order CFD solver supporting heterogeneous CPU+GPU architectures. We'll emphasize the highly tuned OpenACC implementation, allowing very efficient data locality with minimal code intrusion. Finally, we'll present benchmark data and demonstration computations from the OLCF Summit Supercomputer showing near-linear scalability on upwards of 50,000 CPU cores and 7,000 NVIDIA GPUs.OpenACC Compilers Update: New Features and Future Directions [S22032]
Xiaonan Tian, Compiler Engineer, NVIDIA
OpenACC is the primary GPU programming model for many of the most widely-used applications in science and engineering and has been used to successfully initiate or support GPU acceleration of hundreds of applications overall. In this talk you’ll learn about the key features of OpenACC that programmers are using today for effective GPU programming, the ease-of-use and debugging features we are adding to the NVIDIA OpenACC compilers, and future directions for OpenACC. We’ll explore and explain the importance of interoperability between programming models and show how OpenACC excels as a programming model you can use where and when you need it in combination with existing models like MPI, CUDA and standard language parallelism.More Durable, More Reliable, Better Performing: Revolutionizing the Turbomachinery Design Cycle with GPU Acceleration [S21769]
Michael Ni, CEO, ADS CFD Inc
Learn how GPU technology will impact the next generation of turbomachinery designs.GPU-Accelerated Tabu Search and Large Neighborhood Search to Solve Vehicle Routing Problem with Time Windows [S21587]
Minhao Liu, Senior Data Scientist, WalmartLabs | Deyi Zhang, Data Scientist, Walmart Labs
Learn various metaheuristic optimization algorithms to solve large-scale vehicle routing problems with time windows with GPU programming. We'll introduce a Tabu Search algorithm designed to exploit the parallelism in neighborhood search and its OpenACC-based implementation that applies deep-copy and manages complex data types. Then we'll introduce an Adaptive Large Neighborhood Search algorithm in which various combinations of destroy-and-repair heuristics are adaptively applied during the optimization process for the exploration of a substantially wider neighborhood of the solution space. We'll also describe the ALNS implementation using CUDA C and its running test on an NVIDIA DGX Station.
Tutorials and TrainingMulti-GPU Programming with Message-Passing Interface [S21067]
Jiri Kraus, Senior Developer Technology Engineer, NVIDIA
Learn how to program multi-GPU systems or GPU clusters using the message-passing interface (MPI) and OpenACC or NVIDIA CUDA. We'll start with a quick introduction to MPI and how it can be combined with OpenACC or CUDA. Then we'll cover advanced topics like CUDA-aware MPI and how to overlap communication with computation to hide communication times. We''ll also cover the latest improvements with CUDA-aware MPI, interaction with unified memory, the multi-process service, and MPI support in NVIDIA performance analysis tools.
PostersOptimizing Stencil Operations in OpenACC [P21723]
Ronald Caplan, Computational Scientist, Predictive Science Inc.
Stencil operations are used widely in HPC applications and pose an optimization challenge on both CPUs and GPUs. On GPUs, fine-tuned optimizations can be formulated using low-level APIs such as CUDA, but many large established codes prefer a portable, higher-level API such as OpenACC. Although OpenACC lacks the fine-tuning of CUDA, it does allow for some tuning through a variety of parallelization constructs and loop directives. Here, we optimize a stencil operation within our production solar physics research code Magnetohydrodynamics Around a Sphere. We explore numerous OpenACC directive options (including tile, cache, collapse, etc.) and compare their performance over several problem types and sizes. The optimal result is used to run a full-scale simulation and analyzed with Nsight Systems. An emerging cautionary result is that although many directive options yield a speedup of the operator, using the "wrong" directives can result in drastically poor performance.
Takuma Yamaguchi, Ph.D. Student, University of Tokyo | Tsuyoshi Ichimura, Professor, University of Tokyo | Kohei Fujita, Assistant Professor, University of Tokyo | Wijerathne Lalith, Associate Professor, University of Tokyo | Muneo Hori, Director-General [Principal Scientist], Japan Agency for Marine-Earth Science and Technology | Ryota Kusakabe, Ph.D. Student, University of Tokyo
We show that OpenACC is useful for GPU porting of a practical application with FP21-FP32-FP64 mixed-precision computing. FP21 is our custom 21-bit floating-point data type for scientific computing. Here, we ported a soil liquefaction analysis solver that was developed for manycore CPU-based computers. The OpenACC-ported solver achieved a 10.7-fold speedup over the CPU-based solver on a system where the ratio of peak FP64 FLOPS was CPU:GPU = 1:10.2. It took only three weeks for a beginning GPU programmer to port the solver to GPU.
Ruth Falconer, Head of Division, Abertay University | Alasdair Houston, Freelance Developer, N/A
Interested in characterizing spatial structure inherent in 3D scalar fields from simulated or imaged data? This poster presents an accelerated solution of the widely-used Minkowski functionals, using both OpenACC and CUDA on commodity GPUs. We'll present methods to minimize the memory footprint, and hence reduce data-transfer costs. Based on measurement frequency, an OpenACC rather than a CUDA solution might be appropriate. Next steps highlight the additional methods to further enhance and fine-tune the performance of the CUDA solution. Minkowski functionals have been widely applied in cosmology, material science, engineering, microbial ecology, and health care.
Samuel Elliott, Director of NWP Solutions, TempoQuest Inc.
The Weather Research and Forecasting (WRF) model is an open-source, mesoscale numerical weather prediction system designed to serve both operational forecasting and atmospheric research needs. It is the most widely used regional weather forecasting model and a top 5 HPC application worldwide. TQI has implemented an OpenACC/CUDA-based version of WRF to take advantage of NVIDIA GPUs. By utilizing GPUs, the measured performance benefits enable better forecasting through higher resolution, temporal/geographical extents, and so on. Our poster discusses the GPU implementation of the model as well as performance benchmarks demonstrating the model’s practical performance benefits. TQI has also developed a cloud-based solution for running end-to-end AceCAST GPU-WRF workflows on AWS. This provides a solution for a wide range of users who would otherwise not have access to GPU-based compute resources, and automates a highly complex process that is a significant barrier for researchers and operational weather forecasters.
Jingcheng Shen, Ph.D. Student, Osaka University
The pipelined accelerator (PACC) helps lower the hurdle for implementing out-of-core stencil computation, such as large seismic simulations. However, the out-of-core applications themselves are facing data-movement bottlenecks because improvement of accelerators dwarfs that of interconnects. In this poster, we introduce temporal blocking techniques to reuse on-chip data, and propose a data-mapping scheme to eliminate data movement on the host side. The data-mapping scheme accelerates the program by 2.5x compared to previous work and by 35x compared to an OpenMP-based program. However, performance is still bound by data movement between the host and device, so we need to come up with further data-centric strategies. Moreover, the degradation in execution time of PACC code is about 25% compared to an in-core OpenACC code, which should be reduced if we can design further data-centric strategies.