GPU Technology Conference 2020

Meet the Experts Talks Tutorials Posters

GTC 2020 conference offers several opportunities to learn more about OpenACC and collaborate with your peers through a variety of talks, tutorials, posters, and meet-the-experts hangouts. This year's GTC conference will be a digital conference with on-demand recordings and materials posted as soon as available. Registration is FREE!

Connect with the Experts

OpenACC is a programming model designed to help scientists and developers to start with GPUs faster and be more efficient by maintaining a single code source for multiple platforms. OpenACC experts discuss how to start accelerating your code on GPUs, continue optimizing your GPU code, start teaching OpenACC, host or participate in a hackathon, and more.

Talks

Toward an Exascale Earth System Model with Machine Learning Components: An Update [S21834]

Richard Loft, Director of Technology Development, Computational and Information Systems Laboratory, National Center for Atmospheric Research

Many have speculated that combining exascale GPU computational power with machine-learning algorithms could radically improve weather and climate modeling. We'll discuss the status of an ambitious project at the U.S. National Center for Atmospheric Research that's moving in that direction. Having achieved performance portability for a standalone version of the Model for Prediction Across Scales-Atmosphere (MPAS-A) on heterogeneous CPU/GPU architectures across thousands of GPUs using OpenACC, our project has begun looking at two new directions. First, we've launched an effort to port the MOM-6 Ocean Model. Second, machine-learning scientists at NCAR and elsewhere have begun evaluating replacing atmospheric parameterizations with machine-learned emulators, including the atmospheric surface layer, cloud microphysics, and aerosol parameterizations. We'll also discuss related efforts to apply machine-learning emulation to model physics.

Fully Exploiting a GPU Supercomputer for Seismic Imaging [S21451]

Lionel Boillot, HPC Software Development Engineer, Total SA | Long Qu, HPC Software Development Engineer, Total SA

We'll show you how we ported modern seismic applications like Reverse Time Migration, Full Wave Inversion, and One-Way Migration to the GPU-accelerated Pangea III supercomputer. We'll explain decisive code transformations to take full advantage of the computing power brought by NVIDIA V100, as well as Power9-enhanced multi-GPU support (NVLink/GPUDirect). We'll describe different CUDA optimization techniques to achieve an asynchronous implementation entirely overlapping communications with the propagation kernels. We'll also compare OpenACC and CUDA programming models, and outline a new hybrid GPU-CPU data compression algorithm, developed with the support of NVIDIA, that vastly outperforms the CPU version.

Toward Industrial LES/DNS in Aeronautics: Leveraging OpenACC for Massively Parallel CPU+GPU Simulations [S21958]

David Gutzwiller, Software Engineer, head of HPC, Numeca

We'll describe recent advances toward industrial LES/DNS computational fluid dynamics within the scope of the EU TILDA (Towards Industrial LES/DNS in Aeronautics) project. The TILDA project aims to complete high-fidelity industrial LES/DNS simulations with upwards of 1 billion degrees of freedom, with a turnaround time on the order of one day. Achieving this requires near-linear efficiency on massively parallel, heterogeneous CPU+GPU compute resources. We'll describe the development of FineFR, a high-order CFD solver supporting heterogeneous CPU+GPU architectures. We'll emphasize the highly tuned OpenACC implementation, allowing very efficient data locality with minimal code intrusion. Finally, we'll present benchmark data and demonstration computations from the OLCF Summit Supercomputer showing near-linear scalability on upwards of 50,000 CPU cores and 7,000 NVIDIA GPUs.

GPU-Accelerated Tabu Search and Large Neighborhood Search to Solve Vehicle Routing Problem with Time Windows [S21587]

Minhao Liu, Senior Data Scientist, WalmartLabs | Deyi Zhang, Data Scientist, Walmart Labs

Learn various metaheuristic optimization algorithms to solve large-scale vehicle routing problems with time windows with GPU programming. We'll introduce a Tabu Search algorithm designed to exploit the parallelism in neighborhood search and its OpenACC-based implementation that applies deep-copy and manages complex data types. Then we'll introduce an Adaptive Large Neighborhood Search algorithm in which various combinations of destroy-and-repair heuristics are adaptively applied during the optimization process for the exploration of a substantially wider neighborhood of the solution space. We'll also describe the ALNS implementation using CUDA C and its running test on an NVIDIA DGX Station.

Tutorials and Training

Multi-GPU Programming with Message-Passing Interface [S21067]

Jiri Kraus, Senior Developer Technology Engineer, NVIDIA

Learn how to program multi-GPU systems or GPU clusters using the message-passing interface (MPI) and OpenACC or NVIDIA CUDA. We'll start with a quick introduction to MPI and how it can be combined with OpenACC or CUDA. Then we'll cover advanced topics like CUDA-aware MPI and how to overlap communication with computation to hide communication times. We''ll also cover the latest improvements with CUDA-aware MPI, interaction with unified memory, the multi-process service, and MPI support in NVIDIA performance analysis tools.

Posters

Optimizing Stencil Operations in OpenACC [P21723]

Ronald Caplan, Computational Scientist, Predictive Science Inc.

Stencil operations are used widely in HPC applications and pose an optimization challenge on both CPUs and GPUs. On GPUs, fine-tuned optimizations can be formulated using low-level APIs such as CUDA, but many large established codes prefer a portable, higher-level API such as OpenACC. Although OpenACC lacks the fine-tuning of CUDA, it does allow for some tuning through a variety of parallelization constructs and loop directives. Here, we optimize a stencil operation within our production solar physics research code Magnetohydrodynamics Around a Sphere. We explore numerous OpenACC directive options (including tile, cache, collapse, etc.) and compare their performance over several problem types and sizes. The optimal result is used to run a full-scale simulation and analyzed with Nsight Systems. An emerging cautionary result is that although many directive options yield a speedup of the operator, using the "wrong" directives can result in drastically poor performance.

Low-Cost OpenACC Porting of Matrix Solver with FP21-FP32-FP64 Computing: an Earthquake Application [P21886]

Takuma Yamaguchi, Ph.D. Student, University of Tokyo | Tsuyoshi Ichimura, Professor, University of Tokyo | Kohei Fujita, Assistant Professor, University of Tokyo | Wijerathne Lalith, Associate Professor, University of Tokyo | Muneo Hori, Director-General [Principal Scientist], Japan Agency for Marine-Earth Science and Technology | Ryota Kusakabe, Ph.D. Student, University of Tokyo

We show that OpenACC is useful for GPU porting of a practical application with FP21-FP32-FP64 mixed-precision computing. FP21 is our custom 21-bit floating-point data type for scientific computing. Here, we ported a soil liquefaction analysis solver that was developed for manycore CPU-based computers. The OpenACC-ported solver achieved a 10.7-fold speedup over the CPU-based solver on a system where the ratio of peak FP64 FLOPS was CPU:GPU = 1:10.2. It took only three weeks for a beginning GPU programmer to port the solver to GPU.

Runtime Analysis of Spatial Structure: A CUDA Implementation of Minkowski Functionals [P21829]

Ruth Falconer, Head of Division, Abertay University | Alasdair Houston, Freelance Developer, N/A

Interested in characterizing spatial structure inherent in 3D scalar fields from simulated or imaged data? This poster presents an accelerated solution of the widely-used Minkowski functionals, using both OpenACC and CUDA on commodity GPUs. We'll present methods to minimize the memory footprint, and hence reduce data-transfer costs. Based on measurement frequency, an OpenACC rather than a CUDA solution might be appropriate. Next steps highlight the additional methods to further enhance and fine-tune the performance of the CUDA solution. Minkowski functionals have been widely applied in cosmology, material science, engineering, microbial ecology, and health care.

AceCAST GPU-Enabled Weather Research and Forecasting Model Development and Applications [P22064]

Samuel Elliott, Director of NWP Solutions, TempoQuest Inc.

The Weather Research and Forecasting (WRF) model is an open-source, mesoscale numerical weather prediction system designed to serve both operational forecasting and atmospheric research needs. It is the most widely used regional weather forecasting model and a top 5 HPC application worldwide. TQI has implemented an OpenACC/CUDA-based version of WRF to take advantage of NVIDIA GPUs. By utilizing GPUs, the measured performance benefits enable better forecasting through higher resolution, temporal/geographical extents, and so on. Our poster discusses the GPU implementation of the model as well as performance benchmarks demonstrating the model’s practical performance benefits. TQI has also developed a cloud-based solution for running end-to-end AceCAST GPU-WRF workflows on AWS. This provides a solution for a wide range of users who would otherwise not have access to GPU-based compute resources, and automates a highly complex process that is a significant barrier for researchers and operational weather forecasters.

Accelerating Large Seismic Simulation Code with PACC Framework [P21901]

Jingcheng Shen, Ph.D. Student, Osaka University

The pipelined accelerator (PACC) helps lower the hurdle for implementing out-of-core stencil computation, such as large seismic simulations. However, the out-of-core applications themselves are facing data-movement bottlenecks because improvement of accelerators dwarfs that of interconnects. In this poster, we introduce temporal blocking techniques to reuse on-chip data, and propose a data-mapping scheme to eliminate data movement on the host side. The data-mapping scheme accelerates the program by 2.5x compared to previous work and by 35x compared to an OpenMP-based program. However, performance is still bound by data movement between the host and device, so we need to come up with further data-centric strategies. Moreover, the degradation in execution time of PACC code is about 25% compared to an in-core OpenACC code, which should be reduced if we can design further data-centric strategies.