GPU Technology Conference 2018

User Group Meeting Meet the Experts Talks Tutorials Posters

GTC 2018 conference offers several opportunities for learning about OpenACC and collaborating with your fellow OpenACC users through a variety of talks, tutorials, posters, and meet-the-experts hangouts. In addition, you're invited to socialize with others interested in OpenACC at the OpenACC User Group meeting on Tuesday night.

User Group Meeting

March 27th, 7:30-9:30PM, San Jose CA, Mosaic Restaurant.

The OpenACC User Group meets a few times a year during key HPC events to discuss training, provide feedback on the specification, collaborate on OpenACC-related research and activities, share experiences and best practices and have a good time with great company! Join us March 27th at GTC18 - food and drinks are on us. Seating is limited, please register in advance to attend.

Connect with the Experts

Monday, March 26, 1 - 2PM
Tuesday, March 27, 1 - 2PM
Wednesday, March 28, 11AM - 12PM
Thursday, March 29, 3 - 4PM
This session is designed for anyone who is either looking to start with GPUs or already accelerating their code with OpenACC on GPUs or CPUs. Join OpenACC experts and your fellow OpenACC developers to get an expert advice, discuss your code and learn how OpenACC Directives are used by others.

Book Signing

Tuesday, March 27, 6PM, GTC Bookstore
The new OpenACC textbook co-edited by Sunita Chandrasekaran and Guido Juckeland will be available at the GTC bookstore at a discounted price signed by Guido and Sunita for you.

Talks

S8709 - Accelerating Molecular Modeling Tasks on Desktop and Pre-Exascale Supercomputers

Monday, Mar 26, 4:00 PM - 4:50 PM – Hilton San Carlos

This talk will showcase recent successes in the use of GPUs to accelerate challenging molecular simulation analysis tasks on the latest Volta-based Tesla V100 GPUs on both Intel and IBM/OpenPOWER hardware platforms, and with large scale runs on petascale computers such as ORNL Summit. This presentation will highlight the performance benefits obtained from die-stacked memory on Tesla V100, the NVLink interconnect on the IBM OpenPOWER platforms, and the use of advanced features of CUDA, Volta's new Tensor units, and just-in-time (JIT) compilation to increase the performance of key analysis algorithms. We will present results obtained with OpenACC parallel programming directives, current challenges, and future opportunities. Finally, we will describe GPU-accelerated machine learning algorithms for tasks such as clustering of structures resulting from molecular dynamics simulations.

S8291 - Acceleration of a Computational Fluid Dynamics Code with GPU UsingOpenACC

Wednesday March 28, 10 - 10:25AM, Room 212A

The goal of this session is to report the knowledge acquired at the Oak Ridge GPU Hackathon that took place on October 9th-13th 2017, through the acceleration of a CFD (Computational Fluid Dynamics) solver. The presentation will focus on the approach used to make the application suitable for GPU, the acceleration obtained, and the overall experience at the Hackathon. OpenACC was used to implement GPU directives in this work. The presentation will detail the different OpenACC directives used, their advantages and disadvantages, as well as the particularities of CFD applications.

S8926 - ORNL Summit: Accelerated Simulations of Stellar Explosions with FLASH: Towards Exascale Capability

Monday, Mar 26, 3:30 PM - 3:55 PM – Hilton Almaden 2

Multiphysics and multiscale simulations are found in a variety of computational science subfields, but their disparate computational characteristics can make GPU implementations complex and often difficult. Simulations of supernovae are ideal examples of this complexity. We use the scalable FLASH code to model these astrophysical cataclysms, incorporating hydrodynamics, thermonuclear kinetics, and self-‐gravity across considerable spans in space and time. Using OpenACC and GPU-‐enabled libraries coupled to new NVIDIA GPU hardware capabilities, we have improved the physical fidelity of these simulations by increasing the number of evolved nuclear species by more than an order-‐of-‐ magnitude. I will discuss these and other performance improvements to the FLASH code on the Summit supercomputer at Oak Ridge National Laboratory.

S8848 - Adapting Minisweep, a Proxy Application, on Heterogeneous Systems Using OpenACC Directives

Tuesday, Mar 27, 5:00 PM - 5:25 PM – Room 211B

Learn about how the high-level directive-based, widely popular, programming model, OpenACC can help port radiation transport scientific codes to large scale heterogeneous systems consisting of state-of-the-art accelerators such as GPUs. Architectures are rapidly evolving and the exascale machines are expected to offer billion-way concurrency. We need to rethink algorithms, languages, programming models among other components in order to increase parallelism from a programming standpoint to be able to migrate large scale applications to these massively powerful platforms. This talk will discuss programming challenges and its corresponding solutions for porting a wavefront based miniapplication for Denovo, which is a production code for nuclear reactor modeling, using OpenACC. Our OpenACCimplementation running on NVIDIA's next-generation Volta GPU boasts a 85.06x speedup over serial code, which is larger than CUDA's 83.72x speedup over the same serial implementation.

S8811 - An Agile Approach to Building a GPU-enabled and Performance-portable Global Cloud-resolving Atmospheric Model

Monday, Mar 26, 3:00 PM - 3:25 PM – Hilton Santa Clara

We will give a high-level overview of the results of these efforts, and how we built a cross-organizational partnership to achieve them. Ours is a directive-based approach using OMP and OpenACC to achieve portability. We have focused on achieving good performance on three main architectural branches available to us, namely: traditional multi-core processors (e.g. Intel Xeons), many core processors like the Intel Xeon Phi, and of course NVIDIA GPUs. Our focus has been on creating tools for accelerating the optimization process, techniques for effective cross-platform optimization, and methodologies for characterizing and understanding performance. The results are encouraging, suggesting a path forward based on standard directives for responding to the pressures of future architectures.

S8637 - Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P

Tuesday, Mar 27, 2:00 PM - 2:25 PM – Room 211B

We'll present our experience with using OpenACC to port GTC-P, a real-world plasma turbulence simulation, on NVIDIA P100 GPU and SW26010, the Chinese home-grown many-core processor. Meanwhile, we developed the GTC-P code with the native approach on Sunway TaihuLight supercomputer so that we can analyze the performance gap between OpenACC and the native approach on P100 GPU and SW26010. The experiment results show that the performance gap between OpenACC and CUDA on P100 GPU is less than 10% by PGI compiler. However, the gap on SW26010 is more than 50% since the register level communication only supported by native approach can avoid low-efficiency main memory access. Our case study demonstrates that OpenACC can deliver impressively portable performance on P100 GPU, but the lack of software cache via RLC supported by the OpenACC compiler on SW26010 results in large performance gap between OpenACC and the native approach.

S8800 - A Novel Mapped Grid Approach for GPU Acceleration of High-Order Structured Grid CFD Solvers

Tuesday, Mar 27, 4:00 PM - 4:25 PM – Room 211A

We'll present use of state-of-the-art computational fluid dynamics algorithms and their performance on NVIDIA GPUs, including the new DGX-1 Station using multiple Tesla V100 GPU accelerators. A novel mapped grid approach to implementing high-order stencil based finite-difference and finite-volume methods is the highlight, but we'll also feature the use of flux-reconstruction on GPU using OpenACC.

S8188 - Application of openACC to Computer Aided Drug Discovery software suite "Sanjeevini"

Monday March 26, 2 - 2:25PM, Hilton San Carlos

In this session we demonstrate the features and capabilities of OpenACC for porting and optimizing ParDOCK docking module of Sanjeevini suite for Computer Aided Drug Discovery developed at SCFBio, IIT Delhi, India. We have used OpenACC to efficiently port the existing C++ programming model of ParDOCK software with minimal code modifications to run on latest NVIDIA P100 GPU card. With these code modifications and tuning, average speedup of 6x improvements in turnaround time was made possible. By implementing openACC the code is now able to sample 10 times more ligand conformations leading to increase in accuracy. The ACC ported ParDOCK code is now able to predict correct pose of a protein-ligand interaction from 96.8% percent times, compared to 94.3% earlier (for poses under 1 Å) and 89.9% times compared to 86.7% earlier (for poses under 0.5 A).

S8708 - Immersed Boundary Solver Parallelization using OpenACC

Wednesday, Mar 28, 11:30 AM - 11:55 AM – Room 212A

Multi-physics flow problems like Fluid-Structure Interaction (FSI) involve complex interaction physics and require solution of non-linear partial difference equations. Efficient numerical solvers are extremely useful tools for researchers to study the multi-physics interaction behavior. The advent of parallel algorithms and high performance computing have further revolutionized the field of computational engineering. It is therefore important to accelerate the legacy solvers using state of the art parallelization techniques. In the present work, optimization of a discrete finite difference based Immersed boundary solver (IB) is undertaken to efficiently study the external or internal flow behavior around complex geometries at low Reynolds number. The performance enhancement is required in the computationally heaviest components of the solver, i.e., tagging of the intercepted cells and solving continuity and momentum equations. The computational efficiency is improved by utilizing OpenACCprogramming standards for parallel computing on Graphical Process Units (GPU) and using different iterative solvers for solving velocity-pressure correction equation.

S8805 - Managing Memory of Complex Aggregate Data Structures in OpenACC

Monday, Mar 26, 11:30 AM - 11:55 AM – Grand Ballroom 220C

It is extremely challenging to move data between host and device memories when deep nested complex aggregate data structures are commonly used in an application. This talk will bring users diving into VASP, ICON, and other real-world applications and see how the deep copy issue is solved in these real-world applications with PGI compiler and OpenACC APIs. The OpenACC 2.6 specification includes directives and rules that enable programmer-controlled manual deep copy, albeit in a form that can be intrusive in terms of the number of directives required. The OpenACC committee is designing new directives to extend explicit data management to aggregate data structures in a form that is more elegant and concise. The talk will also cover comparison of unified memory, manual deepcopy, full deepcopy, and true deepcopy.

S8351 - MultiGPU Made Easy by OmpSs + CUDA/OpenACC

Wednesday, Mar 28, 3:00 PM - 3:25 PM – Grand Ballroom 220C

While OpenACC focuses on coding productivity and portability, CUDA enables extracting the maximum performance from NVIDIA GPUs. OmpSs, on the other hand, is a GPU-aware task-based programming model which may be combined with CUDA, and recently with OpenACC as well. Using OpenACC we will start benefiting from GPU computing, obtaining great coding productivity and nice performance improvements. We can next fine-tune the critical application parts developing CUDA kernels to hand-optimize the problem. OmpSs combined with either OpenACC or CUDA will enable seamless task parallelism leveraging all system devices.

S8314 - Multi GPU Programming with MPI

Monday, Mar 26, 11:00 AM - 11:50 AM – Hilton Almaden 2

Learn how to program multi-GPU systems or GPU clusters using the message passing interface (MPI) and OpenACC or NVIDIA CUDA. We'll start with a quick introduction to MPI and how it can be combined with OpenACC or CUDA. Then we'll cover advanced topics like CUDA-aware MPI and how to overlap communication with computation to hide communication times. We'll also cover the latest improvements with CUDA-aware MPI, interaction with unified memory, the multi-process service (MPS, aka Hyper-Q for MPI), and MPI support in NVIDIA performance analysis tools.

S8373 - MVAPICH2-GDR: Pushing the Frontier of Designing MPI Libraries Enabling GPUDirect Technologies

Wednesday, Mar 28, 2:00 PM - 2:50 PM – Room 211B

Learn about the latest developments in the high-performance mass passing interference (MPI) over InfiniBand, iWARP, and RoCE (MVAPICH2) library that simplify the task of porting MPI applications to supercomputing clusters with NVIDIA GPUs. MVAPICH2 supports MPI communication directly from GPU device memory and optimizes it using various features offered by the CUDA toolkit, providing optimized performance on different GPU node configurations. These optimizations are integrated transparently under standard MPI API, for better programmability. Recent advances in MVAPICH2 include designs for MPI-3 RMA using GPUDirect RDMA framework for MPI datatype processing using CUDA kernels, support for GPUDirect Async, support for heterogeneous clusters with GPU and non-GPU nodes, and more. We use the popular Ohio State University micro-benchmark suite and example applications to demonstrate how developers can effectively take advantage of MVAPICH2 in applications using MPI and CUDA/OpenACC. We provide guidance on issues like processor affinity to GPU and network that can significantly affect the performance of MPI applications that use MVAPICH2.

S8799 - On Porting Scalable Parallel CFD Application HiFUN on NVIDIA GPU

Wednesday, Mar 28, 10:30 AM - 10:55 AM – Room 212A

The present study deals with porting scalable parallel CFD application HiFUN on NVIDIA Graphics Processing Unit (GPU) using an off-load strategy. The present strategy focuses on improving single node performance of the HiFUN solver with the help of GPUs. This work clearly brings out the efficacy of the off-load strategy using OpenACC directives on GPUs and may be considered as one of the attractive models for porting legacy CFD codes on GPU based supercomputing platform.

S8190 - Performance Optimization for Scientific Applications

Wednesday March 28, 4-4:50PM, Room 201B

We'll take you on a journey through enabling applications for GPUs; interoperability of different languages (including Fortran, OpenACC, C, and CUDA); CUDA library interfacing; data management, movement, and layout tuning; kernel optimization; tool usage; multi-GPU data transfer; and performance modeling. We'll show how careful optimizations can have a dramatic effect and push application performance towards the maximum possible on the hardware. We'll describe tuning of multi-GPU communications, including efficient exploitation of high-bandwidth NVLink hardware. The applications used in this study are from the domain of numerical weather prediction, and also feature in the ESCAPE European collaborative project, but we'll present widely relevant techniques in a generic and easily transferable way.

S8750 - Porting VASP to GPUs with OpenACC

Monday, Mar 26, 10:00 AM - 10:50 AM – Hilton San Carlos

VASP is a software package for atomic-scale materials modeling. It's one of the most widely used codes for electronic-structure calculations and first-principles molecular dynamics. We'll give an overview and status of porting VASP to GPUs with OpenACC. Parts of VASP were previously ported to CUDA C with good speed-ups on GPUs, but also with an increase in the maintenance workload as VASP is otherwise written wholly in Fortran. We'll discuss OpenACC performance relative to CUDA, the impact of OpenACC on VASP code maintenance, and challenges encountered in the port related to management of aggregate data structures. Finally, we'll discuss possible future solutions for data management that would simplify both new development and maintenance of VASP and similar large production applications on GPUs.

S8847 - Solar Storm Modeling using OpenACC: From HPC Cluster to "In-House"

Tuesday, Mar 27, 2:30 PM - 2:55 PM – Room 211B

We explore using OpenACC to migrate applications required for modeling solar storms from CPU HPC clusters to an "in-house" multi-GPU system. We describe the software pipeline and the utilization of OpenACC in the computationally heavy codes. A major step forward is the initial implementation of OpenACC in our Magnetohydrodynamics code MAS. Strategies for overcoming some of the difficulties encountered are discussed, including handling Fortran derived types, array reductions, and performance tuning. Production-level "time-to-solution" results will be shown for multi-CPU and multi-GPU systems of various sizes. The timings show that it is possible to achieve acceptable "time-to-solution"s on a single multi-GPU server/workstation for problems that previously required using multiple HPC CPU-nodes.

S8241 - Sunny Skies Ahead! Versioning GPU accelerated WRF to 3.7.1

Monday March 26, 4-4:25PM, Hilton Santa Clara

This talk details the inherent challenges in porting a GPU-accelerated community code (WRF) to a newer major version, integrating the community non-GPU changes with OpenACC directives from the earlier version. This is a non-trivial exercise - this particular version upgrade contained 143,000 modified lines of code which required reintegration into our accelerator directives. This work is important in providing support for newer features whilst still providing GPU support for the users. We also look at efforts to improve the maintainability of GPU accelerated community codes.

Tutorials and Labs

S8382 - Zero to GPU Hero with OpenACC

Monday, March 26, 9 - 10:20AM, Grand Ballroom 220C

GPUs are often the fastest way to obtain your scientific results, but many students and domain scientists don't know how to get started. In this tutorial we will take an application from simple, serial loops to a fully GPU-enabled application. Students will learn a profile-guided approach to accelerating applications, including how to find hotspots, how to use OpenACC to accelerated important regions of code, and how to get the best performance they can on GPUs. No prior experience in GPU-programming or OpenACC is required, but experience with C, C++, or Fortran is a must. Several books will be given away to attendees who complete this tutorial.

L8119 - Programming GPU-Accelerated OpenPOWER Systems with OpenACC

Monday, March 26, 9 - 11AM, Room LL21C

In this tutorial you will learn how to handle the massive computing performance offered by POWER systems with NVLink-attached GPUs – the technology also powering Sierra and Summit, two of the fastest supercomputers in the US. We will present the POWER architecture and highlight the available software stack, before we dive into programming the attached GPUs with OpenACC. By using real-world examples we will get to know the hardware architectures of both CPU and GPU and learn the most important OpenACC directives on the way. The resulting GPU-accelerated program can easily be used on other GPU-equipped machines and architectures, by nature of OpenACC's portable approach. The lab requires the attendees to bring their own laptop. We will work on IBM Minsky servers (POWER8 CPUs with P100 GPUs).

L8115 - In-depth Performance Analysis for OpenACC/CUDA/OpenCL Applications with Score-P and Vampir

Wednesday, March 28, 10AM - 12PM, Room LL21C

Work with Score-P/Vampir to learn how to dive into the execution properties of CUDA and OpenACC applications. We'll show how to use Score-P to generate a trace file and how to study it with Vampir. Additionally, we'll use the newly established OpenACC tools interface to present how OpenACC applications can be studied for performance bottlenecks. This lab uses GPU resources in the cloud, so bring your laptop. Prerequisites: Basic knowledge on CUDA/OpenACC and MPI is recommended but not required. This lab utilizes GPU resources in the cloud, you are required to bring your own laptop.

L8179 - Fundamentals of Accelerated Computing with OpenACC – Profiling and Parallelizing Applications on Multicore Processors

Wednesday, March 28, 1:30 - 3:30PM, Room LL21C

In this session participants will go deep on OpenACC Directives with hands-on instruction and labs. In this first session you will learn application profiling with the PGI PGProf profiler, building and running OpenACC code on multicore processors with the PGI compiler, before continuing in the second session to then move on to the GPU for maximum performance. No prior OpenACC and GPU Computing experience is required, but students are strongly encouraged to attend both sessions. Some experience with C/C++ or Fortran is required. Please bring your own computer for hands-on portions.

L8180 - Fundamentals of Accelerated Computing with OpenACC(cont’d) – Accelerating Applications on GPU Devices for Maximum Performance

Wednesday, March 28, 3:30 - 5:30PM, Room LL21C

Continuing the work done in the previous session, participants will go deep on OpenACC Directives with hands-on instruction and labs. In this session you will move work on to the GPU for maximum performance. No prior OpenACC and GPU Computing experience is required, but students are strongly encouraged to have attended the first OpenACC Bootcamp session. Some experience with C/C++ or Fortran is required. Please bring your own computer for hands-on portions.

L8116 - Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs

Thursday, March 29, 10AM - 12PM, Room LL21C

View complete schedule