GTC 2017 will be an epicenter of OpenACC activities this coming May. The conference offers several opportunities for learning about OpenACC and collaborating with your fellow OpenACC users through a variety of talks, tutorials, posters, and ask-the-experts hangouts. In addition, you're invited to socialize with others interested in OpenACC at the OpenACC User Group meeting on Monday night.

User Group Meeting

May 8th, 7-9PM, San Jose CA, Hilton hotel, Almaden room.

Seating is limited to 50. Please register in advance if you'd like to attend.

BoF

S7564 - Accelerator Programming Ecosystems

Tuesday, May 9, 4:00 - 4:50 PM, Marriot Salon 3

Emerging heterogeneous systems are opening up tons of programming opportunities. This panel will discuss the latest developments in accelerator programming where the programmers have a choice among OpenMP, OpenACC, CUDA and Kokkos for GPU programming. This panel will throw light on what would be the primary objective(s) for a choice of model, whether its availability across multiple platforms, its rich feature set or its applicability for a certain type of scientific code or compilers' stability or other factors. This will be an interactive Q/A session where participants can discuss their experiences with programming model experts and developers... view more

Connect with the Experts

S7564 - OpenACC: Start with GPUs and Optimize Your Code

Monday, May 8, 10:00 AM - 11:00 AM, LL Pod B
Tuesday, May 9, 10:00 AM - 11:00 AM, LL Pod A
Wednesday, May 10, 1:00 PM - 2:00 PM, LL Pod C
Thursday, May 11, 10:00 AM - 11:00 AM, LL Pod A

This session is designed for anyone who is either looking to start with GPUs or already accelerating their code with OpenACC on GPUs or CPUs. Join OpenACC experts and your fellow OpenACC developers to get an expert advice, discuss your code and learn how OpenACC Directives are used by others.

Talks

S7626 - A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA

Tuesday, May 9, 3:00 - 3:25 PM, Marriot Salon 3

Learn a simple strategy guideline to optimize applications runtime. The strategy is based on four steps and illustrated on a two-dimensional Discontinuous Galerkin solver for Computational Fluid Dynamics on structured meshes. Starting from a CPU sequential code, we guide the audience through the different steps which allowed us to increase performances on a GPU around 149 times the original runtime of the code (performances evaluated on a K20Xm). The same optimization strategy is also applied to the CPU code and increases performances around 35 times the original run time (performances evaluated on a E5-1650v3 processor). Finally, different hardware architectures (Xeon CPUs, GPUs, KNL) are benchmarked with the native CUDA implementation and one based on OpenACC.

S7640 - Porting C++ Applications to GPUs with OpenACC for Lattice Quantum Chromodynamics

Monday, May 8, 1:00 - 1:25 PM, Room 212A

Learn a simple strategy guideline to optimize applications runtime. The strategy is based on four steps and illustrated on a two-dimensional Discontinuous Galerkin solver for Computational Fluid Dynamics on structured meshes. Starting from a CPU sequential code, we guide the audience through the different steps which allowed us to increase performances on a GPU around 149 times the original runtime of the code (performances evaluated on a K20Xm). The same optimization strategy is also applied to the CPU code and increases performances around 35 times the original run time (performances evaluated on a E5-1650v3 processor). Finally, different hardware architectures (Xeon CPUs, GPUs, KNL) are benchmarked with the native CUDA implementation and one based on OpenACC.

S7193 - Achieving Portable Performance for GTC-P with OpenACC on GPU, Multi-Core CPU, and Sunway Many-Core Processor

Thursday, May 11, 9:30 AM - 9:55 AM, Room 211B

Gyrokinetic Toroidal Code developed in Princeton (GTC-P) delivers highly-scalable plasma turbulence simulations at extreme scales on world-leading supercomputers such as Tianhe-2 and Titan. The aim of this work to achieve portable performance in a single source code for GTC-P. We developed the first OpenACC implementation for GPU, CPU, and Sunway processor. The results showed the OpenACC version achieved nearly 90% performance of NVIDIA?CUDA?version on GPU and OpenMP version on CPU; the Sunway OpenACC version achieved 2.5X speedup in the entire code. Our work demonstrates OpenACC can deliver portable performance to complex real-science codes like GTC-P. In additional, we request adding thread-id support in OpenACC standard to avoid expensive atomic operations for reductions.

S7636 - Cache Directive Optimization in OpenACC Programming Model

Tuesday, May 9, 3:30 PM - 3:55 PM, Marriott Salon 3

OpenACC is a directive-based programming model that provides a simple interface to exploit GPU computing. As the GPU employs deep memory hierarchy, appropriate management of memory resources becomes crucial to ensure performance. The OpenACC programming model offers the cache directive to use on-chip hardware (Read-Only Data Cache) or software-managed (Shared Memory) caches to improve memory access efficiency. We have implemented several strategies to promote the shared memory utilization in our PGI compiler suite. In this session, we briefly discuss our investigation of cases that can be potentially optimized by the cache directive and then dive into the underlying implementation. Our compiler is evaluated with self-written micro-benchmarks as well as some real world applications.

S7635 - Comparison of OpenACC and OpenMP 4.5 Offloading: Speeding Up Simulations of Stellar Explosions

Tuesday, May 9, 2:30 PM - 2:55 PM, Room 212A

Learn about a case-study comparing OpenACC and OpenMP4.5 in the context of stellar explosions. Modeling supernovae requires multi-physics simulation codes to capture hydrodynamics, nuclear burning, gravitational forces, etc. As a nuclear detonation burns through the stellar material, it also increases the temperature. An equation of state (EOS) is then required to determine, say, the new pressure associated with this temperature increase. In fact, an EOS is needed after the thermodynamic conditions are changed by any physics routines. This means it is called many times throughout a simulation, requiring the need for a fast EOS implementation. Fortunately, these calculations can be performed independently during each time step, so the work can be offloaded to GPUs. Using the IBM/NVIDIA early test system (precursor to the upcoming Summit supercomputer) at Oak Ridge National Laboratory, we use a hybrid MPI+OpenMP (traditional CPU threads) driver program to offload work to GPUs. We'll compare the performance results as well as some of the currently available features of OpenACC and OpenMP4.5.

S7735 - GPU Acceleration of the HiGrad Computational Fluid Dynamics Code with Mixed OpenACC and CUDA Fortran

Wednesday, May 10, 3:30 PM - 3:55 PM, Room 211B

We'll present the strategy and results for porting an atmospheric fluids code, HiGrad, to the GPU. HiGrad is a cross-compiled, mixed-language code that includes C, C++, and Fortran, and is used for atmospheric modeling. Deep subroutine calls necessitate detailed control of the GPU data layout with CUDA-Fortran. Initial kernel accelerations with OpenACC are presented followed by a discussion of tuning with OpenACC and comparison with specially curated CUDA kernels. We'll demonstrate the performance improvement and different techniques used for porting this code to GPUs, using a mixed CUDA-Fortran and OpenACC implementation for single-node performance, and scaling studies conducted with MPI on local supercomputers and Oak Ridge National Laboratory's Titan supercomputer, on different architectures including the Tesla K40 and Tesla P100.

S7382 - GPUs Unleashed: Analysis of Petascale Molecular Simulations with VMD

Wednesday, May 10, 3:30 PM - 3:55 PM, Room 211B

We'll showcase recent successes in the use of GPUs to accelerate challenging molecular simulation analysis tasks on the latest NVIDIA?Tesla?P100 GPUs on both Intel and IBM/OpenPOWER hardware platforms, and large-scale runs on petascale computers such as Titan and Blue Waters. We'll highlight the performance benefits obtained from die-stacked memory on the Tesla P100, the NVIDIA NVLink# interconnect on the IBM "Minsky" platform, and the use of NVIDIA CUDA?just-in-time compilation to increase the performance of data-driven algorithms. We will present results obtained with OpenACC parallel programming directives, current challenges, and future opportunities. Finally, we'll describe GPU-accelerated machine learning algorithms for tasks such as clustering of structures resulting from molecular dynamics simulations.

S7133 - Multi-GPU Programming with MPI

Monday, May 8, 10:35 AM - 11:25 AM, Room 211B

Learn how to program multi-GPU systems or GPU clusters using the message passing interface (MPI) and OpenACC or NVIDIA CUDA. We'll start with a quick introduction to MPI and how it can be combined with OpenACC or CUDA. Then we'll cover advanced topics like CUDA-aware MPI and how to overlap communication with computation to hide communication times. We'll also cover the latest improvements with CUDA-aware MPI, interaction with Unified Memory, the multi-process service (MPS, aka Hyper-Q for MPI), and MPI support in NVIDIA performance analysis tools.

S7546 - Multi-GPU Programming with OpenACC

Tuesday, May 9, 2:00 PM - 2:25 PM, Marriott Salon 3

This session will discuss techniques for using more than one GPU in an OpenACC program. The session will demonstrate how to address multiple devices, mixing OpenACC and OpenMP to manage multiple devices, and utilizing multiple devices with OpenACC and MPI.

S7521 - Ocean Circulation on GPUs

Monday, May 8, 3:30 PM - 3:55 PM, Room 212B

We'll show the development of an ocean circulation model in China and recent work to port the whole model to GPUs using OpenACC. The preliminary results of the performance of the GPU version also will be shown.

S7192 - OmpSs+OpenACC: Multi-target Task-Based Programming Model exploiting OpenACC GPU Kernels

Tuesday, May 9, 2:30 PM - 2:55 PM, Marriott Salon 3

Discover how the OmpSs programming model enables you to develop different programming models such as OpenACC, multi-thread programming, CUDA, and OpenCL together while providing a single address space and directionality compiler directives. OmpSs is a flagship project in the Barcelona Supercomputing Center, as well as a forerunner of the OpenMP. We'll present the advantages in terms of coding productivity and performance brought by our recent work integrating OpenACC kernels within the OmpSs programming model, as a step forward to our previous OmpSs + CUDA support. We'll also present how to use hybrid GPU and CPU together without any code modification by our runtime system.

S7672 - OpenACC Best Practices: Accelerating the C++ NUMECA FINE/Open CFD Solver

Tuesday, May 9, 9:00 AM - 9:25 AM, Room 212B

We'll demonstrate the maturity and capabilities of OpenACC and the PGI compiler suite in a professional C++ programming environment. We'll explore in detail the adaptation of the general purpose NUMECA FINE/Open CFD solver for heterogeneous CPU+GPU execution. We'll give extra attention to OpenACC tips and tricks used to efficiently port the existing C++ programming model with minimal code modifications. Sample code blocks will be used to clearly demonstrate the implementation principles in a clear and concise manner. Finally, we'll present simulations completed in partnership with Dresser-Rand on the OLCF Titan supercomputer, showcasing the scientific capabilities of FINE/Open and the improvements in simulation turnaround time made possible through the use of OpenACC.

S7535 - Potential Field Solutions of the Solar Corona: Converting a PCG Solver from MPI to MPI+OpenACC

Tuesday, May 9, 2:00 PM - 2:25 PM, Room 212A

We'll describe a real-world example of adding OpenACC to a legacy MPI FORTRAN Preconditioned Conjugate Gradient code, and show timing results for multi-node, multi-GPU runs. The code's application is obtaining 3D spherical potential field (PF) solutions of the solar corona using observational boundary conditions. PF solutions yield approximations of the coronal magnetic field structure and can be used as initial/boundary conditions for MHD simulations with applications to space weather prediction. We highlight key tips and strategies used when converting the MPI code to MPI+OpenACC, including linking Fortran code to the cuSparse library, using CUDA-aware MPI, maintaining performance portability, and dealing with multi-node, multi-GPU run-time environments. We'll show timing results for three increasing-sized problems for running the code with MPI-only (up to 1728 CPU cores), and with MPI+GPU (up to 60 GPUs) using NVIDIA K80 and P100 GPUs. scientific capabilities of FINE/Open and the improvements in simulation turnaround time made possible through the use of OpenACC.

S7275 - Real-time Monitoring of Financial Risk Management on GPU

Tuesday, May 9, 10:00 AM - 10:25 AM, Room 210E

Option-embedded bonds pricing and Value at Risk (VaR) computations have become hotspots in financial risk management since the 2008 financial crisis. The goal of this work is to implement real-time option-embedded bonds pricing and VaR computations in the production system of Shanghai Clearing House, the exclusive central counterparties for the over-the-counter market and the major securities settlement system in China. We developed both CUDA and OpenACC implementations on GPU. The results showed the CUDA versions achieved 60x speedup for option-embedded bonds pricing and 10x speedup for VaR. In addition, the OpenACC versions can deliver portable performance for both option-embedded bonds pricing and VaR computations.

S7628 - The Future of GPU Data Management

Tuesday, May 9, 10:00 AM - 10:25 AM, Room 211B

Optimizing data movement between host and device memories is an important step when porting applications to GPUs. This is true for any programming model (CUDA, OpenACC, OpenMP 4+, ...), and becomes even more challenging with complex aggregate data structures (arrays of structs with dynamically-allocated array members). The CUDA and OpenACC APIs expose the separate host and device memories, requiring the programmer or compiler to explicitly manage the data allocation and coherence. The OpenACC committee is designing directives to extend this explicit data management for aggregate data structures. CUDA C++ has managed memory allocation routines and CUDA Fortran has the managed attribute for allocatable arrays, allowing the CUDA driver to manage data movement and coherence. Future NVIDIA GPUs will support true unified memory, with operating system and driver support for sharing the entire address space between the host and the GPU. We will compare and contrast the current and future explicit memory movement with driver- and system-managed memory, and discuss how future developments will affect application development and performance.

S7527 - Reduction of memory accesses for unstructured low-order finite-element analyses on Pascal GPUs

Thursday, May 11, 10:00 AM - 10:25 AM, Room 211B

We'll show a method that decreases random memory accesses for GPUs by splitting up calculations properly. The target application is unstructured low-order finite element analysis, the core application for manufacturing analyses. To reduce the memory access cost, we apply the element-by-element method for matrix-vector multiplication in the analysis. This method conducts local matrix-vector computation for each element in parallel. Atomic and cache hardware in GPUs has improved and we can utilize the data locality in the element node connectivity by using atomic functions for addition of local results. We port codes to GPUs using OpenACC directives and attain high performance with low development costs. We'll also describe the performance on NVIDIA DGX-1, which contains eight Pascal GPUs.

S7341 - Using OpenACC for NGS Techniques to Create a Portable and Easy-to-Use Code Base

Tuesday, May 9, 3:00 PM - 3:25 PM, Room 210C

Happy with your code but re-writing every time a hardware platform changes? Know NVIDIA CUDA but want to use a higher-level programming model? OpenACC is a directive-based technique that enables more science and less programming. The model facilitates reusing code base on more than one platform. This session will help you: (1) Learn how to incrementally improve a bioinformatics code base using OpenACC without losing performance (2) Explore how to apply optimization techniques and the challenges encountered in the process. We'll share our experience using OpenACC for DNA Next Generation Sequencing techniques.

S7478 - Using OpenACC to parallelize irregular algorithms on GPUs

Wednesday, May 10, 3:30 PM - 3:55 PM, Marriott Salon 3

We'll dive deeper into using OpenACC and explore potential solutions that can overcome challenges faced while parallelizing an irregular algorithm, sparse Fast Fourier Transform (sFFT). We'll analyze code characteristics using profilers, discuss optimizations applied, things we did right, things we did wrong, along with roadblocks that we faced and steps taken to overcome them. We'll highlight how to compare data reproducibility between accelerators in heterogeneous platforms, and report on the algorithmic changes from sequential to parallel especially for an irregular code, while using OpenACC. The results will demonstrate how to create a portable, productive, and maintainable codebase without compromising on performance using OpenACC.

S7558 - Porting and Optimization of Search of Neighbour-particle by Using OpenACC

Monday, May 8, 1:00 PM - 1:25 PM, Room 212B

MPS method is a sort of particle method (not a stencil computation) used for computational fluid dynamics. "Search of neighbor-particle" is a main bottleneck of MPS. We show our porting efforts and three optimizations of search of neighbor-particle by using OpenACC. We evaluate our implementations on Tesla K20c, GeForce GTX 1080, and Tesla P100 GPUs. It achieved 45.7x, 96.8x, and 126.1x times speedup compared with single-thread Ivy-bridge CPU.

Tutorials

L7114 - Multi GPU Programming with MPI and OpenACC

Monday, May 8, 12:30PM - 2:30 PM, Room LL21E

Learn how to program multi-GPU systems or GPU clusters using the message passing interface and OpenACC. We'll start with a quick introduction to MPI and how an NVIDIA(R) CUDA(R)-aware MPI implementation can be used with OpenACC. Other topics covered will include how to handle GPU affinity in multi-GPU systems and using NVIDIA performance analysis tools. As we'll be using GPUs hosted in the cloud, all you are required to bring is a laptop with a modern browser. Prerequisites: C or FORTRAN, Basic OpenACC and MPI are strongly recommended but not required. This lab utilizes GPU resources in the cloud, you are required to bring your own laptop.

L7130 - Introduction to OpenACC Directives

Monday, May 8, 9:30AM - 11:30 AM, Room LL21E

During this lab, you will learn about OpenACC, a user-driven standard that offers a directive-based programming model for the scientific community to port their codes to multiple platforms without significant programming effort. The lab will cover introduction on how to analyze and parallelize your code, as well as perform optimizations like managing data movements. With access to a variety of supercomputers, researchers are looking for a solution that allows their codes to run not only on GPUs but on any architecture with minimal or no code change. Scientists report 2-10x performance increase with as little as a few weeks effort using OpenACC. Prerequisites: While the lab does not assume any previous experience with OpenACC directives or GPU programming in general, programming experience with C or FORTRAN is desirable. This lab utilizes GPU resources in the cloud, you are required to bring your own laptop.

L7115 - In-Depth Performance Analysis for OpenACC/CUDA®/OpenCL Applications with Score-P and Vampir, HZDR

Wednesday, May 10, 4:00PM - 6:00 PM, Room LL21A

Work with Score-P/Vampir to learn how to dive into the execution properties of CUDA and OpenACC applications. We'll show how to use Score-P to generate a trace file and how to study it with Vampir. Additionally, we'll use the newly established OpenACC tools interface to present how OpenACC applications can be studied for performance bottlenecks. This lab uses GPU resources in the cloud, so bring your laptop. Prerequisites: Basic knowledge on CUDA/OpenACC and MPI is recommended but not required. This lab utilizes GPU resources in the cloud, you are required to bring your own laptop.

L7106 - Best GPU Code Practices combining OpenACC, CUDA, and OmpSs

Monday, May 8, 3:00PM - 5:00 PM, Room LL21E

We'll guide you step by step to port and optimize an oil-and-gas mini application to efficiently leverage the amazing computing power of NVIDIA GPUs. While OpenACC focuses on coding productivity and portability, CUDA enables extracting the maximum performance from NVIDIA GPUs. OmpSs, on the other hand, is a GPU-aware task-based programming model that may be combined with CUDA, and recently with OpenACC as well. Using OpenACC, we'll start benefiting from GPU computing, obtaining great coding productivity, and a nice performance improvement. We can next fine-tune the critical application parts developing CUDA kernels to hand-optimize the problem. OmpSs combined with either OpenACC or CUDA will enable seamless task parallelism leveraging all system devices. Prerequisites: Basic knowledge of OpenACC and CUDA. This lab utilizes GPU resources in the cloud, you are required to bring your own laptop.

Posters

P7230 - GPU Acceleration of Gulf Stream Dynamics Simulations using OpenACC

P7149 - Accelerated Stiffness Analysis of Structural Floors using BEM for Buildings' Lateral Assessment

P7258 - Accelerating Plasma Physics with GPUs

P7231 - Experience in Porting a Seismic Imaging Kernel to Multiple GPUs

P7256 - GPU Acceleration of MPAS Physics Scheme Using OpenACC

P7218 - Porting and Optimization of Search of Neighbour-particle by Using OpenACC