GTC 2021 conference offers several opportunities to learn more about the intersection of HPC, AI and Data Science. Join your peers through a variety of talks, tutorials, posters, and meet-the-experts hangouts across topics such as OpenACC and programming languages, developer tools and industry-specific research and applications. This year's GTC conference will be a digital conference with both live and on-demand recordings and materials. Registration is FREE! Learn more.

Connect with the Experts

OpenACC is a programming model designed to help scientists and developers to start with GPUs faster and be more efficient by maintaining a single code source for multiple platforms. OpenACC experts discuss how to start accelerating your code on GPUs, continue optimizing your GPU code, start teaching OpenACC, host or participate in a hackathon, and more. 

Talks

OpenACC

Porting VASP to GPU Using OpenACC: Exploiting the Asynchronous Execution Model [S31717]

Martijn Marsman, Senior Scientist, University of Vienna

NVIDIA GPUs accelerate the most important applications in quantum chemistry (like Gaussian, VASP, Quantum ESPRESSO, GAMESS, NWChem, and CP2K) and molecular dynamics (like GROMACS, NAMD, LAMMPS, and Amber) that are also very popular in materials science, biophysics, drug discovery, and other domains. We'll answer your questions about how to get the best performance for your specific workload or figure out how you can benefit from accelerated computing.

Fluid Dynamic Simulations of Euplectella aspergillum Sponge [E31218]

Giorgio Amati, Senior HPC engineering, CINECA

We present our experience in simulating the flow around a silica-based sponge, the "Euplectella Aspergillum," using a TOP500 machine equipped with NVIDIA GPUs. A Lattice Boltzmann Method (LBM)-based code was used to explore fluid dynamical features of this complex structure. We'll present some physical results, together with details of code implementations and performance figures (up to about 4,000 V100 GPU) for our MPI+OpenACC LBM code.

Aerodynamic Flow Control Simulations with Many GPUs on the Summit Supercomputer [S31231]

Nicholson Koukpaizan, Postdoctoral Research Associate, Oak Ridge National Laboratory

A GPU-accelerated computational fluid dynamics (CFD) solver was used for aerodynamic flow control simulations on the Summit supercomputer at Oak Ridge Leadership Computing Facility. The GPU implementation of the FORTRAN 90 code relies on OpenACC directives to offload work to the GPU and message-passing interface (MPI) for multi-core/multi-device parallelism, taking advantage of the CUDA-aware MPI capability. We'll address implementation details, as well as performance results and optimization. Finally, we'll present new scientific results obtained by leveraging the GPUs for the control of aerodynamic flow separation using fluidic oscillators (actuators that generate spatially oscillating jets without any moving part). We'll add a few details pertaining to CFD and aerodynamic flow control to make the talk accessible to people who are not necessarily familiar with these domains.

On Scalability, Portability, and Maintainability of Commercial CFD Solver HiFUN on NVIDIA GPU [S31758]

Munikrishna Nagaram, Chief Technology Officer, S & I Engineering Solutions

OpenACC provides a parallel programming paradigm for porting legacy computational fluid dynamics (CFD) solvers on hybrid HPC platforms without compromising the readability and maintainability of the code. HiFUN, a message-passing interface (MPI)-based super scalable industry standard CFD code, has adopted the OpenACC framework to exploit its super computing advantage on GPU-based hybrid platforms. We'll highlight (1) OpenACC way of porting a production CFD code while retaining the MPI parallelism and source code maintainability; (2) profiling and performance analysis; (3) performance studies on Volta GPU; and (4) performance evaluation on NVIDIA's newer Ampere Architecture.

A Tale of Two Programming-Models: Enhancing Heterogeneity, Productivity, and Performance through OmpSs-2 + OpenACC Inter-Operation [E31193]

Simon Garcia de Gonzalo, Postdoctoral Researcher , Barcelona Supercomputing Center

Learn about the new possible interoperation between two pragma-based programming models: OmpSs-2 and OpenACC. Two pragma-based programming models made to function completely independent and unaware of each other can be made to effectively collaborate with minimal additional programming. We'll go over the separation of duties between models and describe in-depth the mechanism needed for interoperation. We'll provide concrete code examples using ZPIC, a 2D plasma simulator application written in OmpSs-2, OpenACC, and OmpSs-2 + OpenACC. We'll compare the performance and programmability benefits of OmpSs-2 + OpenACC ZPIC implementation against the other single-model implementations. OmpSs-2 + OpenACC is part of the latest OmpSs-2 release and all ZPIC implementations are open source.

Introducing Developer Tools for Arm and NVIDIA systems [S32163]

David Lecomber, Senior Director , Arm

NVIDIA GPUs on Arm servers are here. In migrating to, or developing on, Arm servers with NVIDIA GPUs, developers using native code, CUDA, and OpenACC continue to need tools and toolchains to succeed and to get the most out of applications. We'll explore the role of key tools and toolchains on Arm servers, from Arm, NVIDIA and elsewhere — and show how each tool fits in the end-to-end journey to production science and simulation.

Panel: Present and Future of Accelerated Computing Programming Approaches [S31146]

Sunita Chandrasekaran, Assistant Professor, University of Delaware | Bryce Lelbach, HPC Programming Models Architect, NVIDIA | Christian Trott, Principal Member of Staff, Sandia National Laboratories | Stephen Jones, CUDA Architect, NVIDIA | Jeff Larkin, Senior Developer Technologies Software Engineer and OpenACC Technical Committee Chair, NVIDIA | Joel Denny, Computer Scientist, Oak Ridge National Laboratory | Jack Deslippe, Application Performance Group Lead, NERSC, Lawrence Berkeley National Lab

With endless choices of programming environments in the parallel computing universe at a time of exascale computers, which programming model should you choose? Are base languages, like C++ and Fortran, a solid choice for today's codes? Will OpenACC and OpenMP directives stay around long enough to warrant investing time and effort in them now for application acceleration on GPUs? Or should a researcher go back to the low-level programming models, like CUDA, to extract maximum performance of the code? Join our panel of experts as they debate the ultimate answer for the future parallel programmer.

Accelerating Machine Learning Applications Using CUDA Graph and OpenACC [S31212]

Leonel Toledo, Recognized Researcher, Barcelona Supercomputing Center (BSC) | Antonio J. Peña, Senior Researcher, Barcelona Supercomputing Center

We'll showcase the integration of CUDA Graph with OpenACC, which allows developers to write applications that benefit from parallelism from the GPU, as well as increasing coding productivity. Since many scientific applications require high performance computing systems to make their calculations, it's important to provide a mechanism that allows developers to exploit the system's hardware to achieve the expected performance.

We will also explore the most important technical details regarding the integration of CUDA Graph and OpenACC. This allows programmers to define the workflow as a set of GPU tasks, potentially executing more than one at the same time.

Examples will be provided using CUDA, C++ and OpenACC, it will be expected that registrants are familiar with at least the fundamentals of these programming languages.

The ChEESE Effort Toward Building a GPU Ecosystem for Earth Science [E31267]

Piero Lanucara, CINECA

We'll present the big effort in the ChEESE project toward building a GPU ecosystem. Our presentation can be split into two main pillars:

The ChEESE flagship applications description and their GPU capabilities. We want to cover technical aspects here related to the code developments and technology used (CUDA or OpenACC, and the motives to do that) as well as benchmarking numbers on ChEESE use cases.

Where these applications will run (the “systems”) and what new science is made possible thanks to NVIDIA GPU (in the case of the ChEESE, what are the possible demonstrators that will run on PRACE and EuroHPC systems?).

Devito: High-Performance Imaging and Inversion Codes from Symbolic Computation and Python in Seconds [S31796]

Gerard Gorman, Associate Professor, Imperial College London

Devito is a domain-specific language (DSL) and code generation framework for designing highly optimized finite difference kernels for use in inversion methods. Devito utilizes SymPy to allow the definition of operators from high-level symbolic equations and generates optimized and automatically tuned code specific to a given target architecture including ARM, GPUs, Power series, x86, and Xeon Phi. Devito is currently used in industry for petascale seismic imaging. Applications in other areas, such as medical imaging and scalable machine learning, are under development. Symbolic computation is a powerful tool that allows users to: build complex solvers from only a few lines of a Python DSL, use advanced code optimization methods to generate parallel high-performance code, and (Re)develop production-ready software in hours rather than months.

Inside the NVIDIA HPC Compilers [S31358]

Bryce Lelbach, HPC Programming Models Architect, NVIDIA

Learn about the architecture and latest features in NVC++ and NVFORTRAN. We'll cover the details of an exciting new feature of NVC++ that will be announced earlier at GTC. We'll also discuss the latest developments in Standard Parallelism in C++ and Fortran. With the NVIDIA HPC compilers, programming GPUs has never been easier! Our session involves four programming models: ISO Standard, CUDA, OpenACC, and OpenMP; two languages: C++ and Fortran; and one tool chain: the NVIDIA HPC compiler.

GPU-Accelerated Diamond Tiling Stencil Computations for Seismic Applications [S31491]

Long Qu, HPC Research Scientist, King Abdullah University of Science and Technology (KAUST)

Learn how we accelerate stencil computations for seismic applications using GPU. Spatial blocking (SB) represents the widely-adopted, vendor-agnostic technique for increasing data reuse in the high memory subsystem level. We combine SB with the Multicore Wavefront Diamond (MWD) tiling approach to maximize performance throughput on GPUs. MWD allows for increased data reuse on shared memory, thus reducing the expensive data traffic to global memory. We assess our implementation using different programming models on the latest GPU hardware technologies. We present preliminary results highlighting the GPU integration of MWD-based stencil computations into the reversed time migration (RTM) using datasets from the Society of Exploration Geophysicists.

Materials Design Toward the Exascale: Porting Electronic Structure Community Codes to GPUs [E32448]

Andrea Ferretti - Senior Researcher and Chair of the MaX Executive Committee - CNR - Nanoscience Institute

Materials are crucial to science and technology, and connected to major societal challenges ranging from energy and environment to information and communication, and manufacturing. Electronic structure methods have become key to materials simulations, allowing scientists to study and design new materials before running actual experiments. The MaX Centre of Excellence — Materials design at the eXascale — is devoted to enable materials modeling at the frontiers of the current and future HPC architectures. MaX's action focuses on popular open-source community codes in the electronic structure field (Quantum ESPRESSO, Yambo, Siesta, Fleur, CP2K, and BigDFT). We'll discuss the performance and portability of MaX flagship codes, with a special focus on GPU accelerators. Porting on GPUs has been demonstrated (all codes released as GPU-ready) following diverse strategies to address both performance and maintainability, while keeping the community engaged.

Talks Related to GPU Hackathons

Unraveling the Universe with Petascale Graph Networks [S31658]

Christina Kreisch, Graduate Student Researcher, Princeton University | Miles Cranmer, Graduate Student Researcher, Princeton University

Learn about using graph networks to find interpretable representations of physical laws in the universe with petabytes of data. We have 44,100 n-body simulations, each with over 20,000 nodes, where each node can be connected to ~30 other nodes. Leveraging NVIDIA’s toolkits and improving GPU utilization, we achieved over 8,000x speed-up in pre-processing. We'll demo our optimizations and graph network, which is built to back-propagate gradients through an entire simulation. One of the greatest challenges in astrophysics is understanding the relationship between galaxies and underlying parameters of the universe. Modeling such relationships with petabytes of data has been computationally prohibitive. We construct and train our graph network in an interpretable way, which allows us to contribute to existing theory by interpreting our graph network with symbolic regression in addition to providing new constraints on cosmological parameters. You don't need any particular prior knowledge for our session.

Scaling Graph Generative Models for Fast Detector Simulations in High-Energy Physics [S31574]

Ali Hariri, Student -Graduate Research Assistant, American University of Beirut

Accurate and fast simulation of particle physics processes is crucial for the high-energy physics community. Simulating the particle showers and interactions in the detector is both time-consuming and computationally expensive. The main goal of a fast simulator in the context of the Large Hadron Collider (LHC) is to map the events from the generation level to the reconstruction level. Traditional detector fast simulation approaches based on non-parametric techniques can significantly improve the speed of the full simulation; however, they also suffer from lower levels of fidelity. For this reason, alternative approaches based on machine-learning techniques can provide faster solutions, while maintaining higher levels of fidelity. We'll introduce a graph neural network-based autoencoder model that provides effective reconstruction of detector simulation for LHC collisions.

Tutorials and Training

Zero to GPU Hero with OpenACC [S31816]

Jeff Larkin, Senior Developer Technologies Software Engineer and OpenACC Technical Committee Chair, NVIDIA

Porting and optimizing legacy applications for GPUs doesn't have to be difficult when you use the right tools. OpenACC is a directive-based parallel programming model that enables C, C++, and Fortran applications to be ported to GPUs quickly while maintaining a single code base. Learn the basics for parallelizing an application using OpenACC directives and the NVIDIA HPC Compiler. Also learn to identify important parts of your application, parallelize those parts for the GPU, optimize data movement, and improve GPU performance. Become the a GPU Hero by joining this session.