Open Accelerated Computing Summit 2024

Keynotes Tutorials Panels Talks

The Open Accelerated Computing Summit reflects our Organization's evolution and commitment to the helping the research and developer community advance science by expanding their accelerated and parallel computing skills. The 2024 Summit brought together preeminent researchers across national laboratories, research institutions and supercomputing centers worldwide to discuss work that aligns with our Organization's focus areas.

Keynotes

Broadening Access to AI Resources through the National AI Research Resource

Katie Antypas, National Science Foundation

The National AI Research Resource (NAIRR) is a concept for a national infrastructure that connects U.S. based researchers and educators to computational, data, software, model and user support resources necessary to power AI innovation and advance the AI ecosystem in responsible matter. A pilot for the NAIRR launched in January 2024 as a proof-of-concept for the eventual full-scale NAIRR. The NAIRR pilot is focusing on supporting research and education across broad and diverse communities, while also serving as a vehicle for gaining insights that will refine the design of a full NAIRR. This talk will provide an update on NAIRR Pilot activities, research highlights and initial lessons learned.

COMPSs, Programming Distributed Computing Systems at the Intersection of HPC, AI and Data Analytics

Rosa M. Badia, Barcelona Supercomputing Center

With Exaflop systems already here, high-performance computing (HPC) involves larger and more complex supercomputers. At the same time, the user community is aware of the underlying performance and eager to leverage it by providing more complex application workflows to leverage them. Moreover, current application trends aim to use data analytics and artificial intelligence combined with HPC modelling and simulation. PyCOMPSs is a framework for parallel task-based programming in Python. Based on simple annotations, it can execute sequential Python programs in parallel in HPC clusters and other distributed infrastructures. In the recent years, it has been extended to better support the integration of HPC workflows with Artificial Intelligence and Data analytics. The environment also leverages heterogeneous computing environments, trying to hide its complexity to the workflow developer. The talk will give an overview of PyCOMPSs illustrated with some recent examples. The talk will also briefly overview some opportunities for EU+US collaboration thanks to the DISCOVER-US project.

Tutorials

End-to-End LLM Tutorial

Yash Gupta, NVIDIA

The end-to-end LLM (Large Language Model) is designed from a real-world perspective that follows the data processing, development, and deployment pipeline paradigm. This talk focuses on the fundamentals and adaptation of a large language model. During the talk, I will give a deep dive into the evolution of LLM, expose you to scalable training techniques for pre-trained models, and provide a mathematical illustration of the attention mechanism. Because it is essential to employ a specific decoding strategy to generate the appropriate output from the LLMs. I will explain the greedy and sampling-based decoding strategies and emphasize improvement approaches. The talk will also include the two major approaches (instruction & alignment tunning) to adapting pre-trained LLM and applying Parameter-efficient finetuning (PEFT). Lastly, this talk will cover a quick demo of how the LLM model can be optimized for inferencing using TensorRT-LLM and deployed with TensorRT-LLM Backend.

Zero to GPU Hero with OpenACC

Aaron Jarmusch, University of Delaware

Porting and optimizing legacy applications for GPUs doesn't have to be difficult when you use the right tools. OpenACC is a directive-based parallel programming model that enables C, C++, and Fortran applications to be ported to GPUs quickly while maintaining a single code base. Learn the basics for parallelizing an application using OpenACC directives and the NVIDIA HPC Compiler. Also learn to identify important parts of your application, parallelize those parts for the GPU, optimize data movement, and improve GPU performance. Become the a GPU Hero by joining this session.

Panels

Will AI Evolve from Code Assistance to Critical Compiler Tools?

Sunita Chandrasekaran, Moderator, OpenACC User Group Chair, University of Delaware

LLMs can be worthwhile tools for programming and software development. While the LLMs gain traction and become increasingly powerful, the fundamental question continues to remain if they will act as a primary source or a supplementary tool. So far we have seen LLMs to be largely useful with code assistance, but do we see the use of LLMs for other critical purposes like compiler development? Can we have the LLMs learn the way compilers are built or optimized? Could we build LLMs that could help generate or recommend transformations of an intermediate representation that can lead to better-optimized machine code?

Talks

State of Union Address: OpenACC Organizational Updates and Future Directions

Jack Wells, President of OpenACC, NVIDIA and Barbara Chapman, Vice President of OpenACC, Hewlett Packard Enterprises (HPE)

Our organizational scope continues to grow, embracing our broad approach to accelerated computing and parallel programming and including a broader set of modeling, simulation, and AI initiatives led by our Open Hackathons program. In this talk, OpenACC president Jack Wells and vice president Barbara Chapman share an update on the organization's accomplishments and discusses activities on which the organization is focused going forward. They will also highlight opportunities for institutions and individuals to participate in outreach and service to the accelerated computing community.

Enhancing the Performance of High-Speed Engineering Flow Computations: The URANOS Case Study

Francesco De Vanna, Università degli studi di Padova

This talk discusses the efforts to enhance the performance of the in-house developed Computational Fluid Dynamics (CFD) solver URANOS. In particular, URANOS-2.0, an evolution of the 2023 solver release, is presented as optimized for pre-exascale architectures. As contemporary European HPC facilities within the current EuroHPC JU panorama utilize distinct GPU architectures—primarily AMD and NVIDIA—URANOS-2.0 adopts the OpenACC standard for portability. The latest release, resulting from several tuning and refactoring efforts, demonstrates excellent multi-GPU scalability, achieving strong scaling efficiency of over 80% across 64 compute nodes (256 GPUs) on both LUMI and Leonardo and weak scaling efficiency of about 95% on LUMI and 90% on Leonardo with up to 256 nodes (1024 GPUs). These improvements establish URANOS-2.0 as a leading supercomputing platform for compressible wall turbulence applications, making it ideal for aerospace and energy engineering tasks in the field of Direct Numerical Simulations (DNS), Wall-Resolved Large Eddy Simulations (WRLES), and the latest Wall-Modeled LES (WMLES). The open-source code is available at https://github.com/uranos-gpu/uranos-gpu.

LLM4VV: Exploring LLM-as-a-Judge for Validation and Verification Testsuites

Sunita Chandrasekaran, University of Delaware

Large Language Models (LLM) are evolving and have significantly revolutionized the landscape of software development. If used well, they can significantly accelerate the software development cycle. At the same time, the community is very cautious of the models being trained on biased or sensitive data, which can lead to biased outputs along with the inadvertent release of confidential information. Additionally, the carbon footprints and the un-explainability of these black box models continue to raise questions about the usability of LLMs.
With the abundance of opportunities LLMs have to offer, this talk explores the idea of judging tests used to evaluate compiler implementations of directive-based programming models as well as probe into the black box of LLMs. Based on our results, utilizing an agent-based prompting approach and setting up a validation pipeline structure drastically increased the quality of DeepSeek Coder, the LLM chosen for the evaluation purposes.

Portability of Fortan's ‘DO CONCURRENT’ on GPUs

Ronald Caplan, Predictive Science Inc.

Standard language parallelism for accelerated computing has had great interest in recent years, as using it can avoid the need for external APIs. Such APIs can have a non-trivial learning curve, be vendor-specific, or not be portible across vendors. Using a standard language can help domain scientists accelerate their code while staying within the language syntax they are familiar with.

For Fortran, the {\tt do concurrent} (DC) loop construct has been shown to yield performance on NVIDIA GPUs on-par with external APIs. Here, we test the support for DC on additional GPU platforms that have recently added support. This includes Intel GPUs using the Intel IFX compiler and AMD GPUs using HPE's CCE compiler. We use two production Fortran codes to test the current performance portability of DC across the GPU vendors. We discuss implementation and compilation details, including the use of directive APIs like OpenACC/OpenMP for data movement where needed/desired. Performance results are shown for real world test cases on a variety of GPUs.

IRIS-SDK: Intelligent Runtime System for Portable Programming in the Era of Extremely Heterogeneous Computing

Seyong Lee, Oak Ridge National Laboratory (ORNL)

From edge to exascale, computer architectures are becoming more heterogeneous and complex. The systems typically have fat nodes, with multicore CPUs and multiple hardware accelerators such as GPUs, FPGAs, and DSPs. This complexity is causing a crisis in programming systems and performance portability. Several programming systems are working to address these challenges, but the increasing architectural diversity is forcing software stacks and applications to be specialized for each architecture, resulting in poor portability and productivity. A more agile, proactive, and intelligent programming system is essential to increase performance portability and improve user productivity. To this end, this talk introduces an intelligent runtime system for extemely heterogeneous computing, called IRIS, which won the 2024 R&D 100 Awards. IRIS and its ecosystem (IRIS-SDK) enable programmers to write portable and flexible programs across diverse heterogeneous architectures from edge to exascale by orchestrating multiple programming platforms in a single execution and programming environment.

Accelerating Multiple PDE-based Wave Simulations by Concurrent CPU-GPU Computing Using Directives

Kohei Fujita, The University of Tokyo

In this talk, I will explain our method that accelerates solving time-evolution partial differential equation problems many times with guaranteed accuracy by concurrent use of CPU and GPU. Here, the memory-rich CPU is used to conduct initial solution prediction using past time-history data, and the compute-efficient GPU is used to solve the PDE-based equations using an iterative solver at the same time. When applied to an elastic wave propagation problem on a single GH200 node, an 8.67-fold speedup was obtained when compared with the conventional method running on the GPU. On the Alps supercomputer, a 6.98-fold speedup from the conventional method on the GPU was obtained, and a high weak scalability of 94.3% was obtained up to 1920 compute nodes. Although the method adds additional complexity to the base program, it can be implemented with low programming cost and high portability using OpenACC and OpenMP, indicating that directive-based parallel programming models are highly effective in analyses in heterogeneous computing environments.

The ECHO Code for Astrophysical Relativistic Plasmas: Acceleration on GPUs and Recent Applications

Luca Del Zanna, University of Florence

Sources of high-energy astrophysics are often characterized by the presence of a relativistic plasma, that is ionized fluid where bulk flow velocities may approach the speed of light, and temperatures, magnetic fields, or gravity may be extreme, requiring a relativistic treatment (General Relativistic Magnetohydrodynamics). However, GRMHD numerical simulations are computationally heavy, and the astrophysical community is drifting towards the use of GPUs or heterogeneous architectures, using various programming strategies (CUDA, SYCL, KOKKOS, OpenACC, to cite a few). Here I will describe the porting process of the Eulerian Conservative High Order (ECHO) code for classic MHD and GRMHD, developed and continuously upgraded in Florence since 2000, to GPU-based platforms, by simply using ISO modern Fortran constructs such as DO CONCURRENT loops and PURE procedures. The new version is at least 16 times faster than the previous one, and it shows a very good scaling up to 256 nodes (1024 x NVIDIA Ampere A100 GPUs at CINECA). Numerical tests and applications to 3D relativistic MHD turbulence in astrophysical plasmas will be also shown. Our positive experience, simply started with the participation to an online Open hackathon, may be useful to other scientists or engineers working with fluid dynamics (or MHD), and our results show that the Standard Language Parallelism paradigm (either in Fortran or C++) could really be the path to follow in the future of high-performance computing.

Acceleration of Turbulent Transport Simulation in Hall Thrusters with OpenACC

Filippo Cichocki, ENEA, Frascati Research Center

HTheta is an in-house numerical code written in Fortran whose main goal is to study the axial anomalous transport inside the channel of Hall thruster (HT) plasma propulsion devices. These present an effective axial mobility of electrons across the radial imposed magnetic field that is 100 times larger than the value expected from collisional considerations, and it is commonly agreed that azimuthal fluctuations of the azimuthal electric field are the main responsible for this. HTheta is an electrostatic 1D particle-in-cell code that covers the azimuthal direction inside the acceleration region of the HT channel to obtain self-consistently such turbulent structures and hence valuable fittings for anomalous transport prediction. Being a particle-in-cell code, the main algorithms are the particle scatter, which obtains the charge density distribution from the positions of the charged particles, the Poisson’s equation solver, which obtains the electrostatic field from the computed charge density, and finally the particle push, which updates the positions and velocities of the macro-particles, being known the electric and magnetic fields.

This talk will focus on the activities carried out and lessons learned during the CINECA Open Hackathon, whose main goal was to accelerate HTheta from a simple CPU serial code to a GPU parallel code, using OpenACC directives. Most effort was put into accelerating the particle scatter and push subroutines with an important code refactoring, while parallelization performance was analyzed using NVIDIA profiling tools (Nsight and NCU).

Building the Roman Difference Imaging Pipeline

Lauren Aldoroty, Duke University

Dark energy comprises ~70% of the content of our Universe, yet its nature remains elusive. Type Ia supernovae (SNe Ia) are among the most precise distance indicator tools available to cosmologists, and studying them is critical to understanding the expansion of the Universe. NASA’s Nancy Grace Roman Space Telescope (Roman) will provide an opportunity to study dark energy with unprecedented precision: it will discover 10^4 SNe Ia. Due to the large volume of expected data, the pipeline must be fast enough to enable follow-up observations on objects of interest. Additionally, Roman’s wide field of view poses challenges to the accuracy of measurements. Developing methods to analyze this quantity of survey data with the accuracy and precision required is a critical obstacle for enabling transient science with Roman. In this talk, we describe the Roman difference imaging analysis (DIA) pipeline. We address these challenges by employing a fast fourier transform (FFT) algorithm and running on high-performance computing cluster GPUs. We reflect on our participation in an Open Hackathon, collaboration with NVIDIA, and the process of optimizing our pipeline.

How Open Hackathons Facilitated a Large Positive Impact to the ExaWind Project

Jon Rood, National Renewable Energy Laboratory (NREL)

ExaWind is a complex application that has been in development for over seven years to simulate the physics of entire wind energy farms down to the exact turbine geometries. It utilizes two main applications coupled together: AMR-Wind and Nalu-Wind. AMR-Wind uses structured grids with adaptive mesh refinement, and solves the atmospheric domain between turbines, while Nalu-Wind uses unstructured grids and solves the flow directly over the turbines. These two applications are then overlapped with an overset methodology. In February 2024, the ExaWind application team from the National Renewable Energy Laboratory (NREL), along with two mentors from the Oakridge Leadership Computing Facility (OLCF) participated in a hackathon in collaboration with Open Hackathons with a very specific purpose of optimizing a large 16 turbine simulation using a large portion of the Frontier supercomputer at OLCF.

This talk will explain how the hackathon facilitated the speed up one of Exascale's most important simulations by 28x through the use of specific tools and crucial interactions with hackathon mentors, and details how the optimizations transpired. Attending the hackathon provided the NREL team with the ability to focus on a single important and complex task while having the attention of skilled HPC engineers at OLCF which culminated in a profound benefit to the ExaWind project.

Optimizing Code Translation with LLMs: A Reward-Driven Evaluation Pipeline for Model Alignment

Manish Bhattarai, Los Alamos National Laboratory

Code translation across programming languages faces challenges due to the lack of aligned datasets and biases in coding style, limiting model generalization. We propose a framework that combines Reward Modeling and Model Alignment to enhance translation quality and address these issues. Using 500,000 Fortran codes from Stack-V2, we train large language models (LLMs) without relying on direct translation pairs. Our method generates multiple translation variants and evaluates them based on execution metrics such as runtime, FLOPs, and memory usage, optimizing performance through a custom reward function. Coarse-grained profiling tools (gprof, Valgrind) and fine-grained LLM-based evaluations are used to align models with both computational efficiency and translation quality. We utilize NVIDIA NeMo's SteerLM for controllable fine-tuning, demonstrating initial success on the LLaMA 3.1-8B model using 200 GH200 GPUs. Our approach, through varied reward structures and LLM configurations, aims to overcome the limitations of current code translation frameworks, improving generalization, reducing bias, and enhancing the reliability of translations at scale.

Accelerating AI for Autonomous Navigation: Optimizing Navigation Transformers for Large-Scale Use Cases

Mehrnaz Sabet, Cornell University

As the integration of autonomous vehicles accelerates, there is an urgent need for faster and more efficient navigation models to meet growing demands. This challenge is particularly pronounced in air navigation, where small drones must operate autonomously and safely within shared airspace. Through a NASA-funded project, we have been developing AI-enabled solutions to facilitate the navigation of drones across various operations, including delivery, transportation, and inspection. Recognizing the need for scalable and efficient AI models for real-time air navigation, we participated in the OpenACC hackathon to explore how accelerated computing can enhance the training and inference performance of navigation transformers for large-scale applications.

In this talk, we will present our work on developing and optimizing two state-of-the-art navigation transformer models. By leveraging accelerated computing, we achieved a 55% improvement in training performance and a 40% increase in memory efficiency. Additionally, we developed a more compact navigation model to accelerate deployment in scenarios where balancing performance and generalizability is crucial. Attendees will gain insights into the architectural innovations, optimization strategies, and the broader implications of these advancements for the future of autonomous navigation. Our work bridges the gap between cutting-edge AI research and real-world applications, setting a new benchmark for efficiency in training and deploying navigation models.