Open Accelerated Computing Summit 2025

Keynotes Panel Talks

The Open Accelerated Computing Summit reflects our Organization's evolution and commitment to helping the research and developer community advance science by expanding their accelerated and parallel computing skills. The 2025 Summit brings together preeminent researchers across national laboratories, research institutions, and supercomputing centers worldwide to discuss work that aligns with our Organization's focus areas.

Keynotes

Accelerating Trust and Transparency in Scientific Computing at Scale

Michela Taufer, University of Tennessee, Knoxville

As scientific computing embraces AI and heterogeneous architectures, the ability to produce results that are both fast and trustworthy becomes critical. In this talk, Dr. Michela Taufer from the University of Tennessee, Knoxville, explores how GPUs and other accelerators can be used not just to compute faster, but to compute better. She will showcase her team's novel graph-based methods for identifying nondeterminism in large-scale HPC simulations and their accelerator-driven Merkle-tree strategy that enables high-throughput, fine-grained checkpointing. By leveraging
parallel hash computations and asynchronous GPU-accelerated pipelines, they reduce storage overheads by over 100× and enable real-time verification of AI-driven workflows. Also discussed is the use of containerized environments with provenance tracking to build transparency into adaptive AI pipelines. This work reflects a broader vision of future scientific infrastructure, where accelerators power not only performance, but also explainability and reproducibility.

Accelerating the Development of Open and Trustworthy AI for Science: The Trillion Parameter Consortium

Charlie Catlett, Argonne National Laboratory, Trillion Parameter Consortium

Achieving the promise of frontier AI models for scientific discovery requires not only immense computational power but also innovation in many areas, from training to evaluation, in turn requiring multidisciplinary collaboration. Although only a handful of organizations have the resources necessary to train these models at scale, new strategies have emerged using high-performance pretrained models, such as domain-specific fine-tuning, and agentic constructs. International collaboration is accelerating progress in key areas such as data preparation; evaluation for scientific reasoning, trustworthiness, and safety; and application-focused fine-tuning. This presentation highlights the progress of international working groups convened by the Trillion Parameter Consortium across these and other areas. Together, these efforts are enabling the scientific community to accelerate its progress and to navigate a rapidly evolving AI landscape.

Panel

Vibe Coding for Science

Fernanda Foertter, Moderator, University of Alabama

Vibe coding is one of the biggest AI trends of 2025. Coined by Andrej Karpathy, co-founder of OpenAI, vibe coding allows users to “fully give in to the vibes, embrace exponentials, and forget that the code even exists,” allowing AI-assisted coding using large language models such as ChatGPT or Claude. Users describe what they want to build using natural language prompts, and AI generates the application, fixes errors using models, and performs testing with limited human intervention.

For researchers, the appeal is clear and undeniable--speed, exploration, and accessibility, to name a few--as the pressure to navigate the complexities of legacy applications shares center stage with the domain science, creating de facto computer scientists. Even as its popularity grows, vibe coding presents challenges to researchers and scientists, including flawed logic, over-engineering, lack of transparency, and diminished accuracy.

Led by Fernanda Foertter, executive director at the University of Alabama, this panel examines the advantages and disadvantages of vibe coding for science and discusses the pitfalls and solutions that researchers should be aware of.

Talks

State of Union Address: OpenACC Organizational Updates and Future Directions

Jack Wells, President of OpenACC, NVIDIA and Barbara Chapman, Vice President of OpenACC, Hewlett Packard Enterprises (HPE)

Our organizational scope continues to grow, embracing our broad approach to accelerated computing and parallel programming and including a broader set of modeling, simulation, and AI initiatives led by our Open Hackathons program. In this talk, OpenACC president Jack Wells and vice president Barbara Chapman share an update on the organization's accomplishments and discusses activities on which the organization is focused going forward. They will also highlight opportunities for institutions and individuals to participate in outreach and service to the accelerated computing community.

BERTHA and PyBERTHA: State-of-the-Art for Full Four-component Dirac-Kohn-Sham Calculations

Loriano Storchi, Università degli Studi "G. d'Annunzio" Chieti - Pescara

This presentation will outline the historical progression of the BERTHA project, a computational code designed for 4-component Dirac-Kohn-Sham (DKS) calculations. BERTHA's development spans several years, marked by continuous adaptation to advancements in supercomputer architectural hardware. The discussion will encompass initial parallelization strategies, employing MPI and OpenMP, culminating in the complete GPU porting of the code, including the implementation of the Python API known as PyBERTHA. Furthermore, the presentation will offer a quick overview of its diverse applications, notably Real-Time TDDFT and the NOCV/CD approach.

Exascale and Record-setting Simulation on Tightly Coupled CPU-GPU Platforms Enabled via OpenACC Directives

Spencer Bryngelson, Georgia Institute of Technology

We present an optimized OpenACC implementation of the recently proposed information geometric regularization (IGR) for unprecedented scale simulation of compressible fluid flows applied to multi-engine spacecraft boosters. We improve upon state-of-the-art computational fluid dynamics (CFD) techniques along computational cost, memory footprint, and energy-to-solution metrics. Unified memory on coupled CPU--GPU or APU platforms increases problem size with negligible overhead. Mixed half- and single-precision storage and computation are used on well-conditioned numerics. We simulate flow at 200 trillion grid points and 1 quadrillion degrees of freedom, exceeding the current record by a factor of 20. A factor of 4 wall-time speedup is achieved over optimized baselines. Ideal weak scaling is seen on OLCF Frontier, LLNL El Capitan, and CSCS Alps using the full systems. Strong scaling is near ideal at extreme conditions, achieving 80% efficiency on the CSCS Alps with an 8-node baseline and scaling to the full system.

AI and HPC Applications on Leadership Computing Platforms: Performance and Scalability Studies

JaeHyuk Kwack, Argonne National Laboratory.

As HPC systems move into the exascale era, an increasing diversity of processing hardware is being deployed. The last decade saw the ascendance of NVIDIA GPU-accelerated systems among the largest-scale HPC systems and spurred the need for application developers to consider approaches to performance portability that preserved developer productivity. This challenge has been compounded in the last several years by the introduction of the first two exascale systems, Frontier and Aurora (#2 and #3 on the November 2024 Top 500 list, respectively). These systems utilize new and different GPUs, with the AMD MI-250X GPU on Frontier and the Intel Data Center GPU Max 1550 on Aurora. This study investigates the performance and qualitative performance portability of 12 HPC and ML applications on three large-scale HPC systems that utilize GPUs from the three different vendors: Frontier (AMD), Aurora (Intel), and Polaris (NVIDIA A100). The performance of these applications is evaluated on single GPU, single node, and multi-node scales on each of the systems. We show that the figures- of-merit (FOMs) of the applications on a single GPU of Aurora and Frontier ranged from 0.9–4x and 0.8–2.5x, respectively, the performance on a GPU of Polaris. We also show that the FOMs on a single node of Aurora and Frontier ranged from 1.3–6.3x and 0.8–2.6x, respectively, a single node of Polaris. The applications were scaled up to 512 nodes, showing good scaling efficiency across the board. Finally, we discuss useful concepts and experiences gained in running diverse applications on diverse HPC systems.

Accelerating Scientific Computing through OpenACC

Ezhilmathi Krishnasamy, University of Luxembourg

This presentation will explore the programming capabilities offered by OpenACC, a directive-based programming model designed to facilitate the acceleration of scientific applications on GPUs. OpenACC provides a rich set of APIs that allow developers to achieve performance levels comparable to those obtained using the CUDA programming model. Despite its directive-based nature, OpenACC demonstrates that good performance can indeed be achieved, as evidenced by prior scientific research and implementations.

The talk will highlight key insights regarding OpenACC, detailing its usage in terms of both data transfer and computational kernels. By understanding how to effectively leverage OpenACC, participants will gain valuable knowledge on optimizing their scientific computations and enhancing the performance of their applications on GPU architectures.

A Fast and Efficient Code for Interface-resolved Simulations of Multiphase Turbulence

Alessio Roccon, University of Udine

Turbulent multiphase flows are ubiquitous in nature and our everyday life, playing a key role in many different applications. Obtaining an accurate description of the dynamics of a dispersed multiphase flow is a challenging task because of the large-scale separation that characterizes these flows. In this talk, we present MHIT36, a GPU-tailored solver for simulations of turbulent flows laden with drops and bubbles. The framework couples direct numerical simulation (DNS) of the Navier–Stokes equations, which describe the flow field, with a phase-field method to capture interfacial phenomena. Simulations are performed in a triply-periodic domain, and the governing equations are discretized using a second-order finite difference scheme. The accurate conservative diffuse interface (ACDI) formulation is used to describe the transport of the phase-field variable.

From a computational perspective, MHIT36 adopts a two-dimensional domain decomposition with workload distributed across MPI tasks. The cuDecomp library is used for pencil transpositions and halo exchanges, while cuFFT and OpenACC directives accelerate the remaining computational kernels on GPUs. This parallelization strategy delivers excellent scaling efficiency up to 1024 GPUs, while preserving a modular structure that facilitates extension and modification. Performance was further enhanced during an Open Hackathon organized by CINECA, where specialized support enabled a 26x speedup.

Accelerating Wind Farm Wake Modeling with GPUs: The FLORIS Hackathon Experience

Rafael Mudafort, National Renewable Energy Laboratory (NREL)

Optimizing wind farm layout and control requires fast, accurate modeling of wake interactions between turbines. FLORIS, a steady-state wind farm simulator developed at the National Renewable Energy Laboratory (NREL), provides a flexible Python-based framework for this task. While FLORIS has seen extensive CPU optimizations, its Python-centric design has historically limited GPU acceleration efforts. With recent advancements in Python GPU libraries such as Numba and cuPyNumeric, the FLORIS team saw an opportunity to bring GPU computing to this highly parallel problem without abandoning the project’s extensibility requirements.

In February 2025, Team FloRizz participated in the NOAA/NREL/NCAR Open Hackathon to develop a GPU-accelerated wake solver for FLORIS. This talk will share the team’s journey from challenges with cuPyNumeric to designing a new solver architecture tailored for GPU execution with Numba kernels, ultimately leading to a 190x speed up of a real-world challenge problem. I’ll discuss key technical decisions, performance results, and the trade-offs encountered during development. Finally, I’ll highlight follow-up work inspired by the hackathon and the broader impact on wind farm design and control optimization.

Accelerated Emulators for High Fidelity Population Synthesis Simulations

Anarya Ray, Northwestern University

Population synthesis simulations map initial conditions of interest to a unique distribution of observable system parameters. High-fidelity simulators hold the promise of novel discoveries from growing empirical datasets. However, realistic simulations are costly to implement across a dense grid of initial conditions.

In this work, we develop scalable deep learning emulators for high-fidelity simulation frameworks. These emulators learn the underlying distribution of the simulator outputs as a function of input parameters. By efficiently interpolating across the grid of input parameters used to construct training sets, they enable accurate simulation-based inference and scalable data generation/feature prediction.

In this talk, we discuss methods for accelerated training through distributed data loading across multiple GPUs, and demonstrate approaches for quantifying uncertainties in emulator predictions. We present the details of our emulator model, its training and optimization, and its application as a powerful generative tool in a diverse collection of problems that involve learning the underlying distribution of simulated data for performing inference, augmentation, and prediction.

Finally, we showcase the performance of our emulators in the context of high-fidelity population synthesis simulations of astrophysical systems such as single and binary stars. While these demonstrations are astronomy-focused, our models can be applied to any problem that requires conditional density estimation of simulation outputs interpolated over initial condition grid-points for accurate density evaluation and scalable data generation.

Accelerating Stellar Structure Modeling with Neural PDE Solvers

Santiago López Tapia, Northwestern University

In this talk, we disucuss the development of a Physics-Informed Neural Network (PINN) architecture to solve the steady-state structure equations of single stars as a mesh-free alternative (or complement) to traditional solvers like MESA. Focusing on the hydrostatic regime, we rewrite the governing equations (mass conservation, hydrostatic equilibrium, radiative transport, and energy generation) using enclosed mass as the independent variable. Our PINN outputs temperature, pressure, radius, and luminosity as functions of mass while incorporating boundary conditions as hard constraints within the architecture. Opacity and density are supplied to the PINN through auxiliary neural networks that are pre-trained to interpolate semi-empirical tabulated data with an R² above 99.9%. We demonstrate that the PINN achieves high accuracy in both supervised mode (leveraging sparse MESA outputs) and physics-only regime (solving purely from PDE residuals), with R² scores of 99.9% and 99.6%, respectively. We utilize a Multilayer Perceptron (MLP) model with sinusoidal activation functions (SIREN) and layer normalization. Moreover, unlike traditional PINNs, the output is constrained to force the boundary conditions. While time dependence and convection are not yet included, our preliminary experiments show the promising results of neural PDE solvers in accelerating stellar modeling. Ongoing work will extend this to full evolutionary models and binary interactions.

Deep Neural Operators for Detailed Binary Evolution Simulation

Philipp Moura Srivastava, Northwestern University

Simulations of binary star systems are essential to studying important astronomical phenomena, including but not limited to, gamma-ray bursts, radio pulsars, gravitational-wave mergers, supernovae, and X-ray binaries. Modern-day simulation software enables large-scale binary population studies but, unfortunately, omits large parts of the simulations. This renders some studies of these phenomena infeasible, and while recent work has achieved impressively low error rates by approximating these omitted parts using traditional signal processing techniques, these error rates may not be sufficient. Typically, these methods work by creating an alignment between similar reference simulations, and assigning a weight to each simulation. In this talk, we present a novel application of deep neural operators to solve initial value problems whose solutions describe the omitted parts of the simulations. Compared to traditional partial differential equation solvers, our approach accelerates the simulations by several orders of magnitude, making previously intractable studies feasible while significantly reducing energy consumption. We also explore the role of physics-informed loss functions for binary simulation approximation; specifically, their impact on our learned representations, the convergence of our architecture weights, and the physical consistency of predicted simulations.

StarEmbed-GPT: Toward a Foundation Model for General-purpose Inference on Variable Stars

Hong-Yu Chen, Northwestern University

We study zero-shot inference for periodic variable star classification using time-series foundation models (TSFMs). Specifically, we generate TSFMs embeddings of ZTF light curves and evaluate a broader task suite: clustering, classification with four lightweight classifiers, period regression, and OOD detection. TSFM embeddings yield information-rich embeddings that are competitive for clustering, classification, and OOD detection relative to strong baselines; cross-survey generalization remains an open direction we are actively studying. Building on these results, we have begun developing StarEmbed-GPT, a Transformer-based model fine-tuned on large ZTF light-curve corpora to produce reusable representations intended to support a broad set of downstream tasks on irregular time series from different surveys.

In this talk, we discuss our efforts to accelerate iteration on large datasets during the NCSA Open Hackathon, where we modified the Autogluon/Chronos-Bolt fine-tuning code to support multi-GPU training, yielding ~3.3× lower training time. Profiling our code using NVIDIA Nsight further helped us identify bottlenecks and informed directions for continued optimization (e.g., GPU idle due to dataloader; DDP sync overhead).