SC 2022 | OpenACC

Birds of a Feather (BoF) Workshops Talks Posters

OpenACC.org will be represented at SC22 by our members, Open Hackathons community, supporters, and users. Please join us at the conference to learn more about the OpenACC organization, explore Open Hackathon and Bootcamp opportunities and outcomes, share your research and feedback, discuss the OpenACC specification, and be a part of our growing user community and organization.

Birds of a Feather (BoF)

OpenACC User Experience: Relevance, Hackathons, and Roadmaps

Tuesday, November 15 | 5:15 - 6:45PM | Location: D170

OpenACC is focused on helping the developer community advance by expanding their accelerated parallel computing skills, and supports a directive-based, high-level accelerated programming model on CPUs, GPUs and other devices. OpenACC supports over 25 hackathons globally each year and has facilitated acceleration of over 200 applications on multiple platforms, e.g., Frontier, Perlmutter, JUWELS, Summit, Sunway Taihulight, and Piz Daint. This BoF invites scientists, programmers and researchers to discuss their experiences in adopting OpenACC for scientific applications, learn about the roadmaps from implementers, share best practices in community facilitated training in software development, and the latest developments in the language specification.

Welcome | Training and Education | HPE Updates | GCC Updates | Use Case: KokkACC

Best Practices for Training an Exascale Workforce Using Applied Hackathons and Bootcamps

Wednesday, November 16 | 5:15 - 6:45PM | Location: C144-145

Given the anticipated growth of the HPC market, HPC is challenged with expanding the size, diversity, and skill of its workforce. As we move toward exascale computing, how best do we prepare future computational scientists, and enable established domain researchers to stay current and master tools needed for exascale architectures? This BoF invites scientists, researchers, trainers, educators, and the RSEs that support them to discuss current learning and development programs, explore adding in-person and virtual hackathons to existing training modalities, and brainstorm implementation strategies to bridge between traditional programming curricula and hands-on skills needed by diverse communities within different environments.

Workshops

2022 International Workshop on Performance Portability and Productivity (P3HPC)

Sunday, November 13 | 8:30AM - 5:00PM | Location: C155

The aim of this workshop is to bring together developers and researchers with an interest in practical solutions, technologies, tools, and methodologies that enable the development of performance-portable applications across a diverse set of current and future high‑performance computers.

We draw from a broad research audience that includes standard languages and runtimes, algorithmic techniques, tools, libraries and non‑standard techniques such as domain-specific languages. We also expect to see submissions from application teams documenting their experiences, good and bad, and those developing metrics and measurement techniques for performance portability and productivity.

This workshop has a proven track record of playing a shepherding role for the broader community in identifying and adapting to technology trends while fostering increased transparency and rigor for performance, portability and productivity. The importance of this role will only increase with the growing diversity and scope of hardware and software platforms, including more-complex workflows and data center scale. For more details visit the workshop website.

Ninth SC Workshop on Best Practices for HPC Training and Education

Monday, November 14 | 8:30AM - 12:00PM | Location: C144-145

The inherent wide distribution, heterogeneity and dynamism of the current and emerging high-performance computing and software environments increasingly challenge cyberinfrastructure facilitators, trainers and educators. The challenge is how to support and train the current diverse users and prepare the future educators, researchers, developers and policymakers to keep pace with the rapidly evolving HPC environments to advance discovery and economic competitiveness for many generations.

The ninth annual half-day workshop on HPC training and education is an ACM SIGHPC Education Chapter coordinated effort, aimed at fostering more collaborations among the practitioners from traditional and emerging fields to explore educational needs in HPC, to develop and deploy HPC training and to identify new challenges and opportunities for the latest HPC platforms. The workshop will also be a platform for disseminating results and lessons learned in these areas and will be captured in a Special Edition of the Journal of Computational Science Education. For more information, please visit the workshop website.

Seventh International Workshop on Extreme Scale Programming Models and Middleware (ESPM2 2022)

Monday, November 14 | 8:30AM - 5:00PM | Location: C156

Next generation architectures and systems being deployed are characterized by high concurrency, low memory per-core, and multiple levels of hierarchy and heterogeneity. These characteristics bring out new challenges in energy efficiency, fault-tolerance, and scalability that must be tackled by next generation programming models and associated middleware/runtimes. This workshop focuses on different aspects of programming models such as task-based parallelism (Legion, Habenero, Charm++, X10, HPX, etc), PGAS (OpenSHMEM, UPC/UPC++, CAF, etc.), Deep Learning (PyTorch, TensorFlow, etc.), directive-based languages (OpenMP, OpenACC) and hybrid MPI+X, etc. It also focuses on their associated middleware (unified runtimes, interoperability for hybrid programming, tight integration of MPI+X, and support for accelerators and FPGAs) for next generation systems and architectures. The ultimate objective of the ESPM2 workshop is to serve as a forum that brings together researchers from academia and industry working in the areas of programming models, runtime systems, languages, and application developers. For more details visit the workshop page.

Ninth Workshop on Accelerator Programming Using Directives (WACCPD 2022)

Friday, November 18 | 8:30AM - 12:00PM | Location: D174

Heterogeneous node architectures are becoming omnipresent in today’s HPC systems. Exploiting the maximum compute capability out of such systems, while also maintaining code portability and maintainability, necessitates accelerator programming approaches such as OpenMP offloading, OpenACC, standard C++/Fortran parallelism, SYCL, DPC++, Kokkos, and RAJA. However, the use of these programming approaches remains a research activity and there are many possible trade-offs between performance, portability, maintainability, and ease of use that must be considered for optimal use of accelerator-based HPC systems.

Toward this end, the workshop will highlight the improvements over state-of-the-art through the accepted papers. In addition, the event will foster discussion with a keynote/panel to draw the community’s attention to key areas that will facilitate the transition to accelerator-based HPC. The workshop aims to showcase all aspects of innovative high-level language features, lessons learned while using directives/abstractions to migrate scientific legacy code, experiences using novel accelerator architectures, among others. For more details visit the workshop page.

Fifth International Workshop on Emerging Parallel Distributed Runtime Systems and Middleware (IPDRM 2022)

Friday, November 18 | 8:30AM - 5:00PM | Location: C146

The role of runtime and middleware has evolved over the past decades as we have begun the exascale era. For leadership class machines, advanced runtime technology not only plays an important role in tasking but also has gained prominence in providing consistent memory across accelerator architectures, intelligent network routing, and performance portability, among other properties. With diminishing returns from hardware fabrication technology, clusters add more specialized accelerators such as FPGAs, CGRAs, and custom ASICs. These current trends highlight middleware challenges such as task/data management while adding new opportunities for exploiting application-specific engines. Further, advances in fields such as AI/ML provide novel opportunities for guiding and exploiting the hardware/software substrate. For these reasons, we propose a new iteration of the IPDRM workshop which will provide a venue for a diverse group of international researchers from universities, industry, research institutions, and funding agencies to discuss the pressing challenges of today’s runtime/middleware technologies. For more details visit the workshop page.

Talks

Leveraging Compiler-Based Translation to Evaluate a Diversity of Exascale Platforms

Sunday, November 13 | 9:37AM - 10:00AM | Location: C155

Accelerator-based heterogeneous computing is the de facto standard in current and upcoming exascale machines. These heterogeneous resources empower computational scientists to select a machine or platform well-suited to their domain or applications. However, this diversity of machines also poses challenges related to programming model selection: inconsistent availability of programming models across different exascale systems, lack of performance portability for those programming models that do span several systems, and inconsistent performance between different models on a single platform. We explore these challenges on exascale-similar hardware, including AMD MI100 and Nvidia A100 GPUs. By extending the source-to-source compiler OpenARC, we demonstrate the power of automated translation of applications written in a single front-end programming model (OpenACC) into a variety of back-end models (OpenMP, OpenCL, CUDA, HIP) that span the upcoming exascale environments. This translation enables us to compare performance within and across devices and to analyze programming model behavior with profiling tools.

Performance Portability of Sparse Block Diagonal Matrix Multiple Vector Multiplications on GPUs

Sunday, November 13 | 11:37AM - 12:00pM | Location: C155

The emergence of multiple accelerator based computer architectures and programming models makes it challenging to achieve performance portability for large-scale scientific simulation software. In this paper, we focus on a sparse block diagonal matrix multiple vector (SpMM) computational kernel and discuss techniques that can be used to achieve performance portability on NVIDIA and AMD based accelerators using CUDA, HIP, OpenACC, Kokkos. We show that performance portability can vary significantly across programming models, GPU architectures, and problem settings, up to 52x in the explored problems. Our study visits the performance portability aggregation metric to guide the development and the selection of performance portable variants.

Blending Accelerated Programming Models in the Face of Increasing Hardware Diversity

Sunday, November 13 | 4:15PM - 5:00PM | Location: C155

The choice of programming model for accelerated computing applications depends on a wide range of factors, which weigh differently across application domains, institutions, and even countries. Why does one application use standard programming languages like C++, while another uses embedded programming models like Kokkos or directives such as OpenACC, and yet another directly programs in vendor-specific languages like CUDA or HIP? This panel will work through a comparison of the various choices, and share hands-on experience from developers in different countries and fields of expertise. We’ll explore both technical and non-technical reasons for how the various approaches are mixed. Join us for a fun and insightful session!

Benchmarking Fortran DO CONCURRENT on CPUs and GPUs Using BabelStream

Monday, November 14 | 3:30PM - 4:00PM | Location: C155

Fortran DO CONCURRENT has emerged as a new way to achieve parallel execution of loops on CPUs and GPUs. This paper studies the performance portability of this construct on a range of processors and compares it with the incumbent models: OpenMP, OpenACC and CUDA. To do this study fairly, we implemented the BabelStream memory bandwidth benchmark from scratch, entirely in modern Fortran, for all of the models considered, which include Fortran DO CONCURRENT, as well as two variants of OpenACC, four variants of OpenMP (2 CPU and 2 GPU), CUDA Fortran, and both loop- and array-based references. BabelStream Fortran matches the C++ implementation as closely as possible, and can be used to make language-based comparisons. This paper represents one of the first detailed studies of the performance of Fortran support on heterogeneous architectures; we include results for AArch64 and x86_64 CPUs as well as AMD, Intel and NVIDIA GPU platforms.

Runtimes Systems for Extreme Heterogeneity: Challenges and Opportunities

Wednesday, November 16 | 1:30PM - 3:00PM | Location: C155-156

The goal of this panel is to discuss the latest runtime evolution and the impact on applications. Advances in this matter are key to executing science workflows and understanding their results, enabling efficient execution on diverse platforms, ensuring scalability of high-level descriptions of analytics workflows, and increasing user productivity and system utilization. In other words, how easily and rapidly a science team can develop or port a workflow to a new platform, and how well the resulting implementation makes use of the platform and its resources.

Our panel includes a large number of different runtimes. Examples of these are OpenMP, OpenACC, SYCL, COMPS, PaRSEC, OmpSs, and StarPU. This is a great opportunity to bring together some of the most important and widely used runtimes and programming models, and present/discuss the latest efforts on each of them and the different perspectives to face the challenges of the upcoming extreme heterogeneity era.

Challenges and Success Stories Migrating Software and Applications to Frontier

Friday, November 18 | 8:35AM - 9:35AM | Location: C146

This talk will share stories from CAAR PIConGPU and ECP SOLLVE projects. The stories will present our experiences on porting applications to pre-exascale systems to exascale system, Frontier. It will highlight challenges we faced preparing and using relevant software tools including alpaka, OpenMP and OpenACC programming models among other tools. The talk will also present insights we gathered from profiler/performance analysis tools. Takeaways will be drawn from both the projects to share with the IPDRM community and at the same seek input from the audience so we can together improve our techniques and approaches.

Analysis of Validating and Verifying OpenACC Compilers 3.0 and Above

Friday, November 18 | 8:39AM - 9:06AM | Location: D174

OpenACC is a high-level directive-based parallel programming model that can manage the sophistication of heterogeneity in architectures and abstract it from the users. The portability of the model across CPUs and accelerators has gained the model a wide variety of users. This means it is also crucial to analyze the reliability of the compilers' implementations. To address this challenge, the OpenACC Validation and Verification team has proposed a validation testsuite to verify the OpenACC implementations across various compilers with an infrastructure for a more streamlined execution. This paper will cover the following aspects: (a) the new developments since the last publication on the testsuite, (b) outline the use of the infrastructure, (c) discuss tests that highlight our workflow process, (d) analyze the results from executing the testsuite on various systems, and (e) outline future developments.

OmpSs-2 and OpenACC Interoperation

Friday, November 18 | 9:06AM - 9:33AM | Location: D174

We propose an interoperation mechanism to enable novel composability across pragma-based programming models. We study and propose a clear separation of duties and implement our approach by augmenting the OmpSs-2 programming model, compiler and runtime system to support OmpSs-2 + OpenACC programming. To validate our proposal we port ZPIC, a kinetic plasma simulator, to leverage our hybrid OmpSs-2 + OpenACC implementation. We compare our approach against OpenACC versions of ZPIC in terms of on a multi-GPU HPC system. We show that our approach manages to provide automatic asynchronous and multi-GPU execution, removing significant burden from the application’s developer, while also being able to outperform manually programmed versions, thanks to a better utilization of the hardware.

KokkACC: Enhancing Kokkos with OpenACC

Friday, November 18 | 10:30AM - 10:57AM | Location: D174

Template metaprogramming is gaining popularity as a high-level solution for achieving performance portability. Kokkos is a representative approach that offers programmers high-level abstractions while most of the device-specific code generation are delegated to the compiler. OpenACC is a high-level and directive-based programming model. This model allows developers to insert hints (pragmas) into their code that help the compiler to parallelize the code. This paper presents an OpenACC back end for Kokkos: KokkACC. KokkACC provides a high-productivity programming environment back end. This work demonstrates the potential benefits of having a high-level and a descriptive programming model based on OpenACC. We observe competitive performance; in some cases, KokkACC is faster than CUDA back end and much faster than OpenMP’s GPU offloading backend. This work also includes a detailed performance study conducted with a set of mini-benchmarks (AXPY and DOT product) and three mini-apps (LULESH, miniFE and SNAP, a LAMMPS proxy mini-app).

SPEL: Software Tool for Porting E3SM Land Model with OpenACC in a Function Unit Test Framework

Friday, November 18 | 10:57AM - 11:24AM | Location: D174

Most high-end computers adopt hybrid architectures, porting a large-scale scientific code onto accelerators is necessary. The paper presents a generic method for porting large-scale scientific code onto accelerators using compiler directives within a modularized function unit test platform. We have implemented the method and designed a software tool (SPEL) to port the E3SM Land Model (ELM) onto the GPUs in the Summit computer. SPEL automatically generates GPU-ready test modules for all ELM functions, such as CanopyFlux, SoilTemperature, and EcosystemDynamics. SPEL breaks the ELM into a collection of standalone unit test programs for easy code verification and further performance improvement. We further optimize several ELM test modules with advanced techniques, including memory reduction, DeepCopy, reconstructed parallel loops, and asynchronous GPU kernel launch. We hope our study will inspire new toolkit developments that expedite large-scale scientific code porting with compiler directives.

GPU-Accelerated Sparse Matrix Vector Product Based on Element-by-Element Method for Unstructured FEM Using OpenACC

Friday, November 18 | 11:24AM - 11:51AM | Location: D174

The development of directive based parallel programming models such as OpenACC has significantly reduced the cost in using accelerators such as GPUs. In this study, the sparse matrix vector product (SpMV), which is often the most computationally expensive part in physics-based simulations, was accelerated by GPU porting using OpenACC. Further speed-up was achieved by introducing the element-by-element (EBE) method in SpMV, an algorithm that is suitable for GPU architecture because it requires large amount of operations but small amount of memory access. In a comparison on one compute node of the supercomputer ABCI, using GPUs resulted in a 21-fold speedup over the CPU-only case, even when using the typical SpMV algorithm, and an additional 2.9-fold speedup when using the EBE method. The results on such analysis was applied to a seismic response analysis considering soil liquefaction, and using GPUs resulted in a 42-fold speedup compared to using only CPUs.

Posters

KokkACC: Enhancing Kokkos with OpenACC

Thursday, November 17 | 8:30AM - 5:00PM | Location: C1-2-3

Kokkos is a representative approach between template metaprogramming solutions that offers programmers high-level abstractions for generic programming while most of the device-specific code generation and optimizations are delegated to the compiler through template specializations. For this, Kokkos provides a set of device-specific code specializations in multiple backends, such as CUDA and HIP. However, maintaining and optimizing multiple device-specific back ends for each new device type can be complex and error-prone. To alleviate these concerns, this paper presents an alternative OpenACC back end for Kokkos: KokkACC. KokkACC provides a high-productivity programming environment and—potentially—a multi architecture back end. We have observed competitive performance; in some cases, KokkACC is faster than NVIDIA’s CUDA back end and much faster than OpenMP’s GPU offloading back end. This work also includes implementation details and a detailed performance study conducted with a set of mini-benchmarks (AXPY and DOT product) and two mini-apps (LULESH and miniFE).

Scalable GPU accelerated simulation of multiphase compressible flow

Thursday, November 17 | 8:30AM - 5:00PM | Location: C1-2-3

We present a strategy for GPU acceleration of a multiphase compressible flow solver that brings us closer to exascale computing. Given the memory-bound nature of most CFD problems, one must be prudent in implementing algorithms and offloading work to accelerators for efficient use of resources. Through careful choice of OpenACC decorations, we achieve 46% of peak GPU FLOPS on the most expensive kernel, leading to a 500-times speedup on an NVIDIA A100 compared to 1 modern Intel CPU core. The implementation also demonstrates ideal weak scaling for up to 13824 GPUs on OLCF Summit. Strong scaling behavior is typical but improved by reduced communication times via CUDA-aware MPI.