Announcing OpenACC 3.4 at ISC 2025!

2025

Announcing OpenACC 3.4 at ISC 2025!

The OpenACC technical committee is proud to announce a new revision of the OpenACC specification! The first revision since November 2022, countless volunteer hours have gone into maintaining, interpreting, and evolving the OpenACC specification, and I’d like to thank everyone who has been involved. This specification revision has benefited from both longstanding members of the committee and the fresh eyes of new implementors asking questions.

The majority of the changes in OpenACC 3.4 are clarifications or restatements of existing behavior. In most cases, the only signs our users will see that these changes have been made will be improved consistency between current implementations and the eventual debut of new OpenACC compilers. Rather than discussing the nitty-gritty changes, below I’ll detail the main changes that our users will notice.

Data Clause Capture Modifier

Perhaps this release's most significant user-facing feature is the "capture" modifier. The capture modifier extends data clauses on shared/unified memory systems, where data clauses are generally ignored due to the data being shared between the host and device, to allow individual arrays to be placed in discrete device memory anyway. If you attended the OpenACC Birds-of-a-Feather (BOF) at Supercomputing 2024 in Atlanta, you’ve already heard about this feature and its availability in preview form in the NVIDIA compilers. This feature was too big to get wrong, so we wanted to expose it to users before finalizing the specification and we received some great feedback from our users.

If the memory is already shared between the host and device, why is this even needed? Well, we encountered a few cases where unified memory and asynchronous execution could introduce errors in real user codes due to a race condition with the shared memory location. The capture modifier fixes this and allows users to reap the full benefits of shared memory and asynchrony. During the preview period for this feature many of the developers we spoke to have found even more ways, beyond what we’d originally envisioned, that this feature benefited their applications.

The following Fortran pseudo code represents a pattern that we found in user applications. The code was written before true shared memory devices started to emerge and has a significant problem when running with shared memory. When running with discrete memories the runtime can allocate space for the stack variable "B" and then lazily deallocate that device memory at a later time, so it doesn’t matter that the subroutine may return before the device has finished using that memory. When running with shared/unified memory, however, the data clauses are ignored, so if the subroutine returns before the kernels are finished on the device, "B" may leave scope, meaning it would no longer be valid on the device.

None
subroutine work(A, N)
integer :: 1, N
real, dimension(N), intent(inout) :: A
real, dimension(N) :: B
!$acc enter data create(B(:)) async(1)
!$acc kernels async(1)
B(:) = 1.0
!$acc end kernels
! A device copy of B is created here.
!$acc parallel loop resent(A(1:N)) async(1)
do i=1,N
A(i) = A(i) + B(i)
end do
!$acc exit data delete(B(:)) async(1)
! No synchronization here, so B is immediately deallocated on the host
! and (presumably) removed from the present table, deallocating it on
! the device too. If the implementation tracks properly though, maybe
!the deallocation is delayed or they're using stream-ordered memory
! allocation and freeing.
end

To be fully transparent, we debated for a long time whether this code is valid, since it was written to assume that the host and accelerator have discrete memories and since it’s clear that the lifetime of "B" on the host may be shorter than the lifetime on the device. We eventually decided that rather than debate the legality, we would develop a solution so the user can safely do this on both shared and discrete memory devices.

The capture modifier was inspired a bit by C++ lambda capture. The idea is that the user can declare that for the dynamic lifetime of that region, a discrete device copy of the data is created, even if it already exists in shared memory. Now that same code would look like this:

None
subroutine work(A, N)
integer :: i, N
real, dimension(N), intent(inout) :: A
real, dimension(N) :: B
!$acc data create(capture: B(:)) async(1)
!$acc kernels async(1)
B(:) = 1.0
!$acc end kernels
! A device copy of B is created here.
!$acc parallel loop present(A(1:N),B(1:N)) async(1)
do i-1,N
A(i) - A(i) + B(i)
end do
! After this asynchronous data region completes 'B' no longer requires
discrete memory
!$acc end data
end

With "B" safely captured in device memory, the subroutine can return and clean up the device version of "B" only after it is no longer needed on the device.

A few caveats about this modifier before you use it in your code: First, it’s only available for structured data lifetimes ("data," "parallel, kernels," "serial"). Dynamic lifetimes introduce some dangerous corner cases that we deemed too risky to solve, at least in this release of the specification. Second, because some of our creative early adopters found good use cases for it, we do allow you to capture on "copy" and "copyout," but it’s critical to ensure the memory you are copying out to is still available. In the example above, the lifetime of "B" ends with the subroutine, so if the runtime attempted to write to that memory, it would be an error.

I’m looking forward to seeing the ways this feature is useful in real codes and encourage you to share your experience with the community if you try "capture" in your application.

Improved Routine Parallelism Inferencing

When OpenACC was envisioned C++ lambdas were not in common use and C++ abstractions like Kokkos, Alpaka, or the standard C++ parallel algorithms did not exist. As these abstractions became more commonplace and users began developing backends using OpenACC we received feedback that the OpenACC routine directive didn’t play as nicely with these approaches as we would like. In this release we’ve specified how compilers can infer the parallelism of a routine based on its context so that the parallelism specification for the directive can be optional in these cases.

Taking a small step back, I’ve been asked before why acc routine requires a parallelism clause in the first place. Well, because so much of the decision of how to parallelize code is left to the compiler when using OpenACC, the compiler needed enough decision to know both how to parallelize the loop containing calls to a routine and for which level of parallelism the routine itself should be built. Without both of these pieces of information it was difficult to build both the callsite and the routine without requiring advanced inlining, interprocedural analysis, or the ability to build all possibilities and decide at runtime. The routine directive allows the routine to reserve the type of parallelism it needs so that the compiler can build it and the callsite correctly.

We started by lifting the requirement to specify parallelism for C++ lambdas, since in common usage the compiler already has a lot of visibility into both the callsite and the code inside the lambda itself. In code like appears below, a compiler should have all the information it needs to build and effectively parallelize the code without the need for additional code from the developer.

C/C++
// The C++ lambda function f must execute in a vector-single mode to ensure
// the containing parallel loop gets partitioned across vector lanes.
// The compiler implicitly adds: #pragma acc routine vector
auto f = []() {
#pragma acc loop vector
for (int i=0; i < I; ++i)
<some complex code>;
};
// Due to the implicit routine directive this loop is parallelised to gangs,
// reserving vector level parallelism to the loop inside the lambda function.
#pragma acc parallel loop gang
for (i = 0; i < I; ++i)
f(); //implicitly vector routine.

Once we had this working for C++ lambdas and began getting data from compilers implementing this feature, we began to realize that the solution we’ve developed for lambdas will work in some other routines, both in C/C++ and Fortran, when the right information is available to the compiler. Take, for instance the example below:

C/C++
// The compiler implicitly adds: #pragma acc routine vector
void f(){
#pragma acc loop vector
for (int i = 0; i < I; ++i)
<some complex code>'
};
// The C++ function g must execute in a worker-single mode to allow calling
// a vector routine f and to contain a loop partitioned across workers.
// The compiler implicitly adds: #pragma acc routine worker
void g() {
f(); // implicitly vector routine.
#pragma acc loop worker
for (int i = 0; i < I; ++i)
<some complex code>;
}
// Due to the implicit routine directives all 3 loops in this code are
// parallelised as intended using different levels of parallelism in OpenACC:
// gang, worker and vector.
void h() {
#pragma acc parallel loop gang
for (int i = 0; i < I; ++i)
g(); // implicitly worker routine.
}

Conditional Atomics

I think this was the single feature that surprised me the most as we worked on 3.4. I’ll admit, I just didn’t understand this use case until I sat down with some of our users and they explained what they were doing and why this made sense. The "atomic" directive can be important for correctly parallelizing certain patterns; it ensures that certain data locations are always accessed atomically so that a data race doesn’t occur on that variable. So when a user asked us to be able to make the atomic operation conditional, we didn’t understand why this would be useful, since it seems to introduce a terrible race condition in the code.

Well, as it turned out, our users knew exactly what they were talking about and educated us along the way. This particular developer uses OpenACC both for GPU and CPU parallelization and, unsurprisingly, uses different parallelization schemes for each. When running on a GPU, the parallelization scheme results in a potential race condition on a particular variable, but when running on the CPU, it does not. Not only is the atomic unnecessary with the CPU parallelization scheme, but it’s terribly slow! Now, that user can add an "if" clause to the "atomic" directive and ensure that when it’s not needed for correctness, it won’t slow the code down.

Always Modifiers on Data Clauses

The final new feature I’d like to highlight is the "always" modifier on data clauses. Since OpenACC 2.0, the default behavior for data clauses is “present or” behavior. Before data is copied to the accelerator, the “present table” is first queried to see whether it’s already there. If the data is present, then the only action is incrementing the present counter so that the runtime will know how many references there are to that data so that it won’t be removed before it’s safe to do so.

But what if your data is changing and you want to update the accelerator more frequently? The "update" directive is the OG way to make the data on the host and device coherent, but we kept hearing from users that they’d like a more convenient way to do this. We started discussing this feature way back in 2018 and originally deemed it unnecessary, since the behavior was already supported with the update directive. Now, with OpenACC 3.4, we’ve finally added some syntactic sugar to make this pattern a little easier for our users. Below, you’ll see the pre-3.4 and version 3.4 way of accomplishing an “always copy-in” operation.

`#pragma acc update device(a) if_present #pragma acc parallel loop copy(a) for (int i=0; i<N; i++) { _ }`	`#pragma acc parallel loop copy (alwaysin:a) for (int i=0; i<N; i++) { _ }`
Prior to OpenACC 3.4, if you wanted to always update the values of a, even if the array is already present, you would use an "update" directive.	Now with OpenACC 3.4, you can use the "alwaysin" modifier on the "copy" clause to declare that the cata should be copied every time the clause is encountered, even if the data is already present.

With this new feature, we’ve introduced the "always," "alwaysin," and "alwaysout" modifiers to the appropriate data clauses to give you complete control over when data coherency is desired between the host and device.

In Conclusion

I hope you're as excited as we are about the release of OpenACC 3.4. We believe that this version of the specification makes life easier both for developers who are implementing OpenACC in compilers and for our users. If you have any feedback about this release, want to share your OpenACC success story, or request something new, please reach out at feedback@openacc.org.

Author

Jeff Larkin

Jeff Larkin is a Senior Developer Technologies Software Engineer at NVIDIA Corporation, where he focuses primarily on porting and optimizing HPC applications. Jeff is an active contributor to the OpenACC and OpenMP standards bodies and is the Chair of the OpenACC Technical Committee. Prior to joining NVIDIA, Jeff worked in the Cray Supercomputing Center of Excellence. He holds a M.S. in Computer Science from the University of Tennessee and a B.S. in Computer Science from Furman University.