11
15
2021

Announcing OpenACC 3.2

It’s been a year since we announced the completion of OpenACC 3.1 and I’m pleased to announce that we have now completed version 3.2. For the past several OpenACC releases we have taken the approach of releasing on an annual cadence so that our releases aren’t held up by unfinished business and so that our implementers don’t have to read and understand monolithic changes upon release. It’s sometimes heard on OpenACC technical calls that we’d rather miss a release to get the feature right than hold up releasing things that are complete just to finish one more thing. So, what has the OpenACC technical committee been working on this year?

Download OpenACC 3.2 Specification
 

The most notable enhancement we’ve made this year is the addition of an error handler API. We’ve had a goal for several years now to develop a way for applications to catch and diagnose errors and then to gracefully shut down the program. OpenACC applications are commonly run on some of the fastest computers in the world and we don’t want a failure on one node to result in wasted compute time on the rest of the nodes in a parallel job. We’ve extended the event handling API that we first introduced for profiling tools to enable developers and tools to register an error handler routine, which can then run to gracefully shut down the application and potentially save state to help diagnose the error. It’s important to note that the API does not support recovering from an error, but it does enable developers to do things like inspect the state of their application, checkpoint the progress, or call MPI_Abort to shut down all instances of the application. Adding this feature led to another substantial improvement though: it forced us to evaluate every API call and directive to think about when and why error conditions may occur and then to define specific errors where there were previously blanket statements like “a runtime error may occur.” This careful examination of the specification will hopefully make debugging OpenACC applications much simpler for our developers. This feature took us a few years to get right and we’re definitely proud of the results.

This careful examination of the specification had another notable side effect: we discovered a lot of repeated text. For quite a few of our runtime API routines, for instance, the text describing their behavior was nearly identical, which means there’s more to read, understand, and for us to maintain. As you look through version 3.2 you’ll discover that many sections have been combined, restated, or reorganized to eliminate unnecessary duplication. This change may not be very exciting to our users, but it does improve readability and makes the maintenance and implementation of the specification quite a bit easier. While we were at it, we also took the time to review the document for terms that the INCITS organization has identified as non-inclusive. It’s important to us that the specification follows industry best practices with regards to inclusiveness.

In terms of features that users will notice, one such feature is the addition of “wait any” routines in the runtime API. Previously, if a program used multiple asynchronous queues in a loop, a matching loop may have been used to synchronize on these queues in order. There’s nothing wrong with this approach, but if the work was not well load-balanced, some queues may sit idle waiting for the queues ahead of them to synchronize. Other parallel programming models, such as MPI, have the ability to wait on any asynchronous work to complete for just such a load imbalance and OpenACC modeled our solution on this approach. The new acc_wait_any and acc_wait_any_device routines accept as an argument an array of asynchronous queues and then returns a queue that is ready. The program can then process the work that’s dependent on this particular queue and move on or wait for another queue to complete. There is one complexity that I want to point out here and that’s how to handle removing queues from the array of queues to test. If the routine returns a queue from the middle of the list, we didn’t want to force the developer to compress the list to remove the new gap, so instead the developer can replace that queue with the special value acc_async_sync to indicate that this position in the array can be skipped. I’ve provided an example of how this new feature may be used below.

#pragma acc data copyin(list[0:10])
{ 
int queues[10]; 
for ( int i=0; i < 10; i++ ) 
{
// Do some unbalanced operation on several queues
#pragma acc enter data copyin(list[i].member[0:list[i].size]) async(i)
// Put the queue number in the queues list, the index and queue number
// do not need to match, like they do here.
queues[i] = i;
}
int next;
// Look for queue that is ready to process
while ( (next = acc_wait_any(10,queues)) >= 0 )
{
// Remove this queue from consideration next time around
queues[next] = acc_async_sync;
// Process work dependent on above
#pragma acc parallel loop 
{
for ( int j=0; j < list[i].size; j++)
{
// do stuff
} 
}
}
}


While we were improving the asynchronous behavior of the specification, we also made it possible to use the async clause on data constructs. This was initially left off the data construct because it does introduce opportunities for users to have race conditions in the code between device and host data, in particular if they were to free data structures while they’re still being used on the device. Also, it was thought at the time that data allocation couldn’t be done asynchronously with the host, but we’ve since seen evidence that this can be handled safely. After consulting with users, however, we decided that the potential convenience of making all data transfers associated with a data region asynchronous outweighed the potential for errors. Here are some things developers should remember when using this new feature:
Asynchronous execution isn’t guaranteed. An implementation may still choose to make the region synchronous.
The asynchronous behavior only applies to the data operations at the beginning and end of the region, not to all directives or API calls within the region. For instance, if the data region is asynchronous on queue 0 and it contains compute regions that enqueue on queue 1 (or if the compute regions are themselves synchronous), it’s the developer’s responsibility to ensure appropriate synchronization.
Just like with the existing unstructured data directives (enter data and exit data), it’s the developer’s responsibility to synchronize the asynchronous queue before using the data or modifying the data structure, for instance freeing the memory, on the host. 
Below is an example of an asynchronous data region in use.

// Mark this entire data region as asynchronous on queue 0
#pragma acc data copy(A[0:N]) async(0)
{
// Execution MAY continue here before data allocation and copies complete
// This region MUST wait on queue 0 to ensure data is ready or enqueue itself
// in queue 0 as well.
#pragma acc parallel loop async(1) wait(0)
for ( int i=0; i < N; N; i++ ) { ; }
// Since the data region MUST NOT copy or deallocate A until the parallel
// region has finished, this wait is necessary.
#pragma acc wait(1) async(0)
}
// Execution MAY continue here before data copies and deallocation occurs
// It's necessary to wait on queue 0 before operating on A to ensure the device
// has finished any data operations.
#pragma acc wait(0)

The final new user-facing feature that I will highlight is a change to how developers can initialize devices. Multi-accelerator systems have become commonplace in high performance computing and we increasingly see OpenACC codes being written to support them. Prior to OpenACC 3.2 in order to initialize a device of a certain type, NVIDIA GPU for instance, with the runtime API the developer would have to initialize all devices of that type, which added certain overheads and side effects. The init directive had the ability to specify a device number and some implementations had a workaround for the API, but the specification had no official way to initialize only a single device using the runtime API. In OpenACC 3.2 we’ve added acc_init_device and acc_shutdown_device to allow developers to initialize and shut down only a particular device on their system, hopefully reducing the overheads that some developers have found on multi-accelerator systems.

Another year and another great OpenACC release. Hopefully you’ll find something useful in this release. If you have comments or suggestions for how OpenACC can better serve your needs, please reach out to us at feedback@openacc.org. We take user feedback very seriously when developing the specification and would love to work with you to do more science with less programming. 

Author

Jeff Larkin
Jeff Larkin
Jeff Larkin is a Senior Developer Technologies Software Engineer at NVIDIA Corporation, where he focuses primarily on porting and optimizing HPC applications. Jeff is an active contributor to the OpenACC and OpenMP standards bodies and is the Chair of the OpenACC Technical Committee. Prior to joining NVIDIA, Jeff worked in the Cray Supercomputing Center of Excellence. He holds a M.S. in Computer Science from the University of Tennessee and a B.S. in Computer Science from Furman University.