OpenACC 3.0

2019

At SC19, we announced the latest update to the OpenACC language specification, version 3.0. This includes a number of minor updates in response to user requests, in addition to some clarifications. We'll start here talking about the process we use to develop the language, then talk about what's new in 3.0 and what else we're working on for subsequent versions.

We decided several years ago to be calendar-driven, instead of feature-driven. That means we try to have a new version of the spec every year around the time of the SC conference. Whatever features we finish working on by the end of that year will be part of the next version. Other features that are still under discussion, or that require more work, will be deferred until the next version. If that isn't much work, then the delay is only a year.

Now for what's new in 3.0:

1. Updated base languages

We updated the base languages to C18, C++17, and Fortran 2018. This was partly motivated by the desire to define behavior for C++ lambdas, which were added in C++14. Support for many of the newer features in these languages aren't yet defined in OpenACC constructs. That work will continue into the coming years. For instance, defining the behavior of the C++ parallel algorithms or Fortran do concurrent construct in compute constructs needs work.

2. Required a data clause on enter data and exit data directives

This was a change to avoid programmer errors. In the 2.7 spec, an 'exit data' directive with no clauses was allowed, but it was a no-op. No data was moved, nothing happened. We saw this in some real programs, where the user added 'enter data' and 'exit data', expecting the 'exit data' to copy all the data from the 'enter data' directive back to the host, more or less like a 'data' construct. This was a case where we wanted to tighten the spec to disallow an obvious error.

3. Added support for invoking C++ lambdas in a compute construct

This includes defining how the variables captured by the lambda are handled, and allowing a routine directive before a lambda name or definition.

4. Added a zero modifier to create and copyout data clauses

Added a zero modifier to create and copyout data clauses, to fill the device memory with zero values after allocation. Filling with zero can be faster than copying the zeros from CPU memory.

5. Added an if clause to the init, shutdown, set, and wait directives

This was requested by users, to avoid adding if statements to branch around the directives.

6. Added a devnum modifier to the wait directive and clause

This will allow synchronization across devices. Many systems now have several GPUs on each node, so support for programs that use multiple GPUs is important.

7. Added an API routine acc_memcpy_d2d

This routine copies data directly between two devices.

What didn't make it into the 3.0 spec that we are actively working on?

One feature that we've been talking about for some years is "deep copy." This would allow programs with dynamic (pointer-based) data structures to define the desired behavior, and to copy the data structure more simply than is possible now with the manual deep copy directives. We have worked on deep copy for a couple years, and even have an initial implementation available in the PGI compilers. However, we were wary of trying to finalize the specification until we get some feedback from that initial implementation, to know whether it has enough features, or if there are unnecessary features that should be removed. We're still waiting on some feedback.

We are actively looking at more features for programming multiple devices. Currently, almost all programs that use multiple devices on a single node do this by using MPI parallelism on the node, one MPI rank per GPU. Some programs use OpenMP or some other thread model on the node, one thread per GPU. Almost no one tries to program multiple GPUs from a single host thread. Nevertheless, there are advantages to be gained from allowing direct interaction between the GPUs, such as synchronization between them or data movement between them, with minimal interaction from the CPU. There's even talk about defining how to distribute data and computation across devices, though to do that properly will require studying the lessons from High Performance Fortran and other similar languages.

A long standing item is to define behavior for data clauses where the data is aliased. The canonical example is:

void s(float* x, float* y, int n){
#pragma acc parallel loop copyin(x[0:n]) copyout(y[0:n])
for (int i = 0; i < n; ++i) y[i] = sinf(x[i]);
}

What should happen when this routine gets called with the same vector for x and y:

 s(z,z,n);

The user wants both the copyin and copyout behavior to apply, but today that's not defined.

We are also looking at defining behavior for Fortran allocatable and Fortran pointer variables and arrays, making them as useful as C and C++ pointers are. This includes assigning a Fortran pointer with a device memory address and allowing that Fortran pointer variable in a deviceptr clause.

If you have ideas or suggestions about possible improvements or additions to the OpenACC language, join the OpenACC Slack community or send email to feedback@openacc.org.

Author

Michael Wolfe

Michael Wolfe has worked on languages and compilers for parallel computing since graduate school at the University of Illinois in the 1970s. Along the way, he co-founded Kuck and Associates, Inc. (since acquired by Intel), tried his hand in academia at the Oregon Graduate Institute (since merged with the Oregon Health and Sciences University), and worked on High Performance Fortran at PGI (since acquired by STMicroelectronics, and more recently by NVIDIA). He now spends most of his time as the technical lead on a team that develops and improves the PGI compilers for highly parallel computing, and in particular for NVIDIA GPU accelerators.