What's new in OpenACC 2.7!

In November 2018, OpenACC announced the latest update to the specification, version 2.7. This includes a number of minor updates to the previous version 2.6, in response to user requests from their experiences using OpenACC is real applications. There are many changes through the text that are intended to clarify and simplify the specification without changing its meaning. Here we'll go through the nontrivial changes and their impact. 

1. Treating the Host CPU as a device

In previous versions, it was unclear whether an implementation could treat the multicore host CPU as another device, or whether it was something distinct. OpenACC 2.7 is now clear that an implementation may treat the host as another device, or the only device.
This involved many textual changes through the spec, mostly changing phrases like'copy data to the device' to 'copy data to a device' or 'to the current device. No specific changes are required for implementations that already allow the host as a device, or that don't support the host as a device.

2. Partially-shared Memory Devices

Previous versions allowed for two types of devices: shared memory devices and non-shared memory devices. For instance, a classical GPU would be a non-shared memory device, whereas the host CPU would be a shared memory device. OpenACC 2.7 now allows for devices that share some memory with the host but also have some discrete memory. There have been accelerators with this behavior, such as NVIDIA GPUs with CUDA Unified Memory. The new specification allows for this case. This change also involved many textual changes throughout the spec, mostly changing sentences like'for a shared memory device, do this' to 'for data in shared memory, do this'.


3. Self clause for compute constructs

OpenACC 2.7 now allows a self clause on the parallel, kernels, or serial compute constructs. The self clause tells the implementation to execute the compute construct on the local device. This makes is straightforward to use both multicore parallelism and accelerator parallelism in the same program using OpenACC, without having to dynamically change the current device. Your program will use the self clause for compute constructs that you want to run in parallel on the multicore CPU, while other compute constructs will execute on the current device.

int ngpu = acc_get_num_devices(acc_device_nvidia);
// execute a parallel loop on the multicore, one iteration per gpu
#pragma acc parallel loop self
for (int g = 0; g < ngpu; ++g) {
   // set current gpu to a specific device number
   #pragma acc set device_type(acc_device_nvidia) device_num(g)
   // compute data range for this gpu to compute
   int ielt = (N+ngpu-1)/ngpu;
   int ilo = g*ielt; 
   int ihi = ilo + ielt;
   if (ihi > nn) ihi = nn;
   // execute a parallel loop on the current gpu
   #pragma acc parallel loop copy(a[ilo:ielt])
   for (int i = ilo; i < ihi; ++i)
      a[i] = sinf(a[i]);

4. Array and Struct reductions

OpenACC 2.7 allows reductions for arrays and array sections, and for composite variables, where a composite variable is a C/C++ struct or class or a Fortran derived type. An array reduction is performed element-by-element, and a struct reduction is performed member-by-member. All members of the struct must have a datatype that is legal for the specified reduction operator. Users should be careful of asking for an array reduction with a large array section, since the implementation must allocate a temp copy of the array section for each gang or worker or vector lane involved in the reduction. In particular, this will allow a C or C++ complex sum reduction.

float _Complex c;
real s[4];
#pragma acc parallel loop reduction(+:c,s) default(present)
for (int i = 0; i < n; ++i) {
   c += CMPLXF( a[i], b[i] );
   s[0] += a[i];
   s[1] += b[i];
   s[2] += c[i];
   if (d[i] > 0) s[3] += d[i];

5. Readonly modifier for copyin and cache

The copyin and cache clauses may include a readonly: modifier in OpenACC 2.7. This tells the compiler that the data will not be written, allowing more aggressive optimization, and in some cases allows use of faster non-coherent memory load instructions.

#pragma acc parallel loop copyin(readonly:a[0:n],c[0:n],w[0:2]) copy(b[1:n-1])
for (int i = 1; i < n - 1; ++i) {
  #pragma acc cache(readonly:c[i-1:i+1])
  b[i] = a[i]*2.0f + (c[i-1]*w[0] + c[i+1]*w[1];

6. Default(none) or default(present) on data constructs

Previous OpenACC versions allowed a default(none) or a default(present) clause on compute constructs. If there were several compute constructs, the clause had to be repeated on each one. OpenACC 2.7 allows a program to add a default(none) or a default(present) clause on a data construct, which then implies that same clause on any compute construct that is lexically contained in that data construct. The following example has the same behavior as replicating the default(present) clause on each of the parallel loop constructs:

#pragma acc data copy(a[0:n], ...) default(present)
   #pragma acc parallel loop
   for (i = 0; i < n; ++i) {
   a[i] = ...
   #pragma acc parallel loop
   for (i = 1; i < n - 1; ++i) {
   ... = a[i-1] ... a[i+1]

7. Reduction implies copy

There were some inconsistencies and inefficiencies in the way reductions were implemented by different compilers. OpenACC 2.7 clarifies that a reduction clause on a compute (parallel or serial) construct implies a copy clause for the reduction variables. This could potentially change the behavior of some programs. For instance, in a program like below, it was unclear whether the reduction result gets stored in the device copy or the host CPU copy of s. OpenACC 2.7 clarifies that the reduction clause implies copy(s) on the compute construct, so the results gets stored in the device copy of s.

s = 0;
#pragma acc data copy(s)
 #pragma acc parallel loop reduction(+:s)
 for (i = 0; i < n; i++) {
   s += x[i];

8. Definition of present

The specification uses the term 'present' to mean when the data has been copied to device memory, but the term was never formally defined. OpenACC 2.7 clarifies the meaning of 'present' in terms of the present counters for that data. It also clarifies the behavior of the acc_map_data and acc_unmap_data API routines on those present counters.

9. Local device

The specification defines the term 'local device' to be the device on which the local thread is executing. When the program starts, the local thread is on the host, so the host CPU is the local device. When the program launches a compute region on the accelerator, the threads in that region will execute on the current device, which is the local device for those threads.

10. Var

The specification also uses the term 'var' to mean a variable, a subarray or array element, a composible variable member, or a Fortran common block name. This cleans up various places in the specification, since the whole list of possibilities doesn't need to be repeated at each appearance.

Changes to the OpenACC specification are largely driven by feedback from real users, meaning you. If you have a pressing need for something new, please describe that to feedback@openacc.org, so we can study it for subsequent releases.


What's new in OpenACC 2.7?
Michael Wolfe
Michael Wolfe has worked on languages and compilers for parallel computing since graduate school at the University of Illinois in the 1970s. Along the way, he co-founded Kuck and Associates, Inc. (since acquired by Intel), tried his hand in academia at the Oregon Graduate Institute (since merged with the Oregon Health and Sciences University), and worked on High Performance Fortran at PGI (since acquired by STMicroelectronics, and more recently by NVIDIA). He now spends most of his time as the technical lead on a team that develops and improves the PGI compilers for highly parallel computing, and in particular for NVIDIA GPU accelerators.