What’s new in OpenACC 2.6?

Everyone is in a hurry today. And why not? We have CPU’s with more cores than ever and we have GPU’s with over 5,000 cores. The way to use these cores is to expose parallelism in our code. By doing that we can run faster (I don’t believe anyone has ever asked for less performance). One of the best ways to do this is to use OpenACC. OpenACC is great because it is simple to use (just “comments” in your code), allows the user to have control over the parallelization, and is portable.

As with any standard, OpenACC is evolving to meet the needs of more and more user applications. In November 2017, OpenACC announced version 2.6 of the specification. Sometimes reading specifications can be a bit dry, so to better understand the changes in 2.6, let’s divide them into three categories: (1) Big Changes, (2) Medium Changes (3) Small changes.

Big Changes

Here are the big changes to OpenACC that will have a big impact on how users develop their codes:

1. Support for Manual Deep Copy
Deep Copy has been the most requested OpenACC feature for some time. That is, the support for manual 'deep copy' with automatic and explicit pointer attach and detach behaviors, attach and detach clauses and API routines. This is a first step in a standard mechanism to copy and update pointer-based members of aggregate data structures between system memory and device memory.

The reason that Deep Copy is so important is that OpenACC only does a shallow copy between device and host so that nested dynamic structures are not fully copied to the device. For data structures that have pointers that point to other data, only the first level of data is copied to the device. All other data stays on the host, leading to performance issues.

This topic is discussed below in more detail.

2. Serial Construct (Offload serial code)
This is a popular feature from users to offload simple serial code regions to the device that use or update data that is already present on the device. While the serial code itself may may run a bit slower on the GPU compared to the CPU, the elimination of data movement can be a big net performance win in some cases.

Medium Changes

These are changes that have some impact but not on a large portion of the user base. That said, they should not be ignored.

1. acc_get_property
This is the “acc_get_property” API routine that gets properties of the device. This allows programs to get information about the device, either for reporting or to make decisions about how to best use it at runtime.

2. no_create data clause on compute and data constructs
This is a user-requested feature to avoid a runtime error in a specific situation, where some array that appears in a compute region is sometimes not used, and when it is not used it will not be present on the device.

Small Changes

Here are small changes that are usually clarification on certain directives or perhaps a little more precise explanation especially for individuals implementing the specification:

1. Clarification for reductions that span multiple nested loops
The reduction clause must appear on all the associated loops, and allows easier implementations.

2. Optional if and if_present clause on host_data construct
This avoids having to insert runtime conditionals when a routine is called sometimes for processing on the host, and sometimes for processing on the device.

3. Define behavior of Fortran optional arguments in data clauses and procedure calls
This was an important clarification in the specification, to ensure that implementations support optional arguments coherently.

4. Clarified the data types used in the profile tool interface section
This is a minor update to make the intent of the values clearer to tool implementers.

Manual Deep Copy

Creating data structures, whether in C/C++ or Fortran, can sometimes contain pointers to other data using more than one level (nested dynamic data structures). By default, if you copy one of these data structures from a host to a device, it only copies the highest level of the data structure to the device. For compilers it can be difficult to follow pointers down a derived data type (a struct) when copying the data from the host to the device (and back). What happens is that any pointers in the data structure point to data that is back on the host. In short, any allocated data in the data structure won’t get copied over.

To better understand this, here is a flat object model and pragma looks something like the following:

struct {
 int x[2];  // size 2
}*A          // size 2
#pragma acc data copy(A[0:2])

If A is copied from the host to the device, the data storage looks like the following:


Since the object model is flat, that is, there are no dynamic parts in the structure, everything gets copied very nicely to the device.

In Fortran the equivalent code is something like the following:

type mytype
 integer :: x(2)
end type mytype
type (mytype) A(2)
!$acc data copy(A(:2))

However, if the data structure and pragma is something like the following:

struct {
 int *x;  
}*A          // size 2
#pragma acc data copy(A[0:2])

Then the host memory layout is roughly like the following:

The OpenACC pragma performs only a “shallow” copy to the device. The resulting memory layout on the host and the device looks like the following:


Notice that the structures in the device memory point back to the element pointers on the host. OpenACC does not follow the pointers to copy their values to the device. You don’t want the GPU using the host values for the data - you want it to use the device values. Additionally, , copying the to/from the host can result in performance degradation.

The equivalent code in Fortran is something like the following:

type mytype
 integer, allocatable :: x
end type mytype
type (mytype) A(2)
!$acc data copy(A(:2)) 

The data layout for Fortran is the same as C.

Before OpenACC 2.6, you had to copy all of the nested data to the device and change all of the references to use the pointers (something of a pain). In OpenACC 2.6 you can now copy the nested data and it will be connected (the specification calls it “attached”) in the data structure.

The simple example in OpenACC 2.6 now looks like the following:

struct {
  int *x;  // size 2
}*A          // size 2
#pragma acc data copy(A[0:2])
#pragma acc data copy(A.x[0:2])

With the equivalent in Fortran being something like the following:

type mytype
 integer, allocatable :: x
end type mytype
type (mytype) A(2)
!$acc data copy(A(:2))
!$acc data copy(A%x(1:2))

We’ve added a pragma to copy the nested dynamic data associated with A. This only works with OpenACC 2.6 so be careful. Also, be sure you copy the dependent data after the main, or top level, data.

The data layout for the example now looks like the following:


This is what you want to happen with nested dynamic data structures.

OpenACC at GTC 2018

During the NVIDIA GTC 2018 conference, there will be a large number of talks around OpenACC together with several excellent tutorials. A complete list of OpenACC events at GTC can be found on a GTC page.
Below is a snapshot of some of the sessions. Be sure to go to the above link to get a more complete list!

Talk Tittle

S8805 - Managing Memory of Complex Aggregate Data Structures in OpenACC This talk will bring users diving into VASP, ICON, and other real-world applications and see how the deep copy issue is solved in these real-world applications with PGI compiler and OpenACC APIs.
S8291 - Acceleration of a Computational Fluid Dynamics Code with GPU Using OpenACC The goal of this session is to report the knowledge acquired at the Oak Ridge GPU Hackathon that took place on October 9th-13th 2017, through the acceleration of a CFD (Computational Fluid Dynamics) solver.
Tutorial: S8382 - Zero to GPU Hero with OpenACC Students will learn a profile-guided approach to accelerating applications, including how to find hotspots, how to use OpenACC to accelerated important regions of code, and how to get the best performance they can on GPUs. Several books will be given away to attendees who complete this tutorial.




What’s new in OpenACC 2.6?
Jeff Layton, Senior Solution Architect, NVIDIA