Announcing OpenACC 3.3 at SC22!

2022

Announcing OpenACC 3.3 at SC22!

Another year has passed since the release of OpenACC 3.2 and, once again, the technical committee has been hard at work maintaining and improving the OpenACC specification. As a committee we decided several years ago that an annual release cadence is the best way to ensure that improvements are delivered in a thoughtful and timely manner. This does mean that sometimes features that we’ve been working on don’t quite make it at final release time, but I think this year’s release is an important one for our users and implementers.

The main focus of the technical committee this year has been to identify, refine, and standardize extensions that appear in existing OpenACC applications. These features have already shown themselves important in certain applications, so standardizing them brings these features to light for more developers and improves the portability of the applications that use them.

Increased Gang Parallelism

The first feature I’ll highlight is an increase in the number of degrees of gang parallelism that can be expressed in OpenACC code. For those unfamiliar, gangs are OpenACC’s coarsest level of parallelism. Gangs operate independently, so although workers within gangs have a degree of data sharing and coordination available to them, gangs have no knowledge of other gangs and are not even guaranteed to run simultaneously. On GPU machines, gangs are generally implemented as CUDA threadblocks or OpenCL workgroups. To-date, OpenACC has been limited to 1 dimension of gangs and for many applications, that was enough, but for certain coding patterns this was just too limiting to expose sufficient parallelism for modern hardware. In particular, for applications with deep nests of function or subroutine calls, the entirety of the call tree may have a lot of parallelism, but no particular level of that call tree contains all of it. OpenACC 3.3 addresses this by enabling up to 3 dimensions of gangs, which may appear in different routines. This means that the total number of OpenACC gangs is the product of up to 3 dimensions of gangs, potentially increasing the parallelism of an application by a huge amount.

Take the code below as an (admittedly, contrived) example. The application has a high level loop for gang parallelism, but calls into a function that also does work that would be a candidate for gang parallelism, which in turn calls into a function that contains a loop that would be a good choice for vector parallelism. Prior to OpenACC 3.3, either the “i” or “j” loop could be used for gang parallelism, but not both. It may be tempting to use worker parallelism on “j”, but worker parallelism won’t help to scale the work up to a large GPU like gang parallelism will. As seen on the right, OpenACC 3.3 allows for specifying up to a 3D grid of gangs (only 2D is used in the example). I can then assign both the “i” and “j” loops to a dimension of gangs, expanding the parallelism by a factor of “N”. This specific example may seem a bit trivial, but this pattern has cropped up in several real science applications.

Mapping parallelism to gangs and vectors in nested routines with OpenACC

Using just gang+vector in OpenACC <3.3 Newly introduced gang parallelism in OpenACC 3.3

// distribute the loop iterations over gangs
#pragma acc parallel loop gang
for ( int i = 0; i < M; i++)
{
outer(data, N);
}
// declare that this routine requires vector parallelism
#pragma acc routine vector
void outer(double *data, int N)
{
for ( int j = 0; j < N; j++)
{
// Do Something
inner (data, N);
// Do Even More
}
}
// execution of this inner routine will also use vector parallelism
#pragma acc routine vector
void inner(double *data, int N)
{
// parallelize this loop using vector parallelism
#pragma acc loop vector
for ( int k = 0; k < N; k++)
{
// Do Something
}
}

// create a two dimensional grid of gangs and map the iterations of this loop to the outer gang dimension
#pragma acc parallel loop num_gangs(N,M)\
gang(dim:2)
for ( int i = 0; i < M; i++)
{
outer(data, N);
}
// execute this routine on the gang parallelism level
#pragma acc routine gang(dim:1)
void outer(double *data, int N)
{
// map the iterations of this inner loop to the inner gang dimension
#pragma acc loop gang(dim:1)
for ( int j = 0; j < N; j++)
{
// Do Something
inner (data, N);
// Do Even More
}
}
// spread the execution of the routine using vector parallelism
#pragma acc routine vector
void inner(double *data, int N)
{
// parallelize this loop using vector parallelism
#pragma acc loop vector
for ( int k = 0; k < N; k++)
{
// Do Something
}
}

Collapsing of Non-Tightly Nested Loops

Collapsing loops is my go-to first step in optimizing OpenACC loops. More often than not the simple act of collapsing loops together and giving the compiler the grand total of all iterations of a loop nest to parallelize is a performance win. Unfortunately, not all loop nests are perfectly nested bundles of parallelism, so this trick doesn’t work all the time. It’s sometimes possible for the compiler to collapse loops that aren’t tightly nested by moving the intervening code (the code that appears between two loops) and executing it redundantly, but it’s not always safe and profitable to do this, so compilers generally won’t do this by default. If I as a programmer believe that it would be safe for the compiler to make this transformation, OpenACC lacked the means for me to communicate that to the compiler. In 3.3 we’ve changed this. We discussed possibly removing the restriction that collapsing applies only to tightly-nested loops altogether, but believed that it would be best for the developer to take an active role in opting into this new behavior by asserting that it’s safe and desirable for the compiler to collapse the loops by rearranging the code. We thought it better that an advanced user opt into this capability rather than having it occur automatically and leaving the user wondering what happened if the behavior isn’t what they expected. For that reason, we’ve added the “force” modifier to the “collapse” clause so that developers can communicate their desire to the compiler.

The code below demonstrates this new capability. The leftmost code demonstrates that not all code can be automatically collapsed, in this case due to the indirect indexing of “nx” and “ny”, which come from arrays “i2” and “i3”. The developer has put a “loop” directive on the innermost loop, but because of the intervening code for looping up the values of “nx” and “nz”, the compiler isn’t likely going to use gang parallelism across both loops, which may limit the scalability of of these loops. In the middle, the developer has moved the loop-up of “nx” and “ny” inside the inner most loop, which enables collapsing but because this work is now done redundantly it may result in poor performance on some platforms. The code on the right uses the new “force” modifier to ask the compiler to perform the necessary transformations to collapse these loops. The compiler will probably do the same transformation as the code in the middle when generating code for large amounts of coarse-grained parallelism, like for GPUs, but may choose not to if it’s generating code for a platform that doesn’t benefit from this transformation. The compiler is now able to do what it believes is best for the platform because the developer has told it these sorts of transformations are safe by using the “force” modifier.

#pragma acc parallel loop 
for (int nc=0;nc<2000;nc++)
{
  nx=i2(nc)
  ny=i3(nc)
  #pragma acc loop
  for ( intnz=0;nz<2000;nz++)
  {
    // Do Stuff
  }
}

#pragma acc parallel loop \
  collapse(2)
for (int nc=0;nc<2000;nc++)
{
  for ( int nz=0;nz<2000;nz++)
  {
    nx=i2(nc)
    ny=i3(nc)
    // Do Stuff
  }
}

#pragma acc parallel loop \
  collapse(force:2)
for (int nc=0;nc<2000;nc++)
{
  nx=i2(nc)
  ny=i3(nc)
  for ( int nz=0;nz<2000;nz++)
  {
    // Do Stuff
  }
}

Expanded Fortran Runtime API

The last notable feature is for Fortran users. The OpenACC Runtime API now supports all the same capabilities in Fortran as it always has in C/C++. Initially, routines that worked with C “void*” were left out of the Fortran API because we weren’t certain the best way to handle them in Fortran. Over time, many of our users wrote their own wrappers for these functions because they viewed them as necessary in their applications. We discussed these wrappers with our users, compared among our implementations, and finally agreed upon the right way to handle these routines in Fortran. Going forward, we now have precedent for how we will handle future additions to the runtime API so that we can maintain a complete runtime API for Fortran.

On behalf of the OpenACC technical committee, I hope you’ll find these additions for OpenACC 3.3 useful in your applications. We do have more things that we worked on over the year, which didn’t quite make the cutoff for this year’s release. Is there anything you need to make OpenACC even more useful in your applications? If so, please share your feedback to feedback@opeancc.org or join our slack and discuss it there.

Author

Jeff Larkin

Jeff Larkin is a Senior Developer Technologies Software Engineer at NVIDIA Corporation, where he focuses primarily on porting and optimizing HPC applications. Jeff is an active contributor to the OpenACC and OpenMP standards bodies and is the Chair of the OpenACC Technical Committee. Prior to joining NVIDIA, Jeff worked in the Cray Supercomputing Center of Excellence. He holds a M.S. in Computer Science from the University of Tennessee and a B.S. in Computer Science from Furman University.