2020

Announcing OpenACC 3.1

A year ago the OpenACC organization put out version 3.0 of the specification, a major upgrade that, among other things, moved forward the support for our base languages (C, C++, and Fortran) to their latest versions. The technical committee didn’t stop working though, and I’m pleased to announce the release of OpenACC 3.1 for November 2020. It’s hard to follow a major release like 3.0, but I believe the changes we made this year will help to make OpenACC implementations better, more interoperable, and easier to use with modern C++ and Fortran. A few years back OpenACC made the decision to move to an annual release model, putting out frequent small updates rather than saving up our changes into large, potentially difficult to implement releases. As such, some of the things we worked hard on this year just weren’t ready and that’s OK. What we did complete certainly makes the specification better.

Since OpenACC 3.0 is considered a major release, with some significant changes, the committee spent much of our time working to reorganize, clarify, and clean up the existing specification to make life easier on our implementers and users. Some repeated text was consolidated to ensure consistency and to prevent future errors. Some sections were clarified to ensure consistent behavior among our implementations. This isn’t particularly exciting work, as hopefully our users will have no reason to even notice it, but it’s important work to ensure that our implementations provide a consistent experience to our users.

The keynote feature of 3.0 was updating our support for the latest versions of C, C++, and Fortran so that we could begin to reason about and support their latest features. We started the year with discussions with members of the C++ concurrency working group to discuss how our two programming models can and should interact. We had similar discussions with members of the Fortran committee. In OpenACC 3.1 we add support for the Fortran BLOCK construct, Fortran DO CONCURRENT construct, and C++ range-based for loops. We still have a long way to go before fully supporting the concurrency constructs in the base languages, but these small additions are things we found important and believe that implementations should add as they’re implementing 3.0.

Fortran Block Construct

The Fortran BLOCK construct provides a mechanism for scoping of variables and code within Fortran similar to curly braces in C and C++. This means that variables can be declared within this scope and ranges of code can have a clear starting and ending point at a granularity smaller than subroutines or functions. Since the BLOCK construct works so much like curly braces in C/C++, we’ve now defined them to work in the same way. Just like combined directives, PARALLEL LOOP for instance, END directives become optional when a BLOCK construct is used. Also, BLOCK constructs have the same privatization rules as those for structured blocks in C/C++. Below you’ll find an example of what a Fortran code using a BLOCK construct might look like.

    !acc data copy(y(:)) copyin(x((:))
    block
    !$acc parallel loop
    y(:) = y(:) + a * x(:)
    end block

The advantage of using the BLOCK construct may not be obvious in this example, but now programmers can define scope for data and compute regions, declarations, and atomics using something that’s baked into the underlying language, rather than by adding the end directives. In my opinion, this encourages cleaner code by using the tools of the language.

Fortran Do Concurrent

I’ve personally spoken many times to the use of DO CONCURRENT in Fortran codes for asserting the viability of a loop for parallelization. By its very name the DO CONCURRENT loop implies that the iterations of a loop may be run in any order, including concurrently. Now, I could point you to some recent discussions on the Fortran committee regarding some nuances to that statement, but instead I’ll just show you a couple of examples of how DO CONCURRENT will work in an OpenACC 3.1 compiler.

First, DO CONCURRENT has a slightly different syntax than standard DO loops. In particular rather than nesting several DO loops, a DO CONCURRENT loop specifies the iteration space for multiple loop indices. In theory, this reduces the amount of code you need to write and may give compilers more flexibility in picking an appropriate order for loop iterations. Notice below how the code on the left has 2 nested loops, but the code on the right, which does the same thing, only has a single DO CONCURRENT construct. Next, we’ve said that you can write the same ACC LOOP directive on a DO CONCURRENT loop and it will work across the full iteration space.

  !$acc parallel loop collapse(2)
do j=1,M 
do i=1,N 
A1(i,j) = i+j
end do
end d

!$acc parallel loop
do concurrent (j=1:M,i=1:N)
A2(i,j) = i + j
end d

We’ve taken things a step further though. Since the intent of DO CONCURRENT is to write loops whose iterations can be executed in any order, we’ve specified that any DO CONCURRENT loop written inside a PARALLEL or KERNELS region will implicitly have an ACC LOOP directive. This means you will use OpenACC to create your parallelism and then the base language to identify what to parallelize. Taking advantage of the new BLOCK construct as well, now the above code looks only like this (notice how little OpenACC is now required):

!$acc parallel
block
 do concurrent (j=1:M,i=1:N)
    A2(i,j) = i + j
  end do
end block

Now, I want to point out a few caveats here. It’s been discussed in the Fortran community that it is actually possible to write a legal DO CONCURRENT loop that cannot be executed in parallel. Since writing DO CONCURRENT inside an ACC PARALLEL or KERNELS region has an implied LOOP, it’s the programmer’s responsibility to ensure that the loop is in fact parallelizable. It’s my belief though that this is what users are actually writing and will actually desire. Second, if you need to use GANG, WORKER, or VECTOR clauses to adjust the parallelization of your loop, you’ll still need to decorate your DO CONCURRENT with an explicit ACC LOOP. Lastly, because we really want to understand how people are using it and make sure we specify things appropriately, the TILE and COLLAPSE clauses are not allowed on a DO CONCURRENT loop, since it’s a bit unclear exactly what’s desired until we actually see how people use this functionality.

C++ Range-Based For Loops

The FOR loop has long been the standard when learning C or C++, but in some ways it’s a bit limiting in its syntax. The programmer gives an index variable with starting condition, an expression for the ending condition, and some operation to move the index toward the ending condition (hopefully). Range-based FOR loops are an alternative syntax that essentially focuses on each item in the range of iterations, rather than focusing on some index value. The compiler may then transform this into a more traditional loop, but the programmer gets to write something more concise. This may sound a bit abstract, so I’ll give an example.

    std::vector<int> indices(N), foo(N), bar(N);
    for ( int i = 0; i < N ; i++ )
    {
        indices[i] = N-i-1; 
        foo[i] = i;
    }
    for(auto i : indices)
    {
      bar[i] = 2 * foo[i];
    }

This code represents a coding pattern known as indirect addressing, by which the relevant locations in an array are stored in an indices array and that is used to find the location into the actual data arrays. In this case, the indices are simply reversed, but in many applications this technique can be used to greatly compress sparsely-populated matrices. While the above example is trivial, the pattern itself is quite powerful. It’s perhaps worth mentioning though that, depending on a lot of factors, indirect indexing like this can greatly affect the code’s performance. The second FOR loop is the range-based FOR loop, where I’ve simply said that I want to operate on all values of the indices array, referring to each value as “i” within the loop. In this case, that fetches every index in the indices array and iterates over those. In order to enable programmers to write with range-based FOR loops, we restated the restrictions on C and C++ loops a bit to make crystal clear what can and cannot be used. As long as you’re using an integer, pointer, or C++ random-access iterator, your loop doesn’t wrap around, and the iteration count can be computed in constant time, you’re in good shape. These restrictions guarantee that the loop will complete and ensure the compiler can determine how to schedule the iterations. Since range-based FOR loops can be defined in terms of regular loops that meet these restrictions, they are supported.

Beyond 3.1

This is certainly not everything we worked on this year, nor is it everything we believe we need to support, so what’s still on our radar? Several features came very close this year, but as we prototyped the features and as the deadline approached we decided that they needed a bit more time to bake.

Error Handler

OpenACC already has an extensive API for performance tools to gather data about the execution of an OpenACC program. Most of the major HPC profiling tools already support this profiling interface and even some users have written their own libraries to gather additional information about the execution of their program. Profiling information is only helpful though if the program executes correctly, and OpenACC currently does not define a mechanism for catching and diagnosing errors. The technical committee spent a lot of time this year defining an error handling model that mirrors the existing profiling interface for catching and diagnosing errors before the application is shut down. This enables the developer to query the executable to understand the conditions that lead up to the error or potentially to call something like MPI_Abort to ensure a parallel application exits correctly. We also examined the existing error conditions very carefully, identifying places where the specification could be clearer regarding errors, and ensured that error conditions are more completely defined than they are now. Unfortunately, as this feature was prototyped and discussed in the committee, we decided it needed a bit more time before releasing it in the specification. I expect that we’ll pick this issue back up right away now that 3.1 has been completed.

Extended Fortran Runtime API

To the extent possible OpenACC supports C, C++, and Fortran equally in the specification and we have developers using OpenACC in each of these languages (and even interfacing to other languages). Once place where support differs though is in the runtime API, where Fortran lacks some routines that provide direct access to device pointers. This was an intentional decision, since it’s less clear in Fortran the right way to return device pointers, but because of this missing feature many of our users have been forced to write their own wrappers to the memory management APIs so that they can be called from Fortran. When more than a few of our users have been forced to fill holes in our specification, clearly we need to do better to fill their needs. We surveyed users who have written their own API wrappers, gathered information about their use cases and their interfaces, and set forth to solve this problem once and for all users. Well, if you ask N users how to write the correct API wrappers you’re likely to get N+1 different solutions, and we did. As the release of 3.1 approached we decided to take a bit longer on these Fortran APIs, which deal with ideas that are foreign to Fortran, like device pointers, and make sure that we standardize on the right approach for our users. If you’re writing your own Fortran wrappers to OpenACC routines, please email feedback@openacc.org and help us to ensure that our final product will meet your needs.

Host-side Async

When the self clause was added to OpenACC we enabled users to easily express regions of code they wish to run in parallel on CPU threads vs. offloading to an attached accelerator, such as a GPU. This is a bit of a simplification of what’s possible with the self clause, but it’s certainly what we expect to be the most common use case. This opened up an interesting case though where a developer can write a parallel region that runs on the host CPU and then mark that region to possibly execute asynchronously. Clearly, this is a really powerful feature, but as our implementers began to implement this, a variety of questions arose about how best to implement this. At first, some very small clarifications seemed to address the questions, but as we dug deeper into the interactions with asynchronous queues on the accelerator device, we discovered that a bit more was needed to make CPU tasking such as this truly useful. The question that we’re still working on is not in how to run tasks asynchronously on the CPU, but how best to allow them to interact with, in particular to synchronize with, tasks on the accelerators.

C++ Parallel Algorithms

This year we addressed the use of DO CONCURRENT for our Fortran users, but what about our C++ users? Can we also support interactions with the C++ Parallel Algorithms and OpenACC? As I said earlier, we started the year with discussions with members of the C++ committee regarding this possibility, but at this point compiler support for the C++ parallel algorithms is still a bit too nascent for us to speculate about how best to support it. We’ve not given up, because the OpenACC organization really believes in driving users to base language parallelism, but we definitely need a lot more prototyping before making this a reality.

C++ Atomics

Recent releases of C++ have put a lot of work into specifying a robust and predictable memory model. One feature of this memory model is support for atomics, something OpenACC also provides as a directive. If a base language provides support for atomics already, do we still need this directive in said base language? This is another area where we really need more time to understand the interactions between OpenACC and C++ atomics and ensure that we get everything specified correctly. Because of how extensive the C++ memory model is, we’re going to take our time to fully understand the implications of using C++ atomics in place of OpenACC atomics before making any changes.

Conclusions

The OpenACC 3.1 specification is now available and we hope that our implementers will begin to support it quickly. Although the user-facing changes are small, they are solid modernization changes that we believe will enhance the user experience and are things that implementers can work on as they’re coming up to speed on OpenACC 3.0 support. As always, if you have feedback about OpenACC, please reach out to us at feedback@openacc.org with your suggestions or questions. Also, if you haven’t already, please join the discussions in our slack channel.

Author

Jeff Larkin

Jeff Larkin is a Senior Developer Technologies Software Engineer at NVIDIA Corporation, where he focuses primarily on porting and optimizing HPC applications. Jeff is an active contributor to the OpenACC and OpenMP standards bodies and is the Chair of the OpenACC Technical Committee. Prior to joining NVIDIA, Jeff worked in the Cray Supercomputing Center of Excellence. He holds a M.S. in Computer Science from the University of Tennessee and a B.S. in Computer Science from Furman University.