07
29
2020

OpenACC and Base Language Parallelism

When OpenACC 3.0 was released in November 2019 the most exciting feature, in my opinion at least, is actually one that might easily be overlooked: updating our base languages. If you’re not familiar with this term, the base languages are the programming languages we, as a directive-based parallel programming model, support, namely C, C++, and Fortran. When we released OpenACC 1.0 in November of 2011 the most important programming languages in scientific and high performance computing were C99, C++98, and Fortran 2003. In the 8 years that followed, these programming languages moved forward, but our support for the new features in these languages did not. This is significant, when you consider that each of these languages added features to natively support parallel programming. Since much of the development of OpenACC comes from our interactions with OpenACC users, we were often fielding questions about C++ lambdas, C++ parallel algorithms, and Fortran DO CONCURRENT loops, but until we updated our support for these languages we had no way to even begin discussing these language features. 

So how does a specification like OpenACC begin to support new programming language features? It’s easy: we list every new feature as unsupported and move our language references forward! Of course, this isn’t very satisfying, since programmers are no better off than they were before, so the next step is to look at each restricted feature, determine whether changes are needed to support that feature, and then lift that restriction as we deem it safe to do so. Some language features require no real effort on our part (I’m looking at you C++ auto keyword) and some will require a lot of effort to fully support. The important thing though is that we now have the ability to discuss these new features and how to support and/or interoperate with them. With this in mind, I’d like to discuss some new features that we’re actively discussing and ask that you provide feedback on their importance and direction.

C++ Parallel Algorithms

C++17 is a really exciting language for parallel programmers. It built on the foundation set in C++11 with lambdas and Standard Template Library (STL) abstractions by adding execution policies that enable developers to express potential parallelism in their code. Take, for instance, the following implementation of a standard “axpy” operation.

#include <algorithm>
#include <cstddef>
#include <execution>

template <typename T>
void axpy(std::size_t N, T a, T* X, T* Y)
{
  std::transform(std::execution::par_unseq, X, X+N, Y, Y,
    [=](float x, float y){
      return y + a*x;
  });
}

In this example, we’re using a variety of C++ features to write concise, maintainable code. The std::transform function takes an execution policy, in this case std::execution::par_unseq, the starting and ending points of our arrays, and a lambda, which captures our input and output values and calculates our results. A sophisticated compiler might run this code in parallel, potentially on CPU threads, or a GPU, or some other interesting device. In fact, some compilers have already begun supporting operations like these on modern GPUs. One can look at this example and begin to imagine a world wherein C++ has everything needed to run effectively on any parallel (or serial) processor without directives; it’s the ultimate sort of portability. So, we don’t need OpenACC at all, right? Well, maybe.

So what’s missing in the above example? The developer has expressed potential parallelism to the compiler, but that’s just one piece of the puzzle. What if the code needs to run on a device with discrete memory? There’s nothing here to express the locality of the data. What if I want to run this loop on a particular processor in my machine and another loop on another? Hmm, there’s nothing expressing compute locality or dependencies between different regions either. What if I want to give the compiler more information on how to tune the execution to a particular processor of interest? Sorry, I can’t really do that either. Fortunately, OpenACC has solutions to all of these problems, but we need to define the interplay between C++ language-level parallelism and OpenACC directives carefully to ensure that we don’t paint the poor programmer into a corner. 

One possible solution to defining this interaction between C++ and OpenACC parallelism is to make this interoperability opt-in. With the above example, one can imagine adding a data region around the code to express the data locality for that region. Since both the OpenACC and the C++ parallel algorithms code are visible to the compiler, a compiler might choose to use the OpenACC device data and execute the C++ parallel algorithms operation on the device, but what if the data region isn’t visible to the compiler because it is higher in the call stack (a common optimization)? How might a compiler know that the data exists and use that information to determine whether to parallelize the code and where to run it? Instead, perhaps it would be better for the developer to opt in to OpenACC interoperability by using a special execution policy, like the code below. Such a policy would inform the compiler that the programmer wishes to have the code parallelized and that it should exist in the OpenACC execution and data model. In other words, treat it like a traditional acc parallel loop in regards to the data and where to execute the code.

#include <algorithm>
#include <cstddef>
#include <execution>
#include <openacc.h>

template <typename T>
void axpy(std::size_t N, T a, T* X, T* Y)
{
#pragma acc data copyin(X) copy(y)
{
  std::transform(openacc::parallel, X, X+N, Y, Y,
    [=](float x, float y){
      return y + a*x;
  });
}
}

Now this isn’t ideal, since now the code relies on an execution policy that is not specified in the C++ language itself, which limits portability to an extent, but C++ provides ways to abstract this such that the non-portable code can be minimized. When building with an OpenACC compiler the OpenACC execution policy will be used, otherwise another will be used. The ability to opt-in (or out) of the OpenACC execution model and the data presence checking like this likely carries enough benefits to outweigh the portability concerns, at least until we can gather more data from real applications using this feature. 

One open question with this approach is how to best interact with asynchronous queues. Although C++ executors may eventually solve this problem, they are still several years away. I’ll propose the following as one possible solution to that problem, but I encourage you to provide feedback and your own suggestions on how best to solve these interactions.

#include <algorithm>
#include <cstddef>
#include <execution>
#include <openacc.h>

template <typename T>
void axpy(std::size_t N, T a, T* X, T* Y)
{
  // Launch work asynchronously into queue 1
  std::transform(openacc::parallel_async<1>, X, X+N, Y, Y,
    [=](float x, float y){
      return y + a*x;
  });
  // Do some unrelated work
  acc_wait(1); // Synchronize with queue 1
}

Fortran 2018 

The news of Fortran’s death is greatly exaggerated. Fortran is still a vital language in HPC and not just with “dusty deck” codes, but with modern codes that are still under active development. Many of OpenACC’s greatest success stories come from Fortran applications, so it’s important that we keep up with recent standards. Fortran 2008 added and Fortran 2018 extended the DO CONCURRENT construct. This construct provides an alternative syntax to DO loops, which provides additional assurances to the compiler about its ability to run the iterations in any order. Although it’s not perfect, this construct can simplify finding parallelism by the compiler. In addition to DO CONCURRENT, Fortran has long supported array operations that compilers may be able to parallelize (the matmul operation, for instance). One thing that DO CONCURRENT lacks, however, is loop-level reductions, which are widely-used in OpenACC codes. Dr. Fortran (Steve Lionel) promised at FortranCon 2020, however, that these will be coming in the next version of Fortran (dubbed Fortran 202X), so perhaps we’ll soon be able to write standard Fortran without directives too.

Like C++ though, expressing the parallelism in an algorithm is just one piece of the puzzle. Even if we assume that we can write all parallelizable loops using DO CONCURRENT and a compiler will be able to parallelize them, expressing data and compute locality, task-based asynchronous operations, and interoperability with native programming models, like CUDA, all fall outside of the realm of the language. So, for the time being, I expect that directives will remain necessary, even in light of DO CONCURRENT support in a compiler. In the example below, notice that OpenACC’s directives are added to a DO CONCURRENT loop in the same way they would have been for a standard DO loop. Is this the right approach to modern Fortran? As with C++, the jury is still out, but I suspect some mixture of OpenACC data and compute directives will remain necessary for the foreseeable future.

subroutine saxpy(N, a, X, Y)
  integer,intent(in)                 :: N
  real(4),dimension(N),intent(in)    :: X
  real(4),dimension(N),intent(inout) :: Y
  real(4),intent(in)                 :: a
  integer                            :: i

!$acc parallel loop copyin(X) copy(Y) async(1)
  do concurrent (i=1:N)
    Y(i) = Y(i) + a * X(i)
  end do
end subroutine

If you’re interested in learning more about my opinions on modern Fortran and OpenACC, I’ll point you to a talk I gave recently at FortranCon2020.

General Language Modernization

The primary focus so far has been on how best to define interactions between OpenACC parallelism and language-level parallelism. There are other interesting features in these languages that we may want to begin supporting in the OpenACC specification. For instance, our runtime API currently supports a C interface and a more limited Fortran interface. C++ users are limited to using the C interface, which lacks both namespacing and function overloading, among other things. It may be desirable to provide a C++-first API as well to avoid polluting the global namespace and to allow more generic function names. It’s certainly something we can do, but should we? Similarly, the Fortran interface has certain holes for any API routines that require device pointers, as these aren’t a Fortran-native idea. I’m personally aware of several applications that had to work around these missing routines by writing their own Fortran bindings. Recent versions of Fortran allow us to revisit these routines and potentially create a more Fortran-native API that includes the complete functionality of the C API. I believe this shortcoming is one we more clearly need to correct, but I’d like to hear from our users on what it should look like.

Somewhat related to this is the addition of attribute syntax in the C++ language. Attributes, as the name implies, allow the programmer to attach certain attributes to data structures and constructs. I can imagine that had C++ attributes existed when OpenACC 1.0 was designed, perhaps the `declare` directive would have been unnecessary in C++, as one could have attached its arguments to a variable using an attribute instead. But, at the same time, just because we can move some of our pragmas to attributes, should we? Arguably, there’s no new functionality, only a duplication of existing functionality using a new syntax. Do attributes really provide something greater than using the existing pragmas? We’ve consulted with a variety of C++ experts and the verdict is clearly split on this one. Take the example below. The attribute-based version is a pseudocode example of how someone might declare a variable with create and the pragma-based version is how someone might do it today.

double A[N];
#pragma acc declare create(A(0:N))


 
double A[N] [[openacc::declare_create]];

 

So, clearly one could replace some of our C++ pragmas with attributes, but is there any real value here to warrant the additional changes both to the specification and to the compilers? The jury is still out, so I’d encourage you to give us your opinion.

In the Fortran realm, I’ve only mentioned DO CONCURRENT so far, as this is the language feature with the most significant overlap with OpenACC. Another really significant feature in Fortran2018 that we’ve not yet addressed is co-arrays. Co-arrays are a Partitioned Global Address Space (PGAS) programming model, in which I can use a special syntax on my Fortran arrays to add dimensions that exist on other processors. These other processes are known as images in Fortran speak and my image can read from and write to the data on other images. This functionality is useful both on distributed memory machines, potentially replacing MPI or SHMEM, or even shared memory machines to handle things like NUMA. This adds even more complexity to our Fortran story, however, and right now we completely disallow the use of Fortran co-arrays with OpenACC features. We’d love to lift these restrictions, but there’s a lot of questions still to be answered. For instance, should we restrict data clauses to data on the local image or possibly data on other images? Should you be allowed to access data from another image within a compute construct? Can all possible hardware platforms support such a thing? Interactions with co-arrays is probably the single most complex unanswered question between OpenACC and Fortran and it’s not one we’ll be able to solve quickly. As with everything I’ve presented so far, if you have an interesting use case that we should consider, please reach out to us.

Conclusions

This post is a snapshot in time to give some idea of what the OpenACC committee is looking at with regards to modern C++ and Fortran. This is also a request for feedback. OpenACC strives to be user-driven, adding things our users need and not adding things for the sake of doing something new. Adding features simply for the sake of having something new isn’t particularly productive and your feedback is one way we measure the need for new capabilities. 

C++ and Fortran are now both parallel programming languages, so what does that mean for directive-based parallel programming models? It’s our hope that eventually everything that you can write in OpenACC you will be able to write using native-language parallelism without the need for directives at all. With this in mind, the OpenACC committee has been working even more closely with the language standards bodies to ensure better cross pollination and knowledge sharing. As we begin to define how we interoperate with these parallel programming languages, we need to be mindful of what gaps we’re already filling, what gaps no longer need filling, and whether any new gaps have emerged. If you have any feedback on what I’ve written here, I encourage you to send an email to feedback@openacc.org. What challenges are you facing in your parallel programming and how can we help you address them?

Author

Jeff Larkin headshot
Jeff Larkin
Jeff Larkin is a Senior Developer Technologies Software Engineer at NVIDIA Corporation, where he focuses primarily on porting and optimizing HPC applications. Jeff is an active contributor to the OpenACC and OpenMP standards bodies and is the Chair of the OpenACC Technical Committee. Prior to joining NVIDIA, Jeff worked in the Cray Supercomputing Center of Excellence. He holds a M.S. in Computer Science from the University of Tennessee and a B.S. in Computer Science from Furman University.