My objective in few words: to write code once, and target as many architectures as possible. Specifically for compute tasks where parallel computing is a good thing.
I wanted to start with an honourable mention to floor. It enables to have the same code translated into Vulkan, CUDA, OpenCL or native code.
Sadly this project has limited documentation and is backed by the effort of just one person; as such I cannot really consider it for any complex project.
Going back to more realistic scenarios, we have OpenACC and OpenMP.
I am purposefully excluding the nvidia compilers from this round of evaluation, mostly because they don’t have much incentive in supporting competitors.
No major distribution ships with compilers capable of offloading actually they do, there are builds in most distribution, but since they are often older versions of the compiler several features and bugfixes might be missing. Also, they are often missing important parts.
While OpenACC/OpenMP are often enabled, this will simply allow offloading tasks to other cores.
Yeah, just threads with extra steps (or an improvement in readability depending on your preferences, but still).
So, the only option to target nvidia cards for me was to compile my own toolchain. I found these which worked with minimal changes to match the latest versions:
- https://gist.github.com/kristerw/4e9a735f2d755ffa73f9bf27edbf3c29 for gcc, it works with gcc 15
- https://gist.github.com/KaruroChori/db89525919e842501ebc89995d5b653a for clang, it works with clang 20.1
GCC 15
Let’s start with gcc. On paper it supports both OpenACC and OpenMP. In practice, it is just a joke.
OpenACC support is broken. One cannot specify routines
to be offloaded in nested layers (so no namespaces are allowed, and no way to set them inside structures; basically it is only good for C code).
If marked, members and operators externally defined at the root of a document will work as routines
, but constructors and destructors cannot be captured as they lack a type and the pragma fails to recognize them.
So, working with C++, one is forced to manually tag each member function in the most verbose manner, preventing any realistic usage of external libraries.
Constructors and destructors must be forcefully inlined via __attribute__((always_inline))
.
This was the best I got out of OpenACC support in gcc 15. Not usable with C++, while probably interoperable.
As for OpenMP, its support was not much better. Constructors were outright refused with obscure errors of the linker about aliasing. moving to a different build solved the issue, no idea what was wrong.
Clang 20.1
They only support OpenMP as of now, support for OpenACC is ongoing but not in master from what I understand.
There is not much to report, it just works.
I encountered some oddities I cannot quite explain, but aside from that it is all good.
Some notes on potential issues and how to address them.
Conclusions
Use Clang. Setting up a custom toolchain is a bit of a mess, but at least it delivers. Next, I want to check few more compilers which are optimized for this kind of applications.
Addendum
Few notes about “discoveries” I made:
- OpenMP in C++ with Clang is generally working, but you are likely to encounter compiler bugs more frequently than usual. Like this, but there are more.
- Many bugs (and performance issues) were resolved using the additional flag
-fopenmp-cuda-mode
. Be careful it has some limitations, but if you can cope with those it is a good optimization. - Don’t even bother trying running offloaded code which is not
-O3
as performance will be absurdly low. Much worse compared to the performance degradation one would observe on a “normal” target. - At times, the offloaded kernels are generated in an incomplete way (probably due to compiler’s internal bugs) but no error is reported. Make sure to validate output by testing offloaded code against the same code run on CPU to avoid surprises.