Offloading to GPU with OpenACC
Many codes can be accelerated significantly by offloading computations to a GPU. Some NeSI Mahuika nodes have GPUs attached to them. If you want your code to run faster, if you're developing your own code or if you have access to the source code and you feel comfortable editing the code, read on.
Here we show how to tell the compiler which part of your algorithm you want to run a GPU. We'll use OpenACC, which adds directives to your source code. The advantages of OpenACC over other approaches is that the source code changes are generally small and your code remains portable, i.e. it will run on both CPU and GPU. The main disadvantage of OpenACC is that only a few compilers support it.
More information about OpenACC can be found here.
Example¶
In the following we show how to achieve this in the case of a reduction operation involving a large loop in C++ (a similar example can be written in Fortran):
#include <iostream>
#include <cmath>
int main() {
double total = 0;
int i, n = 1000000000;
#pragma acc parallel loop copy(total) copyin(n) reduction(+:total)
for (i = 0; i < n; ++i) {
total += exp(sin(M_PI * (double) i/12345.6789));
}
std::cout << "total is " << total << '\n';
}
Save the above code in file total.cxx.
Note the pragma
#pragma acc parallel loop copy(total) copyin(n) reduction(+:total)
We're telling the compiler that the loop following this pragma should be
executed in parallel on the GPU. Since GPUs have hundreds or more
threads, the speedup can be significant. Also note that total
is
initialised on the CPU (above the pragma) and should be copied to the
GPU and back to the CPU after completing the loop. (It is also possible
to initialise this variable on the GPU.) Likewise the number of
iterations n
should be copied from the CPU to the GPU.
Compile¶
We can use the NVIDIA compiler
module load NVHPC
and type
nvc++ -Minfo=all -acc -o totalAccNv total.cxx
to compile the example.
Alternatively, we can use the Cray C++ compiler to build the executable but first we need to load a few modules:
module load craype-broadwell
module load cray-libsci_acc
module load craype-accel-nvidia60
module load PrgEnv-cray
(Ignore warning "cudatoolkit >= 8.0 is required"). Furthermore, you
may need to load cuda/fft
or cuda/blas
To compare the execution times between the CPU and GPU version, we build two executables:
CC -h noacc -o total total.cxx
CC -o totalAccGpu total.cxx
with executable total
compiled with -h noacc
, i.e. OpenACC turned
off.
Run¶
The following commands will submit the runs to the Mahuika queue (note
--gpus-per-node=P100:1
in the case of the executable that offloads to
the GPU):
time srun --ntasks=1 --cpus-per-task=1 ./total
time srun --ntasks=1 --cpus-per-task=1 --gpus-per-node=P100:1 ./totalAccGpu
executable | time [s] |
---|---|
total | 7.6 |
totalAccGpu | 0.41 |
Check out this page to find out how you can offload computations to a GPU using OpenMP.