Offloading to GPU with OpenMP
With OpenMP 4.5, it has become possible to offload computations from the CPU to a GPU, see https://www.openmp.org/wp-content/uploads/SC18-BoothTalks-Jost.pdf
Example¶
In the following we show how to achieve this in the case of a reduction operation involving a large loop:
#include <iostream>
#include <cmath>
int main() {
int n = 1000000000;
double total = 0;
#pragma omp target teams distribute parallel for map(tofrom: total) \
map(to: n) reduction(+:total)
for (int i = 0; i < n; ++i) {
total += exp(sin(M_PI * (double) i/12345.6789));
}
std::cout << "total is " << total << '\n';
}
Save the above code in file total.cxx.
Note the pragma
#pragma omp target teams distribute parallel for map(tofrom: total) \
map(to: n) reduction(+:total)
which moves variables total
and n
to the GPU and creates teams of
threads to perform the sum operation in parallel.
Compile¶
We'll use the Cray C++ compiler to build the executable but first we need to load a few modules:
module load cray-libsci_acc/18.06.1 craype-accel-nvidia60 \
PrgEnv-cray/1.0.4 cuda92/blas/9.2.88 cuda92/toolkit/9.2.88
(Ignore warning "cudatoolkit >= 8.0 is required").
To compare the execution times between the CPU and GPU version, we build two executables:
CC -h noomp -o total total.cxx
CC -o totalOmpGpu total.cxx
with executable total
compiled with -h noomp
, i.e. OpenMP turned
off.
Run¶
The following commands will submit the runs to the Mahuika queue (note
--partition=gpu --gres=gpu:1
in the case of the executable that
offloads to the GPU):
time srun --ntasks=1 --cpus-per-task=1 ./total
time srun --ntasks=1 --cpus-per-task=1 --partition=gpu --gres=gpu:1 ./totalOmpGpu
executable | time [s] |
---|---|
total | 10.9 |
totalOmpGpu | 0.45 |