VTune
Intel VTune Amplifier XE is the premier performance profiler for C, C++, C#, Fortran, Assembly and Java. VTune Homepage
Available Modules¶
module load VTune/2019_update4
Warning
What is VTune?¶
VTune is a tool that allows you to quickly identify where most of the execution time of a program is spent. This is known as profiling. It is good practice to profile a code before attempting to modify the code to improve its performance. VTune collects key profiling data and presents them in an intuitive way. Another tool that provides similar information is ARM MAP.
How to use VTune¶
We'll show how to profile a C++ code with VTune - feel free to choose your own code instead. Start with
git clone https://github.com/pletzer/fidibench
and build the code using the "gimkl" tool chain
cd fidibench
mkdir build
cd build
module load gimkl CMake
cmake ..
make
This will compile a number of executables. Note that VTune does not require one to apply a special compiler switch to profile. You can profile an existing executable if you like. We choose "upwindCxx" as the executable to profile. It is under upwind/cxx, so
cd upwind/cxx
Run the executable with
module load VTune
srun --ntasks=1 --cpus-per-task=2 --hint=nomultithread amplxe-cl -collect hotspots -result-dir vtune-res ./upwindCxx -numCells 256 -numSteps 10
Executable "upwindCxx" takes arguments "-numCells 256" (the number of cells in each dimension) and "-numSteps 10" (the number of time steps). Note the command "amplxe-cl -collect hotspots -result-dir" which was inserted before the executable. The output may look like
Top Hotspots
Function Module CPU Time
------------------------------------------ -------------- --------
Upwind<(unsigned long)3>::advect._omp_fn.1 upwindCxx 25.979s
_int_free libc.so.6 9.170s
operator new libstdc++.so.6 6.521s
free libmpi.so.12 0.300s
indicating that the vast majority of time is spent in the "advect" method (26s), with significant amounts of time spent allocating (6.5s) and deallocating (9.2s) memory.
Drilling further into the code¶
Often this is enough to give you a feel for where the code can be improved. To explore further you can fire up
amplxe-gui &
Go to the bottom and select "Open Result...", choose the directory where the profiling results are saved and click on the .amplxe file. The summary will look similar to the above table. However, you can now dive into selected functions to get more information. Below we see that 16.5 out of 26 seconds were spent starting the two OpenMP threads.