Skip to content
Contact Support

VTune

Intel VTune Amplifier XE is the premier performance profiler for C, C++, C#, Fortran, Assembly and Java. VTune Homepage

Available Modules

module load VTune/2019_update4

Warning

What is VTune?

VTune is a tool that allows you to quickly identify where most of the execution time of a program is spent. This is known as profiling. It is good practice to profile a code before attempting to modify the code to improve its performance. VTune collects key profiling data and presents them in an intuitive way.  Another tool that provides similar information is ARM MAP.

How to use VTune

We'll show how to profile a C++ code with VTune - feel free to choose your own code instead. Start with 

git clone https://github.com/pletzer/fidibench

and build the code using the "gimkl" tool chain

cd fidibench
mkdir build
cd build
module load gimkl CMake
cmake ..
make

This will compile a number of executables. Note that VTune does not require one to apply a special compiler switch to profile. You can profile an existing executable if you like. We choose "upwindCxx" as the executable to profile. It is under upwind/cxx, so

cd upwind/cxx

Run the executable with

module load VTune
srun --ntasks=1 --cpus-per-task=2 --hint=nomultithread amplxe-cl -collect hotspots -result-dir vtune-res ./upwindCxx -numCells 256 -numSteps 10

Executable "upwindCxx" takes arguments "-numCells 256" (the number of cells in each dimension) and  "-numSteps 10" (the number of time steps). Note the command "amplxe-cl -collect hotspots -result-dir" which was inserted before the executable. The output may look like

Top Hotspots

Function                                    Module          CPU Time

------------------------------------------  --------------  --------

Upwind<(unsigned long)3>::advect._omp_fn.1  upwindCxx        25.979s

_int_free                                   libc.so.6         9.170s

operator new                                libstdc++.so.6    6.521s

free                                        libmpi.so.12      0.300s

indicating that the vast majority of time is spent in the "advect" method (26s), with significant amounts of time spent allocating (6.5s) and deallocating (9.2s) memory.

Drilling further into the code

Often this is enough to give you a feel for where the code can be improved. To explore further you can fire up

amplxe-gui &

Go to the bottom and select "Open Result...", choose the directory where the profiling results are saved and click on the .amplxe file. The summary will look similar to the above table. However, you can now dive into selected functions to get more information. Below we see that 16.5 out of 26 seconds were spent starting the two OpenMP threads.