Hyperthreading

As CPU technology advanced engineers realised that adapting CPU architecture to include logical processors within the physical core (conventionally, a CPU) allows some computation to occur simultaneously. The name for this technology is simultaneous multithreading, and Intel's implementation of it is called Hyperthreading.

CPUs capable of Hyperthreading consists of two logical processors per physical core. The logical processors can operate on data/instruction threads simultaneously, meaning the physical core can perform two operations concurrently. In other words, the difference between logical and physical cores is that logical cores are not full stand-alone CPUs, and share some hardware with nearby logical cores. Physical cores are made up of two logical cores.

Hyperthreading is enabled by default on NeSI machines, meaning, by default, Slurm will allocate two threads to each physical core.

Hyperthreading with Slurm¶

When Slurm request a CPU, it is requesting logical cores, which, as mentioned above, there are two of per physical core. If you use --ntasks=n to request CPUs, Slurm will start n MPI tasks which are each assigned to one physical core. Since Slurm "sees" logical cores, once your job starts you will have twice the number of CPUs as ntasks.

If you set --cpus-per-task=n, Slurm will request n logical CPUs per task, i.e., will set n threads for the job. Your code must be capable of running Hyperthreaded (for example using OpenMP) if --cpus-per-task > 1.

Setting --hint=nomultithread with srun or sbatch causes Slurm to allocate only one thread from each core to this job". This will allocate CPUs according to the following image:

Node name	wbn009
Physical Core id	0		1		2		3		0		1		2		3
Logical CPU id	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
Number of Allocated CPUs	4								4
Allocated CPU ids	0 2 4 6								8 10 12 14

Image adapted from Slurm's documentation page.

When to use Hyperthreading¶

Hyperthreading increases the efficiency of some jobs, but the fact that Slurm is counting in logical CPUs makes aspects of running non-Hyperthreaded jobs confusing, even when Hyperthreading is turned off in the job with --hint=nomultithread. To determine if the code you are running is capable of running Hyperthreaded, visit the manual pages for the software.

Alternatively, it is possible to perform an ad-hoc test to determine if your code is capable of making use of Hyperthreading. First run a job that has requested 2 threads per physical core as described above. Then, use the nn_seff command to check the jobs CPU efficiency. If CPU efficiency is greater than 100%, then your code is making use of Hyperthreading, and gaining performance from it. If your job gives an error or stays at 100% efficiency, it is likely you can not run your code Hyperthreaded. 200% CPU efficiency would be the maximally efficient job, however, this is rarely observed and anything over 100% should be considered a bonus.

How to use Hyperthreading¶

Non-hyperthreaded jobs which use --mem-per-cpu requests should halve their memory requests as those are based on memory per logical CPU, not per the number of threads or tasks. For non-MPI jobs, or for MPI jobs that request the same number of tasks on every node, we recommend to specify --mem (i.e. memory per node) instead. See How to request memory (RAM) for more information.
Non-MPI jobs which specify --cpus-per-task and use srun should also set --ntasks=1, otherwise the program will be run twice in parallel, halving the efficiency of the job.

The precise rules about when Hyperthreading applies are as follows:

	Mahuika	Māui
Jobs	Never share physical cores
MPI tasks within the same job	Never share physical cores	Share physical cores by default. You can override this behaviour by using `--hint=nomultithread` in your job submission script.
Threads within the same task	Share physical cores by default. You can override this behaviour by using `--hint=nomultithread` in your job submission script.

How many logical CPUs will my job use or be charged for?¶

The possible job configurations and their results are shown in the following table. We have also included some recommendations to help you make the best choices, depending on the needs of your workflow.

Job configuration	Mahuika	Māui
Only one task `--cpus-per-task` is not used	The job gets, and is charged for, two logical CPUs. `--hint=nomultithread` is irrelevant.	The job gets one logical CPU, but is charged for 80. `--hint=nomultithread` is irrelevant. This configuration is extremely uneconomical on Māui. Consider using Mahuika or the Māui ancillary nodes instead.
Only one task `--cpus-per-task=`N `--hint=nomultithread` is not used	The job gets, and is charged for, N logical CPUs, rounded up to the nearest even number. Set N to an even number if possible.	The job gets N logical CPUs, but is charged for 80. Set N to 80 if possible.
Only one task `--cpus-per-task=`N `--hint=nomultithread` is used	The job gets, and is charged for, 2N logical CPUs.	The job gets 2N logical CPUs, but is charged for 80. Set N to 40 if possible.
More than one task on one or more nodes `--cpus-per-task` is not used `--hint=nomultithread` is not used	Each task gets two logical CPUs. The job is charged for two logical CPUs per task. `--hint=nomultithread` is irrelevant.	Each task gets one logical CPU. The job is charged for 80 logical CPUs per allocated node. If possible, set the number of tasks per node to 80.
More than one task on one or more nodes `--cpus-per-task` is not used `--hint=nomultithread` is used		Each task gets two logical CPUs. The job is charged for 80 logical CPUs per allocated node. If possible, set the number of tasks per node to 40.
More than one task on one or more nodes `--cpus-per-task=`N `--hint=nomultithread` is not used	Each task gets N logical CPUs, rounded up to the nearest even number. The job is charged for that number of logical CPUs per task. Set N to an even number if possible.	Each task gets N logical CPUs. The job is charged for 80 logical CPUs per allocated node. If possible, set N and the number of tasks per node such that N × (tasks per node) = 80.
More than one task on one or more nodes `--cpus-per-task=`N `--hint=nomultithread` is used	Each task gets 2N logical CPUs. The job is charged for 2N logical CPUs per task.	Each task gets 2N logical CPUs. The job is charged for 80 logical CPUs per allocated node. If possible, set N and the number of tasks per node such that N × (tasks per node) = 40.