Mahuika Slurm Partitions
General Limits¶
- No individual job can request more than 20,000 CPU hours. This has the consequence that a job can request more CPUs if it is shorter (short-and-wide vs long-and-skinny).
- No user can have more than 1,000 jobs in the queue at a time.
These limits are defaults and can be altered on a per-account basis if there is a good reason. For example we will increase the limit on queued jobs for those who need to submit large numbers of jobs, provided that they undertake to do so with job arrays.
Partitions¶
A partition can be specified via the appropriate sbatch option, e.g.:
#SBATCH --partition=milan
However on Mahuika there is generally no need to do so, since the default behaviour is that your job will be assigned to the most suitable partition(s) automatically, based on the resources it requests, including particularly its memory/CPU ratio and time limit.
The milan
partition is currently an exception - since it has a
different operating system version it is currently configured to be
opt-in only - your job will not land there it unless you request it.
If you do specify a partition and your job is not a good fit for that partition then you may receive a warning, please do not ignore this. E.g.:
sbatch: `bigmem` is not the most appropriate partition for this job, which would otherwise default to `large`. If you believe this is incorrect then contact support and quote the Job ID number.
Name | Max Walltime | Nodes | CPUs/Node | GPUs/Node | Available Mem/CPU | Available Mem/Node | Max CPUs/job | Description |
---|---|---|---|---|---|---|---|---|
long | 3 weeks | 69 | 72 | None | 1500 MB | 105 GB | 720 | Jobs longer than 3 days. |
large | 3 days | long+157 | 72 | None | 1500 MB | 105 GB | 288 | Default partition. |
milan | 7 days | 56 | 256 | None | 1850 MB | 460 GB | 2560 | Jobs using Milan Nodes | 8 | 256 | 3800 MB | 960 GB |
bigmem | 7 days | 6 | 72 | None | 6300 MB | 460 GB | 288 | Large amounts of memory. |
infill | 7 days | 6 | 54 | None | 5500 MB | 300 GB | ||
hugemem | 7 days | 2 1 1 |
80 128 176 |
18 GB 30 GB 35 GB |
1,500 GB 4,000 GB 6,000 GB |
256 | Very large amounts of memory. | |
gpu | 7 days | 1
4 2 2 1 |
18, plus 54 shared with infill | 1 P100
2 P100 1 A100 2 A100 7 A100-1g.5gb |
6300 MB | 160 GB, plus 300 GB shared with infill | 64 | Nodes with GPUs. See below for more info. |
hgx | 7 days | 4 | 128 | 4 A100 | 6300 MB | 460 GB | 64 | Part of Milan Nodes. See below. |
Quality of Service¶
Orthogonal to the partitions, each job has a "Quality of Service", with
the default QoS for a job being determined by the allocation class of
its project. There are other QoSs which you can select with the
--qos
option:
Debug¶
Specifying --qos=debug
will give the job very high priority, but is
subject to strict limits: 15 minutes per job, and only 1 job at a time
per user. Debug jobs may not span more than two nodes.
Interactive¶
Specifying --qos=interactive
will give the job very high priority, but
is subject to some limits: up to 4 jobs, 16 hours duration, 4 CPUs, 128
GB, and 1 GPU.
Requesting GPUs¶
GPU code | GPU type |
P100 | NVIDIA Tesla P100 PCIe 12GB cards |
A100 (gpu partition) |
NVIDIA Tesla A100 PCIe 40GB cards |
A100-1g.5gb | 1 NVIDIA Tesla A100 PCIe 40GB card divided into 7 MIG GPU slices (5GB each). |
A100 (hgx partition) |
NVIDIA Tesla A100 80GB, on a HGX baseboard with NVLink GPU-to-GPU interconnect between the 4 GPUs |
The default GPU type is P100, of which you can request 1 or 2 per node:
#SBATCH --gpus-per-node=1 # or equivalently, P100:1
To request A100 GPUs, use instead:
#SBATCH --gpus-per-node=A100:1
See GPU use on NeSI for more details about Slurm and CUDA settings.
Limits on GPU Jobs¶
- There is a per-project limit of 6 GPUs being used at a time.
- There is also a per-project limit of 360 GPU-hours being allocated to running jobs. This reduces the number of GPUs available for longer jobs, so for example you can use 2 GPUs at a time if your jobs run for a week, 5 GPUs for two days, or 6 GPUs for one day jobs. The intention is to guarantee that all users can get their GPU debugging jobs running in a reasonably timely manner.
- Each GPU job can use no more than 64 CPUs. This is to ensure that GPUs are not left idle just because their node has no remaining free CPUs.
- There is a limit of one A100-1g.5gb GPU job per user.
Accessing A100 GPUs in the hgx
partition¶
The A100 GPUs in the hgx
partition are designated for workloads
requiring large memory (up to 80GB) or multi-GPU computing (up to 4 GPUs
connected via
NVLink):
- Explicitly specify the partition to access them, with
--partition=hgx
. - Hosting nodes are Milan nodes. Check the dedicated support page for more information about the Milan nodes' differences from Mahuika's Broadwell nodes.
Mahuika Infiniband Islands¶
Mahuika is divided into “islands” of 26 nodes (or 1,872 CPUs). Communication between two nodes on the same island is faster than between two nodes on different islands. MPI jobs placed entirely within one island will often perform better than those split among multiple islands.
You can request that a job runs within a single InfiniBand island by adding:
#SBATCH --switches=1
Slurm will then run the job within one island provided that this does
not delay starting the job by more than the maximum switch waiting time,
currently configured to be 5 minutes. That waiting time limit can be
reduced by adding @<time>
after the number of switches e.g:
#SBATCH --switches=1@00:30:00