Handling preemption
On the Mila cluster, jobs can preempt one-another depending on their priority
(unkillable>high>low) (See the Slurm documentation)
The default preemption mechanism is to kill and re-queue the job automatically
without any notice. To allow a different preemption mechanism, every partition
have been duplicated (i.e. have the same characteristics as their counterparts)
allowing a 120sec grace period before killing your job but don’t requeue
it automatically: those partitions are referred by the suffix: -grace
(main-grace, long-grace, main-cpu-grace, long-cpu-grace).
When using a partition with a grace period, a series of signals consisting of
first SIGCONT and SIGTERM then SIGKILL will be sent to the SLURM
job. It’s good practice to catch those signals using the Linux trap command
to properly terminate a job and save what’s necessary to restart the job. On
each cluster, you’ll be allowed a grace period before SLURM actually kills
your job (SIGKILL).
The easiest way to handle preemption is by trapping the SIGTERM signal
1#SBATCH --ntasks=1
2#SBATCH ....
3
4exit_script() {
5 echo "Preemption signal, saving myself"
6 trap - SIGTERM # clear the trap
7 # Optional: sends SIGTERM to child/sub processes
8 kill -- -$$
9}
10
11trap exit_script SIGTERM
12
13# The main script part
14python3 my_script
Note
Requeuing:
The Slurm scheduler on the cluster does not allow a grace period before
preempting a job while requeuing it automatically, therefore your job will
be cancelled at the end of the grace period.
To automatically requeue it, you can just add the sbatch command inside
your exit_script function.
Packing jobs
Sharing a GPU between processes
srun, when used in a batch job is responsible for starting tasks on the
allocated resources (see srun) SLURM batch script
1#SBATCH --ntasks-per-node=2
2#SBATCH --output=myjob_output_wrapper.out
3#SBATCH --ntasks=2
4#SBATCH --gres=gpu:1
5#SBATCH --cpus-per-task=4
6#SBATCH --mem=18G
7srun -l --output=myjob_output_%t.out python script args
This will run Python 2 times, each process with 4 CPUs with the same arguments
--output=myjob_output_%t.out will create 2 output files appending the task
id (%t) to the filename and 1 global log file for things happening outside
the srun command.
Knowing that, if you want to have 2 different arguments to the Python program,
you can use a multi-prog configuration file: srun -l --multi-prog silly.conf
0 python script firstarg
1 python script secondarg
Or by specifying a range of tasks
%t being the taskid that your Python script will parse. Note the -l on the
srun command: this will prepend each line with the taskid (0:, 1:)
Sharing a node with multiple GPU 1process/GPU
On Digital Research Alliance of Canada, several nodes, especially nodes with
largeGPU (P100) are reserved for jobs requesting the whole node, therefore
packing multiple processes in a single job can leverage faster GPU.
If you want different tasks to access different GPUs in a single allocation you
need to create an allocation requesting a whole node and using srun with a
subset of those resources (1 GPU).
Keep in mind that every resource not specified on the srun command while
inherit the global allocation specification so you need to split each resource
in a subset (except –cpu-per-task which is a per-task requirement)
Each srun represents a job step (%s).
Example for a GPU node with 24 cores and 4 GPUs and 128G of RAM
Requesting 1 task per GPU
1#!/bin/bash
2#SBATCH --nodes=1-1
3#SBATCH --ntasks-per-node=4
4#SBATCH --output=myjob_output_wrapper.out
5#SBATCH --gres=gpu:4
6#SBATCH --cpus-per-task=6
7srun --gres=gpu:1 -n1 --mem=30G -l --output=%j-step-%s.out --exclusive --multi-prog python script args1 &
8srun --gres=gpu:1 -n1 --mem=30G -l --output=%j-step-%s.out --exclusive --multi-prog python script args2 &
9srun --gres=gpu:1 -n1 --mem=30G -l --output=%j-step-%s.out --exclusive --multi-prog python script args3 &
10srun --gres=gpu:1 -n1 --mem=30G -l --output=%j-step-%s.out --exclusive --multi-prog python script args4 &
11wait
This will create 4 output files:
JOBID-step-0.out
JOBID-step-1.out
JOBID-step-2.out
JOBID-step-3.out
Sharing a node with multiple GPU & multiple processes/GPU
Combining both previous sections, we can create a script requesting a whole node
with four GPUs, allocating 1 GPU per srun and sharing each GPU with multiple
processes
Example still with a 24 cores/4 GPUs/128G RAM
Requesting 2 tasks per GPU
1#!/bin/bash
2#SBATCH --nodes=1-1
3#SBATCH --ntasks-per-node=8
4#SBATCH --output=myjob_output_wrapper.out
5#SBATCH --gres=gpu:4
6#SBATCH --cpus-per-task=3
7srun --gres=gpu:1 -n2 --mem=30G -l --output=%j-step-%s-task-%t.out --exclusive --multi-prog silly.conf &
8srun --gres=gpu:1 -n2 --mem=30G -l --output=%j-step-%s-task-%t.out --exclusive --multi-prog silly.conf &
9srun --gres=gpu:1 -n2 --mem=30G -l --output=%j-step-%s-task-%t.out --exclusive --multi-prog silly.conf &
10srun --gres=gpu:1 -n2 --mem=30G -l --output=%j-step-%s-task-%t.out --exclusive --multi-prog silly.conf &
11wait
--exclusive is important to specify subsequent step/srun to bind to different cpus.
This will produce 8 output files, 2 for each step:
JOBID-step-0-task-0.out
JOBID-step-0-task-1.out
JOBID-step-1-task-0.out
JOBID-step-1-task-1.out
JOBID-step-2-task-0.out
JOBID-step-2-task-1.out
JOBID-step-3-task-0.out
JOBID-step-3-task-1.out
Running nvidia-smi in silly.conf, while parsing the output, we can see 4
GPUs allocated and 2 tasks per GPU
cat JOBID-step-* | grep Tesla
0: | 0 Tesla P100-PCIE... On | 00000000:04:00.0 Off | 0 |
1: | 0 Tesla P100-PCIE... On | 00000000:04:00.0 Off | 0 |
0: | 0 Tesla P100-PCIE... On | 00000000:83:00.0 Off | 0 |
1: | 0 Tesla P100-PCIE... On | 00000000:83:00.0 Off | 0 |
0: | 0 Tesla P100-PCIE... On | 00000000:82:00.0 Off | 0 |
1: | 0 Tesla P100-PCIE... On | 00000000:82:00.0 Off | 0 |
0: | 0 Tesla P100-PCIE... On | 00000000:03:00.0 Off | 0 |
1: | 0 Tesla P100-PCIE... On | 00000000:03:00.0 Off | 0 |