Using containers on clusters
How to use containers on clusters
On every cluster with Slurm, datasets and intermediate results should go in
$SLURM_TMPDIR while the final experiment results should go in $SCRATCH.
In order to use the container you built, you need to copy it on the cluster you
want to use.
Warning
You should always store your container in $SCRATCH !
Then reserve a node with srun/sbatch, copy the container and your dataset on the
node given by SLURM (i.e in $SLURM_TMPDIR) and execute the code
<YOUR_CODE> within the container <YOUR_CONTAINER> with:
singularity exec --nv -H $HOME:/home -B $SLURM_TMPDIR:/dataset/ -B $SLURM_TMPDIR:/tmp_log/ -B $SCRATCH:/final_log/ $SLURM_TMPDIR/<YOUR_CONTAINER> python <YOUR_CODE>
Remember that /dataset, /tmp_log and /final_log were created in the
previous section. Now each time, we’ll use singularity, we are explicitly
telling it to mount $SLURM_TMPDIR on the cluster’s node in the folder
/dataset inside the container with the option -B such that each dataset
downloaded by PyTorch in /dataset will be available in $SLURM_TMPDIR.
This will allow us to have code and scripts that are invariant to the cluster
environment. The option -H specify what will be the container’s home. For
example, if you have your code in $HOME/Project12345/Version35/ you can
specify -H $HOME/Project12345/Version35:/home, thus the container will only
have access to the code inside Version35.
If you want to run multiple commands inside the container you can use:
singularity exec --nv -H $HOME:/home -B $SLURM_TMPDIR:/dataset/ \
-B $SLURM_TMPDIR:/tmp_log/ -B $SCRATCH:/final_log/ \
$SLURM_TMPDIR/<YOUR_CONTAINER> bash -c 'pwd && ls && python <YOUR_CODE>'
Example: Interactive case (srun/salloc)
Once you get an interactive session with SLURM, copy <YOUR_CONTAINER> and
<YOUR_DATASET> to $SLURM_TMPDIR
0. Get an interactive session
srun --gres=gpu:1
1. Copy your container on the compute node
rsync -avz $SCRATCH/<YOUR_CONTAINER> $SLURM_TMPDIR
2. Copy your dataset on the compute node
rsync -avz $SCRATCH/<YOUR_DATASET> $SLURM_TMPDIR
Then use singularity shell to get a shell inside the container
3. Get a shell in your environment
singularity shell --nv \
-H $HOME:/home \
-B $SLURM_TMPDIR:/dataset/ \
-B $SLURM_TMPDIR:/tmp_log/ \
-B $SCRATCH:/final_log/ \
$SLURM_TMPDIR/<YOUR_CONTAINER>
4. Execute your code
python <YOUR_CODE>
or use singularity exec to execute <YOUR_CODE>.
3. Execute your code
singularity exec --nv \
-H $HOME:/home \
-B $SLURM_TMPDIR:/dataset/ \
-B $SLURM_TMPDIR:/tmp_log/ \
-B $SCRATCH:/final_log/ \
$SLURM_TMPDIR/<YOUR_CONTAINER> \
python <YOUR_CODE>
You can create also the following alias to make your life easier.
alias my_env='singularity exec --nv \
-H $HOME:/home \
-B $SLURM_TMPDIR:/dataset/ \
-B $SLURM_TMPDIR:/tmp_log/ \
-B $SCRATCH:/final_log/ \
$SLURM_TMPDIR/<YOUR_CONTAINER>'
This will allow you to run any code with:
my_env python <YOUR_CODE>
Example: sbatch case
You can also create a sbatch script:
:linenos:
#!/bin/bash
#SBATCH --cpus-per-task=6 # Ask for 6 CPUs
#SBATCH --gres=gpu:1 # Ask for 1 GPU
#SBATCH --mem=10G # Ask for 10 GB of RAM
#SBATCH --time=0:10:00 # The job will run for 10 minutes
# 1. Copy your container on the compute node
rsync -avz $SCRATCH/<YOUR_CONTAINER> $SLURM_TMPDIR
# 2. Copy your dataset on the compute node
rsync -avz $SCRATCH/<YOUR_DATASET> $SLURM_TMPDIR
# 3. Executing your code with singularity
singularity exec --nv \
-H $HOME:/home \
-B $SLURM_TMPDIR:/dataset/ \
-B $SLURM_TMPDIR:/tmp_log/ \
-B $SCRATCH:/final_log/ \
$SLURM_TMPDIR/<YOUR_CONTAINER> \
python "<YOUR_CODE>"
# 4. Copy whatever you want to save on $SCRATCH
rsync -avz $SLURM_TMPDIR/<to_save> $SCRATCH
Issue with PyBullet and OpenGL libraries
If you are running certain gym environments that require pyglet, you may
encounter a problem when running your singularity instance with the Nvidia
drivers using the --nv flag. This happens because the --nv flag also
provides the OpenGL libraries:
libGL.so.1 => /.singularity.d/libs/libGL.so.1
libGLX.so.0 => /.singularity.d/libs/libGLX.so.0
If you don’t experience those problems with pyglet, you probably don’t need
to address this. Otherwise, you can resolve those problems by apt-get install
-y libosmesa6-dev mesa-utils mesa-utils-extra libgl1-mesa-glx, and then making
sure that your LD_LIBRARY_PATH points to those libraries before the ones in
/.singularity.d/libs.
%environment
# ...
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/mesa:$LD_LIBRARY_PATH
Mila cluster
On the Mila cluster $SCRATCH is not yet defined, you should add the
experiment results you want to keep in /network/scratch/<u>/<username>/. In
order to use the sbatch script above and to match other cluster environment’s
names, you can define $SCRATCH as an alias for
/network/scratch/<u>/<username> with:
echo "export SCRATCH=/network/scratch/${USER:0:1}/$USER" >> ~/.bashrc
Then, you can follow the general procedure explained above.
Digital Research Alliance of Canada
Using singularity on Digital Research Alliance of Canada is similar except that
you need to add Yoshua’s account name and load singularity. Here is an example
of a sbatch script using singularity on compute Canada cluster:
Warning
You should use singularity/2.6 or singularity/3.4. There is a bug in singularity/3.2 which makes gpu unusable.
1#!/bin/bash
2#SBATCH --account=rpp-bengioy # Yoshua pays for your job
3#SBATCH --cpus-per-task=6 # Ask for 6 CPUs
4#SBATCH --gres=gpu:1 # Ask for 1 GPU
5#SBATCH --mem=32G # Ask for 32 GB of RAM
6#SBATCH --time=0:10:00 # The job will run for 10 minutes
7#SBATCH --output="/scratch/<user>/slurm-%j.out" # Modify the output of sbatch
8
9# 1. You have to load singularity
10module load singularity
11# 2. Then you copy the container to the local disk
12rsync -avz $SCRATCH/<YOUR_CONTAINER> $SLURM_TMPDIR
13# 3. Copy your dataset on the compute node
14rsync -avz $SCRATCH/<YOUR_DATASET> $SLURM_TMPDIR
15# 4. Executing your code with singularity
16singularity exec --nv \
17 -H $HOME:/home \
18 -B $SLURM_TMPDIR:/dataset/ \
19 -B $SLURM_TMPDIR:/tmp_log/ \
20 -B $SCRATCH:/final_log/ \
21 $SLURM_TMPDIR/<YOUR_CONTAINER> \
22 python "<YOUR_CODE>"
23# 5. Copy whatever you want to save on $SCRATCH
24rsync -avz $SLURM_TMPDIR/<to_save> $SCRATCH