Singularity
Overview
What is Singularity?
Running Docker on SLURM is a security problem (e.g. running as root, being able to mount any directory). The alternative is to use Singularity, which is a popular solution in the world of HPC.
There is a good level of compatibility between Docker and Singularity, and we can find many exaggerated claims about able to convert containers from Docker to Singularity without any friction. Oftentimes, Docker images from DockerHub are 100% compatible with Singularity, and they can indeed be used without friction, but things get messy when we try to convert our own Docker build files to Singularity recipes.
Links to official documentation
official Singularity user guide (this is the one you will use most often)
official Singularity admin guide
Overview of the steps used in practice
Most often, the process to create and use a Singularity container is:
on your Linux computer (at home or work)
select a Docker image from DockerHub (e.g. pytorch/pytorch)
make a recipe file for Singularity that starts with that DockerHub image
build the recipe file, thus creating the image file (e.g.
my-pytorch-image.sif)test your singularity container before send it over to the cluster
rsync -av my-pytorch-image.sif <login-node>:Documents/my-singularity-images
on the login node for that cluster
queue your jobs with
sbatch ...(note that your jobs will copy over the
my-pytorch-image.sifto $SLURM_TMPDIR and will then launch Singularity with that image)do something else while you wait for them to finish
queue more jobs with the same
my-pytorch-image.sif, reusing it many times over
In the following sections you will find specific examples or tips to accomplish in practice the steps highlighted above.
Nope, not on MacOS
Singularity does not work on MacOS, as of the time of this writing in 2021. Docker does not actually run on MacOS, but there Docker silently installs a virtual machine running Linux, which makes it a pleasant experience, and the user does not need to care about the details of how Docker does it.
Given its origins in HPC, Singularity does not provide that kind of seamless experience on MacOS, even though it’s technically possible to run it inside a Linux virtual machine on MacOS.
Where to build images
Building Singularity images is a rather heavy task, which can take 20 minutes if you have a lot of steps in your recipe. This makes it a bad task to run on the login nodes of our clusters, especially if it needs to be run regularly.
On the Mila cluster, we are lucky to have unrestricted internet access on the compute nodes, which means that anyone can request an interactive CPU node (no need for GPU) and build their images there without problem.
Warning
Do not build Singularity images from scratch every time your run a
job in a large batch. This will be a colossal waste of GPU time as well as
internet bandwidth. If you setup your workflow properly (e.g. using bind
paths for your code and data), you can spend months reusing the same
Singularity image my-pytorch-image.sif.
Building the containers
Building a container is like creating a new environment except that containers are much more powerful since they are self-contained systems. With singularity, there are two ways to build containers.
The first one is by yourself, it’s like when you got a new Linux laptop and you don’t really know what you need, if you see that something is missing, you install it. Here you can get a vanilla container with Ubuntu called a sandbox, you log in and you install each packages by yourself. This procedure can take time but will allow you to understand how things work and what you need. This is recommended if you need to figure out how things will be compiled or if you want to install packages on the fly. We’ll refer to this procedure as singularity sandboxes.
The second way is more like you know what you want, so you write a list of everything you need, you send it to singularity and it will install everything for you. Those lists are called singularity recipes.
First way: Build and use a sandbox
You might ask yourself: On which machine should I build a container?
First of all, you need to choose where you’ll build your container. This operation requires memory and high cpu usage.
Warning
Do NOT build containers on any login nodes !
(Recommended for beginner) If you need to use apt-get, you should build the container on your laptop with sudo privileges. You’ll only need to install singularity on your laptop. Windows/Mac users can look there and Ubuntu/Debian users can use directly:
sudo apt-get install singularity-containerIf you can’t install singularity on your laptop and you don’t need apt-get, you can reserve a cpu node on the Mila cluster to build your container.
In this case, in order to avoid too much I/O over the network, you should define the singularity cache locally:
export SINGULARITY_CACHEDIR=$SLURM_TMPDIR
If you can’t install singularity on your laptop and you want to use apt-get, you can use singularity-hub to build your containers and read Recipe_section.
Download containers from the web
Hopefully, you may not need to create containers from scratch as many have been already built for the most common deep learning software. You can find most of them on dockerhub.
Go on dockerhub and select the container you want to pull.
For example, if you want to get the latest PyTorch version with GPU support (Replace runtime by devel if you need the full Cuda toolkit):
singularity pull docker://pytorch/pytorch:1.0.1-cuda10.0-cudnn7-runtime
Or the latest TensorFlow:
singularity pull docker://tensorflow/tensorflow:latest-gpu-py3
Currently the pulled image pytorch.simg or tensorflow.simg is read-only
meaning that you won’t be able to install anything on it. Starting now, PyTorch
will be taken as example. If you use TensorFlow, simply replace every
pytorch occurrences by tensorflow.
How to add or install stuff in a container
The first step is to transform your read only container
pytorch-1.0.1-cuda10.0-cudnn7-runtime.simg in a writable version that will
allow you to add packages.
Warning
Depending on the version of singularity you are using, singularity will build a container with the extension .simg or .sif. If you’re using .sif files, replace every occurences of .simg by .sif.
Tip
If you want to use apt-get you have to put sudo ahead of the following commands
This command will create a writable image in the folder pytorch.
singularity build --sandbox pytorch pytorch-1.0.1-cuda10.0-cudnn7-runtime.simg
Then you’ll need the following command to log inside the container.
singularity shell --writable -H $HOME:/home pytorch
Once you get into the container, you can use pip and install anything you need
(Or with apt-get if you built the container with sudo).
Warning
Singularity mounts your home folder, so if you install things into
the $HOME of your container, they will be installed in your real
$HOME!
You should install your stuff in /usr/local instead.
Creating useful directories
One of the benefits of containers is that you’ll be able to use them across different clusters. However for each cluster the datasets and experiments folder location can be different. In order to be invariant to those locations, we will create some useful mount points inside the container:
mkdir /dataset
mkdir /tmp_log
mkdir /final_log
From now, you won’t need to worry anymore when you write your code to specify
where to pick up your dataset. Your dataset will always be in /dataset
independently of the cluster you are using.
Testing
If you have some code that you want to test before finalizing your container, you have two choices. You can either log into your container and run Python code inside it with:
singularity shell --nv pytorch
Or you can execute your command directly with
singularity exec --nv pytorch Python YOUR_CODE.py
Tip
—nv allows the container to use gpus. You don’t need this if you don’t plan to use a gpu.
Warning
Don’t forget to clear the cache of the packages you installed in the containers.
Creating a new image from the sandbox
Once everything you need is installed inside the container, you need to convert it back to a read-only singularity image with:
singularity build pytorch_final.simg pytorch
Second way: Use recipes
A singularity recipe is a file including specifics about installation software, environment variables, files to add, and container metadata. It is a starting point for designing any custom container. Instead of pulling a container and installing your packages manually, you can specify in this file the packages you want and then build your container from this file.
Here is a toy example of a singularity recipe installing some stuff:
################# Header: Define the base system you want to use ################
# Reference of the kind of base you want to use (e.g., docker, debootstrap, shub).
Bootstrap: docker
# Select the docker image you want to use (Here we choose tensorflow)
From: tensorflow/tensorflow:latest-gpu-py3
################# Section: Defining the system #################################
# Commands in the %post section are executed within the container.
%post
echo "Installing Tools with apt-get"
apt-get update
apt-get install -y cmake libcupti-dev libyaml-dev wget unzip
apt-get clean
echo "Installing things with pip"
pip install tqdm
echo "Creating mount points"
mkdir /dataset
mkdir /tmp_log
mkdir /final_log
# Environment variables that should be sourced at runtime.
%environment
# use bash as default shell
SHELL=/bin/bash
export SHELL
A recipe file contains two parts: the header and sections. In the
header you specify which base system you want to use, it can be any docker
or singularity container. In sections, you can list the things you want to
install in the subsection post or list the environment’s variable you need
to source at each runtime in the subsection environment. For a more detailed
description, please look at the singularity documentation.
In order to build a singularity container from a singularity recipe file, you should use:
sudo singularity build <NAME_CONTAINER> <YOUR_RECIPE_FILES>
Warning
You always need to use sudo when you build a container from a recipe. As there is no access to sudo on the cluster, a personal computer or the use singularity hub is needed to build a container
Build recipe on singularity hub
Singularity hub allows users to build containers from recipes directly on singularity-hub’s cloud meaning that you don’t need to build containers by yourself. You need to register on singularity-hub and link your singularity-hub account to your GitHub account, then:
Create a new github repository.
Add a collection on singularity-hub and select the github repository your created.
Clone the github repository on your computer.
$ git clone <url>Write the singularity recipe and save it as a file named Singularity.
Git add Singularity, commit and push on the master branch
$ git add Singularity $ git commit $ git push origin master
At this point, robots from singularity-hub will build the container for you, you will be able to download your container from the website or directly with:
singularity pull shub://<github_username>/<repository_name>
Example: Recipe with OpenAI gym, MuJoCo and Miniworld
Here is an example on how you can use a singularity recipe to install complex environment such as OpenAI gym, MuJoCo and Miniworld on a PyTorch based container. In order to use MuJoCo, you’ll need to copy the key stored on the Mila cluster in /ai/apps/mujoco/license/mjkey.txt to your current directory.
#This is a dockerfile that sets up a full Gym install with test dependencies
Bootstrap: docker
# Here we ll build our container upon the pytorch container
From: pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime
# Now we'll copy the mjkey file located in the current directory inside the container's root
# directory
%files
mjkey.txt
# Then we put everything we need to install
%post
export PATH=$PATH:/opt/conda/bin
apt -y update && \
apt install -y keyboard-configuration && \
apt install -y \
python3-dev \
python-pyglet \
python3-opengl \
libhdf5-dev \
libjpeg-dev \
libboost-all-dev \
libsdl2-dev \
libosmesa6-dev \
patchelf \
ffmpeg \
xvfb \
libhdf5-dev \
openjdk-8-jdk \
wget \
git \
unzip && \
apt clean && \
rm -rf /var/lib/apt/lists/*
pip install h5py
# Download Gym and MuJoCo
mkdir /Gym && cd /Gym
git clone https://github.com/openai/gym.git || true && \
mkdir /Gym/.mujoco && cd /Gym/.mujoco
wget https://www.roboti.us/download/mjpro150_linux.zip && \
unzip mjpro150_linux.zip && \
wget https://www.roboti.us/download/mujoco200_linux.zip && \
unzip mujoco200_linux.zip && \
mv mujoco200_linux mujoco200
# Export global environment variables
export MUJOCO_PY_MJKEY_PATH=/Gym/.mujoco/mjkey.txt
export MUJOCO_PY_MUJOCO_PATH=/Gym/.mujoco/mujoco150/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mjpro150/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mujoco200/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/bin
cp /mjkey.txt /Gym/.mujoco/mjkey.txt
# Install Python dependencies
wget https://raw.githubusercontent.com/openai/mujoco-py/master/requirements.txt
pip install -r requirements.txt
# Install Gym and MuJoCo
cd /Gym/gym
pip install -e '.[all]'
# Change permission to use mujoco_py as non sudoer user
chmod -R 777 /opt/conda/lib/python3.6/site-packages/mujoco_py/
pip install --upgrade minerl
# Export global environment variables
%environment
export SHELL=/bin/sh
export MUJOCO_PY_MJKEY_PATH=/Gym/.mujoco/mjkey.txt
export MUJOCO_PY_MUJOCO_PATH=/Gym/.mujoco/mujoco150/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mjpro150/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mujoco200/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/bin
export PATH=/Gym/gym/.tox/py3/bin:$PATH
%runscript
exec /bin/sh "$@"
Here is the same recipe but written for TensorFlow:
#This is a dockerfile that sets up a full Gym install with test dependencies
Bootstrap: docker
# Here we ll build our container upon the tensorflow container
From: tensorflow/tensorflow:latest-gpu-py3
# Now we'll copy the mjkey file located in the current directory inside the container's root
# directory
%files
mjkey.txt
# Then we put everything we need to install
%post
apt -y update && \
apt install -y keyboard-configuration && \
apt install -y \
python3-setuptools \
python3-dev \
python-pyglet \
python3-opengl \
libjpeg-dev \
libboost-all-dev \
libsdl2-dev \
libosmesa6-dev \
patchelf \
ffmpeg \
xvfb \
wget \
git \
unzip && \
apt clean && \
rm -rf /var/lib/apt/lists/*
# Download Gym and MuJoCo
mkdir /Gym && cd /Gym
git clone https://github.com/openai/gym.git || true && \
mkdir /Gym/.mujoco && cd /Gym/.mujoco
wget https://www.roboti.us/download/mjpro150_linux.zip && \
unzip mjpro150_linux.zip && \
wget https://www.roboti.us/download/mujoco200_linux.zip && \
unzip mujoco200_linux.zip && \
mv mujoco200_linux mujoco200
# Export global environment variables
export MUJOCO_PY_MJKEY_PATH=/Gym/.mujoco/mjkey.txt
export MUJOCO_PY_MUJOCO_PATH=/Gym/.mujoco/mujoco150/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mjpro150/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mujoco200/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/bin
cp /mjkey.txt /Gym/.mujoco/mjkey.txt
# Install Python dependencies
wget https://raw.githubusercontent.com/openai/mujoco-py/master/requirements.txt
pip install -r requirements.txt
# Install Gym and MuJoCo
cd /Gym/gym
pip install -e '.[all]'
# Change permission to use mujoco_py as non sudoer user
chmod -R 777 /usr/local/lib/python3.5/dist-packages/mujoco_py/
# Then install miniworld
cd /usr/local/
git clone https://github.com/maximecb/gym-miniworld.git
cd gym-miniworld
pip install -e .
# Export global environment variables
%environment
export SHELL=/bin/bash
export MUJOCO_PY_MJKEY_PATH=/Gym/.mujoco/mjkey.txt
export MUJOCO_PY_MUJOCO_PATH=/Gym/.mujoco/mujoco150/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mjpro150/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mujoco200/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/bin
export PATH=/Gym/gym/.tox/py3/bin:$PATH
%runscript
exec /bin/bash "$@"
Keep in mind that those environment variables are sourced at runtime and not at
build time. This is why, you should also define them in the %post section
since they are required to install MuJoCo.
Using containers on clusters
How to use containers on clusters
On every cluster with Slurm, datasets and intermediate results should go in
$SLURM_TMPDIR while the final experiment results should go in $SCRATCH.
In order to use the container you built, you need to copy it on the cluster you
want to use.
Warning
You should always store your container in $SCRATCH !
Then reserve a node with srun/sbatch, copy the container and your dataset on the
node given by SLURM (i.e in $SLURM_TMPDIR) and execute the code
<YOUR_CODE> within the container <YOUR_CONTAINER> with:
singularity exec --nv -H $HOME:/home -B $SLURM_TMPDIR:/dataset/ -B $SLURM_TMPDIR:/tmp_log/ -B $SCRATCH:/final_log/ $SLURM_TMPDIR/<YOUR_CONTAINER> python <YOUR_CODE>
Remember that /dataset, /tmp_log and /final_log were created in the
previous section. Now each time, we’ll use singularity, we are explicitly
telling it to mount $SLURM_TMPDIR on the cluster’s node in the folder
/dataset inside the container with the option -B such that each dataset
downloaded by PyTorch in /dataset will be available in $SLURM_TMPDIR.
This will allow us to have code and scripts that are invariant to the cluster
environment. The option -H specify what will be the container’s home. For
example, if you have your code in $HOME/Project12345/Version35/ you can
specify -H $HOME/Project12345/Version35:/home, thus the container will only
have access to the code inside Version35.
If you want to run multiple commands inside the container you can use:
singularity exec --nv -H $HOME:/home -B $SLURM_TMPDIR:/dataset/ \
-B $SLURM_TMPDIR:/tmp_log/ -B $SCRATCH:/final_log/ \
$SLURM_TMPDIR/<YOUR_CONTAINER> bash -c 'pwd && ls && python <YOUR_CODE>'
Example: Interactive case (srun/salloc)
Once you get an interactive session with SLURM, copy <YOUR_CONTAINER> and
<YOUR_DATASET> to $SLURM_TMPDIR
0. Get an interactive session
srun --gres=gpu:1
1. Copy your container on the compute node
rsync -avz $SCRATCH/<YOUR_CONTAINER> $SLURM_TMPDIR
2. Copy your dataset on the compute node
rsync -avz $SCRATCH/<YOUR_DATASET> $SLURM_TMPDIR
Then use singularity shell to get a shell inside the container
3. Get a shell in your environment
singularity shell --nv \
-H $HOME:/home \
-B $SLURM_TMPDIR:/dataset/ \
-B $SLURM_TMPDIR:/tmp_log/ \
-B $SCRATCH:/final_log/ \
$SLURM_TMPDIR/<YOUR_CONTAINER>
4. Execute your code
python <YOUR_CODE>
or use singularity exec to execute <YOUR_CODE>.
3. Execute your code
singularity exec --nv \
-H $HOME:/home \
-B $SLURM_TMPDIR:/dataset/ \
-B $SLURM_TMPDIR:/tmp_log/ \
-B $SCRATCH:/final_log/ \
$SLURM_TMPDIR/<YOUR_CONTAINER> \
python <YOUR_CODE>
You can create also the following alias to make your life easier.
alias my_env='singularity exec --nv \
-H $HOME:/home \
-B $SLURM_TMPDIR:/dataset/ \
-B $SLURM_TMPDIR:/tmp_log/ \
-B $SCRATCH:/final_log/ \
$SLURM_TMPDIR/<YOUR_CONTAINER>'
This will allow you to run any code with:
my_env python <YOUR_CODE>
Example: sbatch case
You can also create a sbatch script:
:linenos:
#!/bin/bash
#SBATCH --cpus-per-task=6 # Ask for 6 CPUs
#SBATCH --gres=gpu:1 # Ask for 1 GPU
#SBATCH --mem=10G # Ask for 10 GB of RAM
#SBATCH --time=0:10:00 # The job will run for 10 minutes
# 1. Copy your container on the compute node
rsync -avz $SCRATCH/<YOUR_CONTAINER> $SLURM_TMPDIR
# 2. Copy your dataset on the compute node
rsync -avz $SCRATCH/<YOUR_DATASET> $SLURM_TMPDIR
# 3. Executing your code with singularity
singularity exec --nv \
-H $HOME:/home \
-B $SLURM_TMPDIR:/dataset/ \
-B $SLURM_TMPDIR:/tmp_log/ \
-B $SCRATCH:/final_log/ \
$SLURM_TMPDIR/<YOUR_CONTAINER> \
python "<YOUR_CODE>"
# 4. Copy whatever you want to save on $SCRATCH
rsync -avz $SLURM_TMPDIR/<to_save> $SCRATCH
Issue with PyBullet and OpenGL libraries
If you are running certain gym environments that require pyglet, you may
encounter a problem when running your singularity instance with the Nvidia
drivers using the --nv flag. This happens because the --nv flag also
provides the OpenGL libraries:
libGL.so.1 => /.singularity.d/libs/libGL.so.1
libGLX.so.0 => /.singularity.d/libs/libGLX.so.0
If you don’t experience those problems with pyglet, you probably don’t need
to address this. Otherwise, you can resolve those problems by apt-get install
-y libosmesa6-dev mesa-utils mesa-utils-extra libgl1-mesa-glx, and then making
sure that your LD_LIBRARY_PATH points to those libraries before the ones in
/.singularity.d/libs.
%environment
# ...
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/mesa:$LD_LIBRARY_PATH
Mila cluster
On the Mila cluster $SCRATCH is not yet defined, you should add the
experiment results you want to keep in /network/scratch/<u>/<username>/. In
order to use the sbatch script above and to match other cluster environment’s
names, you can define $SCRATCH as an alias for
/network/scratch/<u>/<username> with:
echo "export SCRATCH=/network/scratch/${USER:0:1}/$USER" >> ~/.bashrc
Then, you can follow the general procedure explained above.
Digital Research Alliance of Canada
Using singularity on Digital Research Alliance of Canada is similar except that
you need to add Yoshua’s account name and load singularity. Here is an example
of a sbatch script using singularity on compute Canada cluster:
Warning
You should use singularity/2.6 or singularity/3.4. There is a bug in singularity/3.2 which makes gpu unusable.
1#!/bin/bash
2#SBATCH --account=rpp-bengioy # Yoshua pays for your job
3#SBATCH --cpus-per-task=6 # Ask for 6 CPUs
4#SBATCH --gres=gpu:1 # Ask for 1 GPU
5#SBATCH --mem=32G # Ask for 32 GB of RAM
6#SBATCH --time=0:10:00 # The job will run for 10 minutes
7#SBATCH --output="/scratch/<user>/slurm-%j.out" # Modify the output of sbatch
8
9# 1. You have to load singularity
10module load singularity
11# 2. Then you copy the container to the local disk
12rsync -avz $SCRATCH/<YOUR_CONTAINER> $SLURM_TMPDIR
13# 3. Copy your dataset on the compute node
14rsync -avz $SCRATCH/<YOUR_DATASET> $SLURM_TMPDIR
15# 4. Executing your code with singularity
16singularity exec --nv \
17 -H $HOME:/home \
18 -B $SLURM_TMPDIR:/dataset/ \
19 -B $SLURM_TMPDIR:/tmp_log/ \
20 -B $SCRATCH:/final_log/ \
21 $SLURM_TMPDIR/<YOUR_CONTAINER> \
22 python "<YOUR_CODE>"
23# 5. Copy whatever you want to save on $SCRATCH
24rsync -avz $SLURM_TMPDIR/<to_save> $SCRATCH