Frequently asked questions (FAQs)
Connection/SSH issues
I’m getting connection refused while trying to connect to a login node
Login nodes are protected against brute force attacks and might ban your IP if it detects too many connections/failures. You will be automatically unbanned after 1 hour. For any further problem, please submit a support ticket.
Shell issues
How do I change my shell ?
By default you will be assigned /bin/bash as a shell. If you would like to
change for another one, please submit a support ticket.
SLURM issues
How can I get an interactive shell on the cluster ?
Use salloc [--slurm_options] without any executable at the end of the
command, this will launch your default shell on an interactive session. Remember
that an interactive session is bound to the login node where you start it so you
could risk losing your job if the login node becomes unreachable.
How can I reset my cluster password ?
To reset your password, please submit a support ticket.
Warning: your cluster password is the same as your Google Workspace account. So, after reset, you must use the new password for all your Google services.
srun: error: –mem and –mem-per-cpu are mutually exclusive
You can safely ignore this, salloc has a default memory flag in case you
don’t provide one.
How can I see where and if my jobs are running ?
Use squeue -u YOUR_USERNAME to see all your job status and locations.
To get more info on a running job, try scontrol show job #JOBID
Unable to allocate resources: Invalid account or account/partition combination specified
Chances are your account is not setup properly. You should submit a support ticket.
How do I cancel a job?
To cancel a specific job, use
scancel #JOBIDTo cancel all your jobs (running and pending), use
scancel -u YOUR_USERNAMETo cancel all your pending jobs only, use
scancel -t PD
How can I access a node on which one of my jobs is running ?
You can ssh into a node on which you have a job running, your ssh connection
will be adopted by your job, i.e. if your job finishes your ssh connection will
be automatically terminated. In order to connect to a node, you need to have
password-less ssh either with a key present in your home or with an
ssh-agent. You can generate a key on the login node like this:
ssh-keygen (3xENTER)
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
chmod 700 ~/.ssh
The ECDSA, RSA and ED25519 fingerprints for Mila’s compute nodes are:
SHA256:hGH64v72h/c0SfngAWB8WSyMj8WSAf5um3lqVsa7Cfk (ECDSA)
SHA256:4Es56W5ANNMQza2sW2O056ifkl8QBvjjNjfMqpB7/1U (RSA)
SHA256:gUQJw6l1lKjM1cCyennetPoQ6ST0jMhQAs/57LhfakA (ED25519)
I’m getting Permission denied (publickey) while trying to connect to a node
See previous question
Where do I put my data during a job ?
Your /home as well as the datasets are on shared file-systems, it is
recommended to copy them to the $SLURM_TMPDIR to better process them and
leverage higher-speed local drives. If you run a low priority job subject to
preemption, it’s better to save any output you want to keep on the shared file
systems, because the $SLURM_TMPDIR is deleted at the end of each job.
slurmstepd: error: Detected 1 oom-kill event(s) in step #####.batch cgroup
You exceeded the amount of memory allocated to your job, either you did not
request enough memory or you have a memory leak in your process. Try increasing
the amount of memory requested with --mem= or --mem-per-cpu=.
PyTorch issues
I randomly get INTERNAL ASSERT FAILED at "../aten/src/ATen/MapAllocator.cpp":263
You are using PyTorch 1.10.x and hitting #67864,
for which the solution is PR #72232
merged in PyTorch 1.11.x. For an immediate fix, consider the following compilable Gist:
hack.cpp.
Compile the patch to hack.so and then export LD_PRELOAD=/absolute/path/to/hack.so
before executing the Python process that import torch a broken PyTorch 1.10.
For Hydra users who are using the submitit launcher plug-in, the env_set key cannot
be used to set LD_PRELOAD in the environment as it does so too late at runtime. The
dynamic loader reads LD_PRELOAD only once and very early during the startup of any
process, before the variable can be set from inside the process. The hack must therefore
be injected using the setup key in Hydra YAML config file:
hydra:
launcher:
setup:
- export LD_PRELOAD=/absolute/path/to/hack.so
On MIG GPUs, I get torch.cuda.device_count() == 0 despite torch.cuda.is_available()
You are using PyTorch 1.13.x and hitting #90543, for which the solution is PR #92315 merged in PyTorch 2.0.
To avoid thus problem, update to PyTorch 2.0. If PyTorch 1.13.x is required, a workaround is to add the following to your script:
unset CUDA_VISIBLE_DEVICES
But this is no longer necessary with PyTorch >= 2.0.
I am told my PyTorch job abuses the filesystem with extreme amounts of IOPS
A fairly common issue in PyTorch is:
RuntimeError: one of the variables needed for gradient computation has been
modified by an inplace operation: [torch.cuda.FloatTensor [1, 50, 300]],
which is output 0 of SplitBackward, is at version 2; expected version 0
instead. Hint: enable anomaly detection to find the operation that failed to
compute its gradient, with torch.autograd.set_detect_anomaly(True).
PyTorch’s autograd engine contains an “anomaly detection mode”, which detects such things as NaN/infinities being created, and helps debugging in-place Tensor modifications. It is activated with
torch.autograd.set_detect_anomaly(True)
PyTorch’s implementation of the anomaly-detection mode tracks where every Tensor was created in the program. This involves the collection of the backtrace at the point the Tensor was created.
Unfortunately, the collection of a backtrace involves a stat() system
call to every source file in the backtrace. This is considered a metadata
access to $HOME and results in intolerably heavy traffic to the shared
filesystem containing the source code, usually $HOME, whatever the location
of the dataset, and even if it is on $SLURM_TMPDIR. It is the source-code
files being polled, not the dataset. As there can be hundreds of PyTorch tensors
created per iteration and thousands of iterations per second, this mode results
in extreme amounts of IOPS to the filesystem.
Warning
Do not use
torch.autograd.set_detect_anomaly(True)except for debugging an individual job interactively, and switch it off as soon as done using it.Do not set
torch.autograd.set_detect_anomaly(True)enabled unconditionally in all your jobs. It is not a consequence-free aid. Due to heavy use of filesystem calls, it has a performance impact and slows down your code, on top of abusing the filesystem.You will be contacted if you violate these guidelines due to the severity of its impact on shared filesystems.
Conda refuses to create an environment with Your installed CUDA driver is: not available
Anaconda attempts to auto-detect the NVIDIA driver version of the system and thus the maximum CUDA toolkit supported, in an attempt at choosing an appropriate CUDA Toolkit version.
However, on login and CPU nodes, there is no NVIDIA GPU and thus no need for
NVIDIA drivers. But that means conda’s auto-detection will not work on
those nodes, and packages declaring a minimum requirement on the drivers will
fail to install.
The solution in such a situation is to set the environment variable
CONDA_OVERRIDE_CUDA to the desired CUDA Toolkit version; For example,
CONDA_OVERRIDE_CUDA=11.8 conda create -n ENVNAME python=3.10 pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
This and other CONDA_OVERRIDE_* variables are documented in the conda manual.