Launch many jobs from same shell script¶

Sometimes you may want to run the same job with different arguments. For example, you may want to launch an experiment using a few different learning rates. This example shows an easy way to do this.

Prerequisites¶

Make sure to read the following sections of the documentation before using this example:

Example¶

The full source code for this example is available on the mila-docs GitHub repository.

job.sh

Compared to the single GPU job example, here we use the $@ bash directive to pass command-line arguments down to the Python script. This makes it very easy to submit multiple jobs, each with different values!

 # distributed/single_gpu/job.sh -> good_practices/launch_many_jobs/job.sh
 #!/bin/bash
 #SBATCH --ntasks=1
 #SBATCH --ntasks-per-node=1
 #SBATCH --cpus-per-task=4
 #SBATCH --gpus-per-task=l40s:1
 ## Or --gpus-per-task=rtx8000:1    to request a different GPU model
 ## Or --gpus-per-task=1            for any GPU model
 #SBATCH --mem-per-gpu=16G
 #SBATCH --time=00:15:00

 # Exit on error
 set -e

 # Echo time and hostname into log
 echo "Date:     $(date)"
 echo "Hostname: $(hostname)"

 # To make your code as much reproducible as possible with
 # `torch.use_deterministic_algorithms(True)`, uncomment the following block:
 ## === Reproducibility ===
 ## Be warned that this can make your code slower. See
 ## https://pytorch.org/docs/stable/notes/randomness.html#cublas-and-cudnn-deterministic-operations
 ## for more details.
 # export CUBLAS_WORKSPACE_CONFIG=:4096:8
 ## === Reproducibility (END) ===

 # Stage dataset into $SLURM_TMPDIR
 mkdir -p $SLURM_TMPDIR/data
 cp /network/datasets/cifar10/cifar-10-python.tar.gz $SLURM_TMPDIR/data/
 # General-purpose alternatives combining copy and unpack:
 #     unzip   /network/datasets/some/file.zip -d $SLURM_TMPDIR/data/
 #     tar -xf /network/datasets/some/file.tar -C $SLURM_TMPDIR/data/

 # Execute Python script
 # Use the `--offline` option of `uv run` on clusters without internet access on compute nodes.
 # Using the `--locked` option can help make your experiments easier to reproduce (it forces
 # your uv.lock file to be up to date with the dependencies declared in pyproject.toml).
-srun uv run python main.py
+srun uv run python main.py "$@"

Running this example¶

You can run this example just like the single GPU job example, but you can now also pass command-line arguments directly when submitting the job with sbatch!

For example:

 $ sbatch job.sh --learning-rate 0.1
 $ sbatch job.sh --learning-rate 0.5
 $ sbatch job.sh --weight-decay 1e-3

Next steps¶

These next examples build on top of this one and show how to properly launch lots of jobs for hyper-parameter sweeps:

Launch many jobs from same shell script¶

Prerequisites¶

Example¶

Running this example¶

Next steps¶

Comments