Sometimes you may want to run the same job with different arguments.
For example, you may want to launch an experiment using a few different learning rates.
This example shows an easy way to do this.
Prerequisites
Make sure to read the following sections of the documentation before using this
example:
# distributed/single_gpu/job.sh -> good_practices/launch_many_jobs/job.sh
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --gpus-per-task=l40s:1
## Or --gpus-per-task=rtx8000:1 to request a different GPU model
## Or --gpus-per-task=1 for any GPU model
#SBATCH --mem-per-gpu=16G
#SBATCH --time=00:15:00
# Exit on error
set -e
# Echo time and hostname into log
echo "Date: $(date)"
echo "Hostname: $(hostname)"
# To make your code as much reproducible as possible with
# `torch.use_deterministic_algorithms(True)`, uncomment the following block:
## === Reproducibility ===
## Be warned that this can make your code slower. See
## https://pytorch.org/docs/stable/notes/randomness.html#cublas-and-cudnn-deterministic-operations
## for more details.
# export CUBLAS_WORKSPACE_CONFIG=:4096:8
## === Reproducibility (END) ===
# Stage dataset into $SLURM_TMPDIR
mkdir -p $SLURM_TMPDIR/data
cp /network/datasets/cifar10/cifar-10-python.tar.gz $SLURM_TMPDIR/data/
# General-purpose alternatives combining copy and unpack:
# unzip /network/datasets/some/file.zip -d $SLURM_TMPDIR/data/
# tar -xf /network/datasets/some/file.tar -C $SLURM_TMPDIR/data/
# Execute Python script
# Use the `--offline` option of `uv run` on clusters without internet access on compute nodes.
# Using the `--locked` option can help make your experiments easier to reproduce (it forces
# your uv.lock file to be up to date with the dependencies declared in pyproject.toml).
-srun uv run python main.py+srun uv run python main.py "$@"
Running this example
You can run this example just like the single GPU job example, but you can now
also pass command-line arguments directly when submitting the job with sbatch!