Track Experiments with Weights & Biases (WandB)¶
Weights & Biases is an experiment tracking platform for logging metrics, hyperparameters, and artifacts from training runs. Mila members supervised by core professors can access the shared Mila organization on wandb.ai for team-level project visibility and collaboration.
Before you begin¶
-
Train your first ResNet18 model on CIFAR-10 on a single GPU using
sbatch. -
Manage Python Dependencies with
uv
Install uv, manage project dependencies, run reproducible Slurm jobs, and run standalone scripts.
Request access to the Mila WandB organization
Students supervised by core professors are eligible for the Mila organization on wandb.ai. Write to it-support@mila.quebec to request access.
What this guide covers¶
- Sign in with single sign-on (SSO) using your
@mila.quebecaddress. - Authenticate the WandB CLI on the cluster.
- Initialize a run and log metrics in a training script.
- Name runs, attach tags, and group related runs.
- Configure Slurm job scripts for reliable WandB logging and run resumption.
- Identify whether a training job is I/O-bound or compute-bound using WandB system metrics and step timing.
Sign in for the first time¶
WandB uses Mila's SSO provider. Signing in with your @mila.quebec address the
first time links the account to the Mila organization.
Migrate an existing WandB account¶
Add your Mila email first to avoid a duplicate account
To avoid creating a duplicate account: add your @mila.quebec address to
the existing WandB account and make it the primary email before
following the steps below. See the WandB documentation on managing email
addresses. Then log
out from WandB before proceeding.
- Go to wandb.ai and click Sign in.
- Enter your
@mila.quebecemail address. The password field will disappear once a recognized SSO domain is detected. - Click Log in — the browser will redirect you to the Mila SSO page.
- Select the mila.quebec identity provider. WandB will offer to link the existing account to the Mila organization.
Create a new account¶
Follow the same SSO steps above. At the account creation prompt, select Professional.
Which account type to select?
Select Professional at the account creation prompt. This unlocks team features required for the Mila organization. The Mila IT team manages organization-level billing, so no personal plan upgrade is required.
Authenticate the CLI on the cluster¶
Most WandB Python API calls require a valid API key stored in the environment.
Log in interactively¶
Install WandB as a tool on a login node, then authenticate:
Resolved 20 packages in 371ms
Prepared 20 packages in 1.87s
Installed 20 packages in 2.16s
+ annotated-types==0.7.0
+ certifi==2026.2.25
[...]
+ urllib3==2.6.3
+ wandb==0.25.1
Installed 2 executables: wandb, wb
wandb: Logging into https://api.wandb.ai.
wandb: Create a new API key at: https://wandb.ai/authorize?ref=models
wandb: Store your API key securely and do not share it.
wandb: Paste your API key and hit enter:
wandb: Appending key for api.wandb.ai to your netrc file: /home/mila/u/username/.netrc
wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin
Tip
uv tool install --upgrade wandb is a one-time step per cluster. The API
key is stored in ~/.netrc after wandb login; subsequent jobs do not need
to re-run wandb login unless the key is rotated.
Offline mode¶
On clusters without outbound internet access on compute nodes, run in offline mode and sync after the job completes:
After the job finishes, sync the run from the login node:
Tip
On DRAC clusters, loading the httpproxy module before the srun line is
an alternative to offline mode:
Initialize and log a training run¶
Every WandB run starts with wandb.init(). Call wandb.finish() at the end of
the script — it is called automatically at exit, but an explicit call is safer
in Slurm jobs.
Initialize a run¶
- Groups runs in the WandB UI. Use one project per research question.
- Human-readable display name shown in the WandB Runs table.
- Sets the unique run ID.
resume="allow"uses this to find and resume the run if it was preempted. Setting it to the Slurm job ID also links the run to its log file (slurm-<JOBID>.out). - Creates a new run if the ID does not exist, or resumes it if it does. Useful when combined with checkpointing to recover from preemption.
- Pass the full
argparsenamespace and SLURM environment variables for easier debugging. Every key becomes a searchable, filterable column in the WandB Runs table under Config.
Log metrics¶
Call wandb.log() inside the batch loop to record training metrics at each
step:
Log validation metrics once per epoch, after the validation loop:
Tip
Prefix metric names with train/ and val/. WandB groups metrics with
matching prefixes automatically in the Charts panel.
Complete example
See the WandB setup example for a complete single-GPU training script with WandB logging integrated.
Organize runs¶
Three arguments help keep runs organized as a project grows. name= and tags=
make individual runs easy to identify and filter in the Runs table, while
group= clusters related runs — such as multi-seed runs or ablations — under a
single expandable row.
- Using
SLURM_ARRAY_JOB_IDautomatically as the group clusters all jobs into a single expandable row. - Labels runs for filtering in the Runs table.
Diagnose training bottlenecks¶
WandB records GPU utilization, CPU usage, and memory under the System tab of every run automatically, no extra code is required. These metrics are the first place to check when a training job is slower than expected.
Read system metrics¶
Open a run in the WandB UI and select the System tab. The GPU Utilization chart shows the fraction of time the GPU spent on active compute during each sampling interval.
Two patterns indicate different root causes:
- Sustained utilization near 100% — the job is compute-bound. The GPU is the bottleneck; this is the expected state for well-configured training.
- Low or oscillating utilization — the GPU idles while waiting for the next batch. The data pipeline cannot deliver batches fast enough; the job is I/O-bound.
Tip
Common fixes for an I/O bottleneck: increase num_workers in the
DataLoader, enable pin_memory=True, or copy the dataset to
$SLURM_TMPDIR before the job starts.
Full job scripts¶
The wandb_setup example
provides a complete job script and training script with data staging and WandB
integration:
| main.py | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 | |
Example run
Running these scripts produces a run like this wandb-example run.
Key concepts¶
wandb.init()- Starts a WandB run. The most commonly used arguments are
project,name,id,group,configandresume. config=(inwandb.init())- Stores a dictionary of hyperparameters alongside the run. Each key becomes a
searchable, filterable column in the WandB Runs table. Pass
config=vars(args) | {f"env/{k}": v for k, v in os.environ.items() if k.startswith("SLURM")}to capture the fullargparsenamespace and Slurm environment variables for easier debugging. group=(inwandb.init())- Groups related runs under a single expandable row in the WandB Runs table.
Useful for multi-seed runs or ablations. Pass
SLURM_ARRAY_JOB_IDto group all tasks in a job array automatically. wandb.log()- Records a dictionary of metric values at the current step. Call once per iteration or epoch.
WANDB_MODE- Controls logging mode. Set to
offlineon clusters without outbound internet access. Sync withwandb sync --sync-allafter the job completes. - System metrics
- GPU utilization, CPU usage, and memory stats collected automatically by WandB during a run. Visible under the System tab of a run in the WandB UI.
perf/prefix- Convention for logging performance timing metrics (e.g.,
perf/data_load_s,perf/compute_s) separately from training metrics to support bottleneck diagnosis.