Synchronizing multiple tasks¶
Before you begin¶
-
Use an interactive job to run multiple tasks.
What this guide covers¶
- Launching multiple tasks with
sbatch - Sharing tasks variables of different tasks
Concept of this example¶
In this guide, we launch a job (using job_***.sh) which will run one or more tasks (whose instructions are stored in main_jax.py or main_torch.py) using libraries (defined in pyproject.toml).
Thus, each example is based on three files:
| File | Description |
|---|---|
job_***.sh |
Bash script used to request an allocation and launch a job (which itself runs multiple tasks based on the requested --ntasks) |
main_***.py |
Python script containing the instructions the tasks execute. In this example, we either use Jax (with the script main_jax.py) or Pytorch (with the script main_torch.py) |
pyproject.toml |
Configuration file used to handle the libraries uv is gonna get. We could have done one pyproject.toml for each example (Jax and Torch), but we gathered the two libraries in one to simplify this guide |
Introducing the different files¶
(You can also check the "Launch many jobs" example.)
In-depth script explanation on job_***.sh
Headers for the resources allocation
These are the header and the parameters we request for the resources allocation.
Environment variables
The environment variables MASTER_ADDR, MASTER_PORT and WORLD_SIZE are defined here and can be retrieved in each tasks. In Python, retrieving the environment variable value is done as follow:
Running the tasks
srun uv run python main_***.py
-
The command
sruncreates tasks. The number of tasks is determined by the parameters--ntasksof our allocation. Here, we requested 4 tasks so the command will run 4 times in parallel tasks. These tasks run the command followingsrun, so each tasks will runuv run python main_torch.pyoruv run python main_jax.py. -
uv runis used to ease the environment set up for our tasks. For more information, read ouruvguide on portability. It is followed by the name of the script we actually want to run in this environment.
In-depth script explanation on main_***.py
Pytorch and Jax
This guide is based on two open source examples
Environment variables
In each file, we retrieve the Slurm environment variables SLURM_PROCID, SLURM_LOCALID, SLURM_NTASKS and SLURM_NODEID. Unlike the environment variables we defined previously (MASTER_ADDR, MASTER_PORT and WORLD_SIZE), these environment variables are specific to each tasks. More SLURM common environment variables are listed in the technical reference.
-
Initialize: in Torch, a group is defined
-
Create a value, different for each task The created value is based on the RANK, which is specific to each task
-
Compute their sum
-
Initialize: this function is specific to Jax
-
Create a value, different for each task
The created value is based on the RANK, which is specific to each task
-
Compute their sum see Jax Lax parallel operators
The final sum is printed from the first task of the first node (NODE_INDEX=0 and RANK=0). This is the task where all the x values have been collected. On the other tasks, the sum is a partial result.
In-depth explanation on pyproject.toml
pyproject.toml is a configuration file used by packaging tools (uv in our case) (More info on pyproject.toml files). The value of dependencies contains information about the libraries we are using in this example. torch is used while using the main_torch.py script, and jax while using the main_jax.py script. If you use only one of them, you can delete the unused library from the pyproject.toml file.
Launching the example¶
-
Connect to the cluster
-
Launch the job
-
(Optional) Check the job status
-
Retrieve the results
When the resources have been allocated and the script has run, an output file has been created: it is by default called
slurm-{JOB_ID}.out, withJOB_IDbeing the ID of the job which has run.For each example, we can see that the ranks of the tasks (ie their
xvalues) are respectively 0, 1, 2 and 3. Thus, their sum, retrieved on [Node 0 | Task 0], is 6.
Next steps¶
-
Launch many jobs from same Shell script
Good practice to run the same experiment with different arguments.