SLURM
SLURM (formerly, Simple Linux Utility for Resource Management) is an application for managing tasks on computer systems.
The official SLURM cheatsheet can be found HERE
SLURM has two main operating modes: batch and interactive. Batch is the preferred mode for watgpu.cs: you can start your task and get an email when it is complete. See the batch section for more details. If you require an interactive session (useful for debugging) with watgpu.cs hardware, see the salloc section.
Before you submit jobs, you should learn about how to monitor running jobs and view cluster resources.
Monitoring jobs and cluster resources
Current jobs
To look at the queue of jobs currently, you can use squeue
to display it. The command scurrent
will also give all the current jobs running on watgpu.
By default squeue
will show all the jobs the scheduler is managing at the moment. It will run much faster if you ask only about your own jobs with
$ squeue -u $USER
You can show only running jobs, or only pending jobs:
$ squeue -u <username> -t RUNNING
$ squeue -u <username> -t PENDING
You can show detailed information for a specific job with scontrol
:
$ scontrol show job -dd <jobid>
Do not run squeue
from a script or program at high frequency, e.g., every few seconds. Responding to squeue
adds load to Slurm, and may interfere with its performance or correct operation.
Cancelling jobs
Use scancel
with the job ID to cancel a job:
$ scancel <jobid>
You can also use it to cancel all your jobs, or all your pending jobs:
$ scancel -u $USER
$ scancel -t PENDING -u $USER
Monitoring cluster resources
To view all cluster resources, you can use sresources
to monitor the availability of cluster resources. It will give you basic information on the:
- total number of GPUs available on each compute node
- the number of GPUs currently allocated on each compute node
- amount of available and allocated RAM on each compute node
- number of CPU cores available and allocated on each compute node
Batch mode usage
You can submit jobs using an SLURM job script. Below is an example of a simple script:
:warning: #SBATCH
is the trigger word for slurm to take into account your arguments. If you wish to disable a line consider using ##SBATCH
, do not remove #
at the beginning of the lines.
#!/bin/bash
# To be submitted to the SLURM queue with the command:
# sbatch batch-submit.sh
# Set resource requirements: Queues are limited to seven day allocations
# Time format: HH:MM:SS
#SBATCH --time=00:15:00
#SBATCH --mem=10GB
#SBATCH --cpus-per-task=2
#SBATCH --gres=gpu:1
# Set output file destinations (optional)
# By default, output will appear in a file in the submission directory:
# slurm-$job_number.out
# This can be changed:
#SBATCH -o JOB%j.out # File to which STDOUT will be written
#SBATCH -e JOB%j-err.out # File to which STDERR will be written
# email notifications: Get email when your job starts, stops, fails, completes...
# Set email address
#SBATCH --mail-user=(email address where notifications are delivered to)
# Set types of notifications (from the options: BEGIN, END, FAIL, REQUEUE, ALL):
#SBATCH --mail-type=ALL
# Load up your conda environment
# Set up environment on watgpu.cs or in interactive session (use `source` keyword instead of `conda`)
source activate <env>
# Task to run
~/cuda-samples/Samples/5_Domain_Specific/nbody/nbody -benchmark -device=0 -numbodies=16777216
You can use SBATCH
variables like --mem
, for example the one above will assign 10GB of RAM to the job.
For CPU cores allocation, you can use --cpus-per-task
, for example the one above will assign 4 cores to the job.
The --gres=gpu:1
will assign 1x GPU to your job.
Running the script
To run the script, simply run sbatch your_script.sh
on watgpu.cs
Interactive mode usage
You can book/reserve resources on the cluster using the salloc
command. Below is an example:
salloc --gres=gpu:2 --cpus-per-task=4 --mem=16G --time=2:00:00
The example above will reserve 2 GPUs, 4 CPU cores, and 16GB of RAM for 2 hours. Once you run the command, it will output the name of the host like so:
salloc: Nodes watgpu308 are ready for job
here watgpu308
is the assigned host that the user can SSH to.
Ideally you want to run this command in either a screen
or tmux
session on watgpu.cs