SLURM

SLURM (formerly, Simple Linux Utility for Resource Management) is an application for managing tasks on computer systems.

The official SLURM cheatsheet can be found HERE

SLURM has two main operating modes: batch and interactive. Batch is the preferred mode for watgpu.cs: you can start your task and get an email when it is complete. See the batch section for more details. If you require an interactive session (useful for debugging) with watgpu.cs hardware, see the salloc section.

Before you submit jobs, you should learn about how to monitor running jobs and view cluster resources.

Monitoring jobs and cluster resources

Current jobs

To look at the queue of jobs currently, you can use squeue to display it. The command scurrent will also give all the current jobs running on watgpu.

By default squeue will show all the jobs the scheduler is managing at the moment. It will run much faster if you ask only about your own jobs with

$ squeue -u $USER

You can show only running jobs, or only pending jobs:

$ squeue -u <username> -t RUNNING  
$ squeue -u <username> -t PENDING  

You can show detailed information for a specific job with scontrol:

$ scontrol show job -dd <jobid>

Do not run squeue from a script or program at high frequency, e.g., every few seconds. Responding to squeue adds load to Slurm, and may interfere with its performance or correct operation.

Cancelling jobs

Use scancel with the job ID to cancel a job:

$ scancel <jobid>

You can also use it to cancel all your jobs, or all your pending jobs:

$ scancel -u $USER

$ scancel -t PENDING -u $USER

Monitoring cluster resources

To view all cluster resources, you can use sresources to monitor the availability of cluster resources. It will give you basic information on the:

  • total number of GPUs available on each compute node
  • the number of GPUs currently allocated on each compute node
  • amount of available and allocated RAM on each compute node
  • number of CPU cores available and allocated on each compute node

Batch mode usage

You can submit jobs using an SLURM job script. Below is an example of a simple script: :warning: #SBATCH is the trigger word for slurm to take into account your arguments. If you wish to disable a line consider using ##SBATCH, do not remove # at the beginning of the lines.

#!/bin/bash
    
# To be submitted to the SLURM queue with the command:
# sbatch batch-submit.sh

# Set resource requirements: Queues are limited to seven day allocations
# Time format: HH:MM:SS
#SBATCH --time=00:15:00
#SBATCH --mem=10GB
#SBATCH --cpus-per-task=2
#SBATCH --gres=gpu:1

# Set output file destinations (optional)
# By default, output will appear in a file in the submission directory:
# slurm-$job_number.out
# This can be changed:
#SBATCH -o JOB%j.out # File to which STDOUT will be written
#SBATCH -e JOB%j-err.out # File to which STDERR will be written

# email notifications: Get email when your job starts, stops, fails, completes...
# Set email address
#SBATCH --mail-user=(email address where notifications are delivered to)
# Set types of notifications (from the options: BEGIN, END, FAIL, REQUEUE, ALL):
#SBATCH --mail-type=ALL
 
# Load up your conda environment
# Set up environment on watgpu.cs or in interactive session (use `source` keyword instead of `conda`)
source activate <env>
 
# Task to run
 
~/cuda-samples/Samples/5_Domain_Specific/nbody/nbody -benchmark -device=0 -numbodies=16777216

You can use SBATCH variables like --mem, for example the one above will assign 10GB of RAM to the job.

For CPU cores allocation, you can use --cpus-per-task , for example the one above will assign 4 cores to the job. The --gres=gpu:1 will assign 1x GPU to your job.

Running the script

To run the script, simply run sbatch your_script.sh on watgpu.cs

Interactive mode usage

You can book/reserve resources on the cluster using the salloc command. Below is an example:

salloc --gres=gpu:2 --cpus-per-task=4 --mem=16G --time=2:00:00  

The example above will reserve 2 GPUs, 4 CPU cores, and 16GB of RAM for 2 hours. Once you run the command, it will output the name of the host like so:

salloc: Nodes watgpu308 are ready for job

here watgpu308 is the assigned host that the user can SSH to.

Ideally you want to run this command in either a screen or tmux session on watgpu.cs