Queue and Priority

Jobs in WATGPU are ran depending on their priority. Higher priority jobs go faster in the queue and can sometimes requeue running jobs to free resources for themselves.

To work with this priority, WATGPU resources have been separated into several partitions. The partition on which a job is running defines its possible allocated resources and its priority. Selecting a partition to run a job is easy:

  • For interactive sessions: simply add --partition=<PARTITION> argument in the salloc command.
  • For batch jobs: add #SBATCH --partition=<PARTITION> in your .sh script

Following is a presentation of the different partitions available in WATGPU:

ALL

This is the default partition. A job submitted to ALL will be able to run on any available resources on WATGPU. Jobs will have standard priority for the queue and might be preempted and requeued if a higher priority job is in need of resources.

All users have access to this partition:

#SBATCH --partition=ALL

SCHOOL

A job submitted to SCHOOL will be allocated only to GPUs owned by the School. While fewer resources are available in SCHOOL than in ALL, the chances of getting preempted by higher priority jobs are reduced. Jobs will have standard priority for the queue and might be preempted and requeued if a higher priority job is in need of resources.

All users within the School have access to this partition:

#SBATCH --partition=SCHOOL

<GROUP>

When a group <GROUP> has entrusted their GPU(s) to WATGPU, they have access to the partition named <GROUP> linked to their contributed GPU(s). Jobs will have high priority for the queue and will preempt and requeue lower priority jobs (from ALL and SCHOOL) if there are not enough resources available with a limit of GPU(s) up to their contribution.

All users from a group have access to their group partition:

#SBATCH --partition=<GROUP>

Users using their group's partition to request GPUs must now specify the group name for the GPU request in their jobs:

#SBATCH --gres=gpu:<GROUP_name>gpu:1