Welcome to the WATGPU research cluster
Introduction
Welcome to WATGPU, a SCS owned GPU cluster aiming to facilitate access to computing resources for research purposes. This documentation serves as a comprehensive guide to understanding and utilizing WATGPU, a cluster managed through the Slurm workload manager.
Download pdf presentation here: 2024/07/25 version.
View the recording of the seminar from the 2024/07/25.
Getting access
- Faculty (CS or cross appointed): Please contact watgpu-admin@lists.uwaterloo.ca
- Student1: With the agreement of the faculty you are working with, send an email to watgpu-admin@lists.uwaterloo.ca with them in CC.
Before making an account request, please load an SSH key at https://authman.uwaterloo.ca
Contact
If you require assistance while using WATGPU, you can contact the following:
- Admin support: watgpu-admin@lists.uwaterloo.ca
Shared GPU Resources
WATGPU offers shared access to GPUs owned by the university and generous researchers. When these GPUs are idle, they contribute to a shared pool, providing users with enhanced computational capabilities.
Slurm: How it works
Slurm simplifies the user experience by allowing you to submit, monitor, and manage your computational jobs seamlessly. Through straightforward command-line interfaces, you can submit batch jobs, specify resource requirements, and monitor job progress. Slurm ensures fair resource allocation, allowing you to focus on your research without the complexity of manual resource management.
If you're new to shared GPU computing and Slurm, getting started with WATGPU is a breeze! Think of WATGPU as your personal computing powerhouse, shared with other users when their GPUs are idle. To begin, follow these simple steps:
- Login: Access watgpu.cs using your credentials.
- Submit a Job: Utilize the
sbatch
command to submit your script that you wish to run. Think of it like asking the server to perform specific computations for you using specific resources (like how many GPUs, how much memory ...). - Monitor Progress: Check the status of your job with the
squeue
command and see how it's progressing. You can usesqueue
to view the job queue and monitor job details. - Enjoy: Your job will be run by the server as soon as the requested resources are available.
For more in-depth information, visit this page.
Thank you for choosing WATGPU. We're here to simplify your computational tasks and enhance your research.
Happy Computing!
WatGPU cluster is a research resource. Access will only be granted to students actively involved in a SCS or cross appointed research group.
Current cluster loads
The cluster consists of four servers. GPU utilization of the servers are shown below, and the links will take you to other dashboards detailing other performance metrics of the servers. Current loads are:
watgpu-100.cs
watgpu-200.cs
watgpu-300.cs
watgpu-400.cs
Managing virtual environments on the cluster
Virtual environments are self-contained directories that contain all libraries and dependencies for a specific project, allowing you to work on multiple projects without conflicts between their requirements. They help you isolate project dependencies, manage versions, and ensure that different software stacks don't interfere with each other. This can be especially useful when working on WATGPU, since there might be specific versions of libraries that may be required for your specific research projects.
The default environment manager for all users on the WATGPU cluster is conda, though pip virtual environments are also supported.
Creating and activating conda environments is simple on WATGPU:
(base) $ conda create -n <conda-venv-name>
(base) $ conda activate <conda-venv-name>
<conda-venv-name> $
You can create pip virtual environments while a conda environment is activated, and after creating the pip virtual environment, you can deactivate the conda environment and activate the pip virtual environment normally.
(base) $ python -m venv <venv-name>
(base) $ conda deactivate
$ source <venv-name>/bin/activate
Installing useful tools through conda
Tools like nvcc and nvtop can be installed using conda since conda is also a package manager.
To perform a basic install of all CUDA Toolkit components using conda, run:
conda install cuda -c nvidia
You can install previous CUDA releases by following the instructions detailed in the Conda Installation section of NVIDIA's online documentation.
To install nvtop, which can be used to better monitor GPU utilization and GPU memory usage, run:
conda install --channel conda-forge nvtop
Further information about nvtop can be found here.
If a tool suggests you to use apt or apt-get to install it, please check if you can install the same tool through conda. If this isn't possible, you can send an email to watgpu-admin@lists.uwaterloo.ca and we can look into installing the tool for you.
SLURM
SLURM (formerly, Simple Linux Utility for Resource Management) is an application for managing tasks on computer systems.
The official SLURM cheatsheet can be found HERE
SLURM has two main operating modes: batch and interactive. Batch is the preferred mode for watgpu.cs: you can start your task and get an email when it is complete. See the batch section for more details. If you require an interactive session (useful for debugging) with watgpu.cs hardware, see the salloc section.
Before you submit jobs, you should learn about how to monitor running jobs and view cluster resources.
Monitoring jobs and cluster resources
Current jobs
To look at the queue of jobs currently, you can use squeue
to display it. The command scurrent
will also give all the current jobs running on watgpu.
By default squeue
will show all the jobs the scheduler is managing at the moment. It will run much faster if you ask only about your own jobs with
$ squeue -u $USER
You can show only running jobs, or only pending jobs:
$ squeue -u <username> -t RUNNING
$ squeue -u <username> -t PENDING
You can show detailed information for a specific job with scontrol
:
$ scontrol show job -dd <jobid>
Do not run squeue
from a script or program at high frequency, e.g., every few seconds. Responding to squeue
adds load to Slurm, and may interfere with its performance or correct operation.
Cancelling jobs
Use scancel
with the job ID to cancel a job:
$ scancel <jobid>
You can also use it to cancel all your jobs, or all your pending jobs:
$ scancel -u $USER
$ scancel -t PENDING -u $USER
Monitoring cluster resources
To view all cluster resources, you can use sresources
to monitor the availability of cluster resources. It will give you basic information on the:
- total number of GPUs available on each compute node
- the number of GPUs currently allocated on each compute node
- amount of available and allocated RAM on each compute node
- number of CPU cores available and allocated on each compute node
Batch mode usage
You can submit jobs using an SLURM job script. Below is an example of a simple script:
:warning: #SBATCH
is the trigger word for slurm to take into account your arguments. If you wish to disable a line consider using ##SBATCH
, do not remove #
at the beginning of the lines.
#!/bin/bash
# To be submitted to the SLURM queue with the command:
# sbatch batch-submit.sh
# Set resource requirements: Queues are limited to seven day allocations
# Time format: HH:MM:SS
#SBATCH --time=00:15:00
#SBATCH --mem=10GB
#SBATCH --cpus-per-task=2
#SBATCH --gres=gpu:1
# Set output file destinations (optional)
# By default, output will appear in a file in the submission directory:
# slurm-$job_number.out
# This can be changed:
#SBATCH -o JOB%j.out # File to which STDOUT will be written
#SBATCH -e JOB%j-err.out # File to which STDERR will be written
# email notifications: Get email when your job starts, stops, fails, completes...
# Set email address
#SBATCH --mail-user=(email address where notifications are delivered to)
# Set types of notifications (from the options: BEGIN, END, FAIL, REQUEUE, ALL):
#SBATCH --mail-type=ALL
# Load up your conda environment
# Set up environment on watgpu.cs or in interactive session (use `source` keyword instead of `conda`)
source activate <env>
# Task to run
~/cuda-samples/Samples/5_Domain_Specific/nbody/nbody -benchmark -device=0 -numbodies=16777216
You can use SBATCH
variables like --mem
, for example the one above will assign 10GB of RAM to the job.
For CPU cores allocation, you can use --cpus-per-task
, for example the one above will assign 4 cores to the job.
The --gres=gpu:1
will assign 1x GPU to your job.
Running the script
To run the script, simply run sbatch your_script.sh
on watgpu.cs
Interactive mode usage
You can book/reserve resources on the cluster using the salloc
command. Below is an example:
salloc --gres=gpu:2 --cpus-per-task=4 --mem=16G --time=2:00:00
The example above will reserve 2 GPUs, 4 CPU cores, and 16GB of RAM for 2 hours. Once you run the command, it will output the name of the host like so:
salloc: Nodes watgpu308 are ready for job
here watgpu308
is the assigned host that the user can SSH to.
Ideally you want to run this command in either a screen
or tmux
session on watgpu.cs
Queue and Priority
Jobs in WATGPU are ran depending on their priority. Higher priority jobs go faster in the queue and can sometimes requeue running jobs to free resources for themselves.
To work with this priority, WATGPU resources have been separated into several partitions. The partition on which a job is running defines its possible allocated resources and its priority. Selecting a partition to run a job is easy:
- For interactive sessions: simply add
--partition=<PARTITION>
argument in thesalloc
command. - For batch jobs: add
#SBATCH --partition=<PARTITION>
in your.sh
script
Following is a presentation of the different partitions available in WATGPU:
ALL
This is the default partition. A job submitted to ALL
will be able to run on any available resources on WATGPU.
Jobs will have standard priority for the queue and might be preempted and requeued if a higher priority job is in need of resources.
All users have access to this partition:
#SBATCH --partition=ALL
SCHOOL
A job submitted to SCHOOL
will be allocated only to GPUs owned by the School. While fewer resources are available in SCHOOL
than in ALL
, the chances of getting preempted by higher priority jobs are reduced.
Jobs will have standard priority for the queue and might be preempted and requeued if a higher priority job is in need of resources.
All users within the School have access to this partition:
#SBATCH --partition=SCHOOL
<GROUP>
When a group <GROUP> has entrusted their GPU(s) to WATGPU, they have access to the partition named <GROUP>
linked to their contributed GPU(s).
Jobs will have high priority for the queue and will preempt and requeue lower priority jobs (from ALL
and SCHOOL
) if there are not enough resources available with a limit of GPU(s) up to their contribution.
All users from a group have access to their group partition:
#SBATCH --partition=<GROUP>
Users using their group's partition to request GPUs must now specify the group name for the GPU request in their jobs:
#SBATCH --gres=gpu:<GROUP_name>gpu:1
JupyterHub on WATGPU
JupyterHub allows you to easily run Jupyter notebooks on the cluster, providing you with another way to run interactive jobs to debug and test your code on the cluster. There is also the added benefit of being able to monitor your code's GPU utilization through the jupyterlab-nvdashboard extension that is enabled for all users, allowing you to understand if your code is efficiently using its allocated GPU.
Users who are looking to use Jupyter notebooks on WATGPU can securely sign in to WATGPU's JupyterHub at watgpu.cs.uwaterloo.ca/jupyterhub/ using their UWaterloo credentials and authenticate using Duo two-factor authentication.
After signing in, users can request resources by filling out a form that specifies, the compute node, number of CPU cores, memory (in GB), number of GPUs, and time.
For now, Jupyter notebooks run through JupyterHub are submitted to the ALL partition, and have some limits on the session's resources:
- up to 16 CPU cores
- up to 128 GB of RAM
- up to 4 GPUs
- up to 4 hours of runtime
Users requesting Jupyter notebooks through JupyterHub with more than 8 CPU cores are recommended to submit these jobs to the watgpu408 compute node since that node has more CPU cores than other compute nodes.
Jupyter notebooks through JupyterHub are meant for debugging and testing your code, not for full production runs - we recommend submitting an sbatch job for full runs.
Virtual environments vs Jupyter kernels for Jupyter notebooks
Virtual environments, as explained in our section about virtual environments on the WATGPU cluster, are used as isolated directories that hold any associated libraries and packages required for a project. Jupyter kernels are used by Jupyter to run Jupyter notebooks in different virtual environments. You can use different virtual environments as different Jupyter kernels, so each Jupyter notebook can run with the libraries and packages specific to that environment.
When you create a Python virtual environment, it typically comes with its own Python interpreter that can be used to install different Python packages within the environment. To use that environment as a Jupyter kernel, you need to make it available to Jupyter by installing the ipykernel
package and linking the environment to Jupyter, allowing you to select that environment as the kernel when you launch a Jupyter notebook.
Installing ipykernel
and linking conda venvs to Jupyter
Activate the conda virtual environment you want to add as a kernel, then install ipykernel using conda:
(base) $ conda activate <conda-venv>
<conda-venv> $ conda install -c anaconda ipykernel
then add the conda virtual environment as a kernel:
<conda-venv> $ python -m ipykernel install --user --name=<conda-venv>
Installing ipykernel
and linking pip venvs to Jupyter
Activate the pip virtual environment you want to add as a kernel, the install ipykernel using pip:
$ source <venv-name>/bin/activate
<venv-name> $ pip install ipykernel
then add the pip virtual environment as a kernel:
<venv-name> $ python -m ipykernel install --user --name=<venv-name>
You can then launch a Jupyter notebook from the directory that contains the pip virtual environment directory.
Accessing Jupyter notebooks in the cluster
This page has directions for how to manually launch a Jupyter notebook on the WATGPU cluster; since we've launched JupyterHub for WATGPU, we recommend users to launch Jupyter notebooks using JupyterHub, following instructions on the JupyterHub on WATGPU page.
Manually launching Jupyter notebooks in the cluster
One can access jupyter notebooks on compute systems using SOCKS proxy and ssh through watgpu.cs. From a (Linux/OSX) terminal ssh to watgpu.cs:
ssh -D 7070 user@watgpu.cs.uwaterloo.ca
On a browser on your local (client) system, configure traffic to use a SOCKS proxy at localhost port 7070. FoxyProxy for Firefox can make this configuration easy to set up and modify.
Preliminaries: Create a conda environment that includes the jupyter server
Also add required conda packages for your working environment e.g. pytorch:
conda create -y -n jupyter-server
conda activate jupyter-server
conda install -c conda-forge pytorch-gpu
pip install jupyter
conda deactivate
Using jupyter notebooks with your environment
Make an interactive reservation with the SLURM scheduler:
salloc --gres=gpu:1 --cpus-per-task=4 --mem=32G --time=1:00:00
Once the reservation starts, ssh to the allocated compute system e.g. watgpu208:
ssh watgpu208
Activate your jupyter-server environment and start a jupyter notebook:
conda activate jupyter-server
jupyter notebook --ip $(hostname -I | awk '{print $1}') --no-browser
...
[I 2023-10-13 02:18:02.902 ServerApp] notebook | extension was successfully loaded.
[I 2023-10-13 02:18:02.902 ServerApp] Serving notebooks from local directory: /u3/ldpaniak
[I 2023-10-13 02:18:02.903 ServerApp] Jupyter Server 2.7.3 is running at:
[I 2023-10-13 02:18:02.903 ServerApp] http://192.168.152.121:8888/tree?token=141a606c1ec9d9f76f65395bd1a4042fb3a5e04307283592
[I 2023-10-13 02:18:02.903 ServerApp] http://127.0.0.1:8888/tree?token=141a606c1ec9d9f76f65395bd1a4042fb3a5e04307283592
[I 2023-10-13 02:18:02.903 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 2023-10-13 02:18:02.915 ServerApp]
To access the server, open this file in a browser:
file:///u3/ldpaniak/.local/share/jupyter/runtime/jpserver-232011-open.html
Or copy and paste one of these URLs:
http://192.168.152.121:8888/tree?token=141a606c1ec9d9f76f65395bd1a4042fb3a5e04307283592
http://127.0.0.1:8888/tree?token=141a606c1ec9d9f76f65395bd1a4042fb3a5e04307283592
Copy the link with the 8888 port to your broswer which has been configured to use SOCKSv4 proxy on localhost port 7070. Your juptyer notebook will be available.
Be sure to shut down the server when done with Control-c.
Custom kernels
There is a single default kernel at the moment: "Python 3". You can also create your own kernels by opening a Terminal inside the notebook:
Once you've opened the terminal you can create your own kernel. Below is an example:
conda create --name myenv # create a custom conda environment which will install packages to, and add to the notebook as a kernel
conda install --yes numpy # install a package you want
conda install -c anaconda ipykernel #install ipykernel which we will use to add kernel to notebook
python -m ipykernel install --user --name=myenv # add the conda environment as a kernel
VSCode Tutorial: Working with WATGPU
In this tutorial, we'll walk you through the process of setting up and using Visual Studio Code (VSCode) to work with WATGPU. This guide will cover connecting to the login gateway and accessing specific clusters (e.g., watgpu108, watgpu208, ...) after allocating resources.
Note that this method of connecting to WATGPU should be used for debugging and understanding your code via testing and notebooks. If you wish to run long experiments, please use the SBATCH command.
Prerequisites
Before you begin, make sure you have the following prerequisites installed:
Quick steps to install the Remote - SSH extension:
- Open Visual Studio Code
- Click on the Extensions icon in the Activity Bar on the side of the window.
- Search for "Remote - SSH" in the Extensions view search box.
- Install the "Remote - SSH" extension.
Step 1: Configuring VSCode
Once the extension is installed, follow these steps to connect to the login gateway (replace your_username
with your actual username):
-
Press
Ctrl + Shift + P
(Windows/Linux) orCmd + Shift + P
(Mac) to open the command palette. -
Type "Remote-SSH: Open SSH Configuration File ...", select it and select the desired file (recommended file:
~/.ssh/config
). -
Insert the following part and replace
<username>
and<path/to/private/key>
with your username and the path to your private key registered in WatGPU:#WATGPU Host watGPU User <username> IdentityFile <path/to/private/key> HostName watgpu.cs.uwaterloo.ca Host watGPU108 User <username> IdentityFile <path/to/private/key> HostName watgpu108 ProxyJump watGPU Host watGPU208 User <username> IdentityFile <path/to/private/key> HostName watgpu208 ProxyJump watGPU Host watGPU308 User <username> IdentityFile <path/to/private/key> HostName watgpu308 ProxyJump watGPU
-
Save and refresh the Remote - SSH extension menu.
Step 2: Allocate Resources and Connect to Cluster
After successfully connecting via ssh to watgpu.cs.uwaterloo.ca
, you'll need to allocate resources using salloc
and then connect to a specific cluster:
-
Follow the Interactive mode part of the documentation to obtain resource allocation.
-
Once resources are allocated, connect to the appropriate cluster in VSCode:
That's it! You are now set up to work with WatGPU using Visual Studio Code.