Running Jobs
On Kelvin2, jobs are submitted via batch script to a scheduling system called SLURM. Jobs are not necessarily run in the order in which they are submitted and those jobs which require a large number of cores, memory or walltime will have to queue until the requested resources become available. The system will run smaller jobs, that can fit in available gaps, until all of the resources that have been requested for the larger job become available.
Note
To improve your experience with the queue, and that of other users, please ensure that you are not requesting excessive amounts of resources to complete your jobs. After reading the below documentation, see our tips for optimising your jobscript.
Batch Scripts
A SLURM batch script typically contains four main components
- A 'shebang'
#!
which states the interpreter type, i.e.#!/bin/bash
- A series of
#SBATCH
directives that specify resource requirements of the job and a variety of other attributes - Lines that load modules and set the environment
- At least one executable bash line that runs code and any associated input parameters
An brief overview of #SBATCH
directives and modules on Kelvin2 is given below:
#SBATCH
directives
#SBATCH
directives specify the resources required for your job and various other attributes. The definitive list of these are in the SLURM documentation, and a few of the most commonly used ones are given below.
Attribute Directives: e.g. --job-name
, --output
, --error
, --time
, --mail-type
, --mail-user
--job-name
- Specifies a name for the job. This helps identify the job when querying your queues.
--output
- Specifies the file in which the standard output will be written. This is the output that would have been printed to the terminal, if you were running the application directly.
--error
- Specifies the file in which the standard errors will be written.
--time
- Specifies a time limit for the job after which it will automatically exit, e.g. 01:30:00 for 1.5 hours, or 1-00:00:00 for 1 day. Providing an accurate time limit will help your queued jobs start running more quickly.
--mail-type
- Notifies the user by email when various events occur, like when the job starts or ends.
--mail-user
- Specifies the email address to receive the notifications.
#SBATCH --job-name=my-kelvin2-job
#SBATCH --output=my-output-file
#SBATCH --error=my-error-file
#SBATCH --time=01:30:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<user-email-address>
Partition Directive: --partition
The --partition
directive should be specified according to the resource requirements and time limit of your job. For example, if you wanted to run a job with a wall time of 3 days, you would need to use a low priority partition otherwise your job would never run, e.g.
The main Kelvin2 partitions along with their time limits and computational resources are shown below. The default partition, if none is specified in the batch script is k2-hipri.
Partition | Time Limit | CPU Cores per Node | Memory (GB) per Node |
---|---|---|---|
k2-hipri | 3 hours | 128 (94 Nodes) | 773 GB (59 Nodes), 1023 GB (35 Nodes) |
k2-medpri | 24 hours | " | " |
k2-lowpri | N/A | " | " |
k2-epsrc | N/A | " | " |
k2-himem | 3 days | 128 (6 Nodes), 256 (2 Nodes) | 2051 GB (4 Nodes) 2063 GB (4 Nodes) |
k2-epsrc-himem | N/A | " | " |
k2-gpu | 3 days | 48 (8 Nodes), 128 (4 Nodes) | 514 GB (8 Nodes), 1031GB (4 Nodes) |
k2-gpu-interactive | 3 hours | " | " |
k2-epsrc-gpu | 3 days | " | " |
Other partitions are also available, including some for specific research groups. A comprehensive list of Kelvin2 nodes, their associated partitions and their computational resources can be found using the sinfo command.
Resource Directives: e.g. --ntasks
, --nodes
, --cpus-per-task
, --mem-per-cpu
#SBATCH
directives for computational resources follow the format
MPI Application example
If your MPI application requires 40 CPU cores spread across a maximum of 2 nodes and 80GB memory (2GB per CPU core) you would include the directives
OpenMP Application example
If your OpenMP application require 40 CPU cores all on the same node and 80GB memory (2GB per CPU core) you would include the directives
GPU Directives: --gres
Jobs submitted to the k2-gpu, k2-gpu-interactive (for interactive jobs only) and k2-epsrc-gpu partitions can access GPUs using the --gres
flag by following the below template (<N>
is the number of GPUs requested per node)
Available GPU Resources | Example --gres flag |
---|---|
32 x Tesla V100 PCIe 32Gb (4 GPUs on 8 Nodes) | #SBATCH --gres=gpu:v100:1 |
12 x Tesla A100 SXM4 80GB (4 GPUs on 3 Nodes) | #SBATCH --gres=gpu:a100:1 |
28 x CI slices of a Tesla A100 SXM4 80GB (7 slices on 4 GPUs on 1 Node) | #SBATCH --gres=gpu:1g.10gb:1 |
Loading Modules
Kelvin2 has a large repository of software installed directly on the system which can be loaded for the jobs using the batch script.
The module avail
command will show the list of applications that are currently installed on the central repository.
[<username>@login1 [kelvin2] ~]$ module avail
--- /opt/gridware/local/el7/etc/modules ---
apps/anaconda/2.5.0/bin
apps/anaconda3/2021.05/bin
apps/anaconda3/2022.10/bin
apps/anaconda3/5.2.0/bin
apps/annovar/20160201/noarch
apps/bamtools/2.3.0/gcc-4.8.5
...
Press space
to navigate through the list one page at a time and when done press q
to exit.
The most useful commands related to working with modules are:
module display <module>
- Shows information about the softwaremodule load <module>
- load the selected module, including its binaries/libraries into the user's environmentmodule unload <module>
- removes the module from the user's environmentmodule list
- show the currently loaded modules in the user's environmentmodule purge
- clear all modules from the user's environment
Two Batch Script Examples
For illustration, example jobscripts for CPU and GPU applications are shown below, but each should be customised for your specific job. More examples are given in the Applications section.
CPU Application (with OpenMPI)
#!/bin/bash
#SBATCH --job-name=mpi-job-name # Specify a name for the job
#SBATCH --output=mpi-job-output.out # Specify where stdout will be written
#SBATCH --error=mpi-job-error.err # Specify where stderr will be written
#SBATCH --time=01:30:00 # time limit of 1hr30min
#SBATCH --mail-user=<email address> # Specify email address for notifications
#SBATCH --mail-type=ALL # Specify types of notification (eg job Begin, End)
#SBATCH --ntasks=32 # Number of tasks to run (for MPI tasks this is number of cores)
#SBATCH --nodes=4 # Maximum number of nodes on which to launch tasks
#SBATCH --partition=k2-hipri # Specify SLURM partition to queue the job
module load mpi/openmpi/4.1.1/gcc-9.3.0 # Load the application and dependencies
module load <mpi-application>
mpirun <application-command> -n 32 [options] # Run the code
GPU Application (MATLAB)
#!/bin/bash
#SBATCH --job-name=matlab-job-name
#SBATCH --output=matlab-output.out
#SBATCH --time=00:30:00
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --mem-per-cpu=5G
#SBATCH --partition=k2-gpu
#SBATCH --gres=gpu:1g.10gb:1
module load matlab/R2022a
# Ulster University (UU) users must use UU's Matlab licence by declaring
# (removing comments) these environment variables:
#export MLM_LICENSE_FILE=27000@193.61.190.229
#export LM_LICENSE_FILE=27000@193.61.190.229
matlab -nosplash -nodisplay -r "matlab_gpu_script;"
Managing Submitted Jobs
SLURM also has commands for managing your jobs.
Checking job status
You can view the status of your submitted jobs by using the squeue
command:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
<jobid> k2-hipri hello-wo <username> R 0:02 1 node103
The status of your job is given by the Job State code (ST) and the nodes your job is running on (or the reason it is not running) is in the 'NODELIST(REASON)' column. Common Job State codes are:
State Code | State | Description |
---|---|---|
R | Running | Your job is currently running. |
PD | Pending | Your job is waiting for resources to be allocated |
F | Failed | Your job submission has failed. |
The definitive list of job state codes are given in the SLURM documentation
Cancelling jobs
If you need to cancel a pending or running job you can use the scancel
command:
Users can delete their own jobs only.
Analysing previous jobs using sacct
Jobs you have previously ran can be analysed to check how much of the resources you had allocated were used by the program. More information may be found in the SLURM documentation.
Interactive Jobs
Interactive jobs allow the user to interact with their applications as they are running on the compute node(s). This is useful for data visualisation and for debugging programs and more complex code compilations.
Note
Since interactive sessions involve waiting on human input, they are generally an inefficient use of computational resources. As such, we recommend using a batch script whenever possible to run your applications automatically in the background.
You can launch an interactive session by including the srun
command in the terminal, followed by a list of computational resources, followed by --pty bash
.
For example, the follow command requests 10 CPU cores and 10GB of RAM on 1 node for a 1 hour period.
[<username>@login1 [kelvin2] ~]$ srun -p k2-hipri -N 1 -n 10 --mem-per-cpu=1G --time=1:00:00 --pty bash
srun: job <jobid> queued and waiting for resources
srun: job <jobid> has been allocated resources
[<username>@node103 [kelvin2] ~]$
Once connected to a login node you can load modules, set your environment and run your jobs interactively.
Interactive jobs with graphical applications
It is possible to interact with graphical applications on Kelvin2 by launching a VNC server and creating an SSH tunnel.
-
Launch an interactive job on a compute node by typing the following command into your Kelvin2 terminal. Modify your resource requirements as needed.
-
On the compute node, launch a vncserver (TigerVNC) by typing the following into your Kelvin2 terminal:
-
If this is your first time launching a VNC server you will be prompted to set a password. For security reasons, this should be different to your login/SSH password for Kelvin2.
-
Once your vncserver is launched you will have information printed to your screen:
-
Take note of the number after the compute node address (e.g. the
1
at the end ofnode181.pri.kelvin2.alces.network:1
). This specifies the port number you need when creating your tunnel. Since the number here is1
, we use port5901
, if the number7
, we would use port5907
. -
In a separate terminal on your local computer, enter the following command to launch an SSH tunnel. Modify your node name and port numbers as appropriate. Here
5903
specifies the port on your local host and5901
comes from the port number noted in the previous step. You will have to log in and authenticate using your usual credentials.From inside the QUB Network
From outside the QUB Network
-
Now open your VNC application and connect to localhost on port 5903 (or whichever port you chose in Step 6). You will have to enter the password set on Step 3.
-
You will now have a graphical view of a terminal in Kelvin2. From here you may load modules and open graphical applications as required.
-
After you have finished your session
- disconnect from your graphical session on your VNC application.
-
find and close the vncserver on the compute node by typing the following commands into your Kelvin2 terminal.
- Close the tunnel by exiting the local terminal opened in Step 6.
Checkpointing
"Checkpointing" means periodically saving the progress of your computations, such that your job can be paused and resumed at a later time. It is good practice to use checkpointing for jobs that take a long time to complete to ensure that you do not lose all progress if any unplanned interruptions occur, e.g. out-of-memory errors/hardware failures/partition time-limits.
Many HPC software applications support checkpointing and instructions are typically included in their user manuals. A few examples are included below:
Tips for optimising your jobscript
- Use the job analysis tool sacct to check if you allocated more memory than the necessary for futures jobs with the same application, e.g.
sacct -j <job_num> --format="JobID,jobname,NTasks,nodelist,CPUTime,ReqMem,MaxVMSize,Elapsed"
- Do not allocate more resources than available in the requested nodes or you job will not run.
- When possible, try not to allocate a large number of CPUs or a large amount of memory in a single node. Instead, spread the allocation among several nodes. 20 CPUs and 100 GB of memory is a reasonable amount to be allocated in a single node.
- Give freedom to Slurm to distribute the resources among the nodes. Use the flags
--ntasks
and--mem-per-cpu
to allocate resources per CPU, rather than--ntasks-per-node
or--mem
. - Allocate always the correct partition, double check the resources you require, in particular the wall time, and fit it in the particular partition.
- Specify an error output different to the standard output. In case the job crashes, this output has useful information about why the job failed, so how it can be fix.
- Activate email notifications.
- If you are intending to run a large number of similar jobs, consider submitting a Job Array
Slurm cheat sheets
Common job commands
Description | Command |
---|---|
Submit a job | sbatch |
Delete a job | scancel |
Job status (all) | squeue |
Job status (by job) | squeue -j |
Job status (detailed) | scontrol show job |
Show expected start time | squeue -j |
Start an interactive job | srun --pty /bin/bash |
Monitor jobs resource usage | sacct -j |
Slurm Environmental variables
Variable | Function |
---|---|
SLURM_ARRAY_JOB_ID | Job array's master job ID number. |
SLURM_ARRAY_TASK_ID | Job array ID (index) number. |
SLURM_CLUSTER_NAME | Name of the cluster on which the job is executing. |
SLURM_CPUS_PER_TASK | Number of cpus requested per task. Only set if the --cpus-per-task option is specified. |
SLURM_JOB_ACCOUNT | Account name associated of the job allocation. |
SLURM_JOB_ID | The ID of the job allocation. |
SLURM_JOB_NAME | Name of the job. |
SLURM_JOB_NODELIST | List of nodes allocated to the job. |
SLURM_JOB_NUM_NODES | Total number of nodes in the job's resource allocation. |
SLURM_JOB_PARTITION | Name of the partition in which the job is running. |
SLURM_JOB_UID | The ID of the job allocation. See SLURM_JOB_ID. Included for backwards compatibility. |
SLURM_JOB_USER | User name of the job owner |
SLURM_MEM_PER_CPU | Same as --mem-per-cpu |
SLURM_MEM_PER_NODE | Same as --mem |
SLURM_NTASKS | Same as -n, --ntasks |
SLURM_NTASKS_PER_NODE | Number of tasks requested per node. Only set if the --ntasks-per-node option is specified. |
SLURM_PROCID | The MPI rank (or relative process ID) of the current process. |
More information about Slurm commands, flags and environment variables, can be found in the Slurm web page.