Job Submission

NI-HPC System Diagram

Image title

Job Handling

Slurm

Jobs on the cluster are under the control of a scheduling system, Slurm.

Jobs are scheduled according to the currently available resources and the resources that are requested. Information on how to request resources for your job are outlined here.

Jobs are not necessarily run in the order in which they are submitted. Jobs needing a large number of cores, memory or walltime will have to queue until the requested resources become available in. The system will run smaller jobs, that can fit in available gaps, until all of the resources that have been requested for the larger job become available.
Always run jobs with the specific number of resources needed.

Submitting a Job

There are 2 classes of jobs that can be ran on Kelvin2.

Non-interactive - sbatch
Interactive - srun

sbatch

Jobs are submitted via a job-script To learn more about writing a jobscript see here.

Once you have created your jobscript you then submit it using the sbatch command and its name :

sbatch my_jobscript.sh

Once your job is submitted you will recieve a unique JOBID.

srun

srun is an interactive job - allows users to run interactive applications directly on a compute node.

To start an interactive job :

srun --pty /bin/bash

Users should specify resources required to run.
Input data is the shell session or application started.
Output data is shown on screen or can be specified to write elsewhere.

Queue status

Once you have submitted your jobs using sbatch or srun, you can then view your queue status to see how your jobs are doing along with further information:

squeue -u <username>

Example squeue output :

JOBID PARTITION NAME       USER  ST TIME NODES NODELIST(REASON)
 11    all       mpiJob    user1 PD 0:00 2     node101, node102
 2     all       serialJob user1 R  0:02 1     node101

Job states

Once you have executed the squeue command take note of the current state (ST) :

Running jobs (R) - your job is currently running in the compute.
Queuing jobs (PD) - your job is waiting for resources to become available to run
Failed jobs (F)- your job submission has failed and should be delted from the queue.

Deleting jobs

If you need to delete a job from the current queue you can use the scancel command with the unique JOBID of the job.

scancel 8

Users can delete their own jobs only.

Slurm cheat sheet

Common job commands

	SGE	Slurm
Submit a job	qsub	sbatch
Delete a job	qdel	scancel
Job status (all)	qstat showq	squeue
Job status (by job)	qstat -j	squeue -j
Job status (detailed)	qstat -j	scontrol show job
Show expected start time	qstat -j	squeue -j --start
Start an interactive job	qrsh	srun --pty /bin/bash
Monitor jobs resource usage	qacct -j	sacct -j --format="JobID,jobname,NTasks,nodelist,CPUTime,ReqMem,MaxVMSize,Elapsed"

Slurm Environmental variables

Variable	Function
SLURM_ARRAY_JOB_ID	Job array's master job ID number.
SLURM_ARRAY_TASK_ID	Job array ID (index) number.
SLURM_CLUSTER_NAME	Name of the cluster on which the job is executing.
SLURM_CPUS_PER_TASK	Number of cpus requested per task. Only set if the --cpus-per-task option is specified.
SLURM_JOB_ACCOUNT	Account name associated of the job allocation.
SLURM_JOB_ID	The ID of the job allocation.
SLURM_JOB_NAME	Name of the job.
SLURM_JOB_NODELIST	List of nodes allocated to the job.
SLURM_JOB_NUM_NODES	Total number of nodes in the job's resource allocation.
SLURM_JOB_PARTITION	Name of the partition in which the job is running.
SLURM_JOB_UID	The ID of the job allocation. See SLURM_JOB_ID. Included for backwards compatibility.
SLURM_JOB_USER	User name of the job owner
SLURM_MEM_PER_CPU	Same as --mem-per-cpu
SLURM_MEM_PER_NODE	Same as --mem
SLURM_NTASKS	Same as -n, --ntasks
SLURM_NTASKS_PER_NODE	Number of tasks requested per node. Only set if the --ntasks-per-node option is specified.
SLURM_PROCID	The MPI rank (or relative process ID) of the current process.

More information about Slurm commands, flags and environment variables, can be found in the Slurm web page.

Optimization

Never run jobs on the login nodes. That will seriously disturb other users who are logged in. Login nodes are never to run jobs.
Allocate interactive sessions with srun if you need to run a job interactively, never use the login nodes.
Request the necessary resources for your job. Use the job analysis tool sacct to check if you allocated more memory than the necessary for futures jobs with the same application.
Don't allocate more resources than necessary, that will increase the queue time, and will waste resources of the machine that could be used by other users.
Try to spread the allocation among several nodes, 20 CPUs and 100 Gb of memory is a reasonable amount to be allocated in a single node.
Don't allocate a big number of CPUs or a large amount of memory in a single node.
Give freedom to Slurm to distribute the resources among the nodes, use the flags --ntasks and --mem-per-cpu to allocate resources per CPU preferably than per node.
Don't restrict the resources per node if possible, do not use the flags --ntasks-per-node nor --mem.
Never allocate more resources than available in the nodes. Review the training material for information about the resources per node in each partition.
Allocate always the correct partition, double check the resources you require, in particular the wall time, and fit it in the particular partition. Check the partition table in the training material.
Specify an error output different to the standard output. In case the job crashes, this output has useful information about why the job failed, so how it can be fix.
Activate email notifications.