Job scheduler
A job scheduler or workload automation is necessary for efficient handling of the computing resources. It is essential and the most important application in order to run your jobs on the compute nodes, so every user must know how to use it. Tasks without passing through the job scheduler will be killed without notice.
![]() |
|---|
| xkcd, Hard Reboot |
As a job scheduler of the PTC cluster, Slurm is employed. The installed version is 17.02.9. If you are new to Slurm, the official Quick Start User Guide is the best place to start learning about it. Here is a basic instruction for using Slurm in the PTC cluster.
The users of the old PTC cluster system might be familiar with Sun Grid Engine (SGE). Please refer to SGE to SLURM conversion and modify the scripts.
A summary of commands and options of Slurm is given here.
Partition
The partitions group compute nodes into logical sets. Each partition has its own job size limit, job time limit, and user group permitted to use it. It is sometimes called job queues. The sinfo command reports the state of partitions.
$ sinfo
PARTITION AVAIL TIMELIMIT JOB_SIZE MAX_CPUS_PER_NODE NODES(A/I/O/T) CPUS(A/I/O/T)
espresso* up 20:00 1-2 10 0/32/0/32 0/1608/0/1608
The name of the partition shown in the above is espresso, which is up and running. * denotes that it is the default partition. Your job will be assigned to the espresso partition if you do not specify a partition to use. The job time limit (TIMELIMIT) is set to be 20 minutes. Jobs running beyond the time limit will be automatically killed. A job assigned to the espresso partition can use two nodes at most (JOB_SIZE), and the maximum number of CPUs per node (MAX_CPUS_PER_NODE) is 10. Thus, the job in the espresso partition can use concurrently up to 2 * 10 = 20 CPUs. The last two fields in the above show the number of nodes by a state in the format "allocated/idle/other/total" (A/I/O/T) and the number of CPUs in the same format.
Another useful command is scontrol show partition.
$ scontrol show partition espresso
PartitionName=espresso
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=YES QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=2 MaxTime=00:20:00 MinNodes=0 LLN=NO MaxCPUsPerNode=10
Nodes=compute-0-[0-31]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=1608 TotalNodes=32 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerCPU=2000 MaxMemPerNode=UNLIMITED
Note again MaxNodes=2 MaxTime=00:20:00. AllowGroups=ALL means that any user can run jobs in the partition. scontrol show partitions shows the detailed information of all the partitions where you can submit jobs.
A new user can use only the espresso partition. The other partitions with a longer time limit and larger job size will be available when the user is qualified to have enough knowledge and skill to use the cluster. It recommended that doing test and exercise in the espresso partition repeatedly before using the other partitions.
The squeue command shows the status of running jobs.
$ squeue
JOBID PARTITION NAME USER STATE TIME TIME_LIMIT NODES
317 microcent my.scrip cbpark COMPLETI 0:50 1:00 1
315 fortnight zsh cbpark RUNNING 5:26 14-00:00:00 1
318 microcent my2.scri cbpark RUNNING 0:39 1:00 3
scontrol show job can be used to see the detail of the submitted job.
$ scontrol show job 315
JobId=315 JobName=zsh
UserId=cbpark(1001) GroupId=cbpark(1001) MCS_label=N/A
Priority=4294901729 Nice=0 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:10:32 TimeLimit=14-00:00:00 TimeMin=N/A
SubmitTime=2018-04-09T15:11:12 EligibleTime=2018-04-09T15:11:12
StartTime=2018-04-09T15:11:12 EndTime=2018-04-23T15:11:12 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=fortnight AllocNode:Sid=ptc:35960
ReqNodeList=(null) ExcNodeList=(null)
NodeList=compute-0-11
BatchHost=compute-0-11
NumNodes=1 NumCPUs=40 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=40,mem=80000M,node=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
Command=zsh
WorkDir=/home/cbpark
Power=
Job submission
The simplest command to run tasks on the compute nodes is srun.
$ srun -N 2 -n 5 -p espresso /bin/hostname
compute-0-21
compute-0-18
compute-0-18
compute-0-18
compute-0-18
The above task print the hostname of the compute nodes in the espresso partition (-p espresso) by allocating five tasks (-n 5) of two nodes (-N 2). (In most cases, it would be not necessary to specify the number of nodes, -N.) The -n 5 option does not allocate five CPU cores since one processor per task will be used by default. The --cpus-per-task option or -c will change the default and it's the option that you would use when running multiprocessing (or parallel) jobs.
-N: number of compute nodes requested,-n: total number of tasks (processes),-c: number of CPUs per task.
Note that the maximum number of nodes per task in the espresso partition is set to be 2. If you set a larger number than the limit, the job will not run and await more resources.
$ srun -N 5 -n 10 -p espresso /bin/hostname
srun: Requested partition configuration not available now
srun: job 330 queued and waiting for resources
$ squeue
JOBID PARTITION NAME USER STATE TIME TIME_LIMIT NODES NODELIST(REASON)
330 espresso hostname cbpark PENDING 0:00 20:00 5 (PartitionNodeLimit)
The job can be deleted by running scancel 330 (see JOBID). To cancel all the jobs of yours,
scancel -u <username>
where <username> is your user ID.
Complicated operations can be done by submitting a script, which is reusable for later execution. Let's say the script is run.sh.
#! /bin/bash -l
#
#SBATCH --job-name=test
#SBATCH --output=output.txt
#
#SBATCH --partition=espresso
#SBATCH --ntasks=10
#SBATCH --nodes=2
#SBATCH --mem-per-cpu=100
#SBATCH --time=10:00
srun echo 'Greetings from' $(/bin/hostname)
srun sleep 60
The above script prints Greetings from hostname 10 times (--ntaskes=10) to output.txt (--output=output.txt) using two nodes (--nodes=2) and 100 MB per CPU (--mem-per-cpu=100). The time limit has been set to be 10 minutes (--time=10:00). Note that lines beginning with #SBATCH are not comments. #SBATCH is a prefix to set options. If you want to comment out the line, attach one more #, i.e., ##SBATCH.
##SBATCH # This is comment, but
#SBATCH # this is NOT comment. Slurm will read this line.
The -l option in the first line means that the shell acts as if it had been invoked as a login shell.
Before submitting the script to the job scheduler, it's advisable to validate the script.
$ sbatch --test-only run.sh
sbatch: Job 564 to start at 2018-04-12 00:20:54 using 80 processors on compute-0-[0-1]
To actually submit the script to the job scheduler, run
sbatch run.sh
See man sbatch or the sbatch page on the official website for available options. The options that might be useful are --deadline, --workdir, --mem, --ntasks-per-node, and so on.
If you want to cancel a running job and resubmit it, memorize the job id and run
scontrol requeue <jobid>
Sometimes, you want interactive operations for a job.
srun -p espresso --pty bash
will put your command prompt to a compute node. This is similar to the qrsh command of SGE. You can return back to the master node by the exit command. If you're using another shell such as Zsh, add --pty zsh instead of --pty bash.
Selecting specific nodes
scontrol can show the detailed information of compute nodes.
$ scontrol show nodes
NodeName=compute-0-0 Arch=x86_64 CoresPerSocket=20
CPUAlloc=0 CPUErr=0 CPUTot=40 CPULoad=0.01
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=compute-0-0 NodeHostName=compute-0-0 Version=17.02
OS=Linux RealMemory=193336 AllocMem=0 FreeMem=190091 Sockets=2 Boards=1
MemSpecLimit=4000
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1131 Owner=N/A MCS_label=N/A
Partitions=espresso,microcentury,longlunch,workday,testmatch,nextweek,nextmonth
BootTime=2019-05-02 15:12:56 SlurmdStartTime=2019-05-03 11:48:02
CfgTRES=cpu=40,mem=193336M
AllocTRES=
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
(...)
You can check the list of nodes assigned to the partition by running \sinfo.
$ \sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
espresso* up 20:00 32 idle compute-0-[0-31]
NODELIST shows that the espresso partition has 32 nodes from compute-0-0 to compute-0-31.
The job scheduler will automatically select the compute nodes when a job has been submitted by using sbatch or srun. See the output of squeue to check the list of nodes allocated to the job. On the other hand, the user may select specific compute nodes by adding options with -w or --nodelist to the sbatch or srun commands. For instance,
$ srun -p espresso -w 'compute-0-[1-2]' /bin/hostname
compute-0-2
compute-0-1
The above run selects the espresso partition and the compute-0-1 and compute-0-2 nodes. Similarly,
$ sbatch -p espresso -w 'compute-0-[1,3]' run.sh
Submitted batch job 1076
will submit run.sh to the espresso partition in the compute-0-1 and compute-0-3 nodes.
Interactive sessions
Another example is to jump into an interactive environment:
$ srun -p espresso -c 10 --pty /bin/bash
[cbpark@compute-0-0 ~]$
By this command, ten CPU cores will be allocated (-c 10) in the espresso partition. To be assigned particular nodes, add the -w option.
$ srun -p espresso -c 10 -w 'compute-0-4' --pty /bin/bash
[cbpark@compute-0-4 ~]$ hostname
compute-0-4
Submitting multiple jobs in one script
Suppose that we want to submit multiple jobs to the job scheduler using one script. The useful option for that is --wrap of sbatch.
#! /bin/bash -l
#
#SBATCH -J multiple_jobs
#SBATCH -o %x-%j.log
#
#SBATCH -p longlunch
for i in $(seq 1 0.1 10); do
echo $i
sbatch -J multiple_jobs_$i -p microcentury -o %x-%j.log \
--wrap="echo $i; /bin/hostname; sleep 30; echo 'Job finished!'"
sleep 5 # pause 5 second between each sbatch submit
done
The important thing is the sbatch --wrap command in the loop (seq 1 0.1 10 generates a sequence of numbers from 1 to 10 in a step of 0.1.) It is recommended to pause some seconds between each job submission to allow the job scheduler to process all the work needed to set up, run, and break down the scheduled jobs. If the output of each job is useless, set -o /dev/null instead of -o %x-%j.log. The --wrap option is particularly useful when the job command is simple enough. Otherwise, it had better generate multiple job scripts.
After submitting the above script into the job scheduler, the squeue command shows us
$ squeue
178350 longlunch multiple cbpark RUNNING 1:31 3:00:00 1 compute-0-21
178366 microcent multiple cbpark RUNNING 0:57 1:00:00 1 compute-0-23
178369 microcent multiple cbpark RUNNING 0:45 1:00:00 1 compute-0-23
178373 microcent multiple cbpark RUNNING 0:14 1:00:00 1 compute-0-21
178374 microcent multiple cbpark RUNNING 0:10 1:00:00 1 compute-0-23
178375 microcent multiple cbpark RUNNING 0:07 1:00:00 1 compute-0-23
178376 microcent multiple cbpark RUNNING 0:07 1:00:00 1 compute-0-23
178377 microcent multiple cbpark RUNNING 0:06 1:00:00 1 compute-0-23
178378 microcent multiple cbpark RUNNING 0:05 1:00:00 1 compute-0-23
178379 microcent multiple cbpark RUNNING 0:02 1:00:00 1 compute-0-23
178380 microcent multiple cbpark RUNNING 0:02 1:00:00 1 compute-0-23
178381 microcent multiple cbpark RUNNING 0:01 1:00:00 1 compute-0-23
Excluding certain nodes
You may want to exclude some nodes from the resources granted to the job because, for instance, they are partially occupied by other users. In this case, add the --exclude option to the sbatch or srun commands or to the script. For example,
sbatch --exclude=compute-0-[0-5,10] myscript.sh
The above command will exclude the compute nodes from compute-0-0 to compute-0-5 and compute-0-10 for the job. It can be added to the job script as follows:
#! /bin/bash -l
#SBATCH --exclude=compute-0-[0-5,10]
(...)
OpenMP
OpenMP supports multiprocessing programming by implementing multithreading which runs concurrently with the runtime environment allocating threads to different processors. (Do not confuse it with Open MPI.) Note that you should make sure that all the CPU cores you request are on the same node. Below is an example script using OpenMP.
#! /bin/bash -l
#
#SBATCH --job-name=omp_test
#SBATCH --output=omp_test.log
#
#SBATCH --partition=espresso
#SBATCH --nodes=1 # number of nodes
#SBATCH --cpus-per-task=8 # number of threads
#SBATCH --mem-per-cpu=100 # memory per cpu
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun -c $SLURM_CPUS_PER_TASK ./MYPROGRAM
Here MYPROGRAM is the executable you will run on the compute node. The most important parts are --nodes=1 and --cpus-per-task=8. The latter tells the job scheduler how many threads you intend to run with. Unless the number of nodes is 1, the job could be distributed over many nodes, leading to poor performance.
Furthermore, we recommend that the number of threads is set to be less than 40 because the majority of the compute nodes have up to only 40 CPUs. If the number exceeds 40, the job will be pending until a compute node having more CPUs become available.
$ squeue
1862756 longlunch toolarge cbpark PENDING 0:00 3:00:00 1 (Priority)
MPI
MPI stands for the message passing interface, a communication protocol for programming parallel computers. Here is an example script for a job with 10 tasks using Open MPI.
#!/bin/bash -l
#
#SBATCH --job-name=mpi_test
#SBATCH --output=mpi_test.log
#
#SBATCH --partition=espresso
#SBATCH --ntasks=10
module load gnu7 openmpi
mpirun ./MYPROGRAM
Note that -np flag is not required because mpirun will automatically figure out the configuration from the Slurm environment variables. If a segmentation fault has occurred, setting ulimit might solve it. For example,
(...)
ulimit -s unlimited
mpirun ./MYPROGRAM
Here are some useful guides for using MPI under Slurm:
Deadly commands
![]() |
|---|
| xkcd, Admin Mourning |
nohup
You might have heard of the nohup command, and moreover, might love to use it. It makes a command ignoring hangup signals, for instance, when you want to have the command running even after you close the terminal. It would never be harmful if you're using it on your standalone machine. However, all the tasks running in the cluster should be recognized by the job scheduler since the job scheduler will allocate resources and gracefully close the tasks. The nohup command sometimes deceives the job scheduler and keeps the processes running in the background even after the submitted job had been completed. The node can eventually be frozen due to fulling up of resources such as memory and disk space. DO NOT use nohup in the cluster.

