Submitting a job
For those familiar with GridEngine, Slurm documentation provide a Rosetta Stone for schedulers, to ease the transition.
Slurm commands
Slurm allows requesting resources and submitting jobs in a variety of ways. The main Slurm commands to submit jobs are:
- srun
Request resources and runs a command on the allocated compute node(s)
Blocking: will not return until the command ends
- sbatch
Request resources and runs a script on the allocated compute node(s)
Asynchronous: will return as soon as the job is submitted
Tip
Slurm Basics
Job
A Job is an allocation of resources (CPUs, RAM, time, etc.) reserved for the execution of a specific process:
The allocation is defined in the submission script as the number of Tasks (
--ntasks
) multiplied by the number of CPUs per Task (--cpus-per-task
) and corresponds to the maximum resources that can be used in parallel,The submission script, via
sbatch
, creates one or more Job Steps and manages the distribution of Tasks on Compute Nodes.
Tasks
A Task is a process to which are allocated the resources defined in the script via the --cpus-per-task
, --mem
and --mem-per-cpu
options. A Task can have these resources like any other process (creation of threads, of sub-processes possibly themselves multi-threaded).
This is the Job’s resource allocation unit. CPUs not used by a Task will be lost, not usable by any other Task or Step. If the Task creates more processes/threads than allocated CPUs, these threads will share the allocation.
Job Steps
A Job Step represents a stage, or section, of the processing performed by the Job. It executes one or more Tasks. This division into Job Steps offers great flexibility in the organization of the steps in the Job and the management, and analysis, of the allocated resources:
Steps can be executed sequentially or in parallel,
one Step can initiate one or more Tasks, executed sequentially or in parallel,
Steps are tracked by the
sstat/sacct
commands, allowing both Step-by-Step progress tracking of a Job during it’s execution, and detailed resource usage statistics for each Step (during and after execution).
Using srun
for a single task, inside a submission script, is not mandatory.
Partition
A Partition is a logical grouping of Compute Nodes. This grouping makes it possible to specialize and optimize each partition for a particular type of job.
See Computing resources and Clusters/Partitions overview for more details.
Job script
To run a job on the system you need to create a submission script
(or job script, or batch script). This script is a regular shell script (bash) with some directives specifying the number of CPUs, memory, etc., that will be interpreted by the scheduling system upon submission.
very simple
#!/bin/bash
#
#SBATCH --job-name=test
hostname -s
sleep 60s
Writing submission scripts can be tricky, see more in Batch scripts. See also our repository of examples scripts.
First job
submit your job script with:
$ sbatch myfirstjob.sh
Submitted batch job 623
Slurm will return with a $JOBID
if the job is accepted, else an error message. Without any options about output, it will be defaulted to slurm-$JOBID.out
(slurm-623.out, with the above example), in the submission directory.
Once submitted, the job enters the queue in the PENDING (PD) state. When resources become available and the job has sufficient priority, an allocation is created for it and it moves to the RUNNING (R) state. If the job completes correctly, it goes to the COMPLETED state, otherwise, its state is set to FAILED.
Tip
You can submit jobs from any login node to any partition. Login nodes are only segregated for build (CPU µarch) and scratch access.
Monitor your jobs
You can monitor your job using either its name (#SBATCH --job-name
) or its $JOBID
with Slurm’s squeue
[1] command:
$ squeue -j 623
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
623 E5 test ltaulell R 0:04 1 c82gluster2
By default, squeue
show every pending and running jobs. You can filter in your own jobs, using -u $USER
or --me
option:
$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
623 E5 test ltaulell R 0:04 1 c82gluster2
If needed, you can modify the output of squeue
[1]. Here’s an example (add CPUs to default output):
$ squeue --me --format="%.7i %.9P %.8j %.8u %.2t %.10M %.6D %.4C %N"
JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST
38956 Lake test ltaulell R 0:41 1 1 c6420node172
Usefull bash aliases:
alias pending='squeue --me --states=PENDING --sort=S,Q --format="%.10i %.12P %.8j %.8u %.6D %.4C %.20R %Q %.19S" # my pending jobs
alias running='squeue --me --states=RUNNING --format="%.10i %.12P %.8j %.8u %.2t %.10M %.6D %.4C %R %.19e" # my running jobs
Analyzing currently running jobs
The sstat
[3] command allows users to easily pull up status information about their currently running jobs. This includes information about CPU usage, task information, node information, resident set size (RSS), and virtual memory (VM). You can invoke the sstat
command as such:
$ sstat --jobs=$JOB_ID
By default, sstat will pull up significantly more information than what would be needed in the commands default output. To remedy this, you can use the –format flag to choose what you want in your output. See format flag in man sstat
or sstat --helpformat
.
Some relevant variables are listed in the table below:
Variable |
Description |
---|---|
jobid |
The id of the Job. |
avecpu |
Average CPU time of all tasks in job. |
averss |
Average resident set size of all tasks in job. |
avevmsize |
Average virtual memory of all tasks in job. |
maxrss |
Maximum resident set size of all tasks in job. |
maxvmsize |
Maximum Virtual Memory size of all tasks in job. |
MaxVMSizeNode |
The node on which the maxvsize occurred. |
ntasks |
Number of tasks in a job. |
For example, let’s print out a job’s average job id, cpu time, max rss, and number of tasks:
sstat --jobs=$JOB_ID --format=jobid,avecpu,maxrss,ntasks
You can obtain more detailed informations about a job using Slurm’s scontrol
[2] command. This can be very usefull for troubleshooting.
$ scontrol show jobid $JOB_ID
$ scontrol show jobid 38956
JobId=38956 JobName=test
UserId=ltaulell(*****) GroupId=psmn(*****) MCS_label=N/A
Priority=8628 Nice=0 Account=staff QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:08 TimeLimit=8-00:00:00 TimeMin=N/A
SubmitTime=2022-07-08T12:00:20 EligibleTime=2022-07-08T12:00:20
AccrueTime=2022-07-08T12:00:20
StartTime=2022-07-08T12:00:22 EndTime=2022-07-16T12:00:22 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-07-08T12:00:22
Partition=Lake AllocNode:Sid=x5570comp2:446203
ReqNodeList=(null) ExcNodeList=(null)
NodeList=c6420node172
BatchHost=c6420node172
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=385582M,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=385582M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/ltaulell/tests/env.sh
WorkDir=/home/ltaulell/tests
StdErr=/home/ltaulell/tests/slurm-38956.out
StdIn=/dev/null
StdOut=/home/ltaulell/tests/slurm-38956.out
Power=
NtasksPerTRES:0
Kill a job
For reasons, you might want to cancel a pending or running job:
scancel $JOB_ID