Basic Slurm Commands🔗
Slurm commands are basic bash utilities and you can therefore view more in depth documentation at any time by typing man <command>
in the terminal.
man <command>
(i.eman sbatch
) to open the full documentation withless
OR<command> --help
(i.esbatch --help
) to print a brief overview of common options
For a comprehensive tutorial on Slurm commands visit the Slurm Documentation
Note
It is assumed that you have a good understanding of the core cluster concepts in Slurm. If you are unfamiliar with these concepts, please refer to the Introduction to Slurm
Table of Contents🔗
- Viewing Cluster State
sinfo
- Requesting Compute Resources
srun
salloc
sbatch
- Monitoring Jobs in the Queue
squeue
scancel
sattach
- Viewing Job History
sacct
sstat
seff
- Scheduling Regular Jobs
scrontab
- Configuring Command Defaults
slurm.conf
Viewing Cluster State🔗
sinfo
🔗
This command provides a summary of the current cluster state, including the
status of various Paritions and Nodes. The STATE column indicates the current
state of the resource for that row. The --format
option can be used to get
the full range of available information about each Parition or Node, including
things like the amount of CPU/Memory/GPU available and currently in use.
Try it yourself:
-
View the manual page for
sinfo
-
List the names of the unique available partitions
-
Retrieve the partition name, max number of CPUs, max job length and total memory available for the superhimem partition (HINT: Use
--format
or--Format
)
Requesting Compute Resources🔗
srun
🔗
This command simply runs the subsequent shell command on the cluster, making any necessary allocations based on the options provided. The allocation lasts as long as the command takes to run and is blocking (i.e., you can't run any other commands until it returns).
While this can be useful for house keeping tasks that require more resources
than are available on the Login or Data node, it is generally a better practice
to write reusable Slurm scripts and deploy them via salloc
or sbatch
.
Try it yourself:
-
Print a summary of the
srun
command options -
Submit a job via
srun
to list files in your home directory -
Resubmit this job, but this time request two CPUs and 100 MB of RAM
salloc
🔗
When developing a script to run on the cluster it is often useful to get access to a remote virtual machine to experiment running your code. This is essential for debugging long running jobs, but can also be used for relatively short (a few hours) interactive jobs analyzing or visualizing data on the cluster.
Running the salloc
command will allocate a job for you, providing
command line access to the virtual machine associated with that job.
Various levels of resources can be request, see man salloc
for more details.
Try it yourself:
Please see Submitting a Slurm Job for an example of running batch jobs via salloc
.
sbatch
🔗
Once you have reasonable confidence that your analysis script is working as expected, you can submit long running jobs in batch mode, allowing you to queue all the tasks associated with your job to run when the resources become available on the cluster. This is the primary way you should be interacting with the compute cluster.
Generally, you will use the sbatch
command with a bash script which executes
one or more additional scripts that comprise your job. You can use this script to
launch analyses in a variety of langauges such as R, Python, or Julia. For
more information on the precompiled software available on the cluster via the
module
system, please see the documentation for the module system.
To make it easy for you to get started with batch jobs on H4H, we have included
the templates/hello_slurm.sh
script to use as a template for more complex
jobs. This file includes some of the basic Slurm Header directives which allow
you to specify the resources, inputs and outputs for your job without needing
to pass additional command line options via sbatch
.
Try it yourself:
Please see Submitting a Slurm Job for an example of running batch jobs via sbatch
.
Monitoring Jobs in the Queue🔗
squeue
🔗
This command is used to view the current job queue on the cluster. By default it only shows jobs you have submitted. Additional options are available to display all jobs on the cluster, format the output of the command to include more or less information, and much more.
Try it yourself:
Please see Submitting a Slurm Job for an example of using squeue
to view submitted jobs.
scancel
🔗
Once you have submitted a batch job to the queue, you may want to cancel it if
you notice an issue with your job script. Basic usage of scancel
requires you
to input your job id (--job-id/-j
) or job name (--job-name/-n
). This information can easily be retrieved
via squeue
.
Additional options to cancel jobs based on their current state, parition, node, etc. are available for this command. Review the documentation for more information.
Try it yourself:
Please see Submitting a Slurm Job for an example of using scancel
to cancel a batch job.
sattach
🔗
This command allows your to attach to a currently running bash job or job step
to monitor its progress. The most basic usage is sattach <job_id>.<step_id>
,
where the job_id
is the numeric identifier (first column) in squeue
and
step_id
is the job step (the numeric value after the dot). To attach to a job
without steps, just use a step id of 0.
In general, it is a better idea to setup logging for your job by specifying a log file in your Slurm hearder (see Submitting a Slurm Job for details). However, this command can be useful for interactive debugging of submitted job or to check on the progress of your job without writing excessive amounts of logging information to disk.
Try it yourself:
Please see Submitting a Slurm Job for an example of attaching to a running job via sattach
.
Viewing Job History🔗
sacct
🔗
This command allows you to review various aspects of previously submitted jobs on the cluster and can be used to help diagnose why a job failed, determine the resources it used, access additional information like the total run time, and much more.
Try it yourself:
Please see Submitting a Slurm Job for an example of reviewing job history with sacct
.
sstat
🔗
This command is similar to sacct
but displays resource utilization for a
currently running job. This can be useful to determine if your job is close to
exceeding it's resource allocation or to monitor resource usage in real time.
Note, however, that it takes time to gather these statistics and for short
running jobs (e.g., < 1 hr) it is unlikely you will see any statistics.
Try it yourself:
Please see Submitting a Slurm Job for an example of viewing current job status with sstat
.
seff
🔗
This is not technically a Slurm command, but is a useful Perl script included with recent Slurm releases to make it easy to determine how efficiently your job uses resources you requested. This is very useful for scaling your jobs down to use the minimum resources and therefore get their maximum possible queue priority.
Try it yourself:
Please see Submitting a Slurm Job for an example of summarizing job efficiency with seff
.
Scheduling Regular Jobs🔗
scrontab
🔗
The Slurm scheduler has its own implementation of the Unix job scheduling system CRON. As such, you can schedule jobs to run at some predefined interval. This is useful for house keeping tasks on the cluster such as auditing disk usage or automating data archiving. While probably not necessary for regular users this command is essential for Data Managers/Coordinators to automate otherwise tedious tasks.
Configuring Command Defaults🔗
slurm.conf
🔗
A slurm.conf
file can be used to set environment variables that modify
the defaults for many commands discussed above. This file can be useful to
pre-specify your preferred list of fields or formatting for commands such
as sacct
, sstat
, squeue
, etc. Please see the the Slurm Scheduler Docs
for more information on customizing your default Slurm commands.