Skip to content

Basic Slurm Commands🔗

Slurm commands are basic bash utilities and you can therefore view more in depth documentation at any time by typing man <command> in the terminal.

  • man <command> (i.e man sbatch) to open the full documentation with less OR
  • <command> --help (i.e sbatch --help) to print a brief overview of common options

For a comprehensive tutorial on Slurm commands visit the Slurm Documentation

Note

It is assumed that you have a good understanding of the core cluster concepts in Slurm. If you are unfamiliar with these concepts, please refer to the Introduction to Slurm

Table of Contents🔗

Viewing Cluster State🔗

sinfo🔗

This command provides a summary of the current cluster state, including the status of various Paritions and Nodes. The STATE column indicates the current state of the resource for that row. The --format option can be used to get the full range of available information about each Parition or Node, including things like the amount of CPU/Memory/GPU available and currently in use.

Try it yourself:

  1. View the manual page for sinfo

    Solution
    man sinfo
    
  2. List the names of the unique available partitions

    Solution
    sinfo | awk '{print $1}' | uniq
    
  3. Retrieve the partition name, max number of CPUs, max job length and total memory available for the superhimem partition (HINT: Use --format or --Format)

    Solution
    sinfo -p superhimem --format="%P %c %l %m" | column -t
    

Requesting Compute Resources🔗

srun🔗

This command simply runs the subsequent shell command on the cluster, making any necessary allocations based on the options provided. The allocation lasts as long as the command takes to run and is blocking (i.e., you can't run any other commands until it returns).

While this can be useful for house keeping tasks that require more resources than are available on the Login or Data node, it is generally a better practice to write reusable Slurm scripts and deploy them via salloc or sbatch.

Try it yourself:

  1. Print a summary of the srun command options

    Solution
    srun --help
    
  2. Submit a job via srun to list files in your home directory

    Solution
    srun -t 0:05:0 ls $HOME
    
  3. Resubmit this job, but this time request two CPUs and 100 MB of RAM

    Solution
    srun -t 0:05:0 -c 2 --mem=100M ls $HOME
    

salloc🔗

When developing a script to run on the cluster it is often useful to get access to a remote virtual machine to experiment running your code. This is essential for debugging long running jobs, but can also be used for relatively short (a few hours) interactive jobs analyzing or visualizing data on the cluster.

Running the salloc command will allocate a job for you, providing command line access to the virtual machine associated with that job. Various levels of resources can be request, see man salloc for more details.

Try it yourself:

Please see Submitting a Slurm Job for an example of running batch jobs via salloc.

sbatch🔗

Once you have reasonable confidence that your analysis script is working as expected, you can submit long running jobs in batch mode, allowing you to queue all the tasks associated with your job to run when the resources become available on the cluster. This is the primary way you should be interacting with the compute cluster.

Generally, you will use the sbatch command with a bash script which executes one or more additional scripts that comprise your job. You can use this script to launch analyses in a variety of langauges such as R, Python, or Julia. For more information on the precompiled software available on the cluster via the module system, please see the documentation for the module system.

To make it easy for you to get started with batch jobs on H4H, we have included the templates/hello_slurm.sh script to use as a template for more complex jobs. This file includes some of the basic Slurm Header directives which allow you to specify the resources, inputs and outputs for your job without needing to pass additional command line options via sbatch.

Try it yourself:

Please see Submitting a Slurm Job for an example of running batch jobs via sbatch.

Monitoring Jobs in the Queue🔗

squeue🔗

This command is used to view the current job queue on the cluster. By default it only shows jobs you have submitted. Additional options are available to display all jobs on the cluster, format the output of the command to include more or less information, and much more.

Try it yourself:

Please see Submitting a Slurm Job for an example of using squeue to view submitted jobs.

scancel🔗

Once you have submitted a batch job to the queue, you may want to cancel it if you notice an issue with your job script. Basic usage of scancel requires you to input your job id (--job-id/-j) or job name (--job-name/-n). This information can easily be retrieved via squeue.

Additional options to cancel jobs based on their current state, parition, node, etc. are available for this command. Review the documentation for more information.

Try it yourself:

Please see Submitting a Slurm Job for an example of using scancel to cancel a batch job.

sattach🔗

This command allows your to attach to a currently running bash job or job step to monitor its progress. The most basic usage is sattach <job_id>.<step_id>, where the job_id is the numeric identifier (first column) in squeue and step_id is the job step (the numeric value after the dot). To attach to a job without steps, just use a step id of 0.

In general, it is a better idea to setup logging for your job by specifying a log file in your Slurm hearder (see Submitting a Slurm Job for details). However, this command can be useful for interactive debugging of submitted job or to check on the progress of your job without writing excessive amounts of logging information to disk.

Try it yourself:

Please see Submitting a Slurm Job for an example of attaching to a running job via sattach.

Viewing Job History🔗

sacct🔗

This command allows you to review various aspects of previously submitted jobs on the cluster and can be used to help diagnose why a job failed, determine the resources it used, access additional information like the total run time, and much more.

Try it yourself:

Please see Submitting a Slurm Job for an example of reviewing job history with sacct.

sstat🔗

This command is similar to sacct but displays resource utilization for a currently running job. This can be useful to determine if your job is close to exceeding it's resource allocation or to monitor resource usage in real time. Note, however, that it takes time to gather these statistics and for short running jobs (e.g., < 1 hr) it is unlikely you will see any statistics.

Try it yourself:

Please see Submitting a Slurm Job for an example of viewing current job status with sstat.

seff🔗

This is not technically a Slurm command, but is a useful Perl script included with recent Slurm releases to make it easy to determine how efficiently your job uses resources you requested. This is very useful for scaling your jobs down to use the minimum resources and therefore get their maximum possible queue priority.

Try it yourself:

Please see Submitting a Slurm Job for an example of summarizing job efficiency with seff.

Scheduling Regular Jobs🔗

scrontab🔗

The Slurm scheduler has its own implementation of the Unix job scheduling system CRON. As such, you can schedule jobs to run at some predefined interval. This is useful for house keeping tasks on the cluster such as auditing disk usage or automating data archiving. While probably not necessary for regular users this command is essential for Data Managers/Coordinators to automate otherwise tedious tasks.

Configuring Command Defaults🔗

slurm.conf🔗

A slurm.conf file can be used to set environment variables that modify the defaults for many commands discussed above. This file can be useful to pre-specify your preferred list of fields or formatting for commands such as sacct, sstat, squeue, etc. Please see the the Slurm Scheduler Docs for more information on customizing your default Slurm commands.