Introduction to Slurm🔗

Slurm (Simple Linux Utility for Resource Management) is an open-source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.

It is used by many of the world's supercomputers and is the default scheduler on the H4H cluster.

Note

Slurm is a complex system with many features. This tutorial will attempt to simplify the basics of using Slurm to submit jobs to a cluster. For more advanced features, please refer to the Slurm documentation.

Core Cluster Concepts🔗

Slurm manages resources in a cluster using three core concepts: Partitions, Nodes, and Jobs.

Slurm_Diagram

Nodes🔗

Nodes are the physical servers on which jobs are run. Think of each node as a computer with some computational resources.

A Node is characterized by the following basic properties:

Hostname: The name of the node.
State: The current state of the node (e.g., idle, allocated, down).
CPUs: The number of CPUs available on the node.
Memory: The amount of memory available on the node.
GPUs: The number of GPUs available on the node.

For each node, there may be one or more jobs running, based on the available compute resources and the size of each job allocation.

Partitions🔗

Partitions are logical groupings of nodes (physical servers) based on some shared characteristic or property. For example, a partition may group nodes with high memory, long-running jobs, or GPU access.

A Partition is characterized by the following basic properties:

Name: The name of the partition.
Nodes: The nodes that belong to the partition.
State: The current state of the partition
Max time: The maximum time a job can run in the partition.

Partition States:

Idle: No jobs are running in the partition.
Allocated: All nodes in the partition are allocated to jobs.
Down: All nodes in the partition are down.
Mix: Some nodes are idle, some are allocated.

Jobs🔗

Jobs are allocations of specific resources to a user for a pre-specified block of time.

Jobs are the primary way you will interact with the H4H cluster. Jobs can be further subdivided into job steps (or tasks) which can potentially run in parallel within a given job.

A Job is characterized by the following basic properties:

Job ID: A unique identifier for the job.
State: The current state of the job (e.g., pending, running, completed).
Partition: The partition in which the job is running.
Nodes: The nodes on which the job is running.
Resources: The resources requested by the job (e.g., CPUs, memory, GPUs).

Slurm Concept Analogy🔗

To better understand the core cluster concepts in Slurm, let's use an analogy of an office workspace: where Partitions are departments, Nodes are individual employees, and Jobs are projects or tasks assigned to employees.

Partitions🔗

Partitions are like departments in an office building (e.g., Marketing, Finance, IT).
Each department groups employees (nodes) based on roles or functions.
In a cluster, partitions group nodes based on characteristics (e.g., high memory, GPU access).
Just as an employee might work across multiple departments, a node can belong to multiple partitions.

Nodes🔗

Nodes are like individual employees in the office.
Each employee (node) has resources such as computers/phones/skills (resources like memory, CPUs, GPUs) to perform tasks.
Nodes in a cluster are physical servers running jobs.
Multiple jobs can run on a single node simultaneously, similar to an employee multitasking.

Jobs🔗

Jobs are like projects or tasks assigned to employees.
Each project (job) requires specific resources like computers, skills, software (resources like memory, CPUs, GPUs).
When tasks are delegated to an employee, the manager (Slurm Scheduler) determines which employees (nodes) are able to handle the task based on specific requirements.
Similarly, when submitting a job to the cluster, you ask for resources and the scheduler assigns the job to nodes that meet those requirements.