Managed SLURM service

1. Introduction to SLURM

SLURM (Simple Linux Utility for Resource Management) is an open-source job scheduler designed for Linux high-performance computing clusters. It efficiently allocates compute resources to user jobs, provides tools for job submission and monitoring, and manages job queues based on configured policies.

With a controller-worker architecture, SLURM handles everything from single-node interactive sessions to complex multi-node parallel jobs. Its command-line tools (sbatch, srun, squeue) offer a straightforward interface for users, while its modular design allows administrators to customize scheduling policies and resource management to meet specific needs.

SLURM's scalability, reliability, and extensive feature set have made it the preferred workload manager in many academic, research, and commercial HPC environments worldwide.

2. Getting Started

Step 1: GreenNode AI Platform will provision and send SLURM cluster login information to customer. Required information for cluster provisioning:

Type and number of GPUs. CPU and RAM will be provisioning by corresponding Instance type
Size of Network Volume. This volume is shared to all nodes in SLURM cluster
SSH public key information

After provisioning, GreenNode will send you IP and port to login to Head node

Step 2: Verify SLURM cluster basic function

Login via SSH to Head node with information provided by GreenNode AI Platform
Check for Slurm node status

scontrol show node

The output should tell you how many SLURM worker node in your cluster, how much resources are on each node (CPU, RAM, and Gres for GPU). Sample output below:

Now you have finished creating a SLURM cluster on GreenNode AI Platform

3. Submitting a sample job

Create nccl_benchmark.sbatch
Change the --nodes

#!/bin/bash
#SBATCH --job-name=nccl_benchmark
#SBATCH --nodes=2 # Number of nodes
#SBATCH --ntasks-per-node=1 # Number of tasks (1 per GPU)
#SBATCH --gpus=2 # Number of GPUs per node
#SBATCH --cpus-per-task=10 # CPU cores per task
#SBATCH --time=01:00:00 # Time limit (HH:MM:SS)
#SBATCH --mem=8G
#SBATCH --partition=main # Change based on your cluster
#SBATCH --output=nccl_benchmark_%j.out
#SBATCH --error=nccl_benchmark_%j.err
# Set NCCL environment variables for debugging (optional)
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=^lo,docker0
# Run NCCL Benchmark
srun --mpi=pmi2 \
all_reduce_perf -b 8 -e 8G -f 2 -g 1 -N 100

Then submit job with

nccl_benchmark.sbatch

You should have 2 output files for this slurm job nccl_benchmark_{job_id}.err and nccl_benchmark_{job_id}.out

When inspecting nccl_benchmark_1.out , at the start of the file, you should see the information of the nodes and devices being used for this job

At the end, we can see the result of the testing

The result indicate that there are no data corruption happened

Related Articles
Distributed Training: LLaMA-Factory on Managed Slurm
1. Overview This guide walks you through implementing distributed training with LLaMA-Factory on a Managed Slurm cluster. The documentation covers all essential aspects of the workflow, including environment configuration, efficient job scheduling ...
Connecting to GreenNode AI Platform
GreenNode offers flexible connection options to manage your compute resources efficiently. Choose the method that best aligns with your workflow: Web Interface The GreenNode web portal offers a user-friendly interface to: Launch and manage notebook ...
Deploy a model endpoint with custom container
In this section, you will learn how to deploy and serve your custom machine-learning models using a custom container in our Greennode AI Platform. Follow the steps below to configure and deploy your model for online prediction: Step 0: Import a model ...
Create an endpoint
After training and registering the model, the online prediction component enables the deployment and serving of models to make real-time predictions or inferences on new data. This component provides endpoints or APIs that can be integrated into ...
What is Network Volume?
Network Volume Network volume functions like a traditional disk volume but offers the flexibility to be moved between different machines. It enables data storage and access across multiple instances, ensuring portability and seamless collaboration. ...

Managed SLURM service

Managed SLURM service

1. Introduction to SLURM

2. Getting Started

3. Submitting a sample job

Related Articles

Distributed Training: LLaMA-Factory on Managed Slurm

Connecting to GreenNode AI Platform

Deploy a model endpoint with custom container

Create an endpoint

What is Network Volume?