Managed SLURM service

Managed SLURM service

1. Introduction to SLURM

SLURM (Simple Linux Utility for Resource Management) is an open-source job scheduler designed for Linux high-performance computing clusters. It efficiently allocates compute resources to user jobs, provides tools for job submission and monitoring, and manages job queues based on configured policies.

With a controller-worker architecture, SLURM handles everything from single-node interactive sessions to complex multi-node parallel jobs. Its command-line tools (sbatch, srun, squeue) offer a straightforward interface for users, while its modular design allows administrators to customize scheduling policies and resource management to meet specific needs.
SLURM's scalability, reliability, and extensive feature set have made it the preferred workload manager in many academic, research, and commercial HPC environments worldwide.

2. Getting Started

Step 1: GreenNode AI Platform will provision and send SLURM cluster login information to customer. Required information for cluster provisioning:
  1. Type and number of GPUs. CPU and RAM will be provisioning by corresponding Instance type
  2. Size of Network Volume. This volume is shared to all nodes in SLURM cluster
  3. SSH public key information
After provisioning, GreenNode will send you IP and port to login to Head node

Step 2: Verify SLURM cluster basic function 
  1. Login via SSH to Head node with information provided by GreenNode AI Platform
  2. Check for Slurm node status
  1. scontrol show node
The output should tell you how many SLURM worker node in your cluster, how much resources are on each node (CPU, RAM, and Gres for GPU). Sample output below:

Now you have finished creating a SLURM cluster on GreenNode AI Platform

3. Submitting a sample job

Create nccl_benchmark.sbatch
Change the --nodes

  1. #!/bin/bash
  2. #SBATCH --job-name=nccl_benchmark
  3. #SBATCH --nodes=2 # Number of nodes
  4. #SBATCH --ntasks-per-node=1 # Number of tasks (1 per GPU)
  5. #SBATCH --gpus=2 # Number of GPUs per node
  6. #SBATCH --cpus-per-task=10 # CPU cores per task
  7. #SBATCH --time=01:00:00 # Time limit (HH:MM:SS)
  8. #SBATCH --mem=8G
  9. #SBATCH --partition=main # Change based on your cluster
  10. #SBATCH --output=nccl_benchmark_%j.out
  11. #SBATCH --error=nccl_benchmark_%j.err

  12. # Set NCCL environment variables for debugging (optional)
  13. export NCCL_DEBUG=INFO
  14. export NCCL_SOCKET_IFNAME=^lo,docker0

  15. # Run NCCL Benchmark
  16. srun --mpi=pmi2 \
  17.           all_reduce_perf -b 8 -e 8G -f 2 -g 1 -N 100

Then submit job with
  1. nccl_benchmark.sbatch
You should have 2 output files for this slurm job nccl_benchmark_{job_id}.err and nccl_benchmark_{job_id}.out


When inspecting nccl_benchmark_1.out , at the start of the file, you should see the information of the nodes and devices being used for this job


At the end, we can see the result of the testing


The result indicate that there are no data corruption happened
    • Related Articles

    • Distributed Training: LLaMA-Factory on Managed Slurm

      1. Overview This guide walks you through implementing distributed training with LLaMA-Factory on a Managed Slurm cluster. The documentation covers all essential aspects of the workflow, including environment configuration, efficient job scheduling ...
    • Manage Service Accounts

      A Service Account is an identity you create within your Root User account that has specific permissions. It shares some similarities with an IAM User Account. To clarify, both Service Accounts and IAM User Accounts are identities with permission ...
    • Create a Full Access IAM Policy

      A Full Access policy grants an IAM entity (user, group, or service account) unrestricted permissions to perform any action on all resources within your Greennode AI Platform account. While convenient, such policies should be used with caution due to ...
    • Deploy a model endpoint with custom container

      In this section, you will learn how to deploy and serve your custom machine-learning models using a custom container in our Greennode AI Platform. Follow the steps below to configure and deploy your model for online prediction: Step 0: Import a model ...
    • Create an Endpoint for online prediction

      After training and registering the model, the online prediction component enables the deployment and serving of models to make real-time predictions or inferences on new data. This component provides endpoints or APIs that can be integrated into ...