Managed SLURM service

Managed SLURM service

1. Introduction to SLURM

SLURM (Simple Linux Utility for Resource Management) is an open-source job scheduler designed for Linux high-performance computing clusters. It efficiently allocates compute resources to user jobs, provides tools for job submission and monitoring, and manages job queues based on configured policies.

With a controller-worker architecture, SLURM handles everything from single-node interactive sessions to complex multi-node parallel jobs. Its command-line tools (sbatch, srun, squeue) offer a straightforward interface for users, while its modular design allows administrators to customize scheduling policies and resource management to meet specific needs.
SLURM's scalability, reliability, and extensive feature set have made it the preferred workload manager in many academic, research, and commercial HPC environments worldwide.

2. Getting Started

Step 1: GreenNode AI Platform will provision and send SLURM cluster login information to customer. Required information for cluster provisioning:
  1. Type and number of GPUs. CPU and RAM will be provisioning by corresponding Instance type
  2. Size of Network Volume. This volume is shared to all nodes in SLURM cluster
  3. SSH public key information
After provisioning, GreenNode will send you IP and port to login to Head node

Step 2: Verify SLURM cluster basic function 
  1. Login via SSH to Head node with information provided by GreenNode AI Platform
  2. Check for Slurm node status
  1. scontrol show node
The output should tell you how many SLURM worker node in your cluster, how much resources are on each node (CPU, RAM, and Gres for GPU). Sample output below:

Now you have finished creating a SLURM cluster on GreenNode AI Platform

3. Submitting a sample job

Create nccl_benchmark.sbatch
Change the --nodes

  1. #!/bin/bash
  2. #SBATCH --job-name=nccl_benchmark
  3. #SBATCH --nodes=2 # Number of nodes
  4. #SBATCH --ntasks-per-node=1 # Number of tasks (1 per GPU)
  5. #SBATCH --gpus=2 # Number of GPUs per node
  6. #SBATCH --cpus-per-task=10 # CPU cores per task
  7. #SBATCH --time=01:00:00 # Time limit (HH:MM:SS)
  8. #SBATCH --mem=8G
  9. #SBATCH --partition=main # Change based on your cluster
  10. #SBATCH --output=nccl_benchmark_%j.out
  11. #SBATCH --error=nccl_benchmark_%j.err

  12. # Set NCCL environment variables for debugging (optional)
  13. export NCCL_DEBUG=INFO
  14. export NCCL_SOCKET_IFNAME=^lo,docker0

  15. # Run NCCL Benchmark
  16. srun --mpi=pmi2 \
  17.           all_reduce_perf -b 8 -e 8G -f 2 -g 1 -N 100

Then submit job with
  1. nccl_benchmark.sbatch
You should have 2 output files for this slurm job nccl_benchmark_{job_id}.err and nccl_benchmark_{job_id}.out


When inspecting nccl_benchmark_1.out , at the start of the file, you should see the information of the nodes and devices being used for this job


At the end, we can see the result of the testing


The result indicate that there are no data corruption happened
    • Related Articles

    • Distributed Training: LLaMA-Factory on Managed Slurm

      1. Overview This guide walks you through implementing distributed training with LLaMA-Factory on a Managed Slurm cluster. The documentation covers all essential aspects of the workflow, including environment configuration, efficient job scheduling ...
    • Connecting to GreenNode AI Platform

      GreenNode offers flexible connection options to manage your compute resources efficiently. Choose the method that best aligns with your workflow: Web Interface The GreenNode web portal offers a user-friendly interface to: Launch and manage notebook ...
    • Deploy a model endpoint with custom container

      In this section, you will learn how to deploy and serve your custom machine-learning models using a custom container in our Greennode AI Platform. Follow the steps below to configure and deploy your model for online prediction: Step 0: Import a model ...
    • Create an endpoint

      After training and registering the model, the online prediction component enables the deployment and serving of models to make real-time predictions or inferences on new data. This component provides endpoints or APIs that can be integrated into ...
    • What is Network Volume?

      Network Volume Network volume functions like a traditional disk volume but offers the flexibility to be moved between different machines. It enables data storage and access across multiple instances, ensuring portability and seamless collaboration. ...