Step 2: Verify SLURM cluster basic function
- Login via SSH to Head node with information provided by GreenNode AI Platform
- Check for Slurm node status
- scontrol show node
The output should tell you how many SLURM worker node in your cluster, how much resources are on each node (CPU, RAM, and Gres for GPU). Sample output below:
Now you have finished creating a SLURM cluster on GreenNode AI Platform
3. Submitting a sample job
Create nccl_benchmark.sbatch
Change the --nodes
- #!/bin/bash
- #SBATCH --job-name=nccl_benchmark
- #SBATCH --nodes=2 # Number of nodes
- #SBATCH --ntasks-per-node=1 # Number of tasks (1 per GPU)
- #SBATCH --gpus=2 # Number of GPUs per node
- #SBATCH --cpus-per-task=10 # CPU cores per task
- #SBATCH --time=01:00:00 # Time limit (HH:MM:SS)
- #SBATCH --mem=8G
- #SBATCH --partition=main # Change based on your cluster
- #SBATCH --output=nccl_benchmark_%j.out
- #SBATCH --error=nccl_benchmark_%j.err
- # Set NCCL environment variables for debugging (optional)
- export NCCL_DEBUG=INFO
- export NCCL_SOCKET_IFNAME=^lo,docker0
- # Run NCCL Benchmark
- srun --mpi=pmi2 \
- all_reduce_perf -b 8 -e 8G -f 2 -g 1 -N 100
Then submit job with
You should have 2 output files for this slurm job nccl_benchmark_{job_id}.err and nccl_benchmark_{job_id}.out
When inspecting nccl_benchmark_1.out , at the start of the file, you should see the information of the nodes and devices being used for this job
At the end, we can see the result of the testing
The result indicate that there are no data corruption happened