Distributed Training: LLaMA-Factory on Managed Slurm

1. Overview

This guide walks you through implementing distributed training with LLaMA-Factory on a Managed Slurm cluster. The documentation covers all essential aspects of the workflow, including environment configuration, efficient job scheduling via Slurm, and critical performance optimizations.

The guide features a practical training example that demonstrates fine-tuning a Llama-3.2-1B model using the alpaca dataset - a collection of instruction-tuning samples designed to enhance model capabilities. Using full fine-tuning settings, we provide detailed, hands-on instructions for executing this real-world training scenario in a distributed computing environment.

Our step-by-step approach ensures you can successfully replicate and adapt this training pipeline for your specific requirements, leveraging the power of Slurm's resource management capabilities.

Key advantages:

Performance Scaling: Slurm intelligently distributes workloads across GPUs/nodes, cutting training time significantly.
Cost Efficiency: Smart scheduling eliminates idle GPU time, maximizing resource utilization and reducing expenses.
Operational Stability: Automated recovery, error management, and monitoring ensure uninterrupted training sessions.
Practical Implementation: Complete fine-tuning workflow includes dataset preparation, YAML configuration, and ready-to-use Slurm scripts.

This guide enables direct replication of optimized LLM training workflows on GreenNode AI Platform infrastructure.

2. Use case

Use case: Training Large Language Models with LLaMA-factory

The GreenNode AI Platform's Managed Slurm service delivers performance for training large language models using LLaMA-factory. This optimized solution distributes processing across multiple GPUs and compute nodes, significantly reducing training time while maintaining operational stability.

Key Advantages

Enterprise Scalability: Seamlessly handles large-scale models with intelligent GPU resource allocation
Cost Efficiency: Minimizes cloud computing expenses by eliminating idle resources through dynamic scheduling
Operational Reliability: Ensures uninterrupted training with automated job recovery and comprehensive error handling

Experience the power of distributed computing without the complexity of manual configuration.

3. Prerequisites

GreenNode GPU Instances
Managed Slurm cluster

4. Environment Setup

With Head node login information, log in to Head node via SSH and run the following on Head node. This will install all required packages for running LlaMA-Factory

python3 –m venv venv
source venv/bin/activate
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics,deepspeed]"

5. Sample Training Task: Fine-tuning LLaMA

To demonstrate a real-world scenario, we will full fine-tune a Llama-3.2-1B model using the Alpaca dataset for instruction-following tasks.

5.1. Dataset: Alpaca

The Alpaca dataset is a collection of instruction-response pairs created by Stanford researchers to improve language model instruction-following capabilities. Generated using LLM-based techniques, it covers diverse tasks from creative writing to problem-solving. The dataset has become a foundation for fine-tuning language models in the open-source community, with variants like Open Instruct Uncensored Alpaca extending its capabilities for more comprehensive training scenarios.

5.2. Model: LLaMA 3.2-1B

Meta AI's LLaMA 3.2 1B delivers impressive performance within a compact 1 billion parameter footprint, balancing strong reasoning capabilities with efficient resource utilization. Despite its small size, the model benefits significantly from distributed fine-tuning across multiple GPUs, where frameworks like LLaMA-Factory with Slurm orchestration can parallelize the training process, dramatically reducing fine-tuning time while maintaining optimization quality.

If you want to download the model directly from Huggingface, you can use the below command:

hugging face-cli download meta-llama/Llama-3.2-1B --local-dir=Llama-3.2-1B

6. Preparing training configuration

LLaMA-Factory leverages YAML files for structured training parameter management, simplifying hyperparameter adjustments while ensuring experiment reproducibility.

We will walk you through the process for creating an optimized YAML configuration specifically tailored for fine-tuning the LLaMA 3.2 1B model.

6.1. Sample YAML Configuration for Fine-Tuning LLaMA 3.2 1B

LLaMA-Factory provides various predefined YAML training configuration files, located at LLaMA-Factory/examples. Here is a YAML file for full fine-tuning LLaMA 3.2 1B with Open Instruct Uncensored Alpaca dataset:

model_name_or_path: Llama-3.2-1B
trust_remote_code: true
stage: sft
do_train: true
finetuning_type: lora
deepspeed: examples/deepspeed/ds_z2_config.json
dataset: alpaca_en
template: llama3
cutoff_len: 512
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 1
output_dir: saves/llama3.2-1b/full/sft
logging_steps: 10
save_strategy: "steps"
save_steps: 100
save_total_limit: 2
plot_loss: true
overwrite_output_dir: true
per_device_train_batch_size: 2
gradient_checkpointing: true
gradient_accumulation_steps: 4
learning_rate: 2.0e

Assuming you have downloaded the model as suggested above, we change the model path to local by:

model_name_or_path: Llama-3.2-1B

We adjust the number of training samples to 1000 samples by:

max_samples: 1000

You can adjust all other options if necessary

7. Configuring Slurm for Multi-Node Training

Assume you have a YAML training configuration file named llama31_training.yaml, create a Slurm script train_llama.sbatch for training on 2 nodes, 1 GPUs per node:

#!/bin/bash
#SBATCH --job-name=llm-multinode-training
#SBATCH --nodes=2
#SBATCH --time=4-00:00:00
#SBATCH --gres=gpu:1
#SBATCH -o log/training_%N.out
#SBATCH -e log/training_%N.err
# Get node information and master node
nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST)
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address | cut -d" " -f2)
echo "Master Node IP: $head_node"
# Create a script for each node to execute
cat > node_script.sh << 'EOF'
#!/bin/bash
# Each node will get its proper SLURM_NODEID in this context
export NNODES=2
export NPROC_PER_NODE=1
export MASTER_ADDR=PLACEHOLDER_MASTER
export MASTER_PORT=29500
export NODE_RANK=$SLURM_NODEID
export NCCL_IB_DISABLE=0
export NCCL_SOCKET_IFNAME=^lo,docker0
export NCCL_TIMEOUT=180000000
export NCCL_BLOCKING_WAIT=1
export NCCL_ASYNC_ERROR_HANDLING=1
export FORCE_TORCHRUN=1
echo "NODE: $(hostname) with RANK: $NODE_RANK starting training"
source venv/bin/activate
llamafactory-cli train llama32_training.yaml
EOF
# Replace placeholder with actual master node
sed -i "s/PLACEHOLDER_MASTER/$head_node/" node_script.sh
chmod +x node_script.sh
# Run the training script on all nodes
srun ./node_script.sh

Use sbatch to submit the training job:

sbatch train_llama.sbatch

View the queue job with squeue

Inspect the training.out and training.err file to see the training progress

The final result is shown below:

8. Monitoring CPU and GPU Usage During Training

Effective resource monitoring is essential when training LLaMA 3.2 1B on GreenNode AI Platform to ensure optimal performance and stability. Comprehensive system monitoring enables:

Real-time identification of performance bottlenecks including GPU underutilization and CPU processing constraints
Dynamic resource optimization through batch size adjustments and memory allocation refinements
Proactive prevention of training interruptions caused by memory limitations and system failures

GreenNode AI Platform features an integrated monitoring dashboard providing live hardware performance metrics:

Per-GPU utilization rates and processing efficiency
VRAM consumption patterns across distributed nodes
CPU load distribution and thread utilization
I/O performance including disk throughput and network transfer rates

Access these critical metrics directly through your GreenNode AI Platform control panel to maintain peak training efficiency throughout your model development lifecycle.

Related Articles
Managed SLURM service
1. Introduction to SLURM SLURM (Simple Linux Utility for Resource Management) is an open-source job scheduler designed for Linux high-performance computing clusters. It efficiently allocates compute resources to user jobs, provides tools for job ...
Training Mode
Training Moe Definition Characteristics Use Cases Single training Single training refers to training a machine learning model using a single instance or node. Utilizes a single compute instance for model training. Suitable for small to medium-sized ...
Local Storage Limits for Notebook, Model Training, and Online Prediction
To ensure optimal performance and cost-efficiency, our platform provides a certain amount of local storage included with each compute instance you create. However, exceeding this storage limit can impact your workflow and results. This guide will ...
Foundation Models
We're excited to offer a diverse selection of powerful base models to fuel your AI development. This list represents a range of cutting-edge language models from leading AI research organizations, each with unique strengths and characteristics. ...
GreenNode AI Platform Release Note 2024
This central hub provides comprehensive information about the latest updates, new features, enhancements, and bug fixes introduced in each release of the GreenNode AI Platform in 2024. Our goal is to keep you informed and empowered to make the most ...

Distributed Training: LLaMA-Factory on Managed Slurm

Distributed Training: LLaMA-Factory on Managed Slurm

1. Overview

2. Use case

Use case: Training Large Language Models with LLaMA-factory

3. Prerequisites

4. Environment Setup

5. Sample Training Task: Fine-tuning LLaMA

5.1. Dataset: Alpaca

5.2. Model: LLaMA 3.2-1B

6. Preparing training configuration

6.1. Sample YAML Configuration for Fine-Tuning LLaMA 3.2 1B

7. Configuring Slurm for Multi-Node Training

8. Monitoring CPU and GPU Usage During Training

Related Articles

Managed SLURM service

Training Mode

Local Storage Limits for Notebook, Model Training, and Online Prediction

Foundation Models

GreenNode AI Platform Release Note 2024