- model_name_or_path: Llama-3.2-1B
- trust_remote_code: true
- stage: sft
- do_train: true
- finetuning_type: lora
- deepspeed: examples/deepspeed/ds_z2_config.json
- dataset: alpaca_en
- template: llama3
- cutoff_len: 512
- max_samples: 1000
- overwrite_cache: true
- preprocessing_num_workers: 1
- output_dir: saves/llama3.2-1b/full/sft
- logging_steps: 10
- save_strategy: "steps"
- save_steps: 100
- save_total_limit: 2
- plot_loss: true
- overwrite_output_dir: true
- per_device_train_batch_size: 2
- gradient_checkpointing: true
- gradient_accumulation_steps: 4
- learning_rate: 2.0e
Assuming you have downloaded the model as suggested above, we change the model path to local by:
- model_name_or_path: Llama-3.2-1B
We adjust the number of training samples to 1000 samples by:
- max_samples: 1000
You can adjust all other options if necessary
7. Configuring Slurm for Multi-Node Training
Assume you have a YAML training configuration file named llama31_training.yaml, create a Slurm script train_llama.sbatch for training on 2 nodes, 1 GPUs per node:
- #!/bin/bash
- #SBATCH --job-name=llm-multinode-training
- #SBATCH --nodes=2
- #SBATCH --time=4-00:00:00
- #SBATCH --gres=gpu:1
- #SBATCH -o log/training_%N.out
- #SBATCH -e log/training_%N.err
-
- # Get node information and master node
- nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST)
- nodes_array=($nodes)
- head_node=${nodes_array[0]}
- head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address | cut -d" " -f2)
- echo "Master Node IP: $head_node"
-
- # Create a script for each node to execute
- cat > node_script.sh << 'EOF'
- #!/bin/bash
- # Each node will get its proper SLURM_NODEID in this context
- export NNODES=2
- export NPROC_PER_NODE=1
- export MASTER_ADDR=PLACEHOLDER_MASTER
- export MASTER_PORT=29500
- export NODE_RANK=$SLURM_NODEID
- export NCCL_IB_DISABLE=0
- export NCCL_SOCKET_IFNAME=^lo,docker0
- export NCCL_TIMEOUT=180000000
- export NCCL_BLOCKING_WAIT=1
- export NCCL_ASYNC_ERROR_HANDLING=1
- export FORCE_TORCHRUN=1
-
- echo "NODE: $(hostname) with RANK: $NODE_RANK starting training"
- source venv/bin/activate
- llamafactory-cli train llama32_training.yaml
- EOF
-
- # Replace placeholder with actual master node
- sed -i "s/PLACEHOLDER_MASTER/$head_node/" node_script.sh
- chmod +x node_script.sh
-
- # Run the training script on all nodes
- srun ./node_script.sh
Use sbatch to submit the training job:
- sbatch train_llama.sbatch
View the queue job with squeue
Inspect the training.out and training.err file to see the training progress
The final result is shown below:
8. Monitoring CPU and GPU Usage During Training
Effective resource monitoring is essential when training LLaMA 3.2 1B on GreenNode AI Platform to ensure optimal performance and stability. Comprehensive system monitoring enables:
- Real-time identification of performance bottlenecks including GPU underutilization and CPU processing constraints
- Dynamic resource optimization through batch size adjustments and memory allocation refinements
- Proactive prevention of training interruptions caused by memory limitations and system failures
GreenNode AI Platform features an integrated monitoring dashboard providing live hardware performance metrics:
- Per-GPU utilization rates and processing efficiency
- VRAM consumption patterns across distributed nodes
- CPU load distribution and thread utilization
- I/O performance including disk throughput and network transfer rates
Access these critical metrics directly through your GreenNode AI Platform control panel to maintain peak training efficiency throughout your model development lifecycle.