Training Mode

Training Moe	Definition	Characteristics	Use Cases
Single training	Single training refers to training a machine learning model using a single instance or node.	Utilizes a single compute instance for model training. Suitable for small to medium-sized datasets. Limited by the compute resources of a single instance, such as CPU, memory, and GPU. Generally faster to set up and execute compared to distributed training.	Small to medium-sized datasets where the training process can be completed efficiently on a single machine. Prototyping and experimenting with new models or algorithms. Initial stages of model development where scalability is not a primary concern.
Distributed training	Distributed training involves training a machine learning model across multiple compute instances or nodes simultaneously.	Distributes the workload across multiple instances to accelerate training time. Scales well with large datasets and complex models. Requires coordination and synchronization between compute nodes. May involve specialized frameworks or libraries for distributed computing, such as TensorFlow's distributed training APIs.	Large-scale datasets that cannot be efficiently processed on a single machine. Training deep learning models with complex architectures, such as convolutional neural networks (CNNs) or recurrent neural networks (RNNs). High-performance computing (HPC) environments where distributed computing resources are available. Production environments where rapid model training and deployment are essential for real-time applications.

When using distributed training, there are several important pieces of information to consider:

Framework Compatibility:
- Ensure that your machine learning framework supports distributed training. Not all frameworks provide built-in support for distributed computing, so you may need to use specialized libraries or APIs.
- Verify compatibility with the specific version of the framework you are using, as distributed training capabilities may vary between versions.
Code Modification:
- Distributed training often requires modifications to your training code to enable parallelization and synchronization across multiple compute nodes.
- You may need to refactor your code to use distributed data parallelism, distributed model parallelism, or other distributed computing techniques depending on your specific requirements.
Resource Management:
- Distributed training typically requires a cluster of compute instances or nodes with high-speed interconnects to efficiently distribute and synchronize training tasks.
- Ensure that you have sufficient compute resources available to handle the workload, including CPU cores, GPU accelerators, memory, and storage capacity.

Related Articles
Distributed Training: LLaMA-Factory on Managed Slurm
1. Overview This guide walks you through implementing distributed training with LLaMA-Factory on a Managed Slurm cluster. The documentation covers all essential aspects of the workflow, including environment configuration, efficient job scheduling ...
Local Storage Limits for Notebook, Model Training, and Online Prediction
To ensure optimal performance and cost-efficiency, our platform provides a certain amount of local storage included with each compute instance you create. However, exceeding this storage limit can impact your workflow and results. This guide will ...
Features
Dashboard - Your Command Center The dashboard provides a snapshot of your active projects, recent activity, resource usage, and billing information. Project Management: Create, access, and manage your AI projects from this central location. ...
How it works?
Overall, the GreenNode AI Platform forms an end-to-end pipeline for building, training, managing, and deploying machine learning models in an AI platform, which includes four main components Notebook Instance, Model Training, Model Registry and ...
Create a tuning job
To create a supervised tuning job, you'll need to provide the following information: Steps to Create a Supervised Tuning Job Access the Tuning Job Creation Interface: Use the provider's platform through the url: . Fill in the Input Parameters: ...

Training Mode

Training Mode

Related Articles

Distributed Training: LLaMA-Factory on Managed Slurm

Local Storage Limits for Notebook, Model Training, and Online Prediction

Features

How it works?

Create a tuning job