Training Mode

Training Mode

Training Moe

Definition

Characteristics

Use Cases

Single training

Single training refers to training a machine learning model using a single instance or node.

  • Utilizes a single compute instance for model training.

  • Suitable for small to medium-sized datasets.

  • Limited by the compute resources of a single instance, such as CPU, memory, and GPU.

  • Generally faster to set up and execute compared to distributed training.

  • Small to medium-sized datasets where the training process can be completed efficiently on a single machine.

  • Prototyping and experimenting with new models or algorithms.

  • Initial stages of model development where scalability is not a primary concern.

Distributed training

Distributed training involves training a machine learning model across multiple compute instances or nodes simultaneously.

  • Distributes the workload across multiple instances to accelerate training time.

  • Scales well with large datasets and complex models.

  • Requires coordination and synchronization between compute nodes.

  • May involve specialized frameworks or libraries for distributed computing, such as TensorFlow's distributed training APIs.

  • Large-scale datasets that cannot be efficiently processed on a single machine.

  • Training deep learning models with complex architectures, such as convolutional neural networks (CNNs) or recurrent neural networks (RNNs).

  • High-performance computing (HPC) environments where distributed computing resources are available.

  • Production environments where rapid model training and deployment are essential for real-time applications.

When using distributed training, there are several important pieces of information to consider:

  • Framework Compatibility:

    • Ensure that your machine learning framework supports distributed training. Not all frameworks provide built-in support for distributed computing, so you may need to use specialized libraries or APIs.

    • Verify compatibility with the specific version of the framework you are using, as distributed training capabilities may vary between versions.

  • Code Modification:

    • Distributed training often requires modifications to your training code to enable parallelization and synchronization across multiple compute nodes.

    • You may need to refactor your code to use distributed data parallelism, distributed model parallelism, or other distributed computing techniques depending on your specific requirements.

  • Resource Management:

    • Distributed training typically requires a cluster of compute instances or nodes with high-speed interconnects to efficiently distribute and synchronize training tasks.

    • Ensure that you have sufficient compute resources available to handle the workload, including CPU cores, GPU accelerators, memory, and storage capacity.


    • Related Articles

    • Distributed Training: LLaMA-Factory on Managed Slurm

      1. Overview This guide walks you through implementing distributed training with LLaMA-Factory on a Managed Slurm cluster. The documentation covers all essential aspects of the workflow, including environment configuration, efficient job scheduling ...
    • Local Storage Limits for Notebook, Model Training, and Online Prediction

      To ensure optimal performance and cost-efficiency, our platform provides a certain amount of local storage included with each compute instance you create. However, exceeding this storage limit can impact your workflow and results. This guide will ...
    • Features

      Dashboard - Your Command Center The dashboard provides a snapshot of your active projects, recent activity, resource usage, and billing information. Project Management: Create, access, and manage your AI projects from this central location. ...
    • How it works?

      Overall, the GreenNode AI Platform forms an end-to-end pipeline for building, training, managing, and deploying machine learning models in an AI platform, which includes four main components Notebook Instance, Model Training, Model Registry and ...
    • Create a tuning job

      To create a supervised tuning job, you'll need to provide the following information: Steps to Create a Supervised Tuning Job Access the Tuning Job Creation Interface: Use the provider's platform through the url: . Fill in the Input Parameters: ...