Start your Model Training Job

Start your Model Training Job

Model training job involves using datasets to create and optimize machine learning models. This process occurs in the cloud environment, where data scientists run code to build models and tune hyperparameters. Training uses computational resources like GPUs or TPUs to accelerate the learning process and improve model accuracy.

Step 1: Accessing the Model Training

Step 2: Configure a Training job

1. Basic Configuration

  • Provide a suitable name for your training job, e.g., "MyAITrainingJob."

  • Choosing the location for this Cloud Training Job.

2. Resource configuration
Config the instance type and the data volume for this training job.
  1. Instance flavor: Specify the CPU, GPU, and RAM configurations based on your workload.
  • Instance count: By default, the number of instances to run the training job is 1, in a single training mode. 

3. Training Container

  • Choose a training framework & suitable version that meets your requirements. Select Pre-built container option to use as a supported framework, then specify the model framework & its version.

  • Training command & arguments

    • Command: Input the command containing the location where the file code located to be executed in the training job. Ensure it is accessible and imported correctly.

      1. For instance: “wget https://abc.com/text.python -o t.python && python t.python”.
    • Arguments: Set up any necessary arguments for the training job, such as hyperparameters, dataset locations, model configuration, etc. Pass these arguments to the argument field.

      1. Example: “--epochs 10 --batch-size 32”

4. Data Mount (optional)

      4.1 Network volume
To read and write data from/to an network volume in the concept of training job, follow these steps:
  1. Specify a network volume: Select a network volume that you want to access from the training job. This network volume will be mounted at pre-defined folder.

Step 3: Start Training

  • Click the "Start training" button to run your training model with the specified configurations at the bottom right corner.

Step 4: Verify the result of the training process

  • Once start running your model, access your training process with “Running” status from the Model Training Dashboard.

  • Navigate to the Monitoring section to view logs generated during the training process, such as loss values, accuracy scores, and other performance indicators.



    • Related Articles

    • Manage a model tuning job

      Model tuning, also known as hyperparameter optimization, is the process of adjusting the hyperparameters of a machine learning model to improve its performance. Hyperparameters are settings that determine the learning process of a model and are not ...
    • Local Storage Limits for Notebook, Model Training, and Online Prediction

      To ensure optimal performance and cost-efficiency, our platform provides a certain amount of local storage included with each compute instance you create. However, exceeding this storage limit can impact your workflow and results. This guide will ...
    • Create a tuning job

      To create a supervised tuning job, you'll need to provide the following information: Steps to Create a Supervised Tuning Job Access the Tuning Job Creation Interface: Use the provider's platform through the url: . Fill in the Input Parameters: ...
    • Import a Model Registry

      The model registry is a centralized repository for storing trained models, their metadata, versions, and associated artifacts. It allows for version control, management, and organization of models developed during the training phase. This enables ...
    • Training Mode

      Training Moe Definition Characteristics Use Cases Single training Single training refers to training a machine learning model using a single instance or node. Utilizes a single compute instance for model training. Suitable for small to medium-sized ...