Set up resource scalability

Set up resource scalability

When creating an endpoint, users are allowed to set up threshold configurations for online prediction and manage scalability based on CPU, RAM, GPU utilization, and response latency, navigate to “Replica Configuration” on the endpoint creation page and follow these step-by-step instructions:

Step 1: Set up replica count

Specify the minimum and maximum number of replica instances that should be available to handle prediction requests. The minimum replica count ensures that there is always a minimum number of instances running, while the maximum replica count limits the upper bound of scalability.

Step 2: Define Threshold Parameters (in Advanced Configuration)

configure the scalability settings to define how the prediction service should scale based on the threshold parameters. Look for options to set thresholds for CPU, RAM, GPU utilization, and response latency. Specify the desired values for each parameter based on your application's requirements and performance goals.
  1. For CPU, RAM, and GPU utilization, you can set thresholds as percentages of the available resources. For example, you may set a CPU utilization threshold of 80% to trigger scaling when CPU usage exceeds this level.
  2. For Response latency, specify the maximum acceptable response time in milliseconds. This threshold determines when additional instances are required to handle incoming prediction requests.
Step 3: Understanding Default Scaling Policies:

Based on the threshold parameters and scalability settings, your scaling policies determine whether to scale up or down the number of replica instances as default with:

  • Formula: desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )].

  • Rule: The formula calculates the desired number of replicas (desiredReplicas) required to achieve a desired metric value (desiredMetricValue). It continuously adjusts the number of replicas based on the ratio between the current metric value (currentMetricValue) and the desired metric value. The algorithm iteratively scales up or down the number of replicas until the ratio approaches 1.0, with a tolerance of ±0.1. This ensures that the system dynamically scales to maintain optimal performance and resource utilization, adapting to fluctuating workloads in real-time (for scaling up) & a maximum latency of 5 minutes (for scaling down).

  • For example, you may create a threshold that triggers scaling up when CPU utilization exceeds 50% & current replica is 2. When the CPU utilization grows up to 100%, your desiredReplicas = ceil[2 * (100 / 50)] = 4. It works the same for the other metrics.

  • Reference: For more information, please find out more at https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/.

    • Related Articles

    • Managed SLURM service

      1. Introduction to SLURM SLURM (Simple Linux Utility for Resource Management) is an open-source job scheduler designed for Linux high-performance computing clusters. It efficiently allocates compute resources to user jobs, provides tools for job ...
    • Training Mode

      Training Moe Definition Characteristics Use Cases Single training Single training refers to training a machine learning model using a single instance or node. Utilizes a single compute instance for model training. Suitable for small to medium-sized ...
    • Create an endpoint

      After training and registering the model, the online prediction component enables the deployment and serving of models to make real-time predictions or inferences on new data. This component provides endpoints or APIs that can be integrated into ...
    • Manage Policies

      IAM Policies are JSON documents that define permissions and rules for accessing resources. These policies are attached to IAM user accounts, user groups, and service accounts to control the actions they can perform on specific resources. IAM policies ...
    • Start your Model Training Job

      Model training job involves using datasets to create and optimize machine learning models. This process occurs in the cloud environment, where data scientists run code to build models and tune hyperparameters. Training uses computational resources ...