Set up resource scalability

When creating an endpoint, users are allowed to set up threshold configurations for online prediction and manage scalability based on CPU, RAM, GPU utilization, and response latency, navigate to “Replica Configuration” on the endpoint creation page and follow these step-by-step instructions:

Step 1: Set up replica count

Specify the minimum and maximum number of replica instances that should be available to handle prediction requests. The minimum replica count ensures that there is always a minimum number of instances running, while the maximum replica count limits the upper bound of scalability.

Step 2: Define Threshold Parameters (in Advanced Configuration)

configure the scalability settings to define how the prediction service should scale based on the threshold parameters. Look for options to set thresholds for CPU, RAM, GPU utilization, and response latency. Specify the desired values for each parameter based on your application's requirements and performance goals.

For CPU, RAM, and GPU utilization, you can set thresholds as percentages of the available resources. For example, you may set a CPU utilization threshold of 80% to trigger scaling when CPU usage exceeds this level.
For Response latency, specify the maximum acceptable response time in milliseconds. This threshold determines when additional instances are required to handle incoming prediction requests.

Step 3: Understanding Default Scaling Policies:

Based on the threshold parameters and scalability settings, your scaling policies determine whether to scale up or down the number of replica instances as default with:

Formula: desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )].
Rule: The formula calculates the desired number of replicas (desiredReplicas) required to achieve a desired metric value (desiredMetricValue). It continuously adjusts the number of replicas based on the ratio between the current metric value (currentMetricValue) and the desired metric value. The algorithm iteratively scales up or down the number of replicas until the ratio approaches 1.0, with a tolerance of ±0.1. This ensures that the system dynamically scales to maintain optimal performance and resource utilization, adapting to fluctuating workloads in real-time (for scaling up) & a maximum latency of 5 minutes (for scaling down).
For example, you may create a threshold that triggers scaling up when CPU utilization exceeds 50% & current replica is 2. When the CPU utilization grows up to 100%, your desiredReplicas = ceil[2 * (100 / 50)] = 4. It works the same for the other metrics.
Reference: For more information, please find out more at https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/.

Related Articles
Managed SLURM service
1. Introduction to SLURM SLURM (Simple Linux Utility for Resource Management) is an open-source job scheduler designed for Linux high-performance computing clusters. It efficiently allocates compute resources to user jobs, provides tools for job ...
Training Mode
Training Moe Definition Characteristics Use Cases Single training Single training refers to training a machine learning model using a single instance or node. Utilizes a single compute instance for model training. Suitable for small to medium-sized ...
Create an endpoint
After training and registering the model, the online prediction component enables the deployment and serving of models to make real-time predictions or inferences on new data. This component provides endpoints or APIs that can be integrated into ...
Distributed Training: LLaMA-Factory on Managed Slurm
1. Overview This guide walks you through implementing distributed training with LLaMA-Factory on a Managed Slurm cluster. The documentation covers all essential aspects of the workflow, including environment configuration, efficient job scheduling ...
Action History
The Action History feature provides a detailed record of all actions performed on resources within the platform. This includes actions such as creating, starting, stopping, and deleting notebook instances, training jobs, tuning jobs, model endpoints, ...

Set up resource scalability

Set up resource scalability

Step 1: Set up replica count

Step 2: Define Threshold Parameters (in Advanced Configuration)

Related Articles

Managed SLURM service

Training Mode

Create an endpoint

Distributed Training: LLaMA-Factory on Managed Slurm

Action History