When creating an endpoint, users are allowed to set up threshold configurations for online prediction and manage scalability based on CPU, RAM, GPU utilization, and response latency, navigate to “Replica Configuration” on the endpoint creation page and follow these step-by-step instructions:
Specify the minimum and maximum number of replica instances that should be available to handle prediction requests. The minimum replica count ensures that there is always a minimum number of instances running, while the maximum replica count limits the upper bound of scalability.
Based on the threshold parameters and scalability settings, your scaling policies determine whether to scale up or down the number of replica instances as default with:
Formula: desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )].
Rule: The formula calculates the desired number of replicas (desiredReplicas) required to achieve a desired metric value (desiredMetricValue). It continuously adjusts the number of replicas based on the ratio between the current metric value (currentMetricValue) and the desired metric value. The algorithm iteratively scales up or down the number of replicas until the ratio approaches 1.0, with a tolerance of ±0.1. This ensures that the system dynamically scales to maintain optimal performance and resource utilization, adapting to fluctuating workloads in real-time (for scaling up) & a maximum latency of 5 minutes (for scaling down).
For example, you may create a threshold that triggers scaling up when CPU utilization exceeds 50% & current replica is 2. When the CPU utilization grows up to 100%, your desiredReplicas = ceil[2 * (100 / 50)] = 4. It works the same for the other metrics.
Reference: For more information, please find out more at https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/.