Set up resource scalability

Set up resource scalability

When creating an endpoint, users are allowed to set up threshold configurations for online prediction and manage scalability based on CPU, RAM, GPU utilization, and response latency, navigate to “Replica Configuration” on the endpoint creation page and follow these step-by-step instructions:

Step 1: Set up replica count

Specify the minimum and maximum number of replica instances that should be available to handle prediction requests. The minimum replica count ensures that there is always a minimum number of instances running, while the maximum replica count limits the upper bound of scalability.

Step 2: Define Threshold Parameters (in Advanced Configuration)

configure the scalability settings to define how the prediction service should scale based on the threshold parameters. Look for options to set thresholds for CPU, RAM, GPU utilization, and response latency. Specify the desired values for each parameter based on your application's requirements and performance goals.
  1. For CPU, RAM, and GPU utilization, you can set thresholds as percentages of the available resources. For example, you may set a CPU utilization threshold of 80% to trigger scaling when CPU usage exceeds this level.
  2. For Response latency, specify the maximum acceptable response time in milliseconds. This threshold determines when additional instances are required to handle incoming prediction requests.
Step 3: Understanding Default Scaling Policies:

Based on the threshold parameters and scalability settings, your scaling policies determine whether to scale up or down the number of replica instances as default with:

  • Formula: desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )].

  • Rule: The formula calculates the desired number of replicas (desiredReplicas) required to achieve a desired metric value (desiredMetricValue). It continuously adjusts the number of replicas based on the ratio between the current metric value (currentMetricValue) and the desired metric value. The algorithm iteratively scales up or down the number of replicas until the ratio approaches 1.0, with a tolerance of ±0.1. This ensures that the system dynamically scales to maintain optimal performance and resource utilization, adapting to fluctuating workloads in real-time (for scaling up) & a maximum latency of 5 minutes (for scaling down).

  • For example, you may create a threshold that triggers scaling up when CPU utilization exceeds 50% & current replica is 2. When the CPU utilization grows up to 100%, your desiredReplicas = ceil[2 * (100 / 50)] = 4. It works the same for the other metrics.

  • Reference: For more information, please find out more at https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/.

    • Related Articles

    • Managed SLURM service

      1. Introduction to SLURM SLURM (Simple Linux Utility for Resource Management) is an open-source job scheduler designed for Linux high-performance computing clusters. It efficiently allocates compute resources to user jobs, provides tools for job ...
    • Create an endpoint

      After training and registering the model, the online prediction component enables the deployment and serving of models to make real-time predictions or inferences on new data. This component provides endpoints or APIs that can be integrated into ...
    • Function calling

      GreenNode MaaS supports function calling, allowing you to define a set of tools or functions that the model can reason about and invoke intelligently based on the conversation context. This enables the creation of dynamic, interactive agents capable ...
    • Automated Scheduling for Model Endpoint

      The Automated Scheduling feature allows users to define specific times for automatically starting or stopping Model Endpoint. This helps optimize cloud usage, reduce unnecessary costs, and ensure that compute resources are only active when needed. ...
    • Automated Scheduling for Notebook Instance

      The Automated Scheduling feature allows users to define specific times for automatically starting or stopping Notebook Instances. This helps optimize cloud usage, reduce unnecessary costs, and ensure that compute resources are only active when ...