Local Storage Limits for Notebook, Model Training, and Online Prediction

To ensure optimal performance and cost-efficiency, our platform provides a certain amount of local storage included with each compute instance you create. However, exceeding this storage limit can impact your workflow and results. This guide will help you understand how local storage limits work and how to manage your storage usage effectively.

Instance Types Review

Before starting, please take a look through our available instance types here,

STT	Instance Flavor	vCPU	Memory (GB)	vRAM (GB)	Local NVMe Storage
1	g5-standard-16x250-1h100	16	250	80	3.75TB
2	g5-standard-32x500-2h100	32	500	160	7.5TB
3	g5-standard-64x1000-4h100	64	1000	320	15TB
4	g5-standard-128x2000-8h100	128	2000	640	30TB

Local Storage Limits

Notebook Instances: Each notebook instance comes with a fixed amount of local NVMe storage, which varies depending on the instance type you choose. This storage is intended for temporary files, code, and data used during your interactive analysis and experimentation.
Model Training Jobs: For model training jobs, the local NVMe storage limit is calculated per node. If your job uses multiple nodes, the total available storage is the per-node limit multiplied by the number of nodes. This storage is used for storing your training data, model checkpoints, and other intermediate files.
Online Prediction Endpoints: For model endpoint, the local NVMe storage limit based on the chosen instance type will representative for the total usage storage of all node in an online prediction endpoint. This storage is primarily used for caching model artifacts and handling incoming requests.

Exceeding the Limit

If you exceed the local storage limit for your instance, you may experience the following:

Read/Write Errors: You might encounter errors when trying to read or write data to the local storage.
Job Failures: Your training jobs or prediction requests could fail due to insufficient storage space.
Performance Degradation: Your instance's performance may slow down significantly.

Managing Local Storage

To avoid exceeding the limit and ensure smooth operation, consider these strategies:

Choose the Right Instance Type: Select an instance type with sufficient local storage for your expected workload. Refer to our documentation for details on the storage capacity of each instance type.
Optimize Data Usage:
- Stream data from external sources (e.g., S3 buckets) instead of storing it locally whenever possible.
- Delete temporary files and intermediate results that are no longer needed.
- Use compression techniques to reduce the size of your data.
Monitor Storage Usage: Regularly check the storage usage of your instances and jobs using the monitoring tools provided by our platform.
Upgrade Instance: If you consistently need more local storage, consider upgrading to a larger instance type.

Important Note:

The local storage provided with each instance is ephemeral. This means that any data stored locally will be lost if the instance is stopped or terminated. For persistent storage, we recommend using cloud storage services like S3 buckets.

If you have any questions or need further assistance, please don't hesitate to contact our support team.

Related Articles
Import a model registry with custom container
The model registry is a centralized repository for storing trained models, their metadata, versions, and associated artifacts. It allows for version control, management, and organization of models developed during the training phase. This enables ...
Import a model registry with pre-built container
Model Preparation Ensure your machine learning model is packaged and stored in a container image compatible with Triton Inference Server. Upload the container image containing your model to a storage location accessible by our AI Platform. The online ...
Import a Model Registry using Triton Server
Model Preparation Since our AI Platform only accesses models from a Network Volume, you must first create a Network Volume. Pull your model from local file systems or cloud storage (AWS S3, Azure Blob, or GCS) into the Network Volume. Ensure the ...
Manage a notebook instance
Notebook instances provide you with a dedicated environment to develop and experiment with your AI models. Follow these steps to seamlessly manage your instances: Accessing Notebook Instances Dashboard: From the platform dashboard, locate the ...
Distributed Training: LLaMA-Factory on Managed Slurm
1. Overview This guide walks you through implementing distributed training with LLaMA-Factory on a Managed Slurm cluster. The documentation covers all essential aspects of the workflow, including environment configuration, efficient job scheduling ...

Local Storage Limits for Notebook, Model Training, and Online Prediction

Local Storage Limits for Notebook, Model Training, and Online Prediction

Instance Types Review

Local Storage Limits

Exceeding the Limit

Managing Local Storage

Related Articles

Import a model registry with custom container

Import a model registry with pre-built container

Import a Model Registry using Triton Server

Manage a notebook instance

Distributed Training: LLaMA-Factory on Managed Slurm