Local Storage Limits for Notebook, Model Training, and Online Prediction
To ensure optimal performance and cost-efficiency, our platform provides a certain amount of local storage included with each compute instance you create. However, exceeding this storage limit can impact your workflow and results. This guide will help you understand how local storage limits work and how to manage your storage usage effectively.
Instance Types Review
Before starting, please take a look through our available instance types
here,
STT | Instance Flavor | vCPU | Memory (GB) | vRAM (GB) | Local NVMe Storage |
1 | g5-standard-16x250-1h100 | 16 | 250 | 80 | 3.75TB |
2 | g5-standard-32x500-2h100 | 32 | 500 | 160 | 7.5TB |
3 | g5-standard-64x1000-4h100 | 64 | 1000 | 320 | 15TB |
4 | g5-standard-128x2000-8h100 | 128 | 2000 | 640 | 30TB |
Local Storage Limits
- Notebook Instances: Each notebook instance comes with a fixed amount of local NVMe storage, which varies depending on the instance type you choose. This storage is intended for temporary files, code, and data used during your interactive analysis and experimentation.
- Model Training Jobs: For model training jobs, the local NVMe storage limit is calculated per node. If your job uses multiple nodes, the total available storage is the per-node limit multiplied by the number of nodes. This storage is used for storing your training data, model checkpoints, and other intermediate files.
- Online Prediction Endpoints: For model endpoint, the local NVMe storage limit based on the chosen instance type will representative for the total usage storage of all node in an online prediction endpoint. This storage is primarily used for caching model artifacts and handling incoming requests.
Exceeding the Limit
If you exceed the local storage limit for your instance, you may experience the following:
- Read/Write Errors: You might encounter errors when trying to read or write data to the local storage.
- Job Failures: Your training jobs or prediction requests could fail due to insufficient storage space.
- Performance Degradation: Your instance's performance may slow down significantly.
Managing Local Storage
To avoid exceeding the limit and ensure smooth operation, consider these strategies:
- Choose the Right Instance Type: Select an instance type with sufficient local storage for your expected workload. Refer to our documentation for details on the storage capacity of each instance type.
Optimize Data Usage:
- Stream data from external sources (e.g., S3 buckets) instead of storing it locally whenever possible.
- Delete temporary files and intermediate results that are no longer needed.
- Use compression techniques to reduce the size of your data.
- Monitor Storage Usage: Regularly check the storage usage of your instances and jobs using the monitoring tools provided by our platform.
- Upgrade Instance: If you consistently need more local storage, consider upgrading to a larger instance type.
Important Note:The local storage provided with each instance is ephemeral. This means that any data stored locally will be lost if the instance is stopped or terminated. For persistent storage, we recommend using cloud storage services like S3 buckets.
If you have any questions or need further assistance, please don't hesitate to contact our support team.
Related Articles
Create an Endpoint for online prediction
After training and registering the model, the online prediction component enables the deployment and serving of models to make real-time predictions or inferences on new data. This component provides endpoints or APIs that can be integrated into ...
Import a model registry with custom container
The model registry is a centralized repository for storing trained models, their metadata, versions, and associated artifacts. It allows for version control, management, and organization of models developed during the training phase. This enables ...
Training Mode
Training Moe Definition Characteristics Use Cases Single training Single training refers to training a machine learning model using a single instance or node. Utilizes a single compute instance for model training. Suitable for small to medium-sized ...
Import a model registry with pre-built container
Model Preparation Ensure your machine learning model is packaged and stored in a container image compatible with Triton Inference Server. Upload the container image containing your model to a storage location accessible by our AI Platform. The online ...
Start your Model Training Job
Model training job involves using datasets to create and optimize machine learning models. This process occurs in the cloud environment, where data scientists run code to build models and tune hyperparameters. Training uses computational resources ...