"OS Errorr: No space left on device" when trying to load a trained model from S3

SuperYorio · December 25, 2023, 9:08pm

Hello all!

I have been stuck on this for weeks and am genuinely beyond confused. For some context, I was able to successfully train and finetune my CodeLlama-7B and CodeLlama-13B on SageMaker using the instances ml.g5.2xlarge and ml.g5.8xlarge and store these models in my S3 bucket. Then, I was able to effectively deploy my CodeLlama-7B model on the SageMaker Inference Endpoint using the following code in my SageMaker Notebook Instance:

model = HuggingFaceModel(
    model_data="s3://...model.tar.gz",
    entry_point="inference.py",
    source_dir="scripts",
    ... # some versioning parameters
)

predictor = model.deploy(
    endpoint_name="CodeLlama-7B",
    instance_type="ml.g5.2xlarge",
    ...
)

In this code, the model_data points to a file (model.tar.gz) containing my finetuned model and inference.py is a script that holds the functions for inference (model_fn(), predict_fn(), etc.). Everything works beautifully when I deploy my CodeLlama-7B. However, when I replace it with my s3 file containing the CodeLlama-13B, I started receiving the OS Error: Device Out of Space error. Several things I have tried that all still resulted in this same error:

Scaling up the instance_type with a very powerful instance, such as ml.p4d.24xlarge (which is weird because I’ve seen tutorials hosting Llama 2-70B on this instance).
Adding a volume_size parameter in my model.deploy() call with other large instances because ml.g5.* instances don’t support attaching extra volume storage.
Using multi-GPU and setting device_map='auto' when calling .from_pretrained().
Setting the SM_NUM_GPUS variable.
Scale up my Notebook Instance.

Any pointers and guidance would be very much appreciated! @philschmid , just wanted to say I’ve been following a lot of your tutorials and they have been super helpful, thank you so much for all the materials you’ve put out : )

Cheers!

SuperYorio · December 28, 2023, 12:27am

The issue has been solved!

My solution was to download the model on my local machine, unpack and include my custom inference script, and then call the .model() and .deploy()` without including the entry scripts and directories.

Topic		Replies	Views
Sagemaker endpoint - no space left on device with large models Amazon SageMaker	3	4932	February 21, 2023
SageMaker OS Error No Space Left On Device while trying to train Falcon40B Amazon SageMaker	3	1325	August 24, 2023
"no space left on device" when downloading a large model for the Sagemaker training job Amazon SageMaker	4	5044	July 18, 2024
Llama3-8B-Instruct finetuning - No space left on device error Models	3	732	July 20, 2024
InternalServerException when running a model loaded on S3 Amazon SageMaker	4	995	August 6, 2021

"OS Errorr: No space left on device" when trying to load a trained model from S3

Related topics