Sagemaker endpoint - no space left on device with large models

blackknight467 · February 20, 2023, 10:52pm

I’m trying to launch a sagemaker endpoint using one of the larger pretrained models but i kept getting out of disk errors. I found this very odd since i was using a machine that came with multiple terrabytes of disk

def model_fn(model_dir):

  logging.set_verbosity_info()
  logger = logging.get_logger("model_fn")
  result = subprocess.run(['df', '-kh'], stdout=subprocess.PIPE)
  logger.info(result)

gave me the output:

Filesystem      Size  Used Avail Use% Mounted on
overlay          52G   31G   22G  59% /
tmpfs            64M     0   64M   0% /dev
tmpfs            94G     0   94G   0% /sys/fs/cgroup
shm              92G   20K   92G   1% /dev/shm
/dev/nvme1n1    3.5T  196K  3.3T   1% /tmp
/dev/nvme0n1p1   52G   31G   22G  59% /etc/hosts
tmpfs            94G   12K   94G   1% /proc/driver/nvidia
devtmpfs         94G     0   94G   0% /dev/nvidia0
tmpfs            94G     0   94G   0% /proc/acpi
tmpfs            94G     0   94G   0% /sys/firmware

This basically told me that everything was going to the overlay disk and that the disk that had all the space was assigned /tmp.

So how do i make my overlay disk larger or how do i make it so that the model directory use /tmp instead of the default?

My logs tell me

com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.mms.sock.9000

Which i’m going to guess is where the model will attempt to download to?

Anyone know the best way to deal with this? Thanks!

philschmid · February 21, 2023, 8:06am

Hello @blackknight467,

Thank you for researching this. I ll forward this to the SageMaker Team!

In the meantime, what i can suggest is that you deploy your model using a model.tar.gz, which is stored on the instance that way you don’t need to load it in the model_fn and you can load it from a local path already.

marshmellow77 · February 21, 2023, 8:29pm

Hi @blackknight467

I had this issue too last week and I worked around it by providing a custom inference script (which I needed anyway) and specifying the cache directory when downloading the model like so:

model = AutoModelForSeq2SeqLM.from_pretrained(model_name, cache_dir="/tmp/model_cache/")

Not ideal, but it did the trick.

But I agree that this should work out of the box without any workaround. I might dig a bit deeper into this and let you know what I find.

Cheers
Heiko

blackknight467 · February 21, 2023, 11:07pm

That worked perfect! Thanks!

Topic		Replies	Views
"OS Errorr: No space left on device" when trying to load a trained model from S3 Amazon SageMaker	1	1365	December 28, 2023
"no space left on device" when downloading a large model for the Sagemaker training job Amazon SageMaker	4	5031	July 18, 2024
SageMaker OS Error No Space Left On Device while trying to train Falcon40B Amazon SageMaker	3	1323	August 24, 2023
ValueError: Source directory does not exist in the repo. Training causal lm in sagemaker Amazon SageMaker	8	1609	July 26, 2021
Training model file too large and fail to deploy Amazon SageMaker	3	1381	July 3, 2023

Sagemaker endpoint - no space left on device with large models

Related topics