I’m trying to launch a sagemaker endpoint using one of the larger pretrained models but i kept getting out of disk errors. I found this very odd since i was using a machine that came with multiple terrabytes of disk
def model_fn(model_dir):
  logging.set_verbosity_info()
  logger = logging.get_logger("model_fn")
  result = subprocess.run(['df', '-kh'], stdout=subprocess.PIPE)
  logger.info(result)
gave me the output:
Filesystem      Size  Used Avail Use% Mounted on
overlay          52G   31G   22G  59% /
tmpfs            64M     0   64M   0% /dev
tmpfs            94G     0   94G   0% /sys/fs/cgroup
shm              92G   20K   92G   1% /dev/shm
/dev/nvme1n1    3.5T  196K  3.3T   1% /tmp
/dev/nvme0n1p1   52G   31G   22G  59% /etc/hosts
tmpfs            94G   12K   94G   1% /proc/driver/nvidia
devtmpfs         94G     0   94G   0% /dev/nvidia0
tmpfs            94G     0   94G   0% /proc/acpi
tmpfs            94G     0   94G   0% /sys/firmware
This basically told me that everything was going to the overlay disk and that the disk that had all the space was assigned /tmp.
So how do i make my overlay disk larger or how do i make it so that the model directory use /tmp instead of the default?
My logs tell me
com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.mms.sock.9000
Which i’m going to guess is where the model will attempt to download to?
Anyone know the best way to deal with this? Thanks!