I’m trying to launch a sagemaker endpoint using one of the larger pretrained models but i kept getting out of disk errors. I found this very odd since i was using a machine that came with multiple terrabytes of disk
def model_fn(model_dir):
logging.set_verbosity_info()
logger = logging.get_logger("model_fn")
result = subprocess.run(['df', '-kh'], stdout=subprocess.PIPE)
logger.info(result)
gave me the output:
Filesystem Size Used Avail Use% Mounted on
overlay 52G 31G 22G 59% /
tmpfs 64M 0 64M 0% /dev
tmpfs 94G 0 94G 0% /sys/fs/cgroup
shm 92G 20K 92G 1% /dev/shm
/dev/nvme1n1 3.5T 196K 3.3T 1% /tmp
/dev/nvme0n1p1 52G 31G 22G 59% /etc/hosts
tmpfs 94G 12K 94G 1% /proc/driver/nvidia
devtmpfs 94G 0 94G 0% /dev/nvidia0
tmpfs 94G 0 94G 0% /proc/acpi
tmpfs 94G 0 94G 0% /sys/firmware
This basically told me that everything was going to the overlay disk and that the disk that had all the space was assigned /tmp.
So how do i make my overlay disk larger or how do i make it so that the model directory use /tmp instead of the default?
My logs tell me
com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.mms.sock.9000
Which i’m going to guess is where the model will attempt to download to?
Anyone know the best way to deal with this? Thanks!