"no space left on device" when downloading a large model for the Sagemaker training job

Hi Huggingface community, I encountered the “no space left on device” error when a sagemaker training job tries to download a large model from huggingface hub. To reproduce this issue, simply replace the model id with “EleutherAI/gpt-neox-20b” in this notebook. Both g5.8xlarge and p4d.24xlarge instances showed the same error. I tried the following solutions but they didn’t work:

  1. Increased the volume_size to 500GB in the HugginceFace estimator. It seems that volume_size parameter is not used if the instance type is g5 or p4.
  2. Changed the cache dir by setting os.environ['TRANSFORMERS_CACHE'] = "/opt/ml/checkpoints/" as indicated by here.

Any suggestion is welcome. Thanks!

@ philschmid Calling the AWS hero for help. Much appreciated!

You should be able to modify the training script here to define where to save the model.
/tmp is the best place to store since that’s where your “VOLUME_SIZE” will be added.

@philschmid Thanks for the suggestion. I found that I need to also set os.environ['HF_HOME'] = '/tmp' to make it work.

1 Like