Hi Huggingface community, I encountered the “no space left on device” error when a sagemaker training job tries to download a large model from huggingface hub. To reproduce this issue, simply replace the model id with “EleutherAI/gpt-neox-20b” in this notebook. Both g5.8xlarge and p4d.24xlarge instances showed the same error. I tried the following solutions but they didn’t work:
- Increased the volume_size to 500GB in the HugginceFace estimator. It seems that volume_size parameter is not used if the instance type is g5 or p4.
- Changed the cache dir by setting
os.environ['TRANSFORMERS_CACHE'] = "/opt/ml/checkpoints/"
as indicated by here.
Any suggestion is welcome. Thanks!