"No space left on device" when using HuggingFace + SageMaker

@philschmid Sorry, I never meant in that way. Thanks.

Hi All,

My working solution for space issue.

If we don’t want to save all the checkpoints then we can go with an option below.

  1. Increase the “volume_size”, need to set this parameter in HuggingFace estimator.
  2. Set the “save_total_limit” parameter in TrainingArguments.
    Ex: save_total_limit = 2

Which means it will save only 2 checkpoints: best checkpoint and last checkpoint (to make sure we can resume training from it)

Reference

If we want to save all the checkpoints and use it for future analysis then we can go with an option below.

  1. Increase the volume_size, need to set this parameter in HuggingFace estimator.
  2. Need to use checkpointing, which saves all checkpoints in /opt/ml/checkpoints which is in sync to a s3 bucket defined in the HuggingFace estimator.

Set output_dir parameter in hyperparameters → ‘output_dir’:“/opt/ml/checkpoints”
Set checkpoint_s3_uri parameter in HuggingFace estimator → checkpoint_s3_uri=“s3://sm-pipelines/checkpoints”
Set output_dir parameter in TrainingArguments which allows us to save the checkpoints in “/opt/ml/checkpoints” directory which is in sync with s3 bucket → output_dir=args.output_dir

Hope this helps.

1 Like