@philschmid Sorry, I never meant in that way. Thanks.
Hi All,
My working solution for space issue.
If we don’t want to save all the checkpoints then we can go with an option below.
- Increase the “volume_size”, need to set this parameter in HuggingFace estimator.
- Set the “save_total_limit” parameter in TrainingArguments.
Ex: save_total_limit = 2
Which means it will save only 2 checkpoints: best checkpoint and last checkpoint (to make sure we can resume training from it)
If we want to save all the checkpoints and use it for future analysis then we can go with an option below.
- Increase the
volume_size, need to set this parameter in HuggingFace estimator. - Need to use
checkpointing, which saves all checkpoints in/opt/ml/checkpointswhich is in sync to a s3 bucket defined in theHuggingFaceestimator.
Set output_dir parameter in hyperparameters → ‘output_dir’:“/opt/ml/checkpoints”
Set checkpoint_s3_uri parameter in HuggingFace estimator → checkpoint_s3_uri=“s3://sm-pipelines/checkpoints”
Set output_dir parameter in TrainingArguments which allows us to save the checkpoints in “/opt/ml/checkpoints” directory which is in sync with s3 bucket → output_dir=args.output_dir
Hope this helps.