"No space left on device" when using HuggingFace + SageMaker

Vinayaks117 · April 27, 2022, 6:47am

@philschmid Sorry, I never meant in that way. Thanks.

Hi All,

My working solution for space issue.

If we don’t want to save all the checkpoints then we can go with an option below.

Increase the “volume_size”, need to set this parameter in HuggingFace estimator.
Set the “save_total_limit” parameter in TrainingArguments.
Ex: save_total_limit = 2

Which means it will save only 2 checkpoints: best checkpoint and last checkpoint (to make sure we can resume training from it)

Reference

If we want to save all the checkpoints and use it for future analysis then we can go with an option below.

Increase the volume_size, need to set this parameter in HuggingFace estimator.
Need to use checkpointing, which saves all checkpoints in /opt/ml/checkpoints which is in sync to a s3 bucket defined in the HuggingFace estimator.

Set output_dir parameter in hyperparameters → ‘output_dir’:“/opt/ml/checkpoints”
Set checkpoint_s3_uri parameter in HuggingFace estimator → checkpoint_s3_uri=“s3://sm-pipelines/checkpoints”
Set output_dir parameter in TrainingArguments which allows us to save the checkpoints in “/opt/ml/checkpoints” directory which is in sync with s3 bucket → output_dir=args.output_dir

Hope this helps.

Topic		Replies	Views
SageMaker OS Error No Space Left On Device while trying to train Falcon40B Amazon SageMaker	3	1325	August 24, 2023
"no space left on device" when downloading a large model for the Sagemaker training job Amazon SageMaker	4	5044	July 18, 2024
Using custom csv data with run_summarization.py in sagemaker Amazon SageMaker	4	2084	June 16, 2021
Sagemaker gpt-j train file error Amazon SageMaker	27	2922	February 22, 2024
Running custom data files on run_summarization.py Amazon SageMaker	16	1467	June 22, 2021

"No space left on device" when using HuggingFace + SageMaker

Related topics