"No space left on device" when using HuggingFace + SageMaker

Hi, I’m trying to train a model using a HuggingFace estimator in SageMaker but I keep getting this error after a few minutes:

[1,15]: File “pyarrow/ipc.pxi”, line 365, in pyarrow.lib._CRecordBatchWriter.write_batch
[1,15]: File “pyarrow/error.pxi”, line 97, in pyarrow.lib.check_status
[1,15]:OSError: [Errno 28] Error writing bytes to file. Detail: [errno 28] No space left on device
[1,15]:

I’m not sure what is triggering this problem because the volume size is high (volume_size=1024)
My hyperparameters are:
{‘per_device_train_batch_size’: 4,
‘per_device_eval_batch_size’: 4,
‘model_name_or_path’: ‘google/mt5-small’,
‘dataset_name’: ‘mlsum’,
‘dataset_config’: ‘es’,
‘text_column’: ‘text’,
‘summary_column’: ‘summary’,
‘max_target_length’: 64,
‘do_train’: True,
‘do_eval’: True,
‘do_predict’: True,
‘predict_with_generate’: True,
‘output_dir’: ‘/opt/ml/model’,
‘num_train_epochs’: 3,
‘seed’: 7,
‘fp16’: True,
‘save_strategy’: ‘no’}
And my estimator is:

create the Estimator

huggingface_estimator = HuggingFace(
entry_point=‘run_summarization.py’, # script
source_dir=’./examples/seq2seq’, # relative path to example
git_config=git_config,
instance_type=‘ml.p3.16xlarge’,
instance_count=2,
volume_size=1024,
transformers_version=‘4.4.2’,
pytorch_version=‘1.6.0’,
py_version=‘py36’,
role=role,
hyperparameters = hyperparameters,
distribution = distribution
)

Any help would be very much appreciated!


Some more details:

  • I’m calling fit without extra params, just like this:
    huggingface_estimator.fit()
  • The entry point is this public script:
    transformers/run_summarization.py at master · huggingface/transformers · GitHub
  • From traceback I see that the error is happening on line 433:
    load_from_cache_file=not data_args.overwrite_cache,
    (I guess something is happening here but not totally sure what)
  • At the moment I’m not saving checkpoints (to prevent that causing the error), using the param ‘save_strategy’: ‘no’
  • The dataset isn’t that big, 1.7 GB.
  • The model is quite big, but less than 3 GB
  • My volume is 1024 GB

Could you also include your .fit() call so that the example can be reproduced? And a link to the run_summarization.py if public? Do you have a sense of what could take up a lot of storage? do you checkpoint a large model very frequently? or do you read a large dataset?

Hi Olivier, thanks for your response! I’ve just edited my question including that information. Let me know if you need more data

Hey @LeoCordoba,

Your error is coming from caching the dataset. Datasets is caching the dataset on disk to work with it properly. The default cache_dir is ~/.cache/huggingface/datasets. This directory seems not to be on the mounted EBS volume.

To fix your problem you need to define a cache_dirin the load_dataset method. You can copy the run_summarization.py to your local filesystem and adjust
line 313
from

datasets = load_dataset(data_args.dataset_name, data_args.dataset_config_name)

to

datasets = load_dataset(data_args.dataset_name, data_args.dataset_config_name,cache_dir="opt/ml/input")
2 Likes

Thanks @philschmid ! That makes sense :ok_hand:

1 Like

Hey @philschmid,
Could you help me with a similar issue here, I am not using load_dataset, instead I’m created Dataset from pandas through Dataset.from_pandas.
getting the same error: [Errno 28] No space left on device: ‘/home/ec2-user/.cache/huggingface’

I have sufficient storage on the instance and the dataset isn’t more than 20 GBs either. Do you have a possible solution to this as Dataset.from_pandas doesn’t seem to have cache_dir as an argument to it.

Any help is appreciated.

Hey @spranjal25,

it is also possible to set the cache_dir via the ENVIRONMENT variable.

The default cache directory is ~/.cache/huggingface/datasets . Change the cache location by setting the shell environment variable, HF_DATASETS_CACHE to another directory

HF_DATASETS_CACHE="/path/to/another/directory"

Here you can find more documentation on this: Cache management — datasets 1.18.3 documentation

3 Likes

Thanks, that worked. @philschmid

Hi @philschmid

Greeting!!

We are also experiencing “No space left on device” when training a BERT model using a HuggingFace estimator in SageMaker pipelines training job.

Could you please help?
Please let me know if you need any additional details.

bert_estimator = HuggingFace(
entry_point=“train.py”,
source_dir="./scripts",
base_job_name=base_job_prefix + “/training”,
instance_type=“ml.p3.2xlarge”,
instance_count=1,
volume_size=1024,
role=role,
transformers_version=“4.11.0”,
pytorch_version=“1.9.0”,
py_version=“py38”,
hyperparameters=hyperparameters,
sagemaker_session=sagemaker_session,
)

Additional details:

  • Dataset size = 50 MB
  • The model is quite big, but less than 3 GB
  • My volume is 1024 GB
  • Model checkpoints are being saved in “/opt/ml/model/model_02” directory

Error details:

#015 #033[ASaving model checkpoint to /model_02/checkpoint-47866
Configuration saved in /model_02/checkpoint-47866/config.json
Model weights saved in /model_02/checkpoint-47866/pytorch_model.bin
tokenizer config file saved in /model_02/checkpoint-47866/tokenizer_config.json
Special tokens file saved in /model_02/checkpoint-47866/special_tokens_map.json
Traceback (most recent call last):
File “/opt/conda/lib/python3.8/site-packages/torch/serialization.py”, line 379, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol)
File “/opt/conda/lib/python3.8/site-packages/torch/serialization.py”, line 499, in _save
zip_file.write_record(name, storage.data_ptr(), num_bytes)
OSError: [Errno 28] No space left on device
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “train.py”, line 256, in
trainer.train()
File “/opt/conda/lib/python3.8/site-packages/transformers/trainer.py”, line 1383, in train
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File “/opt/conda/lib/python3.8/site-packages/transformers/trainer.py”, line 1487, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File “/opt/conda/lib/python3.8/site-packages/transformers/trainer.py”, line 1579, in _save_checkpoint
torch.save(self.optimizer.state_dict(), os.path.join(output_dir, OPTIMIZER_NAME))
File “/opt/conda/lib/python3.8/site-packages/torch/serialization.py”, line 380, in save
return
File “/opt/conda/lib/python3.8/site-packages/torch/serialization.py”, line 259, in exit
self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:298] . unexpected pos 789763200 vs 789763088
terminate called after throwing an instance of ‘c10::Error’
what(): [enforce fail at inline_container.cc:298] . unexpected pos 789763200 vs 789763088

Thank you.

I have the similar error while tokenizing the dataset.
“[Errno 28] Error writing bytes to file. Detail: [errno 28] No space left on device”

I use the original run_mlm.py script to train MLM on my data using Sagemaker. I tried both estimators: HuggingFace and PyTorch. Same issue. I tried all the mentioned above solutions but it does not help.

#015Running tokenizer on dataset line_by_line #41: 24%|██▍ | 77/320 [01:45<05:32, 1.37s/ba]multiprocess.pool.RemoteTraceback: “”"Traceback (most recent call last):
File “/opt/conda/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 2354, in _map_single
writer.write_batch(batch)
File “/opt/conda/lib/python3.6/site-packages/datasets/arrow_writer.py”, line 496, in write_batch
self.write_table(pa_table, writer_batch_size)
File “/opt/conda/lib/python3.6/site-packages/datasets/arrow_writer.py”, line 513, in write_table
self.pa_writer.write_batch(batch)
File “pyarrow/ipc.pxi”, line 408, in pyarrow.lib._CRecordBatchWriter.write_batch
File “pyarrow/error.pxi”, line 112, in pyarrow.lib.check_statusOSError: [Errno 28] Error writing bytes to file. Detail: [errno 28] No space left on deviceDuring handling of the above exception, another exception occurred:Traceback (most recent call last):
File “/opt/conda/lib/python3.6/site-packages/multiprocess/pool.py”, line 119, in worker
result = (True, func(*args, **kwds))
File “/opt/conda/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 519, in wrapper#015Running tokenizer on dataset line_by_line #7: 24%|██▍ | 78/320 [01:45<03:34, 1.13ba/s]#033[A#033[A#033[A#033[A#033[A#033[A#033[A
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)
File “/opt/conda/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 486, in wrapper
out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs)#015Running tokenizer on dataset line_by_line #5: 27%|██▋ | 85/320 [01:45<03:27, 1.13ba/s]#033[A#033[A#033[A#033[A#033[A
File “/opt/conda/lib/python3.6/site-packages/datasets/fingerprint.py”, line 458, in wrapper
out = func(self, *args, **kwargs)
File “/opt/conda/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 2360, in _map_single
writer.finalize()
File “/opt/conda/lib/python3.6/site-packages/datasets/arrow_writer.py”, line 528, in finalize
self.pa_writer.close()
File “pyarrow/ipc.pxi”, line 445, in pyarrow.lib._CRecordBatchWriter.close
File “pyarrow/error.pxi”, line 112, in pyarrow.lib.check_statusOSError: [Errno 28] Error writing bytes to file. Detail: [errno 28] No space left on device

I’ve solved it today. In HF’s run_mlm.py there is datasets.map() function (link to the line in code) and I added argument to this function: ‘keep_in_memory=True’ to avoid caching tokenized dataset.

2 Likes

Hello @Vinayaks117,

could you please share your training script? The error is probably similar to the ones in the thread in that you need to adjust the cache_dir for transformers or datasets.

Hi @philschmid

As requested please find the training script in my github repo. I could not attach here hence I uploaded in one of my repos.

Thank you.

Thanks for providing!
Coud you try to adjust the cache dir for transformers and datasets at the top of your script?

import os
cache_dir = os.makedirs("cache",exist_ok=True)
os.environ['TRANSFORMERS_CACHE'] = "cache"
os.environ['HF_DATASETS_CACHE'] = "cache"

Thanks for the details.

If we look at the TrainingArguments in training script then we are saving checkpoints at the end of every epoch.

save_strategy=IntervalStrategy.EPOCH and output_dir = ‘/opt/ml/model/model_02’

Note: I observed this space issue because we are saving checkpoints at every epoch.

So should we update cache_dir or os.environ[‘TRANSFORMERS_CACHE’] in TrainingArguments to store the checkpoints in cache_dir?

cache_dir = os.makedirs(“cache”, exist_ok=True)
os.environ[‘TRANSFORMERS_CACHE’] = “cache”
os.environ[‘HF_DATASETS_CACHE’] = “cache”

@philschmid Could you please clarify on above query?

@philschmid I am running into a similar issue, but about halfway through my training.

OSError: [Errno 28] No space left on device
Traceback (most recent call last):
  File "train.py", line 182, in <module>
    trainer.train()
  File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1328, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch)
  File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1409, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 1499, in _save_checkpoint
    torch.save(self.optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
  File "/opt/conda/lib/python3.6/site-packages/torch/serialization.py", line 373, in save
    return
  File "/opt/conda/lib/python3.6/site-packages/torch/serialization.py", line 259, in __exit__
    self.file_like.write_end_of_file()

I have set cache_dir=“opt/ml/input” in my load_dataset. I have also assigned 100 GB additional volume. I also tried setting


import os
cache_dir = os.makedirs("cache",exist_ok=True)
os.environ['TRANSFORMERS_CACHE'] = "cache"
os.environ['HF_DATASETS_CACHE'] = "cache"

I’m also passing train_volume_size parameter to estimator:

            estimator = HuggingFace(
                entry_point          = 'train.py',        # fine-tuning script used in training jon
                source_dir           = 'embed_source',      # directory where fine-tuning script is stored
                instance_type        = instance_type,   # instances type used for the training job
                instance_count       = 1,                 # the number of instances used for training
                role                 = get_execution_role(), # Iam role used in training job to access AWS ressources, 
                transformers_version = '4.6',             # the transformers version used in the training job
                pytorch_version      = '1.7',             # the pytorch_version version used in the training job
                py_version           = 'py36',            # the python version used in the training job
                hyperparameters      = hyperparameters,   # the hyperparameter used for running the training job
                metric_definitions   = metric_definitions, # the metrics regex definitions to extract logs
                output_path=os.path.join(dataconnector.version_s3_prefix,  "models"),
                code_location=os.path.join(dataconnector.version_s3_prefix,  "models"),
                train_volume_size = 100
                
            )

But to no avail - fails after about 2 hours. Any ideas of what else I can do?

The parameter for the estimator to increase volume is in v2 volume_size documentation.
Additionally, you could use checkpointing, which saves files which are saved to /opt/ml/checkpoints in sync to a s3 bucket defined in the HuggingFace estimator. HF doc AWS Doc

1 Like

Thanks, I updated the volume size and added checkpointing. It seems the job fails before I complete the first epoch though. My training data consists of 1.7M short text descriptions (~100 MB) and 23 classes. Would a distributed approach help here? Like in this post

You job fails with the following error? Or do you see something different?

Distributed can help you either speed up you training or make it possible to fine-tune models which are not fit onto a single GPU. Since your corpus doesn’t sounds that big there is no need yet to go with distributed training