"No space left on device" when using HuggingFace + SageMaker

Hi, I’m trying to train a model using a HuggingFace estimator in SageMaker but I keep getting this error after a few minutes:

[1,15]: File “pyarrow/ipc.pxi”, line 365, in pyarrow.lib._CRecordBatchWriter.write_batch
[1,15]: File “pyarrow/error.pxi”, line 97, in pyarrow.lib.check_status
[1,15]:OSError: [Errno 28] Error writing bytes to file. Detail: [errno 28] No space left on device
[1,15]:

I’m not sure what is triggering this problem because the volume size is high (volume_size=1024)
My hyperparameters are:
{‘per_device_train_batch_size’: 4,
‘per_device_eval_batch_size’: 4,
‘model_name_or_path’: ‘google/mt5-small’,
‘dataset_name’: ‘mlsum’,
‘dataset_config’: ‘es’,
‘text_column’: ‘text’,
‘summary_column’: ‘summary’,
‘max_target_length’: 64,
‘do_train’: True,
‘do_eval’: True,
‘do_predict’: True,
‘predict_with_generate’: True,
‘output_dir’: ‘/opt/ml/model’,
‘num_train_epochs’: 3,
‘seed’: 7,
‘fp16’: True,
‘save_strategy’: ‘no’}
And my estimator is:

create the Estimator

huggingface_estimator = HuggingFace(
entry_point=‘run_summarization.py’, # script
source_dir=’./examples/seq2seq’, # relative path to example
git_config=git_config,
instance_type=‘ml.p3.16xlarge’,
instance_count=2,
volume_size=1024,
transformers_version=‘4.4.2’,
pytorch_version=‘1.6.0’,
py_version=‘py36’,
role=role,
hyperparameters = hyperparameters,
distribution = distribution
)

Any help would be very much appreciated!


Some more details:

  • I’m calling fit without extra params, just like this:
    huggingface_estimator.fit()
  • The entry point is this public script:
    transformers/run_summarization.py at master · huggingface/transformers · GitHub
  • From traceback I see that the error is happening on line 433:
    load_from_cache_file=not data_args.overwrite_cache,
    (I guess something is happening here but not totally sure what)
  • At the moment I’m not saving checkpoints (to prevent that causing the error), using the param ‘save_strategy’: ‘no’
  • The dataset isn’t that big, 1.7 GB.
  • The model is quite big, but less than 3 GB
  • My volume is 1024 GB

Could you also include your .fit() call so that the example can be reproduced? And a link to the run_summarization.py if public? Do you have a sense of what could take up a lot of storage? do you checkpoint a large model very frequently? or do you read a large dataset?

Hi Olivier, thanks for your response! I’ve just edited my question including that information. Let me know if you need more data

Hey @LeoCordoba,

Your error is coming from caching the dataset. Datasets is caching the dataset on disk to work with it properly. The default cache_dir is ~/.cache/huggingface/datasets. This directory seems not to be on the mounted EBS volume.

To fix your problem you need to define a cache_dirin the load_dataset method. You can copy the run_summarization.py to your local filesystem and adjust
line 313
from

datasets = load_dataset(data_args.dataset_name, data_args.dataset_config_name)

to

datasets = load_dataset(data_args.dataset_name, data_args.dataset_config_name,cache_dir="opt/ml/input")
1 Like

Thanks @philschmid ! That makes sense :ok_hand:

1 Like