Hi, I’m trying to train a model using a HuggingFace estimator in SageMaker but I keep getting this error after a few minutes:
[1,15]: File “pyarrow/ipc.pxi”, line 365, in pyarrow.lib._CRecordBatchWriter.write_batch
[1,15]: File “pyarrow/error.pxi”, line 97, in pyarrow.lib.check_status
[1,15]:OSError: [Errno 28] Error writing bytes to file. Detail: [errno 28] No space left on device
[1,15]:
I’m not sure what is triggering this problem because the volume size is high (volume_size=1024)
My hyperparameters are:
{‘per_device_train_batch_size’: 4,
‘per_device_eval_batch_size’: 4,
‘model_name_or_path’: ‘google/mt5-small’,
‘dataset_name’: ‘mlsum’,
‘dataset_config’: ‘es’,
‘text_column’: ‘text’,
‘summary_column’: ‘summary’,
‘max_target_length’: 64,
‘do_train’: True,
‘do_eval’: True,
‘do_predict’: True,
‘predict_with_generate’: True,
‘output_dir’: ‘/opt/ml/model’,
‘num_train_epochs’: 3,
‘seed’: 7,
‘fp16’: True,
‘save_strategy’: ‘no’}
And my estimator is:
create the Estimator
huggingface_estimator = HuggingFace(
entry_point=‘run_summarization.py’, # script
source_dir=’./examples/seq2seq’, # relative path to example
git_config=git_config,
instance_type=‘ml.p3.16xlarge’,
instance_count=2,
volume_size=1024,
transformers_version=‘4.4.2’,
pytorch_version=‘1.6.0’,
py_version=‘py36’,
role=role,
hyperparameters = hyperparameters,
distribution = distribution
)
Any help would be very much appreciated!
Some more details:
- I’m calling fit without extra params, just like this:
huggingface_estimator.fit() - The entry point is this public script:
transformers/run_summarization.py at master · huggingface/transformers · GitHub - From traceback I see that the error is happening on line 433:
load_from_cache_file=not data_args.overwrite_cache,
(I guess something is happening here but not totally sure what) - At the moment I’m not saving checkpoints (to prevent that causing the error), using the param ‘save_strategy’: ‘no’
- The dataset isn’t that big, 1.7 GB.
- The model is quite big, but less than 3 GB
- My volume is 1024 GB