Using Huggingface Trainer in Colab -> Disk Full

anon50011480 · May 3, 2021, 11:27am

Hello everyone!

I thought I’d post this here first, as I am not sure if it is a bug or if I am doing something wrong.
I’m using the huggingface library to train an XLM-R token classifier. I originally wrote the training routine myself, which worked quite well, but I wanted to switch to the trainer for more advanced features like early stopping and easier setting of training arguments.

To prototype my code, I usually run it on a free google colab account. While the training process works, I’ve had the code crash several times, because the disk space of the Compute Environment runs out. This is NOT my google drive space, but a separate disk of around 60GB space. I have observed, that during training the used space keeps on growing, but I have no idea where or what exactly is writing data. Once the disk is full, this results in the code crashing:

The following are my training parameters/callbacks defined:

## Define Callbacks
class PrinterCallback(TrainerCallback):
    def on_train_begin(self, args, state, control, **kwargs):
        print('\033[1m'+ '=' * 25 + " Model Training " + '=' * 25 + '\033[0m')
    def on_epoch_begin(self, args, state, control, **kwargs):
        print('\n'+ '\033[1m'+ '=' * 25 +' Epoch {:} / {:} '.format(int(trainer.state.epoch) + 1, int(trainer.state.num_train_epochs)) + '=' * 25)


## Training parameters
# training arguments

training_args = TrainingArguments(
    output_dir='./checkpoints',           # output directory
    num_train_epochs=5,              # total # of training epochs
    per_device_train_batch_size=32,    # batch size per device during training
    per_device_eval_batch_size=32,     # batch size for evaluation
    warmup_steps=0,                # number of warmup steps for learning rate scheduler
    weight_decay=0,                   # strength of weight decay
    learning_rate=2e-5,               #2e-5 
    logging_dir='./logs',             # directory for storing logs
    evaluation_strategy= "epoch",     #"steps", "epoch", or "no"
    #eval_steps=100,
    save_total_limit=1,
    load_best_model_at_end=False,      #loads the model with the best evaluation score
    metric_for_best_model="weightedF1",
    greater_is_better=True
)

## Start training

# initialize huggingface trainer
trainer = Trainer(
        model=xlmr_model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=xlmr_tokenizer,
        compute_metrics=validate,
        callbacks=[PrinterCallback]
    )

trainer.train()

Any idea what is going wrong here?

Edit: Here is the Error as text from another run; apparently Torch is continuously writing something to disk, but why and what is it?

    ---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/torch/serialization.py in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization)
    371             with _open_zipfile_writer(opened_file) as opened_zipfile:
--> 372                 _save(obj, opened_zipfile, pickle_module, pickle_protocol)
    373                 return

6 frames
/usr/local/lib/python3.7/dist-packages/torch/serialization.py in _save(obj, zip_file, pickle_module, pickle_protocol)
    490         num_bytes = storage.size() * storage.element_size()
--> 491         zip_file.write_record(name, storage.data_ptr(), num_bytes)
    492 

OSError: [Errno 28] No space left on device

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
<ipython-input-36-3435b262f1ae> in <module>()
----> 1 trainer.train()

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
   1170                     self.control = self.callback_handler.on_step_end(self.args, self.state, self.control)
   1171 
-> 1172                     self._maybe_log_save_evaluate(tr_loss, model, trial, epoch)
   1173 
   1174                 if self.control.should_epoch_stop or self.control.should_training_stop:

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in _maybe_log_save_evaluate(self, tr_loss, model, trial, epoch)
   1267 
   1268         if self.control.should_save:
-> 1269             self._save_checkpoint(model, trial, metrics=metrics)
   1270             self.control = self.callback_handler.on_save(self.args, self.state, self.control)
   1271 

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in _save_checkpoint(self, model, trial, metrics)
   1317         elif self.is_world_process_zero() and not self.deepspeed:
   1318             # deepspeed.save_checkpoint above saves model/optim/sched
-> 1319             torch.save(self.optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
   1320             with warnings.catch_warnings(record=True) as caught_warnings:
   1321                 torch.save(self.lr_scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))

/usr/local/lib/python3.7/dist-packages/torch/serialization.py in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization)
    371             with _open_zipfile_writer(opened_file) as opened_zipfile:
    372                 _save(obj, opened_zipfile, pickle_module, pickle_protocol)
--> 373                 return
    374         _legacy_save(obj, opened_file, pickle_module, pickle_protocol)
    375 

/usr/local/lib/python3.7/dist-packages/torch/serialization.py in __exit__(self, *args)
    257 
    258     def __exit__(self, *args) -> None:
--> 259         self.file_like.write_end_of_file()
    260         self.buffer.flush()
    261 

RuntimeError: [enforce fail at inline_container.cc:274] . unexpected pos 2212230208 vs 2212230096```

BramVanroy · May 3, 2021, 5:36pm

There are three default arguments that are relevant here, but seeing that you set save_total_limit=1 I am not sure what else could be being saved…

github.com

huggingface/transformers/blob/fe82b1bfa07aa054ef70583a561cb7c3978c697f/src/transformers/training_args.py#L161-L172


save_strategy (:obj:`str` or :class:`~transformers.trainer_utils.IntervalStrategy`, `optional`, defaults to :obj:`"steps"`):
    The checkpoint save strategy to adopt during training. Possible values are:


        * :obj:`"no"`: No save is done during training.
        * :obj:`"epoch"`: Save is done at the end of each epoch.
        * :obj:`"steps"`: Save is done every :obj:`save_steps`.


save_steps (:obj:`int`, `optional`, defaults to 500):
    Number of updates steps before two checkpoint saves if :obj:`save_strategy="steps"`.
save_total_limit (:obj:`int`, `optional`):
    If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in
    :obj:`output_dir`.

Can you see what’s actually on the disk?

anon50011480 · May 3, 2021, 5:38pm

I’ll try setting save_strategy explicitly to epoch? Probably right now its saving at the preset amount of steps and can’t delete the saved steps from the colab/gdrive disk for whatever reason.

As for

Can you see what’s actually on the disk?

There is a file explorer built into google colab and I can also explore the filesystem through ipython magic (i.e. using bash); but I didn’t really find where exactly the virtual disk for the python environment is mounted and therefore where the trainer is seemingly writing to (even though it should be working on the Google Drive mount).

Edit: I rechecked, and it appears that after running the trainer, /root was slowly filling up on the colab disk; I can however not see the contents of that mount point. Curiously save_total_limit=1 does also not seem to limit the checkpoints saved on my google drive partition, as checkpoints are being stored all 500 steps and only sporadically deleted.

Kforcode · December 25, 2021, 8:23am

I guess I found the reason,
on deleting the previous checkpoint, it goes to the google drive bin and the bin does not delete it then (deletes after 30 days) and this results in occupied space.
@anon50011480 and @BramVanroy if you can verify this, we can
override the _rotate_checkpoint method of the Trainer to also clean the drive bin. That should resolve the issue

sgugger · December 27, 2021, 6:20pm

I’m not sure Colab gives enough permission for a program to go delete the drive bin. If that’s possible, we can fix the Trainer directly in the Transformers library (it’s easy to check if we’re in a colab noteboook) but if not, I fear the only solution is to not save anything :-/

Black-Dragon · June 23, 2023, 1:24pm

I am having the same problem. Has any workarounds been found?

Also, I don’t fully understand the underlying reason behind this problem. I mount Google Drive and save models in drive, so the virtual machine’s storage should not get filled. Even after deleting the models they should go into the Bin folder in Drive, but the VM shows the storage is full (sorry, I’m not quite familiar how virtual disk works when mounting). I would appreciate it if anyone can help with this.

Topic		Replies	Views
Saving checkpoints in drive 🤗Transformers	6	4076	July 19, 2022
"No space left on device" when using HuggingFace + SageMaker Amazon SageMaker	39	25590	October 10, 2023
Colab session crashing after using all available RAM Beginners	0	2420	January 16, 2021
Trainer API not pushing checkpoints to HUB 🤗Transformers	0	307	December 7, 2022
HuggingFace Trainer() does nothing - only on Vertex AI workbench, works on colab 🤗Transformers	2	1863	September 5, 2022

Using Huggingface Trainer in Colab -> Disk Full

Related topics