I’ll try setting save_strategy explicitly to epoch? Probably right now its saving at the preset amount of steps and can’t delete the saved steps from the colab/gdrive disk for whatever reason.
As for
Can you see what’s actually on the disk?
There is a file explorer built into google colab and I can also explore the filesystem through ipython magic (i.e. using bash); but I didn’t really find where exactly the virtual disk for the python environment is mounted and therefore where the trainer is seemingly writing to (even though it should be working on the Google Drive mount).
Edit: I rechecked, and it appears that after running the trainer, /root was slowly filling up on the colab disk; I can however not see the contents of that mount point. Curiously save_total_limit=1
does also not seem to limit the checkpoints saved on my google drive partition, as checkpoints are being stored all 500 steps and only sporadically deleted.