Hi Guys
I am currently training different models on Google Colab Pro and I am having some memory issues. Is there a way to find out which checkpoints can be deleted? Im using --save_total_limit="3"
and i try to backup the Checkpoint Folder to Gdrive every hour. While doing this and using save_total_limit
the Files on my Notebook are getting deleted (as they should) but not on my Gdrive (obviously). Is there any way to find out which Checkpoints i can manually remove on my Gdrive? Or do i have to check the Logs
E.g. looking for
Deleting older checkpoint [/share/datasets/output_run/checkpoint-9000] due to args.save_total_limit
I had a lot of runs crash in the beggining at weird points and i there fore prop. back up to much, but thats the way i am xD
While the Notebook is still aactiv i can just Check the File Explorer:
and delete every Checkpoint on Gdrive thats diffrent from those once. But how do i clean Checkpoints on Gdrive if the Notebook Crashed?
Ty