Saving checkpoints in drive

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="/gdrive/MyDrive/Thesis/GPT2/checkpoints",
    overwrite_output_dir=False,
    num_train_epochs=5,
    per_device_train_batch_size=6, #previous was 6
    save_steps=100,
    save_total_limit=5,
    fp16 = True,
    dataloader_drop_last=True,
    #evaluate_during_training=True,
    warmup_steps=200
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
    # prediction_loss_only = True
) 

trainer.train()

I want to save the checkpoints directly to my google drive. The problem is the code above saves my checkpoints upto to save limit all well. But after the limit it can’t delete or save any new checkpoints. Although it says checkpoints saved/deleted in the console. Any help?

I think just wait a bit and they appear up in drive , that’s what happened with me

I waited around 20-30 mins. Didn’t notice anything.
It works fine when I remove the save_limits arg though.

i figured out a soloution

from google.colab import drive
drive.flush_and_unmount()

use this code after the training in done and it will update everything to drive

1 Like

This could be a solution. But what if my runtime gets disconnected while training. My checkpoints will be lost then. So I actually need to have the checkpoints in my drive after the save steps.

1 Like

Try this , after every save step use interrupt execution in collab and save the checkpoint using this , then restart training from the saved checkpoint

Can anyone please tell me on how to start the training of transformer from where it had left by loading the previously saved checkpoints. It would be really appreciated. Also thanks in advance.