I pre-trained a language model for my own data and I want to continue the pre-training for additional steps using the last checkpoint. I am planning to use the code below to continue the pre-training but want to be sure that everything is correct before starting.
Let’s say that I saved all of my files into CRoBERTa.
The Trainer will load the last checkpoint it can find, so it won’t necessarily be the one you specified. It will also resume the training from there with just the number of steps left, so it won’t be any different from the model you got at the end of your initial Trainer.train.
and it does load and train successfully, but when I check my logger (eg tensorboard), every time I train the epochs start from 0, and it’s annoying because the curves keep starting from the beginning when they should actually be back-to-back
am I doing something wrong? and is there a way to fix this?
I have one more questions which is unrelated and I found no answers for, is how do I save model checkpoints in the same format as trainer.train() does?
I know I can use model.save_pretrained('bert-base-uncased'), however this saves directly in that directory, unlike trainer.train() which saves in bert-base-uncased/checkpoint-100 …, I want a function that will automatically do this based on the current step count, does such a function exist?
Yes, I met the same situation. For example, the loss at checkpoint-100 was 0.20, but when I set resume from checkpoit to True, the training still start from step 1 and the loss is still 0.7, just as I was start from the beginning.
i just downloaded my checkpoint files from “checkpoint510” and upload it to another machine(same service in vast.ai) , also the resume from check point setup. But there is still a error of no valid checkpoint plz someone tell me why.