Please someone who have done it before can explain.
Here is what i did, I ran the training for 11.5 hours on kaggle p100 free gpu while saving checkpoints and limiting it to 1 checkpoint by using save_total_limit=1
.
The session ended then i ran a new session and loaded the saved checkpoint using:
last_checkpoint = None
if os.path.isdir(training_args.output_dir) and not training_args.overwrite_output_dir:
last_checkpoint = transformers.trainer_utils.get_last_checkpoint(
training_args.output_dir
)
And used ignore_data_skip=True
, to skip to the checkpoint as it say in the trainer docs If set to True
, the training will begin faster (as that skipping step can take a long time)
Then started the training from the saved checkpoint:
trainer.train(resume_from_checkpoint=last_checkpoint)
But now the model is taking exactly the same time to train, It didn’t start faster at all, Also it started from step 0.
So i’m confused, Am i doing something wrong, or this flag doesnt work as it should be ?