Please someone who have done it before can explain.
Here is what i did, I ran the training for 11.5 hours on kaggle p100 free gpu while saving checkpoints and limiting it to 1 checkpoint by using
The session ended then i ran a new session and loaded the saved checkpoint using:
last_checkpoint = None if os.path.isdir(training_args.output_dir) and not training_args.overwrite_output_dir: last_checkpoint = transformers.trainer_utils.get_last_checkpoint( training_args.output_dir )
ignore_data_skip=True, to skip to the checkpoint as it say in the trainer docs If set to
True, the training will begin faster (as that skipping step can take a long time)
Then started the training from the saved checkpoint:
But now the model is taking exactly the same time to train, It didn’t start faster at all, Also it started from step 0.
So i’m confused, Am i doing something wrong, or this flag doesnt work as it should be ?