Resume_from_checkpoint

0sunfire0 · July 5, 2023, 12:59pm

Please someone who have done it before can explain.

Here is what i did, I ran the training for 11.5 hours on kaggle p100 free gpu while saving checkpoints and limiting it to 1 checkpoint by using save_total_limit=1.

The session ended then i ran a new session and loaded the saved checkpoint using:

last_checkpoint = None
if os.path.isdir(training_args.output_dir) and not training_args.overwrite_output_dir:
    last_checkpoint = transformers.trainer_utils.get_last_checkpoint(
        training_args.output_dir
    )

And used ignore_data_skip=True, to skip to the checkpoint as it say in the trainer docs If set to True, the training will begin faster (as that skipping step can take a long time)

Then started the training from the saved checkpoint:

trainer.train(resume_from_checkpoint=last_checkpoint)

But now the model is taking exactly the same time to train, It didn’t start faster at all, Also it started from step 0.

So i’m confused, Am i doing something wrong, or this flag doesnt work as it should be ?

KyonP · June 25, 2024, 3:01am

have you resolved this issue? I am having same problem.

Topic		Replies	Views
Does "resume_from_checkpoint" work? Beginners	0	968	June 19, 2022
Resume training from checkpoint Beginners	1	3037	January 5, 2023
Trainer .train (resume _from _checkpoint =True) Beginners	9	14620	May 16, 2024
Continuing Pre Training from Model Checkpoint Models	12	42173	January 13, 2025
Load from checkpoint not skipping steps 🤗Transformers	7	3641	April 17, 2023

Resume_from_checkpoint

Related topics