Resume in train_unconditional.py

jbmaxwell · November 10, 2022, 3:52pm

I’m just running some initial tests/experiments in train_unconditional.py, to get a sense of how my data/images are going to work with Diffusers and I’d like to continue training from a saved checkpoint. I modified the script to use:

if args.overwrite_output_dir:
        model = UNet2DModel(
            // original model code
        )
    else:
        model_dir = args.output_dir + '/unet/'
        model = UNet2DModel.from_pretrained(model_dir)

which does appear to successfully load my checkpoint (based on the initial loss). However, when I run it, I see that the loss is clearly diverging. I’m guessing I don’t have the correct scheduler/optimizer settings, but I’m not sure how to load those.

Any tips appreciated.

pcuenq · November 11, 2022, 8:48am

Hi @jbmaxwell! I think that’s an interesting use case, would you mind opening an issue in the repo?

Thanks a lot!

jbmaxwell · November 11, 2022, 4:05pm

Okay, will do.

pcuenq · November 11, 2022, 4:06pm

Thank you!

jbmaxwell · November 11, 2022, 11:59pm

Actually, it looks like the problem is really just that the learning rate doesn’t continue from where it was at the end of the last run—i.e., it just runs the whole lr schedule again, as if training from scratch. So it starts okay, diverges for a while, then starts to converge again.
Probably it’s something I’ve missed… ??

Topic	Replies	Views
Restarting training from a checkpoint 🧨 Diffusers	924	August 31, 2022
Correct way to load pretrained model 🧨 Diffusers	585	July 21, 2023
How to resume training from checkpoint Models	578	April 11, 2024
Doesn't work resume training from checkpoint Beginners	191	April 8, 2024
Save a test image during training? 🧨 Diffusers	386	February 6, 2023

Resume in train_unconditional.py

Related topics