Resume in

I’m just running some initial tests/experiments in, to get a sense of how my data/images are going to work with Diffusers and I’d like to continue training from a saved checkpoint. I modified the script to use:

if args.overwrite_output_dir:
        model = UNet2DModel(
            // original model code
        model_dir = args.output_dir + '/unet/'
        model = UNet2DModel.from_pretrained(model_dir)

which does appear to successfully load my checkpoint (based on the initial loss). However, when I run it, I see that the loss is clearly diverging. I’m guessing I don’t have the correct scheduler/optimizer settings, but I’m not sure how to load those.

Any tips appreciated.

Hi @jbmaxwell! I think that’s an interesting use case, would you mind opening an issue in the repo?

Thanks a lot!

Okay, will do.

1 Like

Thank you!

Actually, it looks like the problem is really just that the learning rate doesn’t continue from where it was at the end of the last run—i.e., it just runs the whole lr schedule again, as if training from scratch. So it starts okay, diverges for a while, then starts to converge again.
Probably it’s something I’ve missed… ??