Learning rate schedule with `train_text_to_image.py`

Just wondering; does the learning rate decay in the train_text_to_image.py script? I’m resuming from a checkpoint quite a long way into training (360k steps) and I’m still seeing learning_rate=0.0001 in the progress indicator.

I always assumed that was just showing the initial learning rate, but is it supposed to reflect the actual learning rate at the current step (or epoch)? I’m only wondering because improvement in my outputs is extremely slow. I expected it to be slow, but it seems almost conspicuously slow, and if the learning rate isn’t decreasing (and is generally too high) that might explain it… it’s potentially bouncing around the minima.

The launch process does indicate: “All scheduler states loaded successfully”, so I’m assuming it’s resuming correctly from my checkpoint.

I’ve just started another run, for another 20 epochs, and added --learning_rate=1e-6 to my command, but the progress still indicates:

Steps: 5%|â–Ť | 11397/244100 [22:26<72:12:34, 1.12s/it, lr=0.0001, step_loss=0.0744]

Any clarification appreciated.

UPDATE: Digging in a bit more, comet_ml is showing the learning rate as 1e-6, which is what I set… so who’s right? It does also show lr_scheduler as “constant”, which surprised me. Does diffusion use a constant learning rate? Is that the best approach?

Hi @jbmaxwell! The default learning rate schedule for that script is constant: Learning rate schedule with `train_text_to_image.py`

You can use other schedulers if you want and see if that helps. I believe that a decreasing learning rate is usually applied when training huge models with lots of data and steps.

Regarding you setting the learning rate to 1e-6 and not seeing it in the log, I’m not sure what might be going on. Do you think the script could be scaling it? (diffusers/train_text_to_image.py at main · huggingface/diffusers · GitHub). Otherwise, it could be a bug, please let us know and feel free to open an issue in GitHub :slight_smile:

Thanks @pcuenq !

I actually switched to LoRA fine-tuning and it seems to be going well so far… :crossed_fingers:

Awesome! Let us know how it goes :slight_smile:

Were you able to confirm the bug about the learning rate command-line argument?

I haven’t run it again on the standard script since starting LoRA, so I’m not sure. I do know that I didn’t have --scale_lr set, so it shouldn’t have been scaling.

I can see about reproducing it when my GPUs free up again.

Just to be clear, it was only when resuming from a checkpoint that I seemed unable to change the learning rate, which I’m guessing might be due to the “constant” lr schedule. If I start from the pretrained model I can set the learning rate as expected.

Okay, just confirming: I cannot change the learning rate when resuming from a checkpoint. I don’t suspect this is actually a “bug”, but just a fact of using resuming with the constant scheduler. It does make sense that the “schedule” is basically fixed when you start a new run, so I’m not suggesting that this should be editable… It would just have been convenient in my case! :slight_smile:

I’m going to save the pipeline and start again with the new learning rate, just to see how it progresses.

1 Like