Learning rate schedule with `train_text_to_image.py`

jbmaxwell · March 12, 2023, 3:20am

Just wondering; does the learning rate decay in the train_text_to_image.py script? I’m resuming from a checkpoint quite a long way into training (360k steps) and I’m still seeing learning_rate=0.0001 in the progress indicator.

I always assumed that was just showing the initial learning rate, but is it supposed to reflect the actual learning rate at the current step (or epoch)? I’m only wondering because improvement in my outputs is extremely slow. I expected it to be slow, but it seems almost conspicuously slow, and if the learning rate isn’t decreasing (and is generally too high) that might explain it… it’s potentially bouncing around the minima.

The launch process does indicate: “All scheduler states loaded successfully”, so I’m assuming it’s resuming correctly from my checkpoint.

I’ve just started another run, for another 20 epochs, and added --learning_rate=1e-6 to my command, but the progress still indicates:

Steps: 5%|▍ | 11397/244100 [22:26<72:12:34, 1.12s/it, lr=0.0001, step_loss=0.0744]

Any clarification appreciated.

UPDATE: Digging in a bit more, comet_ml is showing the learning rate as 1e-6, which is what I set… so who’s right? It does also show lr_scheduler as “constant”, which surprised me. Does diffusion use a constant learning rate? Is that the best approach?

pcuenq · March 13, 2023, 10:55am

Hi @jbmaxwell! The default learning rate schedule for that script is constant: Learning rate schedule with `train_text_to_image.py`

You can use other schedulers if you want and see if that helps. I believe that a decreasing learning rate is usually applied when training huge models with lots of data and steps.

Regarding you setting the learning rate to 1e-6 and not seeing it in the log, I’m not sure what might be going on. Do you think the script could be scaling it? (diffusers/train_text_to_image.py at main · huggingface/diffusers · GitHub). Otherwise, it could be a bug, please let us know and feel free to open an issue in GitHub

jbmaxwell · March 15, 2023, 3:39am

Thanks @pcuenq !

I actually switched to LoRA fine-tuning and it seems to be going well so far…

pcuenq · March 15, 2023, 10:36am

Awesome! Let us know how it goes

Were you able to confirm the bug about the learning rate command-line argument?

jbmaxwell · March 15, 2023, 3:14pm

I haven’t run it again on the standard script since starting LoRA, so I’m not sure. I do know that I didn’t have --scale_lr set, so it shouldn’t have been scaling.

I can see about reproducing it when my GPUs free up again.

jbmaxwell · March 16, 2023, 2:32pm

Just to be clear, it was only when resuming from a checkpoint that I seemed unable to change the learning rate, which I’m guessing might be due to the “constant” lr schedule. If I start from the pretrained model I can set the learning rate as expected.

jbmaxwell · March 17, 2023, 5:19pm

Okay, just confirming: I cannot change the learning rate when resuming from a checkpoint. I don’t suspect this is actually a “bug”, but just a fact of using resuming with the constant scheduler. It does make sense that the “schedule” is basically fixed when you start a new run, so I’m not suggesting that this should be editable… It would just have been convenient in my case!

I’m going to save the pipeline and start again with the new learning rate, just to see how it progresses.

Topic		Replies	Views
Learning rate zero? 🧨 Diffusers	1	770	March 31, 2023
Continue fine-tuning with Trainer() after completing the initial training process Beginners	9	5698	January 19, 2022
How to check or manually control the learning rate used in training? 🤗Transformers	1	8122	May 6, 2022
Resume in train_unconditional.py 🧨 Diffusers	4	794	November 11, 2022
Why such a learning rate value? 🤗Transformers	3	3057	November 23, 2021

Learning rate schedule with `train_text_to_image.py`

Related topics