Why does textual inversion example scale learning rate?

Hi, is there any specific reason for why textual inversion training example code uses --scale_lr but other examples like text_to_image or controlnet don’t?

Here is example script of textual inversion:

export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export DATA_DIR="./cat"

accelerate launch textual_inversion.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$DATA_DIR \
  --learnable_property="object" \
  --placeholder_token="<cat-toy>" --initializer_token="toy" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --max_train_steps=3000 \
  --learning_rate=5.0e-04 --scale_lr \  # notice that it uses --scale_lr
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --push_to_hub \
  --output_dir="textual_inversion_cat"

More questions related to tuning learning rate and batch size in general:

  1. Should I also use scale_lr if I’m finetuning a model with a GPU that can fit more samples like A100 80GB? For example, I can fit 24 samples into a batch. Does it mean that I should always pick the biggest batch I can fit into the GPU?
  2. Is there a study or an experimental result explaining the relationship of convergence time as compared to batch size and learning rate? Specifically, I just want to know how quickly can you converge if you have 8 GPUs. I’m sure it’s not going to be 8 times as fast but I want to know the ballpark estimate.
  3. Is there a general guideline for configuring learning rate and batch size for quickest and high quality convergence when fine-tuning Stable Diffusion?

hey @offchan good questions!

This is definitely not my area of expertise but I’ll do my best to answer your questions.

First off, there’s no hard rule for if you should or shouldn’t scale the learning rate or guidelines learning rate when fine-tuning stable diffusion.

There’s no particular reason why it was enabled on one example and not on the other.

I just think about it like if you are more confident you’re going to have high quality gradient estimates (which can be done via larger batch sizes and more gradient accumulation steps), you’ll want a higher learning rate. One of the ways which you can have a higher learning rate is by scaling it.

So it’s really dependent on your data, if it’s going to give you high quality gradients and then how much additional wall clock time is necessary to get those higher quality gradients. I.e. I can use 8 GPUs in data parallel to 8x my batch size but now there’s gradient syncronization overhead so each weight update is going to take longer. If my data is simple enough that I can get good enough gradient estimates, do I need the additional overhead to get the better estimates?

Any way, a bit of a rambly answer to give you there’s no easy answer here. Once again, this is a bit outside my expertise so please take it with a grain of salt. Happy to have anyone else join in and correct me :slight_smile:

3 Likes

Hi @williamberman , your answer is very valuable even if you don’t think you are an expert in it. It’s making it clearer about the current stage of things. You are basically saying that if the data is simple then small batch would produce good enough gradients, which makes sense to me. It means you won’t get much benefit from scaling if your dataset is simple.

AFAIK, when you want to scale your training, you can only increase the batch size but you cannot make frequent training steps of small batch size. More machines cannot increase throughput of training steps per second but they can increase the quality of gradients like you said.

In theory, frequent training steps of small batch size should make the model converge faster (if it’s possible) according to an answer by Ian Goodfellow. Unfortunately, data parallel training doesn’t do this frequent training steps thing.

I guess my main question is more about how fast can you converge as you increase the number of machines/GPUs. If you properly tune learning rate for 1 GPU and tune learning rate for 8 GPUs, how fast can you expect to converge for each option? Do you have any anecdotal experiences or data you saw somewhere that compare this? Let’s say you are just training on LAION-2B dataset.

There’s a scenario which has easy answer. If you have

  • 1 GPU with batch size 4 and gradient accumulations 8 times
  • 8 GPUs each with batch size 4
    Then the speed gain of 8 GPUs would be approximately 8 times or a bit lower because effective batch size of both options are the same.

But an interesting case is when you do fewer gradient accumulations for 1 GPU (or don’t do it at all). I think it will converge faster and the speed gain of 8 GPUs will be significantly less than 8 times.

1 Like

This is only in the case of data parallelism. There are alternative ways to leverage more hardware that can increase throughput. See model parallelism and pipeline parallelism. Data parallelism is just the easiest form of parallelism to do since you just replicate the whole model.

Yes, diminishing marginal returns of more precise gradient estimates is a good rule of thumb I forgot to mention. I would more so state “smaller batch sizes when providing sufficient estimates of the gradient can lead to faster convergence”.

Yes understood, unfortunately this is where I lack the most experience and don’t have a good answer for you :slight_smile:

I’d be hesitant to claim a speed gain of n when adding n gpu’s in data parallel compared to 8 gradient accumulation steps. In the multi gpu case, you have inherent overhead that is also dependent on your node topology. You have to syncronize gradients between different gpus and you can end up with large tail latencies. In the single gpu gradient accumulation case, you have to store all pre-computed weight updates somewhere I think? You might end up with large memory spilled to cpu or disk which could be costly. Idk, much outside my usual bandwidth to make strong claims around this one.

2 Likes

Do people typically use model parallelism or pipeline parallelism? For example, when they trained Stable Diffusion, did they use those techniques or do they just use data parallelism?

afaik they’re more frequently used in the llm world. For these latent diffusion models as they’re pretty small, I think they usually just use data parallel but I wouldn’t quote that :slight_smile: