GPU memory usage of optimizer's states when using LoRA

Hi, recently I am using LoRA(by using peft) + transformers Trainer + DeepSpeed(ZeRO3) to finetune my model (around 7B params). Before this, I tried full-param finetune as well.

The wierd thing is that I found LoRA seems not help to reduce the GPU memory usage compared with the full-param finetune method.

My question is : when using LoRA, I checkecd that a lot of paramerters’s requires_grad is set to False, will this help to reduce the Adam optimizer’s GPU memory usage? Since I think setting requires_grad to False must help reduce the memory usage of gradients tesnor, but the GPU mem usage did not reduce a lot, so I was wondering if it is because the GPU-mem taken by the optimizer’s states is still as the same as before? Will the optimizer record the states of model weights which does not need the gradient?

Hi,
Sorry for diverging from your question, but I don’t find so much info about this online. How do you merge the adapters resulting from the Lora+Zero3 finetuning back to the base model?

Solved! I finally find out that it is because I didn’t set the gradient_checkpointing=True during my LoRA training, which takes a lot of GPU memory!

Ah I just call the PeftModel’s .merge_and_unload method. It will change the base model’s weights in-place.

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.