Hi, recently I am using LoRA(by using peft) + transformers Trainer + DeepSpeed(ZeRO3) to finetune my model (around 7B params). Before this, I tried full-param finetune as well.
The wierd thing is that I found LoRA seems not help to reduce the GPU memory usage compared with the full-param finetune method.
My question is : when using LoRA, I checkecd that a lot of paramerters’s requires_grad
is set to False
, will this help to reduce the Adam optimizer’s GPU memory usage? Since I think setting requires_grad
to False
must help reduce the memory usage of gradients tesnor, but the GPU mem usage did not reduce a lot, so I was wondering if it is because the GPU-mem taken by the optimizer’s states is still as the same as before? Will the optimizer record the states of model weights which does not need the gradient?