I was LORA finetuning LLama 70b model and I turned on gradient_checkpointing: True in my training config but it has no affect on the memory consumption at all and I don’t see any affect of the flag where I put it False or true. Any idea why that will the case ?
1 Like
Isn’t it because LORA has trainable parameters in few tens of millions order and is not significant memory wise?