Hi I am trying to finetune a 35B model using lora (r and alpha 64) . My batch size is 2 and grad accumulation is 2 . I am using 8 A100 80GB gpus with deepspeed zero2 . I estimated it would require 3 gpus to do this . But I am not even able to achieve this on 8GPUs . I keep on getting CUDA OOM. I am unable to figure out why this disceperancy exists. It will be great if someone can explain why this is happening.
Related topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
LoRA Finetuning RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! | 4 | 25 | June 16, 2025 | |
LoRA training with accelerate / deepspeed | 3 | 2294 | May 28, 2025 | |
GPU memory usage of optimizer's states when using LoRA | 4 | 720 | July 5, 2024 | |
Question about FP16/32, LoRA and GPU Memory Usage | 1 | 3750 | September 18, 2023 | |
LoRA / QLoRA fine tuning a 8b Model(llama 3.1) | 1 | 266 | February 24, 2025 |