Hi I am trying to finetune a 35B model using lora (r and alpha 64) . My batch size is 2 and grad accumulation is 2 . I am using 8 A100 80GB gpus with deepspeed zero2 . I estimated it would require 3 gpus to do this . But I am not even able to achieve this on 8GPUs . I keep on getting CUDA OOM. I am unable to figure out why this disceperancy exists. It will be great if someone can explain why this is happening.
Related topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
GPU memory usage of optimizer's states when using LoRA | 4 | 578 | July 5, 2024 | |
Question about FP16/32, LoRA and GPU Memory Usage | 1 | 3543 | September 18, 2023 | |
LoRA / QLoRA fine tuning a 8b Model(llama 3.1) | 1 | 57 | February 24, 2025 | |
Issue with LoRA Adapter Loading on Multiple GPUs during Fine-Tuning with Accelerate and SFTTrainer | 3 | 655 | September 18, 2024 | |
LoRA training with accelerate / deepspeed | 2 | 1898 | August 8, 2024 |