Lora finetuning 35 B model error

Hi I am trying to finetune a 35B model using lora (r and alpha 64) . My batch size is 2 and grad accumulation is 2 . I am using 8 A100 80GB gpus with deepspeed zero2 . I estimated it would require 3 gpus to do this . But I am not even able to achieve this on 8GPUs . I keep on getting CUDA OOM. I am unable to figure out why this disceperancy exists. It will be great if someone can explain why this is happening.