Model is getting loaded unevenly on GPUs

I am trying to finetune Llama3 8B model using peft qlora.
Loading gptq model with automodelforcasualLM which gets loaded unevenly in the GPUs which prevents me from using batch size of more than 1.
I am using context length of 8192 or 4096.
if I use 4096 I cant go more than 2 batchsize
Screenshot 2024-07-08 at 7.31.22 PM

Please help.

Another example.Happening with all models either gptq or bnb.