Model is getting loaded unevenly with AutomodelforCasualLM

I am trying to do SFT with a context length of 4096
the same things works perfectly with LLama3 70B. The model and cache loading is balanced across all gpus.
But while loading Qwen2 9B or Llama3 8B for finetuning it is uneven.
Can’t even do batch size of 2 on 4 A10 gpus.
Screenshot 2024-06-22 at 1 45 26 PM
The model loads like above with a batch size of 1.

Please help.