Cuda OOM on 4 A6000s (142 GB of VRAM) even after using Zero3, Qlora, Accelerate, Max_token_length

Trying to sft Qwen2.5vl-3b-instruct but I get this same error over and over again, I’ve looked at all the past threads and tried their solutions but its just not working. I don’t think downgrading to a smaller model will do any good because the error comes during attention which is quadratic with respect to N not model size.

Maybe its an issue with my collate_fn but I can’t find anything, I’ve even chomped down max_token_length to 1024 and its the same error so I feel like there’s something else. Here are some relevent files I’m running. I’ve been so so so so so stuck on this!!

training script: Train.py - Pastebin.com
deepspeed config: deepseed.json - Pastebin.com

error:
[rank2]: attn_output = F.scaled_dot_product_attention(q, k, v, attention_mask, dropout_p=0.0)
[rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 75.03 GiB. GPU 2 has a total capacity of 47.53 GiB of which 35.91 GiB is free. Including non-PyTorch memory, this process has 11.61 GiB memory in use. Of the allocated memory 10.68 GiB is allocated by PyTorch, and 413.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management

1 Like

It seems that using zero2 instead of zero3 may work in some cases.