Cuda OOM on 4 A6000s (142 GB of VRAM) even after using Zero3, Qlora, Accelerate, Max_token_length

xikron · May 7, 2025, 3:55pm

Trying to sft Qwen2.5vl-3b-instruct but I get this same error over and over again, I’ve looked at all the past threads and tried their solutions but its just not working. I don’t think downgrading to a smaller model will do any good because the error comes during attention which is quadratic with respect to N not model size.

Maybe its an issue with my collate_fn but I can’t find anything, I’ve even chomped down max_token_length to 1024 and its the same error so I feel like there’s something else. Here are some relevent files I’m running. I’ve been so so so so so stuck on this!!

training script: Train.py - Pastebin.com
deepspeed config: deepseed.json - Pastebin.com

error:
[rank2]: attn_output = F.scaled_dot_product_attention(q, k, v, attention_mask, dropout_p=0.0)
[rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 75.03 GiB. GPU 2 has a total capacity of 47.53 GiB of which 35.91 GiB is free. Including non-PyTorch memory, this process has 11.61 GiB memory in use. Of the allocated memory 10.68 GiB is allocated by PyTorch, and 413.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management

John6666 · May 8, 2025, 5:20am

It seems that using zero2 instead of zero3 may work in some cases.

github.com/2U1/Qwen2-VL-Finetune

Qwen-2.5-VL-7B finetuning isssue

opened 11:51AM - 13 Feb 25 UTC

closed 04:15AM - 18 Feb 25 UTC

ragesh-beo

Hi, I got the following issues while finetuning Qwen-2.5-VL-Instruct. 1. The en…vironment.yaml file expects `transformers==4.48.0` and as far as I know, `Qwen2_5_VLForConditionalGeneration` cannot be imported from this version 2. When I updated the transformer to `git+https://github.com/huggingface/transformers`, it gives me an error ``` [rank0]: Traceback (most recent call last): [rank0]: File "/root/train/Qwen2-VL-Finetune/src/training/train.py", line 224, in <module> [rank0]: train() [rank0]: File "/root/train/Qwen2-VL-Finetune/src/training/train.py", line 199, in train [rank0]: trainer.train() [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank0]: return inner_training_loop( [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/transformers/trainer.py", line 2548, in _inner_training_loop [rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch) [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/transformers/trainer.py", line 3698, in training_step [rank0]: loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch) [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/transformers/trainer.py", line 3759, in compute_loss [rank0]: outputs = model(**inputs) [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn [rank0]: ret_val = func(*args, **kwargs) [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1899, in forward [rank0]: loss = self.module(*inputs, **kwargs) [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl [rank0]: return inner() [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in inner [rank0]: result = forward_call(*args, **kwargs) [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/peft/peft_model.py", line 563, in forward [rank0]: return self.get_base_model()(*args, **kwargs) [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl [rank0]: return inner() [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in inner [rank0]: result = forward_call(*args, **kwargs) [rank0]: File "/root/train/Qwen2-VL-Finetune/src/training/monkey_patch_forward.py", line 222, in qwen2_5_mixed_modality_forward [rank0]: self.visual(torch.zeros(14903, 1176), gird_thw=torch.Tensor([[1, 98, 146]])) [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl [rank0]: return inner() [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in inner [rank0]: result = forward_call(*args, **kwargs) [rank0]: TypeError: Qwen2_5_VisionTransformerPretrainedModel.forward() got an unexpected keyword argument 'gird_thw' ```

github.com/QwenLM/Qwen2.5-VL

Finetuning Qwen 2.5 VL 3B with Deepspeed is hanging and not progressing on a g5.12xlarge (4 A10 Gpus)

opened 04:01AM - 05 May 25 UTC

anindya-saha

Hello Team, I am following the instructions from https://github.com/QwenLM/Qwe…n2.5-VL/tree/main/qwen-vl-finetune to finetune Qwen 2.5 VL with DeepSpeed on the TIGER-Lab/VisualWebInstruct. I am using a AWS g5.12xlarge instance which has 4 A10 GPUs with 24 GB VRAM each. The training does not proceed at all and just hangs The data init script is a simple modification of the original and is as follows: ``` VISUALWEBINSTRUCT = { "annotation_path": /home/asaha/VisualWebInstruct/mixed_conversation.jsonl", "data_path": "/home/asaha/VisualWebInstruct/images", } data_dict = { "visualwebinstruct": VISUALWEBINSTRUCT, } ``` The training init script is a simple modification of the original and is as follows: ``` #!/bin/bash # Enable error handling set -e # Distributed training configuration MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"} MASTER_PORT=${MASTER_PORT:-$(shuf -i 20001-29999 -n 1)} NNODES=${WORLD_SIZE:-1} NPROC_PER_NODE=4 HF_HOME="<some folder>hf_cache" # Add debugging flags TORCH_DISTRIBUTED_DEBUG=INFO NCCL_DEBUG=INFO PYTHONUNBUFFERED=1 # DeepSpeed configuration deepspeed=./scripts/zero3.json # Model configuration llm=Qwen/Qwen2.5-VL-3B-Instruct # Using HuggingFace model ID # Training hyperparameters lr=2e-7 batch_size=4 # Reduced batch size grad_accum_steps=4 # Increased gradient accumulation # Training entry point entry_file=qwenvl/train/train_qwen.py # Dataset configuration (replace with public dataset names) datasets="visualwebinstruct%100" # Output configuration run_name="qwen2vl-baseline" output_dir=./output # Training arguments args=" --deepspeed ${deepspeed} \ --model_name_or_path "${llm}" \ --dataset_use ${datasets} \ --data_flatten True \ --tune_mm_vision False \ --tune_mm_mlp True \ --tune_mm_llm True \ --output_dir ${output_dir} \ --num_train_epochs 1 \ --per_device_train_batch_size ${batch_size} \ --per_device_eval_batch_size $((batch_size*2)) \ --gradient_accumulation_steps ${grad_accum_steps} \ --max_pixels 50176 \ --min_pixels 784 \ --eval_strategy "no" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 1 \ --learning_rate ${lr} \ --weight_decay 0 \ --warmup_ratio 0.03 \ --max_grad_norm 1 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --model_max_length 8192 \ --gradient_checkpointing True \ --dataloader_num_workers 4 \ --run_name ${run_name} \ --bf16 True \ --report_to none" # Create log directory if it doesn't exist if [ ! -d "logs" ]; then mkdir -p logs fi # Launch training with proper logging torchrun \ --nproc_per_node=${NPROC_PER_NODE} \ --master_addr=${MASTER_ADDR} \ --master_port=${MASTER_PORT} \ --log_dir=logs \ ${entry_file} ${args} ``` **Logs:** ``` W0504 20:47:28.542000 144230 torch/distributed/run.py:792] ***************************************** [2025-05-04 20:47:33,161] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-05-04 20:47:33,209] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-05-04 20:47:33,246] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-05-04 20:47:33,264] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-05-04 20:47:35,214] [INFO] [comm.py:658:init_distributed] cdb=None [2025-05-04 20:47:35,214] [INFO] [comm.py:658:init_distributed] cdb=None [2025-05-04 20:47:35,214] [INFO] [comm.py:658:init_distributed] cdb=None [2025-05-04 20:47:35,214] [INFO] [comm.py:658:init_distributed] cdb=None [2025-05-04 20:47:35,214] [INFO] [comm.py:689:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2025-05-04 20:47:35,853] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 4 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [2025-05-04 20:47:36,325] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 4 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [2025-05-04 20:47:36,329] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 4 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [2025-05-04 20:47:36,406] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 4 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [2025-05-04 20:47:37,583] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 825, num_elems = 4.07B Loading checkpoint shards: 100%|██████████| 2/2 [00:54<00:00, 27.26s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:54<00:00, 27.26s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:54<00:00, 27.26s/it] Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Loading checkpoint shards: 100%|██████████| 2/2 [00:55<00:00, 27.62s/it] Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Vision Module - Attention Blocks: Trainable Block Indices: None Non-Trainable Block Indices: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] Merger Module Trainable: True LLM Module - Embed Tokens Trainable: True LLM Module - Trainable Layer Indices: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35] LLM Module - Non-Trainable Layer Indices: None Parameter Offload: Total persistent parameters: 755712 in 408 params 0%| | 0/15688 [00:00<?, ?it/s]/home/asaha/Qwen2.5-VL/qwen-vl-finetune/qwen_dpsp/lib/python3.10/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn( /home/asaha/Qwen2.5-VL/qwen-vl-finetune/qwen_dpsp/lib/python3.10/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn( /home/asaha/Qwen2.5-VL/qwen-vl-finetune/qwen_dpsp/lib/python3.10/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn( /home/asaha/Qwen2.5-VL/qwen-vl-finetune/qwen_dpsp/lib/python3.10/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None [rank4]:[E505 07:54:26.886357839 ProcessGroupNCCL.cpp:629] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2922, OpType=_ALLGATHER_BASE, NumelI n=614880, NumelOut=4919040, Timeout(ms)=1800000) ran for 1800044 milliseconds before timing out. [rank2]:[E505 07:54:26.886354198 ProcessGroupNCCL.cpp:629] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2922, OpType=_ALLGATHER_BASE, NumelI n=614880, NumelOut=4919040, Timeout(ms)=1800000) ran for 1800043 milliseconds before timing out. [rank7]:[E505 07:54:26.886357814 ProcessGroupNCCL.cpp:629] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2922, OpType=_ALLGATHER_BASE, NumelI n=614880, NumelOut=4919040, Timeout(ms)=1800000) ran for 1800039 milliseconds before timing out. [rank1]:[E505 07:54:26.886360570 ProcessGroupNCCL.cpp:629] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2922, OpType=_ALLGATHER_BASE, NumelI n=614880, NumelOut=4919040, Timeout(ms)=1800000) ran for 1800047 milliseconds before timing out. [rank5]:[E505 07:54:26.886362121 ProcessGroupNCCL.cpp:629] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2922, OpType=_ALLGATHER_BASE, NumelI n=614880, NumelOut=4919040, Timeout(ms)=1800000) ran for 1800043 milliseconds before timing out. [rank3]:[E505 07:54:26.886367208 ProcessGroupNCCL.cpp:629] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2922, OpType=_ALLGATHER_BASE, NumelI n=614880, NumelOut=4919040, Timeout(ms)=1800000) ran for 1800046 milliseconds before timing out. [rank0]:[E505 07:54:26.886352680 ProcessGroupNCCL.cpp:629] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2922, OpType=_ALLGATHER_BASE, NumelI n=614880, NumelOut=4919040, Timeout(ms)=1800000) ran for 1800029 milliseconds before timing out. [rank4]:[E505 07:54:26.945450765 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 4] failure detected by watchdog at work sequence id: 2922 PG status: last enqueued work: 2922, last completed work: 2921 [rank2]:[E505 07:54:26.945453676 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 2] failure detected by watchdog at work sequence id: 2922 PG status: last enqueued work: 2922, last completed work: 2921 [rank7]:[E505 07:54:26.945456073 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 7] failure detected by watchdog at work sequence id: 2922 PG status: last enqueued work: 2922, last completed work: 2921 [rank0]:[E505 07:54:26.945450113 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 0] failure detected by watchdog at work sequence id: 2922 PG status: last enqueued work: 2922, last completed work: 2921 [rank1]:[E505 07:54:26.945458867 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 1] failure detected by watchdog at work sequence id: 2922 PG status: last enqueued work: 2922, last completed work: 2921 [rank3]:[E505 07:54:26.945463933 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 3] failure detected by watchdog at work sequence id: 2922 PG status: last enqueued work: 2922, last completed work: 2921 [rank5]:[E505 07:54:26.945459141 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 5] failure detected by watchdog at work sequence id: 2922 PG status: last enqueued work: 2922, last completed work: 2921 ``` **nvidia-smi** ``` Sun May 4 21:19:47 2025 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A10G On | 00000000:00:1B.0 Off | 0 | | 0% 31C P0 67W / 300W | 10361MiB / 23028MiB | 100% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA A10G On | 00000000:00:1C.0 Off | 0 | | 0% 29C P0 63W / 300W | 10363MiB / 23028MiB | 100% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA A10G On | 00000000:00:1D.0 Off | 0 | | 0% 29C P0 65W / 300W | 10369MiB / 23028MiB | 100% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | | 0% 30C P0 63W / 300W | 10369MiB / 23028MiB | 100% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 2206 G /usr/lib/xorg/Xorg 4MiB | | 0 N/A N/A 149175 C ...l-finetune/qwen_dpsp/bin/python3.10 10340MiB | | 1 N/A N/A 2206 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 149176 C ...l-finetune/qwen_dpsp/bin/python3.10 10342MiB | | 2 N/A N/A 2206 G /usr/lib/xorg/Xorg 4MiB | | 2 N/A N/A 149177 C ...l-finetune/qwen_dpsp/bin/python3.10 10348MiB | | 3 N/A N/A 2206 G /usr/lib/xorg/Xorg 4MiB | | 3 N/A N/A 149178 C ...l-finetune/qwen_dpsp/bin/python3.10 10348MiB | +---------------------------------------------------------------------------------------+ ```

Topic		Replies	Views
CUDA OOM with deepspeed - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 47.40 GiB of which 209.12 MiB is free Beginners	0	176	December 14, 2024
11B model gets OOM after using deepspeed zero 3 setting with 8 32G V100 🤗Accelerate	2	1285	April 26, 2025
2B Model Fill Up Memory Usage on 4xA100s 🤗Transformers	1	101	April 10, 2025
LLama3-8B - FSDP + QLORA results in OOM with 4 A40's 🤗Accelerate	1	871	June 17, 2024
Llama2-70b-chat loading Cuda Out of Memory Models	0	1216	July 26, 2023

Cuda OOM on 4 A6000s (142 GB of VRAM) even after using Zero3, Qlora, Accelerate, Max_token_length

Related topics