Cuda OOM on 4 A6000s (142 GB of VRAM) even after using Zero3, Qlora, Accelerate, Max_token_length

John6666 · May 8, 2025, 5:20am

It seems that using zero2 instead of zero3 may work in some cases.

Qwen-2.5-VL-7B finetuning isssue

opened 11:51AM - 13 Feb 25 UTC

closed 04:15AM - 18 Feb 25 UTC

Hi, I got the following issues while finetuning Qwen-2.5-VL-Instruct. 1. The en…vironment.yaml file expects `transformers==4.48.0` and as far as I know, `Qwen2_5_VLForConditionalGeneration` cannot be imported from this version 2. When I updated the transformer to `git+https://github.com/huggingface/transformers`, it gives me an error ``` [rank0]: Traceback (most recent call last): [rank0]: File "/root/train/Qwen2-VL-Finetune/src/training/train.py", line 224, in <module> [rank0]: train() [rank0]: File "/root/train/Qwen2-VL-Finetune/src/training/train.py", line 199, in train [rank0]: trainer.train() [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank0]: return inner_training_loop( [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/transformers/trainer.py", line 2548, in _inner_training_loop [rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch) [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/transformers/trainer.py", line 3698, in training_step [rank0]: loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch) [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/transformers/trainer.py", line 3759, in compute_loss [rank0]: outputs = model(**inputs) [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn [rank0]: ret_val = func(*args, **kwargs) [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1899, in forward [rank0]: loss = self.module(*inputs, **kwargs) [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl [rank0]: return inner() [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in inner [rank0]: result = forward_call(*args, **kwargs) [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/peft/peft_model.py", line 563, in forward [rank0]: return self.get_base_model()(*args, **kwargs) [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl [rank0]: return inner() [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in inner [rank0]: result = forward_call(*args, **kwargs) [rank0]: File "/root/train/Qwen2-VL-Finetune/src/training/monkey_patch_forward.py", line 222, in qwen2_5_mixed_modality_forward [rank0]: self.visual(torch.zeros(14903, 1176), gird_thw=torch.Tensor([[1, 98, 146]])) [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl [rank0]: return inner() [rank0]: File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in inner [rank0]: result = forward_call(*args, **kwargs) [rank0]: TypeError: Qwen2_5_VisionTransformerPretrainedModel.forward() got an unexpected keyword argument 'gird_thw' ```

github.com/QwenLM/Qwen2.5-VL

Finetuning Qwen 2.5 VL 3B with Deepspeed is hanging and not progressing on a g5.12xlarge (4 A10 Gpus)

opened 04:01AM - 05 May 25 UTC

anindya-saha

Hello Team, I am following the instructions from https://github.com/QwenLM/Qwe…n2.5-VL/tree/main/qwen-vl-finetune to finetune Qwen 2.5 VL with DeepSpeed on the TIGER-Lab/VisualWebInstruct. I am using a AWS g5.12xlarge instance which has 4 A10 GPUs with 24 GB VRAM each. The training does not proceed at all and just hangs The data init script is a simple modification of the original and is as follows: ``` VISUALWEBINSTRUCT = { "annotation_path": /home/asaha/VisualWebInstruct/mixed_conversation.jsonl", "data_path": "/home/asaha/VisualWebInstruct/images", } data_dict = { "visualwebinstruct": VISUALWEBINSTRUCT, } ``` The training init script is a simple modification of the original and is as follows: ``` #!/bin/bash # Enable error handling set -e # Distributed training configuration MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"} MASTER_PORT=${MASTER_PORT:-$(shuf -i 20001-29999 -n 1)} NNODES=${WORLD_SIZE:-1} NPROC_PER_NODE=4 HF_HOME="<some folder>hf_cache" # Add debugging flags TORCH_DISTRIBUTED_DEBUG=INFO NCCL_DEBUG=INFO PYTHONUNBUFFERED=1 # DeepSpeed configuration deepspeed=./scripts/zero3.json # Model configuration llm=Qwen/Qwen2.5-VL-3B-Instruct # Using HuggingFace model ID # Training hyperparameters lr=2e-7 batch_size=4 # Reduced batch size grad_accum_steps=4 # Increased gradient accumulation # Training entry point entry_file=qwenvl/train/train_qwen.py # Dataset configuration (replace with public dataset names) datasets="visualwebinstruct%100" # Output configuration run_name="qwen2vl-baseline" output_dir=./output # Training arguments args=" --deepspeed ${deepspeed} \ --model_name_or_path "${llm}" \ --dataset_use ${datasets} \ --data_flatten True \ --tune_mm_vision False \ --tune_mm_mlp True \ --tune_mm_llm True \ --output_dir ${output_dir} \ --num_train_epochs 1 \ --per_device_train_batch_size ${batch_size} \ --per_device_eval_batch_size $((batch_size*2)) \ --gradient_accumulation_steps ${grad_accum_steps} \ --max_pixels 50176 \ --min_pixels 784 \ --eval_strategy "no" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 1 \ --learning_rate ${lr} \ --weight_decay 0 \ --warmup_ratio 0.03 \ --max_grad_norm 1 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --model_max_length 8192 \ --gradient_checkpointing True \ --dataloader_num_workers 4 \ --run_name ${run_name} \ --bf16 True \ --report_to none" # Create log directory if it doesn't exist if [ ! -d "logs" ]; then mkdir -p logs fi # Launch training with proper logging torchrun \ --nproc_per_node=${NPROC_PER_NODE} \ --master_addr=${MASTER_ADDR} \ --master_port=${MASTER_PORT} \ --log_dir=logs \ ${entry_file} ${args} ``` **Logs:** ``` W0504 20:47:28.542000 144230 torch/distributed/run.py:792] ***************************************** [2025-05-04 20:47:33,161] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-05-04 20:47:33,209] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-05-04 20:47:33,246] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-05-04 20:47:33,264] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-05-04 20:47:35,214] [INFO] [comm.py:658:init_distributed] cdb=None [2025-05-04 20:47:35,214] [INFO] [comm.py:658:init_distributed] cdb=None [2025-05-04 20:47:35,214] [INFO] [comm.py:658:init_distributed] cdb=None [2025-05-04 20:47:35,214] [INFO] [comm.py:658:init_distributed] cdb=None [2025-05-04 20:47:35,214] [INFO] [comm.py:689:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2025-05-04 20:47:35,853] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 4 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [2025-05-04 20:47:36,325] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 4 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [2025-05-04 20:47:36,329] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 4 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [2025-05-04 20:47:36,406] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 4 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [2025-05-04 20:47:37,583] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 825, num_elems = 4.07B Loading checkpoint shards: 100%|██████████| 2/2 [00:54<00:00, 27.26s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:54<00:00, 27.26s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:54<00:00, 27.26s/it] Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Loading checkpoint shards: 100%|██████████| 2/2 [00:55<00:00, 27.62s/it] Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Vision Module - Attention Blocks: Trainable Block Indices: None Non-Trainable Block Indices: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] Merger Module Trainable: True LLM Module - Embed Tokens Trainable: True LLM Module - Trainable Layer Indices: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35] LLM Module - Non-Trainable Layer Indices: None Parameter Offload: Total persistent parameters: 755712 in 408 params 0%| | 0/15688 [00:00<?, ?it/s]/home/asaha/Qwen2.5-VL/qwen-vl-finetune/qwen_dpsp/lib/python3.10/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn( /home/asaha/Qwen2.5-VL/qwen-vl-finetune/qwen_dpsp/lib/python3.10/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn( /home/asaha/Qwen2.5-VL/qwen-vl-finetune/qwen_dpsp/lib/python3.10/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn( /home/asaha/Qwen2.5-VL/qwen-vl-finetune/qwen_dpsp/lib/python3.10/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None [rank4]:[E505 07:54:26.886357839 ProcessGroupNCCL.cpp:629] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2922, OpType=_ALLGATHER_BASE, NumelI n=614880, NumelOut=4919040, Timeout(ms)=1800000) ran for 1800044 milliseconds before timing out. [rank2]:[E505 07:54:26.886354198 ProcessGroupNCCL.cpp:629] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2922, OpType=_ALLGATHER_BASE, NumelI n=614880, NumelOut=4919040, Timeout(ms)=1800000) ran for 1800043 milliseconds before timing out. [rank7]:[E505 07:54:26.886357814 ProcessGroupNCCL.cpp:629] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2922, OpType=_ALLGATHER_BASE, NumelI n=614880, NumelOut=4919040, Timeout(ms)=1800000) ran for 1800039 milliseconds before timing out. [rank1]:[E505 07:54:26.886360570 ProcessGroupNCCL.cpp:629] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2922, OpType=_ALLGATHER_BASE, NumelI n=614880, NumelOut=4919040, Timeout(ms)=1800000) ran for 1800047 milliseconds before timing out. [rank5]:[E505 07:54:26.886362121 ProcessGroupNCCL.cpp:629] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2922, OpType=_ALLGATHER_BASE, NumelI n=614880, NumelOut=4919040, Timeout(ms)=1800000) ran for 1800043 milliseconds before timing out. [rank3]:[E505 07:54:26.886367208 ProcessGroupNCCL.cpp:629] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2922, OpType=_ALLGATHER_BASE, NumelI n=614880, NumelOut=4919040, Timeout(ms)=1800000) ran for 1800046 milliseconds before timing out. [rank0]:[E505 07:54:26.886352680 ProcessGroupNCCL.cpp:629] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2922, OpType=_ALLGATHER_BASE, NumelI n=614880, NumelOut=4919040, Timeout(ms)=1800000) ran for 1800029 milliseconds before timing out. [rank4]:[E505 07:54:26.945450765 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 4] failure detected by watchdog at work sequence id: 2922 PG status: last enqueued work: 2922, last completed work: 2921 [rank2]:[E505 07:54:26.945453676 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 2] failure detected by watchdog at work sequence id: 2922 PG status: last enqueued work: 2922, last completed work: 2921 [rank7]:[E505 07:54:26.945456073 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 7] failure detected by watchdog at work sequence id: 2922 PG status: last enqueued work: 2922, last completed work: 2921 [rank0]:[E505 07:54:26.945450113 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 0] failure detected by watchdog at work sequence id: 2922 PG status: last enqueued work: 2922, last completed work: 2921 [rank1]:[E505 07:54:26.945458867 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 1] failure detected by watchdog at work sequence id: 2922 PG status: last enqueued work: 2922, last completed work: 2921 [rank3]:[E505 07:54:26.945463933 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 3] failure detected by watchdog at work sequence id: 2922 PG status: last enqueued work: 2922, last completed work: 2921 [rank5]:[E505 07:54:26.945459141 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 5] failure detected by watchdog at work sequence id: 2922 PG status: last enqueued work: 2922, last completed work: 2921 ``` **nvidia-smi** ``` Sun May 4 21:19:47 2025 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A10G On | 00000000:00:1B.0 Off | 0 | | 0% 31C P0 67W / 300W | 10361MiB / 23028MiB | 100% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA A10G On | 00000000:00:1C.0 Off | 0 | | 0% 29C P0 63W / 300W | 10363MiB / 23028MiB | 100% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA A10G On | 00000000:00:1D.0 Off | 0 | | 0% 29C P0 65W / 300W | 10369MiB / 23028MiB | 100% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | | 0% 30C P0 63W / 300W | 10369MiB / 23028MiB | 100% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 2206 G /usr/lib/xorg/Xorg 4MiB | | 0 N/A N/A 149175 C ...l-finetune/qwen_dpsp/bin/python3.10 10340MiB | | 1 N/A N/A 2206 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 149176 C ...l-finetune/qwen_dpsp/bin/python3.10 10342MiB | | 2 N/A N/A 2206 G /usr/lib/xorg/Xorg 4MiB | | 2 N/A N/A 149177 C ...l-finetune/qwen_dpsp/bin/python3.10 10348MiB | | 3 N/A N/A 2206 G /usr/lib/xorg/Xorg 4MiB | | 3 N/A N/A 149178 C ...l-finetune/qwen_dpsp/bin/python3.10 10348MiB | +---------------------------------------------------------------------------------------+ ```

Topic		Replies	Views
Training out of memory 🤗Transformers	0	214	July 18, 2024
CUDA out of memory on multi-GPU 🤗Transformers	1	2626	March 6, 2024
CUDA OOM with deepspeed - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 47.40 GiB of which 209.12 MiB is free Beginners	0	157	December 14, 2024
Running into OOM on GPU with quantized llama-3-8b for text generation inference Models	0	483	June 29, 2024
Extra GPU usage on custom Qwen2-VL 🤗Transformers	0	146	October 28, 2024

Cuda OOM on 4 A6000s (142 GB of VRAM) even after using Zero3, Qlora, Accelerate, Max_token_length

Related topics