Hello Team,
I am following the instructions from https://github.com/QwenLM/Qwe…n2.5-VL/tree/main/qwen-vl-finetune to finetune Qwen 2.5 VL with DeepSpeed on the TIGER-Lab/VisualWebInstruct.
I am using a AWS g5.12xlarge instance which has 4 A10 GPUs with 24 GB VRAM each.
The training does not proceed at all and just hangs
The data init script is a simple modification of the original and is as follows:
```
VISUALWEBINSTRUCT = {
"annotation_path": /home/asaha/VisualWebInstruct/mixed_conversation.jsonl",
"data_path": "/home/asaha/VisualWebInstruct/images",
}
data_dict = {
"visualwebinstruct": VISUALWEBINSTRUCT,
}
```
The training init script is a simple modification of the original and is as follows:
```
#!/bin/bash
# Enable error handling
set -e
# Distributed training configuration
MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
MASTER_PORT=${MASTER_PORT:-$(shuf -i 20001-29999 -n 1)}
NNODES=${WORLD_SIZE:-1}
NPROC_PER_NODE=4
HF_HOME="<some folder>hf_cache"
# Add debugging flags
TORCH_DISTRIBUTED_DEBUG=INFO
NCCL_DEBUG=INFO
PYTHONUNBUFFERED=1
# DeepSpeed configuration
deepspeed=./scripts/zero3.json
# Model configuration
llm=Qwen/Qwen2.5-VL-3B-Instruct # Using HuggingFace model ID
# Training hyperparameters
lr=2e-7
batch_size=4 # Reduced batch size
grad_accum_steps=4 # Increased gradient accumulation
# Training entry point
entry_file=qwenvl/train/train_qwen.py
# Dataset configuration (replace with public dataset names)
datasets="visualwebinstruct%100"
# Output configuration
run_name="qwen2vl-baseline"
output_dir=./output
# Training arguments
args="
--deepspeed ${deepspeed} \
--model_name_or_path "${llm}" \
--dataset_use ${datasets} \
--data_flatten True \
--tune_mm_vision False \
--tune_mm_mlp True \
--tune_mm_llm True \
--output_dir ${output_dir} \
--num_train_epochs 1 \
--per_device_train_batch_size ${batch_size} \
--per_device_eval_batch_size $((batch_size*2)) \
--gradient_accumulation_steps ${grad_accum_steps} \
--max_pixels 50176 \
--min_pixels 784 \
--eval_strategy "no" \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 1 \
--learning_rate ${lr} \
--weight_decay 0 \
--warmup_ratio 0.03 \
--max_grad_norm 1 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--model_max_length 8192 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--run_name ${run_name} \
--bf16 True \
--report_to none"
# Create log directory if it doesn't exist
if [ ! -d "logs" ]; then
mkdir -p logs
fi
# Launch training with proper logging
torchrun \
--nproc_per_node=${NPROC_PER_NODE} \
--master_addr=${MASTER_ADDR} \
--master_port=${MASTER_PORT} \
--log_dir=logs \
${entry_file} ${args}
```
**Logs:**
```
W0504 20:47:28.542000 144230 torch/distributed/run.py:792] *****************************************
[2025-05-04 20:47:33,161] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-04 20:47:33,209] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-04 20:47:33,246] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-04 20:47:33,264] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-04 20:47:35,214] [INFO] [comm.py:658:init_distributed] cdb=None
[2025-05-04 20:47:35,214] [INFO] [comm.py:658:init_distributed] cdb=None
[2025-05-04 20:47:35,214] [INFO] [comm.py:658:init_distributed] cdb=None
[2025-05-04 20:47:35,214] [INFO] [comm.py:658:init_distributed] cdb=None
[2025-05-04 20:47:35,214] [INFO] [comm.py:689:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2025-05-04 20:47:35,853] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 4
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2025-05-04 20:47:36,325] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 4
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2025-05-04 20:47:36,329] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 4
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2025-05-04 20:47:36,406] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 4
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2025-05-04 20:47:37,583] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 825, num_elems = 4.07B
Loading checkpoint shards: 100%|██████████| 2/2 [00:54<00:00, 27.26s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:54<00:00, 27.26s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:54<00:00, 27.26s/it]
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Loading checkpoint shards: 100%|██████████| 2/2 [00:55<00:00, 27.62s/it]
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Vision Module - Attention Blocks:
Trainable Block Indices: None
Non-Trainable Block Indices: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
Merger Module Trainable: True
LLM Module - Embed Tokens Trainable: True
LLM Module - Trainable Layer Indices: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35]
LLM Module - Non-Trainable Layer Indices: None
Parameter Offload: Total persistent parameters: 755712 in 408 params
0%| | 0/15688 [00:00<?, ?it/s]/home/asaha/Qwen2.5-VL/qwen-vl-finetune/qwen_dpsp/lib/python3.10/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
/home/asaha/Qwen2.5-VL/qwen-vl-finetune/qwen_dpsp/lib/python3.10/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
/home/asaha/Qwen2.5-VL/qwen-vl-finetune/qwen_dpsp/lib/python3.10/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
/home/asaha/Qwen2.5-VL/qwen-vl-finetune/qwen_dpsp/lib/python3.10/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
[rank4]:[E505 07:54:26.886357839 ProcessGroupNCCL.cpp:629] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2922, OpType=_ALLGATHER_BASE, NumelI
n=614880, NumelOut=4919040, Timeout(ms)=1800000) ran for 1800044 milliseconds before timing out.
[rank2]:[E505 07:54:26.886354198 ProcessGroupNCCL.cpp:629] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2922, OpType=_ALLGATHER_BASE, NumelI
n=614880, NumelOut=4919040, Timeout(ms)=1800000) ran for 1800043 milliseconds before timing out.
[rank7]:[E505 07:54:26.886357814 ProcessGroupNCCL.cpp:629] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2922, OpType=_ALLGATHER_BASE, NumelI
n=614880, NumelOut=4919040, Timeout(ms)=1800000) ran for 1800039 milliseconds before timing out.
[rank1]:[E505 07:54:26.886360570 ProcessGroupNCCL.cpp:629] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2922, OpType=_ALLGATHER_BASE, NumelI
n=614880, NumelOut=4919040, Timeout(ms)=1800000) ran for 1800047 milliseconds before timing out.
[rank5]:[E505 07:54:26.886362121 ProcessGroupNCCL.cpp:629] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2922, OpType=_ALLGATHER_BASE, NumelI
n=614880, NumelOut=4919040, Timeout(ms)=1800000) ran for 1800043 milliseconds before timing out.
[rank3]:[E505 07:54:26.886367208 ProcessGroupNCCL.cpp:629] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2922, OpType=_ALLGATHER_BASE, NumelI
n=614880, NumelOut=4919040, Timeout(ms)=1800000) ran for 1800046 milliseconds before timing out.
[rank0]:[E505 07:54:26.886352680 ProcessGroupNCCL.cpp:629] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2922, OpType=_ALLGATHER_BASE, NumelI
n=614880, NumelOut=4919040, Timeout(ms)=1800000) ran for 1800029 milliseconds before timing out.
[rank4]:[E505 07:54:26.945450765 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 4] failure detected by watchdog at work sequence id: 2922 PG status:
last enqueued work: 2922, last completed work: 2921
[rank2]:[E505 07:54:26.945453676 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 2] failure detected by watchdog at work sequence id: 2922 PG status:
last enqueued work: 2922, last completed work: 2921
[rank7]:[E505 07:54:26.945456073 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 7] failure detected by watchdog at work sequence id: 2922 PG status:
last enqueued work: 2922, last completed work: 2921
[rank0]:[E505 07:54:26.945450113 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 0] failure detected by watchdog at work sequence id: 2922 PG status:
last enqueued work: 2922, last completed work: 2921
[rank1]:[E505 07:54:26.945458867 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 1] failure detected by watchdog at work sequence id: 2922 PG status:
last enqueued work: 2922, last completed work: 2921
[rank3]:[E505 07:54:26.945463933 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 3] failure detected by watchdog at work sequence id: 2922 PG status:
last enqueued work: 2922, last completed work: 2921
[rank5]:[E505 07:54:26.945459141 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 5] failure detected by watchdog at work sequence id: 2922 PG status:
last enqueued work: 2922, last completed work: 2921
```
**nvidia-smi**
```
Sun May 4 21:19:47 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10G On | 00000000:00:1B.0 Off | 0 |
| 0% 31C P0 67W / 300W | 10361MiB / 23028MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A10G On | 00000000:00:1C.0 Off | 0 |
| 0% 29C P0 63W / 300W | 10363MiB / 23028MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A10G On | 00000000:00:1D.0 Off | 0 |
| 0% 29C P0 65W / 300W | 10369MiB / 23028MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 30C P0 63W / 300W | 10369MiB / 23028MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2206 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 149175 C ...l-finetune/qwen_dpsp/bin/python3.10 10340MiB |
| 1 N/A N/A 2206 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 149176 C ...l-finetune/qwen_dpsp/bin/python3.10 10342MiB |
| 2 N/A N/A 2206 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 149177 C ...l-finetune/qwen_dpsp/bin/python3.10 10348MiB |
| 3 N/A N/A 2206 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 149178 C ...l-finetune/qwen_dpsp/bin/python3.10 10348MiB |
+---------------------------------------------------------------------------------------+
```