Whisper medium finetuning RTX 4090 mostly stays idle

I am finetuning whisper medium using this guide. [Important detail: GPU is connected through a thunderbolt 4]. The data is mozilla common voice 17 (approx. 9000 test, 4000 train). The GPU mostly stays idle and CPU (core i7 13 gen) works at 10-30 percent all the time.
Does that mean I do not have enough CPU resources to feed the GPU?
Should I add something like dataloader_num_workers to 2 or 4 as suggested in this post?
Has thunderbolt bottleneck something to do with it?

1 Like

I donā€™t know anything about thunderbolt, but here are some ideas:

  • ā€œmediumā€ model is calculation intensive, so data will be quickly loaded and it should calculate much longer. So loading would not be a problem.
  • The dataset is quite small, GPU is quite powerful, so given some batch size, one epoch should finish relatively quickly.
  • Thunderbolt 4 has 40 Gbps theoretical limit, but you can easily get 2 GB/sec transfers, which is more than enough for your case.

I think we cannot deduce more without the following info:

  • Which language is it (language code)? Which splits do you use? Default ones?
  • What are your training parameters?
  • Are you sure you are using GPU version of pytorch?
2 Likes

Thanks for your reply.
The language is Urdu, and the training parameters are exactly as in the original guide. Also pasting here:

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-medium-ur",  # change to a repo name of your choice
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=5000,
    gradient_checkpointing=True,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=1000,
    eval_steps=1000,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=False,
)

I tried to increase batch size to 32 or tweaked with gradient_accumulation_steps to increase batch size (for example a value of 4) but the progress stopped and the estimated time increased after every such attempt. Since I do not understand these, I kept them as they were. Based on these default settings, I have spent 7.5 hours to get to 93% of training at this moment (whisper medium). My GPU occasionally has a spike and thatā€™s it (the graph in Task Manager is mostly empty; this is Windows 11), 19.5 of 24 GB VRAM is full and CPU is constantly at 10-30% (mostly around 20%). pytorch version is ā€˜2.5.1+cu124ā€™ (I installed it by selecting CUDA 12.4 on start locally).
Edit: my dataset split:

from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()

common_voice["train"] = load_dataset("mozilla-foundation/common_voice_17_0", "ur", split="train+validation", trust_remote_code=True)

common_voice["test"] = load_dataset("mozilla-foundation/common_voice_17_0", "ur", split="test", trust_remote_code=True)

Sorry for the late reply. Iā€™m guessing: You look at general ā€œutilizationā€.

  1. Win 11 Task Manager does not show CUDA usage by default.
  2. AFAIK the ā€œutilizationā€ does not take CUDA usage into account
  3. By default it gives summary view

So that we are speaking of the same measure:

  1. Disable summary view: Right click GPU on the left.

  2. Disable HW acceleration

  1. Now you can select CUDA to see actual utilization. IIRC stuff usually happens in cuda, copy 1 and copy 2, so select them from top left.

A better tool is nvidia-smi of courseā€¦

2 Likes

Thanks. Thatā€™s what is happening. This is the CUDA utilization while I am running the training:

2 Likes

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.