I am finetuning whisper medium using this guide. [Important detail: GPU is connected through a thunderbolt 4]. The data is mozilla common voice 17 (approx. 9000 test, 4000 train). The GPU mostly stays idle and CPU (core i7 13 gen) works at 10-30 percent all the time.
Does that mean I do not have enough CPU resources to feed the GPU?
Should I add something like dataloader_num_workers
to 2 or 4 as suggested in this post?
Has thunderbolt bottleneck something to do with it?
I donāt know anything about thunderbolt, but here are some ideas:
- āmediumā model is calculation intensive, so data will be quickly loaded and it should calculate much longer. So loading would not be a problem.
- The dataset is quite small, GPU is quite powerful, so given some batch size, one epoch should finish relatively quickly.
- Thunderbolt 4 has 40 Gbps theoretical limit, but you can easily get 2 GB/sec transfers, which is more than enough for your case.
I think we cannot deduce more without the following info:
- Which language is it (language code)? Which splits do you use? Default ones?
- What are your training parameters?
- Are you sure you are using GPU version of
pytorch
?
Thanks for your reply.
The language is Urdu, and the training parameters are exactly as in the original guide. Also pasting here:
training_args = Seq2SeqTrainingArguments(
output_dir="./whisper-medium-ur", # change to a repo name of your choice
per_device_train_batch_size=16,
gradient_accumulation_steps=1, # increase by 2x for every 2x decrease in batch size
learning_rate=1e-5,
warmup_steps=500,
max_steps=5000,
gradient_checkpointing=True,
fp16=True,
evaluation_strategy="steps",
per_device_eval_batch_size=8,
predict_with_generate=True,
generation_max_length=225,
save_steps=1000,
eval_steps=1000,
logging_steps=25,
report_to=["tensorboard"],
load_best_model_at_end=True,
metric_for_best_model="wer",
greater_is_better=False,
push_to_hub=False,
)
I tried to increase batch size to 32 or tweaked with gradient_accumulation_steps
to increase batch size (for example a value of 4) but the progress stopped and the estimated time increased after every such attempt. Since I do not understand these, I kept them as they were. Based on these default settings, I have spent 7.5 hours to get to 93% of training at this moment (whisper medium). My GPU occasionally has a spike and thatās it (the graph in Task Manager is mostly empty; this is Windows 11), 19.5 of 24 GB VRAM is full and CPU is constantly at 10-30% (mostly around 20%). pytorch version is ā2.5.1+cu124ā (I installed it by selecting CUDA 12.4 on start locally).
Edit: my dataset split:
from datasets import load_dataset, DatasetDict
common_voice = DatasetDict()
common_voice["train"] = load_dataset("mozilla-foundation/common_voice_17_0", "ur", split="train+validation", trust_remote_code=True)
common_voice["test"] = load_dataset("mozilla-foundation/common_voice_17_0", "ur", split="test", trust_remote_code=True)
Sorry for the late reply. Iām guessing: You look at general āutilizationā.
- Win 11 Task Manager does not show CUDA usage by default.
- AFAIK the āutilizationā does not take CUDA usage into account
- By default it gives summary view
So that we are speaking of the same measure:
-
Disable summary view: Right click GPU on the left.
-
Disable HW acceleration
- Now you can select CUDA to see actual utilization. IIRC stuff usually happens in cuda, copy 1 and copy 2, so select them from top left.
A better tool is nvidia-smi of courseā¦
This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.